# Capstone Report

## I. Introduction/Business Problem

In this project, we will compare Foursquare data for Toronto and demographic data on Toronto neighborhoods to see if Foursquare data can be a good predictor of any demographic data (and vice versa). The purpose of this exploration is to help citizens and governments use proxy data to understand neighborhoods when desired data is not available. For instance, can the proportion of checkins of a particular age group for trending restaurants in a particular neighborhood predict the age range of residents in that neighborhood.

## II. Data

This project uses two main sources of data. The first is from Foursquare, a local search-and-discovery app which provides personalized recommendations of places to go near a specific location. To get this information we signed up for a Foursquare developer account to be able to use their API. We will then make calls to the API using a list of Toronto neighborhoods.

The second data source contains demographic information on Toronto's neighborhoods. This data comes from the city of Toronto's Open Data Portal (https://portal0.cf.opendata.inter.sandbox-toronto.ca/). 

In [320]:
#############
### Rerun ###
#############

# IDs and Tokens
projectID = "***"
projectToken = "***"

clientID = "***"
clientSecret = "***"

#libraries
from project_lib import Project
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
from scipy import stats
import matplotlib.pyplot as plt
import json, requests

# Global variables
project = Project(None,projectID,projectToken)

### Example of Foursquare Data

In [33]:
url = 'https://api.foursquare.com/v2/venues/explore'

params = dict(
  client_id='AXCYP5WPBJIMEBMW30CH1Z4PEQDP4FBG0CWMPT5ZS040ZU5R',
  client_secret='00RYI3SX0RP3GQXXHVGDZAMT05IACGV11NJHQKYX2SDGZXPJ',
  v='20180323',
  ll='43.8066863,-79.1943534',
  limit=1
)
resp = requests.get(url=url, params=params)
data = json.loads(resp.text)

In [41]:
print('Name:', data['response']['groups'][0]['items'][0]['venue']['name'])
print('Location:', data['response']['groups'][0]['items'][0]['venue']['location']['city'], data['response']['groups'][0]['items'][0]['venue']['location']['state'])

Name: African Rainforest Pavilion
Location: Toronto ON


### Example of Demographic Data

In [2]:
#############
### Rerun ###
#############

demo_df = pd.read_csv('https://www.toronto.ca/ext/open_data/catalog/data_set_files/2016_neighbourhood_profiles.csv', index_col = 3, encoding = 'iso-8859-1')
demo_df = demo_df.drop(['Category', 'Topic', 'Data Source'], axis = 1).T
demo_df.head()

Characteristic,Neighbourhood Number,TSNS2020 Designation,"Population, 2016","Population, 2011",Population Change 2011-2016,Total private dwellings,Private dwellings occupied by usual residents,Population density per square kilometre,Land area in square kilometres,Children (0-14 years),...,External migrants,Total - Mobility status 5 years ago - 25% sample data,Non-movers,Movers,Non-migrants,Migrants,Internal migrants,Intraprovincial migrants,Interprovincial migrants,External migrants.1
City of Toronto,,,2731571,2615060,4.50%,1179057,1112929,4334,630.2,398135,...,59945,2556120,1516110,1040015,639060,400950,184120,141135,42985,216835
Agincourt North,129.0,No Designation,29113,30279,-3.90%,9371,9120,3929,7.41,3840,...,605,27490,18865,8610,5445,3170,880,735,135,2280
Agincourt South-Malvern West,128.0,No Designation,23757,21988,8.00%,8535,8136,3034,7.83,3075,...,490,22325,13565,8775,5610,3145,980,760,220,2170
Alderwood,20.0,No Designation,12054,11904,1.30%,4732,4616,2435,4.95,1760,...,70,11370,8235,3130,2200,925,680,615,70,245
Annex,95.0,No Designation,30526,29177,4.60%,18109,15934,10863,2.81,2360,...,835,27715,12980,14735,8340,6390,3930,2630,1310,2460


## III. Methodology

### Data Collection
#### About the Demographic Data

Demographic data on Toronto neighborhoods was collected from the City of Toronto's Open Data Portal. The csv file was read directly from the download link and stored into a dataframe. To make the data usable for our purposes, we removed unnecessary rows and transposed the matrix so that the neighborhoods represented observations (rows), and the demographic information represented features (columns). Afterwards we extracted the list of neighborhoods for use in searching Foursquare.
*See **Example of Demographic Data** for collection code*

#### About the Foursquare Data
Foursquare data is accessible through the company's API. After creating a developer account, ***Explore*** calls to the API were made for various neighborhoods to get a list of recommended venues near the searched location. Information about the Explore calls can be found at https://developer.foursquare.com/docs/api/venues/explore. 

After collecting recommended venues for each neighborhood, we made ***Details*** calls for each venue in the search results to get a set of details about a venue including location, tips, and categories. Because the Details call is a premium call, we were only able to make 500 calls a day. As such, we decided to limit the number of neighborhoods included in our analysis and search  50 venues (the max limit per Explore call) for each neighborhood. The list of neighborhoods was determined randomly. 

***Note:*** Searching the Foursquare app using neighborhood names is inconsistant, but often returns results. Using neighborhood names with the API, however, rarely returns results (even when the app did). As such Google Maps was used to identify the center coordinates of neighborhoods and manually entered as a dataframe. 

***Searched Neighborhoods:***
*Agincourt North, Alderwood, Annex, Bathurst Manor, Bayview Village, Cliffcrest, Dorset Park, Flemingdon Park, Forest Hill North, Guildwood, Henry Farm, Highland Creek, Hillcrest Village, Humber Summit, Ionview, Kennedy Park, Little Portugal, Long Branch, Malvern, Markland Wood, Morningside, Mount Dennis, New Toronto, Oakridge, Regent Park, Roncesvalles, Rouge, Scarborough Village, The Beaches, Thorncliffe Park, West Hill, Weston, and Woburn*


The API calls returned JSON files, which were converted to data frames and stored as csv files over several days. 

#### Finalize Demographic Data

In [31]:
#############
### Rerun ###
#############

# manual set of neighborhoods and lat/lng coordinates
neighborhoods = pd.DataFrame(data = {'Neighborhood':['Agincourt North', 'Alderwood', 'Annex', 'Bathurst Manor', 'Bayview Village', 
                                                'Cliffcrest', 'Dorset Park', 'Flemingdon Park', 'Forest Hill North', 'Guildwood', 
                                                'Henry Farm', 'Highland Creek', 'Hillcrest Village', 'Humber Summit', 'Ionview', 
                                                'Kennedy Park', 'Little Portugal', 'Long Branch', 'Malvern', 'Markland Wood',
                                                'Morningside', 'Mount Dennis', 'New Toronto', 'Oakridge', 'Regent Park', 
                                                'Roncesvalles', 'Rouge', 'Scarborough Village', 'The Beaches', 'Thorncliffe Park', 
                                                'West Hill', 'Weston', 'Woburn'], 
                               'LL':['43.807553, -79.268030', '43.604498, -79.541228', '43.671220, -79.404673', 
                                     '43.765731, -79.457152', '43.777287, -79.378287', '43.725333, -79.231912', 
                                     '43.759675, -79.278793', '43.716436, -79.331415', '43.704375, -79.427817', 
                                     '43.748745, -79.194773', '43.771804, -79.341777', '43.792310, -79.176724', 
                                     '43.803083, -79.353281', '43.759372, -79.555372', '43.736031, -79.272930', 
                                     '43.726305, -79.260049', '43.648403, -79.430358', '43.592414, -79.533756', 
                                     '43.806878, -79.218038', '43.634009, -79.575008', '43.782131, -79.203709', 
                                     '43.688643, -79.498612', '43.601285, -79.510259', '43.696451, -79.280519', 
                                     '43.659917, -79.361629', '43.647618, -79.449624', '43.804952, -79.165872', 
                                     '43.740604, -79.215696', '43.671681, -79.298558', '43.708681, -79.348812', 
                                     '43.769107, -79.176750', '43.703432, -79.517863', '43.768958, -79.227792']})

# merge the neighborhood locations and demographic data
neighborhoods = neighborhoods.merge(demo_df, how='left', left_on='Neighborhood', right_index = True)
neighborhoods.shape


(33, 2385)

In [126]:
neighborhoods.head()

Unnamed: 0,LL,Neighborhood,Neighbourhood Number,TSNS2020 Designation,"Population, 2016","Population, 2011",Population Change 2011-2016,Total private dwellings,Private dwellings occupied by usual residents,Population density per square kilometre,...,External migrants,Total - Mobility status 5 years ago - 25% sample data,Non-movers,Movers,Non-migrants,Migrants,Internal migrants,Intraprovincial migrants,Interprovincial migrants,External migrants.1
0,"43.807553, -79.268030",Agincourt North,129,No Designation,29113,30279,-3.90%,9371,9120,3929,...,605,27490,18865,8610,5445,3170,880,735,135,2280
1,"43.604498, -79.541228",Alderwood,20,No Designation,12054,11904,1.30%,4732,4616,2435,...,70,11370,8235,3130,2200,925,680,615,70,245
2,"43.671220, -79.404673",Annex,95,No Designation,30526,29177,4.60%,18109,15934,10863,...,835,27715,12980,14735,8340,6390,3930,2630,1310,2460
3,"43.765731, -79.457152",Bathurst Manor,34,No Designation,15873,15434,2.80%,6418,6089,3377,...,375,14750,8845,5905,3680,2235,915,745,170,1310
4,"43.777287, -79.378287",Bayview Village,52,No Designation,21396,17671,21.10%,10111,9532,4195,...,625,20250,9055,11200,6175,5015,1995,1485,510,3030


#### Get Foursquare Recommendations

In [134]:
#############################
### Only need to run once ###
#############################

# Get Foursquare Recommendations

# initialize data frame to store recommended venues
df = pd.DataFrame(columns=['id', 'name', 'lat','lng','neighborhood'])

# api url
url = 'https://api.foursquare.com/v2/venues/explore'

# for each neighborhood
for i in range(0,len(neighborhoods.index)):

    params = dict(
        client_id=clientID,
        client_secret=clientSecret,
        v='20180323',
        section='food',
        time='any',
        day='any',
        ll=neighborhoods.LL[i],
        sortByDistance=1,
        limit=50
    )
    
    resp = requests.get(url=url, params=params)
    data = json.loads(resp.text)
    
    # extract desired info for each venue returned.
    for j in range(0,50):
        ID = data['response']['groups'][0]['items'][j]['venue']['id']
        name = data['response']['groups'][0]['items'][j]['venue']['name']
        lat = data['response']['groups'][0]['items'][j]['venue']['location']['lat']
        lng = data['response']['groups'][0]['items'][j]['venue']['location']['lng']
        df = df.append(pd.DataFrame([[ID, name, lat, lng, neighborhoods.Neighborhood[i]]], columns=['id', 'name', 'lat','lng','neighborhood']), ignore_index = True)


In [135]:
df.shape

(1650, 5)

In [136]:
df.head()

Unnamed: 0,id,name,lat,lng,neighborhood
0,4d83951502eb5481e3c71cf5,Pizza Pizza,43.808318,-79.268537,Agincourt North
1,4d7e9db4e7e1721e6cb7e30b,Popeyes Louisiana Kitchen,43.808652,-79.267929,Agincourt North
2,4d658b7c89238cfaa5129b36,Subway,43.809007,-79.267627,Agincourt North
3,4dacc7855da32d679da9ee55,Congee Town 太皇名粥,43.809035,-79.267634,Agincourt North
4,4b93d4a7f964a520eb5334e3,Saravanaa Bhavan South Indian Restaurant,43.810117,-79.269275,Agincourt North


#### Get Foursquare Venue Details

In [196]:
#############################
### Only need to run once ###
#############################

# Add columns for venue details
df['likes'] = np.nan
df['price'] = np.nan
df['rating'] = np.nan
df['checkinsCount'] = np.nan
df['usersCount'] = np.nan
df['visitsCount'] = np.nan

In [197]:
df.head()

Unnamed: 0,id,name,lat,lng,neighborhood,likes,price,rating,checkinsCount,usersCount,visitsCount
0,4d83951502eb5481e3c71cf5,Pizza Pizza,43.808318,-79.268537,Agincourt North,,,,,,
1,4d7e9db4e7e1721e6cb7e30b,Popeyes Louisiana Kitchen,43.808652,-79.267929,Agincourt North,,,,,,
2,4d658b7c89238cfaa5129b36,Subway,43.809007,-79.267627,Agincourt North,,,,,,
3,4dacc7855da32d679da9ee55,Congee Town 太皇名粥,43.809035,-79.267634,Agincourt North,,,,,,
4,4b93d4a7f964a520eb5334e3,Saravanaa Bhavan South Indian Restaurant,43.810117,-79.269275,Agincourt North,,,,,,


In [210]:
#############################
### Only need to run once ###
#############################

### DAY 1

for i in range(0,450):
    
    # api url
    url = 'https://api.foursquare.com/v2/venues/' + df.id[i]
    
    params = dict(
        client_id=clientID,
        client_secret=clientSecret,
        v='20180323',
    )
    
    resp = requests.get(url=url, params=params)
    data = json.loads(resp.text)
    
    if data['response']['venue'].get("likes") != None:
        df.loc[i,'likes'] = data['response']['venue']['likes']['count']

    if data['response']['venue'].get("price") != None:
        df.loc[i, 'price'] = data['response']['venue']['price']['tier']
        
    if data['response']['venue'].get("rating") != None:
        df.loc[i, 'rating'] = data['response']['venue']['rating']

    if data['response']['venue'].get("stats") != None:
        df.loc[i,'checkinsCount'] = data['response']['venue']['stats']['checkinsCount']
        df.loc[i,'usersCount'] = data['response']['venue']['stats']['usersCount']
        df.loc[i,'visitsCount'] = data['response']['venue']['stats']['visitsCount']


In [11]:
#############################
### Only need to run once ###
#############################

### DAY 2

for i in range(450,900):
    
    # api url
    url = 'https://api.foursquare.com/v2/venues/' + df.id[i]
    
    params = dict(
        client_id=clientID,
        client_secret=clientSecret,
        v='20180323',
    ) 
    
    resp = requests.get(url=url, params=params)
    data = json.loads(resp.text)
    
    if data['response']['venue'].get("likes") != None:
        df.loc[i,'likes'] = data['response']['venue']['likes']['count']

    if data['response']['venue'].get("price") != None:
        df.loc[i, 'price'] = data['response']['venue']['price']['tier']
        
    if data['response']['venue'].get("rating") != None:
        df.loc[i, 'rating'] = data['response']['venue']['rating']

    if data['response']['venue'].get("stats") != None:
        df.loc[i,'checkinsCount'] = data['response']['venue']['stats']['checkinsCount']
        df.loc[i,'usersCount'] = data['response']['venue']['stats']['usersCount']
        df.loc[i,'visitsCount'] = data['response']['venue']['stats']['visitsCount']


In [7]:
#############################
### Only need to run once ###
#############################

### DAY 3

for i in range(900,1350):
    
    # api url
    url = 'https://api.foursquare.com/v2/venues/' + df.id[i]
    
    params = dict(
        client_id=clientID,
        client_secret=clientSecret,
        v='20180323',
    )
    
    resp = requests.get(url=url, params=params)
    data = json.loads(resp.text)
    
    if data['response']['venue'].get("likes") != None:
        df.loc[i,'likes'] = data['response']['venue']['likes']['count']

    if data['response']['venue'].get("price") != None:
        df.loc[i, 'price'] = data['response']['venue']['price']['tier']
        
    if data['response']['venue'].get("rating") != None:
        df.loc[i, 'rating'] = data['response']['venue']['rating']

    if data['response']['venue'].get("stats") != None:
        df.loc[i,'checkinsCount'] = data['response']['venue']['stats']['checkinsCount']
        df.loc[i,'usersCount'] = data['response']['venue']['stats']['usersCount']
        df.loc[i,'visitsCount'] = data['response']['venue']['stats']['visitsCount']


In [7]:
#############################
### Only need to run once ###
#############################

### DAY 4


for i in range(1350,1650):
    
    # api url
    url = 'https://api.foursquare.com/v2/venues/' + df.id[i]
    
    params = dict(
        client_id=clientID,
        client_secret=clientSecret,
        v='20180323',
    )
    
    resp = requests.get(url=url, params=params)
    data = json.loads(resp.text)
    
    if data['response']['venue'].get("likes") != None:
        df.loc[i,'likes'] = data['response']['venue']['likes']['count']

    if data['response']['venue'].get("price") != None:
        df.loc[i, 'price'] = data['response']['venue']['price']['tier']
        
    if data['response']['venue'].get("rating") != None:
        df.loc[i, 'rating'] = data['response']['venue']['rating']

    if data['response']['venue'].get("stats") != None:
        df.loc[i,'checkinsCount'] = data['response']['venue']['stats']['checkinsCount']
        df.loc[i,'usersCount'] = data['response']['venue']['stats']['usersCount']
        df.loc[i,'visitsCount'] = data['response']['venue']['stats']['visitsCount']


In [11]:
# Save dataframe as csv file to storage
project.save_data(data=df.to_csv(index=False),file_name='recommendations.csv',overwrite=True)

{'asset_id': '5fdaa8ff-1cbd-4c48-b4ef-9cb6438e5a3d',
 'bucket_name': 'capstone-donotdelete-pr-ht0edkl25hrrir',
 'file_name': 'recommendations.csv',
 'message': 'File saved to project storage.'}

In [7]:
# Fetch the file
my_file = project.get_file("recommendations.csv")

# Read the CSV data file from the object storage into a pandas DataFrame
my_file.seek(0)
fs = pd.read_csv(my_file)
fs.head()

Unnamed: 0,id,name,lat,lng,neighborhood,likes,price,rating,checkinsCount,usersCount,visitsCount
0,4d83951502eb5481e3c71cf5,Pizza Pizza,43.808318,-79.268537,Agincourt North,1.0,1.0,6.1,0.0,0.0,0.0
1,4d7e9db4e7e1721e6cb7e30b,Popeyes Louisiana Kitchen,43.808652,-79.267929,Agincourt North,6.0,1.0,6.1,0.0,0.0,0.0
2,4d658b7c89238cfaa5129b36,Subway,43.809007,-79.267627,Agincourt North,1.0,1.0,6.3,0.0,0.0,0.0
3,4dacc7855da32d679da9ee55,Congee Town 太皇名粥,43.809035,-79.267634,Agincourt North,32.0,1.0,6.6,0.0,0.0,0.0
4,4b93d4a7f964a520eb5334e3,Saravanaa Bhavan South Indian Restaurant,43.810117,-79.269275,Agincourt North,28.0,1.0,7.6,0.0,0.0,0.0


#### Demographic Feature Selection

In [106]:
# location, neighborhood name, population density and population change
neighborhoods_short = pd.concat([neighborhoods['LL'], 
                                 neighborhoods['Neighborhood'],
                                 neighborhoods['Population density per square kilometre'].str.replace(',', '').astype(int),
                                 neighborhoods['Population Change 2011-2016'].str.replace('%', '').astype(float)], 
                                axis = 1)

# change raw numbers into percentages for population ages
temp = pd.concat([neighborhoods['Children (0-14 years)'].str.replace(',', '').astype(int), 
                 neighborhoods['Youth (15-24 years)'].str.replace(',', '').astype(int), 
                 neighborhoods['Working Age (25-54 years)'].str.replace(',', '').astype(int), 
                 neighborhoods['Pre-retirement (55-64 years)'].str.replace(',', '').astype(int), 
                 neighborhoods['Seniors (65+ years)'].str.replace(',', '').astype(int), 
                 neighborhoods['Older Seniors (85+ years)'].str.replace(',', '').astype(int)], axis = 1)
total = temp.iloc[:,0] + temp.iloc[:,1] + temp.iloc[:,2] + temp.iloc[:,3] + temp.iloc[:,4] + temp.iloc[:,5]
temp.iloc[:,0] = temp.iloc[:,0] / total
temp.iloc[:,1] = temp.iloc[:,1] / total
temp.iloc[:,2] = temp.iloc[:,2] / total
temp.iloc[:,3] = temp.iloc[:,3] / total
temp.iloc[:,4] = temp.iloc[:,4] / total
temp.iloc[:,5] = temp.iloc[:,5] / total
neighborhoods_short = pd.concat([neighborhoods_short, temp], axis = 1)

# change marital status into percentages
temp = pd.concat([neighborhoods['  Married or living common law'].astype(int),
                  neighborhoods['  Not married and not living common law'].astype(int)], axis = 1) 
total = temp.iloc[:,0] + temp.iloc[:,1]
temp.iloc[:,0] = temp.iloc[:,0] / total
temp.iloc[:,1] = temp.iloc[:,1] / total
neighborhoods_short = pd.concat([neighborhoods_short, temp['  Married or living common law']], axis = 1)

# persons living alone
neighborhoods_short = pd.concat([neighborhoods_short,
                                 neighborhoods['Persons living alone (per cent)'].str.replace('%', '').astype(float)], axis = 1)

# Average household size
neighborhoods_short = pd.concat([neighborhoods_short,
                                 neighborhoods[' Average household size'].astype(float)], axis = 1)

# Average income
neighborhoods_short = pd.concat([neighborhoods_short, 
                                 neighborhoods['Total income: Average amount ($)'].str.replace(',', '').astype(int)], axis = 1)


# change housing to percentages
temp = pd.concat([neighborhoods['  Owner'].str.replace(',', '').astype(int),
                  neighborhoods['  Renter'].str.replace(',', '').astype(int)], axis = 1) 
total = temp.iloc[:,0] + temp.iloc[:,1]
temp.iloc[:,0] = temp.iloc[:,0] / total
temp.iloc[:,1] = temp.iloc[:,1] / total
neighborhoods_short = pd.concat([neighborhoods_short, temp['  Owner']], axis = 1)

# change education levels to percentages
temp = neighborhoods.iloc[:,[1702, 1703, 1705, 1708, 1709, 1711, 1712, 1713, 1714, 1715]].apply(pd.to_numeric)
total = temp.sum(axis = 1)
temp.iloc[:,0] = temp.iloc[:,0] / total
temp.iloc[:,1] = temp.iloc[:,1] / total
temp.iloc[:,2] = (temp.iloc[:,2] + temp.iloc[:,3] + temp.iloc[:,4]) / total
temp.iloc[:,5] = temp.iloc[:,5] / total
temp.iloc[:,6] = (temp.iloc[:,6] + temp.iloc[:,7] + temp.iloc[:,8] + temp.iloc[:,9]) / total

neighborhoods_short = pd.concat([neighborhoods_short, temp.iloc[:, [0,1,2,5,6]]], axis = 1)

# add employment rate data
neighborhoods_short = pd.concat([neighborhoods_short,
                                 neighborhoods['Participation rate'].astype(float),
                                 neighborhoods['Employment rate'].astype(float),
                                 neighborhoods['Unemployment rate'].astype(float)], axis = 1)

neighborhoods_short.head()


Unnamed: 0,LL,Neighborhood,Population density per square kilometre,Population Change 2011-2016,Children (0-14 years),Youth (15-24 years),Working Age (25-54 years),Pre-retirement (55-64 years),Seniors (65+ years),Older Seniors (85+ years),...,Total income: Average amount ($),Owner,"No certificate, diploma or degree",Secondary (high) school diploma or equivalency certificate,Apprenticeship or trades certificate or diploma,Bachelor's degree,University certificate or diploma above bachelor level,Participation rate,Employment rate,Unemployment rate
0,"43.807553, -79.268030",Agincourt North,3929,-3.9,0.127787,0.123295,0.376206,0.140765,0.201165,0.030782,...,30414,0.8113,0.261843,0.298221,0.207875,0.175095,0.056966,55.4,50.0,9.8
1,"43.604498, -79.541228",Alderwood,2435,1.3,0.142222,0.099798,0.421818,0.147475,0.162828,0.025859,...,47709,0.794595,0.195324,0.288358,0.293717,0.161715,0.060887,66.5,62.4,6.1
2,"43.671220, -79.404673",Annex,10863,4.6,0.074731,0.118746,0.476251,0.110196,0.187144,0.032932,...,112766,0.380414,0.060358,0.162605,0.14642,0.347867,0.282749,70.6,65.8,6.7
3,"43.765731, -79.457152",Bathurst Manor,3377,2.8,0.14006,0.116867,0.400904,0.122289,0.177108,0.042771,...,45936,0.539852,0.125424,0.255367,0.242561,0.231638,0.145009,65.0,60.3,7.2
4,"43.777287, -79.378287",Bayview Village,4195,21.1,0.109798,0.11389,0.468743,0.115481,0.164356,0.027734,...,52035,0.587834,0.069941,0.196476,0.211959,0.323278,0.198345,63.4,58.5,7.7


#### Add Foursquare data to demographic data

correlations between average price and demographic info



In [107]:
# aggregate Foursquare data across neightboorhoods
neighborhoods_short = neighborhoods_short.merge(fs.groupby(['neighborhood'], as_index = False).mean(), how = 'left', left_on = 'Neighborhood', right_on = 'neighborhood').drop(['neighborhood', 'checkinsCount', 'usersCount', 'visitsCount'], axis = 1)
neighborhoods_short.rename(index=str, columns={'Population density per square kilometre': 'Population density', 
                              'Population Change 2011-2016': 'Population Change',
                              'Children (0-14 years)' : 'Percent children',
                              'Youth (15-24 years)' : 'Percent youth',
                              'Working Age (25-54 years)' : 'Percent working Age',
                              'Pre-retirement (55-64 years)' : 'Percent pre-retirement',
                              'Seniors (65+ years)' : 'Percent seniors',
                              'Older Seniors (85+ years)' : 'Percent older Seniors',
                              '  Married or living common law' : 'Percent married',
                              'Persons living alone (per cent)' : 'Percent living alone',
                              ' Average household size' : 'Average household size',
                              'Total income: Average amount ($)' : 'Average income',
                              '  Owner' : 'Percent homeowners',
                              '  No certificate, diploma or degree' : 'No degree',
                              '  Secondary (high) school diploma or equivalency certificate' : 'Secondary diploma',
                              '    Apprenticeship or trades certificate or diploma' : 'Postsecondary degree',
                              '      Bachelor\'s degree' : 'Bachelors degree',
                              '      University certificate or diploma above bachelor level' : 'Postgraduate degree'},
                          inplace = True)
neighborhoods_short.head()


Unnamed: 0,LL,Neighborhood,Population density,Population Change,Percent children,Percent youth,Percent working Age,Percent pre-retirement,Percent seniors,Percent older Seniors,...,Bachelors degree,Postgraduate degree,Participation rate,Employment rate,Unemployment rate,lat,lng,likes,price,rating
0,"43.807553, -79.268030",Agincourt North,3929,-3.9,0.127787,0.123295,0.376206,0.140765,0.201165,0.030782,...,0.175095,0.056966,55.4,50.0,9.8,43.809872,-79.276526,9.16,1.387755,6.678788
1,"43.604498, -79.541228",Alderwood,2435,1.3,0.142222,0.099798,0.421818,0.147475,0.162828,0.025859,...,0.161715,0.060887,66.5,62.4,6.1,43.609265,-79.545475,22.44,1.489796,7.122449
2,"43.671220, -79.404673",Annex,10863,4.6,0.074731,0.118746,0.476251,0.110196,0.187144,0.032932,...,0.347867,0.282749,70.6,65.8,6.7,43.670651,-79.405661,38.52,1.77551,7.563043
3,"43.765731, -79.457152",Bathurst Manor,3377,2.8,0.14006,0.116867,0.400904,0.122289,0.177108,0.042771,...,0.231638,0.145009,65.0,60.3,7.2,43.769912,-79.464654,10.82,1.479167,6.848571
4,"43.777287, -79.378287",Bayview Village,4195,21.1,0.109798,0.11389,0.468743,0.115481,0.164356,0.027734,...,0.323278,0.198345,63.4,58.5,7.7,43.783453,-79.364768,20.24,1.468085,6.835556


In [62]:
# get description statistics for all features
neighborhoods_desc = neighborhoods_short.describe()
neighborhoods_desc.head()

Unnamed: 0,Population density per square kilometre,Population Change 2011-2016,Children (0-14 years),Youth (15-24 years),Working Age (25-54 years),Pre-retirement (55-64 years),Seniors (65+ years),Older Seniors (85+ years),Married or living common law,Persons living alone (per cent),...,University certificate or diploma below bachelor level,"University certificate, diploma or degree at bachelor level or above",Participation rate,Employment rate,Unemployment rate,lat,lng,likes,price,rating
count,33.0,33.0,33.0,33.0,33.0,33.0,33.0,33.0,33.0,33.0,...,33.0,33.0,33.0,33.0,33.0,33.0,33.0,33.0,33.0,33.0
mean,5540.69697,4.018182,0.154549,0.12456,0.420573,0.12579,0.152372,0.022156,0.5099,13.833333,...,0.019466,0.202918,62.778788,57.10303,9.142424,43.722468,-79.350828,12.893939,1.447949,6.865969
std,3416.597782,9.072295,0.034868,0.018163,0.057407,0.02063,0.04149,0.010923,0.049826,6.117938,...,0.004042,0.064036,5.71652,6.191652,2.14826,0.062732,0.123723,7.750275,0.165438,0.3455
min,1260.0,-4.6,0.074731,0.086139,0.32218,0.077334,0.067188,0.004602,0.402399,3.9,...,0.010822,0.108169,53.6,48.6,5.7,43.601417,-79.570694,4.56,1.16,6.424324
25%,3148.0,-0.5,0.130435,0.113982,0.384464,0.113121,0.133025,0.014522,0.481662,10.5,...,0.016736,0.156408,58.5,51.9,7.4,43.671377,-79.448512,6.98,1.28,6.621875


### Data Analysis

#### Exploratory
We first looked at a correlation matrix to determin which demographic features had the highest correlations with the Foursquare features.

In [108]:
neighborhoods_short.corr()[['likes', 'price', 'rating']]

Unnamed: 0,likes,price,rating
Population density,0.576117,0.425788,0.6593
Population Change,0.286917,0.230166,0.402191
Percent children,-0.471049,-0.236485,-0.297164
Percent youth,-0.280733,-0.408641,-0.467931
Percent working Age,0.585589,0.396203,0.723585
Percent pre-retirement,-0.38745,-0.387217,-0.454116
Percent seniors,-0.092038,0.006017,-0.283084
Percent older Seniors,-0.025788,0.060563,-0.143268
Percent married,-0.241567,0.017862,-0.261396
Percent living alone,0.620684,0.49744,0.706935


The correlation matrix shows that the following features generally have a medium to strong correlation with likes, price, and rating:

* population density
* change in population
* age (children to pre-retirement)
* percentage of persons living alone
* percentage of home owners
* education
* workforce participation rate
* workforce employment rate
* workforce unemployment rate

We looked at these features to see if they can predict Foursquare data, and if Foursquare data can predict any of them.

#### 

### Predicting Foursquare data

We used linear regresssion models to identify which features of demographic data can predict the three main types of Foursquare data: likes, price, and rating.

In [266]:
#neighborhood_short_corr
X_d = sm.add_constant(neighborhoods_short.iloc[:,[2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,22]])
y_d_l = neighborhoods_short['likes']
y_d_p = neighborhoods_short['price']
y_d_r = neighborhoods_short['rating']

#### Predicting Likes

In [207]:
print(sm.OLS(y_d_l, sm.add_constant(neighborhoods_short.iloc[:,[2,17,22]])).fit().summary())


                            OLS Regression Results                            
Dep. Variable:                  likes   R-squared:                       0.619
Model:                            OLS   Adj. R-squared:                  0.580
Method:                 Least Squares   F-statistic:                     15.71
Date:                Wed, 20 Mar 2019   Prob (F-statistic):           2.95e-06
Time:                        18:30:57   Log-Likelihood:                -97.964
No. Observations:                  33   AIC:                             203.9
Df Residuals:                      29   BIC:                             209.9
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                   37.7569 

#### Predicting Price

In [217]:
print(sm.OLS(y_d_p, sm.add_constant(neighborhoods_short.iloc[:,[12,22]])).fit().summary())


                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.346
Model:                            OLS   Adj. R-squared:                  0.302
Method:                 Least Squares   F-statistic:                     7.926
Date:                Wed, 20 Mar 2019   Prob (F-statistic):            0.00172
Time:                        18:34:26   Log-Likelihood:                 20.054
No. Observations:                  33   AIC:                            -34.11
Df Residuals:                      30   BIC:                            -29.62
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
const                      2

#### Predicting Rating

In [256]:
print(sm.OLS(y_d_r, sm.add_constant(neighborhoods_short.iloc[:,[22,7,11]])).fit().summary())

                            OLS Regression Results                            
Dep. Variable:                 rating   R-squared:                       0.790
Model:                            OLS   Adj. R-squared:                  0.769
Method:                 Least Squares   F-statistic:                     36.44
Date:                Wed, 20 Mar 2019   Prob (F-statistic):           5.74e-10
Time:                        18:44:20   Log-Likelihood:                 14.531
No. Observations:                  33   AIC:                            -21.06
Df Residuals:                      29   BIC:                            -15.08
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
const                      8

In [272]:
X_f = sm.add_constant(neighborhoods_short[['likes', 'price', 'rating']])
y_f_2 = neighborhoods_short.iloc[:, 2]
y_f_3 = neighborhoods_short.iloc[:, 3]
y_f_4 = neighborhoods_short.iloc[:, 4]
y_f_5 = neighborhoods_short.iloc[:, 5]
y_f_6 = neighborhoods_short.iloc[:, 6]
y_f_7 = neighborhoods_short.iloc[:, 7]
y_f_8 = neighborhoods_short.iloc[:, 8]
y_f_9 = neighborhoods_short.iloc[:, 9]
y_f_10 = neighborhoods_short.iloc[:, 10]
y_f_11 = neighborhoods_short.iloc[:, 11]
y_f_12 = neighborhoods_short.iloc[:, 12]
y_f_13 = neighborhoods_short.iloc[:, 13]
y_f_14 = neighborhoods_short.iloc[:, 14]
y_f_15 = neighborhoods_short.iloc[:, 15]
y_f_16 = neighborhoods_short.iloc[:, 16]
y_f_17 = neighborhoods_short.iloc[:, 17]
y_f_18 = neighborhoods_short.iloc[:, 18]
y_f_19 = neighborhoods_short.iloc[:, 19]
y_f_20 = neighborhoods_short.iloc[:, 20]
y_f_21 = neighborhoods_short.iloc[:, 21]
y_f_22 = neighborhoods_short.iloc[:, 22]

In [426]:
print(sm.OLS(y_f_18+y_f_19, sm.add_constant(neighborhoods_short[['likes']])).fit().summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.382
Model:                            OLS   Adj. R-squared:                  0.362
Method:                 Least Squares   F-statistic:                     19.13
Date:                Wed, 20 Mar 2019   Prob (F-statistic):           0.000128
Time:                        19:54:02   Log-Likelihood:                 30.947
No. Observations:                  33   AIC:                            -57.89
Df Residuals:                      31   BIC:                            -54.90
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.1878      0.033      5.622      0.0

## IV. Results


### Predicting Venue Likes
We found that population density, postsecondary degree, and unemployment rates were the best predictors for venue likes.

### Predicting Venue Price
Price was difficult to predict with the available demographic data. The best we found was using average household size and unemployment rate, both of which are negatively correlated and account for only a third of the variability in price.

### Predicting Venue Rating
We see that percent of working age adults accounts for half of the variability of rating and that the higher percentage, the higher the rating will be. This suggests new businesses should open in areas with large working-age populations. The highest R-squared value we found, however, was for a model looking at unemployment rate, percent living alone, percentage of pre-retirement individuals, and percentage of people living alone. The lower the unemployment rate and the percert of pre-retirees, along with a slight increase in percent of people living alone, accounts for greater ratings. 

### Predicting Demographic Information
In testing various regression models using Foursquare data to predict demographic information, we noticed that individual features can be fairly good predictors (accounding for 30 to 50% of targets), but combining predictors achieve no significant increase in R-squared. This is likely because the three Foursquare predictors are themselves highly correlated and don't contribute much to the model when added together.

In any case, here are the outcomes of various regression models:

#### A higher average number of *likes* for venues in a particular neighborhood indicates,  
* a greater population density (R-squared = 31)
* a higher percent of working age people (R-squared = 32)
* a higher percent of people living alone (R-squared = 37)
* a lower average household size (R-squared = 32)
* a lower percent of people who have not completed at least a bachelor's degree (R-squared = 39)
* a higher percent of people who have completed at least a bachelor's degree (R-squared = 36)

#### A higher average *price* for venues in a particular neighborhood indicates 
* a lower percent of people who have not completed at least a bachelor's degree (R-squared = 42)
* a higher percent of people who have completed at least a bachelor's degree (R-squared = 35)

#### A higher average *rating* for venues in a particular neighborhood indicates
* a greater population density (R-squared = 42)
* a higher percent of working age people (R-squared = 51)
* a higher percent of people living alone (R-squared = 48)
* a lower average household size (R-squared = 42)
* a lower percent of people who have not completed at least a bachelor's degree (R-squared = 44)
* a higher percent of people who have completed at least a bachelor's degree (R-squared = 35)
* a higher workforce participation rate (R-squared = 38)
* a higher workforce employment rate (R-squared = 40)

## V. Discussion

None of the regression models we created had high enough R-squared values for us to make any meaningful recommendations. However, they do show some general trends. For instance, neighborhoods that have venues with higher than average ratings and likes, are likely populated with a higher proportion of working age people and lower household sizes. Does this indicate that working age people in smaller (or no) families are more likely to rate or like a venue? Or does this mean that they are more likely to rate it higher? Further analysis is needed.

Other interesting, although logistical trend is that neighborhoods with more higher-priced venues (higher average price) are more likely to have more highly educated people living in them. Interestingly, though, average household income was not a good predictor for price, which may suggest that people don't patron expensive venues because they have more money, but because they are more educated. 

## VI. Conclusion

We began this project with the hope that we could find some linkages between Toronto's demographic data and Foursquare data for venues in Toronto. Although some of the results were promising, neither data sets gave an complete conclusions. We recommend that if both governments and business would like to use these results for their own predictions, that they do so carefully. The results show general trends, but cannot be used to predict exact amounts. There is still to much variability that is not explained by the predictors used here, both for predicting demographic information and Foursquare information. However, we do believe these results can be used for a quick check for new businesses to see if their other research matches with the results found in this project.