# Is life in Toronto linear or nonlinear?

## 1. Preparing data

In the assignment for week3 of Applied Data Capstone, venues for each neighborhood in each borough in Toronto are acquired through Foursquare. The coordinates of neighborhood these venues are available on Wikipedia. The 

1.1 import packages

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

1.2 get table from wiki page

In [3]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source,'lxml')
table = soup.find('table')


1.3 transform the table into dataframe

In [4]:
df = pd.read_html(str(table))[0]
df.head()

Unnamed: 0,0,1,2
0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


1.4 change first row to column name

In [5]:
df.columns = df.iloc[0]
df.drop([0],inplace=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront


1.5 remove all rows where borough is not assigned

In [6]:
df1 = df[df['Borough']!='Not assigned']

1.6 for each neighbourhood that is not assigned, assign the borough value to neighbourhood value

In [7]:
index_1 = df1[df1['Neighbourhood']=='Not assigned'].index
df1.loc[index_1,'Neighbourhood'] = df1.loc[index_1,'Borough'].values

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


1.7 combine the rows who have common postcodes with their neighbourhoods separated by comma

In [8]:
df2 = df1.groupby('Postcode',as_index=False).agg({'Borough':lambda x:x.iloc[0],\
                 'Neighbourhood':','.join})
df2.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


1.8 print the shape of the dataframe

In [9]:
df2.shape

(103, 3)

## 2. Get latitude and longitude from Geocoder

2.1 define the function to return coordinates

In [10]:
# import geocoder # import geocoder
#def get_coords(postcode):
#    # initialize your variable to None
#    lat_lng_coords = None

    # loop until you get the coordinates
#    while(lat_lng_coords is None):
#      g = geocoder.google('{}, Toronto, Ontario'.format(postcode))
#      lat_lng_coords = g.latlng

#    latitude = lat_lng_coords[0]
#    longitude = lat_lng_coords[1]
#    return latitude,longitude

# because the function always fails with error max retries exceed with url..., I choose to use the csv provided in the assignment page
def get_coords(postcode,data):
    latitude = data.loc[postcode,'Latitude']
    longitude = data.loc[postcode,'Longitude']
    return latitude,longitude

2.2 using for loop to get the coordinates

In [11]:
data = pd.read_csv('https://cocl.us/Geospatial_data',index_col=0) # load geographical coordinates data from webpage
n = df2.shape[0]
lat_list = []
lon_list = []
for i in np.arange(n):
    print(i)
    postcode_i = df2.iloc[i,0]
    lat_i,lon_i = get_coords(postcode_i,data)
    lat_list.append(lat_i)
    lon_list.append(lon_i)
df2['Latitude'] = lat_list
df2['Longitude'] = lon_list

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102


In [12]:
df2.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


## 3 Explore and cluster the neighborhoods in Toronto

### 3.1 a closer look on the data dataframe

In [13]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(df2['Borough'].unique()),
        df2.shape[0]
    )
)

The dataframe has 11 boroughs and 103 neighborhoods.


### 3.2 use geopy library to get the latitude and longigude values of Totonto

first import some packages

In [14]:
from geopy.geocoders import Nominatim
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium # map rendering library

get coordinates of Toronto

In [15]:
latitude = 43.653963
longitude = -79.387207


### 3.3 create a map of Toronto with neighborhoods supermposed on top

In [16]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df2['Latitude'], df2['Longitude'], df2['Borough'], df2['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Define Foursquare credentials and version

In [17]:
CLIENT_ID = 'DLDRCEL14SQZXNEJGHHG3BKQS5T1IICJFR0QR5KQ0MWKLY4K' # your Foursquare ID
CLIENT_SECRET = 'K3HU1BZ1SXJ3WQWMTTU51A5PIV0BDGWSUUSRWAOZWM1MHACH' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: DLDRCEL14SQZXNEJGHHG3BKQS5T1IICJFR0QR5KQ0MWKLY4K
CLIENT_SECRET:K3HU1BZ1SXJ3WQWMTTU51A5PIV0BDGWSUUSRWAOZWM1MHACH


### 3.4 explore the neighborhoods

Define a function to get nearby venues

In [18]:
def getNearbyVenues(names, latitudes, longitudes, radius=500,LIMIT=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

for each neighborhood in Toronto, get its venues

In [19]:
toronto_venues = getNearbyVenues(names=df2['Neighbourhood'],
                                   latitudes=df2['Latitude'],
                                   longitudes=df2['Longitude']
                                  )

Rouge,Malvern
Highland Creek,Rouge Hill,Port Union
Guildwood,Morningside,West Hill
Woburn
Cedarbrae
Scarborough Village
East Birchmount Park,Ionview,Kennedy Park
Clairlea,Golden Mile,Oakridge
Cliffcrest,Cliffside,Scarborough Village West
Birch Cliff,Cliffside West
Dorset Park,Scarborough Town Centre,Wexford Heights
Maryvale,Wexford
Agincourt
Clarks Corners,Sullivan,Tam O'Shanter
Agincourt North,L'Amoreaux East,Milliken,Steeles East
L'Amoreaux West
Upper Rouge
Hillcrest Village
Fairview,Henry Farm,Oriole
Bayview Village
Silver Hills,York Mills
Newtonbrook,Willowdale
Willowdale South
York Mills West
Willowdale West
Parkwoods
Don Mills North
Flemingdon Park,Don Mills South
Bathurst Manor,Downsview North,Wilson Heights
Northwood Park,York University
CFB Toronto,Downsview East
Downsview West
Downsview Central
Downsview Northwest
Victoria Village
Woodbine Gardens,Parkview Hill
Woodbine Heights
The Beaches
Leaside
Thorncliffe Park
East Toronto
The Danforth West,Riverdale
The Beaches West,Indi

check the size of the resulting dataframe

In [20]:
print(toronto_venues.shape)
toronto_venues.head()

(2255, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Rouge,Malvern",43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
2,"Guildwood,Morningside,West Hill",43.763573,-79.188711,Swiss Chalet Rotisserie & Grill,43.767697,-79.189914,Pizza Place
3,"Guildwood,Morningside,West Hill",43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store
4,"Guildwood,Morningside,West Hill",43.763573,-79.188711,Big Bite Burrito,43.766299,-79.19072,Mexican Restaurant


check how many venues were returned for each neighborhood

In [21]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide,King,Richmond",100,100,100,100,100,100
Agincourt,5,5,5,5,5,5
"Agincourt North,L'Amoreaux East,Milliken,Steeles East",2,2,2,2,2,2
"Albion Gardens,Beaumond Heights,Humbergate,Jamestown,Mount Olive,Silverstone,South Steeles,Thistletown",9,9,9,9,9,9
"Alderwood,Long Branch",9,9,9,9,9,9
"Bathurst Manor,Downsview North,Wilson Heights",20,20,20,20,20,20
Bayview Village,4,4,4,4,4,4
"Bedford Park,Lawrence Manor East",21,21,21,21,21,21
Berczy Park,57,57,57,57,57,57
"Birch Cliff,Cliffside West",4,4,4,4,4,4


find out how many unique categories can be curated from all the returned venues

In [22]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 274 uniques categories.


### 3.5 group the venue by types

In [23]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


examine the new dataframe size.

In [24]:
toronto_onehot.shape

(2255, 274)

group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [25]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').sum().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,"Adelaide,King,Richmond",0,0,0,0,0,0,0,0,0,...,0,2,0,0,0,0,1,0,0,0
1,Agincourt,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Agincourt North,L'Amoreaux East,Milliken,Steel...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Albion Gardens,Beaumond Heights,Humbergate,Jam...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Alderwood,Long Branch",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,"Bathurst Manor,Downsview North,Wilson Heights",0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
6,Bayview Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,"Bedford Park,Lawrence Manor East",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,Berczy Park,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
9,"Birch Cliff,Cliffside West",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


find venues types that has at least 30 nonzeros entries

In [30]:
valid_col =toronto_grouped.columns[toronto_grouped.astype(bool).sum(axis=0)>30]
toronto_grouped2 = toronto_grouped[valid_col]
toronto_grouped2.head()

Unnamed: 0,Neighborhood,Café,Coffee Shop,Park,Pizza Place,Restaurant,Sandwich Place
0,"Adelaide,King,Richmond",5,8,0,2,3,1
1,Agincourt,0,0,0,0,0,1
2,"Agincourt North,L'Amoreaux East,Milliken,Steel...",0,0,1,0,0,0
3,"Albion Gardens,Beaumond Heights,Humbergate,Jam...",0,1,0,1,0,1
4,"Alderwood,Long Branch",0,1,0,2,0,1


add coordinates information to the dataframe

In [34]:
df2.columns = ['Postcode', 'Borough', 'Neighborhood', 'Latitude', 'Longitude']
toronto_merged = df2

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(toronto_grouped2.set_index('Neighborhood'), on='Neighborhood')
toronto_merged.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude,Café,Coffee Shop,Park,Pizza Place,Restaurant,Sandwich Place
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353,0.0,0.0,0.0,0.0,0.0,0.0
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497,0.0,0.0,0.0,0.0,0.0,0.0
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711,0.0,0.0,0.0,1.0,0.0,0.0
3,M1G,Scarborough,Woburn,43.770992,-79.216917,0.0,2.0,0.0,0.0,0.0,0.0
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,0.0,0.0,0.0,0.0,0.0,0.0


## 4 perform linear regression and nonlinear regression on the processed data

Please note that due to the unstable server of Foursquare, the result table is not exact the same as in the report

In [44]:
from sklearn.linear_model import LinearRegression
y_name_list = valid_col[1:]
res = pd.DataFrame(index=y_name_list,columns=['linear_r2','nonlinear_r2'])
x_list=[]
x2_list=[]
y_list=[]
for name in y_name_list:
    y=toronto_merged[name]
    x = toronto_merged[['Latitude','Longitude']]
    ind = np.where((y!=0) & ~np.isnan(y))[0]
    y_valid = y.iloc[ind]
    x_valid = x.iloc[ind,:]
    x2_valid = pd.concat([x_valid,np.square(x_valid)],axis=1)
    x2_valid.columns=['Latitude','Longitude','Latitude2','Longitude2']
    x2_valid.head()
    y_list.append(y_valid)
    x_list.append(x_valid)
    x2_list.append(x2_valid)
    
    lm = LinearRegression()
    lm.fit(x_valid,y_valid)    
    res.loc[name,'linear_r2'] = lm.score(x_valid,y_valid)
    lm2 = LinearRegression()
    lm2.fit(x2_valid,y_valid)
    res.loc[name,'nonlinear_r2'] = lm2.score(x2_valid,y_valid)

x_all = pd.concat(x_list,axis=0)
x2_all = pd.concat(x2_list,axis=0)
y_all = pd.concat(y_list,axis=0)
lm = LinearRegression()
lm.fit(x_all,y_all)    
res.loc['benchmark','linear_r2'] = lm.score(x_all,y_all)
lm2 = LinearRegression()
lm2.fit(x2_all,y_all)
res.loc['benchmark','nonlinear_r2'] = lm2.score(x2_all,y_all)
res

Unnamed: 0,linear_r2,nonlinear_r2
Café,0.248444,0.339181
Coffee Shop,0.248735,0.306205
Park,0.120205,0.147143
Pizza Place,0.0672385,0.129402
Restaurant,0.232242,0.397214
Sandwich Place,0.0143486,0.0784157
benchmark,0.106438,0.127103
