## Capstone Week 3

Contents:
1. Notebook Setup
2. Scraping and Cleaning Toronto Data
3. Enriching Data With Coordinates
4. Exploring and Clustering Neighbourhoods In Toronto

## 1. Notebook Setup

Install required libraries - _code hidden for convenience_

<!--

try:
    print("Installing BeautifulSoup4...")
    !conda install -c conda-forge beautifulsoup4 --yes
    print("BeautifulSoup4 has been successfully installed")
except:
    print("ERROR: could not install BeautifulSoup4")

try:
    print('Installing GeoPy...')
    !conda install -c conda-forge geopy --yes
    print('GeoPy has been successfully installed')
except:
    print('ERROR: could not install GeoPy')

try:
    print('Installing Folium...')
    !conda install -c conda-forge folium=0.5.0 --yes
    print('Folium has been successfully installed')
except:
    print('ERROR: could not install Folium')

-->

In [74]:
try:
    import numpy as np
    import pandas as pd
    from pandas.io.json import json_normalize
    import requests
    import matplotlib as mp
    import matplotlib.cm as cm
    import matplotlib.colors as colors
    from sklearn.cluster import KMeans
    from bs4 import BeautifulSoup as bts
    from geopy.geocoders import Nominatim
    import folium
    print('All libraries imported successfully')
except:
    print('ERROR: Could not import all libraries')

All libraries imported successfully


## 2. Scraping and Cleaning Toronto Data

***Download and parse Toronto data***

- Read the wikipedia page 
- Parse the page using BeautifulSoup
- Find the table and convert is into a list

In [2]:
toronto_wiki_data  = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
print('Downloaded the wikipedia page as HTML')

toronto_soup = bts(toronto_wiki_data,'html.parser')
print('Parsed HTML by using BeatifulSoup')

toronto_table_list = toronto_soup.table.tbody.text.split('\n')
print('Converted HTML table into a list')

Downloaded the wikipedia page as HTML
Parsed HTML by using BeatifulSoup
Converted HTML table into a list


***Clean and split list into header / row***

- Remove empty elements for simplicity
- Get header and row data into seperate lists for further processing

In [3]:
toronto_table_list = list(filter(lambda x: x != '', toronto_table_list))
print('Removed empty elements')

toronto_table_headers = toronto_table_list[0:3]
toronto_table_rows = np.array(toronto_table_list[3:]).reshape(len(toronto_table_list[3:]) // 3, 3)
print('Split headers and rows')

Removed empty elements
Split headers and rows


***Create dataframe and perform additional cleaning and transformation***

- Create the dataframe using header and row data
- Eliminate 'Not assigned' boroughs 
- Update 'Not assigned' neighbourhoods
- Combine neighbourhoods with same Postal Code in same row

In [4]:
df_toronto = pd.DataFrame(np.nan_to_num(toronto_table_rows),columns = toronto_table_headers)
print('Created dataframe')

df_toronto = df_toronto[df_toronto['Borough']!='Not assigned'].reset_index(drop=True)
print('Eliminated \'Not assigned\' boroughs')

df_toronto['Neighbourhood'] = df_toronto.apply(lambda x: x['Borough'] if x['Neighbourhood']=='Not assigned' else x['Neighbourhood'], axis=1)
print('Updated \'Not assigned\' neighbourhoods')

df_temp = pd.DataFrame()
df_temp['Neighbourhood'] = df_toronto.groupby(['Postcode']).apply(lambda x: ', '.join(x['Neighbourhood'].values))
df_temp = df_temp.reset_index().merge(df_toronto[['Postcode','Borough']], how='left', on='Postcode').drop_duplicates().reset_index(drop=True)
df_toronto = df_temp[['Postcode','Borough','Neighbourhood']]
del df_temp
print('Combined neighbourhoods in one row')

df_toronto.rename({'Postcode':'Postal Code'}, axis=1, inplace=True)

Created dataframe
Eliminated 'Not assigned' boroughs
Updated 'Not assigned' neighbourhoods
Combined neighbourhoods in one row


In [5]:
df_toronto.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [6]:
df_toronto.shape

(103, 3)

# 3. Enriching Data With Coordinates

***Read coordinates data***

* Prefer to use the csv file as it's faster and less problematic

In [7]:
df_toronto_coords = pd.read_csv('https://cocl.us/Geospatial_data')
print('Downloaded coordinates')

Downloaded coordinates


***Merge coordinates with existing dataframe***

- Append coordinate data to the Toronto dataset

In [8]:
df_toronto = df_toronto.merge(df_toronto_coords, how='left', on='Postal Code')
del df_toronto_coords
print('Merged coordinates with Toronto dataset')

Merged coordinates with Toronto dataset


In [9]:
df_toronto.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


# 4. Exploring and Clustering Neighbourhoods In Toronto

***Details of Toronto neighbourhoods***

- Borough / Neighbourhood breakdown
- Visualization on map

In [10]:
print('Toronto has {} boroughs and {} neighborhoods.'.format(len(df_toronto['Borough'].unique()),df_toronto.shape[0]))

geolocator = Nominatim(user_agent="toronto")
location = geolocator.geocode('Toronto, Ontario')
lat_offset = 0.05 # tiny bit offset for better map visibility
map_toronto = folium.Map(location=[location.latitude+lat_offset, location.longitude], zoom_start=11)

for lat, lng, borough, neighbourhood in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Borough'], df_toronto['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Toronto has 11 boroughs and 103 neighborhoods.


***Foursquare setup***

- Credentials (Hidden)
- Base URI

In [11]:
# The code was removed by Watson Studio for sharing.

Foursquare Client Key and Base URI set up succesfully
Base URI: https://api.foursquare.com/v2


***Fetch and explore nearby venues***

- Venue breakdown
- Visualization on map

In [12]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        url = '{}/venues/explore?&{}&ll={},{}&radius={}'.format(BASE_URI, CLIENT_KEY, lat, lng, radius)
        results = requests.get(url).json()["response"]['groups'][0]['items']
        venues_list.append([(name, lat, lng, v['venue']['name'], v['venue']['location']['lat'], v['venue']['location']['lng'], v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 'Neighbourhood Latitude', 'Neighbourhood Longitude', 'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category']    
    return nearby_venues

df_toronto_venues = getNearbyVenues(df_toronto['Neighbourhood'], df_toronto['Latitude'], df_toronto['Longitude'])
print('Fetched all the venues')

print(df_toronto.shape, df_toronto_venues.shape)

Fetched all the venues
(103, 5) (1338, 7)


In [13]:
df_toronto_venues.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Rouge, Malvern",43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,"Rouge, Malvern",43.806686,-79.194353,Interprovincial Group,43.80563,-79.200378,Print Shop
2,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,RIGHT WAY TO GOLF,43.785177,-79.161108,Golf Course
3,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
4,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,Scarborough Historical Society,43.788755,-79.162438,History Museum


In [14]:
print('There are total {} venues and {} unique categories'.format(df_toronto_venues.shape[0], len(df_toronto_venues['Venue Category'].unique())))

geolocator = Nominatim(user_agent="toronto")
location = geolocator.geocode('Toronto, Ontario')
lat_offset = 0.05 # tiny bit offset for better map visibility
map_toronto_venue = folium.Map(location=[location.latitude+lat_offset, location.longitude], zoom_start=11)

for lat, lng, venue_name, venue_category in zip(df_toronto_venues['Venue Latitude'], df_toronto_venues['Venue Longitude'], df_toronto_venues['Venue'], df_toronto_venues['Venue Category']):
    label = '{}, {}'.format(venue_name, venue_category)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto_venue)  
    
map_toronto_venue

There are total 1338 venues and 238 unique categories


***Analyse neighbourhoods***

- Encoding of venue categories using one-hot encode
- Using mean of frequency for categories
- Clustering using k-means
- Visualization of clusters

In [79]:
df_toronto_onehot = pd.get_dummies(df_toronto_venues[['Venue Category']], prefix="", prefix_sep="")
df_toronto_onehot['Neighbourhood'] = df_toronto_venues['Neighbourhood'] 
df_toronto_onehot = df_toronto_onehot[[df_toronto_onehot.columns[-1]] + list(df_toronto_onehot.columns[:-1])]
print('Encoded using one-hot encode')

df_toronto_onehot = df_toronto_onehot.groupby('Neighbourhood').mean().reset_index()
print('Calculated the mean of frequency of each category')

df_toronto_onehot.head()

Encoded using one-hot encode
Calculated the mean of frequency of each category


Unnamed: 0,Neighbourhood,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,...,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Agincourt North, L'Amoreaux East, Milliken, St...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [97]:
kclusters = 5
df_toronto_clustering = df_toronto_onehot.copy()
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_toronto_clustering.drop('Neighbourhood', 1))
print('Completed clustering')

df_toronto_clustering.insert(0, 'Cluster Labels', kmeans.labels_)
df_toronto_clustering = df_toronto_clustering.merge(df_toronto[['Neighbourhood', 'Latitude', 'Longitude']], how='left', on='Neighbourhood')
df_toronto_clustering = df_toronto_clustering[[df_toronto_clustering.columns[1]] + list(df_toronto_clustering.columns[-2:]) + [df_toronto_clustering.columns[0]] + list(df_toronto_clustering.columns[2:-2])]
print('Updated dataframe with cluster information')

df_toronto_clustering.head()

Unnamed: 0,Neighbourhood,Latitude,Longitude,Cluster Labels,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Adelaide, King, Richmond",43.650571,-79.384568,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Agincourt,43.7942,-79.262029,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Agincourt North, L'Amoreaux East, Milliken, St...",43.815252,-79.284577,4,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",43.739416,-79.588437,3,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Alderwood, Long Branch",43.602414,-79.543484,3,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [98]:
geolocator = Nominatim(user_agent="toronto")
location = geolocator.geocode('Toronto, Ontario')
lat_offset = 0.05 # tiny bit offset for better map visibility
map_toronto_clusters = folium.Map(location=[location.latitude+lat_offset, location.longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, poi, cluster in zip(df_toronto_clustering['Latitude'], df_toronto_clustering['Longitude'], df_toronto_clustering['Neighbourhood'], df_toronto_clustering['Cluster Labels']):
    label = folium.Popup(str(poi) + ', Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_toronto_clusters)
    
map_toronto_clusters