# Applied Data Science - Capstone Project Notebook

This Notebook will be primarily used to document the progress with Capstone Project (part of the Coursera's class on Applied Data Science).

## Week 3 Activities

- Scrape the list of Toronto neighborhoods off the following Wikipedia page [https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)
- Get the latitude and the longitude coordinates of each neighborhood
- Explore and cluster the neighborhoods in Toronto
- Generate maps to visualize Toronto neighborhoods and how they cluster together


### Scraping the Wikipedia

In this activity I use the [`pandas.read_html`](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.read_html.html) method. Unfortunately, there is some information on the wiki page that I don't need for my dataframe. I get rid of it by:
- using a *regexp* matching the Canadian postal codes format for Toronto area
- selecting just the first dataframe

In [10]:

import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner

# forma URL
postal_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
# fetch the needed data
postal_df = pd.read_html(postal_url)[0]
postal_df.head()


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [11]:
# get rid of the rows where borough is "Not assigned"
postal_df.drop(postal_df[postal_df['Borough'] == 'Not assigned'].index, inplace = True)
postal_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [12]:
# demonstrate the resulting number of rows in the neighborhoods dataframe
postal_df.shape

(103, 3)

### Getting Latitude and Longitude coordinates for each neighborhood

Unfortunately, the `geocoder` seems to be unreliable method for coordinates look-up. Also, I've noticed that both Google and Bing insist on using an API key associated with billing information - since I'm not interested in spending actual money on an experiment, I will use the coordinates CSV file provided in the assignment [https://cocl.us/Geospatial_data](https://cocl.us/Geospatial_data).

In [13]:
geo_url = "https://cocl.us/Geospatial_data"
geo_df = pd.read_csv(geo_url)
geo_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [85]:
toronto_df = pd.merge(postal_df, geo_df, on = 'Postal Code')
toronto_df.rename(columns={'Neighbourhood':'Neighborhood'}, inplace = True)
toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


### Exploring and clustering Toronto neighborhoods


In [86]:
import json # library to handle JSON files
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# import k-means from clustering stage
from sklearn.cluster import KMeans
import folium # map rendering library


In [87]:
# brief summary of the dataset
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(toronto_df['Borough'].unique()),
        toronto_df.shape[0]
    )
)

The dataframe has 10 boroughs and 103 neighborhoods.


In [88]:
# get Toronto coordinates
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="on_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [89]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Borough'], toronto_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [25]:
CLIENT_ID = 'UPTMIWQ3X3CHCY0WN4LW5ZLY24Z5R554MOWRBRXQS13TBH1M' # my Foursquare ID
CLIENT_SECRET = 'OYN5IXULQ4V3SMKXR3BFEYNM4WMV5RMBPIN51GDWWPTMPGSE' # my Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: UPTMIWQ3X3CHCY0WN4LW5ZLY24Z5R554MOWRBRXQS13TBH1M
CLIENT_SECRET:OYN5IXULQ4V3SMKXR3BFEYNM4WMV5RMBPIN51GDWWPTMPGSE


We re-use the `getNearbyVenues()` function from the Week 3 lab

In [94]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    counter = 0
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        r = requests.get(url)
        results = r.json()["response"]['groups'][0]['items']
        print(counter, r.status_code, name)
        counter = counter + 1    

        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now we iterate through all Toronto neighborhoods and collect the data

In [95]:
toronto_venues = getNearbyVenues(names=toronto_df['Neighborhood'],
                                   latitudes=toronto_df['Latitude'],
                                   longitudes=toronto_df['Longitude']
                                  )

0 200 Parkwoods
1 200 Victoria Village
2 200 Regent Park, Harbourfront
3 200 Lawrence Manor, Lawrence Heights
4 200 Queen's Park, Ontario Provincial Government
5 200 Islington Avenue, Humber Valley Village
6 200 Malvern, Rouge
7 200 Don Mills
8 200 Parkview Hill, Woodbine Gardens
9 200 Garden District, Ryerson
10 200 Glencairn
11 200 West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
12 200 Rouge Hill, Port Union, Highland Creek
13 200 Don Mills
14 200 Woodbine Heights
15 200 St. James Town
16 200 Humewood-Cedarvale
17 200 Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
18 200 Guildwood, Morningside, West Hill
19 200 The Beaches
20 200 Berczy Park
21 200 Caledonia-Fairbanks
22 200 Woburn
23 200 Leaside
24 200 Central Bay Street
25 200 Christie
26 200 Cedarbrae
27 200 Hillcrest Village
28 200 Bathurst Manor, Wilson Heights, Downsview North
29 200 Thorncliffe Park
30 200 Richmond, Adelaide, King
31 200 Dufferin, Dovercourt Village
32 200 Scarborough Vill

Let's take a look at the collected data

In [99]:
print(toronto_venues.shape)
print("Obtained venue information for ", len(toronto_venues['Neighborhood'].unique()), "neighborhoods")
toronto_venues.head()

(2141, 7)
Obtained venue information for  96 neighborhoods


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Parkwoods,43.753259,-79.329656,Corrosion Service Company Limited,43.752432,-79.334661,Construction & Landscaping
3,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant


Let's take a look how many venues we know per neighborhood

In [103]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,4,4,4,4,4,4
"Alderwood, Long Branch",7,7,7,7,7,7
"Bathurst Manor, Wilson Heights, Downsview North",21,21,21,21,21,21
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",23,23,23,23,23,23
...,...,...,...,...,...,...
"Willowdale, Willowdale East",34,34,34,34,34,34
"Willowdale, Willowdale West",6,6,6,6,6,6
Woburn,4,4,4,4,4,4
Woodbine Heights,5,5,5,5,5,5


Let's find out how many unique categories can be curated from all the returned venues

In [104]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 269 uniques categories.


#### Analyze each neighborhood

In [105]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Trail,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [106]:
toronto_onehot.shape

(2141, 269)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [107]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Trail,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
91,"Willowdale, Willowdale East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.029412,0.0,0.0,0.0,0.0
92,"Willowdale, Willowdale West",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
93,Woburn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
94,Woodbine Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0


#### Let's print each neighborhood along with the top 5 most common venues

In [108]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')


2           Skating Rink  0.25
3                   Café  0.25
4                   Park  0.00


----Brockton, Parkdale Village, Exhibition Place----
            venue  freq
0            Café  0.14
1  Breakfast Spot  0.09
2     Coffee Shop  0.09
3             Gym  0.05
4    Climbing Gym  0.05


----Business reply mail Processing Centre, South Central Letter Processing Plant Toronto----
                  venue  freq
0    Light Rail Station  0.12
1           Pizza Place  0.06
2         Garden Center  0.06
3        Farmers Market  0.06
4  Fast Food Restaurant  0.06


----CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport----
              venue  freq
0   Airport Service  0.17
1    Airport Lounge  0.11
2  Airport Terminal  0.11
3             Plane  0.06
4       Coffee Shop  0.06


----Caledonia-Fairbanks----
                        venue  freq
0                        Park  0.50
1               Women's Store  0.25
2                    

#### Let's put that into a *pandas* dataframe

In [109]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [110]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Lounge,Latin American Restaurant,Skating Rink,Breakfast Spot,Distribution Center,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store
1,"Alderwood, Long Branch",Pizza Place,Coffee Shop,Gym,Pharmacy,Pub,Sandwich Place,Dessert Shop,Curling Ice,Dance Studio,Deli / Bodega
2,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Grocery Store,Fried Chicken Joint,Diner,Sandwich Place,Bridal Shop,Deli / Bodega,Restaurant,Ice Cream Shop
3,Bayview Village,Chinese Restaurant,Café,Bank,Japanese Restaurant,Deli / Bodega,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Women's Store
4,"Bedford Park, Lawrence Manor East",Sandwich Place,Coffee Shop,Italian Restaurant,Restaurant,Grocery Store,Thai Restaurant,Juice Bar,Fast Food Restaurant,Indian Restaurant,Pub


### Cluster the neighborhoods

Run *k-means* to cluster the neighborhood into 5 clusters

In [111]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_



array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 0, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 1, 0, 3, 1, 3, 3, 3, 2, 3, 3, 3, 3, 0, 3, 3, 3, 0,
       3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 3,
       0, 3, 1, 3, 3, 3, 3, 0])

In [112]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)


In [115]:
# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood

toronto_merged = pd.merge(toronto_df, neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
toronto_merged.head
#toronto_merged.head() # check the last columns!

<bound method NDFrame.head of    Postal Code           Borough  \
0          M3A        North York   
1          M4A        North York   
2          M5A  Downtown Toronto   
3          M6A        North York   
4          M7A  Downtown Toronto   
..         ...               ...   
95         M8X         Etobicoke   
96         M4Y  Downtown Toronto   
97         M7Y      East Toronto   
98         M8Y         Etobicoke   
99         M8Z         Etobicoke   

                                         Neighborhood   Latitude  Longitude  \
0                                           Parkwoods  43.753259 -79.329656   
1                                    Victoria Village  43.725882 -79.315572   
2                           Regent Park, Harbourfront  43.654260 -79.360636   
3                    Lawrence Manor, Lawrence Heights  43.718518 -79.464763   
4         Queen's Park, Ontario Provincial Government  43.662301 -79.389494   
..                                                ...        ..

Let's visualize the clusters


In [124]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Observations

- The clustering of Toronto neighborhoods based on **FourSquare** venue data seems quitre boring - most of the neighborhoods fall into the same cluster, with just few exceptions.
- Some of the neighborhoods return no venues from **FourSquare**