# Applied Data Science Capstone: Week 3 
## Segmenting and Clustering Neighborhoods in the city of Toronto, Canada, Part 3

A Jupyter Notebook that uses pandas and other python libraries to demonstrate 
*k means clustering*. Builds on work in Part 1 and Part 2 notebooks, which are
separate. New work begins under "Part 3".

## Code from Part 1

Like in the Part 2 notebook, I've condensed the code from part 1 into a single cell.

In [1]:
import pandas as pd
import lxml

df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
raw_postal_codes = df[0]
raw_postal_codes = raw_postal_codes[(raw_postal_codes.Borough != 'Not assigned')]
raw_postal_codes = raw_postal_codes[(raw_postal_codes.Neighbourhood != 'Not assigned')]

grouped = raw_postal_codes.groupby('Postal Code')
grouped_data = {'PostalCode':[], 'Borough':[], 'Neighborhood':[]}
for a, b in grouped:
    grouped_data['PostalCode'].append(a)
    grouped_data['Borough'].append(', '.join(b['Borough'].tolist()))
    grouped_data['Neighborhood'].append(', '.join(b['Neighbourhood'].tolist()))

postal_codes = pd.DataFrame(grouped_data)
print("Dataframe ready!")
print(postal_codes.shape)

Dataframe ready!
(103, 3)


## Code from Part 2

I've removed the geocoder and geopy code that didn't 
get results, and just load the CSV file.

In [2]:
long_lat = pd.read_csv("https://cocl.us/Geospatial_data")
long_lat.rename(columns={'Postal Code':'PostalCode'}, inplace=True)
long_lat.head()
postal_codes = postal_codes.merge(long_lat, on='PostalCode')
print("Longitude and Latitude added to dataframe")
print(postal_codes.shape)

Longitude and Latitude added to dataframe
(103, 5)


## Part 3 code starts here

Let's explore the data, looking at similar things
in the New York exercise.

In [3]:
pip install folium

You should consider upgrading via the '/usr/local/Cellar/jupyterlab/2.2.7/libexec/bin/python3.8 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [4]:
from geopy.geocoders import Nominatim
import folium
import requests
from pandas import json_normalize
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

In [5]:
postal_codes.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


### Determine number of Boroughs and Neighborhoods

I think it would be interesting to see the overall
number of boroughs and neighborhoods. However, since
the data has the Neighborhoods column is merged together,
the only way to get a count is to split and count the results

In [6]:
postal_codes['NumberNeighborhoods'] = postal_codes.apply(lambda row: len(row['Neighborhood'].split(",")), axis=1)

print('The dataframe has {} boroughs and {} neighborhoods'.format(
            len(postal_codes['Borough'].unique()),
            postal_codes['NumberNeighborhoods'].sum()))

The dataframe has 10 boroughs and 217 neighborhoods


### Use geopy to get the latitude and longitude values of Toronto, Ontario, Canada

In [7]:
address = "Toronto, Ontario, Canada"
geolocator = Nominatim(user_agent="Capstone_Week_3")
location = geolocator.geocode(address)
if location is not None:
    latitude = location.latitude
    longitude = location.longitude
    print("The coordinates of Toronto are {}, {}".format(latitude, longitude))
else:
    print("Coordinates not found!")

The coordinates of Toronto are 43.6534817, -79.3839347


### Create a map of Toronto with the postal codes superimposed on top

I also color the points by borough

In [8]:
# The Color names come from the folium documentation. I experimented
# a little to position them.

folium_colors = ['purple', 'black', 'green', 'red', 'orange', 'darkred',
                 'blue', 'beige', 'darkblue', 'darkgreen']

boroughs = postal_codes['Borough'].unique()
borough_colors = dict(zip(boroughs, folium_colors))

map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

for i, row in postal_codes.iterrows():
    label_text = "{}, {}".format(row['PostalCode'], row['Borough'])
    label = folium.Popup(label_text, parse_html=True)
    folium.CircleMarker(
        [row['Latitude'], row['Longitude']],
        radius=5,
        popup=label,
        color=borough_colors[row['Borough']],
        fill=True,
        fill_color=borough_colors[row['Borough']],
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)

map_toronto

### Build a dataframe of venues using Foursquare

The New York clustering exercise just did this with
Manhattan, but I'll use the whole Toronto data set, 
which is only slightly larger. 

In [13]:
# Adapted from New York exercise. Refactored slightly for clarity.

client_secret = "O1BJBPD2QA5QAC3VDVFBXS1J1GAGISXTY5BQKZOKAXUEPM2W"
client_id = "ZWNOEXV5OG0240EV0Y5YERTPBTWOQRMEG2ZEV5DQR5D2IHTJ"
api_version = '20180323' # different from what we used in the NY exercise, from FS documentation
limit = 100


# Radius is in meters, I learn from the Foursquare API documentation

def get_nearby_venues(postal_code, latitude, longitude, radius=500):
    
    venues_list = []
        
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
        client_id, 
        client_secret, 
        api_version, 
        latitude, 
        longitude, 
        radius, 
        limit)

    # I notice that if the key for groups is entered as "groups" it fails,
    # but entered as 'groups' it succeeds. I would really like to know
    # why that is.
    
    results = requests.get(url).json()["response"]['groups'][0]["items"]
    for v in results:
        venues_list.append((postal_code,
                           latitude,
                           longitude,
                           v['venue']['name'],
                           v['venue']['location']['lat'],
                           v['venue']['location']['lng'],
                           v['venue']['categories'][0]['name']))
    
    return venues_list


In [12]:
venues_list = []

for i, row in postal_codes.iterrows():
    print(".", end="")
    venues_list += get_nearby_venues(row['PostalCode'],
                                        row['Latitude'],
                                        row['Longitude'])

print("\nVenues list created")

.......................................................................................................
Venues list created


In [14]:
# I keep "PostalCode" without the space, to preserve compatibility
# with the earlier required step. Which, frankly, I wouldn't have done.

toronto_venues = pd.DataFrame(data=venues_list, columns=['PostalCode',
                                                        'Postal Code Latitude',
                                                        'Postal Code Longitude',
                                                        'Venue',
                                                        'Venue Latitude',
                                                        'Venue Longitude',
                                                        'Venue Category'])

In [15]:
print(toronto_venues.shape)
toronto_venues.head()

(2122, 7)


Unnamed: 0,PostalCode,Postal Code Latitude,Postal Code Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M1B,43.806686,-79.194353,Wendy’s,43.807448,-79.199056,Fast Food Restaurant
1,M1B,43.806686,-79.194353,Interprovincial Group,43.80563,-79.200378,Print Shop
2,M1C,43.784535,-79.160497,Chris Effects Painting,43.784343,-79.163742,Construction & Landscaping
3,M1C,43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
4,M1E,43.763573,-79.188711,RBC Royal Bank,43.76679,-79.191151,Bank


### Basic statistics

In [16]:
toronto_venues.groupby('PostalCode').count()

Unnamed: 0_level_0,Postal Code Latitude,Postal Code Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
M1B,2,2,2,2,2,2
M1C,2,2,2,2,2,2
M1E,7,7,7,7,7,7
M1G,4,4,4,4,4,4
M1H,8,8,8,8,8,8
...,...,...,...,...,...,...
M9N,1,1,1,1,1,1
M9P,7,7,7,7,7,7
M9R,4,4,4,4,4,4
M9V,9,9,9,9,9,9


In [17]:
print('There are {} unique categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 267 unique categories.


### Analyze each neighborhood

In [18]:
# one hot encoding for the venue categories

toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], 
                                prefix="", 
                                prefix_sep="")

toronto_onehot['PostalCode'] = toronto_venues['PostalCode']

# In the New York exercise, the order of the columns was changed 
# at this point so the Neighborhood column came first. 
# But this isn't necessary, because the group step further
# down does that.

toronto_onehot.head()

Unnamed: 0,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio,PostalCode
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,M1B
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,M1B
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,M1C
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,M1C
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,M1E


In [19]:
toronto_onehot.shape

(2122, 268)

In [22]:
toronto_grouped = toronto_onehot.groupby('PostalCode').mean().reset_index()
toronto_grouped

Unnamed: 0,PostalCode,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Truck Stop,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,M1B,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M1C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M1E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M1G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M1H,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,M9N,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
96,M9P,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
97,M9R,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
98,M9V,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [23]:
toronto_grouped.shape

(100, 268)

### Build a dataframe with the top five categories

In [58]:
def most_common_venues(row, number=5):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:number]

number_top_venues = 5

# The New York notebook used numpy.arange for this.
# Why use numpy.arange for something trivial like this?

columns = ['PostalCode']
for i in range(1, number_top_venues + 1):
    if i == 1:
        columns.append("1st Most Common Venue")
    elif i == 2:
        columns.append("2nd Most Common Venue")
    elif i == 3:
        columns.append("3rd Most Common Venue")
    else:
        columns.append("{}th Most Common Venue".format(i))

postal_code_venues_sorted = pd.DataFrame(columns=columns)
postal_code_venues_sorted['PostalCode'] = toronto_grouped['PostalCode']

# Again: why use numpy.arange for something trivial like this?

for i in range(0, toronto_grouped.shape[0]):
    postal_code_venues_sorted.iloc[i, 1:] = most_common_venues(
            toronto_grouped.iloc[i, :], number_top_venues)

postal_code_venues_sorted.head()

Unnamed: 0,PostalCode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M1B,Print Shop,Fast Food Restaurant,Yoga Studio,Dim Sum Restaurant,Diner
1,M1C,Construction & Landscaping,Bar,Yoga Studio,Donut Shop,Diner
2,M1E,Restaurant,Rental Car Location,Breakfast Spot,Medical Center,Intersection
3,M1G,Coffee Shop,Pharmacy,Korean BBQ Restaurant,Escape Room,Ethiopian Restaurant
4,M1H,Fried Chicken Joint,Gas Station,Hakka Restaurant,Bakery,Athletics & Sports


### Cluster postal codes

In [59]:
clusters = 4

toronto_grouped_clustering = toronto_grouped.drop('PostalCode', 1)

kmeans = KMeans(n_clusters=clusters, random_state=3).fit(toronto_grouped_clustering)
kmeans.labels_[0:10]

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

In [60]:
# not sure if this is a problem, but in the New York notebook the Cluster Labels
# column as added as an int32, while when I did it it was converted into a float64. 
# So I've added the type conversion.

postal_code_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
postal_code_venues_sorted['Cluster Labels'] = postal_code_venues_sorted['Cluster Labels'].astype('int32')


toronto_merged = postal_codes
toronto_merged = toronto_merged.join(
        postal_code_venues_sorted.set_index("PostalCode"), on='PostalCode')
toronto_merged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,NumberNeighborhoods,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353,2,1.0,Print Shop,Fast Food Restaurant,Yoga Studio,Dim Sum Restaurant,Diner
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,3,1.0,Construction & Landscaping,Bar,Yoga Studio,Donut Shop,Diner
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,3,1.0,Restaurant,Rental Car Location,Breakfast Spot,Medical Center,Intersection
3,M1G,Scarborough,Woburn,43.770992,-79.216917,1,1.0,Coffee Shop,Pharmacy,Korean BBQ Restaurant,Escape Room,Ethiopian Restaurant
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,1,1.0,Fried Chicken Joint,Gas Station,Hakka Restaurant,Bakery,Athletics & Sports


In [61]:
from math import isnan

# create a map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# I think this is a little overkill for colors.

x = np.arange(clusters)
ys = [i + x + (i*x)**2 for i in range(clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, pcode, cluster in zip(toronto_merged['Latitude'],
                                 toronto_merged['Longitude'],
                                 toronto_merged['PostalCode'],
                                 toronto_merged['Cluster Labels']):
    label = folium.Popup(str(pcode) + ' Cluster ' + str(cluster), parse_html=True)

    # A difference between the New York notebook: my
    # Cluster Labels column is a float, rather than an
    # int. 
    if isnan(cluster):
        current_color = rainbow[0]
    else:
        current_color = rainbow[int(cluster-1)]
        
    folium.CircleMarker(
            [lat, lon],
            radius=5,
            popup=label,
            color=current_color,
            fill=True,
            fill_color=current_color,
            fill_opacity=0.7).add_to(map_clusters)

map_clusters

### Examine Clusters

In [62]:
# this is (more or less) how it was done in the New York notebook.
# I'm limiting the display to the top 3 venues

for i in range(0, clusters):
    print("---------- Cluster {} ----------".format(i))
    print(toronto_merged.loc[toronto_merged['Cluster Labels'] == i,
                            ['1st Most Common Venue', '2nd Most Common Venue', '3rd Most Common Venue',
                            '4th Most Common Venue', '5th Most Common Venue']])

---------- Cluster 0 ----------
    1st Most Common Venue 2nd Most Common Venue 3rd Most Common Venue  \
14                   Park            Playground          Intersection   
23      Convenience Store                  Park           Yoga Studio   
25                   Park                  Pool     Food & Drink Shop   
30                   Park              Bus Stop               Airport   
40      Convenience Store                  Park           Yoga Studio   
44                   Park              Bus Line           Swim School   
50                   Park            Playground                 Trail   
74                   Park         Women's Store                   Bar   
79                  Trail                  Park                Bakery   
90                   Park                 River           Yoga Studio   
100                  Park              Bus Line        Sandwich Place   

          4th Most Common Venue 5th Most Common Venue  
14                  Yoga Studio    

### Analysis

I tried running this with 10 Venue types, but the end results were *very*
imbalanced: three of the clusters wound up with only 1 menu, which seemed
wrong. Running it with 5 venues is better. Running with 3 didn't
change the results from 5, so I left it at 5. 

I also ran it with 5 clusters, but one of the clusters had only one member,
and it looks like it belonged in one of the other clusters, so I
dorpped that down to 4. 

With only 4 clusters and the top 5 venues, the results are still pretty 
imbalanced, but I think this is as good as it's going to get:


| Cluster | Number of Postal Codes |
| :------ | :--------------------- |
| 0       | 11                     |
| 1       | 83                     |
| 2       | 3                      |
| 3       | 3                      |

That's not really a problem, but it is interesting. Scanning the top venues
in each cluster, it looks like we could describe them as follows:

* Cluster 0: Leisure (mostly parks and similar venues)
* Cluster 1: Retail and Entertainment (mostly restaurants and shops)
* Cluster 2: Baseball (one of the members includes Rogers Centre)
* Cluster 4: Retail and Entertainment (also mostly restaurants and shops)

