# Segmenting and Clustering Neighborhoods in Toronto

### Libraries

Let's import all needed libraries.

In [0]:
import pandas as pd
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import requests
import folium
from bs4 import BeautifulSoup
from geopy.geocoders import Nominatim

## 1. Step 1: Scraping

Let's repeat what we did in the first notebook.

In [0]:
wikipedia_page = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
source = requests.get(wikipedia_page).text
soup = BeautifulSoup(source, 'lxml')
table = soup.find('table')
data = []
for i, row in enumerate(table.find_all('tr')):
  entry = [item.text for item in row.find_all('td')]
  if entry and entry[1] != 'Not assigned':
    if entry[0] not in [item['Postcode'] for item in data]:
      data.append({'Postcode': entry[0],
                   'Borough': entry[1],
                   'Neighbourhood': entry[2].rstrip() if entry[2].rstrip() != 'Not assigned' else entry[1]})
    else:
      index = [i for i, item in enumerate(data) if item['Postcode'] == entry[0]][0]
      data[index]['Neighbourhood'] += ', '+entry[2].rstrip()
df = pd.DataFrame(data=data, columns=['Postcode', 'Borough', 'Neighbourhood'])
df.columns = ['PostalCode', 'Borough', 'Neighborhood']

## Step 2: Adding coordinates

Let's repeat what we did in the second notebook.

In [0]:
coord_df = pd.read_csv('http://cocl.us/Geospatial_data')
df_with_coords = pd.merge(df, coord_df, left_on='PostalCode', right_on='Postal Code', suffixes=('', '')).drop('Postal Code', 1)

## Step 3: Clustering

### Step 3.0: Visualization

First, let's visualize the neighborhoods on a map. In order to do that, we need to find out Toronto's latitude and longitude.

In [4]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent='coursera-capstone-project')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


Now we can have a look at the map.

In [7]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

for lat, lng, borough, neighborhood in zip(df_with_coords['Latitude'], df_with_coords['Longitude'], df_with_coords['Borough'], df_with_coords['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill_color='#3186cc',
        fill_opacity=0.7,
    ).add_to(map_toronto)
    
map_toronto

### Step 3.1: Data import

We need to import our Foursquare's client informations.

In [0]:
CLIENT_ID = 'VHSY234TCBMJ31GRMHCVW3FMO1A3412INITUQKSUMXPZ4XEK'
CLIENT_SECRET = 'DJGTTS2AIKGUHNDS3WOC34SLMV1HUMPJ4KDY1NVGG351MNFO'
VERSION = '20181219'

And we define the function to extract the venues.

In [0]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['PostalCode', 
                  'PostalCode Latitude', 
                  'PostalCode Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Let's use it to get the data we need. First we select from our DataFrame the subset of boroughs containing the word Toronto. 

In [10]:
boroughs = set(df_with_coords['Borough'])
toronto_boroughs = {borough for borough in boroughs if 'Toronto' in borough}
toronto_data = df_with_coords[df_with_coords['Borough'].isin(toronto_boroughs)].reset_index(drop=True)
toronto_data.head(5)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
1,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
2,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
3,M4E,East Toronto,The Beaches,43.676357,-79.293031
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306


Then we use the function we defined earlier to query Foursquare to get information about the neighborhoods.

In [11]:
toronto_venues = getNearbyVenues(toronto_data.PostalCode, toronto_data.Latitude, toronto_data.Longitude)
toronto_venues.head(5)

Unnamed: 0,PostalCode,PostalCode Latitude,PostalCode Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M5A,43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,M5A,43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,M5A,43.65426,-79.360636,Toronto Cooper Koo Family Cherry St YMCA Centre,43.653191,-79.357947,Gym / Fitness Center
3,M5A,43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot
4,M5A,43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa


### Step 3.2: Data Transformation

Now we need to transform the data we just extracted, because we need to have numerical values for the column 'Venue Category' in order for the K-Means algorithm to work.

We start by converting the 'Venue Category' column in dummies in a new DataFrame, and adding the 'PostalCode' column to this new DataFrame.

In [0]:
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")
toronto_onehot['PostalCode'] = toronto_venues['PostalCode'] 
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

We can now group this DataFrame by its 'PostalCode' column and taking the mean of each group.

In [13]:
toronto_grouped = toronto_onehot.groupby('PostalCode').mean().reset_index()
toronto_grouped.head(5)

Unnamed: 0,PostalCode,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Thai Restaurant,Theater,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,M4E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M4K,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.033333
2,M4L,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M4M,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.066667,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M4N,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We can create now a new DataFrame to better display which are the 10 most common venues in each postal code area.

In [15]:
num_top_venues = 10
indicators = ['st', 'nd', 'rd']

columns = ['PostalCode']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['PostalCode'] = toronto_grouped['PostalCode']

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head(5)

Unnamed: 0,PostalCode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,Coffee Shop,Burger Joint,Neighborhood,Pub,BBQ Joint,Eastern European Restaurant,Dumpling Restaurant,Dog Run,Discount Store,Diner
1,M4K,Greek Restaurant,Italian Restaurant,Ice Cream Shop,Yoga Studio,Pub,Pizza Place,Juice Bar,Indian Restaurant,Health Food Store,Fruit & Vegetable Store
2,M4L,Sandwich Place,Pizza Place,Coffee Shop,Sushi Restaurant,Brewery,Board Shop,Liquor Store,Burger Joint,Burrito Place,Pub
3,M4M,Coffee Shop,Café,American Restaurant,Italian Restaurant,Bakery,Comfort Food Restaurant,Fish Market,Bookstore,Sandwich Place,Juice Bar
4,M4N,Lake,Park,Dim Sum Restaurant,Bus Line,Swim School,Dance Studio,Eastern European Restaurant,Dumpling Restaurant,Dog Run,Discount Store


### Step 3.3: Model creation and training

Now, we can create our K-Means classifier, and train it to segment our venues data in 5 clusters.

In [0]:
kclusters = 5
toronto_grouped_clustering = toronto_grouped.drop('PostalCode', 1)

kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

We can now insert the labels predicted by our model in the DataFrame, along with the information about the most common venues in each area.

In [17]:
toronto_merged = toronto_data
toronto_merged['Cluster Labels'] = kmeans.labels_
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('PostalCode'), on='PostalCode')
toronto_merged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636,0,Coffee Shop,Park,Café,Mexican Restaurant,Pub,Bakery,Breakfast Spot,Dessert Shop,Performing Arts Venue,Italian Restaurant
1,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937,0,Café,Clothing Store,Beer Bar,Plaza,Sandwich Place,Japanese Restaurant,Diner,Ramen Restaurant,Burger Joint,Pizza Place
2,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,0,Gastropub,Coffee Shop,Italian Restaurant,Restaurant,Japanese Restaurant,Hotel,Food Truck,Diner,Spa,Middle Eastern Restaurant
3,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Coffee Shop,Burger Joint,Neighborhood,Pub,BBQ Joint,Eastern European Restaurant,Dumpling Restaurant,Dog Run,Discount Store,Diner
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,0,Cocktail Bar,Seafood Restaurant,Café,Farmers Market,Belgian Restaurant,Jazz Club,Italian Restaurant,Basketball Stadium,Beer Bar,Bistro


### Step 3.5: Visualization

Let's display on a map how the clusters we predicted are distributed.

In [18]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Step 3.6: Cluster examination

Looking at the most common venues, we can try to deduce what venue categories distinguish each cluster.

The first cluster, the most populated, seems a bit generic, but there's definitely a common theme in Coffe Shops and Cafès.

In [19]:
toronto_merged[toronto_merged['Cluster Labels']==0].head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636,0,Coffee Shop,Park,Café,Mexican Restaurant,Pub,Bakery,Breakfast Spot,Dessert Shop,Performing Arts Venue,Italian Restaurant
1,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937,0,Café,Clothing Store,Beer Bar,Plaza,Sandwich Place,Japanese Restaurant,Diner,Ramen Restaurant,Burger Joint,Pizza Place
2,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,0,Gastropub,Coffee Shop,Italian Restaurant,Restaurant,Japanese Restaurant,Hotel,Food Truck,Diner,Spa,Middle Eastern Restaurant
3,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Coffee Shop,Burger Joint,Neighborhood,Pub,BBQ Joint,Eastern European Restaurant,Dumpling Restaurant,Dog Run,Discount Store,Diner
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,0,Cocktail Bar,Seafood Restaurant,Café,Farmers Market,Belgian Restaurant,Jazz Club,Italian Restaurant,Basketball Stadium,Beer Bar,Bistro
5,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383,0,Coffee Shop,Italian Restaurant,Chinese Restaurant,Spa,Seafood Restaurant,Café,Ramen Restaurant,Sandwich Place,Japanese Restaurant,Bar
6,M6G,Downtown Toronto,Christie,43.669542,-79.422564,0,Grocery Store,Café,Park,Athletics & Sports,Nightclub,Convenience Store,Restaurant,Diner,Italian Restaurant,Baby Store
7,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.650571,-79.384568,0,Steakhouse,Café,American Restaurant,Hotel,Asian Restaurant,Seafood Restaurant,Noodle House,Monument / Landmark,Gastropub,Concert Hall
9,M5J,Downtown Toronto,"Harbourfront East, Toronto Islands, Union Station",43.640816,-79.381752,0,Park,Café,Hotel,New American Restaurant,Bubble Tea Shop,Basketball Stadium,Ice Cream Shop,Skating Rink,Italian Restaurant,Japanese Restaurant
11,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,0,Greek Restaurant,Italian Restaurant,Ice Cream Shop,Yoga Studio,Pub,Pizza Place,Juice Bar,Indian Restaurant,Health Food Store,Fruit & Vegetable Store


The second cluster seems characterized by sports, with Gyms and Yoga Studios alongside Sporting Goods Shops.

In [20]:
toronto_merged[toronto_merged['Cluster Labels']==1]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
22,M4R,Central Toronto,North Toronto West,43.715383,-79.405678,1,Sporting Goods Shop,Clothing Store,Coffee Shop,Yoga Studio,Chinese Restaurant,Dessert Shop,Diner,Fast Food Restaurant,Gift Shop,Gym / Fitness Center


The third cluster sees a prevalence of stores, like Pharmacies, Supermarkets, Bakeries, Liquor Stores and Discount Stores.

In [21]:
toronto_merged[toronto_merged['Cluster Labels']==2]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
8,M6H,West Toronto,"Dovercourt Village, Dufferin",43.669005,-79.442259,2,Pharmacy,Supermarket,Bakery,Music Venue,Pool,Café,Middle Eastern Restaurant,Brewery,Liquor Store,Discount Store


The fourth cluster has many Bars and Restaurants, especially of Asian cuisine (like Vietnamese and Korean).

In [22]:
toronto_merged[toronto_merged['Cluster Labels']==3]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
10,M6J,West Toronto,"Little Portugal, Trinity",43.647927,-79.41975,3,Bar,Vietnamese Restaurant,Pizza Place,Asian Restaurant,Yoga Studio,Bakery,Coffee Shop,Cocktail Bar,Cuban Restaurant,Korean Restaurant


Finally, the fifth cluster distinguishes itself by its Sandwich Places and BBQ and Burger Joints.

In [23]:
toronto_merged[toronto_merged['Cluster Labels']==4]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
23,M5R,Central Toronto,"The Annex, North Midtown, Yorkville",43.67271,-79.405678,4,Café,Coffee Shop,Sandwich Place,Pizza Place,French Restaurant,Indian Restaurant,BBQ Joint,Burger Joint,Cosmetics Shop,History Museum
