# Segmenting and Clustering Neighborhoods in Toronto

### Applied Data Science Capstone by IBM/Coursera 

## Table of contents
* [Introduction](#intro)
* [Methodology](#methodology)
* [Analysis](#analysis)

### Exercises
* Point 3: [Dataframe](#dataframe)
* Point 4: [Dataframe with coordinates](#dfcoordinates)
* Point 5: [Clusters](#clusters)

<div id='intro' />

## Introduction

In this project we will explore, segment and cluster the neighbourhoods in the city of **Toronto**. The neighbourhood data is not readily available on The Internet. However, a Wikipedia page exists that has all the information we need to explore and cluster the neighbourhoods in Toronto. 

In order to segment and cluster that information we will need to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format.
 
Once the data is in a structured format, we will make the analysis to cluster the neighbourhoods.

You can find the link to the Wikipedia page [here](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)

<div id='methodology' />

## Methodology

The methodology used to make the different clusters is the K-means method, which is one of the simplest and popular unsupervised machine learning algorithms.

The objective of K-means is to group similar datapoints and discover underlying patterns.  To achieve that, K-means looks for a fixed number (k) of clusters in a dataset.

We will fix a defined number of (k) clusters which refers to the number o centroids we need in the dataset. The centroids represent an imaginary centre of a cluster which is used to allocate each point to a cluster.

K-means allocates each data point to the nearest centroid and re-evaluate the centroids after each iteration.
The objective of K-means is to reduce the distance between each data point and its respective centroid or centre.

<div id='analysis' />

## Analysis

First of all, we will import the libraries needed for the project;
**pandas**, in order to use pandas dataframe,
from **bs4**, BeautifulSuop and 
**requests**, needed to webscraping the Wikipedia.

In [1]:
#import libraries
import pandas as pd
from bs4 import BeautifulSoup
import requests

Then we can make the request and the soup in order to put the information into a dataframe.

In [33]:
#get the data from the source and make the suop
result = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
src = result.content
soup = BeautifulSoup(src, 'html.parser')

In [34]:
#read the table into a dataframe
table = soup.find('table')
df = pd.read_html(str(table))
df = df[0]

In [35]:
#lets take a look of the first rows
df.head()

Unnamed: 0,0,1,2
0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


In [36]:
#and the shape
df.shape

(288, 3)

Once we had created the dataframe, we can rule out all the rows whose Borough's value is Not assigned.
In case there is a row with a Not assigned value in the Neighbourhood column, we will assign it its Borough value.

In [37]:
#process only the rows that have a borugh assigned
df = df[df[1] != 'Not assigned']

In [44]:
#shape without the rows deleted
df.shape

(210, 3)

In [39]:
#set the first row as columns and reset index
df = df.rename(columns=df.iloc[0]).drop(df.index[0])

In [40]:
df = df.reset_index(drop=True)

In [41]:
#check rows with not assigned neighbourhood
df.loc[df['Neighbourhood'] == 'Not assigned']

Unnamed: 0,Postcode,Borough,Neighbourhood
5,M7A,Queen's Park,Not assigned


In [46]:
#replace the unique Not assigned value with its Borough value
df['Neighbourhood'][5] = df['Borough'][5]

In [47]:
#there is any Not assigned value in the Neighbourhood column
sum(df.Neighbourhood == 'Not assigned')

0

Now we can organice the dataframe as it is indicated in the point 3

In [50]:
#combine neighbourhoods on rows with the same postal code
for row in df.values:
    for value in row:
        if len(value)==3:
            code = str(row[0])
            hood = str(row[2])
            if code in df_g.Postcode.values:
                previous_hood = df_g.loc[df_g['Postcode'] == code, 'Neighbourhood'].values
                previous_hood =  previous_hood[0] + ', ' + hood
                df_g.loc[df_g['Postcode'] == code, ['Neighbourhood']] = previous_hood
            else:
                row = pd.DataFrame(row)
                row = row.transpose()
                row.columns=['Postcode','Borough','Neighbourhood']
                df_g = df_g.append(row)
        else :
            continue

In [51]:
#reset the index inplace
df_g.reset_index(drop=True, inplace=True)

<div id='dataframe' />

#### Point 3: Dataframe

In [53]:
#take a look of the first rows
df_g.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Queen's Park,Queen's Park
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


In [54]:
#shape of the new dataframe
df_g.shape

(103, 3)

To get the coordinates of each postal code we will use the csv file provided, we will add them to the dataframe as follows to complete the point 4.

In [55]:
#read the csv with the geoinformation into a dataframe
df_latlng = pd.read_csv(r'C:\Users\aniba\OneDrive\Escritorio\Anibal\List of postal codes of Canada_ M - Wikipedia_files\Geospatial_Coordinates.csv')
#set the correct names for the columns
df_latlng.columns = ['Postcode','Latitude','Longitude']

In [56]:
#first check if all the postcodes are the same in both dataframes
df_latlng['Postcode'].isin(df_g['Postcode']).value_counts()

True    103
Name: Postcode, dtype: int64

<div id='dfcoordinates' />

#### Point 4: Dataframe with coordinates

In [57]:
#join dataframes to include the geo information
df = df_g.join(df_latlng.set_index('Postcode'), on='Postcode')
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494


In [58]:
#import libraries
import folium

from geopy.geocoders import Nominatim

In [59]:
#coordinates of Toronto
latlon_tor = Nominatim(user_agent='api')
location = latlon_tor.geocode('Toronto, Ontario')
latitude = location.latitude
longitude = location.longitude
print(latitude,longitude)

43.653963 -79.387207


In [60]:
#Lets visualize the neighbourhoods in Toronto
map_Toronto = folium.Map(location=[latitude,longitude], zoom_start=10)

for lat,lon,borough,neighbourhood in zip(df['Latitude'],df['Longitude'],df['Borough'],df['Neighbourhood']):
    label = '{},{}'.format(neighbourhood,borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    [lat,lon],
    radius=5,
    popup=label,
    color='blue',
    fill=True,
    fill_color='#3186cc',
    fill_opacity=0.7,
    parse_html=False).add_to(map_Toronto)
map_Toronto

In [61]:
#coordinates of the boroughs whose name contains the word Toronto
map_Toronto = folium.Map(location=[latitude,longitude], zoom_start=12)

#filter the dataframe
df_Toronto = df.set_index('Borough')
df_Toronto = df_Toronto.filter(like='Toronto', axis=0)

#Add the labels to the map
for lat,lon,neighbourhood in zip(df_Toronto['Latitude'],df_Toronto['Longitude'],df_Toronto['Neighbourhood']):
    label = '{}'.format(neighbourhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    [lat,lon],
    radius=5,
    popup=label,
    color='blue',
    fill=True,
    fill_color='#3186cc',
    fill_opacity=0.7,
    parse_html=False).add_to(map_Toronto)

map_Toronto

In [62]:
df_Toronto.reset_index(inplace=True)

In [63]:
#import libraries
import numpy as np
import matplotlib.pyplot as plt #plot library
#backend for rendering plots within the browser
%matplotlib inline
#transform json into a pandas dataframe
from pandas.io.json import json_normalize

In [64]:
#credentials to use the Foursquare api
CLIENT_ID = 'ZTABZ1IGX314WWK0DBCD50IP4HF5DWGICAVTLF2IRFUORYMY' # your Foursquare ID
CLIENT_SECRET = 'MB34RSPNSDWKDTODPWCC1XBEKXTINJIZOBAFHBZHXNSM4TA2' # your Foursquare secret
VERSION = '20180604'
LIMIT = 100
radius = 500

In [65]:
#function that gets the venues from all neighbourhoods in Toronto boroughts
def getNearbyvenues(names, latitudes, longitudes, radius=500):
    
    venues_list = []
    for name,lat,lon in zip(names,latitudes,longitudes):
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            lat,
            lon,
            radius,
            LIMIT)
        
        results = requests.get(url).json()['response']['groups'][0]['items']
        
        venues_list.append([(
            name, 
            lat, 
            lon, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name'],
            v['venue']['categories'][0]['shortName']) for v in results])
    
    
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Borough', 
                  'Borough Latitude', 
                  'Borough Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category',
                  'Venue Category short']
    
    return(nearby_venues)

In [66]:
#get the dataframe of the venues
Toronto_venues = getNearbyvenues(names=df_Toronto['Borough'],
                                   latitudes=df_Toronto['Latitude'],
                                   longitudes=df_Toronto['Longitude'])
Toronto_venues

Unnamed: 0,Borough,Borough Latitude,Borough Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Venue Category short
0,Downtown Toronto,43.654260,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery,Bakery
1,Downtown Toronto,43.654260,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop,Coffee Shop
2,Downtown Toronto,43.654260,-79.360636,Cooper Koo Family YMCA,43.653191,-79.357947,Gym / Fitness Center,Gym / Fitness
3,Downtown Toronto,43.654260,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa,Spa
4,Downtown Toronto,43.654260,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot,Breakfast
5,Downtown Toronto,43.654260,-79.360636,Impact Kitchen,43.656369,-79.356980,Restaurant,Restaurant
6,Downtown Toronto,43.654260,-79.360636,Figs Breakfast & Lunch,43.655675,-79.364503,Breakfast Spot,Breakfast
7,Downtown Toronto,43.654260,-79.360636,Corktown Common,43.655618,-79.356211,Park,Park
8,Downtown Toronto,43.654260,-79.360636,The Distillery Historic District,43.650244,-79.359323,Historic Site,Historic Site
9,Downtown Toronto,43.654260,-79.360636,Dominion Pub and Kitchen,43.656919,-79.358967,Pub,Pub


In [67]:
Toronto_venues.groupby('Borough').count()

Unnamed: 0_level_0,Borough Latitude,Borough Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Venue Category short
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Central Toronto,116,116,116,116,116,116,116
Downtown Toronto,1287,1287,1287,1287,1287,1287,1287
East Toronto,123,123,123,123,123,123,123
West Toronto,179,179,179,179,179,179,179


In [68]:
#unique categories in venue categorie short column
print('There are {} uniques short categories names'.format(len(Toronto_venues['Venue Category short'].unique())))

There are 231 uniques short categories names


In [69]:
#get dummies
Toronto_onehot = pd.get_dummies(Toronto_venues[['Venue Category short']], prefix='',prefix_sep='')

In [70]:
Toronto_onehot['Borough'] = Toronto_venues['Borough']

In [71]:
#sort the columns
fixed_columns = [Toronto_onehot.columns[-1]] + list(Toronto_onehot.columns[:-1])
Toronto_onehot = Toronto_onehot[fixed_columns]

In [72]:
#check the dataframe
Toronto_onehot.head()

Unnamed: 0,Borough,Afghan,Airport,Airport Service,American,Antiques,Apparel,Aquarium,Art Gallery,Arts,...,Trail,Train Station,Travel,Vegetarian / Vegan,Video Games,Vietnamese,Wine Bar,Wine Shop,Wings,Yoga Studio
0,Downtown Toronto,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Downtown Toronto,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Downtown Toronto,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Downtown Toronto,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Downtown Toronto,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [73]:
#groupby
Toronto_grouped = Toronto_onehot.groupby('Borough').mean().reset_index()
Toronto_grouped

Unnamed: 0,Borough,Afghan,Airport,Airport Service,American,Antiques,Apparel,Aquarium,Art Gallery,Arts,...,Trail,Train Station,Travel,Vegetarian / Vegan,Video Games,Vietnamese,Wine Bar,Wine Shop,Wings,Yoga Studio
0,Central Toronto,0.0,0.0,0.0,0.025862,0.0,0.034483,0.0,0.0,0.0,...,0.017241,0.0,0.0,0.008621,0.0,0.008621,0.0,0.0,0.0,0.008621
1,Downtown Toronto,0.000777,0.000777,0.002331,0.01554,0.001554,0.011655,0.003885,0.00777,0.000777,...,0.000777,0.002331,0.003108,0.012432,0.002331,0.005439,0.006993,0.000777,0.000777,0.001554
2,East Toronto,0.0,0.0,0.0,0.02439,0.0,0.00813,0.0,0.0,0.0,...,0.01626,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01626
3,West Toronto,0.0,0.0,0.0,0.0,0.005587,0.0,0.0,0.005587,0.0,...,0.0,0.0,0.0,0.011173,0.0,0.011173,0.005587,0.0,0.0,0.005587


In [74]:
Toronto_grouped.shape

(4, 232)

In [75]:
#show the most frequent venues in each borough
num_top_venues = 5

for borough in Toronto_grouped['Borough']:
    print('----{}----'.format(borough))
    temp = Toronto_grouped[Toronto_grouped['Borough'] == borough].T.reset_index()
    temp.columns = ['venue', 'freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq',ascending=False).reset_index(drop=True).head(num_top_venues))
    print()

----Central Toronto----
         venue  freq
0  Coffee Shop  0.07
1   Sandwiches  0.06
2         Park  0.06
3         Café  0.05
4        Pizza  0.04

----Downtown Toronto----
         venue  freq
0  Coffee Shop  0.09
1         Café  0.06
2   Restaurant  0.03
3      Italian  0.03
4        Hotel  0.03

----East Toronto----
         venue  freq
0        Greek  0.07
1  Coffee Shop  0.06
2      Italian  0.05
3    Ice Cream  0.04
4         Café  0.04

----West Toronto----
         venue  freq
0          Bar  0.07
1  Coffee Shop  0.06
2         Café  0.06
3       Bakery  0.04
4   Restaurant  0.03



<div id='clusters' />

#### Point 5: Clusters

To simplify the analysis, we will fix the number of clusters in 3 and use only the information of the venues from the Borrughs with the word Toronto on his name.

In [76]:
#import k-means
from sklearn.cluster import KMeans

In [77]:
#set the clusters
num_clusters = 3

Toronto_grouped_clustering = Toronto_grouped.drop('Borough', 1)

#run k-means clustering
kmeans = KMeans(n_clusters=num_clusters, random_state=0).fit(Toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([0, 1, 2, 1])

In [78]:
#Borough's lat lon
B_lat = []
B_lon =[]

for borough in Toronto_grouped['Borough']:
    latlon_b = Nominatim(user_agent='foursquare')
    location = latlon_b.geocode('{}, Toronto, Canada'.format(borough))
    lat = location.latitude
    lon = location.longitude
    B_lat.append(lat)
    B_lon.append(lon)

In [79]:
Toronto_grouped['Cluster'] = kmeans.labels_
Toronto_grouped['Latitude'] = B_lat
Toronto_grouped['Longitude'] = B_lon

In [80]:
Toronto_grouped.shape

(4, 235)

In [81]:
#sort the dataframe
columns = Toronto_grouped.columns.tolist()
columns = columns[-3:] + columns[:-3]
Toronto_grouped = Toronto_grouped[columns]

In [82]:
Toronto_grouped

Unnamed: 0,Cluster,Latitude,Longitude,Borough,Afghan,Airport,Airport Service,American,Antiques,Apparel,...,Trail,Train Station,Travel,Vegetarian / Vegan,Video Games,Vietnamese,Wine Bar,Wine Shop,Wings,Yoga Studio
0,0,43.653963,-79.387207,Central Toronto,0.0,0.0,0.0,0.025862,0.0,0.034483,...,0.017241,0.0,0.0,0.008621,0.0,0.008621,0.0,0.0,0.0,0.008621
1,1,43.655115,-79.380219,Downtown Toronto,0.000777,0.000777,0.002331,0.01554,0.001554,0.011655,...,0.000777,0.002331,0.003108,0.012432,0.002331,0.005439,0.006993,0.000777,0.000777,0.001554
2,2,43.626243,-79.396962,East Toronto,0.0,0.0,0.0,0.02439,0.0,0.00813,...,0.01626,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01626
3,1,43.653963,-79.387207,West Toronto,0.0,0.0,0.0,0.0,0.005587,0.0,...,0.0,0.0,0.0,0.011173,0.0,0.011173,0.005587,0.0,0.0,0.005587


In [83]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

In [85]:
map_Toronto = folium.Map(location=[latitude,longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(num_clusters)
ys = [i + x + (i*x)**2 for i in range(num_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Toronto_grouped['Latitude'], Toronto_grouped['Longitude'], Toronto_grouped['Borough'], Toronto_grouped['Cluster']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_Toronto)
       
map_Toronto