# Introduction
Montreal is the second most-populous city in Canada, and the size and the shape of the island of Montreal makes daily transportations a challenge. <br> A good classification of the different neighborhoods by types of venues could help planning the transportation resources in the different times of the day. For instance, an area with many bars and night-clubs will not require transportation resources (bus, taxi..) at the same time of the day as for an area with mainly offices or commercial venues. <br>
A taxi company could use a classification analysis on the neighborhood to plan a better dispatch of the taxi float at each time of the day. <br>
Also, the city transportation system, could also use this input along with all the statistics it is able to collect. It can also help for an expansion of the metro network which is likely to happen in Montreal in a close future. 

# Data

We will use the Foursquare API to get the number of venues by category for each neighbourhood. To request the API we need to create a dataset containing the coordinates of all neighborhoods of Montreal

#### Creation of the Neighborhood dataset

In [2]:
import numpy as np 
import pandas as pd
import json 
import requests 
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium 

We start by scraping the wikipedia page of the postal codes of Montreal

In [13]:
df = pd.read_html('https://fr.wikipedia.org/wiki/Liste_des_codes_postaux_canadiens_d%C3%A9butant_par_H')[2]
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,H0ANon assigné,H1APointe-aux-Trembles,H2ASaint-MichelEst,H3ACentre-ville de MontréalNord (Université Mc...,H4ANotre-Dame-de-GrâceNord-est,H5APlace Bonaventure,H7ADuvernay-Est,H8ANon assigné,H9ADollard-Des-OrmeauxNord-ouest
1,H0BNon assigné,H1BMontréal-Est,H2BAhuntsicNord,H3BCentre-ville de MontréalEst,H4BNotre-Dame-de-GrâceSud-ouest,H5BComplexe Desjardins,H7BSaint-François,H8BNon assigné,H9BDollard-Des-OrmeauxEst
2,H0CNon assigné,H1CRivière-des-PrairiesNord-est,H2CAhuntsicCentre,H3CGriffintown(Incluant Île Notre-Dame & Île S...,H4CSaint-Henri,H5CNon assigné,H7CSaint-Vincent-de-Paul,H8CNon assigné,H9CL'Île-BizardNord-est
3,H0ENon assigné,H1ERivière-des-PrairiesSud-ouest,H2EVillerayNord-est,H3EÎle des Sœurs,H4EVille Émard,H5ENon assigné,H7EDuvernay,H8ENon assigné,H9EL'Île-BizardSud-ouest
4,H0GNon assigné,H1GMontréal-NordNord,H2GPetite-PatrieNord-est,H3GCentre-ville de MontréalSud-est (Université...,H4GVerdunNord,H5GNon assigné,H7GPont-Viau,H8GNon assigné,H9GDollard-Des-OrmeauxSud-ouest


The format is not very friendly... so we change it

In [14]:
df2 =pd.DataFrame([df.loc[label,i] for label, row in df.iterrows() for i in range(9)])
df2["Postal Code"] = df2[0].str.slice(0,3)
df2["Neighborhood"] = df2[0].str.slice(3)
df2 = df2[["Postal Code","Neighborhood"]]
df2 = df2[df2["Neighborhood"] != "Non assigné"]
df2 = df2.reset_index(drop=True)
df2.head()

Unnamed: 0,Postal Code,Neighborhood
0,H1A,Pointe-aux-Trembles
1,H2A,Saint-MichelEst
2,H3A,Centre-ville de MontréalNord (Université McGill)
3,H4A,Notre-Dame-de-GrâceNord-est
4,H5A,Place Bonaventure


After many unsuccesfull tries with geopy geocode which is not reliable with incomplete addresses, I was lucky to find the geocoordinates for most of the postal codes on the web

In [15]:
df_add = pd.read_html('https://www.geonames.org/postal-codes/CA/QC/quebec.html')[2]
df_add.head()

Unnamed: 0.1,Unnamed: 0,Place,Code,Country,Admin1,Admin2,Admin3
0,1.0,Mont-Joli,G5H,Canada,Quebec,Bas-Saint-Laurent,Mont-Joli
1,,48.584/-68.192,48.584/-68.192,48.584/-68.192,48.584/-68.192,48.584/-68.192,48.584/-68.192
2,2.0,Duvernay-Est,H7A,Canada,Quebec,,
3,,45.674/-73.592,45.674/-73.592,45.674/-73.592,45.674/-73.592,45.674/-73.592,45.674/-73.592
4,3.0,Saint-Vincent-de-Paul,H7C,Canada,Quebec,,


Here again, we have to reformat

In [16]:
liste =[df_add.loc[label, "Code"] for label, row in df_add.iterrows()]
coordinate = []
postal = []
for j, i in enumerate(liste):
     if j % 2 != 0:
        coordinate.append(i) 
     elif j % 2 == 0:
        postal.append(i)
del postal[-1]
df_coord = pd.DataFrame({"Code":postal, "Coordonnées": coordinate})
df_coord["Latitude"] = df_coord["Coordonnées"].apply(lambda x: x.split("/")[0])
df_coord["Longitude"] = df_coord["Coordonnées"].apply(lambda x: x.split("/")[1])
df_coord.drop("Coordonnées", axis = 1, inplace = True)
df_add = df_add[df_add["Unnamed: 0"].notna()]
df_add = df_add[["Code","Place"]]
df_add_final = df_add.merge(df_coord, on = "Code")
df_add_final.head()

Unnamed: 0,Code,Place,Latitude,Longitude
0,G5H,Mont-Joli,48.584,-68.192
1,H7A,Duvernay-Est,45.674,-73.592
2,H7C,Saint-Vincent-de-Paul,45.617,-73.649
3,H7E,Duvernay,45.623,-73.695
4,H7G,Pont-Viau,45.577,-73.687


Now we can merge wikipedia data with the geocoordinates dataset:

In [17]:
df2.rename(columns = {"Postal Code": "Code"}, inplace = True)
df_add_final.set_index("Code", inplace =True)
df2.set_index("Code", inplace =True)
df_total = df2.join(df_add_final)
df_total["Latitude"] = df_total["Latitude"].astype(float)
df_total["Longitude"] = df_total["Longitude"].astype(float)
df_total.drop("Place", axis = 1, inplace = True)
df_total.head()

Unnamed: 0_level_0,Neighborhood,Latitude,Longitude
Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
H1A,Pointe-aux-Trembles,,
H2A,Saint-MichelEst,,
H3A,Centre-ville de MontréalNord (Université McGill),45.504,-73.575
H4A,Notre-Dame-de-GrâceNord-est,45.472,-73.615
H5A,Place Bonaventure,,


To get the missing coordinates, the only solution is to search them manualy on google map:

In [18]:
manual_search ={"H1A": [45.674145, -73.500435],
"H2A": [45.561809, -73.601338],
"H5A": [45.49984, -73.56597],
"H9A":[45.495801, -73.832858],
"H2B":[45.576195, -73.649025],
"H9B":[45.489237, -73.801538],
"H4C":[45.476287, -73.586520],
"H3J":[45.486234, -73.573269],
"H9J":[45.450587, -73.873719],
"H3K":[45.483597, -73.552917],
"H9K":[45.456928, -73.915459],
"H3N":[45.528919, -73.629486],
"H8R":[45.42775, -73.646858],
"H9R":[45.460739, -73.813332],
"H1V":[45.559093, -73.542107],
"H1W":[45.544970, -73.547551],
"H9W":[45.431661, -73.866620],
"H3X":[45.482044, -73.64106],
"H2Y":[45.505745, -73.553612],
"H7Y":[45.531265, -73.856384],
"H8Y":[45.506451, -73.788071],
"H8Z":[45.505234, -73.839447],
"H8P":[45.425505, -73.605799]}
for key in manual_search:
    df_total.loc[key,"Latitude"] = manual_search.get(key)[0]
    df_total.loc[key,"Longitude"] = manual_search.get(key)[1]
df_total.head()

Unnamed: 0_level_0,Neighborhood,Latitude,Longitude
Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
H1A,Pointe-aux-Trembles,45.674145,-73.500435
H2A,Saint-MichelEst,45.561809,-73.601338
H3A,Centre-ville de MontréalNord (Université McGill),45.504,-73.575
H4A,Notre-Dame-de-GrâceNord-est,45.472,-73.615
H5A,Place Bonaventure,45.49984,-73.56597


The last step is to get rid of some postal codes which are not usefull for the study (one is Santa Claus postal code, others correspond to some buildings in downtown Montreal)

In [19]:
df_total.drop(["H5B","H0P","H4Z","H0H"], axis=0,inplace=True)

Now we can display the neighborhoods on a map to check the result :

In [20]:
latitude = 45.508888
longitude = -73.561668
map_montreal= folium.Map(location=[latitude, longitude], zoom_start=10)

for lat, lng, neighborhood in zip(df_total['Latitude'], df_total['Longitude'], df_total["Neighborhood"]):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_montreal)  
    
map_montreal

### Foursquare data : venues and categories

We will use Foursquare API to explore venue categories in each neighborhood. Venues can be categorized as residential, professional, shopping or leisure. We need to know what are the venue categories in the Foursquare database.

In [21]:
CLIENT_ID = 'BF1ZE5LH435VB50MRRDZXIC04N1EIZUFC4VYUM443TCWWQEM' # your Foursquare ID
CLIENT_SECRET = 'OXZOAIL5VYZEEKLS2H3QQI41FR2LFAI13IX2D4OITSEJ45XO' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [32]:
categories_url = 'https://api.foursquare.com/v2/venues/categories?client_id={}&client_secret={}&v={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION)
            
results = requests.get(categories_url).json()

for i in range(len(results['response']['categories'])) :
    print(results['response']['categories'][i]["name"] +"    "+results['response']['categories'][i]["id"])

Arts & Entertainment    4d4b7104d754a06370d81259
College & University    4d4b7105d754a06372d81259
Event    4d4b7105d754a06373d81259
Food    4d4b7105d754a06374d81259
Nightlife Spot    4d4b7105d754a06376d81259
Outdoors & Recreation    4d4b7105d754a06377d81259
Professional & Other Places    4d4b7105d754a06375d81259
Residence    4e67e38e036454776db1fb3a
Shop & Service    4d4b7105d754a06378d81259
Travel & Transport    4d4b7105d754a06379d81259


Ther are 10 top categorize that we will use to classify the neighborhoods. We also extracted the id of the categories which will be the parameters to enter in the url requests