# Clustering Neighborhoods in Toronto 
This notebook shows how to scrape web data about boroughs and neighborhoods in Toronto, enrich it with location information from Foursquare and cluster the neighborhoods based on the types of venues in them. 

## Scraping data from Wikipedia and preparing a dataframe

In [2]:
#Import packages
import numpy as np 
import pandas as pd
import requests 
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
from sklearn.datasets.samples_generator import make_blobs
import lxml
import json
#!conda install -c conda-forge geopy --yes # uncomment this line if necessary
from geopy.geocoders import Nominatim 
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if necessary
from bs4 import BeautifulSoup
print('Libraries imported.')


Libraries imported.




### Data Scraping 
The data is scraped from the table on the following Wikipedia page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M. 

The BeautifulSoup package is used to obtain the data that is in the table and to transform the data into a Pandas data-frame. 

In [3]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup=BeautifulSoup(source,'lxml')
print(soup.title)
from IPython.display import display_html
tab = str(soup.table)
df_scraped = pd.read_html(tab)
df=df_scraped[0]
df.head(40)




<title>List of postal codes of Canada: M - Wikipedia</title>


Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,"Malvern, Rouge"


### Conversion of html tab to Pandas dataframe for data cleaning and preparation
Postalcodes where no borough is assigned are dropped from the data-set. Neighborhoods with the same postalcode are combined, names of neighborhoods which are "Not assigned" are replaced with the names of the boroughs. 

In [4]:
df = df[df.Borough != 'Not assigned']

df = df.groupby(['Postal Code','Borough'], sort=False).agg(', '.join)
df.reset_index(inplace=True)

df['Neighborhood'] = np.where(df['Neighborhood'] == 'Not assigned',df['Borough'], df['Neighborhood'])

display(df.head(20))

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [5]:
df.shape

(103, 3)

## Data preparation
I download a data set with the geographical coordinates of each postal code and match it with the existing data set. 

In [9]:
coordinates = pd.read_csv('https://cocl.us/Geospatial_data')
display(coordinates.head())

df_toronto = pd.merge(df,coordinates,on='Postal Code')
display(df_toronto)


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,Business reply mail Processing Centre,43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509
