# Segmenting and Clustering Neighborhoods in Toronto
--------------------------------------------------------------------------------------

# -----> Go to [Part 2](#Part-2) <-----

# **Part 1**

Exploring and scraping of Toronto neighborhood data out of the wikipedia page https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&diff=942851379&oldid=942655599 and bringing into a dataframe

## Import libraries and URL scrapping from Wikipedia

In [2]:
import pandas as pd
from pandas.io.json import json_normalize
import numpy as np
import geopy
from geopy.geocoders import Nominatim
import matplotlib.cm as cm
import matplotlib.colors as colors
import sklearn
import json
from sklearn.cluster import KMeans
import folium

In [3]:
#download raw table from wikipedia
source=pd.read_html("https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&diff=942851379&oldid=942655599")
source[1]

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
...,...,...,...
282,M8Z,Etobicoke,Mimico NW
283,M8Z,Etobicoke,The Queensway West
284,M8Z,Etobicoke,Royal York South West
285,M8Z,Etobicoke,South of Bloor


In [4]:
df=source[1]

### Remove rows with Boroughs with "Not Assigned"

In [5]:
df = df[df['Borough'] != 'Not assigned'].reset_index(drop = True)
print('After dropping rows where borough is "Not assigned", Shape is: ',df.shape)
print('Number of rows with Neighbourhood = "Not assigned" but Borough with some value: ', 
      df[df['Neighbourhood'] == 'Not assigned'].shape[0])

After dropping rows where borough is "Not assigned", Shape is:  (210, 3)
Number of rows with Neighbourhood = "Not assigned" but Borough with some value:  0


In [6]:
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor
5,M7A,Downtown Toronto,Queen's Park
6,M9A,Etobicoke,Islington Avenue
7,M1B,Scarborough,Rouge
8,M1B,Scarborough,Malvern
9,M3B,North York,Don Mills North


### Group by PostCode and Borough in Neighbourhood

In [7]:
df = df.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(lambda x: "%s" % ', '.join(x))
df = df.reset_index()
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [21]:
df.shape

(103, 3)

# Part 2 
Getting the Latitude and Longitude of each Postal_code

### Upload geo table *(download here: http://cocl.us/Geospatial_data)*

In [27]:
df_geo = pd.read_csv('Geospatial_Coordinates.csv')

### Rename Postal Code and merge

In [31]:
df_geo.rename(columns={'Postal Code':'Postcode'},inplace=True)
df_merge = pd.merge(df_geo, df, on='Postcode')
df_2 = df_merge[['Postcode','Borough','Neighbourhood','Latitude','Longitude']]
df_2.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
