# Applied Data Science Capstone

## Segmenting and Clustering Neighborhoods in Toronto

We are importing all the required libraries

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from unicodedata import normalize

We read the tables from the wiki page and use "match='Postal Code'" to select the table we need.

In [6]:
table_MN = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M', match='Postal Code')
print(f'Total tables: {len(table_MN)}')

Total tables: 1


Then we create a data frame from the selected table and rename the columns.

In [7]:
df = table_MN[0]
df.columns = ['PostalCode', 'Borough','Neighbourhood']
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


We check the data types of the data frame's columns

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   PostalCode     180 non-null    object
 1   Borough        180 non-null    object
 2   Neighbourhood  180 non-null    object
dtypes: object(3)
memory usage: 4.3+ KB


We must process only the cells that have an assigned borough, so we ignore cells with a borough that is Not assigned.

In [9]:
df = df.loc[df['Borough'] != 'Not assigned']
df

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


 We combine rows by PostalCode with the neighborhoods separated with a comma.

In [10]:
df = df.groupby(['PostalCode','Borough'])['Neighbourhood'].apply(lambda x: ','.join(x)).reset_index()
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


If a cell has a borough but a Not assigned  neighborhood, then the neighborhood will be the same as the borough.

In [11]:
borough_with_no_neighborhood = df.loc[df['Neighbourhood'] == 'Not assigned','Borough']
df.loc[df['Neighbourhood'] == 'Not assigned','Neighbourhood']=borough_with_no_neighborhood
df.loc[df['Neighbourhood'] == 'Not assigned']['Neighbourhood']
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Let's print the number of rows in the dataframe

In [12]:
df.shape

(103, 3)

## Get the latitude and the longitude coordinates of each neighborhood

In [None]:
import geocoder # import geocoder
# initialize your variable to None
lat_lng_coords = None

postal_code = df['PostalCode']
# loop until you get the coordinates
while(lat_lng_coords is None):
  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
  lat_lng_coords = g.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]

We import the geographical coordinates from the Geospatial_data file and set the "Postal Code" column as the index.

In [23]:
df_geo = pd.read_csv("http://cocl.us/Geospatial_data")
df_geo = df_geo.set_index('Postal Code')
df_geo

Unnamed: 0_level_0,Latitude,Longitude
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476
...,...,...
M9N,43.706876,-79.518188
M9P,43.696319,-79.532242
M9R,43.688905,-79.554724
M9V,43.739416,-79.588437


Then we join our data frames by "PostalCode"

In [24]:
df = df.join(df_geo, on='PostalCode')

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437


## Exploring and clustering the neighborhoods in Toronto