# Applied Data Science Capstone

## Segmenting and Clustering Neighborhoods in Toronto

We are importing all the required libraries

In [51]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from unicodedata import normalize

We read the tables from the wiki page and use "match='Postal Code'" to select the table we need.

In [52]:
table_MN = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M', match='Postal Code')
print(f'Total tables: {len(table_MN)}')

Total tables: 1


Then we create a data frame from the selected table and rename the columns.

In [53]:
df = table_MN[0]
df.columns = ['PostalCode', 'Borough','Neighbourhood']
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


We check the data types of the data frame's columns

In [54]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   PostalCode     180 non-null    object
 1   Borough        180 non-null    object
 2   Neighbourhood  180 non-null    object
dtypes: object(3)
memory usage: 4.3+ KB


We must process only the cells that have an assigned borough, so we ignore cells with a borough that is Not assigned.

In [59]:
df = df.loc[df['Borough'] != 'Not assigned']
df

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


 We combine rows by PostalCode with the neighborhoods separated with a comma.

In [66]:
df = df.groupby(['PostalCode','Borough'])['Neighbourhood'].apply(lambda x: ','.join(x)).reset_index()
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


If a cell has a borough but a Not assigned  neighborhood, then the neighborhood will be the same as the borough.

In [65]:
borough_with_no_neighborhood = df.loc[df['Neighbourhood'] == 'Not assigned','Borough']
df.loc[df['Neighbourhood'] == 'Not assigned','Neighbourhood']=borough_with_no_neighborhood
df.loc[df['Neighbourhood'] == 'Not assigned']['Neighbourhood']
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Let's print the number of rows in the dataframe

In [67]:
df.shape

(103, 3)