# Segmenting and Clustering Neighborhoods in Toronto

First we install lxml needed for the read_html() function

In [2]:
!pip install lxml

Collecting lxml
  Downloading lxml-4.6.2-cp37-cp37m-manylinux1_x86_64.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 4.0 MB/s eta 0:00:01     |███████████▌                    | 2.0 MB 4.0 MB/s eta 0:00:01
[?25hInstalling collected packages: lxml
Successfully installed lxml-4.6.2


Then we import pandas and read the list from wikipedia into a dataframe

In [12]:
import pandas as pd
url="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
df=pd.read_html(url)[0]

We print the head of the dataframe just to see that it looks like intended...

In [14]:
print(df.head())

  Postal Code           Borough              Neighbourhood
0         M1A      Not assigned               Not assigned
1         M2A      Not assigned               Not assigned
2         M3A        North York                  Parkwoods
3         M4A        North York           Victoria Village
4         M5A  Downtown Toronto  Regent Park, Harbourfront


We create the subset df_cleaned which is a subset of the original dataframe, not having Boroughs that are not assigned

In [15]:
df_cleaned = df[df.Borough != 'Not assigned'].reset_index(drop=True)

We take a look at the result

In [17]:
print(df_cleaned.head())

  Postal Code           Borough                                Neighbourhood
0         M3A        North York                                    Parkwoods
1         M4A        North York                             Victoria Village
2         M5A  Downtown Toronto                    Regent Park, Harbourfront
3         M6A        North York             Lawrence Manor, Lawrence Heights
4         M7A  Downtown Toronto  Queen's Park, Ontario Provincial Government


We then group boroughs with the same postal code

In [19]:
df_cleaned_grouped = df_cleaned.groupby(['Postal Code','Borough'], as_index=False).agg(lambda x: ','.join(x))

We take a look at the result

In [20]:
print(df_cleaned_grouped.head())

  Postal Code      Borough                           Neighbourhood
0         M1B  Scarborough                          Malvern, Rouge
1         M1C  Scarborough  Rouge Hill, Port Union, Highland Creek
2         M1E  Scarborough       Guildwood, Morningside, West Hill
3         M1G  Scarborough                                  Woburn
4         M1H  Scarborough                               Cedarbrae


We create a mask, i.e. a subset of the frame, where all neighbourhoods are not assigned.<br/>For each item in the mask, we set the neighbourhood to the borough

In [22]:
mask = df_cleaned_grouped['Neighbourhood'] == "Not assigned"
df_cleaned_grouped.loc[mask, 'Neighbourhood'] = df_cleaned_grouped.loc[mask, 'Borough']

We then look at the size of the resulting dataframe

In [24]:
df_cleaned_grouped.shape

(103, 3)