# Segmenting and Clustering Neighborhoods in Toronto

## First get the page from Wikipedia

For this we will use as suggested **Beautiful Soap**

### Install Beautiful Soap in case it is not already installed

In [1]:
!pip install beautifulsoup4



### Make some imports

In [2]:
import requests
import numpy as np
import pandas as pd

### We use a variable to hold the URL

In [3]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

### Retrieve the table

The table is retrieved with the remaining page, so we will have to clean it properly

**BeautifulSoap** has a function to _prettyfy_ the page

In [4]:
page = requests.get(url).text

In [5]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page,'lxml')
# print(soup.prettify())

### Locate the table

Looking to the result we see where is the table (the tags) and we retrieve it

In [6]:
table = soup.find('table',{'class':'wikitable sortable'})

### Loop table

#### Loop rows
#### And inside cells

Adding everything to a dataframe

In [7]:
boroughs = []
for row in table.findAll("tr"):
    arrayrow = []
    cells = row.findAll("td")
    for cell in cells:
        celtext = cell.text.replace('\n','')
        arrayrow.append(celtext)
    boroughs.append(arrayrow)

df_boroughs = pd.DataFrame(boroughs)
df_boroughs.columns = ['PostalCode','Borough','Neighborhood']

### Clean

### First not assigned

When both borough and neighborhood are not assigned drop the row

### Then first row, that is empty

### Fix neighborhoods

To do this we read the borough and set the neighborhood accordingly

In [8]:
df_boroughs.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,,,
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


In [9]:
df_boroughs.drop(
    df_boroughs[(df_boroughs.Borough == 'Not assigned') &
                (df_boroughs.Neighborhood == 'Not assigned')].index, inplace=True)

In [10]:
df_boroughs = df_boroughs.iloc[1:]

##### How many are they?

One

In [11]:
df_boroughs[df_boroughs.Neighborhood == 'Not assigned']

Unnamed: 0,PostalCode,Borough,Neighborhood
9,M7A,Queen's Park,Not assigned


### Assign Borough name to not assigned neighborhoods

This is pretty straightforward, we need just to select those with _Not assigned_ as value add assign the name of the botough

In [12]:
df_boroughs.loc[df_boroughs.Neighborhood == 'Not assigned', 'Neighborhood'] = df_boroughs.Borough

In [13]:
df_boroughs.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights


### Group neighborhoods of the same borough

This is done by grouping by PostalCode and Borough and applying a join to the neighborhoods separated by commas

In [14]:
df_result = df_boroughs.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(','.join).reset_index()

In [15]:
df_result.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


## Final results

The shame is **(103, 3)** meaning there are 103 different postal codes with proper names grouped by borough

In [16]:
df_result.shape

(103, 3)

!pip install geocoder

### It didn't work

So we just load the csv file.

In [17]:
lat_lng_coords = pd.read_csv('Geospatial_Coordinates.csv')

In [18]:
lat_lng_coords.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


We see that _Postal Code_ has a different name than _PostalCode_ so we just change it

In [19]:
lat_lng_coords.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)

Now we just merge them

In [20]:
df_result = pd.merge(df_result, lat_lng_coords, on='PostalCode')

and voilà

In [21]:
df_result.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
