# Segmenting and Clustering Neighborhoods in Toronto

## First get the page from Wikipedia

For this we will use as suggested **Beautiful Soap**

### Install Beautiful Soap in case it is not already installed

In [None]:
!pip install beautifulsoup4

### Make some imports

In [None]:
import requests
import numpy as np
import pandas as pd

### We use a variable to hold the URL

In [None]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

### Retrieve the table

The table is retrieved with the remaining page, so we will have to clean it properly

**BeautifulSoap** has a function to _prettyfy_ the page

In [None]:
page = requests.get(url).text

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page,'lxml')
# print(soup.prettify())

### Locate the table

Looking to the result we see where is the table (the tags) and we retrieve it

In [None]:
table = soup.find('table',{'class':'wikitable sortable'})

### Loop table

#### Loop rows
#### And inside cells

Adding everything to a dataframe

In [None]:
boroughs = []
for row in table.findAll("tr"):
    arrayrow = []
    cells = row.findAll("td")
    for cell in cells:
        celtext = cell.text.replace('\n','')
        arrayrow.append(celtext)
    boroughs.append(arrayrow)

df_boroughs = pd.DataFrame(boroughs)
df_boroughs.columns = ['PostalCode','Borough','Neighborhood']

### Clean

### First not assigned

When both borough and neighborhood are not assigned drop the row

### Then first row, that is empty

### Fix neighborhoods

To do this we read the borough and set the neighborhood accordingly

In [None]:
df_boroughs.head()

In [None]:
df_boroughs.drop(
    df_boroughs[(df_boroughs.Borough == 'Not assigned') &
                (df_boroughs.Neighborhood == 'Not assigned')].index, inplace=True)

In [None]:
df_boroughs = df_boroughs.iloc[1:]

##### How many are they?

One

In [None]:
df_boroughs[df_boroughs.Neighborhood == 'Not assigned']

### Assign Borough name to not assigned neighborhoods

This is pretty straightforward, we need just to select those with _Not assigned_ as value add assign the name of the botough

In [None]:
df_boroughs.loc[df_boroughs.Neighborhood == 'Not assigned', 'Neighborhood'] = df_boroughs.Borough

In [None]:
df_boroughs.head()

### Group neighborhoods of the same borough

This is done by grouping by PostalCode and Borough and applying a join to the neighborhoods separated by commas

In [None]:
df_result = df_boroughs.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(','.join).reset_index()

In [None]:
df_result.head()

## Final results

The shame is **(103, 3)** meaning there are 103 different postal codes with proper names grouped by borough

In [None]:
df_result.shape

!pip install geocoder

### It didn't work

So we just load the csv file.

In [None]:
lat_lng_coords = pd.read_csv('Geospatial_Coordinates.csv')

In [None]:
lat_lng_coords.head()

We see that _Postal Code_ has a different name than _PostalCode_ so we just change it

In [None]:
lat_lng_coords.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)

Now we just merge them

In [None]:
df_result = pd.merge(df_result, lat_lng_coords, on='PostalCode')

and voilà

In [None]:
df_result.head()