# Segmenting and Clustering Neighborhoods in Toronto

## Install dependency

In [1]:
!conda install -c anaconda beautifulsoup4 -y

Collecting package metadata: done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda

  added / updated specs:
    - beautifulsoup4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2019.1.23  |                0         126 KB  anaconda
    certifi-2019.3.9           |           py36_0         155 KB  anaconda
    conda-4.6.8                |           py36_0         1.7 MB  anaconda
    ------------------------------------------------------------
                                           Total:         1.9 MB

The following packages will be UPDATED:

  ca-certificates    conda-forge::ca-certificates-2018.11.~ --> anaconda::ca-certificates-2019.1.23-0
  certifi            conda-forge::certifi-2018.11.29-py36_~ --> anaconda::certifi-2019.3.9-py36_0
  conda                     conda-forge::conda-4.6.4-py36_0 --> anaconda::conda-4.

## Scrape page with Beautiful Soup

### Use requests to get the page source

In [46]:
import requests
import pandas as pd
page = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")

### Parse the page with BS

In [48]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find(class_='wikitable')
table_rows = table.find_all('tr')

### Transform table rows to Dataframe

In [54]:
l = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    if (len(row) == 3 and row[1] != 'Not assigned'): # Skip unassigned Boroughs
      row[2] = row[2].replace('\n', '') # Remove new line from Neigbourhood
      if (row[2] == 'Not assigned'):
        row[2] = row[1] # Copy Borough to Neighbourhood if NB is not assigned
      l.append(row)
# Create new Dataframe from array
df = pd.DataFrame(l, columns=["Postcode", "Borough", "Neighbourhood"])
# Aggregate by Postcode and Borough
df_grouped = df.groupby(by=['Postcode', 'Borough']).Neighbourhood.agg([('Neighbourhood', ', '.join)]).reset_index()
df_grouped.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [55]:
df_grouped.shape

(103, 3)