# Segmenting and Clustering Neighborhoods in Toronto 

Import packages we will use

In [2]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np 

## Preprocessing

### Web Scraping
We will web scrape our data using BeautifulSoup. 

In [4]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source, 'lxml')

In [29]:
toronto_table = soup.find('table', class_='wikitable sortable')

We will use a for loop to go through the table and create a list, checking for empty cells and skipping if appropriate. 

In [35]:
table_con = []
try: 
    for row in toronto_table.find_all('tr'):
        col = row.find_all('td')
        if len(col) == 3:    #if cell is empty, we will skip it
            table_con.append((col[0].text.strip(), col[1].text.strip(), col[2].text.strip()))
except: pass

Now we will convert our list to an array. 

In [36]:
toronto_array = np.asarray(table_con)

### Data Cleaning
The array will then be converted to a dataframe.

In [190]:
df = pd.DataFrame(toronto_array)
df.columns = ['PostalCode', 'Borough', 'Neighborhood']
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


In [200]:
print('Borough: ', (df.Borough == 'Not assigned').sum())
print('Neighborhood: ', (df.Neighborhood == 'Not assigned').sum())

Borough:  77
Neighborhood:  78


All "Not assigned" will be replace to 'NaN' using numpy for convenience. 

In [201]:
df.replace('Not assigned', np.nan, inplace=True)
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,,
1,M2A,,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,
9,M8A,,


Check to verify replacement is correct. 

In [202]:
df.isnull().sum()

PostalCode       0
Borough         77
Neighborhood    78
dtype: int64

Assume that neighbohood is the same as borough if neighborhood is missing and borough is present.  

In [203]:
for i in range(0,len(df.index)): 
    if df.iloc[i,1] is not np.nan and df.iloc[i,2] is np.nan:
        df.iloc[i,2] = df.iloc[i,1]

df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,,
1,M2A,,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
9,M8A,,


In [204]:
df.isnull().sum()

PostalCode       0
Borough         77
Neighborhood    77
dtype: int64

Drop all rows in which there are missing values for borough and neighborhood. 

In [205]:
df.dropna(inplace=True)
df.isnull().sum()

PostalCode      0
Borough         0
Neighborhood    0
dtype: int64

Group records with same postal code to same row, separated by commas.

In [213]:
df = df.groupby(['PostalCode','Borough'])['Neighborhood'].apply(', '.join).reset_index()
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood]], Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


Dimension of dataframe

In [212]:
df.shape

(103, 3)