# Segmenting and Clustering Neighborhoods in Toronto City

## 1. Setup

### 1.1 Import Libraries

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

### 1.2 Notebook Settings

In [33]:
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

## 2. Getting Data

All data from Toronto neighborhoods should be web scrapped from [Wikipedia link](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) for this purpose will be used [BeatifulSoup](http://beautiful-soup-4.readthedocs.io/en/latest/) (Python library for web scrapping)

### 2.1 Getting Data Using BeatifulSoup

In [34]:
def scrape_table_using_bs(url, cname):
    """
        Function to extract dataframe from HTML table using BeautifulSoup
    """
    html = requests.get(url).text
    soup = BeautifulSoup(html, 'html.parser')
    table = soup.find('table', class_='wikitable sortable')
    header = [head.findAll(text=True)[0].strip() for head in table.find_all("th")]
    data   = [[td.findAll(text=True)[0].strip() for td in tr.find_all("td")]
              for tr in table.find_all("tr")]
    data    = [row for row in data if len(row) == len(header)]
    df = pd.DataFrame(data,columns=header)
    
    return df

In [35]:
# Source URL
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

#getting dataframe
df_raw = scrape_table_using_bs(url=url, cname='wikitable')
df_raw.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


## 2. Preparing Data

The scraped table contains some un-wanted data which needs some treatement. The following tasks will be performed:

1. Only consider cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
2. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
3. Group the table by PostalCode/Borough, Neighbourhood belonging to same borough will be combined in 'Neighbourhood' column as separated with 'comma'.

### 2.1 Drop rows that borough is Not assigned

In [36]:
df = df_raw[~df_raw['Borough'].isin(['Not assigned'])]
df = df.sort_values(by=['Postcode','Borough','Neighbourhood'], ascending=[1,1,1]).reset_index(drop=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,Malvern
1,M1B,Scarborough,Rouge
2,M1C,Scarborough,Highland Creek
3,M1C,Scarborough,Port Union
4,M1C,Scarborough,Rouge Hill


### 2.2 Rename Neighbourhood if it's 'Not Assigned'

In [37]:
df.loc[df['Neighbourhood'] == 'Not assigned', ['Neighbourhood']] = df['Borough']
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,Malvern
1,M1B,Scarborough,Rouge
2,M1C,Scarborough,Highland Creek
3,M1C,Scarborough,Port Union
4,M1C,Scarborough,Rouge Hill


### 2.3 Grouping Neighbourhood by Postcode and Borough

In [39]:
df = df.groupby(['Postcode','Borough'])['Neighbourhood'].apply(', '.join).reset_index()
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Highland Creek, Port Union, Rouge Hill"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [40]:
df.shape

(103, 3)