<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto</font></h1>

This notebook has so far completed the data scraping process using BeautifulSoup. 

### 1. Read data from html and format it properly

In [2]:
# Import necessary packages
import bs4
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests

#### First, we read in raw data from the html. 

In [3]:
with open('List of postal codes of Canada_ M - Wikipedia.htm') as wiki: 
    soup = BeautifulSoup(wiki)

head = soup.find('div', class_="mw-parser-output").table.thead.tr

#### Then we extract the values we want into a dataframe. 

In [4]:
# Get the column names
cols = []
for t in head.find_all('th'):
    col = t.text.strip()
    cols.append(col)

In [5]:
# Get the table body
body = soup.find('div', class_="mw-parser-output").table.tbody
data_raw = body.find_all('tr')

# Make a dataframe to store all data
df_1 = pd.DataFrame(columns = cols)

# Loop through all rows
for data in data_raw: 
    row = [] # make an empty list to store data of one row
    
    # Store the data
    for cell in data.find_all('td'):
        cell = cell.text.strip()
        row.append(cell)
    
    # Append this row to our dataframe if it is assigned a borough
    if 'Not assigned' not in row[1]:
        df_1.loc[len(df_1), :] = row

# Check the dataframe
df_1.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Not assigned
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


### 2. Data Wrangling

#### First, we take care of the missing values in the 'Neighbourhood' column. 

In [6]:
# Change all "Not assigned" in the Neighbourhood with Borough name in the same row
i = 0
for row in df_1.itertuples(): 
    
    if row.Neighbourhood == "Not assigned":
        df_1.loc[i, 'Neighbourhood'] = df_1.loc[i, 'Borough']
    
    i+=1
    
df_1.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


#### Then we merge neighbourhood names with the same postcode, put them into one single cell, separated by ",". 

In [16]:
# Create a new dataframe to store wrangled data
df = pd.DataFrame(columns = cols)

# Group dataset by postcode, loop through each group to get the correct data in correct format
for name, group in df_1.groupby('Postcode', as_index = False):
    
    # Get the corresponding data in correct format
    post = name
    bor = group['Borough'].tolist()[0]
    neigh = ', '.join(group['Neighbourhood'].tolist())
    row = [post, bor, neigh] # Combine all data into a list
    df.loc[len(df), :] = row # Add list to the dataframe

#### The cleaned data should be like this: 

In [17]:
# Display dataframe
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


### Shape of the cleaned dataset

In [9]:
df.shape

(103, 3)

### 3. Adding coordinates to the dataset

Here, I took a lazy step and used the dataset provided by IBM in the assignment description. 

In [15]:
geo = pd.read_csv("http://cocl.us/Geospatial_data")
geo.rename(columns = {'Postal Code' : 'Postcode'}, inplace = True)

In [18]:
df = pd.merge(df, geo, on = 'Postcode')
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
