<a href="https://colab.research.google.com/github/fulcrum3/IBM_Capstone_Project/blob/master/toronto_neighbourhood_clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Toronto Neighbourhoods Clustering
This notebook contains my work for the week 3 assignment of the IBM Capstone course. The objective is to scrape toronto neighbourhood data from wikipedia, cluster it using **k means clustering** and present it using a graphing library.

In [0]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

## Scraping data with beautiful soup
Scraping data from [here](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) and processing it into pandas dataframe

In [2]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source, 'lxml')
table = soup.find('table')
table_rows = table.find_all('tr')
l = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    l.append(row)
neighbourhoods = pd.DataFrame(l, columns=["PostCode", "Borough", "Neighbourhood"])
neighbourhoods = neighbourhoods.replace('\n','', regex=True)
neighbourhoods = neighbourhoods.drop(neighbourhoods.index[0])
print('Table Shape:', neighbourhoods.shape)
neighbourhoods.head()

Table Shape: (288, 3)


Unnamed: 0,PostCode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront


## Cleaning the data

In [3]:
# Removing cells that haven't been assigned a borough
neighbourhoods = neighbourhoods[~(neighbourhoods["Borough"] == 'Not assigned')]
neighbourhoods  = neighbourhoods.groupby(['PostCode', 'Borough'])['Neighbourhood'].apply(','.join).reset_index()
neighbourhoods['Neighbourhood'] = np.where(neighbourhoods['Neighbourhood'] == 'Not assigned', 
                                           neighbourhoods['Borough'], 
                                           neighbourhoods['Neighbourhood']) 
print("Table shape after cleaning: ", neighbourhoods.shape)
neighbourhoods.head()

Table shape after cleaning:  (103, 3)


Unnamed: 0,PostCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


## Getting Latitude and Longitude data 
- Getting data from the csv file given
- Merging into the existing dataset

In [5]:
lat_data = pd.read_csv('http://cocl.us/Geospatial_data')
neighbourhoods = pd.merge(neighbourhoods, lat_data, left_on = 'PostCode', right_on = 'Postal Code')
neighbourhoods.head()

103
103


Unnamed: 0,PostCode,Borough,Neighbourhood,Postal Code,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",M1B,43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",M1C,43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",M1E,43.763573,-79.188711
3,M1G,Scarborough,Woburn,M1G,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,M1H,43.773136,-79.239476
