# Segmenting and Clustering Neighbourhoods in Toronto

In this notebook the neighbourhoods in the city of Toronto will be segmented en clustered based on the postalcode and borough information.<br><br>
For the Toronto neighbourhood data, a Wikipedia page exists that has all the information that is neccessary to explore and cluster the neighbourhoods in Toronto.<br> 
The required HTML table from the Wikipedia page will be used to read into a pandas dataframe.<br>

## Importing libraries

In [None]:
!pip install beautifulsoup4
!pip install lxml
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner

from bs4 import BeautifulSoup # Library for scraping webpage
from IPython.display import display_html # Library for displaying HTML

print('Importing ready!')

## Retreive the postal codes of Canada from Wikipedia webpage

In [None]:
# Get webpage
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
# Scrape webpage
bsoup = BeautifulSoup(source,'lxml')
# Check title of webpage
print(bsoup.title)
# Get table from webpage
html_table = str(bsoup.table)
# Display table
display_html(html_table,raw=True)

## Convert HTML table to Dataframe for preprocessing

In [None]:
df_list = pd.read_html(html_table)
df = df_list[0]
df.rename(columns={'Postal Code':'Postcode'},inplace=True)
print(df.shape)
df

## Cleaning en preparing dataset

### 1 - Check for 'Not assigned' boroughs.

In [None]:
df.loc[df.Neighbourhood == 'Not assigned', 'Neighbourhood'].count()

There are 77 'Not assigned' boroughs.<br>
The rows with a borough that is Not assigned must be ignored.

In [None]:
# Creating a new dataframe without the 'Not assigned' Boroughs
df1 = df[df.Borough != 'Not assigned']
df1.shape

### 2 - Check for the existance of more than one neighbourhood in one postal code area. 

In [None]:
# Create a temporary dataframe with the number of neighbourhoods per postcode, borough
temp = df1.groupby(['Postcode','Borough'], sort=False).count().rename({'Neighbourhood': 'counts'}, axis=1)
print(temp[temp == 1].count())
print(temp[temp > 1].count())

There are NO postal code areas in the dataframe with more neighbourhouds.<br>
The following code for combining the neighbourhoods with the same postal code, is not necessary to execute!

In [None]:
# Combining the neighbourhoods with same Postalcode
df1 = df1.groupby(['Postcode','Borough'], sort=False).agg(', '.join)
df1.reset_index(inplace=True)

### 3 - Check for Not assigned  neighbourhoods, and then replace these neighbourhoods with it's borough

In [None]:
# Count number of 'Not assigned' Neighbourhoods
df1.loc[df1.Neighbourhood == 'Not assigned', 'Neighbourhood'].count()

There are NO neighbourhouds that are 'Not assigned'.<br>
So the following code for setting the 'Not assigned' neighbourhoods to borough is not necessary!

In [None]:
# Replacing the name of the 'Not assigned' neighbourhoods with names of Borough
df1['Neighbourhood'] = np.where(df1['Neighbourhood'] == 'Not assigned',df1['Borough'], df1['Neighbourhood'])
df1

### Dataframe after preprocessing the data

In [None]:
# Shape of data frame
df1.shape