# IBM Applied Data Science Capstone Project
> ## Segmenting and Clustering Neighborhoods in Toronto

>In this notebook, neighborhoods in the city of Toronto are explored, segmented, and clustered. For the Toronto neighborhood data, a [Wikipedia](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) page exists that has all the information needed to explore and cluster the neighborhoods in Toronto. The data is scraped from the Wikipedia page and wrangled, cleaned and then read into a pandas dataframe so that it is in a structured format.

>Once the data is in a structured format, the neighborhoods in the city of Toronto are explored and clustered.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Web-scrape and Explore Dataset</a>

2. <a href="#item2">Fetch Latitude and Longitude of each Neighborhood</a>

3. <a href="#item3">Explore and Cluster the Neighborhoods in Toronto</a>

</font>
</div>

Download all the dependencies.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import requests # library to handle requests
from bs4 import BeautifulSoup  # library to handle web scraping

<a id='item1'></a>

## 1. Web-scrape and Explore Dataset

Fetch the content of the url - https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [2]:
URL = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
r = requests.get(URL) 

Use BeautifulSoup library to read the content of the url.

In [3]:
soup = BeautifulSoup(r.content, 'html5lib')
# print(soup.prettify()) # visual representation of the parse tree

Inspect the HTML script above, the table of the postal codes is under class **Wikitable Sortable**.

In [4]:
table = soup.find('table', attrs = {'class':'wikitable sortable'}) 
# print(table.prettify())

Loop through the table to fetch column names and rows in a list

In [5]:
# Store header in column_names
column_names = [header.text.strip() for header in table.findAll('th')]
column_names

['Postcode', 'Borough', 'Neighbourhood']

In [6]:
rows_list = []

# Loop through table, fetch each of the 3 columns of a row
for row in table.findAll('tr'):
    row_data = row.find_all('td')
    if len(row_data) == 3:
        rows_list.append((row_data[0].text.strip(), row_data[1].text.strip(), row_data[2].text.strip()))

rows_list[:5]

[('M1A', 'Not assigned', 'Not assigned'),
 ('M2A', 'Not assigned', 'Not assigned'),
 ('M3A', 'North York', 'Parkwoods'),
 ('M4A', 'North York', 'Victoria Village'),
 ('M5A', 'Downtown Toronto', 'Harbourfront')]

In [7]:
# instantiate the dataframe with column_names & rows_list
neighborhoods = pd.DataFrame(data = rows_list, columns = column_names)
neighborhoods.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


> * Process the cells that have an assigned borough. Ignore cells with a borough that is **Not assigned**.

In [8]:
neighborhoods.shape

(288, 3)

In [9]:
print('The dataframe has {} unique boroughs and {} unique neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        len(neighborhoods['Neighbourhood'].unique())
    )
)

The dataframe has 12 unique boroughs and 209 unique neighborhoods.


In [10]:
print(neighborhoods.groupby('Borough').count().reset_index())

             Borough  Postcode  Neighbourhood
0    Central Toronto        17             17
1   Downtown Toronto        37             37
2       East Toronto         7              7
3          East York         6              6
4          Etobicoke        45             45
5        Mississauga         1              1
6         North York        38             38
7       Not assigned        77             77
8       Queen's Park         1              1
9        Scarborough        37             37
10      West Toronto        13             13
11              York         9              9


There are **77 boroughs** that are **Not assigned**.

In [11]:
neighborhoods = neighborhoods[neighborhoods['Borough'] != 'Not assigned'].reset_index(drop=True)
neighborhoods.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Not assigned
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


In [12]:
# CONFIRMS 77 boroughs that are Not assigned ignored
neighborhoods.shape

(211, 3)

> * If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [13]:
neighborhoods[neighborhoods['Neighbourhood'] == 'Not assigned']

Unnamed: 0,Postcode,Borough,Neighbourhood
6,M7A,Queen's Park,Not assigned


In [14]:
neighborhoods.loc[neighborhoods['Neighbourhood'] == 'Not assigned', 'Neighbourhood'] = neighborhoods['Borough']
neighborhoods.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


In [15]:
# CONFIRMS Not assigned neighborhood replaced with borough
neighborhoods[neighborhoods['Neighbourhood'] == 'Not assigned']

Unnamed: 0,Postcode,Borough,Neighbourhood


In [16]:
neighborhoods['Neighbourhood'].count()

211

In [17]:
len(neighborhoods['Neighbourhood'].unique())

209

There are 209 unique Neighbourhood but total Neighbourhood count is 211.<br>
Explore 'Neighbourhood' that occurs more than once.

In [18]:
neighborhoods.groupby('Neighbourhood').filter(lambda x: len(x) > 1)

Unnamed: 0,Postcode,Borough,Neighbourhood
27,M5C,Downtown Toronto,St. James Town
115,M6N,York,Runnymede
145,M6S,West Toronto,Runnymede
190,M4X,Downtown Toronto,St. James Town


* More than one neighborhood can exist in one postal code area. <br>
* These rows will be combined into one row with the neighborhoods separated with a comma.

In [19]:
neighborhoods['Postcode'].count()

211

In [20]:
len(neighborhoods['Postcode'].unique())

103

In [21]:
neighborhoods = neighborhoods.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(', '.join).reset_index()

In [22]:
neighborhoods.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [23]:
# CONFIRMS rows combined to one 'Postcode'
neighborhoods.shape

(103, 3)

<a id='item2'></a>

## 2. Fetch Latitude and Longitude of each Neighborhood

Download the latitude & longitude data and save as **Geospatial_Coordinates.csv**

In [24]:
import os
import wget
if not os.path.exists('Geospatial_Coordinates.csv'):
    wget.download('http://cocl.us/Geospatial_data')

In [25]:
# read csv into pandas dataframe
geospatial_coordinates = pd.read_csv('Geospatial_Coordinates.csv')
geospatial_coordinates.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [26]:
neighborhoods = neighborhoods.join(geospatial_coordinates.set_index('Postal Code'), on=['Postcode'])
neighborhoods.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [27]:
neighborhoods[neighborhoods['Postcode'] == 'M5G']

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
57,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383


<a id='item3'></a>

## 3.Explore and Cluster the Neighborhoods in Toronto