# Segmenting and Clustering Neighborhoods in Toronto
#### First part of the capstone project for the IBM Data Science Professional Certificate.

<br>

## 1. Scraping the data from Wikipedia

Install and import the required packages and libraries.

In [79]:
import pandas as pd
import numpy as np

import requests
!pip3 install BeautifulSoup4
!pip3 install lxml
from bs4 import BeautifulSoup



### First we parse the page with BeatifulSoup to get the plain HTML content.

In [80]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

page = requests.get(url).text
soup = BeautifulSoup(page, 'html.parser')

### There is only one table on the page so we convert to to a pandas DataFrame.

In [114]:
table = soup.find('table')
df = pd.read_html(str(table))[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### As required we remove every entry with where no Borough was assigned.

In [115]:
df = df[df['Borough'] != 'Not assigned']
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


### Check if there are any unassigned Neighborhoods, which is not the case.

In [116]:
df[df['Neighborhood'] == 'Not assigned'].count()

Postal Code     0
Borough         0
Neighborhood    0
dtype: int64

### No we need to merge all entries with the same Postal Code by combining the neighborhoods to a comma seperated value. 

#### As we see there are 103 unique postal codes, so we expect a DataFrame with 103 rows.

In [128]:
df['Postal Code'].value_counts().count()

103

In [119]:
df = df.groupby(["Postal Code", "Borough"], as_index=False).agg(lambda x: ", ".join(x))
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### As expected we get 103 rows.

In [72]:
df.shape

(103, 3)

<br>
<br>
<br>

## 2. Getting the geographical coordinates

#### Install and import the required packages and libraries

In [131]:
!pip3 install geocoder
import geocoder



### Load the coordinates from the provided csv file.

In [133]:
coordinates = pd.read_csv("Geospatial_Coordinates.csv")
coordinates.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### Merge our current DataFrame with the new coordinates DataFrame to get the coordinates for each entry.

In [135]:
df2 = df.merge(coordinates, on="Postal Code", how="left")
df2.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


<br>
<br>
<br>

## 3. Clustering and segmenting

In [136]:
client_id = 'EWPJTHBD35PF21MKIXYRV3IJU5C4IBQCG3P2TDI0EA4MEILP'
client_secret = 'YA5SHVIQMVIS5JKOVYVYDTBWUVMDSXMMPZW11QCMUWDYKL3F'