<h1 style="color:blue">Segmenting and Clustering Neighborhoods in Toronto</h1>

## Part I

<p>Start by importing the necessary libraries:</p>

In [2]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

<p>Then use <b>BeautifulSoup</b> and <b>Pandas</b> to create the initial dataframe:</p>

In [3]:
res = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0] 
df = pd.read_html(str(table))
df = pd.read_json(df[0].to_json(orient='records'))
df = df[['Postal code','Borough','Neighborhood']]
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


<p>Next, drop the rows where 'Borough' is 'Not assigned'. Please note that, contrary to what stated in the lab instructions, there are no rows where Borough is assigned and Neighborhood is not assigned. So the next code line is enough to clean the dataframe:<p>

In [4]:
df = df.drop(df[df.Borough == 'Not assigned'].index)
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


<p>The source Wikipedia table column 'Neighborhood' has '/' to separate multiple Neighborhoods, instead of commas as required in the lab. Therefore replace them wherever they occur in that column:</p>

In [5]:
df['Neighborhood'] = df['Neighborhood'].str.replace(' / ',', ')
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [6]:
df.shape

(103, 3)

## Part II

To save time, I will load the latitude and longitude data from the provided .csv file:

In [7]:
df_geo = pd.read_csv("http://cocl.us/Geospatial_data")
df_geo = df_geo.rename(columns={"Postal Code": "Postal code"})
df_geo.head()

Unnamed: 0,Postal code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [10]:
# Merge the two datasets on the 'Postal code' column:
df_merged = pd.merge(df, df_geo, on='Postal code')
df_merged.head()

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
