# Segmenting And Clustering Neighborhoods in Toronto

### Setup
All dependencies needed:

In [52]:
import numpy as np
import pandas as pd
import lxml
from geopy.geocoders import Nominatim
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium
import requests

<hr>

## Task #1

Scrape this wikipedia page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M. Then, turn it into a Dataframe

In [16]:
wiki = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
tables = pd.read_html(wiki) #Returns a list
df = tables[0] # tables[0] is a dataframe
df

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,
176,M6Z,Not assigned,
177,M7Z,Not assigned,
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


Now we need to clean the data:

* Change the Column name **"Postal Code"** to **"PostalCode"**
* Drop the rows with value **"Not assigned"** in column **"Borough"**
* Check that all Boroughs have a neighborhood assigned.

In [17]:
# First Point
df.rename(columns={"Postal Code":"PostalCode"}, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [18]:
# Second Point
df = df[df["Borough"] != "Not assigned"]
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [19]:
# Third Point
df.info()
df.isna().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   PostalCode    103 non-null    object
 1   Borough       103 non-null    object
 2   Neighborhood  103 non-null    object
dtypes: object(3)
memory usage: 1.3+ KB


PostalCode      0
Borough         0
Neighborhood    0
dtype: int64

Finally, we use the **.shape** method to print the number of rows in the dataframe.

In [20]:
df.shape

(103, 3)

<hr>


## Task #2

Having the dataframe from **Task #1**, we need to add the Latitude and Longitude using the Postal Code. Since the **Geocoder package** is very unreliable, we'll use the following csv: http://cocl.us/Geospatial_data

In [22]:
coords = pd.read_csv("http://cocl.us/Geospatial_data")
coords.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [27]:
# Changing the Column "Postal Code"
coords.rename(columns={"Postal Code":"PostalCode"}, inplace=True)
coords.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [28]:
# Checking it has the same amount of rows
coords.shape

(103, 3)

Now we need to merge both dataframes:

In [33]:
df_toronto = df.merge(coords, on="PostalCode")
df_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [109]:
df_toronto.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103 entries, 0 to 102
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PostalCode    103 non-null    object 
 1   Borough       103 non-null    object 
 2   Neighborhood  103 non-null    object 
 3   Latitude      103 non-null    float64
 4   Longitude     103 non-null    float64
dtypes: float64(2), object(3)
memory usage: 3.6+ KB


<hr>
