# Segmenting and Clustering Neighborhoods in Toronto

### Part 1: Scraping data from Wikipedia
The following lines of code will scrape data from <https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M> and will output a DataFrame with a list of neighborhoods in Toronto.

In [1]:
import pandas as pd

We will use pandas to read the info in the url into a list.

In [23]:
data = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
type(data)

list

In [24]:
len(data)

3

From the webpage, we can see that the first element of the list should be the table that we are after. We will save this table as the DataFrame 'df'.

In [25]:
df = data[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


First we will remove all rows where 'Borough' is defined as 'Not assigned'.

In [26]:
df = df[df['Borough']!="Not assigned"]
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Using a Boolean mask, we'll check to see if any Neighborhoods are defined as "Not assigned".

In [27]:
check = df[df['Neighborhood']=="Not assigned"]
check.head()

Unnamed: 0,Postal Code,Borough,Neighborhood


No neighborhoods are defined as "Not assigned", so the DataFrame is ready to go. The number of rows in the DataFrame is shown below:

In [28]:
df.shape

(103, 3)

### Part 2: Getting latitude and longitude for each postal code

We will gather the latitudes and longitudes for each postal code using the csv file below.

In [29]:
lat_long = pd.read_csv("Geospatial_Coordinates.csv")
lat_long.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [30]:
lat_long.shape

(103, 3)

This DataFrame is the same length as our 'df' DataFrame, so we can merge the DataFrames together on the column 'Postal Code'.

In [31]:
df = pd.merge(df, lat_long, on='Postal Code')
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
