<a href="https://colab.research.google.com/github/georgejordan3/IBM_Capstone/blob/main/Segmenting_and_Clustering_Neighborhoods_in_Toronto.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Segmenting and Clustering Neighboorhoods in Toronto

George Jordan <br>
IBM Applied Data Science Capstone <br>
Last Updated: 2-21-21 <br>

This notebook will demonstrate the methods I used to cluster the neighborhoods of Toronto. First, I used pandas to scrape a wikipedia article for data to be implemented into a localized dataframe. Then, I added geocoordinates from a .csv file to be added to the dataframe.

## Scraping and Creating the Dataframe, Part 1

First, I will import the libraries necessary for scraping the URL.

In [2]:
import pandas as pd

Using pandas, I imported the table into a dataframe and displayed the first five rows to ensure that it was made properly.

In [81]:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


I removed any row that had an unassigned Borough. After that, I checked the size of the dataframe before and after the removal of rows.

In [82]:
df1 = df[df.Borough != "Not assigned"]
print(df.shape)
print(df1.shape)

(180, 3)
(103, 3)


Next, we will combine Boroughs that share a zip code so that we only have one row per zipcode.

In [86]:
combined = df1.groupby(['Postal Code','Borough'], as_index=False).agg(lambda x: ','.join(x))
combined.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


For every row that has a named Borough but an unnamed Neighbourhood, I made the name of the Neighbourhood correspond with the Borough.

In [88]:
empty = combined['Neighbourhood'] == "Not assigned"
combined.loc[empty, 'Neighbourhood'] = combined.loc[empty, 'Borough']

In [89]:
combined.shape

(103, 3)

This concludes the first part of this assignment.

## Adding Coordinates to the Data, Part 2

Using the provided link to .csv data for the coordinates of the postal codes, I made a dataframe from this data.

In [93]:
url = "https://cocl.us/Geospatial_data"
coord = pd.read_csv(url)
coord.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


I then combined the two dataframes. Using an inner join, I ensured that only the rows represented in both dataframes would be present in the combined dataframe.

In [95]:
postal_coord = pd.merge(combined, coord)
postal_coord.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


To ensure that I didn't lose or gain any rows, I once again checked the shape of the dataframe.

In [96]:
postal_coord.shape

(103, 5)

This concludes the second part of the assignment.

## Clustering the Neighbourhoods in Toronto