# Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto

#### Author: Altaaf Khan

In this assignment, we will explore, segment and cluster the neighborhoods in the city of Toronto. 

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. We will scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format.

Once the data is in a structured format, we will replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

In [1]:
# Import various modules
import pandas as pd
import numpy as np
import requests

from pandas.io.json import json_normalize  # tranform JSON file into a pandas dataframe
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
from bs4 import BeautifulSoup

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# Folium for generating mapping
import folium

# k-means from clustering stage
from sklearn.cluster import KMeans

## 1. Scrape the Wikipedia page
Link: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, 

In [2]:
wiki_page='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wiki_source = requests.get(wiki_page).text

soup = BeautifulSoup(wiki_source, 'lxml')

table = soup.find("table")
table_rows = table.tbody.find_all("tr")

pc = []

for tr in table_rows:
    td = tr.find_all("td")
    row = [tr.text.strip() for tr in td]
    
    # Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
    if row != [] and row[1] != "Not assigned":
        # If a cell has a borough but a "Not assigned" neighborhood, then the neighborhood will be the same as the borough.
        if "Not assigned" in row[2]: 
            row[2] = row[1]
        pc.append(row)

# Dataframe of three columns: PostalCode, Borough, and Neighborhood
df_pc = pd.DataFrame(pc, columns = ["PostalCode", "Borough", "Neighborhood"])
df_pc.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [3]:
# Group all neighbourhoods with the same postal code
df_pc = df_pc.groupby(["PostalCode", "Borough"])["Neighborhood"].apply(", ".join).reset_index()
df_pc.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [4]:
# Number of rows in the dataframe dataframe
print("Shape: ", df_pc.shape)

Shape:  (103, 3)
