# Segmenting and Clustering Neighborhoods in Toronto Canada - Henry

## Introduction

In this assignment, we need to explore, segment, and cluster the neighborhoods in the city of Toronto. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so we need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. We will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the New York dataset.

Once the data is in a structured format, we can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

## Table of Contents

###     1. Download and Explore Dataset

###     2. Explore Neighborhoods in Toronto Canada

###     3. Analyze Each Neighborhood

###     4. Cluster Neighborhoods

###     5. Examine Clusters

## Import libraries

In [55]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

# import scraping library
from bs4 import BeautifulSoup

print('Libraries imported.')

Libraries imported.


## 1. Download and Explore Dataset

### Set up URL and download the web page

In [56]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
source = requests.get(url).text

In [57]:
soup = BeautifulSoup(source, 'lxml')
article = soup.find('div', class_="mw-parser-output")
table = article.table
rows = table.tbody.find_all("tr")

In [58]:
html_data = []
for row in rows:
    info = row.text.split('\n')[1:-1] # remove empty str (first and last items)
    html_data.append(info)
    
html_data[0:10]

[['Postcode', 'Borough', 'Neighbourhood'],
 ['M1A', 'Not assigned', 'Not assigned'],
 ['M2A', 'Not assigned', 'Not assigned'],
 ['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Harbourfront'],
 ['M5A', 'Downtown Toronto', 'Regent Park'],
 ['M6A', 'North York', 'Lawrence Heights'],
 ['M6A', 'North York', 'Lawrence Manor'],
 ['M7A', "Queen's Park", 'Not assigned']]

In [59]:
html_data[0][-1] = 'Neighborhood' # change to american spelling
df_toronto = pd.DataFrame(html_data[1:], columns=html_data[0])
df_toronto.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


In [60]:
df_toronto.shape

(289, 3)

## Data Cleaning

### Define droping masks

In [61]:
borough_mask = df_toronto.index[df_toronto['Borough'] == 'Not assigned']
neighborhood_mask = df_toronto.index[df_toronto['Neighborhood'] == 'Not assigned']
neighborhood_and_borough_mask = borough_mask & neighborhood_mask

### Print out the statistics before cleaning

In [62]:
print('Statistics before cleaning:\n')
print('  {} Postal codes'.format(df_toronto['Postcode'].unique().shape[0]))
print('  {} Boroughs'.format(df_toronto['Borough'].unique().shape[0] - 1))
print('  {} Neighborhoods'.format(df_toronto['Neighborhood'].unique().shape[0]))
print('  {} rows with Not assigned Borough'.format(borough_mask.shape[0]))
print('  {} rows with Not assigned Neighborhood'.format(neighborhood_mask.shape[0]))
print('  {} rows with Not assigned Neighborhood and Borough'.format(neighborhood_and_borough_mask.shape[0]),'\n')

print('The DataFrame shape is {}'.format(df_toronto.shape),'\n')

Statistics before cleaning:

  180 Postal codes
  11 Boroughs
  210 Neighborhoods
  77 rows with Not assigned Borough
  78 rows with Not assigned Neighborhood
  77 rows with Not assigned Neighborhood and Borough 

The DataFrame shape is (289, 3) 



### Dropping rows with "Not assigned" at "Borough" column

In [63]:
df_toronto.drop(df_toronto.index[borough_mask], inplace=True)
df_toronto.reset_index(drop=True, inplace=True)
print(df_toronto.shape)
df_toronto.head(10)

(212, 3)


Unnamed: 0,Postcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Not assigned
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


### Replace "Not assigned" values in the "Neighborhood" column with the "Borough" name in that cell

In [64]:
# Rerun this because the indexes on the dataframe where reset
neighborhood_mask = df_toronto.index[df_toronto['Neighborhood'] == 'Not assigned']

In [65]:
for idx in neighborhood_mask:
    df_toronto['Neighborhood'][idx] = df_toronto['Borough'][idx]
print(df_toronto.shape)   
df_toronto.head(10)

(212, 3)


Unnamed: 0,Postcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


### Print out the statistics after cleaning

In [66]:
borough_mask = df_toronto.index[df_toronto['Borough'] == 'Not assigned']
neighborhood_mask = df_toronto.index[df_toronto['Neighborhood'] == 'Not assigned']
neighborhood_and_borough_mask = borough_mask & neighborhood_mask

print('Statistics after cleaning:\n')
print('  {} Postal codes'.format(df_toronto['Postcode'].unique().shape[0]))
print('  {} Boroughs'.format(df_toronto['Borough'].unique().shape[0]))
print('  {} Neighborhoods'.format(df_toronto['Neighborhood'].unique().shape[0]))
print('  {} rows with Not assigned Borough'.format(borough_mask.shape[0]))
print('  {} rows with Not assigned Neighborhood'.format(neighborhood_mask.shape[0]))
print('  {} rows with Not assigned Neighborhood and Borough'.format(neighborhood_and_borough_mask.shape[0]),'\n')

print('The DataFrame shape is {}'.format(df_toronto.shape),'\n')

Statistics after cleaning:

  103 Postal codes
  11 Boroughs
  210 Neighborhoods
  0 rows with Not assigned Borough
  0 rows with Not assigned Neighborhood
  0 rows with Not assigned Neighborhood and Borough 

The DataFrame shape is (212, 3) 



###  Group by the 'Postcode' column and consolidate the content in 'Neighborhood' cells

In [67]:
# lambda functions to handle the cell operations
f_neighborhoods = lambda x: "%s" % ', '.join(x)
f_boroughs = lambda x: set(x).pop()

temp = df_toronto.groupby('Postcode')
temp_neighborhoods = temp['Neighborhood'].apply(f_neighborhoods)
temp_boroughs = temp['Borough'].apply(f_boroughs)

columns_list = list(zip(temp_boroughs.index, temp_boroughs, temp_neighborhoods))
df_toronto_grouped = pd.DataFrame(columns_list)

df_toronto_grouped.columns = ['Postcode', 'Borough', 'Neighborhood']

df_toronto_grouped.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


### Print out the final DataFrame shape

In [68]:
print('The final DataFrame shape is {}'.format(df_toronto_grouped.shape),'\n')

The final DataFrame shape is (103, 3) 

