<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto</font></h1>

## Introduction

In this notebook I've followed the same procedures that were used in the lab that dealt with the clustering of neighborhoods in New York, however I had to undertake in extra steps of data processing before using the k-means cluster algorithm. 

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Download and Explore Dataset</a>

2. <a href="#item2">Explore Neighborhoods in Toronto</a>

3. <a href="#item4">Cluster Neighborhoods</a>

4. <a href="#item5">Exploring the clusters</a>    
</font>
</div>

## 1. Download and Explore Dataset

In [2]:
import pandas
import requests
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
my_table = pandas.read_html(url,match='.+', flavor=['lxml'],header=0)[0]
my_table = my_table.rename(columns={'Postcode':'PostalCode'})
my_table.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


To start with the project I've downloaded the data using the read_html method from the Pandas library and renamed the column Postcode to PostalCode in order to match the dataframe that was required for the first assignment.

In [3]:
my_table = my_table[my_table.Borough != 'Not assigned'].reindex()

In this line of code I've removed all postal codes that weren't assigned to a Borough.

In [4]:
my_table.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


In [5]:
toronto_nb = my_table.groupby('PostalCode')['Neighbourhood'].apply(', '.join).reset_index()
toronto_nb.sort_values(by='PostalCode',inplace=True)
toronto_nb.set_index('PostalCode',inplace=True)
toronto_nb.head(10)

Unnamed: 0_level_0,Neighbourhood
PostalCode,Unnamed: 1_level_1
M1B,"Rouge, Malvern"
M1C,"Highland Creek, Rouge Hill, Port Union"
M1E,"Guildwood, Morningside, West Hill"
M1G,Woburn
M1H,Cedarbrae
M1J,Scarborough Village
M1K,"East Birchmount Park, Ionview, Kennedy Park"
M1L,"Clairlea, Golden Mile, Oakridge"
M1M,"Cliffcrest, Cliffside, Scarborough Village West"
M1N,"Birch Cliff, Cliffside West"


In this line of code I've created a new dataframe that combines neighbourhoods that fall under the same postal code.

In [6]:
borough_df = my_table
borough_df = borough_df.drop(['Neighbourhood'], axis=1)
borough_df.sort_values(by='PostalCode',inplace=True)
borough_df.drop_duplicates(inplace=True)
borough_df.set_index('PostalCode',inplace=True)
borough_df.head()

Unnamed: 0_level_0,Borough
PostalCode,Unnamed: 1_level_1
M1B,Scarborough
M1C,Scarborough
M1E,Scarborough
M1G,Scarborough
M1H,Scarborough


In this line of code I've combined both dataframes to result in the third step of the assignment.

In [7]:
toronto_nb['Borough'] = borough_df['Borough']

In [8]:
toronto_nb.reset_index(inplace=True)

In [9]:
toronto_nb = toronto_nb[['PostalCode','Borough','Neighbourhood']]

In this line of code I've assigned the the value of the borough column to all postal codes that weren't assigned a value for Neighbourhood.

In [11]:
index = 0
while index < len(toronto_nb):
    if toronto_nb['Neighbourhood'][index] == 'Not assigned':
           toronto_nb['Neighbourhood'][index] = toronto_nb['Borough'][index] 
    index = index+1

In [10]:
toronto_nb.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


The shape of the resulting dataframe is:

In [12]:
toronto_nb.shape

(103, 3)

## 2. Explore Neighborhoods in Toronto

I choose to get the coordinates of latitude and longitude from the link provided in the assignment. \n
Therefore, I've downloaded and combined the data with the previous dataframe.

In [13]:
!wget -q -O 'toronto_data.csv' http://cocl.us/Geospatial_data
print('Data downloaded!')

Data downloaded!


In [14]:
geodata = pandas.read_csv('toronto_data.csv', delimiter = ',')

In [15]:
geodata.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [16]:
toronto_nb['Latitude'] = geodata['Latitude']
toronto_nb['Longitude'] = geodata['Longitude']

In [17]:
toronto_nb.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


Then I've decided to see how many boroughs there were in my dataframe.

In [18]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(toronto_nb['Borough'].unique()),
        toronto_nb.shape[0]
    )
)

The dataframe has 11 boroughs and 103 neighborhoods.


And their names:

In [19]:
toronto_nb['Borough'].unique()

array(['Scarborough', 'North York', 'East York', 'East Toronto',
       'Central Toronto', 'Downtown Toronto', 'York', 'West Toronto',
       "Queen's Park", 'Mississauga', 'Etobicoke'], dtype=object)