# Segmenting and clustering Neigbourhoods in Toronto - Fan

## Scrap the raw data from Wiki

As a first step we need to scrap the data from the wikipedia page [Toronto Data](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M). To complete this step, we essentially only need the pandas library. Now lets import the necessary packages.

In [2]:
import pandas as pd
import numpy as numpy
print("library imported")

library imported


In [3]:
dataurl = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df_list = pd.read_html(dataurl)
print("df list created")

df list created


In [9]:
len(df_list) ##check how many dataframes been loaded
df_list[0].head() ##Confirm this is the target table
df_raw = df_list[0] ##read the target table

Now that we've got the raw data, the next step is to perform the wrangling as required:

1. Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
2. More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
3. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [12]:
df_raw.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [27]:
df_raw_1 = df_raw[df_raw['Borough'] != "Not assigned"] ## This is to drop all the not assigned rows
df_raw_1['Postal Code'].value_counts().unique() ## This is to make sure there are no duplicate postal codes in the dataframe
df_raw_1[df_raw_1['Neighbourhood'] == "Not assigned"] ## This is to check if there are any rows that has not assigned neighbourhood, if there are any, we need to make it the same as Borough

Unnamed: 0,Postal Code,Borough,Neighbourhood


Now that we've performed the checks and cleaning required, the last step is just to reset the index and return the shape

In [32]:
df_raw_1.reset_index(drop = True, inplace = True) ##This is to reset the index

In [35]:
df_raw_1.shape ##Return the shape of the cleaned dataframe

(103, 3)

## Find the Logitude and Latitude

Next step of the assignment is to find the logitude and latitude for each postcode. Given the intermittence connection to the foursquare API I've decided to use the csv file been provided

In [37]:
df_coord = pd.read_csv('http://cocl.us/Geospatial_data')

In [38]:
df_coord.head() #Vsual check the dataframe been ingested

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [41]:
df_raw_2 = pd.merge(df_raw_1,df_coord,on='Postal Code', how='left') #Perform a left join to bring in the lat and log
df_raw_2.head(10) #check the result

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


We are now satisfied that the second step is now completed.

## Cluster the Neighbourhood

In [42]:
import requests
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium
print("library imported")

library imported
