# Segmenting and Clustering Neighborhoods in Toronto
##### Submitted by Nilo Villanueva

### Assignment guidelines and requirements

Task: Explore and cluster the neighborhoods in Toronto.

1. The relevant Wiki data scraped are postal codes, borought and neighborhood names available in https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, 
2. Required pandas dataframe should have PostalCode, Borough, and Neighborhood as dataframe column names
3. Cells with a Borough marked as "Not assigned" are not processed.
4. Postal codes with multiple neighborhoods entries are combined into one row with the neighborhoods separated with a comma.
5. Cells with a borough entry but having neighborhood entries " Not assigned" , the neighborhood will have the same entry as the borough.
6. Show the dimensions fo the dataframe.

In [1]:
# Getting the needded dependencies

import numpy as np  #Handles data in vectorized manner

!conda install -c conda-forge lxml --yes
!conda install -c conda-forge bs4 --yes
!conda install -c conda-forge html5lib --yes

!conda install -c conda-forge beautifulsoup4 --yes
from bs4 import BeautifulSoup

import pandas as pd #To perform data analysis and make dataframes
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

from IPython.display import display_html
from IPython.display import display

import json #Library for handling JSON data

!conda install -c conda-forge geopy --yes # For geographical data
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.8.3
  latest version: 4.8.4

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.8.3
  latest version: 4.8.4

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.8.3
  latest version: 4.8.4

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.8.3
  latest version: 4.8.4

Please update conda by running

    $ conda update -n base -c defaults conda



#

## Scraping Data From Toronto Postal Wiki-page

##### Scraping and parsing using the Python package BeautifulSoup and with lxml parser. 

In [40]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source,'lxml')
table = str(soup.table) # raw structured data is saved in table

##### Inspecting raw data shows the "Not assigned" entried in Borough and Neighbourhood columns that need to be cleaned up.

In [41]:
display_html(table, raw=True) #displays the scraped raw data from the Wiki

Postal Code,Borough,Neighbourhood
M1A,Not assigned,Not assigned
M2A,Not assigned,Not assigned
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Regent Park, Harbourfront"
M6A,North York,"Lawrence Manor, Lawrence Heights"
M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
M8A,Not assigned,Not assigned
M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
M1B,Scarborough,"Malvern, Rouge"


##### Converting html data into a Pandas dataframe

In [43]:
df = pd.read_html(table, header=0)[0] #the function read_html returns a list of DataFrame objects
print(type(df))
display(df.head(12))

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


#### Cleaning the dataframe as required 
*Required Pandas dataframe should have PostalCode, Borough, and Neighborhood as dataframe column names
Cells with a Borough marked as "Not assigned" are not processed.
Postal codes with multiple neighborhoods entries are combined into one row with the neighborhoods separated with a comma.
Cells with a borough entry but having neighborhood entries " Not assigned" , the neighborhood will have the same entry as the borough.*

In [44]:
df = df[df.Borough != "Not assigned"] # Remove Not assigned Boroughs
df = df.groupby(["Postal Code","Borough"],sort = False).agg(','.join) # Joins everything after Postal Code & Borough, in this case, Neighborhood. 
df.reset_index(inplace=True)
display(df.head(12))

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


#### Displaying the dimensions of the dataframe

In [47]:
print("The cleaned and finalized dataframe is",df.shape)

The cleaned and finalized dataframe is (103, 3)


### Latitude and Longitude Data

Task: Add latitude and longitudeto the cleaned dataframe.

1. A csv file (source: http://cocl.us/Geospatial_data) will be used to provide the geographical coordinates of each postal code.

In [49]:
df_geo = pd.read_csv("./Geospatial_Coordinates.csv")
display(df_geo.head(12))

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


In [50]:
pd.options.display.max_columns = None

Postal Code entries in both, in df (cleaned dataframe) and df_geo (lat/lon coordinates) will be merged.  

In [52]:
df_locations = pd.merge(df, df_geo, how='left', left_on = 'Postal Code', right_on = 'Postal Code')
display(df_locations.head(12))

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
