# Segmenting and Clustering Neighborhoods in the city of Toronto, Canada #

## Introduction

The aim of this project is to create a code that code identifies neighborhood area segments in Toronto and cluster them according to venues available in vicinity of those neighborhoods 

**The process for this will be made up of the following setps**

1. Input and structure the data from [Wikipedia: List of postal codes of Canada: M](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)

2. Explore Neighbourhood Data

3. Analyse and Cluster Data

In [18]:
import numpy as np 
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)



**Install Beautiful Soup & Geocoder**

In [21]:
import requests
! pip install beautifulsoup4 
from bs4 import BeautifulSoup
import json 
from geopy.geocoders import Nominatim 
! pip install geocoder

Requirement not upgraded as not directly required: beautifulsoup4 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages
Requirement not upgraded as not directly required: geocoder in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages
Requirement not upgraded as not directly required: requests in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from geocoder)
Requirement not upgraded as not directly required: click in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from geocoder)
Requirement not upgraded as not directly required: future in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from geocoder)
Requirement not upgraded as not directly required: ratelim in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from geocoder)
Requirement not upgraded as not directly required: six in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from geocoder)
Requirement not upgraded as not directly required: chardet<3.1.0,>=3.0.2 in /opt/cond

**Install Folium and then transform JSON file in a Pandas Dataframe**

In [22]:
!conda install -c conda-forge folium=0.5.0 --yes

Solving environment: done

# All requested packages already installed.



In [23]:
import requests
from pandas.io.json import json_normalize 
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans

import folium

print('Libraries imported.')

Libraries imported.


**Download Toronto Data**

In [24]:
!wget -q -O 'toronto_data.html' https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [29]:
with open('toronto_data.html') as html_file:
    soup=BeautifulSoup(html_file,'lxml')
table_html=soup.find('table',class_='wikitable sortable')

**Create Datafram to read data**

In [35]:
col_names=['PostalCode','Borough','Neighborhood']
neighborhoods = pd.DataFrame(columns=col_names)
neighborhoods
i=0
j=0
for tr in table_html.tbody.find_all('tr'):
    if i==0:
        i=i+1
    else:  
        for td in tr.find_all('td'):
            if j==0:
                postalcode_cd = td.text
                j=j+1
            elif j==1:
                borough_name = td.text
                j=j+1
            else:
                neighborhood_name = td.text
                j=0
                neighborhoods = neighborhoods.append({'PostalCode': postalcode_cd,
                                                      'Borough': borough_name,       
                                                      'Neighborhood': neighborhood_name},                                         
                                                       ignore_index=True)

print('Size',neighborhoods.shape)
neighborhoods.head()

Size (288, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned\n
1,M2A,Not assigned,Not assigned\n
2,M3A,North York,Parkwoods\n
3,M4A,North York,Victoria Village\n
4,M5A,Downtown Toronto,Harbourfront\n


1. **Replace 'Not assigned' neighborhoods with Borougth's name**
2. **Combine neighborhoods belonging to the same postcode**

In [39]:
for index, row in neighborhoods.iterrows():
    row['Neighborhood']=row['Neighborhood'].rstrip()

neighborhoods = neighborhoods[neighborhoods.Borough != "Not assigned"]
neighborhoods = neighborhoods.reset_index(drop=True)


for index, row in neighborhoods.iterrows():
    if row.at['Neighborhood'] == "Not assigned":
            row.at['Neighborhood'] = row.at['Borough']
            
print('Size',neighborhoods.shape)
neighborhoods.head()

Size (103, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [41]:
neighborhoods = neighborhoods.groupby(['PostalCode','Borough'])['Neighborhood'].apply(', '.join).reset_index()
print('Size',neighborhoods.shape)
neighborhoods.head(15)

Size (103, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [42]:
neighborhoods.shape

(103, 3)