## Segmenting and Clustering Neighbourhoods in Toronto

The project includes scraping the Wikipedia page for the postal codes of Canada and then process and clean the data for the clustering. The clustering is carried out by K Means and the clusters are plotted using the Folium Library. The Boroughs containing the name 'Toronto' in it are first plotted and then clustered and plotted again.


## This notebook contain design for 3 task: web scraping, cleaning and clustering

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1.  <a href="#item1">Web scraping</a>
    

2.  <a href="#item2">Data preprocessing and cleaning</a>
    

3.  <a href="#item3">Clustering and the plotting</a>
    </font>
    </div>

Before we get the data and start exploring it, let's download all the dependencies that we will need.


In [5]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


<a id='item1'></a>


## 1. Web scraping


BeautifulSoup Library of Python is used for web scraping of table from the Wikipedia. The title of the webpage is printed to check if the page has been scraped successfully or not. Then the table of postal codes of Canada is printed.

In [134]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')
table_contents=[]
table=soup.find('table')

#### The html table is converted to Pandas DataFrame for cleaning and preprocessing.


In [135]:
for row in table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

In [136]:
# print(table_contents)
df=pd.DataFrame(table_contents)
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

## 2. Data preprocessing and cleaning


In [156]:
# combine multiple neighborhoods with the same post code
df2 = pd.DataFrame({'Postcode':df.PostalCode.unique()})
df2['Borough']=pd.DataFrame(list(set(df['Borough'].loc[df['PostalCode'] == x['Postcode']])) for i, x in df2.iterrows())
df2['Neighborhood']=pd.Series(list(set(df['Neighborhood'].loc[df['PostalCode'] == x['Postcode']])) for i, x in df2.iterrows())

In [157]:
# Shape of data frame
df2.shape

(103, 3)

#### Importing the csv file conatining the latitudes and longitudes for various neighbourhoods in Canada


In [161]:
#add Geo-spatial data
dfll= pd.read_csv("https://cocl.us/Geospatial_data")
dfll.rename(columns={'Postal Code':'Postcode'}, inplace=True)
dfll.set_index("Postcode")
df2.set_index("Postcode")

AttributeError: 'NoneType' object has no attribute 'items'

                        Borough  \
Postcode                          
M3A                  North York   
M4A                  North York   
M5A            Downtown Toronto   
M6A                  North York   
M7A                Queen's Park   
...                         ...   
M8X                   Etobicoke   
M4Y            Downtown Toronto   
M7Y       East Toronto Business   
M8Y                   Etobicoke   
M8Z                   Etobicoke   

                                               Neighborhood  
Postcode                                                     
M3A                                             [Parkwoods]  
M4A                                      [Victoria Village]  
M5A                             [Regent Park, Harbourfront]  
M6A                      [Lawrence Manor, Lawrence Heights]  
M7A                         [Ontario Provincial Government]  
...                                                     ...  
M8X         [The Kingsway, Montgomery Road, Old 

#### Merging the two tables for getting the Latitudes and Longitudes for various neighbourhoods in Canada

In [163]:
df3=pd.merge(df2, dfll)
df3.head()

AttributeError: 'NoneType' object has no attribute 'items'

  Postcode           Borough                        Neighborhood   Latitude  \
0      M3A        North York                         [Parkwoods]  43.753259   
1      M4A        North York                  [Victoria Village]  43.725882   
2      M5A  Downtown Toronto         [Regent Park, Harbourfront]  43.654260   
3      M6A        North York  [Lawrence Manor, Lawrence Heights]  43.718518   
4      M7A      Queen's Park     [Ontario Provincial Government]  43.662301   

   Longitude  
0 -79.329656  
1 -79.315572  
2 -79.360636  
3 -79.464763  
4 -79.389494  

## 3. Clustering and the plotting


Getting all the rows from the data frame which contains Toronto in their Borough.

In [164]:
df4 = df3[df3['Borough'].str.contains('Toronto',regex=False)]
df4

AttributeError: 'NoneType' object has no attribute 'items'

    Postcode                 Borough  \
2        M5A        Downtown Toronto   
9        M5B        Downtown Toronto   
15       M5C        Downtown Toronto   
19       M4E            East Toronto   
20       M5E        Downtown Toronto   
24       M5G        Downtown Toronto   
25       M6G        Downtown Toronto   
30       M5H        Downtown Toronto   
31       M6H            West Toronto   
35       M4J  East York/East Toronto   
36       M5J        Downtown Toronto   
37       M6J            West Toronto   
41       M4K            East Toronto   
42       M5K        Downtown Toronto   
43       M6K            West Toronto   
47       M4L            East Toronto   
48       M5L        Downtown Toronto   
54       M4M            East Toronto   
61       M4N         Central Toronto   
62       M5N         Central Toronto   
67       M4P         Central Toronto   
68       M5P         Central Toronto   
69       M6P            West Toronto   
73       M4R         Central Toronto   


#### Visualizing all the Neighbourhoods of the above data frame using Folium

In [167]:
map_toronto = folium.Map(location=[43.651070,-79.347015],zoom_start=10)

for lat,lng,borough,neighborhood in zip(df4['Latitude'],df4['Longitude'],df4['Borough'],df4['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    [lat,lng],
    radius=5,
    popup=label,
    color='blue',
    fill=True,
    fill_color='#3186cc',
    fill_opacity=0.7,
    parse_html=False).add_to(map_toronto)
map_toronto

#### Using KMeans clustering for the clsutering of the neighbourhoods


In [169]:
k=5
toronto_clustering = df4.drop(['Postcode','Borough','Neighborhood'],1)
kmeans = KMeans(n_clusters = k,random_state=0).fit(toronto_clustering)
kmeans.labels_
df4.insert(0, 'Cluster Labels', kmeans.labels_)

In [170]:
df4

AttributeError: 'NoneType' object has no attribute 'items'

     Cluster Labels Postcode                 Borough  \
2                 0      M5A        Downtown Toronto   
9                 0      M5B        Downtown Toronto   
15                0      M5C        Downtown Toronto   
19                3      M4E            East Toronto   
20                0      M5E        Downtown Toronto   
24                0      M5G        Downtown Toronto   
25                2      M6G        Downtown Toronto   
30                0      M5H        Downtown Toronto   
31                4      M6H            West Toronto   
35                3      M4J  East York/East Toronto   
36                0      M5J        Downtown Toronto   
37                2      M6J            West Toronto   
41                3      M4K            East Toronto   
42                0      M5K        Downtown Toronto   
43                2      M6K            West Toronto   
47                3      M4L            East Toronto   
48                0      M5L        Downtown Tor

In [172]:
# create map
map_clusters = folium.Map(location=[43.651070,-79.347015],zoom_start=10)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighbourhood, cluster in zip(df4['Latitude'], df4['Longitude'], df4['Neighborhood'], df4['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters