# Applied Data Science Capstone

### Segmenting and Clustering Neighborhoods in Toronto

This notebook is used to explore, segment, and cluster the neighborhoods in the city of Toronto.

**Note:** All 3 tasks (*web scraping, cleaning and clustering*) are implemented in the same notebook.

#### Installation of all required libraries and packages:

In [2]:
!pip install beautifulsoup4
!pip install lxml
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 

from IPython.display import display_html
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library
from bs4 import BeautifulSoup
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

print('Libraries imported!')

Collecting beautifulsoup4
[?25l  Downloading https://files.pythonhosted.org/packages/66/25/ff030e2437265616a1e9b25ccc864e0371a0bc3adb7c5a404fd661c6f4f6/beautifulsoup4-4.9.1-py3-none-any.whl (115kB)
[K     |████████████████████████████████| 122kB 6.5MB/s eta 0:00:01
[?25hCollecting soupsieve>1.2 (from beautifulsoup4)
  Downloading https://files.pythonhosted.org/packages/6f/8f/457f4a5390eeae1cc3aeab89deb7724c965be841ffca6cfca9197482e470/soupsieve-2.0.1-py3-none-any.whl
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.9.1 soupsieve-2.0.1
Collecting lxml
[?25l  Downloading https://files.pythonhosted.org/packages/55/6f/c87dffdd88a54dd26a3a9fef1d14b6384a9933c455c54ce3ca7d64a84c88/lxml-4.5.1-cp36-cp36m-manylinux1_x86_64.whl (5.5MB)
[K     |████████████████████████████████| 5.5MB 5.2MB/s eta 0:00:01     |████████████████████████████▋   | 4.9MB 5.2MB/s eta 0:00:01
[?25hInstalling collected packages: lxml
Successfully installed lxml-4.5.1
Col

### Scraping data from the Wikipedia page:

In [3]:
# Get the html source
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup=BeautifulSoup(source,'lxml')
tab = str(soup.table)

# Convert table to pandas dataframe
df = pd.read_html(tab)[0]

# Rows where Borough is 'Not assigned' are deleted
df = df[df.Borough != 'Not assigned']

# Neighborhoods where postcode and Borough are the same are combined
df = df.groupby(['Postal Code','Borough'], sort=False).agg(', '.join)
df.reset_index(inplace=True)

# If Neighborhood is not assigned, replace by Borough name
for index, row in df.iterrows():
    if row['Neighborhood'] == 'Not assigned':
        row['Neighborhood'] = row['Borough']

# Show first 5 rows of dataframe
df.head()


Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [4]:
# Shape of data frame
df.shape

(103, 3)

### Adding longitude and latitute to data frame:

In [6]:
# Fetching longitutde and latitude data from csv file
lat_lon = pd.read_csv('https://cocl.us/Geospatial_data')

# Merging of the two tables
df = pd.merge(df,lat_lon,on='Postal Code')
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


### Exploring and clustering the neighborhoods of Toronto

##### The following section is limited to working with neighborhoods that contain "Toronto" in their Borough.

In [8]:
# Limiting dataframe to neighborhoods with "Toronto" in their Borough
df_Toronto = df[df['Borough'].str.contains('Toronto',regex=False)]
df_Toronto

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
24,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
25,M6G,Downtown Toronto,Christie,43.669542,-79.422564
30,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
31,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259


##### Map of all the neighborhoods with "Toronto" in their Borough:

In [16]:
map_toronto = folium.Map(location=[43.651070,-79.347015],zoom_start=11)

for lat,lng,borough,neighborhood in zip(df_Toronto['Latitude'],df_Toronto['Longitude'],df_Toronto['Borough'],df_Toronto['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    [lat,lng],
    radius=4,
    popup=label,
    color='green',
    fill=True,
    fill_color='#3195cc',
    fill_opacity=0.6,
    parse_html=False).add_to(map_toronto)
map_toronto

#### Using k-means clustering for the clustering of the Toronto neighborhoods:

In [17]:
k=4
Tor_clust = df_Toronto.drop(['Postal Code','Borough','Neighborhood'],1)
kmeans = KMeans(n_clusters = k,random_state=0).fit(Tor_clust)
kmeans.labels_
df_Toronto.insert(0, 'Cluster Labels', kmeans.labels_)

#### Map of the Toronto neighborhood clusters:

In [26]:
# map generation
cluster_map = folium.Map(location=[43.651070,-79.347015],zoom_start=11)

# set colors for the different clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
color_range = cm.rainbow(np.linspace(0, 1, len(ys)))
palette = [colors.rgb2hex(i) for i in color_range]

# add markers to the map
markers = []
for lat, lon, neighborhood, cluster in zip(df_Toronto['Latitude'], df_Toronto['Longitude'], df_Toronto['Neighborhood'], df_Toronto['Cluster Labels']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=4,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=palette[cluster-1],
        fill_opacity=0.6).add_to(cluster_map)
    
# show map   
cluster_map