# Clustering and Segmentation in the city of Toronto, Canada

### In this project the aim is to cluster the neighborhoods in Toronto based on their location data. Since the dataset is not readily available, we need to use web scraping techniques to get the required data into a pandas dataframe for further preprocessing and analysis. The steps carried out throughout this project are outlined and relevant theory is explained.

#### A) Importing and installing the required libraries

###### We would be needing the following libraries
##### 1) Folium - to display maps
##### 2) Geopy.geocoders.Nominatim - Geocoding the addresses
##### 3) requests - to fetch the data from Foursquare API
##### 4) pandas.json_normalize - To map the json objects to the respective data frame object
##### 5) Beautiful Soup 4 - This library is used to scrap data from HTMl,XML pages
##### 6) urllib request - To get the html mage on request

In [1]:
#installing  bs4
!conda install -c conda-forge bs4 --yes

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - bs4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    beautifulsoup4-4.9.1       |   py36h9f0ad1d_0         163 KB  conda-forge
    bs4-4.9.1                  |                0           4 KB  conda-forge
    soupsieve-2.0.1            |   py36h9f0ad1d_0          56 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         223 KB

The following NEW packages will be INSTALLED:

  beautifulsoup4     conda-forge/linux-64::beautifulsoup4-4.9.1-py36h9f0ad1d_0
  bs4                conda-forge/noarch::bs4-4.9.1-0
  soupsieve          conda-forge/linux-64::soupsieve-2.0.1-py36h9f0ad1d_0



Downloading and Extracting Packag

In [2]:
#install folium
!conda install -c conda-forge folium --yes

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - folium


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    branca-0.4.1               |             py_0          26 KB  conda-forge
    brotlipy-0.7.0             |py36h8c4c3a4_1000         346 KB  conda-forge
    chardet-3.0.4              |py36h9f0ad1d_1006         188 KB  conda-forge
    cryptography-2.9.2         |   py36h45558ae_0         613 KB  conda-forge
    folium-0.11.0              |             py_0          61 KB  conda-forge
    pysocks-1.7.1              |   py36h9f0ad1d_1          27 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         1.2 MB

The following NEW packages will be INSTALLED:

  bra

In [3]:
import pandas as pd
import numpy as np
import requests
import json
from pandas.io.json import json_normalize
!conda install -c conda-forge geopy --yes
print("Geopy installed")
print("Libraries imported")

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-1.22.0               |     pyh9f0ad1d_0          63 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          97 KB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::geographiclib-1.50-py_0
  geopy              conda-forge/noarch::geopy-1.22.0-pyh9f0ad1d_0



Downloading and Extracting Packages
geopy-1.22.0         | 63 KB     | ##################################### | 100% 
geographiclib-1.50   | 34 KB     | ###############################

### B) Web Scraping the Data set from wikipedia

In [4]:
import urllib.request
import bs4 as bs

#### Getting the toronto dataset using bs4

In [5]:
sauce = urllib.request.urlopen("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").read()
soup = bs.BeautifulSoup(sauce)
table = soup.find("table")

In [6]:
column_names = ["PostalCode","Borough","Neighborhood"]


In [7]:
#iterating through the table to fettch the table data
table_rows = table.find_all("tr")
dataList=[]
for tr in table_rows:
    td = tr.find_all("td")
    row = [i.text for i in td]
    dataList.append(row)

In [8]:
toronto_df = pd.DataFrame(dataList, columns = column_names)

##### Replacing the new line escape character in the dataset

In [9]:
toronto_df["PostalCode"] = toronto_df["PostalCode"].str.replace("\n","")
toronto_df["Borough"] = toronto_df["Borough"].str.replace("\n","")
toronto_df["Neighborhood"] = toronto_df["Neighborhood"].str.replace("\n","")

In [10]:
toronto_df.drop(0,axis=0,inplace=True)

In [11]:
toronto_df.reset_index(inplace=True)
toronto_df

Unnamed: 0,index,PostalCode,Borough,Neighborhood
0,1,M1A,Not assigned,Not assigned
1,2,M2A,Not assigned,Not assigned
2,3,M3A,North York,Parkwoods
3,4,M4A,North York,Victoria Village
4,5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...,...
175,176,M5Z,Not assigned,Not assigned
176,177,M6Z,Not assigned,Not assigned
177,178,M7Z,Not assigned,Not assigned
178,179,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


In [12]:

toronto_df.drop("index",axis=1,inplace=True)

In [13]:
toronto_df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


### C) Preprocessing the data

In [14]:
#removing entries where the borough is undefined
toronto_df = toronto_df[toronto_df["Borough"] != "Not assigned"]

In [15]:
toronto_df.reset_index()

Unnamed: 0,index,PostalCode,Borough,Neighborhood
0,2,M3A,North York,Parkwoods
1,3,M4A,North York,Victoria Village
2,4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,5,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...,...
98,160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,165,M4Y,Downtown Toronto,Church and Wellesley
100,168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [16]:
toronto_df.reset_index(drop=False,inplace=True)

In [17]:
toronto_df.drop("index",axis=1,inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [18]:
toronto_df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


###### More than one neighborhood can exist in one postal code area. For example, 
###### in the table on the Wikipedia page, you will notice that M5A is listed twice and has two 
###### neighborhoods: Harbourfront and Regent Park. 
###### These two rows will be combined into one row with the neighborhoods separated with a comma.

###### If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [21]:
#The total number of rows in the toronto dataframe
toronto_df.shape

(103, 3)

### D) Geocoding the dataset on the basis of postal codes

In [22]:
!conda install -c conda-forge geopy --yes

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



In [23]:
geo_coord = pd.read_csv("Geospatial_Coordinates.csv")
geo_coord.columns = ["PostalCode", "Latitude", "Longitude"]
geo_coord

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [24]:
toronto_df = pd.merge(toronto_df,geo_coord,on="PostalCode")

In [66]:
toronto_df.head(6)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242


### E) Exploring the first neighborhood in toronto - Parkwoods

In [26]:
# defining the foursquare api keys
CLIENT_ID = 'QVWGG5LGOZVOBV5YZBX2ZKR5ACBGHVIM5OVHJKAHGLPEPTV1' # your Foursquare ID
CLIENT_SECRET = '3U14W2WYWO0T3D1NETWMPOC3CPTBHZJRLFS14BM1JWP5XW4Y'
VERSION = '20180605' # Foursquare API version
RADIUS = 500
LIMIT = 100
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: QVWGG5LGOZVOBV5YZBX2ZKR5ACBGHVIM5OVHJKAHGLPEPTV1
CLIENT_SECRET:3U14W2WYWO0T3D1NETWMPOC3CPTBHZJRLFS14BM1JWP5XW4Y


In [30]:
url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
toronto_df["Latitude"][0],
toronto_df["Longitude"][0],
RADIUS,
LIMIT
)
print(url)

https://api.foursquare.com/v2/venues/explore?client_id=QVWGG5LGOZVOBV5YZBX2ZKR5ACBGHVIM5OVHJKAHGLPEPTV1&client_secret=3U14W2WYWO0T3D1NETWMPOC3CPTBHZJRLFS14BM1JWP5XW4Y&v=20180605&ll=43.7532586,-79.3296565&radius=500&limit=100


In [31]:
#requests module to trigger the url and convert it to json
results = requests.get(url).json()

In [34]:
venues = results['response']['groups'][0]['items']
#conver the json to dataframe using pandas json_normalize
venues_df = json_normalize(venues)

  This is separate from the ipykernel package so we can avoid doing imports until


In [35]:
venues_df

Unnamed: 0,referralId,reasons.count,reasons.items,venue.id,venue.name,venue.location.address,venue.location.lat,venue.location.lng,venue.location.labeledLatLngs,venue.location.distance,venue.location.cc,venue.location.city,venue.location.state,venue.location.country,venue.location.formattedAddress,venue.categories,venue.photos.count,venue.photos.groups
0,e-0-4e8d9dcdd5fbbbb6b3003c7b-0,0,"[{'summary': 'This spot is popular', 'type': '...",4e8d9dcdd5fbbbb6b3003c7b,Brookbanks Park,Toronto,43.751976,-79.33214,"[{'label': 'display', 'lat': 43.75197604605557...",245,CA,Toronto,ON,Canada,"[Toronto, Toronto ON, Canada]","[{'id': '4bf58dd8d48988d163941735', 'name': 'P...",0,[]
1,e-0-4cb11e2075ebb60cd1c4caad-1,0,"[{'summary': 'This spot is popular', 'type': '...",4cb11e2075ebb60cd1c4caad,Variety Store,29 Valley Woods Road,43.751974,-79.333114,"[{'label': 'display', 'lat': 43.75197441585782...",312,CA,Toronto,ON,Canada,"[29 Valley Woods Road, Toronto ON, Canada]","[{'id': '4bf58dd8d48988d1f9941735', 'name': 'F...",0,[]


In [36]:
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
venues_df =venues_df.loc[:, filtered_columns]

In [37]:
#function to get the category type
def get_category_type(row):
    try:
        categories = row["categories"]
    except:
        categories = row["venue.categories"]
    if(len(categories) != 0):
        return categories[0]["name"]
    else:
        return None

In [39]:
#call the above function to get the categories
venues_df["venue.categories"] = venues_df.apply(get_category_type,axis=1)

In [40]:
#renuame the columns
venues_df.columns = [col.split(".")[-1] for col in venues_df.columns]

In [41]:
#display the final df
venues_df.head()

Unnamed: 0,name,categories,lat,lng
0,Brookbanks Park,Park,43.751976,-79.33214
1,Variety Store,Food & Drink Shop,43.751974,-79.333114


In [49]:
# visualizing the neighborhood and its venues on the folium map
import folium
parkwoods_map = folium.Map(location=[toronto_df["Latitude"][0],toronto_df["Longitude"][0]],zoom_start=16)

for lat,lng,name,category in zip(venues_df["lat"],venues_df["lng"],venues_df["name"],venues_df["categories"]):
    label = folium.Popup(str(name) +","+str(category))
    folium.CircleMarker(
    [lat,lng],
    popup = label,
    radius = 5,
    fill = True,
    fill_color = "red",
    fill_opacity = 0.7
    ).add_to(parkwoods_map)

In [51]:
#display the parkwoods map
parkwoods_map

### F) Analyzing all the neighborhoods in Toronto.

#### Now, we will be analyzing the neighborhoods in Toronto and adding the same to our venues dataframe. Below we define a function to be used to get the venues in the respective neighborhoods.

In [95]:
def getNearbyVenues(neighborhoods,latitudes,longitudes):
    for neighborhood, lat, lng in zip(neighborhoods,latitudes,longitudes):
        
        print(neighborhood)
        #define the url
        url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        lng,
        500,
        100
        )
        
        results = requests.get(url).json()
        venues = results['response']['groups'][0]['items']
        temp_df = json_normalize(venues)
        temp_df["venue.categories"] = temp_df.apply(get_category_type,axis=1)
        
        #Append the values to the original data frame.
        names_list=[]
        categories_list=[]
        lat_list=[]
        lng_list=[]
        for i in range(len(temp_df)):
            names_list.append(temp_df["venue.name"][i])
            categories_list.append(temp_df["venue.categories"][i])
            lat_list.append(temp_df["venue.location.lat"][i])
            lng_list.append(temp_df["venue.location.lng"][i])
        df2 = pd.DataFrame({"name":names_list,
                           "categories": categories_list,
                           "lat": lat_list,
                           "lng": lng_list})
        venues_df.append(df2, ignore_index= True)
            
    return df2

In [102]:
# This parameters can be passed based on the neighborhood of which we wish to check the venues
toronto_venues = getNearbyVenues(neighborhoods= [toronto_df["Neighborhood"][10]],
                           latitudes = [toronto_df["Latitude"][10]],
                           longitudes = [toronto_df["Longitude"][10]])

Glencairn




In [103]:
toronto_venues

Unnamed: 0,name,categories,lat,lng
0,Miyako Sushi Restaurant,Japanese Restaurant,43.709111,-79.44393
1,Li Cheng Restaurant,Asian Restaurant,43.708828,-79.443366
2,"Chalker's Pub, Billiards and Bistro",Pub,43.705747,-79.442378
3,Pizza Nova,Pizza Place,43.707504,-79.443141
4,Domino's Pizza,Pizza Place,43.70717,-79.442658


### G) Clustering the neighborhoods in the city of Toronto, Canada

In [111]:
toronto_grouped_clustering = toronto_df.drop(["Neighborhood","PostalCode", "Borough"],axis=1)

In [112]:
toronto_grouped_clustering

Unnamed: 0,Latitude,Longitude
0,43.753259,-79.329656
1,43.725882,-79.315572
2,43.654260,-79.360636
3,43.718518,-79.464763
4,43.662301,-79.389494
...,...,...
98,43.653654,-79.506944
99,43.665860,-79.383160
100,43.662744,-79.321558
101,43.636258,-79.498509


In [108]:
#import the sklearn cluster module KMeans
from sklearn.cluster import KMeans

In [113]:
#initialize the kmeans clustering algorithm and fit the training data
kmeans = KMeans(n_clusters=4,random_state=42).fit(toronto_grouped_clustering)

In [114]:
#add the cluster labels to the original dataframe
toronto_df["Cluster Labels"] = kmeans.labels_

In [116]:
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels
0,M3A,North York,Parkwoods,43.753259,-79.329656,1
1,M4A,North York,Victoria Village,43.725882,-79.315572,1
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,3
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,0
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,3


### H) Visualizing the Clusters

In [122]:
#Getting the coordinates of toronto
from geopy.geocoders import Nominatim
address = "toronto, Ontario, Canada"
geocoder = Nominatim()
location = geocoder.geocode(address)
toronto_latitude = location.latitude
toronto_longitude = location.longitude

  after removing the cwd from sys.path.


In [124]:
from folium import plugins
toronto_map = folium.Map(location = [toronto_latitude,toronto_longitude],zoom_start = 13)
cluster_neighborhoods = plugins.MarkerCluster().add_to(toronto_map)
for lat,lng,label in zip(toronto_df["Latitude"],toronto_df["Longitude"],toronto_df["Neighborhood"]):
    folium.Marker(location = [lat,lng], icon=None,popup=label).add_to(cluster_neighborhoods)

In [125]:
toronto_map