*****

# Segmenting and Clustering Neighborhoods in Toronto

Project: Ari Bimo Prakoso

# Table of Content

### Part 1: Getting Dataframe about Postal Code, Borough, and Neighborhoods
* Import libraries.
* Save the link of postal codes.
* Extract information from the website with BeautifulSoup.
* Preprocess data into dataframe.
* Process cells with defined borough.
* Combine multiple neighborhoods.
* Assign norough to 'Not Assigned' beighborhood.


### Part 2: Getting Dataframe with Latitude and Longitude Data
* Get the coordinates data.
* Get the latitude and longitude data, put it into dataframe by matching postal code.
* Dataframe ready for clustering.

### Part 3: Neighborhood Clustering

* Render Toronto map and each postal code.
* Cluster based on venues of interest.
* Explore first neighborhood.
* Explore all neighborhoods.
* Get data of venues in Toronto.
* Analyze neighborhoods.
* Use machine learning to cluster the neighborhood.
* Plot the cluster on Folium map.
* Examine Clusters

# Part 1: Getting Dataframe about Postal Code, Borough, and Neighborhoods

### Import Libraries

In [1]:
import numpy as np

import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json

!conda install -c conda-forge geopy
from geopy.geocoders import Nominatim #convert an uaddress into latitude and longitude values

import requests
from pandas.io.json import json_normalize  # transform JSON file into dataframe

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if necessary
import folium # map rendering library
print("Libraries imported.")

!conda install -c anaconda beautifulsoup4
!conda install -c anaconda lxml
from lxml import etree
from bs4 import BeautifulSoup

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... 
  - anaconda/win-64::ca-certificates-2019.8.28-0, anaconda/win-64::certifi-2019.9.11-py37_0, anaconda/win-64::conda-4.8.1-py37_0, anaconda/win-64::openssl-1.1.1d-he774522_2
  - anaconda/win-64::ca-certificates-2019.8.28-0, anaconda/win-64::certifi-2019.9.11-py37_0, anaconda/win-64::openssl-1.1.1d-he774522_2, defaults/win-64::conda-4.8.1-py37_0
  - anaconda/win-64::certifi-2019.9.11-py37_0, anaconda/win-64::conda-4.8.1-py37_0, anaconda/win-64::openssl-1.1.1d-he774522_2, defaults/win-64::ca-certificates-2019.8.28-0
  - anaconda/win-64::certifi-2019.9.11-py37_0, anaconda/win-64::openssl-1.1.1d-he774522_2, defaults/win-64::ca-certificates-2019.8.28-0, defaults/win-64::conda-4.8.1-py37_0
  - anaconda/win-64::conda-4.8.1-py37_0, anaconda/win-64::openssl-1.1.1d-he774522_2, defaults/win-64::ca-certificates-2019.8.28-0, defaults/win-64::certifi-2019.9.11-py37_0
  - anaconda/win-64::openss

usage: conda-script.py [-h] [-V] command ...
conda-script.py: error: unrecognized arguments: # uncomment this line if necessary


Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



### Save the link of postal codes

In [2]:
url_toronto = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [3]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

## Extract information from the website

### Create BeautifulSoup instance from the website to parse it

In [4]:
soup = BeautifulSoup(source,'lxml')
#print(soup.prettify())

In [5]:
My_table = (soup.find('table',{'class':'wikitable sortable'}))
#My_table

In [6]:
table_entry_list = My_table.findAll('td')
#table_entry_list 

### Define functions to simplify preprocessing

In [7]:
# This function remove tag from the string
def strip_td(entry_i):
    stringx = str(entry_i)
    stringx = stringx.strip('<td>')
    stringx = stringx.strip('</')
    return stringx

# This function collect title name from the link with a href format
def get_name_from_ahref(stringx):
    try:
        if stringx != 'Not assigned':
            stringx = stringx.split('>')
            stringx = stringx[1]
            stringx = stringx.split('<')
            stringx = stringx[0]  
    except:
        pass
    return stringx

# This function collect title name from oddly named a href format
def get_name_from_residual_list(stringx):
    if type(stringx) == list:
        stringx = stringx[0]
    return stringx

### Extract and save the BeautifulSoup result to three lists

In [8]:
list_PostalCode = []
list_Borough = []
list_Neighborhood = []

for i in range(len(table_entry_list)):
    if i%3 == 0:
        a = strip_td(table_entry_list[i])
        list_PostalCode.append(a)
    elif i%3 == 1:
        b = strip_td(table_entry_list[i])
        b = get_name_from_ahref(b)
        b = get_name_from_residual_list(b)
        list_Borough.append(b)
    else:
        c = strip_td(table_entry_list[i])
        c = get_name_from_ahref(c)
        c = get_name_from_residual_list(c)
        c = c.strip('\n')
        list_Neighborhood.append(c)
         
# some example
print(list_PostalCode[0:10])
print(list_Borough[0:10])
print(list_Neighborhood[0:10])

['M1A', 'M2A', 'M3A', 'M4A', 'M5A', 'M6A', 'M6A', 'M7A', 'M8A', 'M9A']
['Not assigned', 'Not assigned', 'North York', 'North York', 'Downtown Toronto', 'North York', 'North York', 'Downtown Toronto', 'Not assigned', "Queen's Park"]
['Not assigned', 'Not assigned', 'Parkwoods', 'Victoria Village', 'Harbourfront', 'Lawrence Heights', 'Lawrence Manor', "Queen's Park", 'Not assigned', 'Not assigned']


## Combine the list into dataframe

In [9]:
ds_PostalCode = pd.Series(list_PostalCode)
ds_Borough = pd.Series(list_Borough)
ds_Neighborhood = pd.Series(list_Neighborhood)
df = pd.DataFrame(ds_Neighborhood)
df['PostalCode'] = ds_PostalCode
df['Borough'] = ds_Borough
df['Neighborhood'] = ds_Neighborhood
df = df.drop([0], axis='columns')
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### Only process the cells that have an assigned borough. Ignore cells with a borough that is 'Not assigned'.

In [10]:
df_x = df[df['Borough']!='Not assigned']
df_x = df_x.reset_index()
df_x = df_x.drop('index', axis='columns')
df_x.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


### Combine the entry, if they have multiple neighborhoods for the same postal code

The instruction said that:
"More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table."

During the time I retrieve this url: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M  there is only one entry for M5A. It can be easily verified by searching the word match in a web browser. I also verify it on the dataframe, which postal codes does have multiple entries.

For other rows that have multiple entry for the same postcode, I followed the instruction to combine the neighborhoods as described when they are applicable.

In [11]:
# Investigate postal code number with multiple entry
df_temp = df_x['PostalCode'].value_counts()
print(df_temp.head(20))

print('')
print("We can locate and count the number for each entry")
print('number of M5A entry:', df_temp['M5A'])
print('number of M5R entry:', df_temp['M5R'])
print('number of M5J entry:', df_temp['M5J'], 'and so on')

M9V    8
M8Y    8
M5V    7
M4V    5
M9B    5
M8Z    5
M9R    4
M1V    4
M9C    4
M6M    4
M6L    3
M1C    3
M1P    3
M1E    3
M1M    3
M5T    3
M8V    3
M1K    3
M1L    3
M5R    3
Name: PostalCode, dtype: int64

We can locate and count the number for each entry
number of M5A entry: 1
number of M5R entry: 3
number of M5J entry: 3 and so on


In [12]:
# Re-check df_x
df_x.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


In [13]:
df_x['Neighborhood'] = df_x['Neighborhood']+", "
df_x.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,"Parkwoods,"
1,M4A,North York,"Victoria Village,"
2,M5A,Downtown Toronto,"Harbourfront,"
3,M6A,North York,"Lawrence Heights,"
4,M6A,North York,"Lawrence Manor,"


In [14]:
df_x[df_x['Neighborhood'] == 'Not assigned, '].sum()

PostalCode                 M9A
Borough           Queen's Park
Neighborhood    Not assigned, 
dtype: object

In [15]:
# Combine several neighborhood with the .sum() function, then reset index.
df_y = df_x.groupby(['PostalCode','Borough']).sum().reset_index()
df_y.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern,"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union,"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill,"
3,M1G,Scarborough,"Woburn,"
4,M1H,Scarborough,"Cedarbrae,"


In [16]:
# remove last comma
def my_cut_last_char_func(stringx):
    stringx = stringx[0:-2]
    return stringx

df_y['Neighborhood'] = df_y['Neighborhood'].apply(my_cut_last_char_func) 
df_y.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### If a cell has a borough but a 'Not assigned' neighborhood, then the neighborhood will be the same as the borough. 

If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

In [17]:
df_y[df_y['Neighborhood'] == 'Not assigned']

Unnamed: 0,PostalCode,Borough,Neighborhood
93,M9A,Queen's Park,Not assigned


In [18]:
# Since there is only one, we will simply assign it:
df_y.loc[df_y['PostalCode']=='M9A','Neighborhood'] = df_y.loc[df_y['PostalCode']=='M9A','Borough']
df_y[df_y['PostalCode'] == 'M9A']

Unnamed: 0,PostalCode,Borough,Neighborhood
93,M9A,Queen's Park,Queen's Park


### Cleaning up the dataframe:

In [19]:
# Function below help to show the same entry as expected from the notebook
def postal_code_example(stringx):
    example_of_postal_code = {'M5G':1, 'M2H':2, 'M4B':3, 'M1J':4, 'M4G':5, 'M4M':6, 'M1R':7, 'M9V':8, 'M9L':9, 
                          'M5V':10, 'M1B':11, 'M5A':12}
    try:
        rank = example_of_postal_code[stringx]
    except:
        rank = 13
    return rank

# We use temporary 'rank' column to help sort the dataframe
df_y['rank'] = df_y['PostalCode'].map(postal_code_example)
df_y = df_y.sort_values('rank')
df_y = df_y.drop('rank', axis = 'columns')
df_y = df_y.reset_index()
df_y = df_y[['PostalCode', 'Borough', 'Neighborhood']]
df_y.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M5G,Downtown Toronto,Central Bay Street
1,M2H,North York,Hillcrest Village
2,M4B,East York,"Woodbine Gardens, Parkview Hill"
3,M1J,Scarborough,Scarborough Village
4,M4G,East York,Leaside
5,M4M,East Toronto,Studio District
6,M1R,Scarborough,"Maryvale, Wexford"
7,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."
8,M9L,North York,Humber Summit
9,M5V,Downtown Toronto,"CN Tower, Bathurst Quay, Island airport, Harbo..."


In [20]:
# save to final dataframe
df_final = df_y

# Final Dataframe

So here is the final dataframe:

In [21]:
print(df_final.shape)

(103, 3)


Hence, our final dataframe has 103 rows and 3 columns.

Some of the example of the top rows are shown in the following cell:

In [22]:
df_final.head(14)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M5G,Downtown Toronto,Central Bay Street
1,M2H,North York,Hillcrest Village
2,M4B,East York,"Woodbine Gardens, Parkview Hill"
3,M1J,Scarborough,Scarborough Village
4,M4G,East York,Leaside
5,M4M,East Toronto,Studio District
6,M1R,Scarborough,"Maryvale, Wexford"
7,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."
8,M9L,North York,Humber Summit
9,M5V,Downtown Toronto,"CN Tower, Bathurst Quay, Island airport, Harbo..."


In [23]:
#
#
#
#
#

# Part 2: Getting the Dataframe with Coordinates

### Load the coordinates file

here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

use pandas to read the csv file from the following url:

In [24]:
coordinate_list = pd.read_csv('https://cocl.us/Geospatial_data')

In [25]:
coordinate_list.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### Matching the postal code, to get the latitude and longitude

We need to match and input this list of postal code latitude, longitude, and then merge it to our Toronto, Neighborhood Dataframe.

For reference, that dataframe looks like this:

In [26]:
df_final.head(15)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M5G,Downtown Toronto,Central Bay Street
1,M2H,North York,Hillcrest Village
2,M4B,East York,"Woodbine Gardens, Parkview Hill"
3,M1J,Scarborough,Scarborough Village
4,M4G,East York,Leaside
5,M4M,East Toronto,Studio District
6,M1R,Scarborough,"Maryvale, Wexford"
7,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."
8,M9L,North York,Humber Summit
9,M5V,Downtown Toronto,"CN Tower, Bathurst Quay, Island airport, Harbo..."


In [27]:
rowscount, columnscount = df_final.shape
rowscount

103

In [28]:
df_final['Latitude'] = 0
df_final['Longitude'] = 0

for index in range(rowscount):
    postal_code = df_final['PostalCode'].iloc[index]
    dtemp = coordinate_list[(coordinate_list['Postal Code']==postal_code)]
    lat = dtemp['Latitude'].values[0]
    lon = dtemp['Longitude'].values[0]
    df_final['Latitude'].iloc[index] = lat
    df_final['Longitude'].iloc[index] = lon
    #print(postal_code, lat, lon)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


### Dataframe with Latitude and Longitude

In [29]:
df_final.head(14)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
1,M2H,North York,Hillcrest Village,43.803762,-79.363452
2,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937
3,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
4,M4G,East York,Leaside,43.70906,-79.363452
5,M4M,East Toronto,Studio District,43.659526,-79.340923
6,M1R,Scarborough,"Maryvale, Wexford",43.750072,-79.295849
7,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ...",43.739416,-79.588437
8,M9L,North York,Humber Summit,43.756303,-79.565963
9,M5V,Downtown Toronto,"CN Tower, Bathurst Quay, Island airport, Harbo...",43.628947,-79.39442


# Part 3: Neigborhood Clustering

### Render Toronto map and each postal code

In [30]:
import folium # map rendering library

In [31]:
# Get one coordinate for central Toronto. 
latitude = 43.6532
longitude = -79.3832
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6532, -79.3832.


In [32]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_final['Latitude'], df_final['Longitude'], df_final['Borough'], df_final['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

## Cluster based on venues of interest

### Explore first neighborhood

In [33]:
CLIENT_ID = 'PFYGTWGIW5MKCU2LBRZYC1A1JLZ43ULPO4FYHYCRMJUB3GID' # your Foursquare ID
CLIENT_SECRET = 'I05LS14HD54RGZ5LYLJ3JZ5TIDP5S2PKTVIKG4QLJ1Z2M3MX' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

# print('Your credentails:')
# print('CLIENT_ID: ' + CLIENT_ID)
# print('CLIENT_SECRET:' + CLIENT_SECRET)

In [34]:
df_final.loc[0, 'Neighborhood']

'Central Bay Street'

In [35]:
lat = df_final.loc[0, 'Latitude']
lon = df_final.loc[0, 'Longitude']

In [36]:
# Get 100 nearest venues
LIMIT = 100   #limit number of venues

radius = 500  #define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    lat, 
    lon, 
    radius, 
    LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?&client_id=PFYGTWGIW5MKCU2LBRZYC1A1JLZ43ULPO4FYHYCRMJUB3GID&client_secret=I05LS14HD54RGZ5LYLJ3JZ5TIDP5S2PKTVIKG4QLJ1Z2M3MX&v=20180605&ll=43.6579524,-79.3873826&radius=500&limit=100'

In [37]:
# import requests # library to handle requests
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5e298553b57e88001b98e2bc'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Bay Street Corridor',
  'headerFullLocation': 'Bay Street Corridor, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 85,
  'suggestedBounds': {'ne': {'lat': 43.6624524045, 'lng': -79.38117421839567},
   'sw': {'lat': 43.6534523955, 'lng': -79.39359098160432}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '537d4d6d498ec171ba22e7fe',
       'name': "Jimmy's Coffee",
       'location': {'address': '82 Gerrard Street W',
        'crossStreet': 'Gerrard & LaPlante',
        'lat': 43.65842123574496,
        'lng': -79.38561319551111,
        'label

In [38]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [39]:
venues = results['response']['groups'][0]['items']

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Jimmy's Coffee,Coffee Shop,43.658421,-79.385613
1,Tim Hortons,Coffee Shop,43.65857,-79.385123
2,Hailed Coffee,Coffee Shop,43.658833,-79.383684
3,The Elm Tree Restaurant,Modern European Restaurant,43.657397,-79.383761
4,The Queen and Beaver Public House,Gastropub,43.657472,-79.383524


In [40]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

85 venues were returned by Foursquare.


### Explore all neighborhoods

In [41]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

### Get data of venues in Toronto

In [None]:
toronto_venues = getNearbyVenues(names=df_final['Neighborhood'],
                                   latitudes=df_final['Latitude'],
                                   longitudes=df_final['Longitude']
                                  )

Central Bay Street
Hillcrest Village
Woodbine Gardens, Parkview Hill
Scarborough Village
Leaside
Studio District
Maryvale, Wexford
Albion Gardens, Beaumond Heights, Humbergate, Jamestown, Mount Olive, Silverstone, South Steeles, Thistletown
Humber Summit
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
Rouge, Malvern
Harbourfront
Chinatown, Grange Park, Kensington Market
Lawrence Heights, Lawrence Manor
Glencairn
Humewood-Cedarvale
Stn A PO Boxes 25 The Esplanade
First Canadian Place, Underground city
Forest Hill North, Forest Hill West
The Annex, North Midtown, Yorkville
Roselawn
Bedford Park, Lawrence Manor East
Commerce Court, Victoria Hotel
Design Exchange, Toronto Dominion Centre
Harbourfront East, Toronto Islands, Union Station
Adelaide, King, Richmond
Berczy Park
Harbord, University of Toronto
Caledonia-Fairbanks
Little Portugal, Trinity
Dovercourt Village, Dufferin
Kingsview Village, Martin Grove Gardens, Richview Garden

In [None]:
print(toronto_venues.shape)
toronto_venues.head()

### Check the number of venues for each neighborhood

In [None]:
toronto_venues.groupby('Neighborhood').count()

In [None]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

### Analyze each neighborhood

In [None]:
#toronto_venues.head()

In [None]:
# del toronto_onehot

In [None]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")
#toronto_onehot.head()

In [None]:
# add neighborhood column back to dataframe
df_neigbor = toronto_venues['Neighborhood']

toronto_onehot.drop(labels=['Neighborhood'], axis=1,inplace = True)

toronto_onehot.insert(0, 'Neighborhood', df_neigbor)
toronto_onehot.head()

In [None]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

In [None]:
toronto_grouped.shape

In [None]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

## Use machine learning to cluster the neighborhood

In [None]:
# set number of clusters
kclusters = 8

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

In [None]:
# df_final = toronto_data

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = df_final

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')


In [None]:
toronto_merged['Cluster Labels'].value_counts()
toronto_merged = toronto_merged.dropna()
toronto_merged.head(10)

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    cluster = int(cluster)
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Examine Clusters

#### Cluster 0

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

#### Cluster 1

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

#### Cluster 2

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

#### Cluster 3

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

#### Cluster 4

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

#### Cluster 5

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 5, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

#### Cluster 6

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 6, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

#### Cluster 7

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 7, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]