# Segmenting and Clustering Neighbourhoods in Toronto
##### Submitted by Nilo Villanueva

### Assignment guidelines and requirements

Task: Explore and cluster the neighbourhoods in Toronto.

1. The relevant Wiki data scraped are postal codes, borought and neighbourhood names available in https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, 
2. Required pandas dataframe should have PostalCode, Borough, and Neighbourhood as dataframe column names
3. Cells with a Borough marked as "Not assigned" are not processed.
4. Postal codes with multiple neighbourhoods entries are combined into one row with the neighbourhoods separated with a comma.
5. Cells with a borough entry but having neighbourhood entries " Not assigned" , the neighbourhood will have the same entry as the borough.
6. Show the dimensions fo the dataframe.

In [3]:
# Getting the needded dependencies

import numpy as np  #Handles data in vectorized manner

!conda install -c conda-forge lxml --yes
!conda install -c conda-forge bs4 --yes
!conda install -c conda-forge html5lib --yes

!conda install -c conda-forge beautifulsoup4 --yes
from bs4 import BeautifulSoup

import pandas as pd #To perform data analysis and make dataframes
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

from IPython.display import display_html
from IPython.display import display

import json #Library for handling JSON data

!conda install -c conda-forge geopy --yes # For geographical data
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.8.3
  latest version: 4.8.4

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.8.3
  latest version: 4.8.4

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.8.3
  latest version: 4.8.4

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.8.3
  latest version: 4.8.4

Please update conda by running

    $ conda update -n base -c defaults conda



#

## Scraping Data From Toronto Postal Wiki-page

##### Scraping and parsing using the Python package BeautifulSoup and with lxml parser. 

In [4]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source,'lxml')
table = str(soup.table) # raw structured data is saved in table

##### Inspecting raw data shows the "Not assigned" entried in Borough and Neighbourhood columns that need to be cleaned up.

In [5]:
display_html(table, raw=True) #displays the scraped raw data from the Wiki

Postal Code,Borough,Neighbourhood
M1A,Not assigned,Not assigned
M2A,Not assigned,Not assigned
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Regent Park, Harbourfront"
M6A,North York,"Lawrence Manor, Lawrence Heights"
M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
M8A,Not assigned,Not assigned
M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
M1B,Scarborough,"Malvern, Rouge"


##### Converting html data into a Pandas dataframe

In [6]:
df = pd.read_html(table, header=0)[0] #the function read_html returns a list of DataFrame objects
print(type(df))
display(df.head(12))

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


#### Cleaning the dataframe as required 
*Required Pandas dataframe should have PostalCode, Borough, and Neighbourhood as dataframe column names
Cells with a Borough marked as "Not assigned" are not processed.
Postal codes with multiple neighbourhoods entries are combined into one row with the neighbourhoods separated with a comma.
Cells with a borough entry but having neighbourhood entries " Not assigned" , the neighbourhood will have the same entry as the borough.*

In [7]:
df = df[df.Borough != "Not assigned"] # Remove Not assigned Boroughs
df = df.groupby(["Postal Code","Borough"],sort = False).agg(','.join) # Joins everything after Postal Code & Borough, in this case, Neighbourhood. 
df.reset_index(inplace=True)
display(df.head(12))

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


#### Displaying the dimensions of the dataframe

In [8]:
print("The cleaned and finalized dataframe is",df.shape)

The cleaned and finalized dataframe is (103, 3)


### Latitude and Longitude Data

Task: Add latitude and longitudeto the cleaned dataframe.

1. A csv file (source: http://cocl.us/Geospatial_data) will be used to provide the geographical coordinates of each postal code.

In [9]:
df_geo = pd.read_csv("./Geospatial_Coordinates.csv")
display(df_geo.head(12))

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


In [10]:
pd.options.display.max_columns = None

Postal Code entries in both, in df (cleaned dataframe) and df_geo (lat/lon coordinates) will be merged.  

In [11]:
df_locations = pd.merge(df, df_geo, how='left', left_on = 'Postal Code', right_on = 'Postal Code')
display(df_locations.head(12))

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


## Exploring and clustering the neighbourhoods in Toronto

Tasks:
1. Add enough Markdown cells to explain observations reported.
2. Generate maps to visualize the neighbourhoods and clustering.

### 1. Exploring neighbourhoods in Toronto

#### Review of Toronto dataframe from the previous tasks. Here we see latitude and longitude coordinates for each Borough and Neighbourhood

In [20]:
display(df_locations.head())

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


#### We start by getting an overview of the number of locations available.

In [67]:
print('The dataframe has {} boroughs and {} neighbourhoods.'.format(
        len(df_locations['Borough'].unique()),
        df_locations.shape[0]
    )
)

The dataframe has 10 boroughs and 103 neighborhoods.


#### Use geopy library to get the latitude and longitude values of Toronto

In [34]:
address = "Toronto, ON"
geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of {} are {}, {}.'.format(address,latitude,longitude))

The geograpical coordinates of Toronto, ON are 43.6534817, -79.3839347.


#### Creating a map of Toronto with neighbourhood marked with nice purple markers on top.

In [32]:
# Using Folium, create a map using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)
for lat, lng, borough, neighborhood in zip(df_locations['Latitude'], df_locations['Longitude'], df_locations['Borough'], df_locations['Neighbourhood']):
    label = '{}, {}'.format(df_locations, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label,
        color='purple',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### We will focus on exploring neighbourhoods in East York for segmentation and clustering. The East York data will be sliced from the original Toronto dataframe.

In [40]:
east_york_data = df_locations[df_locations['Borough'] == 'East York'].reset_index(drop=True)
east_york_data.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
1,M4C,East York,Woodbine Heights,43.695344,-79.318389
2,M4G,East York,Leaside,43.70906,-79.363452
3,M4H,East York,Thorncliffe Park,43.705369,-79.349372
4,M4J,East York,"East Toronto, Broadview North (Old East York)",43.685347,-79.338106


### Get the coordinates of East York area

In [41]:
address = 'East York'
geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of {} are {}, {}.'.format(address,latitude, longitude))

The geograpical coordinate of East York are 43.699971000000005, -79.33251996261595.


### Using Folium to visualize the map of East York area and marking the neighbourhoods.

In [42]:
map_east_york = folium.Map(location=[latitude, longitude], zoom_start=11)
for lat, lng, label in zip(east_york_data['Latitude'], east_york_data['Longitude'], east_york_data['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label,
        color='purple',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_east_york)  
map_east_york

# Utilizing the Foursquare API to explore the neighbourhoods and segment them

### Define Foursquare Credentials and Version

In [68]:
CLIENT_ID = '11AZKMMXMZTOOMBTOFDIPG4BL3XO1DEOOBZZHF51TV5QFZJH' # Foursquare ID
CLIENT_SECRET = '5HWCFPYL5KRAANFGPVZLUWPQGLMMQLMHSO3PQ0CDEUDTME35' # Foursquare Secret
VERSION = '20180604' # Foursquare API version
print('Credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Credentails:
CLIENT_ID: 11AZKMMXMZTOOMBTOFDIPG4BL3XO1DEOOBZZHF51TV5QFZJH
CLIENT_SECRET:5HWCFPYL5KRAANFGPVZLUWPQGLMMQLMHSO3PQ0CDEUDTME35


### Exploring Leaside, the third-listed neighbourhood in the East York dataframe.

In [49]:
east_york_data.loc[2, 'Neighbourhood']

'Leaside'

### Get the neighbourhood's latitude and longitude values

In [69]:
neighbourhood_latitude = east_york_data.loc[2, 'Latitude'] 
neighbourhood_longitude = east_york_data.loc[2, 'Longitude']
neighbourhood_name = east_york_data.loc[2, 'Neighbourhood']
print('Latitude and longitude values of {} are {}, {}.'.format(neighbourhood_name, neighbourhood_latitude, neighbourhood_longitude))

Latitude and longitude values of Leaside are 43.7090604, -79.3634517.


### Getting the top 100 venues that are in Leaside within a radius of 500 meters. 
### "URL" will be the default url name and create the GET request URL

In [70]:
LIMIT = 100
radius = 500
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighbourhood_latitude, 
    neighbourhood_longitude, 
    radius, 
    LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?&client_id=11AZKMMXMZTOOMBTOFDIPG4BL3XO1DEOOBZZHF51TV5QFZJH&client_secret=5HWCFPYL5KRAANFGPVZLUWPQGLMMQLMHSO3PQ0CDEUDTME35&v=20180604&ll=43.7090604,-79.3634517&radius=500&limit=100'

### Send the GET request and displaying the resutls

In [71]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5f53236b18a1642b1cdf0a87'},
 'response': {'headerLocation': 'Leaside',
  'headerFullLocation': 'Leaside, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 33,
  'suggestedBounds': {'ne': {'lat': 43.7135604045, 'lng': -79.3572380270639},
   'sw': {'lat': 43.704560395499996, 'lng': -79.3696653729361}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '5531956d498e24c6e9994f2e',
       'name': 'Local Leaside',
       'location': {'address': '180 Laird Dr',
        'lat': 43.71001166793114,
        'lng': -79.36351433524794,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.71001166793114,
          'lng': -79.36351433524794}],
        'distance': 106,
        'postalCode': 'M4G 3V7',
        'cc':

### From what was learned in this module, "items key" will contain the information we need. Use get_category_type function from the Foursquare lab.

In [56]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

### Clean the json and structure it into a pandas dataframe of venues located in Leaside area.

In [61]:
venues = results['response']['groups'][0]['items'] 
nearby_venues = json_normalize(venues)
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
nearby_venues.head(10)

  


Unnamed: 0,name,categories,lat,lng
0,Local Leaside,Sports Bar,43.710012,-79.363514
1,Rack Attack,Sporting Goods Shop,43.706934,-79.362261
2,Olde Yorke Fish & Chips,Fish & Chips Shop,43.706141,-79.361829
3,LCBO,Liquor Store,43.710571,-79.360287
4,Enduro Sport,Bike Shop,43.706059,-79.361835
5,The Leaside Pub,Restaurant,43.710468,-79.363848
6,Kintako Japanese Restaurant,Sushi Restaurant,43.711597,-79.363962
7,Aroma Espresso Bar,Coffee Shop,43.705611,-79.360775
8,Bulk Barn,Grocery Store,43.706116,-79.360541
9,Longo's,Supermarket,43.706433,-79.359753


In [62]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

33 venues were returned by Foursquare.


## Explore Neighbourhoods in East York

### Function to repeat the same process to all the neighbourhoods in East York

In [76]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

### Call getNearbyVenues function for each East York neighbourhood and create a new dataframe.

In [77]:
east_york_venues = getNearbyVenues(names=east_york_data['Neighbourhood'],
                                   latitudes=east_york_data['Latitude'],
                                   longitudes=east_york_data['Longitude']
                                  )

Parkview Hill, Woodbine Gardens
Woodbine Heights
Leaside
Thorncliffe Park
East Toronto, Broadview North (Old East York)


### Examining the new dataframe of venues in East York vicinity.

In [88]:
east_york_venues.head(11)

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937,Jawny Bakers,43.705783,-79.312913,Gastropub
1,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937,East York Gymnastics,43.710654,-79.309279,Gym / Fitness Center
2,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937,Shoppers Drug Mart,43.705933,-79.312825,Pharmacy
3,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937,TD Canada Trust,43.70574,-79.31227,Bank
4,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937,Pizza Pizza,43.705159,-79.31313,Pizza Place
5,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937,Rise & Dine Eatery,43.705769,-79.311638,Breakfast Spot
6,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937,Nostalgia,43.706833,-79.311783,Café
7,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937,St. Clair Ave E & O'Connor Dr,43.705233,-79.313274,Intersection
8,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937,Venice Pizza,43.705921,-79.313957,Pizza Place
9,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937,TKTO - Toronto Knife Throwing Organization,43.709966,-79.313411,Athletics & Sports


In [87]:
print(east_york_venues.shape)

(74, 7)


### Number of venues returned for each neighborhood

In [89]:
east_york_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"East Toronto, Broadview North (Old East York)",3,3,3,3,3,3
Leaside,33,33,33,33,33,33
"Parkview Hill, Woodbine Gardens",10,10,10,10,10,10
Thorncliffe Park,21,21,21,21,21,21
Woodbine Heights,7,7,7,7,7,7


### Examine how many unique categories can be curated from all the returned venues

In [90]:
print('There are {} uniques categories.'.format(len(east_york_venues['Venue Category'].unique())))

There are 46 uniques categories.


### Analyze each neighbourhood in East York

In [95]:
east_york_onehot = pd.get_dummies(east_york_venues[['Venue Category']], prefix="", prefix_sep="")  # one hot encoding
east_york_onehot['Neighbourhood'] = east_york_venues['Neighbourhood']  # add neighbourhood column back to dataframe

# move neighbourhood column to the first column
fixed_columns = [east_york_onehot.columns[-1]] + list(east_york_onehot.columns[:-1])
east_york_onehot = east_york_onehot[fixed_columns]

east_york_onehot.head(11)

Unnamed: 0,Neighbourhood,Athletics & Sports,Bagel Shop,Bank,Beer Store,Bike Shop,Breakfast Spot,Brewery,Burger Joint,Café,Coffee Shop,Convenience Store,Curling Ice,Department Store,Dessert Shop,Discount Store,Electronics Store,Fast Food Restaurant,Fish & Chips Shop,Furniture / Home Store,Gas Station,Gastropub,Grocery Store,Gym,Gym / Fitness Center,Indian Restaurant,Intersection,Liquor Store,Mexican Restaurant,Middle Eastern Restaurant,Park,Pet Store,Pharmacy,Pizza Place,Restaurant,Sandwich Place,Shopping Mall,Skating Rink,Smoothie Shop,Spa,Sporting Goods Shop,Sports Bar,Supermarket,Sushi Restaurant,Video Store,Warehouse Store,Yoga Studio
0,"Parkview Hill, Woodbine Gardens",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,"Parkview Hill, Woodbine Gardens",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,"Parkview Hill, Woodbine Gardens",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,"Parkview Hill, Woodbine Gardens",0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,"Parkview Hill, Woodbine Gardens",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
5,"Parkview Hill, Woodbine Gardens",0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,"Parkview Hill, Woodbine Gardens",0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,"Parkview Hill, Woodbine Gardens",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,"Parkview Hill, Woodbine Gardens",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
9,"Parkview Hill, Woodbine Gardens",1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [101]:
print("The shape of the East York venue onehot encoding set is {}.".format(east_york_onehot.shape))

The shape of the East York venue onehot encoding set is (74, 47).


### Group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [105]:
east_york_grouped = east_york_onehot.groupby('Neighbourhood').mean().reset_index()
east_york_grouped.head(11)

Unnamed: 0,Neighbourhood,Athletics & Sports,Bagel Shop,Bank,Beer Store,Bike Shop,Breakfast Spot,Brewery,Burger Joint,Café,Coffee Shop,Convenience Store,Curling Ice,Department Store,Dessert Shop,Discount Store,Electronics Store,Fast Food Restaurant,Fish & Chips Shop,Furniture / Home Store,Gas Station,Gastropub,Grocery Store,Gym,Gym / Fitness Center,Indian Restaurant,Intersection,Liquor Store,Mexican Restaurant,Middle Eastern Restaurant,Park,Pet Store,Pharmacy,Pizza Place,Restaurant,Sandwich Place,Shopping Mall,Skating Rink,Smoothie Shop,Spa,Sporting Goods Shop,Sports Bar,Supermarket,Sushi Restaurant,Video Store,Warehouse Store,Yoga Studio
0,"East Toronto, Broadview North (Old East York)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Leaside,0.0,0.030303,0.060606,0.030303,0.030303,0.030303,0.030303,0.060606,0.0,0.121212,0.0,0.0,0.030303,0.030303,0.0,0.030303,0.0,0.030303,0.060606,0.0,0.0,0.030303,0.0,0.0,0.0,0.0,0.030303,0.030303,0.0,0.0,0.030303,0.0,0.0,0.030303,0.030303,0.030303,0.0,0.030303,0.0,0.090909,0.030303,0.030303,0.030303,0.0,0.0,0.0
2,"Parkview Hill, Woodbine Gardens",0.1,0.0,0.1,0.0,0.0,0.1,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.1,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.1,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Thorncliffe Park,0.0,0.0,0.047619,0.0,0.0,0.0,0.0,0.047619,0.0,0.047619,0.0,0.0,0.0,0.0,0.047619,0.0,0.047619,0.0,0.0,0.047619,0.0,0.047619,0.047619,0.0,0.095238,0.0,0.047619,0.0,0.047619,0.047619,0.0,0.047619,0.047619,0.047619,0.095238,0.0,0.0,0.0,0.0,0.0,0.0,0.047619,0.0,0.0,0.047619,0.047619
4,Woodbine Heights,0.142857,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.142857,0.0,0.0,0.0,0.0,0.142857,0.0,0.0


In [106]:
print("The shape of the East York grouped dataset is {}.".format(east_york_grouped.shape))

The shape of the East York grouped dataset is (5, 47).


### Displaying the top 5 most common venues in each neighbourhood

In [110]:
num_top_venues = 5
for hood in east_york_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = east_york_grouped[east_york_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    display(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----East Toronto, Broadview North (Old East York)----


Unnamed: 0,venue,freq
0,Park,0.33
1,Coffee Shop,0.33
2,Convenience Store,0.33
3,Athletics & Sports,0.0
4,Shopping Mall,0.0




----Leaside----


Unnamed: 0,venue,freq
0,Coffee Shop,0.12
1,Sporting Goods Shop,0.09
2,Bank,0.06
3,Burger Joint,0.06
4,Furniture / Home Store,0.06




----Parkview Hill, Woodbine Gardens----


Unnamed: 0,venue,freq
0,Pizza Place,0.2
1,Athletics & Sports,0.1
2,Café,0.1
3,Pharmacy,0.1
4,Intersection,0.1




----Thorncliffe Park----


Unnamed: 0,venue,freq
0,Indian Restaurant,0.1
1,Sandwich Place,0.1
2,Yoga Studio,0.05
3,Middle Eastern Restaurant,0.05
4,Liquor Store,0.05




----Woodbine Heights----


Unnamed: 0,venue,freq
0,Athletics & Sports,0.14
1,Spa,0.14
2,Park,0.14
3,Curling Ice,0.14
4,Skating Rink,0.14






### Making a pandas dataframe

#### Function to sort the venues in descending order

In [111]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

### New dataframe displaying the top 10 most common venues in each neighbourhood

In [113]:
num_top_venues = 10
indicators = ['st', 'nd', 'rd']
columns = ['Neighbourhood'] # create columns according to number of top venues
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = east_york_grouped['Neighbourhood']

for ind in np.arange(east_york_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(east_york_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"East Toronto, Broadview North (Old East York)",Convenience Store,Park,Coffee Shop,Curling Ice,Gas Station,Furniture / Home Store,Fish & Chips Shop,Fast Food Restaurant,Electronics Store,Discount Store
1,Leaside,Coffee Shop,Sporting Goods Shop,Furniture / Home Store,Bank,Burger Joint,Pet Store,Breakfast Spot,Dessert Shop,Fish & Chips Shop,Liquor Store
2,"Parkview Hill, Woodbine Gardens",Pizza Place,Athletics & Sports,Bank,Gym / Fitness Center,Breakfast Spot,Intersection,Café,Gastropub,Pharmacy,Dessert Shop
3,Thorncliffe Park,Indian Restaurant,Sandwich Place,Yoga Studio,Park,Bank,Burger Joint,Coffee Shop,Discount Store,Fast Food Restaurant,Gas Station
4,Woodbine Heights,Athletics & Sports,Curling Ice,Video Store,Beer Store,Spa,Skating Rink,Park,Furniture / Home Store,Fish & Chips Shop,Fast Food Restaurant


# Cluster Neighborhoods

#### Run k-means to cluster the neighborhood into 5 clusters.

In [120]:
# set number of clusters
kclusters = 5
east_york_grouped_clustering = east_york_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(east_york_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 3, 4, 1, 0], dtype=int32)

### Consolidating everyting by merging cluster data and the top 10 venues for each neighborhood in one dataframe.

In [124]:
# add clustering labels
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
east_york_merged = east_york_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
east_york_merged = east_york_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')
east_york_merged.head() # check the last columns!

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937,4,Pizza Place,Athletics & Sports,Bank,Gym / Fitness Center,Breakfast Spot,Intersection,Café,Gastropub,Pharmacy,Dessert Shop
1,M4C,East York,Woodbine Heights,43.695344,-79.318389,0,Athletics & Sports,Curling Ice,Video Store,Beer Store,Spa,Skating Rink,Park,Furniture / Home Store,Fish & Chips Shop,Fast Food Restaurant
2,M4G,East York,Leaside,43.70906,-79.363452,3,Coffee Shop,Sporting Goods Shop,Furniture / Home Store,Bank,Burger Joint,Pet Store,Breakfast Spot,Dessert Shop,Fish & Chips Shop,Liquor Store
3,M4H,East York,Thorncliffe Park,43.705369,-79.349372,1,Indian Restaurant,Sandwich Place,Yoga Studio,Park,Bank,Burger Joint,Coffee Shop,Discount Store,Fast Food Restaurant,Gas Station
4,M4J,East York,"East Toronto, Broadview North (Old East York)",43.685347,-79.338106,2,Convenience Store,Park,Coffee Shop,Curling Ice,Gas Station,Furniture / Home Store,Fish & Chips Shop,Fast Food Restaurant,Electronics Store,Discount Store


#### Creating the new dataframe and display the top 10 venues for each neighborhood

In [115]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = east_york_grouped['Neighbourhood']

for ind in np.arange(east_york_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(east_york_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"East Toronto, Broadview North (Old East York)",Convenience Store,Park,Coffee Shop,Curling Ice,Gas Station,Furniture / Home Store,Fish & Chips Shop,Fast Food Restaurant,Electronics Store,Discount Store
1,Leaside,Coffee Shop,Sporting Goods Shop,Furniture / Home Store,Bank,Burger Joint,Pet Store,Breakfast Spot,Dessert Shop,Fish & Chips Shop,Liquor Store
2,"Parkview Hill, Woodbine Gardens",Pizza Place,Athletics & Sports,Bank,Gym / Fitness Center,Breakfast Spot,Intersection,Café,Gastropub,Pharmacy,Dessert Shop
3,Thorncliffe Park,Indian Restaurant,Sandwich Place,Yoga Studio,Park,Bank,Burger Joint,Coffee Shop,Discount Store,Fast Food Restaurant,Gas Station
4,Woodbine Heights,Athletics & Sports,Curling Ice,Video Store,Beer Store,Spa,Skating Rink,Park,Furniture / Home Store,Fish & Chips Shop,Fast Food Restaurant


#### Visualization of resulting clusters of venues observed in the East York area

In [122]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(east_york_merged['Latitude'], east_york_merged['Longitude'], east_york_merged['Neighbourhood'], east_york_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

# Examining the East York Clusters

#### With these East York clusters identified, we can compare venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster.

#### Cluster 1

In [125]:
east_york_merged.loc[east_york_merged['Cluster Labels'] == 0, east_york_merged.columns[[1] + list(range(5, east_york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,East York,0,Athletics & Sports,Curling Ice,Video Store,Beer Store,Spa,Skating Rink,Park,Furniture / Home Store,Fish & Chips Shop,Fast Food Restaurant


In [131]:
print("Cluster 0 has more venues for Athletics & Sports.")

Cluster 0 has more venues for Athletics & Sports.


#### Cluster 2

In [126]:
east_york_merged.loc[east_york_merged['Cluster Labels'] == 1, east_york_merged.columns[[1] + list(range(5, east_york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,East York,1,Indian Restaurant,Sandwich Place,Yoga Studio,Park,Bank,Burger Joint,Coffee Shop,Discount Store,Fast Food Restaurant,Gas Station


In [132]:
print("Cluster 1 has more venues that cater to people who like Indian cuisine.")

Cluster 1 has more venues that cater to people who like Indian cuisine.


#### Cluster 3

In [127]:
east_york_merged.loc[east_york_merged['Cluster Labels'] == 2, east_york_merged.columns[[1] + list(range(5, east_york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,East York,2,Convenience Store,Park,Coffee Shop,Curling Ice,Gas Station,Furniture / Home Store,Fish & Chips Shop,Fast Food Restaurant,Electronics Store,Discount Store


In [133]:
print("Cluster 2 has more convenience stores.")

Cluster 2 has more convenience stores.


#### Cluster 4

In [128]:
east_york_merged.loc[east_york_merged['Cluster Labels'] == 3, east_york_merged.columns[[1] + list(range(5, east_york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,East York,3,Coffee Shop,Sporting Goods Shop,Furniture / Home Store,Bank,Burger Joint,Pet Store,Breakfast Spot,Dessert Shop,Fish & Chips Shop,Liquor Store


In [134]:
print("Cluster 3 has more coffee shops.")

Cluster 3 has more coffee shops.


#### Cluster 5

In [129]:
east_york_merged.loc[east_york_merged['Cluster Labels'] == 4, east_york_merged.columns[[1] + list(range(5, east_york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,East York,4,Pizza Place,Athletics & Sports,Bank,Gym / Fitness Center,Breakfast Spot,Intersection,Café,Gastropub,Pharmacy,Dessert Shop


In [135]:
print("Cluster 4 is a pizza-lover's paradise..")

Cluster 4 is a pizza-lover's paradise..
