# SEGMENTING AND CLUSTERING NEIGHBORHOODS
# IN TORONTO

# IBM CAPSTONE PROJECT
<br>
<br>
by George Pinto

### PART ONE -- CREATE DATAFRAME

Let's first import requests and beautiful soup 

In [1]:
import requests #this is the library we use so we can make http requests

In [2]:
from bs4 import BeautifulSoup #this is the library we use to translate html language

Let's define the url that we want to request data from -- we are getting the postal codes for Canada from Wikipedia

In [3]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M" 

Now let's retrieve data from the resource we defined

In [4]:
r = requests.get(url)

Let's turn that html code into text data

In [5]:
html_doc = r.text

We will now clean up the html text so that we can use it for this project. We will use beautiful soup to make it first into a beautiful soup object and then we will use their method find to pull the table we will use to build our dataframe.

Next, we will need to pull the information that we need from the table, first we define the field names and then we use a for loop to pull the table text from each row returning a dictionary

In [6]:
soup = BeautifulSoup(html_doc, 'html.parser')
table = soup.find("table")

# The first tr contains the field names.
headings = [th.get_text().strip() for th in table.find("tr").find_all("th")]

print(headings)

datasets = []
for row in table.find_all("tr")[1:]:
    dataset = dict(zip(headings, (td.get_text() for td in row.find_all("td"))))
    datasets.append(dataset)

print(datasets)

['Postcode', 'Borough', 'Neighbourhood']
[{'Postcode': 'M1A', 'Borough': 'Not assigned', 'Neighbourhood': 'Not assigned\n'}, {'Postcode': 'M2A', 'Borough': 'Not assigned', 'Neighbourhood': 'Not assigned\n'}, {'Postcode': 'M3A', 'Borough': 'North York', 'Neighbourhood': 'Parkwoods\n'}, {'Postcode': 'M4A', 'Borough': 'North York', 'Neighbourhood': 'Victoria Village\n'}, {'Postcode': 'M5A', 'Borough': 'Downtown Toronto', 'Neighbourhood': 'Harbourfront\n'}, {'Postcode': 'M5A', 'Borough': 'Downtown Toronto', 'Neighbourhood': 'Regent Park\n'}, {'Postcode': 'M6A', 'Borough': 'North York', 'Neighbourhood': 'Lawrence Heights\n'}, {'Postcode': 'M6A', 'Borough': 'North York', 'Neighbourhood': 'Lawrence Manor\n'}, {'Postcode': 'M7A', 'Borough': "Queen's Park", 'Neighbourhood': 'Not assigned\n'}, {'Postcode': 'M8A', 'Borough': 'Not assigned', 'Neighbourhood': 'Not assigned\n'}, {'Postcode': 'M9A', 'Borough': 'Etobicoke', 'Neighbourhood': 'Islington Avenue\n'}, {'Postcode': 'M1B', 'Borough': 'Scarbo

Let's import pandas so that we can start building the dataframe

In [7]:
import pandas as pd #library for data analysis -- we will use it to clean up the dataframe

Create the dataframe using the dictionary (datasets)

In [8]:
df = pd.DataFrame(datasets)

In [9]:
df

Unnamed: 0,Borough,Neighbourhood,Postcode
0,Not assigned,Not assigned\n,M1A
1,Not assigned,Not assigned\n,M2A
2,North York,Parkwoods\n,M3A
3,North York,Victoria Village\n,M4A
4,Downtown Toronto,Harbourfront\n,M5A
5,Downtown Toronto,Regent Park\n,M5A
6,North York,Lawrence Heights\n,M6A
7,North York,Lawrence Manor\n,M6A
8,Queen's Park,Not assigned\n,M7A
9,Not assigned,Not assigned\n,M8A


Let's first get rid of the ('\n') spaces on the neighborhood column entries by using the function assign and str.strip()

In [10]:
df=df.assign(Neighbourhood =df['Neighbourhood'].str.strip('\n'))
df

Unnamed: 0,Borough,Neighbourhood,Postcode
0,Not assigned,Not assigned,M1A
1,Not assigned,Not assigned,M2A
2,North York,Parkwoods,M3A
3,North York,Victoria Village,M4A
4,Downtown Toronto,Harbourfront,M5A
5,Downtown Toronto,Regent Park,M5A
6,North York,Lawrence Heights,M6A
7,North York,Lawrence Manor,M6A
8,Queen's Park,Not assigned,M7A
9,Not assigned,Not assigned,M8A


Next: let's remove the 'Not assigned' Borough rows since these Boroughs cannot be identified, for this we will use ~ to exclude these items and the str.contains() function

In [11]:
df = df[~df['Borough'].str.contains('Not assigned', na=False)]
df

Unnamed: 0,Borough,Neighbourhood,Postcode
2,North York,Parkwoods,M3A
3,North York,Victoria Village,M4A
4,Downtown Toronto,Harbourfront,M5A
5,Downtown Toronto,Regent Park,M5A
6,North York,Lawrence Heights,M6A
7,North York,Lawrence Manor,M6A
8,Queen's Park,Not assigned,M7A
10,Etobicoke,Islington Avenue,M9A
11,Scarborough,Rouge,M1B
12,Scarborough,Malvern,M1B


Next we will group our dataframe by Postcode and aggregate the Borough and Neighourhood columns

In [12]:
df_grouped = df.groupby(by='Postcode').agg({'Borough':sum,'Neighbourhood':sum}).reset_index()
df_grouped

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,ScarboroughScarborough,RougeMalvern
1,M1C,ScarboroughScarboroughScarborough,Highland CreekRouge HillPort Union
2,M1E,ScarboroughScarboroughScarborough,GuildwoodMorningsideWest Hill
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,ScarboroughScarboroughScarborough,East Birchmount ParkIonviewKennedy Park
7,M1L,ScarboroughScarboroughScarborough,ClairleaGolden MileOakridge
8,M1M,ScarboroughScarboroughScarborough,CliffcrestCliffsideScarborough Village West
9,M1N,ScarboroughScarborough,Birch CliffCliffside West


That didn't look that great in terms of the strings, but it did give me enough of a clue to realize I need to use a join and I can divide the neighbourhood aggregated items with commas as part of the code, I can't think of a way to return only one Borough instead of the duplicates though, so we will have to deal with those after

In [13]:
df_grouped = df.groupby(by='Postcode').agg({'Borough':' '.join,'Neighbourhood':', '.join}).reset_index()
df_grouped

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough Scarborough,"Rouge, Malvern"
1,M1C,Scarborough Scarborough Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough Scarborough Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough Scarborough Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough Scarborough Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough Scarborough Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough Scarborough,"Birch Cliff, Cliffside West"


Ok, I read on the python manuals and after several searches that OrderedDict actually keeps track of the changes that happen in sequence in the dictionary value fields within the dataframe and you can use the lambda function to remove the items that were added while keeping only the item that was there at first under the key in question, which sounded perfect for our 'Borough' problem.

Let's import OrderedDict from the collections library and try this out:

In [14]:
from collections import OrderedDict
df_grouped['Borough'] = df_grouped['Borough'].str.split().apply(lambda x: ' '.join(OrderedDict.fromkeys(x).keys()))
df_grouped

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


Next we are replacing the 'Neighbourhood' entries which are 'Not assigned' with the entries in the 'Borough' column, just associating the two columns by using the replace function

In [15]:
df_grouped['Neighbourhood'] = df_grouped['Neighbourhood'].replace('Not assigned',df_grouped['Borough'])
df_grouped

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


Let's just rename the columns as per the picture on the assignment page:

In [16]:
df_grouped.rename(columns={'Postcode':'Postal Code', 'Neighbourhood': 'Neighborhood'},inplace=True)

And lastly (for this section) the dataframe shape as requested in the assignment page

In [17]:
df_grouped.shape

(103, 3)

### PART 2 - INSERT LATITUDE AND LONGITUDE

I was having issues with Geocoder so I selected the option of the csv file provided (latitude and longitude data with postal code information for Toronto)

In [18]:
df_geo = pd.read_csv('geospatial_coordinates.csv') #creating dataframe out of csv file 

Let's take a look to make sure the dataframe looks right

In [19]:
df_geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Let's try a merge on 'Postal Code' serving as an ID common between the two data sets:

In [20]:
df_grouped_geo = pd.merge(df_geo, df_grouped, on='Postal Code')

df_grouped_geo

Unnamed: 0,Postal Code,Latitude,Longitude,Borough,Neighborhood
0,M1B,43.806686,-79.194353,Scarborough,"Rouge, Malvern"
1,M1C,43.784535,-79.160497,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,43.763573,-79.188711,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,43.770992,-79.216917,Scarborough,Woburn
4,M1H,43.773136,-79.239476,Scarborough,Cedarbrae
5,M1J,43.744734,-79.239476,Scarborough,Scarborough Village
6,M1K,43.727929,-79.262029,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,43.711112,-79.284577,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,43.716316,-79.239476,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,43.692657,-79.264848,Scarborough,"Birch Cliff, Cliffside West"


Let's check if we've got any NaN values:

In [21]:
df_grouped_geo.isnull().values.any()

False

### PART 3 -- CLUSTERING

Let's first import all the libraries we might need

In [22]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

import folium # plotting library

print('Folium imported')
print('Libraries imported.')

Folium imported
Libraries imported.


If we want to keep our credentials safe we can import pickle and OS to turn them into a pickle object that the notebook can access without showing the actual credentials

In [23]:
import pickle
import os

In [24]:
if not os.path.exists('secret_foursquare_credentials.pkl'):
    Foursquare={}
    Foursquare['CLIENT_ID'] = ''
    Foursquare['CLIENT_SECRET'] = ''
    with open('secret_foursquare_credentials.pkl','wb') as f:
        pickle.dump(Foursquare, f)
else:
    Foursquare=pickle.load(open('secret_foursquare_credentials.pkl','rb'))

In [25]:
CLIENT_ID = Foursquare['CLIENT_ID'] # your Foursquare ID
CLIENT_SECRET = Foursquare['CLIENT_SECRET'] # your Foursquare Secret
VERSION = '20180604'
LIMIT = 30
print('Credentials are ready for Foursquare API')

Credentials are ready for Foursquare API


There's a lot of neighborhoods as you have multiple ones per Postal Code. We are going to explore the neighborhoods in Toronto, let's first filter for Toronto to see which neighborhoods are there:

In [26]:
Toronto = df_grouped_geo[df_grouped_geo['Borough'].str.contains('Toronto')].reset_index(drop=True)
Toronto

Unnamed: 0,Postal Code,Latitude,Longitude,Borough,Neighborhood
0,M4E,43.676357,-79.293031,East Toronto,The Beaches
1,M4K,43.679557,-79.352188,East Toronto,"The Danforth West, Riverdale"
2,M4L,43.668999,-79.315572,East Toronto,"The Beaches West, India Bazaar"
3,M4M,43.659526,-79.340923,East Toronto,Studio District
4,M4N,43.72802,-79.38879,Central Toronto,Lawrence Park
5,M4P,43.712751,-79.390197,Central Toronto,Davisville North
6,M4R,43.715383,-79.405678,Central Toronto,North Toronto West
7,M4S,43.704324,-79.38879,Central Toronto,Davisville
8,M4T,43.689574,-79.38316,Central Toronto,"Moore Park, Summerhill East"
9,M4V,43.686412,-79.400049,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi..."


Geographical coordinates of Toronto are:

In [27]:
longitude = 43.6532
latitude = 79.3832

In [28]:
# create map of Manhattan using latitude and longitude values
map_Toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(Toronto['Latitude'], Toronto['Longitude'], Toronto['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Toronto)  
    
map_Toronto

Let's check out the first neighborhood in our dataset

In [29]:
Toronto.loc[0, 'Neighborhood']

'The Beaches'

Get neighborhood's Longitude and Latitude values:

In [30]:
neighborhood_latitude = Toronto.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = Toronto.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = Toronto.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of The Beaches are 43.67635739999999, -79.2930312.


We'll explore the top 100 venues in The Beaches. I want to start our exploration this way because I'm seriously interested in Toronto and what is there to do in each neighborhood:

Let's first define our URL, we'll reset the limit to 100 venues and the radius to 500 meters just like we did in the lab (it really seems like a good standard for checking stuff out):

In [31]:
# type your answer here
VERSION = '20180604'
radius = 500
LIMIT = 100
latitude = neighborhood_latitude
longitude = neighborhood_longitude
search_query = ''
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, VERSION, latitude, longitude, radius, LIMIT)



Let's send the get request to Foursquare and check out the results on the json file

In [32]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5d5606c666dc060025b51a92'},
 'response': {'headerLocation': 'The Beaches',
  'headerFullLocation': 'The Beaches, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 4,
  'suggestedBounds': {'ne': {'lat': 43.680857404499996,
    'lng': -79.28682091449052},
   'sw': {'lat': 43.67185739549999, 'lng': -79.29924148550948}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4bd461bc77b29c74a07d9282',
       'name': 'Glen Manor Ravine',
       'location': {'address': 'Glen Manor',
        'crossStreet': 'Queen St.',
        'lat': 43.67682094413784,
        'lng': -79.29394208780985,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.67682094413784,
          'lng': -79.29394208780985}],
        'distanc

Let's see what types of venues we've got by extracting the categories

In [33]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Let's convert into a DataFrame

In [34]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Glen Manor Ravine,Trail,43.676821,-79.293942
1,The Big Carrot Natural Food Market,Health Food Store,43.678879,-79.297734
2,Grover Pub and Grub,Pub,43.679181,-79.297215
3,Upper Beaches,Neighborhood,43.680563,-79.292869


How many venues did we get in The Beaches?

In [35]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

4 venues were returned by Foursquare.


Let's check out all the neighborhoods in Toronto -- what if we ended up working for IBM and were interested in moving there, what's the place we'd like the best?


Let's borrow the function from our lab that returns the venues for all the neighborhoods

In [36]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Let's run it!

In [37]:
# calling the function getNearbyVenues to create the dataframe

Toronto_Venues = getNearbyVenues(names=Toronto['Neighborhood'],
                                   latitudes=Toronto['Latitude'],
                                   longitudes=Toronto['Longitude']
                                  )


The Beaches
The Danforth West, Riverdale
The Beaches West, India Bazaar
Studio District
Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park, Summerhill East
Deer Park, Forest Hill SE, Rathnelly, South Hill, Summerhill West
Rosedale
Cabbagetown, St. James Town
Church and Wellesley
Harbourfront, Regent Park
Ryerson, Garden District
St. James Town
Berczy Park
Central Bay Street
Adelaide, King, Richmond
Harbourfront East, Toronto Islands, Union Station
Design Exchange, Toronto Dominion Centre
Commerce Court, Victoria Hotel
Roselawn
Forest Hill North, Forest Hill West
The Annex, North Midtown, Yorkville
Harbord, University of Toronto
Chinatown, Grange Park, Kensington Market
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
Stn A PO Boxes 25 The Esplanade
First Canadian Place, Underground city
Christie
Dovercourt Village, Dufferin
Little Portugal, Trinity
Brockton, Exhibition Place, Parkdale Village
High Park, The 

In [38]:
print(Toronto_Venues.shape) #checking out the size of our resulting data frame
Toronto_Venues.head(15)     #let's see what it looks like

(1689, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,"The Danforth West, Riverdale",43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant
5,"The Danforth West, Riverdale",43.679557,-79.352188,Dolce Gelato,43.677773,-79.351187,Ice Cream Shop
6,"The Danforth West, Riverdale",43.679557,-79.352188,MenEssentials,43.67782,-79.351265,Cosmetics Shop
7,"The Danforth West, Riverdale",43.679557,-79.352188,Cafe Fiorentina,43.677743,-79.350115,Italian Restaurant
8,"The Danforth West, Riverdale",43.679557,-79.352188,La Diperie,43.67753,-79.352295,Ice Cream Shop
9,"The Danforth West, Riverdale",43.679557,-79.352188,Messini Authentic Gyros,43.677827,-79.350569,Greek Restaurant


How many venues did we get per neighborhood?

In [39]:
Toronto_Venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",100,100,100,100,100,100
Berczy Park,56,56,56,56,56,56
"Brockton, Exhibition Place, Parkdale Village",21,21,21,21,21,21
Business Reply Mail Processing Centre 969 Eastern,19,19,19,19,19,19
"CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara",15,15,15,15,15,15
"Cabbagetown, St. James Town",45,45,45,45,45,45
Central Bay Street,84,84,84,84,84,84
"Chinatown, Grange Park, Kensington Market",100,100,100,100,100,100
Christie,16,16,16,16,16,16
Church and Wellesley,84,84,84,84,84,84


That's great! Our clustering algorithm can be put to good use. I was a little bit concerned about our Beaches request returning only 4 venues, clearly there's a lot going on in Toronto that we can explore!


Let's check how many unique categories of venues we have:

In [40]:
print('There are {} uniques categories.'.format(len(Toronto_Venues['Venue Category'].unique())))

There are 236 uniques categories.


Let's encode the categories with binary code to analyze the neighborhoods based on the types of venues that are present:

In [41]:
# one hot encoding
Toronto_onehot = pd.get_dummies(Toronto_Venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Toronto_onehot['Neighborhood'] = Toronto_Venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [Toronto_onehot.columns[-1]] + list(Toronto_onehot.columns[:-1])
Toronto_onehot = Toronto_onehot[fixed_columns]

Toronto_onehot.head()

Unnamed: 0,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Theme Restaurant,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Let's check the size of our encoded dataframe

In [42]:
Toronto_onehot.shape

(1689, 236)

Group by neighborhood and mean of the frequency of each category of venue:

In [43]:
Toronto_grouped = Toronto_onehot.groupby('Neighborhood').mean().reset_index()
Toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theme Restaurant,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,...,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.01
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.017857,0.0,0.0,0.0,0.0
2,"Brockton, Exhibition Place, Parkdale Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Business Reply Mail Processing Centre 969 Eastern,0.052632,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",0.0,0.0,0.066667,0.066667,0.133333,0.2,0.133333,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"Cabbagetown, St. James Town",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022222,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Central Bay Street,0.011905,0.0,0.0,0.0,0.0,0.0,0.0,0.011905,0.0,...,0.0,0.0,0.0,0.0,0.0,0.011905,0.0,0.0,0.011905,0.0
7,"Chinatown, Grange Park, Kensington Market",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.01,0.0,0.0,0.0,0.06,0.0,0.04,0.01,0.0
8,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Church and Wellesley,0.011905,0.011905,0.0,0.0,0.0,0.0,0.0,0.011905,0.0,...,0.011905,0.0,0.0,0.0,0.0,0.0,0.011905,0.011905,0.0,0.0


Check the new size out:

In [44]:
Toronto_grouped.shape

(38, 236)

Let's check the 5 most common venues in each neighborhood, first clue on what 'hood' I would like to live in if I moved to Toronto, I love the 'hood' term in this for loop:

In [45]:
num_top_venues = 5

for hood in Toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = Toronto_grouped[Toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide, King, Richmond----
             venue  freq
0      Coffee Shop  0.08
1             Café  0.05
2       Steakhouse  0.04
3              Bar  0.04
4  Thai Restaurant  0.04


----Berczy Park----
                venue  freq
0         Coffee Shop  0.09
1        Cocktail Bar  0.05
2                Café  0.04
3              Bakery  0.04
4  Seafood Restaurant  0.04


----Brockton, Exhibition Place, Parkdale Village----
            venue  freq
0  Breakfast Spot  0.10
1     Coffee Shop  0.10
2            Café  0.10
3             Gym  0.05
4      Restaurant  0.05


----Business Reply Mail Processing Centre 969 Eastern----
                venue  freq
0  Light Rail Station  0.11
1         Yoga Studio  0.05
2          Restaurant  0.05
3          Smoke Shop  0.05
4             Brewery  0.05


----CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara----
              venue  freq
0   Airport Service  0.20
1    Airport Lounge  0.13
2  Ai

Use function to sort venues in descending order:

In [46]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Create Dataframe displaying top ten venues for each neighborhood

In [58]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = Toronto_grouped['Neighborhood']

for ind in np.arange(Toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head(10)

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Café,Bar,Steakhouse,Thai Restaurant,Gym,Breakfast Spot,Restaurant,Hotel,American Restaurant
1,Berczy Park,Coffee Shop,Cocktail Bar,Farmers Market,Beer Bar,Bakery,Steakhouse,Seafood Restaurant,Cheese Shop,Café,Restaurant
2,"Brockton, Exhibition Place, Parkdale Village",Café,Breakfast Spot,Coffee Shop,Nightclub,Bar,Restaurant,Climbing Gym,Caribbean Restaurant,Furniture / Home Store,Intersection
3,Business Reply Mail Processing Centre 969 Eastern,Light Rail Station,Yoga Studio,Auto Workshop,Comic Shop,Pizza Place,Moving Target,Recording Studio,Restaurant,Burrito Place,Brewery
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",Airport Service,Airport Lounge,Airport Terminal,Airport,Airport Food Court,Harbor / Marina,Sculpture Garden,Bar,Boutique,Boat or Ferry
5,"Cabbagetown, St. James Town",Coffee Shop,Bakery,Café,Restaurant,Pub,Park,Pizza Place,Italian Restaurant,Japanese Restaurant,Indian Restaurant
6,Central Bay Street,Coffee Shop,Italian Restaurant,Ice Cream Shop,Café,Burger Joint,Sandwich Place,Restaurant,Bakery,Bubble Tea Shop,Chinese Restaurant
7,"Chinatown, Grange Park, Kensington Market",Café,Vegetarian / Vegan Restaurant,Chinese Restaurant,Vietnamese Restaurant,Mexican Restaurant,Bar,Dumpling Restaurant,Bakery,Coffee Shop,Donut Shop
8,Christie,Café,Grocery Store,Park,Convenience Store,Restaurant,Baby Store,Italian Restaurant,Athletics & Sports,Diner,Nightclub
9,Church and Wellesley,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Gay Bar,Restaurant,Pub,Mediterranean Restaurant,Café,Pizza Place,Burger Joint


### CLUSTERING

We'll import KMeans first to create the model

In [48]:
from sklearn.cluster import KMeans

Set k = 10 and fit the model, this is a large diverse area and I think we need more groups to accurately show the differences

In [59]:
# set number of clusters
kclusters = 10

Toronto_Clustering = Toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Toronto_Clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 7, 0, 0, 0, 0, 0])

Add cluster labels to the dataframe

In [60]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

Toronto_merged = Toronto

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
Toronto_merged = Toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

Toronto_merged.head() # check the last columns!

Unnamed: 0,Postal Code,Latitude,Longitude,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,43.676357,-79.293031,East Toronto,The Beaches,4,Health Food Store,Trail,Pub,Discount Store,Fast Food Restaurant,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store
1,M4K,43.679557,-79.352188,East Toronto,"The Danforth West, Riverdale",0,Greek Restaurant,Coffee Shop,Italian Restaurant,Ice Cream Shop,Furniture / Home Store,Yoga Studio,Bookstore,Brewery,Bubble Tea Shop,Burger Joint
2,M4L,43.668999,-79.315572,East Toronto,"The Beaches West, India Bazaar",0,Park,Sandwich Place,Italian Restaurant,Pet Store,Gym,Coffee Shop,Pub,Movie Theater,Burrito Place,Burger Joint
3,M4M,43.659526,-79.340923,East Toronto,Studio District,0,Café,Coffee Shop,Italian Restaurant,Bakery,American Restaurant,Yoga Studio,Park,Seafood Restaurant,Sandwich Place,Cheese Shop
4,M4N,43.72802,-79.38879,Central Toronto,Lawrence Park,1,Park,Bus Line,Swim School,Women's Store,Dog Run,Fast Food Restaurant,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant


Let's first explore the clusters visually:

In [61]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Toronto_merged['Latitude'], Toronto_merged['Longitude'], Toronto_merged['Neighborhood'], Toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

We can now explore the individual clusters -- by number:

### CLUSTER 0

In [62]:
Toronto_merged.loc[Toronto_merged['Cluster Labels'] == 0, Toronto_merged.columns[[4] + list(range(5, Toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,"The Danforth West, Riverdale",0,Greek Restaurant,Coffee Shop,Italian Restaurant,Ice Cream Shop,Furniture / Home Store,Yoga Studio,Bookstore,Brewery,Bubble Tea Shop,Burger Joint
2,"The Beaches West, India Bazaar",0,Park,Sandwich Place,Italian Restaurant,Pet Store,Gym,Coffee Shop,Pub,Movie Theater,Burrito Place,Burger Joint
3,Studio District,0,Café,Coffee Shop,Italian Restaurant,Bakery,American Restaurant,Yoga Studio,Park,Seafood Restaurant,Sandwich Place,Cheese Shop
6,North Toronto West,0,Sporting Goods Shop,Coffee Shop,Yoga Studio,Dessert Shop,Restaurant,Rental Car Location,Salon / Barbershop,Diner,Furniture / Home Store,Chinese Restaurant
7,Davisville,0,Dessert Shop,Sandwich Place,Pizza Place,Sushi Restaurant,Coffee Shop,Café,Italian Restaurant,Brewery,Farmers Market,Restaurant
9,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",0,Coffee Shop,Light Rail Station,Pub,Liquor Store,Supermarket,Sushi Restaurant,Fried Chicken Joint,Restaurant,Sports Bar,Bagel Shop
11,"Cabbagetown, St. James Town",0,Coffee Shop,Bakery,Café,Restaurant,Pub,Park,Pizza Place,Italian Restaurant,Japanese Restaurant,Indian Restaurant
12,Church and Wellesley,0,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Gay Bar,Restaurant,Pub,Mediterranean Restaurant,Café,Pizza Place,Burger Joint
13,"Harbourfront, Regent Park",0,Coffee Shop,Bakery,Park,Pub,Café,Breakfast Spot,Mexican Restaurant,Theater,Gym / Fitness Center,Spa
14,"Ryerson, Garden District",0,Coffee Shop,Clothing Store,Cosmetics Shop,Café,Middle Eastern Restaurant,Tea Room,Ice Cream Shop,Diner,Pizza Place,Plaza


### CLUSTER 1

In [63]:
Toronto_merged.loc[Toronto_merged['Cluster Labels'] == 1, Toronto_merged.columns[[4] + list(range(5, Toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,Lawrence Park,1,Park,Bus Line,Swim School,Women's Store,Dog Run,Fast Food Restaurant,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant


### CLUSTER 2

In [64]:
Toronto_merged.loc[Toronto_merged['Cluster Labels'] == 2, Toronto_merged.columns[[4] + list(range(5, Toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
22,Roselawn,2,Garden,Women's Store,Dog Run,Filipino Restaurant,Fast Food Restaurant,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store


### CLUSTER 3

In [65]:
Toronto_merged.loc[Toronto_merged['Cluster Labels'] == 3, Toronto_merged.columns[[4] + list(range(5, Toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
8,"Moore Park, Summerhill East",3,Playground,Tennis Court,Restaurant,Discount Store,Fast Food Restaurant,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store


### CLUSTER 4

In [66]:
Toronto_merged.loc[Toronto_merged['Cluster Labels'] == 4, Toronto_merged.columns[[4] + list(range(5, Toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,The Beaches,4,Health Food Store,Trail,Pub,Discount Store,Fast Food Restaurant,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store


### CLUSTER 5

In [67]:
Toronto_merged.loc[Toronto_merged['Cluster Labels'] == 5, Toronto_merged.columns[[4] + list(range(5, Toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
10,Rosedale,5,Park,Playground,Trail,Building,Dive Bar,Fast Food Restaurant,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant


### CLUSTER 6

In [68]:
Toronto_merged.loc[Toronto_merged['Cluster Labels'] == 6, Toronto_merged.columns[[4] + list(range(5, Toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
23,"Forest Hill North, Forest Hill West",6,Trail,Bus Line,Jewelry Store,Sushi Restaurant,Women's Store,Doner Restaurant,Filipino Restaurant,Fast Food Restaurant,Farmers Market,Falafel Restaurant


### CLUSTER 7

In [69]:
Toronto_merged.loc[Toronto_merged['Cluster Labels'] == 7, Toronto_merged.columns[[4] + list(range(5, Toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
27,"CN Tower, Bathurst Quay, Island airport, Harbo...",7,Airport Service,Airport Lounge,Airport Terminal,Airport,Airport Food Court,Harbor / Marina,Sculpture Garden,Bar,Boutique,Boat or Ferry


### CLUSTER 8

In [70]:
Toronto_merged.loc[Toronto_merged['Cluster Labels'] == 8, Toronto_merged.columns[[4] + list(range(5, Toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,Davisville North,8,Breakfast Spot,Park,Sandwich Place,Clothing Store,Hotel,Food & Drink Shop,Gym,Falafel Restaurant,Event Space,Dog Run


### CLUSTER 9

In [71]:
Toronto_merged.loc[Toronto_merged['Cluster Labels'] == 9, Toronto_merged.columns[[4] + list(range(5, Toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
31,"Dovercourt Village, Dufferin",9,Pharmacy,Supermarket,Bakery,Park,Music Venue,Café,Middle Eastern Restaurant,Bar,Bank,Brewery
