<h1>Final project of IBM Data Science Certification</h1>
<h2>Segmenting and Clustering Neighborhoods in Toronto</h2>
<h3>Problem 3 - Analyze data</h3>
<h3>By: Aurelio Álvarez Ibarra</h3>

This notebook contains the analysis of different neighborhoods in Toronto. For details on the first sections of this notebook, please refer to <a href="https://github.com/aurelioai/Coursera_Capstone/blob/master/Final_proyect_AAI_1.ipynb">Problem 1<a> and <a href='https://github.com/aurelioai/Coursera_Capstone/blob/master/Final_proyect_AAI_2.ipynb'>Problem 2</a>.

<h4>0.1 Initializing data from Problem 1</h4>

In [1]:
# Get packages and libraries ready
!pip install beautifulsoup4 lxml
from bs4 import BeautifulSoup
import requests
import pandas as pd



In [2]:
# Save data from webpage
myurl = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
source = requests.get(myurl).text
mysoup = BeautifulSoup(source,'lxml')
mytable = mysoup.find('table')
toronto_df = pd.DataFrame(columns = ['PostalCode', 'Borough', 'Neighborhood'])

# Process data from webpage
header = True
for mytr in mytable.find_all('tr'): # Looping for each row in the table
    # Initialize data list (row)
    data = []
    for mycell in mytr.find_all('td'): # Looping for each cell in the row
        data.append(mycell.text.strip()) # Strip removes the \n in the end of the cell data
        # The prenious line works for any number of columns (cells) in a row.
    # Write values from the row
    size = len(toronto_df) # Current size of dataframe
    if header: # The header row (which is only one) leaves "data" as a blank list!
        header = False
    else: # Non-header rows can be assigned to dataframe
        toronto_df.loc[size] = data # Appending data after last row of dataframe

# Clean data

# List of boroughs with an assignment
condition1 = toronto_df['Borough']!='Not assigned'
tmp1 = toronto_df[condition1]
tmp1 = tmp1.reset_index(drop=True) # Drops the old index column

# Copying borough name to neighborhood when neighborhood is not assigned
tmp2 = tmp1
for i,hood in enumerate(tmp1['Neighborhood']):
    if (hood=='Not assigned' or hood==''):
        bor = tmp2['Borough'][i]
        print('Updating Neighborhood name for ',bor,' in index ',i)
        tmp2['Neighborhood'][i] = bor
tmp2 = tmp2.reset_index(drop=True) # Drops the old index column

# Merge neighborhoods with the same PostalCode (separated by commas)
tmp3 = tmp2.groupby('PostalCode')['Neighborhood'].apply(','.join).reset_index()
tmp3.rename(columns={'Neighborhood':'Neighborhood_comb'},inplace=True)
merged = pd.merge(tmp2, tmp3, on='PostalCode')
merged.drop(['Neighborhood'],axis=1,inplace=True) # Dropping "old" Neighborhood column
merged.drop_duplicates(inplace=True) # Dropping duplicated rows
merged.rename(columns={'Neighborhood_comb':'Neighborhood'},inplace=True)

# Replacing / by , as the exercise required
dataframe = merged.replace(' / ', ', ',regex=True)
print('After cleaning, the size of the dataframe is: ',dataframe.shape)
dataframe.head(10)

After cleaning, the size of the dataframe is:  (103, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


<h4>0.2 Getting coordinates for the neighborhoods from Problem 2</h4>

In [3]:
# Download provided CSV file
!wget -q -O 'toronto_data.csv' https://cocl.us/Geospatial_data

In [4]:
# Convert file to dataframe
toronto_df = pd.read_csv('toronto_data.csv')
# Changing the name of the first column of downloaded data
toronto_df.rename(columns={'Postal Code':'PostalCode'},inplace=True)
# Merging provided data into the original dataframe
# dataframe is the original data retrieved and cleaned from wikipedia
# toronto_df is the downloaded data
full_df = pd.merge(dataframe, toronto_df, on='PostalCode')
full_df.drop_duplicates(inplace=True) # Dropping duplicated rows
print('Shape of merged dataframe: ',merged.shape)
full_df.head(10)

Shape of merged dataframe:  (103, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


<h4>1.1 Analyzing neighborhoods in Toronto</h4>
The purpose of the following code is to group (cluster) different neighborhoods from Toronto in order to see how similar are some of them, and which type of facilities (venues) they have. Maybe you would like to visit neighborhoods with coffee shops and bars one day, and visit neighborhoods with malls and beauty shops another day!

In [5]:
# Get required packages and libraries ready

import numpy as np # library to handle data in a vectorized manner

# import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Libraries imported.


<h4>1.2 A first look on Toronto</h4>
Let's get some characteristics of the dataframe we have, as well as the location of Toronto in a map.

In [6]:
# How many boroughs and neighborhoods does Toronto have?
print('The dataframe "full_df" for Toronto has {} boroughs and {} neighborhoods.'
      .format(len(full_df['Borough'].unique()),
              full_df.shape[0]
    )
)

The dataframe "full_df" for Toronto has 10 boroughs and 103 neighborhoods.


In [7]:
# Where is Toronto?
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="TO_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geographical coordinates of Toronto are 43.6534817, -79.3839347.


In [8]:
# Create map of Toronto using its latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# Add markers of neighborhoods to map
for lat, lng, borough, neighborhood, pcode in zip(full_df['Latitude'], full_df['Longitude'],
                                                  full_df['Borough'], full_df['Neighborhood'],
                                                  full_df['PostalCode']):
    label = '{} ({}) {}'.format(neighborhood, borough, pcode)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto) # Do not forget to add CircleMarker to the map!!  
    
map_toronto

In order to simplify the analysis, the exercise suggests to perform it only in boroughs that include 'Toronto' in its name. Let's extract that information:

In [9]:
# Define the dataframe by appending the desired boroughs
tmp = []
for i,x in enumerate(full_df['Borough']): # Create an enumerated list of boroughs
    if 'Toronto' in x: # Check if Toronto appears in the borough's name
        tmp.append(full_df.iloc[i])

justtoronto_df = pd.DataFrame(tmp).reset_index(drop=True) # Transform result to dataframe
print('Shape of dataframe for Toronto boroughs: ',justtoronto_df.shape)
justtoronto_df.head()

Shape of dataframe for Toronto boroughs:  (39, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031


Now, let's adapt the map to the Toronto zone:

In [10]:
# Just changed "full_df" to "justtoronto_df"
# And I will overwrite the previous map
# Create map of Toronto using its latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11) # Larger zoom

# Add markers of neighborhoods to map
for lat, lng, borough, neighborhood, pcode in zip(justtoronto_df['Latitude'], justtoronto_df['Longitude'],
                                                  justtoronto_df['Borough'], justtoronto_df['Neighborhood'],
                                                  justtoronto_df['PostalCode']):
    label = '{} ({}) {}'.format(neighborhood, borough, pcode)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto) # Do not forget to add CircleMarker to the map!!  
    
map_toronto

<h4>2.1 Setting up Foursquare credentials</h4>
Please don't eat up my calls credit! XD

In [11]:
CLIENT_ID = 'RRYOHBWLN3VNML1RBPM0TRVDW2R41TKNWMZSH0VTOQKGNO2T' # your Foursquare ID
CLIENT_SECRET = 'X22FCK21ZCS0UVXZ11TILJFRGXGWVMD5ZADQLIOSMDHHSHHN' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

<h4>2.2 Exploring one neighborhood</h4>
In order to make things clear, let's establish the analysis plan using just one neighborhood. Choose by setting a number between 0 and 38 in the following cell.

In [12]:
# Setting up neighborhood to be analyzed
nnum = 5

myneigh = justtoronto_df.loc[nnum, 'Neighborhood']
myneigh_lat = justtoronto_df.loc[nnum, 'Latitude'] # neighborhood latitude value
myneigh_lon = justtoronto_df.loc[nnum, 'Longitude'] # neighborhood longitude value

print('Your selected neighborhood is {}, located at (latitude,longitude) = ({},{}).'
      .format(myneigh, myneigh_lat, myneigh_lon))
print('Don\'t forget to update this cell when you want to analyze other neighborhood!')

Your selected neighborhood is Berczy Park, located at (latitude,longitude) = (43.644770799999996,-79.3733064).
Don't forget to update this cell when you want to analyze other neighborhood!


The following code requests the top 100 venues in 500 meters around the location of your neighborhood:

In [13]:
LIMIT=100 # Remember the number and type of calls you have in your credit
radius=500 # in meters
# The URL structure is straighforward to read.
# Just remember the information you have to provide for each type of request.
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
           CLIENT_ID,
           CLIENT_SECRET,
           VERSION,
           myneigh_lat,
           myneigh_lon,
           radius,
           LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?&client_id=RRYOHBWLN3VNML1RBPM0TRVDW2R41TKNWMZSH0VTOQKGNO2T&client_secret=X22FCK21ZCS0UVXZ11TILJFRGXGWVMD5ZADQLIOSMDHHSHHN&v=20180605&ll=43.644770799999996,-79.3733064&radius=500&limit=100'

In [14]:
# Call to Foursquare. Do not abuse of this cell execution!!!
results = requests.get(url).json()
### results # Careful. Long result ahead. Uncomment just to be sure that it worked

All the information is in the <i>items</i> key. The following function <code>get_category_type</code> is used to extract the name of a category (remember the structure of the information in the <code>json</code> files).

In [15]:
# Function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

The previous function helps to clean the data from the request:

In [16]:
# Getting "items" to work with a smaller amount of data
venues = results['response']['groups'][0]['items']

# Convert JSON-style data into a table
nearby_venues = json_normalize(venues)

# Getting only the columns we will use
# The names come by looking at the json_normalize result
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues = nearby_venues.loc[:, filtered_columns] # All rows, only the filtered columns

# venue.categories looks messy from the previous result. This is why you apply "get_category_type"
#   to that column, then you get the cleaned name. Of course, the function's design comes after
#   checking the data structure in "venues".
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# Remove the "venues." string from the column names
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

# Check the result
print('{} venues were returned by Foursquare in {}.'.format(nearby_venues.shape[0],myneigh))
nearby_venues.head()

57 venues were returned by Foursquare in Berczy Park.


  """


Unnamed: 0,name,categories,lat,lng
0,LCBO,Liquor Store,43.642944,-79.37244
1,The Keg Steakhouse + Bar - Esplanade,Restaurant,43.646712,-79.374768
2,Fresh On Front,Vegetarian / Vegan Restaurant,43.647815,-79.374453
3,Meridian Hall,Concert Hall,43.646292,-79.376022
4,Hockey Hall Of Fame (Hockey Hall of Fame),Museum,43.646974,-79.377323


<h4>3.1 Exploring the full zone</h4>
Now that it has been done for one neighborhood, it can be taken to explore the full set of neghborhoods in the selected region of Toronto.

The following function will do the previous steps with a list of neighborhoods, provided the names and coordinates for each one (and maybe the radius to look for around the location and the limit of venues to search).

In [17]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print('Searching for venues in ',name,'...')
            
        # Create URL for API request
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # Make GET request, directly retrieving only the interesting part
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # Return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results]) # This is a "list comprehension"
        # In this type of list, you include an implicit for, which can be useful to reduce the number of lines
        #   in a code. In this case, it looks in the "results" data for the specific elements and values of the
        #   previously defined lists.

    # Transform result in dataframe
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list]) # Nested list comprehension
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    print()
    print('Done!!',end='\n\n')
    print('Returned a dataframe with shape ',nearby_venues.shape)
    return(nearby_venues)

Now, apply the function to the full set of neighborhoods in Toronto:

In [18]:
toronto_venues = getNearbyVenues(names=justtoronto_df['Neighborhood'],
                                   latitudes=justtoronto_df['Latitude'],
                                   longitudes=justtoronto_df['Longitude']
                                  )

Searching for venues in  Regent Park, Harbourfront ...
Searching for venues in  Queen's Park, Ontario Provincial Government ...
Searching for venues in  Garden District, Ryerson ...
Searching for venues in  St. James Town ...
Searching for venues in  The Beaches ...
Searching for venues in  Berczy Park ...
Searching for venues in  Central Bay Street ...
Searching for venues in  Christie ...
Searching for venues in  Richmond, Adelaide, King ...
Searching for venues in  Dufferin, Dovercourt Village ...
Searching for venues in  Harbourfront East, Union Station, Toronto Islands ...
Searching for venues in  Little Portugal, Trinity ...
Searching for venues in  The Danforth West, Riverdale ...
Searching for venues in  Toronto Dominion Centre, Design Exchange ...
Searching for venues in  Brockton, Parkdale Village, Exhibition Place ...
Searching for venues in  India Bazaar, The Beaches West ...
Searching for venues in  Commerce Court, Victoria Hotel ...
Searching for venues in  Studio Distric

In [19]:
toronto_venues.head(10)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot
3,"Regent Park, Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
4,"Regent Park, Harbourfront",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
5,"Regent Park, Harbourfront",43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant
6,"Regent Park, Harbourfront",43.65426,-79.360636,Corktown Common,43.655618,-79.356211,Park
7,"Regent Park, Harbourfront",43.65426,-79.360636,The Extension Room,43.653313,-79.359725,Gym / Fitness Center
8,"Regent Park, Harbourfront",43.65426,-79.360636,The Distillery Historic District,43.650244,-79.359323,Historic Site
9,"Regent Park, Harbourfront",43.65426,-79.360636,Figs Breakfast & Lunch,43.655675,-79.364503,Breakfast Spot


In [20]:
# How many venues does each neighborhood has?
print('Number of venues retrieved per neighborhood (dataframe):')
toronto_venues.groupby('Neighborhood').count()

Number of venues retrieved per neighborhood (dataframe):


Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,57,57,57,57,57,57
"Brockton, Parkdale Village, Exhibition Place",23,23,23,23,23,23
Business reply mail Processing Centre,16,16,16,16,16,16
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",16,16,16,16,16,16
Central Bay Street,64,64,64,64,64,64
Christie,17,17,17,17,17,17
Church and Wellesley,78,78,78,78,78,78
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
Davisville,32,32,32,32,32,32
Davisville North,8,8,8,8,8,8


Check that the number of venues returned by Foursquare here matches the one in your "one neighborhood" analysis.

In [21]:
# How many type of venues are there in this dataframe?
print('There are {} uniques categories of venues in the dataframe.'.format(len(toronto_venues['Venue Category'].unique())))

There are 231 uniques categories of venues in the dataframe.


<h4>3.2 Managing the information</h4>
The following code will create a dataframe that show how many venues of a given type exists in each neighborhood. The dataframe will be large but this is the preparation step.

In [22]:
# One hot encoding
# Create a dummy dataframe with columns after (unique) values in 'Venue Category'
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# Add neighborhood column back to dataframe
# With this you just create a 'Neighborhood' column in toronto_onehot
#   with the info from toronto_venues['Neighborhood']
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# Move neighborhood column to the first column
# The previous code results in an alphabetical order in the columns (left-to-right)
#   thus let's move the 'Neighborhood' column to the beginning.
colind = toronto_onehot.columns.get_loc("Neighborhood") # Getting the position of column in dataframe
fixed_columns = [toronto_onehot.columns[colind]] + list(toronto_onehot.columns[0:colind]) + list(toronto_onehot.columns[colind+1:])
toronto_onehot = toronto_onehot[fixed_columns]

### Warning! In the lab exercise, the 'Neighborhood' column was added at the end of
###   the dataframe. That is why there you see a '-1' index to refer to that column.
###       fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
###       toronto_onehot = toronto_onehot[fixed_columns]
###   While checking here, I realized the alphabetical order (don't know why!).
###   Thus, I had to modify the code to look for the column by name.

toronto_onehot.head()

Unnamed: 0,Neighborhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Workshop,BBQ Joint,Baby Store,Bagel Shop,Bakery,Bank,Bar,Baseball Stadium,Basketball Stadium,Beach,Bed & Breakfast,Beer Bar,Beer Store,Belgian Restaurant,Bistro,Boat or Ferry,Bookstore,Boutique,Brazilian Restaurant,Breakfast Spot,Brewery,Bubble Tea Shop,Building,Burger Joint,Burrito Place,Bus Line,Butcher,Cable Car,Café,Cajun / Creole Restaurant,Candy Store,Caribbean Restaurant,Cheese Shop,Chinese Restaurant,Chocolate Shop,Church,Climbing Gym,Clothing Store,Cocktail Bar,Coffee Shop,College Arts Building,College Auditorium,College Cafeteria,College Gym,College Rec Center,Colombian Restaurant,Comfort Food Restaurant,Comic Shop,Concert Hall,Convenience Store,Convention Center,Cosmetics Shop,Coworking Space,Creperie,Cuban Restaurant,Cupcake Shop,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Eastern European Restaurant,Electronics Store,Ethiopian Restaurant,Event Space,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Filipino Restaurant,Fish & Chips Shop,Fish Market,Flea Market,Flower Shop,Food,Food & Drink Shop,Food Court,Food Truck,Fountain,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Fruit & Vegetable Store,Furniture / Home Store,Gaming Cafe,Garden,Garden Center,Gas Station,Gastropub,Gay Bar,General Entertainment,General Travel,German Restaurant,Gift Shop,Gluten-free Restaurant,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Harbor / Marina,Health & Beauty Service,Health Food Store,Historic Site,History Museum,Hobby Shop,Home Service,Hookah Bar,Hospital,Hotel,Hotel Bar,IT Services,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Intersection,Irish Pub,Italian Restaurant,Japanese Restaurant,Jazz Club,Jewelry Store,Juice Bar,Korean Restaurant,Lake,Latin American Restaurant,Light Rail Station,Lingerie Store,Liquor Store,Lounge,Market,Martial Arts Dojo,Mediterranean Restaurant,Men's Store,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Modern European Restaurant,Molecular Gastronomy Restaurant,Monument / Landmark,Moroccan Restaurant,Movie Theater,Museum,Music Venue,New American Restaurant,Nightclub,Noodle House,Office,Opera House,Optical Shop,Organic Grocery,Other Great Outdoors,Park,Performing Arts Venue,Pet Store,Pharmacy,Pizza Place,Plane,Playground,Plaza,Poke Place,Pool,Portuguese Restaurant,Poutine Place,Pub,Ramen Restaurant,Record Shop,Rental Car Location,Restaurant,Roof Deck,Sake Bar,Salad Place,Salon / Barbershop,Sandwich Place,Scenic Lookout,Sculpture Garden,Seafood Restaurant,Shoe Store,Shopping Mall,Skate Park,Skating Rink,Smoke Shop,Snack Place,Soup Place,Spa,Speakeasy,Sporting Goods Shop,Sports Bar,Stadium,Stationery Store,Steakhouse,Strip Club,Supermarket,Sushi Restaurant,Swim School,Taco Place,Tailor Shop,Taiwanese Restaurant,Tanning Salon,Tea Room,Thai Restaurant,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio
0,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


The previous dataframe establishes the occurrence of a given venue in a particular neighborhood. Let's group the occurrence of each type (category) of venue per neighborhood, making a <code>mean</code> out of the location to have an idea of the frequency of such occurrence per neighborhood. This is, of the total of venues in a given neighborhood, how feasible is to find a given type of venue.

In [23]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Workshop,BBQ Joint,Baby Store,Bagel Shop,Bakery,Bank,Bar,Baseball Stadium,Basketball Stadium,Beach,Bed & Breakfast,Beer Bar,Beer Store,Belgian Restaurant,Bistro,Boat or Ferry,Bookstore,Boutique,Brazilian Restaurant,Breakfast Spot,Brewery,Bubble Tea Shop,Building,Burger Joint,Burrito Place,Bus Line,Butcher,Cable Car,Café,Cajun / Creole Restaurant,Candy Store,Caribbean Restaurant,Cheese Shop,Chinese Restaurant,Chocolate Shop,Church,Climbing Gym,Clothing Store,Cocktail Bar,Coffee Shop,College Arts Building,College Auditorium,College Cafeteria,College Gym,College Rec Center,Colombian Restaurant,Comfort Food Restaurant,Comic Shop,Concert Hall,Convenience Store,Convention Center,Cosmetics Shop,Coworking Space,Creperie,Cuban Restaurant,Cupcake Shop,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Eastern European Restaurant,Electronics Store,Ethiopian Restaurant,Event Space,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Filipino Restaurant,Fish & Chips Shop,Fish Market,Flea Market,Flower Shop,Food,Food & Drink Shop,Food Court,Food Truck,Fountain,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Fruit & Vegetable Store,Furniture / Home Store,Gaming Cafe,Garden,Garden Center,Gas Station,Gastropub,Gay Bar,General Entertainment,General Travel,German Restaurant,Gift Shop,Gluten-free Restaurant,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Harbor / Marina,Health & Beauty Service,Health Food Store,Historic Site,History Museum,Hobby Shop,Home Service,Hookah Bar,Hospital,Hotel,Hotel Bar,IT Services,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Intersection,Irish Pub,Italian Restaurant,Japanese Restaurant,Jazz Club,Jewelry Store,Juice Bar,Korean Restaurant,Lake,Latin American Restaurant,Light Rail Station,Lingerie Store,Liquor Store,Lounge,Market,Martial Arts Dojo,Mediterranean Restaurant,Men's Store,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Modern European Restaurant,Molecular Gastronomy Restaurant,Monument / Landmark,Moroccan Restaurant,Movie Theater,Museum,Music Venue,New American Restaurant,Nightclub,Noodle House,Office,Opera House,Optical Shop,Organic Grocery,Other Great Outdoors,Park,Performing Arts Venue,Pet Store,Pharmacy,Pizza Place,Plane,Playground,Plaza,Poke Place,Pool,Portuguese Restaurant,Poutine Place,Pub,Ramen Restaurant,Record Shop,Rental Car Location,Restaurant,Roof Deck,Sake Bar,Salad Place,Salon / Barbershop,Sandwich Place,Scenic Lookout,Sculpture Garden,Seafood Restaurant,Shoe Store,Shopping Mall,Skate Park,Skating Rink,Smoke Shop,Snack Place,Soup Place,Spa,Speakeasy,Sporting Goods Shop,Sports Bar,Stadium,Stationery Store,Steakhouse,Strip Club,Supermarket,Sushi Restaurant,Swim School,Taco Place,Tailor Shop,Taiwanese Restaurant,Tanning Salon,Tea Room,Thai Restaurant,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.017544,0.035088,0.0,0.0,0.0,0.017544,0.017544,0.0,0.035088,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.035088,0.0,0.0,0.0,0.035088,0.0,0.0,0.0,0.0,0.017544,0.052632,0.070175,0.0,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.017544,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.017544,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,0.0,0.017544,0.017544,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017544,0.017544,0.017544,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017544,0.017544,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.017544,0.0,0.0,0.0,0.035088,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035088,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.086957,0.0,0.0,0.0,0.0,0.043478,0.0,0.0,0.0,0.130435,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,0.0,0.086957,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478
2,Business reply mail Processing Centre,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0625,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0625,0.0625,0.0625,0.125,0.125,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03125,0.0,0.03125,0.0,0.0,0.0,0.0,0.078125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.171875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015625,0.03125,0.015625,0.015625,0.0,0.0,0.0,0.015625,0.0,0.0,0.0,0.0,0.015625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015625,0.015625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015625,0.0,0.0,0.03125,0.015625,0.0,0.0,0.0,0.0625,0.03125,0.0,0.0,0.0,0.015625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015625,0.0,0.015625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015625,0.0,0.0,0.0,0.0,0.015625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015625,0.0,0.0,0.0,0.0,0.015625,0.0,0.0,0.015625,0.0,0.0,0.03125,0.0,0.046875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015625,0.0,0.0,0.0,0.0,0.0,0.0,0.03125,0.0,0.0,0.0,0.0,0.0,0.015625,0.0,0.0,0.0,0.0,0.015625


As you can see in the previous results, it is more feasible to find a coffee shop than an art gallery in Berczy Park. This is more easily seen if you print the top 5 venues (according to frequency) for each neighborhood.

In [24]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----" + hood + "----") # "plus" signs do not work if you mix strings and numbers!
    # T is for Transposed. It gets the venue categories to the index side.
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 3})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Berczy Park----
          venue   freq
0   Coffee Shop  0.070
1  Cocktail Bar  0.053
2      Beer Bar  0.035
3          Café  0.035
4        Bakery  0.035


----Brockton, Parkdale Village, Exhibition Place----
                venue   freq
0                Café  0.130
1         Coffee Shop  0.087
2      Breakfast Spot  0.087
3         Yoga Studio  0.043
4  Italian Restaurant  0.043


----Business reply mail Processing Centre----
           venue   freq
0    Yoga Studio  0.062
1  Garden Center  0.062
2     Comic Shop  0.062
3    Pizza Place  0.062
4     Restaurant  0.062


----CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport----
              venue   freq
0    Airport Lounge  0.125
1   Airport Service  0.125
2  Airport Terminal  0.125
3   Harbor / Marina  0.062
4           Airport  0.062


----Central Bay Street----
                venue   freq
0         Coffee Shop  0.172
1                Café  0.078
2  Italian Restaurant  0.0

<b>Note</b>: Remember that this <i>frequency</i> analysis depends on the number of venues in the neighborhood. If you see very small numbers in the top 5, it may mean there is a lot of venues in the neighborhood.

To get this information into a dataframe, it is easier to create a function to return the top venues in a . The next cell will create the dataframe in a readable way.

In [25]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [26]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd'] # Not needed if you use "Venue #X" for X = 1 to num_top_venues

# Create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind])) # Indicators for 1, 2 and 3
    except:
        columns.append('{}th Most Common Venue'.format(ind+1)) # When you run out of "indicators"

# Create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns) # As wide as num_top_venues + 1
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood'] # Copy neighborhoods from dataframe

for ind in np.arange(toronto_grouped.shape[0]): # For the number of neighborhoods in the dataframe...
    # The function returns the first "num_top_venues" from the ordered list from each row
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Cocktail Bar,Café,Bakery,Beer Bar,Cheese Shop,Restaurant,Seafood Restaurant,Concert Hall,Basketball Stadium
1,"Brockton, Parkdale Village, Exhibition Place",Café,Coffee Shop,Breakfast Spot,Bakery,Stadium,Burrito Place,Restaurant,Climbing Gym,Pet Store,Performing Arts Venue
2,Business reply mail Processing Centre,Yoga Studio,Auto Workshop,Garden Center,Garden,Fast Food Restaurant,Farmers Market,Light Rail Station,Comic Shop,Pizza Place,Butcher
3,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Lounge,Airport Service,Airport Terminal,Sculpture Garden,Boutique,Plane,Rental Car Location,Harbor / Marina,Boat or Ferry,Bar
4,Central Bay Street,Coffee Shop,Café,Italian Restaurant,Sandwich Place,Burger Joint,Salad Place,Ice Cream Shop,Dessert Shop,Thai Restaurant,Japanese Restaurant


<h4>4.1 Clustering neighborhoods using <i>K means</i></h4>
The following code runs the <code>K means</code> model on several values for number of clusters and random-number-generator seeds.

In [27]:
toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)
for kclusters in range(3,6):
    print()
    print('Results for K-means with k = ',kclusters)
    for seed in range(0,5):
        # Execute k-means clustering for given conditions
        kmeans = KMeans(n_clusters=kclusters, random_state=seed, n_init=12).fit(toronto_grouped_clustering)    
        # Check cluster labels generated for each row in the dataframe
        print('For k = {} and seed = {} the labels are: \n {}'.format(kclusters,seed,kmeans.labels_[0:]))


Results for K-means with k =  3
For k = 3 and seed = 0 the labels are: 
 [2 2 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 0 2 1 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2
 2 2]
For k = 3 and seed = 1 the labels are: 
 [1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 2 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1
 1 1]
For k = 3 and seed = 2 the labels are: 
 [2 2 2 2 2 2 2 2 2 0 2 2 0 2 2 2 2 2 0 2 1 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2
 2 2]
For k = 3 and seed = 3 the labels are: 
 [1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 2 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1
 1 1]
For k = 3 and seed = 4 the labels are: 
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
 0 0]

Results for K-means with k =  4
For k = 4 and seed = 0 the labels are: 
 [0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 1 0 0 0 0 0 2 3 0 0 0 0 0 0 0 0 0
 0 0]
For k = 4 and seed = 1 the labels are: 
 [3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 0 3 2 3 3 3 3 3 1 0 3 3 3 3 3 3 3 3 3
 3 3]
For k = 4 and seed = 2 the labels are: 
 [1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 3 

The value of the <code>seed</code> for the random number generator that initializes the centroids of the clusters seems to influence more for lower <code>kcluster</code> values. With <code>kclusters=5</code> the results are the same. Let's use those values for the clustering.

In [28]:
toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)
kmeans = KMeans(n_clusters=5, random_state=0, n_init=12).fit(toronto_grouped_clustering)
kmeans

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=5, n_init=12, n_jobs=None, precompute_distances='auto',
    random_state=0, tol=0.0001, verbose=0)

Let's complete the dataframe for Toronto neighborhoods with the data from the neighborhoods, cluster label and top venues.

In [29]:
# Add clustering labels to the sorted neighborhood venues
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

In [30]:
# Recover the original dataframe (in this case, "justtoronto_df")
toronto_merged = justtoronto_df

# Add neighborhoods_venues_sorted to toronto_merged according to the neighborhood name
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head(5)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0,Coffee Shop,Bakery,Park,Pub,Café,Restaurant,Breakfast Spot,Theater,Beer Store,Ice Cream Shop
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,0,Coffee Shop,Sushi Restaurant,Creperie,Beer Bar,Sandwich Place,Burger Joint,Burrito Place,Café,Park,College Auditorium
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,0,Clothing Store,Coffee Shop,Café,Middle Eastern Restaurant,Bubble Tea Shop,Japanese Restaurant,Cosmetics Shop,Italian Restaurant,Restaurant,Hotel
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,0,Café,Coffee Shop,Gastropub,Cocktail Bar,American Restaurant,Clothing Store,Art Gallery,Seafood Restaurant,Beer Bar,Moroccan Restaurant
4,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Coffee Shop,Pub,Trail,Health Food Store,Yoga Studio,Discount Store,Deli / Bodega,Department Store,Dessert Shop,Diner


For the final presentation, a map with colored markers for each cluster is shown as follows.

In [31]:
# Getting Toronto's coordinates
address = 'Toronto, Ontario'
geolocator = Nominatim(user_agent="TO_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

In [32]:
# Create map object
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# Set color scheme for each cluster
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.gnuplot(np.linspace(0, 1, len(ys))) # Look for color maps in matplotlib
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Add markers to the map
markers_colors = []
for lat, lon, hood, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'],
                                  toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(hood) + ' (in Cluster ' + str(cluster) + ')', parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<h4>4.2 Examining clusters</h4>
Why that many neighborhoods are in a specific cluster? Let's see the top venues in each cluster and compare between them. Since cluster 3 is the more populated, let's check that one first.

In [39]:
mycluster = 0
toronto_merged.loc[toronto_merged['Cluster Labels'] == mycluster, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Regent Park, Harbourfront",0,Coffee Shop,Bakery,Park,Pub,Café,Restaurant,Breakfast Spot,Theater,Beer Store,Ice Cream Shop
1,"Queen's Park, Ontario Provincial Government",0,Coffee Shop,Sushi Restaurant,Creperie,Beer Bar,Sandwich Place,Burger Joint,Burrito Place,Café,Park,College Auditorium
2,"Garden District, Ryerson",0,Clothing Store,Coffee Shop,Café,Middle Eastern Restaurant,Bubble Tea Shop,Japanese Restaurant,Cosmetics Shop,Italian Restaurant,Restaurant,Hotel
3,St. James Town,0,Café,Coffee Shop,Gastropub,Cocktail Bar,American Restaurant,Clothing Store,Art Gallery,Seafood Restaurant,Beer Bar,Moroccan Restaurant
4,The Beaches,0,Coffee Shop,Pub,Trail,Health Food Store,Yoga Studio,Discount Store,Deli / Bodega,Department Store,Dessert Shop,Diner
5,Berczy Park,0,Coffee Shop,Cocktail Bar,Café,Bakery,Beer Bar,Cheese Shop,Restaurant,Seafood Restaurant,Concert Hall,Basketball Stadium
6,Central Bay Street,0,Coffee Shop,Café,Italian Restaurant,Sandwich Place,Burger Joint,Salad Place,Ice Cream Shop,Dessert Shop,Thai Restaurant,Japanese Restaurant
7,Christie,0,Grocery Store,Café,Park,Baby Store,Candy Store,Nightclub,Coffee Shop,Restaurant,Athletics & Sports,Italian Restaurant
8,"Richmond, Adelaide, King",0,Coffee Shop,Café,Restaurant,Deli / Bodega,Gym,Clothing Store,Thai Restaurant,Hotel,Sushi Restaurant,Seafood Restaurant
9,"Dufferin, Dovercourt Village",0,Pharmacy,Bakery,Brewery,Bar,Bank,Pool,Supermarket,Café,Middle Eastern Restaurant,Smoke Shop


<b>NOTE</b>: If you run this notebook again, the "big" cluster can get another label. In this example, it came to be 0.

For cluster 0, coffee shops and cafés are the common venues on the top list. What happens with neighborhoods like "Dufferin, Dovercourt Village" (index 9)? It does not seem very similar. It shares bakery and bar on his top venues with a couple of other neighborhoods but it seems rather odd. Maybe the analysis tends to load the separation on the top venues rather than the whole set. Anyway, remember we are looking at the top venues here, not at every one of them. For the rest of the clusters, the comparison is straightforward:

In [40]:
mycluster = 1
toronto_merged.loc[toronto_merged['Cluster Labels'] == mycluster, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
29,"Moore Park, Summerhill East",1,Gym,Restaurant,Yoga Studio,Dance Studio,Electronics Store,Eastern European Restaurant,Donut Shop,Doner Restaurant,Dog Run,Distribution Center


In [41]:
mycluster = 2
toronto_merged.loc[toronto_merged['Cluster Labels'] == mycluster, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
21,Forest Hill North & West,2,Park,Jewelry Store,Trail,Sushi Restaurant,Cupcake Shop,Eastern European Restaurant,Donut Shop,Doner Restaurant,Dog Run,Distribution Center
33,Rosedale,2,Park,Playground,Trail,Cuban Restaurant,Eastern European Restaurant,Donut Shop,Doner Restaurant,Dog Run,Distribution Center,Discount Store


In [42]:
mycluster = 3
toronto_merged.loc[toronto_merged['Cluster Labels'] == mycluster, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
19,Roselawn,3,Garden,Home Service,Ice Cream Shop,Yoga Studio,Dance Studio,Electronics Store,Eastern European Restaurant,Donut Shop,Doner Restaurant,Dog Run


In [43]:
mycluster = 4
toronto_merged.loc[toronto_merged['Cluster Labels'] == mycluster, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
18,Lawrence Park,4,Park,Bus Line,Swim School,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Donut Shop,Doner Restaurant,Dog Run,Distribution Center


For the clusters with more than one element, top venues are very similar. There, the clustering makes sense. It may be a challenge to further analyze the data in order to see why the clustering puts that many neighborhoods in one of them (remember the results for <code>kclusters</code> from 3 to 4 in the beginning of section 4.1). Some straightforward ideas on this can be found <a href="https://zerowithdot.com/mistakes-with-k-means-clustering/">here</a> and some solutions are suggested <a href="https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/">here</a>. Since this is a high-dimensionality problem, the suggestion I have is to try several clusters and check the label distribution. Just set <code>maxclusters</code> in the following cell and see what's a good candidate! After that, rinse and repeat.

In [38]:
toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)
maxclusters = 10
seed = 0
save_k = 10
for kclusters in range(3,maxclusters+1):
    print()
    print('Results for K-means with k = ',kclusters)
    # Execute k-means clustering for given conditions
    tmp = KMeans(n_clusters=kclusters, random_state=seed, n_init=12).fit(toronto_grouped_clustering)    
    # Check cluster labels generated for each row in the dataframe
    print('For k = {} and seed = {} the labels are: \n {}'.format(kclusters,seed,tmp.labels_[0:]))
    if kclusters == save_k:
        kmeans = tmp
print()
print('Saved results for kclusters = ',save_k,' in "kmeans"')


Results for K-means with k =  3
For k = 3 and seed = 0 the labels are: 
 [2 2 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 0 2 1 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2
 2 2]

Results for K-means with k =  4
For k = 4 and seed = 0 the labels are: 
 [0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 1 0 0 0 0 0 2 3 0 0 0 0 0 0 0 0 0
 0 0]

Results for K-means with k =  5
For k = 5 and seed = 0 the labels are: 
 [0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 4 0 1 0 0 0 0 0 2 3 0 0 0 0 0 0 0 0 0
 0 0]

Results for K-means with k =  6
For k = 6 and seed = 0 the labels are: 
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 4 0 3 0 0 0 0 0 5 2 0 0 0 0 0 0 0 0 0
 0 0]

Results for K-means with k =  7
For k = 7 and seed = 0 the labels are: 
 [0 0 0 0 0 0 0 0 0 6 0 0 5 0 0 0 0 0 3 0 1 0 0 0 0 0 2 4 0 0 0 0 0 0 0 0 0
 0 0]

Results for K-means with k =  8
For k = 8 and seed = 0 the labels are: 
 [1 1 1 1 1 1 1 1 1 0 1 1 4 1 1 1 1 1 2 1 3 1 1 1 1 1 6 7 1 1 1 1 1 1 1 5 1
 1 1]

Results for K-means with k =  9
For k = 9 and seed = 0 the labels are