# __Capstone Project - The Battle of Neighborhoods__

## __Introduction__

Ranked as one of the top countries to live in according to its unprecedented quality of life, public education system as well as medical facilities, Canada has been seeing a massive rise in the number of new immigrants over the past decades. When it comes to the place for newcomers to settle down, Toronto is always among the first destinations to consider. 

The objective of this project is to provide some guidance to those new immigrants who will be looking for a suitable neighborhood in Toronto for them to settle down, based on a comparative analysis on various features and amenities across different neighborhoods. It is understood that immigrants will pick their own neighborhood based on different criteria depending on their own characteristics and preferences. In this project, the analysis is aiming for those families with a particular focus on the income level and education background of the residents living in the neighborhood. A high income and education level is normally a reflection of the safety of the community and relatively good manners of the residents. Some other common amenities such as restaurants and grocery stores nearby are also taken into account in this analysis. This project can always be modified and customized for those families who consider other attributes as driving factors to decide on which neighborhood to settle down.



## __Data__

The data used in this analysis is gathered from the following various sources.

 - __Neighborhood location information of Toronto :__

> Location information of Toronto neighborhoods is available from the previous assignment in this course, which includes the postal code, longitude and latitude information for each neighborhood in the city of Toronto.


 - __Neighborhood profile of Toronto :__

> The neighborhood profile is obtained from the Census of Population. The profile collects data about age and sex, language, immigration and internal migration, ethnocultural diversity, housing, education, income, and labour, among which population, education, and income information is of interest in this analysis. Due to the fact that the Census is held across Canada every 5 years, data to be used is from the most recent Census in year 2016. 

> Data source publicly available : https://www.toronto.ca/city-government/data-research-maps/open-data/open-data-catalogue/#8c732154-5012-9afe-d0cd-ba3ffc813d5a


 - __Venue information of interest :__

> Venue information of interest (i.e. restaurants and grocery stores) is obtained using the Foursquare API based on the longitude and latitude coordinates of the neighborhood. 



## __Methodology__

The methodology adopted in this analysis is shown as the following steps.

 - __Step 1 : Identify the neighborhoods of Toronto in which residents have high income and education background to construct the pool of neighborhoods of interest for the analysis__
 
> As mentioned in the previous section, a high income / education level normally reflects a relatively safe community and good manners of the residents. The population of each neighborhood, the average individual income and the number of residents in that neighborhood who have a university Bachelor's degree and above, are extracted from the neighborhood profile dataset. Because each community has different population size, the percentage of residents with a Bachelor's degree and above is used, which is calculated as the number of residents with a Bachelor's degree and above divided by the population of that neighborhood. Then, the income and education information is normalized by dividing the maximum value across all neighborhoods. 

> This analysis targets those neighborhoods that have both high income and education level, therefore, a weighted average between the normalized income and education data for each neighborhood is computed. The pool of neighborhoods of interest for the remaining analysis is constructed by selecting those with the highest weighted average values between income and education. In this analysis, 12 neighborhoods are chosen. 

 - __Step 2 : Locate those neighborhoods selected from previous step__ 
 
> The postal codes for those neighborhoods of interested will be looked up within the neighborhood location dataset created from the previous assignment in this course. The latitude and longitude of each neighborhood will then be obtained by calling the geocoder library. A dataset will be created and saved for later analysis containing the neighborhood name, the borough that it belongs to, postal code, and its latitude and longitude information. 

 - __Step 3 : Connect to Foursquare and retrieve venue data of interest within each neighborhood__ 

> After the neighborhood dataset is created, venue information within each neighborhood is collected by connecting to the Foursquare API. The radius for hunting venues is set to be 1 kilometer from the center of each neighborhood. Since this analysis has a particular focus on amenities about places for food, such as restaurants, grocery and convenience stores, some post-processing is required, where only those venues of interest are extracted from the venue dataset.  
Once the venue information is all gathered and post-processed, the column of Venue Category will be one-hot encoded so that different venues will have different feature columns, which will be used for subsequent machine learning and statistical analysis.

 - __Step 4 : Apply machine learning technique (K-Means Clustering) to analyze the data__ 

> In this step, one of the machine learning techniques, i.e. K-Means Clustering, is applied to the dataset, where neighborhoods are clustered. The value of "K" is selected to be 5, which is deemed to be able to cover the complexity of the problem. After clustering, each neighborhood is assigned to one of the 5 cluster groups.¶

 - __Step 5 : Make decisions on the most suitable neighborhood based on statistical indicators__
 
> The final step is to determine the most suitable neighborhood by comparing the sum score of all venues for each cluster. The cluster with the highest score is identified, and the neighborhoods within that cluster are returned as the most suitable communities to choose.   


## __Analysis__

This section presents the detailed Python codes used to perform the analysis, which follows through each of the main steps summarized in the previous section.

#### __Import required libraries__

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

from bs4 import BeautifulSoup
import requests # library to handle requests
import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

%pip install geocoder

import geocoder

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries have been successfully imported.')

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K    100% |████████████████████████████████| 102kB 5.5MB/s 
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6
Note: you may need to restart the kernel to use updated packages.
Libraries have been successfully imported.


#### __Download neighborhood profiles from Open Data Catelog containing income and education information for Toronto neighborhoods__

In [2]:
# Open Data Catalogue - Neighbourhood Profiles
# https://www.toronto.ca/city-government/data-research-maps/open-data/open-data-catalogue/#8c732154-5012-9afe-d0cd-ba3ffc813d5a

neigh_profile_csv_path='https://www.toronto.ca/ext/open_data/catalog/data_set_files/2016_neighbourhood_profiles.csv'
df = pd.read_csv(neigh_profile_csv_path,encoding='latin1')

print('Data has been successfully loaded')

Data has been successfully loaded


#### __Construct dataframe summarizing population, income, and education information for each neighborhoods__

In [3]:
Toronto_Neighborhoods = list(df.columns.values)
Toronto_Neighborhoods = Toronto_Neighborhoods[5:]

# Set up dataframe with the following columns:
# - Neighborhood population
# - Average total income
# - # of residents with a Bachelor's degree and above within each neigborhood
# - Normalized income information
# - Normalized education information
# - Weighted average score for each neighborhood between income and education

df_Toronto = pd.DataFrame(index=Toronto_Neighborhoods, columns=["Population","Average Total Income","Education (Bachelor Level and Above)","NORM Income","NORM Education","SUM Total Score"])
df_Toronto.head()

Unnamed: 0,Population,Average Total Income,Education (Bachelor Level and Above),NORM Income,NORM Education,SUM Total Score
Agincourt North,,,,,,
Agincourt South-Malvern West,,,,,,
Alderwood,,,,,,
Annex,,,,,,
Banbury-Don Mills,,,,,,


#### __Populate appropriate data to each columns of the dataframe__

In [4]:
# First three columns

for index, row in df_Toronto.iterrows():
    df_Toronto.at[index, 'Population'] = df[index][2]
    df_Toronto.at[index, 'Average Total Income'] = df[index][2264]
    df_Toronto.at[index, 'Education (Bachelor Level and Above)'] = df[index][1708]

# Convert object types to numeric values   
    
#df_Toronto.dtypes    
df_Toronto = df_Toronto.replace(to_replace=r',', value='', regex=True)   

df_Toronto["Population"] = pd.to_numeric(df_Toronto["Population"],errors='coerce')
df_Toronto["Average Total Income"] = pd.to_numeric(df_Toronto["Average Total Income"],errors='coerce')
df_Toronto["Education (Bachelor Level and Above)"] = pd.to_numeric(df_Toronto["Education (Bachelor Level and Above)"],errors='coerce')

df_Toronto.head()

Unnamed: 0,Population,Average Total Income,Education (Bachelor Level and Above),NORM Income,NORM Education,SUM Total Score
Agincourt North,29113,30414,5805,,,
Agincourt South-Malvern West,23757,31825,5765,,,
Alderwood,12054,47709,2290,,,
Annex,30526,112766,16590,,,
Banbury-Don Mills,27695,67757,10850,,,


In [5]:
# Last three columns including normalized income and education and total score

# Normalized value is the absolute value divided by the max value among all neighborhoods
# For education, because each neighborhood has different population size, the number of residents with a bachlor's degree is first divided by the neighborhood population before normalizing
df_Toronto["NORM Income"] = df_Toronto["Average Total Income"].div(df_Toronto["Average Total Income"].max())
df_Toronto["NORM Education"] = df_Toronto["Education (Bachelor Level and Above)"].div(df_Toronto["Population"])
df_Toronto["NORM Education"] = df_Toronto["NORM Education"].div(df_Toronto["NORM Education"].max())

# Calculate weighted average between normalized income and normalized education as the total score for each neighborhood
# Same weight (50%) is used for both income and education
df_Toronto["SUM Total Score"] = df_Toronto["NORM Income"]*0.5 + df_Toronto["NORM Education"]*0.5

# Final dataframe obtained
df_Toronto.head()

Unnamed: 0,Population,Average Total Income,Education (Bachelor Level and Above),NORM Income,NORM Education,SUM Total Score
Agincourt North,29113,30414,5805,0.098744,0.31491,0.206827
Agincourt South-Malvern West,23757,31825,5765,0.103325,0.383247,0.243286
Alderwood,12054,47709,2290,0.154894,0.300037,0.227466
Annex,30526,112766,16590,0.366111,0.858316,0.612214
Banbury-Don Mills,27695,67757,10850,0.219983,0.618727,0.419355


#### __Sort total score to get the top neighborhoods to consider in following analysis__

In [6]:
df_Toronto.sort_values('SUM Total Score')

Unnamed: 0,Population,Average Total Income,Education (Bachelor Level and Above),NORM Income,NORM Education,SUM Total Score
Black Creek,21737,25989,1565,0.084377,0.113707,0.099042
Glenfield-Jane Heights,30491,27984,2110,0.090854,0.10929,0.100072
Rustic,9941,31800,800,0.103243,0.127096,0.11517
Beechborough-Greenbrook,6577,33829,610,0.109831,0.146478,0.128154
Keelesdale-Eglinton West,11058,33316,1105,0.108165,0.157818,0.132992
Brookhaven-Amesbury,17757,32483,1920,0.105461,0.170766,0.138114
Humber Summit,12416,30731,1450,0.099773,0.184441,0.142107
Mount Dennis,13593,30827,1650,0.100084,0.191708,0.145896
Elms-Old Rexdale,9456,32012,1130,0.103932,0.18873,0.146331
Humbermede,15545,29528,1955,0.095867,0.198622,0.147244


#### __Identify the top neighborhoods to consider in following analysis__

Based on the SUM total score column, the following 12 neighborhoods are selected from those with highest scores:

 - North York - York Mills
 - Downtown Toronto - Rosedale
 - Central Toronto - Moore Park
 - Central Toronto - Forest Hill South
 - Central Toronto - Lawrence Park South
 - Downtown Toronto - Waterfront Communities
 - Downtown Toronto - The Islands
 - Central Toronto - Annex
 - East York - Leaside
 - Downtown Toronto - Bay Street Corridor
 - North York - Bedford Park
 - East Toronto - The Beaches
 
This list is saved into a csv file, which also includes the postal code for each of the neighborhoods. The postal code is looked up from the source used in previous assignment in this course. The postal codes are manually populated for each of the 12 neighborhoods and also saved in the csv file. The csv file is named "toronto_neighborhood_of_interest"

In [7]:
# Load neighborhood list
toronto_neigh_df = pd.read_csv('toronto_neighborhoods_of_interest.csv')
toronto_neigh_df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M2L,North York,York Mills
1,M4W,Downtown Toronto,Rosedale
2,M4T,Central Toronto,Moore Park
3,M4V,Central Toronto,Forest Hill South
4,M4N,Central Toronto,Lawrence Park South


#### __Obtain latitude and longitude coordinates information for each of the neighborhood of interest__

In [8]:
# Create function to get latitude and longitude coordinates

def get_latlng(postal_code):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        # geocoder.google does not work properly
        # g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        # use another 
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng
    return lat_lng_coords

In [9]:
# Obtain latitude and longitude coordinates for each of the neighborhood of interest based on its postal code
neigh_postal_codes = toronto_neigh_df['Postalcode']    
lat_lng_coords = [get_latlng(postal_code) for postal_code in neigh_postal_codes.tolist()]

In [11]:
# Save the obtained latitude and longitude cooordinates information to the dataframe
neigh_coords_df = pd.DataFrame(lat_lng_coords, columns=['Latitude', 'Longitude'])
toronto_neigh_df['Latitude'] = neigh_coords_df['Latitude']
toronto_neigh_df['Longitude'] = neigh_coords_df['Longitude'
                                               ]
toronto_neigh_df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
0,M2L,North York,York Mills,43.757192,-79.379865
1,M4W,Downtown Toronto,Rosedale,43.68194,-79.378474
2,M4T,Central Toronto,Moore Park,43.690655,-79.383561
3,M4V,Central Toronto,Forest Hill South,43.686083,-79.402335
4,M4N,Central Toronto,Lawrence Park South,43.72816,-79.387085


In [12]:
# Save the updated dataframe to csv for later use
toronto_neigh_df.to_csv('toronto_neighborhoods_of_interest_coords.csv',index=False)

In [13]:
# Get Toronto latitude and longitude coordinates
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of City of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of City of Toronto are 43.653963, -79.387207.


#### __Visualize all neighborhoods of interest on the map of Toronto__

In [14]:
# Create map of Toronto using latitude and longitude values and lot all neighborhoods of interest on the map of Toronto
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_neigh_df['Latitude'], toronto_neigh_df['Longitude'], toronto_neigh_df['Borough'], toronto_neigh_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_toronto)  
    
map_toronto

#### __Get venue information for each neighborhood by connecting to Foursquare API__

In [16]:
# @hiddel_cell
CLIENT_ID = 'SLOD14CY1WU1K4CJNZJC2LHETF2ICJBXOI3VRXE5FCYDSHRN' # your Foursquare ID
CLIENT_SECRET = 'W3TBUJFHXHRMDHXZYECHXB231GDIHINYY3B1GLZBVJYGT4BR' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [17]:
# Create a funtion to get venue information by making APU request
def get_foursquare_venue (postal_code_list, neighborhood_list, lat_list, lng_list, LIMIT = 500, radius = 1000):
    result_final = []
    counter = 0
    for postal_code, neighborhood, lat, lng in zip(postal_code_list, neighborhood_list, lat_list, lng_list):
         
        # Create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
              CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, radius, LIMIT)
            
        # Make API request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        tmp_result = {}
        tmp_result['Postal Code'] = postal_code; 
        tmp_result['Neighborhood(s)'] = neighborhood; 
        tmp_result['Latitude'] = lat; 
        tmp_result['Longitude'] = lng;
        tmp_result['Venue Result'] = results;
        result_final.append(tmp_result)
        
        counter = counter + 1
        print('{}.'.format(counter))
        print('Venue data is successfully obtained for the neighborhood {}.'.format(neighborhood))
        
    return result_final;

In [18]:
# Get venue information for each neighborhood
Toronto_foursquare_venue_dataset = get_foursquare_venue(list(toronto_neigh_df['Postalcode']),
                                                        list(toronto_neigh_df['Neighborhood']),
                                                        list(toronto_neigh_df['Latitude']),
                                                        list(toronto_neigh_df['Longitude']),)

1.
Venue data is successfully obtained for the neighborhood York Mills.
2.
Venue data is successfully obtained for the neighborhood Rosedale.
3.
Venue data is successfully obtained for the neighborhood Moore Park.
4.
Venue data is successfully obtained for the neighborhood Forest Hill South.
5.
Venue data is successfully obtained for the neighborhood Lawrence Park South.
6.
Venue data is successfully obtained for the neighborhood Waterfront Communities.
7.
Venue data is successfully obtained for the neighborhood The Islands.
8.
Venue data is successfully obtained for the neighborhood Annex.
9.
Venue data is successfully obtained for the neighborhood Leaside.
10.
Venue data is successfully obtained for the neighborhood Bay Street Corridor.
11.
Venue data is successfully obtained for the neighborhood Bedford Park.
12.
Venue data is successfully obtained for the neighborhood The Beaches.


In [19]:
# Create function to extract venue information from the raw foursquare dataset obtained previously for each neighborhood

def extract_venue_dataset(foursquare_dataset):
    venue_df = pd.DataFrame(columns = ['Postal Code', 'Neighborhood', 
                                       'Neighborhood Latitude', 'Neighborhood Longitude',
                                       'Venue', 'Venue Summary', 'Venue Category', 'Distance'])
    
    for neigh_dict in foursquare_dataset:
        postal_code = neigh_dict['Postal Code']; 
        neigh = neigh_dict['Neighborhood(s)']
        lat = neigh_dict['Latitude']; 
        lng = neigh_dict['Longitude']
        print('Number of venues in "{}" Negihborhood(s) is:'.format(neigh))
        print(len(neigh_dict['Venue Result']))
        
        for venue_dict in neigh_dict['Venue Result']:
            summary = venue_dict['reasons']['items'][0]['summary']
            name = venue_dict['venue']['name']
            dist = venue_dict['venue']['location']['distance']
            cat =  venue_dict['venue']['categories'][0]['name']
            
            venue_df = venue_df.append({'Postal Code': postal_code, 'Neighborhood': neigh, 
                              'Neighborhood Latitude': lat, 'Neighborhood Longitude':lng,
                              'Venue': name, 'Venue Summary': summary, 
                              'Venue Category': cat, 'Distance': dist}, ignore_index = True)
    
    return(venue_df)

In [20]:
# Extract venue information for each neighborhood
toronto_venues = extract_venue_dataset(Toronto_foursquare_venue_dataset)

Number of venues in "York Mills" Negihborhood(s) is:
4
Number of venues in "Rosedale" Negihborhood(s) is:
20
Number of venues in "Moore Park" Negihborhood(s) is:
69
Number of venues in "Forest Hill South" Negihborhood(s) is:
85
Number of venues in "Lawrence Park South" Negihborhood(s) is:
10
Number of venues in "Waterfront Communities" Negihborhood(s) is:
100
Number of venues in "The Islands" Negihborhood(s) is:
22
Number of venues in "Annex" Negihborhood(s) is:
100
Number of venues in "Leaside" Negihborhood(s) is:
73
Number of venues in "Bay Street Corridor" Negihborhood(s) is:
100
Number of venues in "Bedford Park" Negihborhood(s) is:
38
Number of venues in "The Beaches" Negihborhood(s) is:
81


In [21]:
toronto_venues.head()

Unnamed: 0,Postal Code,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Summary,Venue Category,Distance
0,M2L,York Mills,43.757192,-79.379865,St. Andrews Park,This spot is popular,Park,542
1,M2L,York Mills,43.757192,-79.379865,Hwy 401 at Bayview,This spot is popular,Intersection,887
2,M2L,York Mills,43.757192,-79.379865,Liberty Club Gym,This spot is popular,Gym / Fitness Center,931
3,M2L,York Mills,43.757192,-79.379865,The Empire Fitness Room,This spot is popular,Gym,1000
4,M4W,Rosedale,43.68194,-79.378474,Summerhill Market,This spot is popular,Grocery Store,539


In [22]:
toronto_venues.tail()

Unnamed: 0,Postal Code,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Summary,Venue Category,Distance
697,M4E,The Beaches,43.676845,-79.295225,Alex Farm Products (Beaches),This spot is popular,Cheese Shop,950
698,M4E,The Beaches,43.676845,-79.295225,SUPgirlz,This spot is popular,Scenic Lookout,954
699,M4E,The Beaches,43.676845,-79.295225,Little Elf House Under A Tree,This spot is popular,Tree,955
700,M4E,The Beaches,43.676845,-79.295225,Beaches Sports Centre,This spot is popular,Skating Rink,964
701,M4E,The Beaches,43.676845,-79.295225,The Thai Grill,This spot is popular,Thai Restaurant,977


In [43]:
# Save toronto venue dataset for the neighborhoods of interest
toronto_venues.to_csv('toronto_venues.csv')

In [42]:
# Load saved toronto venue dataset for the neighborhoods of interest
toronto_venues = pd.read_csv('toronto_venues.csv')
toronto_venues.head()

Unnamed: 0.1,Unnamed: 0,Postal Code,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Summary,Venue Category,Distance
0,0,M2L,York Mills,43.757192,-79.379865,St. Andrews Park,This spot is popular,Park,542
1,1,M2L,York Mills,43.757192,-79.379865,Hwy 401 at Bayview,This spot is popular,Intersection,887
2,2,M2L,York Mills,43.757192,-79.379865,Liberty Club Gym,This spot is popular,Gym / Fitness Center,931
3,3,M2L,York Mills,43.757192,-79.379865,The Empire Fitness Room,This spot is popular,Gym,1000
4,4,M4W,Rosedale,43.68194,-79.378474,Summerhill Market,This spot is popular,Grocery Store,539


In [24]:
# Confirm number of neighborhoods
neigh_list = list(toronto_venues['Neighborhood'].unique())
print('Number of Neighborhoods inside toronto:')
print(len(neigh_list))
print('List of Neighborhoods inside toronto:')
neigh_list

Number of Neighborhoods inside toronto:
12
List of Neighborhoods inside toronto:


['York Mills',
 'Rosedale',
 'Moore Park',
 'Forest Hill South',
 'Lawrence Park South',
 'Waterfront Communities',
 'The Islands',
 'Annex',
 'Leaside',
 'Bay Street Corridor',
 'Bedford Park',
 'The Beaches']

In [25]:
# Confirm the list of venues
print('There are {} uniques categories across all neighborhoods of interest.'.format(len(toronto_venues['Venue Category'].unique())))
print('The list of different venue categories is:')
list(toronto_venues['Venue Category'].unique())

There are 174 uniques categories across all neighborhoods of interest.
The list of different venue categories is:


['Park',
 'Intersection',
 'Gym / Fitness Center',
 'Gym',
 'Grocery Store',
 'Athletics & Sports',
 'Italian Restaurant',
 'Sporting Goods Shop',
 'Gourmet Shop',
 'Bank',
 'Beer Store',
 'Neighborhood',
 'Playground',
 'Candy Store',
 'Trail',
 'Café',
 'Bagel Shop',
 'Tapas Restaurant',
 'Tea Room',
 'American Restaurant',
 'Cemetery',
 'Yoga Studio',
 'Restaurant',
 'Chiropractor',
 'Cantonese Restaurant',
 'Thai Restaurant',
 'Breakfast Spot',
 'Coffee Shop',
 'German Restaurant',
 'Mexican Restaurant',
 'Sushi Restaurant',
 'Pharmacy',
 'Movie Theater',
 'Bakery',
 'Burger Joint',
 'Modern European Restaurant',
 'Japanese Restaurant',
 'Sandwich Place',
 'Pub',
 'Fried Chicken Joint',
 'Gastropub',
 'Fast Food Restaurant',
 'Pizza Place',
 'Tennis Court',
 'Vietnamese Restaurant',
 'Office',
 'Electronics Store',
 'Bookstore',
 'Liquor Store',
 'Supermarket',
 'French Restaurant',
 'Spa',
 'Middle Eastern Restaurant',
 'Vegetarian / Vegan Restaurant',
 'Hotel',
 'History Museum',

#### __Apply one-hot encoding to the "venue category" column into every unique categorical columns for implementing machine learning technique later__

In [26]:
# Apply one hot encoding
toronto_venues_onehot = pd.get_dummies(data = toronto_venues, drop_first  = False, prefix = "", prefix_sep = "", columns = ['Venue Category'])

toronto_venues_onehot.drop(columns = ['Unnamed: 0']).head()


Unnamed: 0,Postal Code,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Summary,Distance,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Animal Shelter,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Dealership,BBQ Joint,Baby Store,Bagel Shop,Bakery,Bank,Bar,Beach,Beer Bar,Beer Store,Bike Shop,Bistro,Boat or Ferry,Bookstore,Boutique,Breakfast Spot,Brewery,Bridal Shop,Bubble Tea Shop,Burger Joint,Burrito Place,Bus Line,Butcher,Café,Candy Store,Cantonese Restaurant,Caribbean Restaurant,Castle,Cemetery,Cheese Shop,Chinese Restaurant,Chiropractor,Chocolate Shop,Clothing Store,Coffee Shop,College Gym,Comfort Food Restaurant,Comic Shop,Concert Hall,Convenience Store,Cosmetics Shop,Creperie,Cupcake Shop,Curling Ice,Deli / Bodega,Department Store,Design Studio,Dessert Shop,Diner,Electronics Store,Event Space,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Fish & Chips Shop,Food & Drink Shop,Food Court,Food Truck,French Restaurant,Fried Chicken Joint,Furniture / Home Store,Gastropub,German Restaurant,Gift Shop,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Gym Pool,Harbor / Marina,Health Food Store,Historic Site,History Museum,Hotel,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Intersection,Italian Restaurant,Japanese Restaurant,Jazz Club,Jewelry Store,Jewish Restaurant,Juice Bar,Karaoke Bar,Latin American Restaurant,Liquor Store,Lounge,Martial Arts Dojo,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Modern European Restaurant,Monument / Landmark,Movie Theater,Museum,Music School,Music Store,Music Venue,Nail Salon,Neighborhood.1,New American Restaurant,Nudist Beach,Office,Opera House,Other Great Outdoors,Paper / Office Supplies Store,Park,Performing Arts Venue,Pet Store,Pharmacy,Pizza Place,Plane,Playground,Plaza,Poke Place,Pool,Portuguese Restaurant,Poutine Place,Pub,Ramen Restaurant,Restaurant,Salon / Barbershop,Sandwich Place,Scenic Lookout,Sculpture Garden,Seafood Restaurant,Shoe Store,Shopping Mall,Skating Rink,Smoke Shop,Smoothie Shop,Social Club,Spa,Spanish Restaurant,Speakeasy,Sporting Goods Shop,Sports Bar,Sports Club,Steakhouse,Supermarket,Sushi Restaurant,Tapas Restaurant,Tea Room,Tech Startup,Tennis Court,Thai Restaurant,Theater,Toy / Game Store,Trail,Tree,University,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Yoga Studio
0,M2L,York Mills,43.757192,-79.379865,St. Andrews Park,This spot is popular,542,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,M2L,York Mills,43.757192,-79.379865,Hwy 401 at Bayview,This spot is popular,887,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,M2L,York Mills,43.757192,-79.379865,Liberty Club Gym,This spot is popular,931,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,M2L,York Mills,43.757192,-79.379865,The Empire Fitness Room,This spot is popular,1000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,M4W,Rosedale,43.68194,-79.378474,Summerhill Market,This spot is popular,539,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [27]:
# Confirm data type for each column
toronto_venues_onehot.dtypes

Unnamed: 0                         int64
Postal Code                       object
Neighborhood                      object
Neighborhood Latitude            float64
Neighborhood Longitude           float64
Venue                             object
Venue Summary                     object
Distance                           int64
Accessories Store                  uint8
Airport                            uint8
Airport Food Court                 uint8
Airport Gate                       uint8
Airport Lounge                     uint8
Airport Service                    uint8
Airport Terminal                   uint8
American Restaurant                uint8
Animal Shelter                     uint8
Art Gallery                        uint8
Art Museum                         uint8
Arts & Crafts Store                uint8
Asian Restaurant                   uint8
Athletics & Sports                 uint8
Auto Dealership                    uint8
BBQ Joint                          uint8
Baby Store      

In [28]:
# Only take venues that are of interest in this study, which is places for food, i.e. restaurant and grocery stores.
# This list is created manually 
list_of_features_of_interest = [
 'Neighborhood',
 'Neighborhood Latitude',
 'Neighborhood Longitude',
 'American Restaurant',
 'Asian Restaurant',
 'Bagel Shop',
 'Bakery',
 'BBQ Joint',
 'Breakfast Spot',
 'Bistro',
 'Bubble Tea Shop',
 'Burger Joint',
 'Café',
 'Caribbean Restaurant',
 'Chinese Restaurant',
 'Coffee Shop',
 'Comfort Food Restaurant',
 'Falafel Restaurant',
 'Fast Food Restaurant',
 'Fish & Chips Shop',
 'Food & Drink Shop',
 'Food Court',
 'French Restaurant',
 'Fried Chicken Joint',
 'German Restaurant',
 'Greek Restaurant',
 'Indian Restaurant',
 'Italian Restaurant',
 'Japanese Restaurant',
 'Jewish Restaurant',
 'Latin American Restaurant',
 'Mediterranean Restaurant',
 'Mexican Restaurant',
 'Middle Eastern Restaurant',
 'Modern European Restaurant',
 'New American Restaurant',
 'Pizza Place',
 'Portuguese Restaurant',
 'Poutine Place',
 'Ramen Restaurant',
 'Restaurant',
 'Sandwich Place',
 'Seafood Restaurant',
 'Smoothie Shop',
 'Spanish Restaurant',
 'Steakhouse',
 'Sushi Restaurant',
 'Tapas Restaurant',
 'Thai Restaurant',
 'Vegetarian / Vegan Restaurant',
 'Vietnamese Restaurant',
 'Wings Joint',
 
 'Convenience Store',
 'Farmers Market',
 'Grocery Store',
 'Health Food Store',
 'Supermarket',
 'Shopping Mall',
 'Pharmacy'
]


In [29]:
# Some clean-up on the venue dataframe
toronto_venues_onehot = toronto_venues_onehot.drop(columns = ['Unnamed: 0'])

In [30]:
toronto_venues_onehot.head()

Unnamed: 0,Postal Code,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Summary,Distance,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Animal Shelter,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Dealership,BBQ Joint,Baby Store,Bagel Shop,Bakery,Bank,Bar,Beach,Beer Bar,Beer Store,Bike Shop,Bistro,Boat or Ferry,Bookstore,Boutique,Breakfast Spot,Brewery,Bridal Shop,Bubble Tea Shop,Burger Joint,Burrito Place,Bus Line,Butcher,Café,Candy Store,Cantonese Restaurant,Caribbean Restaurant,Castle,Cemetery,Cheese Shop,Chinese Restaurant,Chiropractor,Chocolate Shop,Clothing Store,Coffee Shop,College Gym,Comfort Food Restaurant,Comic Shop,Concert Hall,Convenience Store,Cosmetics Shop,Creperie,Cupcake Shop,Curling Ice,Deli / Bodega,Department Store,Design Studio,Dessert Shop,Diner,Electronics Store,Event Space,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Fish & Chips Shop,Food & Drink Shop,Food Court,Food Truck,French Restaurant,Fried Chicken Joint,Furniture / Home Store,Gastropub,German Restaurant,Gift Shop,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Gym Pool,Harbor / Marina,Health Food Store,Historic Site,History Museum,Hotel,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Intersection,Italian Restaurant,Japanese Restaurant,Jazz Club,Jewelry Store,Jewish Restaurant,Juice Bar,Karaoke Bar,Latin American Restaurant,Liquor Store,Lounge,Martial Arts Dojo,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Modern European Restaurant,Monument / Landmark,Movie Theater,Museum,Music School,Music Store,Music Venue,Nail Salon,Neighborhood.1,New American Restaurant,Nudist Beach,Office,Opera House,Other Great Outdoors,Paper / Office Supplies Store,Park,Performing Arts Venue,Pet Store,Pharmacy,Pizza Place,Plane,Playground,Plaza,Poke Place,Pool,Portuguese Restaurant,Poutine Place,Pub,Ramen Restaurant,Restaurant,Salon / Barbershop,Sandwich Place,Scenic Lookout,Sculpture Garden,Seafood Restaurant,Shoe Store,Shopping Mall,Skating Rink,Smoke Shop,Smoothie Shop,Social Club,Spa,Spanish Restaurant,Speakeasy,Sporting Goods Shop,Sports Bar,Sports Club,Steakhouse,Supermarket,Sushi Restaurant,Tapas Restaurant,Tea Room,Tech Startup,Tennis Court,Thai Restaurant,Theater,Toy / Game Store,Trail,Tree,University,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Yoga Studio
0,M2L,York Mills,43.757192,-79.379865,St. Andrews Park,This spot is popular,542,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,M2L,York Mills,43.757192,-79.379865,Hwy 401 at Bayview,This spot is popular,887,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,M2L,York Mills,43.757192,-79.379865,Liberty Club Gym,This spot is popular,931,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,M2L,York Mills,43.757192,-79.379865,The Empire Fitness Room,This spot is popular,1000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,M4W,Rosedale,43.68194,-79.378474,Summerhill Market,This spot is popular,539,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [31]:
# Some more clean-up on the venue dataframe
# There is another column of venue called neighborhood, which is conflicting with the neighborhood column, need to be removed
toronto_venues_onehot_clean = toronto_venues_onehot.loc[:, ~toronto_venues_onehot.columns.duplicated()]
toronto_venues_onehot_clean.head()

Unnamed: 0,Postal Code,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Summary,Distance,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Animal Shelter,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Dealership,BBQ Joint,Baby Store,Bagel Shop,Bakery,Bank,Bar,Beach,Beer Bar,Beer Store,Bike Shop,Bistro,Boat or Ferry,Bookstore,Boutique,Breakfast Spot,Brewery,Bridal Shop,Bubble Tea Shop,Burger Joint,Burrito Place,Bus Line,Butcher,Café,Candy Store,Cantonese Restaurant,Caribbean Restaurant,Castle,Cemetery,Cheese Shop,Chinese Restaurant,Chiropractor,Chocolate Shop,Clothing Store,Coffee Shop,College Gym,Comfort Food Restaurant,Comic Shop,Concert Hall,Convenience Store,Cosmetics Shop,Creperie,Cupcake Shop,Curling Ice,Deli / Bodega,Department Store,Design Studio,Dessert Shop,Diner,Electronics Store,Event Space,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Fish & Chips Shop,Food & Drink Shop,Food Court,Food Truck,French Restaurant,Fried Chicken Joint,Furniture / Home Store,Gastropub,German Restaurant,Gift Shop,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Gym Pool,Harbor / Marina,Health Food Store,Historic Site,History Museum,Hotel,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Intersection,Italian Restaurant,Japanese Restaurant,Jazz Club,Jewelry Store,Jewish Restaurant,Juice Bar,Karaoke Bar,Latin American Restaurant,Liquor Store,Lounge,Martial Arts Dojo,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Modern European Restaurant,Monument / Landmark,Movie Theater,Museum,Music School,Music Store,Music Venue,Nail Salon,New American Restaurant,Nudist Beach,Office,Opera House,Other Great Outdoors,Paper / Office Supplies Store,Park,Performing Arts Venue,Pet Store,Pharmacy,Pizza Place,Plane,Playground,Plaza,Poke Place,Pool,Portuguese Restaurant,Poutine Place,Pub,Ramen Restaurant,Restaurant,Salon / Barbershop,Sandwich Place,Scenic Lookout,Sculpture Garden,Seafood Restaurant,Shoe Store,Shopping Mall,Skating Rink,Smoke Shop,Smoothie Shop,Social Club,Spa,Spanish Restaurant,Speakeasy,Sporting Goods Shop,Sports Bar,Sports Club,Steakhouse,Supermarket,Sushi Restaurant,Tapas Restaurant,Tea Room,Tech Startup,Tennis Court,Thai Restaurant,Theater,Toy / Game Store,Trail,Tree,University,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Yoga Studio
0,M2L,York Mills,43.757192,-79.379865,St. Andrews Park,This spot is popular,542,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,M2L,York Mills,43.757192,-79.379865,Hwy 401 at Bayview,This spot is popular,887,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,M2L,York Mills,43.757192,-79.379865,Liberty Club Gym,This spot is popular,931,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,M2L,York Mills,43.757192,-79.379865,The Empire Fitness Room,This spot is popular,1000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,M4W,Rosedale,43.68194,-79.378474,Summerhill Market,This spot is popular,539,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


#### __Apply machine learning K-means cluster__

In [32]:
# Final construction of dataframe to be used in machine learning
# Plus group all rows by neighborhood
toronto_venues_onehot_clean_ml = toronto_venues_onehot_clean[list_of_features_of_interest].drop(columns = ['Neighborhood Latitude', 'Neighborhood Longitude']).groupby('Neighborhood').sum()

toronto_venues_onehot_clean_ml.head()

Unnamed: 0_level_0,American Restaurant,Asian Restaurant,Bagel Shop,Bakery,BBQ Joint,Breakfast Spot,Bistro,Bubble Tea Shop,Burger Joint,Café,Caribbean Restaurant,Chinese Restaurant,Coffee Shop,Comfort Food Restaurant,Falafel Restaurant,Fast Food Restaurant,Fish & Chips Shop,Food & Drink Shop,Food Court,French Restaurant,Fried Chicken Joint,German Restaurant,Greek Restaurant,Indian Restaurant,Italian Restaurant,Japanese Restaurant,Jewish Restaurant,Latin American Restaurant,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Modern European Restaurant,New American Restaurant,Pizza Place,Portuguese Restaurant,Poutine Place,Ramen Restaurant,Restaurant,Sandwich Place,Seafood Restaurant,Smoothie Shop,Spanish Restaurant,Steakhouse,Sushi Restaurant,Tapas Restaurant,Thai Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wings Joint,Convenience Store,Farmers Market,Grocery Store,Health Food Store,Supermarket,Shopping Mall,Pharmacy
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1
Annex,1,0,0,2,1,0,0,1,2,5,0,0,8,0,0,1,0,0,0,2,0,0,1,2,4,1,2,1,2,2,0,0,1,4,0,0,0,4,2,0,0,0,2,0,0,1,4,0,1,0,0,1,0,0,0,1
Bay Street Corridor,2,1,0,1,0,2,0,2,1,5,0,2,6,0,1,0,0,0,1,0,0,0,1,0,3,3,0,0,0,0,1,1,0,2,1,0,3,1,1,2,0,0,2,2,1,1,3,0,0,0,0,0,0,1,1,0
Bedford Park,1,0,1,1,0,0,0,0,0,1,0,0,3,1,0,2,0,0,0,0,0,0,1,1,3,1,0,0,0,0,0,0,0,1,0,0,0,1,1,0,0,0,0,1,0,1,0,0,1,0,1,1,0,0,0,1
Forest Hill South,1,0,2,1,0,1,0,0,2,4,0,0,9,0,0,0,0,1,0,1,1,1,0,0,4,1,0,0,0,0,1,1,0,4,0,0,0,1,3,0,0,0,0,4,1,2,2,1,0,0,0,2,0,1,0,1
Lawrence Park South,0,0,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


In [33]:
# Import k-means library
from sklearn.cluster import KMeans

# Do k-means clustering, the number of clusters is set to 5
kmeans_cluster = KMeans(n_clusters = 5, random_state = 0).fit(toronto_venues_onehot_clean_ml)

#### __Sum all the venues and sort in descending order among 5 clusters__

In [35]:
cluster_venue_df = pd.DataFrame(kmeans_cluster.cluster_centers_)
cluster_venue_df.columns = toronto_venues_onehot_clean_ml.columns
cluster_venue_df.index = ['Cluster 1','Cluster 2','Cluster 3','Cluster 4','Cluster 5']
cluster_venue_df['Total Sum'] = cluster_venue_df.sum(axis = 1)
cluster_venue_df.sort_values(axis = 0, by = ['Total Sum'], ascending=False)

Unnamed: 0,American Restaurant,Asian Restaurant,Bagel Shop,Bakery,BBQ Joint,Breakfast Spot,Bistro,Bubble Tea Shop,Burger Joint,Café,Caribbean Restaurant,Chinese Restaurant,Coffee Shop,Comfort Food Restaurant,Falafel Restaurant,Fast Food Restaurant,Fish & Chips Shop,Food & Drink Shop,Food Court,French Restaurant,Fried Chicken Joint,German Restaurant,Greek Restaurant,Indian Restaurant,Italian Restaurant,Japanese Restaurant,Jewish Restaurant,Latin American Restaurant,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Modern European Restaurant,New American Restaurant,Pizza Place,Portuguese Restaurant,Poutine Place,Ramen Restaurant,Restaurant,Sandwich Place,Seafood Restaurant,Smoothie Shop,Spanish Restaurant,Steakhouse,Sushi Restaurant,Tapas Restaurant,Thai Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wings Joint,Convenience Store,Farmers Market,Grocery Store,Health Food Store,Supermarket,Shopping Mall,Pharmacy,Total Sum
Cluster 5,2.0,1.0,0.0,1.0,0.0,2.0,0.0,2.0,1.0,5.0,0.0,2.0,6.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,3.0,3.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,2.0,1.0,0.0,3.0,1.0,1.0,2.0,0.0,0.0,2.0,2.0,1.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,54.0
Cluster 3,0.0,0.0,0.0,3.0,1.0,3.0,1.0,0.0,0.0,4.0,0.0,0.0,13.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,2.0,4.0,0.0,0.0,0.0,1.0,2.0,1.0,0.0,0.0,2.0,0.0,1.0,0.0,5.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,3.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,53.0
Cluster 1,1.0,0.0,1.333333,1.333333,0.333333,0.666667,0.0,0.333333,1.666667,3.666667,0.0,0.0,7.666667,0.0,0.0,0.666667,0.0,0.333333,0.0,1.0,0.666667,0.666667,0.333333,0.666667,4.0,1.0,0.666667,0.333333,0.666667,1.0,0.333333,0.666667,0.333333,3.333333,0.0,0.0,0.0,2.333333,2.333333,0.0,0.0,0.0,0.666667,1.666667,0.666667,1.333333,2.0,0.666667,0.333333,0.0,0.0,2.0,0.0,0.333333,0.0,1.333333,50.333333
Cluster 4,0.5,1.0,1.0,1.5,1.5,2.0,0.0,0.0,2.0,0.5,1.0,0.0,5.0,0.0,0.0,0.0,0.5,0.5,0.0,0.5,0.5,0.0,0.5,1.5,0.5,1.5,0.0,0.0,0.0,1.5,0.0,0.0,0.0,2.0,0.5,0.0,0.5,1.5,2.5,0.0,0.5,0.0,0.0,1.5,0.0,1.0,0.5,0.0,0.0,0.5,0.0,1.5,0.5,1.0,0.5,0.5,38.5
Cluster 2,0.2,0.0,0.2,0.2,0.0,0.0,0.0,0.0,0.4,0.6,0.0,0.0,1.0,0.2,0.0,0.4,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.2,1.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.4,0.2,0.0,0.0,0.0,0.0,0.2,0.0,0.2,0.0,0.0,0.2,0.0,0.2,0.4,0.0,0.0,0.0,0.4,7.2


## __Results__

This section presents the analysis results. From the previous section, it can be seen that Cluster 5 has the highest venue sum score of 54, and Clusters 3 and 1 follow closely with the second / third highest scores of 53 and 50, respectively. Therefore, these three clusters have the most accessibility of various places for food, including all kinds of restaurants, joints, and convenience stores. 

#### __Show the cluster id for each neighborhood which it belongs to__

At the final step, we identify those neighborhoods that belong to the clusters with highest venue scores. 

In [38]:
neigh_cluster = pd.DataFrame([neigh_list, 1 + kmeans_cluster.labels_]).T
neigh_cluster.columns = ['Neighborhood', 'Cluster']
neigh_cluster.head(15)

Unnamed: 0,Neighborhood,Cluster
0,York Mills,1
1,Rosedale,5
2,Moore Park,2
3,Forest Hill South,1
4,Lawrence Park South,2
5,Waterfront Communities,4
6,The Islands,1
7,Annex,2
8,Leaside,4
9,Bay Street Corridor,2


## __Discussions__

This section discusses the observations and makes recommendations based on the results. Neighborhoods that belong to Cluster 5, 3, and 1 are identified, respectively. It can been seen that Cluster 5 has only one neighborhood, Rosedale located in downtown Toronto, which has the highest venue scores, followed by Bedford Park within Cluster 3, and York Mills, Forest Hill South, and The Islands from Cluster 1 closely. These neighborhoods are therefore recommended to those new immigrants who are looking for a high income / education community with a good accessibility to places to eat.  

In [39]:
neigh_cluster[neigh_cluster['Cluster'] == 5]

Unnamed: 0,Neighborhood,Cluster
1,Rosedale,5


In [40]:
neigh_cluster[neigh_cluster['Cluster'] == 3]

Unnamed: 0,Neighborhood,Cluster
10,Bedford Park,3


In [41]:
neigh_cluster[neigh_cluster['Cluster'] == 1]

Unnamed: 0,Neighborhood,Cluster
0,York Mills,1
3,Forest Hill South,1
6,The Islands,1


## __Conclusions__

It can be concluded that the following neighborhoods are determined to be suitable for those new immigrants who are seeking for a community with a relatively high income and education level, and a good accessibility to places for food within the neighborhood:

 - Rosedale
 - Bedford Park
 - York Mills
 - Forest Hill South
 - The Islands


This study presented a comparative analysis on the neighborhoods of Toronto to provide some guidance to new immigrants who are deciding on the most suitable community for them to settle down. The analysis targets partigularly those immigrants who are looking for a neighborhood with a high income and education level, as well as a good accessibility to places for food such as restaurants and convenience stores. This study can be customized for other immigrants that take into account other criteria and driving factors. Due to complexity, the analysis presented did not consider housing availability and prices within the neighborhood. The future study may include these factors and other considerations to achieve a more comprehensive scope.  

### __THANK YOU__