<h1>Find the best neighborhood in Toronto to open a Coffee shop.</h1>

<i>Vu Khoa 2021

<h2>1.Data Collection </h2>
<h3>1.1 Toronto neighborhoods broken down by postal code.</h2> 

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
List_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
source = requests.get(List_url).text
soup = BeautifulSoup(source, 'lxml')
table_contents=[]
table=soup.find('table')
for row in table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned': #clear dataframe
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

# print(table_contents)
df=pd.DataFrame(table_contents)
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government


<b>1.2 Load Toronto geospatial cooridinates and merge to Toronto Postal Code Data

In [3]:
geo_df=pd.read_csv('C:\Khoa\DataScience_Learning\Geospatial_Coordinates.csv')
geo_df.head()
geo_df.rename(columns={'Postal Code':'PostalCode'},inplace=True)
geo_merged = pd.merge(geo_df, df, on='PostalCode')
geo_data=geo_merged[['PostalCode','Borough','Neighborhood','Latitude','Longitude']]
geo_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


<b>1.3 Toronto neighborhoods populations broken down by postal code

In [4]:
# Load this data from Stats Canada
df_pop = pd.read_csv('https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/hlt-fst/pd-pl/Tables/File.cfm?T=1201&SR=1&RPP=9999&PR=0&CMA=0&CSD=0&S=22&O=A&Lang=Eng&OFT=CSV',encoding = 'unicode_escape')
# Rename the columns appropiatley
df_pop = df_pop.rename(columns={'Geographic code':'PostalCode', 'Geographic name':'PostalCod2', 'Province or territory':'Province', 'Incompletely enumerated Indian reserves and Indian settlements, 2016':'Incomplete', 'Population, 2016':'Population_2016', 'Total private dwellings, 2016':'TotalPrivDwellings', 'Private dwellings occupied by usual residents, 2016':'PrivDwellingsOccupied'})
df_pop= df_pop.drop(columns=['PostalCod2', 'Province', 'Incomplete', 'TotalPrivDwellings', 'PrivDwellingsOccupied'])

# Get rid of the first row 
df_pop = df_pop.iloc[1:]
df_pop.head()

Unnamed: 0,PostalCode,Population_2016
1,A0A,46587.0
2,A0B,19792.0
3,A0C,12587.0
4,A0E,22294.0
5,A0G,35266.0


<b>1.4 Merge Toronto Neighbourhood populations data with Toronto Postal Code data

In [5]:
#Merge the Toronto Pop data with geo postalcode data
gf_new = pd.merge(df_pop, geo_data, on='PostalCode', how='right')
# sort on population
gf_new = gf_new.sort_values(by=['Population_2016'], ascending=False)

# display the new dataframe
gf_new.head()

Unnamed: 0,PostalCode,Population_2016,Borough,Neighborhood,Latitude,Longitude
22,M2N,75897.0,North York,Willowdale South,43.77012,-79.408493
0,M1B,66108.0,Scarborough,"Malvern, Rouge",43.806686,-79.194353
18,M2J,58293.0,North York,"Fairview, Henry Farm, Oriole",43.778517,-79.346556
100,M9V,55959.0,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437
14,M1V,54680.0,Scarborough,"Milliken, Agincourt North, Steeles East, L'Amo...",43.815252,-79.284577


<b>1.5 Toronto neighborhoods average after tax income broken down by postal code

In [6]:
# It was easier to extract this data manually from Stats Canada and load it then it was to scrape it.
# It was only accessible from indeividual queries per postal code on the statscan web site.
df_income = pd.read_csv('C:\TorontoAvgIncome.csv',encoding = 'unicode_escape')
# Rename the after tax income column to a more maanageable name
df_income = df_income.rename(columns={"after-tax income of households in 2015":"AfterTaxIncome2015"})
df_income.head()

Unnamed: 0,PostalCode,AfterTaxIncome2015,Population_2016
0,M2P,115237,7843.0
1,M5M,111821,25975.0
2,M4N,109841,15330.0
3,M5R,108271,26496.0
4,M8X,97210,10787.0


<b>1.6 Merge Toronto Neighbourhood income data with Toronto Postal Code data

In [7]:
#Merge the Toronto Income data with geo postalcode data
gf_new = pd.merge(df_income, gf_new, on='PostalCode', how='right')
# get rid of the Nulls
gf_new = gf_new.replace('Null', 0)
#gf_new cast as float
gf_new['AfterTaxIncome2015'] = gf_new['AfterTaxIncome2015'].astype('float64') 
# Sort on Income
gf_new = gf_new.sort_values(by=['AfterTaxIncome2015'], ascending=False)

#extract to local file, called after_all.csv
# display the new dataframe
gf_new.to_csv('After_all.csv')
gf_new.head()


Unnamed: 0,PostalCode,AfterTaxIncome2015,Population_2016_x,Population_2016_y,Borough,Neighborhood,Latitude,Longitude
0,M2P,115237.0,7843.0,7843.0,North York,York Mills West,43.752758,-79.400049
1,M5M,111821.0,25975.0,25975.0,North York,"Bedford Park, Lawrence Manor East",43.733283,-79.41975
2,M4N,109841.0,15330.0,15330.0,Central Toronto,Lawrence Park,43.72802,-79.38879
3,M5R,108271.0,26496.0,26496.0,Central Toronto,"The Annex, North Midtown, Yorkville",43.67271,-79.405678
4,M8X,97210.0,10787.0,10787.0,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944


<h1>2.Explore and cluster the neighborhoods in Toronto</h1>

In [7]:
#!pip install geopy
#!pip install folium

In [8]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


<i> List all geo_data of toronto

In [9]:
toronto_data=gf_new[gf_new['Borough'].str.contains("Toronto")]
toronto_data.head()

Unnamed: 0,PostalCode,AfterTaxIncome2015,Population_2016_x,Population_2016_y,Borough,Neighborhood,Latitude,Longitude
2,M4N,109841.0,15330.0,15330.0,Central Toronto,Lawrence Park,43.72802,-79.38879
3,M5R,108271.0,26496.0,26496.0,Central Toronto,"The Annex, North Midtown, Yorkville",43.67271,-79.405678
10,M5V,89901.0,49195.0,49195.0,Downtown Toronto,"CN Tower, King and Spadina, Railway Lands, Har...",43.628947,-79.39442
11,M4J,87847.0,35738.0,35738.0,East York/East Toronto,The Danforth East,43.685347,-79.338106
14,M6J,81956.0,32684.0,32684.0,West Toronto,"Little Portugal, Trinity",43.647927,-79.41975


In [10]:
CLIENT_ID = 'JHDDW1FG4DGMVJBGXIPK4LKFURUT2SHWODNCRCPME1XRMF0X' # your Foursquare ID
CLIENT_SECRET = '5JEHG4LGMQ2YKMK2PI1DMGRFX413K2Q0C32E4ZL0AHRPSZ3M' # your Foursquare Secret
VERSION = '20180604'

In [11]:
def getNearbyVenues(names, latitudes, longitudes):
    radius=500
    LIMIT=100
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [12]:
toronto_venues = getNearbyVenues(names=toronto_data['Neighborhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude']
                                  )

Lawrence Park
The Annex, North Midtown, Yorkville
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
The Danforth  East
Little Portugal, Trinity
Rosedale
Moore Park, Summerhill East
Roselawn
The Beaches
North Toronto West
Runnymede, Swansea
Berczy Park
Forest Hill North & West
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
Harbourfront East, Union Station, Toronto Islands
Richmond, Adelaide, King
St. James Town
India Bazaar, The Beaches West
Parkdale, Roncesvalles
Studio District
Davisville
Christie
High Park, The Junction South
The Danforth West, Riverdale
Dufferin, Dovercourt Village
Davisville North
Brockton, Parkdale Village, Exhibition Place
University of Toronto, Harbord
Regent Park, Harbourfront
Church and Wellesley
Garden District, Ryerson
Central Bay Street
Kensington Market, Chinatown, Grange Park
St. James Town, Cabbagetown
Commerce Court, Victoria Hotel
First Canadian Place, Underground city
Toronto 

In [13]:
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Lawrence Park,43.72802,-79.38879,Lawrence Park Ravine,43.726963,-79.394382,Park
1,Lawrence Park,43.72802,-79.38879,Zodiac Swim School,43.728532,-79.38286,Swim School
2,Lawrence Park,43.72802,-79.38879,TTC Bus #162 - Lawrence-Donway,43.728026,-79.382805,Bus Line
3,"The Annex, North Midtown, Yorkville",43.67271,-79.405678,Ezra's Pound,43.675153,-79.405858,Café
4,"The Annex, North Midtown, Yorkville",43.67271,-79.405678,Roti Cuisine of India,43.674618,-79.408249,Indian Restaurant


<i>Let's check how many venues were returned for each neighborhood

In [14]:
toronto_venues.groupby('Neighborhood').count().head()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,59,59,59,59,59,59
"Brockton, Parkdale Village, Exhibition Place",25,25,25,25,25,25
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",17,17,17,17,17,17
Central Bay Street,64,64,64,64,64,64
Christie,16,16,16,16,16,16


<b>Let's find out how many unique categories can be curated from all the returned venues

In [15]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 229 uniques categories.


<b>1.7 Only Coffee shops or Coffee related are added as Venue Categories

In [16]:
# Here we manually pick out restaurants or 'features' from the unique venue list and that we want to examine for similiarity during clustering
rest_list = ['Coffee Shop', 'Café','Food & Drink Shop']
rest_pd = pd.DataFrame(rest_list)
#rest_pd
#rename the coloumns so the match
rest_pd = rest_pd.rename(columns={0:'Venue Category'})

#Join the 2 dataframes as instructed
TO_new = pd.merge(toronto_venues, rest_pd, on='Venue Category', how='right')

# display the new dataframe
#TO_new

TO_new.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,7,7,7,7,7,7
"Brockton, Parkdale Village, Exhibition Place",5,5,5,5,5,5
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",1,1,1,1,1,1
Central Bay Street,15,15,15,15,15,15
Christie,4,4,4,4,4,4
Church and Wellesley,8,8,8,8,8,8
"Commerce Court, Victoria Hotel",19,19,19,19,19,19
Davisville,4,4,4,4,4,4
Davisville North,1,1,1,1,1,1
"Dufferin, Dovercourt Village",1,1,1,1,1,1


<h2> Analyze Each Neighborhood

In [17]:
# one hot encoding
toronto_onehot = pd.get_dummies(TO_new[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = TO_new['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Neighborhood,Café,Coffee Shop,Food & Drink Shop
0,"The Annex, North Midtown, Yorkville",1,0,0
1,"The Annex, North Midtown, Yorkville",1,0,0
2,"The Annex, North Midtown, Yorkville",1,0,0
3,"Little Portugal, Trinity",1,0,0
4,"Little Portugal, Trinity",1,0,0


In [18]:
toronto_onehot.shape

(233, 4)

<h2> Group rows by neighborhood and by by taking the mean of the frequency of occurrence of each category

In [19]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Café,Coffee Shop,Food & Drink Shop
0,Berczy Park,0.142857,0.857143,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.6,0.4,0.0
2,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,1.0,0.0
3,Central Bay Street,0.266667,0.733333,0.0
4,Christie,0.75,0.25,0.0


<h3> Print out top 3 most common venues each neighborhood

In [20]:
import numpy as np
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 3

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,Berczy Park,Coffee Shop,Café,Food & Drink Shop
1,"Brockton, Parkdale Village, Exhibition Place",Café,Coffee Shop,Food & Drink Shop
2,"CN Tower, King and Spadina, Railway Lands, Har...",Coffee Shop,Food & Drink Shop,Café
3,Central Bay Street,Coffee Shop,Café,Food & Drink Shop
4,Christie,Café,Coffee Shop,Food & Drink Shop


<H1>2.Cluster Neighborhoods</h1>

<h3>2.1 Finding cluster base on ration café, coffee shop and food & Drink Shop</h3>

<b>2.1.1 Finding the best cluster

In [21]:
toronto_grouped.head()

Unnamed: 0,Neighborhood,Café,Coffee Shop,Food & Drink Shop
0,Berczy Park,0.142857,0.857143,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.6,0.4,0.0
2,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,1.0,0.0
3,Central Bay Street,0.266667,0.733333,0.0
4,Christie,0.75,0.25,0.0


In [22]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import numpy as np

TO_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# Use silhouette score to find optimal number of clusters to segment the data
kclusters = np.arange(2,10)
results = {}
for size in kclusters:
    model = KMeans(n_clusters = size).fit(TO_grouped_clustering)
    predictions = model.predict(TO_grouped_clustering)
    results[size] = silhouette_score(TO_grouped_clustering, predictions)

best_size = max(results, key=results.get)
print('The best clusters are:',best_size)

The best clusters are: 8


<b>2.1.2 Run K means and segment data into clusters and generate labels

In [23]:
kclusters = best_size
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(TO_grouped_clustering)
# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 7, 4, 0, 5, 0, 6, 1, 2, 3])

In [24]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
toronto_merged2 = toronto_data
# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
toronto_merged2 = toronto_merged2.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged2.shape # check the last columns!

(39, 12)

<b>2.1.3 Remove unlabeled rows #caution!!!!!!!!!!!!!!!!!!!

In [25]:
#Clear dataframe
#toronto_merged2=toronto_merged2[toronto_merged2['Cluster Labels'].notna()]
#toronto_merged2.head()
toronto_merged2.shape

(39, 12)

<h3>2.2 Finding cluster base on populations broken down and income </h3>

<b>2.2.1 Finding the best cluster

In [26]:
toronto_data2=toronto_data.iloc[:,1:3]
toronto_data2.head()

Unnamed: 0,AfterTaxIncome2015,Population_2016_x
2,109841.0,15330.0
3,108271.0,26496.0
10,89901.0,49195.0
11,87847.0,35738.0
14,81956.0,32684.0


In [27]:
# Use silhouette score to find optimal number of clusters to segment the data
kclusters = np.arange(2,10)
results = {}
for size in kclusters:
    model = KMeans(n_clusters = size).fit(toronto_data2)
    predictions = model.predict(toronto_data2)
    results[size] = silhouette_score(toronto_data2, predictions)

best_size = max(results, key=results.get)
print('The best clusters are:',best_size)

The best clusters are: 2


<b>2.2.2 Run K means and segment data into clusters and generate labels

In [28]:
kclusters = best_size
# run k-means clustering
kmeans2 = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_data2)
# check cluster labels generated for each row in the dataframe
kmeans2.labels_.shape

(39,)

In [29]:
# add clustering labels
toronto_merged2.insert(0, 'PI_CLabels', kmeans2.labels_)
toronto_merged2.head()

Unnamed: 0,PI_CLabels,PostalCode,AfterTaxIncome2015,Population_2016_x,Population_2016_y,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
2,0,M4N,109841.0,15330.0,15330.0,Central Toronto,Lawrence Park,43.72802,-79.38879,,,,
3,0,M5R,108271.0,26496.0,26496.0,Central Toronto,"The Annex, North Midtown, Yorkville",43.67271,-79.405678,7.0,Café,Coffee Shop,Food & Drink Shop
10,0,M5V,89901.0,49195.0,49195.0,Downtown Toronto,"CN Tower, King and Spadina, Railway Lands, Har...",43.628947,-79.39442,4.0,Coffee Shop,Food & Drink Shop,Café
11,0,M4J,87847.0,35738.0,35738.0,East York/East Toronto,The Danforth East,43.685347,-79.338106,,,,
14,0,M6J,81956.0,32684.0,32684.0,West Toronto,"Little Portugal, Trinity",43.647927,-79.41975,1.0,Coffee Shop,Café,Food & Drink Shop


<b>2.2.3 Remove unlabeled rows

In [30]:
toronto_merged2=toronto_merged2[toronto_merged2['Cluster Labels'].notna()]
toronto_merged2=toronto_merged2[toronto_merged2['PI_CLabels'].notna()]
toronto_merged2.head()

Unnamed: 0,PI_CLabels,PostalCode,AfterTaxIncome2015,Population_2016_x,Population_2016_y,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
3,0,M5R,108271.0,26496.0,26496.0,Central Toronto,"The Annex, North Midtown, Yorkville",43.67271,-79.405678,7.0,Café,Coffee Shop,Food & Drink Shop
10,0,M5V,89901.0,49195.0,49195.0,Downtown Toronto,"CN Tower, King and Spadina, Railway Lands, Har...",43.628947,-79.39442,4.0,Coffee Shop,Food & Drink Shop,Café
14,0,M6J,81956.0,32684.0,32684.0,West Toronto,"Little Portugal, Trinity",43.647927,-79.41975,1.0,Coffee Shop,Café,Food & Drink Shop
20,0,M4R,78625.0,11394.0,11394.0,Central Toronto,North Toronto West,43.715383,-79.405678,6.0,Coffee Shop,Café,Food & Drink Shop
21,0,M6S,76142.0,34299.0,34299.0,West Toronto,"Runnymede, Swansea",43.651571,-79.48445,1.0,Coffee Shop,Café,Food & Drink Shop


In [31]:
toronto_merged2.shape

(31, 13)

<b>let's find how the number of items per cluster

<i>based on shops rate

In [32]:
dtt=toronto_merged2.groupby('Cluster Labels').count().sort_values(by=['PostalCode'], ascending=False)
dtt

Unnamed: 0_level_0,PI_CLabels,PostalCode,AfterTaxIncome2015,Population_2016_x,Population_2016_y,Borough,Neighborhood,Latitude,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
Cluster Labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0.0,8,8,8,8,8,8,8,8,8,8,8,8
6.0,6,6,6,6,6,6,6,6,6,6,6,6
1.0,5,5,5,5,5,5,5,5,5,5,5,5
4.0,3,3,3,3,3,3,3,3,3,3,3,3
7.0,3,3,3,3,3,3,3,3,3,3,3,3
2.0,2,2,2,2,2,2,2,2,2,2,2,2
3.0,2,2,2,2,2,2,2,2,2,2,2,2
5.0,2,2,2,2,2,2,2,2,2,2,2,2


<i>based on shops population and income

In [43]:
toronto_merged2.groupby('PI_CLabels').count().sort_values(by=['PostalCode'], ascending=False)

Unnamed: 0_level_0,PostalCode,AfterTaxIncome2015,Population_2016_x,Population_2016_y,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
PI_CLabels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,27,27,27,27,27,27,27,27,27,27,27,27
1,4,4,4,4,4,4,4,4,4,4,4,4


<h3>3.Results

<b>As we can see based on income and population the '1' PI_Clabels is more popular so the data we wan to show on the map will be

Let's see means of income and population of this each PI_Clabels

In [36]:
print('avg income of 1 group:',toronto_merged2[toronto_merged2['PI_CLabels']==1]['AfterTaxIncome2015'].mean())
print('avg income of 0 group:',toronto_merged2[toronto_merged2['PI_CLabels']==0]['AfterTaxIncome2015'].mean())
print('avg population of 1 group:',toronto_merged2[toronto_merged2['PI_CLabels']==1]['Population_2016_y'].mean())
print('avg population of 0 group:',toronto_merged2[toronto_merged2['PI_CLabels']==0]['Population_2016_y'].mean())

avg income of 1 group: 0.0
avg income of 0 group: 63086.0
avg population of 1 group: 6.25
avg population of 0 group: 24480.37037037037


In [38]:
latitude=43.651070
longitude=-79.347015

In [42]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)
kclusters=8
dtt=toronto_merged2
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged2['Latitude'], toronto_merged2['Longitude'], toronto_merged2['Neighborhood'], toronto_merged2['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)],
        fill=True,
        fill_color=rainbow[int(cluster)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [40]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)
kclusters=2
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged2['Latitude'], toronto_merged2['Longitude'], toronto_merged2['Neighborhood'], toronto_merged2['PI_CLabels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[1],
        fill=True,
        fill_color=rainbow[1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Let's see the distribution of shops

In [44]:
toronto_final=toronto_merged2[toronto_merged2['Cluster Labels']==0]
toronto_final=toronto_final[toronto_final['PI_CLabels']==0]
toronto_final.head()

Unnamed: 0,PI_CLabels,PostalCode,AfterTaxIncome2015,Population_2016_x,Population_2016_y,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
22,0,M5E,74061.0,9118.0,9118.0,Downtown Toronto,Berczy Park,43.644771,-79.373306,0.0,Coffee Shop,Café,Food & Drink Shop
26,0,M5J,70843.0,14545.0,14545.0,Downtown Toronto,"Harbourfront East, Union Station, Toronto Islands",43.640816,-79.381752,0.0,Coffee Shop,Café,Food & Drink Shop
56,0,M4K,57366.0,31583.0,31583.0,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,0.0,Coffee Shop,Café,Food & Drink Shop
87,0,M5A,46938.0,41078.0,41078.0,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0.0,Coffee Shop,Café,Food & Drink Shop
88,0,M4Y,46324.0,30472.0,30472.0,Downtown Toronto,Church and Wellesley,43.66586,-79.38316,0.0,Coffee Shop,Food & Drink Shop,Café


In [184]:
#location of Toronto
latitude=43.651070
longitude=-79.347015

<i> Location based on shop cluster </i>

In [45]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_final['Latitude'], toronto_final['Longitude'], toronto_final['Neighborhood'], toronto_final['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[1],
        fill=True,
        fill_color=rainbow[1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Discussion:

### 5.1 Explaining the results

As we built our list of neighborhoods with coffee shop venues exclusively we discovered most neighborhoods were similar and the greatest concentration of coffee was in downtown Toronto and east Toronto. This might seem obvious but it would also appear that these are some of the most affluent neighborhoods in Toronto so there appears to be correlation. By Locating in the general vicinity of the Exact location my friend could be geographically centered in this cluster and poised to service his restaurant customer base with the greatest efficiency.<br>
When we built our K-Means dataset we used Silhouette analysis to tell us there was a lot of similarity between neighborhoods and the most common coffeeshop contained with in. Really there were 8 types of cluster or neighborhoods in greater Toronto. The vast majority of those were in 0 cluster. So Toronto coffeeshop might be many but they are very homogeneously located near the center or downtown of Toronto.<br>
After using Kmeans for population and income rate, we realized there are 2 types of cluster, and most of those were in 0 cluster. This means Toronto has 2 types of income and population area, and the most common is cluster 0 with are 84% of location.
After Combine two condittions we found that the best place to locate our new coffee shop <b>is around downtown and east Toronto, near the beach or University of Toronto.</b>

<h3>Conclusion:</h3>
I feel confident with the recommendation I have given my friend as it is backed up with demonstrated data analysis. While nothing can ever be 100% certain he will certainly be better informed than he was prior to asking for my help.
Much more inference can be obtained with more work. A potential side business for my friend might be assisting new restaurant owners where they might locate a new restaurant, who their competition is and who their clientele might be.