## Part 2 of Toronto neighborhood new business opportunities

## Introduction

The purpose of this study is to use a recommender system to suggest new business opportunities for Toronto neighborhoods that currently have few venues (<20 venues, "sparse neighborhoods"). The study proposes to use a user-based collaborative filtering mechanism where the neighborhoods (defined by postal codes) are the "users", and the frequency of various kind of venues are the "scores" of the venues in each neighborhood. Based on similarity among various neighborhoods, this study seeks to recommend venues that should be characteristic of but currently absent from sparse neighborhoods. 

This is the part two of the project. For part 1, see: [Toronto_postcode_clustering](https://github.com/chencheng23/Toronto/blob/master/Toronto_postcode_clustering.ipynb).

The proposed steps include:
1. <a href=#data>Importing data </a>from Toronto venues (prepared from Foursquare data) containing the postal codes and venue categories generated from [Toronto_postcode_clustering](https://github.com/chencheng23/Toronto/blob/master/Toronto_postcode_clustering.ipynb).
2. <a href=#one-hot>One-hot encoding the venue categories </a>
3. <a href=#select>Selecting postal code areas with >=20 venues as the dense areas, and postal code areas with 10-19 venues as neighborhoods this study aims to recommend venues to (sparse areas) </a>
4. <a href=#freq>Calculating the average score (frequency) of the venue categories in each postal code area </a>
5. <a href=#sim>Generating the similarity matrix of each of the sparse areas with all dense areas </a>
6. <a href=#rec>Generating a weighted matrix for sparse areas in terms of the score of all venue categories, and rank the scores of the venue categories </a>
7. <a href=#result>Generating a result table </a>

Some caveats include:
1. Some of the venues in from the dataset (see below) are public infrastructure, such as bus stops or airport, and should not be considered business opportunities. The study will ~~both~~ include ~~and exclude~~ those venues when determining the similarity among neighborhoods, ~~and determine which recommendation system provides most sensible results~~. (after the study, it  was thought that public infrastruture is a part of the characteristics of a neighborhood, and should be included)
2. Some venues may be mutually exclusive or competitive with each other, such as different kinds of restaurants. This should be factored in when analyzing the recommendation results.
3. Obviously, other factors and analyses, such as population, population density, type of the neighborhood (urban vs suburban), socioeconomic factors, projected costs and revenues, etc. need to be factored in when making business decisions. This study merely assumes that neighborhoods that similar to each other should have similar kinds of venues.

In [1]:
import numpy as np
import pandas as pd
from math import sqrt

<a name='data' /> 
### Importing data from Toronto_postcode_clustering

In [36]:
toronto_venues = pd.read_csv('toronto_venues.csv')
toronto_venues.head()

Unnamed: 0.1,Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,0,M1B,43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,1,M1C,43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
2,2,M1C,43.784535,-79.160497,Affordable Toronto Movers,43.787919,-79.162977,Moving Target
3,3,M1E,43.763573,-79.188711,Swiss Chalet Rotisserie & Grill,43.767697,-79.189914,Pizza Place
4,4,M1E,43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store


<a name='one-hot' />   
### One-hot encoding the venue categories

In [276]:
#onehot encode the Venue Category:
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")
toronto_onehot['Postal Code'] = toronto_venues['Neighborhood']
#move 'Postal Code' column to the front
toronto_onehot = toronto_onehot[[toronto_onehot.columns[-1]] + list(toronto_onehot.columns[0:-1])]
print(toronto_onehot.shape)
# toronto_onehot.head()
# just curious to see below that Postal Code M5V has a high frequency of Airport related stuff, thus displaying all M5V venues
# toronto_onehot.loc[toronto_onehot['Postal Code'] == 'M5V']

(2251, 280)


<a name='select' />  
### Defining the "dense" and "sparse" areas

In [34]:
#select Postal Code areas where there are at least 20 venues
venue_count = toronto_venues.groupby(by = 'Neighborhood').count().reset_index()
venue_count = venue_count[venue_count['Venue'] >= 20]
known_neighbor = toronto_onehot[toronto_onehot['Postal Code'].isin(venue_count['Neighborhood'])]

venue_count = toronto_venues.groupby(by = 'Neighborhood').count().reset_index()
venue_count = venue_count[venue_count['Venue'] < 20]
venue_count = venue_count[venue_count['Venue'] >= 10]
test_neighbor = toronto_onehot[toronto_onehot['Postal Code'].isin(venue_count['Neighborhood'])]


<a name='freq' />  
### Calculating the average score (frequency) of the venue categories in each postal code area

In [270]:
#calculate the frequency of the venue categories in each postal code area
known_freq = known_neighbor.groupby(by='Postal Code').mean()
test_freq = test_neighbor.groupby(by='Postal Code').mean()
test_freq.head()

Unnamed: 0_level_0,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
M1L,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
M1T,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
M1W,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
M3H,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.052632,0.0,0.0,0.0,0.0,0.0,0.0,0.0
M4B,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Wonder if for calculating correlations, should include the zero values.. The data table has a lot of zero values, and the zeros may overwhelm the actual meaningful data (i.e. if two neighborhoods have mostly 0 values, would they be considered very similar, even though no information on what venue to recommend can be discerned?)

For right now, keep all zero values for coefficient calculation. 

(after the study, it was thought that the zero values are also valuable, since the lack of certain venue categories is also characteristic of a neighborhood)

In [274]:
#check the order of the coefficients when feeding in the whole known_freq table 
# test_1 = test_freq.iloc[0,:]
# known_1 = known_freq.iloc[1,:]
# np.corrcoef(known_1, test_1)

<a name='sim' />  
### Generating the similarity matrix

In [234]:
#getting the correlation coefficients of the 1st test postal code with all known postal codes
#using the np.corrcoef function (Return Pearson product-moment correlation coefficients, see doc) here, can also use other similarity measurements
corr_test_1 = np.corrcoef(test_1, known_freq)
#corr_test_1 is a 32*32 matrix, first column in this order: coefficient of test_1 with test_1, known_freq.iloc[0,:], \ 
#known_freq.iloc[1,:],... 
df_corr_test_1 = pd.DataFrame(corr_test_1).iloc[1:, 0]
# df_corr_test_1.head()
#reshape to be passed to the element-wise multiplication below
df_corr_test_1 = df_corr_test_1.values.reshape(31, 1)


<a name='rec' />  
### Generating the weighted score matrix and recommendations

In [158]:
#multiply the coefficients with the frequency, to get the weighted frequency
weighted_known_freq_test_1 = np.multiply(known_freq, df_corr_test_1)
weighted_known_freq_test_1.head()

Unnamed: 0_level_0,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
M2J,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002656,0.0,...,0.0,0.002656,0.0,0.0,0.0,0.0,0.0,0.002656,0.005312,0.0
M2N,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,...,-0.0,-0.0,-0.0,-0.000319,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0
M3C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
M4G,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,...,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0
M4K,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000276,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000276


In [137]:
# summing the weighted frequency for test_1 by venue category and divide by the summed similarity (coefficients)
sum_known_freq_test_1 = weighted_known_freq_test_1.sum(axis = 0)
# sum_known_freq_test_1.shape

(279,)

In [168]:
#sum the similarity coefficients of all postal codes for each venue category where the freq is not 0
sum_coef_test_1 = []
for i in range(known_freq.shape[1]):
    one_null_array = list()  #create a list of 1s and 0s to make the summation easier (using dot product)
    for m in range(known_freq.shape[0]):
        if known_freq.iloc[m, i] != 0:
            one_null_array.append(1)
        else:
            one_null_array.append(0)
    sum_coef_col = np.dot(one_null_array, df_corr_test_1)
    sum_coef_test_1.append(sum_coef_col)
print(np.shape(sum_coef_test_1))
type(sum_coef_test_1)

(279, 1)


list

In [173]:
#change the dimension for the division below
sum_coef_test_1 = np.array(sum_coef_test_1).reshape(279, )
np.shape(sum_coef_test_1)

(279,)

In [230]:
#divide the sum of weighted frequency of each venue category by the sum of similarity
# score_test_1 = np.divide(sum_known_freq_test_1, sum_coef_test_1, )
score_test_1 = sum_known_freq_test_1 / sum_coef_test_1
print(type(score_test_1))

#display the top recommended venue categories for test_1 postal code:
sorted_score_test_1 = score_test_1.sort_values(ascending=False)
sorted_score_test_1.head()

<class 'pandas.core.series.Series'>


Coffee Shop              0.091584
Supermarket              0.089797
Portuguese Restaurant    0.073070
Café                     0.064672
Pharmacy                 0.061504
dtype: float64

### Those above are the top 10 recommended venue categories for test_1 area

In [186]:
#compare to the current top ten venue categories of test_1 postal code:
test_1.sort_values(ascending=False).head(10)

Bakery                  0.2
Bus Line                0.2
Fast Food Restaurant    0.1
Soccer Field            0.1
Metro Station           0.1
Bus Station             0.1
Intersection            0.1
Park                    0.1
Costume Shop            0.0
Coworking Space         0.0
Name: M1L, dtype: float64

from the comparison, some top recommendations have values approaching the existing venues, especially when some of the top current venues are public infrastructure and not small business opportunities.

#### what's interesting is how to think about recommendation scores for venue categories that already exist in this neighborhood. If the recommendation score is higher than the current frequency, it might suggest additional opportunity for this type of venue. If the recommendation score is lower than the current frequency, it might suggest that there's an overflow of this type of venue. 

In [277]:
#put the recommendation results into a table (top 5 recommendations)
Rec_table = pd.DataFrame(columns = [[ '1st, score', '2nd, score', '3rd, score', '4th, score', '5th, score']], \
                         index = test_freq.index.values)
Rec_table.index.name = 'Postal Code'



In [279]:
for i in range(5):
    Rec_table.iloc[0,i] = sorted_score_test_1.index[i], sorted_score_test_1[i].round(4)

In [280]:
Rec_table.head()

Unnamed: 0_level_0,"1st, score","2nd, score","3rd, score","4th, score","5th, score"
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
M1L,"(Coffee Shop, 0.0916)","(Supermarket, 0.0898)","(Portuguese Restaurant, 0.0731)","(Café, 0.0647)","(Pharmacy, 0.0615)"
M1T,,,,,
M1W,,,,,
M3H,,,,,
M4B,,,,,


### Defining a function to automate this process for the rest of test_neighbor

In [239]:
# define a function to automate this process for the rest of test_neighbor

def GetRecResults (test_area):   #test_area is the n-th of all the test_freq neighborhoods
    n = test_area
    test_n = test_freq.iloc[n,:]
    #getting the similarity (correlation coefficients) of the n-th test postal code with all known postal codes
    corr_test_n = np.corrcoef(test_n, known_freq)
    #since the np.corrcoef returns an n*n matrix, take only the first column, and exlude the first row (self, 1)
    df_corr_test_n = pd.DataFrame(corr_test_n).iloc[1:, 0]
    #reshape to be passed to the element-wise multiplication below
    df_corr_test_n = df_corr_test_n.values.reshape(known_freq.shape[0], 1)
    
    #multiply the coefficients with the frequency, to get the weighted frequency
    weighted_known_freq_test_n = np.multiply(known_freq, df_corr_test_n)
    
    # summing the weighted frequency for test_n by venue category
    sum_known_freq_test_n = weighted_known_freq_test_n.sum(axis = 0)
    
    #sum the similarity coefficients of all postal codes for each venue category where the freq is not 0
    sum_coef_test_n = []
    for i in range(known_freq.shape[1]):
        one_null_array = list()  #create a list of 1s and 0s to make the summation easier (using a dot product)
        for m in range(known_freq.shape[0]):
            if known_freq.iloc[m, i] != 0:
                one_null_array.append(1)
            else:
                one_null_array.append(0)
        sum_coef_col = np.dot(one_null_array, df_corr_test_n)
        sum_coef_test_n.append(sum_coef_col)
    
    #reshape for the division below
    sum_coef_test_n = np.array(sum_coef_test_1).reshape(279, )
    
    #divide the sum of weighted frequency of each venue category by the sum of similarity
    score_test_n = sum_known_freq_test_n / sum_coef_test_n
    #sort (descending order) the recommended venue categories for test_n postal code:
    sorted_score_test_n = score_test_n.sort_values(ascending=False)
    
    #add the top five venue categories and scores to the Rec_table
    for i in range(5):
        Rec_table.iloc[n,i] = sorted_score_test_n.index[i], sorted_score_test_n[i].round(4)
    return 

<a name='result' />  
### Result table

In [272]:
for i in range(test_freq.shape[0]):
    GetRecResults(i)
Rec_table

Unnamed: 0_level_0,"1st, score","2nd, score","3rd, score","4th, score","5th, score"
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
M1L,"(Coffee Shop, 0.0916)","(Supermarket, 0.0898)","(Portuguese Restaurant, 0.0731)","(Café, 0.0647)","(Pharmacy, 0.0615)"
M1T,"(Jewish Restaurant, 1.119)","(Sushi Restaurant, 0.4917)","(Dim Sum Restaurant, 0.4364)","(Smoke Shop, 0.3981)","(Ramen Restaurant, 0.3964)"
M1W,"(Jewish Restaurant, 2.1901)","(Portuguese Restaurant, 1.6299)","(Dim Sum Restaurant, 1.4404)","(Intersection, 1.3762)","(Climbing Gym, 1.3762)"
M3H,"(Jewish Restaurant, 2.1491)","(Portuguese Restaurant, 1.4928)","(Smoke Shop, 1.0292)","(Mediterranean Restaurant, 1.0173)","(Ramen Restaurant, 1.0113)"
M4B,"(Jewish Restaurant, 1.6291)","(Intersection, 0.6996)","(Stadium, 0.6996)","(Climbing Gym, 0.6996)","(Ramen Restaurant, 0.5636)"
M4H,"(Jewish Restaurant, 3.1079)","(Portuguese Restaurant, 1.1213)","(Climbing Gym, 0.9046)","(Stadium, 0.9046)","(Intersection, 0.9046)"
M4P,"(Intersection, 1.2296)","(Stadium, 1.2296)","(Climbing Gym, 1.2296)","(Dim Sum Restaurant, 1.1097)","(Jewish Restaurant, 0.9351)"
M4R,"(Portuguese Restaurant, 1.4439)","(Dim Sum Restaurant, 0.9825)","(Jewish Restaurant, 0.9799)","(Climbing Gym, 0.7996)","(Intersection, 0.7996)"
M4V,"(Jewish Restaurant, 2.3073)","(Portuguese Restaurant, 1.3511)","(Smoke Shop, 1.3301)","(Mediterranean Restaurant, 1.2951)","(Sushi Restaurant, 0.9463)"
M5V,"(Portuguese Restaurant, 0.51)","(Climbing Gym, 0.3069)","(Stadium, 0.3069)","(Intersection, 0.3069)","(Jewish Restaurant, 0.3069)"


## Analysis and brief conclusion

Here it is, the recommended venues for each of the postal code areas with 10-19 venue categories. Some areas, like M5V, have overall low scores, and upon closer inspection, M5V is likely the airport area, so it's not too surprising that there aren't many businesses around. Jewish restaurants appear to be good opportunities in many areas, some with high scores, such as M4H or M8V, and it may be worthwhile to follow up on those areas and figure out if there's market demand for additional restaurants in those areas, and in particular Jewish restaurants. 

Overall though, the top 5 recommendation results are quite homogeneous, with Restaurants (Jewish, Portuguese, Mediterranean, Ramen), climbing gyms, and some stadiums and intersections dominating the table. This is likely a result from the fact that most of those areas with 10-19 venue categories highly resemble the crowded downtown areas (they fall into the same cluster, see pics below, generated from [Toronto_postcode_clustering](https://github.com/chencheng23/Toronto/blob/master/Toronto_postcode_clustering.ipynb)), and whatever is popular in downtown areas are recommended to the venue-sparse areas.

#### Toronto neighborhoods with >20 venue categories
<img src="files/toronto_clustering_20_or_more.jpg">

#### All Toronto neighborhoods: most peripheral neighborhoods are in the same cluster as the dense downtown neighborhoods (red), when clustered by the frequency of venue categories
<img src="files/toronto_clustering.jpg">

maybe add a table of top 5 venues from the 20 most venue-dense areas??

Backup: spot checking some of the variables from the GetRecResults function. 

In [246]:
test_M8V = test_freq.loc['M8V']
test_M8V.loc['Intersection']

0.0

In [266]:
n = 1
test_n = test_freq.iloc[n,:]
#getting the similarity (correlation coefficients) of the n-th test postal code with all known postal codes
corr_test_n = np.corrcoef(test_n, known_freq)
#since the np.corrcoef returns an n*n matrix, take only the first column, and exlude the first row (self, 1)
df_corr_test_n = pd.DataFrame(corr_test_n).iloc[1:, 0]
#reshape to be passed to the element-wise multiplication below
df_corr_test_n = df_corr_test_n.values.reshape(known_freq.shape[0], 1)

#multiply the coefficients with the frequency, to get the weighted frequency
weighted_known_freq_test_n = np.multiply(known_freq, df_corr_test_n)

# summing the weighted frequency for test_n by venue category
sum_known_freq_test_n = weighted_known_freq_test_n.sum(axis = 0)

#sum the similarity coefficients of all postal codes for each venue category where the freq is not 0
sum_coef_test_n = []
for i in range(known_freq.shape[1]):
    one_null_array = list()  #create a list of 1s and 0s to make the summation easier (using a dot product)
    for m in range(known_freq.shape[0]):
        if known_freq.iloc[m, i] != 0:
            one_null_array.append(1)
        else:
            one_null_array.append(0)
    sum_coef_col = np.dot(one_null_array, df_corr_test_n)
    sum_coef_test_n.append(sum_coef_col)

#reshape for the division below
sum_coef_test_n = np.array(sum_coef_test_1).reshape(279, )

#divide the sum of weighted frequency of each venue category by the sum of similarity
score_test_n = sum_known_freq_test_n / sum_coef_test_n
#sort (descending order) the recommended venue categories for test_n postal code:
sorted_score_test_n = score_test_n.sort_values(ascending=False)

In [258]:
weighted_known_freq_test_n.loc[:, 'Coffee Shop']

Postal Code
M2J    0.036395
M2N    0.033754
M3C    0.044560
M4G    0.032721
M4K    0.014191
M4L    0.014536
M4M    0.014171
M4S    0.017506
M4X    0.026268
M4Y    0.026279
M5A    0.057579
M5B    0.034431
M5C    0.020062
M5E    0.025126
M5G    0.056475
M5H    0.015423
M5J    0.034305
M5K    0.050432
M5L    0.032280
M5M    0.033845
M5R    0.048730
M5S    0.005221
M5T    0.007074
M5W    0.042217
M5X    0.022527
M6H    0.000000
M6J    0.009553
M6K    0.024260
M6P    0.000000
M6S    0.019690
M7A    0.114596
Name: Coffee Shop, dtype: float64

In [267]:
sum_known_freq_test_n[-26]

0.09189938522104202

In [268]:
sum_coef_test_n[-26]

0.18690937243124472