In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
pd.set_option('display.max_columns', None)

<h3>Question: does separating features into 'factors' change the cosine similarity?</h3>

In [4]:
# Import dummy data
events = pd.read_csv('../data/TEST_factor1.csv')
stroll = pd.read_csv('../data/TEST_factor2.csv')
commute = pd.read_csv('../data/TEST_factor3.csv')

<h3>For instance, our Eventbrite data will have MANY more features than the stroll quality and the commute quality data sets.
<br>
<ul>
<li>Eventbrite: 25+ features (all categories and subcategories of events)</li>
<li>Stroll quality: 3 features (tree density, walk score, transit score)</li>
<li>Commute quality: 4 features (walking, biking, transit, and driving times in minutes)</li>
</ul>
<br>
If we simply merge all features together, won't the Eventbrite features overwhelm the features in the other two factors? If so, how can we weigh the features so that they are in essence separated by 'factor' (event, stroll, and commute quality). 

In [7]:
commute.head()

Unnamed: 0,hood,bicycling,driving,transit,walking
0,soma,5,5,10,15
1,fremont,15,20,20,35


In [36]:
stroll.head()

Unnamed: 0,hood,trees,walkscore,transitscore
0,soma,38,99,98
1,fremont,45,83,89


In [37]:
events.head()

Unnamed: 0,hood,event1,event2,event3,event4,event5,event6,event7,event8,event9,event10,event11,event12,event13,event14,event15,event16,event17,event18,event19,event20,event21,event22,event23,event24,event25
0,soma,0,0,0,20,4,0,0,0,0,1,0,1,0,0,1,0,0,1,0,1,0,0,4,8,1
1,fremont,1,0,0,5,0,2,0,1,2,6,0,0,1,0,1,2,8,0,0,2,0,0,1,2,0


<h3>First, let's try merging all data sets together to see the overall cosine similarity.
<br>Cosine similarity: 0.96259804058040244</h3>

In [16]:
merged = events.merge(stroll, on='hood', how='outer')
merged = merged.merge(commute, on='hood', how='outer')

In [17]:
soma = merged
soma = soma[soma['hood'] == 'soma']
fremont = merged
fremont = fremont[fremont['hood'] == 'fremont']

In [19]:
fremont.head()

Unnamed: 0,hood,event1,event2,event3,event4,event5,event6,event7,event8,event9,event10,event11,event12,event13,event14,event15,event16,event17,event18,event19,event20,event21,event22,event23,event24,event25,trees,walkscore,transitscore,bicycling,driving,transit,walking
1,fremont,1,0,0,5,0,2,0,1,2,6,0,0,1,0,1,2,8,0,0,2,0,0,1,2,0,45,83,89,15,20,20,35


In [20]:
def convert_hood_to_array(input_hood):
    # convert to numpy array in format needed for cosine similarity comparison
    input_hood = np.array(input_hood)[0][1:]
    input_hood = input_hood.reshape(1, -1)
    return input_hood

def compare_hood_to_all_city_hoods(input_hood, hood_city_df, comparison_city_df, input_dict):
    hood1 = hood_city_df.loc[hood_city_df['hood'] == input_hood]
    hood1 = convert_hood_to_array(hood1)
    for hood in comparison_city_df['hood']:
        hood2 = comparison_city_df.loc[comparison_city_df['hood'] == hood]
        hood2 = convert_hood_to_array(hood2)
        if input_hood not in input_dict:
            input_dict[input_hood] = {}
            input_dict[input_hood][hood] = cosine_similarity(hood1, hood2)[0][0]
        else:
            input_dict[input_hood][hood] = cosine_similarity(hood1, hood2)[0][0]
    return input_dict

# compare all Seattle neighborhoods with all SF neighborhoods
# note: one optimization for this will be to, instead of a dictionary of dictionaries,
# have a dictionary of tuples (hood_name, cosine_similarity), sorted by c_s
comparisons = {}
for hood in soma['hood']:
    compare_hood_to_all_city_hoods(hood, soma, fremont, comparisons)
for hood in fremont['hood']:
    compare_hood_to_all_city_hoods(hood, fremont, soma, comparisons)

In [21]:
comparisons

{'fremont': {'soma': 0.96259804058040244},
 'soma': {'fremont': 0.96259804058040244}}

<h3>Now, let's get the cosine similarities for each factor. 
<br><br>Originally, I thought we could make each factor's cosine similarity into one feature, then re-do the cosine similarity calculation. Except, oops, then the similarities will be 1!
<br><br>Is there a better way to 'weigh' the features by factor? 
<ul>
<li>Multiplying the cosine similarities together?</li>
<li>Using Euclidian distance somehow?</li>
<li>Normalizing the features somehow before merging them for one single cosine similarity calculation?</li>
</ul>
</h3>

In [22]:
events.head()

Unnamed: 0,hood,event1,event2,event3,event4,event5,event6,event7,event8,event9,event10,event11,event12,event13,event14,event15,event16,event17,event18,event19,event20,event21,event22,event23,event24,event25
0,soma,0,0,0,20,4,0,0,0,0,1,0,1,0,0,1,0,0,1,0,1,0,0,4,8,1
1,fremont,1,0,0,5,0,2,0,1,2,6,0,0,1,0,1,2,8,0,0,2,0,0,1,2,0


In [23]:
events_soma = events
events_soma = events_soma[events_soma['hood'] == 'soma']
events_fremont = events
events_fremont = events_fremont[events_fremont['hood'] == 'fremont']

stroll_soma = stroll
stroll_soma = stroll_soma[stroll_soma['hood'] == 'soma']
stroll_fremont = stroll
stroll_fremont = stroll_fremont[stroll_fremont['hood'] == 'fremont']

commute_soma = commute
commute_soma = commute_soma[commute_soma['hood'] == 'soma']
commute_fremont = commute
commute_fremont = commute_fremont[commute_fremont['hood'] == 'fremont']

In [25]:
comparisons_events = {}
for hood in events_soma['hood']:
    compare_hood_to_all_city_hoods(hood, events_soma, events_fremont, comparisons_events)

comparisons_stroll = {}
for hood in stroll_soma['hood']:
    compare_hood_to_all_city_hoods(hood, stroll_soma, stroll_fremont, comparisons_stroll)

comparisons_commute = {}
for hood in commute_soma['hood']:
    compare_hood_to_all_city_hoods(hood, commute_soma, commute_fremont, comparisons_commute)


In [28]:
print "SOMA/Fremont events:", comparisons_events['soma']['fremont']
print "SOMA/Fremont stroll: ", comparisons_stroll['soma']['fremont']
print "SOMA/Fremont commute: ", comparisons_commute['soma']['fremont']

SOMA/Fremont events: 0.470102133516
SOMA/Fremont stroll:  0.995420026523
SOMA/Fremont commute:  0.979795897113


In [30]:
lst = [comparisons_events['soma']['fremont'], comparisons_stroll['soma']['fremont'], comparisons_commute['soma']['fremont']]

In [31]:
lst

[0.47010213351595742, 0.99542002652317363, 0.97979589711327142]

In [32]:
factors = np.array(lst)

In [33]:
factors

array([ 0.47010213,  0.99542003,  0.9797959 ])

In [34]:
print cosine_similarity(factors)

[[ 1.]]




<h3>Here I was wondering if multiplying the cosine similarities for each factor could be interesting. In the full data set, I could still sort by this 'multiplied similarity' factor.</h3>

In [35]:
[comparisons_events['soma']['fremont'] * comparisons_stroll['soma']['fremont'] * comparisons_commute['soma']['fremont']]

[0.45849458689108852]

<h3>Alternatively, perhaps I could distill the Eventbrite data a bit and make 2-3 features based on events that I think will be most predictive (for example, what is the density of tech meetups per neighborhood?).</h3>