# Similar Users Lab

BUT FIRST a quick word about strings, lists, and sets:

## Working with sets

In mathematics, a set is a collection of distinct objects.  In Python, "Sets" are lists with no duplicate entries. Set objects also support mathematical operations like union, intersection, difference, and symmetric difference.

_Fun fact for your next party:  Techincally, Python sets are implemented using dictionaries (under the hood)._

Here are two sets of colors:


In [8]:
a = set(["Red", "Green", "Blue"])
b = set(["Black", "White", "Green"])

To find out which items are in both sets (**both sets only**), use the "intersection" method:

In [9]:
a.intersection(b)

{'Green'}

To find the items in a, but not b.

In [10]:
a.difference(b)

{'Blue', 'Red'}

To find the items in b, but not a.

In [11]:
b.difference(a)

{'Black', 'White'}

To find a list of all unique sets (aka: union):

In [12]:
a.union(b)

{'Black', 'Blue', 'Green', 'Red', 'White'}

How many are different?

In [19]:
print "Number of different items in b:  %d" % len(b.difference(a))

Number of different items in b:  2


## From Sets to Lists

Now that we're experts with working with Python sets.  Let's get savvy working with lists and unstructured data.

Using the split() method on a string, we can "split" it by a delimiter, to be used as a list.  By default, the .split() method can be applied to any string object, and will automatically split on spaces.  

*You can pass a parameter to split to change which character it will split on, such as ",", if you're trying to turn a comma seprated list of items into a list.*

The following will turn a space delimited *string* into a **list**.

In [20]:
"my name is dave my name is dave my name is dave".split()

['my',
 'name',
 'is',
 'dave',
 'my',
 'name',
 'is',
 'dave',
 'my',
 'name',
 'is',
 'dave']

What's up with this though?  Well all know "my name is dave", but if we had many values, it would be hard to know which of them are unique.  That's when we use sets.

In [21]:
set("my name is dave my name is dave my name is dave".split())

{'dave', 'is', 'my', 'name'}

Ok so we should know enough to conquer our jaccard distance problem, and step into our real problem:

## Who has similar tastes in music?

What we will attempt, is building a small process that takes feedback from a survey, mapping a distance function to find similar users based on Jaccard.

Along the way we will be:
* Working with requests
* Understanding Python fundamentals with sets and lists
* Cleaning up bad data
* Implementing Jaccard distance function
* Finding similar users

First, we will be taking a survey!  Let's all visit the survey posted in the channel before continuing.

*[Check out #dsi-sf-2-Lounge]*

Hopefully everything goes smooothly.  It's possible that I may need to modify the permissions on the sheet or provide a CSV snapshot if we hit a snag.

We will be loading our results via HTTP, then loading them into Pandas via StringIO which allows us to interoperate on strings as if they were file resources, then load them as a Dataframe.  This is setup for us now.

In [22]:
import pandas as pd
import requests

from StringIO import StringIO  

%matplotlib inline

spreadsheet = "https://docs.google.com/spreadsheets/d/1cpUb7XbN-qOq4xbGdYfhY9FtrMqRd0izz4PmTPMejt0/export?format=csv&id=1cpUb7XbN-qOq4xbGdYfhY9FtrMqRd0izz4PmTPMejt0&gid=216538035"
http = requests.get(spreadsheet)
csv_data = StringIO(http.content)
df = pd.read_csv(csv_data, index_col=0)

In [23]:
df

Unnamed: 0_level_0,Name,Favorite Genres / Genres you like,What time of day do you like to listen to music?
Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10/3/2016 23:12:00,Dave,"Blues, Classical, Dance, Easy Listening, Elect...",24/7
10/3/2016 23:15:04,Kiefer,"Easy Listening, New Age, Ultra Speed Metal",hunting truffles
10/4/2016 11:19:10,Alberto Rios,"Classical, Electronic Music, Jazz, Latin Music...","Morning, Afternoon"
10/4/2016 11:19:29,Kat,"Alternative Music, Country, Electronic Music, ...","Morning, Afternoon, Special occasions"
10/4/2016 11:19:29,DOPE/SWAG,"Electronic Music, Hip Hop / Rap, Indie Pop, Pop",24/7
10/4/2016 11:19:28,?,"Alternative Music, Dance, Electronic Music, In...","Morning, Noon, Afternoon, Night, 24/7"
10/4/2016 11:19:41,Gwar!,"Easy Listening, Opera, Singer / Songwriter (in...","Special occasions, Singing in the shower"
10/4/2016 11:19:54,EDWARD,"Classical, Electronic Music, Asian Pop (J-Pop,...",24/7
10/4/2016 11:19:56,Random,Rock,Morning
10/4/2016 11:19:56,Tim,"Alternative Music, Blues, Rock, Singer / Songw...","Morning, Night, Special occasions, doing chores"


**1. Rename the genre feature**

We get bad data from spreadsheets all the time.  This case, it's coming from a survey.  For ease of reference, rename the feature **"Favorite Genres / Genres you like"** to **"genres"**.


**2. Select only your response from the new "genre" feature**

Try printing out only the first value, where df["Name"] == "[Your name]".

**3. Take your survey response for "genre", and split it into a list, equal to the number of responses you chose**

For example if you chose "Blues, Reggae, Electronic Music", convert it to a list that looks like ["Blues", "Raggae", "Electronic Music"].

**4. Create a function that takes 2 lists, then calculate Jaccard distance**

0-60 mph I know but you can do this!  Double check our slides, and refer to the set operations for how to calculate this.  

Here is a boilerplate to get you going.

In [None]:
def jaccard(list1, list2):
    print "list1: ", list1
    print "list2: ", list2
    # Update / your code here
    
list1 = ['blue', 'green', 'yellow']
list2 = ['black', 'orange', 'yellow', 'green']

jaccard(list1, list2)

**5.  Now for our final trick, calculate the distance between your genre preferences vs everyone else.**

Loop through everyone in the dataframe, create a list out of their "genre" string, echo out their name, then finally the distance between you and their sets.

**Optional 6. Try calculating the distance on the time of day feature.**

Try to make a new dataframe, for just you vs everyone, using jaccard, and time of day.  Is there any interesting patterns you see?

**Optional 7. What can you say about the selection of options for genre or time and what they mean?**

## Build a Cosine Sim Function for DSI-SF-2!

In [145]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import numpy as np

cosine_similarity([1,3,4,5,6], [10,3,3,4,8])



array([[ 0.75429803]])

In [148]:
spreadsheet = "https://docs.google.com/spreadsheets/d/1pUlkU1nR_Akw_ghIhOUqxnv5tadh45HLwoOcMkkPYcE/export?format=csv&id=1pUlkU1nR_Akw_ghIhOUqxnv5tadh45HLwoOcMkkPYcE#gid=1294015227"
http = requests.get(spreadsheet)
csv_data = StringIO(http.content)
df = pd.read_csv(csv_data, index_col=0)

In [149]:
df = df.rename(columns={"User your words to describe your favorite food in 1 sentence.": 'text'}).dropna()
df.index = range(df.shape[0])
df

Unnamed: 0,text
0,Pasta.
1,Korean bibimbap
2,My favorite comfort food is nachos.
3,Pizza
4,Spicy
5,anything indian
6,"French fries are made from potatoes, so techni..."
7,Minced raw lamb/beef with bulgur wheat and spi...
8,"These are MY chips, nachos"
9,unhealthy junk food


In [157]:
vectorizer = CountVectorizer()
# vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['text']).toarray()
X[6]

array([0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0])

In [159]:
# vectorizer = CountVectorizer()
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(df['text']).toarray()
X[6]

array([ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.42923366,  0.        ,  0.42923366,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.49157401,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.49157401,
        0.        ,  0.        ,  0.        ,  0.        ,  0.38500248,  0.        ])

In [160]:
# simularities = cosine_similarity(X[8], X)

for user_index, sentence in enumerate(df['text']):
    
    print "Source text:  %s \n-------------------------------------------" % sentence
    user_recs = pd.DataFrame(df)
    user_recs['cosine_dist'] = cosine_similarity(X[user_index], X)[0]
    print "Top 5 closest items by cosine: "
    print user_recs[['cosine_dist', 'text']].sort("cosine_dist", ascending=False).head(5).values
    print "\n\n==========================================================="
    
    
    
    




Source text:  Pasta.  
-------------------------------------------
Top 5 closest items by cosine: 
[[1.0 'Pasta. ']
 [0.0 'unhealthy junk food']
 [0.0 'Messy Melting Mozzarella is bliss.']
 [0.0 'Any kind of french fries']
 [0.0
  "I like all foods that don't taste terrible and have large portion sizes."]]


Source text:  Korean bibimbap 
-------------------------------------------
Top 5 closest items by cosine: 
[[1.0000000000000002 'Korean bibimbap']
 [0.0 'Pasta. ']
 [0.0 'unhealthy junk food']
 [0.0 'Messy Melting Mozzarella is bliss.']
 [0.0 'Any kind of french fries']]


Source text:  My favorite comfort food is nachos. 
-------------------------------------------
Top 5 closest items by cosine: 
[[1.0 'My favorite comfort food is nachos.']
 [0.3058993997556311 'These are MY chips, nachos']
 [0.24433725158459513 'unhealthy junk food']
 [0.0 'Messy Melting Mozzarella is bliss.']
 [0.0 'Any kind of french fries']]


Source text:  Pizza 
-------------------------------------------
To

