Recommendation Enginers
- Content Based and Collaborative Based Filtering
- Jaccard Similarity
- Modified KNN

In [3]:
########################################
## Collaborative-Based User Filtering ##
########################################
import pandas as pd

#read in brands data
user_brands = pd.read_csv('../data/user_brand.csv')

In [4]:
#look at count of stores
user_brands.Store.value_counts()

Target             1866
Old Navy           1200
Home Depot         1186
Kohl's             1157
Banana Republic     932
Nordstrom           904
Gap                 860
Crate & Barrel      816
Express             785
KitchenAid          700
J.Crew              569
Container Store     564
Steve Madden        539
Guess               509
Cuisinart           506
...
Joseph Abboud        1
Hugo Boss            1
Chanel               1
Sky                  1
Gymboree             1
Marmot               1
Stride Rite          1
Oakley               1
DC                   1
J Jill               1
Carter               1
Brooks Brothers      1
David Tutera         1
Barneys Warehouse    1
Walk-Over            1
Length: 198, dtype: int64

In [5]:
# Series of user IDs, note the duplicates
user_ids = user_brands.ID

In [6]:
# groupby ID to see what each user likes!
user_brands.groupby('ID').Store.value_counts()

ID                    
80002  Home Depot         1
       Target             1
80010  Kohl's             1
       DKNY               1
       Converse           1
       Levi's             1
       Express            1
       Old Navy           1
       Container Store    1
       Puma               1
       Cuisinart          1
       Nordstrom          1
80011  Crate & Barrel     1
       BCBGMAXAZRIA       1
       Banana Republic    1
...
91944  Crate & Barrel       1
       Target               1
       Nine West            1
       Guess                1
       French Connection    1
91946  Nordstrom            1
       Target               1
       Levi's               1
       Old Navy             1
91955  Serena and Lily      1
91957  Container Store      1
       Target               1
       BCBGMAXAZRIA         1
       Express              1
       Old Navy             1
Length: 23804, dtype: int64

In [7]:
# turns my data frame into a dictionary
# where the key is a user ID, and the value is a 
# list of stores that the user "likes"
brandsfor = {str(k): list(v) for k,v in user_brands.groupby("ID")["Store"]}

In [8]:
# try it out. User 83065 likes Kohl's and Target
brandsfor['83065']

["Kohl's", 'Target']

In [None]:
# User 82983 likes many more!
brandsfor['82983']
  


In [9]:
########################
## Jaccard Similarity ##
########################

'''
The Jaccard Similarity allows us to compare two sets
If we regard people as merely being a set of brands they prefer
the Jaccard Similarity allows us to compare people

Example. the jaccard similarty between user 82983 and 83065 is .125
            because
            brandsfor['83065'] == ["Kohl's", 'Target']
            brandsfor['82983'] == ['Hanky Panky', 'Betsey Johnson', 'Converse', 'Steve Madden', 'Old Navy', 'Target', 'Nordstrom']

the intersection of these two sets is just set("Target")
the union of the two sets is set(['Target', 'Hanky Panky', 'Betsey Johnson', 'Converse', 'Steve Madden', 'Old Navy', 'Target', 'Nordstrom'])
so the len(intersection) / len(union) = 1 / 8 == .125

EXERCISE: what is the Jaccard Similarity 
          between user 82956 and user 82963?
# ANSWER == 0.3333333333
'''

'\nThe Jaccard Similarity allows us to compare two sets\nIf we regard people as merely being a set of brands they prefer\nthe Jaccard Similarity allows us to compare people\n\nExample. the jaccard similarty between user 82983 and 83065 is .125\n            because\n            brandsfor[\'83065\'] == ["Kohl\'s", \'Target\']\n            brandsfor[\'82983\'] == [\'Hanky Panky\', \'Betsey Johnson\', \'Converse\', \'Steve Madden\', \'Old Navy\', \'Target\', \'Nordstrom\']\n\nthe intersection of these two sets is just set("Target")\nthe union of the two sets is set([\'Target\', \'Hanky Panky\', \'Betsey Johnson\', \'Converse\', \'Steve Madden\', \'Old Navy\', \'Target\', \'Nordstrom\'])\nso the len(intersection) / len(union) = 1 / 8 == .125\n\nEXERCISE: what is the Jaccard Similarity \n          between user 82956 and user 82963?\n# ANSWER == 0.3333333333\n'

In [10]:
brandsfor['82956'] # == ['Diesel', 'Old Navy', 'Crate & Barrel', 'Target']

brandsfor['82963'] # == ['Puma', 'New Balance', 'Old Navy', 'Target']


['Puma', 'New Balance', 'Old Navy', 'Target']

EXERCISE: 

Complete the jaccard method below.
          It should take in a list of brands, and output the 
          jaccard similarity between them

This should work with anything in the set, for example
jaccard([1,2,3], [2,3,4,5,6])  == .3333333

HINT: set1 & set2 is the intersection
      set1 | set2 is the union

In [None]:
# try it out!
brandsfor['83065'] # brands for user 83065
brandsfor['82983'] # brands for user 82983
jaccard(brandsfor['83065'], brandsfor['82983'])


jaccard(brandsfor['82956'], brandsfor['82963'])

In [11]:
#######################
### Our Recommender ###
#######################

'''
Our recommender will be a modified KNN collaborative algorithm.
Input: A given user's brands that they like
Output: A set (no repeats) of brand recommendations based on
        similar users preferences

1. When a user's brands are given to us, we will calculate the input user's
jaccard similarity with every person in our brandsfor dictionary

2. We will pick the K most similar users and recommend
the brands that they like that the given user doesn't know about

EXAMPLE:
Given User likes ['Target', 'Old Navy', 'Banana Republic', 'H&M']
Outputs: ['Forever 21', 'Gap', 'Steve Madden']
'''

"\nOur recommender will be a modified KNN collaborative algorithm.\nInput: A given user's brands that they like\nOutput: A set (no repeats) of brand recommendations based on\n        similar users preferences\n\n1. When a user's brands are given to us, we will calculate the input user's\njaccard similarity with every person in our brandsfor dictionary\n\n2. We will pick the K most similar users and recommend\nthe brands that they like that the given user doesn't know about\n\nEXAMPLE:\nGiven User likes ['Target', 'Old Navy', 'Banana Republic', 'H&M']\nOutputs: ['Forever 21', 'Gap', 'Steve Madden']\n"

In [None]:
given_user = ['Target', 'Old Navy', 'Banana Republic', 'H&M']

#similarty between user 83065 and given user
brandsfor['83065']
jaccard(brandsfor['83065'], given_user) 
# should be 0.2

In [12]:
'''
EXERCISE
    Find the similarty between given_user and ALL of our users
    output should be a dictionary where
    the key is a user id and the value is the jaccard similarity
{...
 '83055': 0.25,
 '83056': 0.0,
 '83058': 0.1111111111111111,
 '83060': 0.07894736842105263,
 '83061': 0.4,
 '83064': 0.25,
 '83065': 0.2,
 ...}
 '''

"\nEXERCISE\n    Find the similarty between given_user and ALL of our users\n    output should be a dictionary where\n    the key is a user id and the value is the jaccard similarity\n{...\n '83055': 0.25,\n '83056': 0.0,\n '83058': 0.1111111111111111,\n '83060': 0.07894736842105263,\n '83061': 0.4,\n '83064': 0.25,\n '83065': 0.2,\n ...}\n "

In [None]:
K = 5 #number of similar users to look at


# Now for the top K most similar users, let's aggregate the brands they like.
# I sort by the jaccard similarty so most similar users are first
# I use the sorted method, but because I'm dorting dictionaries
# I specify the "key" as the value of the dictionary
# the key is what the list should sort on
# so the most similar users end up being on top
most_similar_users = sorted(similarities, key=similarities.get, reverse=True)[:K]

In [None]:
# list of K similar users' IDs
most_similar_users

In [None]:
# let's see what some of the most similar users likes
brandsfor[most_similar_users[0]]

brandsfor[most_similar_users[3]]

In [None]:
# Aggregate all brands liked by the K most similar users into a single set
brands_to_recommend = set()
for user in most_similar_users:
    # for each user
    brands_to_recommend.update(set(brandsfor[user]))
    # add to the set of brands_to_recommend

In [None]:
# UH OH WE HAVE DUPLICATES. Banana Republic, Old Navy, Target are all repeats.


# EXERCISE: use a set difference so brands_to_recommend only has
# brands that given_user hasn't seen yet

In [None]:
######################
## One Step Further ##
######################

# We can take this one step further and caculate a "score" of recommendation
# We will define the score as being the number of times
# a brand appears within the first K users
brands_to_recommend = []
for user in most_similar_users:
    brands_to_recommend += list(set(brandsfor[user]) - set(given_user))

In [None]:
# Use a counter to count the number of times a brand appears
recommend_with_scores = Counter(brands_to_recommend)


In [None]:

# Now we see Gap has the highest score!
recommend_with_scores

In [None]:

#################################
#### Collaborative Item based ###
#################################

'''
We can also define a similary between items using jaccard similarity.
We can say that the similarity between two items is the jaccard similarity
between the sets of people who like the two brands.

Example: similarity of Gap to Target is:
'''

In [None]:
# filter users by liking Gap
gap_lovers = set(user_brands['Gap' == user_brands.Store].ID)
old_navy_lovers = set(user_brands['Old Navy' == user_brands.Store].ID)

In [None]:

# similarty between Gap and Old Navy
jaccard(gap_lovers, old_navy_lovers)

In [None]:
guess_lovers = set(user_brands['Guess' == user_brands.Store].ID)
# similarty between Gap andGuess
jaccard(guess_lovers, gap_lovers)

In [None]:

calvin_lovers = set(user_brands['Calvin Klein' == user_brands.Store].ID)
# similarty between Gap and Calvin Klein
jaccard(calvin_lovers, gap_lovers)