<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">


# Recommendations: Collaborative Filtering Lab

_Author: Dan Wilhelm (LA) _

---

# Collaborative Filtering Lab

Today, we will be writing a User-Item Collaborative Filtering recommendation engine. This engine ranks each user by similarity to a given user. Then, it recommends brands from the most similar users, weighting each brand by the users' similarity.

Because Collaborative Filtering is relatively easy to implement and not part of Scikit-learn, we will be writing it from scratch using vanilla Python.

In [1]:
from collections import Counter

%matplotlib inline
from matplotlib import pyplot as plt

BRANDS_FILE = './datasets/user_brand.csv'

## Load in User-Brands Data

In [2]:
from io import open

user_brands = dict()

with open(BRANDS_FILE, 'r', encoding='utf-8') as fin:
    data = [line.strip().split(",") for line in fin]

data[:20]

[['80002', 'Target'],
 ['80002', 'Home Depot'],
 ['80010', "Levi's"],
 ['80010', 'Puma'],
 ['80010', 'Cuisinart'],
 ['80010', 'Converse'],
 ['80010', 'DKNY'],
 ['80010', 'Express'],
 ['80010', "Kohl's"],
 ['80010', 'Old Navy'],
 ['80010', 'Container Store'],
 ['80010', 'Nordstrom'],
 ['80011', 'Kenneth Cole'],
 ['80011', 'Calvin Klein'],
 ['80011', 'French Connection'],
 ['80011', 'BCBGMAXAZRIA'],
 ['80011', 'Nine West'],
 ['80011', 'Steve Madden'],
 ['80011', 'Diesel'],
 ['80011', 'Guess']]

## Explore the Data

To assist you in exploring, make the following variables:

In [9]:
users = []   # List of all users
brands = []  # List of all brands

# user_brands = {'48132': {'Target', 'H&M', 'Gap'}, '31341': {'Zipcar'}, ... }
user_brands = {}

# brand_users = {'Target': {'48132', '84172', '12353'}, 'Zipcar': {'31341'}, ... }
brand_users = {}

1 - How many unique users are there?

2 - How many unique brands are there?

3 - What is the distribution of the number of brands liked by users?

_Hint:_ Make a list of ```likes_per_brand``` then plot a histogram! 

+ For example: ```plt.hist([1, 1, 1, 2, 2, 3], bins=3)```

In [10]:
likes_per_brand = []   # list of number of brand likes

#plt.hist(likes_per_brand, bins=50);

4 - What is the distribution of the number of users who like a brand?

In [11]:
likes_per_user = []   # list of number of brand likes

#plt.hist(likes_per_user, bins=50);

5 - How many people like **Target**?

6 - How many people like **Banana Republic**?

7 - What brands does **user 86184** like?

8 - What brands does **user 83126** like?

## Jaccard Distance Measure

Given two sets of brands, e.g. user1 = {'Target', 'Starbucks', 'Gap'} and user2 = {'Starbucks', 'Old Navy'}, the Jaccard distance is:

+ jaccard(u1, u2) = 1 - (# brands in common) / (# brands in total).

In [12]:
def jaccard(set1, set2):
    if len(set1) == 0 and len(set2) == 0:
        return Inf
    
    return 0.0

In [13]:
# Make test sets by hand, for example using 'Target' and 'Banana Republic'.
# Compute what the jaccard score should be for your test set.
# Does calling your function yield the same result?



## Weighted Jaccard

This metric does not fully capture our intution of distance between two users and the brands they like. For example, two users who have Target in common are less likely to be similar than users who have Autozone in common. So, let's add a weighting which emphasizes less frequent brands.

+ Weight each brand by 1/(brand's total likes). This is a useful measure, since we want a large weight with only 2 likes and a much smaller weight with 100 likes.
+ This weighting works because "Target" is liked by most users, so it is a less meaningful brand for similarity than "Zipcar".

In [15]:
# count how many times each brand appears in the entire dataset
brand_freq = Counter()  # ???

def weighted_jaccard(set1, set2):
        return 0.0

In [16]:
# Make test sets by hand, for example using 'Target' and 'Banana Republic'.
# Compute what the weighted jaccard score should be for your test set.
# Does calling your function yield the same result?


## Recommendation Engine

First, we'll define two already completed helper functions. 

```
# Pretty-prints similar brands to Target
similar_brands('Target')

# Pretty-prints brands that user 86184 will like
similar_users('86184') 
```



In [1]:
def similar_brands(brand_name):
    """
    Given a brand name **string**, returns a pretty-print string of 
        recommendations of more brands.
    """

    # IMPORTANT: 'recommend_for_brands' expects a set of brand names.
    #   Because 'brand_name' is a string, we convert the single name 
    #   to a set containing the brand name

    recs = recommend_for_brands(set([brand_name]))

    return "For a user who likes {liked}, we recommend {recs}.".format(
        liked=brand_name,
        recs=", ".join(recs))


def similar_users(user):
    """
    Given a user name **string**, returns a pretty-print string 
      of recommendations for a user.
    """
    recs = recommend_for_user(user)

    return "For user {user}, who likes {liked}, we recommend {recs}.".format(
        user=user,
        liked=", ".join(user_brands.get(user, ["nothing"])),
        recs=", ".join(recs))

Next, we define how to recommend brands to users. Make sure you understand how it works -- we just ask for similar brands to the brands the user already likes. 

The below function is complete -- no additional code must be written!

In [18]:
def recommend_for_user(user_string):
    """
     Recommend items to a user that are similar 
       to the brands the user already likes
    """

    return recommend_for_brands(user_brands.get(user_string, set()))

    # NOTE: The second parameter to get() is the default
    #   value to send if the user is not a key in 'user_brands'

Finally, we implement recommending brands based on a list of brands. To do this, we will find the users closest to the brands set. Then, other brands that those users like will be recommended. 

Use the ```jaccard``` or ```jaccard_weighted``` as your distance function to rank users. Then, use your best judgement for how to collect the additional brands.

In [101]:
def recommend_for_brands(brands_set):
    """
    Return top five recommended brands
      based on the brands in 'brands_set'.
    """
    
    # The strategy is:
    #   (1) Find the users most similar to the 'brands_set'.
    #   (2) Get 5 brands those users also like
    #   (3) OPTIONAL: Weight the 5 stores by most unique to least unique

    return []

## Testing

In [None]:
# For testing, recommend brands similar to Target and Banana Republic
print("\n" + similar_brands("Target"))
print(similar_brands("Banana Republic"))

# For testing, recommend brands for users 86184 and 83126
#    NOTE: This is based on the brands each user likes
print(similar_users("86184"))
print(similar_users("83126"))

## Recap

You just implemented User-Item collaborative filtering, making recommendations as follows:
+ **Step One:** Find users similar to the target user.
+ **Step Two:** Recommend things that similar users like.

## Challenge: Item-Item Collaborative Filtering

Now, let's try making an Item-Item recommendation system, as follows. Given a user, we will again recommend other brands the user may like. Instead of looking at other similar users, however, this time we will look at similar items to what the user likes:

+ **Step One:** Determine similarity of brands to one another. 
    - Of all users who like brand X, what other brands do they most like?


+ **Step Two:** Given a target user, recommend similar brands to the brands the user likes.

Note there may be many valid ways of implementing both steps!

In [3]:
# Often, this is computed overnight for all brands and 
#     stored for making quick recommendations

# For a given brand, returns the most similar other brands
def get_most_similar_brands(brand, max_n=10):
    most_similar_brand_names = []
    
    # 1. Of the users who like this brand, 
    #     count how many times they like other brands.
    
    
    # 2. Weight the brands to eliminate common ones such as Target.
    
    
    # 3. Sort the brands by largest score and return only the names
    
    return most_similar_brand_names[:max_n]

In [None]:
import random

def get_user_recommendations(user, max_n=10):
    final_brand_names = []
    
    # 1. Given the brands the user likes, return the similar brands!
    
    return random.sample(final_brand_names, max_n)