# Who To Follow: Recommending Brands

In this exercise, we consider a simple dataset: users following brands. We only know of a user follows a brand or not, but not how much he or she likes this brand.  Given the brands the user is following, we would like to recommend similar brands that s/he might be interested in.  


This is another example of item-based collaborative filtering. Note! The original data comes in as a set of users following brands, so if we convert this data into a matrix of user x brand, we will encode each value as 0 (user does not follow brand) or 1 (user does follow brand). 

Given that we only have 1's and 0's, we probably should not use pearson correlation or vector cosine as a similarity metric.

You could easily extend this to product recommendations (i.e. You purchase this product (1's and 0's). Here are some other products you might like).

### Import code and data

In [7]:
import numpy as np
import pandas as pd
import seaborn as sb
from matplotlib import pyplot as plt
%matplotlib inline

In [8]:
data = pd.read_csv('user-brands.csv')
data.head()

Unnamed: 0,id,brand
0,80002,Target
1,80002,Home Depot
2,80010,Levi's
3,80010,Puma
4,80010,Cuisinart


In [9]:
print "Shape:", data.shape, "Unique User IDs:", data.id.nunique(), "Unique Brands:", data.brand.nunique()

Shape: (23804, 2) Unique User IDs: 3759 Unique Brands: 198


### User-by-brand matrix

Note that our data above is in condensed format. We could make it into a sparse matrix, which might be easier to work with.  You could do this with `pd.pivot_table`:

    M = pd.pivot_table(data, index='id', columns='brand', aggfunc='size', fill_value=0)

We use a `groupby` statement, which gives us a multi-index series, and then we make an `unstack` call to transform it into a dataframe again.  

Note that these steps are not necessary as you could complete this exercise in several different ways.

In [10]:
M = pd.pivot_table(data, index="id",columns="brand",aggfunc="size",fill_value=0)

In [11]:
M.head()

brand,6pm.com,Abercrombie & Fitch,Adidas,Aeropostale,Aldo,All Saints,Amazon.com,American Apparel,American Eagle,Ann Taylor,...,Walgreens,Walk-Over,Wet Seal,Windsor,YSL,Yves Saint Laurent,ZOO,Zara,Zipcar,vineyard vines
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
80002,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
80010,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
80011,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
80015,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
80020,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Jaccard distance

Since we will use a neighborhood method, we need a definition of _distance_.  We'll use the _Jaccard distance_ for this. 

The [_Jaccard index_](https://en.wikipedia.org/wiki/Jaccard_index) is a similarity metric between two sets.  It measures how many elements two sets have in common, as a fraction of the total number of distinct elements in both sets.  

$$\text{Jaccard index} = \frac{ |A \cap B | }{ |A \cup B| }$$

We could make a Jaccard matrix $J$, with pairwise similarities $J_{ij}$ as entries.
- `J[i, j]` = Jaccard similarity between doc _i_ and _j_ (between 0 and 1)
- `J[i, i]` = 1, obviously, and
- `J[i, j]` = `J[i, j]`, i.e., the matrix is symmetric.

We could also define the _Jaccard distance_, which has $D_{ii} = 0$ for identical sets, and bigger values as the sets have less words in common.  We define: $D = 1 - J,$ which has values between 0 and 1.

Common applications of the Jaccard index include text clustering, but we can use it for brand clustering as well, counting the number of followers they have in common.

<hr>
## Exercise


- Create a brand-by-brand matrix, with the similarity distances between two brands in each entry.
   - Obviously, you'd have $N_{ii} = 0$ for each brand $i$, and $N_{ij} = N_{ji}$ for each pair of brands.
   - You can create a 2-dimenional `np.array` for this, or a nested dictionary `N = {i: {j: distance}}`, or anything you like.
      
      
- For a few brands of your choice, show the top most similar brands.  
   - Do your results make sense? Would you agree?
   
   
- For a few users, make a few top recommendations.
   - Per user, display the brands s/he's already following
   - For each brand, compute the distance to all other brands
   - Average all distances to find the few closest brands, with the shortest average distance
   - Make sure you exclude the brands the user is already following from the recommendations
   
   
**Hint: Remember that in this case, lower distances are closer matches!**

In [20]:
def jaccard_distance(M):
    n_brands = M.shape[1]
    I = M.T.dot(M)  # number of users in common 
    n_users_per_brand = np.diag(I)
    N = n_users_per_brand.reshape(n_brands, 1) * np.ones(n_brands)
    U = N + N.T - I  # total unique followers = n_users_i + n_users_j - users in common
    J = I / U.astype(float)  # similarity matrix
    D = 1 - J  # distance
    return D

In [14]:
n_brands = M.shape[1]
### Top part of fraction
I = M.T.dot(M)  # number of users in common 
n_users_per_brand = np.diag(I)
N = n_users_per_brand.reshape(n_brands, 1) * np.ones(n_brands)
print N

### Bottom part of fraction
U = N + N.T - I  # total unique followers = n_users_i + n_users_j - users in common 
print U
J = I / U.astype(float)  # similarity matrix
D = 1 - J  # distance

[[ 260.  260.  260. ...,  260.  260.  260.]
 [   8.    8.    8. ...,    8.    8.    8.]
 [   1.    1.    1. ...,    1.    1.    1.]
 ..., 
 [   4.    4.    4. ...,    4.    4.    4.]
 [   1.    1.    1. ...,    1.    1.    1.]
 [   1.    1.    1. ...,    1.    1.    1.]]
brand                  6pm.com  Abercrombie & Fitch  Adidas  Aeropostale  \
brand                                                                      
6pm.com                    260                  267     260          266   
Abercrombie & Fitch        267                    8       9           14   
Adidas                     260                    9       1            8   
Aeropostale                266                   14       8            7   
Aldo                       261                    9       2            8   
All Saints                 261                    9       2            8   
Amazon.com                 261                    9       2            8   
American Apparel           262              

In [18]:
brand_distance = jaccard_distance(M)
brand_distance.head(3)

brand,6pm.com,Abercrombie & Fitch,Adidas,Aeropostale,Aldo,All Saints,Amazon.com,American Apparel,American Eagle,Ann Taylor,...,Walgreens,Walk-Over,Wet Seal,Windsor,YSL,Yves Saint Laurent,ZOO,Zara,Zipcar,vineyard vines
brand,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
6pm.com,0.0,0.996255,0.996154,0.996241,1,1,1,0.996183,0.996441,0.992481,...,1,1,1.0,1,1,1,1,1.0,1,1
Abercrombie & Fitch,0.996255,0.0,1.0,0.928571,1,1,1,1.0,0.888889,1.0,...,1,1,0.9375,1,1,1,1,0.909091,1,1
Adidas,0.996154,1.0,0.0,1.0,1,1,1,1.0,1.0,1.0,...,1,1,1.0,1,1,1,1,1.0,1,1


Note that this is a _distance_ matrix, so the lower, the closer, the more similar.  Hence we have zeros on the diagonal.

Let's show the top most similar brands for some known brands.

In [23]:
top = 5
for brand in ['Apple']:
    print "%-20s:" % brand, 
    print ", ".join(brand_distance[brand].sort(ascending=True,inplace=False).index[:top]) 

Apple               : Apple, Diesel, Levi's, 6pm.com, Nambe


And let's pick some other random brands.

In [24]:
brands = M.columns

n_show = 10  # show a few brands
print "Top %d similar brands for some random %d brands" % (top, n_show)
for brand in np.random.choice(brands, n_show, replace=False):
    print "%-20s:" % brand, 
    print ", ".join(brand_distance[brand].sort(inplace=False).index[:top])

Top 5 similar brands for some random 10 brands
Sunglass Hut        : Sunglass Hut, CB2, Diesel, Last Call by Neiman Marcus, Kenneth Cole
Patagonia           : Patagonia, The North Face, Louis Vuitton, Ann Taylor, Coach
Adidas              : Adidas, MAC, Louis Vuitton, Gucci, Pottery Barn
Janie and Jack      : Janie and Jack, Pottery Barn, John Varvatos, Diesel, DKNY
Journeys            : Journeys, Topshop, UGG Australia, Lululemon, Wet Seal
New Balance         : New Balance, KitchenAid, Columbia, Levi's, Shoebuy
Bali                : Bali, Armani Exchange, YSL, Charles David, Lancome
Eddie Bauer         : Eddie Bauer, Life is good, Columbia, New Balance, Converse
Vitamix             : Vitamix, Cuisinart, KitchenAid, Kohl's, Old Navy
Children's Place    : Children's Place, Justice, Melissa & Doug, Puma, Old Navy


### Recommendations
Given a user, return recommended brands with scores

In [18]:
def recommend_brands_for_user(user, M, top=5):
    user_brands = M.loc[user][M.loc[user] > 0].index  # get brands of user
    brand_distance = jaccard_distance(M)        
    recs = brand_distance[user_brands].mean(axis=1).sort(ascending=True, inplace=False).index
    # remove all top brands that are already on this user's list
    recs = [rec for rec in recs if rec not in user_brands]
    return recs[:top]

In [19]:
n_users = 5
# for user in [90217, 86156, 89116, 89112]:
for user in np.random.choice(M.index, n_users, replace=False):
    print "User %s" % user
    print "Already following:", ", ".join(brands[M.loc[user] > 0])
    print "Recommended:", ", ".join(recommend_brands_for_user(user, M))
    print

User 82446
Already following: Guess
Recommended: Calvin Klein, Steve Madden, Express, DKNY, BCBGMAXAZRIA

User 81989
Already following: Crocs, Home Depot, Nordstrom, Old Navy, Target
Recommended: Kohl's, Gap, Crate & Barrel, KitchenAid, Express

User 80895
Already following: Target
Recommended: Old Navy, Kohl's, Home Depot, Gap, Crate & Barrel

User 81320
Already following: CB2, Crate & Barrel, Gap, J.Crew, Melissa & Doug, Nordstrom
Recommended: Banana Republic, KitchenAid, Container Store, Restoration Hardware, Cuisinart

User 82327
Already following: New Balance
Recommended: KitchenAid, Columbia, Levi's, Shoebuy, Converse

