# GA Data Science 18 (DAT18) - Lab 16

### Recommendation Systems

### Today

1. Simple similarity based recommendation system
2. Recsys

## Similarity based Recommendation System: Beers


Let's build a recommendation system to recommend types of beers based on user reviews

Usual imports (numpy, pandas)

In [3]:
import pandas as pd
import numpy as np

First of all let's get the data

In [None]:
! curl -O https://s3.amazonaws.com/demo-datasets/beer_reviews.tar.gz

Import data in a pandas dataframe called "allbeers". Use the compression keyword

In [None]:
# allbeers = pd.read_csv("beer_reviews.tar.gz", compression='gzip')
#If the compression fails - expand the compressed folder and open with:
allbeers = pd.read_csv("beer_reviews/beer_reviews.csv")

Let's look at the data

In [None]:
allbeers.head()

Let's restrict this to the top 250 beers. Use the value_counts() method select the top 250 beers.
Assign the selected beers to a dataset called df

In [None]:
n = 250
top_n = allbeers.beer_name.value_counts().index[:n]
df = allbeers[allbeers.beer_name.isin(top_n)]
df.head()

How big is this dataset?

In [None]:
df.info()

### Pivot Table

Aggregate the data in a pivot table using the pivot_table method. Display the mean review_overall for each beer_name aggregating the review_overall values by review_profilename. Use the mean as aggregator.

In [None]:
df_pivot = pd.pivot_table(df, values=["review_overall"],
        columns=["beer_name", "review_profilename"],
        aggfunc=np.mean)

df_pivot.head().index


In [None]:
#pivot_table converts to a multi-index series. Unstack converts to a dataframe where the last index becomes our column head
df_wide = df_pivot.unstack(-1)
df_wide.head()

Display the head of the pivot table, but only for 5 users (columns are users)

In [None]:
df_wide.ix[0:5, 0:5]

### Discussion: what do you notice in this table?

#### Data munging
Set Nans to zero

In [None]:
df_wide = df_wide.fillna(0)

Check that columns are users

In [None]:
df_wide.columns[:10]

Check that rows are beers

In [None]:
df_wide.index.levels[0]
beer_names = df_wide.index.levels[1]
beer_names

### Calculate distance between beers

We're going to use cosine_similarity from scikit-learn to compute the distance between all beers

Imports

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import manhattan_distances
from sklearn.metrics.pairwise import euclidean_distances

Apply cosine similarity to df_wide to calculate pairwise distances

In [None]:
dists = cosine_similarity(df_wide)
dists

### Discussion: what type of object is dists?

Convert dists to a Pandas DataFrame, use the index as column index as well (distances are a square matrix).

In [None]:
dists = pd.DataFrame(dists)
dists.columns = beer_names
dists.index = beer_names
dists.ix[0:10, 0:10]

Select some beers and look their distances to other beers

In [None]:
dists.columns[1:50]

In [None]:
beers_i_like = ['Leffe Blonde', 'Westmalle Trappist Tripel', 'Pliny The Elder']
dists[beers_i_like].head()

Sum the distances of my favourite beers by row, to have one distance from each beer in the sample

In [None]:
beers_summed = dists[beers_i_like].apply(lambda row: np.sum(row), axis=1)
#beers_summed = np.sum(dists[beers_i_like], axis=1)

In [None]:
beers_summed.head()

#### Performance

Optional: which one is faster? use ```%timeit``` to check

In [None]:
%timeit dists[beers_i_like].apply(lambda row: np.sum(row), axis=1)

In [None]:
%timeit np.sum(dists[beers_i_like], axis=1)

#### Ranking

Sort summed beers from best to worse

In [None]:
beers_summed = beers_summed.order(ascending=False)
beers_summed

Filter out the beers used as input and transform to list

In [None]:
ranked_beers = beers_summed.index[beers_summed.index.isin(beers_i_like)==False]
ranked_beers = ranked_beers.tolist()
ranked_beers[:5]

### Pair Programming!

Define a function that does what we just did for an arbitrary input list of beers. it should also receive the maximum number of beers requested n as optional parameter.

In [None]:
def beers_i_like(beers,n=None):
    beers = [beer for beer in beers if beer in dists.columns]
    beers_summed= dists[beers].apply(lambda row: np.sum(row),axis=1)
    beers_summed = beers_summed.order(ascending=False)
    ranked_beers = beers_summed.index[beers_summed.index.isin(beers)==False]
    ranked_beers = ranked_beers.tolist()
    if n is None:
        return ranked_beers
    else:
        return ranked_beers[:n]
    

Test your function. Find the 10 beers most similar to "120 Minute IPA"

In [None]:
beers_i_like(['Asahi'],10)

Try again with the 10 beers most similar to ["Coors Light", "Bud Light", "Amstel Light"]

In [None]:
beers_i_like(['Leffe Blonde', 'Westmalle Trappist Tripel', 'Pliny The Elder'],10)

Optional: register an account on yhat and deploy your model following the instructions [here](https://docs.yhathq.com/python/examples/beer-recommender) and [here](http://nbviewer.ipython.org/gist/glamp/20a18d52c539b87de2af)

## Recsys

A python library for implementing a recommender system. If you'd like to, I recommend you explore this project. It's an efficient way to get a recommendation engine off the ground. The example below uses SVD.

In [None]:
"""
##install python-recsys

### first install dependencies

pip install csc-pysparse networkx divisi2

### then install recsys
git clone https://github.com/python-recsys/python-recsys.git
cd python-recsys/

python setup.py install
"""

Load recsys.algotihm, set VERBOSE = True import SVD class

In [1]:
import recsys.algorithm
recsys.algorithm.VERBOSE = True
from recsys.algorithm.factorize import SVD

Let's look at the files

In [2]:
! ls ../data/movielens/

README      movies.dat  ratings.dat users.dat


Import 'movies.dat' to a 'movies' pandas dataframe. Make sure you name the columns, use the correct separator and define the index.

In [4]:
movies = pd.read_table('../data/movielens/movies.dat', sep='::', engine='python',
                       names= ['ITEMID', 'Title', 'Genres'], index_col= 'ITEMID')

In [5]:
movies.head()

Unnamed: 0_level_0,Title,Genres
ITEMID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),Animation|Children's|Comedy
2,Jumanji (1995),Adventure|Children's|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama
5,Father of the Bride Part II (1995),Comedy


Import 'ratings.dat' to a 'ratings' pandas dataframe. Make sure you name the columns, use the correct separator.

In [6]:
ratings = pd.read_table('../data/movielens/ratings.dat', sep='::', engine='python',
                        names= ['UserID','MovieID','Rating','Timestamp'])

In [7]:
ratings.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


Initialize an SVD instance

In [8]:
svd = SVD()

Populate it with the data from the ratings dataset, using the built in load_data method

In [9]:
svd.load_data(filename='../data/movielens/ratings.dat', sep='::', format={'col':0, 'row':1, 'value':2, 'ids': int})

Loading ../data/movielens/ratings.dat
..........|


Compute SVD

$M=U \Sigma V^T$:

In [10]:
k = 100
svd.compute(k=k, min_values=10, pre_normalize=None, mean_center=True, post_normalize=True)

Creating matrix (1000209 tuples)
Matrix density is: 4.4684%
Updating matrix: squish to at least 10 values
Computing svd k=100, min_values=10, pre_normalize=None, mean_center=True, post_normalize=True


you can also save the output SVD model (in a zip file)

In [None]:
# svd.compute(k=k, min_values=10, pre_normalize=None, mean_center=True, post_normalize=True, savefile='/tmp/movielens')

Reload a saved model:

In [None]:
# svd2 = SVD(filename='/tmp/movielens')

Find the ITEMID number for "Toy Story (1995)"

In [11]:
movies[movies.Title == "Toy Story (1995)"]

Unnamed: 0_level_0,Title,Genres
ITEMID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),Animation|Children's|Comedy


Find the ITEMID number for "Bug's Life, A (1998)"

In [12]:
movies[movies.Title == "Bug's Life, A (1998)"]

Unnamed: 0_level_0,Title,Genres
ITEMID,Unnamed: 1_level_1,Unnamed: 2_level_1
2355,"Bug's Life, A (1998)",Animation|Children's|Comedy


Compute similarity between the two movies

In [13]:
ITEMID1 = 1    # Toy Story (1995)
ITEMID2 = 2355 # A bug's life (1998)
print svd.similarity(ITEMID1, ITEMID2)
# print svd2.similarity(ITEMID1, ITEMID2) to check

0.677069366773


Get movies similar to Toy Story

In [14]:
svd.similar(ITEMID1)

[(1, 0.99999999999999978),
 (3114, 0.87060391051017305),
 (2355, 0.67706936677314977),
 (588, 0.58073514967544992),
 (595, 0.46031829709744226),
 (1907, 0.44589398718134982),
 (364, 0.42908159895577563),
 (2081, 0.42566581277822413),
 (3396, 0.42474056361934953),
 (2761, 0.40439361857576017)]

In [15]:
movies[movies.index == 3114]

Unnamed: 0_level_0,Title,Genres
ITEMID,Unnamed: 1_level_1,Unnamed: 2_level_1
3114,Toy Story 2 (1999),Animation|Children's|Comedy


Predict rating for a given user and movie, $\hat{r}_{ui}$

In [17]:
MIN_RATING = 0.0
MAX_RATING = 5.0
ITEMID = 2
USERID = 1
svd.predict(ITEMID, USERID, MIN_RATING, MAX_RATING)

3.8188408403312475

In [18]:
svd.get_matrix().value(ITEMID, USERID)

0.0

Recommend non rated movies to a user

In [19]:
svd.recommend(USERID, is_row=False)

[(2028, 5.4018452642332546),
 (527, 5.3498144196809516),
 (2905, 5.2133848204673132),
 (318, 5.2052108435955446),
 (1193, 5.1942189963876562),
 (3114, 5.1753939214583697),
 (1, 5.1714259073839521),
 (2019, 5.1037438278754719),
 (1178, 5.0962756861446641),
 (1207, 5.090305272922329)]

Which users should see Toy Story? (e.g. which users -that have not rated Toy Story- would give it a high rating?)

In [20]:
svd.recommend(ITEMID)

[(4086, 4.8232370652938465),
 (3902, 4.6910487993498418),
 (372, 4.6307149881008742),
 (2339, 4.6059530288892852),
 (4801, 4.5922229425826817),
 (283, 4.5271235154200138),
 (101, 4.50916560608227),
 (3324, 4.5013648440713689),
 (1670, 4.4753762897577483),
 (446, 4.4621920455148416)]

Find out more here: [https://github.com/ocelma/python-recsys](https://github.com/ocelma/python-recsys)