# Recommender Systems and SVD

Recommender Systems have become ubiquitous in the modern data science landscape, as companies like Google, Netflix, Pandora, Facebook, etc. rely heavily on them to provide targeted content recommendation to their users to create a more enjoyable user experience.  In these exercises, we'll focus on the process of ***collaborative filtering*** for building recommenders on 2 different datasets (beers and movies).  

[Collaborative Filtering](https://en.wikipedia.org/wiki/Collaborative_filtering) relies on a ***ratings matrix*** for all items to generate similarities between items and users based on similar ratings.  It's important to remember that collaborative filtering is one of the 2 main ways to conduct recommendation, the other being [Content-Based Filtering](https://en.wikipedia.org/wiki/Recommender_system#Content-based_filtering) which explicitly maps items and/or users into a shared feature space based on explicit user/item characteristics.  State of the art recommenders will often rely on hybrid approaches of these 2, so it's important to understand the differences, strengths, and weaknesses of each and what separates them.

### Datasets
- [Beer Ratings](https://github.com/pburkard88/DS_BOS_06/blob/master/Data/beer_reviews.tar.gz): A dataset of beer reviews
- [Movielens Data](https://github.com/pburkard88/DS_BOS_06/blob/master/Data/movielens): A dataset of movie ratings from the original [here](http://grouplens.org/datasets/movielens/)

### Learning Goals
- Perform collaborative filtering from ratings matrices using `pandas` and `sklearn` on the beers data
- Understand why this approach represents collaborative filtering
- Perform collaborative filtering using the [python-recsys](https://github.com/ocelma/python-recsys) library that provides some nice built-in recommender functionality
- Understand how SVDs or other matrix decompositions might fit in in the context of a recommender algorithm

## Similarity based Recommendation System: Beers
The first dataset we'll work with is a list of many beer reviews by a variety of reviewers with accompanying beer metadata on every review.  We'll use this data to generate our reviewer/beer ratings matrix from which we can perform collaborative filtering and recommend beers based on user preferences.

### Beers: Get the Data
First perform the usual imports of `numpy` and `pandas` as `np` and `pd`.

In [1]:
import pandas as pd
import numpy as np

Now let's get the data.  If you don't already have it locally you can use curl to pull it down.

In [2]:
#! curl -O https://s3.amazonaws.com/demo-datasets/beer_reviews.tar.gz

These steps here are optional, just move the data some place where you know where it is and then point your eventual call to `read_csv()` to that location.

In [3]:
! mv 'beer_reviews.tar.gz' ~/data/

mv: rename beer_reviews.tar.gz to /Users/jb/data/: No such file or directory


In [4]:
!ls ~/data

ls: /Users/jb/data: No such file or directory


Import the data into a `pandas` dataframe called `df` by calling `read_csv()` with the appropriate path and the parameter `compression='gzip'` (you don't need this if you already extracted your file, it's just nice to see that pandas can handle gzipped data).

In [5]:
df = pd.read_csv("./beer_reviews.tar.gz", compression='gzip', error_bad_lines=False)
#df = pd.read_csv("~/data/beer_reviews/beer_reviews.csv")


  interactivity=interactivity, compiler=compiler, result=result)


### Explore the Data
Let's look at the data with `head()`

In [6]:
df.head()

Unnamed: 0,beer_reviews/,brewery_name,review_time,review_overall,review_aroma,review_appearance,review_profilename,beer_style,review_palate,review_taste,beer_name,beer_abv,beer_beerid
0,10325,Vecchio Birraio,1234818000.0,1.5,2.0,2.5,stcules,Hefeweizen,1.5,1.5,Sausa Weizen,5.0,47986.0
1,10325,Vecchio Birraio,1235915000.0,3.0,2.5,3.0,stcules,English Strong Ale,3.0,3.0,Red Moon,6.2,48213.0
2,10325,Vecchio Birraio,1235917000.0,3.0,2.5,3.0,stcules,Foreign / Export Stout,3.0,3.0,Black Horse Black Beer,6.5,48215.0
3,10325,Vecchio Birraio,1234725000.0,3.0,3.0,3.5,stcules,German Pilsener,2.5,3.0,Sausa Pils,5.0,47969.0
4,1075,Caldera Brewing Company,1293735000.0,4.0,4.5,4.0,johnmichaelsen,American Double / Imperial IPA,4.0,4.5,Cauldron DIPA,7.7,64883.0


Create a separate data frame `df_test` to investigate a little bit further by selecting out only the **beer_name="Pale Ale"** reviews using the `isIn([])` function.  Then sort this resulting table by **review_profilename** and examine the first 100 rows.  You should notice that the same reviewer can review multiple Pale Ales.

In [14]:
df_test = df[df.beer_name.isin(['Pale Ale'])].sort_values('review_profilename', axis=0)
df_test.head(100)

Unnamed: 0,beer_reviews/,brewery_name,review_time,review_overall,review_aroma,review_appearance,review_profilename,beer_style,review_palate,review_taste,beer_name,beer_abv,beer_beerid
912451,19402,Inland Empire Brewing Company,1.240528e+09,4.0,3.0,3.0,0110x011,American Pale Ale (APA),3.0,3.5,Pale Ale,5.50,49291.0
1406262,9824,Silverado Brewing Company,1.253299e+09,3.5,3.5,3.5,1759Girl,American Pale Ale (APA),2.5,4.0,Pale Ale,5.12,25427.0
563154,423,Boulevard Brewing Co.,1.305678e+09,3.5,2.0,4.0,1Adam12,American Pale Ale (APA),3.0,3.0,Pale Ale,5.40,2094.0
525342,2101,Blue Star Brewing Company,1.237656e+09,4.5,4.0,4.0,1fastz28,American Pale Ale (APA),4.0,4.0,Pale Ale,,5828.0
41264,13397,Mountaineer Brewing Co.,1.291941e+09,4.0,3.0,3.5,321jeff,American Pale Ale (APA),4.0,3.0,Pale Ale,5.59,28951.0
1385721,3725,Réservoir,1.120719e+09,3.5,3.0,3.0,3Vandoo,English Pale Ale,4.0,3.0,Pale Ale,5.00,24527.0
562967,423,Boulevard Brewing Co.,1.203782e+09,5.0,4.5,4.5,7thstreetbrewery,American Pale Ale (APA),5.0,4.5,Pale Ale,5.40,2094.0
563116,423,Boulevard Brewing Co.,1.058366e+09,4.0,3.5,4.0,ADR,American Pale Ale (APA),3.5,3.0,Pale Ale,5.40,2094.0
477535,16465,Croucher Brewing Co.,1.248702e+09,4.5,3.0,4.5,ADZA,American Pale Ale (APA),3.5,4.5,Pale Ale,5.00,40487.0
1429227,25252,Goodieson Brewery,1.304334e+09,3.0,3.0,3.5,ADZA,American Pale Ale (APA),3.0,3.0,Pale Ale,4.50,68580.0


Let's restrict this to the top 250 beers. Use the `value_counts()` method to get a sorted list by value count on **beer_name** and then taking the first 250.  Overwrite `df` with this new data.

In [15]:
df.beer_name.value_counts()

90 Minute IPA                                          3290
India Pale Ale                                         3130
Old Rasputin Russian Imperial Stout                    3111
Sierra Nevada Celebration Ale                          3000
Two Hearted Ale                                        2728
Stone Ruination IPA                                    2704
Arrogant Bastard Ale                                   2704
Sierra Nevada Pale Ale                                 2587
Stone IPA (India Pale Ale)                             2575
Pliny The Elder                                        2527
Founders Breakfast Stout                               2502
Pale Ale                                               2500
Sierra Nevada Bigfoot Barleywine Style Ale             2492
La Fin Du Monde                                        2483
60 Minute IPA                                          2475
Storm King Stout                                       2452
Duvel                                   

In [16]:
n = 250
top_n = df.beer_name.value_counts().index[:n]
df = df[df.beer_name.isin(top_n)]
df.head()

TypeError: unorderable types: str() > float()

How big is this dataset?  Use `df.info()`

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1586630 entries, 0 to 1586629
Data columns (total 13 columns):
beer_reviews/         1586629 non-null object
brewery_name          1586611 non-null object
review_time           1586614 non-null float64
review_overall        1586614 non-null float64
review_aroma          1586614 non-null float64
review_appearance     1586614 non-null float64
review_profilename    1586266 non-null object
beer_style            1586614 non-null object
review_palate         1586614 non-null float64
review_taste          1586614 non-null float64
beer_name             1586614 non-null object
beer_abv              1518829 non-null float64
beer_beerid           1586614 non-null float64
dtypes: float64(8), object(5)
memory usage: 157.4+ MB


Aggregate the data in a pivot table called `df_wide` using the `pivot_table` method. Display the mean review_overall for each beer_name aggregating the review_overall values by review_profilename. Use the mean (numpy.mean) as aggregator.  In other words, the `values` parameter should contain **review_overall** and the `index` parameter should contain **beer_name** and **beer_name**.  Make sure to call `unstack()` at the end.

In [21]:
df_wide = pd.pivot_table(df, values=["review_overall"],
        index=["beer_name", "review_profilename"],
        aggfunc=np.mean).unstack()
df_wide.shape

(56856, 33387)

Display the head of the pivot table, but only for 5 users (columns are users)

In [22]:
df_wide.iloc[0:5, 0:5]

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  if __name__ == '__main__':


Unnamed: 0_level_0,review_overall,review_overall,review_overall,review_overall,review_overall
review_profilename,0110x011,01Ryan10,02maxima,03SVTCobra,04101Brewer
beer_name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
! (Old Ale),,,,,
"""100""",,,,,
"""100"" Pale Ale",,,,,
"""12"" Belgian Golden Strong Ale",,,,,
"""33"" Export",,,,,


### Discussion: what do you notice in this table?

Set Nans to zero with the `fillna()` function.

In [23]:
df_wide = df_wide.fillna(0)

Check that columns are users by examining the first few columns.

In [24]:
df_wide.columns[:10]

MultiIndex(levels=[['review_overall'], ['0110x011', '01Ryan10', '02maxima', '03SVTCobra', '04101Brewer', '05Harley', '0Naught0', '0beerguy0', '0runkp0s', '0to15', '0tt0', '0xFF', '1000Bottles', '1001111.0', '100floods', '100proof', '103stiga', '104bob', '1050Sudz', '108Dragons', '1099.0', '10bear', '10shb', '1100.0', '110toyourleft', '1121987.0', '11millsown113', '11osixBrew', '11soccer11', '11thFloorBrewing', '1229design', '12647summerfield', '12NattiBottles', '12ouncecurls', '12percent', '12puebloyankee', '12vUnion', '12vman', '130guy', '13aphomet', '13smurrf', '159beerrunner', '160Shillings', '16ozSampler', '17202826.0', '1759Girl', '1759dallas', '17Guinness59', '1844original', '184601.0', '187.0', '18alpha', '18todrink', '18tony', '196osh', '1993Heel', '1996StrokerKid', '1Adam12', '1BeerLeague', '1Mainebrew', '1MiltonWaddams', '1PA', '1Paradisebrew', '1after909', '1badcableguy', '1bigwoody', '1brbn1sctch1beer', '1fastz28', '1inamill', '1joeyjojo', '1lastcast', '1morebeer', '1noa', 

Check that rows are beers by examining the first few rows.

In [25]:
pd.Series(df_wide.index[:10])

0                               ! (Old Ale)
1                                     "100"
2                            "100" Pale Ale
3           "12"  Belgian Golden Strong Ale
4                               "33" Export
5                   "4" Horse Oatmeal Stout
6                                 "400" Ale
7             "50" Golden Anniversary Lager
8                      "76" Anniversary Ale
9    "76" Anniversary Ale With English Hops
Name: beer_name, dtype: object

### Calculate distance between beers

This is the key.  We have our ratings matrix now and we're going to use cosine_similarity from scikit-learn to compute the distance between all beers in this space.

In [None]:
# import distance methods
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import manhattan_distances
from sklearn.metrics.pairwise import euclidean_distances

Apply `cosine_similarity()` to `df_wide` to calculate pairwise distances and store this in a variable called `dists`.

In [None]:
dists = cosine_similarity(df_wide)
dists

### Discussion: what type of object is dists?

Convert dists to a Pandas DataFrame, use the index as column index as well (distances are a square matrix).  This means we'll have a beers by beers matrix of the distances between every beer from the ratings space.  Check out the first 10 or so rows and columns and make sure things look right (should see 1s on the diagonal).

In [None]:
dists = pd.DataFrame(dists, columns=df_wide.index)

dists.index = dists.columns
dists.ix[0:10, 0:10]

Select some beers and store them in `beers_i_like` then look their distances to other beers with `head()`

In [None]:
beers_i_like = ['Sierra Nevada Pale Ale', '120 Minute IPA', 'Allagash White']
dists[beers_i_like].head()

Sum the distances of my favorite beers by row, to have one distance from each beer in the sample.  For instance if there are 3 beers in your `beers_i_like` then you will be summing 3 numbers for each row.  Store the results in `beers_summed`.  There are 2 ways you can do this:  
1. Calling `apply()` with a lambda function that contains `np.sum()` with `axis=1`
2. Calling `np.sum()` with `axis=1` on the entire dataframe (sliced by columns you like)

In [None]:
beers_summed = dists[beers_i_like].apply(lambda row: np.sum(row), axis=1)
#beers_summed = np.sum(dists[beers_i_like], axis=1)

Optional: which one is faster? use ```%timeit``` to check

In [None]:
%timeit dists[beers_i_like].apply(lambda row: np.sum(row), axis=1)

In [None]:
%timeit np.sum(dists[beers_i_like], axis=1)

Sort summed beers from best to worse using `order()`

In [None]:
beers_summed = beers_summed.sort_values(ascending=False)
beers_summed

Filter out the beers used as input using `isin()` and store this in `ranked_beers`, then transform this to a list using `tolist()`.  Print out the first 5 elements.

In [None]:
ranked_beers = beers_summed.index[beers_summed.index.isin(beers_i_like)==False]
ranked_beers = ranked_beers.tolist()
ranked_beers[:5]

Define a function that does what we just did for an arbitrary input list of beers. it should also receive the maximum number of beers requested n as optional parameter.

In [None]:
def get_similar(beers, n=None):
    """
    calculates which beers are most similar to the beers provided. Does not return
    the beers that were provided
    
    Parameters
    ----------
    beers: list
        some beers!
    
    Returns
    -------
    ranked_beers: list
        rank ordered beers
    """
    beers = [beer for beer in beers if beer in dists.columns]
    beers_summed = dists[beers].apply(lambda row: np.sum(row), axis=1)
    beers_summed = beers_summed.order(ascending=False)
    ranked_beers = beers_summed.index[beers_summed.index.isin(beers)==False]
    ranked_beers = ranked_beers.tolist()
    if n is None:
        return ranked_beers
    else:
        return ranked_beers[:n]

Test your function. Find the 10 beers most similar to "120 Minute IPA"

In [None]:
for beer in get_similar(["120 Minute IPA"], 10):
    print beer

Cool, let's try again with the 10 beers most similar to ["Coors Light", "Bud Light", "Amstel Light"]

In [None]:
for i, beer in enumerate(get_similar(["Coors Light", "Bud Light", "Amstel Light"], 10)):
    print "%d) %s" % (i+1, beer)

## Movie Recommendations with Recsys
[python-recsys](https://github.com/ocelma/python-recsys) is a nice python library for implementing recommender systems.  We'll use it here to try and make movie recommendations from the [movielens dataset](http://grouplens.org/datasets/movielens/).  

### Install Recsys
First run something like the below code to install everything that you need for recsys.

## install python-recsys

### first install dependencies

pip install csc-pysparse networkx divisi2

### then install recsys
git clone https://github.com/python-recsys/python-recsys.git
cd python-recsys/

python setup.py install

### then Restart Kernel

Import `recsys.algorithm`, set `recsys.algorithm.VERBOSE = True` and import `recsys.algorithm.factorize.SVD` class

In [None]:
import recsys.algorithm
recsys.algorithm.VERBOSE = True
from recsys.algorithm.factorize import SVD

### Get the Data
Download the movielens dataset [here](http://files.grouplens.org/datasets/movielens/ml-20m.zip) 

Let's look at the files, you can do this however you like.

In [None]:
! ls ~/data/movielens

Read in the movies.dat data into a variable `movies` by using `pd.read_table` with `sep='::'`.  Make sure to set the `names` to ITEMID, Title, and Genres to set the columns and the `index_col` to ITEMID.

In [None]:
movies = pd.read_table('~/data/movielens/movies.dat', sep='::', names= ['ITEMID', 'Title', 'Genres'], index_col= 'ITEMID')

###Explore the Data
Take a look at the movies data with `head()`.

In [None]:
movies.head()

Load the ratings.dat data into a `ratings` variable with the same separator, and the column names UserID, MovieID, Rating, Timestamp.

In [None]:
ratings = pd.read_table('~/data/movielens/ratings.dat', sep='::', names= ['UserID','MovieID','Rating','Timestamp'])

In [None]:
ratings.head()

Initialize an `SVD` instance called `svd`

In [None]:
svd = SVD()

Populate it with the data from the ratings dataset, using the built in `load_data()` method.  You should use `format={'col':0, 'row':1, 'value':2, 'ids': int}` and don't forget the `sep` parameter.

In [None]:
svd.load_data(filename='../Data/movielens/ratings.dat', sep='::', format={'col':0, 'row':1, 'value':2, 'ids': int})

Compute SVD with a call to `svd.compute()`.  
- Use `k=100`
- Use `min_values=10`
- Use `pre_normalize=None`
- Use `mean_center=True`
- Use `post_normalize=True`

$M=U \Sigma V^T$:

In [None]:
k = 100
svd.compute(k=k, min_values=10, pre_normalize=None, mean_center=True, post_normalize=True)

you can also save the output SVD model (in a zip file)

In [None]:
# svd.compute(k=k, min_values=10, pre_normalize=None, mean_center=True, post_normalize=True, savefile='/tmp/movielens')

Reload a saved model:

In [None]:
# svd2 = SVD(filename='/tmp/movielens')

###Computing Similarities and Making Recommendations
Let's compute similarity between two movies, first we need to use the movies table to get the itemid that will be used with the ratings data that generated our svd.

Determine the movie ids of "Toy Story (1995)" and "Bug's Life, A (1998)".

In [None]:
movies[movies.Title == "Toy Story (1995)"]

In [None]:
movies[movies.Title == "Bug's Life, A (1998)"]

Print the similarity of these 2 movies by calling `svd.similarity()` with those 2 IDs.

In [None]:
ITEMID1 = 1    # Toy Story (1995)
ITEMID2 = 2355 # A bug's life (1998)
print svd.similarity(ITEMID1, ITEMID2)
# print svd2.similarity(ITEMID1, ITEMID2) to check

Use `svd.similar()` to get movies similar to Toy Story.

In [None]:
svd.similar(ITEMID1)

Try using `svd.predict()` to predict ratings for a given user and movie, $\hat{r}_{ui}$

In [None]:
MIN_RATING = 0.0
MAX_RATING = 5.0
ITEMID = 1
USERID = 1
svd.predict(ITEMID, USERID, MIN_RATING, MAX_RATING)

Look it up in the matrix...

In [None]:
svd.get_matrix().value(ITEMID, USERID)

Try using `svd.recommend()` to Recommend non rated movies to a user (`is_row=False`)

In [None]:
svd.recommend(USERID, is_row=False)

Which users should see Toy Story? (e.g. which users -that have not rated Toy Story- would give it a high rating?)

In [None]:
svd.recommend(ITEMID)

Find out more here: [https://github.com/ocelma/python-recsys](https://github.com/ocelma/python-recsys)