# Collaborative Filtering - Item CF Example
<img src="http://grouplens.org/site-content/uploads/ml-logo.png"/>


We are going to use the Movielens dataset to build a recommendation system. The dataset provides recommendations over 100,000 ratings from 1000 users on 1700 movies. Full data can be found [here](http://grouplens.org/datasets/movielens/100k/). Please go ahead and download/unzip the data before we continue.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline


In [None]:
# You will have to possibly change where "u.data" and "u.item" can be found after downloading the dataset!

rating_headers = ['user_id', 'movie_id', 'rating', 'timestamp']
movie_ratings = pd.read_csv('u.data',sep="\t",names=rating_headers)

movie_headers = ['movie_id','title','release_date','video_release_date','imdb_url','unknown','action','adventure',
                 'animation','childrens','comedy','crime','docu','drama','fantasy','film-noir','horror','musical',
                 'mystery','romance','scifi','thriller','war','western']
movie_titles = pd.read_csv('u.item',sep='|',names=movie_headers)

In [None]:
movie_ratings.head()

In [None]:
# We don't really care for the timestamp, so lets drop it
movie_ratings.drop("timestamp",axis=1,inplace=True)
movie_ratings.head()

In [None]:
movie_titles.head()

In [None]:
# For now, all we really care about are the movie_id and the title-- the first two columns
movie_titles.drop(movie_titles.columns[2:movie_titles.shape[1]],axis=1,inplace=True)
movie_titles.head()

In [None]:
# Okay, let's merge on the movie ratings and movie titles by movie id
# This will give us a more easily readable dataset
df = movie_ratings.merge(movie_titles)
df.head()

In [None]:
# What do these ratings look like?
df.rating.value_counts()

In [None]:
df.hist('rating',bins=5) 

# Gameplan


Okay, so we have a full dataset. We want to now perform some item based collaborative filtering. Before we can do that, we should normalize the ratings globally, and then reform the data into a matrix of user x movie.

**Quiz:** Why would we not want to normalize ratings per movie?



In [None]:
df['rating_norm'] = df.rating - df.rating.mean()
df.head()

In [None]:
df.rating_norm.hist()

In [None]:
# Going to use pivot_table to create exactly the matrix we want of user x movie, where each entry is
# the rating. We will 0 fill if there is no rating.
user_item_df = df.pivot_table(values="rating_norm",index="user_id",columns="title",fill_value=0)
user_item_df.head()

There are many similarity metrics we could use to create a similarity matrix. The similarity matrix is a square matrix of item x item, where each entry is a metric with how similar the two items are.

In this instance, we're going to define the similarity by column wise correlations. Now, for a given movie, we can look at what movies have the highest correlation.

In [None]:
corr_mat = pd.DataFrame(np.corrcoef(user_item_df,rowvar=0,bias=1),
                        index=user_item_df.columns,
                        columns=user_item_df.columns)

In [None]:
corr_mat.loc[:,"Toy Story (1995)"].order(ascending=False)[:10] 

Pretty good! "Youve watched Toy Story. You might also like Beauty and the Beast and Aladdin. (And maybe Apollo 13)"

# Exercise 1:
    


**Part 1** 

If you create a similarity matrix using vector cosine similarity rather than correlation, how do the top 10 "similar" movies for "Toy Story (1995)" change? Do they make sense?

<img src="https://upload.wikimedia.org/math/7/0/6/706e3fbc6408ba68798bc970fc8a1fc6.png"/>

Given two movies `a` and `b`, you could compute this quantity as:
```
a.dot(b)/( np.linalg.norm(a) * np.linalg.norm(b) )
```
Hint: Think about how you could make this into a matrix operation

Look at movie number 10. You would definitely want a hybrid approach here!!!

**Part 2**

We are going to go back to using the correlation matrix now. Let's do some predictions for user 224. (i.e. You will want to use the Pandas "loc" function to access the _label_ 224, and not the _index_ 224. [More here](http://pandas.pydata.org/pandas-docs/stable/indexing.html)


What are the top 10 movies that user 224 liked?

**Part 3**

Now, we are going to recommend user 224 some movies! The simplest way to get a set of recommended movies is to take the dot product of user_224 with our correlation matrix.
```
user_224_recs = user_224_series.dot( corr_mat )
```

What are the top 10 movies we would recommend user 224?

**Part 4**

This is hard to visualize if our item based collaborative filtering is working well or not. Let's use seaborn to plot out a scatter of the items they rated, and our predictions for those items.
```
sb.regplot( user_224_series, user_224_recs )
```

Pretty good. We can ignore the actual numbers, but look at the trend. Roughly speaking, our predictions correlate well with their actually posted reviews. Now to be more accurate, we would have to run cross validation tests to confirm our recommendation engine.


# Exercise 2:

Recall the previous lecture's example on SVD. This time, use TruncatedSVD from the sklearn.decomposition module in order to decompose our data matrix of user x movie (where each item is the normalized rating).

* Use TruncatedSVD to fit_transform our data matrix into a reduced matrix.
* Plot the cumulative sum of explained\_variance\_ratio\_.
* Plot horizontal bargraphs of the first 8 components. What are the weights of the top 10 movies for each of these components?
* Reconstruct the matrix using the svd.inverse_transform function and pass in your reduced matrix. This is equivalent to the following operation: reduced\_matrix.dot( svd.components\_ ). What are the top 10 recommendations for user 224?
* Plot using sb.regplot user 224's ratings as well as the new recommendations for user 224.

# Exercise 3 (Optional):


The SVD method does not work as well when the matrix is very sparse (like now). Most recommendation systems instead try to model users and movies along some hidden feature space. The problem is having to solve for both some matrix X (user to feature) and some matrix Y (feature to movie). This can be solved using alternating least squares.

Rough procedure, for N iterations:
* Fix X. Solve for Y.
* Fix Y. Solve for X.

The following blog post has an example of alternative least squares solution. Try writing a version to solve and recommend based on the post:

http://bugra.github.io/work/notes/2014-04-19/alternating-least-squares-method-for-collaborative-filtering/
