# Introduction (688 only)
In this question, you'll build a basic recommendation system using collaborative filtering to make recommendations on a typical recommendation systems dataset, the MovieLens dataset. The purpose of this question is to become familiar with the internals of recommendation systems: both how they train and how they form recommendations. 

### Grading 
Your submission will be scored in the following manner: 
* process - 10pts
* train - 15pts
* recommend - 10pts

## Collaborative Filtering by Matrix Factorization
In collaborative filtering we wish to factorize our ratings matrix into two smaller feature matrices whose product is equal to the original ratings matrix. Specifically, given some partially filled ratings matrix $X\in \mathbb{R}^{m\times n}$, we want to find feature matrices $U \in \mathbb{R}^{m\times k}$ and $V \in \mathbb{R}^{n\times k}$ such that $UV^T = X$. In the case of movie recommendation, each row of $U$ could be features corresponding to a user, and each row of $V$ could be features corresponding to a movie, and so $u_i^Tv_j$ is the predicted rating of user $i$ on movie $j$. This forms the basis of our hypothesis function for collaborative filtering: 

$$h_\theta(i,j) = u_i^T v_j$$

In general, $X$ is only partially filled (and usually quite sparse), so we can indicate the non-presence of a rating with a 0. Notationally, let $S$ be the set of $(i,j)$ such that $X_{i,j} \neq 0$, so $S$ is the set of all pairs for which we have a rating. The loss used for collaborative filtering is squared loss:

$$\ell(h_\theta(i,j),X_{i,j}) = (h_\theta(i,j) - X_{i,j})^2$$

The last ingredient to collaborative filtering is to impose an $l_2$ weight penalty on the parameters, so our total loss is:

$$\sum_{i,j\in S}\ell(h_\theta(i,j),X_{i,j}) + \lambda_u ||U||_2^2 + \lambda_v ||V||_2^2$$

For this assignment, we'll let $\lambda_u = \lambda_v = \lambda$. 

## MovieLens rating dataset
To start off, let's get the MovieLens dataset. This dataset is actually quite large (24+ million ratings), but we will instead use their smaller subset of 100k ratings. You will have to go fetch this from their website. 

* You can download the archive containing their latest dataset release from http://files.grouplens.org/datasets/movielens/ml-latest-small.zip (last updated October 2016). 
* For more details (contents and structure of archive), you can read the README at http://files.grouplens.org/datasets/movielens/ml-latest-README.html
* You can find general information from their website description located at http://grouplens.org/datasets/movielens/. 

For this assignment, we will only be looking at the ratings data specifically. However, it is good to note that there is additional data available (i.e. movie data and user made tags for movies) which could be used to improve the ability of the recommendation system. 

In [2]:
import pandas as pd
import numpy as np
import scipy.sparse as sp
import scipy.linalg as la
import matplotlib
matplotlib.use("svg")
# AUTOLAB_IGNORE_START
%matplotlib inline
# AUTOLAB_IGNORE_STOP
import matplotlib.pyplot as plt
plt.style.use("ggplot")

In [3]:
# AUTOLAB_IGNORE_START
movies = pd.read_csv("ml-latest-small/movies.csv")
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
ratings = pd.read_csv("ml-latest-small/ratings.csv")
ratings.head()
# AUTOLAB_IGNORE_STOP

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


## Data preparation

Matrix factorization requires that we have our ratings stored in a matrix of users, so the first task is to take the dataframe and convert it into this format. Note that in general, typically these matrices are extremely large and sparse (especially if you want to process the 24 million ratings), however for the purposes of this assignment a dense representation of this smaller dataset should fit on any machine. 

### Specification
* Split the data by assigning the first $\mathrm{floor}(9n/10)$ permuted entries to be the training set, and the remaining to be the testing set. Use the order given by the permutation. 
* Each row of the ratings matrix corresponds to a user. The first row of the matrix should correspond to the first user (by userID), and so on. This is because the set of user IDs already form a consecutive range of numbers. 
* Each column of the ratings matrix corresponds to a movie. The order of the columns doesn't matter, so long as the resulting list of movie names is accurate. This is because the set of movie IDs does not form a consecutive range of numbers. 
* Each user and movie that exists in the **ratings** dataframe should be present in the ratings matrix, even if it doesn't have any entries. We will only use the movies dataframe to extract the names of the movies. 
* Any entry that does not have a rating should have a default value of 0. 

In [7]:
import math
def process(ratings, movies, P):
    """ Given a dataframe of ratings and a random permutation, split the data into a training 
        and a testing set, in matrix form. 
        
        Args: 
            ratings (dataframe) : dataframe of MovieLens ratings
            movies (dataframe) : dataframe of MovieLens movies
            P (numpy 1D array) : random permutation vector
            
        Returns: 
            (X_tr, X_te, movie_names)  : training and testing splits of the ratings matrix (both 
                                         numpy 2D arrays), and a python list of movie names 
                                         corresponding to the columns of the ratings matrices. 
    """
    unique_users = ratings.userId.unique()
    unique_movies = ratings.movieId.unique()
#     print(unique_movies)
    movie_index = np.arange(len(unique_movies))
    index_map = {unique_movies[i]: movie_index[i] for i in range(len(unique_movies))}
#     print(index_map[31])
    
    movies = movies.reset_index()
    movies.set_index('movieId', inplace=True)
    movie_list = [movies.loc[int(x)].title for x in ratings.movieId.unique()]

    partition = math.floor(9*len(P)/10)
    train_index = P[:partition]
    test_index = P[partition: ]

    train_data = ratings.iloc[train_index,:]
    test_data = ratings.iloc[test_index,:]

    X_tr = np.zeros((len(unique_users), len(unique_movies)))
    X_te = np.zeros((len(unique_users), len(unique_movies)))
    
    for xi in train_data.itertuples():
        X_tr[xi.userId-1][index_map[xi.movieId]] = xi.rating
    
    for xi in test_data.itertuples():
        X_te[xi.userId-1][index_map[xi.movieId]] = xi.rating

    return (X_tr, X_te, movie_list)
    
#     for k,v in index_map.items():
#         if v == 1007:
#             print(k,v)
#     print(ratings.loc[(ratings['userId'] == 671) & (ratings['movieId']==3253),])
#          (596, 2987)	3.0   
#     print(len(movie_list))
#     print(movie_list[index_map[6425]])
#     print(movies.loc[unique_movies[index_map[6425]]].title)
    pass

# AUTOLAB_IGNORE_START
# process(ratings, movies, np.random.permutation(len(ratings)))
X_tr, X_te, movieNames = process(ratings, movies, np.random.permutation(len(ratings)))
print(X_tr.shape, X_te.shape, movieNames[:5])
# AUTOLAB_IGNORE_STOP

(671, 9066) (671, 9066) ['Dangerous Minds (1995)', 'Dumbo (1941)', 'Sleepers (1996)', 'Escape from New York (1981)', 'Cinema Paradiso (Nuovo cinema Paradiso) (1989)']


For example, running this on the small MovieLens dataset using a random permutation gives the following result: 
    
    (671L, 9066L) (671L, 9066L) ['Toy Story (1995)', 'Jumanji (1995)', 'Grumpier Old Men (1995)', 'Waiting to Exhale (1995)', 'Father of the Bride Part II (1995)']

Your actual titles may vary depending on the random permutation given. 

## Alternating Minimization for Collaborative Filtering
Now we build the collaborative filtering recommendation system. We will use a method known as alternating least squares. Essentially, we will alternate between optimizing over $U$ and $V$ by holding the other constant. By treating one matrix as a constant, we get exactly a weighted least squares problem, which has a well-known solution. More details can be found in the lecture notes. 

### Specification
* Similar to the softmax regression on MNIST, there is a verbose parameter here that you can use to print your training err and test error. These should decrease (and converge). 
* You can assume a dense representation of all the inputs. 
* You may find it useful to have an indicator matrix W where $W_{ij} = 1$ if there is a rating in $X_{ij}$. 
* You can initialize U,V with random values. 

In [225]:
def error(X, U, V):
    """ Compute the mean error of the observed ratings in X and their estimated values. 
        Args: 
            X (numpy 2D array) : a ratings matrix as specified above
            U (numpy 2D array) : a matrix of features for each user
            V (numpy 2D array) : a matrix of features for each movie
        Returns: 
            (float) : the mean squared error of the observed ratings with their estimated values
        """
#     print(X)
    W = (X > 0).astype(np.float64)
#     print((X - U.dot(V))**2)
    return(np.sum(W * np.square(X - U.dot(V.T))) / np.count_nonzero(W))
#     rate_diff = U.dot(V) - X
#     mse = ((W * rate_diff)**2).mean()
#     return(mse)

    
def train(X, X_te, k, U, V, niters=51, lam=10, verbose=False):
    """ Train a collaborative filtering model. 
        Args: 
            X (numpy 2D array) : the training ratings matrix as specified above
            X_te (numpy 2D array) : the testing ratings matrix as specified above
            k (int) : the number of features use in the CF model
            U (numpy 2D array) : an initial matrix of features for each user
            V (numpy 2D array) : an initial matrix of features for each movie
            niters (int) : number of iterations to run
            lam (float) : regularization parameter
            verbose (boolean) : verbosity flag for printing useful messages
            
        Returns:
            (U,V) : A pair of the resulting learned matrix factorization
    """
    
#     for iter_ in range(niters):
# #         print('V', V.shape, 'V.T', V.T.shape, 'V.dot(V.T)', V.dot(V.T).shape, 
# #               'I',np.eye(k).shape, 'X.T', X.shape, 'V.dot(X.T)', V.dot(X.T).shape)
# #         print(la.solve((V.dot(V.T) + lam*np.eye(k)), V.dot(X.T)).T.shape)
        
#         U = la.solve((V.dot(V.T) + (lam*np.eye(k))), V.dot(X.T)).T     
# #         print(U)
# #         print('U', U.shape, 'U.T', U.T.shape, 'U.T.dot(U)', U.T.dot(U).shape, 
# #               'I',np.eye(k).shape, 'U.T', U.T.shape, 'U.T.dot(X)', U.T.dot(X).shape)
# #         print(la.solve((U.T.dot(U) + lam*np.eye(k)), U.T.dot(X)).shape)
#         V = la.solve((U.T.dot(U) + (lam*np.eye(k))), U.T.dot(X))
# #         print(V)
        
################
    reg = lam * np.eye(k)
    W = (X > 0.1)
    u, v = X.shape
    
    for iter_ in range(niters):
        for i in range(u):
            movie_index = W[i,:]
            V_j = V[movie_index,]
            u_update = la.solve(V_j.T.dot(V_j) + reg, V_j.T.dot(X[i, movie_index]))
            U[i,:] = u_update
        
        for j in range(v):
            user_index = W.T[j,:]
            U_i = U[user_index,]
            v_update = la.solve(U_i.T.dot(U_i) + reg, U_i.T.dot(X.T[j, user_index]))
            V[j,:] = v_update
        
        error_ = error(X, U, V)
        test_err = error(X_te, U, V)

        if verbose:
            if iter_ == 0:
                print('Iter |Train Err |Test Err')
            print(iter_,'|',error_, '|', test_err)
        
#     print(U.shape, V.shape)
    return(U,V)
    pass



In [226]:
# weightedDiff = C.multiply(P - ratingsPrediction)
# c = np.floor( 5* np.random.random((2,2)))
# P = np.floor( 2 * np.random.random((2,2)))
# r = np.floor( 5* np.random.random((2,2)))
# print(c)
# print(P[:,1].shape)
# print(r)

# print(np.mean(np.array([[1,2],[0,3]])))
# print((P*(c - r)))
# print(sp.diags(P[:,1]).dot(sp.diags(P[:,1]).T).todense())
# print(sp.diags(np.repeat(1, 5)).getcol(2))

Training the recommendation system with a random initialization of U,V with 5 features and $\lambda = 10$ results in the following output. Your results may vary depending on your random permutation.  

    Iter |Train Err |Test Err  
        0|    1.3854|    2.1635
        5|    0.7309|    1.5782
       10|    0.7029|    1.5078
       15|    0.6951|    1.4874
       20|    0.6910|    1.4746
       25|    0.6898|    1.4679
       30|    0.6894|    1.4648
       35|    0.6892|    1.4634
       40|    0.6891|    1.4631
       45|    0.6891|    1.4633
       50|    0.6891|    1.4636
    Wall time: 7min 32s

In [227]:
# AUTOLAB_IGNORE_START
# error(X_tr, np.random.random((X_tr.shape[0], 50)), np.random.random((50, X_tr.shape[1])))
# print(np.sum(X_tr))
# print(np.random.random((X_tr.shape[0], 50)))
k = 5
U = 5 * np.random.random((X_tr.shape[0], k))
V = 5 * np.random.random((k, X_tr.shape[1]))
# print(U)
# print(V)
U, V = train(X_tr, X_te, k, np.random.random((X_tr.shape[0], k)), np.random.random((X_tr.shape[1], k)), verbose = True)
# print(u.shape, v.shape)
# print(1.43315299153e-06)
# AUTOLAB_IGNORE_STOP

Iter |Train Err |Test Err
0 | 0.936135983959 | 13.5778103259
1 | 0.811776377856 | 13.5779735165
2 | 0.742957939643 | 13.578098252
3 | 0.713093423828 | 13.5781388825
4 | 0.698560765099 | 13.5781683375
5 | 0.691472872344 | 13.5781897344
6 | 0.687529626672 | 13.578204259
7 | 0.685054177408 | 13.5782153496
8 | 0.683428017044 | 13.5782257884
9 | 0.682384961828 | 13.5782365508
10 | 0.68173576564 | 13.5782474455
11 | 0.681310935071 | 13.5782578922
12 | 0.680984184181 | 13.5782674141
13 | 0.680688324995 | 13.5782757917
14 | 0.680408386907 | 13.5782830189
15 | 0.680155851912 | 13.5782892087
16 | 0.679940227848 | 13.578294517
17 | 0.679758191477 | 13.578299095
18 | 0.679599727935 | 13.5783030682
19 | 0.679456235691 | 13.5783065316
20 | 0.679323308798 | 13.5783095575
21 | 0.6791998612 | 13.5783122041
22 | 0.679086443626 | 13.5783145225
23 | 0.678984023924 | 13.5783165584
24 | 0.678893381558 | 13.5783183517
25 | 0.678814894817 | 13.5783199349
26 | 0.678748507532 | 13.5783213321
27 | 0.678693766101

## Recommendations

Finally, we need to be able to make recommendations given a matrix factorization. We can do this by simply taking the recommending the movie with the highest value in the estimated ratings matrix. 

### Specification
* For each user, recommend the the movie with the highest predicted rating for that user that the user **hasn't** seen before. 
* Return the result in a list such that the ith element in this list is the recommendation for the user corresponding to the ith row of the ratings matrix. 

In [206]:
def recommend(X, U, V, movieNames):
    """ Recommend a new movie for every user.
        Args: 
            X (numpy 2D array) : the training ratings matrix as specified above
            U (numpy 2D array) : a learned matrix of features for each user
            V (numpy 2D array) : a learned matrix of features for each movie
            movieNames : a list of movie names corresponding to the columns of the ratings matrix
        Returns
            (list) : a list of movie names recommended for each user
    """
    reco = []
    X_hat = U.dot(V.T)
    X_hat[X > 0] = np.float('-Inf')
    for i,x in enumerate(X_hat):
        reco.append(movieNames[np.argmax(x)])

    return(reco)
    
    pass
    
# AUTOLAB_IGNORE_START
recommendations = recommend(X_tr, U, V, movieNames)
print(recommendations[:10])
# ['Princess Bride, The (1987)', 'True Lies (1994)', 'Star Wars: Episode IV - A New Hope (1977)', 
# 'Saving Private Ryan (1998)', 'Pulp Fiction (1994)', 'Star Wars: Episode V - The Empire Strikes Back (1980)', 
# 'Pulp Fiction (1994)', 'Shrek (2001)', 'Pulp Fiction (1994)', 'Star Wars: Episode IV - A New Hope (1977)']
# AUTOLAB_IGNORE_STOP

['Shawshank Redemption, The (1994)', 'Shawshank Redemption, The (1994)', 'Usual Suspects, The (1995)', 'Shawshank Redemption, The (1994)', 'Shawshank Redemption, The (1994)', 'Shawshank Redemption, The (1994)', "Schindler's List (1993)", 'Fargo (1996)', 'Godfather, The (1972)', "Schindler's List (1993)"]


Our implementation gets the following results (we can see they are all fairly popular and well known movies that were recommended). Again your results will vary depending on the random permutation. 

    ['Shawshank Redemption, The (1994)', 'Shawshank Redemption, The (1994)', 'Shawshank Redemption, The (1994)', 'Shawshank Redemption, The (1994)', 'Shawshank Redemption, The (1994)', 'Shawshank Redemption, The (1994)', 'Godfather, The (1972)', 'Fargo (1996)', 'Godfather, The (1972)', "Schindler's List (1993)"]