## DATA 643 Project 3 | Matrix Factorization methods

#### By Dhananjay Kumar

The goal of this assignment is to practice Matrix Factorization Technique. Our task is to implement matrix factorization methods such as - Singular Value Decomposition (SVD) or Alternating Least Squares (ALS) - in the context of a recommender system.

For this assignment I have used Movielns 100K dataset and refered to following sources available on the internet:

1. http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm
2. https://fenix.tecnico.ulisboa.pt/downloadFile/3779576344458/singular-value-decomposition-fast-track-tutorial.pdf
3. https://beckernick.github.io/matrix-factorization-recommender/


### Load Data 

### 

In [1]:
import pandas as pd
import numpy as np


ratings_list = [i.strip().split("\t") for i in open('u.data', 'r').readlines()]
movies_list = [i.strip().split("|")[0:3] for i in open('u.item', 'r').readlines()]
ratings = np.array(ratings_list)
movies = np.array(movies_list)
ratings_df = pd.DataFrame(ratings_list, columns = ['UserID', 'MovieID', 'Rating', 'Timestamp'], dtype = int)
movies_df = pd.DataFrame(movies_list, columns = ['MovieID', 'Title', 'ReleaseDate'])
movies_df['MovieID'] = movies_df['MovieID'].apply(pd.to_numeric)

### Movie Dataset

In [2]:

movies_df.head()

Unnamed: 0,MovieID,Title,ReleaseDate
0,1,Toy Story (1995),01-Jan-1995
1,2,GoldenEye (1995),01-Jan-1995
2,3,Four Rooms (1995),01-Jan-1995
3,4,Get Shorty (1995),01-Jan-1995
4,5,Copycat (1995),01-Jan-1995


In [3]:
ratings_df.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In order to use Matrix Factorization, lets pivot the above dataset into a new dataframe where Row represent the unique User IDs and Column represents Unique Movie IDs 

In [4]:
R_df = ratings_df.pivot(index = 'UserID', columns ='MovieID', values = 'Rating').fillna(0)
R_df.head()

MovieID,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,3.0,3.0,5.0,4.0,1.0,5.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


d

In [5]:
R = R_df.as_matrix()
user_ratings_mean = np.mean(R, axis = 1)
R_demeaned = R - user_ratings_mean.reshape(-1, 1)

## Single Value Decomposition (SVD)

SVD is matrix factorization technique where a matrix X is factorized into three matrices: U, D and V.

Given m x n matrix X:

    U is an (m x r) orthogonal matrix
    S is an (r x r) diagonal matrix with non-negative real numbers on the diagonal
    V^T is an (r x n) orthogonal matrix

Elements on the diagnoal in S are known as singular values of X.

Matrix X can be factorized to U, S and V. The U matrix represents the feature vectors corresponding to the users in the hidden feature space and the V matrix represents the feature vectors corresponding to the items in the hidden feature space. 

Fortunately the library SCiPy has a SVD function which I would be using it, as shown below


In [6]:
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(R_demeaned, k = 50)
# Convert Sigma into a Diagonal Matrix
sigma = np.diag(sigma)

Now as discussed above , we can get the original user rating matrix back by multiplying U, S and V^T

In [7]:
predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)
preds_df = pd.DataFrame(predicted_ratings, columns = R_df.columns)
preds_df.head()

MovieID,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
0,6.488436,2.959503,1.634987,3.024467,1.656526,1.659506,3.630469,0.240669,1.791518,3.347816,...,0.011976,-0.092017,-0.074553,-0.060985,0.009427,-0.035641,-0.039227,-0.037434,-0.025552,0.023513
1,2.347262,0.129689,-0.098917,0.328828,0.159517,0.481361,0.213002,0.097908,1.8921,0.671,...,0.003943,-0.026939,-0.03546,-0.029883,-0.027153,-0.015244,-0.008277,-0.01176,0.011639,-0.046924
2,0.291905,-0.26383,-0.151454,-0.179289,0.013462,-0.088309,-0.057624,0.568764,-0.018506,0.280742,...,-0.028964,-0.031622,0.045513,0.026089,-0.021705,0.002282,0.032363,0.017322,-0.006644,-0.00948
3,0.36641,-0.443535,0.041151,-0.007616,0.055373,-0.080352,0.299015,-0.010882,-0.160888,-0.118834,...,0.020069,0.015981,-0.000182,0.005593,0.026634,0.023562,0.036405,0.029984,0.015612,-0.008713
4,4.263488,1.937122,0.052529,1.04935,0.652765,0.002836,1.730461,0.870584,0.341027,0.569055,...,0.019973,-0.053521,-0.017242,-0.007137,-0.038987,0.010338,0.004869,0.007603,-0.020575,0.00333


In [8]:
def recommend_movies(predictions_df, userID, movies_df, original_ratings_df, num_recommendations=5):
    
    # Get and sort the user's predictions
    user_row_number = userID - 1 # UserID starts at 1, not 0
    sorted_user_predictions = preds_df.iloc[user_row_number].sort_values(ascending=False) # UserID starts at 1
    
    # Get the user's data and merge in the movie information.
    user_data = original_ratings_df[original_ratings_df.UserID == (userID)]
    user_full = (user_data.merge(movies_df, how = 'left', left_on = 'MovieID', right_on = 'MovieID').
                     sort_values(['Rating'], ascending=False)
                 )

    print 'User {0} has already rated {1} movies.'.format(userID, user_full.shape[0])
    print 'Recommending highest {0} predicted ratings movies not already rated.'.format(num_recommendations)
    
    # Recommend the highest predicted rating movies that the user hasn't seen yet.
    recommendations = (movies_df[~movies_df['MovieID'].isin(user_full['MovieID'])].
         merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
               left_on = 'MovieID',
               right_on = 'MovieID').
         rename(columns = {user_row_number: 'Predictions'}).
         sort_values('Predictions', ascending = False).
                       iloc[:num_recommendations, :-1]
                      )

    return user_full, recommendations

Now lets find out predictions for User 87

In [9]:
already_rated, predictions = recommend_movies(preds_df, 87, movies_df, ratings_df, 10)

User 87 has already rated 211 movies.
Recommending highest 10 predicted ratings movies not already rated.


In [10]:
already_rated.head(10)

Unnamed: 0,UserID,MovieID,Rating,Timestamp,Title,ReleaseDate
105,87,38,5,879875940,"Net, The (1995)",01-Jan-1995
140,87,510,5,879875818,"Magnificent Seven, The (1954)",01-Jan-1954
126,87,172,5,879875737,"Empire Strikes Back, The (1980)",01-Jan-1980
129,87,181,5,879876194,Return of the Jedi (1983),14-Mar-1997
136,87,435,5,879875818,Butch Cassidy and the Sundance Kid (1969),01-Jan-1969
56,87,96,5,879875734,Terminator 2: Judgment Day (1991),01-Jan-1991
138,87,50,5,879876194,Star Wars (1977),01-Jan-1977
53,87,568,5,879875818,Speed (1994),01-Jan-1994
141,87,204,5,879876447,Back to the Future (1985),01-Jan-1985
165,87,496,5,879877709,It's a Wonderful Life (1946),01-Jan-1946


In [11]:
predictions

Unnamed: 0,MovieID,Title,ReleaseDate
109,168,Monty Python and the Holy Grail (1974),01-Jan-1974
509,663,Being There (1979),01-Jan-1979
143,226,Die Hard 2 (1990),01-Jan-1990
122,191,Amadeus (1984),01-Jan-1984
551,712,Tin Men (1987),01-Jan-1987
502,655,Stand by Me (1986),01-Jan-1986
119,187,"Godfather: Part II, The (1974)",01-Jan-1974
386,520,"Great Escape, The (1963)",01-Jan-1963
682,864,My Fellow Americans (1996),20-Dec-1996
75,117,"Rock, The (1996)",07-Jun-1996
