# Movie Recommender Modeling

author: Ben Sturm <br />
contact: bwsturm@gmail.com <br />
date: 6/18/2018

In this notebook, I'm going to implement Nick Becker's Matrix factorization method.  The major difference is that I'm going to try to implement it using sparse matrices.

In [1]:
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
import scipy.sparse

First I'm going to load in the MovieLens 20M Dataset.

In [2]:
ratings_df = pd.read_csv('/Users/bwsturm/ds/metis/metisbc/Week10/movie_recommender/data/ml-20M/ratings.csv')
movies_df = pd.read_csv('/Users/bwsturm/ds/metis/metisbc/Week10/movie_recommender/data/ml-20M/movies.csv')
movies_df['movieId'] = movies_df['movieId'].apply(pd.to_numeric)

In [92]:
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [93]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,1112486027
1,1,29,3.5,1112484676
2,1,32,3.5,1112484819
3,1,47,3.5,1112484727
4,1,50,3.5,1112484580


Now I want to calculate the mean rating for each user and subtract that value from each user's rating.  Then, I'll assign that to a new column.

In [41]:
def user_mean_normalization(df):
    mean_rating = ratings_df.groupby('userId')['rating'].mean()
    mean_rating_df = mean_rating.to_frame('rating_mean')
    df2 = pd.merge(df,mean_rating_df,left_on='userId',right_index=True)
    df2['rating_normalized'] = df2['rating']-df2['rating_mean']
    return df2

In [42]:
ratings_df2 = user_mean_normalization(ratings_df)

In [98]:
ratings_df2.tail(100)

Unnamed: 0,userId,movieId,rating,timestamp,rating_mean,rating_normalized
20000163,138493,7347,4.5,1255810758,4.172922,0.327078
20000164,138493,7361,5.0,1255807607,4.172922,0.827078
20000165,138493,7371,5.0,1256288607,4.172922,0.827078
20000166,138493,7416,4.5,1255817853,4.172922,0.327078
20000167,138493,7438,5.0,1255806506,4.172922,0.827078
20000168,138493,7577,4.0,1258134201,4.172922,-0.172922
20000169,138493,8254,5.0,1255805322,4.172922,0.827078
20000170,138493,8360,4.5,1256750489,4.172922,0.327078
20000171,138493,8371,1.0,1260209497,4.172922,-3.172922
20000172,138493,8529,4.5,1255817122,4.172922,0.327078


In [95]:
ratings_df2[ratings_df2['userId']==3].head()

Unnamed: 0,userId,movieId,rating,timestamp,rating_mean,rating_normalized
236,3,1,4.0,944919407,4.122995,-0.122995
237,3,24,3.0,945176048,4.122995,-1.122995
238,3,32,4.0,944918047,4.122995,-0.122995
239,3,50,5.0,944918018,4.122995,0.877005
240,3,160,3.0,945176048,4.122995,-1.122995


In [75]:
unique_rated_movie_ids = ratings_df2['movieId'].unique()
unique_rated_movie_ids.shape

(26744,)

In [100]:
unique_rated_movie_ids.sort()

array([     1,      2,      3, ..., 131258, 131260, 131262])

In [101]:
movie_mapping_df = pd.DataFrame(unique_rated_movie_ids,columns=['movieId'])
movie_mapping_df.head()

Unnamed: 0,movieId
0,1
1,2
2,3
3,4
4,5


In [105]:
movie_mapping_df.reset_index(inplace=True)
movie_mapping_df.head()

Unnamed: 0,index,movieId
0,0,1
1,1,2
2,2,3
3,3,4
4,4,5


In [107]:
movie_mapping_df.rename(columns={'index':'movie_idx'},inplace=True)

In [109]:
movie_mapping_df.tail()

Unnamed: 0,movie_idx,movieId
26739,26739,131254
26740,26740,131256
26741,26741,131258
26742,26742,131260
26743,26743,131262


In [None]:
pd.find()

Now I need to merge my movie_mapping_df with my ratings_df2 

In [103]:
#ratings_df2['Movie_idx'] = ratings_df2['movieId'].apply(lambda x: movie_mapping_df[movie_mapping_df['movieId']==x])

In [110]:
ratings_df2 = pd.merge(ratings_df2,movie_mapping_df,on='movieId')

In [129]:
ratings_df2.sort_values(by=['userId','movieId'],inplace=True)

In [131]:
ratings_df2.reset_index(drop=True,inplace=True)

In [132]:
ratings_df2.head()

Unnamed: 0,userId,movieId,rating,timestamp,rating_mean,rating_normalized,movie_idx
0,1,2,3.5,1112486027,3.742857,-0.242857,1
1,1,29,3.5,1112484676,3.742857,-0.242857,28
2,1,32,3.5,1112484819,3.742857,-0.242857,31
3,1,47,3.5,1112484727,3.742857,-0.242857,46
4,1,50,3.5,1112484580,3.742857,-0.242857,49


Now I want to get the ratings data into a sparse matrix

In [133]:
# Initialize sparse matrix of ratings
item_user_data = csr_matrix((ratings_df2['rating_normalized'].astype(np.double),
                       (ratings_df2['userId'], #row_id
                        ratings_df2['movie_idx']))) #column_id

#print(item_user_data)

In [134]:
print(item_user_data)

  (1, 1)	-0.24285714285714288
  (1, 28)	-0.24285714285714288
  (1, 31)	-0.24285714285714288
  (1, 46)	-0.24285714285714288
  (1, 49)	-0.24285714285714288
  (1, 110)	-0.24285714285714288
  (1, 149)	0.2571428571428571
  (1, 220)	0.2571428571428571
  (1, 250)	0.2571428571428571
  (1, 257)	0.2571428571428571
  (1, 290)	0.2571428571428571
  (1, 293)	0.2571428571428571
  (1, 315)	0.2571428571428571
  (1, 333)	-0.24285714285714288
  (1, 363)	-0.24285714285714288
  (1, 537)	0.2571428571428571
  (1, 583)	-0.24285714285714288
  (1, 587)	-0.24285714285714288
  (1, 645)	-0.7428571428571429
  (1, 902)	-0.24285714285714288
  (1, 907)	-0.24285714285714288
  (1, 990)	-0.24285714285714288
  (1, 1017)	0.2571428571428571
  (1, 1057)	0.2571428571428571
  (1, 1058)	-0.24285714285714288
  :	:
  (138493, 11848)	-0.1729222520107241
  (138493, 11862)	-0.1729222520107241
  (138493, 11863)	-1.172922252010724
  (138493, 11882)	-0.1729222520107241
  (138493, 11901)	-0.1729222520107241
  (138493, 11962)	0.327077747

### Singular Value Decomposition

Now I'm going to use scipy's SVD method.  What's great is that it can do this operation on a sparse matrix. 

In [135]:
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(item_user_data, k = 50)

In [136]:
sigma = np.diag(sigma)

In [137]:
print('The size of U is: {}'.format(U.shape))
print('The size of Vt is {}'.format(Vt.shape))

The size of U is: (138494, 50)
The size of Vt is (50, 26744)


In [138]:
def get_user_rating(userId):
    user_idx = userId-1  
    user_mean_rating =  ratings_df2.loc[ratings_df2['userId']==userId,'rating_mean'].get_values()[0]
    user_predicted_rating = np.dot(np.dot(U[user_idx],sigma),Vt) + user_mean_rating
    return user_predicted_rating

In [139]:
user_predictions = get_user_rating(2)
user_predictions.shape

(26744,)

In [140]:
np.argsort(user_predictions)

array([  31,   46,  587, ..., 7041, 5853, 4897])

In [None]:
def recommend_movies(predictions_df, userID, movies_df, original_ratings_df, num_recommendations=5):
    
    # Get and sort the user's predictions
    user_row_number = userID - 1 # UserID starts at 1, not 0
    sorted_user_predictions = preds_df.iloc[user_row_number].sort_values(ascending=False) # UserID starts at 1
    
    # Get the user's data and merge in the movie information.
    user_data = original_ratings_df[original_ratings_df.userId == (userID)]
    user_full = (user_data.merge(movies_df, how = 'left', left_on = 'movieId', right_on = 'movieId').
                     sort_values(['rating'], ascending=False)
                 )

    print('User {0} has already rated {1} movies.'.format(userID, user_full.shape[0]))
    print('Recommending highest {0} predicted ratings movies not already rated.'.format(num_recommendations))
    
    # Recommend the highest predicted rating movies that the user hasn't seen yet.
    recommendations = (movies_df[~movies_df['movieId'].isin(user_full['movieId'])].
         merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
               left_on = 'movieId',
               right_on = 'movieId').
         rename(columns = {user_row_number: 'Predictions'}).
         sort_values('Predictions', ascending = False).
                       iloc[:num_recommendations, :-1]
                      )

    return user_full, recommendations

In [155]:
def has_rated_movie(userId, movie_idx):
    mask = ratings_df['userId']==userId
    mask_rated = ratings_df2.loc[mask,'movie_idx'].isin([movie_idx])
    if sum(mask_rated)>0:
        return True
    else:
        return False 

In [187]:
def recommend_movies2(movies_df, ratings_df, userId, num_recommendations=5):
    
    # get user_predictions
    user_predictions = get_user_rating(userId)
    
    # Sort my predictions from highest to lowest
    pred_idxs_sorted = np.argsort(user_predictions)
    pred_idxs_sorted[:] = pred_idxs_sorted[::-1]
    
    nm = unique_rated_movie_ids.shape[0] #get num_movies
    print("Top recommendations for UserId: {}".format(userId))
    i=0; j=0
    while i < num_recommendations:
        movie_idx = pred_idxs_sorted[j]
        if not has_rated_movie(userId, movie_idx):
            movieId = movie_mapping_df.loc[movie_mapping_df['movie_idx']==movie_idx,'movieId']
            movieTitle = movies_df.loc[movies_df['movieId']==movieId.values[0],'title']
            print('Predicting rating {0:.1f} for movie {1}'.format(\
                    user_predictions[pred_idxs_sorted[i]],movieTitle.values[0]))
            i=i+1
        j=j+1


In [191]:
pred_idxs_sorted = recommend_movies2(movies_df,ratings_df2,11,20)

Top recommendations for UserId: 11
Predicting rating 5.0 for movie Godfather: Part II, The (1974)
Predicting rating 4.9 for movie One Flew Over the Cuckoo's Nest (1975)
Predicting rating 4.7 for movie Casablanca (1942)
Predicting rating 4.2 for movie Apocalypse Now (1979)
Predicting rating 4.1 for movie Citizen Kane (1941)
Predicting rating 4.1 for movie Babe (1995)
Predicting rating 4.1 for movie Taxi Driver (1976)
Predicting rating 4.1 for movie Rocky (1976)
Predicting rating 4.1 for movie Raging Bull (1980)
Predicting rating 4.1 for movie Kill Bill: Vol. 2 (2004)
Predicting rating 4.1 for movie Annie Hall (1977)
Predicting rating 4.0 for movie Dead Man Walking (1995)
Predicting rating 4.0 for movie Wizard of Oz, The (1939)
Predicting rating 4.0 for movie To Kill a Mockingbird (1962)
Predicting rating 4.0 for movie Sense and Sensibility (1995)
Predicting rating 4.0 for movie Nutty Professor, The (1996)
Predicting rating 4.0 for movie L.A. Confidential (1997)
Predicting rating 4.0 for

In [None]:
print("Top recommendations for User1:")
for i in range(10):
    print('Predicting rating {0:.1f} for movie {1}.'.format(\
    my_predictions[pred_idxs_sorted[i]],movies.loc[pred_idxs_sorted[i],'title']))
    
print("\nOriginal ratings provided:")
for i in range(len(Y[:,1])):
    if Y[i,1] > 0:
        print('Rated {0:.1f} for movie {1}.'.format(Y[i][1],movies.loc[i,'title']))

In [145]:
#has_rated_movie(1,28)

In [153]:
mask = ratings_df['userId']==1
movie_rated = ratings_df2.loc[mask,'movie_idx'].isin([0])
sum(movie_rated)

0

In [170]:
movieTitle = movies_df.loc[movies_df['movieId']==5,'title']

In [182]:
movieTitle.values[0]

'Father of the Bride Part II (1995)'

In [174]:
movieId = movie_mapping_df.loc[movie_mapping_df['movie_idx']==26742,'movieId']

In [179]:
movieId.values[0]

131260