# Movie Recommender Optimization

author: Ben Sturm <br />
contact: bwsturm@gmail.com <br />
date: 6/21/2018

This notebook is an extension of what I previously did in the Movie_Recommender_matrix_factorization.ipynb notebook.  However, my goal in this notebook is to try and optimize my recommendation model.  In order to do that optimization, I'm going to attempt to do a train/test split on my data, so that I can build me recommender on the train data and score it on the test data. 

### Reading in the data and doing the train/test split

In [1]:
#first step is to import some libraries
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
import scipy.sparse

from sklearn.model_selection import train_test_split

Now I'm going to read in the data that I generated in the previous notebook.  This data includes the ratings of Ben (userId=138494), Ruth (userId=138495), and Rom-Com fan (userId=138496).


In [2]:
ratings_df = pd.read_csv('/Users/bwsturm/ds/metis/metisbc/Week10/movie_recommender/data/user_ratings/20Mratings_with_Ben_Ruth_RomcomFan.csv')
movies_df = pd.read_csv('/Users/bwsturm/ds/metis/metisbc/Week10/movie_recommender/data/ml-20m/movies.csv')
movie_mapping_df = pd.read_csv('/Users/bwsturm/ds/metis/metisbc/Week10/movie_recommender/data/user_ratings/movie_mapping.csv')

In [3]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,1112486000.0
1,1,29,3.5,1112485000.0
2,1,32,3.5,1112485000.0
3,1,47,3.5,1112485000.0
4,1,50,3.5,1112485000.0


In [4]:
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Now that the data has been read in, I'm going to use sklearn's train/test split method.

In [5]:
train, test = train_test_split(ratings_df, test_size=0.2, random_state=7)

In [6]:
train.head()

Unnamed: 0,userId,movieId,rating,timestamp
11278380,78041,2188,3.0,965249900.0
6512657,44818,555,4.0,841663500.0
1036722,7036,53129,4.5,1203873000.0
18009814,124776,1208,4.0,1013812000.0
18276911,126661,4079,2.0,1001536000.0


A few things I need to check however, is that every user in the test data is also in the train data.  Also, I need to check that every movie in the test data is also in the train data.

In [7]:
def find_test_users_not_in_train(train_df,test_df):
    all_train_users = train_df['userId'].unique()
    all_test_users = test_df['userId'].unique()
    
    missing_users = []
    for user in all_test_users:
        if user not in all_train_users:
            missing_users.append(user)
    
    if len(missing_users)>0:
        print('Some users in test set were not in train set.')
    else:
        print('All users accounted for.')
        
    return missing_users
    

In [8]:
missing_users = find_test_users_not_in_train(train,test)

All users accounted for.


In [9]:
def find_test_movies_not_in_train(train_df,test_df):
    all_train_movies = train_df['movieId'].unique()
    all_test_movies = test_df['movieId'].unique()
    
    missing_movies = []
    for movie in all_test_movies:
        if movie not in all_train_movies:
            missing_movies.append(movie)
    
    if len(missing_movies)>0:
        print('Some movies in test set were not in train set.')
    else:
        print('All movies accounted for.')
        
    return missing_movies

In [10]:
missing_movies = find_test_movies_not_in_train(train,test)

All movies accounted for.


Awesome, I don't have to do any special house cleaning to account for any movies or users missing in the training data.

Now I'm going to copy and paste some of the functions from the previous notebook.  A better method would be to put all of these functions in a '.py' file, but I'll have to save that task for later.

In [11]:
def movie_mean_normalization(df):
    mean_rating = df.groupby('movieId')['rating'].mean()
    mean_rating_df = mean_rating.to_frame('rating_movie_mean')
    df2 = pd.merge(df,mean_rating_df,left_on='movieId',right_index=True)
    df2['rating_normalized'] = df2['rating']-df2['rating_movie_mean']
    return df2

In [12]:
train2 = movie_mean_normalization(train)

In [13]:
train2.head()

Unnamed: 0,userId,movieId,rating,timestamp,rating_movie_mean,rating_normalized
11278380,78041,2188,3.0,965249900.0,2.797983,0.202017
6767671,46632,2188,2.0,906105300.0,2.797983,-0.797983
17899552,123981,2188,2.0,1230784000.0,2.797983,-0.797983
18729403,129910,2188,3.0,1023747000.0,2.797983,0.202017
5141332,35246,2188,3.0,1003047000.0,2.797983,0.202017


In [14]:
all_train_movie_ids = train2['movieId'].unique()
all_train_movie_ids.shape

(15451,)

Since this shape is the same as what I calculated in the previous notebook for the full ratings dataset, I'm confident all movies are accounted for.  This means I can use the movie_mapping_df DataFrame that I read in previously.

Now I can merge my train and test data with the movie_mapping_df.

In [15]:
train2 = pd.merge(train2,movie_mapping_df,on='movieId')

In [16]:
test2 = pd.merge(test,movie_mapping_df,on='movieId')

In [17]:
train2.head()

Unnamed: 0,userId,movieId,rating,timestamp,rating_movie_mean,rating_normalized,movie_idx
0,78041,2188,3.0,965249900.0,2.797983,0.202017,2098
1,46632,2188,2.0,906105300.0,2.797983,-0.797983,2098
2,123981,2188,2.0,1230784000.0,2.797983,-0.797983,2098
3,129910,2188,3.0,1023747000.0,2.797983,0.202017,2098
4,35246,2188,3.0,1003047000.0,2.797983,0.202017,2098


In [18]:
test2.head()

Unnamed: 0,userId,movieId,rating,timestamp,movie_idx
0,2155,410,4.0,845245100.0,406
1,58771,410,3.0,836600200.0,406
2,84042,410,4.0,837272000.0,406
3,28876,410,3.0,847178000.0,406
4,24914,410,4.0,1267855000.0,406


In [19]:
train2.sort_values(by=['userId','movieId'],inplace=True)
train2.reset_index(drop=True,inplace=True)

In [20]:
test2.sort_values(by=['userId','movieId'],inplace=True)
test2.reset_index(drop=True,inplace=True)

In [21]:
train2.tail()

Unnamed: 0,userId,movieId,rating,timestamp,rating_movie_mean,rating_normalized,movie_idx
15971910,138496,5299,5.0,,3.463,1.537,5146
15971911,138496,6155,5.0,,3.182139,1.817861,5980
15971912,138496,6942,5.0,,3.793812,1.206188,6723
15971913,138496,58559,1.5,,4.222062,-2.722062,11199
15971914,138496,69757,4.0,,3.791008,0.208992,12108


In [22]:
test2.tail()

Unnamed: 0,userId,movieId,rating,timestamp,movie_idx
3992974,138494,27706,1.0,,8936
3992975,138495,589,0.5,,583
3992976,138496,2571,2.0,,2476
3992977,138496,69406,4.5,,12064
3992978,138496,88163,4.5,,13670


Now I want to get the train2 data into a sparse matrix in order to feed into an SVD model.

In [23]:
# Initialize sparse matrix of ratings
item_user_data = csr_matrix((train2['rating_normalized'].astype(np.double),
                       (train2['userId'], #row_id
                        train2['movie_idx']))) #column_id

In [24]:
item_user_data.shape

(138497, 15451)

In [25]:
print(item_user_data)

  (1, 1)	0.28951354990742306
  (1, 28)	-0.4599616575726295
  (1, 46)	-0.5521891876197662
  (1, 49)	-0.83589866283861
  (1, 110)	0.08826004628655593
  (1, 220)	0.12683454756625467
  (1, 250)	0.503198149156233
  (1, 290)	-0.05423408584439482
  (1, 293)	-0.17691792170076504
  (1, 333)	-0.2575736423286332
  (1, 363)	0.3300751331242249
  (1, 537)	-0.136426819296811
  (1, 583)	-0.4307955170024136
  (1, 644)	-0.21562088635366372
  (1, 901)	-0.47645458908858185
  (1, 906)	-0.45618505690252364
  (1, 989)	0.3586749285033366
  (1, 1016)	0.06698348969642343
  (1, 1056)	0.15714013224821954
  (1, 1057)	-0.4797210497338962
  (1, 1067)	0.08012057750277624
  (1, 1074)	0.24658939768193067
  (1, 1112)	-0.6728927746868019
  (1, 1169)	0.3094321329639893
  (1, 1171)	0.2777247897338073
  :	:
  (138495, 1147)	0.8457345325701828
  (138495, 1592)	0.2321560713084425
  (138495, 1869)	-0.12279393487447177
  (138495, 1872)	-1.22204759766502
  (138495, 1985)	0.7095744680851066
  (138495, 2230)	0.8236225895316807
  (

### Singular Value Decomposition

In [26]:
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(item_user_data, k = 15)

In [27]:
sigma = np.diag(sigma)

In [28]:
print('The size of U is: {}'.format(U.shape))
print('The size of Vt is: {}'.format(Vt.shape))
print('The size of sigma is: {}'.format(sigma.shape))

The size of U is: (138497, 15)
The size of Vt is: (15, 15451)
The size of sigma is: (15, 15)


In [29]:
def get_mean_movie_rating(df):
    mean_rating = df.groupby('movieId')['rating'].mean()
    return mean_rating

In [30]:
def get_user_rating_normalized_by_movie(df,userId,mean_rating,U,sigma,Vt):
    user_idx = userId-1
    user_predicted_rating = np.dot(np.dot(U[user_idx],sigma),Vt) + mean_rating.values
    return user_predicted_rating

In [31]:
def has_rated_movie(userId, movie_idx):
    mask = ratings_df2['userId']==userId
    mask_rated = ratings_df2.loc[mask,'movie_idx'].isin([movie_idx])
    if sum(mask_rated)>0:
        return True
    else:
        return False 

In [32]:
def get_errors_test(train_df,test_df,U,sigma,Vt,userId_list=None,allusers=False):
    if allusers:
        userId_array = test['userId'].unique()
        userId_list = userId_array.tolist()
    error_list = []
    mean_rating = get_mean_movie_rating(train_df)
    for userId in userId_list:
        user_predicted_rating = get_user_rating_normalized_by_movie(train_df,userId,mean_rating,U,sigma,Vt)
        test_userid_df = test_df[test_df['userId']==userId]
        for index, row in test_userid_df.iterrows():
            movie_idx = int(row['movie_idx'])
            actual_rating = row['rating']
            predicted_rating = user_predicted_rating[movie_idx]
            error = np.abs(predicted_rating-actual_rating)
            error_list.append(error)
    error_array = np.array(error_list)
    num_samples = len(error_array)
    mae = (1/num_samples)*sum(error_array)
    rmse = np.sqrt((1/num_samples*sum(error_array**2)))
    return mae, rmse, num_samples

In [33]:
user_array = np.arange(1,10000)
user_list = user_array.tolist()
get_errors_test(train2,test2,U,sigma,Vt,user_list)

(0.737481311782031, 0.9462910472554306, 295016)

Cool!  Now I have a way to evaluate my model.  Unfortunately, it takes too long to do it for my complete test set.  I'm not sure where the bottleneck is, but I will wait to evaluate that for later.

My next step is to go through different values of k for my SVD and generate a score of my model for each value of k.

First starting with k=5.

In [None]:
def get_recommender_score(k):
    U, sigma, Vt = svds(item_user_data, k = k)
    sigma = np.diag(sigma)
    user_array = np.arange(1,10000)
    user_list = user_array.tolist()
    return get_errors_test(train2,test2,U,sigma,Vt,user_list)

In [None]:
(mae_k5,rmse_k5,num_samples_k5) = get_recommender_score(5)
print(mae_k5,rmse_k5,num_samples_k5)

In [None]:
#now let's try k=10
(mae_k10,rmse_k10,num_samples_k10) = get_recommender_score(10)
print(mae_k10,rmse_k10,num_samples_k10)

In [None]:
#now let's try k=20
(mae_k20,rmse_k20,num_samples_k20) = get_recommender_score(20)
print(mae_k20,rmse_k20,num_samples_k20)

In [None]:
#now let's try k=50
(mae_k50,rmse_k50,num_samples_k50) = get_recommender_score(50)
print(mae_k50,rmse_k50,num_samples_k50)

In [None]:
#now let's try k=100
(mae_k100,rmse_k100,num_samples_k100) = get_recommender_score(100)
print(mae_k100,rmse_k100,num_samples_k100)

So far we have tested k=5,10,20,50,&100.  The best mae and rmse scores were for k=5.  Now I'm going to try a few values below and above that value.

In [None]:
(mae_k3,rmse_k3,num_samples_k3) = get_recommender_score(3)
print(mae_k3,rmse_k3,num_samples_k3)

In [None]:
(mae_k1,rmse_k1,num_samples_k1) = get_recommender_score(1)
print(mae_k1,rmse_k1,num_samples_k1)

In [None]:
(mae_k2,rmse_k2,num_samples_k2) = get_recommender_score(2)
print(mae_k2,rmse_k2,num_samples_k2)

In [None]:
(mae_k15,rmse_k15,num_samples_k15) = get_recommender_score(15)
print(mae_k15,rmse_k15,num_samples_k15)

This is surprising to me, but the best result is with k=1.  The next thing I'm going to do the same analysis with the training data.

In [None]:
def get_errors_train(train_df,test_df,U,sigma,Vt,userId_list=None,allusers=False):
    if allusers:
        userId_array = test['userId'].unique()
        userId_list = userId_array.tolist()
    error_list = []
    mean_rating = get_mean_movie_rating(train_df)
    for userId in userId_list:
        user_predicted_rating = get_user_rating_normalized_by_movie(train_df,userId,mean_rating,U,sigma,Vt)
        train_userid_df = train_df[train_df['userId']==userId]
        for index, row in train_userid_df.iterrows():
            movie_idx = int(row['movie_idx'])
            actual_rating = row['rating']
            predicted_rating = user_predicted_rating[movie_idx]
            error = np.abs(predicted_rating-actual_rating)
            error_list.append(error)
    error_array = np.array(error_list)
    num_samples = len(error_array)
    mae = (1/num_samples)*sum(error_array)
    rmse = np.sqrt((1/num_samples*sum(error_array**2)))
    return mae, rmse, num_samples

In [None]:
def get_recommender_score_train(k):
    U, sigma, Vt = svds(item_user_data, k = k)
    sigma = np.diag(sigma)
    user_array = np.arange(1,2500)
    user_list = user_array.tolist()
    return get_errors_train(train2,test2,U,sigma,Vt,user_list)

In [None]:
(mae_k1_train,rmse_k1_train,num_samples_k1_train) = get_recommender_score_train(1)
print(mae_k1_train,rmse_k1_train,num_samples_k1_train)

In [None]:
(mae_k2_train,rmse_k2_train,num_samples_k2_train) = get_recommender_score_train(2)
print(mae_k2_train,rmse_k2_train,num_samples_k2_train)

In [None]:
(mae_k3_train,rmse_k3_train,num_samples_k3_train) = get_recommender_score_train(3)
print(mae_k3_train,rmse_k3_train,num_samples_k3_train)

In [None]:
(mae_k4_train,rmse_k4_train,num_samples_k4_train) = get_recommender_score_train(4)
print(mae_k4_train,rmse_k4_train,num_samples_k4_train)

In [None]:
(mae_k5_train,rmse_k5_train,num_samples_k5_train) = get_recommender_score_train(5)
print(mae_k5_train,rmse_k5_train,num_samples_k5_train)

In [None]:
(mae_k10_train,rmse_k10_train,num_samples_k10_train) = get_recommender_score_train(10)
print(mae_k10_train,rmse_k10_train,num_samples_k10_train)

In [None]:
(mae_k100_train,rmse_k100_train,num_samples_k100_train) = get_recommender_score_train(100)
print(mae_k100_train,rmse_k100_train,num_samples_k100_train)

In [None]:
(mae_k1000_train,rmse_k1000_train,num_samples_k1000_train) = get_recommender_score_train(1000)
print(mae_k1000_train,rmse_k1000_train,num_samples_k1000_train)

The above results are also surprising to me, because I would expect that the error would go down with more components (increasing k).  However, the opposite was true.  For now, I'm going to use k=15 as my optimal, which is based on some results I found in a paper I read.

In [34]:
def recommend_movies3(movies_df, ratings_df, userId, U, sigma, Vt, num_recommendations=5):
    
    mean_rating = get_mean_movie_rating(ratings_df)
    # get user_predictions
    user_predictions = get_user_rating_normalized_by_movie(ratings_df,userId,mean_rating, U, sigma, Vt)
    
    # Sort my predictions from highest to lowest
    pred_idxs_sorted = np.argsort(user_predictions)
    pred_idxs_sorted[:] = pred_idxs_sorted[::-1]
    
    unique_rated_movie_ids = ratings_df['movieId'].unique()
    nm = unique_rated_movie_ids.shape[0] #get num_movies
    print("Top recommendations for UserId: {}".format(userId))
    i=0; j=0
    while i < num_recommendations:
        movie_idx = pred_idxs_sorted[j]
        if not has_rated_movie(userId, movie_idx):
            movieId = movie_mapping_df.loc[movie_mapping_df['movie_idx']==movie_idx,'movieId']
            movieTitle = movies_df.loc[movies_df['movieId']==movieId.values[0],'title']
            print('Predicting rating {0:.1f} for movie {1}'.format(\
                    user_predictions[pred_idxs_sorted[i]],movieTitle.values[0]))
            i=i+1
        j=j+1
        
    nm_rated = sum(ratings_df['userId'] == userId)
    num_to_return = min(20,nm_rated)
    movieId = ratings_df.loc[ratings_df['userId'] == userId,'movieId']
    movieId_array = movieId.sample(num_to_return).values
    user_ratings_df = ratings_df[ratings_df['userId']==userId]
    print("\nA subset of original ratings provided for UserId: {}".format(userId))
    for i in range(num_to_return):
        movieTitle = movies_df.loc[movies_df['movieId']==movieId_array[i],'title']
        rating = user_ratings_df.loc[user_ratings_df['movieId']==movieId_array[i],'rating']
        print('Rated {0:.1f} for movie {1}'.format(rating.values[0],movieTitle.values[0]))    

In [35]:
def get_recommended_movies(movies_df,ratings_df,userId):
    U, sigma, Vt = svds(item_user_data2, k = 15)
    sigma = np.diag(sigma)
    recommend_movies3(movies_df,ratings_df,userId,U,sigma,Vt,20)

In [36]:
ratings_df2 = movie_mean_normalization(ratings_df)

In [37]:
ratings_df2 = pd.merge(ratings_df2,movie_mapping_df,on='movieId')
ratings_df2.sort_values(by=['userId','movieId'],inplace=True)
ratings_df2.reset_index(drop=True,inplace=True)

In [38]:
ratings_df2.head()

Unnamed: 0,userId,movieId,rating,timestamp,rating_movie_mean,rating_normalized,movie_idx
0,1,2,3.5,1112486000.0,3.211977,0.288023,1
1,1,29,3.5,1112485000.0,3.95223,-0.45223,28
2,1,32,3.5,1112485000.0,3.898055,-0.398055,31
3,1,47,3.5,1112485000.0,4.05348,-0.55348,46
4,1,50,3.5,1112485000.0,4.334372,-0.834372,49


In [40]:
# Initialize sparse matrix of ratings
item_user_data2 = csr_matrix((ratings_df2['rating_normalized'].astype(np.double),
                       (ratings_df2['userId'], #row_id
                        ratings_df2['movie_idx']))) #column_id

In [41]:
get_recommended_movies(movies_df,ratings_df2,138495)

Top recommendations for UserId: 138495
Predicting rating 4.5 for movie Zero Motivation (Efes beyahasei enosh) (2014)
Predicting rating 4.5 for movie Shawshank Redemption, The (1994)
Predicting rating 4.3 for movie Usual Suspects, The (1995)
Predicting rating 4.3 for movie Schindler's List (1993)
Predicting rating 4.3 for movie Death on the Staircase (Soupçons) (2004)
Predicting rating 4.3 for movie O Auto da Compadecida (Dog's Will, A) (2000)
Predicting rating 4.3 for movie Band of Brothers (2001)
Predicting rating 4.3 for movie Seven Samurai (Shichinin no samurai) (1954)
Predicting rating 4.3 for movie Fight Club (1999)
Predicting rating 4.3 for movie Godfather, The (1972)
Predicting rating 4.2 for movie The War (2007)
Predicting rating 4.2 for movie Rear Window (1954)
Predicting rating 4.2 for movie Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)
Predicting rating 4.2 for movie Princess Bride, The (1987)
Predicting rating 4.2 for movie Matrix, The (1999)
Predicting rating 4.2 for movie

In [42]:
get_recommended_movies(movies_df,ratings_df2,138494)

Top recommendations for UserId: 138494
Predicting rating 5.1 for movie Forrest Gump (1994)
Predicting rating 4.9 for movie Fight Club (1999)
Predicting rating 4.6 for movie American Beauty (1999)
Predicting rating 4.6 for movie Shawshank Redemption, The (1994)
Predicting rating 4.6 for movie Seven (a.k.a. Se7en) (1995)
Predicting rating 4.6 for movie Matrix, The (1999)
Predicting rating 4.5 for movie Memento (2000)
Predicting rating 4.5 for movie Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
Predicting rating 4.5 for movie Pulp Fiction (1994)
Predicting rating 4.5 for movie Zero Motivation (Efes beyahasei enosh) (2014)
Predicting rating 4.5 for movie Clockwork Orange, A (1971)
Predicting rating 4.5 for movie Eternal Sunshine of the Spotless Mind (2004)
Predicting rating 4.5 for movie American History X (1998)
Predicting rating 4.4 for movie Monty Python and the Holy Grail (1975)
Predicting rating 4.4 for movie Lion King, The (1994)
Predicting rating 4.4 for movie Donnie Darko (2001)
Predic

In [43]:
get_recommended_movies(movies_df,ratings_df2,138496)

Top recommendations for UserId: 138496
Predicting rating 4.5 for movie Zero Motivation (Efes beyahasei enosh) (2014)
Predicting rating 4.4 for movie Shawshank Redemption, The (1994)
Predicting rating 4.4 for movie Usual Suspects, The (1995)
Predicting rating 4.3 for movie Schindler's List (1993)
Predicting rating 4.3 for movie Death on the Staircase (Soupçons) (2004)
Predicting rating 4.3 for movie Godfather, The (1972)
Predicting rating 4.3 for movie Rear Window (1954)
Predicting rating 4.3 for movie O Auto da Compadecida (Dog's Will, A) (2000)
Predicting rating 4.3 for movie American Beauty (1999)
Predicting rating 4.3 for movie Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)
Predicting rating 4.3 for movie Seven Samurai (Shichinin no samurai) (1954)
Predicting rating 4.3 for movie Band of Brothers (2001)
Predicting rating 4.3 for movie Casablanca (1942)
Predicting rating 4.3 for movie Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)
Predicting rating 4.2 for movie The War (2007)
P

In [52]:
get_recommended_movies(movies_df,ratings_df2,9)

Top recommendations for UserId: 9
Predicting rating 5.0 for movie Shawshank Redemption, The (1994)
Predicting rating 5.0 for movie Silence of the Lambs, The (1991)
Predicting rating 4.9 for movie Pulp Fiction (1994)
Predicting rating 4.9 for movie Braveheart (1995)
Predicting rating 4.8 for movie Schindler's List (1993)
Predicting rating 4.7 for movie Usual Suspects, The (1995)
Predicting rating 4.6 for movie Seven (a.k.a. Se7en) (1995)
Predicting rating 4.6 for movie Zero Motivation (Efes beyahasei enosh) (2014)
Predicting rating 4.5 for movie Fugitive, The (1993)
Predicting rating 4.5 for movie Terminator 2: Judgment Day (1991)
Predicting rating 4.5 for movie Godfather: Part II, The (1974)
Predicting rating 4.4 for movie Saving Private Ryan (1998)
Predicting rating 4.4 for movie Dances with Wolves (1990)
Predicting rating 4.4 for movie Matrix, The (1999)
Predicting rating 4.4 for movie American Beauty (1999)
Predicting rating 4.4 for movie One Flew Over the Cuckoo's Nest (1975)
Predi

In [53]:
U, sigma, Vt = svds(item_user_data, k = 15)
sigma = np.diag(sigma)

In [None]:
get_errors_test(train2,test2,U,sigma,Vt,[138495])

In [54]:
mean_rating = get_mean_movie_rating(train2)
user_predicted_rating = get_user_rating_normalized_by_movie(train2,138495,mean_rating,U,sigma,Vt)

In [55]:
user_predicted_rating[:10]

array([3.95337335, 3.21117353, 3.12863093, 2.87380463, 3.05554528,
       3.82488724, 3.38346138, 3.13445603, 2.99907695, 3.39686622])

In [56]:
test2.tail()

Unnamed: 0,userId,movieId,rating,timestamp,movie_idx
3992974,138494,27706,1.0,,8936
3992975,138495,589,0.5,,583
3992976,138496,2571,2.0,,2476
3992977,138496,69406,4.5,,12064
3992978,138496,88163,4.5,,13670


In [57]:
user_predicted_rating[57], user_predicted_rating[583], user_predicted_rating[1985]

(3.976037464735463, 3.9166653012669257, 3.789613867151849)

In [None]:
1/3*(np.abs(3.9858288202033534-4.5)+np.abs(3.9009701420966683-.5)+np.abs(3.7459672239379023-4.5))

In [58]:
def get_group_rating(user_predictions1, user_predictions2):
    '''
    Takes the predictions from two users and returns the average minus a penalty term based on the absolute value
    of the difference in the predicted score.  I divided this penalty term by 5, which was arbitrarily chosen.
    '''
    group_prediction = (user_predictions1+user_predictions2)/2 - np.abs(user_predictions1-user_predictions2)/5
    return group_prediction

In [59]:
def recommender_2users(movies_df, ratings_df, userId1, userId2, U, sigma, Vt, num_recommendations=5):
    
    mean_rating = get_mean_movie_rating(ratings_df)
    #get user predictions
    user_predictions1 = get_user_rating_normalized_by_movie(ratings_df, userId1, mean_rating, U, sigma, Vt)
    user_predictions2 = get_user_rating_normalized_by_movie(ratings_df, userId2, mean_rating, U, sigma, Vt)
    
    #get the weighted average prediction
    group_predictions = get_group_rating(user_predictions1, user_predictions2)
    
    # Sort my predictions from highest to lowest
    pred_idxs_sorted = np.argsort(group_predictions)
    pred_idxs_sorted[:] = pred_idxs_sorted[::-1]
    
    unique_rated_movie_ids = ratings_df['movieId'].unique()
    nm = unique_rated_movie_ids.shape[0] #get num_movies
    print("Top combinded recommendations for UserIds: {} and {}".format(userId1,userId2))
    i=0; j=0
    while i < num_recommendations:
        movie_idx = pred_idxs_sorted[j]
        if not (has_rated_movie(userId1, movie_idx) or has_rated_movie(userId2, movie_idx)):
            movieId = movie_mapping_df.loc[movie_mapping_df['movie_idx']==movie_idx,'movieId']
            movieTitle = movies_df.loc[movies_df['movieId']==movieId.values[0],'title']
            print('Predicting rating {0:.1f} for movie {1}'.format(\
                    group_predictions[pred_idxs_sorted[i]],movieTitle.values[0]))
            i=i+1
        j=j+1

In [61]:
U, sigma, Vt = svds(item_user_data2, k = 15)
sigma = np.diag(sigma)
recommender_2users(movies_df, ratings_df2, 138494, 138495, U, sigma, Vt, 20)

Top combinded recommendations for UserIds: 138494 and 138495
Predicting rating 4.5 for movie Zero Motivation (Efes beyahasei enosh) (2014)
Predicting rating 4.5 for movie Shawshank Redemption, The (1994)
Predicting rating 4.5 for movie Fight Club (1999)
Predicting rating 4.4 for movie Forrest Gump (1994)
Predicting rating 4.3 for movie Usual Suspects, The (1995)
Predicting rating 4.3 for movie Matrix, The (1999)
Predicting rating 4.3 for movie Schindler's List (1993)
Predicting rating 4.3 for movie American Beauty (1999)
Predicting rating 4.3 for movie Memento (2000)
Predicting rating 4.3 for movie Death on the Staircase (Soupçons) (2004)
Predicting rating 4.3 for movie City of God (Cidade de Deus) (2002)
Predicting rating 4.3 for movie Star Wars: Episode V - The Empire Strikes Back (1980)
Predicting rating 4.3 for movie Seven Samurai (Shichinin no samurai) (1954)
Predicting rating 4.3 for movie Spirited Away (Sen to Chihiro no kamikakushi) (2001)
Predicting rating 4.3 for movie Dark K

In [None]:
get_recommended_movies(movies_df,ratings_df2,9)

In [None]:
get_recommended_movies(movies_df,ratings_df2,138207)

In [None]:
#ratings_df2[ratings_df2['movieId']==5299]

Now I want to build a recommender that allows the user to filter by genre.

In [None]:
def recommend_movies4(movies_df, ratings_df, userId, U, sigma, Vt, genres_list=[], num_recommendations=5):
    
    mean_rating = get_mean_movie_rating(ratings_df)
    # get user_predictions
    user_predictions = get_user_rating_normalized_by_movie(ratings_df,userId,mean_rating, U, sigma, Vt)
    
    # Sort my predictions from highest to lowest
    pred_idxs_sorted = np.argsort(user_predictions)
    pred_idxs_sorted[:] = pred_idxs_sorted[::-1]
    
    unique_rated_movie_ids = ratings_df['movieId'].unique()
    nm = unique_rated_movie_ids.shape[0] #get num_movies
    print("Top recommendations for UserId: {}".format(userId))
    recommended_movies_df = pd.DataFrame(columns=['predicted rating','title','genres'])
    i=0; j=0
    while i < num_recommendations:
        movie_idx = pred_idxs_sorted[j]
        if not has_rated_movie(userId, movie_idx):
            movieId = movie_mapping_df.loc[movie_mapping_df['movie_idx']==movie_idx,'movieId']
            movieTitle = movies_df.loc[movies_df['movieId']==movieId.values[0],'title']
            movieGenres = movies_df.loc[movies_df['movieId']==movieId.values[0],'genres'].values[0]
            if movie_contains_genre(movieGenres,genres_list):
                recommended_movies_df.loc[i] = [user_predictions[pred_idxs_sorted[i]], movieTitle.values[0], movieGenres] 
                i=i+1
        j=j+1
    print(recommended_movies_df)
        
    nm_rated = sum(ratings_df['userId'] == userId)
    num_to_return = min(20,nm_rated)
    movieId = ratings_df.loc[ratings_df['userId'] == userId,'movieId']
    movieId_array = movieId.sample(num_to_return).values
    user_ratings_df = ratings_df[ratings_df['userId']==userId]
    print("\nA subset of original ratings provided for UserId: {}".format(userId))
    for i in range(num_to_return):
        movieTitle = movies_df.loc[movies_df['movieId']==movieId_array[i],'title']
        rating = user_ratings_df.loc[user_ratings_df['movieId']==movieId_array[i],'rating']
        print('Rated {0:.1f} for movie {1}'.format(rating.values[0],movieTitle.values[0]))    

In [None]:
def movie_contains_genre(movieGenre,genres_list):
    movieGenre = movieGenre.lower()
    genres_list = [val.lower() for val in genres_list]
    if '(no genres listed)' in movieGenre:
        return True  # returning true here because we can't make any assumptions here
    else:
        movieGenre_list = movieGenre.split('|')
        for genre in movieGenre_list:
            if genre in genres_list:
                return True
        return False

In [None]:
def get_recommended_movies2(movies_df,ratings_df,userId,genres_list=[]):
    U, sigma, Vt = svds(item_user_data2, k = 15)
    sigma = np.diag(sigma)
    recommend_movies4(movies_df,ratings_df,userId,U,sigma,Vt,genres_list,20)

In [None]:
movie_contains_genre('(no genres listed)',['Drama','Action','fantasy'])

In [None]:
movieGenre = movies_df.loc[movies_df['movieId']==1,'genres']

In [None]:
movieGenre[0]

In [None]:
get_recommended_movies2(movies_df,ratings_df2,138494,genres_list=['comedy'])

In [None]:
movieGenres = movies_df.loc[movies_df['movieId']==125878,'genres'].values[0]

In [None]:
movieGenres

In [None]:
get_recommended_movies(movies_df,ratings_df2,9)

In [None]:
get_recommended_movies(movies_df,ratings_df2,138207)