# Movie Recommender Modeling

author: Ben Sturm <br />
contact: bwsturm@gmail.com <br />
date: 6/18/2018

In this notebook, I'm going to implement Nick Becker's Matrix factorization method.  The major difference is that I'm going to try to implement it using sparse matrices.

In [1]:
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
import scipy.sparse

First I'm going to load in the MovieLens 20M Dataset.

In [2]:
ratings_df = pd.read_csv('/Users/bwsturm/ds/metis/metisbc/Week10/movie_recommender/data/ml-20M/ratings.csv')
movies_df = pd.read_csv('/Users/bwsturm/ds/metis/metisbc/Week10/movie_recommender/data/ml-20M/movies.csv')
movies_df['movieId'] = movies_df['movieId'].apply(pd.to_numeric)

In [3]:
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,1112486027
1,1,29,3.5,1112484676
2,1,32,3.5,1112484819
3,1,47,3.5,1112484727
4,1,50,3.5,1112484580


In [5]:
ratings_df.movieId.dtype

dtype('int64')

Now I'm going to write a function that allows me to append another user's ratings.

In [6]:
import csv

def read_rating_data(file_path):
    with open(file_path, newline='') as csvfile:
        reader = csv.DictReader(csvfile)
        my_ratings_dict = {}
        firstline=True
        for row in reader:
            my_ratings_dict[str(row['movieId'])]=float(row['rating'])
            
    return my_ratings_dict

In [7]:
#Add Ben's ratings
file_path= '/Users/bwsturm/ds/metis/metisbc/Week10/movie_recommender/data/user_ratings/Ben_movie_ratings.csv'
my_ratings_dict1 = read_rating_data(file_path)

In [8]:
#Add Ruth's ratings
file_path= '/Users/bwsturm/ds/metis/metisbc/Week10/movie_recommender/data/user_ratings/Ruth_movie_ratings.csv'
my_ratings_dict2 = read_rating_data(file_path)

In [9]:
def add_user_rating(df,my_ratings_dict):
    last_user_id = max(df['userId'])
    new_user_id = last_user_id + 1
    for key, value in my_ratings_dict.items():
        df = df.append({'userId':new_user_id,'movieId':int(key),'rating':value},ignore_index=True)
    
    df = df.astype({'userId':int,'movieId':int})
    return df

In [10]:
#ratings_df['userId']
ratings_df2 = add_user_rating(ratings_df,my_ratings_dict1)
ratings_df2 = add_user_rating(ratings_df2,my_ratings_dict2)

In [11]:
ratings_df2.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,1112486000.0
1,1,29,3.5,1112485000.0
2,1,32,3.5,1112485000.0
3,1,47,3.5,1112485000.0
4,1,50,3.5,1112485000.0


In [12]:
ratings_df2.index[-1]

20000306

In [13]:
ratings_df2.tail()

Unnamed: 0,userId,movieId,rating,timestamp
20000302,138495,6942,4.0,
20000303,138495,2075,4.5,
20000304,138495,3556,4.5,
20000305,138495,1172,5.0,
20000306,138495,2324,5.0,


Now I want to filter out movies that don't have many ratings.  If I don't do this, then I tend to recommend films that have only a few very high ratings.

In [14]:
# first I want to see what percentage of films have more than 5 ratings
num_ratings_S = ratings_df2.groupby('movieId')['rating'].count()
nmr = num_ratings_S.shape[0]  #nmr stands for number of movies rated
nmr_5 = sum(num_ratings_S >=5) #numr_5 stands for number of movies rated with 5 or more ratings
nmr_10 = sum(num_ratings_S >=10) #numr_10 stands for number of movies rated with 10 or more ratings
nmr_20 = sum(num_ratings_S >=20) #numr_20 stands for number of movies rated with 20 or more ratings

In [15]:
print('The fraction of movies rated with 5 or more ratings is {:.3f}'.format(nmr_5/nmr))
print('The fraction of movies rated with 10 or more ratings is {:.3f}'.format(nmr_10/nmr))
print('The fraction of movies rated with 20 or more ratings is {:.3f}'.format(nmr_20/nmr))

The fraction of movies rated with 5 or more ratings is 0.686
The fraction of movies rated with 10 or more ratings is 0.578
The fraction of movies rated with 20 or more ratings is 0.491


In [16]:
def filter_rated_movies(df, min_ratings=10):
    df2 = df.groupby('movieId').filter(lambda row: len(row) >= min_ratings)
    df2.reset_index(drop=True, inplace=True)
    return df2

In [17]:
ratings_df3 = filter_rated_movies(ratings_df2,10)

In [18]:
ratings_df3.shape

(19964877, 4)

In [19]:
ratings_df3.tail()

Unnamed: 0,userId,movieId,rating,timestamp
19964872,138495,6942,4.0,
19964873,138495,2075,4.5,
19964874,138495,3556,4.5,
19964875,138495,1172,5.0,
19964876,138495,2324,5.0,


Now I want to calculate the mean rating for each user and subtract that value from each user's rating.  Then, I'll assign that to a new column.

In [20]:
# This function does the mean normalization per user
def user_mean_normalization(df):
    mean_rating = df.groupby('userId')['rating'].mean()
    mean_rating_df = mean_rating.to_frame('rating_mean')
    df2 = pd.merge(df,mean_rating_df,left_on='userId',right_index=True)
    df2['rating_normalized'] = df2['rating']-df2['rating_mean']
    return df2

In [21]:
# This function does the mean normalization per movie
def movie_mean_normalization(df):
    mean_rating = df.groupby('movieId')['rating'].mean()
    mean_rating_df = mean_rating.to_frame('rating_movie_mean')
    df2 = pd.merge(df,mean_rating_df,left_on='movieId',right_index=True)
    df2['rating_normalized'] = df2['rating']-df2['rating_movie_mean']
    return df2

In [22]:
ratings_df3 = movie_mean_normalization(ratings_df3)

In [23]:
ratings_df3.head()

Unnamed: 0,userId,movieId,rating,timestamp,rating_movie_mean,rating_normalized
0,1,2,3.5,1112486000.0,3.211977,0.288023
451,5,2,3.0,851527600.0,3.211977,-0.211977
1500,13,2,3.0,849082700.0,3.211977,-0.211977
3325,29,2,3.0,835562200.0,3.211977,-0.211977
3903,34,2,3.0,846509400.0,3.211977,-0.211977


Next, I will find the number of unique movies after the filtering step I did previously.

In [24]:
unique_rated_movie_ids = ratings_df3['movieId'].unique()
unique_rated_movie_ids.shape

(15451,)

In [25]:
unique_rated_movie_ids.sort()

In [26]:
movie_mapping_df = pd.DataFrame(unique_rated_movie_ids,columns=['movieId'])
movie_mapping_df.head()

Unnamed: 0,movieId
0,1
1,2
2,3
3,4
4,5


Now, I'm going to map each movie to a unique index.

In [27]:
movie_mapping_df.reset_index(inplace=True)
movie_mapping_df.head()

Unnamed: 0,index,movieId
0,0,1
1,1,2
2,2,3
3,3,4
4,4,5


In [28]:
movie_mapping_df.rename(columns={'index':'movie_idx'},inplace=True)

In [29]:
movie_mapping_df.tail()

Unnamed: 0,movie_idx,movieId
15446,15446,129428
15447,15447,129937
15448,15448,130073
15449,15449,130075
15450,15450,130490


Now I need to merge my movie_mapping_df with my ratings_df3 

In [30]:
ratings_df3 = pd.merge(ratings_df3,movie_mapping_df,on='movieId')

In [31]:
ratings_df3.sort_values(by=['userId','movieId'],inplace=True)

In [32]:
ratings_df3.reset_index(drop=True,inplace=True)

In [33]:
ratings_df3.tail()

Unnamed: 0,userId,movieId,rating,timestamp,rating_movie_mean,rating_normalized,movie_idx
19964872,138495,2393,1.0,,3.202037,-2.202037,2299
19964873,138495,3556,4.5,,3.651373,0.848627,3433
19964874,138495,3844,4.0,,3.51098,0.48902,3712
19964875,138495,6942,4.0,,3.799759,0.200241,6723
19964876,138495,59615,1.0,,2.944857,-1.944857,11301


Now I want to get the ratings data into a sparse matrix

In [34]:
# Initialize sparse matrix of ratings
item_user_data = csr_matrix((ratings_df3['rating_normalized'].astype(np.double),
                       (ratings_df3['userId'], #row_id
                        ratings_df3['movie_idx']))) #column_id

#print(item_user_data)

In [35]:
print(item_user_data)

  (1, 1)	0.2880231983095807
  (1, 28)	-0.4522300469483569
  (1, 31)	-0.39805469097376633
  (1, 46)	-0.5534797687861275
  (1, 49)	-0.8343722078032592
  (1, 110)	0.08742640874684593
  (1, 149)	0.46096227013316415
  (1, 220)	0.13044946191179552
  (1, 250)	0.5033553395240857
  (1, 257)	-0.19067190194855232
  (1, 290)	-0.0505735544876762
  (1, 293)	-0.17423116921705528
  (1, 315)	-0.4469904996370291
  (1, 333)	-0.2558438653394477
  (1, 363)	0.3252821079571895
  (1, 537)	-0.13370569350717432
  (1, 583)	-0.4318882189683224
  (1, 587)	-0.6770565095815098
  (1, 644)	-0.21684751972942484
  (1, 901)	-0.4816805288974195
  (1, 906)	-0.4557478319407595
  (1, 989)	0.3623271889400921
  (1, 1016)	0.06615349189118058
  (1, 1056)	0.15037613093422797
  (1, 1057)	-0.4811256506299557
  :	:
  (138494, 6183)	1.1898846495119786
  (138494, 7934)	0.6721311475409837
  (138494, 8090)	-0.721558476038763
  (138494, 8936)	-2.3729281767955803
  (138494, 13098)	-0.8431149097815767
  (138495, 16)	0.031423871498379
  (13

### Singular Value Decomposition

Now I'm going to use scipy's SVD method.  What's great is that it can do this operation on a sparse matrix. 

In [36]:
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(item_user_data, k = 50)

In [37]:
sigma = np.diag(sigma)

In [38]:
print('The size of U is: {}'.format(U.shape))
print('The size of Vt is: {}'.format(Vt.shape))
print('The size of sigma is: {}'.format(sigma.shape))

The size of U is: (138496, 50)
The size of Vt is: (50, 15451)
The size of sigma is: (50, 50)


In [39]:
def get_user_rating_normalized_by_user(userId):
    user_idx = userId-1  
    user_mean_rating =  ratings_df3.loc[ratings_df3['userId']==userId,'rating_mean'].get_values()[0]
    user_predicted_rating = np.dot(np.dot(U[user_idx],sigma),Vt) + user_mean_rating
    return user_predicted_rating

In [40]:
def get_user_rating_normalized_by_movie(df,userId):
    user_idx = userId-1
    mean_rating = df.groupby('movieId')['rating'].mean()
    user_predicted_rating = np.dot(np.dot(U[user_idx],sigma),Vt) + mean_rating.values
    return user_predicted_rating

In [41]:
user_predictions = get_user_rating_normalized_by_movie(ratings_df3,138494)
user_predictions.shape

(15451,)

In [42]:
def has_rated_movie(userId, movie_idx):
    mask = ratings_df3['userId']==userId
    mask_rated = ratings_df3.loc[mask,'movie_idx'].isin([movie_idx])
    if sum(mask_rated)>0:
        return True
    else:
        return False 

In [43]:
def recommend_movies2(movies_df, ratings_df, userId, num_recommendations=5):
    
    # get user_predictions
    user_predictions = get_user_rating_normalized_by_movie(ratings_df,userId)
    
    # Sort my predictions from highest to lowest
    pred_idxs_sorted = np.argsort(user_predictions)
    pred_idxs_sorted[:] = pred_idxs_sorted[::-1]
    
    nm = unique_rated_movie_ids.shape[0] #get num_movies
    print("Top recommendations for UserId: {}".format(userId))
    i=0; j=0
    while i < num_recommendations:
        movie_idx = pred_idxs_sorted[j]
        if not has_rated_movie(userId, movie_idx):
            movieId = movie_mapping_df.loc[movie_mapping_df['movie_idx']==movie_idx,'movieId']
            movieTitle = movies_df.loc[movies_df['movieId']==movieId.values[0],'title']
            print('Predicting rating {0:.1f} for movie {1}'.format(\
                    user_predictions[pred_idxs_sorted[i]],movieTitle.values[0]))
            i=i+1
        j=j+1
        
    nm_rated = sum(ratings_df['userId'] == userId)
    num_to_return = min(30,nm_rated)
    movieId = ratings_df.loc[ratings_df['userId'] == userId,'movieId']
    movieId_array = movieId.sample(num_to_return).values
    user_ratings_df = ratings_df[ratings_df['userId']==userId]
    print("\nA subset of original ratings provided for UserId: {}".format(userId))
    for i in range(num_to_return):
        movieTitle = movies_df.loc[movies_df['movieId']==movieId_array[i],'title']
        rating = user_ratings_df.loc[user_ratings_df['movieId']==movieId_array[i],'rating']
        print('Rated {0:.1f} for movie {1}'.format(rating.values[0],movieTitle.values[0]))    

In [44]:
recommend_movies2(movies_df,ratings_df3,138494,30)

Top recommendations for UserId: 138494
Predicting rating 5.0 for movie Forrest Gump (1994)
Predicting rating 5.0 for movie Natural Born Killers (1994)
Predicting rating 4.9 for movie Schindler's List (1993)
Predicting rating 4.9 for movie Fight Club (1999)
Predicting rating 4.8 for movie American Beauty (1999)
Predicting rating 4.8 for movie Dances with Wolves (1990)
Predicting rating 4.8 for movie Kill Bill: Vol. 1 (2003)
Predicting rating 4.8 for movie Clockwork Orange, A (1971)
Predicting rating 4.7 for movie Monty Python and the Holy Grail (1975)
Predicting rating 4.7 for movie Star Wars: Episode V - The Empire Strikes Back (1980)
Predicting rating 4.7 for movie Star Wars: Episode IV - A New Hope (1977)
Predicting rating 4.7 for movie Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)
Predicting rating 4.6 for movie Star Wars: Episode VI - Return of the Jedi (1983)
Predicting rating 4.6 for movie Indiana Jones and the Last Crusade (1989)
Predicting ratin

In [45]:
recommend_movies2(movies_df,ratings_df3,138495,30)

Top recommendations for UserId: 138495
Predicting rating 4.5 for movie Zero Motivation (Efes beyahasei enosh) (2014)
Predicting rating 4.5 for movie Shawshank Redemption, The (1994)
Predicting rating 4.5 for movie Fight Club (1999)
Predicting rating 4.4 for movie Schindler's List (1993)
Predicting rating 4.4 for movie Godfather, The (1972)
Predicting rating 4.3 for movie Casablanca (1942)
Predicting rating 4.3 for movie Godfather: Part II, The (1974)
Predicting rating 4.3 for movie Princess Bride, The (1987)
Predicting rating 4.3 for movie Seven Samurai (Shichinin no samurai) (1954)
Predicting rating 4.3 for movie Death on the Staircase (Soupçons) (2004)
Predicting rating 4.3 for movie Monty Python and the Holy Grail (1975)
Predicting rating 4.3 for movie Rear Window (1954)
Predicting rating 4.3 for movie Usual Suspects, The (1995)
Predicting rating 4.3 for movie Band of Brothers (2001)
Predicting rating 4.3 for movie Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)
Predicting rati

In [46]:
def get_group_rating(user_predictions1, user_predictions2):
    '''
    Takes the predictions from two users and returns the average minus a penalty term based on the absolute value
    of the difference in the predicted score.  I divided this penalty term by 5, which was arbitrarily chosen.
    '''
    group_prediction = (user_predictions1+user_predictions2)/2 - np.abs(user_predictions1-user_predictions2)/5
    return group_prediction

In [47]:
def recommender_2users(movies_df, ratings_df, userId1, userId2, num_recommendations=5):
    
    #get user predictions
    user_predictions1 = get_user_rating_normalized_by_movie(ratings_df, userId1)
    user_predictions2 = get_user_rating_normalized_by_movie(ratings_df, userId2)
    
    #get the weighted average prediction
    group_predictions = get_group_rating(user_predictions1, user_predictions2)
    
    # Sort my predictions from highest to lowest
    pred_idxs_sorted = np.argsort(group_predictions)
    pred_idxs_sorted[:] = pred_idxs_sorted[::-1]
    
    nm = unique_rated_movie_ids.shape[0] #get num_movies
    print("Top combinded recommendations for UserIds: {} and {}".format(userId1,userId2))
    i=0; j=0
    while i < num_recommendations:
        movie_idx = pred_idxs_sorted[j]
        if not (has_rated_movie(userId1, movie_idx) or has_rated_movie(userId2, movie_idx)):
            movieId = movie_mapping_df.loc[movie_mapping_df['movie_idx']==movie_idx,'movieId']
            movieTitle = movies_df.loc[movies_df['movieId']==movieId.values[0],'title']
            print('Predicting rating {0:.1f} for movie {1}'.format(\
                    group_predictions[pred_idxs_sorted[i]],movieTitle.values[0]))
            i=i+1
        j=j+1

In [48]:
user_predictionsA = get_user_rating(138494)
user_predictionsB = get_user_rating(138493)

NameError: name 'get_user_rating' is not defined

In [49]:
recommender_2users(movies_df,ratings_df3,138494,138495,30)

Top combinded recommendations for UserIds: 138494 and 138495
Predicting rating 4.6 for movie Fight Club (1999)
Predicting rating 4.6 for movie Schindler's List (1993)
Predicting rating 4.5 for movie Zero Motivation (Efes beyahasei enosh) (2014)
Predicting rating 4.4 for movie Monty Python and the Holy Grail (1975)
Predicting rating 4.4 for movie Godfather, The (1972)
Predicting rating 4.4 for movie Star Wars: Episode V - The Empire Strikes Back (1980)
Predicting rating 4.3 for movie Star Wars: Episode IV - A New Hope (1977)
Predicting rating 4.3 for movie Forrest Gump (1994)
Predicting rating 4.3 for movie Seven Samurai (Shichinin no samurai) (1954)
Predicting rating 4.3 for movie Spirited Away (Sen to Chihiro no kamikakushi) (2001)
Predicting rating 4.3 for movie Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)
Predicting rating 4.3 for movie City of God (Cidade de Deus) (2002)
Predicting rating 4.3 for movie Princess Bride, The (1987)
Predicting rating 4

In [50]:
recommend_movies2(movies_df,ratings_df3,138493,30)

Top recommendations for UserId: 138493
Predicting rating 4.9 for movie Godfather, The (1972)
Predicting rating 4.9 for movie Godfather: Part II, The (1974)
Predicting rating 4.8 for movie Usual Suspects, The (1995)
Predicting rating 4.6 for movie Zero Motivation (Efes beyahasei enosh) (2014)
Predicting rating 4.6 for movie Princess Bride, The (1987)
Predicting rating 4.5 for movie Memento (2000)
Predicting rating 4.5 for movie Shawshank Redemption, The (1994)
Predicting rating 4.5 for movie Goodfellas (1990)
Predicting rating 4.4 for movie Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)
Predicting rating 4.4 for movie Blade Runner (1982)
Predicting rating 4.4 for movie Death on the Staircase (Soupçons) (2004)
Predicting rating 4.3 for movie City of God (Cidade de Deus) (2002)
Predicting rating 4.3 for movie Fargo (1996)
Predicting rating 4.3 for movie Seven Samurai (Shichinin no samurai) (1954)
Predicting rating 4.3 for movie O Auto da Compadecida (Dog's Wil

In [51]:
user_predictions_sorted = np.sort(user_predictions)

I'd like to save the DataFrame with Ruth's and my information so that I can do further processing on it.  In addition, I want to use the same filtering technique where I filter on movies that have been rated 10 or more times. 

In [None]:
ratings_df_to_save = filter_rated_movies(ratings_df2,10)

In [None]:
ratings_df_to_save.tail()

In [None]:
ratings_df_to_save.to_csv('ratings_with_Ben_Ruth.csv',index=False)

In [None]:
movie_mapping_df.to_csv('movie_mapping.csv',index=False)

In [None]:
#movie_mapping_df.index
movie_mapping_df.loc[0,:]

In [None]:
user_predictions = get_user_rating_normalized_by_movie(ratings_df3,138493)

In [None]:
user_predictions.shape

In [None]:
#recommend_movies2(movies_df,ratings_df3,12,30)

In [None]:
#Add Rom-com fan ratings
file_path= '/Users/bwsturm/ds/metis/metisbc/Week10/movie_recommender/data/user_ratings/Rom-com_fan_ratings.csv'
my_ratings_dict3 = read_rating_data(file_path)

In [None]:
my_ratings_dict3

In [None]:
ratings_df2.tail()

The romantic comedy fan is a profile I made up just to demonstrate my recommender.  The userId for this fan is 138496.

In [None]:
ratings_df2 = add_user_rating(ratings_df2,my_ratings_dict3)

In [None]:
ratings_df2.tail()

In [None]:
ratings_df_to_save2 = filter_rated_movies(ratings_df2,10)

In [None]:
ratings_df_to_save2.to_csv('ratings_with_Ben_Ruth_RomcomFan.csv',index=False)

How sparse is the item_user_data sparse matrix?

In [None]:
item_user_data.shape