# Netflix Recommendation
#### <span style="color: cornflowerblue">Team 01 | CSPB 4502 | 11/16/22</span>

The following is a base recommendation using only the netflix dataset.

In [8]:
## LIBRARIES USED
import pandas as pd
import numpy as np
import pickle
import re
import matplotlib.pyplot as plt
import seaborn as sns
from io import StringIO
from sklearn.metrics.pairwise import cosine_similarity
from scipy.stats import iqr
from surprise import Dataset, SVD, Reader, accuracy
from surprise.model_selection import cross_validate, train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error

<span style="color: cornflowerblue">Import Dataset and Combine</span>

In [2]:
## IMPORT NETFLIX MOVIE NAMES
netflix_movie_titles = "../data/netflix/movie_titles.csv"

def manual_sep(old_split):
    new_split = old_split[0:2] + [",".join(old_split[2:])]
    return new_split
    
ntfx = pd.read_csv(netflix_movie_titles,
                   encoding = "ISO-8859-1",
                   header = None,
                   names = ['Movie_Id', 'Year', 'Name'],
                   on_bad_lines=manual_sep,
                   engine='python')
ntfx.dropna(subset='Year', inplace=True)
ntfx['Year'] = ntfx['Year'].astype("Int64")
print("Netflix Movie Names:")
print(f'{ntfx.shape = }')
print(ntfx.head().to_string())
print("Netflix Movie Ratings:")

Netflix Movie Names:
ntfx.shape = (17763, 3)
   Movie_Id  Year                          Name
0         1  2003               Dinosaur Planet
1         2  2004    Isle of Man TT 2004 Review
2         3  1997                     Character
3         4  1994  Paula Abdul's Get Up & Dance
4         5  2004      The Rise and Fall of ECW
Netflix Movie Ratings:


In [3]:
## IMPORT NETFLIX MOVIE RATINGS AND COMBINE
netflix_movie_ratings = [f'../data/netflix/combined_data_{i}.txt' for i in range(1, 5)]
stream = StringIO()
movie_number = "1"
for path in netflix_movie_ratings:
    print(f'reading file {path}')
    with open(path, "r") as file:
        patrn = "[0-9]:"
        for line in file:
            if re.search(patrn, line):
                movie_num = line.replace(":\n", "")
            else:
                stream.write(movie_num+","+line)
    file.close()  
stream.seek(0)
print("reading done")
ratings = pd.read_csv(stream,
                      encoding = "ISO-8859-1",
                      names = ['Movie_Id', 'CustomerID', 'Rating', 'Date'],
                      engine='c')
stream.close()
del(stream)
print(f'{ratings.shape = }')
print(ratings.head().to_string())

reading file ../data/netflix/combined_data_1.txt
reading file ../data/netflix/combined_data_2.txt
reading file ../data/netflix/combined_data_3.txt
reading file ../data/netflix/combined_data_4.txt
reading done
ratings.shape = (100480507, 4)
   Movie_Id  CustomerID  Rating        Date
0         1     1488844       3  2005-09-06
1         1      822109       5  2005-05-13
2         1      885013       4  2005-10-19
3         1       30878       4  2005-12-26
4         1      823519       3  2004-05-03


In [4]:
print(f'Number of Movies: {len(ratings["Movie_Id"].unique())}')
print(f'Number of Customers: {len(ratings["CustomerID"].unique())}')
print(f'Total number of Reviews: {len(ratings)}')

Number of Movies: 17770
Number of Customers: 480189
Total number of Reviews: 100480507


<span style="color: cornflowerblue">Collaborative Filtering</span>

There are a few ways we can build a recommendation system. Unfortunately, since the dataset we are using is so large, it is extremely difficult to do anything substantial without a high computation and time cost. For example, a very common step in recommendation systems is to create a sparcity matrix. If we did not make any adjustments to out dataset, we would need a sparcity matrix which contained inforamtion for 17,770 unique movies and 480,189 unique customers with 100,480,507 ratings total. This would amount to a matrix with 8,532,958,530 cells in total - sometyhing that would take too much time and too much computational resources.  
One way to get past this is through "on-the-fly" collaborative filtering. Collaborative filtering is a technique which uses only information for rating profiles fropm different users or items.$^{[1]}$ In order to do this, they locate similar users or items that are in the nearest neighborhood and generate recommendations based on the inforamtion given.  
There are two major types of collaborative filtering - user based filtering and item based filtering. In user based filtering, we look for similar users and base the decision making on the choices made by the set of users. In item based collaboration, we look for similar items based on the items that have already been chosen.  
What we can do is instead of calculating everything at once, we can caluclate recommendations based on who the custyomer is and compare it with customers that are similar. This greatly cuts down computation by eliminating datapoints which aren't very similar. The issue with this is that since we are only computing recommendations for single customers in the moment, and because we are only looking at correlation values, it becomes difficult to produce any type of metric. 

For collaboration, we have created a script to run which will get the get the top similar users based on a specified threshold:

In [5]:
import scripts.filtering as uf

In [6]:
## Define variables and create a similarity filtering object
movie_thresh = 0.6    # threshold proportion of random_ID's movies that a user must have rated to be retained
rho_thresh = 0.4   # threshold a user's correlation coefficient must meet to be retained

user_filter = uf.SimilarityFiltering(ratings,
                                     movie_thresh=movie_thresh,
                                     rho_thresh=rho_thresh,
                                     random_state=2)

In [7]:
## Getting user similarity based on chosen movies for customer 10
user_filter.getUserSimilarity(531050, "Movie_Id")
user_filter.similar_users.sort_values('corr', ascending=False).head()

Unnamed: 0,Movie_Id,CustomerID,Rating,Date,corr
185471,11846,1180814,4,2004-08-20,0.589639
185698,17716,1180814,4,2005-10-21,0.589639
185696,17681,1180814,4,2005-01-10,0.589639
185695,17671,1180814,4,2005-04-12,0.589639
185694,17662,1180814,3,2005-02-14,0.589639


From this, we can then get a score for each movie id based on the correlation given.

In [8]:
## Create a score based on correlation and rating
df = user_filter.similar_users
df['score'] = df['corr']*df['Rating']
df.head()

Unnamed: 0,Movie_Id,CustomerID,Rating,Date,corr,score
0,3,728801,4,2004-04-19,0.405044,1.620175
1,30,728801,2,2004-04-28,0.405044,0.810088
2,83,728801,4,2003-03-10,0.405044,1.620175
3,97,728801,4,2003-03-15,0.405044,1.620175
4,108,728801,3,2004-06-20,0.405044,1.215131


In [9]:
# Get the cumulative score
cum_df = df.groupby('Movie_Id').sum()[['score', 'corr']]
cum_df.columns = ['cum_score', 'cum_corr']
cum_df.head()

Unnamed: 0_level_0,cum_score,cum_corr
Movie_Id,Unnamed: 1_level_1,Unnamed: 2_level_1
3,14.309206,3.873034
5,1.702102,0.425525
8,38.394871,11.912866
9,0.8679,0.43395
13,2.139101,0.42782


In [10]:
## Create our recomendation, make an estimated rating
recommendation = pd.DataFrame(data=(cum_df['cum_score']/cum_df['cum_corr']).to_list(),
                              index = cum_df.index,
                              columns=['Recommendation_Score']).sort_values(by='Recommendation_Score',
                                                                                        ascending=False)
recommendation = recommendation.merge(right=ntfx, left_index=True, right_on='Movie_Id')
recommendation.head(10)

Unnamed: 0,Recommendation_Score,Movie_Id,Year,Name
13943,5.0,13944,1998,A Good Baby
9736,5.0,9737,2001,The Powerpuff Girls: Meet the Beat-Alls
6632,5.0,6633,2003,Foyle's War: Series 1
6175,5.0,6176,1983,Fraggle Rock: Season 1
15890,5.0,15891,1988,Inspector Morse 5: Last Seen Wearing
15632,5.0,15633,2000,The Thin Blue Lie
17234,5.0,17235,2003,Kal Ho Naa Ho: Tomorrow May Never Come
9481,5.0,9482,1992,Inspector Morse 23: The Death of the Self
3850,5.0,3851,1992,Inspector Morse 25: Cherubim & Seraphim
4578,5.0,4579,1977,I Never Promised You a Rose Garden


<span style="color: cornflowerblue">Reducing Data</span>

While the above implementation is very useful, it does not allow us to get any type of metric to evaluate the effectiveness of our recommendation. An alternative implementation would be to use matrix factorization using the SVD algorithm. This algorithm was popularized by Simon Funk during the Netflix Prize competition.$^{[2]}$ This can be accomplished through the surprise library.$^{[3]}$ Before doing this, however, we need to try and reduce the amount of objects in the data as much as possible.  
To do this, we will first remove movies that are probably not very popular in the data - these are any movies which have a low amount of reviews. Since we are looking to recommend movies, we probably dont really care to recommend a movie with low popularity. We will then remove any customers who are not very active in their reviews. If a customer doesnt review often, then they will most likely be less trustworthy with giving reviews.

In [11]:
## Free memory from previous work since we are dealing with large data
del([recommendation, cum_df, df, user_filter])

In [12]:
## Remove movies with too few reviews (unpopular movies)
freq_df = ratings.groupby('Movie_Id')['Rating'].agg(['count'])
freq_df.head(20)
movie_thresh = round(freq_df['count'].quantile(0.7))
drop_movies = freq_df[freq_df['count'] < movie_thresh].index
print(f'Movies to drop: {len(drop_movies)}')

Movies to drop: 12438


In [13]:
## Remove customer with too few reviews (inactive customers)
freq_df = ratings.groupby('CustomerID')['Rating'].agg(['count'])
freq_df.head(20)
customer_thresh = round(freq_df['count'].quantile(0.7))
drop_customers = freq_df[freq_df['count'] < customer_thresh].index
print(f'Customers to drop: {len(drop_customers)}')

Customers to drop: 335809


In [15]:
## Contextual outlier removal: remove outliers in movie reviews
def q1(x):
    return np.quantile(x, .25)

def q3(x):
    return np.quantile(x, .75)

freq_df = ratings.groupby('Movie_Id')['Rating'].agg([iqr, q1, q3])
outlier_df = ratings.merge(right=freq_df,
                               left_on='Movie_Id',
                               right_index=True)
del(freq_df)
outliers = outlier_df[(outlier_df['Rating']<(outlier_df['q1'] - 1.5*outlier_df['iqr'])) |
                      (outlier_df['Rating']>(outlier_df['q3'] + 1.5*outlier_df['iqr']))].index
del(outlier_df)
print(f'Contextual outliers to drop: {len(outliers)}')

Contextual outliers to drop: 2914484


In [16]:
new_ratings = ratings[~ratings['Movie_Id'].isin(drop_movies)]
new_ratings = new_ratings[~new_ratings['CustomerID'].isin(drop_customers)]
new_ratings = new_ratings[~new_ratings.index.isin(outliers)]
new_ratings.shape

(69868264, 4)

In [17]:
print(f'Total removed datapoints: {len(ratings) - len(new_ratings)} ({1-(len(new_ratings)/len(ratings)):.2f}%)')

Total removed datapoints: 30612243 (0.30%)


In [18]:
del(ratings, outliers, drop_movies, drop_customers)

In [28]:
reader = Reader()

data = Dataset.load_from_df(new_ratings[['CustomerID', 'Movie_Id', 'Rating']], reader)

svd = SVD()
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5)

{'test_rmse': array([0.74498367, 0.74513559, 0.74513853, 0.74515964, 0.74485369]),
 'test_mae': array([0.58629553, 0.5865052 , 0.58654417, 0.58652367, 0.58630125]),
 'fit_time': (1243.6732385158539,
  1166.1720881462097,
  971.7347178459167,
  867.3207578659058,
  937.7179448604584),
 'test_time': (497.456750869751,
  245.21426105499268,
  193.13601326942444,
  214.1835584640503,
  210.58476948738098)}

In [11]:
filename = '../data/rating_corr.csv'
ratings = pd.read_csv(filename)
ratings.head()

Unnamed: 0,index,Movie_Id,CustomerID,Rating,Date,corr
0,84178754,15039,1333,3,2005-07-09,1.0
1,38942962,6923,1333,2,2004-03-07,1.0
2,50690952,9160,1333,2,2004-02-08,1.0
3,20913277,3936,1333,1,2005-07-09,1.0
4,95259246,16901,1333,1,2004-02-18,1.0


<span style="color: cornflowerblue">Using Integrated Data</span>

While using the netflix data is useful, we can try to implement integrated data to get more information.

In [5]:
## Getting integrated dataset from AWS
import boto3
import sys
import io
import re
import csv
import os
import pandas as pd
import numpy as np
from botocore import UNSIGNED
from botocore.client import Config
if sys.version_info[0] < 3:
    from StringIO import StringIO # Python 2.x
else:
    from io import StringIO # Python 3.x

# Get our client with bucket region. Set config to an unsigned signiture (req for public users)
resource = boto3.client('s3', config=Config(signature_version=UNSIGNED), region_name='us-west-1')
# Get the bucket name
bucket_name = 'moviereview.data'
object_key = "data/NetflixJoinedNLP.csv"
csv_obj_conn = resource.get_object(Bucket=bucket_name, Key=object_key)
# csv_obj_ntfx is a json with the data in 'Body'
body = csv_obj_conn['Body']
csv_read_conn = body.read()
conn_bytesio = io.BytesIO(csv_read_conn)
conn = pd.read_csv(conn_bytesio)
conn.head()

Unnamed: 0.1,Unnamed: 0,Movie_Id,Year,Name,idx,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,0,1,2003,Dinosaur Planet,tt0389605,tt0389605,tvMiniSeries,Dinosaur Planet,Dinosaur Planet,0,2003,,48,"Animation,Documentary,Family"
1,1,3,1997,Character,tt0119448,tt0119448,movie,Character,Karakter,0,1997,,122,"Crime,Drama,Mystery"
2,2,6,1997,Sick,tt0120126,tt0120126,movie,Sick,"Sick: The Life & Death of Bob Flanagan, Superm...",0,1997,,90,Documentary
3,3,7,1992,8 Man,tt0182668,tt0182668,movie,8 Man,Eitoman - Subete no sabishii yoru no tame ni,0,1992,,83,"Action,Sci-Fi"
4,4,8,2004,What the #$*! Do We Know!?,tt0399877,tt0399877,movie,What the #$*! Do We (K)now!?,What the #$*! Do We (K)now!?,0,2004,,109,"Comedy,Documentary,Drama"


In [6]:
## Integrating into Netflix ratings
ratings = ratings[ratings['Movie_Id'].isin(conn['Movie_Id'])]
ratings.shape

(87080526, 4)

In [7]:
## Remove movies with too few reviews (unpopular movies)
freq_df = ratings.groupby('Movie_Id')['Rating'].agg(['count'])
freq_df.head(20)
movie_thresh = round(freq_df['count'].quantile(0.85))
drop_movies = freq_df[freq_df['count'] < movie_thresh].index
print(f'Movies to drop: {len(drop_movies)}')

## Remove customer with too few reviews (inactive customers)
freq_df = ratings.groupby('CustomerID')['Rating'].agg(['count'])
freq_df.head(20)
customer_thresh = round(freq_df['count'].quantile(0.85))
drop_customers = freq_df[freq_df['count'] < customer_thresh].index
print(f'Customers to drop: {len(drop_customers)}')

Movies to drop: 9504
Customers to drop: 407669


In [8]:
## Contextual outlier removal: remove outliers in movie reviews
def q1(x):
    return np.quantile(x, .25)

def q3(x):
    return np.quantile(x, .75)

freq_df = ratings.groupby('Movie_Id')['Rating'].agg([iqr, q1, q3])
outlier_df = ratings.merge(right=freq_df,
                               left_on='Movie_Id',
                               right_index=True)
del(freq_df)
outliers = outlier_df[(outlier_df['Rating']<(outlier_df['q1'] - 1.5*outlier_df['iqr'])) |
                      (outlier_df['Rating']>(outlier_df['q3'] + 1.5*outlier_df['iqr']))].index
del(outlier_df)
print(f'Contextual outliers to drop: {len(outliers)}')

Contextual outliers to drop: 2389174


In [9]:
new_ratings = ratings[~ratings['Movie_Id'].isin(drop_movies)]
new_ratings = new_ratings[~new_ratings['CustomerID'].isin(drop_customers)]
new_ratings = new_ratings[~new_ratings.index.isin(outliers)]
new_ratings.shape

(35598871, 4)

In [10]:
print(f'Total removed datapoints: {len(ratings) - len(new_ratings)} ({1-(len(new_ratings)/len(ratings)):.2f}%)')

Total removed datapoints: 51481655 (0.59%)


In [11]:
del(ratings, outliers, drop_movies, drop_customers)

In [12]:
## Adding information
new_ratings = new_ratings.merge(right=conn, on='Movie_Id')[['Movie_Id', 'CustomerID', 'Rating', 'genres']]

In [13]:
new_ratings.head()

Unnamed: 0,Movie_Id,CustomerID,Rating,genres
0,8,785314,1,"Comedy,Documentary,Drama"
1,8,243963,3,"Comedy,Documentary,Drama"
2,8,1447783,4,"Comedy,Documentary,Drama"
3,8,1912665,1,"Comedy,Documentary,Drama"
4,8,1744889,1,"Comedy,Documentary,Drama"


In [14]:
new_ratings = new_ratings[~(new_ratings['genres'] == '\\N')]

In [15]:
## Checking freq of genres - we want to remove genres that aren't frequent:
freq_genre = {}
genres = conn['genres']
# Get frequency dict
for genre in genres:
    glist = genre.split(',')
    for g in glist:
        if freq_genre.get(g) is None:
            freq_genre[g] = 1
        else:
            freq_genre[g] += 1
conf = 0.01*len(conn) # confidence level: genres need to be in at least 1% of movies
delete = []
for key, value in freq_genre.items():
    if value < conf:
        delete += [key]
print(f'Number of genres that are not frequent enough: {len(delete)}/{len(freq_genre.items())}')
print(delete)

Number of genres that are not frequent enough: 7/29
['Film-Noir', 'Adult', 'Talk-Show', '\\N', 'Game-Show', 'Reality-TV', 'News']


In [16]:
## Explode the genre column
def explode_genre(row, exclude):
    rows = []
    genres = row[-1].split(',')
    for genre in genres:
        if genre not in exclude:
            new_row = row
            new_row = np.append(new_row, [genre])
            rows += [new_row]
    return rows

def explode(df, exclude):
    new_df = []
    print('[', end='')
    df = df.to_numpy()
    n = len(df)
    for i in range(n):
        row = df[i]
        new_df += explode_genre(row, exclude)
        if(i%(len(df)//10) == 0):
            print('|', end='')
    del(df)
    print(f'] Done\tnew_length={len(new_df)}')
    return new_df

exp_npy = explode(new_ratings, delete)
col = list(new_ratings.columns) + ['genre_exp']
del(new_ratings)
del(conn)
del(freq_genre)
new_ratings = pd.DataFrame(exp_npy, columns=col)
del(exp_npy)
new_ratings.head()

[|||||||||||] Done	new_length=89494247


Unnamed: 0,Movie_Id,CustomerID,Rating,genres,genre_exp
0,8,785314,1,"Comedy,Documentary,Drama",Comedy
1,8,785314,1,"Comedy,Documentary,Drama",Documentary
2,8,785314,1,"Comedy,Documentary,Drama",Drama
3,8,243963,3,"Comedy,Documentary,Drama",Comedy
4,8,243963,3,"Comedy,Documentary,Drama",Documentary


In [17]:
new_ratings = new_ratings[['CustomerID', 'Movie_Id', 'genre_exp', 'Rating']]

In [2]:
## Run to ssave or run above csv (to reset kernel and save memory)
# new_ratings.to_csv('../data/ntfx_ratings_and_genres.csv')
new_ratings = pd.read_csv('../data/ntfx_ratings_and_genres.csv')
# new_ratings_train = new_ratings.sample(int(89494247*0.6))
# new_ratings_train.to_csv('../data/train_ntfx_ratings_and_genres.csv')
# new_ratings_test = new_ratings[~new_ratings.isin(new_ratings_train)]
# new_ratings_test = new_ratings_test[new_ratings_test['CustomerID'].isin(new_ratings_train['CustomerID'])]
# new_ratings_test.to_csv('../data/test_ntfx_ratings_and_genres.csv')
# new_ratings_train =pd.read_csv('../data/train_ntfx_ratings_and_genres.csv')
# new_ratings_test =pd.read_csv('../data/test_ntfx_ratings_and_genres.csv')

In [4]:
reader = Reader()

data = Dataset.load_from_df(new_ratings[['CustomerID', 'genre_exp', 'Rating']], reader)
del(new_ratings)
svd1 = SVD()

In [3]:
cross_validate(svd1, data, measures=['RMSE', 'MAE'], cv=5)

{'test_rmse': array([0.8878395 , 0.88770821, 0.88801972, 0.88752251, 0.8877624 ]),
 'test_mae': array([0.71907795, 0.71821666, 0.71824355, 0.7179755 , 0.717723  ]),
 'fit_time': (600.9166688919067,
  603.4310672283173,
  602.8181791305542,
  603.8718183040619,
  599.3855788707733),
 'test_time': (202.50324010849,
  137.48147082328796,
  128.2964267730713,
  203.58900117874146,
  120.58939504623413)}

In [5]:
trainset, testset = train_test_split(data, test_size=0.4, random_state=1)
del(data)
svd1.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fd52ef21e50>

In [6]:
pred = svd1.test(testset)
accuracy.rmse(pred)

RMSE: 0.8881


0.8880798306023862

In [3]:
import pickle

def saveModel(filename, model):
    with open(filename, 'wb') as f:
        pickle.dump(model, f)

def loadModel(filename):
    with open(filename, 'rb') as f:
        return pickle.load(f)

# test_svd = loadModel(filename)

In [13]:
filename = '../data/genre_svd_model'
# saveModel(filename, svd1)

In [14]:
filename = '../data/genre_testset'
# saveModel(filename, testset)

In [3]:
reader = Reader()
new_ratings = new_ratings[['CustomerID', 'Movie_Id', 'Rating']].drop_duplicates()
data = Dataset.load_from_df(new_ratings, reader)
del(new_ratings)
svd2 = SVD()

In [4]:
cross_validate(svd2, data, measures=['RMSE', 'MAE'], cv=5)

{'test_rmse': array([0.7376177 , 0.73766638, 0.73786179, 0.73790282, 0.7380553 ]),
 'test_mae': array([0.58279059, 0.58281425, 0.58297482, 0.5829652 , 0.5831612 ]),
 'fit_time': (406.6788065433502,
  428.40667033195496,
  427.3528952598572,
  426.79230999946594,
  420.9851379394531),
 'test_time': (117.49416637420654,
  80.32076692581177,
  100.83745884895325,
  80.6466212272644,
  86.78982901573181)}

In [None]:
4+5

In [6]:
trainset, testset = train_test_split(data, test_size=0.4, random_state=1)
del(data)
svd2.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f0fe3464670>

In [7]:
pred = svd2.test(testset)
accuracy.rmse(pred)

RMSE: 0.6882


0.6881834454835424

In [8]:
filename = '../data/movie_svd_model'
# saveModel(filename, svd2)
filename = '../data/movie_testset'
# saveModel(filename, testset)

In [4]:
filename = '../data/genre_svd_model'
# svd1 = loadModel(filename) # genre model
filename = '../data/genre_testset'
# testset1 = loadModel(filename) 
filename = '../data/movie_svd_model'
# svd2 = loadModel(filename) # movie model
filename = '../data/movie_testset'
# testset2 = loadModel(filename) # CustomerID, MovieID, Rating
filename = '../data/combined_testset'
comb_testset = loadModel(filename) # pandas dataset for combined testsets with rating results

In [5]:
comb_testset.head()

Unnamed: 0.1,CustomerID,Movie_Id,Rating_x,Unnamed: 0,genre_exp,Rating_y,movie_pred,genre_pred
0,2519865,7713,3.0,39079677,Comedy,3,3.890008,3.990738
1,2519865,7713,3.0,39079678,Horror,3,3.890008,3.960025
2,1315370,10451,5.0,51174840,Crime,5,3.704715,4.329966
3,1315370,10451,5.0,51174841,Sci-Fi,5,3.704715,4.157865
4,2015904,14978,3.0,74564131,Action,3,4.089476,3.403297


In [6]:
data_genre = comb_testset[['CustomerID', 'genre_exp']].to_numpy()

In [7]:
preds = []
for val in data_genre:
    pred = svd1.predict(*val)
    if not pred.details['was_impossible']:
        preds += [pred.est]
    else:
        preds += [-1]

In [8]:
len(preds)

96540075

In [9]:
comb_testset['genre_pred'] = preds

In [10]:
filename = '../data/combined_testset'
# saveModel(filename, comb_testset)

In [6]:
## Basic averaging of model results
comb_testset['pred_rating'] = (comb_testset['movie_pred'] + comb_testset['genre_pred'])/2
comb_testset.head()

Unnamed: 0.1,CustomerID,Movie_Id,Rating_x,Unnamed: 0,genre_exp,Rating_y,movie_pred,genre_pred,pred_rating
0,2519865,7713,3.0,39079677,Comedy,3,3.890008,3.990738,3.940373
1,2519865,7713,3.0,39079678,Horror,3,3.890008,3.960025,3.925017
2,1315370,10451,5.0,51174840,Crime,5,3.704715,4.329966,4.01734
3,1315370,10451,5.0,51174841,Sci-Fi,5,3.704715,4.157865,3.93129
4,2015904,14978,3.0,74564131,Action,3,4.089476,3.403297,3.746387


In [7]:
## Find RMSE between true rating and averaged predicted rating
rms = mean_squared_error(comb_testset['Rating_x'], comb_testset['pred_rating'], squared=False)
rms

0.7490367017465918

In [10]:
mae = mean_absolute_error(comb_testset['Rating_x'], comb_testset['pred_rating'])
mae

0.6030799041658691

[1]   
[2] https://sifter.org/~simon/journal/20061211.html  
[3] https://surprise.readthedocs.io/en/stable/index.html
