<a href="https://colab.research.google.com/github/dean-sh/Movie-Ratings-Collaborating-Filltering/blob/master/Final%20Model%20-%20Funk%20SVD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

MovieLens Recommendations - Final Model - Funk SVD
=============================================
## Dean Shabi, Dedi Kovatch, July 2019 
## Final Project for TCDS - Technion Data Science Specialization

MovieLens data sets were collected by the GroupLens Research Project
at the University of Minnesota.
 
This data set consists of: 
	* 100,000 ratings (1-5) from 943 users on 1682 movies. 
	* Each user has rated at least 20 movies. 
        * Simple demographic info for the users (age, gender, occupation, zip)

The data was collected through the MovieLens web site
(movielens.umn.edu) during the seven-month period from September 19th, 
1997 through April 22nd, 1998. This data has been cleaned up - users
who had less than 20 ratings or did not have complete demographic
information were removed from this data set. Detailed descriptions of
the data file can be found at the end of this file.


# Funk SVD - Final Model

Based on Repo by **Geoffrey Bolmier**, Université Paul Sabatier.
His repo implements Simon Funk's SVD algorithm fast and effieciently.

https://github.com/gbolmier/funk-svd

We got amazing recommendations with this algorithm and this is the final one we picked.

### **We also brought the model to 35 real users with a survey, where we got their own recommendations and got some good feedback.**

## Download Data


https://github.com/gbolmier/funk-svd/blob/master/funk_svd/svd.py

In [0]:
!pip install git+https://github.com/gbolmier/funk-svd

Collecting git+https://github.com/gbolmier/funk-svd
  Cloning https://github.com/gbolmier/funk-svd to /tmp/pip-req-build-n3mik0ke
  Running command git clone -q https://github.com/gbolmier/funk-svd /tmp/pip-req-build-n3mik0ke
Building wheels for collected packages: funk-svd
  Building wheel for funk-svd (setup.py) ... [?25l[?25hdone
  Stored in directory: /tmp/pip-ephem-wheel-cache-0p_6suex/wheels/66/f1/cb/e8147525b73388cc0bd5588c915e731ab65aba9a968e3ba455
Successfully built funk-svd
Installing collected packages: funk-svd
Successfully installed funk-svd-0.0.1.dev1


In [0]:
from funk_svd.dataset import fetch_ml20m_ratings
from funk_svd import SVD

from sklearn.metrics import mean_absolute_error

#20m dataset
import pandas as pd
import numpy as np
import zipfile
import urllib.request

Dataset20m = True
Download_new_dataset = True

if Download_new_dataset:
    if Dataset20m:
        print("Downloading 20-m movielens data...")

        urllib.request.urlretrieve("http://files.grouplens.org/datasets/movielens/ml-20m.zip", "movielens20m.zip")

        zip_ref = zipfile.ZipFile('movielens20m.zip', "r")
        zip_ref.extractall()
        print("Downloaded the 20-m movielens!")

        movies_df = pd.read_csv('ml-20m/movies.csv',names= ['i_id', 'title', 'genres'], sep=',', encoding='latin-1')
        movies_df.drop([0], inplace=True)
        movies_df['i_id'] = movies_df['i_id'].apply(pd.to_numeric)

        # Create one merged DataFrame containing all the movielens data.

        df = fetch_ml20m_ratings()
        movielens18 = df.copy()

    else:
        print("Downloading 100K latest movielens data...")

        urllib.request.urlretrieve("http://files.grouplens.org/datasets/movielens/ml-latest-small.zip", "movielens.zip")

        zip_ref = zipfile.ZipFile('movielens.zip', "r")
        zip_ref.extractall()

        ratings_df = pd.read_csv('ml-latest-small/ratings.csv', names=['u_id', 'i_id', 'rating', 'timestamp'], sep=',', encoding='latin-1', header = None)
        ratings_df.drop([0], inplace=True)
        ratings_df=ratings_df.apply(pd.to_numeric)


        movies_df = pd.read_csv('ml-latest-small/movies.csv',names= ['i_id', 'title', 'genres'], sep=',', encoding='latin-1')
        movies_df.drop([0], inplace=True)
        movies_df['i_id'] = movies_df['i_id'].apply(pd.to_numeric)

        # Create one merged DataFrame containing all the movielens data.
        movielens18 = ratings_df.merge(movies_df, on='i_id')
        movielens18.drop(axis = 1, columns = ['title', 'genres', 'timestamp'], inplace = True)
        movielens18.columns = ['u_id', 'i_id', 'rating']

Downloading 20-m movielens data...
Downloaded the 20-m movielens!
Downloading data...
Unzipping data...


## HyperParameter Search

### Random Search

Sampling Learning rate, lambda, and number of latent features\factors from a uniform distribution, about 100 times.
In each iteration, the parameters as well as the RMSE and MAE are being saved in the results dataframe.

In [0]:
from sklearn.metrics import mean_squared_error, mean_absolute_error
# movielens18.drop(columns = 'timestamp', inplace = True)


train = movielens18.sample(frac=0.8)
val = movielens18.drop(train.index.tolist()).sample(frac=0.5, random_state=8)
test = movielens18.drop(train.index.tolist()).drop(val.index.tolist())

iterations = 100

def sample_params():
    lr = np.random.uniform(low = 0.001, high = 0.1,  size = 1)[0]
    reg = np.random.uniform(low = 0.001, high = 0.1,  size = 1)[0]
#     factors = np.random.randint(low = 100, high = 500,  size = 1)[0]
    factors = 300
    return lr, reg, factors

In [0]:
results = []
for i in range(iterations):
    
    lr, reg, factors = sample_params()
    
    svd = SVD(learning_rate=lr, regularization=reg, n_epochs=5, n_factors=factors,
              min_rating=0.5, max_rating=5)
    
    svd.fit(X=train, X_val=val, early_stopping=True, shuffle=False)

    pred = svd.predict(test)
    mae = mean_absolute_error(test["rating"], pred)
    rmse = np.sqrt(mean_squared_error(test["rating"], pred))
    print("Test MAE:  {:.2f}".format(mae))
    print("Test RMSE: {:.2f}".format(rmse))
    print('{} factors, {} lr, {} reg'.format(factors, lr, reg))
    results.append([rmse, mae, lr, reg, factors])

Preprocessing data...

Epoch 1/5  | val_loss: 0.74 - val_rmse: 0.86 - val_mae: 0.66 - took 18.8 sec
Epoch 2/5  | val_loss: 0.71 - val_rmse: 0.85 - val_mae: 0.65 - took 17.2 sec
Epoch 3/5  | val_loss: 0.70 - val_rmse: 0.84 - val_mae: 0.64 - took 17.5 sec
Epoch 4/5  | val_loss: 0.69 - val_rmse: 0.83 - val_mae: 0.64 - took 17.7 sec
Epoch 5/5  | val_loss: 0.68 - val_rmse: 0.82 - val_mae: 0.63 - took 17.6 sec

Training took 1 min and 41 sec
Test MAE:  0.63
Test RMSE: 0.82
300 factors, 0.03726422643591702 lr, 0.07328935967245179 reg


In [0]:
results = pd.DataFrame(results)
results.columns = ['RMSE', 'MAE', 'learningRate', 'Lambda', 'NumOfFeatures']

print("Random search results:")
results.sort_values(by='RMSE').head()

Random search results:


Unnamed: 0,RMSE,MAE,learningRate,Lambda,NumOfFeatures
0,0.82485,0.632915,0.037264,0.073289,300


### Training the best model

In [0]:
lr, reg, factors = (0.007, 0.03, 90)

svd = SVD(learning_rate=lr, regularization=reg, n_epochs=200, n_factors=factors,
          min_rating=0.5, max_rating=5)

svd.fit(X=train, X_val=val, early_stopping=True, shuffle=False)

pred = svd.predict(test)
mae = mean_absolute_error(test["rating"], pred)
rmse = np.sqrt(mean_squared_error(test["rating"], pred))
print("Test MAE:  {:.2f}".format(mae))
print("Test RMSE: {:.2f}".format(rmse))
print('{} factors, {} lr, {} reg'.format(factors, lr, reg))

Preprocessing data...

Epoch 1/200  | val_loss: 0.77 - val_rmse: 0.87 - val_mae: 0.67 - took 6.9 sec
Epoch 2/200  | val_loss: 0.75 - val_rmse: 0.86 - val_mae: 0.67 - took 6.2 sec
Epoch 3/200  | val_loss: 0.73 - val_rmse: 0.85 - val_mae: 0.66 - took 6.2 sec
Epoch 4/200  | val_loss: 0.71 - val_rmse: 0.84 - val_mae: 0.65 - took 6.1 sec
Epoch 5/200  | val_loss: 0.70 - val_rmse: 0.83 - val_mae: 0.64 - took 6.1 sec
Epoch 6/200  | val_loss: 0.68 - val_rmse: 0.83 - val_mae: 0.63 - took 6.1 sec
Epoch 7/200  | val_loss: 0.67 - val_rmse: 0.82 - val_mae: 0.63 - took 6.1 sec
Epoch 8/200  | val_loss: 0.66 - val_rmse: 0.81 - val_mae: 0.62 - took 6.1 sec
Epoch 9/200  | val_loss: 0.65 - val_rmse: 0.81 - val_mae: 0.62 - took 6.1 sec
Epoch 10/200 | val_loss: 0.64 - val_rmse: 0.80 - val_mae: 0.61 - took 6.1 sec
Epoch 11/200 | val_loss: 0.64 - val_rmse: 0.80 - val_mae: 0.61 - took 6.1 sec
Epoch 12/200 | val_loss: 0.63 - val_rmse: 0.80 - val_mae: 0.61 - took 6.1 sec
Epoch 13/200 | val_loss: 0.63 - val_rmse:

## User Recommendations

### Random Search

In [0]:
#Adding our own ratings

n_m = len(movielens18.i_id.unique())

#  Initialize my ratings
my_ratings = np.zeros(n_m)


my_ratings[4993] = 5
my_ratings[1080] = 5
my_ratings[260] = 5
my_ratings[4896] = 5
my_ratings[1196] = 5
my_ratings[1210] = 5
my_ratings[2628] = 5
my_ratings[5378] = 5

print('User ratings:')
print('-----------------')

for i, val in enumerate(my_ratings):
    if val > 0:
        print('Rated %d stars: %s' % (val, movies_df.loc[movies_df.i_id==i].title.values))

User ratings:
-----------------
Rated 5 stars: ['Star Wars: Episode IV - A New Hope (1977)']
Rated 5 stars: ["Monty Python's Life of Brian (1979)"]
Rated 5 stars: ['Star Wars: Episode V - The Empire Strikes Back (1980)']
Rated 5 stars: ['Star Wars: Episode VI - Return of the Jedi (1983)']
Rated 5 stars: ['Star Wars: Episode I - The Phantom Menace (1999)']
Rated 5 stars: ["Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)"]
Rated 5 stars: ['Lord of the Rings: The Fellowship of the Ring, The (2001)']
Rated 5 stars: ['Star Wars: Episode II - Attack of the Clones (2002)']


In [0]:
print("Adding your recommendations!")
items_id = [item[0] for item in np.argwhere(my_ratings>0)]
ratings_list = my_ratings[np.where(my_ratings>0)]
user_id = np.asarray([0] * len(ratings_list))

user_ratings = pd.DataFrame(list(zip(user_id, items_id, ratings_list)), columns=['u_id', 'i_id', 'rating'])

Adding your recommendations!


In [0]:
try:
    movielens18 = movielens18.drop(columns=['timestamp'])
except:
    pass
data_with_user = movielens18.append(user_ratings, ignore_index=True)

train_user = data_with_user.sample(frac=0.8)
val_user = data_with_user.drop(train_user.index.tolist()).sample(frac=0.5, random_state=8)
test_user = data_with_user.drop(train_user.index.tolist()).drop(val_user.index.tolist())

In [0]:
from itertools import product
def funk_svd_predict(userID, data_with_user, movies_df):
    userID = [userID]

    # all_users = data_with_user.u_id.unique()
    all_movies = data_with_user.i_id.unique()
    recommendations = pd.DataFrame(list(product(userID, all_movies)), columns=['u_id', 'i_id'])

    #Getting predictions for the selected userID
    pred_train = svd.predict(recommendations)
    recommendations['prediction'] = pred_train
    recommendations.head(10)

    sorted_user_predictions = recommendations.sort_values(by='prediction', ascending=False)

    user_ratings = data_with_user[data_with_user.u_id == userID[0]]
    user_ratings.columns = ['u_id',	'i_id', 'rating']
    # Recommend the highest predicted rating movies that the user hasn't seen yet.
    recommendations = movies_df[~movies_df['i_id'].isin(user_ratings['i_id'])].\
        merge(pd.DataFrame(sorted_user_predictions).reset_index(drop=True), how = 'inner', left_on = 'i_id', right_on = 'i_id').\
        sort_values(by='prediction', ascending = False).drop(['i_id'],axis=1)

    rated_df = movies_df[movies_df['i_id'].isin(user_ratings['i_id'])].\
        merge(pd.DataFrame(data_with_user).reset_index(drop=True), how = 'inner', left_on = 'i_id', right_on = 'i_id')
    rated_df = rated_df.loc[rated_df.u_id==userID[0]].sort_values(by='rating', ascending = False)
    
    return recommendations, rated_df

In [0]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

iterations = 1

def sample_params():
    lr = np.random.uniform(low = 0.001, high = 0.01,  size = 1)[0]
    reg = np.random.uniform(low = 0.001, high = 0.1,  size = 1)[0]
    factors = np.random.randint(low = 30, high = 200,  size = 1)[0]
#     factors = 300
    return lr, reg, factors

In [0]:
#Checking Num Movies to see if this is the 20M data (should be 26K movies)
len(movielens18.i_id.unique())

26744

In [0]:
results = []
for i in range(iterations):
    
    lr, reg, factors = sample_params()
    
    svd = SVD(learning_rate=lr, regularization=reg, n_epochs=20, n_factors=factors,
              min_rating=0.5, max_rating=5)
    
    print('{} factors, {} lr, {} reg'.format(factors, lr, reg))

    svd.fit(X=train_user, X_val=val_user, early_stopping=True, shuffle=False)

    pred = svd.predict(test_user)
    mae = mean_absolute_error(test_user["rating"], pred)
    rmse = np.sqrt(mean_squared_error(test_user["rating"], pred))
    print("Test MAE:  {:.2f}".format(mae))
    print("Test RMSE: {:.2f}\n".format(rmse))
    results.append([rmse, mae, lr, reg, factors])

In [0]:
results = pd.DataFrame(results)
results.columns = ['RMSE', 'MAE', 'learningRate', 'Lambda', 'NumOfFeatures']

print("Random search results:")
results.sort_values(by='RMSE').head()

Random search results:


Unnamed: 0,RMSE,MAE,learningRate,Lambda,NumOfFeatures
0,0.833328,0.640036,0.00615,0.090192,30


### Training the best model

In [0]:
lr, reg, factors = (0.007, 0.03, 90)
epochs = 50

svd = SVD(learning_rate=lr, regularization=reg, n_epochs=epochs, n_factors=factors,
          min_rating=0.5, max_rating=5)

svd.fit(X=train_user, X_val=val_user, early_stopping=True, shuffle=False)

pred = svd.predict(test_user)
mae = mean_absolute_error(test_user["rating"], pred)
rmse = np.sqrt(mean_squared_error(test_user["rating"], pred))
print("Test MAE:  {:.2f}".format(mae))
print("Test RMSE: {:.2f}".format(rmse))
print('{} factors, {} lr, {} reg'.format(factors, lr, reg))

Preprocessing data...

Epoch 1/50  | val_loss: 0.76 - val_rmse: 0.87 - val_mae: 0.67 - took 6.1 sec
Epoch 2/50  | val_loss: 0.75 - val_rmse: 0.86 - val_mae: 0.66 - took 6.2 sec
Epoch 3/50  | val_loss: 0.73 - val_rmse: 0.85 - val_mae: 0.66 - took 6.2 sec
Epoch 4/50  | val_loss: 0.71 - val_rmse: 0.84 - val_mae: 0.65 - took 6.1 sec
Epoch 5/50  | val_loss: 0.69 - val_rmse: 0.83 - val_mae: 0.64 - took 6.1 sec
Epoch 6/50  | val_loss: 0.68 - val_rmse: 0.83 - val_mae: 0.63 - took 6.1 sec
Epoch 7/50  | val_loss: 0.67 - val_rmse: 0.82 - val_mae: 0.63 - took 6.1 sec
Epoch 8/50  | val_loss: 0.66 - val_rmse: 0.81 - val_mae: 0.62 - took 6.1 sec
Epoch 9/50  | val_loss: 0.65 - val_rmse: 0.81 - val_mae: 0.62 - took 6.1 sec
Epoch 10/50 | val_loss: 0.64 - val_rmse: 0.80 - val_mae: 0.61 - took 6.3 sec
Epoch 11/50 | val_loss: 0.64 - val_rmse: 0.80 - val_mae: 0.61 - took 6.3 sec
Epoch 12/50 | val_loss: 0.63 - val_rmse: 0.79 - val_mae: 0.61 - took 6.1 sec
Epoch 13/50 | val_loss: 0.63 - val_rmse: 0.79 - val_m

### Predictions:

In [0]:
from itertools import product
def funk_svd_predict(userID, data_with_user, movies_df):
    userID = [userID]

    # all_users = data_with_user.u_id.unique()
    all_movies = data_with_user.i_id.unique()
    recommendations = pd.DataFrame(list(product(userID, all_movies)), columns=['u_id', 'i_id'])

    #Getting predictions for the selected userID
    pred_train = svd.predict(recommendations)
    recommendations['prediction'] = pred_train
    recommendations.head(10)

    sorted_user_predictions = recommendations.sort_values(by='prediction', ascending=False)

    user_ratings = data_with_user[data_with_user.u_id == userID[0]]
    user_ratings.columns = ['u_id',	'i_id', 'rating']
    # Recommend the highest predicted rating movies that the user hasn't seen yet.
    recommendations = movies_df[~movies_df['i_id'].isin(user_ratings['i_id'])].\
        merge(pd.DataFrame(sorted_user_predictions).reset_index(drop=True), how = 'inner', left_on = 'i_id', right_on = 'i_id').\
        sort_values(by='prediction', ascending = False)#.drop(['i_id'],axis=1)

    rated_df = movies_df[movies_df['i_id'].isin(user_ratings['i_id'])].\
        merge(pd.DataFrame(data_with_user).reset_index(drop=True), how = 'inner', left_on = 'i_id', right_on = 'i_id')
    rated_df = rated_df.loc[rated_df.u_id==userID[0]].sort_values(by='rating', ascending = False)
    
    return recommendations, rated_df

In [0]:
recommendations, rated_df = funk_svd_predict(0, data_with_user, movies_df)

In [0]:
rated_df

Unnamed: 0,i_id,title,genres,u_id,rating
54502,260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi,0,5.0
75060,1080,Monty Python's Life of Brian (1979),Comedy,0,5.0
120374,1196,Star Wars: Episode V - The Empire Strikes Back...,Action|Adventure|Sci-Fi,0,5.0
167214,1210,Star Wars: Episode VI - Return of the Jedi (1983),Action|Adventure|Sci-Fi,0,5.0
196789,2628,Star Wars: Episode I - The Phantom Menace (1999),Action|Adventure|Sci-Fi,0,5.0
214029,4896,Harry Potter and the Sorcerer's Stone (a.k.a. ...,Adventure|Children|Fantasy,0,5.0
251583,4993,"Lord of the Rings: The Fellowship of the Ring,...",Adventure|Fantasy,0,5.0
268009,5378,Star Wars: Episode II - Attack of the Clones (...,Action|Adventure|Sci-Fi|IMAX,0,5.0


In [0]:
recommendations.head(30)

Unnamed: 0,i_id,title,genres,u_id,prediction
5845,5952,"Lord of the Rings: The Two Towers, The (2002)",Adventure|Fantasy,0,5.0
7348,7502,Band of Brothers (2001),Action|Drama|War,0,5.0
7033,7153,"Lord of the Rings: The Return of the King, The...",Action|Adventure|Drama|Fantasy,0,5.0
9784,31948,"Phone Box, The (Cabina, La) (1972)",Comedy|Drama|Mystery|Thriller,0,5.0
2482,2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller,0,5.0
314,318,"Shawshank Redemption, The (1994)",Crime|Drama,0,4.981206
8965,26674,Prime Suspect (1991),Crime|Drama|Mystery|Thriller,0,4.96879
20463,100553,Frozen Planet (2011),Documentary,0,4.967922
49,50,"Usual Suspects, The (1995)",Crime|Mystery|Thriller,0,4.956233
1111,1136,Monty Python and the Holy Grail (1975),Adventure|Comedy|Fantasy,0,4.938441


## Survey real users to showcase our model

**We created an online survey to aquire real user recommendations from Technion Data Science class and our friends.**

The survey is in Segmanta platform and can be accessed here:
https://surveys.segmanta.com/85fdg0

### Making A Survey

**We picked the top 20 movies from the 20-m Movielens data, that have the highest rating variation and also are popular.**

This was important, so that we get ratings that have high std in the data.


In [0]:
#setting up the list of movies for survey
number_of_ratings = pd.DataFrame(movielens18.groupby('i_id').count().loc[:,'rating'])
std_of_ratings = pd.DataFrame(movielens18.groupby('i_id').std().loc[:,'rating']) 
mean_of_ratings = pd.DataFrame(movielens18.groupby('i_id').mean().loc[:,'rating'])

x = number_of_ratings.merge(mean_of_ratings, how='inner', on = 'i_id').merge(std_of_ratings, how='inner', on = 'i_id')
x = x.merge(movies_df, how='inner', on = 'i_id')
x.drop('genres', axis = 1, inplace = True)
x.columns = ['MovieID', 'Number of Ratings', 'Avg Ratings', 'STD of ratings', 'Title']

In [0]:
movies_pick_std = x.sort_values(by = 'Number of Ratings', ascending = False).head(150).sort_values(by = 'STD of ratings', ascending = False).head(20)
movies_pick_mean = x.sort_values(by = 'Number of Ratings', ascending = False).head(150).sort_values(by = 'Avg Ratings', ascending = False).head(20)

frames = [movies_pick_std, movies_pick_mean]
picked_movies = pd.concat(frames)

picked_movies = picked_movies.sample(frac=1, random_state=0)

picked_movies = picked_movies.head(20)
picked_movies = picked_movies.reset_index(drop=True)
picked_movies.drop(['Number of Ratings', 'Avg Ratings', 'STD of ratings'], axis=1, inplace=True)

#picked_movies

**Adding new movies**

We decided to pick 24 more newer movies based on real-users survey.

the "new_survey_movies" csv contains the additional data.

In [0]:
#Adding new movies - upload the "new_survey_movies.csv" here

from google.colab import files
uploaded = files.upload()

Saving new_survey_movies.csv to new_survey_movies.csv


In [0]:
new_movies = pd.read_csv('new_survey_movies.csv', header=None)

In [0]:
for i, item in new_movies.iterrows():
    a = movies_df[movies_df.title==item[0]]
    line = pd.DataFrame({'MovieID':a.i_id.values[0], 'Title':a.title.values[0]}, index=[10])
    picked_movies = picked_movies.append(line, ignore_index=False)

picked_movies = picked_movies.reset_index(drop=True)
picked_movies.MovieID = picked_movies.MovieID.astype(int)

In [0]:
picked_movies

Unnamed: 0,MovieID,Title
0,50,"Usual Suspects, The (1995)"
1,318,"Shawshank Redemption, The (1994)"
2,912,Casablanca (1942)
3,2628,Star Wars: Episode I - The Phantom Menace (1999)
4,1517,Austin Powers: International Man of Mystery (1...
5,208,Waterworld (1995)
6,2959,Fight Club (1999)
7,780,Independence Day (a.k.a. ID4) (1996)
8,736,Twister (1996)
9,1198,Raiders of the Lost Ark (Indiana Jones and the...


### Adding Survey results to the data

In [0]:
#Loading Survey Data

from google.colab import files
uploaded = files.upload()

Saving movielens.xlsx to movielens.xlsx


In [0]:
num_of_survey_ratings = picked_movies.shape[0]

survey_data = pd.read_excel('movielens.xlsx')
survey_data.drop(axis=0, index = 0, inplace=True)
survey_data = survey_data.iloc[:, 12:].transpose()
Names = survey_data.iloc[num_of_survey_ratings-1]
survey_data.drop(survey_data.tail(4).index,inplace=True)
survey_data.reset_index(inplace=True)
survey_data.rename(columns={"index": "Title"}, inplace=True)
survey_data['MovieID'] =  None
survey_data.loc[:num_of_survey_ratings,'MovieID'] = picked_movies['MovieID']
survey_data.MovieID.astype(int)

##################################

survey_data = survey_data.merge(picked_movies, how='left', on='MovieID')   
# survey_data.Title = survey_data.Title.str.strip()
survey_data.drop(['Title_y'], axis=1, inplace = True)
survey_data.rename(columns={"Title_x": "Title"}, inplace=True)

survey_data= pd.melt(survey_data, id_vars=['MovieID', 'Title'], value_vars=list(range(1,38)),
                    var_name='UserID', value_name='rating')

#dropping NAN ratings (where there was no rating in the survey)

survey_data = survey_data[pd.notnull(survey_data['rating'])]
survey_data = survey_data.reset_index(drop=True)

survey_data.head(10)

Unnamed: 0,MovieID,Title,UserID,rating
0,318,The Shawshank Redemption (1994),1,5
1,912,Casablanca (1942),1,3
2,2628,Star Wars: Episode I - The Phantom Menace (1999),1,2
3,1517,Austin Powers: International Man of Mystery (1...,1,3
4,208,Waterworld (1995),1,2
5,2959,Fight Club (1999),1,5
6,780,Independence Day (a.k.a. ID4) (1996),1,3
7,736,Twister (1996),1,3
8,1198,Raiders of the Lost Ark (Indiana Jones) (1981),1,4
9,750,Dr. Strangelove (1964),1,5


In [0]:
user_ratings = survey_data[['UserID','MovieID','rating']]
user_ratings.columns = ['u_id', 'i_id', 'rating']

Creating the dataset with addition of the survey users


In [0]:
try:
    movielens18 = movielens18.drop(columns=['timestamp'])
except:
    pass

# shifting the u_id up, so the user_ids from the surveys will be 1,2,3 etc and the original u_id will start at 4, 5...
n_survey_users = len(Names)
# user_ratings.u_id += 1
movielens18.u_id += n_survey_users

data_with_user = user_ratings.append(movielens18, ignore_index=True).reset_index(drop=True)

train_user = data_with_user.sample(frac=0.9)
val_user = data_with_user.drop(train_user.index.tolist()).sample(frac=0.5, random_state=8)
test_user = data_with_user.drop(train_user.index.tolist()).drop(val_user.index.tolist())

In [0]:
data_with_user.groupby(by='u_id').count()[20:40]

In [0]:
#Crearing a popularity dataframe

movie_popularity = data_with_user.groupby(by='i_id').count()
movie_popularity['i_id'] = data_with_user.groupby(by='i_id').count().index
movie_popularity.drop(['u_id'], axis=1, inplace=True)
movie_popularity.rename(columns={"rating": "popularity"}, inplace=True)
movie_popularity = movie_popularity.reset_index(drop=True)

# merging movies_df with the popularity, by i_id

movies_df_with_popularity = movies_df.merge(movie_popularity, how = 'left', left_on = 'i_id', right_on = 'i_id').reset_index(drop=True)

### Train + Getting predictions

In [0]:
from itertools import product
from sklearn.metrics import mean_squared_error, mean_absolute_error
def funk_svd_predict(userID, data_with_user, movies_df):
    userID = [userID]

    # all_users = data_with_user.u_id.unique()
    all_movies = data_with_user.i_id.unique()
    recommendations = pd.DataFrame(list(product(userID, all_movies)), columns=['u_id', 'i_id'])

    #Getting predictions for the selected userID
    pred_train = svd.predict(recommendations)
    recommendations['prediction'] = pred_train
    recommendations.head(10)

    sorted_user_predictions = recommendations.sort_values(by='prediction', ascending=False)

    user_ratings = data_with_user[data_with_user.u_id == userID[0]]
    user_ratings.columns = ['u_id',	'i_id', 'rating']
    # Recommend the highest predicted rating movies that the user hasn't seen yet.
    recommendations = movies_df[~movies_df['i_id'].isin(user_ratings['i_id'])].\
        merge(pd.DataFrame(sorted_user_predictions).reset_index(drop=True), how = 'inner', left_on = 'i_id', right_on = 'i_id').\
        sort_values(by='prediction', ascending = False).drop(['i_id'],axis=1)
    recommendations = recommendations.reset_index(drop=True)

    rated_df = movies_df[movies_df['i_id'].isin(user_ratings['i_id'])].\
        merge(pd.DataFrame(data_with_user).reset_index(drop=True), how = 'inner', left_on = 'i_id', right_on = 'i_id')
    rated_df = rated_df.loc[rated_df.u_id==userID[0]].sort_values(by='rating', ascending = False)
    rated_df = rated_df.reset_index(drop=True)
    return recommendations, rated_df

This are the best hyperparameters from our random search:

In [0]:
lr, reg, factors = (0.007, 0.03, 90)
epochs = 50

svd = SVD(learning_rate=lr, regularization=reg, n_epochs=epochs, n_factors=factors,
          min_rating=0.5, max_rating=5)

svd.fit(X=train_user, X_val=val_user, early_stopping=True, shuffle=False)

pred = svd.predict(test_user)
mae = mean_absolute_error(test_user["rating"], pred)
rmse = np.sqrt(mean_squared_error(test_user["rating"], pred))
print("Test MAE:  {:.2f}".format(mae))
print("Test RMSE: {:.2f}".format(rmse))
print('{} factors, {} lr, {} reg'.format(factors, lr, reg))

Preprocessing data...

Epoch 1/50  | val_loss: 0.76 - val_rmse: 0.87 - val_mae: 0.67 - took 8.0 sec
Epoch 2/50  | val_loss: 0.74 - val_rmse: 0.86 - val_mae: 0.66 - took 7.6 sec
Epoch 3/50  | val_loss: 0.72 - val_rmse: 0.85 - val_mae: 0.65 - took 7.4 sec
Epoch 4/50  | val_loss: 0.70 - val_rmse: 0.84 - val_mae: 0.64 - took 7.2 sec
Epoch 5/50  | val_loss: 0.69 - val_rmse: 0.83 - val_mae: 0.63 - took 7.4 sec
Epoch 6/50  | val_loss: 0.67 - val_rmse: 0.82 - val_mae: 0.63 - took 7.3 sec
Epoch 7/50  | val_loss: 0.66 - val_rmse: 0.81 - val_mae: 0.62 - took 7.4 sec
Epoch 8/50  | val_loss: 0.65 - val_rmse: 0.81 - val_mae: 0.62 - took 7.5 sec
Epoch 9/50  | val_loss: 0.64 - val_rmse: 0.80 - val_mae: 0.61 - took 7.6 sec
Epoch 10/50 | val_loss: 0.64 - val_rmse: 0.80 - val_mae: 0.61 - took 7.6 sec
Epoch 11/50 | val_loss: 0.63 - val_rmse: 0.79 - val_mae: 0.61 - took 7.5 sec
Epoch 12/50 | val_loss: 0.62 - val_rmse: 0.79 - val_mae: 0.60 - took 7.3 sec
Epoch 13/50 | val_loss: 0.62 - val_rmse: 0.79 - val_m

In [0]:
#Survey users:


### Getting recommendations

A nice trick we added is slicing the recommendations for popular movies only.
In the beggining, we got weird Soviet documentaries.. nobody wants to watch that.

We deduced that this is because some random user in the 20M dataset has very similar rating pattern to the survey user, therefore we got results that coresponds to its taste.

Decided to **recommend only popular movies (>100 ratings).** Reduced the variety and hurts discovery a bit,  but **improves results substantially**!

**This is also done by the big guys (Netflix, Amazon, Youtube)**


In [0]:
Selected_user = 36
print("predictions for {}".format(Names[Selected_user]))

recommendations, rated_df = funk_svd_predict(Selected_user, data_with_user, movies_df_with_popularity)

predictions for Dean Shabi


In [0]:
rated_df[['title', 'rating', 'genres']]

Unnamed: 0,title,rating,genres
0,Inception (2010),5,Action|Crime|Drama|Mystery|Sci-Fi|Thriller|IMAX
1,Iron Man (2008),5,Action|Adventure|Sci-Fi
2,"Lord of the Rings: The Return of the King, The...",5,Action|Adventure|Drama|Fantasy
3,Pirates of the Caribbean: The Curse of the Bla...,5,Action|Adventure|Comedy|Fantasy
4,Spider-Man (2002),5,Action|Adventure|Sci-Fi|Thriller
5,X-Men (2000),5,Action|Adventure|Sci-Fi
6,Scary Movie (2000),5,Comedy|Horror
7,"Lion King, The (1994)",5,Adventure|Animation|Children|Drama|Musical|IMAX
8,"Matrix, The (1999)",5,Action|Sci-Fi|Thriller
9,"Incredibles, The (2004)",5,Action|Adventure|Animation|Children|Comedy


In [0]:
recommendations[['title', 'prediction', 'genres', 'popularity']][recommendations.popularity>=100].head(40)

Unnamed: 0,title,prediction,genres,popularity
2,North & South (2004),4.843947,Drama|Romance,147.0
5,3 Idiots (2009),4.769312,Comedy|Drama|Romance,453.0
6,"Shawshank Redemption, The (1994)",4.767195,Crime|Drama,63395.0
7,"Sixth Sense, The (1999)",4.757356,Drama|Horror|Mystery,39028.0
8,Lifted (2006),4.748968,Animation|Comedy|Sci-Fi,127.0
10,Finding Nemo (2003),4.727982,Adventure|Animation|Children|Comedy,23569.0
11,Shrek (2001),4.717577,Adventure|Animation|Children|Comedy|Fantasy|Ro...,31972.0
13,Schindler's List (1993),4.709197,Drama|War,50054.0
14,Harry Potter and the Deathly Hallows: Part 2 (...,4.704267,Action|Adventure|Drama|Fantasy|Mystery|IMAX,3983.0
16,Eddie Izzard: Dress to Kill (1999),4.693521,Comedy,208.0


### Export to Excell

In [0]:
!pip install xlsxwriter
import xlsxwriter

writer = pd.ExcelWriter('predictions.xlsx', engine='xlsxwriter')

for i, user in enumerate(Names):
    print("recommendations for {}".format(Names[i+1]))
    recommendations, _ = funk_svd_predict(i+1, data_with_user, movies_df_with_popularity)
    sheetName = str(Names[i+1])+str(i)
    recommendations[['title', 'prediction', 'genres', 'popularity']][recommendations.popularity>=100].head(40).to_excel(writer, sheet_name=sheetName)
writer.save()

In [0]:
# !pip install xlsxwriter
import xlsxwriter

writer = pd.ExcelWriter('user_ratings.xlsx', engine='xlsxwriter')

for i, user in enumerate(Names):
    print("ratings for {}".format(Names[i+1]))
    _ , rated_df = funk_svd_predict(i+1, data_with_user, movies_df_with_popularity)
    sheetName = str(Names[i+1])+str(i)
    rated_df.to_excel(writer, sheet_name=sheetName)
writer.save()
