<a href="https://colab.research.google.com/github/dean-sh/Movie-Ratings-Collaborating-Filltering/blob/master/Singular%20Value%20Decomposition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

MovieLens Recommendations - Dean Shabi, Dedi Kovatch, July 2019
=============================================

## Final Project for TCDS - Technion Data Science Specialization. 

MovieLens data sets were collected by the GroupLens Research Project
at the University of Minnesota.
 
This data set consists of:
	* 100,000 ratings (1-5) from 943 users on 1682 movies. 
	* Each user has rated at least 20 movies. 
        * Simple demographic info for the users (age, gender, occupation, zip)

The data was collected through the MovieLens web site
(movielens.umn.edu) during the seven-month period from September 19th, 
1997 through April 22nd, 1998. This data has been cleaned up - users
who had less than 20 ratings or did not have complete demographic
information were removed from this data set. Detailed descriptions of
the data file can be found at the end of this file.


## Data Description




Here are brief descriptions of the data.

ml-data.tar.gz   -- Compressed tar file.  To rebuild the u data files do this:
                gunzip ml-data.tar.gz
                tar xvf ml-data.tar
                mku.sh

u.data     -- The full u data set, 100000 ratings by 943 users on 1682 items.
              Each user has rated at least 20 movies.  Users and items are
              numbered consecutively from 1.  The data is randomly
              ordered. This is a tab separated list of 
	         user id | item id | rating | timestamp. 
              The time stamps are unix seconds since 1/1/1970 UTC   

u.info     -- The number of users, items, and ratings in the u data set.

u.item     -- Information about the items (movies); this is a tab separated
              list of
              movie id | movie title | release date | video release date |
              IMDb URL | unknown | Action | Adventure | Animation |
              Children's | Comedy | Crime | Documentary | Drama | Fantasy |
              Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
              Thriller | War | Western |
              The last 19 fields are the genres, a 1 indicates the movie
              is of that genre, a 0 indicates it is not; movies can be in
              several genres at once.
              The movie ids are the ones used in the u.data data set.

u.genre    -- A list of the genres.

u.user     -- Demographic information about the users; this is a tab
              separated list of
              user id | age | gender | occupation | zip code
              The user ids are the ones used in the u.data data set.

u.occupation -- A list of the occupations.

u1.base    -- The data sets u1.base and u1.test through u5.base and u5.test
u1.test       are 80%/20% splits of the u data into training and test data.
u2.base       Each of u1, ..., u5 have disjoint test sets; this if for
u2.test       5 fold cross validation (where you repeat your experiment
u3.base       with each training and test set and average the results).
u3.test       These data sets can be generated from u.data by mku.sh.
u4.base
u4.test
u5.base
u5.test

ua.base    -- The data sets ua.base, ua.test, ub.base, and ub.test
ua.test       split the u data into a training set and a test set with
ub.base       exactly 10 ratings per user in the test set.  The sets
ub.test       ua.test and ub.test are disjoint.  These data sets can
              be generated from u.data by mku.sh.

allbut.pl  -- The script that generates training and test sets where
              all but n of a users ratings are in the training data.

mku.sh     -- A shell script to generate all the u data sets from u.data.

## Imports




In [0]:
import numpy as np
import pandas as pd
import collections
import seaborn as sns
%matplotlib inline
# from mpl_toolkits.mplot3d import Axes3D
from IPython import display
from matplotlib import pyplot as plt
import sklearn
import sklearn.manifold
# import tensorflow as tf
# tf.logging.set_verbosity(tf.logging.ERROR)

# # Add some convenience functions to Pandas DataFrame.
# pd.options.display.max_rows = 10
# pd.options.display.float_format = '{:.3f}'.format



# # Install Altair and activate its colab renderer.
# print("Installing Altair...")
# !pip install git+git://github.com/altair-viz/altair.git
# import altair as alt
# alt.data_transformers.enable('default', max_rows=None)
# alt.renderers.enable('colab')
# print("Done installing Altair.")

# # Install spreadsheets and import authentication module.
# USER_RATINGS = False
# !pip install --upgrade -q gspread
# from google.colab import auth
# import gspread
# from oauth2client.client import GoogleCredentials

## **Importing dataset, preprocessing**




In [0]:
# download the MovieLens Data, and create DataFrames containing movies, users, and ratings.

print("Downloading movielens data...")
import zipfile
import urllib.request

urllib.request.urlretrieve("http://files.grouplens.org/datasets/movielens/ml-100k.zip", "movielens.zip")
zip_ref = zipfile.ZipFile('movielens.zip', "r")
zip_ref.extractall()
print("Done. Dataset contains:")
print(zip_ref.read('ml-100k/u.info'))

In [0]:
# Load each data set (users, movies, and ratings).
users_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv(
    'ml-100k/u.user', sep='|', names=users_cols, encoding='latin-1')

ratings_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv(
    'ml-100k/u.data', sep='\t', names=ratings_cols, encoding='latin-1')

# The movies file contains a binary feature for each genre.
genre_cols = [
    "genre_unknown", "Action", "Adventure", "Animation", "Children", "Comedy",
    "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir", "Horror",
    "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"
]
movies_cols = [
    'movie_id', 'title', 'release_date', "video_release_date", "imdb_url"] + genre_cols

movies = pd.read_csv(
    'ml-100k/u.item', sep='|', names=movies_cols, encoding='latin-1')


Some Preproccessing

In [0]:
# Since the ids start at 1, we shift them to start at 0.
users["user_id"] = users["user_id"].apply(lambda x: str(x-1))
movies["movie_id"] = movies["movie_id"].apply(lambda x: str(x-1))
movies["year"] = movies['release_date'].apply(lambda x: str(x).split('-')[-1])
ratings["movie_id"] = ratings["movie_id"].apply(lambda x: str(x-1))
ratings["user_id"] = ratings["user_id"].apply(lambda x: str(x-1))
ratings["rating"] = ratings["rating"].apply(lambda x: float(x))

# Compute the number of movies to which a genre is assigned.
genre_occurences = movies[genre_cols].sum().to_dict()

# Since some movies can belong to more than one genre, we create different
# 'genre' columns as follows:
# - all_genres: all the active genres of the movie.
# - genre: randomly sampled from the active genres.
def mark_genres(movies, genres):
    def get_random_genre(gs):
        active = [genre for genre, g in zip(genres, gs) if g==1]
        if len(active) == 0:
            return 'Other'
        return np.random.choice(active)
    def get_all_genres(gs):
        active = [genre for genre, g in zip(genres, gs) if g==1]
        if len(active) == 0:
            return 'Other'
        return '-'.join(active)
    movies['genre'] = [
        get_random_genre(gs) for gs in zip(*[movies[genre] for genre in genres])]
    movies['all_genres'] = [
        get_all_genres(gs) for gs in zip(*[movies[genre] for genre in genres])]

mark_genres(movies, genre_cols)

In [0]:
# Create one merged DataFrame containing all the movielens data.
movielens = ratings.merge(movies, on='movie_id').merge(users, on='user_id')

# Utility to split the data into training and test sets.
def split_dataframe(df, holdout_fraction=0.1):
    """Splits a DataFrame into training and test sets.
    Args:
    df: a dataframe.
    holdout_fraction: fraction of dataframe rows to use in the test set.
    Returns:
    train: dataframe for training
    test: dataframe for testing
    """
    test = df.sample(frac=holdout_fraction, replace=False)
    train = df[~df.index.isin(test.index)]
    return train, test

In [0]:
movies.head()

Unnamed: 0,movie_id,title,release_date,video_release_date,imdb_url,genre_unknown,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,year,genre,all_genres
0,0,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1995,Comedy,Animation-Children-Comedy
1,1,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1995,Adventure,Action-Adventure-Thriller
2,2,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1995,Thriller,Thriller
3,3,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1995,Comedy,Action-Comedy-Drama
4,4,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,1995,Crime,Crime-Drama-Thriller


In [0]:
users.head()

Unnamed: 0,user_id,age,sex,occupation,zip_code
0,0,24,M,technician,85711
1,1,53,F,other,94043
2,2,23,M,writer,32067
3,3,24,M,technician,43537
4,4,33,F,other,15213


In [0]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,195,241,3.0,881250949
1,185,301,3.0,891717742
2,21,376,1.0,878887116
3,243,50,2.0,880606923
4,165,345,1.0,886397596


In [0]:
movielens.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp,title,release_date,video_release_date,imdb_url,genre_unknown,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,year,genre,all_genres,age,sex,occupation,zip_code
0,195,241,3.0,881250949,Kolya (1996),24-Jan-1997,,http://us.imdb.com/M/title-exact?Kolya%20(1996),0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1997,Comedy,Comedy,49,M,writer,55105
1,195,256,2.0,881251577,Men in Black (1997),04-Jul-1997,,http://us.imdb.com/M/title-exact?Men+in+Black+...,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1997,Sci-Fi,Action-Adventure-Comedy-Sci-Fi,49,M,writer,55105
2,195,110,4.0,881251793,"Truth About Cats & Dogs, The (1996)",26-Apr-1996,,http://us.imdb.com/M/title-exact?Truth%20About...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1996,Comedy,Comedy-Romance,49,M,writer,55105
3,195,24,4.0,881251955,"Birdcage, The (1996)",08-Mar-1996,,"http://us.imdb.com/M/title-exact?Birdcage,%20T...",0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1996,Comedy,Comedy,49,M,writer,55105
4,195,381,4.0,881251843,"Adventures of Priscilla, Queen of the Desert, ...",01-Jan-1994,,http://us.imdb.com/M/title-exact?Adventures%20...,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1994,Comedy,Comedy-Drama,49,M,writer,55105


# Matrix Factorization via Singular Value Decomposition

Based on great work done by **Nick Becker**, RAPIDS Team at NVIDIA

*   https://beckernick.github.io/matrix-factorization-recommender/
*   https://github.com/beckernick/matrix_factorization_recommenders

This gave us great recommendations, however the RMSE we got was pretty high (probably due to some normalization factor we didn't account for)


Matrix factorization is the breaking down of one matrix in a product of multiple matrices. It's extremely well studied in mathematics, and it's highly useful. There are many different ways to factor matrices, but singular value decomposition is particularly useful for making recommendations.

So what is singular value decomposition (SVD)? At a high level, SVD is an algorithm that decomposes a matrix $R$ into the best lower rank (i.e. smaller/simpler) approximation of the original matrix $R$. Mathematically, it decomposes R into a two unitary matrices and a diagonal matrix:

$$\begin{equation}
R = U\Sigma V^{T}
\end{equation}$$

where R is users's ratings matrix, $U$ is the user "features" matrix, $\Sigma$ is the diagonal matrix of singular values (essentially weights), and $V^{T}$ is the movie "features" matrix. $U$ and $V^{T}$ are orthogonal, and represent different things. $U$ represents how much users "like" each feature and $V^{T}$ represents how relevant each feature is to each movie.

To get the lower rank approximation, we take these matrices and keep only the top $k$ features, which we think of as the underlying tastes and preferences vectors.


## Setting Up the Ratings Data

In [0]:
import pandas as pd
import numpy as np
import zipfile
import urllib.request

print("Downloading movielens data...")
    
urllib.request.urlretrieve("http://files.grouplens.org/datasets/movielens/ml-latest-small.zip", "movielens.zip")

zip_ref = zipfile.ZipFile('movielens.zip', "r")
zip_ref.extractall()


ratings_df = pd.read_csv('ml-latest-small/ratings.csv', names=['user_id', 'movie_id', 'rating', 'timestamp'], sep=',', encoding='latin-1', header = None)
ratings_df.drop([0], inplace=True)
ratings_df=ratings_df.apply(pd.to_numeric)
# ratings_df['UserID'] = ratings_df['UserID'].apply(pd.to_numeric)
# ratings_df['UserID'] = ratings_df['UserID'].apply(pd.to_numeric)


movies_df = pd.read_csv('ml-latest-small/movies.csv',names= ['movie_id', 'title', 'genres'], sep=',', encoding='latin-1')
movies_df.drop([0], inplace=True)
movies_df['movie_id'] = movies_df['movie_id'].apply(pd.to_numeric)
# movies_df.drop('Genres', axis = 1, inplace = True)

# Create one merged DataFrame containing all the movielens data.
movielens18 = ratings_df.merge(movies_df, on='movie_id')

Downloading movielens data...


I'll also take a look at the movies and ratings dataframes.

In [0]:
movies_df.shape

(9742, 3)

In [0]:
movielens18.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


In [0]:
R_df = ratings_df.pivot(index = 'user_id', columns ='movie_id', values = 'rating').fillna(0)
R_df.head()

movie_id,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,34,36,38,39,40,41,42,43,...,185135,185435,185473,185585,186587,187031,187541,187593,187595,187717,188189,188301,188675,188751,188797,188833,189043,189111,189333,189381,189547,189713,190183,190207,190209,190213,190215,190219,190221,191005,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,4.0,0.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


de-mean the data (normalize by each users mean) and convert it from a dataframe to a numpy array.

In [0]:
R = R_df.as_matrix()

Z = R>0
m, n = R.shape
Ymean = np.zeros(m)
Ynorm = np.zeros(R.shape)

for i in range(m):
    idx = Z[i, :] == 1
    Ymean[i] = np.mean(R[i, idx])
    Ynorm[i, idx] = R[i, idx] - Ymean[i]    

  """Entry point for launching an IPython kernel.


In [0]:
R_demeaned = R - Ymean[:,np.newaxis]

## Singular Value Decomposition

Scipy and Numpy both have functions to do the singular value decomposition. I'm going to use the Scipy function `svds` because it let's me choose how many latent factors I want to use to approximate the original ratings matrix (instead of having to truncate it after).

In [0]:
from scipy.sparse.linalg import svds
K = 10
U, sigma, Vt = svds(R_demeaned, k = 10)

In [0]:
df.head()

Unnamed: 0,movie_id,title,genres
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,2,Jumanji (1995),Adventure|Children|Fantasy
3,3,Grumpier Old Men (1995),Comedy|Romance
4,4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,5,Father of the Bride Part II (1995),Comedy


Done. The function returns exactly what I detailed earlier in this post, except that the $\Sigma$ returned is just the values instead of a diagonal matrix. This is useful, but since I'm going to leverage matrix multiplication to get predictions I'll convert it to the diagonal matrix form.

In [0]:
sigma = np.diag(sigma)

## Making Predictions from the Decomposed Matrices

In [0]:
pd.DataFrame(np.dot(np.dot(U, sigma), Vt)).head() + Ymean[:,np.newaxis]

In [0]:
pd.DataFrame(np.dot(np.dot(U, sigma), Vt)).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,9684,9685,9686,9687,9688,9689,9690,9691,9692,9693,9694,9695,9696,9697,9698,9699,9700,9701,9702,9703,9704,9705,9706,9707,9708,9709,9710,9711,9712,9713,9714,9715,9716,9717,9718,9719,9720,9721,9722,9723
0,2.731634,0.824226,0.863543,-0.128215,0.11766,1.620033,0.022565,-0.117652,0.050279,1.912932,0.552621,0.254022,0.005336,-0.102017,-0.180319,0.948455,0.061236,-0.096068,0.687209,0.151289,1.261252,0.181749,-0.01719,0.252383,0.495318,-0.188269,-0.199703,-0.273025,0.547097,-0.012557,-0.111906,2.252317,1.131754,0.199436,0.003295,0.714506,-0.106971,-0.00813,0.051073,-0.122895,...,-0.157027,-0.115065,-0.071877,-0.075409,-0.071877,-0.087704,-0.120529,-0.110677,-0.104763,-0.090134,-0.092262,-0.082204,-0.164606,-0.127641,-0.058698,-0.179804,-0.112345,-0.11253,-0.123898,-0.149409,-0.113312,-0.100775,-0.114916,-0.111973,-0.112902,-0.111787,-0.111973,-0.111787,-0.111787,-0.126525,-0.123168,-0.121489,-0.124847,-0.124847,-0.123168,-0.124847,-0.123168,-0.123168,-0.123168,-0.133069
1,0.178901,-0.038069,-0.037811,-0.006254,0.016541,0.076676,-0.073422,-0.002933,-0.018228,-0.082428,0.002699,-0.049389,-0.039679,-0.037711,-0.055149,0.211399,-0.07533,-0.09433,-0.072832,0.051227,-0.037693,-0.040162,-0.023834,-0.05959,0.045661,-0.070228,0.00569,-0.095775,-0.204458,-0.022554,0.016811,0.005587,-0.089603,0.000461,0.006093,-0.017727,-0.009136,-0.070792,-0.004064,-0.043874,...,0.031704,0.010337,0.000233,0.016881,0.000233,-0.002226,0.013059,0.091506,0.004994,0.000808,-4.3e-05,-0.044401,-0.005832,0.006365,0.002879,-0.005297,-0.004993,-0.004451,-0.001224,-0.006367,-0.008008,-0.003448,-0.004725,-0.006077,-0.003367,-0.00662,-0.006077,-0.00662,-0.00662,-0.000455,-0.002066,-0.002871,-0.00126,-0.00126,-0.002066,-0.00126,-0.002066,-0.002066,-0.002066,0.005347
2,0.023583,-0.001087,0.009023,-0.006277,-0.024101,0.065603,-0.024909,-0.00567,0.005685,0.055995,-0.022128,0.021467,0.000627,-0.007039,-0.006482,-0.006187,-0.03894,5e-05,0.002065,0.009784,0.016307,-0.008969,-0.002138,-0.000635,-0.007776,-0.011981,-0.015251,-0.016733,0.021421,0.003982,-0.023048,0.052589,-0.016008,-0.024877,0.003988,-0.047013,-0.001045,0.000345,0.00693,-0.005408,...,-0.001495,0.00043,-0.000285,0.000377,-0.000285,-0.000713,-0.002556,0.010296,0.00887,-0.000708,-0.000762,0.010046,-0.00289,-0.001361,3.4e-05,-0.003361,-0.001517,-0.001573,-0.000727,-0.002418,-0.000932,-0.000974,-0.001406,-0.001406,-0.001684,-0.001351,-0.001406,-0.001351,-0.001351,-0.001205,-0.001213,-0.001217,-0.001209,-0.001209,-0.001213,-0.001209,-0.001213,-0.001213,-0.001213,-0.00314
3,1.479939,0.196467,0.192636,-0.03512,0.104789,0.194374,0.267949,-0.133225,-0.115444,-0.010298,0.718759,-0.133585,-0.007747,0.036157,-0.00951,0.395341,1.088663,0.17245,0.192556,-0.160884,0.877455,-0.066031,-0.167028,0.099026,0.699756,0.005758,-0.080493,0.334972,0.899928,0.085349,-0.010672,1.062285,0.975442,0.721076,-0.077426,1.067818,-0.097435,0.034366,-0.016771,-0.036849,...,-0.128765,-0.109915,-0.073742,-0.075341,-0.073742,-0.077813,-0.088481,-0.180858,-0.180034,-0.044795,-0.049857,-0.138352,-0.06403,-0.088939,-0.066516,-0.055062,-0.096042,-0.096166,-0.099938,-0.072999,-0.09829,-0.070107,-0.075434,-0.095793,-0.096415,-0.095669,-0.095793,-0.095669,-0.095669,-0.096665,-0.096388,-0.09625,-0.096527,-0.096527,-0.096388,-0.096527,-0.096388,-0.096388,-0.096388,-0.105531
4,1.256434,0.974787,0.403596,0.106501,0.518697,0.736876,0.617943,0.101136,0.094213,1.135083,1.039142,-0.009352,0.057612,0.268524,0.224823,0.492558,0.722043,0.036372,0.588009,0.010699,1.037831,0.473947,0.15244,0.153748,0.740593,0.207954,0.10008,0.08736,0.101443,-0.024128,0.434146,1.249361,1.437849,0.846536,-0.031665,1.015884,-0.017989,0.179374,-0.014745,0.105071,...,-0.016617,-0.023214,-0.020868,-0.018384,-0.020868,-0.02056,-0.017447,-0.024693,-0.043651,-0.014485,-0.015134,-0.031037,-0.008812,-0.019438,-0.020831,-0.005335,-0.018894,-0.018477,-0.021794,-0.012288,-0.022236,-0.017732,-0.01776,-0.019728,-0.017642,-0.020145,-0.019728,-0.020145,-0.020145,-0.019929,-0.020162,-0.020279,-0.020046,-0.020046,-0.020162,-0.020046,-0.020162,-0.020162,-0.020162,-0.020979


In [0]:
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + preds_df

### Making Movie Recommendations

In [0]:
Ymean[:,np.newaxis].shape

(610, 1)

In [0]:
preds_df = pd.DataFrame(all_user_predicted_ratings, columns = R_df.columns)
print(preds_df.shape)

(610, 9724)


In [0]:
def recommend_movies(predictions_df, userID, movies_df, original_ratings_df, num_recommendations=5):
    
    # Get and sort the user's predictions
    user_row_number = userID -1
    sorted_user_predictions = predictions_df.iloc[user_row_number].sort_values(ascending=False)

    # Get the user's data and merge in the movie information.
    user_data = original_ratings_df[original_ratings_df.user_id == userID]
    user_full = (user_data.merge(movies_df, how = 'left', left_on = 'movie_id', right_on = 'movie_id').
                     sort_values(['rating'], ascending=False))

    print ('User {0} has already rated {1} movies.'.format(userID, user_full.shape[0]))
    print ('Recommending highest {0} predicted ratings movies not already rated.'.format(num_recommendations))
    
    # Recommend the highest predicted rating movies that the user hasn't seen yet.
    recommendations = (movies_df[~movies_df['movie_id'].isin(user_full['movie_id'])]. #all the movies not in the user_full recommendations
         merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
               left_on = 'movie_id',
               right_on = 'movie_id').
         rename(columns = {user_row_number: 'Predictions'}).
         sort_values('Predictions', ascending = False).
                       iloc[:num_recommendations, :-1]
                      )

    return user_full, recommendations

In [0]:
already_rated, predictions = recommend_movies(preds_df, 25, movies_df, ratings_df, 10)

In [0]:
already_rated.head(10)

Unnamed: 0,user_id,movie_id,rating,timestamp,title,genres
13,25,68157,5.0,1535470515,Inglourious Basterds (2009),Action|Drama|War
12,25,60069,5.0,1535470523,WALLÂ·E (2008),Adventure|Animation|Children|Romance|Sci-Fi
23,25,180095,5.0,1535470476,Wonder (2017),Drama
22,25,177593,5.0,1535470532,"Three Billboards Outside Ebbing, Missouri (2017)",Crime|Drama
19,25,122912,5.0,1535470461,Avengers: Infinity War - Part I (2018),Action|Adventure|Sci-Fi
18,25,116797,5.0,1535470507,The Imitation Game (2014),Drama|Thriller|War
17,25,91529,5.0,1535470498,"Dark Knight Rises, The (2012)",Action|Adventure|Crime|IMAX
16,25,79132,5.0,1535470428,Inception (2010),Action|Crime|Drama|Mystery|Sci-Fi|Thriller|IMAX
14,25,68954,5.0,1535470493,Up (2009),Adventure|Animation|Children|Drama
1,25,260,5.0,1535470429,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi


In [0]:
predictions

Unnamed: 0,movie_id,title,genres
275,318,"Shawshank Redemption, The (1994)",Crime|Drama
312,356,Forrest Gump (1994),Comedy|Drama|Romance|War
2220,2959,Fight Club (1999),Action|Crime|Drama|Thriller
895,1196,Star Wars: Episode V - The Empire Strikes Back...,Action|Adventure|Sci-Fi
907,1210,Star Wars: Episode VI - Return of the Jedi (1983),Action|Adventure|Sci-Fi
507,593,"Silence of the Lambs, The (1991)",Crime|Horror|Thriller
656,858,"Godfather, The (1972)",Crime|Drama
255,296,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
3187,4306,Shrek (2001),Adventure|Animation|Children|Comedy|Fantasy|Ro...
8358,109487,Interstellar (2014),Sci-Fi|IMAX


###Train Test Split

In [0]:
movielens18.shape

(100836, 6)

In [0]:
train_df = movielens18.sample(frac=0.9, random_state=0)
test_df = movielens18.drop(train_df.index.tolist())
train_df.shape, test_df.shape

((90752, 6), (10084, 6))

In [0]:
from sklearn.metrics import mean_absolute_error

# df = fetch_ml20m_ratings()
# movielens18.drop(columns = ['title', 'genres'], inplace=True)
# movielens18.columns = df.columns

train = movielens18.sample(frac=0.8, random_state=7)
val = movielens18.drop(train.index.tolist()).sample(frac=0.5, random_state=8)
test = movielens18.drop(train.index.tolist()).drop(val.index.tolist())

svd = SVD(learning_rate=0.01, regularization=0.1, n_epochs=100,
          n_factors=80, min_rating=1, max_rating=5)

svd.fit(X=train, X_val=val, early_stopping=True, shuffle=False)

pred = svd.predict(test)
mae = mean_absolute_error(test["rating"], pred)


In [0]:
min_user_clicks = 10
filter_users = train_df['user_id'].value_counts() > min_user_clicks
filter_users = filter_users[filter_users].index.tolist()

min_item_clicks = 10
filter_items = train_df['movie_id'].value_counts() > min_item_clicks
filter_items = filter_items[filter_items].index.tolist()

train_filtered = train_df[(train_df['user_id'].isin(filter_users)) & train_df['movie_id'].isin(filter_items)]
print('The original data frame shape:\t{}'.format(train_df.shape))
print('The new data frame shape:\t{}'.format(train_filtered.shape))


In [0]:
#Check that all users appears in both Train & Test

train_users_idx = list(train_filtered.user_id.unique())
test_df_idx = list(test_df.user_id.unique())
iters = list(set(train_users_idx) & set(test_df_idx))


train_filtered = train_filtered[train_filtered.user_id.isin(iters)]
test_df = test_df[test_df.user_id.isin(iters)]

In [0]:
#Check that all movies in Test appears in Train

train_movies_inx = list(train_filtered.movie_id.unique())
test_movies_inx = list(test_df.movie_id.unique())
uniques_movies_test = list(np.setdiff1d(test_movies_inx, train_movies_inx))
flt_lst = list(np.setdiff1d(test_movies_inx, uniques_movies_test))


test_df = test_df[test_df['movie_id'].isin(flt_lst)]

In [0]:
train_users = list(train_filtered.user_id.unique())
test_users = list(test_df.user_id.unique())
counter=0
for user in test_users:
    if user not in train_users:
        counter =+ 1
print("Number of non overlaps in users = {}".format(counter))

train_movies = list(train_filtered.movie_id.unique())
test_movies = list(test_df.movie_id.unique())
counter=0
for movie in test_movies:
    if movie not in train_movies:
        counter =counter + 1

print("Number movies in test that are not in train = {}".format(counter))

In [0]:
pivot_train = train_filtered.pivot_table(values = 'rating', index = 'movie_id', columns = 'user_id')
pivot_train.fillna(0, inplace = True)

pivot_test = test_df.pivot_table(values = 'rating', index = 'movie_id', columns = 'user_id')
pivot_test.fillna(0, inplace = True)

In [0]:
print("Train set contains {} ratings".format(np.sum(np.sum(pivot_train>0))))
print("Test set contains {} ratings".format(np.sum(np.sum(pivot_test>0))))

In [0]:
#Checking if movies in the test set are included in the train set

s_test = set(pivot_test.columns)
s_train = set(pivot_train.columns)

inter = s_test.issubset(s_train)
inter, len(s_train), len(s_test)

### Training the algorithm on the train set

In [0]:
R_df = pivot_train.copy()
R_df.head()

user_id,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,26,27,28,29,31,32,33,34,35,36,37,38,39,40,41,42,...,570,571,572,573,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599,600,601,602,603,604,605,606,607,608,610
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
1,4.0,0.0,0.0,0.0,4.0,0.0,4.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.5,0.0,4.5,3.5,4.0,0.0,3.5,0.0,0.0,0.0,0.0,3.0,0.0,0.0,5.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,...,4.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,4.0,3.0,0.0,0.0,0.0,5.0,0.0,0.0,5.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,4.0,4.0,0.0,3.0,2.5,4.0,0.0,4.0,3.0,4.0,2.5,0.0,2.5,5.0
2,0.0,0.0,0.0,0.0,0.0,4.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,3.0,3.0,3.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,3.5,0.0,0.0,4.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,2.5,0.0,4.0,0.0,4.0,0.0,0.0,0.0,0.0,2.5,4.0,0.0,4.0,0.0,0.0,3.5,0.0,0.0,2.0,0.0
3,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,3.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,1.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
5,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.5,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0
6,4.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,4.0,4.5,0.0,0.0,3.5,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,3.5,0.0,3.0,0.0,0.0,0.0,0.0,3.0,0.0,4.5,0.0,0.0,3.0,4.0,3.0,0.0,0.0,0.0,0.0,5.0


In [0]:
R = R_df.as_matrix()

Z = R>0
m, n = R.shape
Ymean = np.zeros(m)
Ynorm = np.zeros(R.shape)

for i in range(m):
    idx = Z[i, :] == 1
    Ymean[i] = np.mean(R[i, idx])
    Ynorm[i, idx] = R[i, idx] - Ymean[i]    

  """Entry point for launching an IPython kernel.


In [0]:
R_demeaned = R - Ymean[:,np.newaxis]

In [0]:
from scipy.sparse.linalg import svds

In [0]:
'''
The best way I've found to do this..
Looping over the test pivot and locating the corresponding cell in the train set,
by INDEX, where the test-pivot has a rating.
'''
def accuracy(pivot_test, preds_df, k):
    true_val = []
    prediction_val = []

    movie_intersect = list(np.setdiff1d(pivot_test.index, preds_df.index))
    user_intersect = list(np.setdiff1d(pivot_test.columns, preds_df.columns))

    for i in pivot_test.index: #movie_id
        for j in pivot_test.columns: #user_id
            if (pivot_test.loc[i,j]!=0) and (i not in movie_intersect):
                real_value = pivot_test.loc[i,j]
                prediction = preds_df.loc[i,j]
#                 print('movie_id = {},user_id = {}'.format(i,j))
#                 print("real value = {}, prediction = {}".format(real_value, prediction))
                prediction_val.append(preds_df.loc[i,j] + Ymean[:,np.newaxis])
                true_val.append(pivot_test.loc[i,j])

    true_val = np.array(true_val)
    validation_val = np.array(prediction_val)
    RMSE = np.round(np.sqrt(np.square(true_val - prediction_val).mean()),4)
    MAE = np.round(np.abs(true_val - prediction_val).mean(),4)

    print('RMSE: {}, MAE: {}, K: {}'.format(RMSE, MAE, k))
    return (RMSE, MAE)

In [0]:
results = pd.DataFrame(columns=['RMSE', 'MAE','K'])

In [0]:
pivot_train

In [0]:
preds_df.head()

user_id,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,26,27,28,29,31,32,33,34,35,36,37,38,39,40,41,42,...,570,571,572,573,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599,600,601,602,603,604,605,606,607,608,610
1,3.451891,0.058611,0.007331,2.637868,1.099987,3.230932,2.198768,1.193644,0.458786,0.541504,1.229292,0.177793,0.473869,0.738335,1.67291,1.059495,1.665207,3.663445,3.391498,3.579316,2.991586,0.39903,0.528706,1.105328,0.41755,2.416811,0.737759,0.500821,2.068553,2.78212,1.729464,0.793354,0.362697,0.122071,0.626887,1.287499,2.11865,1.572042,-0.011347,1.154391,...,3.487177,-0.038712,1.532049,5.176255,0.576355,0.277724,0.045875,1.171996,0.056506,1.627237,2.119013,0.540326,0.587333,0.791465,1.663337,0.019029,2.753613,1.486589,0.874798,0.928583,3.031569,1.030969,1.573785,2.113194,2.120092,0.456212,3.90515,2.109267,0.163412,1.939983,2.508578,1.832005,2.196829,0.668947,1.430875,2.476204,-1.218595,1.408496,1.771635,1.495228
2,0.894914,-0.092731,-0.071086,-0.15337,0.607579,3.121587,0.844036,1.391994,-0.040986,0.064775,0.618743,0.037675,-0.331725,0.779259,0.774952,0.026932,0.314524,2.036401,2.183381,1.420304,1.639731,-0.117252,-0.253789,0.26345,0.381263,0.952063,0.144212,0.033835,0.410589,0.832148,0.306754,0.061445,0.333529,-0.116311,0.579957,1.402147,0.230994,1.575549,-0.115456,0.396768,...,1.31127,0.203133,0.310089,1.649999,0.720692,-0.225939,-0.149671,0.555731,-0.062473,0.610019,1.250978,0.258769,0.213432,0.328823,2.39724,-0.493491,1.055446,0.830552,0.747009,0.701016,2.81234,-0.184432,2.111701,0.304915,1.519398,-0.068936,1.571269,0.13226,-0.058371,2.277476,2.868166,0.359764,2.048927,-0.693193,1.512298,1.680663,0.940717,0.81751,2.109111,0.809867
3,0.835231,0.176799,0.135655,-0.103934,0.244541,2.452694,-0.046309,0.305062,0.118281,-0.349651,0.364915,0.166561,-0.138204,0.425359,-0.267285,-0.400672,-0.010952,0.602568,1.522893,0.051047,-0.466322,-0.073531,-0.250858,-0.004646,0.083104,0.35502,0.321876,0.010884,0.450327,1.284789,0.325924,-0.270504,0.249414,0.187452,0.237916,0.576842,-0.014429,0.662791,-0.14831,2.446762,...,0.28019,0.180126,0.34435,-0.196989,0.1121,0.036494,0.135883,0.440409,0.094003,-0.093739,0.27846,-0.085237,-0.240761,0.069985,0.562625,0.258966,0.128109,0.696193,0.370148,0.452293,0.896718,0.020287,0.741429,0.03503,0.439975,0.079905,-0.522108,0.628086,0.031336,2.302958,1.877882,-0.518013,1.039225,-0.445847,0.859887,0.517374,0.46573,0.2916,2.158124,-1.147367
4,-0.005933,0.14164,0.076914,0.209794,0.457974,2.790473,0.130251,0.405387,0.105704,0.138638,0.272641,0.223936,-0.114563,0.538392,-0.213365,-0.298493,0.024003,0.768882,0.185059,0.101214,0.353634,0.048443,0.105409,0.092231,0.171479,0.251265,0.184331,0.065643,0.679358,1.594514,0.368817,-0.269517,0.313365,0.017324,0.230821,0.685795,-0.081723,0.942186,0.075291,0.969324,...,0.160051,0.066241,0.259483,0.199997,0.164557,-0.005609,0.082995,0.126856,0.095371,-0.232842,-0.137198,-0.030546,0.034345,0.13373,0.810954,0.167301,0.465687,0.323077,0.467235,0.485481,0.63993,-0.129197,0.836729,0.168541,0.402371,0.096907,0.048056,0.007177,0.08454,1.284753,1.376774,-0.167675,1.334417,-0.529334,1.032968,0.651285,0.646068,-0.234083,0.370818,-0.21049
5,1.057611,-0.054667,0.036791,0.559091,0.60218,3.332881,1.133493,0.413266,0.024486,-0.378158,1.351953,-0.048798,-0.039464,0.733045,0.558456,0.23907,1.195875,3.026044,-0.143045,-1.669538,0.549019,-0.059388,1.428677,0.78096,0.134476,0.17039,3.51124,0.811889,1.423887,3.137265,1.2582,-0.118721,0.206144,-0.007862,0.14109,0.844985,0.805467,0.958085,0.449198,2.103651,...,1.85118,0.093355,1.2991,2.193886,-0.001891,-0.091884,-0.11647,1.492557,-0.071569,-0.42914,3.284227,-0.435802,-0.423406,-0.313958,0.543579,0.934421,0.854894,-0.280835,1.100163,0.832717,2.924837,0.034789,1.065256,0.661926,1.473646,0.182405,0.188328,1.601696,-0.144806,3.175701,-0.477687,-0.254032,2.508251,2.747775,1.319326,-0.241857,1.135724,0.94197,3.116881,2.827764


In [0]:
U, sigma, Vt = svds(R_demeaned, k = 50, maxiter= 500)
sigma = np.diag(sigma)
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + Ymean[:,np.newaxis]
preds_df = pd.DataFrame(all_user_predicted_ratings, columns = R_df.columns)
preds_df.index +=1
# RMSE, MAE = accuracy(pivot_test, preds_df, K)


In [0]:
# flattened_preds = pd.DataFrame(preds_df.to_records())
preds_df['movies'] = preds_df.index
preds_melted = pd.melt(preds_df, id_vars=['movies'], value_vars=preds_df.columns[:595])

In [0]:
preds_melted.head()
preds_melted.columns = ['movie_id', 'user_id', 'rating_predicted']

In [0]:
test_df = pd.merge(test_df, preds_melted,  how='inner', left_on=['user_id','movie_id'], right_on = ['user_id','movie_id'],)


In [0]:
test_df.shape

(3717, 9)

In [0]:
sum(test_df.rating_predicted.isna())

0

In [0]:
from sklearn.metrics import mean_absolute_error

mean_absolute_error(test_df.rating, test_df.rating_predicted)

3.1374667166646915

In [0]:
for K in range(1,2):
    U, sigma, Vt = svds(R_demeaned, k = K, maxiter= 500)
    sigma = np.diag(sigma)
    all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + Ymean[:,np.newaxis]
    preds_df = pd.DataFrame(all_user_predicted_ratings, columns = R_df.columns)
    RMSE, MAE = accuracy(pivot_test, preds_df, K)
    results.append([RMSE, MAE, K])

**The results are pretty bad (too bad, random model givs 1.5 RMSE)..**

We think there's a normalization factor we didn't account for, since the recomendations are okay.