# Movie Rating MF

In [1]:
!pip install torch
!pip install pandas
!pip install numpy

Collecting torch
  Downloading torch-1.11.0-cp38-cp38-win_amd64.whl (158.0 MB)
Installing collected packages: torch
Successfully installed torch-1.11.0


In [2]:
import pandas as pd
import numpy as np
import torch

## Loading the dataset from file
The MovieLens dataset is read from CSV files

In [3]:
# Importing the rating data as DataFrame (see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)
dataset_path = "./data/ml-latest-small"
ratings_df = pd.read_csv(f"{dataset_path}/ratings.csv", delimiter=",")
ratings_df.columns = ["UserID", "MovieID", "Rating", "Timestamp"]
ratings_df = ratings_df.drop(columns=["Timestamp"]) # Timestamp column is not required to create the rating matrix

In [4]:
# Check the structure of the DataFrame
ratings_df.head()

Unnamed: 0,UserID,MovieID,Rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


## Creating a rating matrix
In the rating matrix, user ids are represented as rows and movie ids as columns.
The value of each cell states the rating a user has given to a movie.

**Do not change code in this section!**

In [5]:
rating_matrix = ratings_df.pivot(index="UserID", columns="MovieID", values="Rating")

In [6]:
# Checking the matrix structure
rating_matrix.head()

MovieID,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,


As visible, the rating matrix is very sparse, since most users have only rated a few movies.
The purpose of matrix factorization is to predict the missing ratings, based on the actual ratings.

The following method can be used to check which ratings were given to movies

In [7]:
# Example: Show the rating a user has given to a specific movie
def get_rating(rating_matrix, movie_id, user_id):
    return rating_matrix[movie_id][user_id]

get_rating(rating_matrix, movie_id=21, user_id=4)

3.0

## Splitting the rating matrix in training and test set
The training set will be used to train the embeddings, while the test set is used to evaluate the performance.

**Do not change code in this section!**

In [8]:
# Determine the number of users and movies in the dataset
n_users, n_movies = rating_matrix.shape
print(f"The rating matrix contains {n_users} individual users and {n_movies} different movies.")

The rating matrix contains 610 individual users and 9724 different movies.


In [9]:
# Mask a subset of the rating matrix, that will be used as test set.
test_set_size = 0.1 # 10 % for testing
test_set_mask = rating_matrix.iloc[0:round(n_users * test_set_size),0:round(n_movies * test_set_size)].notna()

In [15]:
# Preparing the rating training set
train_ratings_df = rating_matrix.copy()
train_ratings_df[test_set_mask] = 0 # Hide test values in training matrix
train_ratings_df[train_ratings_df.isna()] = 0 # Replace NaN values with 0
train_ratings_df.head()

MovieID,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
# Preparing the test set with ratings
test_ratings_df = rating_matrix[test_set_mask]
test_ratings_df[test_ratings_df.isna()] = 0

## Retrieving content description from TMDB
This section is optional and is included to provide a hands-on example how further content descriptions can be retrieved.

[API Documentation](https://developers.themoviedb.org/3/getting-started/introduction)
[API Client documentation](https://github.com/celiao/tmdbsimple)

In [13]:
!pip install tmdbsimple



In [14]:
import tmdbsimple as tmdb

# API key can be retrieved following the API documentation in https://developers.themoviedb.org/3/getting-started/introduction
tmdb.API_KEY = "INSERT_HERE"

# Reading the MovieLens links dataset
links_df = pd.read_csv(f"{dataset_path}/links.csv", delimiter=",")
links_df.columns = ["MovieID" ,"ImdbID", "TmdbID"]

# Getting the TMDB id for the MovieLens id
movie_id = 1
tmdb_id = links_df[links_df.MovieID == movie_id].TmdbID[0]

# Getting the information
movie = tmdb.Movies(tmdb_id).info()

# Genre
print(f"Genres: {movie['genres']}")

# Plot
print(f"Plot: {movie['overview']}")

Genres: [{'id': 16, 'name': 'Animation'}, {'id': 12, 'name': 'Adventure'}, {'id': 10751, 'name': 'Family'}, {'id': 35, 'name': 'Comedy'}]
Plot: Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.


## Model
This is the place, where you should start with the implementation of your model. Good luck and happy coding!

In [26]:
train_ratings_df
#Creating the Embeddings ,Merging and Making the Model from Embeddings
# n_movies=len(train_ratings_df['MovieID'].unique())
# n_users=len(train_ratings_df['UserID'].unique())
# n_latent_factors=64  # hyperparamter to deal with.

# print(n_movies)
# print(n_users)

MovieID,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,2.5,0.0,0.0,0.0,0.0,0.0,2.5,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
607,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
608,2.5,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
609,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
# Calculate rating predictions using your model. You should adapt this code ;)
prediction_matrix = rating_matrix.copy()

## Evaluation
In order to determine the accuracy of the model, the predicted ratings are compared with the actual ratings in the dataset.
For this evaluation, Root Mean Squared Error (RMSE) is used as metric. It can be calculated like this:
$ RMSE = \sqrt{\frac{1}{|R|} \sum \limits _{(u,i) \in R} (\hat{r}_{ui} - r_{ui})^2} $

where $R$ is the rating matrix, $\hat{r}_{ui}$ is the actual rating of user $u$ for item $i$ and $r_{ui}$ is the predicted rating.
A lower RMSE indicates a more accurate model.

**Do not change code in this section!**

In [16]:
# Compute the RMSE
def RMSE(predictions, targets):
    differences = (predictions - targets)**2
    return np.sqrt(np.sum(np.sum(differences)) / np.sum(differences.count()))

# This method computes the RMSE on the test set. It is used to evaluate the accuracy of the models for this challenge.
def RMSE_Testset(predictions, targets):
    return RMSE(predictions[test_set_mask], targets[test_set_mask])

In [17]:
RMSE(prediction_matrix, rating_matrix)

  return mean(axis=axis, dtype=dtype, out=out, **kwargs)


0.0

In [18]:
RMSE_Testset(prediction_matrix, rating_matrix)

  return mean(axis=axis, dtype=dtype, out=out, **kwargs)


0.0

## Getting Recommendations
In this final section, we can infer the recommended movies for users.

In [19]:
# Reading the MovieLens links dataset
movies_df = pd.read_csv(f"{dataset_path}/movies.csv", delimiter=",")
movies_df.columns = ["MovieID", "Title", "Genres"]

In [20]:
target_user = 101
k = 3 # Amount of ratings that should be shown

sorted_by_rating = prediction_matrix.loc[target_user].sort_values(ascending=False)
recommendations = {}

for movie_id, rating in sorted_by_rating[0:k].iteritems():
    recommendations[movie_id] = {
        "Title": movies_df[movies_df.MovieID == movie_id].Title.values[0],
        "Predicted Rating": rating
    }

In [21]:
print(f"Top-{k} recommended movies for user {target_user}:")
pd.DataFrame(recommendations).transpose()

Top-3 recommended movies for user 101:


Unnamed: 0,Title,Predicted Rating
2712,Eyes Wide Shut (1999),5.0
1093,"Doors, The (1991)",5.0
2599,Election (1999),5.0
