## Matrix Factorization: Movie Recommender System

### About:
##### In this project, I will be using Matrix Factorization technique partnered with layer embedment to train movie recommendation system on an movie review dataset. Particularly with this model, stakeholders will be able to employ this recommender system on any movie or television streaming service such as Netflix to precisely recommend a movie based off of an end users watch history and the respected review they gave.

##### This repository includes a Jupyter Notebook that incorporates a matrix factorization technique for recommender systems, best described in the academic paper "Matrix Factorization Techniques for Recommender Systems" by Koren, Yehuda, et al. “Matrix Factorization Techniques for Recommender Systems.” Datajobs, IEEE Computer Society, 7 Aug. 2009, https://datajobs.com/data-science-repo/Recommender-Systems-%5BNetflix%5D.pdf. Accessed 7 July 2024. 

<img src='https://csdl-images.ieeecomputer.org/mags/co/2009/08/figures/mco20090800301.gif' width='800'>

##### In addition, this project utilizes embedding layers to represent users and movies as dense vectors. These embeddings capture latent factors from the user and item ID indices, best described in the academic paper "Neural Collaborative Filtering" by He, Xiangnan, et al. “Neural Collaborative Filtering.” arXiv, 16 Aug. 2017, https://arxiv.org/abs/1708.05031. Accessed 7 July 2024. 

<img src='https://miro.medium.com/v2/resize:fit:1400/1*aP-Mx266ExwoWZPSdHtYpA.png' width='800'>

In [10]:
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
import torch.nn as nn
from torch.optim import Adam
from torch.utils.data import DataLoader, Dataset
import opendatasets as od

In [14]:
od.download("https://www.kaggle.com/datasets/vinothkumarj280204/movie-recommendation-system-dataset")

Dataset URL: https://www.kaggle.com/datasets/vinothkumarj280204/movie-recommendation-system-dataset
Downloading movie-recommendation-system-dataset.zip to .\movie-recommendation-system-dataset


100%|██████████| 165M/165M [00:04<00:00, 41.4MB/s] 





In [2]:
# Load data
movies = pd.read_csv('movie-recommendation-system-dataset/movies.csv') 
ratings = pd.read_csv('movie-recommendation-system-dataset/ratings.csv')

# Display the first few rows of each dataframe
print(movies.head())
print(ratings.head())

   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  
   userId  movieId  rating   timestamp
0       1      296     5.0  1147880044
1       1      306     3.5  1147868817
2       1      307     5.0  1147868828
3       1      665     5.0  1147878820
4       1      899     3.5  1147868510


In [3]:
# Encode userID and movieID as categorical data
ratings['userId'] = ratings['userId'].astype('category').cat.codes.values
ratings['movieId'] = ratings['movieId'].astype('category').cat.codes.values

In [5]:
# Split data into training and test sets
train_data, test_data = train_test_split(ratings, test_size=0.2, random_state=42)

In [6]:
# Convert data to tensors
train_users = torch.tensor(train_data['userId'].values, dtype=torch.long)
train_movies = torch.tensor(train_data['movieId'].values, dtype=torch.long)
train_ratings = torch.tensor(train_data['rating'].values, dtype=torch.float32)

test_users = torch.tensor(test_data['userId'].values, dtype=torch.long)
test_movies = torch.tensor(test_data['movieId'].values, dtype=torch.long)
test_ratings = torch.tensor(test_data['rating'].values, dtype=torch.float32)

In [7]:
# Check if CUDA is available and move model to GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


In [8]:
# Move tensors to GPU
train_users = train_users.to(device)
train_movies = train_movies.to(device)
train_ratings = train_ratings.to(device)

test_users = test_users.to(device)
test_movies = test_movies.to(device)
test_ratings = test_ratings.to(device)

In [9]:
# Dataset class
class RatingsDataset(Dataset):
    def __init__(self, users, movies, ratings):
        self.users = users
        self.movies = movies
        self.ratings = ratings
    
    def __len__(self):
        return len(self.ratings)
    
    def __getitem__(self, idx):
        try:
            return self.users[idx], self.movies[idx], self.ratings[idx]
        except Exception as e:
            print(f"Error at index {idx}: {e}")
            raise e

In [10]:
# Dataset class
class RatingsDataset(Dataset):
    def __init__(self, users, movies, ratings):
        self.users = users
        self.movies = movies
        self.ratings = ratings
    
    def __len__(self):
        return len(self.ratings)
    
    def __getitem__(self, idx):
        return self.users[idx], self.movies[idx], self.ratings[idx]

In [11]:
# DataLoader
train_dataset = RatingsDataset(train_users, train_movies, train_ratings)
test_dataset = RatingsDataset(test_users, test_movies, test_ratings)

In [12]:
train_loader = DataLoader(train_dataset, batch_size=4096, shuffle=True, num_workers=0)
test_loader = DataLoader(test_dataset, batch_size=4096, shuffle=False, num_workers=0)

In [13]:
class MatrixFactorization(nn.Module):
    def __init__(self, num_users, num_movies, embedding_dim):
        super(MatrixFactorization, self).__init__()
        self.user_embedding = nn.Embedding(num_users, embedding_dim)
        self.movie_embedding = nn.Embedding(num_movies, embedding_dim)
    
    def forward(self, user, movie):
        user_emb = self.user_embedding(user)
        movie_emb = self.movie_embedding(movie)
        return (user_emb * movie_emb).sum(1)

# Get number of users and movies
num_users = ratings['userId'].nunique()
num_movies = ratings['movieId'].nunique()
embedding_dim = 50  # This can be adjusted

In [14]:
# Instantiate the model
model = MatrixFactorization(ratings['userId'].nunique(), ratings['movieId'].nunique(), 50).to(device)

In [15]:
# Define loss function and optimizer
criterion = nn.MSELoss()
optimizer = Adam(model.parameters(), lr=0.005)

In [16]:
# Training with mixed precision
scaler = torch.cuda.amp.GradScaler()

In [17]:
# Training loop
num_epochs = 5  # Adjust as necessary

for epoch in range(num_epochs):
    model.train()
    epoch_loss = 0
    for batch_users, batch_movies, batch_ratings in train_loader:
        batch_users, batch_movies, batch_ratings = batch_users.to(device), batch_movies.to(device), batch_ratings.to(device)
        
        optimizer.zero_grad()
        
        # Mixed precision
        with torch.cuda.amp.autocast():
            predictions = model(batch_users, batch_movies)
            loss = criterion(predictions, batch_ratings)
        
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
        
        epoch_loss += loss.item()
    
    avg_train_loss = epoch_loss / len(train_loader)
    
    # Evaluate on test set
    model.eval()
    test_loss = 0
    with torch.no_grad():
        for batch_users, batch_movies, batch_ratings in test_loader:
            batch_users, batch_movies, batch_ratings = batch_users.to(device), batch_movies.to(device), batch_ratings.to(device)
            
            with torch.cuda.amp.autocast():
                predictions = model(batch_users, batch_movies)
                loss = criterion(predictions, batch_ratings)
                test_loss += loss.item()
    
    avg_test_loss = test_loss / len(test_loader)
    print(f'Epoch {epoch+1}/{num_epochs}, Train Loss: {avg_train_loss:.4f}, Test Loss: {avg_test_loss:.4f}')
    
# Save the model
torch.save(model.state_dict(), 'matrix_factorization.pth')

Epoch 1/5, Train Loss: 14.7898, Test Loss: 1.8375
Epoch 2/5, Train Loss: 1.1504, Test Loss: 1.2399
Epoch 3/5, Train Loss: 0.9439, Test Loss: 1.0735
Epoch 4/5, Train Loss: 0.8794, Test Loss: 0.9794
Epoch 5/5, Train Loss: 0.8086, Test Loss: 0.9215


In [21]:
# Function to compute RMSE
def rmse(predictions, targets):
    return torch.sqrt(((predictions - targets) ** 2).mean())

# Function to compute MAE
def mae(predictions, targets):
    return torch.abs(predictions - targets).mean()

In [22]:
# Evaluate the model
model.eval()
with torch.no_grad():
    # Move test data to device
    test_users = test_users.to(device)
    test_movies = test_movies.to(device)
    test_ratings = test_ratings.to(device)
    
    # Get predictions
    test_predictions = model(test_users, test_movies)
    
    # Compute RMSE and MAE
    test_rmse = rmse(test_predictions, test_ratings)
    test_mae = mae(test_predictions, test_ratings)

    print(f'Test RMSE: {test_rmse.item()}')
    print(f'Test MAE: {test_mae.item()}')

Test RMSE: 0.9599725604057312
Test MAE: 0.7120007872581482


In [23]:
def recommend_movies(user_id, num_recommendations=5):
    user_id_tensor = torch.tensor([user_id]).to(device)
    movie_ids = torch.arange(num_movies).to(device)
    
    model.eval()
    with torch.no_grad():
        predictions = model(user_id_tensor.repeat(num_movies), movie_ids)
    
    top_movies = predictions.argsort(descending=True)[:num_recommendations]
    recommended_movie_ids = movie_ids[top_movies].cpu().numpy()
    recommended_movies = movies.loc[movies['movieId'].isin(recommended_movie_ids)]
    
    return recommended_movies

In [24]:
# Recommend movies for a given user
user_id = 1  # Example user ID
recommended_movies = recommend_movies(user_id)
print(recommended_movies)

       movieId              title         genres
9725     32316  River, The (1951)  Drama|Romance
11217    49312   Snow Cake (2006)          Drama
