# Model Training

In this section, we perform the **model training** for three different recommendation systems:

1. **Collaborative Filtering** using **KNN** and **NCF**
2. **Content-based Filtering** using **TF-IDF**

---

### Overview of the Models

- **Content-based Filtering**:  
  The **TF-IDF** model recommends movies based on the textual content of movie descriptions (such as the **title** and **overview**). It uses these columns to compute similarity scores between movies and recommend similar ones.

- **Collaborative Filtering**:  
  This method predicts whether a movie should be recommended to a user based on **numerical** columns like `user_id`, `movie_id`, and **ratings**. Since our dataset lacks actual `user_id` and `ratings` columns, we generate **synthetic `user_id` values**. Instead of actual ratings, we use the **`weighted_rating`** column, which is calculated during **feature engineering**, to simulate user preferences.

---

### Steps Involved
1. **Loading the Dataset**  
   We load the dataset containing movie data and preprocess it for model training.

2. **Generating Synthetic Data**  
   Since the dataset does not contain **`user_id`** or **`ratings`**, we create synthetic **`user_id`** values. The **`weighted_rating`** is used in place of the missing ratings to simulate movie preferences.

3. **Training Models**  
   We train three models:
   - **KNN** for Collaborative Filtering
   - **NCF** (Neural Collaborative Filtering) for Collaborative Filtering
   - **TF-IDF** for Content-based Filtering

Each model is trained on the preprocessed data to learn the relationships and provide recommendations.

---

By the end of this notebook, we evaluate the models based on their performance and determine the best recommendation system for this dataset.

In [53]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split

In [73]:
# Load the dataset
df = pd.read_csv("MoviesData_Processed.csv")

### Content-Based Movie Recommendation System

In this section, we implement a **Content-Based Filtering** recommendation system using **TF-IDF (Term Frequency-Inverse Document Frequency)** and **Cosine Similarity** to suggest movies based on their **overview**.

1. **Preprocessing**:  
   The first step involves filling any missing values in the **`overview`** column with an empty string to ensure the **TF-IDF Vectorizer** works smoothly without errors from `NaN` values.

2. **TF-IDF Vectorization**:  
   A **TF-IDF Vectorizer** is initialized with the `stop_words='english'` parameter to remove common English words (e.g., "the", "is", etc.) that do not contribute much to the similarity calculation. This vectorizer is then applied to the **`overview`** column to create a **TF-IDF matrix**, where each row corresponds to a movie's overview and each column represents a word weighted by its importance across the corpus.

3. **Cosine Similarity Calculation**:  
   Next, the **Cosine Similarity** between each movie's overview is computed using the **TF-IDF matrix**. This step measures how similar the overviews of the movies are to one another. A higher cosine similarity indicates a greater similarity between two movies.

4. **Reverse Mapping for Movie Titles**:  
   A reverse mapping of **movie titles** to their corresponding **index** in the dataset is created using a **pandas Series**. This allows us to quickly look up the index of a movie given its title.

5. **Recommendation Function**:  
   The `content_based_recommendations` function is designed to take a **movie title** as input, find its index in the dataset, and retrieve a list of **top_n most similar movies** based on the cosine similarity. If the provided title is not found in the dataset, it prints a message and returns an empty list. Otherwise, it sorts the movies by similarity score, excludes the input movie itself, and returns the **top_n** most similar movies along with their overviews.

6. **Example Usage**:  
   An example usage of the `content_based_recommendations` function is provided with the movie title **"Spider-Man 2"**. The function will return a list of the top 10 movies most similar to "Spider-Man 2" based on their **overvuses
This method leverages the **TF-IDF** technique to identify movies with similar descriptions and recommend them to the user.

In [75]:
df['overview'] = df['overview'].fillna('')

# Initializing TF-IDF Vectorizer
tfidf = TfidfVectorizer(stop_words='english')

tfidf_matrix = tfidf.fit_transform(df['overview'])

# Computing the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Constructing a reverse map of indices and movie titles
indices = pd.Series(df.index, index=df['original_title']).drop_duplicates()

def content_based_recommendations(title, cosine_sim=cosine_sim, df=df, indices=indices, top_n=10):
    idx = indices.get(title)
    if idx is None:
        print("Title not found in the dataset.")
        return []
    
    # Getting the pairwise similarity scores for this movie
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Sorting the movies based on similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the scores of the top_n most similar movies (excluding itself)
    sim_scores = sim_scores[1:top_n+1]
    movie_indices = [i[0] for i in sim_scores]
    

    return df[['original_title', 'overview']].iloc[movie_indices]

# Example usage:
print(content_based_recommendations("Spider-Man 2"))

                original_title  \
5                 Spider-Man 3   
159                 Spider-Man   
20      The Amazing Spider-Man   
38    The Amazing Spider-Man 2   
1540             Arachnophobia   
3080        The House of Mirth   
813                   Superman   
1566                27 Dresses   
2002       Someone Like You...   
3792        Psycho Beach Party   

                                               overview  
5     The seemingly invincible Spider-Man goes up ag...  
159   After being bitten by a genetically altered sp...  
20    Peter Parker is an outcast high schooler aband...  
38    For Peter Parker, life is busy. Between taking...  
1540  A large spider from the jungles of South Ameri...  
3080  A woman risks losing her chance of happiness w...  
813   Mild-mannered Clark Kent works as a reporter a...  
1566  Altruistic Jane finds herself facing her worst...  
2002  Jane Goodale has everything going for her. She...  
3792  Spoof of 1960's Beach Party/Gidget surf

### Neural Collaborative Filtering (NCF) with Movie Recommendations

In this section, we implement a **Neural Collaborative Filtering (NCF)** model to recommend movies based on synthetic user ratings and movie data. The model follows several key steps, including data preprocessing, model setup, training, and evaluation.

1. **Preprocessing & Data Setup**:  
   The data is first cleaned by dropping rows with missing **`vote_count`** or **`vote_average`** values. An overall mean **C** of the vote average is calculated, and **`m`**, the 90th percentile of the **`vote_count`**, is computed to filter out movies with fewer than **m** votes. We then compute a **weighted rating** for each movie, which considers both the movie's rating and its popularity. Synthetic **`user_id`** values are generated for each movie, and the dataset is normalized and split into **train** and **test** sets.

2. **PyTorch Dataset Class**:  
   A custom PyTorch `Dataset` class, `MovieDataset`, is defined to handle the conversion of the data into a format suitable for deep learning. This class takes in the user IDs, movie IDs, and their corresponding **weighted ratings**, and formats them into tensors for efficient loading during training and testing.

3. **NCF Model Architecture**:  
   The **Neural Collaborative Filtering (NCF)** model is implemented using **embedding layers** for both users and movies. The embeddings are concatenated and passed through a series of **fully connected layers** with dropout for regularization, producing a final predicted rating. This architecture is designed to learn complex interactions between users and movies through embeddings.

4. **Model Initialization**:  
   The model is initialized with the appropriate number of user and movie embeddings, along with the loss function (`L1Loss` for mean absolute error) and the optimizer (`AdamW`). The learning rate scheduler is set to decrease the learning rate every 5 epochs by a factor of 0.5 to help with convergence.

5. **Weight Initialization**:  
   The weights of the model’s **fully connected layers** are initialized using **Xavier uniform** initialization to improve training stability.

6. **Training Loop**:  
   The training loop iterates over the data for a specified number of epochs. During each epoch, the model predicts ratings, computes the loss, and performs backpropagation to update the model parameters. Gradient clipping is used to prevent exploding gradients. Early stopping is applied to halt training if the loss does not improve for several epochs.

7. **Evaluation**:  
   After training, the model is evaluated on the **test set** to calculate the loss and **Mean Absolute Error (MAE)**, which measures how well the model’s predictions align with the actual ratings.

8. **Run Training and Evaluation**:  
   Finally, the model is trained for 20 epochs and then evaluated on the test data, with both the loss and MAE being printed out.

This implementation demonstrates how neural collaborative filtering can be applied for movie recommendation tasks by leveraging deep learning techniques to model the interactions between users and movies.

In [89]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split

# ==========================
# Preprocessing & Data Setup
# ==========================

# Drop NaN values in vote counts and averages
df = df.dropna(subset=['vote_count', 'vote_average'])

# Calculate overall mean vote average (C)
C = df['vote_average'].mean()

# Define m as the 90th percentile of vote_count
m = df['vote_count'].quantile(0.90)

# We filter out movies that have a vote_count less than m
qualified = df[df['vote_count'] >= m].copy()

# Function to compute the weighted rating
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m)) * R + (m/(v+m)) * C

# We calculate weighted rating and create a new column
qualified['weighted_rating'] = qualified.apply(weighted_rating, axis=1)

# Sorting movies by weighted rating
qualified = qualified.sort_values('weighted_rating', ascending=False)

# Generate synthetic user_id and select relevant columns
df['user_id'] = np.random.randint(0, 1000, df.shape[0])  # Assign random user IDs

ratings = df[['user_id', 'id', 'weighted_rating']].dropna()
ratings.rename(columns={'id': 'movie_id'}, inplace=True)

# Normalize weighted_rating
ratings['weighted_rating'] = (ratings['weighted_rating'] - ratings['weighted_rating'].min()) / \
                             (ratings['weighted_rating'].max() - ratings['weighted_rating'].min())

# Clip negative values (if any)
ratings['weighted_rating'] = ratings['weighted_rating'].clip(lower=0)

# Encode users and movies
total_users = ratings['user_id'].nunique()
total_movies = ratings['movie_id'].nunique()

user2idx = {user: idx for idx, user in enumerate(ratings['user_id'].unique())}
movie2idx = {movie: idx for idx, movie in enumerate(ratings['movie_id'].unique())}

ratings.loc[:, 'user_id'] = ratings['user_id'].map(user2idx)
ratings.loc[:, 'movie_id'] = ratings['movie_id'].map(movie2idx)

# Train-test split
train_data, test_data = train_test_split(ratings, test_size=0.2, random_state=42)

# ==========================
# PyTorch Dataset Class
# ==========================

class MovieDataset(Dataset):
    def __init__(self, data):
        self.users = torch.tensor(data['user_id'].values, dtype=torch.long)
        self.movies = torch.tensor(data['movie_id'].values, dtype=torch.long)
        self.ratings = torch.tensor(data['weighted_rating'].values, dtype=torch.float32)

    def __len__(self):
        return len(self.ratings)

    def __getitem__(self, idx):
        return self.users[idx], self.movies[idx], self.ratings[idx]

# DataLoaders
train_dataset = MovieDataset(train_data)
test_dataset = MovieDataset(test_data)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

# ==========================
# Neural Collaborative Filtering (NCF) Model
# ==========================

class NCF(nn.Module):
    def __init__(self, num_users, num_movies, embed_size=64):
        super(NCF, self).__init__()
        self.user_embedding = nn.Embedding(num_users, embed_size)
        self.movie_embedding = nn.Embedding(num_movies, embed_size)
        self.fc_layers = nn.Sequential(
            nn.Linear(embed_size * 2, 128),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(64, 1)
        )
    
    def forward(self, user, movie):
        user_embedded = self.user_embedding(user)
        movie_embedded = self.movie_embedding(movie)
        interaction = torch.cat([user_embedded, movie_embedded], dim=-1)
        output = self.fc_layers(interaction)
        return output.squeeze()

# Initialize model, loss, optimizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ncf_model = NCF(total_users, total_movies).to(device)
criterion = nn.L1Loss()  # Mean Absolute Error
optimizer = optim.AdamW(ncf_model.parameters(), lr=0.001, weight_decay=0.01)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.5)

# ==========================
# Model Weight Initialization
# ==========================

def weights_init(m):
    if isinstance(m, nn.Linear):
        nn.init.xavier_uniform_(m.weight)
        if m.bias is not None:
            nn.init.zeros_(m.bias)

ncf_model.apply(weights_init)

# ==========================
# Training Loop
# ==========================

def train(model, train_loader, criterion, optimizer, scheduler, epochs=20):
    model.train()
    best_loss = float('inf')
    patience, counter = 3, 0
    for epoch in range(epochs):
        total_loss = 0
        for users, movies, ratings in train_loader:
            users, movies, ratings = users.to(device), movies.to(device), ratings.to(device)
            optimizer.zero_grad()
            predictions = model(users, movies)
            loss = criterion(predictions, ratings)
            loss.backward()

            # Gradient clipping
            for param in model.parameters():
                param.grad.data.clamp_(-1, 1)

            optimizer.step()
            total_loss += loss.item()
        scheduler.step()
        avg_loss = total_loss / len(train_loader)
        print(f"Epoch {epoch+1}, Loss: {avg_loss:.4f}")
        
        if avg_loss < best_loss:
            best_loss = avg_loss
            counter = 0
        else:
            counter += 1
            if counter >= patience:
                print("Early stopping triggered")
                break

# ==========================
# Evaluation Function
# ==========================

def evaluate(model, test_loader, criterion):
    model.eval()
    total_loss = 0
    total_absolute_error = 0  # Variable to store the sum of absolute errors
    with torch.no_grad():
        for users, movies, ratings in test_loader:
            users, movies, ratings = users.to(device), movies.to(device), ratings.to(device)
            predictions = model(users, movies)
            
            # Compute the loss
            loss = criterion(predictions, ratings)
            total_loss += loss.item()
            
            # Calculate absolute errors for MAE
            absolute_error = torch.abs(predictions - ratings)
            total_absolute_error += absolute_error.sum().item()
    
    # Calculate MAE and average loss
    mae = total_absolute_error / len(test_loader.dataset)
    avg_loss = total_loss / len(test_loader)
    
    print(f"Test Loss: {avg_loss:.4f}, Test MAE: {mae:.4f}")

# ==========================
# Run Training and Evaluation
# ==========================

train(ncf_model, train_loader, criterion, optimizer, scheduler, epochs=20)
evaluate(ncf_model, test_loader, criterion)

Epoch 1, Loss: 0.7332
Epoch 2, Loss: 0.6341
Epoch 3, Loss: 0.5369
Epoch 4, Loss: 0.4552
Epoch 5, Loss: 0.4455
Epoch 6, Loss: 0.4136
Epoch 7, Loss: 0.3913
Epoch 8, Loss: 0.3676
Epoch 9, Loss: 0.3320
Epoch 10, Loss: 0.3380
Epoch 11, Loss: 0.3313
Epoch 12, Loss: 0.3302
Epoch 13, Loss: 0.3168
Epoch 14, Loss: 0.3165
Epoch 15, Loss: 0.2998
Epoch 16, Loss: 0.2962
Epoch 17, Loss: 0.2930
Epoch 18, Loss: 0.2961
Epoch 19, Loss: 0.2923
Epoch 20, Loss: 0.2877
Test Loss: 0.2443, Test MAE: 0.2405


### K-Nearest Neighbors (KNN) Regressor for Movie Rating Prediction

In this code, we implement a **K-Nearest Neighbors (KNN)** model to predict movie ratings using a **weighted rating** system based on movie popularity and user preferences. The process begins by loading and preprocessing the movie dataset, which includes dropping missing values for **`vote_count`** and **`vote_average`**, and calculating a **mean vote average (C)** and the 90th percentile of **vote_count (m)**. Movies with a vote count below **m** are excluded, and a **weighted rating** is calculated for each remaining movie, considering both the movie's rating and its popularity. Next, we simulate user-item interactions by assigning random **user_ids** to each movie, then encode the **user_id** and **movie_id** using **LabelEncoder** for easier manipulation. The **weighted_rating** is normalized using **MinMaxScaler** to ensure the values are between 0 and 1. The data is split into **training** and **test** sets, with **user_id** and **movie_id** as features and the **normalized ratings** as the target. A **KNN Regressor** model is trained on the training set, using **cosine similarity** as the distance metric, and predictions are made on the test set. The model's performance is evaluated using the **Mean Absolute Error (MAE)** metric to measure the prediction accuracy. The output displays the MAE, reflecting the model's ability to predict ratings based on the weighted popularity of movies.

In [91]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error

# Load the dataset
df = pd.read_csv("MoviesData_Processed.csv")

# Drop NaN values in vote counts and averages
df = df.dropna(subset=['vote_count', 'vote_average'])

# Calculate overall mean vote average (C)
C = df['vote_average'].mean()

# Define m as the 90th percentile of vote_count
m = df['vote_count'].quantile(0.90)

# Filter out movies that have a vote_count less than m
qualified = df[df['vote_count'] >= m].copy()

# Function to compute the weighted rating
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m)) * R + (m/(v+m)) * C

# Calculate weighted rating and create a new column
qualified['weighted_rating'] = qualified.apply(weighted_rating, axis=1)

# Sorting movies by weighted rating
qualified = qualified.sort_values('weighted_rating', ascending=False)

# Simulate user-item interactions
qualified.loc[:, 'user_id'] = np.random.randint(0, 1000, qualified.shape[0])  # Simulated users
ratings = qualified[['user_id', 'id', 'weighted_rating']].dropna()
ratings.rename(columns={'id': 'movie_id'}, inplace=True)

# Encode users and movies
user_encoder = LabelEncoder()
movie_encoder = LabelEncoder()
ratings.loc[:, 'user_id'] = user_encoder.fit_transform(ratings['user_id'])
ratings.loc[:, 'movie_id'] = movie_encoder.fit_transform(ratings['movie_id'])

# Normalize weighted ratings
scaler = MinMaxScaler()
ratings.loc[:, 'rating'] = scaler.fit_transform(ratings[['weighted_rating']])

# Train-test split
train_data, test_data = train_test_split(ratings, test_size=0.2, random_state=42)

# Prepare training and test sets
X_train = train_data[['user_id', 'movie_id']]
y_train = train_data['rating']
X_test = test_data[['user_id', 'movie_id']]
y_test = test_data['rating']

# Train KNN model
knn_model = KNeighborsRegressor(n_neighbors=10, metric='cosine')
knn_model.fit(X_train, y_train)

# Predict on test set
y_pred = knn_model.predict(X_test)

# Evaluate model
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error: {mae:.4f}")

Mean Absolute Error: 0.1617
