# BT4222 Hybrid Recommendation with LightFM

## Introduction
This notebook presents the development and evaluation of the following model:

- LightFM-Based Hybrid Recommender:
A hybrid recommendation model that combines collaborative filtering with content features using the LightFM library.
- The model transforms user-movie ratings into interaction matrices and incorporates movie metadata (e.g., genres, director, cast) as item features.
- It is trained using WARP (Weighted Approximate-Rank Pairwise) loss to optimize for ranking.
- Evaluation is performed using standard and temporal split strategies, and results are reported using top-K metrics including Precision, Recall, NDCG, MAP, Hit Rate, MRR, and AUC.






## Mount Google Drive
This code cell mounts your Google Drive in the Colab environment, enabling access to datasets stored in your drive for subsequent processing.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Import Libraries and Load Raw Datasets
Importing necessary libraries.
Loading all primary datasets used in the project:

- `df_tmdb_final.csv`: Final movie metadata with embeddings

- `df_links_with_ratings.csv`: User ratings merged with TMDB movie IDs

These datasets are merged to form `df_merged`, a comprehensive dataset that links user preferences to rich movie features.

In [None]:
!pip install lightfm
import pandas as pd
import numpy as np
import spacy
import ast
import scipy.sparse as sp
import re
from scipy.sparse import csr_matrix, identity, coo_matrix
from ast import literal_eval
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import normalize
from sklearn.decomposition import TruncatedSVD

from lightfm import LightFM
from lightfm.data import Dataset
from lightfm.evaluation import precision_at_k
from lightfm.evaluation import recall_at_k
from lightfm.evaluation import auc_score
from lightfm.evaluation import reciprocal_rank
from lightfm.cross_validation import random_train_test_split

import random
from tqdm import tqdm




In [None]:
# Load Datasets
df_tmdb_final = pd.read_csv("/content/drive/MyDrive/BT4222_Project/df_tmdb_final.csv")
df_links_with_ratings = pd.read_csv("/content/drive/MyDrive/BT4222_Project/df_links_with_ratings.csv")

# Step 1: Prepare Content Feature Dataset
We create a new dataframe df_content by selecting only the columns needed for content-based recommendation:

- `movie_id`: Unique identifier for each movie  
- `original_title`: Movie title  
- `genre`: Genre tags (typically as lists or encoded vectors)  
- `director_embedding`: Numeric embedding representing the movie’s director  
- `main_cast_embeddings`: Averaged or stacked embeddings of main cast members  
- `production_company_embedding`: Embedding of the associated production company  
- `original_language_embedding`: Embedding for the movie's original language  


### 1.1 String Cleaning through parse_embedding_from_array_wrapper
When the embeddings were extracted and saved during data cleaning, they were stored as strings in the CSV. These strings sometimes include extra wrappers, such as array( ... ), and have numbers separated by spaces instead of commas—this is not a valid Python literal.

- **Removing Array Wrappers**:
It uses a regular expression to detect and remove the "array(...)" wrapper so that only the inner content remains.

- **Ensuring Proper Bracketing**:
It makes sure that the cleaned string starts with [ and ends with ] so that it represents a valid Python list literal.

- **Inserting Commas**:
It then uses another regular expression to insert commas between numbers that are only separated by whitespace

### 1.2 Defining parse_embedding
When the embeddings were extracted and saved during the data cleaning process, they were converted to strings in the CSV file. This makes it necessary to parse these string representations back into numerical arrays before further processing. The parse_embedding function handles this conversion by:

- **Checking if the input is a string**:
It may be a string representation of a list (e.g., "[0.1, 0.2, 0.3]"), so we need to convert it back to a Python list and then into a NumPy array.

- **Handling errors gracefully**:
If the string cannot be converted (due to formatting issues), the function returns a zero vector with the specified dimension as a fallback.

- **Working with lists or arrays**:
If the embedding is already in list or array format, it converts (or reaffirms) it as a NumPy array.


In [None]:
#step 1: Preparing the Dataset

# Select relevant content-based features
df_content = df_tmdb_final[[
    'movie_id',
    'original_title',
    'genre',
    'director_embedding',
    'main_cast_embeddings',
    'production_company_embedding',
    'original_language_embedding'
]].drop_duplicates(subset='movie_id').reset_index(drop=True)


def parse_embedding_from_array_wrapper(s, dim=300):
    """
    Convert messy string embeddings (with array(...) wrappers or irregular formatting)
    into fixed-length NumPy vectors.
    """
    if pd.isna(s):
        return np.zeros(dim)

    # Clean wrappers and whitespace
    s = re.sub(r'array\((.*?)\)', r'\1', s)      # Remove 'array(...)'
    s = s.replace('\n', ' ')                     # Remove newlines
    s = s.replace('[', '').replace(']', '')      # Strip brackets

    # Extract floats using regex (handles scientific notation too)
    numbers = re.findall(r'-?\d+\.\d+(?:e[+-]?\d+)?', s)

    try:
        vec = np.array([float(n) for n in numbers], dtype=np.float32)

        # Pad or truncate to target dimension
        if len(vec) < dim:
            vec = np.pad(vec, (0, dim - len(vec)), mode='constant')
        elif len(vec) > dim:
            vec = vec[:dim]

        return vec

    except:
        return np.zeros(dim)


def parse_embedding(x, dim=300):
    """
    Parses an embedding from either a valid string or an actual list/array.
    """
    if isinstance(x, str):
        x = x.strip()
        if x.startswith('[') and x.endswith(']') and 'array' not in x:
            try:
                return np.array(ast.literal_eval(x))
            except:
                return np.zeros(dim)
    elif isinstance(x, (list, np.ndarray)):
        return np.array(x)

    return np.zeros(dim)



embedding_cols = [
    'main_cast_embeddings',
    'director_embedding',
    'production_company_embedding',
    'original_language_embedding',
]

# Apply wrapper-based parsing to messy embedding strings
for col in embedding_cols:
    df_content[col] = df_content[col].apply(lambda v: parse_embedding_from_array_wrapper(v, dim=300))

# Apply standard parsing to clean genre embeddings
df_content['genre'] = df_content['genre'].apply(lambda v: parse_embedding(v, dim=300))



df_merged = pd.merge(df_links_with_ratings, df_content, left_on='tmdbId', right_on='movie_id', how='inner')
del df_content

In [None]:
embedding_cols = [
    'main_cast_embeddings',
    'director_embedding',
    'production_company_embedding',
    'original_language_embedding',
    'genre'
]

for col in embedding_cols:
    print(f"{col} — valid (non-zero) vectors:", df_content[col].apply(lambda x: np.sum(x) > 0).sum())

main_cast_embeddings — valid (non-zero) vectors: 268
director_embedding — valid (non-zero) vectors: 1216
production_company_embedding — valid (non-zero) vectors: 1341
original_language_embedding — valid (non-zero) vectors: 1418
genre — valid (non-zero) vectors: 1446


#Step 2: Building the User-Item Interaction Matrix for LightFM

In this step, we convert user-movie ratings into a sparse matrix format suitable for LightFM.  
Each user and movie is mapped to a unique matrix index. Optionally, ratings can be binarized  
(e.g., for implicit feedback), and the final result is a CSR matrix used for model training.

Outputs:
- `interaction_matrix`: Sparse matrix of shape *(num_users × num_movies)*
- `user_mapper`, `item_mapper`: Maps original IDs to matrix indices
- `user_inverse_mapper`, `item_inverse_mapper`: Maps indices back to original IDs


In [None]:
# Step 2: Prepare user-item interaction matrix
def create_interaction_matrix(df, user_col, item_col, rating_col, threshold=0):
    """
    Create a user-item interaction sparse matrix

    Parameters:
    -----------
    df : pandas dataframe
    user_col : column name for user IDs
    item_col : column name for item IDs
    rating_col : column name for ratings
    threshold : minimum rating to consider as positive interaction

    Returns:
    --------
    interaction_matrix : scipy sparse matrix
    user_mapper : dict, map user ID to matrix row index
    item_mapper : dict, map item ID to matrix column index
    user_inverse_mapper : dict, map row index to user ID
    item_inverse_mapper : dict, map column index to item ID
    """
    # Get unique users and items
    users = df[user_col].unique()
    items = df[item_col].unique()

    # Create mappers
    user_mapper = {user: i for i, user in enumerate(users)}
    item_mapper = {item: i for i, item in enumerate(items)}

    user_inverse_mapper = {i: user for user, i in user_mapper.items()}
    item_inverse_mapper = {i: item for item, i in item_mapper.items()}

    # Create matrix indices
    user_indices = [user_mapper[user] for user in df[user_col]]
    item_indices = [item_mapper[item] for item in df[item_col]]

    # Get ratings
    ratings = [1 if r >= 4 else 0 for r in df[rating_col]]

    # Create sparse matrix
    interaction_matrix = sp.coo_matrix((ratings, (user_indices, item_indices)),
                                       shape=(len(users), len(items)))

    return interaction_matrix.tocsr(), user_mapper, item_mapper, user_inverse_mapper, item_inverse_mapper

# Create the interaction matrix
interaction_matrix, user_mapper, movie_mapper, user_inverse_mapper, movie_inverse_mapper = create_interaction_matrix(
    df_merged, 'userId', 'movie_id', 'rating'
)

# Print out the density of the interaction matrix
num_users, num_items = interaction_matrix.shape
num_interactions = interaction_matrix.nnz
sparsity = 1 - (num_interactions / (num_users * num_items))

print(f"Users: {num_users}")
print(f"Items: {num_items}")
print(f"Interactions: {num_interactions}")
print(f"Sparsity: {sparsity:.4f}")



Users: 240981
Items: 1441
Interactions: 6605930
Sparsity: 0.9810


### Step 3: Creating the Item Features Matrix Using Embeddings and TruncatedSVD

In this step, we construct a rich content-based feature matrix for movies using precomputed embedding vectors derived from metadata fields such as:

- `director_embedding`
- `main_cast_embeddings`
- `production_company_embedding`
- `original_language_embedding`
- `genre`

These embedding vectors are stacked horizontally into a single sparse matrix, with rows aligned to the `movie_id` ordering used in the interaction matrix.

To reduce dimensionality and enhance model efficiency, we apply **TruncatedSVD** (a PCA variant for sparse matrices), projecting the combined features into a lower-dimensional latent space.

**Outputs:**
- `item_features`: A sparse matrix of shape *(num_movies × n_components)*, representing each movie's compressed content profile.


In [None]:
def create_item_features(df, item_col, feature_cols, item_mapper_reference=None, n_components=None):
    """
    Create item features sparse matrix for LightFM using precomputed embeddings.
    Reduces memory usage by applying TruncatedSVD to each feature independently.

    Parameters:
    -----------
    df : pandas dataframe
    item_col : column name for item IDs
    feature_cols : list of column names with embedding features
    item_mapper_reference : optional dict mapping item_id -> index to align row order
    n_components : number of dimensions to reduce each feature to (per feature)

    Returns:
    --------
    item_features : scipy sparse matrix (CSR)
    item_mapper : dict mapping item_id -> row index
    """

    # Step 1: Align dataframe with interaction matrix
    if item_mapper_reference:
        item_ids_ordered = list(item_mapper_reference.keys())
        df = df[df[item_col].isin(item_ids_ordered)].drop_duplicates(subset=item_col).copy()
        df['index'] = df[item_col].map(item_mapper_reference)
        df = df.sort_values('index').drop(columns=['index'])
        item_mapper = item_mapper_reference
    else:
        items = df[item_col].unique()
        item_mapper = {item: i for i, item in enumerate(items)}
        df = df[df[item_col].isin(item_ids_ordered)].drop_duplicates(subset=item_col).copy()
        df['index'] = df[item_col].map(item_mapper)
        df = df.sort_values('index').drop(columns=['index'])

    # Step 2: Process and reduce each feature separately
    reduced_matrices = []
    for feature in feature_cols:
        first_val = df[feature].iloc[0]
        if isinstance(first_val, (list, np.ndarray)):
            print(f"Processing feature: {feature}")
            embeddings = np.vstack(df[feature].values).astype(np.float32)

            # Apply TruncatedSVD per feature
            if n_components and embeddings.shape[1] > n_components:
                svd = TruncatedSVD(n_components=n_components, random_state=42)
                reduced = svd.fit_transform(embeddings)
                reduced_sparse = sp.csr_matrix(reduced)
            else:
                reduced_sparse = sp.csr_matrix(embeddings)

            reduced_matrices.append(reduced_sparse)

    # Step 3: Combine features
    if reduced_matrices:
        item_features = sp.hstack(reduced_matrices).tocsr().astype(np.float32)
    else:
        item_features = sp.eye(len(df), format='csr', dtype=np.float32)  # fallback

    return item_features, item_mapper


# List of embedding-based feature columns
embedding_features = [
    'director_embedding',
    'main_cast_embeddings',
    'production_company_embedding',
    'original_language_embedding',
    'genre'
]

# Create item features matrix
item_features, item_mapper = create_item_features(df_content, 'movie_id', embedding_features, item_mapper_reference=movie_mapper, n_components=100)
print(f"Item features shape: {item_features.shape}")

Processing feature: director_embedding
Processing feature: main_cast_embeddings
Processing feature: production_company_embedding
Processing feature: original_language_embedding
Processing feature: genre
Item features shape: (1441, 446)


Validation: Check for Item Mapper Alignment Between Interaction Matrix and Feature Matrix

In [None]:
print("Same keys:", set(movie_mapper.keys()) == set(item_mapper.keys()))

# Check for mismatched indices
for key in movie_mapper:
    if key in item_mapper and movie_mapper[key] != item_mapper[key]:
        print(f"Mismatch for {key}: {movie_mapper[key]} != {item_mapper[key]}")


Same keys: True


#Step 4: Splitting Interaction Data into Train and Test Sets

To evaluate our recommendation model fairly, we split the user-item interaction matrix  
into 80% training and 20% testing sets using `random_train_test_split`.  

Ensures that All Users in the Test set exists in the Train set.

Outputs:
- `train_interactions`: Sparse matrix used for model training
- `test_interactions`: Sparse matrix used for evaluation


In [None]:
# Step 4: Split the data into train and test sets (80% train, 20% test)
train_interactions, test_interactions = random_train_test_split(interaction_matrix, test_percentage=0.2)


# Step 5: Initialize and Train the LightFM Model

We define a function to initialize and train a LightFM recommendation model using both collaborative signals and item-level metadata.

The function supports several tunable hyperparameters:
- `num_components`: the dimensionality of the latent space
- `loss`: objective function (e.g., 'warp', 'bpr', etc.)
- `epochs`: number of training iterations
- `learning_rate`: step size for Adagrad optimization

To represent users, we use an identity matrix (one-hot encoded user features). The model is trained on a sparse interaction matrix, optionally incorporating high-dimensional item features (e.g., embeddings for cast, director, production company, language).

The model uses the Adagrad learning schedule and returns a trained instance of `LightFM`, suitable for evaluation and recommendation generation.


In [None]:
# Step 5: Initialize and train the LightFM model
def train_lightfm_model(train_matrix, item_features=None, num_components=20,
                        loss='warp', epochs=15, learning_rate=0.05):
    """
    Train a LightFM recommendation model

    Parameters:
    -----------
    train_matrix : scipy sparse matrix of user-item interactions
    item_features : scipy sparse matrix of item features (optional)
    num_components : latent dimensionality of the model
    loss : the loss function ('regression', 'bpr', 'logistic', or 'warp-kos')
    epochs : number of training epochs
    learning_rate : learning rate

    Returns:
    --------
    model : trained LightFM model
    """
    # Initialize the model
    model = LightFM(no_components=num_components,
                   loss=loss,
                   learning_schedule='adagrad',
                   learning_rate=learning_rate)

    # Create identity user features
    user_features = sp.identity(train_matrix.shape[0], format='csr')

    # Train the model
    model.fit(train_matrix,
             user_features=user_features,
             item_features=item_features,
             epochs=epochs,
             verbose=True)

    return model


# Train the model
model = train_lightfm_model(train_interactions, item_features=item_features)

Epoch: 100%|██████████| 15/15 [1:48:47<00:00, 435.19s/it]


# Step 6: Extended Evaluation of the LightFM Model

In this step, we evaluate the trained LightFM recommendation model using a rich suite of ranking-based metrics to assess its effectiveness in top-K recommendation tasks.

The evaluation supports both in-sample (training set) and out-of-sample (test set) analysis. We use both built-in LightFM metrics and custom implementations to provide a holistic performance overview.

**Built-in Metrics (from LightFM):**
- `Precision@K`: Measures the proportion of relevant items among the top-K recommendations.
- `Recall@K`: Measures the proportion of all relevant items that were successfully recommended.
- `AUC`: Area Under the Curve — reflects the model’s ability to rank relevant items higher than irrelevant ones.

**Custom Metrics (manually computed):**
- `Hit Rate@K`: Checks whether at least one test item is present in the top-K recommendations.
- `MRR@K` (Mean Reciprocal Rank): Evaluates how early the first relevant item appears in the ranked list.
- `MAP@K` (Mean Average Precision): Averages the precision scores at each position where a relevant item appears.
- `NDCG@K` (Normalized Discounted Cumulative Gain): Considers the order of recommended items, rewarding relevant items that appear earlier.

Evaluation Results:
- Returned as a dictionary and converted to a summary DataFrame comparing performance on training and test sets.
- Enables diagnostic insights on overfitting, cold-start issues, and real-world ranking performance.


In [None]:
def evaluate_model(model, train, test, item_features=None, k=10, allow_overlap=False):
    """
    Evaluate a LightFM model using ranking metrics: Precision, Recall, AUC (using LightFM's built‑in functions)
    and custom metrics: Hit Rate, MRR, MAP and NDCG (computed manually).

    Parameters
    ----------
    model : LightFM model
        A fitted LightFM model.
    train : scipy sparse matrix of shape (n_users, n_items)
        Training interactions matrix.
    test : scipy sparse matrix of shape (n_users, n_items)
        Test interactions matrix.
    item_features : optional, default None
        Item features matrix.
    k : int, default 10
        Number of top items to consider.
    allow_overlap : bool, default False
        If True, bypasses the built‑in overlap check (useful for in‑sample evaluation).

    Returns
    -------
    metrics : dict
        Dictionary with the computed metrics:
          - precision: Precision@k (averaged over users)
          - recall: Recall@k (averaged over users)
          - auc: AUC score (averaged over users)
          - hit_rate: Fraction of users with at least one test item in top‑k predictions
          - mrr: Mean Reciprocal Rank over users
          - map: Mean Average Precision over users
          - ndcg: Normalized Discounted Cumulative Gain computed manually over users
    """
    # Convert matrices to CSR format for efficient row-wise access
    if not isinstance(train, csr_matrix):
        train = train.tocsr()
    if not isinstance(test, csr_matrix):
        test = test.tocsr()

    # If allowing overlap (i.e. in-sample evaluation), bypass the built-in check
    if allow_overlap:
        empty_train = coo_matrix(train.shape)
        eval_train = empty_train
    else:
        eval_train = train

    metrics = {}
    # Compute built-in metrics from LightFM (for precision, recall, and AUC)
    metrics["precision"] = precision_at_k(
        model, test, train_interactions=eval_train,
        item_features=item_features, k=k
    ).mean()
    metrics["recall"] = recall_at_k(
        model, test, train_interactions=eval_train,
        item_features=item_features, k=k
    ).mean()
    metrics["auc"] = auc_score(
        model, test, train_interactions=eval_train,
        item_features=item_features
    ).mean()

    # Initialize lists to collect custom metric values
    hit_rates = []
    mrrs = []
    average_precisions = []
    ndcgs = []

    n_users, n_items = test.shape

    for user in range(n_users):
        # Get the indices of items in the test set for this user
        # Use proper CSR matrix row slicing
        test_items = test[user, :].nonzero()[1]
        if len(test_items) == 0:
            # Skip users with no test items
            continue

        # Predict scores for all items for the user
        scores = model.predict(user, np.arange(n_items), item_features=item_features)

        # If not allowing overlap, mask (exclude) items seen in training
        if not allow_overlap:
            train_items = train[user, :].nonzero()[1]
            scores[train_items] = -np.inf

        # Get the top-k item indices (highest predicted scores)
        top_k_items = np.argsort(-scores)[:k]

        # --- Hit Rate ---
        # A "hit" is recorded if at least one test item is in the top-k recommendations
        hit = 1 if np.intersect1d(top_k_items, test_items).size > 0 else 0
        hit_rates.append(hit)

        # --- Mean Reciprocal Rank (MRR) ---
        rr = 0.0
        for rank, item in enumerate(top_k_items):
            if item in test_items:
                rr = 1.0 / (rank + 1)
                break
        mrrs.append(rr)

        # --- Mean Average Precision (MAP) ---
        num_relevant = 0
        sum_precisions = 0.0
        for rank, item in enumerate(top_k_items):
            if item in test_items:
                num_relevant += 1
                sum_precisions += num_relevant / (rank + 1)
        ap = sum_precisions / min(len(test_items), k)
        average_precisions.append(ap)

        # --- Normalized Discounted Cumulative Gain (NDCG) computed manually ---
        # Compute DCG: sum over ranked positions of (relevance / log2(rank+2))
        dcg = 0.0
        for rank, item in enumerate(top_k_items):
            relevance = 1 if item in test_items else 0
            dcg += relevance / np.log2(rank + 2)
        # Compute the Ideal DCG (IDCG): maximum possible DCG
        ideal_relevances = min(len(test_items), k)
        idcg = sum([1.0 / np.log2(i + 2) for i in range(ideal_relevances)])
        ndcg = dcg / idcg if idcg > 0 else 0.0
        ndcgs.append(ndcg)

    # Average the custom metrics over the users that had test items
    metrics["hit_rate"] = np.mean(hit_rates) if hit_rates else 0.0
    metrics["mrr"] = np.mean(mrrs) if mrrs else 0.0
    metrics["map"] = np.mean(average_precisions) if average_precisions else 0.0
    metrics["ndcg"] = np.mean(ndcgs) if ndcgs else 0.0

    return metrics


# To compute in-sample (training set) metrics without causing an overlap error,
# set allow_overlap=True so that an empty training matrix is passed to the built-in functions.
train_metrics = evaluate_model(model, train_interactions, train_interactions,
                               item_features=item_features, k=10, allow_overlap=True)

# For out-of-sample (test set) evaluation, make sure your train/test split does not overlap:
test_metrics = evaluate_model(model, train_interactions, test_interactions, item_features=item_features, k=10)

# Combine results into a DataFrame and print them
df_results = pd.DataFrame({
     "Metric": list(train_metrics.keys()),
     "Train Set": list(train_metrics.values()),
     "Test Set": list(test_metrics.values())
})
print(df_results.to_string(index=False, float_format="%.4f"))

   Metric  Train Set  Test Set
precision     0.3163    0.1309
   recall     0.3598    0.2274
      auc     0.9167    0.8781
 hit_rate     0.9481    0.5726
      mrr     0.6423    0.2873
      map     0.3964    0.1267
     ndcg     0.5309    0.2064


In [None]:
print(df_results.to_markdown(index=False, tablefmt="pipe", floatfmt=",.4f"))

| Metric    |   Train Set |   Test Set |
|:----------|------------:|-----------:|
| precision |      0.3163 |     0.1309 |
| recall    |      0.3598 |     0.2274 |
| auc       |      0.9167 |     0.8781 |
| hit_rate  |      0.9481 |     0.5726 |
| mrr       |      0.6423 |     0.2873 |
| map       |      0.3964 |     0.1267 |
| ndcg      |      0.5309 |     0.2064 |
