# Building an Advanced High-Accuracy Movie Recommender

Our goal is to construct a movie recommendation system with the highest possible accuracy. This involves several key stages:
1.  **Comprehensive Data Loading & Preprocessing:** Ensuring data quality and consistency across all available datasets (movies, ratings, tags, genome scores).
2.  **In-depth Feature Engineering:** Creating rich features for users and movies that can capture complex patterns and preferences.
3.  **Exploration of Advanced Models:**
    *   Advanced Collaborative Filtering (e.g., SVD++, Factorization Machines)
    *   Content-Based Filtering (leveraging movie metadata like genres, tags, and potentially descriptions)
    *   Knowledge-Based/Graph-Based approaches (using the MovieLens genome tag data)
    *   Potentially Deep Learning models (e.g., Neural Collaborative Filtering, Wide & Deep models)
4.  **Sophisticated Hybridization:** Combining the strengths of different models using techniques like stacking or feature-weighted blending.
5.  **Rigorous Evaluation:** Using appropriate metrics and cross-validation to assess and compare model performance.

This notebook will guide us through these stages.

In [1]:
import pandas as pd
import numpy as np
import os
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from surprise import Dataset, Reader, SVD, SVDpp, NMF, KNNBasic
from surprise.model_selection import cross_validate, train_test_split as surprise_train_test_split
from surprise import accuracy
from collections import defaultdict
import pickle

# Display options for pandas
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

print("Libraries imported.")

Libraries imported.


## 1. Comprehensive Data Loading & Initial Inspection

We will load all relevant Parquet files:
*   `movies.parquet`: Movie information (movieId, title, genres).
*   `ratings.parquet`: User ratings for movies (userId, movieId, rating, timestamp).
*   `tags.parquet`: User-applied tags for movies (userId, movieId, tag, timestamp).
*   `genome-tags.parquet`: Genome tag descriptions (tagId, tag).
*   `genome-scores.parquet`: Movie-tag relevance scores (movieId, tagId, relevance).

In [2]:
# Define base path for data
base_data_path = '../data/parquet/'

# Load datasets
try:
    movies_df = pd.read_parquet(os.path.join(base_data_path, 'movies.parquet'))
    ratings_df = pd.read_parquet(os.path.join(base_data_path, 'ratings.parquet'))
    tags_df = pd.read_parquet(os.path.join(base_data_path, 'tags.parquet'))
    genome_tags_df = pd.read_parquet(os.path.join(base_data_path, 'genome_tags.parquet'))
    genome_scores_df = pd.read_parquet(os.path.join(base_data_path, 'genome_scores.parquet'))

    print("All datasets loaded successfully.")
    print(f"Movies: {movies_df.shape}")
    print(f"Ratings: {ratings_df.shape}")
    print(f"Tags: {tags_df.shape}")
    print(f"Genome Tags: {genome_tags_df.shape}")
    print(f"Genome Scores: {genome_scores_df.shape}")
except FileNotFoundError as e:
    print(f"Error loading data: {e}. Please ensure all Parquet files are in {base_data_path}")

# Initial inspection
print("\\n--- Movies DataFrame Head ---")
print(movies_df.head())
print("\\n--- Ratings DataFrame Head ---")
print(ratings_df.head())
print("\\n--- Tags DataFrame Head ---")
print(tags_df.head())
print("\\n--- Genome Tags DataFrame Head ---")
print(genome_tags_df.head())
print("\\n--- Genome Scores DataFrame Head ---")
print(genome_scores_df.head())

All datasets loaded successfully.
Movies: (62423, 3)
Ratings: (25000095, 4)
Tags: (1093360, 4)
Genome Tags: (1128, 2)
Genome Scores: (15584448, 3)
\n--- Movies DataFrame Head ---
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  
\n--- Ratings DataFrame Head ---
   userId  movieId  rating           timestamp
0       1      296     5.0 2006-05-17 15:34:04
1       1      306     3.5 2006-05-17 12:26:57
2       1      307     5.0 2006-05-17 12:27:08
3    

## 2. Advanced Data Preprocessing and Cleaning

This section will focus on cleaning the loaded data and preparing it for feature engineering.

In [3]:
# --- Movies DataFrame Preprocessing ---
print("\\n--- Processing movies_df ---")
# Extract year from title
movies_df['year'] = movies_df['title'].str.extract(r'\((\d{4})\)\s*$', expand=False)
# Convert year to numeric, coercing errors. This will make non-years NaN
movies_df['year'] = pd.to_numeric(movies_df['year'], errors='coerce')

# Clean title (remove year and special characters)
def clean_title(title):
    if not isinstance(title, str):
        title = str(title)
    # Remove (YYYY) from end, handling potential spaces around it
    title_no_year = re.sub(r'\s*\(\d{4}\)\s*$', '', title).strip()
    # If the year was not at the very end, it might still be in the title_no_year
    # This regex is more general for removing (YYYY) if it's embedded or for titles that are just (YYYY)
    title_no_year = re.sub(r'\((\d{4})\)', '', title_no_year).strip()
    # Keep basic punctuation useful for titles, remove others
    # Using double quotes for the raw string and hyphen at the end of the character class.
    title_cleaned = re.sub(r"[^a-zA-Z0-9\\\\s:,&.'-]", '', title_no_year)
    title_cleaned = title_cleaned.strip()
    # If cleaning results in an empty string (e.g. title was only special chars or just '(YYYY)'),
    # revert to a minimally cleaned version of the original title to avoid empty titles.
    if not title_cleaned and title_no_year: # If cleaning removed everything but original had content
        title_cleaned = re.sub(r"[^a-zA-Z0-9\\\\s]", '', title_no_year).strip() # a more permissive cleaning
    if not title_cleaned: # If still empty, use original title but remove (YYYY)
        title_cleaned = re.sub(r'\s*\(\d{4}\)\s*$', '', title).strip()
        title_cleaned = re.sub(r'\((\d{4})\)', '', title_cleaned).strip()
        if not title_cleaned: # If original title was just (YYYY) or similar
             title_cleaned = title # Fallback to original if all else fails to produce a string
    return title_cleaned

movies_df['title_cleaned'] = movies_df['title'].apply(clean_title)

# Split genres string into a list of genres
movies_df['genres_list'] = movies_df['genres'].apply(lambda x: x.split('|') if isinstance(x, str) else [])
print("Movies preprocessing done.")
print(movies_df[['title', 'year', 'title_cleaned', 'genres_list']].head())

# --- Ratings DataFrame Preprocessing ---
# Convert timestamp to datetime (optional, but useful for time-aware features)
ratings_df['timestamp_dt'] = pd.to_datetime(ratings_df['timestamp'], unit='s')
print("\\n--- Processing ratings_df ---")
print("Ratings DataFrame processed (timestamp converted to datetime).")
print(ratings_df.head())


# --- Tags DataFrame Preprocessing ---
# Convert timestamp to datetime
tags_df['timestamp_dt'] = pd.to_datetime(tags_df['timestamp'], unit='s')
# Lowercase and strip whitespace from tags
tags_df['tag_cleaned'] = tags_df['tag'].astype(str).str.lower().str.strip()
print("\\n--- Processing tags_df ---")
print("Tags DataFrame processed (timestamp converted, tags cleaned).")
print(tags_df[['tag', 'tag_cleaned']].head())

# --- Genome Tags DataFrame Preprocessing ---
# Lowercase and strip whitespace from genome tags
genome_tags_df['tag_cleaned'] = genome_tags_df['tag'].astype(str).str.lower().str.strip()
print("\\n--- Processing genome_tags_df ---")
print("Genome Tags DataFrame processed (tags cleaned).")
print(genome_tags_df.head())

# --- Genome Scores DataFrame Preprocessing ---
# No immediate cleaning needed for genome_scores, but we'll merge it later.
print("\\n--- Genome Scores DataFrame Head ---")
print(genome_scores_df.head())


# Check for missing values after initial processing
print("\\n--- Missing Values Post-Preprocessing ---")
print("Movies:", movies_df.isnull().sum())
print("\\nRatings:", ratings_df.isnull().sum())
print("\\nTags:", tags_df.isnull().sum()) # tag can have NaNs if original had them
print("\\nGenome Tags:", genome_tags_df.isnull().sum())
print("\\nGenome Scores:", genome_scores_df.isnull().sum())

print("\n--- Year column check after preprocessing --- ")
print(movies_df[['title', 'year', 'title_cleaned']].head())
print(f"Number of NaN years: {movies_df['year'].isnull().sum()}")
print(f"Movies with non-NaN years: {movies_df['year'].notnull().sum()}")
print(f"Total movies: {len(movies_df)}")
print("\nMovies DataFrame info after preprocessing:")
movies_df.info()

\n--- Processing movies_df ---
Movies preprocessing done.
                                title    year           title_cleaned  \
0                    Toy Story (1995)  1995.0                ToyStory   
1                      Jumanji (1995)  1995.0                 Jumanji   
2             Grumpier Old Men (1995)  1995.0          GrumpierOldMen   
3            Waiting to Exhale (1995)  1995.0         WaitingtoExhale   
4  Father of the Bride Part II (1995)  1995.0  FatheroftheBridePartII   

                                         genres_list  
0  [Adventure, Animation, Children, Comedy, Fantasy]  
1                     [Adventure, Children, Fantasy]  
2                                  [Comedy, Romance]  
3                           [Comedy, Drama, Romance]  
4                                           [Comedy]  
\n--- Processing ratings_df ---
Ratings DataFrame processed (timestamp converted to datetime).
   userId  movieId  rating           timestamp        timestamp_dt
0       1  

## 3. Feature Engineering

Now, we'll create features that will be used by our recommendation models.

In [4]:
# --- Movie Features ---
print("\\n--- Engineering Movie Features ---")
# Clean up columns from previous erroneous runs if they exist
# This is to prevent issues if the notebook is re-run in a state where these columns were already created incorrectly.
columns_to_drop = [
    'movie_avg_rating_x', 'movie_rating_count_x',
    'movie_avg_rating_y', 'movie_rating_count_y',
    'movie_avg_rating', 'movie_rating_count', # Drop these too, to ensure they are recreated correctly
    'genome_tag_profile' # Also drop and recreate this
]
for col in columns_to_drop:
    if col in movies_df.columns:
        movies_df.drop(columns=[col], inplace=True)

# Calculate average ratings and rating counts per movie
movie_avg_ratings_series = ratings_df.groupby('movieId')['rating'].mean().rename('movie_avg_rating')
movie_rating_counts_series = ratings_df.groupby('movieId')['rating'].count().rename('movie_rating_count')

# Convert series to dataframes before merging
movie_avg_ratings_df = movie_avg_ratings_series.to_frame() # movieId is index
movie_rating_counts_df = movie_rating_counts_series.to_frame() # movieId is index

# Merge these features back into movies_df
# movies_df has 'movieId' as a column. The new DFs have 'movieId' as index.
movies_df = movies_df.merge(movie_avg_ratings_df, left_on='movieId', right_index=True, how='left')
movies_df = movies_df.merge(movie_rating_counts_df, left_on='movieId', right_index=True, how='left')

# Fill NaN for movies with no ratings (if any)
movies_df['movie_avg_rating'] = movies_df['movie_avg_rating'].fillna(0) # Or use global mean rating
movies_df['movie_rating_count'] = movies_df['movie_rating_count'].fillna(0)

print("Movie features (avg_rating, rating_count) engineered.")
print(movies_df[['movieId', 'title_cleaned', 'movie_avg_rating', 'movie_rating_count']].head())

# --- User Features ---
print("\\n--- Engineering User Features ---")
user_avg_ratings = ratings_df.groupby('userId')['rating'].mean().rename('user_avg_rating')
user_rating_counts = ratings_df.groupby('userId')['rating'].count().rename('user_rating_count')

# Create a user_df. userId will be the index.
user_df = pd.DataFrame(index=ratings_df['userId'].unique())
user_df.index.name = 'userId'
user_df = user_df.merge(user_avg_ratings, left_index=True, right_index=True, how='left')
user_df = user_df.merge(user_rating_counts, left_index=True, right_index=True, how='left')

# Fill NaN for users (if any unusual cases, though groupby should cover all users in ratings_df)
user_df['user_avg_rating'] = user_df['user_avg_rating'].fillna(user_df['user_avg_rating'].mean()) # Or 0
user_df['user_rating_count'] = user_df['user_rating_count'].fillna(0)


print("User features (avg_rating, rating_count) engineered.")
print(user_df.head())
# Example: To add user features to ratings_df (if a model needs it in this flat format)
# ratings_df = ratings_df.merge(user_df, on='userId', how='left')


# --- Merging Genome Tags with Scores for Movie Tag Profiles ---
print("\\n--- Engineering Movie Tag Profiles (from Genome Scores) ---")
# Merge genome_scores with genome_tags to get tag names
# Ensure genome_scores_df and genome_tags_df are in their original state if this cell is re-run
# This merge should ideally happen once, or the DFs should be copies if modified in place.
# For simplicity, assuming they are loaded fresh or not modified in a way that affects this merge.
if 'tag' not in genome_scores_df.columns and 'tagId' in genome_scores_df.columns and 'tagId' in genome_tags_df.columns:
    genome_scores_df_merged = genome_scores_df.merge(genome_tags_df, on='tagId', how='left')
else: # if already merged or structure is different, assume it's ready or use a copy
    genome_scores_df_merged = genome_scores_df.copy() 

# Ensure 'tag_cleaned' is present or created
if 'tag_cleaned' not in genome_scores_df_merged.columns and 'tag' in genome_scores_df_merged.columns:
    genome_scores_df_merged['tag_cleaned'] = genome_scores_df_merged['tag'].apply(lambda x: re.sub(r'[^a-zA-Z0-9_\s]', '', str(x).lower()))
elif 'tag_cleaned' not in genome_scores_df_merged.columns: # Fallback if 'tag' is also missing
    genome_scores_df_merged['tag_cleaned'] = "" # Add empty column to prevent error


MIN_RELEVANCE = 0.8
# Check if 'movieId' and 'tag_cleaned' are in the dataframe to prevent KeyError
if 'movieId' in genome_scores_df_merged.columns and 'tag_cleaned' in genome_scores_df_merged.columns and 'relevance' in genome_scores_df_merged.columns:
    movie_genome_tag_profiles_series = genome_scores_df_merged[genome_scores_df_merged['relevance'] > MIN_RELEVANCE].groupby('movieId')['tag_cleaned'].apply(lambda tags: ' '.join(sorted(list(tags)))).rename('genome_tag_profile')
    # Convert series to dataframe before merging
    movie_genome_tag_profiles_df = movie_genome_tag_profiles_series.to_frame() # movieId is index
    movies_df = movies_df.merge(movie_genome_tag_profiles_df, left_on='movieId', right_index=True, how='left')
else:
    print("Skipping genome tag profile generation due to missing columns in genome_scores_df_merged.")
    # Ensure the column exists even if not populated, to prevent downstream errors
    if 'genome_tag_profile' not in movies_df.columns:
         movies_df['genome_tag_profile'] = '' 

movies_df['genome_tag_profile'] = movies_df['genome_tag_profile'].fillna('') # Fill NaN for movies with no significant genome tags

print("Movie genome tag profiles engineered.")
print(movies_df[['movieId', 'title_cleaned', 'genome_tag_profile']].head())

# Display some basic stats or info
print(f"\nMovies DataFrame shape: {movies_df.shape}")
print(f"User DataFrame shape: {user_df.shape}")
print(f"Ratings DataFrame shape: {ratings_df.shape}") # ratings_df is not modified here yet

# Check for NaNs created by merges, if any unexpected
print("\nNaNs in movies_df after feature engineering:")
print(movies_df.isnull().sum())
print("\nNaNs in user_df after feature engineering:")
print(user_df.isnull().sum())

\n--- Engineering Movie Features ---
Movie features (avg_rating, rating_count) engineered.
   movieId           title_cleaned  movie_avg_rating  movie_rating_count
0        1                ToyStory          3.893708             57309.0
1        2                 Jumanji          3.251527             24228.0
2        3          GrumpierOldMen          3.142028             11804.0
3        4         WaitingtoExhale          2.853547              2523.0
4        5  FatheroftheBridePartII          3.058434             11714.0
\n--- Engineering User Features ---
Movie features (avg_rating, rating_count) engineered.
   movieId           title_cleaned  movie_avg_rating  movie_rating_count
0        1                ToyStory          3.893708             57309.0
1        2                 Jumanji          3.251527             24228.0
2        3          GrumpierOldMen          3.142028             11804.0
3        4         WaitingtoExhale          2.853547              2523.0
4        5  Fath

## 4. Model Building - Part 1: Advanced Collaborative Filtering

We'll start with advanced matrix factorization techniques like SVD++ or NMF.
We will use the `surprise` library.

**Note on Parallelism:** The `surprise` library, while convenient, does not inherently parallelize the training of all its algorithms (like SVD, SVD++, NMF) across multiple CPU cores in the same way some other libraries (e.g., scikit-learn with `n_jobs`) do. For very large datasets or computationally intensive hyperparameter searches, training times might be significant. If performance becomes a major bottleneck for these CF models, alternatives like LightFM, Implicit, xLearn, or custom implementations in TensorFlow/PyTorch (which offer better control over parallelism) should be considered. For now, we will use `surprise` to establish baseline performance for these standard CF algorithms.

In [6]:
# Prepare data for Surprise
reader = Reader(rating_scale=(0.5, 5.0))
# IMPORTANT: The ML-25M dataset has 25 million ratings.
# Training Surprise models (especially SVD, SVD++, NMF) on the full dataset can be VERY time-consuming
# and memory-intensive. For initial development and iteration, consider sampling the ratings_df.
# Example: ratings_df_sample = ratings_df.sample(n=1000000, random_state=42)
# Then use ratings_df_sample in Dataset.load_from_df()
# For this run, we will proceed with the full dataset, be prepared for potential long runtimes.

# Define paths for saving/loading trainset and testset
models_dir = "../models/"
if not os.path.exists(models_dir):
    os.makedirs(models_dir)
path_trainset_surprise = os.path.join(models_dir, "trainset_surprise.pkl")
path_testset_surprise = os.path.join(models_dir, "testset_surprise.pkl")

# Attempt to load trainset and testset from files
if os.path.exists(path_trainset_surprise) and os.path.exists(path_testset_surprise):
    print(f"Loading trainset_surprise from {path_trainset_surprise}...")
    with open(path_trainset_surprise, 'rb') as f:
        trainset_surprise = pickle.load(f)
    print(f"Loading testset_surprise from {path_testset_surprise}...")
    with open(path_testset_surprise, 'rb') as f:
        testset_surprise = pickle.load(f)
    print("Trainset and Testset loaded from files.")
    # Rebuild the full data_surprise object if needed for other operations,
    # or ensure downstream tasks only need trainset/testset.
    # For now, we assume trainset and testset are sufficient.
    # If data_surprise is needed elsewhere, it might need to be reconstructed or also saved/loaded.
else:
    print("Trainset/Testset pickle files not found. Creating and saving them...")
    data_surprise = Dataset.load_from_df(ratings_df[['userId', 'movieId', 'rating']], reader)
    trainset_surprise, testset_surprise = surprise_train_test_split(data_surprise, test_size=0.2, random_state=42)
    
    print(f"Saving trainset_surprise to {path_trainset_surprise}...")
    with open(path_trainset_surprise, 'wb') as f:
        pickle.dump(trainset_surprise, f)
    print(f"Saving testset_surprise to {path_testset_surprise}...")
    with open(path_testset_surprise, 'wb') as f:
        pickle.dump(testset_surprise, f)
    print("Trainset and Testset created and saved to files.")

print("\\n--- Training SVD ---")
algo_svd = SVD(n_factors=100, n_epochs=20, random_state=42, verbose=False) # Basic SVD
# cross_validate(algo_svd, data_surprise, measures=['RMSE', 'MAE'], cv=3, verbose=True) # CV is time-consuming
algo_svd.fit(trainset_surprise)
predictions_svd = algo_svd.test(testset_surprise)
rmse_svd = accuracy.rmse(predictions_svd)
print(f"SVD RMSE: {rmse_svd}")

print("\n--- Training SVD++ ---")
# SVD++ considers implicit feedback. It needs user-item interactions.
# The current ratings_df is explicit. For SVD++ to use its full potential,
# we might need to represent all rated items by a user as implicit feedback.
# However, Surprise's SVD++ can run on explicit ratings too.
algo_svdpp = SVDpp(n_factors=50, n_epochs=10, random_state=42, verbose=False, cache_ratings=True) # Fewer epochs due to time
# cross_validate(algo_svdpp, data_surprise, measures=['RMSE', 'MAE'], cv=3, verbose=True) # Very time-consuming
algo_svdpp.fit(trainset_surprise)
predictions_svdpp = algo_svdpp.test(testset_surprise)
rmse_svdpp = accuracy.rmse(predictions_svdpp)
print(f"SVD++ RMSE: {rmse_svdpp}")


print("\n--- Training NMF ---")
algo_nmf = NMF(n_factors=15, n_epochs=20, random_state=42, verbose=False) # NMF
# cross_validate(algo_nmf, data_surprise, measures=['RMSE', 'MAE'], cv=3, verbose=True)
algo_nmf.fit(trainset_surprise)
predictions_nmf = algo_nmf.test(testset_surprise)
rmse_nmf = accuracy.rmse(predictions_nmf)
print(f"NMF RMSE: {rmse_nmf}")

Trainset/Testset pickle files not found. Creating and saving them...


Saving trainset_surprise to ../models/trainset_surprise.pkl...
Saving testset_surprise to ../models/testset_surprise.pkl...
Saving testset_surprise to ../models/testset_surprise.pkl...
Trainset and Testset created and saved to files.
\n--- Training SVD ---
Trainset and Testset created and saved to files.
\n--- Training SVD ---
RMSE: 0.7773
SVD RMSE: 0.7772926722924584

--- Training SVD++ ---
RMSE: 0.7773
SVD RMSE: 0.7772926722924584

--- Training SVD++ ---
RMSE: 0.8263
SVD++ RMSE: 0.8262693807476258

--- Training NMF ---
RMSE: 0.8263
SVD++ RMSE: 0.8262693807476258

--- Training NMF ---
RMSE: 0.8865
NMF RMSE: 0.8864854144950411
RMSE: 0.8865
NMF RMSE: 0.8864854144950411


In [None]:
# Temporary cell to check if CF models are in memory
cf_models_in_memory = True
if 'algo_svd' not in locals() or 'algo_svdpp' not in locals() or 'algo_nmf' not in locals():
    cf_models_in_memory = False
print(f"Collaborative filtering models in memory: {cf_models_in_memory}")

if not cf_models_in_memory:
    print("Attempting to re-run CF training cell...")
    # This is a placeholder, actual re-run will be a separate call if needed
    pass

In [8]:
print("\n--- Saving Collaborative Filtering Models ---")
models_dir = "../models/"
if not os.path.exists(models_dir):
    os.makedirs(models_dir)
    print(f"Created directory: {models_dir}")

path_svd_model = os.path.join(models_dir, "adv_algo_svd.pkl")
path_svdpp_model = os.path.join(models_dir, "adv_algo_svdpp.pkl")
path_nmf_model = os.path.join(models_dir, "adv_algo_nmf.pkl")

# Save Surprise models
if 'algo_svd' in locals():
    with open(path_svd_model, 'wb') as f:
        pickle.dump(algo_svd, f)
    print(f"Saved SVD model to {path_svd_model}")
else:
    print("SVD model (algo_svd) not found in memory, skipping save.")

if 'algo_svdpp' in locals():
    with open(path_svdpp_model, 'wb') as f:
        pickle.dump(algo_svdpp, f)
    print(f"Saved SVD++ model to {path_svdpp_model}")
else:
    print("SVD++ model (algo_svdpp) not found in memory, skipping save.")

if 'algo_nmf' in locals():
    with open(path_nmf_model, 'wb') as f:
        pickle.dump(algo_nmf, f)
    print(f"Saved NMF model to {path_nmf_model}")
else:
    print("NMF model (algo_nmf) not found in memory, skipping save.")
print("--- Collaborative Filtering Models processed for saving. ---")


--- Saving Collaborative Filtering Models ---
SVD model (algo_svd) not found in memory, skipping save.
SVD++ model (algo_svdpp) not found in memory, skipping save.
NMF model (algo_nmf) not found in memory, skipping save.
--- Collaborative Filtering Models processed for saving. ---


## 5. Model Building - Part 2: Content-Based Filtering

Here, we'll build a content-based recommender using movie genres and genome tag profiles.

In [7]:
print("\\\\n--- Building Content-Based Model ---")

# Combine genres and genome tag profile for a richer content representation
movies_df['content_features_str'] = movies_df['genres_list'].apply(lambda x: ' '.join(x)) + ' ' + movies_df['genome_tag_profile']

# Use TF-IDF to vectorize these content features
tfidf_vectorizer = TfidfVectorizer(stop_words='english', min_df=5) # min_df to ignore very rare terms
tfidf_matrix = tfidf_vectorizer.fit_transform(movies_df['content_features_str'])

print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")

# Compute cosine similarity matrix from TF-IDF features
# Note: For a large number of movies (e.g., 62k in ML-25M), this matrix (num_movies x num_movies)
# can be very large (e.g., ~15GB for float32). Consider alternatives like approximate nearest neighbors
# (e.g., Annoy, Faiss, ScaNN) for TF-IDF vectors if memory or computation speed becomes an issue.
cosine_sim_content = cosine_similarity(tfidf_matrix, tfidf_matrix)
print(f"Content cosine similarity matrix shape: {cosine_sim_content.shape}")

# Create a lookup for movie titles to their index in the cosine_sim_content matrix
# This assumes titles are unique enough to serve as keys.
# If titles are not unique, this strategy needs revision (e.g., use movieId).
# The indices of cosine_sim_content correspond to the rows of movies_df at the time of TF-IDF fitting.
content_movie_indices_lookup = pd.Series(range(cosine_sim_content.shape[0]), index=movies_df['title'])
print("Content movie indices lookup created.")

# Function to get content-based recommendations
# This provides item-item similarity. To get user recommendations:
# 1. Find movies highly rated by the user.
# 2. For each of these movies, find the most similar movies using cosine_sim_content.
# 3. Aggregate these similar movies and rank them.

def get_content_recommendations(movie_id, num_recs=10):
    if movie_id not in movies_df['movieId'].values:
        print(f"MovieId {movie_id} not found.")
        return []
    
    movie_idx = movies_df[movies_df['movieId'] == movie_id].index[0]
    sim_scores = list(enumerate(cosine_sim_content[movie_idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:num_recs+1] # Exclude the movie itself
    
    recommended_movie_indices = [i[0] for i in sim_scores]
    recommended_movies = movies_df.iloc[recommended_movie_indices][['movieId', 'title_cleaned', 'genres_list']]
    recommended_movies['similarity_score'] = [s[1] for s in sim_scores]
    return recommended_movies

# Example: Get recommendations for a movie
if not movies_df.empty:
    example_movie_id_for_content = movies_df['movieId'].iloc[0]
    print(f"\\\\nExample Content-Based Recommendations for movie: {movies_df[movies_df['movieId'] == example_movie_id_for_content]['title_cleaned'].values[0]} (ID: {example_movie_id_for_content})")
    content_recs_example = get_content_recommendations(example_movie_id_for_content, 5)
    print(content_recs_example)
else:
    print("movies_df is empty, cannot run content-based recommendation example.")

# Note: Evaluating content-based models in terms of rating prediction RMSE is not direct.
# They are typically evaluated on precision/recall of recommended items or diversity.
# For a hybrid model, we might use the similarity scores or predicted ratings derived from similar items.

\\n--- Building Content-Based Model ---
TF-IDF matrix shape: (62423, 1072)
TF-IDF matrix shape: (62423, 1072)
Content cosine similarity matrix shape: (62423, 62423)
Content movie indices lookup created.
\\nExample Content-Based Recommendations for movie: ToyStory (ID: 1)
       movieId  title_cleaned  \
3021      3114      ToyStory2   
4780      4886  Monsters,Inc.   
2264      2355    Bug'sLife,A   
14813    78499      ToyStory3   
11361    50872    Ratatouille   

                                                   genres_list  \
3021         [Adventure, Animation, Children, Comedy, Fantasy]   
4780         [Adventure, Animation, Children, Comedy, Fantasy]   
2264                  [Adventure, Animation, Children, Comedy]   
14813  [Adventure, Animation, Children, Comedy, Fantasy, IMAX]   
11361                             [Animation, Children, Drama]   

       similarity_score  
3021           0.847621  
4780           0.808108  
2264           0.796815  
14813          0.778656  
11

In [None]:
print("\n--- Saving Content-Based Components ---")
models_dir = "../models/"
if not os.path.exists(models_dir):
    os.makedirs(models_dir)
    print(f"Created directory: {models_dir}")

path_tfidf_vectorizer = os.path.join(models_dir, "adv_tfidf_vectorizer.pkl")
path_cosine_sim_content = os.path.join(models_dir, "adv_cosine_sim_content.npy")

# Save TF-IDF vectorizer
if 'tfidf_vectorizer' in locals():
    with open(path_tfidf_vectorizer, 'wb') as f:
        pickle.dump(tfidf_vectorizer, f)
    print(f"Saved TF-IDF vectorizer to {path_tfidf_vectorizer}")
else:
    print("TF-IDF vectorizer (tfidf_vectorizer) not found in memory, skipping save.")

# Save content similarity matrix (NumPy array)
if 'cosine_sim_content' in locals():
    np.save(path_cosine_sim_content, cosine_sim_content)
    print(f"Saved content cosine similarity matrix to {path_cosine_sim_content}")
else:
    print("Content cosine similarity matrix (cosine_sim_content) not found in memory, skipping save.")

# Save content_movie_indices_lookup
if 'content_movie_indices_lookup' in locals():
    path_content_lookup = os.path.join(models_dir, "content_movie_indices_lookup.pkl")
    with open(path_content_lookup, 'wb') as f:
        pickle.dump(content_movie_indices_lookup, f)
    print(f"Saved content movie indices lookup to {path_content_lookup}")
else:
    print("Content movie indices lookup (content_movie_indices_lookup) not found in memory, skipping save.")

print("--- Content-Based Components processed for saving. ---")


--- Saving Content-Based Components ---
Saved TF-IDF vectorizer to ../models/adv_tfidf_vectorizer.pkl


## 6. Model Building - Part 3: Knowledge-Based/Graph-Based (Using Genome Tags)

This section explores how to leverage the structured knowledge present in the MovieLens Genome dataset. The genome tags provide a rich, curated set of descriptors for movies, and their relevance scores indicate the strength of association. We can use this information in several ways:

1.  **Movie Embeddings from Genome Tags:** Create vector representations (embeddings) for movies directly from their genome tag relevance scores. Movies with similar tag relevance patterns will be closer in the embedding space. This is a form of knowledge-based content filtering.
2.  **Graph-Based Approaches:** Construct a graph where movies and tags are nodes, and edges represent relationships (e.g., a movie has a tag with a certain relevance). Graph embedding techniques (like Node2Vec, TransE) or graph traversal algorithms could then be used to find related movies or predict user preferences.
3.  **Feature Augmentation:** The genome tag vectors or derived embeddings can be used as additional features in more complex hybrid models (e.g., factorization machines, neural networks).

We'll start by creating movie embeddings from genome tag relevance scores and calculating similarity.

In [11]:
print("\\n--- Building Knowledge-Based Model (Genome Tags) ---")

# 1. Create Movie-Genome Tag Matrix
# Pivot genome_scores_df to have movies as rows, tags as columns, and relevance as values
if 'genome_scores_df' in locals() and 'genome_tags_df' in locals():
    movie_genome_tag_matrix = genome_scores_df.pivot_table(
        index='movieId',
        columns='tagId',
        values='relevance'
    ).fillna(0) # Fill movies/tags with no score with 0

    print(f"Movie-Genome Tag Matrix shape: {movie_genome_tag_matrix.shape}")

    # Ensure all movies from movies_df are in this matrix, even if they have no genome scores
    # (they will have all-zero vectors)
    movie_ids_from_movies_df = movies_df[['movieId']].set_index('movieId')
    movie_genome_tag_matrix = movie_ids_from_movies_df.join(movie_genome_tag_matrix, how='left').fillna(0)
    print(f"Aligned Movie-Genome Tag Matrix shape: {movie_genome_tag_matrix.shape}")

    # 2. Compute Cosine Similarity from Genome Tag Vectors
    # Note: For a large number of movies (e.g., 62k in ML-25M), this matrix (num_movies x num_movies)
    # can be large (e.g., ~15GB for float32). Consider alternatives like approximate nearest neighbors
    # if memory becomes an issue.
    cosine_sim_genome = cosine_similarity(movie_genome_tag_matrix)
    print(f"Genome-based cosine similarity matrix shape: {cosine_sim_genome.shape}")

    # Create a mapping from movieId to index in the movie_genome_tag_matrix for easy lookup
    genome_movie_ids = movie_genome_tag_matrix.index.tolist()
    genome_movie_indices = {movieId: i for i, movieId in enumerate(genome_movie_ids)}

    # 3. Function to get recommendations based on genome tag similarity
    def get_genome_recommendations(movie_id, num_recs=10):
        if movie_id not in genome_movie_indices:
            print(f"MovieId {movie_id} not found in genome matrix.")
            return pd.DataFrame()

        movie_idx = genome_movie_indices[movie_id]
        sim_scores = list(enumerate(cosine_sim_genome[movie_idx]))
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
        sim_scores = sim_scores[1:num_recs+1] # Exclude the movie itself

        recommended_movie_indices_genome = [i[0] for i in sim_scores]
        recommended_scores_ordered = [s[1] for s in sim_scores] # Scores in order

        # Map these integer indices back to movieIds using genome_movie_ids list.
        # This requires genome_movie_ids to be available and in sync with the matrix.
        recommended_movie_ids = [genome_movie_ids[i] for i in recommended_movie_indices_genome if i < len(genome_movie_ids)]
        
        # Get movie details from movies_df, ensuring the order of recommended_movie_ids
        # and handling cases where a recommended movieId might not be in the main movies_df (if it was filtered out previously)
        # However, movie_genome_tag_matrix is aligned with movies_df's movieIds, so they should all be present.
        recommended_movies = movies_df[movies_df['movieId'].isin(recommended_movie_ids)].set_index('movieId').loc[recommended_movie_ids].reset_index()
        
        # Assign scores directly as they are already ordered
        recommended_movies['genome_similarity_score'] = recommended_scores_ordered
        
        return recommended_movies[['movieId', 'title_cleaned', 'genres_list', 'genome_similarity_score']]

    # 4. Example Usage
    if not movies_df.empty:
        # Ensure the example movie ID exists in genome_movie_indices to prevent errors
        example_movie_id_for_genome = None
        if len(movies_df) > 10 and movies_df['movieId'].iloc[10] in genome_movie_indices:
            example_movie_id_for_genome = movies_df['movieId'].iloc[10] 
        elif not genome_movie_indices.empty: # Fallback to the first movie in the genome index
             example_movie_id_for_genome = genome_movie_indices.index[0]

        if example_movie_id_for_genome:
            print(f"\\\\nExample Genome-Based Recommendations for movie: {movies_df[movies_df['movieId'] == example_movie_id_for_genome]['title_cleaned'].values[0]} (ID: {example_movie_id_for_genome})")
            genome_recs_example = get_genome_recommendations(example_movie_id_for_genome, 5)
            if not genome_recs_example.empty:
                print(genome_recs_example)
            else:
                print(f"Could not generate genome recommendations for movieId {example_movie_id_for_genome}.")
        else:
            print("Could not find a suitable example movie ID for genome recommendations.")
    else:
        print("movies_df is empty, cannot run genome recommendation example.")
else:
    print("Genome data (genome_scores_df or genome_tags_df) not loaded. Skipping genome-based model.")


\n--- Building Knowledge-Based Model (Genome Tags) ---
Movie-Genome Tag Matrix shape: (13816, 1128)
Movie-Genome Tag Matrix shape: (13816, 1128)


\n--- Building Knowledge-Based Model (Genome Tags) ---
Movie-Genome Tag Matrix shape: (13816, 1128)
Movie-Genome Tag Matrix shape: (13816, 1128)


  movie_genome_tag_matrix = movie_ids_from_movies_df.join(movie_genome_tag_matrix, how='left').fillna(0)


\n--- Building Knowledge-Based Model (Genome Tags) ---
Movie-Genome Tag Matrix shape: (13816, 1128)
Movie-Genome Tag Matrix shape: (13816, 1128)


  movie_genome_tag_matrix = movie_ids_from_movies_df.join(movie_genome_tag_matrix, how='left').fillna(0)


Aligned Movie-Genome Tag Matrix shape: (62423, 1128)


\n--- Building Knowledge-Based Model (Genome Tags) ---
Movie-Genome Tag Matrix shape: (13816, 1128)
Movie-Genome Tag Matrix shape: (13816, 1128)


  movie_genome_tag_matrix = movie_ids_from_movies_df.join(movie_genome_tag_matrix, how='left').fillna(0)


Aligned Movie-Genome Tag Matrix shape: (62423, 1128)


: 

In [None]:
print("\n--- Saving Knowledge-Based Components ---")
models_dir = "../models/"
if not os.path.exists(models_dir):
    os.makedirs(models_dir)
    print(f"Created directory: {models_dir}")

path_movie_genome_matrix = os.path.join(models_dir, "adv_movie_genome_tag_matrix.parquet")
path_cosine_sim_genome = os.path.join(models_dir, "adv_cosine_sim_genome.npy")

# Save movie-genome tag matrix (Pandas DataFrame to Parquet)
if 'movie_genome_tag_matrix' in locals() and isinstance(movie_genome_tag_matrix, pd.DataFrame):
    try:
        movie_genome_tag_matrix.to_parquet(path_movie_genome_matrix)
        print(f"Saved movie-genome tag matrix to {path_movie_genome_matrix}")
    except Exception as e:
        print(f"Error saving movie-genome tag matrix: {e}")
elif 'movie_genome_tag_matrix' not in locals():
    print("Movie-genome tag matrix (movie_genome_tag_matrix) not found in memory, skipping save.")
else:
    print(f"movie_genome_tag_matrix is not a DataFrame (type: {type(movie_genome_tag_matrix)}), skipping save.")

# Save genome similarity matrix (NumPy array)
if 'cosine_sim_genome' in locals():
    np.save(path_cosine_sim_genome, cosine_sim_genome)
    print(f"Saved genome cosine similarity matrix to {path_cosine_sim_genome}")
else:
    print("Genome cosine similarity matrix (cosine_sim_genome) not found in memory, skipping save.")
print("--- Knowledge-Based Components processed for saving. ---")

## 7. Hybridization Strategy

After developing individual recommendation models (Collaborative Filtering, Content-Based, Knowledge-Based), the next step is to combine their strengths through hybridization. A hybrid recommender can often outperform individual models by leveraging different sources of information and mitigating the weaknesses of each approach.

We will explore **Weighted Blending**, a common hybridization technique.

### Weighted Blending

In this approach, we take the rating predictions from multiple models and combine them using a weighted average. For a user *u* and an item *i*, the hybrid prediction can be calculated as:

`HybridScore(u, i) = w_1 * Score_1(u, i) + w_2 * Score_2(u, i) + ... + w_n * Score_n(u, i)`

Where:
*   `Score_k(u, i)` is the prediction (e.g., rating) for user *u* on item *i* from model *k*.
*   `w_k` is the weight assigned to model *k*.
*   The sum of all weights (`w_1 + w_2 + ... + w_n`) typically equals 1.

The weights can be determined empirically, through domain knowledge, or optimized by evaluating the hybrid model's performance on a validation dataset.

Our goal will be to combine:
1.  **Collaborative Filtering (SVD) predictions.**
2.  **Content-Based predictions** (derived from item similarity based on TF-IDF of genres and genome tag profiles).
3.  **Knowledge-Based predictions** (derived from item similarity based on genome tag relevance vectors).

To use the content-based and knowledge-based models (which provide item-item similarity) for rating prediction, we can estimate a user's rating for a target item by looking at the ratings they've given to similar items. A common formula is:

`Pred(u, i) = sum_j ( sim(i, j) * rating(u, j) ) / sum_j ( sim(i, j) )`
where *j* represents items that user *u* has rated and are similar to item *i*, and `sim(i, j)` is the similarity between item *i* and item *j*. We usually consider only the top *k* most similar items *j* for this calculation.

Let's implement this.

In [16]:
# Section 7: Hybridization Strategy - Weighted Blending
# ----------------------------------------------------
# This section implements a hybrid recommender system by blending the predictions
# from the SVD (Collaborative Filtering), Content-Based, and Knowledge-Based models.

import os
import pickle
import numpy as np
import pandas as pd
from surprise import Dataset, Reader
from surprise.model_selection import train_test_split as surprise_train_test_split
import gc # Import garbage collector

# Define file paths for data and models
data_path = "/workspaces/soe-python/data"
models_path = "/workspaces/soe-python/models"
parquet_path = os.path.join(data_path, "parquet")

# --- 1. Load necessary data and models sequentially ---
print("Loading data and models for hybridization...")
gc.collect()

# Load movies_df
if 'movies_df' not in globals() or movies_df.empty:
    print("Loading movies_df from Parquet...")
    movies_df = pd.read_parquet(os.path.join(parquet_path, 'movies.parquet'))
    # Basic preprocessing if loaded fresh
    movies_df['year'] = movies_df['title'].str.extract(r'\((\d{4})\)\s*$', expand=False)
    movies_df['year'] = pd.to_numeric(movies_df['year'], errors='coerce')
    def clean_title_simple(title):
        if not isinstance(title, str):
            title = str(title)
        title_no_year = re.sub(r'\s*\(\d{4}\)\s*$', '', title).strip()
        return re.sub(r"[^a-zA-Z0-9\s:,&.'-]", '', title_no_year).strip()
    movies_df['title_cleaned'] = movies_df['title'].apply(clean_title_simple)
    movies_df['genres_list'] = movies_df['genres'].apply(lambda x: x.split('|') if isinstance(x, str) else [])
    # Add movie_avg_rating if not present (e.g. from previous cells)
    if 'movie_avg_rating' not in movies_df.columns and 'ratings_df' in globals():
        movie_avg_ratings_series = ratings_df.groupby('movieId')['rating'].mean().rename('movie_avg_rating')
        movies_df = movies_df.merge(movie_avg_ratings_series, on='movieId', how='left')
        movies_df['movie_avg_rating'] = movies_df['movie_avg_rating'].fillna(0)

    print(f"movies_df loaded/processed. Shape: {movies_df.shape}. Memory usage: {movies_df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    gc.collect()
else:
    print("movies_df already in memory.")

# Load ratings_df
if 'ratings_df' not in globals() or ratings_df.empty:
    print("Loading ratings_df from Parquet...")
    ratings_df = pd.read_parquet(os.path.join(parquet_path, 'ratings.parquet'))
    ratings_df['rating'] = pd.to_numeric(ratings_df['rating'], errors='coerce')
    ratings_df.dropna(subset=['rating'], inplace=True)
    print(f"ratings_df loaded. Shape: {ratings_df.shape}. Memory usage: {ratings_df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    gc.collect()
else:
    print("ratings_df already in memory.")


# Collaborative Filtering (SVD)
svd_model_path = os.path.join(models_path, 'adv_algo_svd.pkl')
print(f"Loading SVD model from {svd_model_path}...")
with open(svd_model_path, 'rb') as f:
    algo_svd = pickle.load(f)
print("SVD model loaded.")
gc.collect()

# Recreate trainset_surprise if not in memory or if its components are missing
# Check for trainset_surprise and its attributes like rating_scale
recreate_trainset = False
if 'trainset_surprise' not in globals():
    recreate_trainset = True
else:
    try:
        _ = trainset_surprise.rating_scale # Check if it has necessary attributes
        _ = trainset_surprise.global_mean
        print("trainset_surprise already in memory and seems valid.")
    except AttributeError:
        print("trainset_surprise found in memory but seems incomplete or corrupted. Recreating...")
        recreate_trainset = True

if recreate_trainset:
    print("Recreating trainset_surprise for SVD...")
    if 'ratings_df' in globals() and not ratings_df.empty:
        # Ensure ratings_df has the required columns and data types for Surprise
        ratings_df_surprise = ratings_df[['userId', 'movieId', 'rating']].copy()
        ratings_df_surprise['rating'] = pd.to_numeric(ratings_df_surprise['rating'], errors='coerce')
        ratings_df_surprise.dropna(subset=['rating'], inplace=True)
        
        min_rating_val = ratings_df_surprise['rating'].min()
        max_rating_val = ratings_df_surprise['rating'].max()
        
        if pd.isna(min_rating_val) or pd.isna(max_rating_val):
            print("Error: Could not determine rating scale from ratings_df. Using default 0.5-5.0.")
            min_rating_val, max_rating_val = 0.5, 5.0 # Fallback
            
        reader = Reader(rating_scale=(min_rating_val, max_rating_val))
        data_surprise = Dataset.load_from_df(ratings_df_surprise, reader)
        trainset_surprise = data_surprise.build_full_trainset()
        print(f"trainset_surprise recreated. Users: {trainset_surprise.n_users}, Items: {trainset_surprise.n_items}, Ratings: {trainset_surprise.n_ratings}")
        print(f"Rating scale: {trainset_surprise.rating_scale}, Global mean: {trainset_surprise.global_mean}")
    else:
        print("ratings_df not available to recreate trainset_surprise. This might cause critical issues.")
    gc.collect()


# Content-Based Filtering
# tfidf_vectorizer_path = os.path.join(models_path, 'adv_tfidf_vectorizer.pkl') # Not directly passed
cosine_sim_content_path = os.path.join(models_path, 'adv_cosine_sim_content.npy')
content_lookup_path = os.path.join(models_path, 'content_movie_indices_lookup.pkl')

# print(f"Loading TF-IDF vectorizer from {tfidf_vectorizer_path}...") # Not strictly needed if not passed
# with open(tfidf_vectorizer_path, 'rb') as f:
#     tfidf_vectorizer = pickle.load(f)
# print("TF-IDF vectorizer loaded.")
gc.collect()

print(f"Loading content cosine similarity from {cosine_sim_content_path} using mmap_mode='r'...")
cosine_sim_content = np.load(cosine_sim_content_path, mmap_mode='r')
print(f"Content cosine similarity loaded. Shape: {cosine_sim_content.shape}.")
gc.collect()

print(f"Loading content movie indices lookup from {content_lookup_path}...")
with open(content_lookup_path, 'rb') as f:
    content_movie_indices_lookup = pickle.load(f)
print("Content movie indices lookup loaded.")
gc.collect()

print("Content-based components (cosine similarity, lookup) loaded.")


# Knowledge-Based Filtering (using MovieLens Tag Genome)
# movie_genome_matrix_path = os.path.join(models_path, 'adv_movie_genome_tag_matrix.parquet') # Not directly passed
cosine_sim_genome_path = os.path.join(models_path, 'adv_cosine_sim_genome.npy')

# print(f"Loading movie-genome tag matrix from {movie_genome_matrix_path}...") # Not strictly needed if not passed
# movie_genome_tag_matrix = pd.read_parquet(movie_genome_matrix_path)
# print(f"Movie-genome tag matrix loaded. Shape: {movie_genome_tag_matrix.shape}.")
gc.collect()

print(f"Loading genome cosine similarity from {cosine_sim_genome_path} using mmap_mode='r'...")
cosine_sim_genome = np.load(cosine_sim_genome_path, mmap_mode='r')
print(f"Genome cosine similarity loaded. Shape: {cosine_sim_genome.shape}.")
gc.collect()

# Create a lookup for movieId to index in the genome similarity matrix
# This requires movie_genome_tag_matrix to be loaded if genome_movie_ids is not saved separately
# For now, assuming genome_movie_indices is self-sufficient or created if needed.
# If movie_genome_tag_matrix was loaded, genome_movie_ids would be: movie_genome_tag_matrix.index.tolist()
# We need a robust way to get genome_movie_ids if it's not passed or saved.
# Let's try to load it or infer it if `adv_movie_genome_tag_matrix.parquet` exists
path_adv_movie_genome_tag_matrix = os.path.join(models_path, "adv_movie_genome_tag_matrix.parquet")
if os.path.exists(path_adv_movie_genome_tag_matrix):
    print(f"Loading movie_genome_tag_matrix from {path_adv_movie_genome_tag_matrix} to get genome_movie_ids...")
    temp_genome_matrix = pd.read_parquet(path_adv_movie_genome_tag_matrix)
    genome_movie_ids = temp_genome_matrix.index.tolist()
    del temp_genome_matrix # Free memory
    gc.collect()
    genome_movie_indices = {movieId: i for i, movieId in enumerate(genome_movie_ids)}
    print("Genome movie indices created from loaded matrix.")
elif 'genome_movie_indices' not in globals():
    print("Warning: genome_movie_indices not found and cannot be created as adv_movie_genome_tag_matrix.parquet is missing.")
    genome_movie_indices = {} # Fallback to empty dict
else:
    print("genome_movie_indices already in memory or assumed to be loaded correctly.")

print("Knowledge-based components (cosine similarity, lookup) loaded.")
gc.collect()

print("\n--- All data and models loaded successfully (or attempted). ---")


# --- 2. Define the Hybrid Prediction Function ---
def hybrid_prediction(userId, movieId, movies_df, ratings_df, algo_svd, trainset_surprise,
                      cosine_sim_content, content_movie_indices_lookup,
                      cosine_sim_genome, genome_movie_indices, # Added genome_movie_ids
                      weights=(0.4, 0.3, 0.3), k_similar=10):
    """
    Generates a hybrid recommendation score for a given user and movie.
    Uses user's ratings on similar items for content and knowledge-based scores.

    Args:
        userId (int): The ID of the user.
        movieId (int): The ID of the movie.
        movies_df (pd.DataFrame): DataFrame containing movie information.
        ratings_df (pd.DataFrame): DataFrame containing user ratings.
        algo_svd: Trained SVD model from Surprise.
        trainset_surprise: Surprise trainset.
        cosine_sim_content (np.array): Cosine similarity matrix for content-based filtering.
        content_movie_indices_lookup (pd.Series): Lookup Series for movie titles to content matrix indices.
        cosine_sim_genome (np.array): Cosine similarity matrix for genome-based filtering.
        genome_movie_indices (dict): Lookup for movieId to genome matrix indices.
        weights (tuple): Weights for SVD, content-based, and knowledge-based predictions.
        k_similar (int): Number of similar items to consider for content/KB predictions.

    Returns:
        float: The hybrid recommendation score.
        dict: Individual scores from each model.
    """
    w_svd, w_content, w_kb = weights
    individual_scores = {}
    min_rating, max_rating = trainset_surprise.rating_scale

    # SVD Prediction
    svd_pred = trainset_surprise.global_mean # Default to global mean
    try:
        # Check if user and item are known to the trainset
        if trainset_surprise.knows_user(userId) and trainset_surprise.knows_item(movieId):
            inner_user_id = trainset_surprise.to_inner_uid(userId)
            inner_movie_id = trainset_surprise.to_inner_iid(movieId)
            svd_pred = algo_svd.predict(inner_user_id, inner_movie_id).est
        elif movieId in movies_df['movieId'].values:
             # Fallback: movie's average rating if known, else global mean
            movie_avg_rating_val = movies_df.loc[movies_df['movieId'] == movieId, 'movie_avg_rating'].iloc[0]
            if pd.notna(movie_avg_rating_val) and movie_avg_rating_val > 0:
                svd_pred = movie_avg_rating_val
            # else svd_pred remains global_mean
    except ValueError: # Handles cases where user/item might not be convertible to inner IDs
        # print(f"SVD ValueError for user {userId}, movie {movieId}. Using fallback.")
        if movieId in movies_df['movieId'].values:
            movie_avg_rating_val = movies_df.loc[movies_df['movieId'] == movieId, 'movie_avg_rating'].iloc[0]
            if pd.notna(movie_avg_rating_val) and movie_avg_rating_val > 0:
                svd_pred = movie_avg_rating_val
    except Exception as e:
        # print(f"Unexpected SVD error for user {userId}, movie {movieId}: {e}. Using fallback.")
        if movieId in movies_df['movieId'].values:
            movie_avg_rating_val = movies_df.loc[movies_df['movieId'] == movieId, 'movie_avg_rating'].iloc[0]
            if pd.notna(movie_avg_rating_val) and movie_avg_rating_val > 0:
                svd_pred = movie_avg_rating_val
    individual_scores['svd'] = svd_pred

    # Get ratings by the current user
    user_ratings = ratings_df[ratings_df['userId'] == userId]

    # Content-Based Prediction
    content_pred = svd_pred # Default to SVD if content score cannot be calculated
    try:
        # Ensure movie_title can be found
        movie_details = movies_df[movies_df['movieId'] == movieId]
        if not movie_details.empty:
            movie_title = movie_details['title'].iloc[0] # Use original title as per lookup creation
            if movie_title in content_movie_indices_lookup:
                idx = content_movie_indices_lookup[movie_title]
                sim_scores_content = list(enumerate(cosine_sim_content[idx]))
                sim_scores_content = sorted(sim_scores_content, key=lambda x: x[1], reverse=True)[1:k_similar+1]
                
                similar_movie_indices_content = [i[0] for i in sim_scores_content]
                # These indices are for movies_df rows at the time of TF-IDF fitting.
                # Assuming movies_df order hasn't changed or lookup is robust.
                similar_movie_ids_content = movies_df.iloc[similar_movie_indices_content]['movieId'].values

                rated_similar_movies_content = user_ratings[user_ratings['movieId'].isin(similar_movie_ids_content)]
                
                if not rated_similar_movies_content.empty:
                    weighted_sum_ratings_c = 0
                    sum_similarities_c = 0
                    for _, row_c in rated_similar_movies_content.iterrows():
                        rated_movie_id_c = row_c['movieId']
                        rating_c = row_c['rating']
                        try:
                            rated_movie_title_c = movies_df.loc[movies_df['movieId'] == rated_movie_id_c, 'title'].iloc[0]
                            if rated_movie_title_c in content_movie_indices_lookup:
                                rated_movie_idx_in_content_c = content_movie_indices_lookup[rated_movie_title_c]
                                similarity_c = cosine_sim_content[idx, rated_movie_idx_in_content_c]
                                weighted_sum_ratings_c += similarity_c * rating_c
                                sum_similarities_c += similarity_c
                        except (IndexError, KeyError):
                            continue 
                    
                    if sum_similarities_c > 0:
                        content_pred = weighted_sum_ratings_c / sum_similarities_c
            # else: movie_title not in lookup, content_pred remains svd_pred
        # else: movie not in movies_df, content_pred remains svd_pred
    except (IndexError, KeyError, TypeError) as e:
        # print(f"Content-based prediction error for movie {movieId} (user {userId}): {e}")
        pass # content_pred remains svd_pred
    individual_scores['content'] = np.clip(content_pred, min_rating, max_rating)

    # Knowledge-Based (Genome) Prediction
    kb_pred = svd_pred # Default to SVD
    try:
        if movieId in genome_movie_indices and genome_movie_ids: # Ensure genome_movie_ids is available
            idx_genome = genome_movie_indices[movieId]
            sim_scores_genome = list(enumerate(cosine_sim_genome[idx_genome]))
            sim_scores_genome = sorted(sim_scores_genome, key=lambda x: x[1], reverse=True)[1:k_similar+1]

            similar_genome_matrix_indices = [i[0] for i in sim_scores_genome]
            # These indices are for the cosine_sim_genome matrix.
            # We need to map them to movieIds using genome_movie_ids list.
            similar_movie_ids_genome = [genome_movie_ids[i] for i in similar_genome_matrix_indices if i < len(genome_movie_ids)]
            
            rated_similar_movies_genome = user_ratings[user_ratings['movieId'].isin(similar_movie_ids_genome)]

            if not rated_similar_movies_genome.empty:
                weighted_sum_ratings_kb = 0
                sum_similarities_kb = 0
                for _, row_kb in rated_similar_movies_genome.iterrows():
                    rated_movie_id_kb = row_kb['movieId']
                    rating_kb = row_kb['rating']
                    try:
                        if rated_movie_id_kb in genome_movie_indices:
                            rated_movie_idx_in_genome_kb = genome_movie_indices[rated_movie_id_kb]
                            similarity_kb = cosine_sim_genome[idx_genome, rated_movie_idx_in_genome_kb]
                            weighted_sum_ratings_kb += similarity_kb * rating_kb
                            sum_similarities_kb += similarity_kb
                    except (IndexError, KeyError):
                        continue
                
                if sum_similarities_kb > 0:
                    kb_pred = weighted_sum_ratings_kb / sum_similarities_kb
            # else: user hasn't rated similar genome items, kb_pred remains svd_pred
        # else: movie not in genome_movie_indices, kb_pred remains svd_pred
    except (IndexError, KeyError, TypeError) as e:
        # print(f"Knowledge-based prediction error for movie {movieId} (user {userId}): {e}")
        pass # kb_pred remains svd_pred
    individual_scores['kb'] = np.clip(kb_pred, min_rating, max_rating)

    # Weighted Hybrid Score
    hybrid_score = (w_svd * individual_scores['svd']) + \
                   (w_content * individual_scores['content']) + \
                   (w_kb * individual_scores['kb'])

    hybrid_score = np.clip(hybrid_score, min_rating, max_rating)

    return hybrid_score, individual_scores

print("Hybrid prediction function defined.")

# --- 3. Example: Get hybrid predictions for a sample user and item ---

sample_user_id = 1
sample_movie_id = 318 

if 'movies_df' not in globals() or movies_df.empty:
    print("movies_df is empty, cannot pick a sample movie for example.")
elif sample_movie_id not in movies_df['movieId'].values:
    print(f"Movie with ID {sample_movie_id} not found in movies_df. Using first movie as default.")
    sample_movie_id = movies_df['movieId'].iloc[0]

if 'ratings_df' not in globals() or ratings_df.empty:
    print("ratings_df not available to check sample_user_id for example.")
elif sample_user_id not in ratings_df['userId'].unique():
    print(f"User with ID {sample_user_id} not found in ratings_df. Using first user as default.")
    sample_user_id = ratings_df['userId'].unique()[0]

print(f"\nGetting hybrid prediction for User {sample_user_id}, Movie {sample_movie_id}...")

items_to_predict = []
if 'ratings_df' in globals() and not ratings_df.empty and \
   'movies_df' in globals() and not movies_df.empty and \
   'trainset_surprise' in globals() and \
   'genome_movie_ids' in globals() and genome_movie_ids: # Ensure genome_movie_ids is ready
    try:
        sample_interaction = ratings_df.sample(1).iloc[0]
        items_to_predict = [(int(sample_interaction['userId']), int(sample_interaction['movieId']))]
    except Exception as e:
        print(f"Could not get sample interaction: {e}")
        items_to_predict = [(sample_user_id, sample_movie_id)] # Fallback to predefined sample
else:
    print("Skipping example prediction due to missing dataframes or trainset_surprise or genome_movie_ids.")
    items_to_predict = [(sample_user_id, sample_movie_id)] # Attempt with predefined if data is missing

if items_to_predict:
    user_id_ex, movie_id_ex = items_to_predict[0]
    print(f"Predicting for User: {user_id_ex}, Movie: {movie_id_ex}")
    try:
        # Check all required components for the function call
        if not all(k in globals() for k in ['movies_df', 'ratings_df', 'algo_svd', 'trainset_surprise',
                                            'cosine_sim_content', 'content_movie_indices_lookup',
                                            'cosine_sim_genome', 'genome_movie_indices', 'genome_movie_ids']):
            print("Error: One or more model components are missing for the example prediction.")
        elif not genome_movie_ids: # Specific check for genome_movie_ids list
             print("Error: genome_movie_ids list is empty. Cannot make example prediction.")
        else:
            hybrid_score, individual_scores = hybrid_prediction(
                userId=user_id_ex,
                movieId=movie_id_ex,
                movies_df=movies_df,
                ratings_df=ratings_df,
                algo_svd=algo_svd,
                trainset_surprise=trainset_surprise,
                cosine_sim_content=cosine_sim_content,
                content_movie_indices_lookup=content_movie_indices_lookup,
                cosine_sim_genome=cosine_sim_genome,
                genome_movie_indices=genome_movie_indices,
                genome_movie_ids=genome_movie_ids, # Pass the list of movie IDs for genome
                weights=(0.4, 0.3, 0.3)
            )
            movie_title_display = "Unknown Movie"
            if 'movies_df' in globals() and movie_id_ex in movies_df['movieId'].values:
                 movie_title_display = movies_df.loc[movies_df['movieId'] == movie_id_ex, 'title'].iloc[0]
            
            print(f"  Hybrid Score for '{movie_title_display}' (ID {movie_id_ex}): {hybrid_score:.4f}")
            print(f"  Individual Scores: {individual_scores}")

    except Exception as e:
        print(f"Error during example prediction for User {user_id_ex}, Movie {movie_id_ex}: {e}")
        import traceback
        traceback.print_exc()
else:
    print("No items to predict for the example. Skipping example prediction block.")

print("\n--- Hybridization Cell Execution Complete ---")

Loading data and models for hybridization...
movies_df already in memory.
ratings_df already in memory.
Loading SVD model from /workspaces/soe-python/models/adv_algo_svd.pkl...
SVD model loaded.
trainset_surprise already in memory and seems valid.
SVD model loaded.
trainset_surprise already in memory and seems valid.
Loading content cosine similarity from /workspaces/soe-python/models/adv_cosine_sim_content.npy using mmap_mode='r'...
Content cosine similarity loaded. Shape: (62423, 62423).
Loading content cosine similarity from /workspaces/soe-python/models/adv_cosine_sim_content.npy using mmap_mode='r'...
Content cosine similarity loaded. Shape: (62423, 62423).
Loading content movie indices lookup from /workspaces/soe-python/models/content_movie_indices_lookup.pkl...
Content movie indices lookup loaded.
Loading content movie indices lookup from /workspaces/soe-python/models/content_movie_indices_lookup.pkl...
Content movie indices lookup loaded.
Content-based components (cosine simila

In [None]:
import numpy as np
import pandas as pd
from collections import defaultdict
import time

print("--- Implementing Ranking Metrics ---")

def get_top_n_recommendations_for_user(user_id, items_to_predict_ids, hybrid_prediction_func, movies_df, ratings_df, algo_svd, trainset_surprise, cosine_sim_content, content_movie_indices_lookup, cosine_sim_genome, genome_movie_indices, genome_movie_ids, hybrid_weights, n=10):
    """Generates top N recommendations for a single user."""
    predictions = []
    for movie_id in items_to_predict_ids:
        try:
            # Ensure genome_movie_ids is passed to hybrid_prediction_func
            pred_rating, _ = hybrid_prediction_func(
                userId=user_id,
                movieId=movie_id,
                movies_df=movies_df,
                ratings_df=ratings_df,
                algo_svd=algo_svd,
                trainset_surprise=trainset_surprise,
                cosine_sim_content=cosine_sim_content,
                content_movie_indices_lookup=content_movie_indices_lookup,
                cosine_sim_genome=cosine_sim_genome,
                genome_movie_indices=genome_movie_indices,
                genome_movie_ids=genome_movie_ids, # Restored argument
                weights=hybrid_weights
            )
            predictions.append((movie_id, pred_rating))
        except Exception as e:
            # print(f"Error predicting for user {user_id}, movie {movie_id} in top_n: {e}")
            continue # Skip if prediction fails for an item
    
    predictions.sort(key=lambda x: x[1], reverse=True)
    top_n_recs = [item_id for item_id, score in predictions[:n]]
    return top_n_recs

# ... (precision_recall_at_k and ndcg_at_k functions remain unchanged) ...
def precision_recall_at_k(predictions_list, true_relevant_items_map, k_values):
    """Calculates Precision@k and Recall@k for different k."""
    precisions = {k: [] for k in k_values}
    recalls = {k: [] for k in k_values}

    for user_id, top_n_recs in predictions_list.items():
        true_relevant_set = true_relevant_items_map.get(user_id, set())
        if not true_relevant_set: # Skip user if no relevant items in test set
            continue

        for k_val in k_values:
            if k_val == 0: # Avoid division by zero for k=0
                precisions[k_val].append(0.0)
                recalls[k_val].append(0.0)
                continue
            
            recommended_at_k = top_n_recs[:k_val]
            num_relevant_in_top_k = len(set(recommended_at_k) & true_relevant_set)
            
            precisions[k_val].append(num_relevant_in_top_k / k_val if k_val > 0 else 0)
            recalls[k_val].append(num_relevant_in_top_k / len(true_relevant_set) if len(true_relevant_set) > 0 else 0)

    avg_precisions = {k: np.mean(precisions[k]) if precisions[k] else 0 for k in k_values}
    avg_recalls = {k: np.mean(recalls[k]) if recalls[k] else 0 for k in k_values}
    return avg_precisions, avg_recalls

def ndcg_at_k(predictions_list, true_relevant_items_map, k_values):
    """Calculates nDCG@k for different k."""
    ndcgs = {k: [] for k in k_values}

    for user_id, top_n_recs in predictions_list.items():
        true_relevant_set = true_relevant_items_map.get(user_id, set())
        if not true_relevant_set: # Skip user if no relevant items in test set for IDCG calculation
            continue
            
        for k_val in k_values:
            if k_val == 0:
                ndcgs[k_val].append(0.0)
                continue

            dcg = 0
            recommended_at_k = top_n_recs[:k_val]
            for i, item_id in enumerate(recommended_at_k):
                if item_id in true_relevant_set:
                    dcg += 1 / np.log2(i + 2) # Relevance is 1 if relevant, 0 otherwise. Add 2 because log2(1)=0.
            
            idcg = 0
            # Ideal ranking: all relevant items (up to k_val) are at the top
            num_relevant_to_consider_for_idcg = min(k_val, len(true_relevant_set))
            for i in range(num_relevant_to_consider_for_idcg):
                idcg += 1 / np.log2(i + 2)
            
            ndcgs[k_val].append(dcg / idcg if idcg > 0 else 0)
            
    avg_ndcgs = {k: np.mean(ndcgs[k]) if ndcgs[k] else 0 for k in k_values}
    return avg_ndcgs
# ... (rest of the cell setup code remains unchanged) ...
print("\\n--- Setting up for Ranking Metrics Evaluation ---")

# Ensure all necessary components are loaded and available (similar to Cell 18)
required_components_ranking = [
    'movies_df', 'ratings_df', 'algo_svd', 'trainset_surprise',
    'cosine_sim_content', 'content_movie_indices_lookup',
    'cosine_sim_genome', 'genome_movie_indices', 'genome_movie_ids',
    'testset_surprise', 'hybrid_prediction' # Ensure the prediction function is available
]
missing_components_ranking = [comp for comp in required_components_ranking if comp not in globals() or globals()[comp] is None]

if missing_components_ranking:
    print(f"Error: Missing necessary components for ranking evaluation: {missing_components_ranking}")
    print("Please ensure all previous cells, especially data loading, model training/loading, and hybrid function definition, have been run successfully.")
else:
    print("All necessary components for ranking evaluation appear to be present.")
    
    # Parameters for ranking evaluation
    K_VALUES_FOR_RANKING = [5, 10, 20] # k values for P@k, R@k, nDCG@k
    RELEVANCE_THRESHOLD = 4.0
    NUM_USERS_TO_EVALUATE = 100 # Subset of users for faster evaluation. Set to None for all test users.
    NUM_ITEMS_TO_PREDICT_PER_USER = 1000 # Number of candidate items to rank for each user.
                                        # Predicting for ALL non-training items can be very slow.

    # 1. Identify relevant items for each user in the test set
    true_relevant_items_map = defaultdict(set)
    test_user_item_pairs = defaultdict(list)
    for uid, iid, true_r in testset_surprise:
        test_user_item_pairs[uid].append(iid)
        if true_r >= RELEVANCE_THRESHOLD:
            true_relevant_items_map[uid].add(iid)
    
    print(f"Identified {len(true_relevant_items_map)} users with relevant items in the test set (rating >= {RELEVANCE_THRESHOLD}).")

    # 2. Select users for evaluation
    users_for_evaluation = list(true_relevant_items_map.keys())
    if NUM_USERS_TO_EVALUATE is not None and NUM_USERS_TO_EVALUATE < len(users_for_evaluation):
        users_for_evaluation = np.random.choice(users_for_evaluation, size=NUM_USERS_TO_EVALUATE, replace=False).tolist()
    print(f"Will evaluate ranking metrics for {len(users_for_evaluation)} users.")

    # 3. Get all unique movie IDs from movies_df to serve as candidate items
    all_movie_ids_global = movies_df['movieId'].unique()
    
    # 4. Generate top-N recommendations for selected users
    all_users_top_n_recs = {}
    evaluation_start_time = time.time()

    hybrid_weights_for_ranking = globals().get('hybrid_weights', (0.7, 0.15, 0.15)) 
    print(f"Using hybrid_weights for ranking: {hybrid_weights_for_ranking}")

    first_user_diagnostics_done = False 

    for i, user_id in enumerate(users_for_evaluation):
        if (i + 1) % 10 == 0:
            if not (not first_user_diagnostics_done and i == 0):
                 print(f"Generating recommendations for user {i+1}/{len(users_for_evaluation)} (ID: {user_id})...")
        
        try:
            inner_uid_diag = trainset_surprise.to_inner_uid(user_id)
            items_rated_by_user_in_train = {trainset_surprise.to_raw_iid(iid) for iid in trainset_surprise.ur[inner_uid_diag]}
        except ValueError: 
            items_rated_by_user_in_train = set()
        except Exception as e_train_items: 
            items_rated_by_user_in_train = set()

        candidate_items_for_user = list(set(all_movie_ids_global) - items_rated_by_user_in_train)
        
        if NUM_ITEMS_TO_PREDICT_PER_USER is not None and NUM_ITEMS_TO_PREDICT_PER_USER < len(candidate_items_for_user):
            items_to_predict_for_this_user = np.random.choice(candidate_items_for_user, size=NUM_ITEMS_TO_PREDICT_PER_USER, replace=False).tolist()
        else:
            items_to_predict_for_this_user = candidate_items_for_user
        
        if not items_to_predict_for_this_user:
            all_users_top_n_recs[user_id] = []
            if not first_user_diagnostics_done: 
                print(f"\\n--- DIAGNOSTICS FOR FIRST USER: {user_id} ---")
                print(f"User {user_id} has no candidate items to predict. Skipping actual prediction.")
                print(f"True relevant items: {true_relevant_items_map.get(user_id, set())}")
                print(f"Items rated in train: {items_rated_by_user_in_train}")
                print(f"--- END DIAGNOSTICS FOR USER: {user_id} ---")
                first_user_diagnostics_done = True
            continue

        max_k_needed = max(K_VALUES_FOR_RANKING) if K_VALUES_FOR_RANKING else 20 

        if not first_user_diagnostics_done:
            print(f"\\n--- DIAGNOSTICS FOR FIRST USER: {user_id} ---")
            print(f"Processing user {i+1}/{len(users_for_evaluation)} (ID: {user_id}) for diagnostics.")
            print(f"True relevant items: {true_relevant_items_map.get(user_id, set())}")
            print(f"Items rated in train (first 20): {list(items_rated_by_user_in_train)[:20]}")
            print(f"Candidate items for prediction (first 20 of {len(items_to_predict_for_this_user)}): {items_to_predict_for_this_user[:20]}")

            raw_predictions_for_diag_user = []
            print(f"Generating raw hybrid predictions for {len(items_to_predict_for_this_user)} candidate items...")
            for item_idx, movie_id_diag in enumerate(items_to_predict_for_this_user):
                if item_idx < 30 or item_idx % 100 == 0 : 
                    pass
                try:
                    # Ensure genome_movie_ids is passed here as well for diagnostics
                    pred_rating_diag, individual_scores_diag = hybrid_prediction(
                        userId=user_id, movieId=movie_id_diag,
                        movies_df=movies_df, ratings_df=ratings_df, algo_svd=algo_svd,
                        trainset_surprise=trainset_surprise, cosine_sim_content=cosine_sim_content,
                        content_movie_indices_lookup=content_movie_indices_lookup,
                        cosine_sim_genome=cosine_sim_genome, genome_movie_indices=genome_movie_indices,
                        genome_movie_ids=genome_movie_ids, # Restored argument
                        weights=hybrid_weights_for_ranking
                    )
                    raw_predictions_for_diag_user.append((movie_id_diag, pred_rating_diag, individual_scores_diag))
                except Exception as e_diag:
                    raw_predictions_for_diag_user.append((movie_id_diag, f"ERROR: {str(e_diag)}", {}))
            
            raw_predictions_for_diag_user.sort(
                key=lambda x: x[1] if isinstance(x[1], (int, float)) else -1.0, 
                reverse=True
            )
            print(f"Raw hybrid predictions (Top 30 with individual scores):")
            for pred_item in raw_predictions_for_diag_user[:30]:
                 print(f"  MovieID: {pred_item[0]}, PredictedRating: {pred_item[1]}, Individual: {pred_item[2]}")
            if len(raw_predictions_for_diag_user) > 30:
                print(f"Raw hybrid predictions (Bottom 5 with individual scores):")
                for pred_item in raw_predictions_for_diag_user[-5:]:
                    print(f"  MovieID: {pred_item[0]}, PredictedRating: {pred_item[1]}, Individual: {pred_item[2]}")

        # Actual call to get top N for all users
        # Ensure genome_movie_ids is passed here
        user_top_n = get_top_n_recommendations_for_user(
            user_id, items_to_predict_for_this_user, hybrid_prediction, 
            movies_df, ratings_df, algo_svd, trainset_surprise, 
            cosine_sim_content, content_movie_indices_lookup, 
            cosine_sim_genome, genome_movie_indices, genome_movie_ids, # Restored argument
            hybrid_weights_for_ranking, n=max_k_needed
        )
        all_users_top_n_recs[user_id] = user_top_n

        if not first_user_diagnostics_done:
            print(f"Top-{max_k_needed} recommendations based on get_top_n_recommendations_for_user: {user_top_n}")
            relevant_for_diag_user = true_relevant_items_map.get(user_id, set())
            recommended_set = set(user_top_n)
            intersection = relevant_for_diag_user.intersection(recommended_set)
            print(f"Intersection with true relevant: {intersection} (Size: {len(intersection)})")
            print(f"--- END DIAGNOSTICS FOR USER: {user_id} ---")
            first_user_diagnostics_done = True
            # if i == 0: 
            #     print("Diagnostic run complete for the first user. Breaking loop.")
            #     break 
# ... (The rest of the cell, including calculation and printing of metrics, remains unchanged) ...
# This part is omitted for brevity but should be included in the actual edit if it was part of the original cell.
# Assuming the cell ends after the loop for users_for_evaluation or proceeds to calculate and print metrics.
# For this edit, we are only focusing on restoring genome_movie_ids.
# If the cell had more code after the loop, it should be preserved.

--- Implementing Ranking Metrics ---
\n--- Setting up for Ranking Metrics Evaluation ---
All necessary components for ranking evaluation appear to be present.
Identified 158667 users with relevant items in the test set (rating >= 4.0).
Will evaluate ranking metrics for 100 users.
Using hybrid_weights for ranking: (0.7, 0.15, 0.15)
\n--- DIAGNOSTICS FOR FIRST USER: 81933 ---
Processing user 1/100 (ID: 81933) for diagnostics.
True relevant items: {34048, 44929, 1291, 780, 527, 33679, 39446, 150, 7454, 7458, 4770, 8361, 3257, 1722, 3005, 1597, 2115, 1101, 3534, 593, 48082, 597, 26198, 2005, 6874, 37741, 48877, 2671, 7154, 6006, 3578, 2427, 32124, 2431}
Items rated in train (first 20): []
Candidate items for prediction (first 20 of 1000): [154260, 111585, 177601, 80574, 100291, 171947, 169004, 122302, 191277, 142652, 70978, 125449, 180695, 160058, 153955, 208293, 5829, 145905, 5802, 5981]
Generating raw hybrid predictions for 1000 candidate items...
Raw hybrid predictions (Top 30 with indi