
# Introduction
In this notebook, I continue from the exploratory data analysis (EDA) to build recommendation models based on the insights gained. The goal is to implement different recommendation algorithms, evaluate their performance, and understand which models perform best for our dataset.



### Import Libraries

First, we need to import the necessary libraries for modeling and data manipulation. Ensure that you have installed all required packages using pip if necessary.

In [134]:
# pip install pandas numpy matplotlib seaborn scikit-learn surprise

import pandas as pd
import numpy as np
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.decomposition import TruncatedSVD
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score
from surprise import Dataset, Reader, SVD, KNNBasic, accuracy
from datetime import datetime
import ast
import warnings
warnings.filterwarnings('ignore')


### Load Cleaned Data
We load the cleaned versions of the titles and interactions datasets that were saved at the end of the EDA notebook.

In [135]:
titles = pd.read_csv('titles_cleaned.csv')
interactions = pd.read_csv('interactions_cleaned.csv')

## Feature Engineering

### Convert String Representation of Lists Back to Lists
The multivalued columns in the titles dataset are stored as strings. We'll convert them back to actual lists.

In [136]:
def str_to_list(x):
    try:
        return ast.literal_eval(x)
    except:
        return ['Unknown']

multivalued_columns = ['GENRE_TMDB', 'DIRECTOR', 'ACTOR', 'PRODUCER']
for col in multivalued_columns:
    titles[col] = titles[col].apply(str_to_list)


In [137]:
titles

Unnamed: 0,TITLE_ID,ORIGINAL_TITLE,ORIGINAL_LANGUAGE,RELEASE_DURATION_DAYS,GENRE_TMDB,DIRECTOR,ACTOR,PRODUCER
0,tm1282307,L'ultima notte di Amore,it,484.0,"[drama, thriller]",[Andrea Di Stefano],"[Pierfrancesco Favino, Linda Caridi, Antonio G...","[Benedetto Habib, Daniel Campos Pavoncelli, Fa..."
1,tm1338500,Bird Box Barcelona,es,357.0,"[horror, scifi, thriller]","[David Pastor, Àlex Pastor]","[Mario Casas, Georgina Campbell, Diego Calva, ...","[Adrián Guerra, Chris Morgan, Dylan Clark, Núr..."
2,ts371824,Steeltown Murders,en,417.0,"[crime, drama, history, thriller]",[Marc Evans],"[Scott Arthur, Sion Alun Davies, Keith Allen, ...",[Hannah Thomas]
3,tm123363,Expend4bles,en,294.0,"[action, thriller, war]",[Scott Waugh],"[Jason Statham, Sylvester Stallone, 50 Cent, M...","[Jason Statham, Jeffrey Greenstein, Jonathan Y..."
4,tm1045025,65,en,491.0,"[action, drama, scifi, thriller]","[Bryan Woods, Scott Beck]","[Adam Driver, Ariana Greenblatt, Chloe Coleman...","[Bryan Woods, Deborah Liebling, Sam Raimi, Sco..."
...,...,...,...,...,...,...,...,...
20624,ts21325,Hunter,en,14535.0,"[action, crime, drama, thriller]","[David Soul, Tony Mordente, Gus Trikonis, Jame...","[Fred Dryer, Stepfanie Kramer, Charles Hallaha...","[Frank Lupo, Fred Dryer, George Geiger, Lawren..."
20625,tm1382322,Strange Darling,en,3720.0,"[horror, thriller]",[JT Mollner],"[Willa Fitzgerald, Kyle Gallner, Jason Patric,...","[Bill Block, Giovanni Ribisi, Roy Lee, Steven ..."
20626,ts21242,Mission: Impossible,en,21111.0,"[action, crime, drama, thriller]","[Tom Gries, Leonard J. Horn, Seymour Robbie, H...","[Peter Graves, Greg Morris, Peter Lupus, Bob J...","[Allan Balter, Barry Crane, Joseph Gantman, Ro..."
20627,ts37497,Popeye the Sailor,en,23401.0,"[animation, comedy, family, romance]","[Jack Kinney, Paul Fennell, Bob Bemiller, Tom ...","[Jack Mercer, Mae Questel, Jackson Beck]",[Al Brodax]


### Handle Less Frequent Categories
To reduce dimensionality and prevent the "curse of dimensionality," we'll group less frequent items as 'other'.

In [138]:
def get_top_items(column, min_count):
    all_items = titles.explode(column)[column]
    item_counts = all_items.value_counts()
    top_items = item_counts[item_counts >= min_count].index.tolist()
    return top_items

# Thresholds
director_min_count = 5
actor_min_count = 10
producer_min_count = 5

# Get top items
top_directors = get_top_items('DIRECTOR', director_min_count)
top_actors = get_top_items('ACTOR', actor_min_count)
top_producers = get_top_items('PRODUCER', producer_min_count)

# Replace less frequent items
def replace_less_frequent(items, top_items):
    return [item if item in top_items else 'other' for item in items]

titles['DIRECTOR'] = titles['DIRECTOR'].apply(lambda x: replace_less_frequent(x, top_directors))
titles['ACTOR'] = titles['ACTOR'].apply(lambda x: replace_less_frequent(x, top_actors))
titles['PRODUCER'] = titles['PRODUCER'].apply(lambda x: replace_less_frequent(x, top_producers))


#### Handle Release Dates
We'll process the RELEASE_DURATION_DAYS column to obtain the actual release dates.

In [139]:
# Ensure the 'reference_date' is a valid datetime
reference_date = pd.to_datetime(interactions['COLLECTOR_TSTAMP'].max(), errors='coerce', utc=True)

# Convert 'RELEASE_DURATION_DAYS' to numeric, drop NaNs, and convert to timedelta
titles['RELEASE_DURATION_DAYS'] = pd.to_numeric(titles['RELEASE_DURATION_DAYS'], errors='coerce')
titles = titles.dropna(subset=['RELEASE_DURATION_DAYS'])  # Drop rows with NaN after conversion

# Step 2: Identify extreme values (filter out values beyond 100 years)
threshold_days = pd.to_numeric(pd.Timedelta(days=36500).days)  # Convert 100 years to days and then to numeric
extreme_titles = titles[titles['RELEASE_DURATION_DAYS'] > threshold_days]
print(f"Number of titles with extreme RELEASE_DURATION_DAYS: {len(extreme_titles)}")

# Option A: Remove titles with extreme values
titles = titles[titles['RELEASE_DURATION_DAYS'] <= threshold_days]

# Convert 'RELEASE_DURATION_DAYS' to timedelta only after removing extreme values
titles['RELEASE_DURATION_DAYS'] = pd.to_timedelta(titles['RELEASE_DURATION_DAYS'], unit='D')

# Step 3: Calculate 'RELEASE_DATE' by subtracting 'RELEASE_DURATION_DAYS' from 'reference_date'
titles['RELEASE_DATE'] = reference_date - titles['RELEASE_DURATION_DAYS']

# Ensure 'RELEASE_DATE' is valid datetime
titles['RELEASE_DATE'] = pd.to_datetime(titles['RELEASE_DATE'], errors='coerce', utc=True)

# Step 4: Drop rows with invalid 'RELEASE_DATE' values (NaT values)
titles = titles.dropna(subset=['RELEASE_DATE'])

# Step 5: Reset index of titles to ensure indices are from 0 to N-1
titles = titles.reset_index(drop=True)

Number of titles with extreme RELEASE_DURATION_DAYS: 33


#### Create Mapping for Recommendations
We create mappings from TITLE_ID to indices and vice versa to facilitate quick lookups during recommendations.

In [140]:
# Create mapping from TITLE_ID to index
title_id_to_idx = pd.Series(titles.index, index=titles['TITLE_ID']).drop_duplicates()

# Create reverse mapping from index to TITLE_ID
idx_to_title_id = pd.Series(titles['TITLE_ID'].values, index=titles.index)

#### Create Combined Feature ('SOUP')

We'll create a combined text feature by concatenating important metadata. This will be used for content-based filtering.

In [141]:
titles

Unnamed: 0,TITLE_ID,ORIGINAL_TITLE,ORIGINAL_LANGUAGE,RELEASE_DURATION_DAYS,GENRE_TMDB,DIRECTOR,ACTOR,PRODUCER,RELEASE_DATE
0,tm1282307,L'ultima notte di Amore,it,484 days,"[drama, thriller]",[other],"[Pierfrancesco Favino, other, other, other, ot...","[Benedetto Habib, Daniel Campos Pavoncelli, Fa...",2023-03-09 04:29:50.019000+00:00
1,tm1338500,Bird Box Barcelona,es,357 days,"[horror, scifi, thriller]","[other, other]","[Mario Casas, Georgina Campbell, other, other,...","[Adrián Guerra, Chris Morgan, Dylan Clark, Núr...",2023-07-14 04:29:50.019000+00:00
2,ts371824,Steeltown Murders,en,417 days,"[crime, drama, history, thriller]",[other],"[other, other, other, other, Aneurin Barnard, ...",[other],2023-05-15 04:29:50.019000+00:00
3,tm123363,Expend4bles,en,294 days,"[action, thriller, war]",[other],"[Jason Statham, Sylvester Stallone, 50 Cent, M...","[other, Jeffrey Greenstein, Jonathan Yunger, K...",2023-09-15 04:29:50.019000+00:00
4,tm1045025,65,en,491 days,"[action, drama, scifi, thriller]","[other, other]","[Adam Driver, other, other, other, other]","[other, other, Sam Raimi, other, other]",2023-03-02 04:29:50.019000+00:00
...,...,...,...,...,...,...,...,...,...
20591,ts21325,Hunter,en,14535 days,"[action, crime, drama, thriller]","[David Soul, Tony Mordente, Gus Trikonis, Jame...","[other, other, Charles Hallahan, other]","[other, other, other, other, other, Stephen J....",1984-09-18 04:29:50.019000+00:00
20592,tm1382322,Strange Darling,en,3720 days,"[horror, thriller]",[other],"[other, Kyle Gallner, Jason Patric, Giovanni R...","[Bill Block, other, Roy Lee, Steven Schneider]",2014-04-29 04:29:50.019000+00:00
20593,ts21242,Mission: Impossible,en,21111 days,"[action, crime, drama, thriller]","[Tom Gries, Leonard J. Horn, Seymour Robbie, H...","[other, other, other, other, other]","[other, other, other, other, other]",1966-09-17 04:29:50.019000+00:00
20594,ts37497,Popeye the Sailor,en,23401 days,"[animation, comedy, family, romance]","[Jack Kinney, other, other, other, other, other]","[other, other, other]",[other],1960-06-10 04:29:50.019000+00:00


In [142]:
titles['SOUP'] = titles['ORIGINAL_TITLE'] + ' ' + \
                 titles['GENRE_TMDB'].apply(lambda x: ' '.join(x)) + ' ' + \
                 titles['DIRECTOR'].apply(lambda x: ' '.join(x)) + ' ' + \
                 titles['ACTOR'].apply(lambda x: ' '.join(x)) + ' ' + \
                 titles['PRODUCER'].apply(lambda x: ' '.join(x))

## Content-Based Filtering using Soup Features
### Version 1: Optimized
We optimize content-based filtering by applying Truncated SVD to reduce dimensionality, which speeds up the computation of cosine similarities.

#### Approach 2: Separate Features Similarity Search removed due to low computing capabilities of my local pc

There will be no Version 2 because I needed to remove it. I implement feature based recommentations with custom weight based on feature importances but unfortunately my poor computer couldn't handle that well.. Let's continue with not really ideal SOUP approach.

In [143]:
from sklearn.decomposition import TruncatedSVD

# Limit the vocabulary size to 5000 most frequent words
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)
tfidf_matrix = tfidf.fit_transform(titles['SOUP'])
print(f"TF-IDF Matrix Shape: {tfidf_matrix.shape}")


# Apply TruncatedSVD to reduce dimensions
svd = TruncatedSVD(n_components=100, random_state=42)
tfidf_matrix_svd = svd.fit_transform(tfidf_matrix)
print(f"Reduced TF-IDF Matrix Shape: {tfidf_matrix_svd.shape}")

cosine_sim_svd = cosine_similarity(tfidf_matrix_svd, tfidf_matrix_svd)

TF-IDF Matrix Shape: (20596, 5000)
Reduced TF-IDF Matrix Shape: (20596, 100)


## Collaborative Filtering
We implement collaborative filtering using the Surprise library, ensuring a time-wise train-test split to prevent data leakage.

### Prepare Data
Assign Ratings Based on Interaction Types
We map different interaction types to implicit ratings, which will be used for training the collaborative filtering model.

In [146]:
# Map interaction types to ratings
interaction_weights = {
    'likelist_addition': 3,
    'seenlist_addition': 2,
    'watchlist_addition': 1,
    'clickout_provider': 0.5
}
# Why these ratings? I don't have any business proved proof, just feeling it.

interactions['RATING'] = interactions['INTERACTION_TYPE'].map(interaction_weights)
interactions.dropna(subset=['RATING'], inplace=True)
interactions['RATING'] = interactions['RATING'].astype(float)

interactions['COLLECTOR_TSTAMP'] = pd.to_datetime(interactions['COLLECTOR_TSTAMP'], errors='coerce', utc=True)


# Check for NaT values
num_nat = interactions['COLLECTOR_TSTAMP'].isna().sum()
print(f"Number of NaT values in 'COLLECTOR_TSTAMP': {num_nat}")

# Drop rows where 'COLLECTOR_TSTAMP' could not be parsed
interactions.dropna(subset=['COLLECTOR_TSTAMP'], inplace=True)


Number of NaT values in 'COLLECTOR_TSTAMP': 0


#### Time-wise Train-Test Split
We split the data based on a cutoff date to prevent future data from leaking into the training set. And also implementing sampling_percentage with direct request of my poor computer.

In [147]:
# Sort data by timestamp
interactions_sorted = interactions.sort_values('COLLECTOR_TSTAMP')

# Define cutoff date (Get last week for testing)
cutoff_date = pd.to_datetime('2024-06-30', utc=True)

# Split data
train_data_cf = interactions_sorted[interactions_sorted['COLLECTOR_TSTAMP'] < cutoff_date]
test_data_cf = interactions_sorted[interactions_sorted['COLLECTOR_TSTAMP'] >= cutoff_date]


# Define the sampling percentage for the train data (e.g., 50%)
sampling_percentage = 0.05

# Sample the specified percentage of the train data
train_data_sampled = train_data_cf.sample(frac=sampling_percentage, random_state=42)

# Check the size of the sampled train data
print(f"Original training data size: {len(train_data_cf)}")
print(f"Sampled training data size: {len(train_data_sampled)}")
print(f"Testing data size: {len(test_data_cf)}")


# Get users in training and test sets
train_users = set(train_data_sampled['BE_ID'].unique())
test_users = set(test_data_cf['BE_ID'].unique())

# Find users in test set not present in training set
cold_start_users = test_users - train_users
print(f"Number of cold-start users: {len(cold_start_users)}")


Original training data size: 10856962
Sampled training data size: 542848
Testing data size: 665878
Number of cold-start users: 38190


In [148]:
train_data_cf.head()

Unnamed: 0,BE_ID,TITLE_ID,COLLECTOR_TSTAMP,INTERACTION_TYPE,RATING
7006449,4680435c1c780f7b4a8bb356bdb600c4,ts227860,2024-04-06 00:00:01.260000+00:00,watchlist_addition,1.0
7390867,f9b2a3ef6033c2d2c8fc8a04bad91043,tm1254240,2024-04-06 00:00:01.361000+00:00,clickout_provider,0.5
3016312,f2e6dafa60376cbaa73945d309dd66aa,tm54285,2024-04-06 00:00:01.438000+00:00,watchlist_addition,1.0
6278080,1a56964286263d0f765adf7ddd854128,tm921693,2024-04-06 00:00:02.007000+00:00,clickout_provider,0.5
534195,fc1258faca488ed258357a1f5847b804,ts314105,2024-04-06 00:00:02.486000+00:00,seenlist_addition,2.0


In [149]:
test_data_cf.head()

Unnamed: 0,BE_ID,TITLE_ID,COLLECTOR_TSTAMP,INTERACTION_TYPE,RATING
9714315,e7fa4bdd19163153886d3b62d040855a,tm44919,2024-06-30 00:00:00.329000+00:00,seenlist_addition,2.0
665277,30a42291c9230949c343880f0f78a1b8,tm327233,2024-06-30 00:00:00.606000+00:00,seenlist_addition,2.0
5549411,ce42f7bd49c382707d8bfc15d7c23680,ts84270,2024-06-30 00:00:02.648000+00:00,seenlist_addition,2.0
6644055,c79f058b0bce03b89e58e39a25e1991f,tm441050,2024-06-30 00:00:02.788000+00:00,seenlist_addition,2.0
8962037,c79f058b0bce03b89e58e39a25e1991f,tm441050,2024-06-30 00:00:03.509000+00:00,likelist_addition,3.0


#### Prepare Data for Surprise Library

In [150]:
# Prepare the data
reader = Reader(rating_scale=(0.5, 3)) # I know it looks sad.
train_dataset = Dataset.load_from_df(train_data_sampled[['BE_ID', 'TITLE_ID', 'RATING']], reader)
trainset = train_dataset.build_full_trainset()

# Prepare testset
testset = list(zip(test_data_cf['BE_ID'], test_data_cf['TITLE_ID'], test_data_cf['RATING']))


### Train Collaborative Filtering Model (SVD)
We use Singular Value Decomposition (SVD) for collaborative filtering.

In [125]:
# Initialize and train the SVD algorithm
algo_svd = SVD(random_state=42)
algo_svd.fit(trainset)

# Test the algorithm
predictions = algo_svd.test(testset)
rmse = accuracy.rmse(predictions)
print(f"Test RMSE: {rmse:.4f}")


RMSE: 0.6691
Test RMSE: 0.6691


In [151]:
# KNN works slow as hell, couldn't see any ANN implementation for Surprise library.

'''
# Initialize and train the Item-based KNN algorithm
sim_options_item = {
    'name': 'cosine',
    'user_based': False  # Compute similarities between items
}

algo_knn_item = KNNBasic(sim_options=sim_options_item)
algo_knn_item.fit(trainset)


predictions_knn_item = algo_knn_item.test(testset)
rmse_knn_item = accuracy.rmse(predictions_knn_item)
print(f"Item-based KNN Test RMSE: {rmse_knn_item:.4f}")


# Initialize and train the User-based KNN algorithm
sim_options_user = {
    'name': 'cosine',
    'user_based': True  # Compute similarities between users
}

algo_knn_user = KNNBasic(sim_options=sim_options_user)
algo_knn_user.fit(trainset)


predictions_knn_user = algo_knn_user.test(testset)
rmse_knn_user = accuracy.rmse(predictions_knn_user)
print(f"User-based KNN Test RMSE: {rmse_knn_user:.4f}")

'''

'\n# Initialize and train the Item-based KNN algorithm\nsim_options_item = {\n    \'name\': \'cosine\',\n    \'user_based\': False  # Compute similarities between items\n}\n\nalgo_knn_item = KNNBasic(sim_options=sim_options_item)\nalgo_knn_item.fit(trainset)\n\n\npredictions_knn_item = algo_knn_item.test(testset)\nrmse_knn_item = accuracy.rmse(predictions_knn_item)\nprint(f"Item-based KNN Test RMSE: {rmse_knn_item:.4f}")\n\n\n# Initialize and train the User-based KNN algorithm\nsim_options_user = {\n    \'name\': \'cosine\',\n    \'user_based\': True  # Compute similarities between users\n}\n\nalgo_knn_user = KNNBasic(sim_options=sim_options_user)\nalgo_knn_user.fit(trainset)\n\n\npredictions_knn_user = algo_knn_user.test(testset)\nrmse_knn_user = accuracy.rmse(predictions_knn_user)\nprint(f"User-based KNN Test RMSE: {rmse_knn_user:.4f}")\n\n'

### Recommendation Functions
We define several functions to generate recommendations using different methods.

In [152]:
from tqdm import tqdm

# Get unique users and items
unique_users = interactions['BE_ID'].unique()
unique_items = interactions['TITLE_ID'].unique()

# Create reverse mapping for content-based recommendations
indices = pd.Series(titles.index, index=titles['TITLE_ID']).drop_duplicates()

def get_recommendations(title_id, cosine_sim):
    # Check if the title_id exists in the mapping
    if title_id not in title_id_to_idx:
        return []
    
    # Get the index of the title_id
    idx = title_id_to_idx[title_id]
    
    # Get similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Sort the titles based on similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Exclude the first one (itself) and get top 10 similar titles
    sim_scores = sim_scores[1:11]
    
    # Get the indices of the most similar titles
    title_indices = [i[0] for i in sim_scores]
    
    # Map indices back to TITLE_IDs using idx_to_title_id
    recommended_titles = [idx_to_title_id[i] for i in title_indices if i in idx_to_title_id.index]
    
    return recommended_titles


def get_svd_recommendations(user_id, n=10):
    # Get items the user has interacted with in the training set
    interacted_items = set(train_data_cf[train_data_cf['BE_ID'] == user_id]['TITLE_ID'])

    # Predict ratings for all items not yet interacted with
    items_to_predict = [iid for iid in unique_items if iid not in interacted_items]

    predictions = [algo_svd.predict(user_id, iid) for iid in items_to_predict]

    # Sort predictions by estimated rating
    predictions.sort(key=lambda x: x.est, reverse=True)

    # Get top N recommendations
    top_n = predictions[:n]
    top_n_iids = [pred.iid for pred in top_n]

    # Return titles of recommended items
    recommended_titles = titles[titles['TITLE_ID'].isin(top_n_iids)]
    return recommended_titles[['TITLE_ID', 'ORIGINAL_TITLE', 'GENRE_TMDB']]

def get_content_based_recommendations(user_id, n=10):
    # Get items the user has interacted with
    interacted_items = set(train_data_cf[train_data_cf['BE_ID'] == user_id]['TITLE_ID'])

    all_recs = []
    for item_id in interacted_items:
        recs = get_recommendations(item_id, cosine_sim_svd)
        all_recs.extend(recs)

    # Remove items the user has already interacted with
    all_recs = [item for item in all_recs if item not in interacted_items]

    # Count the frequency of each recommended item
    from collections import Counter
    rec_counter = Counter(all_recs)

    # Get the top N recommended items
    top_recs = rec_counter.most_common(n)
    top_n_iids = [item for item, count in top_recs]

    # Return titles of recommended items
    recommended_titles = titles[titles['TITLE_ID'].isin(top_n_iids)]
    
    return recommended_titles[['TITLE_ID', 'ORIGINAL_TITLE', 'GENRE_TMDB']]


def get_hybrid_recommendations(user_id, n=10):
    # Get SVD recommendations
    svd_recs_df = get_svd_recommendations(user_id, n*2)
    svd_recs = svd_recs_df['TITLE_ID'].tolist()
    
    # Get content-based recommendations
    content_recs_df = get_content_based_recommendations(user_id, n*2)
    content_recs = content_recs_df['TITLE_ID'].tolist()
    
    # Combine recommendations
    combined_recs = svd_recs + content_recs
    
    # Remove duplicates and items already interacted with
    interacted_items = set(train_data_cf[train_data_cf['BE_ID'] == user_id]['TITLE_ID'])
    combined_recs = [item for item in combined_recs if item not in interacted_items]
    
    # Ensure that the recommended titles exist in the 'titles' DataFrame
    valid_recs = titles[titles['TITLE_ID'].isin(combined_recs)]
    
    # Count frequency (if needed)
    from collections import Counter
    rec_counter = Counter(valid_recs['TITLE_ID'])
    
    # Sort items by frequency and get top N
    top_recs = rec_counter.most_common(n)
    top_n_iids = [item for item, count in top_recs]
    
    # Return titles of recommended items
    recommended_titles = titles[titles['TITLE_ID'].isin(top_n_iids)]
    
    return recommended_titles[['TITLE_ID', 'ORIGINAL_TITLE', 'GENRE_TMDB']]

def get_most_popular_recommendations(user_id, n=10):
    # Get items the user has interacted with
    interacted_items = set(train_data_cf[train_data_cf['BE_ID'] == user_id]['TITLE_ID'])

    # Get most popular items (by number of interactions)
    item_popularity = interactions['TITLE_ID'].value_counts()
    popular_items = item_popularity.index.tolist()

    # Remove items already interacted with
    recommendations = [item for item in popular_items if item not in interacted_items]

    # Get top N recommendations
    top_n_items = recommendations[:n]

    # Return titles of recommended items
    recommended_titles = titles[titles['TITLE_ID'].isin(top_n_items)]
    return recommended_titles[['TITLE_ID', 'ORIGINAL_TITLE', 'GENRE_TMDB']]

def get_new_arrivals_recommendations(user_id, n=10):
    # Get items the user has interacted with
    interacted_items = set(
        train_data_cf[train_data_cf['BE_ID'] == user_id]['TITLE_ID']
    )

    # Get the latest interaction timestamp for the user
    user_interaction_date = interactions[
        interactions['BE_ID'] == user_id
    ]['COLLECTOR_TSTAMP'].max()
    
    if pd.isna(user_interaction_date):
        return pd.DataFrame(columns=['TITLE_ID', 'ORIGINAL_TITLE', 'GENRE_TMDB'])

    # Define the two-week window before the user's interaction date
    start_date = user_interaction_date - pd.Timedelta(days=14)

    # Filter titles that were released within the two-week window
    new_titles_window = titles[
        (titles['RELEASE_DATE'] >= start_date) & 
        (titles['RELEASE_DATE'] <= user_interaction_date)
    ]

    if new_titles_window.empty:
        return pd.DataFrame(columns=['TITLE_ID', 'ORIGINAL_TITLE', 'GENRE_TMDB'])

    # Get popularity of new titles based on interaction counts
    new_items_interaction_counts = interactions[
        interactions['TITLE_ID'].isin(new_titles_window['TITLE_ID'])
    ]['TITLE_ID'].value_counts()

    # Sort new items by popularity
    popular_new_items = new_items_interaction_counts.index.tolist()

    # Remove items already interacted with
    recommendations = [
        item for item in popular_new_items if item not in interacted_items
    ]

    # Get top N recommendations
    top_n_items = recommendations[:n]

    # Return titles of recommended items
    recommended_titles = titles[
        titles['TITLE_ID'].isin(top_n_items)
    ]
    return recommended_titles[['TITLE_ID', 'ORIGINAL_TITLE', 'GENRE_TMDB']]

def get_trending_recommendations(user_id, n=10):
    # Get items the user has interacted with
    interacted_items = set(train_data_cf[train_data_cf['BE_ID'] == user_id]['TITLE_ID'])

    # Define recent period
    recent_period = interactions['COLLECTOR_TSTAMP'].max() - pd.Timedelta(days=30)

    # Get interactions in recent period
    recent_interactions = interactions[interactions['COLLECTOR_TSTAMP'] >= recent_period]

    # Get most interacted items in recent period
    trending_items = recent_interactions['TITLE_ID'].value_counts().index.tolist()

    # Remove items already interacted with
    recommendations = [item for item in trending_items if item not in interacted_items]

    # Get top N recommendations
    top_n_items = recommendations[:n]

    # Return titles of recommended items
    recommended_titles = titles[titles['TITLE_ID'].isin(top_n_items)]
    return recommended_titles[['TITLE_ID', 'ORIGINAL_TITLE', 'GENRE_TMDB']]


## Evaluation Metrics

We define several evaluation metrics to assess the performance of our recommendation algorithms. 

### Precision@K
#### Definition:

Precision@K measures the proportion of recommended items in the top K that are relevant.

#### Interpretation:

High Precision@K: Indicates that the algorithm is good at placing relevant items at the top of the recommendation list.
Importance: Useful when we care about the quality of the top recommendations.


### Recall@K
#### Definition:

Recall@K measures the proportion of relevant items that are recommended in the top K.

#### Interpretation:

High Recall@K: Indicates that the algorithm is retrieving a large portion of all relevant items.
Importance: Useful when we want to ensure that users are exposed to as many relevant items as possible.


### F1-Score@K
#### Definition:

The F1-Score is the harmonic mean of Precision@K and Recall@K.
 
#### Interpretation:

Balanced Metric: Provides a balance between precision and recall.
Importance: Useful when both precision and recall are equally important.

### Mean Reciprocal Rank (MRR@K)
#### Definition:

MRR@K measures the reciprocal of the rank of the first relevant item in the recommended list up to K.

#### Interpretation:

High MRR@K: Indicates that the first relevant item appears early in the recommendation list.
Importance: Useful when early placement of relevant items is critical.

### Normalized Discounted Cumulative Gain (nDCG@K)
#### Definition:

nDCG@K evaluates the ranking quality of the recommendations by considering the position of relevant items and providing higher scores for items that appear earlier.

#### Interpretation:

High nDCG@K: Indicates that relevant items are ranked higher in the recommendation list.
Importance: Captures both the relevance and position of items, making it a comprehensive metric.

### Coverage
#### Definition:

Coverage measures the proportion of items that can be recommended by the system.

#### Interpretation:

High Coverage: Indicates that the algorithm can recommend a wide variety of items.
Importance: Ensures diversity and reduces over-concentration on popular items.


## Why Some Metrics Are Better in Certain Contexts

### Precision vs. Recall Trade-off:

Precision-Focused: If we care more about the relevance of the top recommendations (e.g., limited screen space), precision is more important.
Recall-Focused: If we aim to expose users to as many relevant items as possible, recall becomes crucial.

### F1-Score:

Provides a balance between precision and recall, which is useful when both are equally important.

### MRR and nDCG:

MRR: Emphasizes the rank of the first relevant item, suitable when the first relevant recommendation is critical.
nDCG: Considers the entire ranking and is useful for evaluating the overall quality of the recommendation list.

### Coverage:

Important for ensuring that the system can recommend a diverse set of items and is not limited to a small subset of popular items.

I selected metrics but implementations are chatgpt sponsored, I hope it did well.

In [153]:
# Evaluation Metrics
def precision_at_k(recommended_items, relevant_items, k):
    recommended_k = recommended_items[:k]
    relevant_set = set(relevant_items)
    recommended_set = set(recommended_k)
    intersection = recommended_set.intersection(relevant_set)
    precision = len(intersection) / k
    return precision

def recall_at_k(recommended_items, relevant_items, k):
    recommended_k = recommended_items[:k]
    relevant_set = set(relevant_items)
    recommended_set = set(recommended_k)
    intersection = recommended_set.intersection(relevant_set)
    recall = len(intersection) / len(relevant_set) if len(relevant_set) > 0 else 0
    return recall

def f1_at_k(precision, recall):
    if precision + recall == 0:
        return 0
    return 2 * (precision * recall) / (precision + recall)

def mrr_at_k(recommended_items, relevant_items, k):
    for idx, item in enumerate(recommended_items[:k]):
        if item in relevant_items:
            return 1 / (idx + 1)
    return 0

def dcg_at_k(recommended_items, relevant_items, k):
    dcg = 0.0
    for i, item in enumerate(recommended_items[:k]):
        if item in relevant_items:
            dcg += 1 / np.log2(i + 2)
    return dcg

def idcg_at_k(relevant_items, k):
    idcg = 0.0
    n_relevant = min(len(relevant_items), k)
    for i in range(n_relevant):
        idcg += 1 / np.log2(i + 2)
    return idcg

def ndcg_at_k(recommended_items, relevant_items, k):
    dcg = dcg_at_k(recommended_items, relevant_items, k)
    idcg = idcg_at_k(relevant_items, k)
    if idcg == 0:
        return 0.0
    return dcg / idcg


def evaluate_algorithm_with_metrics(recommend_func, algorithm_name):
    k = 10
    precision_list = []
    recall_list = []
    f1_list = []
    mrr_list = []
    ndcg_list = []

    recommended_items_set = set()

    # Evaluate on a sample of 100 users
    sample_users = list(test_interactions.keys())[:100]  # Adjusted to 100 users
    total_items = len(unique_items)

    # Initialize tqdm progress bar over users
    with tqdm(total=len(sample_users), desc=f"Evaluating {algorithm_name}", ncols=100) as pbar:
        for user in sample_users:
            relevant_items = test_interactions[user]
            recommended_titles_df = recommend_func(user)
            recommended_items = recommended_titles_df['TITLE_ID'].tolist()
            recommended_items_set.update(recommended_items)

            if recommended_items:
                precision = precision_at_k(recommended_items, relevant_items, k)
                recall = recall_at_k(recommended_items, relevant_items, k)
                f1 = f1_at_k(precision, recall)
                mrr = mrr_at_k(recommended_items, relevant_items, k)
                ndcg = ndcg_at_k(recommended_items, relevant_items, k)
                precision_list.append(precision)
                recall_list.append(recall)
                f1_list.append(f1)
                mrr_list.append(mrr)
                ndcg_list.append(ndcg)
            else:
                # Handle case where no recommendations are made
                precision_list.append(0)
                recall_list.append(0)
                f1_list.append(0)
                mrr_list.append(0)
                ndcg_list.append(0)

            pbar.update(1)

    # Compute average metrics
    avg_precision = np.mean(precision_list)
    avg_recall = np.mean(recall_list)
    avg_f1 = np.mean(f1_list)
    avg_mrr = np.mean(mrr_list)
    avg_ndcg = np.mean(ndcg_list)
    coverage = len(recommended_items_set) / total_items

    print(f"\nEvaluation Results for {algorithm_name}:")
    print(f"Precision@{k}: {avg_precision:.4f}")
    print(f"Recall@{k}: {avg_recall:.4f}")
    print(f"F1-Score@{k}: {avg_f1:.4f}")
    print(f"MRR@{k}: {avg_mrr:.4f}")
    print(f"nDCG@{k}: {avg_ndcg:.4f}")
    print(f"Coverage: {coverage:.4f}")
    print("\n")

    return {
        'Algorithm': algorithm_name,
        'Precision@10': avg_precision,
        'Recall@10': avg_recall,
        'F1-Score@10': avg_f1,
        'MRR@10': avg_mrr,
        'nDCG@10': avg_ndcg,
        'Coverage': coverage
    }

### Prepare Data for Evaluation

We prepare the data needed for evaluating our recommendation algorithms.

In [155]:
indices = pd.Series(titles.index, index=titles['TITLE_ID']).drop_duplicates()

# Consider 'likelist_addition' and 'seenlist_addition' as positive interactions
positive_interactions = interactions[
    interactions['INTERACTION_TYPE'].isin(['likelist_addition', 'seenlist_addition'])
]

# Time-wise train-test split
positive_interactions = positive_interactions.sort_values('COLLECTOR_TSTAMP')

cutoff_date_eval = pd.to_datetime('2024-06-30', utc=True)

train_data_eval = positive_interactions[positive_interactions['COLLECTOR_TSTAMP'] < cutoff_date_eval]
test_data_eval = positive_interactions[positive_interactions['COLLECTOR_TSTAMP'] >= cutoff_date_eval]

# Create dictionaries for train and test interactions
train_interactions = train_data_eval.groupby('BE_ID')['TITLE_ID'].apply(set).to_dict()
test_interactions = test_data_eval.groupby('BE_ID')['TITLE_ID'].apply(set).to_dict()


### valuate Algorithms
We evaluate each recommendation algorithm using the defined metrics. Be careful, this part took around 19 minutes at my computer.

In [156]:
results = []

# Content-Based (Soup Features)
result = evaluate_algorithm_with_metrics(
    lambda user_id: get_content_based_recommendations(user_id, n=10),
    "Content-Based (Soup Features)"
)
results.append(result)

# Collaborative Filtering (SVD)
result = evaluate_algorithm_with_metrics(
    get_svd_recommendations,
    "Collaborative Filtering (SVD)"
)
results.append(result)

# Hybrid Model
result = evaluate_algorithm_with_metrics(
    get_hybrid_recommendations,
    "Hybrid (SVD + Content-Based)"
)
results.append(result)

# Most Popular
result = evaluate_algorithm_with_metrics(
    get_most_popular_recommendations,
    "Most Popular"
)
results.append(result)

# New Arrivals
result = evaluate_algorithm_with_metrics(
    get_new_arrivals_recommendations,
    "New Arrivals"
)
results.append(result)

# Trending
result = evaluate_algorithm_with_metrics(
    get_trending_recommendations,
    "Trending"
)
results.append(result)

# Display Results
results_df = pd.DataFrame(results)
results_df.sort_values('F1-Score@10', ascending=False, inplace=True)
results_df.reset_index(drop=True, inplace=True)
print("Ranking of Algorithms:")
display(results_df)


Evaluating Content-Based (Soup Features): 100%|███████████████████| 100/100 [02:47<00:00,  1.67s/it]



Evaluation Results for Content-Based (Soup Features):
Precision@10: 0.0000
Recall@10: 0.0000
F1-Score@10: 0.0000
MRR@10: 0.0000
nDCG@10: 0.0000
Coverage: 0.0474




Evaluating Collaborative Filtering (SVD): 100%|███████████████████| 100/100 [01:40<00:00,  1.00s/it]



Evaluation Results for Collaborative Filtering (SVD):
Precision@10: 0.0000
Recall@10: 0.0000
F1-Score@10: 0.0000
MRR@10: 0.0000
nDCG@10: 0.0000
Coverage: 0.0152




Evaluating Hybrid (SVD + Content-Based): 100%|████████████████████| 100/100 [05:59<00:00,  3.59s/it]



Evaluation Results for Hybrid (SVD + Content-Based):
Precision@10: 0.0010
Recall@10: 0.0100
F1-Score@10: 0.0018
MRR@10: 0.0033
nDCG@10: 0.0050
Coverage: 0.0286




Evaluating Most Popular: 100%|████████████████████████████████████| 100/100 [02:31<00:00,  1.51s/it]



Evaluation Results for Most Popular:
Precision@10: 0.0050
Recall@10: 0.0199
F1-Score@10: 0.0071
MRR@10: 0.0218
nDCG@10: 0.0136
Coverage: 0.0013




Evaluating New Arrivals: 100%|████████████████████████████████████| 100/100 [03:24<00:00,  2.04s/it]



Evaluation Results for New Arrivals:
Precision@10: 0.0090
Recall@10: 0.0315
F1-Score@10: 0.0126
MRR@10: 0.0338
nDCG@10: 0.0202
Coverage: 0.0009




Evaluating Trending: 100%|████████████████████████████████████████| 100/100 [02:39<00:00,  1.60s/it]


Evaluation Results for Trending:
Precision@10: 0.0090
Recall@10: 0.0380
F1-Score@10: 0.0129
MRR@10: 0.0244
nDCG@10: 0.0188
Coverage: 0.0012


Ranking of Algorithms:





Unnamed: 0,Algorithm,Precision@10,Recall@10,F1-Score@10,MRR@10,nDCG@10,Coverage
0,Trending,0.009,0.037985,0.012886,0.024429,0.018776,0.001233
1,New Arrivals,0.009,0.031464,0.012646,0.03375,0.020161,0.000925
2,Most Popular,0.005,0.019909,0.007104,0.021833,0.013567,0.001295
3,Hybrid (SVD + Content-Based),0.001,0.01,0.001818,0.003333,0.005,0.028605
4,Content-Based (Soup Features),0.0,0.0,0.0,0.0,0.0,0.047408
5,Collaborative Filtering (SVD),0.0,0.0,0.0,0.0,0.0,0.015166


For you to inspect old results while new results are being evaluated, I printed results again...

In [161]:
results_df

Unnamed: 0,Algorithm,Precision@10,Recall@10,F1-Score@10,MRR@10,nDCG@10,Coverage
0,Trending,0.009,0.037985,0.012886,0.024429,0.018776,0.001233
1,New Arrivals,0.009,0.031464,0.012646,0.03375,0.020161,0.000925
2,Most Popular,0.005,0.019909,0.007104,0.021833,0.013567,0.001295
3,Hybrid (SVD + Content-Based),0.001,0.01,0.001818,0.003333,0.005,0.028605
4,Content-Based (Soup Features),0.0,0.0,0.0,0.0,0.0,0.047408
5,Collaborative Filtering (SVD),0.0,0.0,0.0,0.0,0.0,0.015166


# Analysis of Results

Results seems like pretty bad. Idk if implemented metrics wrong, or handled test set pretty bad or incorrectly implement models or just because of sampling training data. But I also don't know industry standart for this metrics.

New Arrivals and Trending algorithms have the highest F1-Score@10 among all methods tested. These algorithms are straightforward to implement and require minimal computational resources. They adapt quickly to changes in user behavior and content updates. 
Simple algorithms achieve higher precision and recall compared to more sophisticated models.

For our complex models. The content-based model relies on textual (SOUP) features that may not capture the full spectrum of user preferences. SVD model also pretty bad but I believe this happens because of our sampled train set and only 100 user to evaluate. 
Even though these models are bad, hybrid model still scores a little bit better that would mean ensemble approach might worth to consider in ideal/better conditions. Combines both collaborative and content-based approaches but still underperforms compared to simpler algorithms.

### Coverage
Observations:

Content-Based and Hybrid models have higher coverage, indicating they can recommend a wider variety of items.
New Arrivals and Trending have lower coverage, focusing on a smaller set of recent or popular items.

High Coverage: Desirable for promoting diversity and exposing users to less-known items.
Low Coverage: May lead to over-recommendation of certain items, potentially causing user fatigue.

## Generate Recommendations for Specific Titles and Users

### Recommendations for Specific given Titles

In [132]:
# Generate Recommendations for Specific Titles and Users
title_ids = ['tm107473', 'tm50355', 'ts89259']
for title_id in title_ids:
    # Using Content-Based (Soup Features)
    recs = get_recommendations(title_id, cosine_sim_svd)
    print(f"Recommendations for Title ID {title_id} using Content-Based (Soup Features):")
    display(titles[titles['TITLE_ID'].isin(recs)][['TITLE_ID', 'ORIGINAL_TITLE', 'GENRE_TMDB']])
    print("\n")

Recommendations for Title ID tm107473 using Content-Based (Soup Features):


Unnamed: 0,TITLE_ID,ORIGINAL_TITLE,GENRE_TMDB
292,tm282652,The Little Hours,"[comedy, romance]"
1575,tm49694,Un 32 août sur terre,"[comedy, drama, romance]"
2990,tm13311,Don Juan DeMarco,"[comedy, drama, romance]"
3333,tm64871,Six Days Seven Nights,"[action, comedy, romance]"
5449,tm30996,Dementia 13,"[horror, thriller]"
6693,tm1334269,Anyone But You,"[comedy, romance]"
9327,tm11653,Powder,"[drama, fantasy, thriller]"
11518,tm134211,Lost in Translation,"[comedy, drama, european, romance]"
13190,tm245718,Paris Can Wait,"[comedy, drama, romance]"
18265,tm21452,Barfly,"[comedy, drama, romance]"




Recommendations for Title ID tm50355 using Content-Based (Soup Features):


Unnamed: 0,TITLE_ID,ORIGINAL_TITLE,GENRE_TMDB
2501,tm61246,The Horse Whisperer,"[action, drama, romance, western]"
2901,tm111966,The Great Gatsby,"[drama, romance]"
3102,tm413548,Waiting for the Barbarians,"[drama, european, history]"
3490,tm81358,The Assassination of Jesse James by the Coward...,"[crime, drama, history, western]"
7791,tm177420,Places in the Heart,[drama]
9981,tm109262,Marvin's Room,[drama]
15428,ts379313,The Gold,"[crime, drama, history]"
17584,tm68560,The Chase,"[crime, drama, thriller]"
18212,tm56098,Safe House,"[action, thriller]"
19346,tm186300,Scarecrow,[drama]




Recommendations for Title ID ts89259 using Content-Based (Soup Features):


Unnamed: 0,TITLE_ID,ORIGINAL_TITLE,GENRE_TMDB
3185,tm5637,You Can't Take It with You,"[comedy, drama, romance]"
4663,tm239721,Captain Fantastic,"[action, comedy, drama]"
7285,tm455899,Black and Blue,"[action, crime, drama, thriller]"
7927,tm2043,1492: Conquest of Paradise,"[action, drama, european, history]"
9312,tm139885,The Other Side of the Wind,"[drama, european]"
10033,tm1338306,Rather,[documentation]
11045,tm119380,Mr. Deeds Goes to Town,"[comedy, drama, romance]"
17979,tm1135158,One Day as a Lion,"[action, comedy, crime, thriller]"
18670,tm1401833,Lights Out,"[action, horror, thriller]"
20386,tm68950,The Bourne Legacy,"[action, thriller]"






## Generate Recommendations for random 3 Users

In [157]:
# Generate Recommendations for Random 3 Users
user_ids = test_data_cf['BE_ID'].unique()[:3]
for user_id in user_ids:
    # Using Hybrid Model
    try:
        recs_hybrid = get_hybrid_recommendations(user_id)
        if not recs_hybrid.empty:
            print(f"Recommendations for User ID {user_id} using Hybrid Model:")
            display(recs_hybrid)
        else:
            print(f"No recommendations for User ID {user_id} using Hybrid Model.")
    except Exception as e:
        print(f"Error generating recommendations for User ID {user_id}: {e}")
    print("\n")


Recommendations for User ID e7fa4bdd19163153886d3b62d040855a using Hybrid Model:


Unnamed: 0,TITLE_ID,ORIGINAL_TITLE,GENRE_TMDB
395,tm1325446,Accused,"[drama, thriller]"
555,tm2,The Empire Strikes Back,"[action, fantasy, scifi]"
671,ts36147,Lucifer,"[crime, drama, fantasy, scifi]"
707,ts343696,Queen Charlotte: A Bridgerton Story,"[drama, history, romance]"
1098,tm134620,The Girlfriend Experience,[drama]
1871,tm64499,Un cuento chino,"[comedy, drama]"
3999,tm1308443,Invitation to a Murder,[thriller]
4058,tm11948,Toy Story 4,"[action, animation, comedy, family, fantasy]"
4639,tm178323,Jack and Jill,[comedy]
4983,tm1271630,Polite Society,"[action, comedy, drama]"




Recommendations for User ID 30a42291c9230949c343880f0f78a1b8 using Hybrid Model:


Unnamed: 0,TITLE_ID,ORIGINAL_TITLE,GENRE_TMDB
121,tm73364,Taxi to the Dark Side,"[crime, documentation, history, war]"
555,tm2,The Empire Strikes Back,"[action, fantasy, scifi]"
1128,tm14765,Toy Story,"[action, animation, comedy, family, fantasy]"
1871,tm64499,Un cuento chino,"[comedy, drama]"
2090,ts426318,Together: Treble Winners,"[documentation, sport]"
2321,tm69334,How the Grinch Stole Christmas!,"[animation, comedy, family, fantasy]"
3157,ts20103,The Tonight Show Starring Jimmy Fallon,"[comedy, music]"
3539,ts89898,The Falcon and The Winter Soldier,"[action, drama, scifi]"
4133,ts101383,No Side Game,"[drama, sport]"
5156,tm435784,The Panama Papers,"[crime, documentation]"




Recommendations for User ID ce42f7bd49c382707d8bfc15d7c23680 using Hybrid Model:


Unnamed: 0,TITLE_ID,ORIGINAL_TITLE,GENRE_TMDB
707,ts343696,Queen Charlotte: A Bridgerton Story,"[drama, history, romance]"
1720,ts330402,Is It Cake?,[reality]
1840,tm318269,The Rider,"[drama, western]"
2507,tm129545,Ace in the Hole,[drama]
2736,tm1140265,Eureka,"[drama, western]"
2970,tm153755,Boyhood,[drama]
3157,ts20103,The Tonight Show Starring Jimmy Fallon,"[comedy, music]"
3185,tm5637,You Can't Take It with You,"[comedy, drama, romance]"
3198,tm124668,Love in the Afternoon,"[comedy, crime, drama, romance]"
6042,tm6,Star Wars: Episode III - Revenge of the Sith,"[action, fantasy, scifi]"






# Conclusion
In this modeling notebook, I've implemented several recommendation algorithms, including content-based filtering, collaborative filtering using SVD, and hybrid methods. I also introduced simpler strategies like recommending most popular, new arrivals, and trending items. The evaluation results indicate that simpler methods like recommending trending or new arrivals perform better in terms of precision and recall compared to more complex models in this context.

Due to computational limitations and the scope of this project, some models did not perform as expected. Further work could involve more advanced models, extensive hyperparameter tuning, and incorporating additional features to improve recommendation quality.