# Inspecting and Exploring Recommendations for Specific Titles

In this notebook, we'll delve into the performance of our recommendation system by focusing on three specific titles:

- **tm107473**: *Morning Glory*
- **tm50355**: *The Right Stuff*
- **ts89259**: *The Queen's Gambit*

Our objectives are to:

1. **Identify** users who interacted with these titles and have at least one additional interaction afterward.
2. **Define** what ideal recommendations should look like based on these interactions.
3. **Compare** these ideal recommendations with those generated by our current weighted recommender.
4. **Determine** the optimal value of `k` to improve recommendation quality.


In [3]:
# Import necessary libraries
import pandas as pd
import numpy as np
from surprise import Reader, Dataset
import matplotlib.pyplot as plt
from tqdm import tqdm

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

# Set display options for better readability
pd.set_option('display.max_columns', None)


### Loading Processed Data

We'll begin by loading the preprocessed `titles` and `interactions` datasets.


In [4]:
PROCESSED_TITLES_PATH = 'titles.csv.gz'
PROCESSED_INTERACTIONS_PATH = 'interactions_cleaned.csv' 

# Load the data
titles_df = pd.read_csv(PROCESSED_TITLES_PATH)
interactions_df = pd.read_csv(PROCESSED_INTERACTIONS_PATH)

# Display the first few rows of each DataFrame
print("Titles DataFrame:")
display(titles_df.head())

print("\nInteractions DataFrame:")
display(interactions_df.head())


Titles DataFrame:


Unnamed: 0,TITLE_ID,ORIGINAL_TITLE,ORIGINAL_LANGUAGE,RELEASE_DURATION_DAYS,GENRE_TMDB,DIRECTOR,ACTOR,PRODUCER,WRITER
0,tm1282307,L'ultima notte di Amore,it,484,"[\n ""drama"",\n ""thriller""\n]","[\n ""Andrea Di Stefano""\n]","[\n ""Pierfrancesco Favino"",\n ""Linda Caridi""...","[\n ""Benedetto Habib"",\n ""Daniel Campos Pavo...",
1,tm1338500,Bird Box Barcelona,es,357,"[\n ""horror"",\n ""scifi"",\n ""thriller""\n]","[\n ""David Pastor"",\n ""Àlex Pastor""\n]","[\n ""Mario Casas"",\n ""Georgina Campbell"",\n ...","[\n ""Adrián Guerra"",\n ""Chris Morgan"",\n ""D...","[\n ""David Pastor"",\n ""Àlex Pastor""\n]"
2,ts371824,Steeltown Murders,en,417,"[\n ""crime"",\n ""drama"",\n ""history"",\n ""th...","[\n ""Marc Evans""\n]","[\n ""Scott Arthur"",\n ""Sion Alun Davies"",\n ...","[\n ""Hannah Thomas""\n]","[\n ""Ed Whitmore""\n]"
3,tm123363,Expend4bles,en,294,"[\n ""action"",\n ""thriller"",\n ""war""\n]","[\n ""Scott Waugh""\n]","[\n ""Jason Statham"",\n ""Sylvester Stallone"",...","[\n ""Jason Statham"",\n ""Jeffrey Greenstein"",...",
4,tm1045025,65,en,491,"[\n ""action"",\n ""drama"",\n ""scifi"",\n ""thr...","[\n ""Bryan Woods"",\n ""Scott Beck""\n]","[\n ""Adam Driver"",\n ""Ariana Greenblatt"",\n ...","[\n ""Bryan Woods"",\n ""Deborah Liebling"",\n ...","[\n ""Bryan Woods"",\n ""Scott Beck""\n]"



Interactions DataFrame:


Unnamed: 0,BE_ID,TITLE_ID,COLLECTOR_TSTAMP,INTERACTION_TYPE
0,89ce5486cfd135f81edd5f2cc4013e1e,tm122846,2024-04-06 21:37:50.666000+00:00,clickout_provider
1,5437456587e85d0b97070ea63f459e49,tm172163,2024-04-07 02:08:27.618000+00:00,seenlist_addition
2,98c21bf80a45fbfef9902508aba52cdc,ts22280,2024-04-14 05:23:09.304000+00:00,seenlist_addition
3,a4fe1d6790b12ef00f5a81631b69a437,ts416258,2024-04-07 07:34:54.506000+00:00,seenlist_addition
4,69cf67f4676a77d6ead478f60d84a493,ts15366,2024-04-08 08:29:22.511000+00:00,seenlist_addition


### Defining Ideal Recommendations

For each specified title, we'll identify users who:

1. **Interacted** with the title.
2. **Have at least one additional interaction** after that.

For each user, we gather all interactions that happen after the interaction with the given title.
For each user, we will treat all these subsequent interactions as relevant items.
We'll compute metrics like precision@k, recall@k, mean_average_precision, and mean_average_recall for each user, with k being the number of relevant items for that user.
We'll aggregate the results across all users.

#### Titles of Interest:

- **tm107473**: *Morning Glory*
- **tm50355**: *The Right Stuff*
- **ts89259**: *The Queen's Gambit*


In [7]:
import ast
def str_to_list(x):
    try:
        return ast.literal_eval(x)
    except:
        return ['Unknown']
    
titles= titles_df
interactions = interactions_df


multivalued_columns = ['GENRE_TMDB', 'DIRECTOR', 'ACTOR', 'PRODUCER']
for col in multivalued_columns:
    titles[col] = titles[col].apply(str_to_list)


def get_top_items(column, min_count):
    all_items = titles.explode(column)[column]
    item_counts = all_items.value_counts()
    top_items = item_counts[item_counts >= min_count].index.tolist()
    return top_items

# Thresholds
director_min_count = 5
actor_min_count = 10
producer_min_count = 5

# Get top items
top_directors = get_top_items('DIRECTOR', director_min_count)
top_actors = get_top_items('ACTOR', actor_min_count)
top_producers = get_top_items('PRODUCER', producer_min_count)

# Replace less frequent items
def replace_less_frequent(items, top_items):
    return [item if item in top_items else 'other' for item in items]

titles['DIRECTOR'] = titles['DIRECTOR'].apply(lambda x: replace_less_frequent(x, top_directors))
titles['ACTOR'] = titles['ACTOR'].apply(lambda x: replace_less_frequent(x, top_actors))
titles['PRODUCER'] = titles['PRODUCER'].apply(lambda x: replace_less_frequent(x, top_producers))


# Ensure the 'reference_date' is a valid datetime
reference_date = pd.to_datetime(interactions['COLLECTOR_TSTAMP'].max(), errors='coerce', utc=True)

# Convert 'RELEASE_DURATION_DAYS' to numeric, drop NaNs, and convert to timedelta
titles['RELEASE_DURATION_DAYS'] = pd.to_numeric(titles['RELEASE_DURATION_DAYS'], errors='coerce')
titles = titles.dropna(subset=['RELEASE_DURATION_DAYS'])  # Drop rows with NaN after conversion

# Step 2: Identify extreme values (filter out values beyond 100 years)
threshold_days = pd.to_numeric(pd.Timedelta(days=36500).days)  # Convert 100 years to days and then to numeric
extreme_titles = titles[titles['RELEASE_DURATION_DAYS'] > threshold_days]
print(f"Number of titles with extreme RELEASE_DURATION_DAYS: {len(extreme_titles)}")

# Option A: Remove titles with extreme values
titles = titles[titles['RELEASE_DURATION_DAYS'] <= threshold_days]

# Convert 'RELEASE_DURATION_DAYS' to timedelta only after removing extreme values
titles['RELEASE_DURATION_DAYS'] = pd.to_timedelta(titles['RELEASE_DURATION_DAYS'], unit='D')

# Step 3: Calculate 'RELEASE_DATE' by subtracting 'RELEASE_DURATION_DAYS' from 'reference_date'
titles['RELEASE_DATE'] = reference_date - titles['RELEASE_DURATION_DAYS']

# Ensure 'RELEASE_DATE' is valid datetime
titles['RELEASE_DATE'] = pd.to_datetime(titles['RELEASE_DATE'], errors='coerce', utc=True)

# Step 4: Drop rows with invalid 'RELEASE_DATE' values (NaT values)
titles = titles.dropna(subset=['RELEASE_DATE'])

# Step 5: Reset index of titles to ensure indices are from 0 to N-1
titles = titles.reset_index(drop=True)


# Create mapping from TITLE_ID to index
title_id_to_idx = pd.Series(titles.index, index=titles['TITLE_ID']).drop_duplicates()

# Create reverse mapping from index to TITLE_ID
idx_to_title_id = pd.Series(titles['TITLE_ID'].values, index=titles.index)


Number of titles with extreme RELEASE_DURATION_DAYS: 33


In [12]:
from sklearn.metrics import roc_auc_score
from tqdm import tqdm
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import hstack
from src.evaluation.metrics import precision_at_k, recall_at_k, f_score_at_k, ndcg_at_k, mean_average_precision, area_under_roc_curve

# Define the titles of interest
title_ids = ['tm107473', 'tm50355', 'ts89259']

# Define weights for each feature
weights = {
    'ORIGINAL_TITLE': 0.3,
    'GENRE_TMDB': 0.3,
    'DIRECTOR': 0.2,
    'ACTOR': 0.1,
    'PRODUCER': 0.1
}

# Initialize lists to store the TF-IDF matrices and their weights
tfidf_matrices = []
feature_weights = []

# Generate TF-IDF matrices for each feature and apply the weights
for feature, weight in weights.items():
    tfidf = TfidfVectorizer(stop_words='english', max_features=5000)
    # Replace NaN values with an empty string
    tfidf_matrix = tfidf.fit_transform(titles[feature].fillna('').apply(lambda x: ' '.join(x) if isinstance(x, list) else x))
    tfidf_matrices.append(tfidf_matrix)
    feature_weights.append(weight)

# Combine the TF-IDF matrices into a single weighted matrix
weighted_tfidf_matrix = hstack([tfidf_matrix * weight for tfidf_matrix, weight in zip(tfidf_matrices, feature_weights)])

print(f"Weighted TF-IDF Matrix Shape: {weighted_tfidf_matrix.shape}")

# Apply TruncatedSVD to reduce dimensions
svd = TruncatedSVD(n_components=100, random_state=42)
tfidf_matrix_svd = svd.fit_transform(weighted_tfidf_matrix)
print(f"Reduced TF-IDF Matrix Shape: {tfidf_matrix_svd.shape}")

# Generate cosine similarity matrix
cosine_sim_svd = cosine_similarity(tfidf_matrix_svd, tfidf_matrix_svd)

# Initialize a dictionary to hold ideal recommendations per title
ideal_recommendations = {title_id: {} for title_id in title_ids}

# Initialize metrics dictionaries to hold user-specific metrics
user_metrics = {title_id: [] for title_id in title_ids}

# Initialize list to store the final average metrics for each title
title_results = []

# Iterate over each title to identify ideal recommendations and calculate metrics
for title_id in tqdm(title_ids, desc="Processing Titles"):
    # Find all interactions with the title
    title_interactions = interactions_df[interactions_df['TITLE_ID'] == title_id]
    
    # Get user IDs who interacted with the title (limit to 100 users)
    users_who_interacted = title_interactions['BE_ID'].unique()[:20]  # Limit to 100 users

    # Progress bar for users
    for user_id in tqdm(users_who_interacted, desc=f"Processing users for title {title_id}", leave=False):
        # Get all interactions of the user sorted by timestamp
        user_interactions = interactions_df[interactions_df['BE_ID'] == user_id].sort_values('COLLECTOR_TSTAMP')
        
        # Find the interaction index with the current title
        title_indices = user_interactions[user_interactions['TITLE_ID'] == title_id].index
        
        # For each interaction with the title, find all subsequent interactions
        relevant_interactions = []
        for idx in title_indices:
            # Get the position of the title interaction
            pos = user_interactions.index.get_loc(idx)
            
            # Get all interactions after the title interaction
            if pos + 1 < len(user_interactions):
                relevant_interactions.extend(user_interactions.iloc[pos + 1:]['TITLE_ID'].tolist())

        # Calculate metrics if the user has relevant interactions
        if relevant_interactions:
            # Store all relevant interactions as ideal recommendations for the user
            ideal_recommendations[title_id][user_id] = relevant_interactions
            
            # Find the cosine similarity for the title and recommend similar titles
            title_idx = titles[titles['TITLE_ID'] == title_id].index[0]
            sim_scores = list(enumerate(cosine_sim_svd[title_idx]))
            sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
            
            # Get top K recommendations based on cosine similarity
            recommended_indices = [i[0] for i in sim_scores[1:]]  # Exclude the title itself
            predicted_recommendations = [titles.iloc[i]['TITLE_ID'] for i in recommended_indices[:min(len(relevant_interactions), 10)]]

            # Set k to the minimum of relevant_interactions count and predefined k=10
            k = min(len(relevant_interactions), 10)

            # Calculate metrics using functions from metrics.py with the adjusted k
            precision = precision_at_k(predicted_recommendations, relevant_interactions, k=k)
            recall = recall_at_k(predicted_recommendations, relevant_interactions, k=k)
            f1_score = f_score_at_k(precision, recall)
            ndcg = ndcg_at_k(predicted_recommendations, relevant_interactions, k=k)
            map_score = mean_average_precision(predicted_recommendations, relevant_interactions, k=k)
            
            # Append user-specific metrics
            user_metrics[title_id].append({
                'user_id': user_id,
                'precision@k': precision,
                'recall@k': recall,
                'f1_score@k': f1_score,
                'ndcg@k': ndcg,
                'mean_avg_precision': map_score,
                'relevant_interactions': relevant_interactions,
                'predicted_recommendations': predicted_recommendations
            })

    # Calculate average metrics for the title
    if user_metrics[title_id]:
        df_metrics = pd.DataFrame(user_metrics[title_id])
        avg_precision = df_metrics['precision@k'].mean()
        avg_recall = df_metrics['recall@k'].mean()
        avg_f1_score = df_metrics['f1_score@k'].mean()
        avg_ndcg = df_metrics['ndcg@k'].mean()
        avg_map = df_metrics['mean_avg_precision'].mean()
        avg_auc = df_metrics['auc'].mean()

        # Append title-level metrics to the final results
        title_results.append({
            'title_id': title_id,
            'avg_precision@k': avg_precision,
            'avg_recall@k': avg_recall,
            'avg_f1_score@k': avg_f1_score,
            'avg_ndcg@k': avg_ndcg,
            'avg_mean_avg_precision': avg_map,
            'avg_auc': avg_auc
        })

# Convert title results to DataFrame
results_df = pd.DataFrame(title_results)
print("Average Metrics per Title:")
display(results_df)

# Display ideal recommendations for each title
for title_id in title_ids:
    print(f"Ideal Recommendations for Title ID {title_id}:")
    display(pd.DataFrame.from_dict(ideal_recommendations[title_id], orient='index', columns=['Ideal Recommendations']).head())


Weighted TF-IDF Matrix Shape: (20596, 16880)
Reduced TF-IDF Matrix Shape: (20596, 100)


Processing Titles:   0%|          | 0/3 [00:00<?, ?it/s]
Processing users for title tm107473:   0%|          | 0/20 [00:00<?, ?it/s][A
Processing users for title tm107473:   5%|▌         | 1/20 [00:02<00:40,  2.14s/it][A
Processing users for title tm107473:  10%|█         | 2/20 [00:02<00:19,  1.10s/it][A
Processing users for title tm107473:  15%|█▌        | 3/20 [00:03<00:15,  1.12it/s][A
Processing users for title tm107473:  20%|██        | 4/20 [00:04<00:14,  1.13it/s][A
Processing users for title tm107473:  25%|██▌       | 5/20 [00:04<00:11,  1.26it/s][A
Processing users for title tm107473:  30%|███       | 6/20 [00:05<00:10,  1.34it/s][A
Processing users for title tm107473:  35%|███▌      | 7/20 [00:06<00:11,  1.13it/s][A
Processing users for title tm107473:  40%|████      | 8/20 [00:06<00:08,  1.34it/s][A
Processing users for title tm107473:  45%|████▌     | 9/20 [00:07<00:07,  1.50it/s][A
Processing users for title tm107473:  50%|█████     | 10/20 [00:07<00:05,  1.71it

Average Metrics per Title:





Unnamed: 0,title_id,avg_precision@k,avg_recall@k,avg_f1_score@k,avg_ndcg@k,avg_mean_avg_precision,avg_auc
0,tm107473,0.015,0.000658,0.001245,0.013305,0.000105,0.500116
1,tm50355,0.0,0.0,0.0,0.0,0.0,0.499785
2,ts89259,0.0,0.0,0.0,0.0,0.0,0.499756


Ideal Recommendations for Title ID tm107473:


ValueError: 1 columns passed, passed data had 863 columns

### Generating Current Recommendations

We'll now generate recommendations using our **Content-Based Weighted** recommender for the specified titles. We'll compare these recommendations against the ideal recommendations defined earlier.


In [6]:
import ast
def str_to_list(x):
    try:
        return ast.literal_eval(x)
    except:
        return ['Unknown']
    
titles= titles_df
interactions = interactions_df


multivalued_columns = ['GENRE_TMDB', 'DIRECTOR', 'ACTOR', 'PRODUCER']
for col in multivalued_columns:
    titles[col] = titles[col].apply(str_to_list)


def get_top_items(column, min_count):
    all_items = titles.explode(column)[column]
    item_counts = all_items.value_counts()
    top_items = item_counts[item_counts >= min_count].index.tolist()
    return top_items

# Thresholds
director_min_count = 5
actor_min_count = 10
producer_min_count = 5

# Get top items
top_directors = get_top_items('DIRECTOR', director_min_count)
top_actors = get_top_items('ACTOR', actor_min_count)
top_producers = get_top_items('PRODUCER', producer_min_count)

# Replace less frequent items
def replace_less_frequent(items, top_items):
    return [item if item in top_items else 'other' for item in items]

titles['DIRECTOR'] = titles['DIRECTOR'].apply(lambda x: replace_less_frequent(x, top_directors))
titles['ACTOR'] = titles['ACTOR'].apply(lambda x: replace_less_frequent(x, top_actors))
titles['PRODUCER'] = titles['PRODUCER'].apply(lambda x: replace_less_frequent(x, top_producers))


# Ensure the 'reference_date' is a valid datetime
reference_date = pd.to_datetime(interactions['COLLECTOR_TSTAMP'].max(), errors='coerce', utc=True)

# Convert 'RELEASE_DURATION_DAYS' to numeric, drop NaNs, and convert to timedelta
titles['RELEASE_DURATION_DAYS'] = pd.to_numeric(titles['RELEASE_DURATION_DAYS'], errors='coerce')
titles = titles.dropna(subset=['RELEASE_DURATION_DAYS'])  # Drop rows with NaN after conversion

# Step 2: Identify extreme values (filter out values beyond 100 years)
threshold_days = pd.to_numeric(pd.Timedelta(days=36500).days)  # Convert 100 years to days and then to numeric
extreme_titles = titles[titles['RELEASE_DURATION_DAYS'] > threshold_days]
print(f"Number of titles with extreme RELEASE_DURATION_DAYS: {len(extreme_titles)}")

# Option A: Remove titles with extreme values
titles = titles[titles['RELEASE_DURATION_DAYS'] <= threshold_days]

# Convert 'RELEASE_DURATION_DAYS' to timedelta only after removing extreme values
titles['RELEASE_DURATION_DAYS'] = pd.to_timedelta(titles['RELEASE_DURATION_DAYS'], unit='D')

# Step 3: Calculate 'RELEASE_DATE' by subtracting 'RELEASE_DURATION_DAYS' from 'reference_date'
titles['RELEASE_DATE'] = reference_date - titles['RELEASE_DURATION_DAYS']

# Ensure 'RELEASE_DATE' is valid datetime
titles['RELEASE_DATE'] = pd.to_datetime(titles['RELEASE_DATE'], errors='coerce', utc=True)

# Step 4: Drop rows with invalid 'RELEASE_DATE' values (NaT values)
titles = titles.dropna(subset=['RELEASE_DATE'])

# Step 5: Reset index of titles to ensure indices are from 0 to N-1
titles = titles.reset_index(drop=True)


# Create mapping from TITLE_ID to index
title_id_to_idx = pd.Series(titles.index, index=titles['TITLE_ID']).drop_duplicates()

# Create reverse mapping from index to TITLE_ID
idx_to_title_id = pd.Series(titles['TITLE_ID'].values, index=titles.index)


Number of titles with extreme RELEASE_DURATION_DAYS: 33


In [9]:
titles = titles.drop(columns=['WRITER'])

In [10]:
titles

Unnamed: 0,TITLE_ID,ORIGINAL_TITLE,ORIGINAL_LANGUAGE,RELEASE_DURATION_DAYS,GENRE_TMDB,DIRECTOR,ACTOR,PRODUCER,RELEASE_DATE
0,tm1282307,L'ultima notte di Amore,it,484 days,"[drama, thriller]",[other],"[Pierfrancesco Favino, other, other, other, ot...","[Benedetto Habib, Daniel Campos Pavoncelli, Fa...",2023-03-09 04:29:50.019000+00:00
1,tm1338500,Bird Box Barcelona,es,357 days,"[horror, scifi, thriller]","[other, other]","[Mario Casas, Georgina Campbell, other, other,...","[Adrián Guerra, Chris Morgan, Dylan Clark, Núr...",2023-07-14 04:29:50.019000+00:00
2,ts371824,Steeltown Murders,en,417 days,"[crime, drama, history, thriller]",[other],"[other, other, other, other, Aneurin Barnard, ...",[other],2023-05-15 04:29:50.019000+00:00
3,tm123363,Expend4bles,en,294 days,"[action, thriller, war]",[other],"[Jason Statham, Sylvester Stallone, 50 Cent, M...","[other, Jeffrey Greenstein, Jonathan Yunger, K...",2023-09-15 04:29:50.019000+00:00
4,tm1045025,65,en,491 days,"[action, drama, scifi, thriller]","[other, other]","[Adam Driver, other, other, other, other]","[other, other, Sam Raimi, other, other]",2023-03-02 04:29:50.019000+00:00
...,...,...,...,...,...,...,...,...,...
20591,ts21325,Hunter,en,14535 days,"[action, crime, drama, thriller]","[David Soul, Tony Mordente, Gus Trikonis, Jame...","[other, other, Charles Hallahan, other]","[other, other, other, other, other, Stephen J....",1984-09-18 04:29:50.019000+00:00
20592,tm1382322,Strange Darling,en,-49 days,"[horror, thriller]",[other],"[other, Kyle Gallner, Jason Patric, Giovanni R...","[Bill Block, other, Roy Lee, Steven Schneider]",2024-08-23 04:29:50.019000+00:00
20593,ts21242,Mission: Impossible,en,21111 days,"[action, crime, drama, thriller]","[Tom Gries, Leonard J. Horn, Seymour Robbie, H...","[other, other, other, other, other]","[other, other, other, other, other]",1966-09-17 04:29:50.019000+00:00
20594,ts37497,Popeye the Sailor,en,23401 days,"[animation, comedy, family, romance]","[Jack Kinney, other, other, other, other, other]","[other, other, other]",[other],1960-06-10 04:29:50.019000+00:00


In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import hstack

# Define weights for each feature
weights = {
    'ORIGINAL_TITLE': 0.3,
    'GENRE_TMDB': 0.3,
    'DIRECTOR': 0.2,
    'ACTOR': 0.1,
    'PRODUCER': 0.1
}

# Initialize lists to store the TF-IDF matrices and their weights
tfidf_matrices = []
feature_weights = []

# Generate TF-IDF matrices for each feature and apply the weights
for feature, weight in weights.items():
    tfidf = TfidfVectorizer(stop_words='english', max_features=5000)
    # Replace NaN values with an empty string
    tfidf_matrix = tfidf.fit_transform(titles[feature].fillna('').apply(lambda x: ' '.join(x) if isinstance(x, list) else x))
    tfidf_matrices.append(tfidf_matrix)
    feature_weights.append(weight)

# Combine the TF-IDF matrices into a single weighted matrix
weighted_tfidf_matrix = hstack([tfidf_matrix * weight for tfidf_matrix, weight in zip(tfidf_matrices, feature_weights)])

print(f"Weighted TF-IDF Matrix Shape: {weighted_tfidf_matrix.shape}")

# Apply TruncatedSVD to reduce dimensions
svd = TruncatedSVD(n_components=100, random_state=42)
tfidf_matrix_svd = svd.fit_transform(weighted_tfidf_matrix)
print(f"Reduced TF-IDF Matrix Shape: {tfidf_matrix_svd.shape}")

cosine_sim_svd = cosine_similarity(tfidf_matrix_svd, tfidf_matrix_svd)


Weighted TF-IDF Matrix Shape: (20596, 16880)
Reduced TF-IDF Matrix Shape: (20596, 100)


In [15]:
from src.models.content_based import WeightedContentBasedRecommender



weighted_content_recommender = WeightedContentBasedRecommender(titles, tfidf_matrix_svd, idx_to_title_id)
weighted_content_recommender.train()

In [16]:
def get_recommendations_for_title(recommender, title_id, top_k=10):
    """
    Generates recommendations for a given title using the specified recommender.
    
    Parameters:
    - recommender: The trained recommender object.
    - title_id: The title ID for which to generate recommendations.
    - top_k: Number of top recommendations to retrieve.
    
    Returns:
    - List of recommended title IDs.
    """
    recommendations = recommender.get_recommendations(title_id=title_id, top_k=top_k)
    return recommendations

# Create a dictionary to hold current recommendations per title
current_recommendations = {title_id: [] for title_id in title_ids}

# Generate recommendations for each title
for title_id in title_ids:
    recommendations = get_recommendations_for_title(weighted_content_recommender, title_id, top_k=10)
    current_recommendations[title_id] = recommendations
    print(f"Current Recommendations for Title ID {title_id}:")
    print(recommendations)
    print("\n")

Current Recommendations for Title ID tm107473:
['tm54451', 'tm39130', 'tm33545', 'tm1138197', 'tm1278917', 'tm177922', 'tm160841', 'tm144685', 'tm1388422', 'tm111589']


Current Recommendations for Title ID tm50355:
['ts294208', 'tm350717', 'ts106126', 'tm136061', 'tm49065', 'tm181695', 'tm139775', 'tm1365333', 'tm1196434', 'ts26763']


Current Recommendations for Title ID ts89259:
['tm44430', 'ts271532', 'ts22246', 'ts302830', 'ts262357', 'ts20248', 'tm104832', 'ts4647', 'ts388033', 'ts37490']




### Evaluating Recommendation Quality

We'll compare the **ideal recommendations** with the **current recommendations** to assess how well our system is performing.

For each title and each user associated with it:

- **Ideal Recommendation:** The next title the user interacted with.
- **Actual Recommendation:** The top-K titles recommended by the system.

We'll calculate **Precision@K** for each user and then compute the **average Precision@K** across all users for each title.


In [17]:
# Initialize a dictionary to hold precision scores per title
precision_scores_per_title = {title_id: [] for title_id in title_ids}

# Define the value of K
k = 5 

for title_id in title_ids:
    print(f"Evaluating Recommendations for Title ID {title_id}:")
    ideal_recs = ideal_recommendations[title_id]
    current_recs = current_recommendations[title_id]
    
    # Iterate over each user and their ideal recommendation
    for user_id, ideal_rec in ideal_recs.items():
        # Check if the ideal recommendation is in the top-K recommendations
        if ideal_rec in current_recs[:k]:
            precision_scores_per_title[title_id].append(1.0)  # Perfect precision for this user
        else:
            precision_scores_per_title[title_id].append(0.0)  # Missed recommendation
    
    # Calculate average Precision@K for the title
    if precision_scores_per_title[title_id]:
        average_precision = np.mean(precision_scores_per_title[title_id])
    else:
        average_precision = np.nan  # Handle cases with no evaluations
    
    print(f"Average Precision@{k} for Title ID {title_id}: {average_precision}\n")


Evaluating Recommendations for Title ID tm107473:
Average Precision@5 for Title ID tm107473: 0.0

Evaluating Recommendations for Title ID tm50355:
Average Precision@5 for Title ID tm50355: 0.0

Evaluating Recommendations for Title ID ts89259:
Average Precision@5 for Title ID ts89259: 0.0



In [19]:
len(ideal_recs)

1161