# Content Based Filtering

## Import relevant libraries

Note: Use cuda (and T4 GPU for Google Colab) as the embedding process will take longer otherwise.

In [None]:
import pandas as pd
import numpy as np
from google.colab import drive
import gdown
import torch

## Loading preprocessed datasets into Colab

In [None]:
metadata_drive = '1vUxq_77g3r1jH1S3Zctmj-u5pArq7iZd'
gdown.download(f"https://drive.google.com/uc?id={metadata_drive}", "movies_metadata.csv", quiet=False)
df_movies_metadata = pd.read_csv("movies_metadata.csv")
df_movies_metadata.head()

Downloading...
From: https://drive.google.com/uc?id=1vUxq_77g3r1jH1S3Zctmj-u5pArq7iZd
To: /content/movies_metadata.csv
100%|██████████| 50.9M/50.9M [00:00<00:00, 165MB/s]


Unnamed: 0,adult,genres,id,original_language,original_title,overview,production_companies,production_countries,spoken_languages,release_year,runtime_category,vote_count_log,vote_average_norm,vote_count_norm,popularity_norm,years_since_release,keyword_values,textual_representation
0,False,"Animation, Comedy, Family",862,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",Pixar Animation Studios,United States of America,English,1995,Medium,8.597113,0.77,0.900011,0.040087,22,"jealousy, toy, boy, friendship, friends, rival...","This movie is titled Toy Story, produced in Un..."
1,False,"Adventure, Fantasy, Family",8844,en,Jumanji,When siblings Judy and Peter discover an encha...,"TriStar Pictures, Teitler Film, Interscope Com...",United States of America,"English, Français",1995,Medium,7.78904,0.69,0.815416,0.031079,22,"board game, disappearance, based on children's...","This movie is titled Jumanji, produced in Unit..."
2,False,"Romance, Comedy",15602,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,"Warner Bros., Lancaster Gate",United States of America,English,1995,Medium,4.532599,0.65,0.474507,0.021394,22,"fishing, best friend, duringcreditsstinger, ol...","This movie is titled Grumpier Old Men, produce..."
3,False,"Comedy, Drama, Romance",31357,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",Twentieth Century Fox Film Corporation,United States of America,English,1995,Long,3.555348,0.61,0.372201,0.007049,22,"based on novel, interracial relationship, sing...","This movie is titled Waiting to Exhale, produc..."
4,False,Comedy,11862,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,"Sandollar Productions, Touchstone Pictures",United States of America,English,1995,Medium,5.159055,0.57,0.540089,0.01532,22,"baby, midlife crisis, confidence, aging, daugh...",This movie is titled Father of the Bride Part ...


## Sentence Transformer




The sentence transformer model is used to encode the sentences in the `textual_representation` feature into embeddings. Compute the cosine similarities using the embeddings generated.

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2', device = 'cuda' if torch.cuda.is_available() else 'cpu')
sentences = df_movies_metadata['textual_representation'].tolist()
sentence_embeddings = model.encode(sentences)
sentence_embeddings.shape

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

(45119, 384)

In [None]:
embeddings = torch.tensor(sentence_embeddings).to('cuda' if torch.cuda.is_available() else 'cpu')
embeddings = torch.nn.functional.normalize(embeddings, p = 2, dim = 1)
similarity_matrix = torch.matmul(embeddings, embeddings.T)
similarity_matrix = similarity_matrix.cpu().numpy()
similarity_df = pd.DataFrame(similarity_matrix, index=df_movies_metadata.index, columns=df_movies_metadata.index)
similarity_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,45109,45110,45111,45112,45113,45114,45115,45116,45117,45118
0,1.000000,0.444346,0.276309,0.287794,0.332795,0.288308,0.343270,0.325479,0.290747,0.186446,...,0.213983,0.258825,0.295162,0.253269,0.243710,0.307533,0.311919,0.243195,0.324702,0.295970
1,0.444346,1.000000,0.278659,0.215315,0.229811,0.282642,0.291310,0.345102,0.288790,0.293119,...,0.279284,0.319815,0.310581,0.301155,0.280512,0.269136,0.343153,0.319654,0.370854,0.208622
2,0.276309,0.278659,1.000000,0.399507,0.388268,0.301451,0.426658,0.385983,0.239008,0.272493,...,0.369072,0.295549,0.226216,0.278041,0.250528,0.348682,0.269494,0.307120,0.436144,0.323274
3,0.287794,0.215315,0.399507,0.999999,0.379720,0.411267,0.371149,0.275089,0.299168,0.210230,...,0.396873,0.253119,0.291508,0.234463,0.455182,0.490260,0.343836,0.420715,0.397196,0.408826
4,0.332795,0.229811,0.388268,0.379720,1.000000,0.214479,0.312228,0.274216,0.320095,0.192143,...,0.239187,0.222098,0.275852,0.274831,0.257432,0.364041,0.261331,0.288168,0.328348,0.304491
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45114,0.307533,0.269136,0.348682,0.490260,0.364041,0.409546,0.409465,0.261573,0.332385,0.420166,...,0.455177,0.399499,0.425109,0.451771,0.434965,1.000000,0.502815,0.472094,0.443321,0.426413
45115,0.311919,0.343153,0.269494,0.343836,0.261331,0.312794,0.284005,0.265019,0.228466,0.332203,...,0.577708,0.350398,0.331385,0.288601,0.307950,0.502815,1.000000,0.445175,0.441564,0.395804
45116,0.243195,0.319654,0.307120,0.420715,0.288168,0.555038,0.346787,0.421992,0.409802,0.473314,...,0.528146,0.490986,0.505975,0.380974,0.522281,0.472094,0.445175,1.000000,0.412122,0.351189
45117,0.324702,0.370854,0.436144,0.397196,0.328348,0.384644,0.446273,0.313721,0.248493,0.331247,...,0.464968,0.465603,0.422147,0.450802,0.341686,0.443321,0.441564,0.412122,1.000001,0.385195


In [None]:
# Example of getting recommendations

n = 10
def get_recommendations(idx, n):
    sim_scores = similarity_matrix[idx]
    sim_scores = list(enumerate(sim_scores))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse = True)
    sim_scores = sim_scores[1:n+1]  # Skip the movie itself
    movie_indices = [i[0] for i in sim_scores]
    return df_movies_metadata.iloc[movie_indices]

recommendations = get_recommendations(2000, n)
recommendations[['id', 'original_title', 'textual_representation']]

Unnamed: 0,id,original_title,textual_representation
4853,120,The Lord of the Rings: The Fellowship of the Ring,This movie is titled The Lord of the Rings: Th...
16329,1361,The Return of the King,"This movie is titled The Return of the King, p..."
6989,122,The Lord of the Rings: The Return of the King,This movie is titled The Lord of the Rings: Th...
5804,121,The Lord of the Rings: The Two Towers,This movie is titled The Lord of the Rings: Th...
14127,17632,The Hunt for Gollum,"This movie is titled The Hunt for Gollum, prod..."
42072,10248,The Ring Thing,"This movie is titled The Ring Thing, produced ..."
11578,71506,Ringers - Lord of the Fans,This movie is titled Ringers - Lord of the Fan...
21999,57158,The Hobbit: The Desolation of Smaug,This movie is titled The Hobbit: The Desolatio...
19927,49051,The Hobbit: An Unexpected Journey,This movie is titled The Hobbit: An Unexpected...
25313,122917,The Hobbit: The Battle of the Five Armies,This movie is titled The Hobbit: The Battle of...


In [None]:
df_movies_metadata['textual_representation'][2000]

"This movie is titled The Lord of the Rings, produced in United States of America with English as the main language. It falls under the genres of Fantasy, Drama, Animation, Adventure and explores themes of elves, dwarves, hobbit, mission. Overview: The Fellowship of the Ring embark on a journey to destroy the One Ring and end Sauron's reign over Middle-earth."

## Evaluation of Precision@k, Recall@k, NDCG and F1 Score

In [None]:
def dcg_at_k(r, k):
    r = np.asarray(r)[:k]
    if r.size:
        return np.sum(r / np.log2(np.arange(2, r.size + 2)))
    return 0.0

def ndcg_at_k(r, k):
    dcg_max = dcg_at_k(sorted(r, reverse=True), k)
    if not dcg_max:
        return 0.0
    return dcg_at_k(r, k) / dcg_max

recall_scores = []
precision_scores = []
ndcg_scores = []
f1 = 0.0
k = 10
threshold = 0.60

for query_idx in range(len(df_movies_metadata)):
    # Get all similarities for this query movie (excluding self-similarity)
    all_similarities = similarity_df.loc[query_idx].drop(query_idx)

    recommended_indices = all_similarities.nlargest(k).index

    relevant_indices = all_similarities[all_similarities >= threshold].index

    hits = len(set(recommended_indices) & set(relevant_indices))

    recall = hits / len(relevant_indices) if len(relevant_indices) > 0 else 0.0
    precision = hits / k

    # For NDCG, create binary relevance vector for recommended items
    relevance_binary = [1 if idx in relevant_indices else 0 for idx in recommended_indices]
    ndcg = ndcg_at_k(relevance_binary, k)

    recall_scores.append(recall)
    precision_scores.append(precision)
    ndcg_scores.append(ndcg)

mean_recall = np.mean(recall_scores) if recall_scores else 0.0
mean_precision = np.mean(precision_scores) if precision_scores else 0.0
mean_ndcg = np.mean(ndcg_scores) if ndcg_scores else 0.0

if mean_precision + mean_recall > 0:
    f1 = 2 * (mean_precision * mean_recall) / (mean_precision + mean_recall)

print(f"Content-Based Filtering Evaluation ({threshold * 100:.2f}% Similarity Threshold):")
print(f"NDCG@{k}: {mean_ndcg:.4f}")
print(f"F1@{k}: {f1:.4f}")
print(f"Precision@{k}: {mean_precision:.4f}")
print(f"Recall@{k}: {mean_recall:.4f}")

Content-Based Filtering Evaluation (60.00% Similarity Threshold):
NDCG@10: 0.9469
F1@10: 0.5671
Precision@10: 0.8004
Recall@10: 0.4391


In this evaluation, the similarity score is used as the criteria to consider recommended movies as relevant to the actual movie (ground truth). A similarity threshold of 60% is used because this threshold value provides the most optimal accuracy (F1 score) for the model to capture semantic meaning of the sentences representing the actual movies.

The sentence transformer model is good at ranking and recommending truly relevant movies, as evidenced by the relatively high NDCG and precision scores. The smaller recall value suggests that the model is not as good in considering other movies that may also be relevant, hence retrieving a smaller proportion of the total relevant movies in the dataset.

Notably, a limitation of the sentence transformer model is that the model places emphasis on similarity between movies in terms of the context from the `textual_representation` feature, hence the movies recommended may not fully align with the user's preference. Integration with other techniques, such as collaborative filtering, would likely allow the model to better recommend movies based on not just semantic meaning but user interactions as well.