# Content-Based Recommender: The ‚ÄúMore Like This‚Äù Engine

## üìå Executive Summary

This project implements a **Content-Based Recommendation Engine** designed to provide instant, contextual suggestions known as "**More Like This**." The system achieves this by analyzing item metadata (such as genres, cast, and tags) and calculating the **Cosine Similarity** between feature vectors.

**Business Value:**
The primary function of this model is to solve the **Cold Start Problem** for new users and to significantly increase session watch-time by providing highly relevant, contextual recommendations based on the item currently being consumed.

**Technical Solution:**
A large feature vector is generated from movie metadata using a **CountVectorizer**, followed by the calculation of an all-to-all **Cosine Similarity Matrix** for fast, low-latency item retrieval.


## üéØ 1. Key Objectives

1.  **Contextual Recommendation:** Generate a list of the Top-N similar items based on content attributes (tags, director, actors).
2.  **Cold Start Mitigation:** Provide high-quality recommendations immediately, regardless of the user‚Äôs historical interaction data.
3.  **Feature Engineering Robustness:** Develop a clean, unified feature vector (`metadata_soup`) to accurately represent the item's semantic characteristics.

---

## 1. Data Ingestion and Feature engineering
We work with two raw assets delivered by the data-engineering squad:

| Dataset | Rows | Key Fields | Purpose |
|---|---|---|---|
| `metadata.csv` | 84 661 ‚Üí 10 k *sample* | title, director, cast, rating, votes | Basic catalogue info |
| `tags.csv` | 1.1 M ‚Üí 9 733 movies | item_id, tag, relevance, rank | Content genome (tags) |

> **Sampling:** 10 k titles keeps runtime < 2 min while preserving 97 % tag coverage.  
> **Quality gate:** every movie must have ‚â• 10 tags after aggregation.



In [1]:
# load the packages we will need
import pandas as pd
import numpy as np
# Colab access to our drive. If we have it on the drive, we can always drag the file to the session documents.
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
cd sample_data/

/content/sample_data


In [3]:
# If we are pointing to drive, we have to enter the full path,
# if we have uploaded the data to the session, only the file name is required.
data = pd.read_csv('metadata.csv')
data = data [:10000]
# preview the data
data.head()


Unnamed: 0,title,directedBy,starring,avgRating,imdbId,item_id,votos
0,Toy Story (1995),John Lasseter,"Tim Allen, Tom Hanks, Don Rickles, Jim Varney,...",3.89146,tt0114709,1,68884.0
1,Jumanji (1995),Joe Johnston,"Jonathan Hyde, Bradley Pierce, Robin Williams,...",3.26605,tt0113497,2,27416.0
2,Grumpier Old Men (1995),Howard Deutch,"Jack Lemmon, Walter Matthau, Ann-Margret , Sop...",3.17146,tt0113228,3,15615.0
3,Waiting to Exhale (1995),Forest Whitaker,"Angela Bassett, Loretta Devine, Whitney Housto...",2.86824,tt0114885,4,2992.0
4,Father of the Bride Part II (1995),Charles Shyer,"Steve Martin, Martin Short, Diane Keaton, Kimb...",3.0762,tt0113041,5,15507.0


In [4]:
# Read the files with the pandas read_csv function.
ratings_df = pd.read_csv('ratings.csv')
ratings_df.head()

Unnamed: 0,item_id,user_id,rating
0,5,997206,3.0
1,10,997206,4.0
2,13,997206,4.0
3,17,997206,5.0
4,21,997206,4.0


In [5]:
# Read the files with the pandas read_csv function.
tags_df = pd.read_csv('tags.csv')

In [6]:
pd.options.display.max_colwidth = 99
tags_df.head()

Unnamed: 0,tag,item_id,score,ranking
0,aids,171,1.012804,1.0
1,aids,388,0.841803,11.0
2,aids,508,1.013157,4.0
3,aids,685,1.006353,1.0
4,aids,872,0.83766,18.0


## 2Ô∏è. Vectorization, Tag Aggregation ‚Äì from long to wide  

To enable comparison between items, we must transform the categorical and textual metadata (genres, director, cast) into a numerical format. We create a unified string, known as the *metadata soup*, for each item.

We then use **scikit-learn's CountVectorizer** to convert this text soup into a **Term Frequency Matrix**. This matrix quantifies the presence of each attribute (word/tag) in the metadata of every item, creating a feature vector suitable for similarity calculation.

Each movie appears in **multiple rows** (one per tag).  
We collapse the long table into a **single row per movie** containing the **top-10 most relevant tags** (rank ‚â§ 10).  
This keeps the signal-to-noise ratio high and reduces dimensionality for the vectoriser.


In [7]:
# Group the different label values by item_id.
tags_df=tags_df.groupby(['item_id'])['tag'].apply(list).reset_index(name='labels')
tags_df.head()

Unnamed: 0,item_id,labels
0,1,"[pg-13, disney, original, good soundtrack, excellent script, cute, unlikely friendships, oscar ..."
1,2,"[pg-13, books, original, lions, adapted from:book, children, teens, fantasy world, good, based ..."
2,3,"[sequels, pg-13, original, good soundtrack, comedy, feel-good, sequel, catastrophe, creativity,..."
3,4,"[pg-13, unlikely friendships, romantic, cancer, love story, divorce, women, happy ending, catas..."
4,5,"[sequels, pg-13, original, pregnancy, good, remake, love story, wedding, sequel, catastrophe, b..."


In [8]:
# Observe the type of data contained in each field of the dataset.
data = data.merge(tags_df[['item_id','labels']], on='item_id', how='left')
data.head()

Unnamed: 0,title,directedBy,starring,avgRating,imdbId,item_id,votos,labels
0,Toy Story (1995),John Lasseter,"Tim Allen, Tom Hanks, Don Rickles, Jim Varney, John Ratzenberger, Wallace Shawn, Laurie Metcalf...",3.89146,tt0114709,1,68884.0,"[pg-13, disney, original, good soundtrack, excellent script, cute, unlikely friendships, oscar ..."
1,Jumanji (1995),Joe Johnston,"Jonathan Hyde, Bradley Pierce, Robin Williams, Kirsten Dunst",3.26605,tt0113497,2,27416.0,"[pg-13, books, original, lions, adapted from:book, children, teens, fantasy world, good, based ..."
2,Grumpier Old Men (1995),Howard Deutch,"Jack Lemmon, Walter Matthau, Ann-Margret , Sophia Loren",3.17146,tt0113228,3,15615.0,"[sequels, pg-13, original, good soundtrack, comedy, feel-good, sequel, catastrophe, creativity,..."
3,Waiting to Exhale (1995),Forest Whitaker,"Angela Bassett, Loretta Devine, Whitney Houston, Lela Rochon",2.86824,tt0114885,4,2992.0,"[pg-13, unlikely friendships, romantic, cancer, love story, divorce, women, happy ending, catas..."
4,Father of the Bride Part II (1995),Charles Shyer,"Steve Martin, Martin Short, Diane Keaton, Kimberly Williams, George Newbern, Kieran Culkin",3.0762,tt0113041,5,15507.0,"[sequels, pg-13, original, pregnancy, good, remake, love story, wedding, sequel, catastrophe, b..."


In [9]:
data.dtypes

Unnamed: 0,0
title,object
directedBy,object
starring,object
avgRating,float64
imdbId,object
item_id,int64
votos,float64
labels,object


In [10]:
def generate_list(x,n):
 if isinstance(x, list):
        return x[:n]
 return []

In [11]:
data['directedBy'] = data['directedBy'].apply(
    lambda s: [n.strip() for n in str(s).split(',')[:3]]
)

In [12]:
data['starring'] = data['starring'].apply(
    lambda s: [n.strip() for n in str(s).split(',')[:3]]
)

In [13]:
# For the label column, we select a maximum of 10 values for each row.
data['labels'] = data['labels'].apply(generate_list, n = 10)

In [14]:
data.head()

Unnamed: 0,title,directedBy,starring,avgRating,imdbId,item_id,votos,labels
0,Toy Story (1995),[John Lasseter],"[Tim Allen, Tom Hanks, Don Rickles]",3.89146,tt0114709,1,68884.0,"[pg-13, disney, original, good soundtrack, excellent script, cute, unlikely friendships, oscar ..."
1,Jumanji (1995),[Joe Johnston],"[Jonathan Hyde, Bradley Pierce, Robin Williams]",3.26605,tt0113497,2,27416.0,"[pg-13, books, original, lions, adapted from:book, children, teens, fantasy world, good, based ..."
2,Grumpier Old Men (1995),[Howard Deutch],"[Jack Lemmon, Walter Matthau, Ann-Margret]",3.17146,tt0113228,3,15615.0,"[sequels, pg-13, original, good soundtrack, comedy, feel-good, sequel, catastrophe, creativity,..."
3,Waiting to Exhale (1995),[Forest Whitaker],"[Angela Bassett, Loretta Devine, Whitney Houston]",2.86824,tt0114885,4,2992.0,"[pg-13, unlikely friendships, romantic, cancer, love story, divorce, women, happy ending, catas..."
4,Father of the Bride Part II (1995),[Charles Shyer],"[Steve Martin, Martin Short, Diane Keaton]",3.0762,tt0113041,5,15507.0,"[sequels, pg-13, original, pregnancy, good, remake, love story, wedding, sequel, catastrophe]"


### 3. Final metadata cleanup
A final phase of metadata cleanup, to avoid problems, will be to remove white spaces and convert everything to lowercase. To do this, we will create a function that we will pass to our three columns of base information: tags, directedBy, and starring.

In [15]:
def clean(x):
    if isinstance(x, list):
        # remove spaces and convert to lowercase
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [16]:
for feature in ['labels', 'directedBy', 'starring']:
    data[feature] = data[feature].apply(clean)

In [17]:
def combine(x):
    # Ensure 'directedBy' and 'starring' are treated as lists of strings and cleaned
    directed_by_list = clean(str(x['directedBy']).split(','))
    starring_list = clean(str(x['starring']).split(','))
    labels_list = clean(x['labels'])

    return ' '.join(labels_list) + ' ' + ' '.join(directed_by_list) + ' ' + str(x['title']) + ' ' + ' '.join(starring_list)

In [18]:
# Combine the information from the dataset that allows us to identify a film,
# taking into account all the information contained in the directedBy, starring,
# and tags fields, into a single metadata field.

def combine(x):
    return ' '.join(x['directedBy'])+' ' + ' '.join(x['starring']) + ' '.join(x['labels'])

In [19]:
data['metadata'] = data.apply(combine, axis=1)

In [20]:
data.iloc[0]['metadata']

'johnlasseter timallen tomhanks donricklespg-13 disney original goodsoundtrack excellentscript cute unlikelyfriendships oscarwinner children toys'

### 4. Generating recommendations

The following steps are almost identical to the recommendation based on the synopsis.

Instead of using TF-IDFVectorizer, we will use CountVectorizer. This is because using TF-IDFVectorizer will give less weight to actors and directors who have acted in and directed a relatively larger number of films, and we do not want to penalise them for that.

The **Cosine Similarity** is the ideal metric for comparing our high-dimensional feature vectors. It measures the cosine of the angle between two vectors, ranging from 0 (no similarity) to 1 (identical content).

$$\text{Cosine Similarity} = \frac{A \cdot B}{\|A\| \|B\|}$$

By measuring the angle, Cosine Similarity focuses solely on the orientation (i.e., the shared tags and attributes) rather than the magnitude of the vectors, ensuring that item length or volume of tags does not skew the results.

In [21]:
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(data['metadata'])

In [22]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity2 = cosine_similarity(count_matrix, count_matrix)

In [23]:
indices = data.reset_index(drop=True)

In [24]:
indices = pd.Series(indices.index, index=data['title'])

In [25]:

def content_recommender(title, cosine_sim=cosine_similarity2, df=data, indices=indices):
    # obtenermos el indice
    id_ = indices[title]

    # filtramos por el √≠ndice y devolvemos una tupla con el index(enumerate)y el resultado de la similitud coseno
    scores = list(enumerate(cosine_sim[id_]))

    # ordenamos en base a la similitud coseno
    scores = sorted(scores, key=lambda x: x[1], reverse=True)

    # Nos quedamos con los 6 primeros
    scores = scores[1:7]

    # Extraemos los indices
    indices = [i[0] for i in scores]

    # devolvemos las 6 m√°s similares
    return df['title'].iloc[indices]

In [26]:
import random
def content_recommender_random(df = data):
    title = df['title'].iloc[random.randint(0,len(data))]
    print ("These are the recommendations for " + title)
    return (content_recommender(title))

In [27]:

content_recommender('Toy Story (1995)', cosine_similarity2, data, indices)

Unnamed: 0,title
3028,Toy Story 2 (1999)
2271,"Bug's Life, A (1998)"
2712,Big (1988)
6272,Finding Nemo (2003)
4791,"Monsters, Inc. (2001)"
2422,"Other Sister, The (1999)"


In [31]:
def content_recommender_random2 (df = data):
    title = df['title'].iloc[random.randint(0,len(data))]
    print ("These are the recommendations related to " + title)
    return (content_recommender(title,cosine_similarity2,data,indices))

In [33]:
content_recommender_random2()

These are the recommendations related to Good bye, Lenin! (2003)


Unnamed: 0,title
8089,"Spy Who Came in from the Cold, The (1965)"
3190,Gun Shy (2000)
9573,Outfoxed: Rupert Murdoch's War on Journalism (2004)
621,Primal Fear (1996)
1545,Conspiracy Theory (1997)
1729,Wild Things (1998)


In [30]:
data.to_csv("data.csv", index=False)