# Content Embedding as a Solution to the Cold Start Problem

To tackle the cold start problem in recommender systems, one effective approach is to utilize content embedding techniques. By embedding the content of movies into a high-dimensional vector space, we can capture meaningful representations of their features. These embeddings enable us to quantify similarities between movies based on their content characteristics, facilitating accurate recommendations even for new or less-known items.

In the following sections, we will explore three different methods for embedding movie content. Each method offers unique advantages and considerations, providing flexibility in addressing various aspects of the cold start problem. Through these techniques, we aim to enhance the performance and robustness of our recommender system, ensuring personalized and relevant recommendations for users across diverse scenarios.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

In [12]:
df = pd.read_csv('./data/movies.csv')
df

Unnamed: 0,movieId,title,cast_and_crew,year,poster_path,genre
0,42594,Scream and Scream Again,Gordon Hessler (director); Christopher Wicking...,1970,/49m9QJ2ubKuVtdDj9B7XqgCyriv.jpg,"Horror, Science Fiction"
1,280133,Jenny,George Bloomfield (director/screenplay); Marti...,1970,/iaJ0PyRh17KoMcUlXxTVu6Vi85s.jpg,"Drama, Romance"
2,225155,The Adventures of Gerard,Jerzy Skolimowski (director/screenplay); Arthu...,1970,/aoOV7vWvGiNjaUTHCSIEAeHSvOw.jpg,"Adventure, Comedy, Drama"
3,85255,...tick...tick...tick...,Ralph Nelson (director); James Lee Barrett (sc...,1970,/rBNgytHXZEXgOl805pYWKiBQe7s.jpg,"Drama, Action"
4,117999,Last of the Mobile Hot Shots,Sidney Lumet (director); Gore Vidal (screenpla...,1970,/xj5Rub6H5B9yMh6GL3oomOM1SLg.jpg,Drama
...,...,...,...,...,...,...
11500,979097,Memory,Michel Franco (director/screenplay); Jessica C...,2023,/cZgTA5ZOKOIRGyYClJMu02VUNcE.jpg,Drama
11501,558915,The Color Purple,"Blitz Bazawule (director), Marcus Gardley (scr...",2023,/3Jc93sCl0DqkePYjw47zHpqj7YS.jpg,Drama
11502,823452,The Boys in the Boat,"George Clooney (director), Mark L. Smith (scre...",2023,/ncJMztHprw3gLRAnDjNnnT23CIt.jpg,"Drama, History"
11503,365620,Ferrari,"Michael Mann (director), Troy Kennedy Martin (...",2023,/nNMoJMDCeF4Q5wpWvKuh5b8K2sX.jpg,"Drama, History"


In [15]:
dataset = pd.DataFrame()
dataset['movieId']= df['movieId']
dataset['title']= df['title']
dataset['year']= df['year']
dataset['content'] = df['title'] +' (' + df['year'].astype(str) + ')' + '; ' + df['cast_and_crew'] + '; ' + df['genre']
df['content'] = df['title'] +' (' + df['year'].astype(str) + ')' + '; ' + df['cast_and_crew'] + '; ' + df['genre']
# df['content'] = df['title'] +' (' + df['year'] + ')' + '; ' + df['cast_and_crew']

In [11]:
dataset

Unnamed: 0,movieId,title,year,content
0,42594,Scream and Scream Again,1970,Scream and Scream Again (1970); Gordon Hessler...
1,280133,Jenny,1970,Jenny (1970); George Bloomfield (director/scre...
2,225155,The Adventures of Gerard,1970,The Adventures of Gerard (1970); Jerzy Skolimo...
3,85255,...tick...tick...tick...,1970,...tick...tick...tick... (1970); Ralph Nelson ...
4,117999,Last of the Mobile Hot Shots,1970,Last of the Mobile Hot Shots (1970); Sidney Lu...
...,...,...,...,...
11500,979097,Memory,2023,Memory (2023); Michel Franco (director/screenp...
11501,558915,The Color Purple,2023,The Color Purple (2023); Blitz Bazawule (direc...
11502,823452,The Boys in the Boat,2023,The Boys in the Boat (2023); George Clooney (d...
11503,365620,Ferrari,2023,"Ferrari (2023); Michael Mann (director), Troy ..."


In [8]:
dataset

Unnamed: 0,movieId,title,year,content
0,42594,Scream and Scream Again,1970,Scream and Scream Again (1970); Gordon Hessler...
1,280133,Jenny,1970,Jenny (1970); George Bloomfield (director/scre...
2,225155,The Adventures of Gerard,1970,The Adventures of Gerard (1970); Jerzy Skolimo...
3,85255,...tick...tick...tick...,1970,...tick...tick...tick... (1970); Ralph Nelson ...
4,117999,Last of the Mobile Hot Shots,1970,Last of the Mobile Hot Shots (1970); Sidney Lu...
...,...,...,...,...
11500,979097,Memory,2023,Memory (2023); Michel Franco (director/screenp...
11501,558915,The Color Purple,2023,The Color Purple (2023); Blitz Bazawule (direc...
11502,823452,The Boys in the Boat,2023,The Boys in the Boat (2023); George Clooney (d...
11503,365620,Ferrari,2023,"Ferrari (2023); Michael Mann (director), Troy ..."


## 1. Cosine similarity

Let's generate recommendations for a given movie based on content similarity. It utilizes a count matrix and cosine similarity to calculate the similarity scores between movies. If the input movie is found in the dataset, the function returns a list of the top 20 recommended movies based on content similarity.

In [13]:
def get_recommendations(movie):
        # Creating a count Matrix
    cv = CountVectorizer()
    count_matrix = cv.fit_transform(dataset['content'])

    # Creating a similarity score matrix
    similarity = cosine_similarity(count_matrix)
    
    if movie in dataset['title'].unique():
        idx = dataset.loc[dataset['title']==movie].index[0]
        lst = list(enumerate(similarity[idx]))
        lst = sorted(lst, key= lambda x:x[1], reverse=True)
        lst = lst[1:20]

        movie_list = []
        for i in range(len(lst)):
            idx_sim = lst[i][0]
            movie_list.append(dataset['title'][idx_sim]+' ('+dataset['year'][idx_sim].astype(str)+')')

    return movie_list

In [14]:
get_recommendations('Jaws')

['Jaws 2 (1978)',
 'The Deep (1977)',
 'Funny Bones (1995)',
 'Jaws: The Revenge (1987)',
 'The Friends of Eddie Coyle (1973)',
 'The Batman (2022)',
 'Highlander II: The Quickening (1991)',
 'The Lost World: Jurassic Park (1997)',
 'James and the Giant Peach (1996)',
 'Jaws 3-D (1983)',
 'The Jerk (1979)',
 'B.S. I Love You (1971)',
 'E.T. the Extra-Terrestrial (1982)',
 'Predator (1987)',
 'The Frighteners (1996)',
 'The Return of the Pink Panther (1975)',
 'Sex, Lies, and Videotape (1989)',
 'The Perfection (2019)',
 'The Losers (2010)']

 In the case of the 'Jaws' movie, while the recommender system correctly suggests sequels and movies of similar genres, it also includes unrelated comedy films like 'Funny Bones' and 'B.S. I Love You'.
 
 Let's see if we can improve our recommender system.

### Explore Using Stopwords

In [19]:
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

def get_recommendations(movie):
    # Creating a count Matrix
    stop_words = list(stopwords.words('english'))
    cv = CountVectorizer(stop_words=stop_words)
    count_matrix = cv.fit_transform(dataset['content'])

    # Creating a similarity score matrix
    similarity = cosine_similarity(count_matrix)
    
    if movie in dataset['title'].unique():
        idx = dataset.loc[dataset['title']==movie].index[0]
        lst = list(enumerate(similarity[idx]))
        lst = sorted(lst, key= lambda x:x[1], reverse=True)
        lst = lst[1:20]

        movie_list = []
        for i in range(len(lst)):
            idx_sim = lst[i][0]
            movie_list.append(dataset['title'][idx_sim]+' ('+dataset['year'][idx_sim].astype(str)+')')

    return movie_list

[nltk_data] Downloading package stopwords to /home/adin/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [20]:
get_recommendations('Jaws')

['Jaws 2 (1978)',
 'The Deep (1977)',
 'Funny Bones (1995)',
 'Jaws: The Revenge (1987)',
 'The Friends of Eddie Coyle (1973)',
 'The Batman (2022)',
 'Highlander II: The Quickening (1991)',
 'James and the Giant Peach (1996)',
 'The Lost World: Jurassic Park (1997)',
 'The Return of the Pink Panther (1975)',
 'B.S. I Love You (1971)',
 'Jaws 3-D (1983)',
 'The Jerk (1979)',
 'E.T. the Extra-Terrestrial (1982)',
 'Sex, Lies, and Videotape (1989)',
 'The Frighteners (1996)',
 'The Perfection (2019)',
 'Predator (1987)',
 'Orca (1977)']

It doesn't get any better.

## Utilizing TF-IDF for Improved Recommendations

Incorporating TF-IDF (Term Frequency-Inverse Document Frequency) can significantly enhance the quality of recommendations by providing a more nuanced representation of movie content. TF-IDF assigns weights to words based on their frequency in a document (movie) relative to the entire dataset, emphasizing terms that are unique and informative.

In [21]:
def get_recommendations(movie):
    # Creating a TF-IDF Matrix
    tfidf_vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = tfidf_vectorizer.fit_transform(dataset['content'])

    # Creating a similarity score matrix
    similarity = cosine_similarity(tfidf_matrix)

    if movie in dataset['title'].unique():
        idx = dataset.loc[dataset['title'] == movie].index[0]
        lst = list(enumerate(similarity[idx]))
        lst = sorted(lst, key=lambda x: x[1], reverse=True)
        lst = lst[1:20]

        movie_list = []
        for i in range(len(lst)):
            idx_sim = lst[i][0]
            movie_list.append(dataset['title'][idx_sim] + ' (' + str(dataset['year'][idx_sim]) + ')')

    return movie_list

In [22]:
get_recommendations('Jaws')

['Jaws 2 (1978)',
 'Jaws: The Revenge (1987)',
 'The Jerk (1979)',
 'The Deep (1977)',
 'Jaws 3-D (1983)',
 'The Island (1980)',
 'Indiana Jones and the Temple of Doom (1984)',
 'The Lost World: Jurassic Park (1997)',
 '1941 (1979)',
 'Close Encounters of the Third Kind (1977)',
 'Caveman (1981)',
 'Summer School (1987)',
 'Always (1989)',
 'The Fabelmans (2022)',
 'Doctor Detroit (1983)',
 'James and the Giant Peach (1996)',
 'Bridge of Spies (2015)',
 'Paul (2011)',
 'E.T. the Extra-Terrestrial (1982)']

Utilizing TF-IDF indeed offers several advantages, especially in capturing the significance of directors' names, actors, and genres within movie descriptions. By incorporating TF-IDF, the recommendation system can effectively highlight these important features and provide more accurate and relevant recommendations to users.

## Recommender System Using Embedding Databases

Embedding databases offer a powerful approach to enhancing recommender systems by providing rich representations of items such as movies, products, or documents. These databases store embeddings, which are dense, high-dimensional vectors that capture semantic relationships between items based on their features or content. By leveraging embedding databases, we can build recommender systems that offer personalized and relevant recommendations to users.

**Benefits of Using Embedding Databases**:
- **Rich Representation**: Embedding databases capture complex relationships between items, enabling a comprehensive understanding of their features or content.
- **Semantic Understanding**: By leveraging embeddings, recommender systems can interpret and capture the semantic meaning of items, leading to more accurate recommendations.
- **Flexibility**: Embedding databases offer flexibility in storing and accessing embeddings for diverse types of items, making them suitable for building recommender systems for various domains.
- **Scalability**: Embedding databases can handle large-scale datasets efficiently, allowing for the storage and retrieval of embeddings for millions or even billions of items.

By leveraging embedding databases, we can develop robust and effective recommender systems that provide personalized and relevant recommendations to users, enhancing their overall experience and satisfaction. Embedding databases offer a versatile and scalable solution for building recommender systems across diverse domains and applications.

In [None]:
# Uncomment to install chromadb properly
# !python -m pip install -U sentence-transformers chromadb

In [9]:
import pandas as pd

df = pd.read_csv("./movies/movies.csv")
df['content'] = 

Unnamed: 0,movieId,title,cast_and_crew,year,poster_path,genre
0,42594,Scream and Scream Again,Gordon Hessler (director); Christopher Wicking...,1970,/49m9QJ2ubKuVtdDj9B7XqgCyriv.jpg,"Horror, Science Fiction"
1,280133,Jenny,George Bloomfield (director/screenplay); Marti...,1970,/iaJ0PyRh17KoMcUlXxTVu6Vi85s.jpg,"Drama, Romance"
2,225155,The Adventures of Gerard,Jerzy Skolimowski (director/screenplay); Arthu...,1970,/aoOV7vWvGiNjaUTHCSIEAeHSvOw.jpg,"Adventure, Comedy, Drama"
3,85255,...tick...tick...tick...,Ralph Nelson (director); James Lee Barrett (sc...,1970,/rBNgytHXZEXgOl805pYWKiBQe7s.jpg,"Drama, Action"
4,117999,Last of the Mobile Hot Shots,Sidney Lumet (director); Gore Vidal (screenpla...,1970,/xj5Rub6H5B9yMh6GL3oomOM1SLg.jpg,Drama
...,...,...,...,...,...,...
11500,979097,Memory,Michel Franco (director/screenplay); Jessica C...,2023,/cZgTA5ZOKOIRGyYClJMu02VUNcE.jpg,Drama
11501,558915,The Color Purple,"Blitz Bazawule (director), Marcus Gardley (scr...",2023,/3Jc93sCl0DqkePYjw47zHpqj7YS.jpg,Drama
11502,823452,The Boys in the Boat,"George Clooney (director), Mark L. Smith (scre...",2023,/ncJMztHprw3gLRAnDjNnnT23CIt.jpg,"Drama, History"
11503,365620,Ferrari,"Michael Mann (director), Troy Kennedy Martin (...",2023,/nNMoJMDCeF4Q5wpWvKuh5b8K2sX.jpg,"Drama, History"


In [1]:
import chromadb
from chromadb.utils import embedding_functions

In [2]:
import chromadb
from chromadb.utils import embedding_functions

CHROMA_DATA_PATH = "./movies/chroma"
EMBED_MODEL = "all-MiniLM-L6-v2"
COLLECTION_NAME = "movies"

client = chromadb.PersistentClient(path=CHROMA_DATA_PATH)

In [4]:
embedding_func = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=EMBED_MODEL)

collection = client.create_collection(
    name=COLLECTION_NAME,
    embedding_function=embedding_func,
    metadata={"hnsw:space": "cosine"},
    get_or_create=True
)

In [3]:
# Uncomment in case you have created a collection then delete it
# client.delete_collection(COLLECTION_NAME)

In [16]:
documents = [row['content'] for _, row in df.iterrows()]
ids = [f"{row['movieId']}" for _, row in df.iterrows()]
genres = [{"genre": g} for g in df['genre']]

In [17]:
len(df), len(documents), len(ids), len(genres)

(11505, 11505, 11505, 11505)

In [19]:
collection.add(
    ids=ids,
    documents=documents,
    metadatas=genres,
)

In [20]:
# Retrieve movies
movie = 'Jaws'
query_results = collection.query(
    query_texts=[f"Find me some movies related to {movie}"],
    n_results=20,
)

query_results["documents"]

[['Jaws: The Revenge (1987); Joseph Sargent (director); Michael de Guzman (screenplay); Lorraine Gary, Lance Guest, Mario Van Peebles, Karen Young, Michael Caine, Judith Barsi, Lynn Whitfield, Mitchell Anderson, Melvin Van Peebles, Fritzi Jane Courtney, Lee Fierro, Roy Scheider, Cedric Scott, Charles Bowleg, Mary Smith, Edna Billotto, Cyprian R. Dube, William E. Marks, Diane Hetfield, Jay Mello; Adventure, Thriller, Horror',
  'Jaws (1975); Steven Spielberg (director); Peter Benchley, Carl Gottlieb (screenplay); Roy Scheider, Robert Shaw, Richard Dreyfuss, Lorraine Gary, Murray Hamilton, Carl Gottlieb, Jeffrey Kramer, Susan Backlinie, Lee Fierro, Peter Benchley, Fritzi Jane Courtney, Steven Spielberg; Horror, Thriller, Adventure',
  'Jaws 3-D (1983); Joe Alves (director); Richard Matheson, Carl Gottlieb (screenplay); Dennis Quaid, Bess Armstrong, Simon MacCorkindale, Louis Gossett Jr., John Putch, Lea Thompson, P.H. Moriarty, Kaye Stevens, Dan Blasko, Liz Morris, Harry Grant, Lisa Maur

In [25]:
# Retrieve documents related to Jaws by genre
query_results = collection.query(
    query_texts=[f"Find me some movies related to {movie}"],
    where={"genre": {"$in": ["Horror", "Thriller"]}},
    n_results=20,
)

query_results["documents"]

[['Triangle (2009); Christopher Smith (director); Melissa George, Michael Dorman, Rachael Carpani, Henry Nixon, Emma Lung, Liam Hemsworth; Horror',
  'Allegoria (2022); Spider One (director/screenplay); Krsy Fox, John Ennis, Bryce Johnson, Edward Hong, Adam Busch, Adam Marcinowski; Horror',
  'Cobweb (2023); Samuel Bodin (director); Chris Thomas Devlin (screenplay); Lizzy Caplan, Woody Norman, Cleopatra Coleman, Antony Starr; Horror',
  "The Boogens (1981); James L. Conway (director); David O'Malley, Bob Hunt (screenplay); Rebecca Balding, Fred McCarren, Anne-Marie Martin, Jeff Harlan, John Crawford, Med Flory, Jon Lormer, Scott Wilkinson; Horror",
  "Flashback (2021); Christopher MacBride (director/screenplay); Dylan O'Brien, Maika Monroe, Hannah Gross, Emory Cohen, Keir Gilchrist; Thriller",
  'Up from the Depths (1979); Charles B. Griffith (director); Anne Dyer, Alfred Sweeney (screenplay); Sam Bottoms, Susanne Reed, Virgil Frye, Kedric Wolfe, Charles Howerton, Denise Hayes, R. Lee 