Project Overview: The date set is a modified MovieLens ml-latest-small dataset collected by GroupLens research(https://grouplens.org/datasets/movielens/). Specifically, I used the movies dataset to generate a simple content-based recommendation system. The system relies on calculating the cosin similarity of TF-IDF (Term Frequency-Inverse Document Frequency) vectors and vectors of processed query. Then, the similarities are ranked to give top k recommendation based on similarity score.

Import necessary libraries such as panda to process data and useful scikit learn and pytorch packages 

In [77]:
import pandas as pd
from sentence_transformers import SentenceTransformer, util
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

Load the MovieLens data set and select 500 movies.

In [78]:
# Load dataset
file_path = 'ml-latest-small/movies.csv'  
df = pd.read_csv(file_path)
# Randomly select a smaller subset, use seed to ensure reprodcutivity
df = df.sample(n=500, random_state=42)

# Inspect the dataset, no further process needed
df.info()
# df.head()

<class 'pandas.core.frame.DataFrame'>
Index: 500 entries, 6213 to 1593
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  500 non-null    int64 
 1   title    500 non-null    object
 2   genres   500 non-null    object
dtypes: int64(1), object(2)
memory usage: 15.6+ KB


TF-IDF (Term Frequency-Inverse Document Frequency) vectors of the movies are computed using Scikit-learn package. 


In [None]:
#initlized the vectorizer
vectorizer = TfidfVectorizer(stop_words='english')

# Compute TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(df['genres'])
# Display the shape of the TF-IDF matrix
#print(f"TF-IDF Matrix Shape: {tfidf_matrix.shape}")

process_query process the input query so the query matches more closely with genres. Top 3 genres in the input query with cosin similarity score greater than 0.3 will be used as the output query.

Sentence transformer detects the semantic meaning the input query, for example, love has the same semantic meaning as romance.


In [84]:
def process_query(query, df, ):
    # Extract unique genres from the DataFrame
    unique_genres = set()
    for genres in df['genres']:
        unique_genres.update(genres.split('|'))
    unique_genres = list(unique_genres) 

    # Load a pre-trained Sentence Transformer model
    model = SentenceTransformer('all-MiniLM-L6-v2') 
    
    # Encode the query and genres
    query_embedding = model.encode(query)
    genre_embeddings = model.encode(unique_genres)

    # Compute cosine similarity between the query and genres
    similarities = util.cos_sim(query_embedding, genre_embeddings)

    # Map genres to their similarity scores
    genre_similarity_scores = {genre: score.item() for genre, score in zip(unique_genres, similarities[0])}

    # Append the top 3 most relevant genres to the query and filter by similairty score
    top_genres = sorted(genre_similarity_scores.items(), key=lambda x: x[1], reverse=True)[:3]
    processed_query = " " + " ".join([genre for genre, score in top_genres if score > 0.3])
    print("Processed Query:", processed_query)
    return processed_query

Here is an example where you can test the process-query function

In [None]:
query = "Movie with boyfriend-girlfriend relationship"
process_query(query, df)

Processed Query:  Romance Film-Noir Thriller


' Romance Film-Noir Thriller'

Main function for recommendation. 

The cosin similarity of TF-IDF matrix of movies and vectors of processed query are calculated then ranked. Recommendation is given based on ranked cosin similarity.

In [81]:
def recommend_items(query, df, tfidf_matrix, vectorizer, top_n=5):
    query = process_query(query,df)
    # Transform the query into a TF-IDF vector
    query_vec = vectorizer.transform([query])
    
    # Compute cosine similarity between the query and all items
    cosine_similarities = cosine_similarity(query_vec, tfidf_matrix).flatten()
    
    # Add similarity scores to the dataframe
    recommendations = df.copy()
    recommendations['similarity'] = cosine_similarities
    
    # Sort by similarity and return the top N items
    recommendations = recommendations.sort_values(by='similarity', ascending=False).head(top_n)
    #print(recommendations)
    return recommendations[['title', 'similarity']]

Here is an example of using this recommendation system

In [88]:
# Example query
query = "I love thrilling action movies set in space, with a comedic twist."
#query = "romance and action"

# Get recommendations
recommendations = recommend_items(query, df, tfidf_matrix, vectorizer)

# Display the results
print("Top Recommendations:")
print(recommendations)

Processed Query:  Sci-Fi Thriller Comedy
Top Recommendations:
                                     title  similarity
3617  Incredible Shrinking Man, The (1957)    0.933803
8923            Absolutely Anything (2015)    0.888347
7300           Hot Tub Time Machine (2010)    0.888347
3973     Man in the White Suit, The (1951)    0.888347
3352                  Short Circuit (1986)    0.888347
