# Content-Based Movie Recommendation Using Metadata and User Ratings in Python

This notebook was build as a basic data science project, This notebook guide through how to build a movie recommendation system using two approaches: a simple popularity-based method and a more advanced content-based method utilizing movie metadata and user ratings. We will explore data loading, cleaning, basic recommendation techniques, and the use of sentence embeddings to find movies with similar descriptions.

* As the named says this one is just a basic movie recommendation system. That's why I have used a almost cleaned data set which a available in the data folder of this repo. I did not used any EDA processes in this notebook.

**WARNING**:

* Don't run this notebook on a Low-End computers as this one uses a local large language model for embedding, that can make your pc very slow and can even cause crashes.

* You can use `Google Colab` in that case.

The data is contained in two CSV files named `movies_metadata.csv` and `ratings.csv`

`movies_metadata` contains the following columns:

- `movie_id`: Unique identifier of each movie.
- `title`: Title of the movie.
- `overview`: Short description of the movie.
- `vote_average`: Average score the movie got.
- `vote_count`: Total number of votes the movie got.

`ratings` contains the following columns:

- `user_id`: Unique identifier of the person who rated the movie.
- `movie_id`: Unique identifier of the movie.
- `rating`: Value between 0 and 10 indicating how much the person liked the movie.

## Importing basic Libraries

In [1]:
# For Data manipulation
import pandas as pd

# For visulization
import matplotlib.pyplot as plt

## Loding both the data sets.

In [2]:
# Read the movies_metadata file
movies_df = pd.read_csv('movies_metadata.csv')

# Read the ratings file
ratings_df = pd.read_csv('ratings.csv')

## Take a look at both the datasets.

In [3]:
print("Movies Metadata:")
movies_df.head()

Movies Metadata:


Unnamed: 0,movie_id,title,overview,vote_average,vote_count
0,95765.0,Cinema Paradiso,"A filmmaker recalls his childhood, when he fel...",8.2,834.0
1,67116.0,The French Connection,Tough narcotics detective 'Popeye' Doyle is in...,7.4,435.0
2,80801.0,The Gods Must Be Crazy,Misery is brought to a small group of Sho in t...,7.1,251.0
3,96446.0,Willow,Fearful of a prophecy stating that a girl chil...,6.9,484.0
4,112697.0,Clueless,"Shallow, rich and socially successful Cher is ...",6.9,828.0


In [4]:
print("Ratings:")
ratings_df.head()

Ratings:


Unnamed: 0,user_id,movie_id,rating
0,2,113862.0,3.0
1,2,114898.0,3.0
2,2,109444.0,4.0
3,2,109830.0,3.0
4,2,111257.0,3.0


## Shape of both the datasets.

In [5]:
print("The shape of the movies metadata: ", movies_df.shape)
print("The shape of the ratings: ", ratings_df.shape)

The shape of the movies metadata:  (9010, 5)
The shape of the ratings:  (99793, 3)


## If missing values are present in any of the datasets.

In [6]:
print("The missing values in the movies metadata: ",movies_df.isnull().sum().sum())
print("The missing values in the rating: ", ratings_df.isnull().sum().sum())

The missing values in the movies metadata:  12
The missing values in the rating:  0


* We have total of 12 missing values.
* Let's take a look at the column wise distribution of these missing values.

In [7]:
movies_df.isnull().sum()

Unnamed: 0,0
movie_id,0
title,0
overview,12
vote_average,0
vote_count,0


* All the missing values are present in a single column names `overview` that contains a short description about the story of that movies.

* We can't impute these values with mode of the column because these column contains a unique overview of every movies and imputing these with mode can alter the data.

* So, we will simply replace the NaN value with an empty string.

In [8]:
movies_df['overview'] = movies_df['overview'].fillna('')
movies_df['overview'].isnull().sum()

np.int64(0)

* Now it does not have any NaN values.

In [9]:
unique_movies_count = movies_df['movie_id'].nunique()
print(f"Number of unique movies: {unique_movies_count}")

unique_rated_movies_count = ratings_df['movie_id'].nunique()
print(f"Number of unique movies that have been rated: {unique_rated_movies_count}")

Number of unique movies: 9010
Number of unique movies that have been rated: 9010


* This confirms that both data sets have same number of movie_id. which will be helpfull while merging these datasets.

## A Simple recommender based on popularity or highest rating

* This code calls create a simple_recommender function, that take a numerical column as criterion and number of recommendations as input.

* Then finds the sort the given column in a descending order and create a dataframe with this.

* Finaly, it return the top 10 (or any number specified) movies title as recommendations

In [10]:
def simple_recommender(criterion='vote_average', n_recommendations=10):

  if criterion not in ['vote_average', 'vote_count']:
    raise ValueError("Criterion must be 'vote_average' or 'vote_count'")

  # Sort the DataFrame by the chosen criterion in descending order
  recommended_movies = movies_df.sort_values(by=criterion, ascending=False)

  # Select the top N movies and the relevant columns
  return recommended_movies[['title', criterion]].head(n_recommendations)

* This is a simple example of the top 10 recommendations based on `vote_average` column.

In [11]:
print("Top 10 movies by vote average:")
print(simple_recommender(criterion='vote_average', n_recommendations=10))

Top 10 movies by vote average:
                                         title  vote_average
8907                                  Reckless          10.0
8363    Carmen Miranda: Bananas Is My Business          10.0
7463    Common Threads: Stories from the Quilt          10.0
6603                   Chilly Scenes of Winter          10.0
873                      Dancer, Texas Pop. 81          10.0
430                        Survive and Advance          10.0
2882  The Haunted World of Edward D. Wood, Jr.          10.0
1910                             The Civil War           9.2
6136                                    Cosmos           9.1
4329                      Little Miss Broadway           9.0


* This is also an example of top 5 recommendations based on `vote_count` column.

In [13]:
print("Top 5 movies by vote count:")
print(simple_recommender(criterion='vote_count', n_recommendations=5))

Top 5 movies by vote count:
                title  vote_count
67          Inception     14075.0
5577  The Dark Knight     12269.0
5761           Avatar     12114.0
8033     The Avengers     12000.0
6935         Deadpool     11444.0


## Generate recommendations based on given movie_title and rating.

* This function takes movie_title as a reference and generates movie recommendation based on similarity with the rating of that movies.

* Use function usees pairwise comparision with the cosine_similarity function to calculate similarity between the ratings of movies.

In [14]:
from sklearn.metrics.pairwise import cosine_similarity

def movie_recommender(movie_title, ratings_df, movies_df):
  movie_ratings = pd.merge(ratings_df, movies_df[['movie_id', 'title']], on='movie_id')

  movie_matrix = movie_ratings.pivot_table(index='user_id', columns='title', values='rating')

  movie_matrix.fillna(0, inplace=True)

  if movie_title not in movie_matrix.columns:
    return f"Movie '{movie_title}' not found in the ratings data."

  target_movie_ratings = movie_matrix[movie_title]

  target_movie_ratings_2d = target_movie_ratings.values.reshape(1, -1)

  movie_matrix_transposed = movie_matrix.T

  cosine_sim = cosine_similarity(target_movie_ratings_2d, movie_matrix_transposed)

  similarity_scores = pd.DataFrame(cosine_sim.T, index=movie_matrix.columns, columns=['similarity'])

  similarity_scores = similarity_scores.sort_values(by='similarity', ascending=False)

  similarity_scores = similarity_scores.drop(movie_title)

  return similarity_scores

* This is just an example for getting recommendation using rating similarity with `Toy story` as a movies title to calculate similarity.

In [15]:
# Find movies similar to 'Toy Story'
similar_movies = movie_recommender('Toy Story', ratings_df, movies_df)
print("\nMovies similar to 'Toy Story':")
similar_movies.head()


Movies similar to 'Toy Story':


Unnamed: 0_level_0,similarity
title,Unnamed: 1_level_1
Toy Story 2,0.59471
Star Wars,0.576188
Forrest Gump,0.564534
Independence Day,0.562946
Groundhog Day,0.548023


## Generate embeddings based on the movie descriptions

* This code cell install the libray that contains local llm for creating embeddings.

* This code cell can take upto 5 minutes to execute completly. S0, just wait for it.

In [16]:
!pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-5.0.0-py3-none-any.whl.metadata (16 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_

`sentence-transformers` is a Python library that makes it easy to compute dense vector embeddings for sentences, paragraphs, and even images. It's built on top of the popular transformers library by Hugging Face and provides a simple interface for using state-of-the-art pre-trained models to generate high-quality embeddings. These embeddings can then be used for various tasks, such as:

* Semantic Search: Finding text that is semantically similar to a query.

* Clustering: Grouping similar sentences or documents together.

* Recommendation Systems: Recommending items based on text descriptions, as demonstrated in your notebook.

* Text Classification: Categorizing text based on its content.

* This code cell can also take upto 3 minutes to run completely.

In [17]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

* Initialize a simple local Sentence Transformer model

* This local sentence transformer model has multiple model but I am using 'all-MiniLM-L6-v2' which is a good general-purpose model and relatively small. This one is the best choice for this recommendation system.

## Generating the embeddings of `overview` column.

This code cell uses the initialized SentenceTransformer model to generate embeddings for the `overview` column of your movies_df DataFrame.


* **Import tqdm**: It imports tqdm.auto for displaying a progress bar during the embedding generation process.

* **Register tqdm with pandas**: tqdm.pandas() integrates the progress bar with pandas operations.

* **Generate Embeddings**: `movies_df['overview'].progress_apply(lambda x: model.encode(x))` applies the model.encode() function to each overview in the 'overview' column. The model.encode() function converts the text description into a numerical vector (the embedding). The progress_apply shows the progress bar as it processes each row.

* **Store Embeddings**: The generated embeddings are stored in a new column called 'overview_embeddings' in the movies_df DataFrame.

* **Print Confirmation**: It prints "Embeddings generated successfully." to indicate that the process is complete.

* **Display DataFrame Head**: movies_df.head() displays the first few rows of the DataFrame, including the new 'overview_embeddings' column, so you can see the generated embeddings.

These embeddings are crucial for the next step, where you'll use them to calculate the similarity between different movie overviews and generate recommendations based on content.

* As described above this code cell generated embeddings for the for all the descriptions in the dataset.

* So, this one can also take upto 10 minutes to run completely. Use can see the progress bar in the output.

In [18]:
from tqdm.auto import tqdm

tqdm.pandas()

movies_df['overview_embeddings'] = movies_df['overview'].progress_apply(lambda x: model.encode(x))

print("Embeddings generated successfully.")
movies_df.head()

  0%|          | 0/9010 [00:00<?, ?it/s]

Embeddings generated successfully.


Unnamed: 0,movie_id,title,overview,vote_average,vote_count,overview_embeddings
0,95765.0,Cinema Paradiso,"A filmmaker recalls his childhood, when he fel...",8.2,834.0,"[-0.05425616, -0.03165526, -0.021484615, 0.019..."
1,67116.0,The French Connection,Tough narcotics detective 'Popeye' Doyle is in...,7.4,435.0,"[-0.10362899, -0.08074167, -0.088221915, -0.02..."
2,80801.0,The Gods Must Be Crazy,Misery is brought to a small group of Sho in t...,7.1,251.0,"[-0.0051899557, 0.10656531, 0.00037663497, 0.0..."
3,96446.0,Willow,Fearful of a prophecy stating that a girl chil...,6.9,484.0,"[-0.012960977, 0.0008763148, -0.050599206, 0.0..."
4,112697.0,Clueless,"Shallow, rich and socially successful Cher is ...",6.9,828.0,"[0.009831264, 0.011321236, 0.07493013, 0.00525..."


## Use embedding simillarity to generate recommendations

* This code cell defines a function called `embedding_recommender`. This function takes a user's description of a movie, the movies DataFrame with embeddings, the Sentence Transformer model, and the desired number of recommendations as input. It then generates an embedding for the user's description, calculates the similarity between this embedding and the embeddings of all movie overviews, and returns a list of the top N movies with the highest similarity scores. Essentially, it finds movies whose descriptions are most similar in meaning to what the user described.

In [19]:
def embedding_recommender(user_description, movies_df, model, n_recommendations=10):
  # Generate embedding for the user description
  user_embedding = model.encode(user_description)

  # Calculate cosine similarity between user embedding and all movie embeddings
  # movies_df['overview_embeddings'] contains numpy arrays, convert to a list of arrays
  movie_embeddings_list = list(movies_df['overview_embeddings'].values)
  cosine_sim = cosine_similarity([user_embedding], movie_embeddings_list)

  # Get the similarity scores for all movies
  similarity_scores = pd.DataFrame({'title': movies_df['title'], 'similarity': cosine_sim[0]})

  # Sort the similarity scores in descending order
  similarity_scores = similarity_scores.sort_values(by='similarity', ascending=False)

  # Return the top N recommendations
  return similarity_scores.head(n_recommendations)

* This code cell generates a list of 10 movie recommendations with the user description `An adventure movie`

In [20]:
# Example usage:
user_input = "An adventure movie"
top_movies = embedding_recommender(user_input, movies_df, model, n_recommendations=10)
print(f"\nTop 10 movies similar to: '{user_input}'")
top_movies


Top 10 movies similar to: 'An adventure movie'


Unnamed: 0,title,similarity
6869,King Kong,0.695269
4353,An Awfully Big Adventure,0.525858
673,Oceans,0.508059
2745,Toy Story of Terror!,0.505055
6552,The Hotel New Hampshire,0.504362
6970,The Poseidon Adventure,0.499834
6642,Lola Montès,0.496342
6053,Quest for Fire,0.495148
3805,"10,000 BC",0.491594
6253,The Adventures of Huck Finn,0.478801


* This one gets the description, which type of movie they want recommendations of as an input and uses the above function to list the recommendations based on that description.

* You can use this code cell to get recommendation based on your description.

* Remember this movie recommendation system is not a professional one this one is made only as beginner friendly Data Science project. So, this can make mistakes and can recommend movies that are totaly different or opposite from what you have described to it. So, don't take it personal and also this one uses old datasets, So, this does not contain data of new movies.

In [22]:
user_des = input("Give me a description of a movie you'd like to watch: ")
top_movies = embedding_recommender(user_des, movies_df, model, n_recommendations=10)
print(f"\nTop 10 movies similar to: '{user_des}'")
top_movies

Give me a description of a movie you'd like to watch: A Space travel movie

Top 10 movies similar to: 'A Space travel movie'


Unnamed: 0,title,similarity
3511,You Only Live Twice,0.557631
5057,Cube²: Hypercube,0.553914
6263,A Trip to the Moon,0.531262
6058,Babylon 5: Thirdspace,0.516667
2946,A Brief History of Time,0.511692
650,The Snow Walker,0.506029
7543,Mission to Mir,0.503633
6304,Earth Girls Are Easy,0.495405
278,Interstellar,0.494252
7497,Morons from Outer Space,0.489171


# Summary Report: Content-Based Movie Recommendation

This notebook demonstrates the process of building a content-based movie recommendation system using movie metadata and user ratings.

**Key Steps and Findings:**

1.  **Data Loading and Exploration:**
    *   The notebook loads two datasets: `movies_metadata.csv` and `ratings.csv`.
    *   Initial exploration shows the shape of the dataframes and identifies missing values in the `overview` column of the `movies_metadata` dataframe.
    *   Missing values in the `overview` column are replaced with empty strings.
    *   It's confirmed that both dataframes contain the same number of unique movie IDs, which is helpful for merging.

2.  **Simple Recommender (Popularity-Based):**
    *   A basic recommender function `simple_recommender` is created.
    *   This function recommends movies based on either `vote_average` or `vote_count`.
    *   Examples show the top movies based on these criteria.

3.  **Rating Similarity Recommender:**
    *   A function `movie_recommender` is defined to generate recommendations based on the similarity of user ratings.
    *   It uses a pivot table of user ratings and calculates cosine similarity between the rating vectors of different movies.
    *   An example demonstrates finding movies similar to 'Toy Story' based on this method.

4.  **Content-Based Recommender using Embeddings:**
    *   The `sentence-transformers` library is installed to generate text embeddings.
    *   A Sentence Transformer model ('all-MiniLM-L6-v2') is loaded.
    *   Embeddings are generated for the `overview` column of the `movies_metadata` dataframe. This process converts movie descriptions into numerical vectors.
    *   A function `embedding_recommender` is created. This function takes a user's description, generates an embedding for it, and calculates the cosine similarity between the user's embedding and the movie overview embeddings.
    *   It then recommends movies with the highest similarity scores to the user's description.
    *   Examples demonstrate generating recommendations based on user-provided descriptions, such as "An adventure movie" and "A Space travel movie".

**Conclusion:**

The notebook successfully implements both a rating-based and a content-based movie recommendation system. The content-based approach utilizes embeddings generated from movie overviews to find movies with similar descriptions, providing a powerful way to recommend movies based on their textual content. The notebook also includes basic data exploration and handling of missing values.

**Thank you for reviewing this notebook! Please provide any feedback or suggestions for improvement.**