<a href="https://colab.research.google.com/github/am2644/TMDB-5000-Movie-Dataset/blob/main/recommendation_system.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
! pip install -q kaggle

In [None]:
! mkdir ~/.kaggle

mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [None]:
! cp kaggle.json ~/.kaggle/

In [None]:
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
! kaggle datasets download -d tmdb/tmdb-movie-metadata

Downloading tmdb-movie-metadata.zip to /content
 79% 7.00M/8.89M [00:00<00:00, 25.0MB/s]
100% 8.89M/8.89M [00:00<00:00, 29.2MB/s]


In [None]:
! unzip tmdb-movie-metadata.zip

Archive:  tmdb-movie-metadata.zip
  inflating: tmdb_5000_credits.csv   
  inflating: tmdb_5000_movies.csv    


In [None]:
! pip install sentence-transformers

In [None]:
# Importing necessary libraries
import pandas as pd  # For data manipulation and analysis
import numpy as np  # For numerical operations
from sentence_transformers import SentenceTransformer  # For transforming sentences into embeddings
from sklearn.metrics.pairwise import cosine_similarity  # For computing cosine similarity between embeddings

In [None]:
# Reading the first CSV file into a pandas DataFrame
df1 = pd.read_csv('/content/tmdb_5000_credits.csv')

# Reading the second CSV file into another pandas DataFrame
df2 = pd.read_csv('/content/tmdb_5000_movies.csv')

In [None]:
# Renaming the 'movie_id' column to 'id' in the DataFrame df1
df1.rename(columns={'movie_id': 'id'}, inplace=True)

# Merging the two DataFrames (df1 and df2) on the 'id' column
df = df1.merge(df2, on='id')

In [None]:
# Initializing the model with 'bert-base-nli-mean-tokens' pre-trained model
model = SentenceTransformer('bert-base-nli-mean-tokens')

In [None]:
# Updating the 'overview' column for specific rows in the DataFrame df
df.loc[2656, 'overview'] = 'A biopic of the rise of father Jorge Mario Bergoglio SJ from a teacher in a Jesuit High School in Argentina to archbishop and cardinal of Buenos Aires to Pope of the Roman Catholic Church. The story touches on his relation with his fellow Jesuits in Argentina and Europe, to his relation with laureate writer Jorge Luis Borges, Argentine dictator Jorge Rafael Videla, and archbishops Laghi (nuncio to Argentina) and Quarracino (cardinal of Buenos Aires), up to the moment where he is elected Pope in 2013.'
df.loc[4140, 'overview'] = 'The life of Frank Sinatra, as an actor and singer and the steps along the way that led him to become such an icon.'
df.loc[4431, 'overview'] = "There is so much interest in food these days yet there is almost no interest in the hands that pick that food. In the US, farm labor has always been one of the most difficult and poorly paid jobs and has relied on some of the nation's most vulnerable people. While the legal restrictions which kept people bound to farms, like slavery, have been abolished, exploitation still exists, ranging from wage theft to modern-day slavery. These days, this exploitation is perpetuated by the corporations at the top of the food chain: supermarkets. Their buying power has kept wages pitifully low and has created a scenario where desperately poor people are willing to put up with anything to keep their jobs."


In [None]:
# Defining a function to vectorize text using the provided SentenceTransformer model
def vectorize_text(text):
    # Encoding the input text into a vector representation using the pre-trained model
    vector = model.encode(text)
    # Returning the vector representation of the input text
    return vector


In [None]:
# Applying the vectorize_text function to each element in the 'overview' column of the DataFrame df
# and creating a new column 'vectorized_column' to store the resulting vectors
df['vectorized_column'] = df['overview'].apply(vectorize_text)

In [None]:
def find_similar_movies(text, top_n=3):
    # Get the vectorized representation of the input movie title
    input_vector = model.encode(text)

    # Reshape the input vector to match the shape expected by cosine_similarity
    input_vector = input_vector.reshape(1, -1)

    # Calculate cosine similarity between the input vector and all other vectors in the dataset
    similarities = cosine_similarity(input_vector, df['vectorized_column'].tolist())

    # Get the indices of top_n most similar movies
    top_indices = similarities.argsort()[0][-top_n-1:-1][::-1]  # Exclude the input title

    # Retrieve the titles and overviews of the top_n most similar movies
    similar_movies = df.loc[top_indices, ['title_x', 'overview']]

    return similar_movies.values.tolist()

# Example usage
input_title = input("Describe the movie you like: ")
similar_movies = find_similar_movies(input_title)
if similar_movies:
    print("Top 3 similar movies:")
    for idx, (title, overview) in enumerate(similar_movies, start=1):
        print(f"{idx}. {title} - Overview: {overview}")


Enter the title of the movie: a superhero that fight against a villain
Top 3 similar movies:
1. Mystery Men - Overview: When Captain Amazing (Kinnear) is kidnapped by Casanova Frankenstein (Rush) a group of superheroes combine together to create a plan. But these aren't normal superheroes. Now, the group who include such heroes as Mr. Furious (Stiller), The Shoveller (Macy) and The Blue Raja (Azaria) must put all the powers together to save everyone they know and love.
2. The Matrix Revolutions - Overview: The human city of Zion defends itself against the massive invasion of the machines as Neo fights to end the war at another front while also opposing the rogue Agent Smith.
3. Superman IV: The Quest for Peace - Overview: With global superpowers engaged in an increasingly hostile arms race, Superman leads a crusade to rid the world of nuclear weapons. But Lex Luthor, recently sprung from jail, is declaring war on the Man of Steel and his quest to save the planet. Using a strand of Supe