# Movie Recommendation System

This notebook builds a movie recommendation system using various movie attributes like keywords, genres, overviews, titles, spoken languages, and production countries.

## 1. Importing Libraries

First, we import the necessary libraries. `numpy` and `pandas` are for data manipulation. `TfidfVectorizer` and `CountVectorizer` from `sklearn.feature_extraction.text` are used for text feature extraction. `cosine_similarity` from `sklearn.metrics.pairwise` calculates the similarity between movies. `nltk` is for natural language processing tasks like removing stopwords and lemmatization. `literal_eval` from `ast` helps in safely evaluating string literals. `process` from `fuzzywuzzy` is used for finding the closest movie title match.

In [None]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from ast import literal_eval
from fuzzywuzzy import process

## 2. Downloading the Dataset

We download the movie dataset from a Google Drive link using the `gdown` library. The dataset is saved as `file.csv`.

In [None]:
import gdown
url = "https://drive.google.com/uc?id=15qdSFASWLhg9W8kBLQbxYBwu-OI2lica"
output = "file.csv"
gdown.download(url, output, quiet=False)

Downloading...
From (original): https://drive.google.com/uc?id=15qdSFASWLhg9W8kBLQbxYBwu-OI2lica
From (redirected): https://drive.google.com/uc?id=15qdSFASWLhg9W8kBLQbxYBwu-OI2lica&confirm=t&uuid=a74189e3-ce53-404a-ab2b-c83b902da105
To: /content/file.csv
100%|██████████| 556M/556M [00:09<00:00, 57.2MB/s]


'file.csv'

## 3. Loading and Exploring the Data

We load the downloaded CSV file into a pandas DataFrame and display the first few rows to understand the data structure. We also check the shape of the DataFrame (number of rows and columns) and list the column names.

In [None]:
df = pd.read_csv("file.csv")
df.head(3)

Unnamed: 0,id,title,vote_average,vote_count,status,release_date,revenue,runtime,adult,backdrop_path,...,original_title,overview,popularity,poster_path,tagline,genres,production_companies,production_countries,spoken_languages,keywords
0,27205,Inception,8.364,34495,Released,2010-07-15,825532764,148,False,/8ZTVqvKDQ8emSGUEMjsS4yHAwrp.jpg,...,Inception,"Cobb, a skilled thief who commits corporate es...",83.952,/oYuLEt3zVCKq57qu2F8dT7NIa6f.jpg,Your mind is the scene of the crime.,"Action, Science Fiction, Adventure","Legendary Pictures, Syncopy, Warner Bros. Pict...","United Kingdom, United States of America","English, French, Japanese, Swahili","rescue, mission, dream, airplane, paris, franc..."
1,157336,Interstellar,8.417,32571,Released,2014-11-05,701729206,169,False,/pbrkL804c8yAv3zBZR4QPEafpAR.jpg,...,Interstellar,The adventures of a group of explorers who mak...,140.241,/gEU2QniE6E77NI6lCU6MxlNBvIx.jpg,Mankind was born on Earth. It was never meant ...,"Adventure, Drama, Science Fiction","Legendary Pictures, Syncopy, Lynda Obst Produc...","United Kingdom, United States of America",English,"rescue, future, spacecraft, race against time,..."
2,155,The Dark Knight,8.512,30619,Released,2008-07-16,1004558444,152,False,/nMKdUUepR0i5zn0y1T4CsSB5chy.jpg,...,The Dark Knight,Batman raises the stakes in his war on crime. ...,130.643,/qJ2tW6WMUDux911r6m7haRef0WH.jpg,Welcome to a world without rules.,"Drama, Action, Crime, Thriller","DC Comics, Legendary Pictures, Syncopy, Isobel...","United Kingdom, United States of America","English, Mandarin","joker, sadism, chaos, secret identity, crime f..."


In [None]:
print(f"DataFrame Shape: {df.shape}")
print("Columns:", df.columns.tolist())

DataFrame Shape: (1229959, 24)
Columns: ['id', 'title', 'vote_average', 'vote_count', 'status', 'release_date', 'revenue', 'runtime', 'adult', 'backdrop_path', 'budget', 'homepage', 'imdb_id', 'original_language', 'original_title', 'overview', 'popularity', 'poster_path', 'tagline', 'genres', 'production_companies', 'production_countries', 'spoken_languages', 'keywords']


## 4. Handling Missing Values

We check for missing values in the DataFrame using `df.isnull().sum()`. Then, we fill the missing values in the selected feature columns with empty strings to avoid errors during processing.

In [None]:
df.isnull().sum()

Unnamed: 0,0
id,0
title,13
vote_average,0
vote_count,0
status,0
release_date,228009
revenue,0
runtime,0
adult,0
backdrop_path,911015


In [None]:
features = ['keywords', 'genres', 'overview', 'title', 'spoken_languages', 'production_countries']
df[features] = df[features].fillna('')

## 5. Handling Duplicate Rows

We check for duplicate rows in the DataFrame using `df.duplicated().sum()`. If duplicates exist, we remove them using `df.drop_duplicates(inplace=True)`.

In [None]:
df.duplicated().sum()

np.int64(374)

In [None]:
df.drop_duplicates(inplace=True)

## 6. Downloading NLTK Resources

We download necessary resources from the NLTK library: 'stopwords' (common words like 'the', 'a', 'is' that are usually removed during text processing) and 'wordnet' (a lexical database used for lemmatization).

In [None]:
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## 7. Defining Text Cleaning Functions

We define two helper functions for text cleaning:
- `remove_stopwords`: Takes text as input and removes English stopwords.
- `lemmatize_text`: Takes text as input and reduces words to their base or dictionary form (e.g., 'running' becomes 'run').

In [None]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

In [None]:
def remove_stopwords(text):
    words = [word for word in text.split() if word.lower() not in stop_words]
    return " ".join(words)

def lemmatize_text(text):
    words = [lemmatizer.lemmatize(word) for word in text.split()]
    return " ".join(words)

## 8. Defining Jaccard Similarity

The Jaccard similarity is a statistic used for comparing the similarity and diversity of sample sets. It is defined as the size of the intersection divided by the size of the union of the sample sets. This is useful for comparing lists of items like genres or keywords.

In [None]:
def jaccard_similarity(set1, set2):
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    return intersection / union if union != 0 else 0

## 9. Processing List-like Features

Some features like 'keywords', 'genres', etc., are stored as strings that look like lists. We use `safe_literal_eval` to convert these strings into actual Python lists. If the conversion fails, it attempts to split the string by commas.

In [None]:
def safe_literal_eval(value):
    if isinstance(value, str):
        try:
            return literal_eval(value)
        except (ValueError, SyntaxError):
            return value.split(',') if value else []
    return value if isinstance(value, list) else []

features = ['keywords', 'genres', 'spoken_languages', 'production_countries']
for feature in features:
    df[feature] = df[feature].apply(safe_literal_eval)

## 10. Cleaning and Preparing Features

We define a `clean_data` function to process the list-like features. It converts all items in the list to lowercase and removes spaces. This helps in standardizing the data for similarity calculations. We apply this function to the relevant features.

In [None]:
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        return ''

features = ['keywords', 'genres', 'spoken_languages', 'production_countries']
for feature in features:
    df[feature] = df[feature].apply(clean_data)

## 11. Feature Extraction and Similarity Calculation

We limit the DataFrame to the first 3000 rows to manage computation time. We use `TfidfVectorizer` to convert the 'overview' text into a matrix of TF-IDF features. TF-IDF (Term Frequency-Inverse Document Frequency) reflects how important a word is to a document in a collection.

We use `CountVectorizer` to convert the 'keywords' into a matrix of token counts. This counts the occurrences of each keyword.

Finally, we calculate the cosine similarity between movies based on their 'overview' and 'keywords' matrices. Cosine similarity measures the cosine of the angle between two vectors, indicating how similar they are.

In [None]:
df = df[:3000]

tfidf_overview = TfidfVectorizer(stop_words='english')
tfidf_overview_matrix = tfidf_overview.fit_transform(df['overview'])

keywords_text = [' '.join(kw) if isinstance(kw, list) else kw for kw in df['keywords']]
vector = CountVectorizer(stop_words='english')
vector_keywords_matrix = vector.fit_transform(keywords_text)

cosine_sim_overview = cosine_similarity(tfidf_overview_matrix)
cosine_sim_keywords = cosine_similarity(vector_keywords_matrix)

## 12. Extracting Movie Information

We extract lists of movie titles, genres, countries, and languages from the processed DataFrame. These lists will be used in the recommendation function.

In [None]:
movie_titles = df['title'].tolist()
movie_genres = df['genres'].tolist()
movie_countries = df['production_countries'].tolist()
movie_languages = df['spoken_languages'].tolist()

## 13. Defining Combined Similarity Function

This function calculates a combined similarity score between two movies based on a weighted sum of different similarity measures:
- Cosine similarity of overviews
- Cosine similarity of keywords
- Jaccard similarity of genres
- Jaccard similarity of production countries
- Jaccard similarity of spoken languages

The weights determine the importance of each feature in the overall similarity score.

In [None]:
def combined_similarity(idx1, idx2):
    weight_overview = 0.3
    weight_keywords = 0.3
    weight_genres = 0.2
    weight_countries = 0.1
    weight_languages = 0.1

    genre_sim = jaccard_similarity(set(movie_genres[idx1]), set(movie_genres[idx2]))
    country_sim = jaccard_similarity(set(movie_countries[idx1]), set(movie_countries[idx2]))
    language_sim = jaccard_similarity(set(movie_languages[idx1]), set(movie_languages[idx2]))

    return (
        weight_overview * cosine_sim_overview[idx1, idx2] +
        weight_keywords * cosine_sim_keywords[idx1, idx2] +
        weight_genres * genre_sim +
        weight_countries * country_sim +
        weight_languages * language_sim
    )

## 14. Getting Movie Recommendations

This is the main function to get movie recommendations.
- It takes a `movie_title` and the number of recommendations (`top_n`) as input.
- It uses fuzzy matching (`fuzzywuzzy.process.extractOne`) to find the closest matching movie title in the dataset, even if there are typos.
- It sets a similarity threshold to handle cases where the input title is significantly different from any title in the dataset.
- It finds the index of the matched movie in the DataFrame.
- It calculates the `combined_similarity` between the input movie and all other movies.
- It sorts the movies based on their similarity score in descending order.
- Finally, it prints the top `n` recommended movies along with their similarity scores.

In [None]:
def get_recommendations(movie_title, top_n=10):
    if not isinstance(movie_title, str):
        return "Error: Movie title must be a string."

    movie_titles = df['title'].tolist()

    match, score = process.extractOne(movie_title, movie_titles)

    similarity_threshold = 80

    if score < similarity_threshold:
        print(f"Error: '{movie_title}' not found in the dataset. Did you mean '{match}'?")
        movie_title_for_recommendations = match
    else:
        movie_title_for_recommendations = match
        print(f"Did you mean {match}?, Here")

    try:
        movie_index = df[df['title'] == movie_title_for_recommendations].index[0]
    except IndexError:
        return f"Error: Could not find an exact match for '{movie_title_for_recommendations}' in the dataset after fuzzy matching."

    similarities = [(i, combined_similarity(movie_index, i)) for i in range(len(df)) if i != movie_index]
    similarities.sort(key=lambda x: x[1], reverse=True)

    print(f"\nTop {top_n} recommendations for '{movie_title_for_recommendations}':")
    for i, (movie_idx, sim) in enumerate(similarities[:top_n], 1):
        print(f"{i}. {df.iloc[movie_idx]['title']} (Similarity: {sim:.3f})")

## 15. Getting Recommendations for "Terminator"

Finally, we call the `get_recommendations` function with the movie title "Terminator" to get recommendations.

In [None]:
print(get_recommendations("Terminator"))

Did you mean The Terminator?, Here

Top 10 recommendations for 'The Terminator':
1. Terminator 2: Judgment Day (Similarity: 0.571)
2. Terminator 3: Rise of the Machines (Similarity: 0.504)
3. Terminator Salvation (Similarity: 0.475)
4. Terminator: Dark Fate (Similarity: 0.438)
5. Hotel Artemis (Similarity: 0.429)
6. Terminator Genisys (Similarity: 0.388)
7. Déjà Vu (Similarity: 0.368)
8. Tenet (Similarity: 0.361)
9. The One (Similarity: 0.358)
10. The Book of Eli (Similarity: 0.352)
None
