## MOVIE RECOMMENDATION

### Project Introduction
The aim of this project is to build a content-based movie recommender system that suggests movies similar to a given movie. By leveraging movie metadata such as genres, cast, crew, and keywords, the system provides personalized movie recommendations.

### Project Outline

**Data Collection and Preprocessing**

1. Datasets: Utilized Movies metadata, Keywords, and Credits datasets.
2. Merging: Combined datasets on the 'id' column.
3. Cleaning: Cleaned the 'id' column, dropped duplicates, and handled missing values.
4. Feature Selection: Selected relevant columns: title, overview, genres, cast, crew, keywords.
5. Data Cleaning and Normalization

### Text Normalization: Converted text data to lowercase.
1. Genres Conversion: Transformed genres into a list of genre names.
2. Feature Combination: Combined genres and overview into a single text feature.
3. Word2Vec Model Training

### Tokenization: Tokenized the combined features of each movie.
Model Training: Trained a Word2Vec model on the tokenized corpus to learn word embeddings.
Feature Extraction

Description Vectors: Computed the average Word2Vec vector for each movie's description.
Similarity Computation

### Cosine Similarity: Calculated cosine similarity between movie embeddings to identify similar movies.
Recommendation Function

Top Recommendations: Created a function to fetch the top 10 most similar movie recommendations for a given movie.

In [1]:
# dependencies
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from transformers import DistilBertTokenizer, DistilBertModel
import torch

In [2]:
# load datasets movies metadata,keywords,credits
movies=pd.read_csv(r"C:\Users\Hp\Downloads\movies_metadata.csv")
keywords=pd.read_csv(r"C:\Users\Hp\Downloads\keywords.csv")
credits=pd.read_csv(r"C:\Users\Hp\Downloads\credits.csv")

  movies=pd.read_csv(r"C:\Users\Hp\Downloads\movies_metadata.csv")


In [3]:
# function to clean Movie id column and convert the values to int
def clean_movie_id(col):
    try:
        return int(col)
    except:
        return np.nan
    
movies['id']=movies['id'].apply(clean_movie_id)

In [4]:
# combining the 3 dataframes
movies_keywords=pd.merge(movies,keywords, on='id')
df=pd.merge(movies_keywords,credits,on='id')
df.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,spoken_languages,status,tagline,title,video,vote_average,vote_count,keywords,cast,crew
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862.0,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,...","[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de..."
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844.0,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,"[{'id': 10090, 'name': 'board game'}, {'id': 1...","[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de..."
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602.0,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392...","[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de..."
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357.0,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0,"[{'id': 818, 'name': 'based on novel'}, {'id':...","[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de..."
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862.0,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n...","[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de..."


## Data Inspection

In [5]:
# shape
df.shape

(46628, 27)

In [6]:
# check for duplicates
df.duplicated().sum()

1166

In [7]:
# drop duplicate
df.drop_duplicates(inplace=True)

In [8]:
# check for null values
df.isnull().sum()

adult                        0
belongs_to_collection    40969
budget                       0
genres                       0
homepage                 37685
id                           0
imdb_id                     17
original_language           11
original_title               0
overview                   954
popularity                   3
poster_path                386
production_companies         3
production_countries         3
release_date                87
revenue                      3
runtime                    260
spoken_languages             3
status                      84
tagline                  25050
title                        3
video                        3
vote_average                 3
vote_count                   3
keywords                     0
cast                         0
crew                         0
dtype: int64

**For our content based filtering we will use the following columns from our movie database**

1. overview
2. genres
3. keywords


In [9]:
df=df[['title','overview','genres','cast','crew','keywords']]
df.head()

Unnamed: 0,title,overview,genres,cast,crew,keywords
0,Toy Story,"Led by Woody, Andy's toys live happily in his ...","[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,Jumanji,When siblings Judy and Peter discover an encha...,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,Grumpier Old Men,A family wedding reignites the ancient feud be...,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...","[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...","[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...","[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,Father of the Bride Part II,Just when George Banks has recovered from his ...,"[{'id': 35, 'name': 'Comedy'}]","[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [10]:
# check column types in our new df
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45462 entries, 0 to 46627
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   title     45459 non-null  object
 1   overview  44508 non-null  object
 2   genres    45462 non-null  object
 3   cast      45462 non-null  object
 4   crew      45462 non-null  object
 5   keywords  45462 non-null  object
dtypes: object(6)
memory usage: 2.4+ MB


## Data Cleaning and Preprocessing

### Cleaning Genres column

In [11]:
def clean_genre(col):
    col=eval(col)
    genres=[]
    for dicts in col:
        genres.append(dicts['name'].lower())
    return genres

In [12]:
df['genres']=df['genres'].apply(clean_genre)
df['genres'].head()

0     [animation, comedy, family]
1    [adventure, fantasy, family]
2               [romance, comedy]
3        [comedy, drama, romance]
4                        [comedy]
Name: genres, dtype: object

## Cleaning Genres column

In [13]:
def clean_keyword(col):
    col=eval(col)
    keyword=[]
    for dicts in col:
        keyword.append(dicts['name'].lower())
    return keyword

In [14]:
df['keywords']=df['keywords'].apply(clean_keyword)
df['keywords'].head()

0    [jealousy, toy, boy, friendship, friends, riva...
1    [board game, disappearance, based on children'...
2    [fishing, best friend, duringcreditsstinger, o...
3    [based on novel, interracial relationship, sin...
4    [baby, midlife crisis, confidence, aging, daug...
Name: keywords, dtype: object

In [15]:
df['title']=df['title'].str.lower()
df['overview']=df['overview'].str.lower()
df.head()

Unnamed: 0,title,overview,genres,cast,crew,keywords
0,toy story,"led by woody, andy's toys live happily in his ...","[animation, comedy, family]","[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[jealousy, toy, boy, friendship, friends, riva..."
1,jumanji,when siblings judy and peter discover an encha...,"[adventure, fantasy, family]","[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[board game, disappearance, based on children'..."
2,grumpier old men,a family wedding reignites the ancient feud be...,"[romance, comedy]","[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[fishing, best friend, duringcreditsstinger, o..."
3,waiting to exhale,"cheated on, mistreated and stepped on, the wom...","[comedy, drama, romance]","[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[based on novel, interracial relationship, sin..."
4,father of the bride part ii,just when george banks has recovered from his ...,[comedy],"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[baby, midlife crisis, confidence, aging, daug..."


In [16]:
# convert overviw column to list
df['overview']=df['overview'].str.split()

In [17]:
# drop null values
df=df.dropna()

In [18]:
df['combined_feature']=df['genres']  +df['overview']+ df['keywords']

In [19]:
# create new df with only title and combined_feature
movies=df[['title','combined_feature']]
movies["combined_feature"]=movies["combined_feature"].apply(" ".join)
movies.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies["combined_feature"]=movies["combined_feature"].apply(" ".join)


Unnamed: 0,title,combined_feature
0,toy story,"animation comedy family led by woody, andy's t..."
1,jumanji,adventure fantasy family when siblings judy an...
2,grumpier old men,romance comedy a family wedding reignites the ...
3,waiting to exhale,"comedy drama romance cheated on, mistreated an..."
4,father of the bride part ii,comedy just when george banks has recovered fr...


## Train a Word2Vec model on your corpus.

In [20]:
import gensim
from gensim.models import Word2Vec

corpus=dict([(index,row['combined_feature']) for index,row in movies.iterrows()])
# Tokenize corpus
tokenized_corpus = [doc.split() for doc in corpus.values()]

# Train Word2Vec model
model = Word2Vec(sentences=tokenized_corpus, vector_size=100, window=10, min_count=1, workers=4)

## Feature Extraction 

For each movie, compute the average Word2Vec vector of its description.

In [21]:
movies.set_index('title',inplace=True)

In [22]:
def get_description_vector(title):
    description = movies.loc[title, 'combined_feature']
    if isinstance(description, pd.Series):
        description = description.iloc[0]
    words = description.split()
    word_vectors = [model.wv[word] for word in words if word in model.wv]
    if word_vectors:
        return np.mean(word_vectors, axis=0)
    else:
        return np.zeros(model.vector_size)

In [23]:
#embeddings vector
embeddings=[get_description_vector(title) for title in movies.index]

In [24]:
#create an embeddings df
df=pd.DataFrame({
    'title': movies.index,
    'embedding':embeddings
})

df.set_index('title',inplace=True)

## Similarity Computation

In [25]:
from sklearn.metrics.pairwise import cosine_similarity

embedding_matrix=np.vstack(df['embedding'].values)

In [26]:
cosine_sim=cosine_similarity(embedding_matrix)

In [27]:
def get_recommendations(title):
    #get the index of the movie that matches the title
    idx=df.index.get_loc(title)
    #get the pairwise similarity scores of all movies
    sim_scores=list(enumerate(cosine_sim[idx]))
    #sort the movies based on the similarity scores
    sim_scores=sorted(sim_scores,key=lambda x: x[1],reverse=True)
    #get the scores of the 5 most similar movies
    sim_scores=sim_scores[1:11]
    # get the movie indices
    movie_indices=[i[0] for i in sim_scores]
    return pd.Series(df.index[movie_indices].tolist())

In [33]:
get_recommendations('rambo')

0                                      spy game
1    allan quatermain and the lost city of gold
2                            punisher: war zone
3                                    battleship
4                                 reign of fire
5                                         keoma
6                                largo winch ii
7                                       getaway
8                    golgo 13: the professional
9                                     moby dick
dtype: object

In [29]:
get_recommendations('spider-man')

0                     twin sitters
1                        evilspeak
2            descendant of the sun
3                       gridlocked
4                       blown away
5                       blown away
6         x-men origins: wolverine
7                 bong of the dead
8    hello mary lou: prom night ii
9                the toxic avenger
dtype: object

### Conclusions
1. Effective Content-Based Recommendations: The project successfully developed a content-based recommendation system that provides relevant movie suggestions based on the similarity of movie descriptions and metadata.
2. Robust Preprocessing Pipeline: The data cleaning and preprocessing steps effectively prepared the data for model training, ensuring high-quality feature extraction.
3. Word2Vec Model Utilization: The Word2Vec model effectively captured semantic similarities between words in movie descriptions, enabling accurate computation of movie similarities.