# **Abstract**

This project introduces a content-based Movie Recommender System that suggests movies similar to a user-selected title. The system leverages machine learning techniques, particularly similarity-based algorithms, to analyze movie metadata and identify closely related films. A 'tags' feature is utilized to enhance the recommendation accuracy. Developed using Python and Streamlit, the application offers an interactive interface, while the integration of the TMDb API enriches the user experience by fetching and displaying movie posters. This approach ensures efficient, accurate, and visually engaging recommendations for users seeking similar movie options.

In [5]:
import pandas as pd 
import numpy as np

In [6]:
credits=pd.read_csv('tmdb_5000_credits.csv')
movies=pd.read_csv('tmdb_5000_movies.csv')

In [7]:
movies.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [8]:
credits.head()

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [None]:
movies=movies.merge(credits,on='title')

In [None]:
movies.head(3)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,285,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,206647,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."


# **Data Preprocessing**

- In this notebook, the datasets `tmdb_5000_credits.csv` and `tmdb_5000_movies.csv` were loaded and merged on the `'title'` column to consolidate movie-related information. Data preprocessing involved selecting relevant features (`'movie_id'`, `'title'`, `'overview'`, `'genres'`, `'keywords'`, `'cast'`, and `'crew'`), handling missing values by dropping nulls, and removing duplicates to ensure data consistency. Feature transformation was performed using `ast.literal_eval` to convert string-formatted lists/dictionaries into Python objects, followed by extracting key elements like genre names, main cast members, and crew details. These steps were essential to clean, structure, and prepare the data for building an effective movie recommendation system.

In [11]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4809 entries, 0 to 4808
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4809 non-null   int64  
 1   genres                4809 non-null   object 
 2   homepage              1713 non-null   object 
 3   id                    4809 non-null   int64  
 4   keywords              4809 non-null   object 
 5   original_language     4809 non-null   object 
 6   original_title        4809 non-null   object 
 7   overview              4806 non-null   object 
 8   popularity            4809 non-null   float64
 9   production_companies  4809 non-null   object 
 10  production_countries  4809 non-null   object 
 11  release_date          4808 non-null   object 
 12  revenue               4809 non-null   int64  
 13  runtime               4807 non-null   float64
 14  spoken_languages      4809 non-null   object 
 15  status               

features we are going to keep

#generes
#id
#title
#overview
#cast
#crew

In [12]:
movies=movies[['movie_id','title','overview','genres','keywords','cast','crew']]

In [13]:
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [14]:
movies.isnull().sum()

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [15]:
movies.dropna(inplace=True)

In [16]:
movies.duplicated().sum()

0

literal_eval is a safe Python method from the ast module that converts strings containing Python data (like lists, dicts, numbers) into actual Python objects.

In [17]:
import ast 


In [18]:
def convert(obj):
    l=[]
    for i in ast.literal_eval(obj):
        l.append(i['name'])
    return l


In [19]:
def convert3(obj):
    l=[]
    c=0
    for i in ast.literal_eval(obj):
        if c !=3:
          l.append(i['name'])
        else:
           break  
    return l

In [20]:
def fetch_director(obj):
    l=[]
    c=0
    for i in ast.literal_eval(obj):
        if i['job']=='Director':  
          l.append(i['name'])
    
    return l

In [21]:
movies['genres']=movies['genres'].apply(convert)

In [22]:
movies['keywords']=movies['keywords'].apply(convert)

In [23]:
movies['cast']=movies['cast'].apply(convert3)

In [24]:
movies['crew']=movies['crew'].apply(fetch_director)

In [25]:
movies.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weave...",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley, ...",[Gore Verbinski]


In [26]:
movies['overview']=movies['overview'].apply(lambda x:x.split())

In [27]:
movies.head(2) 

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weave...",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley, ...",[Gore Verbinski]


In [28]:
column=movies.drop(columns=['movie_id','title'],axis=1).columns
for columns in column:
    movies[columns]=movies[columns].apply(lambda x:[i.replace(' ','') for i in x])

In [29]:
movies['tags']=movies['cast']+movies['crew']+movies['genres']+movies['keywords']+movies['overview']

In [30]:
new_df=movies[['movie_id','title','tags']]

In [31]:
new_df['tags']=new_df['tags'].apply(lambda x:' '.join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags']=new_df['tags'].apply(lambda x:' '.join(x))


In [76]:
new_df

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,samworthington zoesaldana sigourneyweav stephe...
1,285,Pirates of the Caribbean: At World's End,johnnydepp orlandobloom keiraknightley stellan...
2,206647,Spectre,danielcraig christophwaltz léaseydoux ralphfie...
3,49026,The Dark Knight Rises,christianbal michaelcain garyoldman annehathaw...
4,49529,John Carter,taylorkitsch lynncollin samanthamorton willemd...
...,...,...,...
4804,9367,El Mariachi,carlosgallardo jaimedehoyo petermarquardt rein...
4805,72766,Newlyweds,edwardburn kerrybishé marshadietlein caitlinfi...
4806,231617,"Signed, Sealed, Delivered",ericmabiu kristinbooth crystallow geoffgustafs...
4807,126186,Shanghai Calling,danielhenney elizacoup billpaxton alanruck zhu...


In [33]:
new_df['tags']=new_df['tags'].apply(lambda x:x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags']=new_df['tags'].apply(lambda x:x.lower())


# **2. Feature Engineering**

In this phase, the textual data is transformed into a suitable format for machine learning. The 'tags' feature undergoes text preprocessing, including lowercasing, **`stemming`**, and removing stop words. The processed text is then converted into numerical representations using techniques like **`CountVectorizer`**, enabling the model to compute similarity scores effectively.

## Vectorization

**why use vectorization**

* In this case, vectorization with **`CountVectorizer`** means converting the "tags" text for each movie into a numerical format.

* Each unique tag becomes a feature, and the vector shows how often each tag appears for a movie. This helps the model compare movies based on their tags and recommend similar ones. 

In [34]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(max_features=5000,stop_words='english')

**Purpose of parameters in count vector**

- **`CountVectorizer`** from **`sklearn`** converts a collection of text documents into a matrix of token counts.

- The parameter **`max_features=5000`** limits the vocabulary to the top 5,000 most frequent words.

- **`stop_words='english'`** removes common English stop words (like "the", "is", "and") to focus on meaningful terms.

In [37]:
vectors=cv.fit_transform(new_df['tags']).toarray()

In [39]:
vectors[0]

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [50]:
print(cv.get_feature_names_out())

['000' '007' '10' ... 'zoo' 'zooeydeschanel' 'zoëkravitz']


## Stemming

Stemming is applied using **`PorterStemmer`** from the **`nltk`** library, a popular NLP (Natural Language Processing) library, to reduce words to their root forms, ensuring similar terms (like "acting" and "actor") are treated alike. This enhances the consistency of the 'tags' feature, improving the accuracy and efficiency of movie similarity calculations.

In [46]:
import nltk
from nltk.stem.porter import PorterStemmer
ps=PorterStemmer()

In [48]:
def stem(text):
    y=[]
    for i in text.split():
        y.append(ps.stem(i))
    return ' '.join(y)

In [52]:
new_df['tags'][0]

'samworthington zoesaldana sigourneyweav stephenlang michellerodriguez giovanniribisi joeldavidmoor cchpounder wesstudi lazalonso dileeprao mattgerald seananthonymoran jasonwhyt scottlawr kellykilgour jamespatrickpitt seanpatrickmurphi peterdillon kevindorman kelsonhenderson davidvanhorn jacobtomuri michaelblain-rozgay joncurri lukehawk woodyschultz petermensah soniaye jahnelcurfman ilramchoi kylawarren lisaroumain debrawilson chrismala taylorkibbi jodielandau julielamm cullenb.madden josephbradymadden frankietorr austinwilson sarawilson tamicawashington-mil lucybri nathanmeist gerryblair matthewchamberlain paulyat wraywilson jamesgaylyn melvinlenoclarkiii carvonfutrel brandonjelk micahmoch hanniyahmuhammad christophernolen christaoliv aprilmariethoma bravitaa.threatt colinbleasdal mikebodnar mattclayton nicoledionn jamieharrison allanhenri anthonyingrub ashleyjefferi deanknowsley josephmika-hunt terrynotari kaipantano loganpithy stuartpollock raja garethruck rhiansheehan t.j.storm jod

In [49]:
new_df['tags']=new_df['tags'].apply(stem)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags']=new_df['tags'].apply(stem)


# **3. Model Development**

A **`cosine_similarity`** algorithm is implemented to measure the similarity between movies based on the vectorized 'tags'. The model computes similarity scores for each movie pair, allowing the system to identify and recommend movies that are most similar to the user's selection.

In [55]:
from sklearn.metrics.pairwise import cosine_similarity
similarity=cosine_similarity(vectors)

In [57]:
similarity[0]

array([1.        , 0.07142857, 0.05143445, ..., 0.02326211, 0.02571722,
       0.        ])

## Building the Recommendation System

The core recommendation logic is developed by selecting a movie and retrieving the top similar movies based on the similarity matrix. The system ensures efficient retrieval and processing for real-time recommendations.

In [77]:
def recommend(movie):
    movies_index=new_df[new_df['title']== movie].index[0]
    distances=similarity[movies_index]
    movies_list=sorted(list(enumerate(distances)),reverse=True,key=lambda x:x[1])[1:11]

    for i in movies_list:
        print(new_df.iloc[i[0]].title)
        
 
 

In [78]:
recommend('Batman Begins')

The Dark Knight
The Dark Knight Rises
Amidst the Devil's Wings
Batman
Batman & Robin
Batman
Mi America
Defendor
Dead Man Down
Batman Returns


In [74]:
import pickle

pickle.dump(new_df.to_dict(),open('movies.pkl','wb'))
pickle.dump(similarity,open('similarity.pkl','wb'))