### Business Objective
The goal of this project was to help users find movies they’re likely to enjoy, instead of manually browsing through thousands of titles. From a business perspective, such a system helps increase user satisfaction, time spent on the platform, and overall subscription retention — which directly impacts revenue for OTT platforms.

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
import warnings 
warnings.filterwarnings('ignore')

In [2]:
movies = pd.read_csv('tmdb_5000_movies.csv')
credits = pd.read_csv('tmdb_5000_credits.csv')


In [3]:
movies.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [4]:
credits.head()

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [5]:
#lets merge both table for smooth work
movies = movies.merge(credits,on = 'title')

In [6]:
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [7]:
#lets remove those columns which will not be useful in our  analysis
#the columns which i will keep for analysis
#genres
#id
#keywords
#title
#overview
#cast
#crew

### Selection of Relevant Attributes for Recommendation System
I selected only the columns genres, id, keywords, title, overview, cast, and crew because these contain the most relevant information needed to understand the content and context of a movie — which are the core inputs for a recommendation engine.

The goal was to recommend similar movies based on storyline, theme, and people involved. Columns like genres, overview, and keywords describe what the movie is about, while cast and crew give information about the actors and directors, which often influence user preferences.

I excluded other columns such as budget, popularity, or runtime because they don’t directly affect user similarity or movie similarity in terms of content. Keeping only meaningful features helped reduce noise and improve both model interpretability and performance.

In [8]:
movies = movies[['genres','movie_id','keywords','title','overview','cast','crew']]

In [9]:
movies.head()

Unnamed: 0,genres,movie_id,keywords,title,overview,cast,crew
0,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",Spectre,A cryptic message from Bond’s past sends him o...,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",The Dark Knight Rises,Following the death of District Attorney Harve...,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",John Carter,"John Carter is a war-weary, former military ca...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [10]:
#checking missing values in our relevant columns
movies.isnull().sum()

genres      0
movie_id    0
keywords    0
title       0
overview    3
cast        0
crew        0
dtype: int64

In [11]:
#dropping those null values 
movies.dropna(inplace = True)

In [12]:
movies.isnull().sum()

genres      0
movie_id    0
keywords    0
title       0
overview    0
cast        0
crew        0
dtype: int64

movies.duplicated().sum()

In [14]:
movies.iloc[0].genres

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [15]:
import ast
ast.literal_eval('[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]')

[{'id': 28, 'name': 'Action'},
 {'id': 12, 'name': 'Adventure'},
 {'id': 14, 'name': 'Fantasy'},
 {'id': 878, 'name': 'Science Fiction'}]

In [16]:
def convert(obj):
    L = []
    for i in ast.literal_eval(obj):
        L.append(i['name'])
    return L

### Extracting Genre Names from Nested JSON Data
I applied the convert() function to extract genre names from the complex JSON-like structure present in the genres column. In the raw dataset, each movie’s genres were stored as a list of dictionaries (e.g., [{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}]).

The goal was to simplify this nested data into a list of clean, readable genre names (like Action, Adventure, Fantasy, Science Fiction) that can be used effectively for content-based similarity and textual analysis.

By doing this transformation, I made the genres column machine-readable and suitable for vectorization techniques such as CountVectorizer or TF-IDF, which require text input.

In [17]:
#import ast
#ast.literal_eval() #this is to change string list into proper list format in genras

In [18]:
movies['genres'] = movies['genres'].apply(convert)

In [19]:
movies.head()

Unnamed: 0,genres,movie_id,keywords,title,overview,cast,crew
0,"[Action, Adventure, Fantasy, Science Fiction]",19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,"[Adventure, Fantasy, Action]",285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,"[Action, Adventure, Crime]",206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",Spectre,A cryptic message from Bond’s past sends him o...,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,"[Action, Crime, Drama, Thriller]",49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",The Dark Knight Rises,Following the death of District Attorney Harve...,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,"[Action, Adventure, Science Fiction]",49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",John Carter,"John Carter is a war-weary, former military ca...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [20]:
#now lets do kinda same process for our keyword column
movies['keywords'] = movies['keywords'].apply(convert)

### Extracting Movie Keywords for Content Similarity
I applied the same convert() function to the keywords column because it also contained nested JSON-like data, where each movie had a list of keyword dictionaries (e.g., [{"id": 100, "name": "space travel"}, {"id": 200, "name": "alien"}]).

Extracting only the name values helped convert this unstructured data into a clean list of meaningful keywords (e.g., space travel, alien). These keywords represent the underlying themes or concepts of a movie and play an important role in identifying content similarity between movies.

This transformation made the keywords feature text-ready for vectorization and enhanced the recommendation system’s ability to capture semantic similarity between movies.

In [23]:
def convert2(obj):
    L = []
    counter = 0
    for i in ast.literal_eval(obj):
        if counter != 3:
            L.append(i['name'])
            counter += 1
        else:
            break
    return L

In [24]:
movies['cast'] = movies['cast'].apply(convert2)

### Extracting Top 3 Cast Members for Each Movie
The cast column had a long list of actors for each movie. I extracted only the top 3 cast members because lead actors have the most influence on user preferences. This reduced noise and helped my model focus on the most impactful features.

In [27]:
def fetch_director(obj):
    L = []
    for i in ast.literal_eval(obj):
        if i['job'] == 'Director':
            L.append(i['name'])
            break
    return L

In [28]:
movies['crew'] = movies['crew'].apply(fetch_director)

### Extracting Director Information from Crew Data
The crew column had a list of all crew members in JSON format. I used the fetch_director() function to extract only the director’s name since a director’s work style often defines the movie’s theme and influences what viewers like

In [30]:
movies['overview'] = movies['overview'].apply(lambda x: x.split())

### Text Tokenization for Movie Overview
The overview column contains a text summary (description) of the movie. To process this data for our recommendation system, we need to convert it into a tokenized form — that is, a list of individual words.

In [32]:
movies['genres'] = movies['genres'].apply(lambda x:[i.replace(' ','') for i in x])
movies['keywords'] = movies['keywords'].apply(lambda x:[i.replace(' ','') for i in x])
movies['cast'] = movies['cast'].apply(lambda x:[i.replace(' ','') for i in x])
movies['crew'] = movies['crew'].apply(lambda x:[i.replace(' ','') for i in x])

### Removing Whitespaces from Text Data
Applied a lambda function to remove spaces from all text elements in the columns — genres, keywords, cast, and crew.

In [33]:
movies.head()

Unnamed: 0,genres,movie_id,keywords,title,overview,cast,crew
0,"[Action, Adventure, Fantasy, ScienceFiction]",19995,"[cultureclash, future, spacewar, spacecolony, ...",Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]
1,"[Adventure, Fantasy, Action]",285,"[ocean, drugabuse, exoticisland, eastindiatrad...",Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski]
2,"[Action, Adventure, Crime]",206647,"[spy, basedonnovel, secretagent, sequel, mi6, ...",Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[DanielCraig, ChristophWaltz, LéaSeydoux]",[SamMendes]
3,"[Action, Crime, Drama, Thriller]",49026,"[dccomics, crimefighter, terrorist, secretiden...",The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[ChristianBale, MichaelCaine, GaryOldman]",[ChristopherNolan]
4,"[Action, Adventure, ScienceFiction]",49529,"[basedonnovel, mars, medallion, spacetravel, p...",John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[TaylorKitsch, LynnCollins, SamanthaMorton]",[AndrewStanton]


In [34]:
movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew']

### Feature Engineering – Creating the **tags** Column
The idea behind a content-based recommendation system is to find similarities between movies based on their descriptive features. However, these features (overview, genres, keywords, cast, crew) exist in separate columns. To compute similarity effectively, we need to combine all relevant information into a single text-based representation

In [36]:
new_df = movies[['movie_id','title','tags']]

In [75]:
new_df.head(1)

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin..."


In [79]:
new_df['tags'] = new_df['tags'].apply(lambda x:' '.join(x))

### Converting List of Words into a Single String
After combining all the text features (overview, genres, keywords, cast, crew) into the tags column, each row contained a list of words.
However, most text vectorization techniques (like CountVectorizer or TF-IDF) require the input to be in string format, not lists

In [81]:
new_df['tags'][0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d SamWorthington ZoeSaldana SigourneyWeaver JamesCameron'

In [85]:
new_df['tags'] = new_df['tags'].apply(lambda x:x.lower())

### Text Normalization – Converting to Lowercase
The text data in the tags column contained a mix of uppercase and lowercase words (e.g., “Action” and “action”).
Since text vectorization is case-sensitive, such inconsistencies can cause the same word to be treated as different tokens

In [93]:
#time to do NLP on tags 
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 5000,stop_words = 'english')

In [107]:
vectors = cv.fit_transform(new_df['tags']).toarray()

### Text Vectorization using CountVectorizer
Machine learning models and similarity algorithms cannot directly process text data. Therefore, it’s essential to convert the textual tags column into a numerical representation that captures the importance of each word across all movies

In [109]:
vectors

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [111]:
vectors[0]

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [119]:
#lets apply stemming
import nltk

In [121]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [123]:
def stem(text):
    y = []
    for i in text.split():
        y.append(ps.stem(i))
    return ' '.join(y)

In [125]:
new_df['tags'] = new_df['tags'].apply(stem)

### Text Normalization – Stemming
The words in the tags column may appear in different grammatical forms — for example, “love”, “loved”, and “loving”.
Although they convey the same meaning, they would be treated as separate tokens by the model, reducing similarity accuracy

In [131]:
from sklearn.metrics.pairwise import cosine_similarity

In [139]:
similarity = cosine_similarity(vectors)

### Computing Similarity using Cosine Similarity
After converting all movies into numerical feature vectors (using CountVectorizer), the next step is to measure how similar two movies are based on their content — i.e., genres, cast, keywords, and overview

In [143]:
similarity[0]

array([1.        , 0.08964215, 0.05976143, ..., 0.02519763, 0.02817181,
       0.        ])

In [169]:
def recommend(movie):
    movie_index = new_df[new_df['title'] == movie].index[0]
    distances = similarity[movie_index]
    movies_list = sorted(list(enumerate(distances)),reverse = True,key = lambda x:x[1])[1:6]
    for i in movies_list:
        print(new_df.iloc[i[0]].title)

### Building the Recommendation Function
After computing the cosine similarity matrix, the goal is to build a function that recommends movies similar to a given input movie.
- Step 1: Find the index of the selected movie in the dataset.
- Step 2: Retrieve the similarity scores (distances) for that movie from the precomputed similarity matrix.
- Step 3: Sort all movies in descending order of similarity scores.
- Step 4: Select the top 5 most similar movies (excluding the movie itself).
- Step 5: Display the recommended movie titles.

In [173]:
recommend('Batman Begins')

The Dark Knight
The Dark Knight Rises
Batman
Batman & Robin
Batman


In [167]:
new_df.iloc[1216].title

'Autumn in New York'

In [175]:
#lets convert this shit into website
import pickle

In [177]:
#pickle.dump(new_df,open('movies.pkl','wb'))

In [179]:
new_df['title'].values

array(['Avatar', "Pirates of the Caribbean: At World's End", 'Spectre',
       ..., 'Signed, Sealed, Delivered', 'Shanghai Calling',
       'My Date with Drew'], dtype=object)

In [183]:
pickle.dump(new_df.to_dict(),open('movie_dict.pkl','wb'))

In [181]:
pickle.dump(similarity,open('similarity.pkl','wb'))