**Content based recommendation System**

*This notebook details the process of building a recommendation system that suggests movies based on their content, such as genre, cast, and keywords.*

In [36]:
import pandas as pd
import numpy as np
import ast
import pickle
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity


*We are loading three datasets: tmdb_5000_movies.csv, tmdb_5000_credits.csv, and a custom Bollywood dataset. The TMDB datasets will be merged to combine movie details with cast and crew information.*

In [37]:
tmdb_credits_url = "https://www.dropbox.com/scl/fi/o6h6ml24maqkkg5smkwyj/tmdb_5000_credits.csv?rlkey=q9gh85659aspuqb7i2aytjhuo&st=cf5lpy8a&dl=1"
tmdb_movies_url = "https://www.dropbox.com/scl/fi/s49mp8ssn1ziohhxy424b/tmdb_5000_movies.csv?rlkey=cwc84dofl7axm09f2zp0zbomc&st=z6k326x2&dl=1"
bollywood_url = "https://www.dropbox.com/scl/fi/wbn9stgrmbrpg1r3gld3x/bollywood_full.csv?rlkey=736m50a899kjkgs5swe9gfhbz&st=7bxvjfs8&dl=1"

In [38]:
tmdb_credits = pd.read_csv(tmdb_credits_url)
tmdb_movies = pd.read_csv(tmdb_movies_url)
bollywood = pd.read_csv(bollywood_url)

In [39]:
hollywood_df = tmdb_movies.merge(tmdb_credits, left_on='id', right_on='movie_id')
hollywood_df = hollywood_df[["id","title_x","overview","genres","keywords","cast","crew"]]
hollywood_df = hollywood_df.rename(columns={"id":"movie_id", "title_x": "title"})
hollywood_df['origin'] = 'Hollywood'
hollywood_df['poster_url'] = ''

In [40]:
def extract_genres(genre_str):
    try:
        genre_list = ast.literal_eval(genre_str)
        return [genre['name'] for genre in genre_list]
    except (ValueError, SyntaxError): return []

bollywood['genres'] = bollywood['genres'].apply(extract_genres)
bollywood = bollywood.rename(columns={
    "title_x": "title", "story": "overview",
    "actors": "cast", "poster_path": "poster_url"
})

In [41]:
bollywood = bollywood[["title", "overview", "genres", "cast", "poster_url"]]
bollywood.dropna(inplace=True)
bollywood["movie_id"] = range(100000, 100000 + len(bollywood))
bollywood['origin'] = 'Bollywood'
bollywood['crew'] = ''

In [42]:
common_cols = ["movie_id", "title", "overview", "genres", "keywords", "cast", "crew", "poster_url", "origin"]

In [43]:
combined = pd.concat([
    hollywood_df.reindex(columns=common_cols),
    bollywood.reindex(columns=common_cols)
], ignore_index=True)
combined['keywords'].fillna('[]', inplace=True)
combined['crew'].fillna('[]', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  combined['keywords'].fillna('[]', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  combined['crew'].fillna('[]', inplace=True)


In [44]:
def convert_json(obj):
    try: return [i["name"] for i in ast.literal_eval(obj)]
    except: return []

def convert_list_or_string(obj):
    if isinstance(obj, list): return obj
    elif isinstance(obj, str): return [name.strip() for name in obj.split(',')]
    return []

def get_top_3(obj):
    return obj[:3]

def get_director(obj): # Only for Hollywood data
    try:
        for i in ast.literal_eval(obj):
            if i["job"] == "Director": return [i["name"]]
    except: return []
    return []

def collapse(L):
    return [i.replace(" ", "").lower() for i in L]


In [45]:
combined['genres'] = combined['genres'].apply(convert_list_or_string)

In [46]:
combined['keywords'] = combined['keywords'].apply(convert_json)

In [47]:
combined['cast'] = combined['cast'].apply(convert_list_or_string)

In [48]:
combined['cast'] = combined['cast'].apply(get_top_3)

In [49]:
combined['crew'] = combined['crew'].apply(get_director)

In [50]:
combined["overview"] = combined["overview"].apply(lambda x: x.split() if isinstance(x, str) else [])

In [51]:
for feature in ['genres', 'keywords', 'cast', 'crew']:
    combined[feature] = combined[feature].apply(collapse)

In [52]:
combined["tags"] = combined["overview"] + combined["genres"] + combined["keywords"] + combined["cast"] + combined["crew"]
combined["tags"] = combined["tags"].apply(lambda x: " ".join(x).lower())

In [53]:
final_df = combined[["movie_id", "title", "tags", "overview", "cast", "crew", "poster_url", "origin"]].copy()

In [54]:
ps = PorterStemmer()

In [55]:
final_df["tags"] = final_df["tags"].apply(lambda x: " ".join([ps.stem(i) for i in x.split()]))
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=5000, stop_words="english")
vectors = tfidf.fit_transform(final_df["tags"]).toarray()
similarity = cosine_similarity(vectors)

In [56]:
pickle.dump(final_df.to_dict(), open("movie_list.pkl", "wb"))
pickle.dump(similarity, open("similarity.pkl", "wb"))