# Content Based Movie Recommendation

## What are we doing?

Recommandation systems are used to help customer find content/data that they are more likely to consume. This project aims to build a movie recommendation system. The project uses TMBD 5000 dataset (https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata).

## What type of recommendation systems are we building?

We are building a content based recommendation system. These types of recommendation systems try to suggest your content similar to the content that you have previously consumed. For e.g. - If you have been watching a lot of thriller movies, these type of recommendation system will suggest you content which is similar to your watch history.

## What are other type of recommendation system?

* Collaborative Filtering: These types of recommendation systems use other people's consumption behavior to suggest other user content to watch. For e.g. if you've watched a lot of thriller movies recently, then these recommendation system will suggest you content of users who have also been watching thriller movies along with other genre. The hypothesis behind these systems is that people with similar taste in content have higher probability of consuming each others content.

* Hybrid: These types of recommendation systems combine both content based and collaborative filtering techniques to recommend content.

## What methodology are we using?

* We are using movies overview, cast, director, genre information as source information.
* The source information from multiple columns is combined into a single columns, `tags`. Think of tags as a column that contains overview of a movie, it's cast, director and genre.
* Stemming is applied to the tags column.
* A bag of words representation is created for all movies, using words from tag column from all movies.
* After applying bag of words, each movie now gets its vector for tags.
* A cosine similarity is then found between all movies and their bag of word representation for the tag columns.
* Movies whose tags have high similarity with tags of other movies are expected to be more similar with respect to their content.

In [1]:
import numpy as np
import pandas as pd

# Data Pre-processing

In [7]:
movies = pd.read_csv('data/tmdb_5000_movies.csv')
credits = pd.read_csv('data/tmdb_5000_credits.csv')

In [8]:
movies = movies.merge(credits,on='title')

In [9]:
def filter_required_columns(df, columns):
    return df[columns]

In [10]:
def handle_missing_data(movies):
  movies.dropna(inplace=True)
  return movies

In [11]:
def remove_duplicate_data(movies):
  movies.drop_duplicates(inplace=True)
  return movies

In [12]:
import ast

def extract_names_from_object_string(obj):
  l = []
  for i in ast.literal_eval(obj):
    l.append(i['name'])
  return l

In [13]:
def modify_genre(movies):
  movies['genres'] = movies['genres'].apply(extract_names_from_object_string)

In [14]:
def modify_keywords(movies):
  movies['keywords'] = movies['keywords'].apply(extract_names_from_object_string)

In [15]:
def extract_cast(obj):
  l = []
  for i in ast.literal_eval(obj):
    if len(l) < 3:
      l.append(i['name'])
    else:
      break
  return l

In [16]:
def modify_cast(movies):
  movies['cast'] = movies['cast'].apply(extract_cast)

In [17]:
def extract_director(obj):
  l = []
  for i in ast.literal_eval(obj):
    if i['job'] == 'Director':
      l.append(i['name'])
      break
  return l

In [18]:
def modify_crew(movies):
  movies['crew'] = movies['crew'].apply(extract_director)

In [19]:
def create_tag(movies):

  movies['overview'] = movies['overview'].apply(lambda x:x.split())
  movies['genres'] = movies['genres'].apply(lambda x:[i.replace(" ","") for i in x])
  movies['keywords'] = movies['keywords'].apply(lambda x:[i.replace(" ","") for i in x])
  movies['cast'] = movies['cast'].apply(lambda x:[i.replace(" ","") for i in x])
  movies['crew'] = movies['crew'].apply(lambda x:[i.replace(" ","") for i in x])

  movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew']
  movies.drop(columns=['overview','genres','keywords','cast','crew'],inplace=True)
  movies['tags'] = movies['tags'].apply(lambda x:" ".join(x))
  movies['tags'] = movies['tags'].apply(lambda x:x.lower())
  return movies


In [20]:
def pre_process(movies):

  # column list
  column = ['genres', 'id', 'keywords', 'overview', 'title', 'cast', 'crew']

  movies = filter_required_columns(movies, column)
  movies = handle_missing_data(movies)
  movies = remove_duplicate_data(movies)

  modify_genre(movies)
  modify_keywords(movies)
  modify_cast(movies)
  modify_crew(movies)

  movies = create_tag(movies)

  return movies

In [21]:
movies = pre_process(movies)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies.dropna(inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies['genres'] = movies['genres'].apply(extract_names_from_object_string)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies['overview'] = movies['overview'].apply(lambda x:x.split())
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexe

# Generate Similarity

In [22]:
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [23]:
def apply_stemming(text):
  ps = PorterStemmer()
  y = []
  for i in text.split():
    y.append(ps.stem(i))
  return " ".join(y)

In [24]:
def stem_tags(movies):
  movies['tags'] = movies['tags'].apply(apply_stemming)

In [25]:
def apply_count_vectorizer_to_tags(movies):
  cv = CountVectorizer(max_features=5000,stop_words='english')
  vectors = cv.fit_transform(movies['tags']).toarray()
  return vectors

In [26]:
from itertools import count
def prepare_data(movies):

  stem_tags(movies)
  count_vector = apply_count_vectorizer_to_tags(movies)
  similarity = cosine_similarity(count_vector)

  return movies, count_vector, similarity

In [27]:
movies, count_vector, similarity = prepare_data(movies)

# Manage Models

In [31]:
def save_model(movies, count_vector, similarity):
  movies.to_csv('data/model/movies.csv')
  np.save('data/model/count_vector.npy', count_vector)
  np.save('data/model/similarity.npy', similarity)

In [32]:
def load_model():
  movies = pd.read_csv('data/model/movies.csv')
  count_vector = np.load('data/model/count_vector.npy', allow_pickle=True)
  similarity = np.load('data/model/similarity.npy', allow_pickle=True)
  return movies, count_vector, similarity

In [34]:
save_model(movies, count_vector, similarity)

In [35]:
l_movies, l_count_vector, l_similarity = load_model()

## Recommendation Generation

In [39]:
def recommend(movie_name):
  movie_index = l_movies[l_movies['title'] == movie_name].index[0]
  distances = l_similarity[movie_index]
  movies_list = sorted(list(enumerate(distances)),reverse=True,key=lambda x:x[1])[1:6]

  for i in movies_list:
    print(l_movies.iloc[i[0]].title)

In [40]:
recommend('Avatar')

Aliens vs Predator: Requiem
Aliens
Falcon Rising
Independence Day
Titan A.E.
