# Movie Recommender System

This system recommends you movies based on the previously watched content.

Two datasets(Movies and Credits) have been used for recommending movies from https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata

**1. MOVIE DATASET** 
* budget 
* genres 
* homepage
* id
* keywords
* original language
* original title
* overview	
* popularity	
* production_companies	
* production_countries	
* release_date	
* revenue	
* runtime	
* spoken_languages	
* status	
* tagline	
* title	
* vote_average	
* vote_count

**2. CREDIT DATSESET**
* movie_id	
* title	
* cast	
* crew

**Importing relevant libraries**

In [3]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

**Loading Datasets**

In [4]:
movies = pd.read_csv('tmdb_5000_movies.csv')
credits = pd.read_csv('tmdb_5000_credits.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'tmdb_5000_movies.csv'

# Exploring data

In [None]:
movies.shape

In [None]:
credits.shape

In [None]:
movies.info()

In [None]:
movies.head(2)

In [None]:
credits.info()

In [None]:
credits.head(2)

**Merging both datasets on title**

In [None]:
df = movies.merge(credits, on = 'title')

In [None]:
df.shape

In [None]:
df.head(1)

# Selecting significant features for the recommendation system

In [None]:
features = ['movie_id','title','overview','genres','keywords','cast','crew']

In [None]:
data = df[features]

In [None]:
data.head(2)

**Checking for missing values**

In [None]:
data.isnull().sum()

**Dropping rows with any missing value**

In [None]:
data.dropna(inplace = True)

In [None]:
#checking for duplicate rows
data.duplicated().sum()

# Data Pre-processing

**Extracting 'name' from the genres and keywords into a list**

In [None]:
import ast

def extract_(text):
    L = []
    for i in ast.literal_eval(text):
        L.append(i['name']) 
    return L 

In [None]:
data['genres'] = data['genres'].apply(extract_)
data.head(2)

In [None]:
data['keywords'] = data['keywords'].apply(extract_)
data.head(2)

**Extracting 3 lead actors from the cast into a list**

In [None]:
def extract3(text):
    L = []
    counter = 0
    for i in ast.literal_eval(text):
        if counter < 3:
            L.append(i['name'])
        counter+=1
    return L 

In [None]:
data['cast'] = data['cast'].apply(extract3)
data.head(2)

In [None]:
data['cast'] = data['cast'].apply(lambda x:x[0:3])

**Extracting only director's name from crew** 

In [None]:
def fetch_director(text):
    L = []
    for i in ast.literal_eval(text):
        if i['job'] == 'Director':
            L.append(i['name'])
    return L 

In [None]:
data['crew'] = data['crew'].apply(fetch_director)

In [None]:
data.sample(5)

**Removing space in between words**

In [None]:
def collapse(L):
    L1 = []
    for i in L:
        L1.append(i.replace(" ",""))
    return L1

In [None]:
data['cast'] = data['cast'].apply(collapse)
data['crew'] = data['crew'].apply(collapse)
data['keywords'] = data['keywords'].apply(collapse)
data['genres'] = data['genres'].apply(collapse)

In [None]:
data.head(3)

**Converting string into a list for 'overview'**

In [None]:
data['overview'] = data['overview'].apply(lambda x:x.split())

# CREATING TAGS
* A tag recommender system is a recommender system which recommends tags to the user. In this context, a tag is defined as a word freely added to an object by a user.

In [None]:
data['tags'] = data['overview'] + data['genres'] + data['keywords'] + data['cast'] + data['crew']

In [None]:
#FINAL DATAFRAME 
movies_df = data[['movie_id','title', 'tags']]

In [None]:
# converting list of strings to a single string

movies_df['tags'] = movies_df['tags'].apply( lambda x: " ".join(x))

#changing to lower case letters

movies_df['tags'] = movies_df['tags'].apply(lambda x : x.lower())

In [None]:
movies_df.head(5)

In [None]:
movies_df['tags'][0]

# Vectorizing tags

**STEMMING all the tags**
* Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words. 
* For example, ['run', 'running', 'runs'] will be converted to ['run','run', 'run' ]

In [None]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
ps.stem('running')

In [None]:
def stem(text):
    y = []
    for i in text.split():
        y.append(ps.stem(i))
    
    return " ".join(y)

In [None]:
movies_df['tags'] = movies_df['tags'].apply(stem)

In [None]:
movies_df.head(1)

**Extracting top 5000 occuring words**

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=5001,stop_words='english')

In [None]:
vector = cv.fit_transform(movies_df['tags']).toarray()

In [None]:
vector.shape

**Using COSINE similarity to recommend similar content**
* Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(vector)
similarity

**Defining our recommendation function**

In [None]:
def recommend(movie):
    
    #Extracting the index of the given movie
    index = movies_df[movies_df['title'] == movie].index[0]
    
    #calculating the similarity with each movie and sorting in descending order
    distances = sorted(list(enumerate(similarity[index])),reverse=True,key = lambda x: x[1])
    
    for i in distances[1:6]:
        
        #printing top 5 similar movies
        print(movies_df.iloc[i[0]].title)
        

In [None]:
recommend('Iron Man')

**Deploying models**

In [None]:
import pickle
import bz2file as bz2

def compressed_pickle(title, data):

    with bz2.BZ2File(title + '.pbz2', 'w') as f:
        pickle.dump(data, f)

compressed_pickle('movies_dict', movies_df.to_dict())
compressed_pickle('similarity', similarity)