<a href="https://colab.research.google.com/github/HardeepSaggu/Movie-Recommendation-System-NLP/blob/master/Movie_Recommender_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**libraries and modules**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn.metrics.pairwise import linear_kernel,cosine_similarity
from ast import literal_eval
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

**Dataset** 


In [None]:
df1 = pd.read_csv('datasets/tmdb_5000_credits.csv')
df2 = pd.read_csv('datasets/tmdb_5000_movies.csv')

In [None]:
df1.head(5)

In [None]:
df2.head(5)

**Merging both datasets to one to combine everything into one main DataFrame using 'id' feature**


In [None]:
df1.columns = ['id','title_x','cast','crew']
df2 = df2.merge(df1,on = 'id')
df2.head(5)

In [None]:
m = df2['vote_count'].quantile(0.9)
C = df2['vote_average'].mean()

In [None]:
def weight_average(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m))*R + (m/(v+m))*C

**Filtering Dataframe to get movies with vote count >= m (minimum votes required to appear on chart)**

In [None]:
q_movies = df2.copy().loc[df2['vote_count'] >= m]
q_movies.shape

In [None]:
q_movies['score'] = q_movies.apply(weight_average , axis = 1)
q_movies.shape

**find top 10 IMDB rated movies**

In [None]:
q_movies = q_movies.sort_values('score' , ascending = False)
q_movies[['title','score','vote_average','vote_count']].head(10)

In [None]:
pop = df2.sort_values('popularity' , ascending = False)
plt.figure(figsize=(12,4))
plt.barh(pop['title'].head(6),pop['popularity'].head(6))
plt.gca().invert_yaxis()
plt.xlabel("Popularity")
plt.title("Popular Movies")

**Content Based Filtering**

## Initializing TDF-IDF vectorizer object to generate TDF-IDF matrix of plots of movies

---

In [None]:
tfidf = TfidfVectorizer( stop_words='english' )
df2['overview'] = df2['overview'].fillna('')
tfidf_matrix = tfidf.fit_transform(df2['overview'])
tfidf_matrix.shape

In [None]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
cosine_sim.shape

In [None]:
indices = pd.Series(df2.index, index=df2['title']).drop_duplicates()
def get_recommendations(title,cosine_sim = cosine_sim):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]
    movie_indices = [i[0] for i in sim_scores]
    return df2['title'].iloc[movie_indices]

**Testing recommender System**

In [None]:
get_recommendations('The Dark Knight Rises')

In [None]:
get_recommendations('The Godfather')

In [None]:
features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    df2[feature] = df2[feature].apply(literal_eval)

In [None]:
def get_director(x):
    for i in x:
        if (i['job'] == 'Director'):
            return i['name']
    return np.nan

In [None]:
def get_list(x):
    if (isinstance(x,list)):
        names = [i['name'] for i in x]
        if len(names) > 3:
            return names[:3]
        return names
    return []

In [None]:
df2['director'] = df2['crew'].apply(get_director)
features = ['cast', 'keywords', 'genres']
for feature in features:
    df2[feature] = df2[feature].apply(get_list)

In [None]:
df2[['title', 'cast', 'director', 'keywords', 'genres']].head(5)

In [None]:
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [None]:
features = ['cast', 'keywords', 'director', 'genres']
for feature in features:
    df2[feature] = df2[feature].apply(clean_data)

In [None]:
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])
df2['soup'] = df2.apply(create_soup, axis=1)

**Vectorizing**

In [None]:
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(df2['soup'])

**Finding the Cosine Similarity Scores**

In [None]:
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

In [None]:
df2 = df2.reset_index()
indices = pd.Series(df2.index, index=df2['title'])

**Testing recommender system**

In [None]:
get_recommendations('The Godfather',cosine_sim2)

In [None]:
get_recommendations('Minions', cosine_sim2)