# MOVIE RECOMMENDER #
This notebook explores three different recommender models for movies from the TMBD database. The models progress in sophistication and features considered. I hope it's informative. This notebook is heavily inspired by the work of 
IBTESAM AHMED on Kaggle: https://www.kaggle.com/code/ibtesama/getting-started-with-a-movie-recommendation-system

In [1]:
# Import and get the data
import pandas as pd
import numpy as np
df1=pd.read_csv('tmdb_5000_credits.csv')
df2=pd.read_csv('tmdb_5000_movies.csv')

df1.columns= ["id", "title","cast","crew"]
df2= df2.merge(df1,on="id")

In [3]:
# We want to look at the most popular movies first so we filter by a weighted rating (to avoid small sample reviews)
C= df2['vote_average'].mean() # average score in the dataset
m= df2['vote_count'].quantile(0.9) #cutoff value for count of votes - we want movies with more votes than 90% of the dataset
q_movies= df2.copy().loc[df2['vote_count'] >= m]

# Now we've selected the most voted on we can create a weighted rating and sort by it
def weighted_rating(x, m=m, C=C):
    v= x['vote_count']
    R= x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

q_movies['score']= q_movies.apply(weighted_rating, axis=1)
q_movies = q_movies.sort_values(by='score', ascending=False)
    

## RECOMMENDER 1 - Sensible filtering ##
This is the first recommender model already. Filtering out noise in the data and sorting gives us a billboard of great films.

In [4]:
# Strong top ten by taking .head(10)
q_movies[['title_x', 'vote_count', 'vote_average', 'score']].head(10)

Unnamed: 0,title_x,vote_count,vote_average,score
1881,The Shawshank Redemption,8205,8.5,8.059258
662,Fight Club,9413,8.3,7.939256
65,The Dark Knight,12002,8.2,7.92002
3232,Pulp Fiction,8428,8.3,7.904645
96,Inception,13752,8.1,7.863239
3337,The Godfather,5893,8.4,7.851236
95,Interstellar,10867,8.1,7.809479
809,Forrest Gump,7927,8.2,7.803188
329,The Lord of the Rings: The Return of the King,8064,8.1,7.727243
1990,The Empire Strikes Back,5879,8.2,7.697884


## RECOMMENDER 2 - Content Based ##
Users who like one movie will probably like similar films. We can use the TMDB database's movie descriptions to find movies that are alike.

To do this, we can run a semantic search on movie descriptions. First we prepare the descriptions for transorfmation, then we vectorize the articles and find the closest vectors using a cosine similarity score (effectively trying to find the closest B possible to A). 

The cosine similarity score looks like this:


But because we used TF-IDF as our vector embedding, all values are positive and we have no need of absolute values - we can just use the `linear_kernel` dot product.

# MOVIE RECOMMENDER #
This notebook explores three different recommender models for movies from the TMBD database. The models progress in sophistication and features considered. I hope it's informative. This notebook is heavily inspired by the work of 
IBTESAM AHMED on Kaggle: https://www.kaggle.com/code/ibtesama/getting-started-with-a-movie-recommendation-system

In [5]:
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
df2['overview'] = df2['overview'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(df2['overview'])

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

Now we have a vector space of movie descriptions, we can write our recommendation function to query the space given a movie title.

In [6]:
indices = pd.Series(df2.index, index=df2['title_x']).drop_duplicates() # get index of a movie given its title

def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return df2['title_x'].iloc[movie_indices]

get_recommendations('Interstellar')

1709         Space Pirate Captain Harlock
300                     Starship Troopers
4353                    The Green Inferno
220                            Prometheus
2260                      All Good Things
268                         Stuart Little
1352                              Gattaca
4176    Battle for the Planet of the Apes
2648                       Winnie Mandela
634                            The Matrix
Name: title_x, dtype: object

Not bad, good enough but maybe the plot isn't what people enjoy about the movie. A better system would be to include other metadata like actors and directors - so we'll do that next.

In [7]:
from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    df2[feature] = df2[feature].apply(literal_eval)

# Get the director's name from the crew feature. If director is not listed, return NaN
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan


# Returns the list top 3 elements or entire list; whichever is more.
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []

# Define new director, cast, genres and keywords features that are in a suitable form.
df2['director'] = df2['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    df2[feature] = df2[feature].apply(get_list)

In [8]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    df2[feature] = df2[feature].apply(clean_data)

In [9]:
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])
df2['soup'] = df2.apply(create_soup, axis=1)

# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(df2['soup'])

In [10]:
# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

# Reset index of our main DataFrame and construct reverse mapping as before
df2 = df2.reset_index()
indices = pd.Series(df2.index, index=df2['title_x'])

In [12]:
get_recommendations('TRON: Legacy', cosine_sim2)

68                          Iron Man
228                         Oblivion
4401             The Helix... Loaded
83                        The Lovers
193                      After Earth
4117              Six-String Samurai
91      Independence Day: Resurgence
101               X-Men: First Class
266                         I, Robot
466                 The Time Machine
Name: title_x, dtype: object