# Part 2: Content-based Filtering Recommender System

## Section A: Introduction

▪ In this practical session, we learn how to build a basic model of content-based recommender systems using the Movies Data set that is publicly available on Kaggle. 

▪ To achieve this, we will compute pairwise cosine similarity scores for all movies based on their plot descriptions and recommend movies based on that similarity score threshold.

\>>> **(Full dataset can be downloaded here)** https://www.kaggle.com/rounakbanik/the-movies-dataset?select=movies_metadata.csv

\>>> **(The reference of this practical)** https://www.datacamp.com/community/tutorials/recommender-systems-python

### Content-based Filtering Recommender Systems

▪ Content-based recommendations systems are the systems that look for similarity before recommending something. 

<img src="content.png" width="350">

## Section B: Data Exploration

### Loading Dataset into Dataframe

In [None]:
import pandas as pd

movies_data = pd.read_csv('movies_metadata.csv', low_memory = False)

In [None]:
movies_data.shape

In [None]:
movies_data.head(10)

### Retrieving All Columns' Names

In [None]:
movies_data.columns

### Identifying the Best Indicator of Similarity (Part 1) 

▪ Assumption: **If two movies fall under the same category, then they might be similar to certain extent**. 

In [None]:
movies_data.genres

### Understanding the Content of Genres

In [None]:
# Each movie can be categorized under more than one genre
movies_data.genres[0]

In [None]:
# genres is stored as string
print(type(movies_data.genres[0]))

In [None]:
import re

size = len(movies_data.genres)
print(size)

In [None]:
movies_data_2 = [movies_data.genres[index] for index in range(size) if re.search('Science Fiction', movies_data.genres[index])]
movies_data_2

In [None]:
print(type(movies_data_2))

### DataFrame Slicing using str.contains()

▪ The loc property is used to access a group of rows and columns by label(s).

In [None]:
movies_data_2 = movies_data.copy().loc[movies_data['genres'].str.contains('Science Fiction')]
movies_data_2.head()

### Recommending Movies Based on Genres

In [None]:
movies_data_3 = movies_data_2[['original_title', 'release_date', 'genres']]
movies_data_3.head()

### Question: Is genre a good indicator of Similarity? Are Powder and Screamers similar movies?

### Identifying the Best Indicator of Similarity (Part 2)

▪ Assumption: **If two movies have similar plots, then they might be similar to certain extent**.

In [None]:
movies_data['overview'].head()

## Section C: Feature Extraction

### TfIdfVectorizer

▪ Scikit-learn's built-in TfIdfVectorizer class is used to produce the TF-IDF matrix:

\>>> Import the Tfidf module using scikit-learn.

\>>> Replace not-a-number values with a blank string.

\>>> Remove stop words like 'the', 'an', etc. since they do not give any useful information about the topic.

\>>> Finally, construct the TF-IDF matrix on the data.

In [None]:
import pandas as pd

for i in range(len(movies_data['overview'])):
    if pd.isna(movies_data['overview'][i]):
        print(movies_data['overview'][i])

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words = 'english')

# Replace NaN with an empty string
movies_data['overview'] = movies_data['overview'].fillna('')

# Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(movies_data['overview'])

In [None]:
print(type(tfidf_matrix))

### Question: What are the 2 numbers printed when the shape property is accessed?

In [None]:
# Check the shape of tfidf_matrix
tfidf_matrix.shape

In [None]:
tfidf.vocabulary_

### Useless Features vs. Useful Features? Data Preprocessing?

In [None]:
tfidf.get_feature_names_out()

In [None]:
tfidf.get_feature_names_out()[0:500]

## Section D: Similarity Computation

▪ With the matrix, **cosine similarity** can be used to calculate a numeric quantity that denotes the similarity between two movies.

▪ The syntax is **cosine_similarity(X, Y=None, dense_output=True)**

\>>> X (either an ndarray or a sparse matrix) is the input data.

\>>> Y (either an ndarray or a sparse matrix) is the input data. If None, the output will be the pairwise similarities between all samples in X.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute the Cosine Similarity in terms of pairwise similarities
cosine_sim_1 = cosine_similarity(tfidf_matrix, tfidf_matrix)

In [None]:
cosine_sim_1.shape

In [None]:
print(type(cosine_sim_1))

### Question: What is cosine_sim?

In [None]:
# Print the first 6 rows and 6 columns
for i in range(6):
    print(cosine_sim_1[i][:6])

### linear_kernel()

▪ Since TF-IDF vectorizer is used, calculating the dot product between each vector will directly give you the cosine similarity score. 

▪ Therefore, we can use sklearn's linear_kernel() instead of cosine_similarities() since it is faster.

In [None]:
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim_2 = linear_kernel(tfidf_matrix, tfidf_matrix)

In [None]:
cosine_sim_2.shape

In [None]:
print(type(cosine_sim_2))

In [None]:
# Print the first 6 rows and 6 columns
for i in range(6):
    print(cosine_sim_2[i][:6])

### cosine_similarity() vs. linear_kernel()

https://campus.datacamp.com/courses/feature-engineering-for-nlp-in-python/tf-idf-and-similarity-scores?ex=9

In [None]:
import time
from sklearn.metrics.pairwise import linear_kernel

# Record start time
start = time.time()

# Compute cosine similarity matrix
cosine_sim_lk = linear_kernel(tfidf_matrix, tfidf_matrix)

# Print time taken
print("Time taken: %s seconds" %(time.time() - start))

In [None]:
import time
from sklearn.metrics.pairwise import cosine_similarity

# Record start time
start = time.time()

# Compute cosine similarity matrix
cosine_sim_cs = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Print time taken
print("Time taken: %s seconds" %(time.time() - start))

## Section E: Recommending Movies

▪ Create a reverse mapping of movie titles and DataFrame indices. 

In [None]:
tempo = movies_data[['title']]
tempo

In [None]:
# Create a pandas series where indexes are values and titles are indexes
indices = pd.Series(movies_data.index, index = movies_data['title']).drop_duplicates()

# Check the first 10 indices
indices[:10]

In [None]:
indices.shape

In [None]:
print(type(indices))

### enumerate()

▪ Enumerate() method adds a counter to an iterable and returns it in a form of enumerating object. 

▪ This enumerated object can then be used directly for loops or converted into a list of tuples using the list() method.

https://www.geeksforgeeks.org/enumerate-in-python/

In [None]:
# Python program to illustrate enumerate function
list_1 = ["eat", "sleep", "repeat"]
  
# Creating enumerate objects
obj_1 = enumerate(list_1)
  
print("Return type:", type(obj_1))
print(list(enumerate(list_1)))

### get_recommendations()

▪ To build a content filtering recommender, we need to define a function that takes in a movie title as an input and outputs a list of the 10 most similar movies.

▪ These are the following steps to follow:

\>>> Get the index of the movie given its title.

\>>> Get the list of cosine similarity scores for that particular movie with all movies. Convert it into a list of tuples where the first element is its position, and the second is the similarity score.

\>>> Sort the aforementioned list of tuples based on the similarity scores; that is, the second element.

\>>> Get the top 10 elements of this list. Ignore the first element as it refers to self (the movie most similar to a particular movie is the movie itself).

\>>> Return the titles corresponding to the indices of the top elements.

In [None]:
def get_recommendations(title, cosine_sim = cosine_sim_1):

    # Get the index of the movie that matches the title
    index = indices[title]
    # print(index) 
    
    # Get the pairwsie similarity scores of all 45466 movies with the selected movie: 'The Dark Knight Rises'
    sim_scores = list(enumerate(cosine_sim[index]))
    # print(sim_scores)
    
    # Sort the movies based on the similarity scores in descending order
    sim_scores = sorted(sim_scores, key = lambda x: x[1], reverse = True)
    # print(sim_scores)
    
    # Get the scores of the top 10 most similar movies 
    sim_scores = sim_scores[1:11]
    # print(sim_scores)

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    # print(movie_indices)
    
    # Return the top 10 most similar movies
    return movies_data['title'].iloc[movie_indices]

In [None]:
get_recommendations('The Dark Knight Rises')

## Section F: Exercise

### Question 1: Credits, Genres, and Keywords Based Recommender

Build a recommender system based on the following metadata: the 3 top actors, the director, related genres, and the movie plot keywords.

Reference: https://www.datacamp.com/community/tutorials/recommender-systems-python

### Question 2: Popularity Filter

Build a recommender that would take the 30 most similar movies, calculate the weighted ratings (using the IMDB formula from above), sort movies based on this rating, and return the top 10 movies.