# Recommendation System (Content-Based Filtering)

Using the dataset from [TMDB Movies Dataset](https://www.kaggle.com/datasets/ahsanaseer/top-rated-tmdb-movies-10k?fbclid=IwAR2MpWrWpcw2QNCv_FZg2l0sjBh9xAvhrqtnZBO9K-QS6PHI1aHkdB6qLa0), I will try to build a recommender system for movie selection.
This project is just a personal sample project to get the knowledge of building simple recommendation system.

For this project, I will try to build a CONTENT-BASED RECOMMENDATION SYSTEM.

### Content-Based Filtering:

Content-based filtering recommends items similar to those a user has liked or interacted with in the past, based on the characteristics of the items themselves. (e.g., tags, genre, actors, directors, description).

Approach: 
 - It relies on analyzing the features of items (content) and creating user profiles based on their preferences. Recommendations are made by matching the content features of items with the user profile.

Example: 
- In a movie recommendation system, if a user has liked action movies in the past, the system will recommend other action movies with similar attributes (e.g., tags, genre, actors, directors, description).

Strengths:
 - Doesn't require historical data from other users.
 - Can provide recommendations for new or unpopular items.
 - Can provide explanations for recommendations based on item features.

Weaknesses:
 - Limited to recommending items similar to those the user has interacted with before.
 - May suffer from the "filter bubble" problem, where recommendations are overly similar to the user's existing preferences.
 - Requires a good understanding of item features and user preferences, which can be challenging to obtain.

### Collaborative Filtering:

Collaborative filtering recommends items to a user based on the preferences of other users who have similar tastes or behavior. (using rating from other users)

Approach: 
 - It builds a user-item matrix representing the interactions between users and items (e.g., ratings, purchases) and identifies similarities between users or items.

Example: 
 - If two users have rated or interacted with similar items in the past, collaborative filtering will recommend items liked by one user to the other and vice versa.

Strengths:
 - Can capture complex user preferences without requiring explicit item features.
 - Can recommend items outside a user's current preferences, potentially leading to serendipitous discoveries.
 - Effective in handling the "cold start" problem, where there is limited information about new users or items.

Weaknesses:
 - Relies heavily on historical user-item interactions, so it may struggle to recommend new or niche items with limited data.
 - Vulnerable to the "popularity bias," where popular items are recommended more frequently, leading to a lack of diversity in recommendations.
 - May suffer from scalability issues with large datasets due to the computational complexity of calculating user or item similarities.

In [300]:
#importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [301]:
df_ori = pd.read_csv('top10K-TMDB-movies.csv')
df_ori.head()

Unnamed: 0,id,title,genre,original_language,overview,popularity,release_date,vote_average,vote_count
0,278,The Shawshank Redemption,"Drama,Crime",en,Framed in the 1940s for the double murder of h...,94.075,1994-09-23,8.7,21862
1,19404,Dilwale Dulhania Le Jayenge,"Comedy,Drama,Romance",hi,"Raj is a rich, carefree, happy-go-lucky second...",25.408,1995-10-19,8.7,3731
2,238,The Godfather,"Drama,Crime",en,"Spanning the years 1945 to 1955, a chronicle o...",90.585,1972-03-14,8.7,16280
3,424,Schindler's List,"Drama,History,War",en,The true story of how businessman Oskar Schind...,44.761,1993-12-15,8.6,12959
4,240,The Godfather: Part II,"Drama,Crime",en,In the continuing saga of the Corleone crime f...,57.749,1974-12-20,8.6,9811


## Step 1: Data Understanding¶

In [302]:
# Make a copy of the DataFrame to avoid modifying the original
df = df_ori.copy() 

In [303]:
df.shape

(10000, 9)

In [304]:
df.describe()

Unnamed: 0,id,popularity,vote_average,vote_count
count,10000.0,10000.0,10000.0,10000.0
mean,161243.505,34.697267,6.62115,1547.3094
std,211422.046043,211.684175,0.766231,2648.295789
min,5.0,0.6,4.6,200.0
25%,10127.75,9.15475,6.1,315.0
50%,30002.5,13.6375,6.6,583.5
75%,310133.5,25.65125,7.2,1460.0
max,934761.0,10436.917,8.7,31917.0


In [305]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 10000 non-null  int64  
 1   title              10000 non-null  object 
 2   genre              9997 non-null   object 
 3   original_language  10000 non-null  object 
 4   overview           9987 non-null   object 
 5   popularity         10000 non-null  float64
 6   release_date       10000 non-null  object 
 7   vote_average       10000 non-null  float64
 8   vote_count         10000 non-null  int64  
dtypes: float64(2), int64(2), object(5)
memory usage: 703.2+ KB


In [306]:
# check for duplicates
df.duplicated().sum()

0

In [307]:
# check for missing values or NaN
df.isnull().sum()

id                    0
title                 0
genre                 3
original_language     0
overview             13
popularity            0
release_date          0
vote_average          0
vote_count            0
dtype: int64

In [308]:
# check for unique value on each column
df.nunique()

id                   10000
title                 9661
genre                 2123
original_language       43
overview              9985
popularity            8511
release_date          6113
vote_average            42
vote_count            3191
dtype: int64

In [309]:
df.columns

Index(['id', 'title', 'genre', 'original_language', 'overview', 'popularity',
       'release_date', 'vote_average', 'vote_count'],
      dtype='object')

#### Observations:
 - There are minor missing values on the dataset. We can remove the dataset since they are just a small numbers.

## Step 2: Data Processing¶


### Remove missing values

In [310]:
# Remove rows with any missing values
df = df.dropna()

In [311]:
df.isnull().sum()

id                   0
title                0
genre                0
original_language    0
overview             0
popularity           0
release_date         0
vote_average         0
vote_count           0
dtype: int64

In [312]:
df.shape

(9985, 9)

### Combine title and genre columns as tags columns

In [313]:
# Combine title and genre columns as tags columns

df['tags'] = df['overview'] + ', ' + df['genre']

In [314]:
# Example of dropping columns
df = df.drop(['overview', 'genre'], axis=1)
df.head()

Unnamed: 0,id,title,original_language,popularity,release_date,vote_average,vote_count,tags
0,278,The Shawshank Redemption,en,94.075,1994-09-23,8.7,21862,Framed in the 1940s for the double murder of h...
1,19404,Dilwale Dulhania Le Jayenge,hi,25.408,1995-10-19,8.7,3731,"Raj is a rich, carefree, happy-go-lucky second..."
2,238,The Godfather,en,90.585,1972-03-14,8.7,16280,"Spanning the years 1945 to 1955, a chronicle o..."
3,424,Schindler's List,en,44.761,1993-12-15,8.6,12959,The true story of how businessman Oskar Schind...
4,240,The Godfather: Part II,en,57.749,1974-12-20,8.6,9811,In the continuing saga of the Corleone crime f...


### Convert text into vectors.

There are multiple options to convert text into vectors:
1. Bag of words
2. TFIDF
3. Word Embeddings

In [315]:
text_data = df['tags'].tolist()
max_features = df.shape[0]

In [316]:
# TF-IDF (Term Frequency-Inverse Document Frequency):

from sklearn.feature_extraction.text import TfidfVectorizer

def tfidf_converter(text_data:list, max_features:int) -> np.array:

    # Create a TfidfVectorizer object
    tfidf_vectorizer = TfidfVectorizer(max_features=max_features, stop_words='english')

    # Fit the vectorizer to the data and transform the text into a TF-IDF vector representation
    tfidf_vectors = tfidf_vectorizer.fit_transform(text_data)

    # Convert the sparse matrix to a dense array for visualization (optional)
    tfidf_vectors_array = tfidf_vectors.toarray()

    print(f"TFIDF vectors: \n {tfidf_vectors_array}")
    print(f"TFIDF shape: {tfidf_vectors_array.shape}")

    return tfidf_vectors_array

vector_text = tfidf_converter(text_data, max_features)

TFIDF vectors: 
 [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
TFIDF shape: (9985, 9985)


In [317]:
# # Using BoW to convert text to vector for tags columns

# # BoW
# from sklearn.feature_extraction.text import CountVectorizer

# def bag_of_words_converter(text_data:list, max_features:int) -> np.array:

#     # Create a CountVectorizer object
#     vectorizer = CountVectorizer(max_features=max_features, stop_words='english')

#     # Fit the vectorizer to the data and transform the text into a bag-of-words vector representation
#     bow_vectors = vectorizer.fit_transform(text_data)

#     # Convert the sparse matrix to a dense array for visualization (optional)
#     bow_vectors_array = bow_vectors.toarray()

#     print(f"Bag-of-Words vectors: \n {bow_vectors_array}")
#     print(f"Bag-of-Words shape: {bow_vectors_array.shape}")

#     return bow_vectors_array

# vector_text = bag_of_words_converter(text_data, max_features)

#### Cosine Similarity

Cosine similarity is a measure of similarity between two non-zero vectors in an inner product space. It measures the cosine of the angle between the two vectors and ranges from -1 (opposite directions) to 1 (same direction), with 0 indicating orthogonality (perpendicularity).

In the context of text analysis, cosine similarity is often used to quantify the similarity between two documents represented as numerical vectors, such as Bag-of-Words vectors or TF-IDF vectors. It measures how similar the documents are in terms of their word frequencies or TF-IDF values.

In [318]:
from sklearn.metrics.pairwise import cosine_similarity

# Calculate cosine similarity within tags
similarities = cosine_similarity(vector_text)

similarities

array([[1.        , 0.01619478, 0.02738017, ..., 0.06024391, 0.0466822 ,
        0.03040782],
       [0.01619478, 1.        , 0.01189414, ..., 0.        , 0.00355157,
        0.        ],
       [0.02738017, 0.01189414, 1.        , ..., 0.0126931 , 0.02284379,
        0.02358997],
       ...,
       [0.06024391, 0.        , 0.0126931 , ..., 1.        , 0.00651959,
        0.00553894],
       [0.0466822 , 0.00355157, 0.02284379, ..., 0.00651959, 1.        ,
        0.0094703 ],
       [0.03040782, 0.        , 0.02358997, ..., 0.00553894, 0.0094703 ,
        1.        ]])

In [319]:
# access index of selected movies
index = df[df['title'] == 'The Godfather'].index[0]
index

2

In [320]:
sorted(list(enumerate(similarities[index])), reverse=True, key=lambda vector_text:vector_text[1])

[(2, 1.0),
 (4, 0.4118643362110689),
 (1611, 0.2111518667348452),
 (7416, 0.1676058050726227),
 (214, 0.13022896145441593),
 (9355, 0.12717772883852851),
 (656, 0.12217518682985025),
 (3126, 0.11149645404483577),
 (4444, 0.11054092899863752),
 (1816, 0.10884791944891335),
 (5359, 0.10691167872007531),
 (5515, 0.10642122976022807),
 (330, 0.10587579228714653),
 (495, 0.10309817458420588),
 (3551, 0.1016547641058298),
 (6585, 0.1011193367754344),
 (6962, 0.10068648316292227),
 (9816, 0.10036611666746298),
 (3054, 0.10013986356044799),
 (8564, 0.10009502470043402),
 (1120, 0.09882170885888697),
 (951, 0.09723216615192687),
 (1755, 0.09654372268445657),
 (9239, 0.09601624466206816),
 (1842, 0.09513295224205501),
 (6021, 0.0944537623910672),
 (7046, 0.0935132740817375),
 (5500, 0.09340018564860014),
 (194, 0.09334909130328087),
 (5952, 0.092990050867966),
 (250, 0.09264088422437648),
 (3349, 0.09137401902001635),
 (434, 0.0912178327974042),
 (3669, 0.09106463602318986),
 (153, 0.09094555068

In [321]:
# calculate score of similarities of given index for each tags rows - convert to list with index number(enumerate) - sort by reverse - filter by highest similarities descending
distance = sorted(list(enumerate(similarities[index])), reverse=True, key=lambda vector_text:vector_text[1])
# select top 10 movies with similarities
for i in distance[1:11]:
    print(df.iloc[i[0]].title)

# Maybe can add more e.g top 5 sort by vote_average or popularity?


The Godfather: Part II
The Godfather: Part III
Blood Ties
The Best of Youth
Proud Mary
The Color Purple
Extremely Wicked, Shockingly Evil and Vile
Four Brothers
Road to Perdition
Joe


In [322]:
def get_recommendation (movie_name:str):
    movie_name = movie_name.title() # capitalizes the first letter of each word in a string
    
    if movie_name in df['title'].values:
        index = df[df['title'] == movie_name].index[0]
        distance = sorted(list(enumerate(similarities[index])), reverse=True, key=lambda x:x[1])

        recommended_movies = []
        for i in distance[1:10]:
            # print(df.iloc[i[0]].title)
            recommended_movies.append(df.iloc[i[0]].title)

        # Maybe can add more e.g top 5 sort by vote_average or popularity?
    else:
        print("Error: Given movie name is invalid")

    return recommended_movies

In [323]:
get_recommendation('The Godfather')

['The Godfather: Part II',
 'The Godfather: Part III',
 'Blood Ties',
 'The Best of Youth',
 'Proud Mary',
 'The Color Purple',
 'Extremely Wicked, Shockingly Evil and Vile',
 'Four Brothers',
 'Road to Perdition']

# Save & Export Dataset and Similarities Files

In [324]:
import pickle

In [327]:
# save movies_list as pickle files
pickle.dump(df, open('../model/movies_list.pkl', 'wb'))

In [328]:
# save similarities as pickle files
pickle.dump(similarities, open('../model/similarities.pkl', 'wb'))

In [329]:
# try to load movies list
pickle.load(open('../model//movies_list.pkl', 'rb'))

Unnamed: 0,id,title,original_language,popularity,release_date,vote_average,vote_count,tags
0,278,The Shawshank Redemption,en,94.075,1994-09-23,8.7,21862,Framed in the 1940s for the double murder of h...
1,19404,Dilwale Dulhania Le Jayenge,hi,25.408,1995-10-19,8.7,3731,"Raj is a rich, carefree, happy-go-lucky second..."
2,238,The Godfather,en,90.585,1972-03-14,8.7,16280,"Spanning the years 1945 to 1955, a chronicle o..."
3,424,Schindler's List,en,44.761,1993-12-15,8.6,12959,The true story of how businessman Oskar Schind...
4,240,The Godfather: Part II,en,57.749,1974-12-20,8.6,9811,In the continuing saga of the Corleone crime f...
...,...,...,...,...,...,...,...,...
9995,10196,The Last Airbender,en,98.322,2010-06-30,4.7,3347,"The story follows the adventures of Aang, a yo..."
9996,331446,Sharknado 3: Oh Hell No!,en,12.490,2015-07-22,4.7,417,The sharks take bite out of the East Coast whe...
9997,13995,Captain America,en,18.333,1990-12-14,4.6,332,"During World War II, a brave, patriotic Americ..."
9998,2312,In the Name of the King: A Dungeon Siege Tale,en,15.159,2007-11-29,4.7,668,A man named Farmer sets out to rescue his kidn...
