### General
* RUNNING ON COLAB: Make a folder in your Google Drive called `datascience_data`, and upload all the .csv files there
* RUNNING LOCALLY: Make a folder called `datascience_data` in the same directory as this notebook, and put all the .csv files there

### Recommendation
* In order to get a recommendation, provide a **full title** of a movie that you like.


In [1]:
# Ignores warnings in output
import warnings
warnings.filterwarnings('ignore')

In [2]:
import pandas as pd
import numpy as np
import json

# # Run on Colab

from google.colab import drive
drive.mount('/content/drive')
# Define the path to the file in Google Drive
drive_path = 'drive/My Drive/datascience_data'

# # Run on local (folder: data)
# drive_path = 'datascience_data'

# Define the file names that are needed for the code
file_names = ['keywords.csv','movies_metadata.csv']

# Load the data into Pandas DataFrames
keywords_df = pd.read_csv(drive_path + '/' + file_names[0])
movies_metadata_df = pd.read_csv(drive_path + '/' + file_names[1])

Mounted at /content/drive


### 1. Data Preprocessing
Let's begin with the inspection and cleaning of each dataset. We start by checking for missing values, duplicates, and data types to ensure that the data is consistent. After that, I'll merge the relevant information from each dataset into a single DataFrame for further processing.

In [3]:
# Data inspection and cleaning for each dataset

# Data cleaning for 'movies_metadata_df'
# Remove duplicates and ensure that 'id' is of integer type
movies_metadata_df = movies_metadata_df.drop_duplicates(subset='id')
movies_metadata_df = movies_metadata_df[movies_metadata_df['id'].apply(lambda x: str(x).isdigit())]
movies_metadata_df['id'] = movies_metadata_df['id'].astype(int)

# Data cleaning for 'keywords_df'
# Remove duplicates and ensure that 'id' is of integer type
keywords_df = keywords_df.drop_duplicates('id')
keywords_df['id'] = keywords_df['id'].astype(int)

# We will also extract the genres information and convert it from JSON string to list
def parse_genres(genres_string):
    try:
        return [genre['name'] for genre in json.loads(genres_string.replace("'", "\""))]
    except json.decoder.JSONDecodeError:
        return []  # In case of error, return an empty list

# Apply the genres parsing function to the 'genres' column
movies_metadata_df['genres'] = movies_metadata_df['genres'].apply(parse_genres)

# Merge the datasets on 'id' column including 'genres'
merged_df = pd.merge(movies_metadata_df[['id', 'title', 'overview', 'genres', 'popularity']],
                     keywords_df,
                     on='id',
                     how='left')

# Display the merged DataFrame structure and check for nulls
print(merged_df.head())
merged_df.isnull().sum()



      id                        title  \
0    862                    Toy Story   
1   8844                      Jumanji   
2  15602             Grumpier Old Men   
3  31357            Waiting to Exhale   
4  11862  Father of the Bride Part II   

                                            overview  \
0  Led by Woody, Andy's toys live happily in his ...   
1  When siblings Judy and Peter discover an encha...   
2  A family wedding reignites the ancient feud be...   
3  Cheated on, mistreated and stepped on, the wom...   
4  Just when George Banks has recovered from his ...   

                         genres popularity  \
0   [Animation, Comedy, Family]  21.946943   
1  [Adventure, Fantasy, Family]  17.015539   
2             [Romance, Comedy]    11.7129   
3      [Comedy, Drama, Romance]   3.859495   
4                      [Comedy]   8.387519   

                                            keywords  
0  [{'id': 931, 'name': 'jealousy'}, {'id': 4290,...  
1  [{'id': 10090, 'name': 'bo

id              0
title           3
overview      954
genres          0
popularity      3
keywords        1
dtype: int64

Handle the missing values in the overview and keywords columns, which may involve filling or removing the missing entries.
Preprocess the overview and keywords text data. This involves converting the JSON in the keywords column into a workable format, and cleaning the text (e.g., removing stop words, punctuation, and stemming or lemmatization).

In [4]:
# Handling missing values
# For the 'overview' column, if the overview is missing, we can fill it with an empty string
# For the 'keywords' column, if the keywords are missing, we can also fill it with an empty list
merged_df['overview'] = merged_df['overview'].fillna('')
merged_df['keywords'] = merged_df['keywords'].fillna('[]')

# Now we need to convert the 'keywords' column from a JSON object to an actual list of keywords
import json

# A function to parse the keywords correctly, handling any errors in the JSON decoding process
def parse_keywords(keyword_string):
    try:
        return json.loads(keyword_string.replace("'", "\""))
    except json.decoder.JSONDecodeError:
        return []  # In case of error, return an empty list

# Convert 'keywords' column from JSON to list of keywords
merged_df['keywords'] = merged_df['keywords'].apply(parse_keywords)

# Create a 'keyword_list' column from the 'keywords' column
merged_df['keyword_list'] = merged_df['keywords'].apply(lambda x: [d['name'] for d in x])

# Create a 'combined' column that includes the overview, keyword list, and genres. This is the column that we vectorize using TF-IDF
merged_df['combined'] = merged_df['overview'] + ' ' + merged_df['keyword_list'].apply(lambda x: ' '.join(x)) + ' ' + merged_df['genres'].apply(lambda x: ' '.join(x))


In [5]:
merged_df.isnull().sum()

id              0
title           3
overview        0
genres          0
popularity      3
keywords        0
keyword_list    0
combined        0
dtype: int64

In [6]:
merged_df[['title','combined']].head()

Unnamed: 0,title,combined
0,Toy Story,"Led by Woody, Andy's toys live happily in his ..."
1,Jumanji,When siblings Judy and Peter discover an encha...
2,Grumpier Old Men,A family wedding reignites the ancient feud be...
3,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom..."
4,Father of the Bride Part II,Just when George Banks has recovered from his ...


The missing values in the overview and keywords columns have been handled, and a new combined column has been created by merging the overview and the keyword names. This column will be used for TF-IDF vectorization.

The final structure indicates that there are no missing values in the key columns used for TF-IDF. However, there are 3 entries with missing titles, which should not affect the recommendation system, as the recommendations are based on the content, not the title itself.



In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np


# Initialize the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

# Fit and transform the 'combined' column to a TF-IDF matrix
tfidf_matrix = tfidf_vectorizer.fit_transform(merged_df['combined'])

# Output the shape of the TF-IDF matrix
tfidf_matrix.shape


(45433, 77452)

The TF-IDF vectorization is complete. The resulting TF-IDF matrix has dimensions of 45433×77452, indicating that there are 45,433 movies and 77,452 unique words (after removing stop words) across the combined text.

With this matrix, we can now calculate the cosine similarity between movies and build the recommendation system.

### Building a Recommendation System with Pairwise Cosine Similarity (Sklearn)

An alternative approach: pairwise_distances from sklearn.metrics with metric='cosine', which computes similarity scores on the fly.
Source: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html

In [8]:
# Construct a reverse map of indices and movie titles
indices = pd.Series(merged_df.index, index=merged_df['title']).drop_duplicates()

In [9]:
from sklearn.metrics.pairwise import pairwise_distances

# Function to get recommendations based on cosine similarity, computed on-the-fly
def get_recommendations_pairwise_distances(title, tfidf_matrix=tfidf_matrix, indices=indices):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Compute the cosine similarity between this movie and all others in the dataset
    cosine_similarities = 1 - pairwise_distances(tfidf_matrix[idx], tfidf_matrix, metric='cosine')

    # Get the scores of all movies
    sim_scores = list(enumerate(cosine_similarities[0]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:20]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return merged_df['title'].iloc[movie_indices]

# Test the system with a movie
get_recommendations_pairwise_distances("Batman")


9228                    Batman Beyond: Return of the Joker
1490                                        Batman & Robin
41424    LEGO DC Comics Super Heroes: Justice League - ...
15505                           Batman: Under the Red Hood
10119                                        Batman Begins
1328                                        Batman Returns
18244                                The Dark Knight Rises
32096                     Batman Unlimited: Monster Mayhem
12477                                      The Dark Knight
21387                      Batman: Mystery of the Batwoman
29168                                     Batman vs. Robin
9167             The Batman Superman Movie: World's Finest
35994               The Flash 2 - Revenge of the Trickster
25249                                    Batman vs Dracula
19783              Batman: The Dark Knight Returns, Part 1
18027                                     Batman: Year One
150                                         Batman Forev

# Adding genre filtering
**Genre Filtering in Movie Recommendation System**

The genre filtering mechanism in the movie recommendation system enhances specificity by tailoring recommendations to a user-specified genre. This process involves several key steps:

Genre Filtering Mechanism
Function for Recommendations: The get_recommendations_with_genre_filter function is the centerpiece. It accepts a movie title, the TF-IDF matrix, the index mapping, the dataset, and an optional genre filter as inputs.

Duplicate Title Handling: If there are duplicate titles in the dataset, the function selects the first occurrence to ensure uniqueness.

Cosine Similarity Retrieval: For the specified movie, the function computes its cosine similarity with all other movies.

Applying Genre Filter: Crucially, the genre filter allows refining recommendations. If specified, the system filters the similarity scores to include only movies that match the selected genre. This step enhances the relevance of recommendations to user preferences.

Ranking and Selection: The system ranks movies based on their similarity scores and selects the top ten recommendations, aligning them with the user's genre preference.

In [10]:
# Initialize the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

# Fit and transform the 'combined' column to a TF-IDF matrix
tfidf_matrix = tfidf_vectorizer.fit_transform(merged_df['combined'])

## Computing Similarity without packages (Genres includes)


In [11]:
# Function to calculate cosine similarity scores for a sparse matrix
def cosine_similarity_manual(tfidf_matrix, index):
    # Compute the cosine similarity between the index movie and all movies in the matrix
    # This is done by calculating the dot product (as the vectors are already L2-normalized)
    cosine_similarities = tfidf_matrix.dot(tfidf_matrix[index].T).toarray().ravel()
    return cosine_similarities

def get_recommendations_with_genre_filter(title, tfidf_matrix, merged_df, genre_filter=None):
    # Convert title to lowercase for case-insensitive matching
    title = title.lower()

    # Assuming the title is found in the DataFrame
    titles = merged_df['title'].str.lower()
    indices = pd.Series(merged_df.index, index=titles).drop_duplicates()

    # Get the index of the movie that matches the title
    idx_list = indices[title]
    if isinstance(idx_list, pd.Series):
        idx = idx_list.iloc[0]  # Take the first one if there are multiple
    else:
        idx = idx_list

    # Calculate cosine similarity scores using the manual method
    cosine_similarities = cosine_similarity_manual(tfidf_matrix, idx)

    # Get the scores of all movies
    sim_scores = list(enumerate(cosine_similarities))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # If a genre filter is applied, filter the recommended movies by the specified genre
    if genre_filter:
        genre_filter = genre_filter.lower()  # Convert genre filter to lowercase once
        genre_filtered_scores = []
        for movie_idx, score in sim_scores:
            # Check if the genre filter is in the list of genres (also converted to lowercase)
            if any(genre_filter in genre.lower() for genre in merged_df.iloc[movie_idx]['genres']):
                genre_filtered_scores.append((movie_idx, score))
        sim_scores = genre_filtered_scores

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie titles and scores
    similar_movies = [(merged_df['title'].iloc[index], score) for index, score in sim_scores]

    return similar_movies

# Test the updated function
recommendations = get_recommendations_with_genre_filter("Batman", tfidf_matrix, merged_df, genre_filter="Family")
for title, score in recommendations:
    print(f"{title}: {score:.2f}")

Batman Unlimited: Monster Mayhem: 0.36
Batman: Mystery of the Batwoman: 0.35
Batman: Mask of the Phantasm: 0.24
Batman Beyond: The Movie: 0.20
Batman Unlimited: Animal Instincts: 0.19
The Lego Batman Movie: 0.18
LEGO DC Comics Super Heroes: Batman: Be-Leaguered: 0.17
JLA Adventures: Trapped in Time: 0.14
Lego Batman: The Movie - DC Super Heroes Unite: 0.13
Batman: 0.12


## Using sklearn package

In [12]:
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.feature_extraction.text import TfidfVectorizer

# Construct a reverse map of indices and movie titles
indices = pd.Series(merged_df.index, index=merged_df['title']).drop_duplicates()

# Function to get recommendations based on cosine similarity, computed on-the-fly, with genre filtering
def get_recommendations_with_genre_filter(title, tfidf_matrix=tfidf_matrix, indices=indices, df=merged_df, genre_filter=None):
    # Handle duplicate titles
    idx_series = indices[title]
    if isinstance(idx_series, pd.Series):
        idx = idx_series.iloc[0]
    else:
        idx = idx_series

    # Compute the cosine similarity between this movie and all others in the dataset
    cosine_similarities = 1 - pairwise_distances(tfidf_matrix[idx:idx+1], tfidf_matrix, metric='cosine')

    # Get the scores of all movies
    sim_scores = list(enumerate(cosine_similarities[0]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # If a genre filter is applied, filter the recommended movies by the specified genre
    if genre_filter:
        genre_filtered_scores = []
        for i in sim_scores:
            movie_idx = i[0]
            if genre_filter in df.iloc[movie_idx]['genres']:
                genre_filtered_scores.append(i)
        sim_scores = genre_filtered_scores

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices and scores
    movie_indices_scores = [(i[0], i[1]) for i in sim_scores]

    # Return the top 10 most similar movies along with their similarity scores
    return [(df['title'].iloc[index], score) for index, score in movie_indices_scores]

# Here you are asked to input a movie and a genre. It is case sensitive. Our example is "Batman" and no input on genre.
user_input = input('Enter title: ')
user_genre_input = input('Enter genre: (blank input means no genre filter) ')

# Test the system with a movie and apply genre filter for 'Family'
recommendations = get_recommendations_with_genre_filter(user_input, genre_filter=user_genre_input)
for title, score in recommendations:
    print(f"{title}: {score:.2f}")

Enter title: Batman
Enter genre: (blank input means no genre filter) 
Batman Beyond: Return of the Joker: 0.51
Batman & Robin: 0.49
LEGO DC Comics Super Heroes: Justice League - Gotham City Breakout: 0.44
Batman: Under the Red Hood: 0.39
Batman Begins: 0.39
Batman Returns: 0.38
The Dark Knight Rises: 0.38
Batman Unlimited: Monster Mayhem: 0.36
The Dark Knight: 0.35
Batman: Mystery of the Batwoman: 0.35


#Self chosen method: Word Embeddings

###Training the model

In [13]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, GlobalAveragePooling1D, Dense, LSTM, Dropout
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.optimizers import Adam
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import multilabel_confusion_matrix
from tensorflow.keras.callbacks import EarlyStopping

def extract_keyword_names(keyword_list):
    if isinstance(keyword_list, list):
        names = [keyword['name'] for keyword in keyword_list if isinstance(keyword, dict) and 'name' in keyword]
        return ' '.join(names)
    return ''

# Data preparation
tsDf = merged_df
tsDf.dropna(subset=['genres'], inplace=True)
tsDf['keywords'] = tsDf['keywords'].apply(extract_keyword_names)
tsDf.fillna({'overview': '', 'keywords': ''}, inplace=True)
tsDf['combined_text'] = tsDf['overview'] + ' ' + tsDf['keywords']

# MultiLabel Binarizer
mlb = MultiLabelBinarizer()
encoded_labels = mlb.fit_transform(tsDf['genres'])

# Generate labels for genres
categorical_labels = encoded_labels

# Prepare tokenizer
tokenizer = Tokenizer(num_words=1000, oov_token='<OOV>')
tokenizer.fit_on_texts(tsDf['combined_text'])
sequences = tokenizer.texts_to_sequences(tsDf['combined_text'])
padded_sequences = pad_sequences(sequences, maxlen=60, padding='post')

# Implemented early stopping to counteract overfitting to the training data
early_stopping = EarlyStopping(
    monitor='val_accuracy',
    patience=4,
    restore_best_weights=True
)

# Split the data for training and testing
X_train, X_test, y_train, y_test, titles_train, titles_test = train_test_split(
    padded_sequences, categorical_labels, tsDf['title'], test_size=0.2, random_state=42)

# Adjustment of parameters - This was the ones that gave us the best result
learning_rate = 0.001
batch_size = 32
dropout_rate = 0.3
epochs = 25

# Building our prediction model
model = Sequential([
    Embedding(1000, 64, input_length=60),
    LSTM(32, return_sequences=True),
    GlobalAveragePooling1D(),
    Dropout(dropout_rate),
    Dense(32, activation='relu'),
    Dense(len(mlb.classes_), activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer=Adam(learning_rate=learning_rate), metrics=['accuracy'])

# Fit the model to our data
history = model.fit(
    X_train,
    y_train,
    epochs = epochs,
    batch_size = batch_size,
    validation_data = (X_test, y_test),
    callbacks = [early_stopping],
    verbose = 2
)

Epoch 1/25
1136/1136 - 34s - loss: 0.2933 - accuracy: 0.2532 - val_loss: 0.2666 - val_accuracy: 0.2751 - 34s/epoch - 30ms/step
Epoch 2/25
1136/1136 - 9s - loss: 0.2582 - accuracy: 0.2952 - val_loss: 0.2463 - val_accuracy: 0.3179 - 9s/epoch - 8ms/step
Epoch 3/25
1136/1136 - 9s - loss: 0.2444 - accuracy: 0.3088 - val_loss: 0.2387 - val_accuracy: 0.3064 - 9s/epoch - 8ms/step
Epoch 4/25
1136/1136 - 8s - loss: 0.2344 - accuracy: 0.3267 - val_loss: 0.2276 - val_accuracy: 0.3462 - 8s/epoch - 7ms/step
Epoch 5/25
1136/1136 - 10s - loss: 0.2240 - accuracy: 0.3859 - val_loss: 0.2204 - val_accuracy: 0.4019 - 10s/epoch - 9ms/step
Epoch 6/25
1136/1136 - 7s - loss: 0.2184 - accuracy: 0.4123 - val_loss: 0.2177 - val_accuracy: 0.4112 - 7s/epoch - 6ms/step
Epoch 7/25
1136/1136 - 9s - loss: 0.2155 - accuracy: 0.4208 - val_loss: 0.2161 - val_accuracy: 0.4132 - 9s/epoch - 8ms/step
Epoch 8/25
1136/1136 - 7s - loss: 0.2133 - accuracy: 0.4251 - val_loss: 0.2147 - val_accuracy: 0.4219 - 7s/epoch - 6ms/step
Epo

###Testing the prediction

In [14]:
# Handling the lists of genres
def genres_from_predictions(preds, threshold=0.5):
    genre_preds = []
    for pred in preds:
        genre_pred = mlb.classes_[(pred > threshold).astype(int) == 1]
        genre_preds.append(genre_pred)
    return genre_preds

def predict_genre_from_overview(overview):
    sequence = tokenizer.texts_to_sequences([overview])
    padded_sequence = pad_sequences(sequence, maxlen=60, padding='post')

    prediction = model.predict(padded_sequence)
    predicted_genre_arrays = genres_from_predictions(prediction, threshold=0.5)
    predicted_genres = [genre for array in predicted_genre_arrays for genre in array]

    if not predicted_genres:
        return "Try something else"
    else:
        return ', '.join(predicted_genres)

# You can test these different examples of overviews
drama_example = "his best friend diego and with his dysfunctional mother he dreams on leaving his hometown to watch a bob show but seems to be hopeless stranded in the spot"
horror_scifi_example = "a team of men and women investigate the mysterious of two to a but world monster future creature movie"
comedy_example = "pilot for and an on this new show off the his first our for three years to more of take on love politics and the of life all with for any lover of comedy this is the must see show of stand up comedy"
# Example usage for drama
predicted_genres = predict_genre_from_overview("a small town in the in the south of a teenager that uses the mr man his time in at school with his best friend and with his mother he dreams on leaving his to a show but seems to be in the")
print("Predicted Genres:", predicted_genres)

Predicted Genres: Drama


###Checking the performance

In [27]:
def sequence_to_text(sequence):
    words = [tokenizer.index_word.get(idx, '<?>') for idx in sequence if idx != 0]
    return ' '.join(words)

actual_genres = [', '.join(mlb.classes_[y_test[i].astype(bool)]) for i in range(len(y_test))]


# Example of 7 different overviews from our testing dataset
for i in range(5):
    print(f"Sample {i+1}:")
    i += 2000
    title = titles_test.iloc[i]
    overview = sequence_to_text(X_test[i])
    predicted_genres = predict_genre_from_overview(overview)
    actual_genres_str = actual_genres[i]

    print("Title:", title)
    print("Predicted Genres:", predicted_genres)
    print("Actual Genres:", actual_genres_str)
    print("Overview:", overview)
    print("------")


Sample 1:
Title: Shall We Dance?
Predicted Genres: Drama
Actual Genres: Comedy, Drama, Romance
Overview: upon first <OOV> of a beautiful <OOV> a <OOV> and <OOV> <OOV> lawyer <OOV> up for <OOV> <OOV> <OOV> <OOV> wife husband relationship <OOV> master
------
Sample 2:
Title: The Hunter's Prayer
Predicted Genres: Drama, Thriller
Actual Genres: Action, Thriller
Overview: an <OOV> helps a young woman <OOV> the death of her family <OOV> <OOV> murder <OOV> survival escape teenage girl drug <OOV> <OOV> daughter
------
Sample 3:
Title: The von Trapp Family: A Life of Music
Predicted Genres: Music
Actual Genres: Drama, Music
Overview: <OOV> <OOV> <OOV> the <OOV> daughter of a well known musical family <OOV> many <OOV> to <OOV> her musical career and move to the united states family
------
Sample 4:
Title: Shade
Predicted Genres: Action
Actual Genres: Action, Crime, Thriller
Overview: they figure a good way of <OOV> this is taking down <OOV> the <OOV> <OOV> a well known <OOV> in a <OOV> game howe

###Reccommender of movies in genre from overview

In [16]:
def recommender_from_overview(overview, amount=5):
    predicted_genres = predict_genre_from_overview(overview)
    genres = predicted_genres.split(", ")
    print("Predicted Genres:", genres)

    merged_df['popularity'] = pd.to_numeric(merged_df['popularity'], errors='coerce')
    filtered_df = merged_df[merged_df['genres'].apply(lambda x: any(genre in x for genre in genres))]
    sorted_df = filtered_df.sort_values(by='popularity', ascending=False)
    return sorted_df.head(amount)['title']

# Example using an overview we know the model will predict to be a Comedy, showing the 10 most popular comedies in the dataset
out = recommender_from_overview(comedy_example,10)
print(out)

Predicted Genres: ['Comedy']
30677                                             Minions
24438                                          Big Hero 6
26546                                            Deadpool
26548                      Guardians of the Galaxy Vol. 2
26542    Pirates of the Caribbean: Dead Men Tell No Tales
43256            Captain Underpants: The First Epic Movie
351                                          Forrest Gump
30667                                               Ted 2
38823                                    Now You See Me 2
2210                                    Life Is Beautiful
Name: title, dtype: object
