### General
* RUNNING ON COLAB: Make a folder in your Google Drive called `datascience_data`, and upload all the .csv files there
* RUNNING LOCALLY: Make a folder called `datascience_data` in the same directory as this notebook, and put all the .csv files there

### Recommendation
* In order to get a recommendation, provide a **full title** of a movie that you like.


### TODO:
* We should use clustering, but on what? Genres? Tags? Both?
    * I tried doing clustering algorithm on the tdif_matrix, but it did not give great results compared to the current method

In [1]:
import pandas as pd
import numpy as np

# # Run on Colab

# from google.colab import drive
# drive.mount('/content/drive')
# Define the path to the file in Google Drive
# drive_path = 'drive/My Drive/datascience_data'

# Run on local (folder: data)
drive_path = 'datascience_data'

# Define the file names
file_names = ['credits.csv', 'keywords.csv', 'links.csv', 'links_small.csv', 'movies_metadata.csv', 'ratings.csv', 'ratings_small.csv']

# Load the data into Pandas DataFrames
# credits_df = pd.read_csv(drive_path + '/' + file_names[0])
keywords_df = pd.read_csv(drive_path + '/' + file_names[1])
# links_df = pd.read_csv(drive_path + '/' + file_names[2])
# links_small_df = pd.read_csv(drive_path + '/' + file_names[3])
movies_metadata_df = pd.read_csv(drive_path + '/' + file_names[4])
# ratings_df = pd.read_csv(drive_path + '/' + file_names[5])
# ratings_small_df = pd.read_csv(drive_path + '/' + file_names[6])


# Display the first few rows of each dataframe to understand their structure
dataframes = {
    "keywords": keywords_df.head(),
    # "links_small": links_small_df.head(),
    # "links": links_df.head(),
    "movies_metadata": movies_metadata_df.head(),
    # "ratings_small": ratings_small_df.head()
}

# dataframes

  movies_metadata_df = pd.read_csv(drive_path + '/' + file_names[4])


### 1. Data Preprocessing
Let's begin with the inspection and cleaning of each dataset. I'll start by checking for missing values, duplicates, and data types to ensure that the data is consistent and ready for merging. After that, I'll merge the relevant information from each dataset into a single DataFrame for further processing.

In [2]:
# Data inspection and cleaning for each dataset

# Inspect for duplicates, null values and data types in movies_metadata
movies_metadata_clean = movies_metadata_df.drop_duplicates(subset='id')
movies_metadata_clean = movies_metadata_clean[movies_metadata_clean['id'].apply(lambda x: str(x).isdigit())]
movies_metadata_clean['id'] = movies_metadata_clean['id'].astype(int)

# Check for null values and duplicates in the keywords dataset
keywords_clean = keywords_df.drop_duplicates('id')
keywords_clean['id'] = keywords_clean['id'].astype(int)

# Now, merge the datasets on 'id' column
# We only need the 'id', 'title', and 'overview' from movies_metadata
# And the 'keywords' from the keywords dataset
merged_df = pd.merge(movies_metadata_clean[['id', 'title', 'overview']],
                     keywords_clean,
                     on='id',
                     how='left')

# Display the merged DataFrame structure and check for nulls
merged_df.info(), merged_df.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45433 entries, 0 to 45432
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        45433 non-null  int64 
 1   title     45430 non-null  object
 2   overview  44479 non-null  object
 3   keywords  45432 non-null  object
dtypes: int64(1), object(3)
memory usage: 1.4+ MB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  keywords_clean['id'] = keywords_clean['id'].astype(int)


(None,
       id                        title  \
 0    862                    Toy Story   
 1   8844                      Jumanji   
 2  15602             Grumpier Old Men   
 3  31357            Waiting to Exhale   
 4  11862  Father of the Bride Part II   
 
                                             overview  \
 0  Led by Woody, Andy's toys live happily in his ...   
 1  When siblings Judy and Peter discover an encha...   
 2  A family wedding reignites the ancient feud be...   
 3  Cheated on, mistreated and stepped on, the wom...   
 4  Just when George Banks has recovered from his ...   
 
                                             keywords  
 0  [{'id': 931, 'name': 'jealousy'}, {'id': 4290,...  
 1  [{'id': 10090, 'name': 'board game'}, {'id': 1...  
 2  [{'id': 1495, 'name': 'fishing'}, {'id': 12392...  
 3  [{'id': 818, 'name': 'based on novel'}, {'id':...  
 4  [{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n...  )

Handle the missing values in the overview and keywords columns, which may involve filling or removing the missing entries.
Preprocess the overview and keywords text data. This involves converting the JSON-like string in the keywords column into a workable format, tokenizing, and cleaning the text (e.g., removing stop words, punctuation, and stemming or lemmatization).

In [3]:
# Handling missing values
# For the 'overview' column, if the overview is missing, we can fill it with an empty string
# For the 'keywords' column, if the keywords are missing, we can also fill it with an empty string

merged_df['overview'] = merged_df['overview'].fillna('')
merged_df['keywords'] = merged_df['keywords'].fillna('[]')

# Now we need to convert the 'keywords' column from a JSON-like string to an actual list of keywords
import json

# A function to parse the keywords correctly, handling any errors in the JSON decoding process
def parse_keywords(keyword_string):
    try:
        return json.loads(keyword_string.replace("'", "\""))
    except json.decoder.JSONDecodeError:
        return []  # In case of error, return an empty list

# Apply the function to the 'keywords' column
merged_df['keywords'] = merged_df['keywords'].apply(parse_keywords)

# Extract just the names of the keywords to a new column 'keyword_list'
merged_df['keyword_list'] = merged_df['keywords'].apply(lambda x: [d['name'] for d in x])

# We will also create a 'combined' column that concatenates the overview and the keyword list into a single string
# This is what we will use for TF-IDF vectorization
merged_df['combined'] = merged_df['overview'] + ' ' + merged_df['keyword_list'].apply(lambda x: ' '.join(x))

# Display the final structure of the DataFrame and the first few rows of the 'combined' column
merged_df['combined'].head(), merged_df.isnull().sum()


(0    Led by Woody, Andy's toys live happily in his ...
 1    When siblings Judy and Peter discover an encha...
 2    A family wedding reignites the ancient feud be...
 3    Cheated on, mistreated and stepped on, the wom...
 4    Just when George Banks has recovered from his ...
 Name: combined, dtype: object,
 id              0
 title           3
 overview        0
 keywords        0
 keyword_list    0
 combined        0
 dtype: int64)

The missing values in the overview and keywords columns have been handled, and a new combined column has been created by concatenating the overview and the keyword names. This column will serve as the corpus for TF-IDF vectorization.

The final structure indicates that there are no missing values in the key columns used for TF-IDF. However, there are 3 entries with missing titles, which should not affect the recommendation system, as the recommendations are based on the content, not the title itself.



In [4]:
merged_df.head()

Unnamed: 0,id,title,overview,keywords,keyword_list,combined
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,...","[jealousy, toy, boy, friendship, friends, riva...","Led by Woody, Andy's toys live happily in his ..."
1,8844,Jumanji,When siblings Judy and Peter discover an encha...,[],[],When siblings Judy and Peter discover an encha...
2,15602,Grumpier Old Men,A family wedding reignites the ancient feud be...,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392...","[fishing, best friend, duringcreditsstinger, o...",A family wedding reignites the ancient feud be...
3,31357,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...","[{'id': 818, 'name': 'based on novel'}, {'id':...","[based on novel, interracial relationship, sin...","Cheated on, mistreated and stepped on, the wom..."
4,11862,Father of the Bride Part II,Just when George Banks has recovered from his ...,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n...","[baby, midlife crisis, confidence, aging, daug...",Just when George Banks has recovered from his ...


In [5]:
# merged_df.to_csv('movies_metadata_clean.csv', index=False)

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np


# Initialize the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

# Fit and transform the 'combined' column to a TF-IDF matrix
tfidf_matrix = tfidf_vectorizer.fit_transform(merged_df['combined'])

# Output the shape of the TF-IDF matrix
tfidf_matrix.shape





  recip = np.true_divide(1., other)


The TF-IDF vectorization is complete. The resulting TF-IDF matrix has dimensions of 45433×77452, indicating that there are 45,433 movies and 77,452 unique words (after removing stop words) across the combined text corpus.

With this matrix, we can now calculate the cosine similarity between movies and build the recommendation system.

### Building a Recommendation System with Pairwise Cosine Similarity

Due to the high memory requirements of storing such a large dense matrix, we use an alternative approach: a more memory-efficient method such as pairwise_distances from sklearn.metrics with metric='cosine', which computes similarity scores on the fly and doesn't require storing a large matrix in memory.

In [7]:
from sklearn.metrics.pairwise import pairwise_distances

# Construct a reverse map of indices and movie titles
indices = pd.Series(merged_df.index, index=merged_df['title']).drop_duplicates()

# Instead of calculating the dense matrix all at once, we'll calculate the cosine similarity on-the-fly
# For memory efficiency, we'll use pairwise_distances with metric 'cosine' which is equivalent to 1 - cosine_similarity

# Function to get recommendations based on cosine similarity, computed on-the-fly
def get_recommendations_pairwise_distances(title, tfidf_matrix=tfidf_matrix, indices=indices):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Compute the cosine similarity between this movie and all others in the dataset
    cosine_similarities = 1 - pairwise_distances(tfidf_matrix[idx], tfidf_matrix, metric='cosine')
    
    # Get the scores of all movies
    sim_scores = list(enumerate(cosine_similarities[0]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:20]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return merged_df['title'].iloc[movie_indices]

# Test the system with a movie
get_recommendations_pairwise_distances("The Dark Knight")


18244                                The Dark Knight Rises
15505                           Batman: Under the Red Hood
1328                                        Batman Returns
10119                                        Batman Begins
585                                                 Batman
150                                         Batman Forever
20223              Batman: The Dark Knight Returns, Part 2
40944    LEGO DC Comics Super Heroes: Batman: Be-Leaguered
21181    Batman Unmasked: The Psychology of the Dark Kn...
9228                    Batman Beyond: Return of the Joker
41952    Batman Beyond Darwyn Cooke's Batman 75th Anniv...
32096                     Batman Unlimited: Monster Mayhem
41946                                The Lego Batman Movie
3094                          Batman: Mask of the Phantasm
19783              Batman: The Dark Knight Returns, Part 1
41424    LEGO DC Comics Super Heroes: Justice League - ...
39598                             Batman: The Killing Jo

### Testing Nearest Neighbors with Cosine Similarity

In [8]:
from sklearn.neighbors import NearestNeighbors

# Using the NearestNeighbors class to find the most similar items
# Initializing the NearestNeighbors model with cosine similarity
model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20, n_jobs=-1)

# Fitting the model on our TF-IDF matrix
model_knn.fit(tfidf_matrix)

# Function to get recommendations using Nearest Neighbors
def get_recommendations_knn(title, model_knn=model_knn, indices=indices):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Find the k-neighbors of a point
    distances, indices = model_knn.kneighbors(tfidf_matrix[idx], n_neighbors=11)

    # Get the indices of the nearest neighbors (excluding itself)
    nearest_indices = indices.flatten()[1:]

    # Return the top 10 most similar movies
    return merged_df['title'].iloc[nearest_indices]

# Test the system with a movie
get_recommendations_knn('The Dark Knight')


18244                                The Dark Knight Rises
15505                           Batman: Under the Red Hood
1328                                        Batman Returns
10119                                        Batman Begins
585                                                 Batman
150                                         Batman Forever
20223              Batman: The Dark Knight Returns, Part 2
40944    LEGO DC Comics Super Heroes: Batman: Be-Leaguered
21181    Batman Unmasked: The Psychology of the Dark Kn...
9228                    Batman Beyond: Return of the Joker
28681                                      The Dark Knight
2973                                             Kagemusha
34830                                     West Of Shanghai
32656                                            Turbo Kid
17995                       Prisoners of the Lost Universe
17103                                   The Storm Warriors
6937                                                  He

## Alternative Methods for Computing Similarity
There are indeed other methods to compute similarities between items in a recommendation system context, which can be more memory efficient. 
