# Modeling and Evaluation Report

## Model Choice
#### Now that we've analyzed our data, we move to developing our initial recommendation system. To develop our recommendation model, we can use various techniques from the recommendation systems literature. Collaborative filtering, content-based and hybrid approaches would all work for solving this type of recommendation problem. Collaborative filtering works based on the similtarities of the users or movies based on interaction history, content-based filtering works based on the features of the movies, and hybrid approaches combine interaction data and content-based data to solve the problem.

#### In choosing the model algorithm to use, we consider the advantages of the techniques. Collaborative filtering is effective in capturing user preferences based on interactions, while content-based methods exploit our detailed content feature information to give good recommendations. These methods also have disadvantages, such as collaborative filtering not exploting our rich feature information, and content-based filtering having difficulty with cold-start scenarios. Hybrid approach like using an ensemble of collaborative and content-based approaches can potentially solve these shortcoming at the cost of additional model complexity.

#### For our approach, since we have some rich feature information for each movie, we will use a content-based approach like cosine similarity. We will engineer the features of our dataset and evaluate a cosine similarity based model, and also discuss how to potentially improve it in the future. When actually putting our solution into an A/B test, we would compare it to a baseline like sorting movies by the average rating and recommending based on that heuristic, and only release our model if it outperforms the baseline.

#### We will now create the algorithm and use offline technqiues like precision and recall to evalute it's accuracy. We will later discuss how the test set is created. We will go through the data preprocessing, model building, test set evaluation, and next steps. 

## Data Preprocessing

##### We did some initial data processing during our data analysis and identified useful features and included tags into our movies dataset. Now we will preprocess the data to limit it to potentially useful features. In the future we would want to do a more advanced hyperparameter search to ensure our feature choices are exploiting the information found in the data.

##### Using the year, overview, popularity, average_rating, genres and tags should give us a good initial approach for creating our vector for our similarity search.

In [1]:
import pandas as pd

# Load datasets previously created
movies_combined_all_features = pd.read_csv("Camille G data/movies_combined_all_features.csv")
ratings_df = pd.read_csv("Camille G data/ratings.csv")

In [2]:
# Add genres in one-hot encoding
genres_df = movies_combined_all_features['genres'].str.get_dummies(sep='|')

In [3]:
genres_df

Unnamed: 0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9719,0,1,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
9720,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
9721,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
9722,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [4]:
# Combine genres with movies dataset
movies_combined_all_features_with_genre = pd.concat([movies_combined_all_features, genres_df.add_prefix("genre_")], axis=1)

In [5]:
# Drop unncessary unnamed index column and other id columns that are not informative
movies_combined_all_features_with_genre = movies_combined_all_features_with_genre.drop(movies_combined_all_features_with_genre.filter(regex="Unname"),axis=1)

In [6]:
# current set of columns
print([x for x in list(movies_combined_all_features_with_genre.columns) if not x.startswith("tag_")])

['movieId', 'title', 'genres', 'year', 'imdbId', 'tmdbId', 'overview', 'popularity', 'original_title', 'runtime', 'release_date', 'vote_average', 'vote_count', 'status', 'tagline', 'spoken_languages', 'cast', 'id', 'filename', 'average_rating', 'rating_count', 'rating_std', 'genre_(no genres listed)', 'genre_Action', 'genre_Adventure', 'genre_Animation', 'genre_Children', 'genre_Comedy', 'genre_Crime', 'genre_Documentary', 'genre_Drama', 'genre_Fantasy', 'genre_Film-Noir', 'genre_Horror', 'genre_IMAX', 'genre_Musical', 'genre_Mystery', 'genre_Romance', 'genre_Sci-Fi', 'genre_Thriller', 'genre_War', 'genre_Western']


#### Fill Missing Values

In [7]:
# Checking for missing values in the relevant columns
missing_values = movies_combined_all_features_with_genre.isnull().sum()
missing_values[missing_values!=0]

year                 13
tmdbId                8
overview            129
popularity          120
original_title      120
                   ... 
tag_wry            8170
tag_younger men    8170
tag_zither         8170
tag_zoe kazan      8170
tag_zombies        8170
Length: 1605, dtype: int64

In [9]:
# Year, overview and popularity have missing values that need imputation
movies_combined_all_features_with_genre['popularity'] = movies_combined_all_features_with_genre['popularity'].fillna(movies_combined_all_features_with_genre['popularity'].median())

movies_combined_all_features_with_genre['overview'] = movies_combined_all_features_with_genre['overview'].fillna("No overview available")

median_year = movies_combined_all_features_with_genre['year'].median()
movies_combined_all_features_with_genre['year'] = movies_combined_all_features_with_genre['year'].fillna(median_year)

In [10]:
# Check if missing values are filled
missing_values = movies_combined_all_features_with_genre.isnull().sum()
missing_values[missing_values!=0]

tmdbId                8
original_title      120
runtime             120
release_date        120
vote_average        120
                   ... 
tag_wry            8170
tag_younger men    8170
tag_zither         8170
tag_zoe kazan      8170
tag_zombies        8170
Length: 1602, dtype: int64

### Normalization

#### Numerical features need to be normalized before being usable in our vector representation. Normalizing numerical features ('year', 'popularity', and 'rating').

In [11]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
movies_combined_all_features_with_genre[['year', 'popularity', 'average_rating']] = scaler.fit_transform(movies_combined_all_features_with_genre[['year', 'popularity', 'average_rating']])

# Check the prepared dataset
movies_combined_all_features_with_genre.head()


Unnamed: 0,movieId,title,genres,year,imdbId,tmdbId,overview,popularity,original_title,runtime,...,genre_Film-Noir,genre_Horror,genre_IMAX,genre_Musical,genre_Mystery,genre_Romance,genre_Sci-Fi,genre_Thriller,genre_War,genre_Western
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,0.801724,114709,862.0,"Led by Woody, Andy's toys live happily in his ...",0.147815,Toy Story,81.0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),Adventure|Children|Fantasy,0.801724,113497,8844.0,When siblings Judy and Peter discover an encha...,0.019709,Jumanji,104.0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),Comedy|Romance,0.801724,113228,15602.0,A family wedding reignites the ancient feud be...,0.017802,Grumpier Old Men,101.0,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,0.801724,114885,31357.0,"Cheated on, mistreated and stepped on, the wom...",0.016711,Waiting to Exhale,127.0,...,0,0,0,0,0,1,0,0,0,0
4,5,Father of the Bride Part II (1995),Comedy,0.801724,113041,11862.0,Just when George Banks has recovered from his ...,0.027924,Father of the Bride Part II,106.0,...,0,0,0,0,0,0,0,0,0,0


#### To be able to use the overview we need to vectorize it, we use a simple tfidf but could use a more complex embedding in the future

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Vectorizing the 'overview' text data using TF-IDF
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=2000)
tfidf_matrix = tfidf_vectorizer.fit_transform(movies_combined_all_features_with_genre['overview'])

# Converting the TF-IDF matrix to a DataFrame for easier handling
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Displaying the TF-IDF DataFrame
tfidf_df.head()

Unnamed: 0,000,10,11,12,13,14,15,17,18,1930s,...,writing,written,wrong,year,years,york,young,younger,youth,zombies
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.120211,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Combine Features

In [13]:
# Combining the features we need
genre_columns = [col for col in movies_combined_all_features_with_genre.columns if 'genre_' in col]
genre_data = movies_combined_all_features_with_genre[genre_columns]

tags_columns = [col for col in movies_combined_all_features_with_genre.columns if 'tag_' in col]
tags_data = movies_combined_all_features_with_genre[tags_columns]

# movieId nd title are only for metadata, rest are vector features
combined_features = pd.concat([movies_combined_all_features_with_genre[['movieId', 'title', 'year', 'popularity', 'average_rating']], genre_data, tfidf_df], axis=1)

In [14]:
combined_features.iloc[:,2:]

Unnamed: 0,year,popularity,average_rating,genre_(no genres listed),genre_Action,genre_Adventure,genre_Animation,genre_Children,genre_Comedy,genre_Crime,...,writing,written,wrong,year.1,years,york,young,younger,youth,zombies
0,0.801724,0.147815,0.760207,0,0,1,1,1,1,0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0
1,0.801724,0.019709,0.651515,0,0,1,0,1,0,0,...,0.0,0.0,0.0,0.0,0.120211,0.0,0.000000,0.0,0.0,0.0
2,0.801724,0.017802,0.613248,0,0,0,0,0,1,0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0
3,0.801724,0.016711,0.412698,0,0,0,0,0,1,0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0
4,0.801724,0.027924,0.571429,0,0,0,0,0,1,0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9719,0.991379,0.014765,0.777778,0,1,0,1,0,1,0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0
9720,0.991379,0.028507,0.666667,0,0,0,1,0,1,0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.066828,0.0,0.0,0.0
9721,0.991379,0.004282,0.666667,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0
9722,1.000000,0.024278,0.666667,0,1,0,1,0,0,0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0


In [15]:
len(combined_features.columns.tolist())

2025

#### We can also attempt dimensionality reduction to see if it improves performance, using a technique like PCA

In [16]:
from sklearn.decomposition import PCA

# Applying PCA for dimensionality reduction
# Choosing a number of components that explains a significant portion of the variance (e.g., 95%)
pca = PCA(n_components=0.99)
pca_features = pca.fit_transform(combined_features.iloc[:,2:])

# Checking the shape of the reduced features to understand the new dimensionality
pca_features.shape

(9724, 1797)

#### Creating 1797 features covers 99% of variance in the data

In [17]:
pca_features_df = pd.DataFrame(pca_features)

In [18]:
pca_features_df = pd.concat([combined_features.iloc[:,:2],pca_features_df], axis=1)

## Compute cosine similarity across vectors

In [19]:
from sklearn.metrics.pairwise import cosine_similarity

# Calculating cosine similarity
cosine_sim = cosine_similarity(combined_features.iloc[:,2:])
cosine_sim_pca = cosine_similarity(pca_features_df.iloc[:,2:])

# Converting the cosine similarity matrix to a DataFrame for easier handling
cosine_sim_df = pd.DataFrame(cosine_sim, index=combined_features['title'], columns=combined_features['title'])

# Converting the cosine similarity matrix to a DataFrame for easier handling
cosine_sim_pca_df = pd.DataFrame(cosine_sim_pca, index=combined_features['title'], columns=combined_features['title'])

In [20]:
# Displaying a portion of the cosine similarity DataFrame
cosine_sim_df.iloc[:5, :5]

title,Toy Story (1995),Jumanji (1995),Grumpier Old Men (1995),Waiting to Exhale (1995),Father of the Bride Part II (1995)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Toy Story (1995),1.0,0.689321,0.391379,0.331786,0.44875
Jumanji (1995),0.689321,1.0,0.251795,0.184652,0.261782
Grumpier Old Men (1995),0.391379,0.251795,1.0,0.658459,0.589592
Waiting to Exhale (1995),0.331786,0.184652,0.658459,1.0,0.496972
Father of the Bride Part II (1995),0.44875,0.261782,0.589592,0.496972,1.0


In [21]:
# Displaying a portion of the cosine similarity DataFrame using PCA
cosine_sim_pca_df.iloc[:5, :5]

title,Toy Story (1995),Jumanji (1995),Grumpier Old Men (1995),Waiting to Exhale (1995),Father of the Bride Part II (1995)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Toy Story (1995),1.0,0.571733,0.069287,-0.065502,0.13581
Jumanji (1995),0.571733,1.0,-0.072359,-0.245058,-0.058204
Grumpier Old Men (1995),0.069287,-0.072359,1.0,0.394858,0.309463
Waiting to Exhale (1995),-0.065502,-0.245058,0.394858,1.0,0.066066
Father of the Bride Part II (1995),0.13581,-0.058204,0.309463,0.066066,1.0


In [22]:
def get_recommendations(title, cosine_sim_matrix, movies, top_n=5):
    """
    Function to get movie recommendations based on cosine similarity.
    
    :param title: Title of the movie for which recommendations are to be made.
    :param cosine_sim_matrix: DataFrame containing cosine similarity scores.
    :param movies: DataFrame containing movie titles.
    :param top_n: Number of top recommendations to return.
    :return: List of top_n recommended movie titles.
    """
    # Check if the movie is in the dataset
    if title not in cosine_sim_matrix.index:
        return f"The movie '{title}' is not in the dataset."

    # Get the similarity scores for the specified movie
    sim_scores = cosine_sim_matrix[title]

    # Sort the movies based on similarity scores
    sorted_sim_scores = sim_scores.sort_values(ascending=False)

    # Get the titles of the top N similar movies
    top_movies = sorted_sim_scores.iloc[1:top_n+1].index

    return top_movies.tolist()

# Example: Getting recommendations for "Toy Story (1995)"
recommended_movies = get_recommendations("Toy Story (1995)", cosine_sim_df, movies_combined_all_features_with_genre, top_n=5)
recommended_movies

['Toy Story 2 (1999)',
 'Toy Story 3 (2010)',
 "Emperor's New Groove, The (2000)",
 'Wild, The (2006)',
 'Monsters, Inc. (2001)']

#### These recommendations like toy story make sense based on toy story 1, showing our model is working correctly

## Evaluation

#### For evaluation we first split each users data by time into 80% train and 20% test, then we create vectors for each user based on the train data, and see how well it gets precision@5 and recall@5 on our test set

In [23]:
# Implementing the user-wise temporal train-test split
train_data = []
test_data = []

for user_id in ratings_df['userId'].unique():
    # Get all interactions for the user, sorted by time
    user_data = ratings_df[ratings_df['userId'] == user_id].sort_values(by='timestamp')
    
    # Calculate the index for splitting (80% for train, 20% for test)
    split_index = int(len(user_data) * 0.8)

    # Split the data
    user_train_data = user_data.iloc[:split_index]
    user_test_data = user_data.iloc[split_index:]

    # Append to the respective lists
    train_data.append(user_train_data)
    test_data.append(user_test_data)

# Concatenating all users' data into train and test DataFrames
train_df = pd.concat(train_data)
test_df = pd.concat(test_data)

# Checking the size of the train and test sets
train_df.shape, test_df.shape

((80419, 4), (20417, 4))

In [24]:
# Merge our train set with our feature vectors
train_user_movie_features = train_df.merge(combined_features, on='movieId')
train_user_movie_features_pca = train_df.merge(pca_features_df, on='movieId')
train_user_movie_features.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,year,popularity,average_rating,genre_(no genres listed),genre_Action,...,writing,written,wrong,year.1,years,york,young,younger,youth,zombies
0,1,1210,5.0,964980499,Star Wars: Episode VI - Return of the Jedi (1983),0.698276,0.048675,0.80839,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,804,4.0,964980499,She's the One (1996),0.810345,0.014566,0.611111,0,0,...,0.0,0.0,0.0,0.0,0.0,0.177965,0.0,0.0,0.0,0.0
2,1,2018,5.0,964980523,Bambi (1942),0.344828,0.070013,0.637427,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.163221,0.0,0.0,0.0
3,1,2628,4.0,964980523,Star Wars: Episode I - The Phantom Menace (1999),0.836207,0.056779,0.579365,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.187114,0.0,0.0,0.0
4,1,2826,4.0,964980523,"13th Warrior, The (1999)",0.836207,0.042706,0.534188,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [25]:
# Find the mean of the train vectors to represent the user
train_user_vectors = train_user_movie_features.drop(["movieId","rating","timestamp","title"], axis=1).groupby('userId').mean()
train_user_vectors_pca = train_user_movie_features_pca.drop(["movieId","rating","timestamp","title"], axis=1).groupby('userId').mean()

In [26]:
# Display what sample vectors look like 
train_user_vectors

Unnamed: 0_level_0,year,popularity,average_rating,genre_(no genres listed),genre_Action,genre_Adventure,genre_Animation,genre_Children,genre_Comedy,genre_Crime,...,writing,written,wrong,year,years,york,young,younger,youth,zombies
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.710485,0.046754,0.678465,0.0,0.432432,0.437838,0.156757,0.227027,0.372973,0.210811,...,0.000000,0.000000,0.004105,0.005837,0.005883,0.007662,0.014963,0.001166,0.001259,0.000000
2,0.915292,0.084173,0.774361,0.0,0.434783,0.086957,0.000000,0.000000,0.304348,0.347826,...,0.000000,0.000000,0.000000,0.005763,0.011685,0.009259,0.009151,0.000000,0.000000,0.016658
3,0.684650,0.031445,0.707552,0.0,0.419355,0.258065,0.096774,0.096774,0.161290,0.032258,...,0.000000,0.000000,0.008899,0.000000,0.003149,0.008213,0.014096,0.000000,0.000000,0.000000
4,0.730152,0.036004,0.731405,0.0,0.127907,0.139535,0.034884,0.052326,0.430233,0.145349,...,0.000000,0.000000,0.004082,0.008981,0.005488,0.008927,0.024536,0.000000,0.001272,0.000000
5,0.761330,0.069501,0.708735,0.0,0.228571,0.228571,0.171429,0.257143,0.400000,0.257143,...,0.000000,0.000000,0.000000,0.000000,0.008460,0.000000,0.017441,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,0.731067,0.033795,0.690421,0.0,0.125561,0.119955,0.020179,0.033632,0.391256,0.116592,...,0.001712,0.000314,0.001187,0.007866,0.009437,0.012405,0.021828,0.001651,0.001173,0.000000
607,0.753240,0.052883,0.657727,0.0,0.456376,0.268456,0.013423,0.073826,0.268456,0.147651,...,0.000000,0.000000,0.000000,0.008539,0.015353,0.009738,0.009272,0.001493,0.000000,0.000000
608,0.796687,0.045420,0.620721,0.0,0.323795,0.240964,0.066265,0.114458,0.453313,0.144578,...,0.001571,0.000000,0.002221,0.010626,0.008339,0.005941,0.014689,0.001303,0.000445,0.001081
609,0.793995,0.053697,0.675171,0.0,0.379310,0.310345,0.034483,0.034483,0.172414,0.206897,...,0.000000,0.009154,0.000000,0.010724,0.012233,0.000000,0.012215,0.000000,0.000000,0.000000


In [27]:
# Calculate the cosine similarity between user vectors and movie vectors
train_user_movie_similarity = cosine_similarity(train_user_vectors, combined_features.iloc[:,2:])
train_user_movie_similarity_pca = cosine_similarity(train_user_vectors_pca, pca_features_df.iloc[:,2:])

In [28]:
train_user_movie_similarity_df = pd.DataFrame(train_user_movie_similarity, 
                                              index=train_user_vectors.index, 
                                              columns=combined_features["movieId"])

train_user_movie_similarity_pca_df = pd.DataFrame(train_user_movie_similarity_pca,
                                              index=train_user_vectors_pca.index, 
                                              columns=pca_features_df["movieId"])

In [29]:
train_user_movie_similarity_df

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.690067,0.626899,0.541253,0.533918,0.571894,0.635557,0.533951,0.597205,0.600932,0.696484,...,0.657436,0.530238,0.643115,0.542558,0.458561,0.696243,0.609856,0.564418,0.617487,0.599193
2,0.417948,0.386630,0.495488,0.568071,0.558224,0.685757,0.487457,0.404046,0.611179,0.601896,...,0.577789,0.591173,0.720634,0.519321,0.525002,0.568591,0.488106,0.696066,0.579600,0.587537
3,0.506513,0.498269,0.464579,0.493703,0.484120,0.579115,0.456734,0.482783,0.597501,0.641185,...,0.671419,0.544823,0.610572,0.521018,0.459729,0.592317,0.495868,0.605283,0.596317,0.522932
4,0.528767,0.451174,0.653725,0.733792,0.628894,0.533474,0.648494,0.446118,0.499299,0.514113,...,0.542134,0.617224,0.796256,0.527087,0.494592,0.579927,0.573973,0.714464,0.494827,0.658247
5,0.651180,0.559473,0.600334,0.656585,0.588897,0.586197,0.594684,0.534206,0.518658,0.563545,...,0.581154,0.618307,0.730816,0.557097,0.470837,0.643027,0.615624,0.659670,0.556998,0.613060
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,0.493073,0.427083,0.651743,0.762583,0.601734,0.503820,0.646981,0.423435,0.488538,0.489012,...,0.522900,0.635352,0.802703,0.509248,0.480015,0.552421,0.545484,0.737970,0.479962,0.629368
607,0.511058,0.493652,0.527972,0.565523,0.540807,0.682503,0.521645,0.488963,0.625320,0.708548,...,0.617479,0.537328,0.663654,0.499957,0.472954,0.611924,0.508310,0.628783,0.588821,0.565436
608,0.607761,0.528968,0.612844,0.625275,0.642914,0.627892,0.608528,0.512625,0.588653,0.649181,...,0.670846,0.554614,0.728381,0.540487,0.495940,0.674348,0.617795,0.625388,0.583209,0.661674
609,0.476803,0.468588,0.478261,0.540886,0.499925,0.700909,0.472396,0.489904,0.590684,0.731827,...,0.557118,0.569835,0.647602,0.512520,0.503527,0.545475,0.459651,0.654620,0.566974,0.525196


In [32]:
# Save to cache for flask/steamlit app
train_user_movie_similarity_df.to_parquet("cache/train_user_movie_similarity.parquet", engine="pyarrow")

#### Here we calculate the precision and recall at @5, we could also use metrics like NDCG, since we are limited to recommending 5 movies we just use 5 for now

In [33]:
# Defining the function to get top 5 recommendations for a user
def get_recommendations(user_id, user_movie_similarity_df, ratings_df, k=5):
    recommendations = user_movie_similarity_df.loc[user_id]

    # Get top 5 movie IDs
    top_5_movies = recommendations.nlargest(5).index.tolist()
    return pd.Series(top_5_movies)

In [34]:
def calculate_precision_recall_at_k(predictions, actuals, k=5):
    # Calculate precision and recall at K
    precision = sum(predictions[:k].isin(actuals)) / k
    recall = sum(predictions[:k].isin(actuals)) / len(actuals)
    return precision, recall

def calculate_precision_recall(train_similarity_df):
    precisions = []
    recalls = []

    for user_id in test_df['userId'].unique():
        actual_movies = test_df[test_df['userId'] == user_id]['movieId']
        
        # Assuming you have a function get_recommendations(user_id, k)
        recommended_movies = get_recommendations(user_id, train_similarity_df, test_df, k=5)
        precision, recall = calculate_precision_recall_at_k(recommended_movies, actual_movies, k=5)
        
        precisions.append(precision)
        recalls.append(recall)
    
    # Calculate average precision and recall
    average_precision = sum(precisions) / len(precisions)
    average_recall = sum(recalls) / len(recalls)
    return average_precision*100.0, average_recall*100.0

print("(Precision, Recall):", calculate_precision_recall(train_user_movie_similarity_df))
print("(Precision, Recall):", calculate_precision_recall(train_user_movie_similarity_pca_df))

(Precision, Recall): (0.6557377049180327, 0.1250496143056776)
(Precision, Recall): (1.4754098360655739, 0.3379090816812576)


## Conclusion

#### We can see the raw method got a precision of 0.65% and recall of 0.12% while the PCA method got a precision of 1.48% and recall of 0.337%. While these are both quite low, so our current content-based features and representation seem to have limited power to predict the next movies a user will like. In the future we could try using collaborative filtering, more complex vector representation of users and items, doing hyperparameter tuning and using more complex methods like regression and neural networks.

#### Also we should take into account scalability, real world applicability, speed and performance. We need additional metrics to measure metrics like diversity and serendipity and doing full A/B test with feedback will show how well our system is really working in real world use cases. More personalization will allow us to better represent users and create a system that is able to delight customers.