# Movie Mate: Find Your Next Cinematic Match

## Objective: Build and evaluate recommendation engines to predict user movie preferences.

### Introduction

Ditch the endless scrolling! Our data-powered engine finds your next cinematic soulmate based on your unique movie taste.

Our objective is to leverage the power of data science to build intelligent recommendation engines that can predict which superhero movies you'll enjoy the most. By analyzing your movie preferences and incorporating the "Batman vs. Superman" theme, we aim to create a personalized recommendation experience that caters to your specific tastes.

The cornerstone of our recommendation system is the MovieLens dataset. This renowned dataset provides a treasure trove of user ratings for countless movies across various genres. The richness of this data allows us to understand user preferences and identify patterns that can be harnessed for movie recommendations.

Get ready to embark on a thrilling journey as we explore the intersection of data science and superhero fandom!

### Data Preparation

#### Looking at the columns

In [3]:
import pandas as pd

folder_path = "./data"
file_names = ["tag", "rating", "movie", "link", "genome_tags", "genome_scores"]

print("Columns in each data source:")

for file_name in file_names:
    df = pd.read_csv(folder_path + "/" + file_name + ".csv")
    print("File name: " + file_name)
    print(df.head())

Columns in each data source:
File name: tag
   userId  movieId            tag            timestamp
0      18     4141    Mark Waters  2009-04-24 18:19:40
1      65      208      dark hero  2013-05-10 01:41:18
2      65      353      dark hero  2013-05-10 01:41:19
3      65      521  noir thriller  2013-05-10 01:39:43
4      65      592      dark hero  2013-05-10 01:41:18
File name: rating
   userId  movieId  rating            timestamp
0       1        2     3.5  2005-04-02 23:53:47
1       1       29     3.5  2005-04-02 23:31:16
2       1       32     3.5  2005-04-02 23:33:39
3       1       47     3.5  2005-04-02 23:32:07
4       1       50     3.5  2005-04-02 23:29:40
File name: movie
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

  

I will use the following data for the 3 approaches, 

1. Collaborative (user based): movie rating data
2. Collaborative (movie based): movie rating data
3. Content based : genome scores data

#### Checking for missing or incomplete data

##### Movie rating data

In [4]:
data_path = "./data/rating.csv"
rating_df = pd.read_csv(data_path)
print(rating_df.head())

   userId  movieId  rating            timestamp
0       1        2     3.5  2005-04-02 23:53:47
1       1       29     3.5  2005-04-02 23:31:16
2       1       32     3.5  2005-04-02 23:33:39
3       1       47     3.5  2005-04-02 23:32:07
4       1       50     3.5  2005-04-02 23:29:40


Questions i'm concerned with:
1. What is the total number of movies?
2. What is the total number of users?
3. What is the average number of movies rated by a user?
4. What is the average number of ratings per movie?

In [5]:
print(rating_df.nunique())

userId         138493
movieId         26744
rating             10
timestamp    15351121
dtype: int64


In [14]:
user_movie_count = rating_df.groupby('userId')['movieId'].nunique()
print(user_movie_count.describe())

count    138493.000000
mean        144.413530
std         230.267257
min          20.000000
25%          35.000000
50%          68.000000
75%         155.000000
max        9254.000000
Name: movieId, dtype: float64


In [15]:
movie_user_count = rating_df.groupby('movieId')['userId'].nunique()
print(movie_user_count.describe())

count    26744.000000
mean       747.841123
std       3085.818268
min          1.000000
25%          3.000000
50%         18.000000
75%        205.000000
max      67310.000000
Name: userId, dtype: float64


### Collaborative Filtering

##### User-based Collaborative Filtering (CF):

<b>Concept:</b> User-based CF identifies users with similar taste in movies (neighbors) and recommends movies those neighbors enjoyed but the target user hasn't seen yet.
<b>Similarity Calculation:</b> We can calculate pairwise similarity between users using adjusted cosine similarity. This measures how closely aligned their rating vectors are, considering the direction of their ratings

- We are using adjusted cosine since it's possible some users tend to rate enverything higher and some everything lower.
- There exists a likely scenario that there isn't much overlap between the movies 2 people have rated. This is one of the major downsides of this kind of approach.

  Although, a case can be made that users not having rated the same movies does carry some significance from case to case.

  For example, if 2 users only have a Netflix subscription, it is likely that they haven't rated movies that are Hotstar exclusives. Moreoever, it would not make sense to receommend them movies exclusive to Hotstar.

  Another example of the relevance of not rating the same movies would be if the languages of the movies is something they don't speak. English movie watchers would not have rated Japanese movies which should mean they have similar taste in a way.

  Ultimately, it would really be based on <b>how the data was collected and and from which users</b>.

##### Movie-based Collaborative Filtering (CF):

<b>Concept:</b> Movie-based CF focuses on finding similar movies rather than similar users. It recommends movies similar to those a user has enjoyed but hasn't seen yet.
<b>Similarity Calculation:</b> Similar to user-based CF, we can calculate pairwise similarity between movies using cosine similarity based on how users have rated them.

### Content-Based Filtering (CBF)
Leveraging Movie Genomes: Explain the concept of movie genomes and their potential for CBF.
Preprocessing movie genomes:
Extract relevant features from movie genomes (e.g., genres, actors, directors).
Convert textual features (genres) to numerical representations using techniques like one-hot encoding.
Building a CBF model:
Choose a machine learning algorithm (e.g., K-Nearest Neighbors, Naive Bayes) suitable for CBF.
Train the model on movie features and corresponding ratings.
Define a function to recommend movies similar to a specific movie based on the trained model.

### Evaluation
Since you're dealing with recommending movies, here are some commonly used metrics that assess different aspects of your recommender systems:

Mean Reciprocal Rank (MRR): This metric considers the rank of the first relevant recommendation in the list. A higher MRR indicates that relevant recommendations appear closer to the top of the list. Why it's useful: MRR prioritizes the importance of getting the most relevant item at the top of the recommendation list.

Precision@k: This metric measures the proportion of the top k recommendations that are actually relevant to the user. A higher precision indicates a higher accuracy in recommending relevant movies. Why it's useful: Precision@k allows you to assess the overall quality of the top k recommendations. You might choose a different value of k depending on how many recommendations you typically show to users.

Normalized Discounted Cumulative Gain (NDCG): This metric considers both the relevance and the ranking of recommended items. A higher NDCG indicates that the most relevant items are ranked higher in the list.

### The "Batman vs. Superman" Showdown
User choice integration: Create a mechanism (e.g., input field) for users to choose their favorite superhero ("Batman" or "Superman").
Personalized recommendations:
Based on user choice and their movie rating history:
Utilize all three recommendation engines (user-based CF, movie-based CF, CBF) to generate movie recommendations.
Optionally, prioritize recommendations within the chosen superhero universe.

### Conclusion
Summarize findings: Discuss the performance of each recommendation engine based on the evaluation metrics.
Insights and limitations: Highlight insights gained about user preferences and limitations of the system.
Future directions: Suggest potential improvements or extensions to the project (e.g., hybrid recommendation systems, incorporating additional datasets).