# **Build Your First RecSys**

In this notebook, you will build a simple collaborative filtering-based movie recommendation system.

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

## Loading the data

We will use the [MovieLens Dataset](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset) that consists of movies released on or before July 2017. Data points include metadata about the movies, as well as 26 million ratings from 270,000 users for all 45,000 movies. Ratings are on a scale of 1-5 and have been obtained from the official GroupLens website.

For the purpose of this exercise, we will use a subset of 100,000 ratings from 700 users on 9,000 movies.


*movies_metadata.csv* contains information about the movies:

In [None]:
movies = pd.read_csv("https://raw.githubusercontent.com/evgeniyako-edu/build-your-first-recsys-workshop/main/data/movies_metadata.csv",
                     low_memory=False)
movies = movies.drop([19730, 29503, 35587])
movies['id'] = pd.to_numeric(movies['id'])
movies.head()

*ratings_small.csv* contains the ratings. We can drop the $\texttt{timestamp}$ column as we will not use it in this exercise.

In [None]:
ratings = pd.read_csv("https://raw.githubusercontent.com/evgeniyako-edu/build-your-first-recsys-workshop/main/data/ratings_small.csv",
                     low_memory=False).drop(["timestamp"], axis=1)
ratings.head()

Let's merge the two tables so that the ratings table also contains the name of the movie:

In [None]:
ratings = (ratings.merge(movies[["id", "title"]],
                        right_on="id", 
                        left_on="movieId")).drop(["id"], axis=1).sort_values("userId")
ratings.head(10)

## Explorative Data Analysis (EDA)

Let's explore the dataset in a bit more detail. 

Most users rate a couple of movies. However, there are some dedicated users who give ratings to 100+ movies: 

In [None]:
counts_per_user = ratings.groupby("userId")[["movieId"]].count().reset_index()
counts_per_user.columns = ["userId", "num_movies_rated"]
plt.figure(figsize=(12,5))
plt.hist(counts_per_user["num_movies_rated"], bins=range(0,800,50))
plt.xticks(range(0,800,50))
_ = plt.xlabel("number of movies rated")
_ = plt.ylabel("number of users")

What are the most rated movies in the dataset? Display them together with their average rating.

In [None]:
# TODO

## Collaborative Filtering

In this section, we will implement a simple collaborative filtering-based recommender which would predict the ratings a user would give to unseen movies. Top-rated candidates can then be suggested to the user.

Our system will predict the unknown ranking as a linear combination of the ratings given to the movie by other users who have seen it. We will weigh each of such ratings by how much the user in question is similar to each of these other users. More spicifically, a prediction a user $u$ would give to the unseen movie $i$ is

$$\hat{R_{ui}} = \bar{R}_u + w_{uv}\sum_{v}\left(R_{vi} - \bar{R}_{v} \right)$$

There, the summation goes over all users $v$ who rated movie $i$, and $\bar{R}$ is an average rating given by a user to all the movies they have rated.

Similarity between users $u$ and $v$ can be measured as Pearson correlation between the ratings vectors from these users, considering only the movies both of them rated:

$$w_{uv} = \frac{\sum_j (R_{uj} - \bar{R}_u )(R_{vj} - \bar{R}_v )}{\sqrt{{\sum_j (R_{uj} - \bar{R}_u )^2}{\sum_j (R_{vj} - \bar{R}_v )^2}}}$$

There, the summation goes over all movies $j$ that were rated by both users $u$ and $v$.

Let's first implement some building blocks and then bring them together in a single collaborative filtering system. 

In [None]:
def predict_rating(user_id, movie_id, ratings):
    """ 
    Predict rating that a user user_id would give to movie movie_id 
    using collaborative filtering
    """
    
    # Step 1: Get all the ratings for the movie in question

    # Step 2: Compute similarity between the user in question, user_id,
    # and everyone who has rated the movie in question, movie_id
    

    # Step 3: Compute the predicted rating               
    
    pass

Let's take a look at the first user who has rated only 6 movies. Which of the top 50 most-rated movies can we recommend them to watch?

How does it compare to our recommendations to, for example, user 2?

In [None]:
# TODO