# DSCI 563 - Unsupervised Learning

# Lab 4: Movie Recommendations

## Submission instructions
<hr>
rubric={mechanics:2}

You will receive marks for correctly submitting this assignment. To submit this assignment, follow the instructions below:

- **Please add a link to your GitHub repository here: LINK TO YOUR  REPO**
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions/).
- Make at least three commits in your lab's GitHub repository.
- Push the final .ipynb file with your solutions to your GitHub repository for this lab.
- Upload the .ipynb file to Gradescope.
- If the .ipynb file is too big or doesn't render on Gradescope for some reason, also upload a pdf or html in addition to the .ipynb. 
- Make sure that your plots/output are rendered properly in Gradescope.

> [Here](https://github.com/UBC-MDS/public/tree/master/rubric) you will find the description of each rubric used in MDS.

> As usual, do not push the data to the repository. 

**At this point in the program, even if it is not asked explicitly in the instructions, it's always expected that you provide a brief justification or explanation when you make some non-obvious choices (e.g., hyperparameter choices) or present a bunch of plots. If you don't do it, the reader doesn't know the rationale behind your decisions and what they are supposed to look for in your visualizations.**

<br><br><br><br>

## Imports
<hr>

In [None]:
import os

import numpy as np
import pandas as pd
from sklearn.decomposition import TruncatedSVD
from sklearn.model_selection import train_test_split

<br><br><br><br>

## Exercise 1: Data and warm-up
<hr>

In this lab, you will build a variety of movie recommendation systems using the [MovieLens dataset](https://www.kaggle.com/prajitdatta/movielens-100k-dataset/data). The original source of the data is [here](https://grouplens.org/datasets/movielens/), and the structure of the data is described in the [README](http://files.grouplens.org/datasets/movielens/ml-latest-small-README.html) that comes with it. 

Run the code below which reads the ratings data as a CSV assuming that the file "u.data" is under `data/ml-100k/` directory in your lab folder. Timestamp can be useful in recommendation systems but we are going to ignore it in this assignment. 

In [None]:
r_cols = ["user_id", "movie_id", "rating", "timestamp"]
ratings = pd.read_csv(
    os.path.join("data", "ml-100k", "u.data"),
    sep="\t",
    names=r_cols,
    encoding="latin-1",
)
ratings = ratings.drop(columns=["timestamp"])
ratings.head()

In [None]:
# We'll be using these keys later in the starter code
user_key = "user_id"
item_key = "movie_id"

### 1.1 Terminology
Here is some notation we will be using in this lab. 

**Constants**:

 - $N$: the number of users, indexed by $n$
 - $M$: the number of movies, indexed by $m$
 - $\mathcal{R}$: the set of indices $(n,m)$ where we have ratings in the utility matrix $Y$
    - Thus $|\mathcal{R}|$ is the total number of ratings
 - $k$: the number of latent dimensions we use in collaborative filtering
 
**The data**:

 - $Y$: the utility matrix containing ratings, with a lot of missing entries
 - `train_mat` and `valid_mat`: Utility matrices for train and validation sets, respectively
  

#### 1.1.1 
rubric={accuracy:1}

**Your tasks:**    

What are the values of $N$ and $M$ in movie ratings data?  

In [None]:
N = None
M = None

<div class="alert alert-warning">

Solution_1_1_1
    
</div>

#### 1.1.2
rubric={accuracy:1}

**Your tasks:**    

What would be the shape of the utility matrix $Y$? 

<div class="alert alert-warning">

Solution_1_1_2
    
</div>

#### 1.1.3
rubric={accuracy:1}

**Your tasks:**    

What is the fraction of non missing ratings in the utility matrix $Y$? 

<div class="alert alert-warning">

Solution_1_1_3
    
</div>

#### 1.1.4
rubric={accuracy:2}

**Your tasks:**    

What are the average number of ratings per user and per movie? 

<div class="alert alert-warning">

Solution_1_1_4
    
</div>

<br><br>

### 1.2 Data splitting 
rubric={accuracy:2}

**Your tasks:**

1. Split the ratings data into train and validation splits with `test_size=0.2` and `random_state=42`. 

<div class="alert alert-warning">

Solution_1_2_1
    
</div>

<br><br>

### 1.3 Utility matrix 
rubric={accuracy:4,reasoning:2}

The code below creates user and item mappers which map user ids and item ids to indices. 

**Your tasks:**
1. Create utility matrices for train and validation sets and store then in `train_mat` and `valid_mat` variables, respectively. Show the shapes of these utility matrices.  
2. Briefly explain the difference between the two matrices. How many non-nan elements are there in each utility matrix? 

> You may use the code from lecture notes with appropriate attributions.  

> You won't do it in real life but since our dataset is not that big, create a dense utility matrix in this assignment. You are welcome to try sparse matrix but then you may have to change some started code provided in the later exercises.  

In [None]:
user_mapper = dict(zip(np.unique(ratings[user_key]), list(range(N))))
item_mapper = dict(zip(np.unique(ratings[item_key]), list(range(M))))
user_inverse_mapper = dict(zip(list(range(N)), np.unique(ratings[user_key])))
item_inverse_mapper = dict(zip(list(range(M)), np.unique(ratings[item_key])))

In [None]:
train_mat = None
valid_mat = None

<div class="alert alert-warning">

Solution_1_3_1
    
</div>

<div class="alert alert-warning">

Solution_1_3_2
    
</div>

<br><br>

### 1.4 Evaluation
rubric={reasoning:3}

You will be developing a number of models to complete the utility matrix in this assignment. To compare these models, you'll be evaluating them using the functions below. 
- Given two matrices, the `error` function below returns RMSE of non-nan elements.
- Given predictions and train and validation utility matrices, the `evaluate` function below prints train and validation RMSEs by calling the `error` function for each set. 

**Your task:**

1. Briefly explain what exactly are we comparing in order to evaluate recommender systems. Do we have ground truth in the context of recommender systems? Can we guarantee that the recommendations given by a systems with high RMSE are going to be effective recommendations in the sense that customers are likely to consume the recommended items? Briefly explain. 

In [None]:
def error(Y1, Y2):
    """
    Returns the root mean squared error (RMSE).
    """
    return np.sqrt(np.nanmean((Y1 - Y2) ** 2))


def evaluate(pred_Y, train_mat, valid_mat, model_name="Global average"):
    print("%s train RMSE: %0.2f" % (model_name, error(pred_Y, train_mat)))
    print("%s valid RMSE: %0.2f" % (model_name, error(pred_Y, valid_mat)))

<div class="alert alert-warning">

Solution_1_4_1
    
</div>

<br><br><br><br>

## Exercise 2: Baselines
<hr>

In this exercise you'll implement a number of baseline models to fill in the missing entries of the utility matrix.

The code below implements the following two baselines. 

- global average rating
- per-user average rating 

**global average rating baseline**

In [None]:
avg = np.nanmean(train_mat)
pred_g = np.zeros(train_mat.shape) + avg
evaluate(pred_g, train_mat, valid_mat, model_name="Global average")

**per-user average rating baseline**

In [None]:
avg_n = np.nanmean(train_mat, axis=1)
avg_n[
    np.isnan(avg_n)
] = avg  # for the rows with all nan entries, where user has not rated any movies.
pred_n = np.tile(avg_n[:, None], (1, M))
evaluate(pred_n, train_mat, valid_mat, model_name="Per-user average")

<br><br>

### 2.1 Per-movie average baseline
rubric={accuracy:3}

**Your tasks:**

1. Implement per-movie average baseline and report train and validation RMSEs.

<div class="alert alert-warning">

Solution_2_1_1
    
</div>

In [None]:
avg_m = None

<br><br>

### 2.2 Average of per-user and per-movie average baselines
rubric={accuracy:3,reasoning:1}

**Your tasks:**

1. Implement average of per-movie and per-user averages baseline and report train and validation RMSEs.
2. Which baseline is performing the best in terms of RMSE? 

<div class="alert alert-warning">

Solution_2_2_1
    
</div>

<div class="alert alert-warning">

Solution_2_2_2
    
</div>

<br><br>

### (Optional) 2.3 $k$-nearest neighbours imputation
rubric={reasoning:1}

**Your tasks:**

1. Try [KNNImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html) to fill in the missing entries. 
2. Report train and validation RMSEs and compare the results with previous baselines. 

<div class="alert alert-warning">

Solution_2_3_1
    
</div>

<div class="alert alert-warning">

Solution_2_3_2
    
</div>

<br><br><br><br>

## Exercise 3: Collaborative filtering

**Collaborative filtering** is one of the most popular approaches to build recommendation systems, which is based on something similar to LSA or `TruncatedSVD` to fill in the missing entries of the utility matrix. 

### 3.1 `TruncatedSVD` by replacing missing values with zeros
rubric={accuracy:6,quality:2}

Utility matrices are usually huge with many missing entries. If we want to use scikit-learn's `TruncatedSVD`, we first need to impute the missing values with some numeric values. In this exercise, you'll first center the non-nan ratings, replace missing entries with zeros, and experiment with `TruncatedSVD` with different values of $k$ to fill in the missing entries in the utility matrix. 

**Your tasks:**

- Subtract the averages of per user and per movie rating from non-nan ratings in the utility matrix. 
> Hint: You can subtract the following from the appropriate utility matrix assuming that `avg_n` has per user averages and `avg_m` has per movie averages: `0.5 * avg_n[:, None] + 0.5 * avg_m[None]`
- Replace missing values in the train utility matrix with zeros. 
> Hint: See help of [`np.nan_to_num`](https://numpy.org/doc/stable/reference/generated/numpy.nan_to_num.html). 
- Train [`TruncatedSVD`](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html) and get reconstructions to fill in the missing entries in the utility matrix. Experiment with at least a few values of $k$ (`n_components` hyperparameter of `TruncatedSVD`). Report train and validation RMSEs in each case. 
> When you reconstruct the data, do not forget to to add the following in the reconstructions: `0.5 * avg_n[:, None] + 0.5 * avg_m[None]`


<div class="alert alert-warning">

Solution_3_1
    
</div>

<br><br>

### 3.2 Discussion 
rubric={reasoning:4}

1. What's wrong with the approach in 3.1? Why is it not common to use `scikit-learn`'s `TruncatedSVD` for collaborative filtering? Why do we need a separate package?

<div class="alert alert-warning">

Solution_3_2_1
    
</div>

<br><br>

### 3.3 Collaborative filtering with `surprise` package
rubric={accuracy:4,reasoning:1}

Use the [`surprise`](https://surprise.readthedocs.io/en/stable/) package which has implementation of dimensionality reduction with proper handling of missing values, which is suitable for recommendation systems. You can install it as follows in your environment. 

```
>> conda activate 563
>> conda install -c conda-forge scikit-surprise
or 
>> pip install scikit-surprise
```

**Your tasks:**

 
1. Carry out cross-validation using the [`surprise`](https://surprise.readthedocs.io/en/stable/) package and SVD algorithm. Report mean RMSEs. 
2. Briefly comment on the results. 

<div class="alert alert-warning">

Solution_3_3_1
    
</div>

<div class="alert alert-warning">

Solution_3_3_2
    
</div>

<br><br><br><br>

## (Optional) 4: `top_n` predictions
<hr>

rubric={reasoning:1}

**Your tasks:**
1. Fit the SVD model on the train set using the `surprise` package. Write a function which returns `top_n` movie rating predictions for a given user id. Movies with these top ratings could be recommended to the user. 

> You may adapt [this code](https://github.com/NicolasHug/Surprise/blob/master/examples/top_n_recommendations.py) from the developer of the surprise package. If you do so, provide proper attributions. 

<div class="alert alert-warning">

Solution_4_1
    
</div>

<br><br><br><br>

## Exercise 5: Content-based recommenders
<hr> 

Collaborative filtering is an unsupervised approach to fill in missing entries in the utility matrix. In this exercise, you'll explore content-based filtering, a supervised machine learning approach to recommendation systems. 

Collaborative filtering only uses the utility matrix. But usually there is information available about items (i.e., movies in our case) or users. Content-based filtering exploits this information along with ratings information to understand the taste of a user and predict their ratings for the items they have not consumed or rated before. 

The code below loads movie genre features from `ml-100k/u.item` and stores them in a variable called `W`. 

In [None]:
cols = [
    "movie_id",
    "movie title",
    "release date",
    "video release date",
    "IMDb URL",
    "unknown",
    "Action",
    "Adventure",
    "Animation",
    "Children",
    "Comedy",
    "Crime",
    "Documentary",
    "Drama",
    "Fantasy",
    "Film-Noir",
    "Horror",
    "Musical",
    "Mystery",
    "Romance",
    "Sci-Fi",
    "Thriller",
    "War",
    "Western",
]

movies_data = pd.read_csv(
    os.path.join("data", "ml-100k", "u.item"),
    sep="|",
    names=cols,
    encoding="latin-1",
)
movies_data.head()

In [None]:
genres = [
    "Action",
    "Adventure",
    "Animation",
    "Children",
    "Comedy",
    "Crime",
    "Documentary",
    "Drama",
    "Fantasy",
    "Film-Noir",
    "Horror",
    "Musical",
    "Mystery",
    "Romance",
    "Sci-Fi",
    "Thriller",
    "War",
    "Western",
]
movie_genres = movies_data[genres]
movie_genres.head()

In [None]:
W = movie_genres.to_numpy()
W.shape

In [None]:
print("Average number of genres per movie: %.1f" % (W.sum() / M))

<br><br>

### 5.1 Create `X` and `y` per user 
rubric={accuracy:4)

In content-based filtering, we create a separate profile (`X` and `y`) for each user depending upon how many items they have rated so far. Since the number of items rated by each user is different, the size of the `X` is going to be different for each user. The function `get_X_y_per_user` below creates `X` and `y` for each user with movie genre features.   

**Your tasks:**
1. Create `X` and `y` per user by calling the function `get_X_y_per_user` below on train and validation ratings. 


In [None]:
from collections import defaultdict


def get_X_y_per_user(ratings_df, d=W.shape[1]):
    """
    Returns X and y for each user.

    Parameters:
    ----------
    ratings_df : pandas.DataFrame
         ratings data as a dataframe

    d : int
        number of item features

    Return:
    ----------
        dictionaries containing X and y for all users
    """
    lr_y = defaultdict(list)
    lr_X = defaultdict(list)

    for index, val in ratings_df.iterrows():
        n = user_mapper[val[user_key]]
        m = item_mapper[val[item_key]]
        lr_X[n].append(W[m])
        lr_y[n].append(val["rating"])

    for n in lr_X:
        lr_X[n] = np.array(lr_X[n])
        lr_y[n] = np.array(lr_y[n])

    return lr_X, lr_y

<div class="alert alert-warning">

Solution_5_1_1
    
</div>

<br><br>

### 5.2 Number of examples for each user
rubric={accuracy:4,reasoning:2}


**Your tasks:**
1. Write code to extract user ids with minimum and maximum number of examples in their corresponding `X`. Display user ids and the number of ratings available for these users.  
2. Would the size of `X` have an impact on the recommendations given by a content-based recommender system for that user? Briefly explain how. 

<div class="alert alert-warning">

Solution_5_2_1
    
</div>

<div class="alert alert-warning">

Solution_5_2_2
    
</div>

<br><br>

### 5.3 Training regression models per user
rubric={accuracy:6}

**Your tasks:**
1. For each user, train regression models of your choice to predict missing ratings in the utility matrix. 
2. Report train and validation RMSEs.

<div class="alert alert-warning">

Solution_5_3_1
    
</div>

<div class="alert alert-warning">

Solution_5_3_2
    
</div>

<br><br>

### 5.4 Discussion 
rubric={reasoning:2}

**Your tasks:**
1. Compare the validation RMSE from 5.3 with the validation RMSE you got with collaborative filtering. 

<div class="alert alert-warning">

Solution_5_4_1
    
</div>

<br><br><br><br>

## (Optional) Exercise 6: Hybrid models
rubric={reasoning:1}

When you will work as a data scientist, very often you will only be given the problem which needs to be solved without any instructions or scaffolding. Our hope is that the courses in MDS have taught you some basic concepts so that you at least know the right keywords to search for when you work on a problem. One of the useful skills to learn as a data scientist is to check whether there is a suitable tool available out there for your task. If yes, examining how useful and reliable it is and how easy/difficult it is to get it working.  

In class, we noted that there are hybrid approaches for recommendation systems which combine collaborative filtering as well as content-based filtering.

**Your tasks:**

Search on the internet and figure out whether there are any off-the-shelf tools or packages you can use to build hybrid recommendation systems. Try to get one of these packages working. Write a thoughtful paragraph on your experience with the package. 

<div class="alert alert-warning">

Solution_6
    
</div>

<br><br><br><br>

## (Optional) Exercise 7: Your takeaway from the course
<hr>
rubric={reasoning:1}


**Your tasks:**

What is your biggest takeaway from this course? Anything else you would like to share?

<div class="alert alert-warning">

Solution_7
    
</div>

<br><br><br><br>

**PLEASE READ BEFORE YOU SUBMIT:** 

When you are ready to submit your assignment do the following:

1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`. 
2. Notebooks with cell execution numbers out of order or not starting from "1" will have marks deducted. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).
3. Push all your work to your GitHub lab repository. 
4. Upload the assignment using Gradescope's drag and drop tool. Check out this [Gradescope Student Guide](https://lthub.ubc.ca/guides/gradescope-student-guide/) if you need help with Gradescope submission. 
5. Make sure that the plots and output are rendered properly in your submitted file. If the .ipynb file is too big and doesn't render on Gradescope, also upload a pdf or html in addition to the .ipynb so that the TAs can view your submission on Gradescope. 

Congratulations on finishing the last lab of this course 🎉!! 

![](eva-congrats.png)