In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab4.ipynb")

![](img/563_lab_banner.png)

# Lab 4: Movie Recommendations

## Imports
<hr>

In [None]:
import os

import numpy as np
import pandas as pd
from hashlib import sha1

from sklearn.decomposition import TruncatedSVD
from sklearn.model_selection import train_test_split

<br><br>

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

## Submission instructions <a name="si"></a>
rubric={mechanics}

You will receive marks for correctly submitting this assignment by following the instructions below:
    
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions/).
- [Here](https://github.com/UBC-MDS/public/tree/master/rubric) you will find the description of each rubric used in MDS.
- Make at least three commits in your lab's GitHub repository.    
- Push the final .ipynb file with your solutions to your GitHub repository for this lab.        
- Before submitting your lab, run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).     
- Make sure to enroll to Gradescope via [Canvas](https://canvas.ubc.ca/courses/106525).
- Upload the .ipynb file to Gradescope.
- Make sure that your plots/output are rendered properly in Gradescope.    
- If the .ipynb file is too big or doesn't render on Gradescope for some reason, also upload a pdf (preferably WebPDF) or html export of .ipynb file with your solutions so that TAs can view your submission on Gradescope. 
- The data you download for this lab <b>SHOULD NOT BE PUSHED TO YOUR REPOSITORY</b> (there is also a `.gitignore` in the repo to prevent this).
- Include a clickable link to your GitHub repo for the lab just below this cell.
</div>    

_Points:_ 2

YOUR REPO LINK GOES HERE

<!-- END QUESTION -->

<br><br><br><br>

## Exercise 1: Data and warm-up
<hr>

In this lab, you will build a variety of movie recommendation systems using the [MovieLens dataset](https://www.kaggle.com/prajitdatta/movielens-100k-dataset/data). The original source of the data is [here](https://grouplens.org/datasets/movielens/), and the structure of the data is described in the [README](http://files.grouplens.org/datasets/movielens/ml-latest-small-README.html) that comes with it. 

Run the code below which reads the ratings data as a CSV assuming that the file "u.data" is under `data/ml-100k/` directory in your lab folder. Timestamp can be useful in recommendation systems but we are going to ignore it in this assignment. 

In [None]:
r_cols = ["user_id", "movie_id", "rating", "timestamp"]
ratings = pd.read_csv(
    os.path.join("data", "ml-100k", "u.data"),
    sep="\t",
    names=r_cols,
    encoding="latin-1",
)
ratings = ratings.drop(columns=["timestamp"])
ratings.head()

In [None]:
# We'll be using these keys later in the starter code
user_key = "user_id"
item_key = "movie_id"

### 1.1 Terminology
Here is some notation we will be using in this lab. 

**Constants**:

 - $N$: the number of users, indexed by $n$
 - $M$: the number of movies, indexed by $m$
 - $\mathcal{R}$: the set of indices $(n,m)$ where we have ratings in the utility matrix $Y$
    - Thus $|\mathcal{R}|$ is the total number of ratings
 - $k$: the number of latent dimensions we use in collaborative filtering
 
**The data**:

 - $Y$: the utility matrix containing ratings, with a lot of missing entries
 - `train_mat` and `valid_mat`: Utility matrices for train and validation sets, respectively
  

#### 1.1.1 
rubric={accuracy}

**Your tasks:**    

How many users and items are there in the movie ratings data?  

<div class="alert alert-warning">

Solution_1_1_1
    
</div>

_Points:_ 1

In [None]:
N = ...
M = ...

In [None]:
print(f"Number of users (N)  : {N}")
print(f"Number of movies (M) : {M}")

In [None]:
grader.check("q1.1.1")

<br><br>

#### 1.1.2
rubric={accuracy}

**Your tasks:**    

What would be the shape of the utility matrix $Y$? 

<div class="alert alert-warning">

Solution_1_1_2
    
</div>

_Points:_ 1

In [None]:
# How many rows in the utility matrix?
utility_n_rows = ...

# How many columns in the utility matrix?
utility_n_cols = ...

In [None]:
grader.check("q1.1.2")

<br><br>

#### 1.1.3
rubric={accuracy}

**Your tasks:**    

What is the percentage of non-nan ratings in the utility matrix $Y$? 

<div class="alert alert-warning">

Solution_1_1_3
    
</div>

_Points:_ 1

In [None]:
non_nan_ratings_percentage = ...

In [None]:
print(f"Non-nan ratings percentage: {np.round(non_nan_ratings_percentage,3)}")

In [None]:
grader.check("q1.1.3")

<br><br>

#### 1.1.4
rubric={accuracy}

**Your tasks:**    

What are the average number of ratings per user and per movie? 

<div class="alert alert-warning">

Solution_1_1_4
    
</div>

_Points:_ 2

In [None]:
avg_nratings_per_user = ...
avg_nratings_per_movie = ...

In [None]:
print(f"Average number of ratings per user : {avg_nratings_per_user}")
print(f"Average number of ratings per movie: {avg_nratings_per_movie}")

In [None]:
grader.check("q1.1.4")

<br><br>

### 1.2 Data splitting 
rubric={accuracy}

**Your tasks:**

1. Split the ratings data into train and validation splits with `test_size=0.2` and `random_state=42`. 

<div class="alert alert-warning">

Solution_1_2_1
    
</div>

_Points:_ 1

In [None]:
...

In [None]:
X_train, X_valid, y_train, y_valid = ...

In [None]:
grader.check("q1.2")

<br><br>

### 1.3 Utility matrix 
rubric={accuracy}

The code below creates user and item mappers which map user ids and item ids to indices. 

**Your tasks:**
1. Create utility matrices for train and validation sets and store them in `train_mat` and `valid_mat` variables, respectively. How many non-nan elements are there in each of these matrices? 

> You may use the code from lecture notes with appropriate attributions.  

> You won't do it in real life but since our dataset is not that big, create a dense utility matrix in this assignment. You are welcome to try sparse matrix but then you may have to change some starter code provided in the later exercises.  

In [None]:
user_mapper = dict(zip(np.unique(ratings[user_key]), list(range(N))))
item_mapper = dict(zip(np.unique(ratings[item_key]), list(range(M))))
user_inverse_mapper = dict(zip(list(range(N)), np.unique(ratings[user_key])))
item_inverse_mapper = dict(zip(list(range(M)), np.unique(ratings[item_key])))

In [None]:
train_mat = None
valid_mat = None

<div class="alert alert-warning">

Solution_1_3
    
</div>

_Points:_ 4

In [None]:
...

In [None]:
# What's the number of non-nan elements in train_mat (nnn_train_mat)?
nnn_train_mat = ...

# What's the number of non-nan elements in valid_mat (nnn_valid_mat)?
nnn_valid_mat = ...

In [None]:
print(f"Number of non-nan elements in train_mat: {nnn_train_mat}")
print(f"Number of non-nan elements in valid_mat: {nnn_valid_mat}")

In [None]:
grader.check("q1.3")

<br><br>

<!-- BEGIN QUESTION -->

### 1.4 Evaluation
rubric={reasoning}

You will be developing a number of models to complete the utility matrix in this assignment. To compare these models, you'll be evaluating them using the functions below. 
- Given two matrices, the `error` function below returns RMSE of non-nan elements.
- Given predictions and train and validation utility matrices, the `evaluate` function below prints train and validation RMSEs by calling the `error` function for each set. 

**Your task:**

1. Discuss this evaluation metric in the context of recommender systems focussing on the following points:
    - Do we have ground truth in the context of recommender systems? 
    - What exactly are we comparing in order to evaluate recommender systems?
    - Can we guarantee that the recommendations given by a systems with low RMSE are going to be effective recommendations in the sense that customers are likely to consume the recommended items? Briefly discuss. 

In [None]:
def error(Y1, Y2):
    """
    Given two matrices of the same shape, 
    returns the root mean squared error (RMSE).
    """
    return np.sqrt(np.nanmean((Y1 - Y2) ** 2))


def evaluate(pred_Y, train_mat, valid_mat, model_name="Global average"):
    """
    Given predicted utility matrix and train and validation utility matrices 
    print train and validation RMSEs.
    """
    print("%s train RMSE: %0.2f" % (model_name, error(pred_Y, train_mat)))
    print("%s valid RMSE: %0.2f" % (model_name, error(pred_Y, valid_mat)))

<div class="alert alert-warning">

Solution_1_4
    
</div>

_Points:_ 4

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br><br><br><br>

## Exercise 2: Baselines
<hr>

In this exercise you'll implement a number of baseline models to fill in the missing entries of the utility matrix.

As an example, the code below implements the global average rating baseline. 

**global average rating baseline**

In [None]:
avg = np.nanmean(train_mat)
pred_g = np.zeros(train_mat.shape) + avg
evaluate(pred_g, train_mat, valid_mat, model_name="Global average")

<br><br>

<!-- BEGIN QUESTION -->

### 2.1 Per-user average baseline
rubric={accuracy}

**Your tasks:**

1. Implement per-user average baseline and report train and validation RMSEs.

<div class="alert alert-warning">

Solution_2_1
    
</div>

_Points:_ 3

In [None]:
avg_n = None

In [None]:
...

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 2.2 Per-movie average baseline
rubric={accuracy}

**Your tasks:**

1. Implement per-movie average baseline and report train and validation RMSEs.

<div class="alert alert-warning">

Solution_2_2
    
</div>

_Points:_ 3

In [None]:
avg_m = None

In [None]:
...

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 2.3 Average of per-user and per-movie average baselines
rubric={accuracy}

**Your tasks:**

1. Implement average of per-movie and per-user averages baseline and report train and validation RMSEs.

<div class="alert alert-warning">

Solution_2_3
    
</div>

_Points:_ 2

_Type your answer here, replacing this text._

In [None]:
...

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 2.4 $k$-nearest neighbours imputation
rubric={accuracy}

**Your tasks:**

1. Try [KNNImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html) with at least 3 choices for the hyperparameter `n_neighbors` to fill in the missing entries. 
2. Report train and validation RMSEs. 

<div class="alert alert-warning">

Solution_2_4
    
</div>

_Points:_ 4

In [None]:
...

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 2.5 Discussion
rubric={reasoning}

**Your tasks:**

Compare and discuss the results of all the baseline methods you tried in Exercise 2. 

<div class="alert alert-warning">

Solution_2_5
    
</div>

_Points:_ 4

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br><br><br><br>

## Exercise 3: Collaborative filtering

**Collaborative filtering** is one of the most popular approaches to fill in the missing entries of the utility matrix, which is based on something similar to LSA or `TruncatedSVD`.  

<!-- BEGIN QUESTION -->

### 3.1 `TruncatedSVD` by replacing missing values with zeros
rubric={accuracy,quality}

Utility matrices are usually huge with many missing entries. If we want to use scikit-learn's `TruncatedSVD`, we first need to impute the missing values with some numeric values. In this exercise, you'll first center the non-nan ratings, replace missing entries with zeros, and experiment with `TruncatedSVD` with different values of $k$ (`n_components` hyperparameter of `TruncatedSVD`) to fill in the missing entries in the utility matrix. 

**Your tasks:**

1. Subtract the averages of per user and per movie rating from non-nan ratings in the utility matrix. 
2. Replace missing values in the train utility matrix with zeros. 
> Hint: See help of [`np.nan_to_num`](https://numpy.org/doc/stable/reference/generated/numpy.nan_to_num.html). 
3. Train [`TruncatedSVD`](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html) and get reconstructions to fill in the missing entries in the utility matrix. Experiment with at least a few values of $k$ (`n_components` hyperparameter of `TruncatedSVD`). Report train and validation RMSEs in each case. 
> When you reconstruct the data, do not forget to to add the averages you subtracted in the first step.

<div class="alert alert-warning">

Solution_3_1
    
</div>

_Points:_ 6

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 3.2 Discussion 
rubric={reasoning}

1. What's wrong with the approach in 3.1? Why is it not common to use `scikit-learn`'s `TruncatedSVD` for collaborative filtering? Why do we need a separate package?

<div class="alert alert-warning">

Solution_3_2
    
</div>

_Points:_ 4

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 3.3 Collaborative filtering with `surprise` package
rubric={accuracy,reasoning}

Use the [`surprise`](https://surprise.readthedocs.io/en/stable/) package which has implementation of dimensionality reduction with proper handling of missing values, which is suitable for recommendation systems. You can install it as follows in your environment. 

```
>> conda activate 563
>> conda install -c conda-forge scikit-surprise
or 
>> pip install scikit-surprise
```

**Your tasks:**

 
1. Carry out cross-validation using the [`surprise`](https://surprise.readthedocs.io/en/stable/) package and SVD algorithm. Report mean RMSEs. 
2. Briefly comment on the results. 

<div class="alert alert-warning">

Solution_3_3
    
</div>

_Points:_ 4

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br><br><br><br>

## Exercise 4: Content-based recommenders
<hr> 

Collaborative filtering is an unsupervised approach to fill in missing entries in the utility matrix. In this exercise, you'll explore content-based filtering, a supervised machine learning approach to recommendation systems. 

Collaborative filtering only uses the utility matrix. But usually there is information available about items (i.e., movies in our case) or users. Content-based filtering exploits this information along with ratings information to understand the taste of a user and predict their ratings for the items they have not consumed or rated yet. 

The code below loads movie genre features from `data/ml-100k/u.item` and stores them in a variable called `W`. 

In [None]:
cols = [
    "movie_id",
    "movie title",
    "release date",
    "video release date",
    "IMDb URL",
    "unknown",
    "Action",
    "Adventure",
    "Animation",
    "Children",
    "Comedy",
    "Crime",
    "Documentary",
    "Drama",
    "Fantasy",
    "Film-Noir",
    "Horror",
    "Musical",
    "Mystery",
    "Romance",
    "Sci-Fi",
    "Thriller",
    "War",
    "Western",
]

movies_data = pd.read_csv(
    os.path.join("data", "ml-100k", "u.item"),
    sep="|",
    names=cols,
    encoding="latin-1",
)
movies_data.head()

In [None]:
genres = [
    "Action",
    "Adventure",
    "Animation",
    "Children",
    "Comedy",
    "Crime",
    "Documentary",
    "Drama",
    "Fantasy",
    "Film-Noir",
    "Horror",
    "Musical",
    "Mystery",
    "Romance",
    "Sci-Fi",
    "Thriller",
    "War",
    "Western",
]
movie_genres = movies_data[genres]
movie_genres.head()

In [None]:
W = movie_genres.to_numpy()
W.shape

In [None]:
print(f"Average number of genres per movie: {(W.sum() / M)}")

<br><br>

<!-- BEGIN QUESTION -->

### 4.1 Create `X` and `y` per user 
rubric={accuracy)

In content-based filtering, we create a separate profile (`X` and `y`) for each user depending upon how many items they have rated so far. Since the number of items rated by each user is different, the size of `X` is going to be different for each user. The function `get_X_y_per_user` below creates `X` and `y` for each user with movie genre features.   

**Your tasks:**

1. Create `X` and `y` per user by calling the function `get_X_y_per_user` below on train and validation ratings. 

In [None]:
from collections import defaultdict


def get_X_y_per_user(ratings_df, d=W.shape[1]):
    """
    Returns X and y for each user.

    Parameters:
    ----------
    ratings_df : pandas.DataFrame
         ratings data as a dataframe

    d : int
        number of item features

    Return:
    ----------
        dictionaries containing X and y for all users
    """
    lr_y = defaultdict(list)
    lr_X = defaultdict(list)

    for index, val in ratings_df.iterrows():
        n = user_mapper[val[user_key]]
        m = item_mapper[val[item_key]]
        lr_X[n].append(W[m])
        lr_y[n].append(val["rating"])

    for n in lr_X:
        lr_X[n] = np.array(lr_X[n])
        lr_y[n] = np.array(lr_y[n])

    return lr_X, lr_y

<div class="alert alert-warning">

Solution_4_1
    
</div>

_Points:_ 3

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 4.2 Number of examples for each user
rubric={accuracy,reasoning}

**Your tasks:**
1. Write code to extract user ids with minimum and maximum number of examples in their corresponding `X`. Display user ids and the number of ratings available for these users. If there are multiple users with the same number of minimum or maximum number of ratings, just show the id and the number of ratings for one of them. 
2. Would the size of `X` have an impact on the recommendations given by a content-based recommender system for that user? Why or why not? 

<div class="alert alert-warning">

Solution_4_2
    
</div>

_Points:_ 4

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 4.3 Training regression models per user
rubric={accuracy}

**Your tasks:**
1. For each user, train regression models of your choice to predict missing ratings in the utility matrix. 
2. Report train and validation RMSEs.

<div class="alert alert-warning">

Solution_4_3
    
</div>

_Points:_ 6

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 4.4 Discussion 
rubric={reasoning}

**Your tasks:**
1. Compare the validation RMSE from 4.3 with the validation RMSE you got with collaborative filtering. 
2. Discuss advantages and disadvantages of content-based filtering over collaborative filtering. 

<div class="alert alert-warning">

Solution_4_4
    
</div>

_Points:_ 3

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br><br><br><br>

## Exercise 5: Food for thought
<hr>

Each lab will have a few challenging questions. These are usually low-risk questions and will contribute to maximum 5% of the lab grade. The main purpose here is to challenge yourself, dig deeper in a particular area, and going beyond what we explicitly discussed in the class. When you start working on labs, attempt all other questions before moving to these challenging questions. If you are running out of time, please skip the challenging questions. 

![](img/eva-game-on.png)

<!-- BEGIN QUESTION -->

### (Challenging) 5.1 `top_n` predictions
<hr>

rubric={reasoning}

**Your tasks:**
1. Fit the SVD model on the train set using the `surprise` package. Write a function which returns `top_n` movie rating predictions for a given user id. Movies with these top ratings could be recommended to the user. 

> You may adapt [this code](https://github.com/NicolasHug/Surprise/blob/master/examples/top_n_recommendations.py) from the developer of the surprise package. If you do so, provide proper attributions. 

<div class="alert alert-warning">

Solution_5_1
    
</div>

_Points:_ 1

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### (Challenging) Exercise 5.2 Hybrid models
rubric={reasoning}

When you will work as a data scientist, it is likely that you will be given a problem which needs to be solved without explicit instructions or scaffolding. Our hope is that the courses in MDS have taught you some fundamentals so that you at least know the right keywords to search for when you come across something that we have not explicitly talked about during the program. One of the useful skills to learn as a data scientist is to check whether there is a suitable tool available out there for your task. If yes, examining how useful and reliable it is and how easy/difficult it is to get it working.  

In class, we noted that there are hybrid approaches for recommendation systems which combine collaborative filtering as well as content-based filtering.

**Your tasks:**

Search on the internet and figure out whether there are any off-the-shelf tools or packages you can use to build hybrid recommendation systems. Try to get one of these packages working. Write a thoughtful paragraph on your experience with the package. 

<div class="alert alert-warning">

Solution_5_2
    
</div>

_Points:_ 1

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### (Challenging) Exercise 5.3 Your takeaway from the course
<hr>
rubric={reasoning:1}

**Your tasks:**

What is your biggest takeaway from this course? Anything else you would like to share?

> Detailed and thoughtful answers are appreciated. 

<div class="alert alert-warning">

Solution_5.3
    
</div>

_Points:_ 1

<!-- END QUESTION -->

<br><br><br><br>

Before submitting your assignment, please make sure you have followed all the instructions in the Submission Instructions section at the top. 

Congratulations on finishing the last lab of this course 🎉!! 

In [None]:
from IPython.display import Image
Image("img/eva-congrats.png")