# Lab 7

## Collaborative filtering and recommendations

### At the end of this lab, I should be able to
* Understand how item-item and user-user collaborative filtering perform recommendations
* Explain a experiment where we tested item-item versus user-user

**Note:** Exercises can be autograded and count towards your lab and assignment score. Problems are graded for participation.

**Video Introduction:**
https://calpoly.zoom.us/rec/share/TNZApiQaUUNRyMrKIl8MAVEh1FCIFNUTCthC81lA1Cn-Vw2CVn3hWBd6Wtde2WXj.V_7lCBK8Yk1MXx_r?startTime=1646613203000

In [1]:
from pathlib import Path
home = str(Path.home()) # all other paths are relative to this path. change to something else if this is not the case on your system

In [2]:
%load_ext autoreload
%autoreload 2

# make sure your run the cell above before running this
import Lab7_helper

## Real dataset: Movielens

https://grouplens.org/datasets/movielens/

> MovieLens is a collaborative filtering system for movies. A
user of MovieLens rates movies using 1 to 5 stars, where 1 is "Awful" and 5 is "Must
See". MovieLens then uses the ratings of the community to recommend other movies
that user might be interested in, predict what that user might rate a movie,
or perform other tasks. - "Collaborative Filtering Recommender Systems"

In [3]:
import pandas as pd
import numpy as np

ratings = pd.read_csv(f'{home}/csc-466-student/data/movielens-small/ratings.csv') # you might need to change this path
ratings = ratings.dropna()
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [4]:
movies = pd.read_csv(f'{home}/csc-466-student/data/movielens-small/movies.csv')
movies = movies.dropna()
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


### Joining the data together
We need to join those two source dataframes into a single one called data. I do this by setting the index to movieId and then specifying an ``inner`` join which means that the movie has to exist on both sides of the join. Then I reset the index so that I can later set the multi-index of userId and movieId. The results of this are displayed below. Pandas is awesome, but it takes some getting used to how everything works.

In [5]:
data = movies.set_index('movieId').join(ratings.set_index('movieId'),how='inner').reset_index()
data = data.drop('timestamp',axis=1) # We won't need timestamp here
data.head()

Unnamed: 0,movieId,title,genres,userId,rating
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.5
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.5


In [6]:
ratings = data.set_index(['userId','movieId'])['rating']
ratings # as Series

userId  movieId
1       1          4.0
5       1          4.0
7       1          4.5
15      1          2.5
17      1          4.5
                  ... 
184     193581     4.0
        193583     3.5
        193585     3.5
        193587     3.5
331     193609     4.0
Name: rating, Length: 100836, dtype: float64

#### Exercise 1
I provide a structure for predicting recommentations using user-user collaborative filtering.  For this exercise, please complete the missing components.

``data_raw`` - your entire dataframe

``x_raw`` - the data from a single user

``N`` - neighborhood size

``frac`` - fraction for your test dataset

In [7]:
mae = Lab7_helper.predict_user_user(ratings.unstack(),ratings.unstack().loc[1])
mae

169
61
10
86
250


0.8241596814667028

#### Exercise 2
I provide a structure for predicting recommentations using item-item collaborative filtering. For this exercise, please complete the missing components.

In [8]:
mae = Lab7_helper.predict_item_item(ratings.unstack(),ratings.unstack().loc[1])
mae

  sims = (db.drop(movie).loc[ix_raw].apply(lambda y: (y.values*x.values).sum()/(np.sqrt((y**2).sum())*np.sqrt((x**2).sum())),axis=1)+1)/2


0.74

#### Problem 1
This is an open ended question that requires you to code. I have provided my own ratings for some of the movies in the dataset. What would you recommend to me based on my recommendations if you applied user-user filtering? Feel free to also change to your rankings. I ranked the top 5 movies according to the count of users who have ranked movies.

##### Upload a copy of your code, output, and discussion here: https://canvas.calpoly.edu/courses/67334/assignments/477738 

In [9]:
data[['movieId','title']].value_counts()

movieId  title                                 
356      Forrest Gump (1994)                       329
318      Shawshank Redemption, The (1994)          317
296      Pulp Fiction (1994)                       307
593      Silence of the Lambs, The (1991)          279
2571     Matrix, The (1999)                        278
                                                  ... 
4093     Cop (1988)                                  1
4089     Born in East L.A. (1987)                    1
58351    City of Men (Cidade dos Homens) (2007)      1
4083     Best Seller (1987)                          1
193609   Andrew Dice Clay: Dice Rules (1991)         1
Length: 9724, dtype: int64

In [10]:
counts = data[['movieId','title']].value_counts().reset_index()

In [11]:
user_ratings = pd.DataFrame(index=['Dr. Anderson'],columns=counts['title'])
user_ratings.loc["Dr. Anderson","Forrest Gump (1994)"] = 4
user_ratings.loc["Dr. Anderson","Shawshank Redemption, The (1994)"] = 5
user_ratings.loc["Dr. Anderson","Pulp Fiction (1994)"] = 3
user_ratings.loc["Dr. Anderson","Silence of the Lambs, The (1991)"] = 2
user_ratings.loc["Dr. Anderson","Matrix, The (1999)"] = 5
user_ratings

title,Forrest Gump (1994),"Shawshank Redemption, The (1994)",Pulp Fiction (1994),"Silence of the Lambs, The (1991)","Matrix, The (1999)",Star Wars: Episode IV - A New Hope (1977),Jurassic Park (1993),Braveheart (1995),Terminator 2: Judgment Day (1991),Schindler's List (1993),...,Doomsday (2008),Gardens of Stone (1987),"Fourth Protocol, The (1987)",Mongol (2007),War Dance (2007),Cop (1988),Born in East L.A. (1987),City of Men (Cidade dos Homens) (2007),Best Seller (1987),Andrew Dice Clay: Dice Rules (1991)
Dr. Anderson,4,5,3,2,5,,,,,,...,,,,,,,,,,


In [12]:
ratings_reordered = ratings.unstack().T.loc[counts['movieId']].T # reorder the ratings to be the same as above
ratings_reordered.columns = user_ratings.columns
ratings_reordered

title,Forrest Gump (1994),"Shawshank Redemption, The (1994)",Pulp Fiction (1994),"Silence of the Lambs, The (1991)","Matrix, The (1999)",Star Wars: Episode IV - A New Hope (1977),Jurassic Park (1993),Braveheart (1995),Terminator 2: Judgment Day (1991),Schindler's List (1993),...,Doomsday (2008),Gardens of Stone (1987),"Fourth Protocol, The (1987)",Mongol (2007),War Dance (2007),Cop (1988),Born in East L.A. (1987),City of Men (Cidade dos Homens) (2007),Best Seller (1987),Andrew Dice Clay: Dice Rules (1991)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,3.0,4.0,5.0,5.0,4.0,4.0,,5.0,...,,,,,,,,,,
2,,3.0,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,0.5,...,,,,,,,,,,
4,,,1.0,5.0,1.0,5.0,,,,,...,,,,,,,,,,
5,,3.0,5.0,,,,,4.0,3.0,5.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,4.0,3.5,5.0,4.5,5.0,4.5,2.5,3.5,3.5,5.0,...,,,,,,,,,,
607,,5.0,3.0,5.0,5.0,3.0,4.0,5.0,4.0,5.0,...,,,,,,,,,,
608,3.0,4.5,5.0,4.0,5.0,3.5,3.0,4.0,3.0,4.0,...,,,,,,,,,,
609,4.0,4.0,4.0,,,,3.0,3.0,3.0,,...,,,,,,,,,,


In [15]:
### Start coding your solution here

#### Problem 2
Repeat problem 1 but recommend movies using item-item. Any difference? Which one do you think is more reasonable?

##### Your solution here: https://canvas.calpoly.edu/courses/67334/assignments/477739

In [14]:
# Good job!
# Don't forget to push with ./submit.sh

#### Having trouble with the test cases and the autograder?

You can always load up the answers for the autograder. The autograder runs your code and compares your answer to the expected answer. I manually review your code, so there is no need to hide this from you.

```python
import joblib
answers = joblib.load(f"{home}/csc-466-student/tests/answers_Lab7.joblib")
answers.keys()
```