# Collaborative filtering

In this notebook, we look through the code detailed in your trains around creating collaborative recommender functions.

**NOTE**: the functions and most code in this notebook are exactly the same as what appears in the trains. All we have done here is broken down the functions into pieces like in the webinars to demonstrate how each piece fits together.

In [16]:
# Import our regular old heroes
import numpy as np
import pandas as pd
import scipy as sp # <-- The sister of Numpy, used in our code for numerical efficientcy.
import matplotlib.pyplot as plt
import seaborn as sns

# Entity featurization and similarity computation
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

# Libraries used during sorting procedures.
import operator # <-- Convienient item retrieval during iteration
import heapq # <-- Efficient sorting of large lists

# Imported for our sanity
import warnings
warnings.filterwarnings('ignore')

---

## Read in our data

Note here how we're only loading in one dataframe - in contrast to content-based filtering, we don't need all the information about the different books. We only need information about the user IDs, item IDs (and titles in this case, for ease of identification), and the user's ratings:

In [17]:
book_ratings = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/unsupervised_sprint/book_ratings.csv')
book_ratings.head()

Unnamed: 0,user_id,book_id,title,rating
0,314,1,Harry Potter and the Half-Blood Prince (Harry ...,5
1,439,1,Harry Potter and the Half-Blood Prince (Harry ...,3
2,588,1,Harry Potter and the Half-Blood Prince (Harry ...,5
3,1169,1,Harry Potter and the Half-Blood Prince (Harry ...,4
4,1185,1,Harry Potter and the Half-Blood Prince (Harry ...,4


We create a utility matrix, by applying the `.pivot_table` method to our dataframe. We use our `user_id` on the index, `title` on the columns and `ratings` as the values in the cells. This utility matrix is what we will create our simialrity matrix from.

*Note that we used `title` instead of `book_id` as our columns - this is for easier identification of the books.*

In [18]:
util_matrix = book_ratings.pivot_table(index=['user_id'],
                                       columns=['title'],
                                       values='rating')
util_matrix.shape

(28906, 812)

Our utility matrix will inherently be sparse - not every user will have rated every single item in our system, and not every item will have been rated. Hence, you will see a lot of NaNs in the utility matrix:

In [19]:
util_matrix.head()

title,'Salem's Lot,"'Tis (Frank McCourt, #2)",1421: The Year China Discovered America,1776,1984,A Bend in the River,A Bend in the Road,A Brief History of Time,A Briefer History of Time,A Case of Need,...,"Women in Love (Brangwen Family, #2)",World War Z: An Oral History of the Zombie War,"World Without End (The Kingsbridge Series, #2)",Wuthering Heights,"Xenocide (Ender's Saga, #3)",Year of Wonders,You Shall Know Our Velocity!,Zen and the Art of Motorcycle Maintenance: An Inquiry Into Values,Zodiac,number9dream
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,,,
9,,,,,,,,,,,...,,,,,,,,,,


Next up, we normalise our utility matrix. This is to get our ratings on an even playing field based on individual's rating habits, and to help us construct our cosine similarity matrix.

We then fill all NaNs with 0s to assist in the cosine similarity calculation, transpose our matrix using the `.T` method and then drop any users who have not rated anything - transposing our matrix makes this easier to do, and we drop these users because if they have not rated anything, they will not help our process of finding similar users to our reference user! This helps us to save time and space with our cosine similarity calculations.

We then save our utility matrix as a sparse matrix, using `scipy's` sparse matrix format - this is to help us save space.

In [None]:
# Normalize each row (a given user's ratings) of the utility matrix
util_matrix_norm = util_matrix.apply(lambda x: (x-np.mean(x))/(np.max(x)-np.min(x)), axis=1)

# Fill Nan values with 0's, transpose matrix, and drop users with no ratings
util_matrix_norm.fillna(0, inplace=True)
util_matrix_norm = util_matrix_norm.T
util_matrix_norm = util_matrix_norm.loc[:, (util_matrix_norm != 0).any(axis=0)]

# Save the utility matrix in scipy's sparse matrix format
util_matrix_sparse = sp.sparse.csr_matrix(util_matrix_norm.values)

Next up, we create our cosine similarity matrix using the `cosine_similarity` function. We use a transposed (`.T`) version of our utility matrix as we want a user-user cosine similarity matrix.

We convert this to a DataFrame so we can see that our cosine similarity matrix has been properly created!

Remember, along the diagonal of our cosine similarity matrix, we should be seeing 1's, as the diagonals are where each user meets themselves (1 indicates they are exactly the same, which should be the case for the same user! User 7 should have a similarity of 1 with themselves!)

In [20]:
# Compute the similarity matrix using the cosine similarity metric
user_similarity = cosine_similarity(util_matrix_sparse.T)

# Save the matrix as a dataframe to allow for easier indexing
user_sim_df = pd.DataFrame(user_similarity,
                           index = util_matrix_norm.columns,
                           columns = util_matrix_norm.columns)

# Review a small portion of the constructed similartiy matrix
user_sim_df[:5]

user_id,7,10,23,27,35,41,46,47,49,51,...,53364,53366,53372,53373,53378,53381,53393,53403,53406,53420
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,-0.177657,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
27,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
35,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


---

## Top-N Recommendations


In order to produce a list of top-N recommendations for collaborative filtering, the following simple algorithm can be followed:

1. Select an initial reference user to generate recommendations for.
2. Extract all the similarity values between the reference user and each other user in the similarity matrix.
3. Sort the resulting similarity values in descending order, and select the  𝑘
  most similar users based on these values.
4. For each selected user, collect their top-rated items.
5. Form a tally of which items are most popular across the  𝑘
  similar users. Do this by counting how many times a top-rated item is common amongst the other users.
6. Sort the top-rated items according the the popularity tally. Return the top-N values as the result.


Here is the full function as detailed in the trains - but we will break it down into steps afterwards:

In [28]:
def collab_generate_top_N_recommendations(user, N=10, k=20):

    # Cold-start problem - no ratings given by the reference user.
    # With no further user data, we solve this by simply recommending
    # the top-N most popular books in the item catalog.
    if user not in user_sim_df.columns:
        return book_ratings.groupby('title').mean().sort_values(by='rating',
                                        ascending=False).index[:N].to_list()

    # Gather the k users which are most similar to the reference user
    sim_users = user_sim_df.sort_values(by=user, ascending=False).index[1:k+1]
    favorite_user_items = [] # <-- List of highest rated items gathered from the k users
    most_common_favorites = {} # <-- Dictionary of highest rated items in common for the k users

    for i in sim_users:
        # Maximum rating given by the current user to an item
        max_score = util_matrix_norm.loc[:, i].max()
        # Save the names of items maximally rated by the current user
        favorite_user_items.append(util_matrix_norm[util_matrix_norm.loc[:, i]==max_score].index.tolist())

    # Loop over each user's favorite items and tally which ones are
    # most popular overall.
    for item_collection in range(len(favorite_user_items)):
        for item in favorite_user_items[item_collection]:
            if item in most_common_favorites:
                most_common_favorites[item] += 1
            else:
                most_common_favorites[item] = 1
    # Sort the overall most popular items and return the top-N instances
    sorted_list = sorted(most_common_favorites.items(), key=operator.itemgetter(1), reverse=True)[:N+1]
    top_N = [x[0] for x in sorted_list]
    return top_N

The first part of the code accounts for the possibility of the cold-start problem. If our specified user is not in the columns of our similarity matrix, it means that we can't calculate their similarity to any other user.

The solution here is to simply find the top N (in this case, 10) books in our system based on each book's average rating:

In [None]:
    # Cold-start problem - no ratings given by the reference user.
    # With no further user data, we solve this by simply recommending
    # the top-N most popular books in the item catalog.
    if user not in user_sim_df.columns:
        return book_ratings.groupby('title').mean().sort_values(by='rating',
                                        ascending=False).index[:11].to_list()

If, however, we do not face the above problem, we then take the step of finding a list of N similar users to our specified user. Here, we're using the top 20 users - we sort our similarity matrix by our specified user in descending order. We take from index 1, because at index 0 would be the ID of our specified user with a similarity value of 1:

In [21]:
sim_users = user_sim_df.sort_values(by=314, ascending=False).index[1:21]
sim_users

Int64Index([ 1964, 18957, 33119, 18361, 51838, 17663,   588, 10140, 32635,
            43985, 11854, 28509,  9722, 17189, 30879,   725,  6016, 40251,
            12024, 24326],
           dtype='int64', name='user_id')

Now that we have a list of the top 20 most similar users to our user, we can start collating their favourite items.

We create an empty list `favorite_user_items` where we're going to be collecting each of our 20 users' 'favourite' books based on their ratings. We also create an empty dictionary `most_common_favorites` that will come into use later.

We use a `for-loop` to go through each of our 20 similar users. For each user, we find their `max_score` - this is the maximum rating they have given any book in our system. Some may have given a highest rating of 5; some may be a bit more harsh and may have only ever given a highest rating of 2!

For each user, we then collect each book they have given their highest rating to, and append this to the `favorite_user_items` list.

In [22]:
favorite_user_items = [] # <-- List of highest rated items gathered from the k users
most_common_favorites = {} # <-- Dictionary of highest rated items in common for the k users

for i in sim_users:
        # Maximum rating given by the current user to an item
        max_score = util_matrix_norm.loc[:, i].max()
        # Save the names of items maximally rated by the current user
        favorite_user_items.append(util_matrix_norm[util_matrix_norm.loc[:, i]==max_score].index.tolist())

If we have a look at `favourite_user_items`, we can see that for each of our 20 similar users, there is a sublist that contains their highest-rated books:

In [23]:
favorite_user_items

[['Four to Score (Stephanie Plum, #4)',
  "Life, the Universe and Everything (Hitchhiker's Guide, #3)"],
 ['Harry Potter and the Goblet of Fire (Harry Potter, #4)'],
 ['Children of Dune (Dune Chronicles #3)',
  'Harry Potter and the Goblet of Fire (Harry Potter, #4)'],
 ['Harry Potter and the Goblet of Fire (Harry Potter, #4)',
  'The Broken Wings'],
 ['Harry Potter and the Half-Blood Prince (Harry Potter, #6)',
  "I'm a Stranger Here Myself: Notes on Returning to America after Twenty Years Away",
  'Neither Here nor There: Travels in Europe',
  'Slouching Towards Bethlehem',
  'The Lord of the Rings: Weapons and Warfare',
  'The Power of One (The Power of One, #1)',
  'Treasure Island'],
 ['Another Bullshit Night in Suck City',
  'Dune Messiah (Dune Chronicles #2)',
  'Gates of Fire: An Epic Novel of the Battle of Thermopylae',
  'Harry Potter and the Half-Blood Prince (Harry Potter, #6)',
  'Harry Potter and the Order of the Phoenix (Harry Potter, #5)',
  'Harry Potter and the Prison

Next up, we need to count how often each of the above books appears in the `favorite_user_items` list. We're essentially going through each book in each sublist, and for each book we are tallying how many times it appears and making record of this in our dictionary `most_common_favorites`:

In [24]:
for item_collection in range(len(favorite_user_items)):
        for item in favorite_user_items[item_collection]:
            if item in most_common_favorites:
                most_common_favorites[item] += 1
            else:
                most_common_favorites[item] = 1

Now if we check out `most_common_favorites`, we can see how each book now has a count of how many times it appears in `favorite_user_items` associated with it. We can see some popular ones already - 'Harry Potter and the Goblet of Fire (Harry Potter, #4)' appeared 11 times!

In [25]:
most_common_favorites

{'Four to Score (Stephanie Plum, #4)': 1,
 "Life, the Universe and Everything (Hitchhiker's Guide, #3)": 1,
 'Harry Potter and the Goblet of Fire (Harry Potter, #4)': 11,
 'Children of Dune (Dune Chronicles #3)': 2,
 'The Broken Wings': 2,
 'Harry Potter and the Half-Blood Prince (Harry Potter, #6)': 8,
 "I'm a Stranger Here Myself: Notes on Returning to America after Twenty Years Away": 5,
 'Neither Here nor There: Travels in Europe': 2,
 'Slouching Towards Bethlehem': 1,
 'The Lord of the Rings: Weapons and Warfare': 6,
 'The Power of One (The Power of One, #1)': 2,
 'Treasure Island': 4,
 'Another Bullshit Night in Suck City': 1,
 'Dune Messiah (Dune Chronicles #2)': 1,
 'Gates of Fire: An Epic Novel of the Battle of Thermopylae': 1,
 'Harry Potter and the Order of the Phoenix (Harry Potter, #5)': 1,
 'Harry Potter and the Prisoner of Azkaban (Harry Potter, #3)': 1,
 'The High Window (Philip Marlowe, #3)': 1,
 'The History of Sexuality, Volume 1: An Introduction': 1,
 "The Hitchhike

We then sort the above dictionary by the count of each item. We use the `key` argument in the `sorted` function to tell python to use the count of the books to sort the dictionary, rather than the book title.

In this same step, we can limit our resulting list to only the top 10 in our dictionary using slicing:

In [26]:
sorted_list = sorted(most_common_favorites.items(), key=operator.itemgetter(1), reverse=True)[:11]
sorted_list

[('Harry Potter and the Goblet of Fire (Harry Potter, #4)', 11),
 ('Harry Potter and the Half-Blood Prince (Harry Potter, #6)', 8),
 ('The Lord of the Rings: Weapons and Warfare', 6),
 ("I'm a Stranger Here Myself: Notes on Returning to America after Twenty Years Away",
  5),
 ('Treasure Island', 4),
 ("The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)",
  3),
 ('Children of Dune (Dune Chronicles #3)', 2),
 ('The Broken Wings', 2),
 ('Neither Here nor There: Travels in Europe', 2),
 ('The Power of One (The Power of One, #1)', 2),
 ('A Short History of Nearly Everything', 2)]

Finally, all that is left to do is return only the name of the books, not their count, to our user. We do this by using indexing: `x[0] for x in sorted_list]` ensures we are only returning the first element of each tuple (i.e., the book title):

In [27]:
top_N = [x[0] for x in sorted_list]
top_N

['Harry Potter and the Goblet of Fire (Harry Potter, #4)',
 'Harry Potter and the Half-Blood Prince (Harry Potter, #6)',
 'The Lord of the Rings: Weapons and Warfare',
 "I'm a Stranger Here Myself: Notes on Returning to America after Twenty Years Away",
 'Treasure Island',
 "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)",
 'Children of Dune (Dune Chronicles #3)',
 'The Broken Wings',
 'Neither Here nor There: Travels in Europe',
 'The Power of One (The Power of One, #1)',
 'A Short History of Nearly Everything']

Now let's compare that output to what we actually get from the function:

In [29]:
# Our recommended list for user 314
collab_generate_top_N_recommendations(314)

['Harry Potter and the Goblet of Fire (Harry Potter, #4)',
 'Harry Potter and the Half-Blood Prince (Harry Potter, #6)',
 'The Lord of the Rings: Weapons and Warfare',
 "I'm a Stranger Here Myself: Notes on Returning to America after Twenty Years Away",
 'Treasure Island',
 "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)",
 'Children of Dune (Dune Chronicles #3)',
 'The Broken Wings',
 'Neither Here nor There: Travels in Europe',
 'The Power of One (The Power of One, #1)',
 'A Short History of Nearly Everything']

---

## Rating Predictions

We can generate user-item ratings for collaborative filtering using the following algorithmic steps:

1. Select a reference user from the database and a reference item (book) they have not rated.
2. For the reference user, gather the similarity values between them and each other user.
3. Sort the gathered similarity values in descending order.
4. Select the  𝑘
  highest similarity values which are above a given threshold value, creating a collection  𝐾
  similar users.
5. For each user in collection  𝐾
 , get their rating of the reference item if it exists (other users may not have rated this item as well)
6. Compute a weighted average rating from both the gathered rating values and user similarity values.

Here is the full function as detailed in the trains - but we will break it down into steps afterwards:

In [36]:
def collab_generate_rating_estimate(book_title, user, k=20, threshold=0.0):

    # Gather the k users which are most similar to the reference user
    sim_users = user_sim_df.sort_values(by=user, ascending=False).index[1:k+1]

    # Store the corresponding user's similarity values
    user_values = user_sim_df.sort_values(by=user, ascending=False).loc[:,user].tolist()[1:k+1]

    rating_list = [] # <-- List of k user's ratings for the reference item
    weight_list = [] # <-- List of k user's similarities to the reference user

    # Create a weighted sum for each of the k users who have rated the
    # reference item (book).
    for sim_idx, user_id in enumerate(sim_users):
        # User's rating of the item
        rating = util_matrix.loc[user_id, book_title]
        # User's similarity to the reference user
        similarity = user_values[sim_idx]
        # Skip the user if they have not rated the item, or are too dissimilar to
        # the reference user
        if (np.isnan(rating)) or (similarity < threshold):
            continue
        elif not np.isnan(rating):
            rating_list.append(rating*similarity)
            weight_list.append(similarity)
    try:
        # Return the weighted sum as the predicted rating for the reference item
        predicted_rating = sum(rating_list)/sum(weight_list)
    except ZeroDivisionError:
        # If no ratings for the reference item can be collected, return the average
        # rating given by all users for the item.
        predicted_rating = np.mean(util_matrix[book_title])
    return predicted_rating

We start off by collecting the k-most similar users to our reference user, as with the previous function (in this case, 20 users):

In [30]:
# Gather the k users which are most similar to the reference user
sim_users = user_sim_df.sort_values(by=314, ascending=False).index[1:21]
sim_users

Int64Index([ 1964, 18957, 33119, 18361, 51838, 17663,   588, 10140, 32635,
            43985, 11854, 28509,  9722, 17189, 30879,   725,  6016, 40251,
            12024, 24326],
           dtype='int64', name='user_id')

Next, since we want to be calculating a weighted average at the end of this, we also need to store the similarity values for each of these 20 users, and store them in a list in descending order:

In [31]:
# Store the corresponding user's similarity values
user_values = user_sim_df.sort_values(by=314, ascending=False).loc[:,314].tolist()[1:21]
user_values

[0.3903270358962388,
 0.38752188227650997,
 0.31257183082499185,
 0.3004831696289196,
 0.27714723480365766,
 0.2693671928551703,
 0.265541228697193,
 0.26147815138290853,
 0.25519784132951534,
 0.24972641435540344,
 0.23941870357811101,
 0.23902551768969965,
 0.23688317740355277,
 0.23667052904918376,
 0.23524078567485052,
 0.23475160221647717,
 0.23280735794485072,
 0.22983222854778815,
 0.22855210968295453,
 0.22236412140540754]

Next up, we want to go through our list of most similar users, and get their ratings for our chosen book (if they have rated it).

We do this using a `for-loop` to `enumerate` through our `sim_users` variable created earlier on. We use the utiliy matrix to collect the user's rating of our reference book (saved as `rating`), and also the similarity value for this user from our `user_values` list we created before (`similarity`).

Then we check for each of our similar users if either:
- they have not rated the book (`if (np.isnan(rating))`) or
- their similarity value is below our threshold (`similarity < threshold`)

If either of these is the case for a particular similar user, we will simply skip them (`if [conditions]:`

`continue`).

If not, we go on to calculating our numerators (`rating*similarity`) and denominators (`similarity`) for our weighted average equation. We add these to our `rating_list` and `weight_list` lists to be summed later:

In [32]:
rating_list = [] # <-- List of k user's ratings for the reference item
weight_list = [] # <-- List of k user's similarities to the reference user

# Create a weighted sum for each of the k users who have rated the
# reference item (book).
for sim_idx, user_id in enumerate(sim_users):

  # User's rating of the item
  rating = util_matrix.loc[user_id, "The Lord of the Rings: Weapons and Warfare"]

  # User's similarity to the reference user
  similarity = user_values[sim_idx]

  # Skip the user if they have not rated the item, or are too dissimilar to
  # the reference user
  if (np.isnan(rating)) or (similarity < 0.0):
    continue
  elif not np.isnan(rating):
    rating_list.append(rating*similarity)
    weight_list.append(similarity)

In [33]:
rating_list

[1.2019326785156783,
 1.3857361740182883,
 1.3468359642758516,
 1.0459126055316341,
 1.2759892066475766,
 0.9989056574216137,
 0.9576748143124441]

In [34]:
weight_list

[0.3004831696289196,
 0.27714723480365766,
 0.2693671928551703,
 0.26147815138290853,
 0.25519784132951534,
 0.24972641435540344,
 0.23941870357811101]

Once done, we can go ahead to calculate the predicted rating, to be returned:

In [35]:
predicted_rating = sum(rating_list)/sum(weight_list)
predicted_rating

4.432698712267665

Now let's compare that output to what we actually get from the function:

In [37]:
title = "The Lord of the Rings: Weapons and Warfare"
actual_rating = book_ratings[(book_ratings['user_id'] == 314) & (book_ratings['title'] == title)]['rating'].values[0]
pred_rating = collab_generate_rating_estimate(book_title = title, user = 314)
print (f"Title - {title}")
print ("---")
print (f"Actual rating: \t\t {actual_rating}")
print (f"Predicted rating: \t {pred_rating}")

Title - The Lord of the Rings: Weapons and Warfare
---
Actual rating: 		 4
Predicted rating: 	 4.432698712267665


---

## Scikit-surprise package

In this section, we introduce you to the scikit-surprise package - this is a really useful package that is specialised for recommender system problems, and contains **models** that you can use in a very similar way to how you have used sklearn models in previous sprints!

Make sure to read through the [surprise documentation](https://surprise.readthedocs.io/en/stable/) to discover more about how this package works.

First we install the surprise package if we have not already. (If this does not work, try running `!pip install surprise` instead)

In [38]:
pip install surprise



Next, we import some packages. We recommend reading up on the surprise package to fully understand what all of these are used for. There is a mix of model functions and data importing functions here.

In [39]:
# Packages for modeling
from surprise import Reader
from surprise import Dataset
from surprise import KNNWithMeans
from surprise import KNNBasic
from surprise.model_selection import cross_validate
from surprise.model_selection import GridSearchCV
from surprise import SVD
from surprise import SVDpp
from surprise import NMF
from surprise import SlopeOne
from surprise import CoClustering
from surprise.model_selection import train_test_split
from surprise import accuracy
import heapq
import operator

# Packages for model evaluation
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from time import time

To start off, we need to instantiate a [Reader](https://surprise.readthedocs.io/en/stable/reader.html) - this essentially tells surprise how our ratings are laid out. Because this package is specialised for use with recommender systems, it is already expecting to deal with ratings. With a Reader, we specify the scale we are working with for our ratings (going from 0.5 to a max of 5.0)

`reader = Reader(rating_scale = (0.5, 5.0))`

Next up, we load in our data. We use the `.load_from_df` function to do just that - load our data from a pandas dataframe we already have! In this we specify our dataframe name (`book_ratings`), the column names (`[['user_id', 'book_id', 'rating']]`), and the `reader` we just created.

`data = Dataset.load_from_df(book_ratings[['user_id', 'book_id', 'rating']], reader)`

Check out some of the other ways to read in data [here](https://surprise.readthedocs.io/en/stable/dataset.html)



In [40]:
# instantiate our reader and load our data
reader = Reader(rating_scale = (0.5, 5.0))
data = Dataset.load_from_df(book_ratings[['user_id', 'book_id', 'rating']], reader)

Now that our data is ready to go, we can start thinking in a similar fashion to how we did with previous ML sprints. First things first, let's create train and test sets to use with our models.

Notice how we don't create `X_train`, `y_train`, `X_test` and `y_test` however! Because the surprise package knows we are using ratings data, it already knows we're trying to predict ratings - all we need to do is split into a `trainset` and `testset`.

We still use the `train_test_split` function to do this (note: in the imports section, we imported this function from surprise, not from sklearn!). Here we can also specify a `test_size`, much like with sklearn's version:

In [None]:
trainset, testset = train_test_split(data, test_size = 0.1)

Now we're getting to the good part - modelling! The modelling process is quite straightforward. We instantiate our model, fit it to our training data using the `.fit` method, and then make predictions. The difference here is that surprise models **don't** use `.predict`, they use `.test` to make predictions:

In [41]:
# instantiate model
svd = SVD()

#fit it to our training data
svd.fit(trainset)

# create our predictions
preds = svd.test(testset)

In [42]:
# instantiate our model
nmf = NMF()

# fit it to our training data
nmf.fit(trainset)

# create our predictions
preds_nmf = nmf.test(testset)

Lastly, we can use surprise's metrics to calculate RMSE in a pretty straightforward way, using `accuracy.rmse()` to give us an idea of how well the model performs. Remember with RMSE, it's an error metric, so the lower it is, the better!

In [43]:
#evaluate using the RMSE
svd_rmse = accuracy.rmse(preds)
nmf_rmse = accuracy.rmse(preds_nmf)

RMSE: 0.9116
RMSE: 1.0428
