<div align="center"><a href="https://www.nvidia.com/en-us/deep-learning-ai/education/"><img src="./images/DLI_Header.png"></a></div>

# Content Based Filters with Real Data

To test our Content-based Filters Knowledge, we'll be using a subsection of the [Amazon Review Dataset](https://nijianmo.github.io/amazon/index.html), specifically the Electronics dataset. There are a few more challenges working with real data such as data cleanliness, memory contraints, and  how to verify the model is ready for real world testing.

## Objectives
This notebook demonstrates:
* How to split data to train and test a recommender model
  * [1. Exploring the Data](#1.-Exploring-the-Data)
  * [2. Building the Datasets](#2.-Building-the-Datasets)
* How to build a Content-based filter for large datasets
  * [3. In-Place Content-Based Filters](#3.-In-Place-Content-Based-Filters)
    * [3.1 Finding the Total Number of Points Given per User](#3.1-Finding-the-Total-Number-of-Points-Given-per-User)
    * [3.2 Finding the Total Number of Points Given per User per Category](#3.2-Finding-the-Total-Number-of-Points-Given-per-User-per-Category)
    * [3.3 Finding the User's Percentage Preference for Each Category](#3.3-Finding-the-User's-Percentage-Preference-for-Each-Category)
  * [4. Making a Prediction](#4.-Making-a-Prediction)
  * [5. Validation Metrics](#5.-Validation-Metrics)
  * [6. Wrap Up](#6.-Wrap-Up)

## 1. Exploring the Data

Our data is located in a csv file under `/data`. We've already merged the metadata with the reviews data so everything can be displayed in one DataFrame.

In [1]:
import cudf

ratings = cudf.read_csv("data/raw_data.csv")

ratings.columns

Index(['reviewerID', 'asin', 'overall', 'unixReviewTime', 'brand',
       'category_0_0', 'category_0_1', 'category_0_2', 'category_0_3',
       'category_1_0', 'category_1_1', 'category_1_2', 'category_1_3',
       'salesRank_Electronics', 'salesRank_Camera', 'salesRank_Computers',
       'salesRank_CellPhones', 'salesRank_CellPhones_NA',
       'salesRank_Electronics_NA', 'salesRank_Camera_NA',
       'salesRank_Computers_NA', 'price_filled'],
      dtype='object')

Hmm, it looks like our categories are being stored as strings alongside the user-item ratings. Previously, we had a column for each category with a `1` if the item was that category, and `0` otherwise. Can we make that here? 

Let's start by getting a list of all the categories. We can [concatenate](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.concat.html) to stack all the category columns together, and then use [unique](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.Series.unique.html) to remove any duplicates.

In [2]:
category_columns = [
    'category_0_0', 'category_0_1', 'category_0_2', 'category_0_3',
    'category_1_0', 'category_1_1', 'category_1_2', 'category_1_3'
]

categories = cudf.Series(dtype="str")
for category in category_columns:
    categories = cudf.concat([categories, ratings[category]])
categories = categories.unique()

categories

0                       3D Glasses
1                      AC Adapters
2        AV Receivers & Amplifiers
3           Access-Control Keypads
4                      Accessories
                  ...             
581                       Xbox 360
582                       Xbox One
583                      Zip Discs
584                  eBook Readers
585    eBook Readers & Accessories
Length: 586, dtype: object

Wow! 586 categories! We could have a column for each category like before, but is that wise? How many rows does our data have?

In [3]:
count_ratings = len(ratings)

count_ratings

1689188

If we do create a column for each value, we'll have a billion cells worth of data. Okay, new strategy. We're going to  build our filter using only a few more extra columns with our dataset.

## 2. Building the Datasets

Before we get much further, we should split our dataset into training and validation datasets. One thing to consider when doing this for recommendation systems is that our goal is to encourage future interaction by the users. Instead of doing a random split, we'll use the last review left by the user for validation.

The `unixReviewTime` column enables us to do that. We will start by finding the largest (newest) review time and assigning it a marker, `valid`. However, the `unixReviewTime` only goes down to the day, so if a user does all of their reviews on the same day, then all of their reviews would end up in the validation set. We can add a tie-breaker, or `timeBreaker`, by adding a scaled version of the row index to the time. 

The `as_index` parameter for [groupby](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.DataFrame.groupby.html) will create a fresh numerical index as opposed to using the `reviewerID` as the index.

In [4]:
scaled_index = ratings.index / count_ratings
ratings["timeBreaker"] = scaled_index + ratings["unixReviewTime"]

In [5]:
valid_ratings = ratings[["reviewerID", "timeBreaker"]].groupby(['reviewerID'], as_index=False).max()
valid_ratings["valid"] = True

valid_ratings.head()

Unnamed: 0,reviewerID,timeBreaker,valid
0,A000715434M800HLCENK9,1400458000.0,True
1,A00101847G3FJTWYGNQA,1385770000.0,True
2,A00166281YWM98A3SVD55,1370563000.0,True
3,A0046696382DWIPVIWO0K,1402877000.0,True
4,A00472881KT6WR48K907X,1402358000.0,True


Next, we [merge](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.DataFrame.merge.html) the validation marker back onto our full ratings dataset to make it easier to pull the training data.

In [6]:
data_split = ratings.merge(valid_ratings, how="left", on=['reviewerID', "timeBreaker"])

data_split.head()

Unnamed: 0,reviewerID,asin,overall,unixReviewTime,brand,category_0_0,category_0_1,category_0_2,category_0_3,category_1_0,...,salesRank_Camera,salesRank_Computers,salesRank_CellPhones,salesRank_CellPhones_NA,salesRank_Electronics_NA,salesRank_Camera_NA,salesRank_Computers_NA,price_filled,timeBreaker,valid
0,AEVNTIQFU2TQ6,B00004ZCB3,3.0,1398643200,Tiffen,Electronics,Camera & Photo,Accessories,Filters & Accessories,,...,18665.5,14369.5,254050.0,False,False,False,False,16.49,1398643000.0,
1,A1XI6NT41B6E6X,B00004ZCB3,5.0,1386288000,Tiffen,Electronics,Camera & Photo,Accessories,Filters & Accessories,,...,18665.5,14369.5,254050.0,False,False,False,False,16.49,1386288000.0,
2,A2PMIMM3U3THM7,B00004ZCB3,5.0,1241481600,Tiffen,Electronics,Camera & Photo,Accessories,Filters & Accessories,,...,18665.5,14369.5,254050.0,False,False,False,False,16.49,1241482000.0,
3,A1E9NFMLMVY61H,B00004ZCB4,5.0,1374624000,Tiffen,Electronics,Camera & Photo,Accessories,Filters & Accessories,,...,18665.5,14369.5,254050.0,False,False,False,False,16.49,1374624000.0,
4,AXABTEYS7A4A8,B00004ZCB4,5.0,1374710400,Tiffen,Electronics,Camera & Photo,Accessories,Filters & Accessories,,...,18665.5,14369.5,254050.0,False,False,False,False,16.49,1374710000.0,


To keep things clean, we'll replace the `null`s in the `valid` column with `False`.

In [7]:
data_split['valid'].fillna(False, inplace = True)

We'll keep a simplified table of the user-item validation ratings to make it easier to check our work later.

In [8]:
clean_columns = ["reviewerID", "asin", "overall"]
train_overall = data_split.loc[~data_split['valid']][clean_columns]

train_overall.head()

Unnamed: 0,reviewerID,asin,overall
0,AEVNTIQFU2TQ6,B00004ZCB3,3.0
1,A1XI6NT41B6E6X,B00004ZCB3,5.0
2,A2PMIMM3U3THM7,B00004ZCB3,5.0
3,A1E9NFMLMVY61H,B00004ZCB4,5.0
4,AXABTEYS7A4A8,B00004ZCB4,5.0


In [9]:
valid_overall = data_split.loc[data_split['valid']][clean_columns]

valid_overall.head()

Unnamed: 0,reviewerID,asin,overall
10720,A2JMSJO1Z8C8ZQ,B00004SY4H,5.0
10721,A3MZSWSG6L8EN2,B00004SY4H,5.0
10722,A1F958T2GSWCEK,B00004SYKO,5.0
10752,AAVSIY28BPKGV,B000023VW2,3.0
10784,A1OZOJ8YVQQZCS,B00004TX77,5.0


While we're at it, let's save some of this work for future notebooks.

In [10]:
save_columns = clean_columns + ['valid']

data_split[save_columns].to_csv('data/ratings.csv', index=False)

Here's the overall data efficiency strategy. We'll stack all the categories into one column, and then use `groupby` operations to get our user ratings for each category.

We'll again use the [concat](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.concat.html) function, but if we want all the categories to fall under the same column, we'll need to use the [rename](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.Series.rename.html) function to give each stack section the same category name.

A little more data cleanup... We'll also remove the `NA` category since it's an empty category, the `Electronics` category since every item has it, and any duplicate categories.

In [11]:
column_base = ["reviewerID", "asin", "overall", "valid"]

ratings_stack = cudf.DataFrame()
for category in category_columns:
    stack_section = data_split[column_base + [category]]
    stack_section = stack_section.rename(columns={category: "category"})
    ratings_stack = cudf.concat([ratings_stack, stack_section])

# Remove null categories and duplicates
ratings_stack = ratings_stack.loc[ratings_stack['category'] != "NA"]
ratings_stack = ratings_stack.loc[ratings_stack['category'] != "Electronics"]
ratings_stack = ratings_stack.drop_duplicates()

ratings_stack.head()

Unnamed: 0,reviewerID,asin,overall,valid,category
324139,A000715434M800HLCENK9,B000UYYZ0M,1.0,False,Accessories & Supplies
324139,A000715434M800HLCENK9,B000UYYZ0M,1.0,False,Office Electronics Accessories
324139,A000715434M800HLCENK9,B000UYYZ0M,1.0,False,Projection Screens
453648,A000715434M800HLCENK9,B001EHAI6Y,5.0,False,MP3 Player Accessories
453648,A000715434M800HLCENK9,B001EHAI6Y,5.0,False,MP3 Players & Accessories


Finally, we can use our `valid` marker and [loc](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.DataFrame.loc.html) to split our datasets.

In [12]:
train_data = ratings_stack.loc[~ratings_stack['valid']].drop("valid", axis=1)

train_data.head()

Unnamed: 0,reviewerID,asin,overall,category
324139,A000715434M800HLCENK9,B000UYYZ0M,1.0,Accessories & Supplies
324139,A000715434M800HLCENK9,B000UYYZ0M,1.0,Office Electronics Accessories
324139,A000715434M800HLCENK9,B000UYYZ0M,1.0,Projection Screens
453648,A000715434M800HLCENK9,B001EHAI6Y,5.0,MP3 Player Accessories
453648,A000715434M800HLCENK9,B001EHAI6Y,5.0,MP3 Players & Accessories


In [13]:
valid_data = ratings_stack.loc[ratings_stack['valid']].drop("valid", axis=1)

valid_data.head()

Unnamed: 0,reviewerID,asin,overall,category
1679159,A000715434M800HLCENK9,B00HMZG3YS,5.0,Bags & Cases
1679159,A000715434M800HLCENK9,B00HMZG3YS,5.0,Camera & Photo
1679159,A000715434M800HLCENK9,B00HMZG3YS,5.0,Camera Cases
1537376,A00101847G3FJTWYGNQA,B00B19L8LO,4.0,Computer Cases
1537376,A00101847G3FJTWYGNQA,B00B19L8LO,4.0,Computer Components


## 3. In-Place Content-Based Filters


Now that we have our user (`reviewerID`), item (`asin`), rating (`overall`), and category all lined up, we have the necessary ingredients to make a content-based filter.

The following is a puzzle. Please replace the `FIXME`s for each code block in the section. Each `FIXME` is a single string or method. Use the <a href="1-03_content_based_intro.ipynb">previous lab</a> as a hint, and if more help is needed, take a look at the solutions folder.

### 3.1 Finding the Total Number of Points Given per User


In [14]:
# FIXME: groupby
user_total = (
    train_data[["reviewerID", "overall"]].groupby(["reviewerID"], as_index=False).sum()
)

user_total.head()

Unnamed: 0,reviewerID,overall
0,A000715434M800HLCENK9,33.0
1,A00101847G3FJTWYGNQA,67.0
2,A00166281YWM98A3SVD55,58.0
3,A0046696382DWIPVIWO0K,43.0
4,A00472881KT6WR48K907X,84.0


### 3.2 Finding the Total Number of Points Given per User per Category

In [15]:
category_total = (
    train_data[["reviewerID", "overall", "category"]]
    .groupby(["reviewerID", "category"], as_index=False)
    .sum()
)

category_total.head()

Unnamed: 0,reviewerID,category,overall
0,A000715434M800HLCENK9,Accessories & Supplies,6.0
1,A000715434M800HLCENK9,Audio & Video Accessories,2.0
2,A000715434M800HLCENK9,Cables & Interconnects,2.0
3,A000715434M800HLCENK9,MP3 Player Accessories,5.0
4,A000715434M800HLCENK9,MP3 Players & Accessories,5.0


### 3.3 Finding the User's Percentage Preference for Each Category

In [16]:
# FIXME: merge dataframes
category_total = category_total.merge(user_total, how="left", on='reviewerID')

category_total.head()

Unnamed: 0,reviewerID,category,overall_x,overall_y
0,A10LWFKVC21F82,Cables & Interconnects,3.0,335.0
1,A10LWFKVC21F82,Cases & Sleeves,3.0,335.0
2,A10LWFKVC21F82,Cell Phones & Accessories,4.0,335.0
3,A10LWFKVC21F82,Computer Cable Adapters,4.0,335.0
4,A10LWFKVC21F82,Computer Components,13.0,335.0


In [17]:
category_total["ratio"] = category_total["overall_x"] / category_total["overall_y"]

category_total.head()

Unnamed: 0,reviewerID,category,overall_x,overall_y,ratio
0,A10LWFKVC21F82,Cables & Interconnects,3.0,335.0,0.008955
1,A10LWFKVC21F82,Cases & Sleeves,3.0,335.0,0.008955
2,A10LWFKVC21F82,Cell Phones & Accessories,4.0,335.0,0.01194
3,A10LWFKVC21F82,Computer Cable Adapters,4.0,335.0,0.01194
4,A10LWFKVC21F82,Computer Components,13.0,335.0,0.038806


In [18]:
# FIXME: find category's fraction of total points
category_total = category_total.drop(["overall_x", "overall_y"], axis=1)

category_total.head()

Unnamed: 0,reviewerID,category,ratio
0,A10LWFKVC21F82,Cables & Interconnects,0.008955
1,A10LWFKVC21F82,Cases & Sleeves,0.008955
2,A10LWFKVC21F82,Cell Phones & Accessories,0.01194
3,A10LWFKVC21F82,Computer Cable Adapters,0.01194
4,A10LWFKVC21F82,Computer Components,0.038806


## 4. Making a Prediction

Congrats on making the users' profiles. Let's use them to make a prediction! We'll start by [merging](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.DataFrame.merge.html) our predictions with our validation data. We'll use an `inner` merge to avoid including predictions outside our validation dataset.

We do not need the validation ratings just yet, but we do need the categories for each item in our validation set. This allows us to pull the relevant categories from the user profile to find how well their category tastes align with the categories of the item we're predicting for.

In [19]:
prediction = category_total.merge(
    valid_data, how="inner", on=["reviewerID", "category"]
)

prediction.head()

Unnamed: 0,reviewerID,category,ratio,asin,overall
0,A11KZN1K3UYV6S,Cables & Accessories,0.141304,B003XN24GY,3.0
1,A102F7EHNVW30Q,Camera & Photo,0.128713,B004V97MXE,5.0
2,A100UD67AHFODS,Computers & Accessories,0.130855,B00HHRP11C,5.0
3,A10NMELR4KX0J6,Mice,0.013766,B00B0TMWTW,4.0
4,A10ZFE6YE0UHW8,Computers & Accessories,0.16526,B00IWA1MNE,4.0


In [20]:
prediction = prediction.drop(["category", "overall"], axis=1).groupby(['reviewerID', 'asin'], as_index=False).sum()

prediction.head()

Unnamed: 0,reviewerID,asin,ratio
0,A00101847G3FJTWYGNQA,B00B19L8LO,0.432836
1,A00166281YWM98A3SVD55,B007B5S8BU,0.344828
2,A0046696382DWIPVIWO0K,B00HMXIKCS,0.27907
3,A00472881KT6WR48K907X,B00GYL9KK0,0.059524
4,A00473363TJ8YSZ3YAGG9,B003NR57BY,0.395833


## 5. Validation Metrics

The best way to test a recommender model is to do [A/B testing](https://en.wikipedia.org/wiki/A/B_testing), as the goal is to maximize user interactivity with our systems. However, we do not want to risk turning tested users away with a bad model, so we can sanity check our model against other metrics beforehand.

Here, we'll calculate [Root Mean Squared Error](https://en.wikipedia.org/wiki/Root-mean-square_deviation) with our validation set. Let's merge our predicted ratings with the validation data's true rating.

In [21]:
prediction = prediction.merge(valid_overall, how="inner", on=['reviewerID', 'asin'])

prediction.head()

Unnamed: 0,reviewerID,asin,ratio,overall
0,A3CFH5XEOD42DR,B00006B82A,0.466667,5.0
1,A3RWO8892KE5U,B00007M1TZ,0.166667,4.0
2,A2GPSM71K9GL91,B00007M1TZ,0.068493,5.0
3,A38YMBLNIKM7X3,B00007M1TZ,0.090909,2.0
4,A24W5O9HXK2KLX,B00006B7SG,0.296296,1.0


As in our example problem [in the previous notebook](1-03_content_based_intro.ipynb), we'll need to scale up our predictions.

Please replace the `FIXME`s so that we have predicted ratings with the correct scale.

In [22]:
# FIXME: convert to 1 - 5 scale
min_rating = 1
max_rating = 5
prediction["predicted_rating"] = prediction["ratio"] * (max_rating - min_rating) + min_rating

prediction.head()

Unnamed: 0,reviewerID,asin,ratio,overall,predicted_rating
0,A3CFH5XEOD42DR,B00006B82A,0.466667,5.0,2.866667
1,A3RWO8892KE5U,B00007M1TZ,0.166667,4.0,1.666667
2,A2GPSM71K9GL91,B00007M1TZ,0.068493,5.0,1.273973
3,A38YMBLNIKM7X3,B00007M1TZ,0.090909,2.0,1.363636
4,A24W5O9HXK2KLX,B00006B7SG,0.296296,1.0,2.185185


Next, we'll calculate the squared error by subtracting the true value from the predicted value and squaring the result.

In [23]:
prediction["squared_error"] = (
    prediction["predicted_rating"] - prediction["overall"]
) ** 2

prediction.head()

Unnamed: 0,reviewerID,asin,ratio,overall,predicted_rating,squared_error
0,A3CFH5XEOD42DR,B00006B82A,0.466667,5.0,2.866667,4.551111
1,A3RWO8892KE5U,B00007M1TZ,0.166667,4.0,1.666667,5.444444
2,A2GPSM71K9GL91,B00007M1TZ,0.068493,5.0,1.273973,13.88328
3,A38YMBLNIKM7X3,B00007M1TZ,0.090909,2.0,1.363636,0.404959
4,A24W5O9HXK2KLX,B00006B7SG,0.296296,1.0,2.185185,1.404664


Finally, we can find the Root Mean Squared Error (RMSE) by [averaging](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.DataFrame.mean.html) the `squared_error`s and taking the square root.

In [24]:
prediction["squared_error"].mean() ** .5

2.6019044951648507

So, roughly on average, the predictions are off by about 2 points. Considering the scale of ratings goes from 1 - 5, that's a pretty big error.

However, there are some things to note.
* Predicted ratings are relative to other items. Take the example where a user would rate item A 5 and item B 1. It doesn't matter so much that our system predicts {A 4, B 3} or {A 3, B 2.5}. In both cases, item A is recommended to the user over B.
* In the next notebook, we'll go over collaborative filtering, which can produce better predictions with more data. There are some instances where content-based filtering can make predictions where collaborative filtering cannot. Try to keep track of the differences while going through the next notebook.

## 6. Wrap Up

Congratulations on making it this far! Please run the cell below to shut down the kernel before moving on to the <a href="1-05_als.ipynb">next notebook</a>.

In [25]:
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

<div align="center"><a href="https://www.nvidia.com/en-us/deep-learning-ai/education/"><img src="./images/DLI_Header.png"></a></div>