In [1]:
import os
os.chdir('..')

import warnings
warnings.filterwarnings("ignore")

# Data Overview

* The goal of this notebook is to provide an overview of input data required to train and evaluate a recommender algorithm.
* Before building a recommender, you must **format your data accordingly** with the specifications in this notebook. 
* For input data, we use the famous [MovieLens 100K dataset](https://grouplens.org/datasets/movielens/100k/) as a concrete application.

# Table of Contents

1. [Interaction Data](#Interaction-Data)
2. [User Features](#User-Features)
3. [Item Features](#Item-Features)
4. [Extended Data](#Extended-Data)
   

# Interaction Data

* To train a recommender, we need data about a set of users $U$, a set of items $I$, and their interactions $R$.
* The interactions can be *explicit* (e.g. ratings) or *implicit* (e.g. clicks). 
* Each interaction can be represented as a tuple ($u$, $i$, $r$) where $u$ is a **user_id**, $i$ is an **item_id** and $r$ is the observed **response**.
* The training and testing data are stored in `data_train.csv` and `data_test.csv`, respectively. 
* In train/test data, the response for MovieLens interaction is considered 1 if the user rated an item (movie) as 5 on a 1-5 point scale and 0 otherwise.

In [2]:
import pandas as pd

train_df = pd.read_csv("data/data_train.csv")

train_df.head()

Unnamed: 0,user_id,item_id,response
0,843,427,0
1,144,173,1
2,601,250,0
3,751,751,0
4,201,275,0


* As with any machine learning approach, the best-practice is to split your training data into different **training**, **validation**, and **testing** sets.
* The specific approach (e.g. random split, time-holdout, cross-fold validation) will depend on the nature of the data and the application. 
* For demonstration purposes, we use a 10% time-holdout for the test data.
* Let's explore interaction statistics in training and test datasets.
 

In [3]:
from mab2rec.utils import print_interaction_stats

test_df = pd.read_csv("data/data_test.csv")

print("=== TRAIN ===")
print_interaction_stats(train_df)

print("=== TEST ===")
print_interaction_stats(test_df)

=== TRAIN ===
Number of rows: 34,543
Number of users: 896
Number of items: 201
Mean response rate: 0.2665

=== TEST ===
Number of rows: 4,797
Number of users: 112
Number of items: 194
Mean response rate: 0.2723



* As shown in the summary statistics, there are ~200 items (movies) in the train and test sets with an average response rate of ~26%.

# User Features

* It is possible to build recommendation algorithms based on interaction data only, e.g., via matrix factorization.   
* More advanced algorithms utilize other data in addition to the interactions. 
* In context-aware recommenders, user features (contexts) provide information about each user.
* Let's explore one-hot encoded user features from the MovieLens data. 

In [4]:
user_df = pd.read_csv("data/features_user.csv")

user_df.head()

Unnamed: 0,user_id,u0,u1,u2,u3,u4,u5,u6,u7,u8,...,u22,u23,u24,u25,u26,u27,u28,u29,u30,u31
0,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,2,0,0,0,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,0,0
2,3,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,4,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,5,0,0,0,0,1,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0


# Item Features

* In content-aware recommenders, item features provide information about each item. This information, for instance, can be used to measure **item similarity**.
* In particular, this can helpful to solve the [cold-start problem](https://en.wikipedia.org/wiki/Cold_start_(recommender_systems)) when there exist items in the test set that were not part of training.
* In general, item features can form **latent representations** based embeddings of unstructured text, images, video, or audio related to each item.
* Mab2Rec allows building **content- and context-aware** recommendation algorithms.
* Let's explore the one-hot encoded item features based on movie genre. 

In [5]:
item_df = pd.read_csv("data/features_item.csv")

item_df.head()

Unnamed: 0,item_id,i0,i1,i2,i3,i4,i5,i6,i7,i8,i9,i10,i11,i12,i13,i14,i15,i16,i17,i18
0,1,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,4,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
2,7,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0
3,8,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0
4,9,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0


# Extended Data

* Our data overview so far covers user features, item features and their interaction in the test/train splits. 
* This setup suffice for most recommenders. 
* Still, practical considerations might emerge in the real-world such as item eligibility, cold start problems, and memory efficiency. 
* The extended datasets provide additional information and the advanced functionality notebook shows how to incorporate it when building recommenders. 

## Eligibility 
* So far we assumed that all items are eligible for all users. 
* In practice, this might not be the case. 
* In the extended data, `data_eligibility.csv` provides the list of eligible items for each user. 
* See advanced functionality notebook to incorporate this extended data and address eligibility criteria. 

## Feature Data Types
* When working with applications with a large user base, the user features file can be quite large leading to out of memory issues. 
* A practical approach is to store the data type of each user feature and then to load user features with data types explicitly specified. 
* This can bring substantial memory savings. 
* In the extended data `features_user_dtypes.json` provides an example of specifying data types. 
* See the advanced functionality notebook to incorporate this exended data and address memory issues. 
 