# Get Started

*by [Longqi@Cornell](http://www.cs.cornell.edu/~ylongqi/) licensed under [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/)*

This tutorial demonstrates the process of training and evaluating recommendation algorithms using OpenRec (tutorial on implementing new recommendation algorithm: [tutorial]()):
 * Prepare training and evaluation datasets.
 * Instantiate a recommender.
 * Instantiate a sampler.
 * Instantiate evaluators.
 * Instantiate a model trainer.
 * TRAIN AND EVALUATE!

### Prepare training and evaluation datasets

* Download your favorite dataset from the web. In this tutorial, we use [a relatively small citeulike dataset](http://www.wanghao.in/CDL.htm) for demonstration purpose (It requires `unrar` package to unpack the data).

In [1]:
import rarfile
import urllib

urllib.urlretrieve('http://www.wanghao.in/data/ctrsr_datasets.rar', 'ctrsr_datasets.rar')
rar = rarfile.RarFile('ctrsr_datasets.rar')
rar.extractall()

0

* Convert raw data into [numpy structured array](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.rec.html). As required by the **ImplicitDataset** class, two keys `user_id` and `item_id` are required. Each row in the converted numpy array represents an interaction. The array might contain additional keys based on the use cases.

In [2]:
import numpy as np
import random

users_count = 0
interactions_count = 0
with open('ctrsr_datasets/citeulike-a/users.dat', 'r') as fin:
    for line in fin:
        interactions_count += int(line.split()[0])
        users_count += 1

# radomly hold out an item per user for validation and testing respectively.
val_structured_arr = np.zeros(users_count, dtype=[('user_id', np.int32), ('item_id', np.int32)]) 
test_structured_arr = np.zeros(users_count, dtype=[('user_id', np.int32), ('item_id', np.int32)])
train_structured_arr = np.zeros(interactions_count-11102, dtype=[('user_id', np.int32), ('item_id', np.int32)])

interaction_ind = 0
next_user_id = 0
next_item_id = 0
map_to_item_id = dict()  # Map item id from 0 to len(items)-1

with open('ctrsr_datasets/citeulike-a/users.dat', 'r') as fin:
    for line in fin:
        item_list = line.split()[1:]
        random.shuffle(item_list)
        for ind, item in enumerate(item_list):
            if item not in map_to_item_id:
                map_to_item_id[item] = next_item_id
                next_item_id += 1
            if ind == 0:
                val_structured_arr[next_user_id] = (next_user_id, map_to_item_id[item])
            elif ind == 1:
                test_structured_arr[next_user_id] = (next_user_id, map_to_item_id[item])
            else:
                train_structured_arr[interaction_ind] = (next_user_id, map_to_item_id[item])
                interaction_ind += 1
        next_user_id += 1

* Instantiate training, validation, and testing datasets. As the data is from users' implicit feedback, we choose the **ImplicitDataset** class, as opposed to the general **Dataset** class.

In [3]:
from openrec.utils import ImplicitDataset

train_dataset = ImplicitDataset(raw_data=train_structured_arr, 
                        max_user=users_count, 
                        max_item=len(map_to_item_id), name='Train')
val_dataset = ImplicitDataset(raw_data=val_structured_arr, 
                      max_user=users_count,
                      max_item=len(map_to_item_id), name='Val')
test_dataset = ImplicitDataset(raw_data=test_structured_arr, 
                       max_user=users_count,
                       max_item=len(map_to_item_id), name='Test')

### Instantiate a recommender

We use the [BPR recommender](http://openrec.readthedocs.io/en/latest/recommenders/openrec.recommenders.bpr.html) that implements the pure Baysian Personalized Ranking (BPR) algorithm

In [4]:
from openrec.recommenders import BPR

bpr_model = BPR(batch_size=1000, 
                max_user=train_dataset.max_user(), 
                max_item=train_dataset.max_item(), 
                dim_embed=20, 
                opt='Adam')

### Instantiate a sampler

A basic [pairwise sampler](http://openrec.readthedocs.io/en/latest/utils/openrec.utils.samplers.html) is used, i.e., each instance contains an user, an item that the user interacts, and an item that the user did NOT interact. 

In [5]:
from openrec.utils.samplers import PairwiseSampler

sampler = PairwiseSampler(batch_size=1000, dataset=train_dataset, num_process=1)

### Instantiate evaluators

Define evaluators that you plan to use. This tutorial evaluate the recommender against Area Under Curve (AUC).

In [6]:
from openrec.utils.evaluators import AUC

auc_evaluator = AUC()

### Instantiate a model trainer

The **implicit model trainer** drives the training and evaluation of the recommender using defined *implicit feedback datasets*, sampler, model, and evaluators.

In [7]:
from openrec import ImplicitModelTrainer

model_trainer = ImplicitModelTrainer(batch_size=1000, 
                             test_batch_size=100, 
                            train_dataset=train_dataset, 
                             model=bpr_model, 
                             sampler=sampler)

### TRAIN AND EVALUATE!

In [8]:
model_trainer.train(num_itr=10000, 
                    display_itr=1000, 
                    eval_datasets=[val_dataset, test_dataset],
                    evaluators=[auc_evaluator])

[34m== Start training with FULL evaluation ==[0m
[31m[Itr 1000][0m loss: 574.840049
[32m..(dataset: Val) evaluation[0m


100%|██████████| 56/56 [00:35<00:00,  1.58it/s]

[32m..(dataset: Val)[0m AUC 0.870014176351
[32m..(dataset: Test) evaluation[0m



100%|██████████| 56/56 [00:35<00:00,  1.57it/s]


[32m..(dataset: Test)[0m AUC 0.867485912866
[31m[Itr 2000][0m loss: 282.632813
[32m..(dataset: Val) evaluation[0m


100%|██████████| 56/56 [00:34<00:00,  1.62it/s]

[32m..(dataset: Val)[0m AUC 0.91076701659
[32m..(dataset: Test) evaluation[0m



100%|██████████| 56/56 [00:34<00:00,  1.60it/s]


[32m..(dataset: Test)[0m AUC 0.910444554201
[31m[Itr 3000][0m loss: 174.990474
[32m..(dataset: Val) evaluation[0m


100%|██████████| 56/56 [00:34<00:00,  1.62it/s]

[32m..(dataset: Val)[0m AUC 0.926863193579
[32m..(dataset: Test) evaluation[0m



100%|██████████| 56/56 [00:33<00:00,  1.67it/s]


[32m..(dataset: Test)[0m AUC 0.926612262175
[31m[Itr 4000][0m loss: 127.946962
[32m..(dataset: Val) evaluation[0m


100%|██████████| 56/56 [00:33<00:00,  1.66it/s]

[32m..(dataset: Val)[0m AUC 0.935330491623
[32m..(dataset: Test) evaluation[0m



100%|██████████| 56/56 [00:34<00:00,  1.62it/s]


[32m..(dataset: Test)[0m AUC 0.935395690621
[31m[Itr 5000][0m loss: 101.319234
[32m..(dataset: Val) evaluation[0m


100%|██████████| 56/56 [00:33<00:00,  1.67it/s]

[32m..(dataset: Val)[0m AUC 0.940758071975
[32m..(dataset: Test) evaluation[0m



100%|██████████| 56/56 [00:33<00:00,  1.67it/s]


[32m..(dataset: Test)[0m AUC 0.940972548345
[31m[Itr 6000][0m loss: 84.372617
[32m..(dataset: Val) evaluation[0m


100%|██████████| 56/56 [00:34<00:00,  1.64it/s]

[32m..(dataset: Val)[0m AUC 0.944326776768
[32m..(dataset: Test) evaluation[0m



100%|██████████| 56/56 [00:33<00:00,  1.66it/s]


[32m..(dataset: Test)[0m AUC 0.944723649225
[31m[Itr 7000][0m loss: 72.579934
[32m..(dataset: Val) evaluation[0m


100%|██████████| 56/56 [00:33<00:00,  1.65it/s]

[32m..(dataset: Val)[0m AUC 0.946881856049
[32m..(dataset: Test) evaluation[0m



100%|██████████| 56/56 [00:34<00:00,  1.63it/s]


[32m..(dataset: Test)[0m AUC 0.947246023978
[31m[Itr 8000][0m loss: 63.512150
[32m..(dataset: Val) evaluation[0m


100%|██████████| 56/56 [00:33<00:00,  1.68it/s]

[32m..(dataset: Val)[0m AUC 0.948353025894
[32m..(dataset: Test) evaluation[0m



100%|██████████| 56/56 [00:33<00:00,  1.65it/s]


[32m..(dataset: Test)[0m AUC 0.949117874329
[31m[Itr 9000][0m loss: 56.601964
[32m..(dataset: Val) evaluation[0m


100%|██████████| 56/56 [00:33<00:00,  1.67it/s]

[32m..(dataset: Val)[0m AUC 0.949670341536
[32m..(dataset: Test) evaluation[0m



100%|██████████| 56/56 [00:33<00:00,  1.66it/s]


[32m..(dataset: Test)[0m AUC 0.95068552518
