In [None]:
import os
import json
import io
import imp

Cindicator contest simulates online training, so the interface is a bit more complicated than the usual predict->train procedure.

There are three methods that you have to implement: `train`, `predict`, `update`.

First of all, the `train` method of your model is called on historical data.

Then we call `predict` on a batch of samples.

Finally, the `update` method is called with the same batch of samples, but this time you are provided with true answers for those.

Let's take a look at an example of how a model is trained and evaluated on Dbrain platform.

In [None]:
# some dirty hacks with imports to maintain nice directory layout
lr_baseline_module = imp.load_source("lr", "lr/ds_model.py")

In [None]:
assets_dir = ""
dump_dir = "dump_dir/"

The model is initialized with two directories.

-`assets_dir` - all those large assets that you loaded through the web page will be there

-`dump_dir` - you can write anything to this directory, all the data that you store here will presist during training and evaluation

In [None]:
model = lr_baseline_module.DSModel(assets_dir, dump_dir)

The model is trained on train set data and the `dump` method is called.

Here we use the preview dataset as our training data.

Let's download it and extract to data/preview dir:

In [None]:
%%script bash
mkdir -p data/preview
wget https://s3-eu-west-1.amazonaws.com/dbrain-datasets-ds/cindicator_09_08/preview.tar.gz -O data/preview.tar.gz
tar -xvf data/preview.tar.gz -C data/preview/

In [None]:
data_dir = "data/preview/"

Train model:

In [None]:
model.train(data_dir)
model.dump()

Time to test our model.

The model testing simulates online learning. At each step of testing we call `.predict` on a batch of samples and then `.update` on the same batch, but this time we provide you with the true answers.

In [None]:
model = lr_baseline_module.DSModel(assets_dir, dump_dir)
model.load()

In [None]:
markup = "data/preview/preview_markup.json"
test_data = "data/preview//"

In [None]:
import sys
from collections import defaultdict
import pandas as pd
from tqdm import tqdm

def markup2bytes(markup: {str: {str: object}}) -> bytes:
    df = defaultdict(list)
    for field in markup:
        for idx in markup[field]:
            df['id'].append(idx)
            df[field].append(markup[field][idx])
    df = pd.DataFrame.from_dict(df)
    return df.to_csv().encode()



with open(markup) as f:
    data = json.load(f)

predictions = {}
for filename in tqdm(sorted(data.keys())):
    csv = os.path.join(test_data, filename)
    with open(csv, 'rb') as f:
        csv = f.read()
    p = model.predict([csv])[0]
    predictions[filename] = p
    y = markup2bytes(data[filename])
    model.update([(csv, y)])

And there are the predictions of our model on the preview set. 

In [None]:
ordered_preds = []
ordered_answers = []
for fname, answers in data.items():
    answers = answers["question_answer"]
    for q_id, answer in answers.items():
        ordered_answers.append(answer)
        ordered_preds.append(predictions[fname]["question_answer"][int(q_id)])

Let's compute inverse log_loss of our model

In [None]:
from sklearn.metrics import log_loss
sample_weights = [1 if y == 0 else 1.5 for y in ordered_answers]
1 / log_loss(ordered_answers, ordered_preds, sample_weight=sample_weights)

On Dbrain, your model will go through almost the same process, although it will be trained on the full training set and it will be evaluated on the test set.  