In [1]:
import os
import json
import io
import imp

Cindicator contest simulates online training, so the interface is a bit more complicated than the usual predict->train procedure.

There are three methods that you have to implement: `train`, `predict`, `update`.

First of all, the `train` method of your model is called on historical data.

Then we call `predict` on a batch of samples.

Finally, the `update` method is called with the same batch of samples, but this time you are provided with true answers for those.

Let's take a look at an example of how a model is trained and evaluated on Dbrain platform.

In [2]:
# some dirty hacks with imports to maintain nice directory layout
rf_baseline_module = imp.load_source("rf", "rf/ds_model.py")

In [3]:
assets_dir = ""
dump_dir = "dump_dir/"

The model is initialized with two directories.

-`assets_dir` - all those large assets that you loaded through the web page will be there

-`dump_dir` - you can write anything to this directory, all the data that you store here will presist during training and evaluation

In [4]:
model = rf_baseline_module.DSModel(assets_dir, dump_dir)

The model is trained on train set data and the `dump` method is called.

Here we use the preview dataset as our training data.

Let's download it and extract to data/preview dir:

In [5]:
%%script bash
mkdir -p data/preview
wget https://s3-eu-west-1.amazonaws.com/dbrain-datasets-ds/cindicator_09_08/preview.tar.gz -O data/preview.tar.gz
tar -xvf data/preview.tar.gz -C data/preview/

./00000.csv
./00001.csv
./00002.csv
./00003.csv
./00004.csv
./00005.csv
./00006.csv
./00007.csv
./00008.csv
./00009.csv
./00010.csv
./00011.csv
./00012.csv
./00013.csv
./00014.csv
./00015.csv
./00016.csv
./00017.csv
./00018.csv
./00019.csv
./00020.csv
./00021.csv
./00022.csv
./00023.csv
./00024.csv
./00025.csv
./00026.csv
./00027.csv
./00028.csv
./00029.csv
./00030.csv
./00031.csv
./00032.csv
./00033.csv
./00034.csv
./00035.csv
./00036.csv
./00037.csv
./00038.csv
./00039.csv
./00040.csv
./00041.csv
./00042.csv
./00043.csv
./00044.csv
./00045.csv
./00046.csv
./00047.csv
./00048.csv
./00049.csv
./00050.csv
./00051.csv
./00052.csv
./00053.csv
./00054.csv
./00055.csv
./00056.csv
./00057.csv
./00058.csv
./00059.csv
./00060.csv
./00061.csv
./00062.csv
./00063.csv
./00064.csv
./00065.csv
./00066.csv
./00067.csv
./00068.csv
./00069.csv
./00070.csv
./00071.csv
./00072.csv
./00073.csv
./00074.csv
./00075.csv
./00076.csv
./00077.csv
./00078.csv
./00079.csv
./00080.csv
./00081.csv
./00082.csv
./00

--2018-08-31 19:44:35--  https://s3-eu-west-1.amazonaws.com/dbrain-datasets-ds/cindicator_09_08/preview.tar.gz
Resolving s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)... 52.218.84.178
Connecting to s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)|52.218.84.178|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26390400 (25M) [application/x-gzip]
Saving to: ‘data/preview.tar.gz’

     0K .......... .......... .......... .......... ..........  0%  300K 86s
    50K .......... .......... .......... .......... ..........  0%  569K 65s
   100K .......... .......... .......... .......... ..........  0%  595K 58s
   150K .......... .......... .......... .......... ..........  0%  623K 54s
   200K .......... .......... .......... .......... ..........  0%  565K 52s
   250K .......... .......... .......... .......... ..........  1% 9,38M 44s
   300K .......... .......... .......... .......... ..........  1%  607K 43s
   350K .......... .......... .........

In [6]:
data_dir = "data/preview/"

Train model:

In [7]:
model.train(data_dir)
model.dump()

  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)


Time to test our model.

The model testing simulates online learning. At each step of testing we call `.predict` on a batch of samples and then `.update` on the same batch, but this time we provide you with the true answers.

In [8]:
model = rf_baseline_module.DSModel(assets_dir, dump_dir)
model.load()

In [9]:
markup = "data/preview/markup.json"
test_data = "data/preview//"

In [None]:
import sys
from collections import defaultdict
import pandas as pd
from tqdm import tqdm

def markup2bytes(markup: {str: {str: object}}) -> bytes:
    df = defaultdict(list)
    for field in markup:
        for idx in markup[field]:
            df['id'].append(idx)
            df[field].append(markup[field][idx])
    df = pd.DataFrame.from_dict(df)
    return df.to_csv().encode()



with open(markup) as f:
    data = json.load(f)

predictions = {}
for filename in tqdm(sorted(data.keys())):
    csv = os.path.join(test_data, filename)
    with open(csv, 'rb') as f:
        csv = f.read()
    p = model.predict([csv])[0]
    predictions[filename] = p
    y = markup2bytes(data[filename])
    model.update([(csv, y)])

And there are the predictions of our model on the preview set. 

In [12]:
ordered_preds = []
ordered_answers = []
for fname, answers in data.items():
    answers = answers["question_answer"]
    for q_id, answer in answers.items():
        ordered_answers.append(answer)
        ordered_preds.append(predictions[fname]["question_answer"][int(q_id)])

Let's compute inverse log_loss of our model

In [13]:
from sklearn.metrics import log_loss
sample_weights = [1 if y == 0 else 1.5 for y in ordered_answers]
1 / log_loss(ordered_answers, ordered_preds, sample_weight=sample_weights)

1.9779830224521984

On Dbrain, your model will go through almost the same process, although it will be trained on the full training set and it will be evaluated on the test set.  