# Submit for MLTrack 2018

This notebook is aimed to get the submit for MLTrack 2018 competition.

In [3]:
!pip3 install deeppavlov

Import key components. 

In [5]:
import os.path
from pathlib import Path

In [4]:
import numpy as np

In [6]:
import deeppavlov
from deeppavlov import build_model
from deeppavlov.core.commands.utils import parse_config
from deeppavlov.core.commands.train import read_data_by_config, train_evaluate_model_from_config, get_iterator_from_config
from deeppavlov.download import deep_download

Load a config file for model training.

In [7]:
PATH_TO_CONFIG = '../deeppavlov/configs/ranking/mltrack_ranker.json'
config = parse_config(PATH_TO_CONFIG)

Download and decompress required files including mltrack dataset, pretrained Bert weights and fine-tuned weights.

In [9]:
deep_download(config)

If you want to start training, then use the following cell. Otherwise, skip this step.

In [22]:
train_evaluate_model_from_config(config)

If you didn't train model in the previous step and want just to use fine-tuned weights, change load_path in the config file.

In [5]:
config['chainer']['pipe'][1]['load_path'] = '{DOWNLOADS_PATH}/fine_tuned_model/mltrack_model_fine_tuned/model'

To get probabilities of each class instead of labels for every answer in test dataset, change config file. It is needed in order to do more accurate ranking. 

In [7]:
config['chainer']['pipe'][1]['return_probas'] = True

Build a model based on the prepared config.

In [12]:
model = build_model(config, download=False)

Read and parse data from files using dataset reader.

In [9]:
data = read_data_by_config(config)

Generate batches with the help of dataset iterator.

In [10]:
iterator = get_iterator_from_config(config, data)
batches = [x for x in iterator.gen_batches(batch_size=36, data_type='test', shuffle=False)]

Now everything is prepared for the prediction. Just run the following cell to get it.

In [11]:
predictions = []

for batch in batches:
    predictions.extend(model(batch[0]))

List of predictions contains probability distribution for each answer. Now let's prepare context ids for writing in submit file. Splitting is needed to determine the number of answers for every context (sometimes it is not equal to 6) and then to remove padded answers. 

In [12]:
DATA_FOR_SUBMIT_PATH = os.path.expanduser(Path(config["dataset_reader"]["data_path"]) / Path('final.tsv'))

with open(DATA_FOR_SUBMIT_PATH, "r") as f:
    data_for_ids = f.readlines()    
    
data_for_ids = [el.strip('\n').split('\t') for el in data_for_ids]
context_ids = [el[0] for el in data_for_ids]

In [13]:
def split_context_ids(context_id):
    splitted_context_id, cur_ids = [], []
    cur_id = context_id[0]
    for el in context_id:
        if el == cur_id:
            cur_ids.append(el)
        else:
            splitted_context_id.append(cur_ids)
            cur_id = el
            cur_ids = [cur_id]
    splitted_context_id.append(cur_ids)
    return splitted_context_id

In [14]:
splitted_context_ids = split_context_ids(context_ids)

Save probabilities only for real answers.

In [15]:
pred_for_real_data = [pred[:len(ids)] for pred, ids in zip(predictions, splitted_context_ids)]

Final score of an answer is positive if probability of 'good' class has highest value. It is negative for 'bad' answers and is equal to zero for 'neutral' answers.

In [16]:
signs = [np.argmax(el, -1)-1 for el in pred_for_real_data]
scores = [np.max(el, -1) for el in pred_for_real_data]
pred_for_real_data = [sign*score for sign, score in zip(signs, scores)]

Using obtained scores let's get ranking.

In [17]:
ranking = [np.flip(np.argsort(el), -1) for el in pred_for_real_data] 

The final step is to save submit file.

In [18]:
submit = [i+' '+str(el) for ids, rank in zip(splitted_context_ids, ranking) for i, el in zip(ids, rank)]

In [19]:
with open(os.path.expanduser(config["metadata"]["variables"]["ROOT_PATH"])+"/submit.txt", "w") as f:
    f.write('\n'.join(submit))

You can find submit file in a directory `~/.deeppavlov`.