Data format
===

The data itself is already pre-processed (`<s>, </s>` tags, `<unk>` tag, etc.). The punctuation is tokenized (one symbol = one token, no word-punctuation merged tokens). The is only one space symbol between every two adjacent tokens.

UPD: In the data there are common combination "@@ ". This is because the data has been preprocessed with BPE encoding (see: https://arxiv.org/abs/1508.07909 and https://github.com/rsennrich/subword-nmt) in order to reduce the vocabulary size.

In order to properly print a message you should make sure to do the following in Python:

[your string message here].replace(‘@@ ‘, ‘’)

NOTE: replace `@@space` by nothing. Don’t forget the [space]!

Broadly speaking, BPE encoding will split words into the most common n-gram to reduce the vocabulary size. The ‘@@ ’ you see are tokens to indicate there was a split. Thus to print the actual word you should replace all occurrences of '@@' to nothing.


Data fields are separated by one tab character.

    context - context phrase(s) for response, always human generated
    response - one phrase or a few phrases, may be from different speakers
    human-generated - flag if the response is generated by human

Data header: 'id\tcontext\tresponse\thuman-generated\n'

Submission format
===

ROC AUC score.

The file should contain a header and have the following format:

id,human-generated

1,1

8,0

9,1

10,1

We expect the solution file to have 524,342 predictions.

In [97]:
import numpy as np
import pandas as pd
from xgboost import XGBClassifier

In [98]:
def extract_features(frame):
    frame_features = []
    
    for index, row in frame.iterrows():
        context = row['context']
        response = row['response']
        features = []
    
        features.append(float(len(response)))

        tokens = response.split(' ')
        features.append(float(len(tokens)))
        
        frame_features.append(features)

    return frame_features

## Training

In [47]:
chunksize = 20000
all_features = []
all_labels = []
for i, frame in enumerate(pd.read_csv('data/train.txt', chunksize = chunksize, delimiter = '\t')):
    frame = frame.replace({'@@ ': ''}, regex = True)
    short_frame = frame[['context', 'response']]
    
    features = extract_features(short_frame)
    labels = frame['human-generated'].values.tolist()
    
    all_features.extend(features) # actually more efficient than numpy append
    all_labels.extend(labels)
    
    if i > 2: break # to lessen number of chunks

In [59]:
np_features = np.array(all_features)
np_labels = np.array(all_labels)

XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1)

In [None]:
model = XGBClassifier()
model.fit(np_features, np_labels)

## Inference

In [67]:
chunksize = 20000
all_features = []
all_ids = []
for i, frame in enumerate(pd.read_csv('data/test.txt', chunksize = chunksize, delimiter = '\t')):
    frame = frame.replace({'@@ ': ''}, regex = True)
    short_frame = frame[['context', 'response']]
    
    features = extract_features(short_frame)
    ids = frame['id'].values.tolist()
    
    all_features.extend(features) # actually more efficient than numpy append
    all_ids.extend(ids)
    
    if i > 2: break # to lessen number of chunks

In [68]:
np_features = np.array(all_features)
np_ids = np.array(all_ids)

In [69]:
predicted_labels = model.predict(np_features)

In [95]:
output = pd.DataFrame(np.concatenate(([np_ids], [predicted_labels]), axis = 0).T, columns = ['id', 'human-generated'])

In [96]:
output.to_csv('output.csv', index = False)