In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

## About Data Set, Model and Cross-Validation Setup

For compeiting in this task, I focused on training distilled transformers for fast iterating, such as DistillBert and DistilRoBERTa. Also using Bert-base-uncase to validate my results. On this notebook, it focused on analysis the produced prediction on validation and test, and then performing post processing.

The input sequences available for models are "question_title", "question_body", "answer". The max length for question and answer can be configured differently, here model used 384 for question and 512 for answer since their length difference spotted on data analysis. The key parameters are stored in configs and easily to be reviewed.

The model archtitechure are using a shared weights of tranmsformer embedding to ingest "question_title" + "question_body" for question, "question_title" + "answer" for answer, respectively. Meanwhile, a customized classification head is added on top of that.

The training stragegy consists of several part:

1) freeze the pretrainined weights of embedding/transformer to tune the classification head first,

2) unfreeze transformer weights using warm up scheduling to graduately increasing learning rate, and

3) use customized early stopping callback while perofmrance on the validation set stop imrpoving.

4) also try out some commonly augmentation tricks, such as truncated corpus, drop out words or label soften.


Regarding to cross-valation, Based on the well populating on duplicated questions, a good stregegy is to use `GroupKFold` to split data, to well split data further, `category` is also applied to generate group for data split into training and validation set. Meanwhile, the first fold of 5-fold cross-valiation is used to training model and validating the model's performance. 

In [2]:
# import essential modules
import os
import sys

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

import pandas as pd
import numpy as np

In [3]:
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 1000)
pd.set_option('display.width', 1000)

In [4]:
result_dir: str = "../input/distilroberta-base_q384_a512"
result_stats_filename: str = "model_stats.hdf5"

## To dive into prediction

To maximize the model performance on eval metrics, to thresholding the predictions from model. This is observed the label distribution in training set and understanding of the quest. The and therefore just  

In [5]:
from scipy.stats import spearmanr

def spearmanr_ignore_nan(trues: np.array, preds: np.array):
    return np.nanmean(
        [spearmanr(ta, pa).correlation for ta, pa in
         zip(np.transpose(trues), np.transpose(np.nan_to_num(preds)) + 1e-7)])

In [6]:
# open up the result file from training model
file_path = os.path.join(result_dir, result_stats_filename)
with pd.HDFStore(file_path, mode='r') as store:
    print(f"open {file_path} and found {len(store.keys())}:\n{store.keys()}")
    for k, v in store.items():
        var_name = k.split('/')[-1]
        df = store.get(k)
        vars()[var_name] = df
        print(f'read {k}: {df.shape}')

open ../input/distilroberta-base_q384_a512/model_stats.hdf5 and found 7:
['/test_preds', '/valid_breakdown_metrics', '/valid_group_score', '/valid_overall_metrics', '/valid_preds', '/valid_test_stats_diff', '/valid_trues']
read /test_preds: (476, 30)
read /valid_breakdown_metrics: (30, 5)
read /valid_group_score: (5, 1)
read /valid_overall_metrics: (8, 5)
read /valid_preds: (1216, 30)
read /valid_test_stats_diff: (30, 7)
read /valid_trues: (1216, 30)


In [7]:
valid_test_stats_diff

Unnamed: 0,test_mean,valid_mean,mean_diff,test_std,valid_std,ks_stats,p_value
question_opinion_seeking,0.416924,0.448605,-0.031682,0.416924,0.448605,0.091414,0.006076
question_type_entity,0.09159,0.118006,-0.026416,0.09159,0.118006,0.057593,0.197096
question_body_critical,0.610354,0.627512,-0.017158,0.610354,0.627512,0.08921,0.007998
question_type_definition,0.022786,0.039467,-0.016682,0.022786,0.039467,0.100336,0.001866
question_conversational,0.028085,0.043803,-0.015718,0.028085,0.043803,0.1031,0.001265
question_well_written,0.808299,0.822846,-0.014548,0.808299,0.822846,0.100218,0.001896
question_interestingness_self,0.498091,0.51226,-0.014169,0.498091,0.51226,0.076149,0.035521
question_type_compare,0.031084,0.044118,-0.013034,0.031084,0.044118,0.114047,0.000245
question_multi_intent,0.233776,0.244053,-0.010277,0.233776,0.244053,0.037,0.719376
question_type_choice,0.253075,0.260613,-0.007538,0.253075,0.260613,0.040669,0.605361


In [8]:
valid_group_score  # this shows our current model performed bad on stackoverflow but great on life_art and science.

Unnamed: 0,score
CULTURE,0.355161
LIFE_ARTS,0.416903
SCIENCE,0.412144
STACKOVERFLOW,0.229831
TECHNOLOGY,0.362469


In [9]:
valid_breakdown_metrics  # pre-sorted the model performance by spearman

Unnamed: 0,bias,mae,mape,pearson,spearman
question_type_spelling,-7.7e-05,0.000622,2.270338,0.037887,0.045481
question_not_really_a_question,0.000245,0.008318,1.896549,0.079877,0.056634
answer_plausible,-0.005538,0.063712,0.066673,0.099059,0.103282
question_type_consequence,0.003327,0.016466,1.601828,0.159365,0.148936
answer_relevance,0.002329,0.050746,0.052423,0.167636,0.160318
answer_well_written,-0.012466,0.079821,0.087967,0.186088,0.18312
answer_helpful,-0.005366,0.088043,0.095494,0.228579,0.226554
question_expect_short_answer,-0.002903,0.282184,0.409226,0.269054,0.262577
answer_type_procedure,-0.017276,0.161247,1.390612,0.262567,0.272878
answer_satisfaction,0.008916,0.097911,0.114595,0.295653,0.279513


In [10]:
valid_trues.head()

Unnamed: 0_level_0,question_asker_intent_understanding,question_body_critical,question_conversational,question_expect_short_answer,question_fact_seeking,question_has_commonly_accepted_answer,question_interestingness_others,question_interestingness_self,question_multi_intent,question_not_really_a_question,question_opinion_seeking,question_type_choice,question_type_compare,question_type_consequence,question_type_definition,question_type_entity,question_type_instructions,question_type_procedure,question_type_reason_explanation,question_type_spelling,question_well_written,answer_helpful,answer_level_of_information,answer_plausible,answer_relevance,answer_satisfaction,answer_type_instructions,answer_type_procedure,answer_type_reason_explanation,answer_well_written
qa_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1
6,1.0,0.666667,0.0,0.5,1.0,1.0,0.444444,0.333333,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,1.0,0.5,0.0,0.0,0.833333,0.888889,0.666667,0.888889,1.0,0.733333,0.666667,0.666667,0.0,0.777778
11,1.0,0.333333,0.0,1.0,1.0,1.0,0.666667,0.555556,0.0,0.0,0.333333,0.333333,0.0,0.0,0.0,0.0,0.666667,0.0,0.333333,0.0,0.888889,0.666667,0.333333,0.666667,0.666667,0.266667,0.0,0.0,0.0,0.888889
17,0.888889,1.0,0.0,0.0,1.0,0.0,0.666667,0.333333,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.666667,0.0,1.0,1.0,0.666667,1.0,1.0,1.0,0.0,0.0,1.0,1.0
24,0.777778,0.555556,0.0,1.0,0.666667,1.0,0.555556,0.333333,0.0,0.0,0.333333,1.0,0.0,0.0,0.0,0.0,0.666667,0.0,0.666667,0.0,0.888889,0.666667,0.666667,0.666667,0.888889,0.9,0.333333,0.333333,0.666667,1.0
41,0.888889,0.666667,0.0,0.333333,1.0,0.666667,0.555556,0.444444,1.0,0.0,0.0,0.333333,0.0,0.0,0.333333,0.333333,0.0,0.0,0.666667,0.0,1.0,0.888889,0.555556,1.0,1.0,0.8,0.0,0.0,0.333333,1.0


In [11]:
valid_preds.head()

Unnamed: 0_level_0,question_asker_intent_understanding,question_body_critical,question_conversational,question_expect_short_answer,question_fact_seeking,question_has_commonly_accepted_answer,question_interestingness_others,question_interestingness_self,question_multi_intent,question_not_really_a_question,question_opinion_seeking,question_type_choice,question_type_compare,question_type_consequence,question_type_definition,question_type_entity,question_type_instructions,question_type_procedure,question_type_reason_explanation,question_type_spelling,question_well_written,answer_helpful,answer_level_of_information,answer_plausible,answer_relevance,answer_satisfaction,answer_type_instructions,answer_type_procedure,answer_type_reason_explanation,answer_well_written
qa_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1
6,0.913903,0.574332,0.008799,0.683393,0.753354,0.738192,0.556214,0.423216,0.06657,0.002666,0.551079,0.051694,0.004357,0.000341,0.000446,0.082212,0.881789,0.266576,0.042201,5e-06,0.77911,0.917146,0.620512,0.963767,0.968732,0.816445,0.822706,0.133666,0.055822,0.911258
11,0.860276,0.545545,0.009622,0.755446,0.882033,0.882884,0.557784,0.447238,0.315605,0.004348,0.280138,0.387065,0.020527,0.01621,0.025293,0.189753,0.262477,0.110225,0.459036,0.00204,0.721503,0.865184,0.656405,0.919287,0.937834,0.749343,0.201835,0.124105,0.445069,0.852494
17,0.930392,0.746779,0.006125,0.309692,0.962183,0.885127,0.673433,0.552903,0.618383,0.000332,0.085606,0.02888,0.271529,0.009798,0.075362,0.030754,0.114819,0.108604,0.775594,7.1e-05,0.850926,0.947082,0.72057,0.979831,0.983407,0.914241,0.077692,0.136435,0.886666,0.932496
24,0.904079,0.513168,0.005836,0.790587,0.851329,0.865632,0.586556,0.417086,0.330288,0.00052,0.504178,0.819299,0.040932,0.005569,0.00114,0.040498,0.439039,0.112704,0.158191,2.1e-05,0.799131,0.97865,0.744484,0.989262,0.989302,0.913242,0.346648,0.146935,0.721115,0.950358
41,0.894373,0.590808,0.052658,0.45497,0.898931,0.671971,0.612816,0.574072,0.730754,0.003479,0.317559,0.279242,0.319637,0.02526,0.123637,0.055559,0.189252,0.154059,0.415576,0.002474,0.831692,0.868908,0.623571,0.940311,0.931832,0.798913,0.102962,0.127883,0.636078,0.890738


### Lift from post-processing

In [12]:
sys.path.append("../nlp_utils")

from nlp_utils import OptimalRounder

In [13]:
# training optimal rounder from the indices distribution from training

df = pd.read_csv('../input/google-quest-challenge/train.csv')[valid_preds.columns]
opt = OptimalRounder(ref=df)
valid_preds_opt = opt.fit_transform(valid_trues, valid_preds)

fitting: question_asker_intent_understanding


  c /= stddev[:, None]
  c /= stddev[None, :]
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)


fitting: question_body_critical
fitting: question_conversational
fitting: question_expect_short_answer
fitting: question_fact_seeking
fitting: question_has_commonly_accepted_answer
fitting: question_interestingness_others
fitting: question_interestingness_self
fitting: question_multi_intent
fitting: question_not_really_a_question
fitting: question_opinion_seeking
fitting: question_type_choice
fitting: question_type_compare
fitting: question_type_consequence
fitting: question_type_definition
fitting: question_type_entity
fitting: question_type_instructions
fitting: question_type_procedure
fitting: question_type_reason_explanation
fitting: question_type_spelling
fitting: question_well_written
fitting: answer_helpful
fitting: answer_level_of_information
fitting: answer_plausible
fitting: answer_relevance
fitting: answer_satisfaction
fitting: answer_type_instructions
fitting: answer_type_procedure
fitting: answer_type_reason_explanation
fitting: answer_well_written


The Optimal Rounder using random search to find best threshold discretize the continous outputs from our model. Discretized the model outputs as post-processing is the key process to lift model and maximize the model perform. Is is also written in fast and effecitve pythonic  way with numpy.

In [14]:
valid_scores_orig = valid_trues.apply(lambda x: x.corr(valid_preds[x.name], method='spearman'))
valid_scores_orig

question_asker_intent_understanding      0.383365
question_body_critical                   0.602926
question_conversational                  0.432150
question_expect_short_answer             0.262577
question_fact_seeking                    0.350735
question_has_commonly_accepted_answer    0.400842
question_interestingness_others          0.333692
question_interestingness_self            0.470368
question_multi_intent                    0.522495
question_not_really_a_question           0.056634
question_opinion_seeking                 0.448686
question_type_choice                     0.709525
question_type_compare                    0.376313
question_type_consequence                0.148936
question_type_definition                 0.380167
question_type_entity                     0.472563
question_type_instructions               0.760115
question_type_procedure                  0.334975
question_type_reason_explanation         0.592825
question_type_spelling                   0.045481


In [15]:
# score after post processing on validation set
valid_scores_opt = valid_trues.apply(lambda x: x.corr(valid_preds_opt[x.name], method='spearman'))
valid_scores_opt

question_asker_intent_understanding      0.377789
question_body_critical                   0.603422
question_conversational                  0.535739
question_expect_short_answer             0.282926
question_fact_seeking                    0.369103
question_has_commonly_accepted_answer    0.432656
question_interestingness_others          0.344361
question_interestingness_self            0.484238
question_multi_intent                    0.522911
question_not_really_a_question                NaN
question_opinion_seeking                 0.454448
question_type_choice                     0.721485
question_type_compare                    0.580100
question_type_consequence                     NaN
question_type_definition                 0.604581
question_type_entity                     0.612240
question_type_instructions               0.781821
question_type_procedure                  0.342245
question_type_reason_explanation         0.592867
question_type_spelling                        NaN


In [16]:
valid_preds_opt.apply(lambda x: x.nunique())  
# check the unique value counts in every index after post processing, only one unique value make scoring become NAN

question_asker_intent_understanding      5
question_body_critical                   8
question_conversational                  3
question_expect_short_answer             5
question_fact_seeking                    5
question_has_commonly_accepted_answer    5
question_interestingness_others          7
question_interestingness_self            6
question_multi_intent                    5
question_not_really_a_question           1
question_opinion_seeking                 5
question_type_choice                     5
question_type_compare                    5
question_type_consequence                1
question_type_definition                 4
question_type_entity                     5
question_type_instructions               5
question_type_procedure                  3
question_type_reason_explanation         5
question_type_spelling                   1
question_well_written                    7
answer_helpful                           3
answer_level_of_information              6
answer_plau

In [17]:
# eyeballing the improvement on every attribute
valid_scores_opt_diff = (valid_scores_opt - valid_scores_orig).sort_values(ascending=False)
valid_scores_opt_diff

question_type_definition                 0.224414
question_type_compare                    0.203787
question_type_entity                     0.139677
question_conversational                  0.103589
question_has_commonly_accepted_answer    0.031814
question_type_instructions               0.021706
question_expect_short_answer             0.020349
answer_type_procedure                    0.020136
question_fact_seeking                    0.018368
question_interestingness_self            0.013870
answer_satisfaction                      0.012058
question_type_choice                     0.011960
question_interestingness_others          0.010669
answer_relevance                         0.008373
question_type_procedure                  0.007271
answer_type_instructions                 0.006723
question_opinion_seeking                 0.005762
answer_level_of_information              0.005688
answer_plausible                         0.002710
question_well_written                    0.001274


In [18]:
# apply useful columns only, has improvement, and not NaN in metrics
use_cols = valid_scores_opt_diff.loc[valid_scores_opt_diff > -.0010].dropna().index.tolist()
print(f"select {len(use_cols)} labels getting improve: {use_cols}")

select 24 labels getting improve: ['question_type_definition', 'question_type_compare', 'question_type_entity', 'question_conversational', 'question_has_commonly_accepted_answer', 'question_type_instructions', 'question_expect_short_answer', 'answer_type_procedure', 'question_fact_seeking', 'question_interestingness_self', 'answer_satisfaction', 'question_type_choice', 'question_interestingness_others', 'answer_relevance', 'question_type_procedure', 'answer_type_instructions', 'question_opinion_seeking', 'answer_level_of_information', 'answer_plausible', 'question_well_written', 'answer_helpful', 'question_body_critical', 'question_multi_intent', 'question_type_reason_explanation']


In [19]:
# calculate the lift from post processing
valid_preds_opt_final = valid_preds.copy()
valid_preds_opt_final[use_cols] = opt.transform(valid_preds[use_cols])

score_orig = spearmanr_ignore_nan(valid_trues.values, valid_preds.values)
score_opt = spearmanr_ignore_nan(valid_trues.values, valid_preds_opt_final.values)

print(f"orig score={score_orig:.3f}, optimized score={score_opt:.3f}, improve={score_opt-score_orig:.3f}")

orig score=0.386, optimized score=0.415, improve=0.029


In [20]:
# successfully apply the same post processing to test prediction
test_preds[use_cols] = opt.transform(test_preds[use_cols])
test_preds.head().T

qa_id,39,46,70,132,200
question_asker_intent_understanding,0.947959,0.884822,0.932607,0.875872,0.937934
question_body_critical,0.833333,0.555556,0.888889,0.5,0.666667
question_conversational,0.333333,0.0,0.0,0.0,0.0
question_expect_short_answer,0.333333,0.666667,0.666667,0.5,0.666667
question_fact_seeking,0.333333,0.666667,0.666667,0.666667,0.666667
question_has_commonly_accepted_answer,0.333333,1.0,1.0,1.0,1.0
question_interestingness_others,0.888889,0.5,0.666667,0.5,0.555556
question_interestingness_self,0.666667,0.5,0.5,0.5,0.555556
question_multi_intent,0.666667,0.0,0.0,0.0,0.666667
question_not_really_a_question,0.001387,0.002109,0.001294,0.009392,0.001906
