# Testing Bert
In this notebook, I test my fine-tuned BERT-Large model on the HANS dataset. BERT was fine-tuned using a GCP TPU on MNLI.

In [214]:
import csv

## HANS Dataset
We will do so on the hans.tsv file, which is a converted version of heuristic_evaluation_set.txt, the set used in the research paper. The way I converted the file can be seen in the ___ jupyter notebook. Because we are using a different file, I had to make changes to the run_classifier.py file. As a result, bert will be tested on run_classifier_hans.py. 

In [189]:
!python bert/run_classifier_hans.py \
  --task_name=MNLI \
  --do_predict=true \
  --data_dir='' \
  --vocab_file='uncased_L-12_H-768_A-12/vocab.txt' \
  --bert_config_file='uncased_L-12_H-768_A-12/bert_config.json' \
  --init_checkpoint=gs://bert-results/MNLI-output/ \
  --max_seq_length=128 \
  --output_dir=gs://bert-results/MNLI-output/

2020-02-09 21:59:37.201959: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0



W0209 21:59:38.749001 140145063781824 module_wrapper.py:139] From bert/run_classifier_hans.py:784: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.


W0209 21:59:38.749380 140145063781824 module_wrapper.py:139] From bert/run_classifier_hans.py:784: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.


W0209 21:59:38.750283 140145063781824 module_wrapper.py:139] From /home/jupyter/bert/modeling.py:93: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.


W0209 21:59:38.751029 140145063781824 module_wrapper.py:139] From bert/run_classifier_hans.py:808: The name tf.gfile.MakeDirs is deprecated. Please use tf.io.gfile.makedirs instead.

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, plea

## Interpreting the results
HANS has a script to assess the performance of a model. We must convert our output file to their desired format. 

In [5]:
import pandas as pd

In [29]:
hans_test_set = pd.read_csv("hans.tsv", sep="\t")

In [31]:
pair_ids = hans_test_set['pairID']

In [190]:
f = open('output/test_results.tsv', "r")
prediction = []
for x in f:
    values = x.split("\t")
    values[-1] = values[-1].strip('\n')
    # get_labels() returns ["contradiction", "entailment", "neutral"]
    contra = float(values[0])
    neutral = float(values[2])
    entail = float(values[1])
    
    #HANS only uses non-entailment and entailment
    if (neutral > contra) and (neutral > entail):
        prediction.append('non-entailment')
    elif contra > neutral and contra > entail:
        prediction.append('non-entailment')
    elif entail > contra and entail > neutral:
        prediction.append('entailment')

In [191]:
assert len(prediction) == len(pair_ids)

In [192]:
f = open('predictions_hans.txt','w') 
f.write('pairID,gold_label\n')
line_num = 0
for line in prediction:
    if line_num == 30001:
        f.write('%s,%s' % (pair_ids[line_num], line))
        break
    f.write('%s,%s\n' % (pair_ids[line_num], line))
    line_num += 1
f.close()

In [193]:
f = open('heuristics_evaluation_set.txt', "r")
first = True
i = 0
cor = 0
fsl = 0
for x in f:
    if first:
        first = False
        continue
    else:
        values = x.split("\t")
        if values[0] == prediction[i]:
            cor += 1
        else:
            fsl += 1
        i += 1

In [194]:
print('HANS dataset performance:', cor/(cor+fsl))

HANS dataset performance: 0.4715


In [195]:
pair_ids = hans_test_set['pairID']

### Now we'll use their script to assess the model

In [196]:
!python evaluate_heur_output.py 'predictions_hans.txt'

Heuristic entailed results:
lexical_overlap: 0.1308
subsequence: 0.0162
constituent: 0.0042

Heuristic non-entailed results:
lexical_overlap: 0.74
subsequence: 0.9526
constituent: 0.9852

Subcase results:
ln_subject/object_swap: 0.241
ln_preposition: 0.972
ln_relative_clause: 0.982
ln_passive: 0.598
ln_conjunction: 0.907
le_relative_clause: 0.005
le_around_prepositional_phrase: 0.026
le_around_relative_clause: 0.047
le_conjunction: 0.011
le_passive: 0.565
sn_NP/S: 1.0
sn_PP_on_subject: 0.865
sn_relative_clause_on_subject: 0.937
sn_past_participle: 0.963
sn_NP/Z: 0.998
se_conjunction: 0.074
se_adjective: 0.002
se_understood_object: 0.001
se_relative_clause_on_obj: 0.003
se_PP_on_obj: 0.001
cn_embedded_under_if: 0.998
cn_after_if_clause: 0.999
cn_embedded_under_verb: 0.966
cn_disjunction: 0.989
cn_adverb: 0.974
ce_embedded_under_since: 0.001
ce_after_since_clause: 0.001
ce_embedded_under_verb: 0.0
ce_conjunction: 0.009
ce_adverb: 0.01

Template results:
temp1: 0.241
temp5: 1.0
temp7: 1.0

### Concatenate HANS train and MNLI train to train BERT

In [222]:
f = open('heuristics_train_set.txt', "r")
lines = []
first = True
i = 0
for x in f:
    if first:
        first = False
        continue
    i += 1
    values = x.split("\t")
    values[-1] = values[-1].strip('\n')
    new_values = []
    new_values.append(str(i + 391163)) #index
    new_values.append(values[7]) #promptID
    new_values.append(values[7] + 'h') #pairID
    new_values.append(values[8]) #genre is heuristic 
    new_values.append(values[1]) #sentences
    new_values.append(values[2])
    new_values.append(values[3]) 
    new_values.append(values[4])
    new_values.append(values[5])
    new_values.append(values[6])
    new_values.append(values[0]) #gold labels
    new_values.append(values[0])
    
    lines.append(new_values)
    

### Writing to MNLI training file

In [223]:
with open('glue_data/MNLI/train.tsv', 'a') as out_file:
    tsv_writer = csv.writer(out_file, delimiter='\t')
    for x in lines:
        tsv_writer.writerow(x)