#### FEVER dataset processing

<h5>Process the claims in the fever dataset</h5>

In this notebook, we will prepare the training dataset and buid a baseline model that would set us up for the NLI tasks

We use the following repos for reference code:

- [fever-baselines](https://github.com/klimzaporojets/fever-baselines.git)
- [fever-allennlp-reader](https://github.com/j6mes/fever-allennlp-reader)
- [fever-allennlp](https://github.com/j6mes/fever-allennlp)

Note, AllenNLP here is used only for the NLI training, using models such as Decomposable Attention, Elmo + ESIM, ESIM etc. We will not use any of it here.
In this notebook, we will first focus on extracting the data from the pre-processed Wiki corpus provided by [fever.ai](https://fever.ai/dataset/fever.html).

The data is available in a [docker image](https://hub.docker.com/r/feverai/common), 21GB in size. The container is created and the volume /local/ from it is mounted and made available to our [container](https://github.com/dmayukh/fakenews/Dockerfile) 


We will install a few dependencies such as:
- numpy>=1.15
- regex
- allennlp==2.5.0
- fever-scorer==2.0.39
- fever-drqa==1.0.13

The following packages are installed by the above dependencies
- torchvision-0.9.1
- google_cloud_storage-1.38.0
- overrides==3.1.0
- transformers-4.6.1
- spacy-3.0.6
- sentencepiece-0.1.96
- torch-1.8.1
- wandb-0.10.33
- lmdb-1.2.1
- jsonnet-0.17.0

We do not really need allennlp or fever-scorer as of yet, we would only need DrQA. We would prefer to use the DrQA from the official github, but for now we will go with what was prepackaged by the [j6mes](https://pypi.org/project/fever-drqa/) 


In [1]:
import argparse
import json
from multiprocessing.pool import ThreadPool

<h4>Pre-parsed FEVER Datasets</h4>
Create the database from the DB file that contains the preprocessed Wiki pages. This DB was made available to us by FEVER.

FeverDocDB is a simple wrapper that opens a SQLlite3 connection to the database and provides methods to execute simple select queries to fetch ids for documents and to fetch lines given a document.

We will not require this in the first pass of our work here, since we are only interested in findings the documents closest to a claim text.

The function to fetch lines per document is what uses the connection to the database. In order to find the closest documents for a given claim, we use the ranker that uses a <b>pre-created TFIDF index</b> which can locate the document ids given a claim text.

The pre-created index is available in '/local/fever-common/data/index/fever-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz'


Sample data from training file:

> {"id": 75397, "verifiable": "VERIFIABLE", "label": "SUPPORTS", "claim": "Nikolaj Coster-Waldau worked with the Fox Broadcasting Company.", "evidence": [[[92206, 104971, "Nikolaj_Coster-Waldau", 7], [92206, 104971, "Fox_Broadcasting_Company", 0]]]}

A closer look at the evidence:

> [[92206, 104971, "Nikolaj_Coster-Waldau", 7]

92206 and 104971 are the annotation ids, while the "Nikolaj_Coster-Waldau" is the evidence page and the line number is 7.


#### Formatting the input text

The training of the model is done on the evidence provided by the human annotators, therefore we use the 'evidence' to run our training.

After formatting, the training examples are written as below that is then used to train the MLP

> {'claim': 'Nikolaj Coster-Waldau worked with the Fox Broadcasting Company .',
  'evidence': [('Nikolaj_Coster-Waldau', 7), ('Fox_Broadcasting_Company', 0)],
  'label': 0,
  'label_text': 'SUPPORTS'}

The baseline model is a simple MLP that uses the count vectorizer to vectorize the claim text and the evidence page texts. It also uses an additional feature which is the cosine similarity between the vectorized claim text and the vectorized combined texts from all the evidences.

The vectorizers are saved to the filesystem that can be used later for transorming the incoming sentences.

The trained model is used to run eval on the dev dataset of the same format.


<h5>Retrieval of the evidence</h5>

We also attempt to extract the evidence from the corresponding pages

First, using the tfidf doc ranker, we extract the top 5 pages that are similar to the claim text


> {'claim': 'Nikolaj Coster-Waldau worked with the Fox Broadcasting Company .', 'evidence': [('Nikolaj_Coster-Waldau', 7), ('Fox_Broadcasting_Company', 0)], 'label': 0, 'label_text': 'SUPPORTS', 'predicted_pages': [('Coster', 498.82682448841246), ('Nikolaj', 348.42021460316823), ('The_Other_Woman_-LRB-2014_film-RRB-', 316.8405030379064), ('Nikolaj_Coster-Waldau', 316.8405030379064), ('Nukaaka_Coster-Waldau', 292.47605893902585)]}

For each of the pages, we extract the lines from the page text and use 'online tfidf ranker' to fetch the closest matching lines from the text.

The training examples are then formatted as below which is then used to run EVAL on the MLP model


> {'claim': 'Nikolaj Coster-Waldau worked with the Fox Broadcasting Company .',
 'evidence': [('Nikolaj_Coster-Waldau', 7), ('Fox_Broadcasting_Company', 0)],
 'label': 0,
 'label_text': 'SUPPORTS',
 'predicted_pages': [('Coster', 498.82682448841246),
  ('Nikolaj', 348.42021460316823),
  ('The_Other_Woman_-LRB-2014_film-RRB-', 316.8405030379064),
  ('Nikolaj_Coster-Waldau', 316.8405030379064),
  ('Nukaaka_Coster-Waldau', 292.47605893902585)],
 'predicted_sentences': [('Nikolaj', 7),
  ('The_Other_Woman_-LRB-2014_film-RRB-', 1),
  ('Nukaaka_Coster-Waldau', 1),
  ('Coster', 63),
  ('Nikolaj_Coster-Waldau', 0)]}
  

In [2]:
dataset_root = 'data/data'
working_dir = 'working/data'

In [3]:
!tail -2 data/data/fever-data/train.jsonl

{"id": 13114, "verifiable": "VERIFIABLE", "label": "SUPPORTS", "claim": "J. R. R. Tolkien created Gimli.", "evidence": [[[28359, 34669, "Gimli_-LRB-Middle-earth-RRB-", 0]], [[28359, 34670, "Gimli_-LRB-Middle-earth-RRB-", 1]]]}
{"id": 152180, "verifiable": "VERIFIABLE", "label": "SUPPORTS", "claim": "Susan Sarandon is an award winner.", "evidence": [[[176133, 189101, "Susan_Sarandon", 1]], [[176133, 189102, "Susan_Sarandon", 2]], [[176133, 189103, "Susan_Sarandon", 8]]]}


In [4]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/ubuntu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [5]:
!head -2 data/data/fever-data/paper_test.jsonl

{"id": 113501, "verifiable": "NOT VERIFIABLE", "label": "NOT ENOUGH INFO", "claim": "Grease had bad reviews.", "evidence": [[[133128, null, null, null]]]}
{"id": 163803, "verifiable": "VERIFIABLE", "label": "SUPPORTS", "claim": "Ukrainian Soviet Socialist Republic was a founding participant of the UN.", "evidence": [[[296950, 288668, "Ukrainian_Soviet_Socialist_Republic", 7]], [[298602, 290067, "Ukrainian_Soviet_Socialist_Republic", 7], [298602, 290067, "United_Nations", 0]], [[300696, 291816, "Ukrainian_Soviet_Socialist_Republic", 7]], [[344347, 327887, "Ukrainian_Soviet_Socialist_Republic", 7]], [[344994, 328433, "Ukrainian_Soviet_Socialist_Republic", 7]], [[344997, 328435, "Ukrainian_Soviet_Socialist_Republic", 7]]]}


#### Create the training dataset

The training examples have three (3) classes:
- SUPPORTS
- REFUTES
- NOT ENOUGH INFO

For the 'NOT ENOUGH INFO' class, the evidences are set to None. This would cause problems with training since we would still like to generate features for the samples which have been put in this class.

Next, we will loop over the records in the training dataset to create the training records. Specifically, we would be generating evidences for the samples in the 'NOT ENOUGH INFO' class so that the None values now have some page information.

Our strategy for dealing with missing evidences for the 'NOT ENOUGH INFO' class is to find the pages that are closest to the claims based on the tfidf similarity. The tfidf similarity of the documents in the fever DB is already precomputed and make available to us via the index file:

> '/local/fever-common/data/index/fever-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz'

Create the directory where we will save our prepared datasets

The raw training data is available at 

> /local/fever-common/data/fever-data/train.jsonl

The raw dev data from the FEVER paper is available at 

> /local/fever-common/data/fever-data/paper_dev.jsonl

We wil generate the training dataset by sampling for NEI examples based on closest document match against our claim.

In [6]:
!mkdir -p working/data/training/baseline

In [7]:
from mda.src.dataset.DatasetGenerator import DatasetGenerator

In [8]:
!wc -l working/data/training/train.ns.pages.p5.jsonl

145449 working/data/training/train.ns.pages.p5.jsonl


##### Prepare the training dataset

This takes a while, if we have already run this step in the past, we would simply jump to <b>Building the feature sets</b> step and use the file from the working_dir + 'train.ns.pages.p5.jsonl' 

In [3]:
ds_generator = DatasetGenerator(dataset_root='data/data/',out_dir='working/data/training/baseline/', database_path='data/data/fever/fever.db')
ds_generator.generate_nei_evidences('train', 5)

  0%|          | 0/145449 [00:00<?, ?it/s]

Writing data to working/data/training/baseline//train.ns.pages.p5.jsonl


100%|██████████| 145449/145449 [25:35<00:00, 94.70it/s] 


##### Prepare the dev dataset

In [4]:
ds_generator = DatasetGenerator(dataset_root='data/data/',out_dir='working/data/training/baseline/', database_path='data/data/fever/fever.db')
ds_generator.generate_nei_evidences('paper_dev', 5)

  0%|          | 2/9999 [00:00<09:23, 17.74it/s]

Writing data to working/data/training/baseline//paper_dev.ns.pages.p5.jsonl


100%|██████████| 9999/9999 [02:23<00:00, 69.78it/s] 


In [6]:
!wc -l  working/data/training/baseline/*

    9999 working/data/training/baseline/paper_dev.ns.pages.p5.jsonl
  145449 working/data/training/baseline/train.ns.pages.p5.jsonl
  155448 total


#### Building the feature sets

Using the training data and dev data we generated, we will create the vectorizers and save them to local files

The training and dev data is available at 

> working/data/training/baseline/train.ns.pages.p5.jsonl 

> working/data/training/baseline/paper_dev.ns.pages.p5.jsonl

The key information we need from the training samples are the claim text and the texts from the evidence pages

For each training example, generate:
- a tokenized claim, 
- the label id, 
- the label text, 
- list of wiki pages that were provided as evidence.

This is done using a custom formatter `training_line_formatter` we would write.

In [9]:
from mda.src.dataset.DatasetReader import DatasetReader

In [10]:
infile = 'working/data/training/baseline/train.ns.pages.p5.jsonl'
dsreader = DatasetReader(in_file=infile,label_checkpoint_file=None, database_path='data/data/fever/fever.db')
raw, data = dsreader.read()
ds_train = dsreader.get_dataset()
print(ds_train.element_spec)

100%|██████████| 145449/145449 [00:01<00:00, 80259.74it/s] 
100%|██████████| 145449/145449 [00:01<00:00, 142429.20it/s]


(TensorSpec(shape=(2,), dtype=tf.string, name=None), TensorSpec(shape=(3,), dtype=tf.int32, name=None))


Save the label encoder from training, we will need them for the dev dataset preparation

In [11]:
import pickle
label_checkpoint_file = 'working/data/training/baseline/label_encoder_train.pkl'
with open(label_checkpoint_file, 'wb') as f:
    pickle.dump(dsreader.labelencoder, f)

In [12]:
infile = 'working/data/training/baseline/paper_dev.ns.pages.p5.jsonl'
label_checkpoint_file = 'working/data/training/baseline/label_encoder_train.pkl'
#note, use type = 'train' since formatting would be like the train examples
dsreader = DatasetReader(in_file=infile,label_checkpoint_file=label_checkpoint_file, database_path='data/data/fever/fever.db', type='train')
raw_dev, data_dev = dsreader.read()
ds_dev = dsreader.get_dataset()

100%|██████████| 9999/9999 [00:00<00:00, 182615.14it/s]
100%|██████████| 9999/9999 [00:00<00:00, 204362.41it/s]


#### Build the vectorizers

We will build a <b>term frequency vectorizer</b> and a TDIDF vectorizer and save them to a file.

The vocabulary will be limited to 5000. For each of the claim and the body text, we would produce the vectors which would be of dimension 5000.

We will also add the cosine similarity between the claim vector and the body text vector and use it as an additional feature.

The dimension of our feature would be then 5000 + 5000 + 1 = 10001

We will be using the contents of both the training and dev set to build the vectorizers. 

We will need to read the dataset into memory from the td.dataset readers, since CountVectorizers cannot operate in batches.

In [14]:
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

In [15]:
max_features = 5000
max_len = 4  # Sequence length to pad the outputs to.
bow_vectorizer = TextVectorization(
 max_tokens=max_features,
 output_mode='int',
 output_sequence_length=max_len)
freq_vectorizer = TextVectorization(
 max_tokens=max_features,
 output_mode='count')
tfidf_vectorizer = TextVectorization(
 max_tokens=max_features,
 output_mode='tf-idf')

In [16]:
ds = ds_train.map(lambda x, y: x[0] + ' ' + x[1])
bow_vectorizer.adapt(ds.batch(64))

In [17]:
bow_vectorizer.vocabulary_size()

5000

In [18]:
freq_vectorizer.adapt(ds.batch(64))

In [19]:
tfidf_vectorizer.adapt(ds.batch(64))

In [43]:
### Save the vectorizers
# import os
# path = 'working/data/training/baseline/'
# tf.io.write_file(
#     path + 'bow_vectorizer.pkl', bow_vectorizer, name=None
# )

ValueError: Attempt to convert a value (<tensorflow.python.keras.layers.preprocessing.text_vectorization.TextVectorization object at 0x7fcebb5394d0>) with an unsupported type (<class 'tensorflow.python.keras.layers.preprocessing.text_vectorization.TextVectorization'>) to a Tensor.

In [None]:
# path = 'working/data/training/baseline/'
# with open(os.path.join(path + 'freq_vectorizer.pkl'), "wb+") as f:
#     pickle.dump(freq_vectorizer, f)
# path = 'working/data/training/baseline/'
# with open(os.path.join(path + 'tfidf_vectorizer.pkl'), "wb+") as f:
#     pickle.dump(tfidf_vectorizer, f)

In [20]:
tfidf_vectorizer.get_vocabulary()

['[UNK]',
 'the',
 'and',
 'in',
 'of',
 'a',
 'is',
 'rrb',
 'lrb',
 'end',
 'start',
 'by',
 'was',
 'for',
 'as',
 'to',
 'film',
 'on',
 'an',
 's',
 'american',
 'with',
 'he',
 'his',
 'born',
 'from',
 'has',
 'series',
 'her',
 'award',
 'best',
 'which',
 'it',
 'at',
 'she',
 'television',
 'known',
 'first',
 'lsb',
 'rsb',
 'also',
 'actor',
 'one',
 'directed',
 'that',
 'album',
 'united',
 'released',
 'world',
 'or',
 'actress',
 'who',
 'films',
 'states',
 'won',
 'drama',
 'its',
 'awards',
 'most',
 'written',
 'role',
 'two',
 'are',
 'after',
 'new',
 'including',
 'academy',
 'city',
 'band',
 'comedy',
 'producer',
 '2016',
 'stars',
 'singer',
 'received',
 '2012',
 'music',
 'based',
 'may',
 'john',
 'name',
 'been',
 'debut',
 '2015',
 'second',
 'produced',
 'starring',
 '2014',
 'million',
 'roles',
 'their',
 '2013',
 'rock',
 '2011',
 'three',
 'had',
 'english',
 'british',
 'studio',
 'all',
 'director',
 'other',
 'such',
 'time',
 '2010',
 'career',


In [23]:
BATCH_SIZE = 32
MAX_SEQ_LEN = 60
BUFFER_SIZE = 32000

hypothesis = ds_train.map(lambda x, y: x[0])
evidence = ds_train.map(lambda x, y: x[1])
labels = ds_train.map(lambda x, y: y)
# print(data)
# print(labels)
features = tf.data.Dataset.zip((hypothesis,evidence))
d = tf.data.Dataset.zip((features,labels))
dataset_train = d.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
print(dataset_train)
print(dataset_train.element_spec)

<BatchDataset shapes: (((32,), (32,)), (32, 3)), types: ((tf.string, tf.string), tf.int32)>
((TensorSpec(shape=(32,), dtype=tf.string, name=None), TensorSpec(shape=(32,), dtype=tf.string, name=None)), TensorSpec(shape=(32, 3), dtype=tf.int32, name=None))


In [24]:
BATCH_SIZE = 32
MAX_SEQ_LEN = 60
BUFFER_SIZE = 32000

hypothesis = ds_dev.map(lambda x, y: x[0])
evidence = ds_dev.map(lambda x, y: x[1])
labels = ds_dev.map(lambda x, y: y)
# print(data)
# print(labels)
features = tf.data.Dataset.zip((hypothesis,evidence))
d = tf.data.Dataset.zip((features,labels))
dataset_dev = d.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
print(dataset_dev)
print(dataset_dev.element_spec)

<BatchDataset shapes: (((32,), (32,)), (32, 3)), types: ((tf.string, tf.string), tf.int32)>
((TensorSpec(shape=(32,), dtype=tf.string, name=None), TensorSpec(shape=(32,), dtype=tf.string, name=None)), TensorSpec(shape=(32, 3), dtype=tf.int32, name=None))


In [75]:
for d in dataset_train.take(1):
    print(d[0])

tf.Tensor(
[[b'[START] the flash aired in the nineties . [END]'
  b'The Flash is a 1990 American television series developed by the writing team of Danny Bilson and Paul De Meo that aired on CBS . The Flash is a 1990 American television series developed by the writing team of Danny Bilson and Paul De Meo that aired on CBS . The Flash is a 1990 American television series developed by the writing team of Danny Bilson and Paul De Meo that aired on CBS . The Flash is a 1990 American television series developed by the writing team of Danny Bilson and Paul De Meo that aired on CBS . The Flash is a 1990 American television series developed by the writing team of Danny Bilson and Paul De Meo that aired on CBS .']
 [b'[START] winter passing had mixed reviews . [END]'
  b'The film premiered in 2005 to mixed reviews , and was not released in the United Kingdom until 2013 , when it was released under the new title Happy Endings .']
 [b'[START] m . s . reddy produced ramayanam . [END]'
  b'Ramayana

In [25]:
from tensorflow import keras

inp1 = keras.Input(shape=(None, ), dtype=tf.string, name = "hypothesis")
inp2 = keras.Input(shape=(None, ), dtype=tf.string, name = "evidence")

lr = 0.001

claim_tfs = freq_vectorizer(inp1)
body_tfs = freq_vectorizer(inp2)
claim_tfidf = tfidf_vectorizer(inp1)
body_tfidf = tfidf_vectorizer(inp2)

cosine_layer = keras.layers.Dot((1,1), normalize=True)
cosine_similarity = cosine_layer((claim_tfidf, body_tfidf))

w = keras.layers.concatenate([body_tfs, claim_tfs, cosine_similarity], axis = 1)

x1 = keras.layers.Dense(100, activation='relu')(w)
x2 = keras.layers.Dropout(0.4)(x1)
x3 = keras.layers.Dense(3, activation='softmax')(x2)
model = keras.Model([inp1, inp2], x3)
model.compile(loss='categorical_crossentropy',
          optimizer=tf.keras.optimizers.Adam(lr=lr), 
          metrics=['accuracy'])
model.summary()

checkpoint_filepath = 'working/data/training/baseline/checkpoint_mlp'
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=True,
    monitor='val_accuracy',
    mode='max',
    save_best_only=True)

stop_early = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=4)

# Train. Do not specify batch size because the dataset takes care of that.
history = model.fit(dataset_train, epochs=10, callbacks=[stop_early], validation_data=dataset_dev)

  "The `lr` argument is deprecated, use `learning_rate` instead.")


Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
evidence (InputLayer)           [(None, None)]       0                                            
__________________________________________________________________________________________________
hypothesis (InputLayer)         [(None, None)]       0                                            
__________________________________________________________________________________________________
text_vectorization_2 (TextVecto (None, 5000)         0           hypothesis[0][0]                 
                                                                 evidence[0][0]                   
__________________________________________________________________________________________________
text_vectorization_1 (TextVecto (None, 5000)         0           hypothesis[0][0]             