#### FEVER dataset processing

<h5>Process the claims in the fever dataset</h5>

In this notebook, we will prepare the training dataset and buid a baseline model that would set us up for the NLI tasks

We use the following repos for reference code:

- [fever-baselines](https://github.com/klimzaporojets/fever-baselines.git)
- [fever-allennlp-reader](https://github.com/j6mes/fever-allennlp-reader)
- [fever-allennlp](https://github.com/j6mes/fever-allennlp)

Note, AllenNLP here is used only for the NLI training, using models such as Decomposable Attention, Elmo + ESIM, ESIM etc. We will not use any of it here.
In this notebook, we will first focus on extracting the data from the pre-processed Wiki corpus provided by [fever.ai](https://fever.ai/dataset/fever.html).

The data is available in a [docker image](https://hub.docker.com/r/feverai/common), 21GB in size. The container is created and the volume /local/ from it is mounted and made available to our [container](https://github.com/dmayukh/fakenews/Dockerfile) 


We will install a few dependencies such as:
- numpy>=1.15
- regex
- allennlp==2.5.0
- fever-scorer==2.0.39
- fever-drqa==1.0.13

The following packages are installed by the above dependencies
- torchvision-0.9.1
- google_cloud_storage-1.38.0
- overrides==3.1.0
- transformers-4.6.1
- spacy-3.0.6
- sentencepiece-0.1.96
- torch-1.8.1
- wandb-0.10.33
- lmdb-1.2.1
- jsonnet-0.17.0

We do not really need allennlp or fever-scorer as of yet, we would only need DrQA. We would prefer to use the DrQA from the official github, but for now we will go with what was prepackaged by the [j6mes](https://pypi.org/project/fever-drqa/) 


In [1]:
import argparse
import json
from multiprocessing.pool import ThreadPool

<h4>Pre-parsed FEVER Datasets</h4>
Create the database from the DB file that contains the preprocessed Wiki pages. This DB was made available to us by FEVER.

FeverDocDB is a simple wrapper that opens a SQLlite3 connection to the database and provides methods to execute simple select queries to fetch ids for documents and to fetch lines given a document.

We will not require this in the first pass of our work here, since we are only interested in findings the documents closest to a claim text.

The function to fetch lines per document is what uses the connection to the database. In order to find the closest documents for a given claim, we use the ranker that uses a <b>pre-created TFIDF index</b> which can locate the document ids given a claim text.

The pre-created index is available in '/local/fever-common/data/index/fever-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz'


Sample data from training file:

> {"id": 75397, "verifiable": "VERIFIABLE", "label": "SUPPORTS", "claim": "Nikolaj Coster-Waldau worked with the Fox Broadcasting Company.", "evidence": [[[92206, 104971, "Nikolaj_Coster-Waldau", 7], [92206, 104971, "Fox_Broadcasting_Company", 0]]]}

A closer look at the evidence:

> [[92206, 104971, "Nikolaj_Coster-Waldau", 7]

92206 and 104971 are the annotation ids, while the "Nikolaj_Coster-Waldau" is the evidence page and the line number is 7.


#### Formatting the input text

The training of the model is done on the evidence provided by the human annotators, therefore we use the 'evidence' to run our training.

After formatting, the training examples are written as below that is then used to train the MLP

> {'claim': 'Nikolaj Coster-Waldau worked with the Fox Broadcasting Company .',
  'evidence': [('Nikolaj_Coster-Waldau', 7), ('Fox_Broadcasting_Company', 0)],
  'label': 0,
  'label_text': 'SUPPORTS'}

The baseline model is a simple MLP that uses the count vectorizer to vectorize the claim text and the evidence page texts. It also uses an additional feature which is the cosine similarity between the vectorized claim text and the vectorized combined texts from all the evidences.

The vectorizers are saved to the filesystem that can be used later for transorming the incoming sentences.

The trained model is used to run eval on the dev dataset of the same format.


<h5>Retrieval of the evidence</h5>

We also attempt to extract the evidence from the corresponding pages

First, using the tfidf doc ranker, we extract the top 5 pages that are similar to the claim text


> {'claim': 'Nikolaj Coster-Waldau worked with the Fox Broadcasting Company .', 'evidence': [('Nikolaj_Coster-Waldau', 7), ('Fox_Broadcasting_Company', 0)], 'label': 0, 'label_text': 'SUPPORTS', 'predicted_pages': [('Coster', 498.82682448841246), ('Nikolaj', 348.42021460316823), ('The_Other_Woman_-LRB-2014_film-RRB-', 316.8405030379064), ('Nikolaj_Coster-Waldau', 316.8405030379064), ('Nukaaka_Coster-Waldau', 292.47605893902585)]}

For each of the pages, we extract the lines from the page text and use 'online tfidf ranker' to fetch the closest matching lines from the text.

The training examples are then formatted as below which is then used to run EVAL on the MLP model


> {'claim': 'Nikolaj Coster-Waldau worked with the Fox Broadcasting Company .',
 'evidence': [('Nikolaj_Coster-Waldau', 7), ('Fox_Broadcasting_Company', 0)],
 'label': 0,
 'label_text': 'SUPPORTS',
 'predicted_pages': [('Coster', 498.82682448841246),
  ('Nikolaj', 348.42021460316823),
  ('The_Other_Woman_-LRB-2014_film-RRB-', 316.8405030379064),
  ('Nikolaj_Coster-Waldau', 316.8405030379064),
  ('Nukaaka_Coster-Waldau', 292.47605893902585)],
 'predicted_sentences': [('Nikolaj', 7),
  ('The_Other_Woman_-LRB-2014_film-RRB-', 1),
  ('Nukaaka_Coster-Waldau', 1),
  ('Coster', 63),
  ('Nikolaj_Coster-Waldau', 0)]}
  

In [2]:
dataset_root = 'data/data'
working_dir = 'working/data'

In [4]:
!tail -2 data/data/fever-data/train.jsonl

{"id": 13114, "verifiable": "VERIFIABLE", "label": "SUPPORTS", "claim": "J. R. R. Tolkien created Gimli.", "evidence": [[[28359, 34669, "Gimli_-LRB-Middle-earth-RRB-", 0]], [[28359, 34670, "Gimli_-LRB-Middle-earth-RRB-", 1]]]}
{"id": 152180, "verifiable": "VERIFIABLE", "label": "SUPPORTS", "claim": "Susan Sarandon is an award winner.", "evidence": [[[176133, 189101, "Susan_Sarandon", 1]], [[176133, 189102, "Susan_Sarandon", 2]], [[176133, 189103, "Susan_Sarandon", 8]]]}


In [5]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/ubuntu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [6]:
!head -2 data/data/fever-data/paper_test.jsonl

{"id": 113501, "verifiable": "NOT VERIFIABLE", "label": "NOT ENOUGH INFO", "claim": "Grease had bad reviews.", "evidence": [[[133128, null, null, null]]]}
{"id": 163803, "verifiable": "VERIFIABLE", "label": "SUPPORTS", "claim": "Ukrainian Soviet Socialist Republic was a founding participant of the UN.", "evidence": [[[296950, 288668, "Ukrainian_Soviet_Socialist_Republic", 7]], [[298602, 290067, "Ukrainian_Soviet_Socialist_Republic", 7], [298602, 290067, "United_Nations", 0]], [[300696, 291816, "Ukrainian_Soviet_Socialist_Republic", 7]], [[344347, 327887, "Ukrainian_Soviet_Socialist_Republic", 7]], [[344994, 328433, "Ukrainian_Soviet_Socialist_Republic", 7]], [[344997, 328435, "Ukrainian_Soviet_Socialist_Republic", 7]]]}


#### Create the training dataset

The training examples have three (3) classes:
- SUPPORTS
- REFUTES
- NOT ENOUGH INFO

For the 'NOT ENOUGH INFO' class, the evidences are set to None. This would cause problems with training since we would still like to generate features for the samples which have been put in this class.

Next, we will loop over the records in the training dataset to create the training records. Specifically, we would be generating evidences for the samples in the 'NOT ENOUGH INFO' class so that the None values now have some page information.

Our strategy for dealing with missing evidences for the 'NOT ENOUGH INFO' class is to find the pages that are closest to the claims based on the tfidf similarity. The tfidf similarity of the documents in the fever DB is already precomputed and make available to us via the index file:

> '/local/fever-common/data/index/fever-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz'

Create the directory where we will save our prepared datasets

The raw training data is available at 

> /local/fever-common/data/fever-data/train.jsonl

The raw dev data from the FEVER paper is available at 

> /local/fever-common/data/fever-data/paper_dev.jsonl

We wil generate the training dataset by sampling for NEI examples based on closest document match against our claim.

In [5]:
!mkdir -p working/data/training/baseline

In [1]:
from mda.src.dataset.DatasetGenerator import DatasetGenerator

In [4]:
!wc -l working/data/training/train.ns.pages.p5.jsonl

145449 working/data/training/train.ns.pages.p5.jsonl


##### Prepare the training dataset

In [3]:
ds_generator = DatasetGenerator(dataset_root='data/data/',out_dir='working/data/training/baseline/', database_path='data/data/fever/fever.db')
ds_generator.generate_nei_evidences('train', 5)

  0%|          | 0/145449 [00:00<?, ?it/s]

Writing data to working/data/training/baseline//train.ns.pages.p5.jsonl


100%|██████████| 145449/145449 [25:35<00:00, 94.70it/s] 


##### Prepare the dev dataset

In [4]:
ds_generator = DatasetGenerator(dataset_root='data/data/',out_dir='working/data/training/baseline/', database_path='data/data/fever/fever.db')
ds_generator.generate_nei_evidences('paper_dev', 5)

  0%|          | 2/9999 [00:00<09:23, 17.74it/s]

Writing data to working/data/training/baseline//paper_dev.ns.pages.p5.jsonl


100%|██████████| 9999/9999 [02:23<00:00, 69.78it/s] 


In [6]:
!wc -l  working/data/training/baseline/*

    9999 working/data/training/baseline/paper_dev.ns.pages.p5.jsonl
  145449 working/data/training/baseline/train.ns.pages.p5.jsonl
  155448 total


#### Building the feature sets

Using the training data and dev data we generated, we will create the vectorizers and save them to local files

The training and dev data is available at 

> working/data/training/baseline/train.ns.pages.p5.jsonl 

> working/data/training/baseline/paper_dev.ns.pages.p5.jsonl

The key information we need from the training samples are the claim text and the texts from the evidence pages

For each training example, generate:
- a tokenized claim, 
- the label id, 
- the label text, 
- list of wiki pages that were provided as evidence.

This is done using a custom formatter `training_line_formatter` we would write.

In [3]:
from mda.src.dataset.DatasetReader import DatasetReader

In [4]:
infile = 'working/data/training/baseline/train.ns.pages.p5.jsonl'
dsreader = DatasetReader(in_file=infile,label_checkpoint_file=None, database_path='data/data/fever/fever.db')
raw, data = dsreader.read()
ds_train = dsreader.get_dataset()
print(ds_train.element_spec)

100%|██████████| 145449/145449 [00:01<00:00, 77541.19it/s] 
100%|██████████| 145449/145449 [00:01<00:00, 139881.75it/s]


(TensorSpec(shape=(2,), dtype=tf.string, name=None), TensorSpec(shape=(3,), dtype=tf.int32, name=None))


Save the label encoder from training, we will need them for the dev dataset preparation

In [5]:
import pickle
label_checkpoint_file = 'working/data/training/baseline/label_encoder_train.pkl'
with open(label_checkpoint_file, 'wb') as f:
    pickle.dump(dsreader.labelencoder, f)

In [6]:
infile = 'working/data/training/baseline/paper_dev.ns.pages.p5.jsonl'
label_checkpoint_file = 'working/data/training/baseline/label_encoder_train.pkl'
#note, use type = 'train' since formatting would be like the train examples
dsreader = DatasetReader(in_file=infile,label_checkpoint_file=label_checkpoint_file, database_path='data/data/fever/fever.db', type='train')
raw_dev, data_dev = dsreader.read()
ds_dev = dsreader.get_dataset()

100%|██████████| 9999/9999 [00:00<00:00, 175393.42it/s]
100%|██████████| 9999/9999 [00:00<00:00, 196706.67it/s]


#### Build feature set 

We will build a <b>term frequency vectorizer</b> and a TDIDF vectorizer and save them to a file.

The vocabulary will be limited to 5000. For each of the claim and the body text, we would produce the vectors which would be of dimension 5000.

We will also add the cosine similarity between the claim vector and the body text vector and use it as an additional feature.

The dimension of our feature would be then 5000 + 5000 + 1 = 10001

##### Create the vectorizers
We will be using the contents of both the training and dev set to build the vectorizers. 

We will need to read the dataset into memory from the td.dataset readers, since CountVectorizers cannot operate in batches.

In [12]:
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

In [36]:
max_features = 5000
max_len = 4  # Sequence length to pad the outputs to.
bow_vectorizer = TextVectorization(
 max_tokens=max_features,
 output_mode='int',
 output_sequence_length=max_len)
freq_vectorizer = TextVectorization(
 max_tokens=max_features,
 output_mode='count')
tfidf_vectorizer = TextVectorization(
 max_tokens=max_features,
 output_mode='tf-idf')

In [37]:
ds = ds_train.map(lambda x, y: x[0] + ' ' + x[1])
bow_vectorizer.adapt(ds.batch(64))

In [45]:
bow_vectorizer.vocabulary_size()

5000

In [39]:
freq_vectorizer.adapt(ds.batch(64))

In [40]:
tfidf_vectorizer.adapt(ds.batch(64))

In [43]:
### Save the vectorizers
import os
path = 'working/data/training/baseline/'
tf.io.write_file(
    path + 'bow_vectorizer.pkl', bow_vectorizer, name=None
)

ValueError: Attempt to convert a value (<tensorflow.python.keras.layers.preprocessing.text_vectorization.TextVectorization object at 0x7fcebb5394d0>) with an unsupported type (<class 'tensorflow.python.keras.layers.preprocessing.text_vectorization.TextVectorization'>) to a Tensor.

In [None]:
# path = 'working/data/training/baseline/'
# with open(os.path.join(path + 'freq_vectorizer.pkl'), "wb+") as f:
#     pickle.dump(freq_vectorizer, f)
# path = 'working/data/training/baseline/'
# with open(os.path.join(path + 'tfidf_vectorizer.pkl'), "wb+") as f:
#     pickle.dump(tfidf_vectorizer, f)

In [44]:
tfidf_vectorizer.get_vocabulary()

['[UNK]',
 'the',
 'and',
 'in',
 'of',
 'a',
 'is',
 'rrb',
 'lrb',
 'end',
 'start',
 'by',
 'was',
 'for',
 'as',
 'to',
 'film',
 'on',
 'an',
 's',
 'american',
 'with',
 'he',
 'his',
 'born',
 'from',
 'has',
 'series',
 'her',
 'award',
 'best',
 'which',
 'it',
 'at',
 'she',
 'known',
 'television',
 'first',
 'lsb',
 'rsb',
 'also',
 'actor',
 'one',
 'directed',
 'that',
 'album',
 'united',
 'released',
 'world',
 'or',
 'actress',
 'who',
 'films',
 'states',
 'won',
 'drama',
 'its',
 'awards',
 'most',
 'written',
 'role',
 'two',
 'after',
 'are',
 'new',
 'including',
 'academy',
 'city',
 'comedy',
 'band',
 'producer',
 '2016',
 'stars',
 'singer',
 '2012',
 'received',
 'music',
 'based',
 'john',
 'may',
 'name',
 '2015',
 'debut',
 'been',
 'second',
 'produced',
 '2014',
 'million',
 'starring',
 'roles',
 '2013',
 'their',
 '2011',
 'rock',
 'three',
 'had',
 'studio',
 'english',
 'all',
 'british',
 'director',
 'other',
 'such',
 'time',
 'show',
 '2010',
 '

In [80]:
BATCH_SIZE = 32
MAX_SEQ_LEN = 60
BUFFER_SIZE = 32000

hypothesis = ds_train.map(lambda x, y: x[0])
evidence = ds_train.map(lambda x, y: x[1])
labels = ds_train.map(lambda x, y: y)
print(data)
print(labels)
features = tf.data.Dataset.zip((hypothesis,evidence))
d = tf.data.Dataset.zip((features,labels))
dataset_train = d.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
print(dataset_train)
print(dataset_train.element_spec)

<MapDataset shapes: (2,), types: tf.string>
<MapDataset shapes: (3,), types: tf.int32>
<BatchDataset shapes: (((32,), (32,)), (32, 3)), types: ((tf.string, tf.string), tf.int32)>
((TensorSpec(shape=(32,), dtype=tf.string, name=None), TensorSpec(shape=(32,), dtype=tf.string, name=None)), TensorSpec(shape=(32, 3), dtype=tf.int32, name=None))


In [81]:
BATCH_SIZE = 32
MAX_SEQ_LEN = 60
BUFFER_SIZE = 32000

hypothesis = ds_dev.map(lambda x, y: x[0])
evidence = ds_dev.map(lambda x, y: x[1])
labels = ds_dev.map(lambda x, y: y)
print(data)
print(labels)
features = tf.data.Dataset.zip((hypothesis,evidence))
d = tf.data.Dataset.zip((features,labels))
dataset_dev = d.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
print(dataset_dev)
print(dataset_dev.element_spec)

<MapDataset shapes: (2,), types: tf.string>
<MapDataset shapes: (3,), types: tf.int32>
<BatchDataset shapes: (((32,), (32,)), (32, 3)), types: ((tf.string, tf.string), tf.int32)>
((TensorSpec(shape=(32,), dtype=tf.string, name=None), TensorSpec(shape=(32,), dtype=tf.string, name=None)), TensorSpec(shape=(32, 3), dtype=tf.int32, name=None))


In [75]:
for d in dataset_train.take(1):
    print(d[0])

tf.Tensor(
[[b'[START] the flash aired in the nineties . [END]'
  b'The Flash is a 1990 American television series developed by the writing team of Danny Bilson and Paul De Meo that aired on CBS . The Flash is a 1990 American television series developed by the writing team of Danny Bilson and Paul De Meo that aired on CBS . The Flash is a 1990 American television series developed by the writing team of Danny Bilson and Paul De Meo that aired on CBS . The Flash is a 1990 American television series developed by the writing team of Danny Bilson and Paul De Meo that aired on CBS . The Flash is a 1990 American television series developed by the writing team of Danny Bilson and Paul De Meo that aired on CBS .']
 [b'[START] winter passing had mixed reviews . [END]'
  b'The film premiered in 2005 to mixed reviews , and was not released in the United Kingdom until 2013 , when it was released under the new title Happy Endings .']
 [b'[START] m . s . reddy produced ramayanam . [END]'
  b'Ramayana

In [None]:
from tensorflow import keras

inp1 = keras.Input(shape=(None, ), dtype=tf.string, name = "hypothesis")
inp2 = keras.Input(shape=(None, ), dtype=tf.string, name = "evidence")

lr = 0.001

claim_tfs = freq_vectorizer(inp1)
body_tfs = freq_vectorizer(inp2)
claim_tfidf = tfidf_vectorizer(inp1)
body_tfidf = tfidf_vectorizer(inp2)

cosine_layer = keras.layers.Dot((1,1), normalize=True)
cosine_similarity = cosine_layer((claim_tfidf, body_tfidf))

w = keras.layers.concatenate([body_tfs, claim_tfs, cosine_similarity], axis = 1)

x1 = keras.layers.Dense(100, activation='relu')(w)
x2 = keras.layers.Dropout(0.4)(x1)
x3 = keras.layers.Dense(3, activation='softmax')(x2)
model = keras.Model([inp1, inp2], x3)
model.compile(loss='categorical_crossentropy',
          optimizer=tf.keras.optimizers.Adam(lr=lr), 
          metrics=['accuracy'])
model.summary()

checkpoint_filepath = 'working/data/training/baseline/checkpoint_mlp'
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=True,
    monitor='val_accuracy',
    mode='max',
    save_best_only=True)

stop_early = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=8)

# Train. Do not specify batch size because the dataset takes care of that.
history = model.fit(dataset_train, epochs=10, callbacks=[stop_early], validation_data=dataset_dev)


Model: "model_7"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
evidence (InputLayer)           [(None, None)]       0                                            
__________________________________________________________________________________________________
hypothesis (InputLayer)         [(None, None)]       0                                            
__________________________________________________________________________________________________
text_vectorization_12 (TextVect (None, 5000)         0           hypothesis[0][0]                 
                                                                 evidence[0][0]                   
__________________________________________________________________________________________________
text_vectorization_11 (TextVect (None, 5000)         0           hypothesis[0][0]           

#### Get the dataset generator to use for training

In [47]:
BATCH_SIZE = 64
MAX_SEQ_LEN = 60
BUFFER_SIZE = 32000

# claim_bow = bow_vectorizer.transform(train_claims)
# claim_tfs = tfreq_vectorizer.transform(claim_bow)
# claim_tfidf = tfidf_vectorizer.transform(train_claims)

# #get the text from the bodies of all the n-closest docs for the claim
# #body_texts = texts(data)
# body_bow = bow_vectorizer.transform(train_bodies)
# body_tfs = tfreq_vectorizer.transform(body_bow)
# body_tfidf = tfidf_vectorizer.transform(train_bodies)

# cosines = np.array([cosine_similarity(c, b)[0] for c,b in zip(claim_tfidf,body_tfidf)])

# return hstack([body_tfs,claim_tfs,cosines])

def transform(text):
    claim_bow = bow_vectorizer.transform(text[0])
    claim_tfs = freq_vectorizer.transform(text[0])
    claim_tfidf = tfidf_vectorizer.transform(text[0])
    body_bow = bow_vectorizer.transform(text[1])
    body_tfs = freq_vectorizer.transform(text[1])
    body_tfidf = tfidf_vectorizer.transform(text[1])
    cosines = np.array([cosine_similarity(c, b)[0] for c,b in zip(claim_tfidf,body_tfidf)])
    
    return hstack([body_tfs,claim_tfs,cosines])

data = ds_train.map(lambda x, y: transform(x))
labels = ds_train.map(lambda x, y: y)
print(data)
print(labels)
d = tf.data.Dataset.zip((data,labels))
dataset_train = d.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
print(dataset_train)
print(dataset_train.element_spec)

AttributeError: in user code:

    <ipython-input-46-6aca85572743>:30 None  *
        lambda x, y: transform(x))
    <ipython-input-47-254a20fa8d48>:20 transform  *
        claim_bow = bow_vectorizer.transform(text[0])

    AttributeError: 'TextVectorization' object has no attribute 'transform'


In [49]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
claims = train_claims
bodies = train_bodies
dev_claims = dev_claims
dev_bodies = dev_bodies

lim_unigram = 5000
stop_words = [
        "a", "about", "above", "across", "after", "afterwards", "again", "against", "all", "almost", "alone", "along",
        "already", "also", "although", "always", "am", "among", "amongst", "amoungst", "amount", "an", "and", "another",
        "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are", "around", "as", "at", "back", "be",
        "became", "because", "become", "becomes", "becoming", "been", "before", "beforehand", "behind", "being",
        "below", "beside", "besides", "between", "beyond", "bill", "both", "bottom", "but", "by", "call", "can", "co",
        "con", "could", "cry", "de", "describe", "detail", "do", "done", "down", "due", "during", "each", "eg", "eight",
        "either", "eleven", "else", "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone",
        "everything", "everywhere", "except", "few", "fifteen", "fifty", "fill", "find", "fire", "first", "five", "for",
        "former", "formerly", "forty", "found", "four", "from", "front", "full", "further", "get", "give", "go", "had",
        "has", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein", "hereupon", "hers", "herself",
        "him", "himself", "his", "how", "however", "hundred", "i", "ie", "if", "in", "inc", "indeed", "interest",
        "into", "is", "it", "its", "itself", "keep", "last", "latter", "latterly", "least", "less", "ltd", "made",
        "many", "may", "me", "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly", "move", "much",
        "must", "my", "myself", "name", "namely", "neither", "nevertheless", "next", "nine", "nobody", "now", "nowhere",
        "of", "off", "often", "on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", "ours",
        "ourselves", "out", "over", "own", "part", "per", "perhaps", "please", "put", "rather", "re", "same", "see",
        "serious", "several", "she", "should", "show", "side", "since", "sincere", "six", "sixty", "so", "some",
        "somehow", "someone", "something", "sometime", "sometimes", "somewhere", "still", "such", "system", "take",
        "ten", "than", "that", "the", "their", "them", "themselves", "then", "thence", "there", "thereafter", "thereby",
        "therefore", "therein", "thereupon", "these", "they", "thick", "thin", "third", "this", "those", "though",
        "three", "through", "throughout", "thru", "thus", "to", "together", "too", "top", "toward", "towards", "twelve",
        "twenty", "two", "un", "under", "until", "up", "upon", "us", "very", "via", "was", "we", "well", "were", "what",
        "whatever", "when", "whence", "whenever", "where", "whereafter", "whereas", "whereby", "wherein", "whereupon",
        "wherever", "whether", "which", "while", "whither", "who", "whoever", "whole", "whom", "whose", "why", "will",
        "with", "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves"
        ]
bow_vectorizer = CountVectorizer(max_features=lim_unigram,
                                         stop_words=stop_words)
bow = bow_vectorizer.fit_transform(claims + bodies)
tfreq_vectorizer = TfidfTransformer(use_idf=False).fit(bow)
tfidf_vectorizer = TfidfVectorizer(max_features=lim_unigram,
                                           stop_words=stop_words).fit(claims + bodies + dev_claims + dev_bodies)

The vectorizers will be saved in a folder in the directory 'ns_nn_sent' so that it can be looked up later.

In [13]:
!mkdir -p working/models

In [24]:
ls working/data/training/baseline

label_encoder_train.pkl  paper_dev.ns.pages.p5.jsonl  train.ns.pages.p5.jsonl


In [25]:
with open(os.path.join('working/data/training/baseline/train_labels.pkl'), "wb+") as f:
    pickle.dump(train_labels, f)

In [26]:
with open(os.path.join('working/data/training/baseline/dev_labels.pkl'), "wb+") as f:
    pickle.dump(dev_labels, f)

In [27]:
train_labels_file = 'working/data/training/baseline/train_labels.pkl'
if os.path.exists(train_labels_file):
    with open(os.path.join(train_labels_file), "rb") as f:
                train_labels = pickle.load(f)
else:
    print("Saved file not found, processing again...")
    train_labels = []
    for d, l in ds_train.batch(1):
        train_labels.append(l.numpy()[0])

In [28]:
dev_labels_file = 'working/data/training/baseline/dev_labels.pkl'
if os.path.exists(dev_labels_file):
    with open(os.path.join(dev_labels_file), "rb") as f:
                dev_labels = pickle.load(f)
else:
    print("Saved file not found, processing again...")
    dev_labels = []
    for d, l in ds_dev.batch(1):
        dev_labels.append(l.numpy()[0])

In [53]:
train_feats[0].tocsr()

matrix([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.46857671]])

Transform the claims and the body texts using the vectorizers.

In [14]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import hstack
def process_train():
    claim_bow = bow_vectorizer.transform(train_claims)
    claim_tfs = tfreq_vectorizer.transform(claim_bow)
    claim_tfidf = tfidf_vectorizer.transform(train_claims)

    #get the text from the bodies of all the n-closest docs for the claim
    #body_texts = texts(data)
    body_bow = bow_vectorizer.transform(train_bodies)
    body_tfs = tfreq_vectorizer.transform(body_bow)
    body_tfidf = tfidf_vectorizer.transform(train_bodies)

    cosines = np.array([cosine_similarity(c, b)[0] for c,b in zip(claim_tfidf,body_tfidf)])

    return hstack([body_tfs,claim_tfs,cosines])

def process_dev():
    claim_bow = bow_vectorizer.transform(dev_claims)
    claim_tfs = tfreq_vectorizer.transform(claim_bow)
    claim_tfidf = tfidf_vectorizer.transform(dev_claims)

    #get the text from the bodies of all the n-closest docs for the claim
    #body_texts = texts(data)
    body_bow = bow_vectorizer.transform(dev_bodies)
    body_tfs = tfreq_vectorizer.transform(body_bow)
    body_tfidf = tfidf_vectorizer.transform(dev_bodies)

    cosines = np.array([cosine_similarity(c, b)[0] for c,b in zip(claim_tfidf,body_tfidf)])

    return hstack([body_tfs,claim_tfs,cosines])

In [15]:
import os
import pickle
model_name = 'ns_nn_sent'
base_path = 'working/models/'

def load_features(name):
    features = list()
    ffpath = os.path.join(base_path, model_name)
    if not os.path.exists(ffpath):
        os.mkdir(ffpath)
    if (not os.path.exists(os.path.join(ffpath, name + ".pkl"))):
        print("Saved features do not exist, creating data...")
        if name == 'train':
            features = process_train()
        else:
            features = process_dev()
        with open(os.path.join(ffpath, name + ".pkl"), "wb+") as f:
            pickle.dump(features, f)
    else:
        print("Loading saved feature from {}".format(os.path.join(ffpath, name + ".pkl")))
        with open(os.path.join(ffpath, name + ".pkl"), "rb") as f:
            features = pickle.load(f)
    return features

Create the labels for the features

In [16]:
def out(features,labels):
    if features is not None:
        return np.hstack(features) if len(features) > 1 else features[0], labels
    return [[]],[]

In [17]:
# label_name = "label"
# def labels(data):
#     return [datum[label_name] for datum in data]
# def out(features,ds):
#     if ds is not None:
#         return np.hstack(features) if len(features) > 1 else features[0], labels(ds)
#     return [[]],[]

This needs to be performed once per dataset. Therefore, we would save the transformed vectors in a file to reuse for each modelling excercise.

Check if the saved vectors exist, if not, create them by using the vectorizers and applying a transform on the 
- claim
- lines from the body of the evidence pages

In [16]:
from scipy import sparse

In [29]:
train_fs = []
train_features = load_features("train")
train_fs.append(train_features)
#train_features = sparse.csr_matrix(train_features)
train_feats = out(train_fs, train_labels)

Loading saved feature from working/models/ns_nn_sent/train.pkl


In [30]:
input_shape = train_feats[0].shape[1]
print("input_shape =", input_shape)

input_shape = 10001


In [31]:
dev_fs = []
dev_features = load_features("dev")
dev_fs.append(dev_features)
#dev_features = sparse.csr_matrix(dev_features)
dev_feats = out(dev_fs, dev_labels)

Loading saved feature from working/models/ns_nn_sent/dev.pkl


In [33]:
dev_feats[:2]

(<9999x10001 sparse matrix of type '<class 'numpy.float64'>'
 	with 207568 stored elements in COOrdinate format>,
 [array([1, 0, 0], dtype=int32),
  array([1, 0, 0], dtype=int32),
  array([0, 0, 1], dtype=int32),
  array([1, 0, 0], dtype=int32),
  array([0, 1, 0], dtype=int32),
  array([0, 1, 0], dtype=int32),
  array([0, 0, 1], dtype=int32),
  array([0, 1, 0], dtype=int32),
  array([1, 0, 0], dtype=int32),
  array([0, 1, 0], dtype=int32),
  array([1, 0, 0], dtype=int32),
  array([1, 0, 0], dtype=int32),
  array([0, 0, 1], dtype=int32),
  array([0, 1, 0], dtype=int32),
  array([0, 0, 1], dtype=int32),
  array([0, 1, 0], dtype=int32),
  array([0, 1, 0], dtype=int32),
  array([1, 0, 0], dtype=int32),
  array([1, 0, 0], dtype=int32),
  array([0, 1, 0], dtype=int32),
  array([0, 0, 1], dtype=int32),
  array([1, 0, 0], dtype=int32),
  array([0, 0, 1], dtype=int32),
  array([0, 0, 1], dtype=int32),
  array([0, 0, 1], dtype=int32),
  array([0, 1, 0], dtype=int32),
  array([0, 1, 0], dtype=int

In [32]:
input_shape = dev_feats[0].shape[1]
print("input_shape =", input_shape)

input_shape = 10001


### Training (using PyTorch)
It's now time to build the model. We will build a Simple Multi layer perceptron.

In [30]:
from torch import nn

class SimpleMLP(nn.Module):
    def __init__(self,input_dim,hidden_dim,output_dim,keep_p=.6):
        super(SimpleMLP, self).__init__()
        self.fc1 = nn.Linear(input_dim,hidden_dim)
        self.fc2 = nn.Linear(hidden_dim,output_dim)

        self.do = nn.Dropout(1-keep_p)
        self.relu = nn.ReLU()

    def forward(self,x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.do(x)

        x = self.fc2(x)
        x = self.do(x)
        return x

In [31]:
model = SimpleMLP(input_shape,100,3)
model

SimpleMLP(
  (fc1): Linear(in_features=10001, out_features=100, bias=True)
  (fc2): Linear(in_features=100, out_features=3, bias=True)
  (do): Dropout(p=0.4, inplace=False)
  (relu): ReLU()
)

Clean up any saved models

In [32]:
#rm -rf working/models/ns_nn_sent/ns_nn_sent.best.save

Define the logger, the one that will be used to monitor the model training progress

The best model will be saved at 
> working/models/ns_nn_sent/

In [33]:
import logging
class LogHelper():
    handler = None
    @staticmethod
    def setup():
        FORMAT = '[%(levelname)s] %(asctime)s - %(name)s - %(message)s'
        LogHelper.handler = logging.StreamHandler()
        LogHelper.handler.setLevel(logging.DEBUG)
        LogHelper.handler.setFormatter(logging.Formatter(FORMAT))

        LogHelper.get_logger(LogHelper.__name__).info("Log Helper set up")

    @staticmethod
    def get_logger(name,level=logging.DEBUG):
        ##note: once a logger is created, repeated calls using the same name will give you the same logger object
        l = logging.getLogger(name)
        sh = logging.StreamHandler()
        l.setLevel(level)
        l.addHandler(sh)
        return l
    
class EarlyStopping():
    def __init__(self,name,patience=8):
        self.patience = patience
        self.best_model = None
        self.best_score = None

        self.best_epoch = 0
        self.epoch = 0
        #print("name is ", EarlyStopping.__name__)
        self.name = name
        #self.logger = LogHelper.get_logger(EarlyStopping.__name__)
        self.logger = LogHelper.get_logger(name)

    def __call__(self, model, acc):
        self.epoch += 1

        if self.best_score is None:
            self.best_score = acc

        if acc >= self.best_score:
            torch.save(model.state_dict(),"working/models/ns_nn_sent/{0}.best.save".format(self.name))
            self.best_score = acc
            self.best_epoch = self.epoch
            self.logger.info("Saving best weights from round {0}".format(self.epoch))
            return False

        elif self.epoch > self.best_epoch+self.patience:
            self.logger.info("Early stopping: Terminate")
            return True

        self.logger.info("Early stopping: Worse Round")
        return False

    def set_best_state(self,model):
        self.logger.info("Loading weights from round {0}".format(self.best_epoch))
        model.load_state_dict(torch.load("working/models/ns_nn_sent/{0}.best.save".format(self.name)))

#### Dataset reader

We will need to handle the batching of inputs to our model

We will define a batcher that deals with the sparse matrix

In [34]:
from scipy.sparse import coo_matrix
from torch.autograd import Variable
import torch
def is_gpu():
    return os.getenv("GPU","no").lower() in ["1",1,"yes","true","t"]

def gpu():
    if is_gpu():
        torch.cuda.set_device(int(os.getenv("CUDA_DEVICE", 0)))
        return True
    return False

class Batcher():
    def __init__(self,data,size):
        self.data = data
        self.size = size
        self.pointer = 0

        if isinstance(self.data,coo_matrix):
            self.data = self.data.tocsr()

    def __next__(self):
        if self.pointer == splen(self.data):
            self.pointer = 0
            raise StopIteration
        next = min(splen(self.data),self.pointer+self.size)
        to_return = self.data[self.pointer : next]
        start,end = self.pointer,next
        self.pointer = next
        return to_return, splen(to_return), start, end

    def __iter__(self):
        return self

def splen(data):
    try:
        return data.shape[0]
    except:
        return len(data)

def prepare_with_labels(data,labels):
    data = data.todense()
    v = torch.FloatTensor(np.array(data))
    if gpu():
        return Variable(v.cuda()), Variable(torch.LongTensor(labels).cuda())
    return Variable(v), Variable(torch.LongTensor(labels))


def prepare(data):
    data = data.todense()
    v = torch.FloatTensor(np.array(data))
    if gpu():
        return Variable(v.cuda())
    return Variable(v)

In [35]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.utils import shuffle
import torch.nn.functional as F

def evaluate(model,data,labels,batch_size):
    predicted = predict(model,data,batch_size)
    return accuracy_score(labels,predicted.data.numpy().reshape(-1))

def predict(model, data, batch_size):
    batcher = Batcher(data, batch_size)

    predicted = []
    for batch, size, start, end in batcher:
        d = prepare(batch)
        model.eval()
        logits = model(d).cpu()

        predicted.extend(torch.max(logits, 1)[1])
    return torch.stack(predicted)

def train(model, fs, batch_size, lr, epochs,dev=None, clip=None, early_stopping=None,name=None):
    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=1e-4)

    data, labels = fs
    if dev is not None:
        dev_data,dev_labels = dev

    for epoch in tqdm(range(epochs)):
        epoch_loss = 0
        epoch_data = 0

        shuffle(data,labels)

        batcher = Batcher(data, batch_size)

        for batch, size, start, end in batcher:
            d,gold = prepare_with_labels(batch,labels[start:end])

            model.train()
            optimizer.zero_grad()
            logits = model(d)

            loss = F.cross_entropy(logits, gold)
            loss.backward()

            epoch_loss += loss.cpu()
            epoch_data += size

            if clip is not None:
                torch.nn.utils.clip_grad_norm(model.parameters(), clip)
            optimizer.step()

        print("Average epoch loss: {0}".format((epoch_loss/epoch_data).data.numpy()))

        #print("Epoch Train Accuracy {0}".format(evaluate(model, data, labels, batch_size)))
        if dev is not None:
            acc = evaluate(model,dev_data,dev_labels,batch_size)
            print("Epoch Dev Accuracy {0}".format(acc))

            if early_stopping is not None and early_stopping(model,acc):
                break

    if dev is not None and early_stopping is not None:
        early_stopping.set_best_state(model)

In [36]:
mname = 'ns_nn_sent'
final_model = train(model, train_feats, 500, 1e-2, 90, dev_feats, early_stopping=EarlyStopping(mname))

  0%|          | 0/90 [00:00<?, ?it/s]

Average epoch loss: 0.0016196609940379858


Saving best weights from round 1
  1%|          | 1/90 [00:10<15:24, 10.39s/it]

Epoch Dev Accuracy 0.6278627862786279
Average epoch loss: 0.001501814927905798


Saving best weights from round 2
  2%|▏         | 2/90 [00:20<15:02, 10.26s/it]

Epoch Dev Accuracy 0.6342634263426342
Average epoch loss: 0.0014742235653102398


Saving best weights from round 3
  3%|▎         | 3/90 [00:38<18:25, 12.71s/it]

Epoch Dev Accuracy 0.6388638863886389
Average epoch loss: 0.0014582787407562137


Early stopping: Worse Round
  4%|▍         | 4/90 [00:48<16:57, 11.83s/it]

Epoch Dev Accuracy 0.6374637463746374
Average epoch loss: 0.0014515924267470837


Early stopping: Worse Round
  6%|▌         | 5/90 [01:08<20:14, 14.28s/it]

Epoch Dev Accuracy 0.6297629762976298
Average epoch loss: 0.001442103530280292


Saving best weights from round 6
  7%|▋         | 6/90 [01:29<22:49, 16.30s/it]

Epoch Dev Accuracy 0.6465646564656465
Average epoch loss: 0.0014398741768673062


Early stopping: Worse Round
  8%|▊         | 7/90 [01:48<23:28, 16.97s/it]

Epoch Dev Accuracy 0.6443644364436444
Average epoch loss: 0.001430192613042891


Early stopping: Worse Round
  9%|▉         | 8/90 [02:08<24:43, 18.09s/it]

Epoch Dev Accuracy 0.6458645864586459
Average epoch loss: 0.0014347969554364681


Early stopping: Worse Round
 10%|█         | 9/90 [02:18<21:03, 15.60s/it]

Epoch Dev Accuracy 0.6417641764176417
Average epoch loss: 0.0014273609267547727


Early stopping: Worse Round
 11%|█         | 10/90 [02:37<22:19, 16.74s/it]

Epoch Dev Accuracy 0.6418641864186418
Average epoch loss: 0.0014319585170596838


Early stopping: Worse Round
 12%|█▏        | 11/90 [02:57<23:14, 17.66s/it]

Epoch Dev Accuracy 0.6402640264026402
Average epoch loss: 0.0014225579798221588


Early stopping: Worse Round
 13%|█▎        | 12/90 [03:07<19:53, 15.31s/it]

Epoch Dev Accuracy 0.6407640764076408
Average epoch loss: 0.0014181816950440407


Early stopping: Worse Round
 14%|█▍        | 13/90 [03:27<21:24, 16.68s/it]

Epoch Dev Accuracy 0.6378637863786378
Average epoch loss: 0.001419287407770753


Early stopping: Worse Round
 16%|█▌        | 14/90 [03:37<18:38, 14.71s/it]

Epoch Dev Accuracy 0.6383638363836384
Average epoch loss: 0.0014201359590515494


Early stopping: Terminate
 16%|█▌        | 14/90 [03:47<20:35, 16.25s/it]
Loading weights from round 6


Epoch Dev Accuracy 0.634963496349635


<h4> We achieve a dev set performance of 64% </h4>

#### Training using tensorflow keras

In [30]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras

In [31]:
from tensorflow.keras.layers import Dense,Dropout,Input

We will have to build a layer similar to the ones we built using torch.

`
SimpleMLP(
  (fc1): Linear(in_features=10001, out_features=100, bias=True)
  (fc2): Linear(in_features=100, out_features=3, bias=True)
  (do): Dropout(p=0.4, inplace=False)
  (relu): ReLU()
)
`

In [32]:
train_x, train_y = train_feats
print("Shape of the training dataset =", train_x.shape)

Shape of the training dataset = (145449, 10001)


In [43]:
for d in ds_train.map(lambda x, y: x).take(1):
    print(d[1])

tf.Tensor(b'He then played Detective John Amsterdam in the short-lived Fox television series New Amsterdam -LRB- 2008 -RRB- , as well as appearing as Frank Pike in the 2009 Fox television film Virtuality , originally intended as a pilot . The Fox Broadcasting Company -LRB- often shortened to Fox and stylized as FOX -RRB- is an American English language commercial broadcast television network that is owned by the Fox Entertainment Group subsidiary of 21st Century Fox .', shape=(), dtype=string)


In [67]:
ds_train.element_spec

(TensorSpec(shape=(2,), dtype=tf.string, name=None),
 TensorSpec(shape=(3,), dtype=tf.int32, name=None))

In [66]:
### (TensorSpec(shape=(2,), dtype=tf.string, name=None), TensorSpec(shape=(3,), dtype=tf.int32, name=None))

In [54]:
train_features, train_labels = train_feats

In [57]:
type(train_features.tocsr())
train_features = train_features.tocsr().todense()

In [92]:
train_features = sparse.csr_matrix(train_features).todense()

In [None]:
dataset_train = tf.data.Dataset.from_tensor_slices((train_features, train_labels))

In [79]:
BATCH_SIZE = 64
MAX_SEQ_LEN = 60
BUFFER_SIZE = 32000



print(h)
print(e)
f = tf.data.Dataset.zip((h,e))
d = tf.data.Dataset.zip((f,l))
dataset_train = d.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
print(dataset_train)
print(dataset_train.element_spec)

ValueError: in user code:

    <ipython-input-76-b1b99179c462>:25 None  *
        lambda x, y: process_train(x))
    <ipython-input-75-407de16f6ff3>:11 process_train  *
        claim_bow = bow_vectorizer.transform(text[0])
    /home/ubuntu/anaconda3/envs/tensorflow2_latest_p37/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:1255 transform  *
        _, X = self._count_vocab(raw_documents, fixed_vocab=True)
    /home/ubuntu/anaconda3/envs/tensorflow2_latest_p37/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:1113 _count_vocab  *
        for doc in raw_documents:
    /home/ubuntu/anaconda3/envs/tensorflow2_latest_p37/lib/python3.7/site-packages/tensorflow/python/autograph/operators/control_flow.py:419 for_stmt
        iter_, extra_test, body, get_state, set_state, symbol_names, opts)
    /home/ubuntu/anaconda3/envs/tensorflow2_latest_p37/lib/python3.7/site-packages/tensorflow/python/autograph/operators/control_flow.py:484 _known_len_tf_for_stmt
        n = py_builtins.len_(iter_)
    /home/ubuntu/anaconda3/envs/tensorflow2_latest_p37/lib/python3.7/site-packages/tensorflow/python/autograph/operators/py_builtins.py:249 len_
        return _tf_tensor_len(s)
    /home/ubuntu/anaconda3/envs/tensorflow2_latest_p37/lib/python3.7/site-packages/tensorflow/python/autograph/operators/py_builtins.py:277 _tf_tensor_len
        'len requires a non-scalar tensor, got one of shape {}'.format(shape))

    ValueError: len requires a non-scalar tensor, got one of shape Tensor("Shape:0", shape=(0,), dtype=int32)


In [64]:
for d in dataset_train.take(1):
    print(d)

InvalidArgumentError: TypeError: Cannot iterate over a scalar tensor.
Traceback (most recent call last):

  File "/home/ubuntu/anaconda3/envs/tensorflow2_latest_p37/lib/python3.7/site-packages/tensorflow/python/ops/script_ops.py", line 247, in __call__
    return func(device, token, args)

  File "/home/ubuntu/anaconda3/envs/tensorflow2_latest_p37/lib/python3.7/site-packages/tensorflow/python/ops/script_ops.py", line 135, in __call__
    ret = self._func(*args)

  File "/home/ubuntu/anaconda3/envs/tensorflow2_latest_p37/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py", line 645, in wrapper
    return func(*args, **kwargs)

  File "<ipython-input-61-3d3327444ca7>", line 11, in process_data
    claim_bow = bow_vectorizer.transform(text[0])

  File "/home/ubuntu/anaconda3/envs/tensorflow2_latest_p37/lib/python3.7/site-packages/sklearn/feature_extraction/text.py", line 1255, in transform
    _, X = self._count_vocab(raw_documents, fixed_vocab=True)

  File "/home/ubuntu/anaconda3/envs/tensorflow2_latest_p37/lib/python3.7/site-packages/sklearn/feature_extraction/text.py", line 1113, in _count_vocab
    for doc in raw_documents:

  File "/home/ubuntu/anaconda3/envs/tensorflow2_latest_p37/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 526, in __iter__
    raise TypeError("Cannot iterate over a scalar tensor.")

TypeError: Cannot iterate over a scalar tensor.


	 [[{{node EagerPyFunc}}]] [Op:IteratorGetNext]

### Build the model
Build the model using keras functional API

We will need to reshape the labels array so that they have the approriate dimensions for the training

In [None]:
#train_x is the concatenation of the tf vectors for the claim
x = train_x
labels_3 = train_labels

dim = train_x.shape[1]
num_examples = train_x.shape[0]
lr = 0.001
                
# This is tf.data.experimental.AUTOTUNE in older tensorflow.
AUTOTUNE = tf.data.AUTOTUNE

def generator_fn(n_samples):
    """Return a function that takes no arguments and returns a generator."""
    def generator():
        num_batches = num_examples/n_samples
        counter = 0
        if counter == 0:
            idx = np.arange(num_examples)
            np.random.shuffle(idx)
        
        while counter < num_batches:
            index_batch = idx[n_samples*counter:n_samples*(counter+1)]
            counter += 1
            rec = x[index_batch, :].todense()
            if len(rec) == n_samples:
                yield rec, labels_3[index_batch]
        counter = 0

    return generator

samples = 500
#we are handling the batching with the samples, set the batch_size to 1, don't let dataset do any batching, the generator already does
batch_size = 1
epochs = 10

# Create dataset.
gen = generator_fn(n_samples=samples)
dataset = tf.data.Dataset.from_generator(
    generator=gen, 
    output_types=(np.float32, np.int32), 
    output_shapes=((samples, dim), (samples, 3))
)

#we are handling the batching with the samples, set the batch_size to 1, don't let dataset do any batching, the generator already does
dataset = dataset.batch(batch_size, drop_remainder=True)

# Prepare model.

inp = keras.Input(shape=(None, dim), sparse=False)
x1 = Dense(100, activation='relu')(inp)
x2 = Dropout(0.4)(x1)
x3 = keras.layers.Dense(3, activation='softmax')(x2)
model = keras.Model(inp, x3)
model.compile(loss='categorical_crossentropy',
          optimizer=tf.keras.optimizers.Adam(lr=lr), 
          metrics=['accuracy'])
model.summary()


stop_early = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=8)

# Train. Do not specify batch size because the dataset takes care of that.
model.fit(dataset, epochs=epochs, callbacks=[stop_early], validation_data=(dev_x, dev_labels))

In [33]:
train_labels = np.zeros(shape=(len(train_y),3))
for idx, val in enumerate(train_y):
    train_labels[idx][val]=1
print("A peek a the reshaped labels:")
print(train_labels[:5])
print("The datatypes of the training dataset, features={}, labels={}".format(type(train_x), type(train_labels)))

A peek a the reshaped labels:
[[1. 1. 0.]
 [1. 1. 0.]
 [1. 1. 0.]
 [1. 1. 0.]
 [1. 1. 0.]]
The datatypes of the training dataset, features=<class 'scipy.sparse.coo.coo_matrix'>, labels=<class 'numpy.ndarray'>


##### Dealing with sparse matrix in keras
Some useful code is [here](https://stackoverflow.com/questions/37609892/keras-sparse-matrix-issue)

The type of sparse matrix we have created is `scipy.sparse.coo.coo_matrix`, we will need to convert it to `scipy.sparse.csr.csr_matrix`

In [34]:
from scipy import sparse
train_x=sparse.csr_matrix(train_x)
print("The datatypes of the training dataset, features={}, labels={}".format(type(train_x), type(train_labels)))

The datatypes of the training dataset, features=<class 'scipy.sparse.csr.csr_matrix'>, labels=<class 'numpy.ndarray'>


In [35]:
dev_x, dev_y = dev_feats
dev_x=sparse.csr_matrix(dev_x)
dev_labels = np.zeros(shape=(len(dev_y),3))
for idx, val in enumerate(dev_y):
    dev_labels[idx][val]=1
print("A peek a the reshaped dev labels:")
dev_labels[:5]

A peek a the reshaped dev labels:


array([[1., 1., 0.],
       [1., 1., 0.],
       [1., 1., 0.],
       [1., 1., 0.],
       [1., 1., 0.]])

#### Save these sparse matrices 
As npz files so that we can continue this training on other system without having to pull in the expensive large fever datasets




In [36]:
ls working/data

dev_labels.npz   embedding_mappings_300d.npz  test_y_tests.npz
dev_x.npz        fever_vocab.txt              train_labels.npz
dev_y_preds.npz  [0m[01;34mout[0m/                         [01;34mtraining[0m/
dev_y_tests.npz  test_y_preds.npz


In [234]:
train_x_file = "working/data/train_x.npz"
np.savez(train_x_file, train_x)
train_lbl_file = "working/data/train_labels.npz"
np.savez(train_lbl_file, train_labels)


dev_x_file = "working/data/dev_x.npz"
np.savez(dev_x_file, dev_x)
dev_lbl_file = "working/data/dev_labels.npz"
np.savez(dev_lbl_file, dev_labels)

In [235]:
ls -lth working/data

total 45M
-rw-r--r-- 1 root root 235K Jul 12 12:49 dev_labels.npz
-rw-r--r-- 1 root root 2.3M Jul 12 12:49 dev_x.npz
-rw-r--r-- 1 root root 3.4M Jul 12 12:49 train_labels.npz
-rw-r--r-- 1 root root  39M Jul 12 12:49 train_x.npz
-rw-r--r-- 1 root root    0 Jul  5 07:26 matching_page_sentences.jsonl
drwxr-xr-x 5 root root  160 Jul  5 07:26 [0m[01;34mtraining[0m/
-rw-r--r-- 1 root root    0 Jul  5 07:26 claim_texts.jsonl


#### Use batching via data generators

We did not use data generators for training. Let's use data generators that will feed data to out training.

We will need to write a custom generator here since we are using scipy spare matrix and not tensors.

The generator will be called repeatedly by the trainer (model.fit) and each time it is called, we will need to return it a set of data from the dataset.

The batch size will be controller by the caller, i.e. the trainer, but we will need to keep track of the records we are sending back so that we know when to reset and loop over.

The generator must be iterable and would keep a track of the number of batches we will need to create and track the records we are sending.

If the number of records is not perfectly divisible by batch_size, we will run into issues with the generator. For now, we will deal with it by dropping the last set of records if there are fewer than batch_size records in the last batch.

Since we are dealing with scipy sparse matrix, we would not be able to send in the data as argument to the generator. We will therefore hardcode the values in the generator.

In [37]:
import numpy as np
import tensorflow as tf

#train_x is the concatenation of the tf vectors for the claim
x = train_x
labels_3 = train_labels

dim = train_x.shape[1]
num_examples = train_x.shape[0]
lr = 0.001
                
# This is tf.data.experimental.AUTOTUNE in older tensorflow.
AUTOTUNE = tf.data.AUTOTUNE

def generator_fn(n_samples):
    """Return a function that takes no arguments and returns a generator."""
    def generator():
        num_batches = num_examples/n_samples
        counter = 0
        if counter == 0:
            idx = np.arange(num_examples)
            np.random.shuffle(idx)
        
        while counter < num_batches:
            index_batch = idx[n_samples*counter:n_samples*(counter+1)]
            counter += 1
            rec = x[index_batch, :].todense()
            if len(rec) == n_samples:
                yield rec, labels_3[index_batch]
        counter = 0

    return generator

samples = 500
#we are handling the batching with the samples, set the batch_size to 1, don't let dataset do any batching, the generator already does
batch_size = 1
epochs = 10

# Create dataset.
gen = generator_fn(n_samples=samples)
dataset = tf.data.Dataset.from_generator(
    generator=gen, 
    output_types=(np.float32, np.int32), 
    output_shapes=((samples, dim), (samples, 3))
)

#we are handling the batching with the samples, set the batch_size to 1, don't let dataset do any batching, the generator already does
dataset = dataset.batch(batch_size, drop_remainder=True)

# Prepare model.

inp = keras.Input(shape=(None, dim), sparse=False)
x1 = Dense(100, activation='relu')(inp)
x2 = Dropout(0.4)(x1)
x3 = keras.layers.Dense(3, activation='softmax')(x2)
model = keras.Model(inp, x3)
model.compile(loss='categorical_crossentropy',
          optimizer=tf.keras.optimizers.Adam(lr=lr), 
          metrics=['accuracy'])
model.summary()


stop_early = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=8)

# Train. Do not specify batch size because the dataset takes care of that.
model.fit(dataset, epochs=epochs, callbacks=[stop_early], validation_data=(dev_x, dev_labels))

  "The `lr` argument is deprecated, use `learning_rate` instead.")


Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, None, 10001)]     0         
_________________________________________________________________
dense (Dense)                (None, None, 100)         1000200   
_________________________________________________________________
dropout (Dropout)            (None, None, 100)         0         
_________________________________________________________________
dense_1 (Dense)              (None, None, 3)           303       
Total params: 1,000,503
Trainable params: 1,000,503
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10

KeyboardInterrupt: 

In [28]:
_, dev_acc = model.evaluate(dev_x, dev_labels, verbose=0)
dev_acc

0.6488648653030396

### OBSOLETE: do not refer
#### NLI model

Now that we have a baseline model, we will try an NLI model to see if we can improve on the benchmark we have just set.

First, we would have to build our data generator.

In [236]:
data_formatted[:5]

[{'claim': 'Nikolaj Coster-Waldau worked with the Fox Broadcasting Company .',
  'evidence': [('Nikolaj_Coster-Waldau', 7), ('Fox_Broadcasting_Company', 0)],
  'label': 0,
  'label_text': 'SUPPORTS'},
 {'claim': 'Roman Atwood is a content creator .',
  'evidence': [('Roman_Atwood', 1), ('Roman_Atwood', 3)],
  'label': 0,
  'label_text': 'SUPPORTS'},
 {'claim': 'History of art includes architecture , dance , sculpture , music , painting , poetry literature , theatre , narrative , film , photography and graphic arts .',
  'evidence': [('History_of_art', 2)],
  'label': 0,
  'label_text': 'SUPPORTS'},
 {'claim': 'Adrienne Bailon is an accountant .',
  'evidence': [('Adrienne_Bailon', 0)],
  'label': 1,
  'label_text': 'REFUTES'},
 {'claim': 'System of a Down briefly disbanded in limbo .',
  'evidence': [('In_Limbo', -1)],
  'label': 2,
  'label_text': 'NOT ENOUGH INFO'}]

In [30]:
x.shape

(145449, 10001)

In [32]:
labels_3.shape

(145449, 3)

In [223]:
def get_data_generator():
    for data, lbl in zip(x, labels_3):
        d = data.todense()
        #note:  d is a matrix, this cannot be sent in as a feature to our model to train, we need to reshape this into an array
        d = np.asarray(d).reshape(-1)
        yield d, lbl

In [227]:
def get_dataset():
    generator = lambda: get_data_generator()
    return tf.data.Dataset.from_generator(
            generator, output_signature=(
            tf.TensorSpec(shape=(10001, ), dtype=tf.int32),
            tf.TensorSpec(shape=(3, ), dtype=tf.int32,)))

In [228]:
for d in get_dataset().take(5):
    print(d[0])

tf.Tensor([0 0 0 ... 0 0 0], shape=(10001,), dtype=int32)
tf.Tensor([0 0 0 ... 0 0 0], shape=(10001,), dtype=int32)
tf.Tensor([0 0 0 ... 0 0 0], shape=(10001,), dtype=int32)
tf.Tensor([0 0 0 ... 0 0 0], shape=(10001,), dtype=int32)
tf.Tensor([0 0 0 ... 0 0 0], shape=(10001,), dtype=int32)


In [229]:
BUFFER_SIZE = 3200
BATCH_SIZE = 16
ds_train = get_dataset()
ds_train = ds_train.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

#### Build the network

In [230]:
for d in ds_train.take(1):
    print(d[0].shape)

(16, 10001)


In [231]:

dim = 50
vocab_size = 8000
inp = keras.Input(shape=(None, ))

embedding_layer = tf.keras.layers.Embedding(
        input_dim=vocab_size+1,
        output_dim=dim)

x1 = embedding_layer(inp)

lstm_layer1 = tf.keras.layers.Bidirectional(tf.keras.layers.RNN(tf.keras.layers.LSTMCell(dim)))(x1)

x2 = Dense(100, activation='relu')(lstm_layer1)
x3 = Dropout(0.1)(x2)
output = keras.layers.Dense(3, activation='softmax')(x3)
model = keras.Model(inputs=inp, outputs=output)
model.compile(loss='categorical_crossentropy',
          optimizer=tf.keras.optimizers.Adam(lr=lr), 
          metrics=['accuracy'])
model.summary()
stop_early = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=8)

Model: "model_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_12 (InputLayer)        [(None, None)]            0         
_________________________________________________________________
embedding_9 (Embedding)      (None, None, 50)          400050    
_________________________________________________________________
bidirectional_8 (Bidirection (None, 100)               40400     
_________________________________________________________________
dense_12 (Dense)             (None, 100)               10100     
_________________________________________________________________
dropout_6 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_13 (Dense)             (None, 3)                 303       
Total params: 450,853
Trainable params: 450,853
Non-trainable params: 0
_____________________________________________________

In [232]:
model.fit(ds_train, epochs=epochs, callbacks=[stop_early], validation_data=(dev_x, dev_labels))

Epoch 1/10
      3/Unknown - 46s 14s/step - loss: 1.0982 - accuracy: 0.3333

KeyboardInterrupt: 