#### FEVER dataset processing

<h5>Process the claims in the fever dataset</h5>

The code available in the fever repos do not seem to work anymore, most of them do not provide steps on how to prepare the training dataset for the NLI task

In this notebook, we will prepare the training dataset that would be the input to the NLI tasks

We use the following repos for reference code:

- [fever-baselines](https://github.com/klimzaporojets/fever-baselines.git)
- [fever-allennlp-reader](https://github.com/j6mes/fever-allennlp-reader)
- [fever-allennlp](https://github.com/j6mes/fever-allennlp)

Note, AllenNLP here is used only for the NLI training, using models such as Decomposable Attention, Elmo + ESIM, ESIM etc. In this notebook, we will first focus on extracying the data from the pre-processed Wiki corpus provided by [fever.ai](https://fever.ai/dataset/fever.html).

The data is available in a [docker image](https://hub.docker.com/r/feverai/common), 21GB in size. The container is created and the volume /local/ from it is mounted and made available to our [container](https://github.com/dmayukh/2021-summer-main/blob/master/project/fever-allennlp/Dockerfile) 

We have made some modifications to the code from the repos we pulled and the changes are available [here](https://github.com/dmayukh/2021-summer-main/tree/master/project/fever-allennlp) 

We will install a few dependencies such as:
- numpy>=1.15
- regex
- allennlp
- fever-scorer
- fever-drqa

We do not really need allennlp or fever-scorer as of yet, we would only need DrQA. I would prefer to use the DrQA from the official github, but for now we will go with what was prepackaged by the [j6mes](https://pypi.org/project/fever-drqa/) 

First, we configure the paths so that we are able to find the correct modules that were changed

In [2]:
import argparse
import json
from multiprocessing.pool import ThreadPool

Create the database from the DB file that contains the preprocessed Wiki pages. This DB was made available to us by FEVER.

FeverDocDB is a simple wrapper that opens a SQLlite3 connection to the database and provides methods to execute simple select queries to fetch ids for documents and to fetch lines given a document.

We will not require this in the first pass of our work here, since we are only interested in findings the documents closest to a claim text.

The function to fetch lines per document is what uses the connection to the database. In order to find the closest documents for a given claim, use use the ranker that uses a pre-created TFIDF index which can locate the document ids given a claim text.

The pre-created index is available in '/local/fever-common/data/index/fever-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz'


Sample data from training file:

> {"id": 75397, "verifiable": "VERIFIABLE", "label": "SUPPORTS", "claim": "Nikolaj Coster-Waldau worked with the Fox Broadcasting Company.", "evidence": [[[92206, 104971, "Nikolaj_Coster-Waldau", 7], [92206, 104971, "Fox_Broadcasting_Company", 0]]]}

A closer look at the evidence:

> [[92206, 104971, "Nikolaj_Coster-Waldau", 7]

92206 and 104971 are the annotation ids, while the "Nikolaj_Coster-Waldau" is the evidence page and the line number is 7.


#### Formatting the input text

The training of the model is done on the evidence provided by the human annotators, therefore we use the 'evidence' to run our training.

After formatting, the training examples are written as below that is then used to train the MLP

> {'claim': 'Nikolaj Coster-Waldau worked with the Fox Broadcasting Company .',
  'evidence': [('Nikolaj_Coster-Waldau', 7), ('Fox_Broadcasting_Company', 0)],
  'label': 0,
  'label_text': 'SUPPORTS'}

The baseline model is a simple MLP that uses the count vectorizer to vectorize the claim text and the evidence page texts. It also uses an additional feature which is the cosine similarity between the vectorized claim text and the vectorized combined texts from all the evidences.

The vectorizers are saved to the filesystem that can be used later for transorming the incoming sentences.

TODO: not sure why the specific evidence lines are not used for the training.

The trained model is used to run eval on the dev dataset of the same format.

TODO: inference is not explicitly done in this code. We will have to do inference which is most likely going to be done as follows:

Given a claim, use the ranker to fetch the 5 closest pages from the DB. Create the features by using the saved vectorizers.

Predict the label of the example, i.e. which class it belongs to. 'SUPPORTS' or 'REFUTES'.


<h5>Retrieval of the evidence</h5>

We also attempt to extract the evidence from the corresponding pages

First, using the tfidf doc ranker, we extract the top 5 pages that are similar to the claim text


> {'claim': 'Nikolaj Coster-Waldau worked with the Fox Broadcasting Company .', 'evidence': [('Nikolaj_Coster-Waldau', 7), ('Fox_Broadcasting_Company', 0)], 'label': 0, 'label_text': 'SUPPORTS', 'predicted_pages': [('Coster', 498.82682448841246), ('Nikolaj', 348.42021460316823), ('The_Other_Woman_-LRB-2014_film-RRB-', 316.8405030379064), ('Nikolaj_Coster-Waldau', 316.8405030379064), ('Nukaaka_Coster-Waldau', 292.47605893902585)]}

For each of the pages, we extract the lines from the page text and use 'online tfidf ranker' to fetch the closest matching lines from the text.

The training examples are then formatted as below which is then used to run EVAL on the MLP model


> {'claim': 'Nikolaj Coster-Waldau worked with the Fox Broadcasting Company .',
 'evidence': [('Nikolaj_Coster-Waldau', 7), ('Fox_Broadcasting_Company', 0)],
 'label': 0,
 'label_text': 'SUPPORTS',
 'predicted_pages': [('Coster', 498.82682448841246),
  ('Nikolaj', 348.42021460316823),
  ('The_Other_Woman_-LRB-2014_film-RRB-', 316.8405030379064),
  ('Nikolaj_Coster-Waldau', 316.8405030379064),
  ('Nukaaka_Coster-Waldau', 292.47605893902585)],
 'predicted_sentences': [('Nikolaj', 7),
  ('The_Other_Woman_-LRB-2014_film-RRB-', 1),
  ('Nukaaka_Coster-Waldau', 1),
  ('Coster', 63),
  ('Nikolaj_Coster-Waldau', 0)]}
  
The scoring of the evidence predictor is not straight forward. Use the fever-scorer to score the predictions.

In [5]:
!tail /local/fever-common/data/fever-data/train.jsonl

{"id": 28978, "verifiable": "NOT VERIFIABLE", "label": "NOT ENOUGH INFO", "claim": "Kenneth Branagh is a follower.", "evidence": [[[45096, null, null, null]]]}
{"id": 225357, "verifiable": "VERIFIABLE", "label": "SUPPORTS", "claim": "Absolute Beginners starred David Bowie.", "evidence": [[[268538, 265101, "Absolute_Beginners_-LRB-film-RRB-", 1]]]}
{"id": 116046, "verifiable": "VERIFIABLE", "label": "REFUTES", "claim": "Neil Young is not a singer-songwriter.", "evidence": [[[282523, 276718, "Neil_Young", 0]], [[284488, 278309, "Neil_Young", 0]], [[285346, 278997, "Neil_Young", 0]], [[330685, 317288, "Neil_Young", 0]], [[330685, 317289, "Neil_Young", 7]], [[331800, 318260, "Neil_Young", 0]]]}
{"id": 180974, "verifiable": "VERIFIABLE", "label": "SUPPORTS", "claim": "Jeff Goldblum starred in a film.", "evidence": [[[210069, 217835, "Jeff_Goldblum", 0]], [[210069, 217836, "Jeff_Goldblum", 3]], [[210069, 217837, "Jeff_Goldblum", 6], [210069, 217837, "Adam_Resurrected", 4], [210069, 217837, "

In [6]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [7]:
!head /local/fever-common/data/fever-data/paper_test.jsonl

{"id": 113501, "verifiable": "NOT VERIFIABLE", "label": "NOT ENOUGH INFO", "claim": "Grease had bad reviews.", "evidence": [[[133128, null, null, null]]]}
{"id": 163803, "verifiable": "VERIFIABLE", "label": "SUPPORTS", "claim": "Ukrainian Soviet Socialist Republic was a founding participant of the UN.", "evidence": [[[296950, 288668, "Ukrainian_Soviet_Socialist_Republic", 7]], [[298602, 290067, "Ukrainian_Soviet_Socialist_Republic", 7], [298602, 290067, "United_Nations", 0]], [[300696, 291816, "Ukrainian_Soviet_Socialist_Republic", 7]], [[344347, 327887, "Ukrainian_Soviet_Socialist_Republic", 7]], [[344994, 328433, "Ukrainian_Soviet_Socialist_Republic", 7]], [[344997, 328435, "Ukrainian_Soviet_Socialist_Republic", 7]]]}
{"id": 70041, "verifiable": "VERIFIABLE", "label": "SUPPORTS", "claim": "2 Hearts is a musical composition by Minogue.", "evidence": [[[225394, 230056, "2_Hearts_-LRB-Kylie_Minogue_song-RRB-", 0]], [[317953, 306972, "2_Hearts_-LRB-Kylie_Minogue_song-RRB-", 0]], [[319638

#### Create the training dataset

The training examples have three (3) classes:
- SUPPORTS
- REFUTES
- NOT ENOUGH INFO

For the 'NOT ENOUGH INFO' class, the evidences are set to None. This would cause problems with training since we would still like to generate features for the samples which have been put in this class.

Next, we will loop over the records in the training dataset to create the training records. Specifically, we would be generating evidences for the samples in the 'NOT ENOUGH INFO' class so that the None values now have some page information.

Our strategy for dealing with missing evidences for the 'NOT ENOUGH INFO' class is to find the pages that are closest to the claims based on the tdidf similarity. The tfidf similarity of the documents in the fever DB is already precomputed and make available to us via the index file:

> '/local/fever-common/data/index/fever-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz'

Let's load the index file and create the ranker

In [8]:
from drqa import retriever
tdidf_npz_file = '/local/fever-common/data/index/fever-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz'
ranker = retriever.get_class('tfidf')(tfidf_path=tdidf_npz_file)

Create the directory where we will save our prepared datasets

The raw training data is available at 

> /local/fever-common/data/fever-data/train.jsonl

The raw dev data from the FEVER paper is available at 

> /local/fever-common/data/fever-data/paper_dev.jsonl

In [9]:
!mkdir -p working/data/training

In [10]:
import json
from tqdm import tqdm

def prepare_dataset(split, k=5):
    fever_root = '/local/fever-common/'
    working_dir = 'working/data/'
    print("Saving prepared dataset to {}".format("training/{0}.ns.pages.p{1}.jsonl".format(split,k)))
    with open(fever_root + "data/fever-data/{0}.jsonl".format(split),"r") as f_in:
        with open(working_dir + "training/{0}.ns.pages.p{1}.jsonl".format(split,k),"w+") as f_out:
            for line in tqdm(f_in.readlines()):
                line = json.loads(line)
                if line["label"] == "NOT ENOUGH INFO":
                        doc_names, doc_scores = ranker.closest_docs(line['claim'], k)
                        pp = list(doc_names)

                        for idx,evidence_group in enumerate(line['evidence']):
                            for evidence in evidence_group:
                                if idx<len(pp):
                                    evidence[2] = pp[idx]
                                    evidence[3] = -1
                
                f_out.write(json.dumps(line) + "\n")

##### Prepare the training dataset

In [None]:
!rm -rf training/train.ns.pages.p5.jsonl
prepare_dataset('train', 5)

##### Prepare the dev dataset

In [13]:
!rm -rf training/paper_dev.ns.pages.p5.jsonl
prepare_dataset('paper_dev', 5)

  0%|          | 0/9999 [00:00<?, ?it/s]

Saving prepared dataset to training/paper_dev.ns.pages.p5.jsonl


100%|██████████| 9999/9999 [02:58<00:00, 56.11it/s]


In [14]:
!wc -l  working/data/training/*

    9999 working/data/training/paper_dev.ns.pages.p5.jsonl
  145449 working/data/training/train.ns.pages.p5.jsonl
    2193 working/data/training/train.pages.p5.jsonl
  157641 total


#### Building the feature sets

Using the training data and dev data we generated, we will create the vectorizers and save them to local files

The training and dev data is available at 

> working/data/training/train.ns.pages.p5.jsonl 

> working/data/training/paper_dev.ns.pages.p5.jsonl

The key information we need from the training samples are the claim text and the texts from the evidence pages

For each training example, generate:
- a tokenized claim, 
- the label id, 
- the label text, 
- list of wiki pages that were provided as evidence.

This is done using a custom formatter `training_line_formatter` we would write.

In [11]:
from nltk import word_tokenize

class LabelSchema:
    def __init__(self,labels):
        self.labels = {self.preprocess(val):idx for idx,val in enumerate(labels)}
        self.idx = {idx:self.preprocess(val) for idx,val in enumerate(labels)}

    def get_id(self,label):
        if self.preprocess(label) in self.labels:
            return self.labels[self.preprocess(label)]
        return None

    def preprocess(self,item):
        return item.lower()

class FEVERLabelSchema(LabelSchema):
    def __init__(self):
        super().__init__(["supports", "refutes", "not enough info"])

def nltk_tokenizer(text):
    return " ".join(word_tokenize(text))

class training_line_formatter():
    def __init__(self):
        self.tokenize = nltk_tokenizer
        
    def format(self, lines):
        formatted = []
        for line in tqdm(lines):
            fl = self.format_line(line)
            if fl is not None:
                if isinstance(fl,list):
                    formatted.extend(fl)
                else:
                    formatted.append(fl)
        return formatted

    def format_line(self, line):
        label_schema = FEVERLabelSchema()
        # get the label, i.e. SUPPORTS etc.
        annotation = line["label"]
        if annotation is None:
            annotation = line["verifiable"]
        pages = []

        # did we get the closest sentences to the claim text? is this the sentence or the line number from the doc text?
        if 'predicted_sentences' in line:
            pages.extend([(ev[0], ev[1]) for ev in line["predicted_sentences"]])
        elif 'predicted_pages' in line:
            pages.extend([(ev[0], -1) for ev in line["predicted_pages"]])
        else:
            # these are the human annotated evidence available in the original training file
            for evidence_group in line["evidence"]:
                pages.extend([(ev[2], ev[3]) for ev in evidence_group])

    #     if self.filtering is not None:
    #         for page, _ in pages:
    #             if self.filtering({"id": page}) is None:
    #                 return None

        return {"claim": self.tokenize(line["claim"]), "evidence": pages, "label": label_schema.get_id(annotation),
                "label_text": annotation}

In [8]:
class Reader:
    def __init__(self,encoding="utf-8"):
        self.enc = encoding

    def read(self,file):
        with open(file,"r",encoding = self.enc) as f:
            return self.process(f)

    def process(self,f):
        pass

class JSONLineReader(Reader):
    def process(self,fp):
        data = []
        for line in tqdm(fp.readlines()):
            data.append(json.loads(line.strip()))
        return data

Use k=5 which is the number of closest documents we used to prepare our datasets. This is simply used to load the correct input file for the dataset formatting.

In [9]:
import json
from tqdm import tqdm
jlr = JSONLineReader()
split = 'train'
working_dir = 'working/data/'
k = 5
training_data_file = working_dir + "training/{0}.ns.pages.p{1}.jsonl".format(split, k)
data = jlr.read(training_data_file)

100%|██████████| 145449/145449 [00:02<00:00, 56289.51it/s]


We will need to format the training data so that we can extract the claim and the body text

In [12]:
formatter = training_line_formatter()
formatted_train_data = formatter.format(data)

100%|██████████| 145449/145449 [00:22<00:00, 6488.27it/s]


In [13]:
formatted_train_data[:1]

[{'claim': 'Nikolaj Coster-Waldau worked with the Fox Broadcasting Company .',
  'evidence': [('Nikolaj_Coster-Waldau', 7), ('Fox_Broadcasting_Company', 0)],
  'label': 0,
  'label_text': 'SUPPORTS'}]

Each formatted training example now looks like:
    
> {'claim': 'Nikolaj Coster-Waldau worked with the Fox Broadcasting Company .',
  'evidence': [('Nikolaj_Coster-Waldau', 7), ('Fox_Broadcasting_Company', 0)],
  'label': 0,
  'label_text': 'SUPPORTS'}

In [14]:
data_formatted = []
data_formatted.extend(filter(lambda record: record is not None, formatted_train_data))
data_formatted

[{'claim': 'Nikolaj Coster-Waldau worked with the Fox Broadcasting Company .',
  'evidence': [('Nikolaj_Coster-Waldau', 7), ('Fox_Broadcasting_Company', 0)],
  'label': 0,
  'label_text': 'SUPPORTS'},
 {'claim': 'Roman Atwood is a content creator .',
  'evidence': [('Roman_Atwood', 1), ('Roman_Atwood', 3)],
  'label': 0,
  'label_text': 'SUPPORTS'},
 {'claim': 'History of art includes architecture , dance , sculpture , music , painting , poetry literature , theatre , narrative , film , photography and graphic arts .',
  'evidence': [('History_of_art', 2)],
  'label': 0,
  'label_text': 'SUPPORTS'},
 {'claim': 'Adrienne Bailon is an accountant .',
  'evidence': [('Adrienne_Bailon', 0)],
  'label': 1,
  'label_text': 'REFUTES'},
 {'claim': 'System of a Down briefly disbanded in limbo .',
  'evidence': [('In_Limbo', -1)],
  'label': 2,
  'label_text': 'NOT ENOUGH INFO'},
 {'claim': 'Homeland is an American television spy thriller based on the Israeli television series Prisoners of War .',

In [15]:
len(data_formatted)

145449

We do the same for the dev dataset as well

In [16]:
import json
from tqdm import tqdm
jlr = JSONLineReader()
split = 'paper_dev'
working_dir = 'working/data/'
k = 5
dev_data_file = working_dir + "training/{0}.ns.pages.p{1}.jsonl".format(split, k)
dev_data = jlr.read(dev_data_file)

formatter = training_line_formatter()
formatted_dev_data = formatter.format(dev_data)

dev_data_formatted = []
dev_data_formatted.extend(filter(lambda record: record is not None, formatted_dev_data))
dev_data_formatted

100%|██████████| 9999/9999 [00:00<00:00, 16821.47it/s]
100%|██████████| 9999/9999 [00:02<00:00, 4739.65it/s]


[{'claim': 'Colin Kaepernick became a starting quarterback during the 49ers 63rd season in the National Football League .',
  'evidence': [('Colin_Kaepernick', -1)],
  'label': 2,
  'label_text': 'NOT ENOUGH INFO'},
 {'claim': 'Tilda Swinton is a vegan .',
  'evidence': [('Swinton_-LRB-surname-RRB-', -1)],
  'label': 2,
  'label_text': 'NOT ENOUGH INFO'},
 {'claim': 'Fox 2000 Pictures released the film Soul Food .',
  'evidence': [('Soul_Food_-LRB-film-RRB-', 0),
   ('Soul_Food_-LRB-film-RRB-', 0),
   ('Soul_Food_-LRB-film-RRB-', 0),
   ('Soul_Food_-LRB-film-RRB-', 0),
   ('Soul_Food_-LRB-film-RRB-', 0)],
  'label': 0,
  'label_text': 'SUPPORTS'},
 {'claim': 'Anne Rice was born in New Jersey .',
  'evidence': [('List_of_Ace_titles_in_numeric_series', -1),
   ('List_of_Ace_titles_in_numeric_series', -1)],
  'label': 2,
  'label_text': 'NOT ENOUGH INFO'},
 {'claim': 'Telemundo is a English-language television network .',
  'evidence': [('Telemundo', 0),
   ('Telemundo', 1),
   ('Telemund

In [17]:
len(dev_data_formatted)

9999

#### Building the feature set
We will use the formatted training and dev data now to generate the features for our training

We only have the body ids, we will need to extract the body text given the body ids. We will use the database provided for that.

First create a class to handle interactions with the database

In [18]:
from drqa.retriever import DocDB, utils
class FeverDocDB(DocDB):

    def __init__(self,path=None):
        super().__init__(path)

    def get_doc_lines(self, doc_id):
        """Fetch the raw text of the doc for 'doc_id'."""
        cursor = self.connection.cursor()
        cursor.execute(
            "SELECT lines FROM documents WHERE id = ?",
            (utils.normalize(doc_id),)
        )
        result = cursor.fetchone()
        cursor.close()
        return result if result is None else result[0]

    def get_non_empty_doc_ids(self):
        """Fetch all ids of docs stored in the db."""
        cursor = self.connection.cursor()
        cursor.execute("SELECT id FROM documents WHERE length(trim(text)) > 0")
        results = [r[0] for r in cursor.fetchall()]
        cursor.close()
        return results

In [19]:
database_path = '/local/fever-common/data/fever/fever.db'
database = FeverDocDB(database_path)

Our formatted data looks like this

> {'claim': 'Nikolaj Coster-Waldau worked with the Fox Broadcasting Company .',
 'evidence': [('Nikolaj_Coster-Waldau', 7), ('Fox_Broadcasting_Company', 0)],
 'label': 0,
 'label_text': 'SUPPORTS'}
 
We will use the evidence fields to extract the supporting texts. First define some routines to extract the relevant information from the formatted lines.

In [20]:
import random

class SimpleRandom():
    instance = None

    def __init__(self,seed):
        self.seed = seed
        self.random = random.Random(seed)

    def next_rand(self,a,b):
        return self.random.randint(a,b)

    @staticmethod
    def get_instance():
        if SimpleRandom.instance is None:
            SimpleRandom.instance = SimpleRandom(SimpleRandom.get_seed())
        return SimpleRandom.instance

    @staticmethod
    def get_seed():
        return int(os.getenv("RANDOM_SEED", 12459))

    @staticmethod
    def set_seeds():

        torch.manual_seed(SimpleRandom.get_seed())
        if gpu():
            torch.cuda.manual_seed_all(SimpleRandom.get_seed())
        np.random.seed(SimpleRandom.get_seed())
        random.seed(SimpleRandom.get_seed())

In [21]:
ename = "evidence"
def claims(data):
    return [datum["claim"] for datum in data]
def body_ids(data):
    return [[d[0] for d in datum[ename] ] for datum in data]
def flatten(l):
    return [item for sublist in l for item in sublist]
def bodies(data):
    #data = [d for d in flatten(body_ids(data)) if d]
    return [database.get_doc_text(id) for id in set(flatten(body_ids(data)))]

def texts(data):
    return [" ".join(set(instance)) for instance in body_lines(data)]

def body_lines(data):
    return [[get_doc_line(d[0],d[1]) for d in datum[ename] ] for datum in data]

def get_doc_line(doc,line):
    lines = database.get_doc_lines(doc)

#     if os.getenv("PERMISSIVE_EVIDENCE","n").lower() in ["y","yes","true","t","1"]:
#         if lines is None:
#             return ""

    if line > -1:
        return lines.split("\n")[line].split("\t")[1]
    else:
        non_empty_lines = [line.split("\t")[1] for line in lines.split("\n") if len(line.split("\t"))>1 and len(line.split("\t")[1].strip())]
        return non_empty_lines[SimpleRandom.get_instance().next_rand(0,len(non_empty_lines)-1)]    

We will build a term frequency vectorizer and a TDIDF vectorizer and save them to a file.

The vocabulary will be limited to 5000. For each of the claim and the body text, we would produce the vectors which would be of dimension 5000.

We will also add the cosine similarity between the claim vector and the body text vector and use it as an additional feature.

The dimension of our feature would be then 5000 + 5000 + 1 = 10001

Clean up any pre-generated feature vectors we are going to re-run the vectorizers

In [28]:
# !rm -rf working/models/ns_nn_sent/dev.pkl
# !rm -rf working/models/ns_nn_sent/train.pkl

##### Create the vectorizers
We will be using the contents of both the training and dev set to build the vectorizers. 

In [29]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
CLAIMS = claims(data_formatted)
BODIES = bodies(data_formatted)
dev_claims = claims(dev_data_formatted)
dev_bodies = bodies(dev_data_formatted)
lim_unigram = 5000
stop_words = [
        "a", "about", "above", "across", "after", "afterwards", "again", "against", "all", "almost", "alone", "along",
        "already", "also", "although", "always", "am", "among", "amongst", "amoungst", "amount", "an", "and", "another",
        "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are", "around", "as", "at", "back", "be",
        "became", "because", "become", "becomes", "becoming", "been", "before", "beforehand", "behind", "being",
        "below", "beside", "besides", "between", "beyond", "bill", "both", "bottom", "but", "by", "call", "can", "co",
        "con", "could", "cry", "de", "describe", "detail", "do", "done", "down", "due", "during", "each", "eg", "eight",
        "either", "eleven", "else", "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone",
        "everything", "everywhere", "except", "few", "fifteen", "fifty", "fill", "find", "fire", "first", "five", "for",
        "former", "formerly", "forty", "found", "four", "from", "front", "full", "further", "get", "give", "go", "had",
        "has", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein", "hereupon", "hers", "herself",
        "him", "himself", "his", "how", "however", "hundred", "i", "ie", "if", "in", "inc", "indeed", "interest",
        "into", "is", "it", "its", "itself", "keep", "last", "latter", "latterly", "least", "less", "ltd", "made",
        "many", "may", "me", "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly", "move", "much",
        "must", "my", "myself", "name", "namely", "neither", "nevertheless", "next", "nine", "nobody", "now", "nowhere",
        "of", "off", "often", "on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", "ours",
        "ourselves", "out", "over", "own", "part", "per", "perhaps", "please", "put", "rather", "re", "same", "see",
        "serious", "several", "she", "should", "show", "side", "since", "sincere", "six", "sixty", "so", "some",
        "somehow", "someone", "something", "sometime", "sometimes", "somewhere", "still", "such", "system", "take",
        "ten", "than", "that", "the", "their", "them", "themselves", "then", "thence", "there", "thereafter", "thereby",
        "therefore", "therein", "thereupon", "these", "they", "thick", "thin", "third", "this", "those", "though",
        "three", "through", "throughout", "thru", "thus", "to", "together", "too", "top", "toward", "towards", "twelve",
        "twenty", "two", "un", "under", "until", "up", "upon", "us", "very", "via", "was", "we", "well", "were", "what",
        "whatever", "when", "whence", "whenever", "where", "whereafter", "whereas", "whereby", "wherein", "whereupon",
        "wherever", "whether", "which", "while", "whither", "who", "whoever", "whole", "whom", "whose", "why", "will",
        "with", "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves"
        ]
bow_vectorizer = CountVectorizer(max_features=lim_unigram,
                                         stop_words=stop_words)
bow = bow_vectorizer.fit_transform(CLAIMS + BODIES)
tfreq_vectorizer = TfidfTransformer(use_idf=False).fit(bow)
tfidf_vectorizer = TfidfVectorizer(max_features=lim_unigram,
                                           stop_words=stop_words).fit(CLAIMS + BODIES + dev_claims + dev_bodies)

The vectorizers will be saved in a folder in the directory 'ns_nn_sent' so that it can be looked up later.

In [30]:
!mkdir -p working/models

Transform the claims and the body texts using the vectorizers.

In [22]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import hstack
def process(data):
    claim_bow = bow_vectorizer.transform(claims(data))
    claim_tfs = tfreq_vectorizer.transform(claim_bow)
    claim_tfidf = tfidf_vectorizer.transform(claims(data))

    #get the text from the bodies of all the n-closest docs for the claim
    body_texts = texts(data)
    body_bow = bow_vectorizer.transform(body_texts)
    body_tfs = tfreq_vectorizer.transform(body_bow)
    body_tfidf = tfidf_vectorizer.transform(body_texts)

    cosines = np.array([cosine_similarity(c, b)[0] for c,b in zip(claim_tfidf,body_tfidf)])

    return hstack([body_tfs,claim_tfs,cosines])

In [23]:
len(tfidf_vectorizer.vocabulary_)

NameError: name 'tfidf_vectorizer' is not defined

In [24]:
import os
import pickle
model_name = 'ns_nn_sent'
base_path = 'working/models/'

def load_features(name, data):
    features = list()
    ffpath = os.path.join(base_path, model_name)
    if not os.path.exists(ffpath):
        os.mkdir(ffpath)
    if (not os.path.exists(os.path.join(ffpath, name + ".pkl"))):
        print("Saved features do not exist, creating data...")
        features = process(data)
        with open(os.path.join(ffpath, name + ".pkl"), "wb+") as f:
            pickle.dump(features, f)
    else:
        print("Loading saved feature from {}".format(os.path.join(ffpath, name + ".pkl")))
        with open(os.path.join(ffpath, name + ".pkl"), "rb") as f:
            features = pickle.load(f)
    return features

Create the labels for the features

In [25]:
label_name = "label"
def labels(data):
    return [datum[label_name] for datum in data]
def out(features,ds):
    if ds is not None:
        return np.hstack(features) if len(features) > 1 else features[0], labels(ds)
    return [[]],[]

This needs to be performed once per dataset. Therefore, we would save the transformed vectors in a file to reuse for each modelling excercise.

Check if the saved vectors exist, if not, create then by using the vectorizers and applying a transform on the 
- claim
- lines from the body of the evidence pages

In [26]:
train_fs = []
features = load_features("train", data_formatted)
train_fs.append(features)
train_feats = out(train_fs, data_formatted)

Loading saved feature from working/models/ns_nn_sent/train.pkl


In [27]:
input_shape = train_feats[0].shape[1]
print("input_shape =", input_shape)

input_shape = 10001


In [28]:
dev_fs = []
features = load_features("dev", dev_data_formatted)
dev_fs.append(features)
dev_feats = out(dev_fs, dev_data_formatted)

Loading saved feature from working/models/ns_nn_sent/dev.pkl


In [29]:
dev_feats

(<9999x10001 sparse matrix of type '<class 'numpy.float64'>'
 	with 191370 stored elements in COOrdinate format>,
 [2,
  2,
  0,
  2,
  1,
  1,
  0,
  1,
  2,
  1,
  2,
  2,
  0,
  1,
  0,
  1,
  1,
  2,
  2,
  1,
  0,
  2,
  0,
  0,
  0,
  1,
  1,
  1,
  2,
  2,
  1,
  1,
  2,
  1,
  1,
  0,
  1,
  1,
  1,
  0,
  0,
  1,
  2,
  0,
  1,
  1,
  1,
  0,
  2,
  2,
  1,
  0,
  2,
  2,
  1,
  2,
  0,
  1,
  1,
  1,
  0,
  0,
  1,
  0,
  2,
  0,
  1,
  2,
  1,
  0,
  1,
  1,
  0,
  0,
  1,
  0,
  0,
  1,
  2,
  1,
  1,
  2,
  2,
  0,
  2,
  1,
  1,
  1,
  1,
  2,
  1,
  1,
  1,
  1,
  0,
  0,
  2,
  0,
  0,
  1,
  1,
  1,
  0,
  2,
  2,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  2,
  1,
  1,
  0,
  1,
  1,
  2,
  0,
  1,
  1,
  1,
  0,
  1,
  1,
  1,
  1,
  0,
  0,
  2,
  0,
  0,
  0,
  2,
  0,
  1,
  1,
  0,
  2,
  2,
  0,
  0,
  0,
  2,
  2,
  0,
  1,
  2,
  2,
  2,
  0,
  1,
  1,
  2,
  2,
  0,
  1,
  1,
  2,
  1,
  1,
  2,
  2,
  0,
  0,
  0,
  2,
  2,
  1,
  2,
  1,
  0,
  2,
  1,
  2,
  2,
 

#### Training
It's now time to build the model. We will build a Simple Multi layer perceptron.

In [30]:
from torch import nn

class SimpleMLP(nn.Module):
    def __init__(self,input_dim,hidden_dim,output_dim,keep_p=.6):
        super(SimpleMLP, self).__init__()
        self.fc1 = nn.Linear(input_dim,hidden_dim)
        self.fc2 = nn.Linear(hidden_dim,output_dim)

        self.do = nn.Dropout(1-keep_p)
        self.relu = nn.ReLU()

    def forward(self,x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.do(x)

        x = self.fc2(x)
        x = self.do(x)
        return x

In [31]:
model = SimpleMLP(input_shape,100,3)
model

SimpleMLP(
  (fc1): Linear(in_features=10001, out_features=100, bias=True)
  (fc2): Linear(in_features=100, out_features=3, bias=True)
  (do): Dropout(p=0.4, inplace=False)
  (relu): ReLU()
)

Clean up any saved models

In [32]:
#rm -rf working/models/ns_nn_sent/ns_nn_sent.best.save

Define the logger, the one that will be used to monitor the model training progress

The best model will be saved at 
> working/models/ns_nn_sent/

In [33]:
import logging
class LogHelper():
    handler = None
    @staticmethod
    def setup():
        FORMAT = '[%(levelname)s] %(asctime)s - %(name)s - %(message)s'
        LogHelper.handler = logging.StreamHandler()
        LogHelper.handler.setLevel(logging.DEBUG)
        LogHelper.handler.setFormatter(logging.Formatter(FORMAT))

        LogHelper.get_logger(LogHelper.__name__).info("Log Helper set up")

    @staticmethod
    def get_logger(name,level=logging.DEBUG):
        ##note: once a logger is created, repeated calls using the same name will give you the same logger object
        l = logging.getLogger(name)
        sh = logging.StreamHandler()
        l.setLevel(level)
        l.addHandler(sh)
        return l
    
class EarlyStopping():
    def __init__(self,name,patience=8):
        self.patience = patience
        self.best_model = None
        self.best_score = None

        self.best_epoch = 0
        self.epoch = 0
        #print("name is ", EarlyStopping.__name__)
        self.name = name
        #self.logger = LogHelper.get_logger(EarlyStopping.__name__)
        self.logger = LogHelper.get_logger(name)

    def __call__(self, model, acc):
        self.epoch += 1

        if self.best_score is None:
            self.best_score = acc

        if acc >= self.best_score:
            torch.save(model.state_dict(),"working/models/ns_nn_sent/{0}.best.save".format(self.name))
            self.best_score = acc
            self.best_epoch = self.epoch
            self.logger.info("Saving best weights from round {0}".format(self.epoch))
            return False

        elif self.epoch > self.best_epoch+self.patience:
            self.logger.info("Early stopping: Terminate")
            return True

        self.logger.info("Early stopping: Worse Round")
        return False

    def set_best_state(self,model):
        self.logger.info("Loading weights from round {0}".format(self.best_epoch))
        model.load_state_dict(torch.load("working/models/ns_nn_sent/{0}.best.save".format(self.name)))

#### Dataset reader

We will need to handle the batching of inputs to our model

We will define a batcher that deals with the sparse matrix

In [34]:
from scipy.sparse import coo_matrix
from torch.autograd import Variable
import torch
def is_gpu():
    return os.getenv("GPU","no").lower() in ["1",1,"yes","true","t"]

def gpu():
    if is_gpu():
        torch.cuda.set_device(int(os.getenv("CUDA_DEVICE", 0)))
        return True
    return False

class Batcher():
    def __init__(self,data,size):
        self.data = data
        self.size = size
        self.pointer = 0

        if isinstance(self.data,coo_matrix):
            self.data = self.data.tocsr()

    def __next__(self):
        if self.pointer == splen(self.data):
            self.pointer = 0
            raise StopIteration
        next = min(splen(self.data),self.pointer+self.size)
        to_return = self.data[self.pointer : next]
        start,end = self.pointer,next
        self.pointer = next
        return to_return, splen(to_return), start, end

    def __iter__(self):
        return self

def splen(data):
    try:
        return data.shape[0]
    except:
        return len(data)

def prepare_with_labels(data,labels):
    data = data.todense()
    v = torch.FloatTensor(np.array(data))
    if gpu():
        return Variable(v.cuda()), Variable(torch.LongTensor(labels).cuda())
    return Variable(v), Variable(torch.LongTensor(labels))


def prepare(data):
    data = data.todense()
    v = torch.FloatTensor(np.array(data))
    if gpu():
        return Variable(v.cuda())
    return Variable(v)

In [35]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.utils import shuffle
import torch.nn.functional as F

def evaluate(model,data,labels,batch_size):
    predicted = predict(model,data,batch_size)
    return accuracy_score(labels,predicted.data.numpy().reshape(-1))

def predict(model, data, batch_size):
    batcher = Batcher(data, batch_size)

    predicted = []
    for batch, size, start, end in batcher:
        d = prepare(batch)
        model.eval()
        logits = model(d).cpu()

        predicted.extend(torch.max(logits, 1)[1])
    return torch.stack(predicted)

def train(model, fs, batch_size, lr, epochs,dev=None, clip=None, early_stopping=None,name=None):
    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=1e-4)

    data, labels = fs
    if dev is not None:
        dev_data,dev_labels = dev

    for epoch in tqdm(range(epochs)):
        epoch_loss = 0
        epoch_data = 0

        shuffle(data,labels)

        batcher = Batcher(data, batch_size)

        for batch, size, start, end in batcher:
            d,gold = prepare_with_labels(batch,labels[start:end])

            model.train()
            optimizer.zero_grad()
            logits = model(d)

            loss = F.cross_entropy(logits, gold)
            loss.backward()

            epoch_loss += loss.cpu()
            epoch_data += size

            if clip is not None:
                torch.nn.utils.clip_grad_norm(model.parameters(), clip)
            optimizer.step()

        print("Average epoch loss: {0}".format((epoch_loss/epoch_data).data.numpy()))

        #print("Epoch Train Accuracy {0}".format(evaluate(model, data, labels, batch_size)))
        if dev is not None:
            acc = evaluate(model,dev_data,dev_labels,batch_size)
            print("Epoch Dev Accuracy {0}".format(acc))

            if early_stopping is not None and early_stopping(model,acc):
                break

    if dev is not None and early_stopping is not None:
        early_stopping.set_best_state(model)

In [None]:
mname = 'ns_nn_sent'
final_model = train(model, train_feats, 500, 1e-2, 90, dev_feats, early_stopping=EarlyStopping(mname))

  0%|          | 0/90 [00:00<?, ?it/s]

Average epoch loss: 0.0016196609940379858


Saving best weights from round 1
  1%|          | 1/90 [00:10<15:24, 10.39s/it]

Epoch Dev Accuracy 0.6278627862786279
Average epoch loss: 0.001501814927905798


Saving best weights from round 2
  2%|▏         | 2/90 [00:20<15:02, 10.26s/it]

Epoch Dev Accuracy 0.6342634263426342
Average epoch loss: 0.0014742235653102398


Saving best weights from round 3
  3%|▎         | 3/90 [00:38<18:25, 12.71s/it]

Epoch Dev Accuracy 0.6388638863886389
Average epoch loss: 0.0014582787407562137


Early stopping: Worse Round
  4%|▍         | 4/90 [00:48<16:57, 11.83s/it]

Epoch Dev Accuracy 0.6374637463746374
Average epoch loss: 0.0014515924267470837


Early stopping: Worse Round
  6%|▌         | 5/90 [01:08<20:14, 14.28s/it]

Epoch Dev Accuracy 0.6297629762976298
Average epoch loss: 0.001442103530280292


Saving best weights from round 6
  7%|▋         | 6/90 [01:29<22:49, 16.30s/it]

Epoch Dev Accuracy 0.6465646564656465
Average epoch loss: 0.0014398741768673062


Early stopping: Worse Round
  8%|▊         | 7/90 [01:48<23:28, 16.97s/it]

Epoch Dev Accuracy 0.6443644364436444
Average epoch loss: 0.001430192613042891


Early stopping: Worse Round
  9%|▉         | 8/90 [02:08<24:43, 18.09s/it]

Epoch Dev Accuracy 0.6458645864586459
Average epoch loss: 0.0014347969554364681


Early stopping: Worse Round
 10%|█         | 9/90 [02:18<21:03, 15.60s/it]

Epoch Dev Accuracy 0.6417641764176417


<h4> We achieve a dev set performance of 64% </h4>