#### FEVER dataset processing

Process the claims in the fever dataset

The code available in the fever repos do not seem to work anymore, most of them do not provide steps on how to prepare the training dataset for the NLI task

In this notebook, we will prepare the training dataset that would be the input to the NLI tasks

We use the following repos for reference code:

- [fever-baselines](https://github.com/klimzaporojets/fever-baselines.git)
- [fever-allennlp-reader](https://github.com/j6mes/fever-allennlp-reader)
- [fever-allennlp](https://github.com/j6mes/fever-allennlp)

Note, AllenNLP here is used only for the NLI training, using models such as Decomposable Attention, Elmo + ESIM, ESIM etc. In this notebook, we will first focus on extracying the data from the pre-processed Wiki corpus provided by [fever.ai](https://fever.ai/dataset/fever.html).

The data is available in a docker images, 21 GB in size, that is 

In [None]:
import sys  
sys.path.insert(0, 'src/fever/reader')
sys.path.insert(1, 'src/fever/evidence/retrieval_methods')

In [7]:
#!pip install -r requirements.txt

Collecting allennlp
  Downloading allennlp-2.5.0-py3-none-any.whl (681 kB)
[K     |████████████████████████████████| 681 kB 5.2 MB/s eta 0:00:01
[?25hCollecting fever-scorer
  Downloading fever-scorer-2.0.39.tar.gz (3.9 kB)
Collecting fever-drqa
  Downloading fever-drqa-1.0.13.tar.gz (43 kB)
[K     |████████████████████████████████| 43 kB 2.4 MB/s  eta 0:00:01
[?25hCollecting wandb<0.11.0,>=0.10.0
  Downloading wandb-0.10.33-py2.py3-none-any.whl (1.8 MB)
[K     |████████████████████████████████| 1.8 MB 10.3 MB/s eta 0:00:01
[?25hCollecting boto3<2.0,>=1.14
  Downloading boto3-1.17.102-py2.py3-none-any.whl (131 kB)
[K     |████████████████████████████████| 131 kB 28.9 MB/s eta 0:00:01
[?25hCollecting checklist==0.0.11
  Downloading checklist-0.0.11.tar.gz (12.1 MB)
[K     |████████████████████████████████| 12.1 MB 27.2 MB/s eta 0:00:01
Collecting transformers<4.7,>=4.1
  Downloading transformers-4.6.1-py3-none-any.whl (2.2 MB)
[K     |████████████████████████████████| 2.2 MB 3

In [2]:
import argparse
import json
from multiprocessing.pool import ThreadPool
from document_database import FEVERDocumentDatabase
from tqdm import tqdm

In [3]:
from top_docs import TopNDocsTopNSents

In [4]:
database_path = '/local/fever-common/data/fever/fever.db'
database = FEVERDocumentDatabase(database_path)

In [5]:
from drqa import retriever

In [6]:
index = '/local/fever-common/data/index/fever-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz'
ranker = retriever.get_class('tfidf')(tfidf_path=index)

In [9]:
ls /local/fever-common/data/fever-data/

paper_dev.jsonl   shared_task_dev.jsonl   train.jsonl
paper_test.jsonl  shared_task_test.jsonl


In [9]:
claim_text = "Nikolaj Coster-Waldau worked with the Fox Broadcasting Company."
n_docs = 5
doc_names, doc_scores = ranker.closest_docs(claim_text, n_docs)
doc_names, doc_scores 

(['Coster',
  'Nikolaj',
  'The_Other_Woman_-LRB-2014_film-RRB-',
  'Nikolaj_Coster-Waldau',
  'Nukaaka_Coster-Waldau'],
 array([498.82682449, 348.4202146 , 316.84050304, 316.84050304,
        292.47605894]))

In [41]:
claim_text = "Binary cross entropy is used for classification"
n_docs = 5
doc_names, doc_scores = ranker.closest_docs(claim_text, n_docs)
doc_names, doc_scores 

(['Cross_entropy',
  'Index_of_information_theory_articles',
  'Entropy',
  'Binary_entropy_function',
  'Entropy_-LRB-information_theory-RRB-'],
 array([229.243706  , 201.72437336, 197.26505297, 184.24463662,
        169.90698121]))

In [14]:
pages = zip(doc_names, doc_scores)
sorted_pages = list(sorted(pages, reverse=True, key=lambda elem: elem[1]))
pgs = [p[0] for p in sorted_pages[:5]]
pgs

['Coster',
 'Nikolaj',
 'The_Other_Woman_-LRB-2014_film-RRB-',
 'Nikolaj_Coster-Waldau',
 'Nukaaka_Coster-Waldau']

In [15]:
lines = database.get_doc_lines(pgs[0])
lines

['0\tCoster is a Dutch occupational surname .\tDutch\tDutch-language\toccupational surname\toccupational surname',
 '1\tNotable people with the surname include :',
 '2\t',
 '3\t',
 '4\tAnne Vallayer-Coster -LRB- 1744-1818 -RRB- , French painter\tAnne Vallayer-Coster\tAnne Vallayer-Coster',
 '5\t',
 '6\tArnold Coster -LRB- born 1976 -RRB- , Dutch mountaineers\tDutch\tDutch-language\tArnold Coster\tArnold Coster',
 '7\t',
 '8\tCharles Coster -LRB- 1837-1888 -RRB- , American soldier and public official\tCharles Coster\tCharles Coster',
 '9\t',
 '10\tCharles De Coster -LRB- 1827-1879 -RRB- , Belgian novelist\tCharles De Coster\tCharles De Coster',
 '11\t',
 '12\tDick Coster -LRB- born 1946 -RRB- , Dutch sailor , father of Sven and Kalle Coster\tDutch\tDutch-language\tDick Coster\tDick Coster\tKalle Coster\tKalle Coster',
 '13\t',
 '14\tDirk Coster -LRB- 1889 -- 1950 -RRB- , Dutch physicist\tDutch\tDutch-language\tDirk Coster\tDirk Coster',
 '15\t',
 '16\tCoster -- Kronig transition named a

In [49]:
lines[0]

"0\tIn information theory , the cross entropy between two probability distributions and over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set , if a coding scheme is used that is optimized for an `` unnatural '' probability distribution , rather than the `` true '' distribution .\tinformation theory\tinformation theory\tprobability distribution\tprobability distribution\tentropy\tinformation entropy"

In [51]:
lines[0].split('\t')[1]

"In information theory , the cross entropy between two probability distributions and over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set , if a coding scheme is used that is optimized for an `` unnatural '' probability distribution , rather than the `` true '' distribution ."

In [57]:
lines[6].split('\t')

['6',
 'where is the entropy of , and is the Kullback -- Leibler divergence of from -LRB- also known as the relative entropy of p with respect to q -- note the reversal of emphasis -RRB- .',
 'entropy',
 'information entropy']

In [17]:
claim_text = "Nikolaj Coster-Waldau worked with the Fox Broadcasting Company."
n_docs = 5
doc_names, doc_scores = ranker.closest_docs(claim_text, n_docs)

In [32]:
pages = zip(doc_names, doc_scores)
sorted_pages = list(sorted(pages, reverse=True, key=lambda elem: elem[1]))
pgs = [p[0] for p in sorted_pages[:5]]
pgs

['Coster',
 'Nikolaj',
 'The_Other_Woman_-LRB-2014_film-RRB-',
 'Nikolaj_Coster-Waldau',
 'Nukaaka_Coster-Waldau']

In [69]:
lines = database.get_doc_lines(pgs[0])

In [40]:
lines

['0\tCoster is a Dutch occupational surname .\tDutch\tDutch-language\toccupational surname\toccupational surname',
 '1\tNotable people with the surname include :',
 '2\t',
 '3\t',
 '4\tAnne Vallayer-Coster -LRB- 1744-1818 -RRB- , French painter\tAnne Vallayer-Coster\tAnne Vallayer-Coster',
 '5\t',
 '6\tArnold Coster -LRB- born 1976 -RRB- , Dutch mountaineers\tDutch\tDutch-language\tArnold Coster\tArnold Coster',
 '7\t',
 '8\tCharles Coster -LRB- 1837-1888 -RRB- , American soldier and public official\tCharles Coster\tCharles Coster',
 '9\t',
 '10\tCharles De Coster -LRB- 1827-1879 -RRB- , Belgian novelist\tCharles De Coster\tCharles De Coster',
 '11\t',
 '12\tDick Coster -LRB- born 1946 -RRB- , Dutch sailor , father of Sven and Kalle Coster\tDutch\tDutch-language\tDick Coster\tDick Coster\tKalle Coster\tKalle Coster',
 '13\t',
 '14\tDirk Coster -LRB- 1889 -- 1950 -RRB- , Dutch physicist\tDutch\tDutch-language\tDirk Coster\tDirk Coster',
 '15\t',
 '16\tCoster -- Kronig transition named a

In [60]:
pgs[0]

'Cross_entropy'

In [73]:
print(lines[0])
lns = [line.split("\t")[1] if len(line.split("\t")[1]) > 1 else "" for line in
                     lines]

print("Length =", len(lns))
lns

0	In information theory , the cross entropy between two probability distributions and over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set , if a coding scheme is used that is optimized for an `` unnatural '' probability distribution , rather than the `` true '' distribution .	information theory	information theory	probability distribution	probability distribution	entropy	information entropy
Length = 20


["In information theory , the cross entropy between two probability distributions and over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set , if a coding scheme is used that is optimized for an `` unnatural '' probability distribution , rather than the `` true '' distribution .",
 '',
 '',
 'The cross entropy for the distributions and over a given set is defined as follows :',
 '',
 '',
 'where is the entropy of , and is the Kullback -- Leibler divergence of from -LRB- also known as the relative entropy of p with respect to q -- note the reversal of emphasis -RRB- .',
 '',
 '',
 'For discrete and this means',
 '',
 '',
 'The situation for continuous distributions is analogous .',
 'We have to assume that and are absolutely continuous with respect to some reference measure -LRB- usually is a Lebesgue measure on a Borel σ-algebra -RRB- .',
 'Let and be probability density functions of and with respect to .',
 'Then',
 ''

In [74]:
p_lines = []
p_lines.extend(zip(lns, [pgs[0]] * len(lns), range(len(lns))))
p_lines

[("In information theory , the cross entropy between two probability distributions and over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set , if a coding scheme is used that is optimized for an `` unnatural '' probability distribution , rather than the `` true '' distribution .",
  'Cross_entropy',
  0),
 ('', 'Cross_entropy', 1),
 ('', 'Cross_entropy', 2),
 ('The cross entropy for the distributions and over a given set is defined as follows :',
  'Cross_entropy',
  3),
 ('', 'Cross_entropy', 4),
 ('', 'Cross_entropy', 5),
 ('where is the entropy of , and is the Kullback -- Leibler divergence of from -LRB- also known as the relative entropy of p with respect to q -- note the reversal of emphasis -RRB- .',
  'Cross_entropy',
  6),
 ('', 'Cross_entropy', 7),
 ('', 'Cross_entropy', 8),
 ('For discrete and this means', 'Cross_entropy', 9),
 ('', 'Cross_entropy', 10),
 ('', 'Cross_entropy', 11),
 ('The situation for contin

#### Format of the extracted text
Extract the lines in the format 'sentence, page name, serial number of the line in the page'

We will need to rank the lines in the page by similarity measure to the claim text

Use the prepared lines for input to the TFIDF ranker

The following line consumed too much memory 

`ranker = retriever.get_class('tfidf')(tfidf_path=index)`

Do this first, save the lines per claim to a text file and then use the tdidf ranker.

Return the predicted page and the line number from the page ranked by the score.

##### Processing the data from training file

Open the train.jsonl file and send in the lines for prediction and sentence extraction

Since we would not be able to load the ranker and the tfidf ranker together, we will load the ranker first and then generate matching docs and all sentences in them.




In [16]:
# use a hardcoded list of lines to test the tdidf ranker
claim_text = "Nikolaj Coster-Waldau worked with the Fox Broadcasting Company."
p_lines = [("In information theory , the cross entropy between two probability distributions and over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set , if a coding scheme is used that is optimized for an `` unnatural '' probability distribution , rather than the `` true '' distribution .",
  'Cross_entropy',
  0),
 ('', 'Cross_entropy', 1),
 ('', 'Cross_entropy', 2),
 ('The cross entropy for the distributions and over a given set is defined as follows :',
  'Cross_entropy',
  3),
 ('', 'Cross_entropy', 4),
 ('', 'Cross_entropy', 5),
 ('where is the entropy of , and is the Kullback -- Leibler divergence of from -LRB- also known as the relative entropy of p with respect to q -- note the reversal of emphasis -RRB- .',
  'Cross_entropy',
  6),
 ('', 'Cross_entropy', 7),
 ('', 'Cross_entropy', 8),
 ('For discrete and this means', 'Cross_entropy', 9),
 ('', 'Cross_entropy', 10),
 ('', 'Cross_entropy', 11),
 ('The situation for continuous distributions is analogous .',
  'Cross_entropy',
  12),
 ('We have to assume that and are absolutely continuous with respect to some reference measure -LRB- usually is a Lebesgue measure on a Borel σ-algebra -RRB- .',
  'Cross_entropy',
  13),
 ('Let and be probability density functions of and with respect to .',
  'Cross_entropy',
  14),
 ('Then', 'Cross_entropy', 15),
 ('', 'Cross_entropy', 16),
 ('', 'Cross_entropy', 17),
 ('NB : The notation is also used for a different concept , the joint entropy of and .',
  'Cross_entropy',
  18),
 ('', 'Cross_entropy', 19)]

In [17]:
lines = []
for p_line in p_lines:
    lines.append({
        "sentence": p_line[0],
        "page": p_line[1],
        "line_on_page": p_line[2]
    })
lines

[{'sentence': "In information theory , the cross entropy between two probability distributions and over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set , if a coding scheme is used that is optimized for an `` unnatural '' probability distribution , rather than the `` true '' distribution .",
  'page': 'Cross_entropy',
  'line_on_page': 0},
 {'sentence': '', 'page': 'Cross_entropy', 'line_on_page': 1},
 {'sentence': '', 'page': 'Cross_entropy', 'line_on_page': 2},
 {'sentence': 'The cross entropy for the distributions and over a given set is defined as follows :',
  'page': 'Cross_entropy',
  'line_on_page': 3},
 {'sentence': '', 'page': 'Cross_entropy', 'line_on_page': 4},
 {'sentence': '', 'page': 'Cross_entropy', 'line_on_page': 5},
 {'sentence': 'where is the entropy of , and is the Kullback -- Leibler divergence of from -LRB- also known as the relative entropy of p with respect to q -- note the reversal of emphasi

In [18]:
from drqascripts.retriever.build_tfidf_lines import OnlineTfidfDocRanker

In [19]:
import math
class RankArgs:
    def __init__(self):
        self.ngram = 2
        self.hash_size = int(math.pow(2,24))
        self.tokenizer = "simple"
        self.num_workers = None
freqs = None
tfidf = OnlineTfidfDocRanker(RankArgs(), [line["sentence"] for line in lines], freqs)

In [20]:
tfidf

<drqascripts.retriever.build_tfidf_lines.OnlineTfidfDocRanker at 0x7fb8c5339c10>

In [21]:
claim_text = "Binary cross entropy is used for classification"
n_sents = 5
line_ids, scores = tfidf.closest_docs(claim_text, n_sents)
ret_lines = []
for idx, line in enumerate(line_ids):
    ret_lines.append(lines[line])
    ret_lines[-1]["score"] = scores[idx]

In [22]:
line_ids, scores

([0, 3, 18, 6], array([6.58504233, 4.66038494, 2.73572755, 1.28551579]))

In [18]:
ret_lines

[{'sentence': "In information theory , the cross entropy between two probability distributions and over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set , if a coding scheme is used that is optimized for an `` unnatural '' probability distribution , rather than the `` true '' distribution .",
  'page': 'Cross_entropy',
  'line_on_page': 0,
  'score': 6.585042331106434},
 {'sentence': 'The cross entropy for the distributions and over a given set is defined as follows :',
  'page': 'Cross_entropy',
  'line_on_page': 3,
  'score': 4.6603849408028335},
 {'sentence': 'NB : The notation is also used for a different concept , the joint entropy of and .',
  'page': 'Cross_entropy',
  'line_on_page': 18,
  'score': 2.735727550499232},
 {'sentence': 'where is the entropy of , and is the Kullback -- Leibler divergence of from -LRB- also known as the relative entropy of p with respect to q -- note the reversal of emphasis -RRB- .'

In [20]:
sents = [(s["page"], s["line_on_page"]) for s in ret_lines]
sents

[('Cross_entropy', 0),
 ('Cross_entropy', 3),
 ('Cross_entropy', 18),
 ('Cross_entropy', 6)]

In [21]:
pages = list(set(map(lambda sent:sent[0],sents)))
pages

['Cross_entropy']

In [22]:
sents

[('Cross_entropy', 0),
 ('Cross_entropy', 3),
 ('Cross_entropy', 18),
 ('Cross_entropy', 6)]

In [26]:
ls /local/fever-common/data/fever-data/

paper_dev.jsonl   shared_task_dev.jsonl   train.jsonl
paper_test.jsonl  shared_task_test.jsonl


In [12]:
import argparse
import json
in_file = '/local/fever-common/data/fever-data/shared_task_dev.jsonl'
out_file = 'working/data/claim_texts.jsonl'
with open(in_file,"r") as in_file, open(out_file, "w+") as out_file:
    lines = []
    for line in in_file:
        lines.append(json.loads(line))

KeyboardInterrupt: 

In [29]:
import math
class RankArgs:
    def __init__(self):
        self.ngram = 2
        self.hash_size = int(math.pow(2,24))
        self.tokenizer = "simple"
        self.num_workers = None
def tf_idf_sim(claim, lines, freqs=None):
    freqs = None
    n_sents = 5
    tfidf = OnlineTfidfDocRanker(RankArgs(), [line["sentence"] for line in lines], freqs)
    line_ids, scores = tfidf.closest_docs(claim,n_sents)
    ret_lines = []
    for idx, line in enumerate(line_ids):
        ret_lines.append(lines[line])
        ret_lines[-1]["score"] = scores[idx]
    return ret_lines

In [42]:
!head working/data/claim_texts.jsonl

In [None]:
in_file = '/local/fever-common/data/fever-data/shared_task_dev.jsonl'
out_file = 'working/data/matching_page_sentences.jsonl'
sample_size = 10
cnt = 0
n_docs = 5
with open(in_file,"r") as in_file, open(out_file, "w+") as out_file:
    for line in in_file:
        print(line)
        ln = json.loads(line)
        claim_text = ln['claim']
        cnt += 1
        doc_names, doc_scores = ranker.closest_docs(claim_text, n_docs)
        ## sort the docs by score
        page_scores = zip(doc_names, doc_scores)
        sorted_pages = list(sorted(page_scores, reverse=True, key=lambda elem: elem[1]))
        pages = [p[0] for p in sorted_pages[:1]]
        print('Pages...')
        print(pages)
        ## get the lines from the pages
        for page in pages:
            lines = database.get_doc_lines(page)
            # parse the lines
            p_lines = []
            for line in lines:
                lns = [line.split("\t")[1] if len(line.split("\t")[1]) > 1 else "" for line in lines]
                p_lines.extend(zip(lns, [page] * len(lns), range(len(lns))))
            # reusing variable
            lines = []
            for p_line in p_lines:
                lines.append({
                    "sentence": p_line[0],
                    "page": p_line[1],
                    "line_on_page": p_line[2]
                })
            scores = tf_idf_sim(claim_text, lines)
            sents = [(s["page"], s["line_on_page"]) for s in scores]
            pgs = list(set(map(lambda sent:sent[0],sents)))
            ln["predicted_pages"] = pgs
            ln["predicted_sentences"] = sents
            out_file.write(json.dumps(ln) + "\n")
            #out_file.write(json.dumps(p_lines) + "\n")
            #out_file.write(json.dumps(lines) + "\n")
            #del lines
            #print("Number of lines = ", len(p_lines))
        if cnt > sample_size:
            break

{"id": 91198, "verifiable": "NOT VERIFIABLE", "label": "NOT ENOUGH INFO", "claim": "Colin Kaepernick became a starting quarterback during the 49ers 63rd season in the National Football League.", "evidence": [[[108548, null, null, null]]]}

Pages...
['Colin_Kaepernick']
{"id": 194462, "verifiable": "NOT VERIFIABLE", "label": "NOT ENOUGH INFO", "claim": "Tilda Swinton is a vegan.", "evidence": [[[227768, null, null, null]]]}

Pages...
['Swinton_-LRB-surname-RRB-']
{"id": 137334, "verifiable": "VERIFIABLE", "label": "SUPPORTS", "claim": "Fox 2000 Pictures released the film Soul Food.", "evidence": [[[289914, 283015, "Soul_Food_-LRB-film-RRB-", 0]], [[291259, 284217, "Soul_Food_-LRB-film-RRB-", 0]], [[293412, 285960, "Soul_Food_-LRB-film-RRB-", 0]], [[337212, 322620, "Soul_Food_-LRB-film-RRB-", 0]], [[337214, 322622, "Soul_Food_-LRB-film-RRB-", 0]]]}

Pages...
['Soul_Food']
{"id": 166626, "verifiable": "NOT VERIFIABLE", "label": "NOT ENOUGH INFO", "claim": "Anne Rice was born in New Jersey

In [25]:
mkdir -p working/data

In [3]:
ls working/data/

matching_page_sentences.jsonl


In [4]:
!head working/data/matching_page_sentences.jsonl

In [8]:
sample_page = 'List_of_Ace_titles_in_numeric_series'
lines = database.get_doc_lines(sample_page)
len(lines)

1841

In [10]:
p_lines = []
for line in lines:
    lns = [line.split("\t")[1] if len(line.split("\t")[1]) > 1 else "" for line in lines]
    p_lines.extend(zip(lns, [sample_page] * len(lns), range(len(lns))))

In [11]:
p_lines

[('In January 1969 , Ace Books switched from a letter-series code for its books to a numeric series .',
  'List_of_Ace_titles_in_numeric_series',
  0),
 ('The number does not indicate sequence of publication , unlike the number in the letter series codes ; instead it identifies the alphabetic position of the title .',
  'List_of_Ace_titles_in_numeric_series',
  1),
 ('It was assigned by dividing the range 00001-99999 into 26 sections , one for each letter of the alphabet , and then assigning the code depending on the first letters of the title .',
  'List_of_Ace_titles_in_numeric_series',
  2),
 ('As can be seen from the list below , this approach was evidently not followed in every case , but it accounts for the great majority of the codes .',
  'List_of_Ace_titles_in_numeric_series',
  3),
 ('', 'List_of_Ace_titles_in_numeric_series', 4),
 ('', 'List_of_Ace_titles_in_numeric_series', 5),
 ("The number is also part of the ISBN , for the later titles ; the ISBN for a book -LRB- if it h