### Introduction

This story aims at providing the ground-work for a fact-checking API on Writey McWriteface. For v1, we aim to fact-check via Wikipedia. 
The expected output of a fact-checking system is to provide one more type of feedback to the user. When a user-written claim is contradicted, we will point out the exact text and the source of that text that contradicts the claim. 

Verifying the veracity of any claim (ie, 'fact-checking') can be broken down into two distinct problems. 

1. Identifying evidence that either supports or refutes the claim (ie, 'Relevant Fact Extraction (ReFE)')
2. Identifying the truth value of the claim itself, given some evidence (ie, 'Fact Verification (FaVer)')

We will be dealing with each of these two problems independently. The network architecture used in both is approximately the same:

1. The statement of claim and evidence are both put through BERT, either separately (ReFE) or as a single string (FaVer). 
2. Claim and evidence are combined in a layer of bi-directional attention flow (biDAF). 
3. The output of biDAF is put through inner-attention, which produces a single vector that can be used for classification with a perceptron. 

The Fact Extraction and Verification (FEVer dataset) provides the source corpus for this project.

In order to 'productize' these experiments, we will have to build a pre-processing framework that does all of the following:

1. Extracts relevant documents from Wikipedia that can be put through ReFE. This effort extends across the whole question-answering domain. 
2. Rephrasing the claims (which operate at a sentence level), to be unambiguous. This is known as 'Coreference evaluation'; ie, the process of disambiguating pronoun references based on a larger context. Coreference evaluation will probably be done through a neural network as well, and the GAP-Coreference dataset (https://github.com/google-research-datasets/gap-coreference) will be used to train it. 
3. Potentially, deciding if a sentence should be sent to claim verification at all. 
4. Finally, productization will mean making all of these parts work together cohesively, which will be a longer process, and will NOT be covered under this story. We will link to that story once productization commences. 

Further, this project also aims at understanding the use of GCP resources (specifically, mixed-precision training on V100 GPUs). We will eventually be adding performance comparisons on all training. 

### Relevant Fact Extraction

This ticket covers the ReFE network and describes it at the training level. 

#### Training 

Training is conduced in the following way:
1. For any "verifiable" claim, the dataset provides context texts, each of multiple sentences. The dataset also provides notation as to which sentence the claim rests on. These sentences are used as "Positive" samples. The remaining sentences are used as "Negative" samples. Training and Validation sets are constructed from this data. The validation dataset is further used for testing (since the amount of data isn't actually very large). 

2. For each training batch, a balanced dataset is presented to the network; with an ~equal number of Positive and Negative samples. 

3. Obviously, the number of negative samples far outweighs the number of positive samples. 

4. Hence, the total number of batches used is expected to completely cover the Positive samples 3 times. However, samples are chosen randomly, so there is no guarantee tha the network will see each Positive sample thrice. 

5. During testing, the full batch is chosen at random. This would mean that the positive samples in a batch would number far fewer than the negative samples. This is pretty much how we would expect real-life data to look.

Note that in the above sequence, we are not resolving co-references, or doing any other significant pre-processing. Hence, this network *must* sit after a pre-processing step. 

#### Evaluation

For this network alone, it is possible to look at the performance in the network in several ways. 
1. Sending negative samples to the next network is not as bad as NOT sending positive samples.
2. Hence the network is evaluated against 'BestValidationLoss', and 'BestValidationEvidentiaryAccuracy'

#### Results

Model 'BestValidationEvidentiaryAccuracy':

Overall accuracy: ~82%; 
Accuracy for evidences only: ~88%

Model 'BestValidationOverallAccuracy':

Overall accuracy: ~87%; 
Accuracy for evidences only: ~85%

#### Further work

1. Repurpose SQuAD and other datasets to provide more data to train this network. 


### Fact Verification

This ticket covers the FaVer network and describes it at the training level.

#### Training
Training is conducted in the following way:

1. The dataset provides truth values for all verifiable claims. 
2. Each claim is combined with all of it's evidence into a single string, in the usual Bert way. True claims are thought of as "Positive" samples and False claims are thought of as "Negative" samples. 
3. One-tenth of the dataset is kept apart for Validation. The Validation dataset is also used for testing. 
4. The nature of the FEVer dataset is such that for verifiable claims, Positive samples far outweight negative samples. The aim during training, however, is always to present the network with a balanced set of samples. 
5. Hence, the total number of batches is chosen to cover the entire negative sample training space 3 times. Having said that, each batch is chosen completely at random (equal numbers of random positive and negative samples), so there is no guarantee that the network will see all negative samples. 
6. During testing, the full batch is chosen entirely at random; with no specific weighting awarded to numbers of positive and negative samples. 

#### Evaluation

1. The evaluation metric used in this network is Accuracy. Note that the FEVer task itself requires an F1 score that combines the location of the evidence with the accuracy of the judgement. However, since we have structured the task differently, accuracy suffices for now. 
2. Note that this network is subject to inputs... so when placed after ReFE, accuracy will probably be a subset of the accuracy of the ReFE network itself. 


#### Results

Model 'BestValidationAccuracy': 

Overall Accuracy: ~93%

#### Further Work

1. Noisy sampling: Throw sentences into the evidence that do not contribute evidence to the claim. Aim is to achieve a similar/higher result. This should take care of some of the noise that will come out of the ReFE network. 
