## Comparison Approach
This notebook loads each of the individual, trained models from the best runs of both Bert and CNN-based approaches. It will show the model.summary() and diagram, then will run a performance test by inferring results for the texts in the ClaimBuster dataset's crowdsourced.csv file. The file contains 22501 sentences. We will use sentences per second as the performance metric, and the on-disk size of each model as the complexity metric.

In [1]:
## Usual Imports
import numpy as np
import pandas as pd

from tensorflow import keras
from tensorflow.keras.preprocessing.sequence import pad_sequences

import string

import json

# to fix the CUDA issues for CUDA 11.2 to allow use of the GPU
from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

In [2]:
# load and parse the crowdsourced.csv file

cs = pd.read_csv("../data/crowdsourced.csv", delimiter=',', quotechar = '"', index_col='Sentence_id')

Unlike the curated json dataset we used for training, the "verdict" column takes three values:

| Verdict | Description |
| :---: | :--- |
| +1 | Checkable Fact Statements, e.g. "Inflation is down 2%" |
| 0 | Uncheckable Fact Statements, e.g. "Jack likes fish" |
| -1 | Non Fact Statements, e.g. "Drink the water" |

For the purposes of this paper, we are only interested in checkable fact statements, so we set any -1 verdicts to equal zero before tokenizing.

In [3]:
cs.columns

Index(['Text', 'Speaker', 'Speaker_title', 'Speaker_party', 'File_id',
       'Length', 'Line_number', 'Sentiment', 'Verdict'],
      dtype='object')

In [12]:
# Change -1 verdicts to 0
cs.loc[cs["Verdict"] == -1]["Verdict"] == 0

Sentence_id
16       False
17       False
18       False
19       False
20       False
         ...  
34451    False
34455    False
34456    False
34457    False
34458    False
Name: Verdict, Length: 14685, dtype: bool

In [17]:
# we will need different lengths of tokenization depending on the model

def tokenize_from_json(sentences, max_len=100):

    # Load the tokenizer from the stored json file created from
    # the original training data
    with open('./tokenizer.json') as f:
        data = json.load(f)
        t = keras.preprocessing.text.tokenizer_from_json(data)
    
    # Convert to word IDs and pad each sentences out to max_length
    tokens = pad_sequences(t.texts_to_sequences(sentences),
                           max_len,
                           padding='post',
                           truncating='post')
    
    vocab_list = list(t.word_index.keys())
           
    return tokens, vocab_list
    

## Now we have set up the tokenization function based on the tokenizer and vocabulary generated from the original training sets, 

In [21]:
tokens, vocab_list = tokenize_from_json(cs["Text"] )

In [26]:
tokens.shape


(22501, 100)

### Citations
@inproceedings{arslan2020claimbuster,
    title={{A Benchmark Dataset of Check-worthy Factual Claims}},
    author={Arslan, Fatma and Hassan, Naeemul and Li, Chengkai and Tremayne, Mark },
    booktitle={14th International AAAI Conference on Web and Social Media},
    year={2020},
    organization={AAAI}
}

@article{meng2020gradient,
  title={Gradient-Based Adversarial Training on Transformer Networks for Detecting Check-Worthy Factual Claims},
  author={Meng, Kevin and Jimenez, Damian and Arslan, Fatma and Devasier, Jacob Daniel and Obembe, Daniel and Li, Chengkai},
  journal={arXiv preprint arXiv:2002.07725},
  year={2020}
}
