## Comparison Approach
This notebook loads each of the individual, trained models from the best runs of both Bert and CNN-based approaches. It will show the model.summary() and diagram, then will run a performance test by inferring results for the texts in the ClaimBuster dataset's crowdsourced.csv file. The file contains 22501 sentences. We will use sentences per second as the performance metric, and the on-disk size of each model as the complexity metric.

In [1]:
## Usual Imports
import numpy as np
import pandas as pd

from tensorflow import keras
import tensorflow.keras.backend as backend
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.config.experimental import list_physical_devices, set_visible_devices

import string

import json
import pickle

import matplotlib as plt

import datetime


import sys
sys.path.insert(0, '../python')
import debug
from jbyrne_utils import tokenize_sentences


# to fix the CUDA issues for CUDA 11.2 to allow use of the GPU
from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

### Load the crowdsourced test data set

In [2]:
# load and parse the crowdsourced.csv file

cs = pd.read_csv("../data/crowdsourced.csv", delimiter=',', quotechar = '"', index_col='Sentence_id')

Unlike the curated json dataset we used for training, the "verdict" column takes three values:

| Verdict | Description |
| :---: | :--- |
| +1 | Checkable Fact Statements, e.g. "Inflation is down 2%" |
| 0 | Uncheckable Fact Statements, e.g. "Jack likes fish" |
| -1 | Non Fact Statements, e.g. "Drink the water" |

For the purposes of this paper, we are only interested in checkable fact statements, so we set any -1 verdicts to equal zero before tokenizing.

In [3]:
len(cs.loc[cs["Verdict"] == -1]["Verdict"])


14685

In [4]:
# Change -1 verdicts (non claim sentences) to be 0.
print(f"Before: {len(cs.loc[cs['Verdict'] == -1])} -1 labels.")

cs.loc[cs["Verdict"] == -1, "Verdict"] = 0

print(f"After:  {len(cs.loc[cs['Verdict'] == -1])} -1 labels.")

Before: 14685 -1 labels.
After:  0 -1 labels.


### Tokenizing the new dataset
Provided this is run AFTER the other tests, there should be a tokenizer.pkl and embed_matrix.pkl already created from the training dataset.  We need to encode the new text using the same vocabulary and ID mapping as it will be input into a pre-trained embeddings layer in the models.

## Initialize the output dataset

In [5]:
output = pd.DataFrame(columns = ["Type",
                                 "Model",
                                 "Hardware",
                                 "Max Length",
                                 "Filters",
                                 "Dense Layers",
                                 "Parameter Count",
                                 "Val Accuracy",
                                 "Test Accuracy",
                                 "Inf. Rate/s",
                                 ])

### GPU vs CPU performance
As one objective is to run claim detection at the edge, we will be doing performance testing on both GPU and CPU hardware.

All work on this project has been done using the following software and hardware:

* Anaconda distribution of Python 3.8.2
* Tensorflow 2.4.1
* AMD Ryzen TR 3970X 32-Core Processor with Hyperthreading (64 threads)
* NVidia RTX2080 Super GPU

First display the tensorflow IDs for the CPU and GPU

In [6]:
list_physical_devices()

[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
 PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

In [7]:
# Disable the GPU for these runs
set_visible_devices([], 'GPU')

## Test the CNN Best models for each max_len
We ran a series of grid searches to find the best performing models for each of the tested sentence lengths:

| max_len | Description |
| :---: | :--- |
| 100 | Only 14 of 11056 sentences are truncated, so 99.87% of sentences are processed in full |
| 50 | Half the length still processes 95.57% of sentences in full |
| 21 | Equal to the rounded average length, processes 61.478% of sentences in full |
| 17 | Equal to the median length, processes half the sentences in full |

In [8]:
with open('./best_models.pkl', 'rb') as f:
    best_models = pickle.load(f)
best_models = best_models.sort_values('val_accuracy_best', ascending=False)
best_models

Unnamed: 0,timestamp,max_len,batch_size,embed_dim,num_filters,kernel_sizes,dense_layer_dims,dropout_rate,val_accuracy_best,val_accuracy_best_epoch
1,210409-235328,50,50,50,"[96, 96, 96]","[8, 16, 32]",[8],0.2,0.976492,6
0,210409-210515,100,50,50,"[64, 64, 64]","[4, 8, 16]",[8],0.2,0.975588,8
3,210409-225707,17,50,50,"[96, 96, 96]","[8, 12, 16]",[32],0.2,0.970615,19
2,210410-004542,21,50,50,"[64, 64, 64]","[8, 12, 16]",[8],0.2,0.969259,14


In [9]:
best_models["timestamp"]

1    210409-235328
0    210409-210515
3    210409-225707
2    210410-004542
Name: timestamp, dtype: object

In [10]:
cycles = 100


def run_perftest(tokens, labels, model, cycles):
    print(f"Inferring {cycles} iterations of test data")
    start_time = datetime.datetime.now()
    for i in range(cycles):
        history = model.evaluate(tokens, labels, batch_size=128, verbose=0)
        if i % 10 == 0:
            print(f"\n{i:03d}", end="")
        else:
            print(".", end="")
    print('\n\nCOMPLETED\n')
                  
    end_time = datetime.datetime.now()
    difference = end_time - start_time
    return (difference.total_seconds(), history)

keras.backend.clear_session()
for index,row in best_models.iterrows():
    timestamp = row["timestamp"]
    
    model = keras.models.load_model(f'../best_models/{timestamp}')
    print(f"\n\n\nCNN Model from timestamp {timestamp}")
    
    # Display summary and diagram of the model
    model.summary()
    keras.utils.plot_model(model, f'{timestamp}.png', show_shapes=True, show_dtype=True, rankdir="TB")
    
    tokens, _ = tokenize_sentences(cs["Text"], max_len=row["max_len"] )
    labels = cs["Verdict"]
    
    # Run Performance Test on the crowdsourced test data set
    
    difference, history = run_perftest(tokens, labels, model, cycles)
    
    print(f"Time taken for {cycles * len(labels):,} inferrences = {difference:.3f} s.")
    print(f"Rate is {cycles * len(labels) / difference:,.3f} inferrences per second")
    
    # Add the results to the output
    
    filter_string = f"Sizes: {row['kernel_sizes']} Counts: {row['num_filters']}"
    parameter_count = np.sum([backend.count_params(w) for w in model.trainable_weights]) + \
                      np.sum([backend.count_params(w) for w in model.non_trainable_weights])

    
    record = pd.DataFrame( {"Type": "CNN",
                            "Model": timestamp,
                            "Hardware": "GPU",
                            "Max Length": row["max_len"],
                            "Filters": filter_string,
                            "Dense Layers": row["dense_layer_dims"],
                            "Parameter Count": parameter_count ,
                            "Val Accuracy": row["val_accuracy_best"],
                            "Test Accuracy": history[1],
                            "Inf. Rate/s": cycles * len(labels) / difference
                           },
                           index = [1]) # timestamp
    output = output.append(record)
    




CNN Model from timestamp 210409-235328
Model: "model_84"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_85 (InputLayer)           [(None, 50)]         0                                            
__________________________________________________________________________________________________
embedding_84 (Embedding)        (None, 50, 50)       409800      input_85[0][0]                   
__________________________________________________________________________________________________
conv1d_228 (Conv1D)             (None, 43, 96)       38496       embedding_84[0][0]               
__________________________________________________________________________________________________
conv1d_229 (Conv1D)             (None, 35, 96)       76896       embedding_84[0][0]               
_________________________________________________

Loading previously created Tokenizer
Inferring 100 iterations of test data

000.........
010.........
020.........
030.........
040.........
050.........
060.........
070.........
080.........
090.........

COMPLETED

Time taken for 2,250,100 inferrences = 24.028 s.
Rate is 93,644.918 inferrences per second



CNN Model from timestamp 210410-004542
Model: "model_52"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_53 (InputLayer)           [(None, 21)]         0                                            
__________________________________________________________________________________________________
embedding_52 (Embedding)        (None, 21, 50)       409800      input_53[0][0]                   
__________________________________________________________________________________________________
conv1d_144 (Conv1D)             (None

In [11]:
output

Unnamed: 0,Type,Model,Hardware,Max Length,Filters,Dense Layers,Parameter Count,Val Accuracy,Test Accuracy,Inf. Rate/s
1,CNN,210409-235328,GPU,50,"Sizes: [8, 16, 32] Counts: [96, 96, 96]",[8],681209.0,0.976492,0.886361,41294.983218
1,CNN,210409-210515,GPU,100,"Sizes: [4, 8, 16] Counts: [64, 64, 64]",[8],501145.0,0.975588,0.889072,44209.782797
1,CNN,210409-225707,GPU,17,"Sizes: [8, 12, 16] Counts: [96, 96, 96]",[32],592169.0,0.970615,0.885072,93644.918164
1,CNN,210410-004542,GPU,21,"Sizes: [8, 12, 16] Counts: [64, 64, 64]",[8],526745.0,0.969259,0.886494,84235.316209


__210409-235328: max_len = 50__
!["210409-235328"](./210409-235328.png)

__210409-210515	max_len=100__
!["210409-210515"](./210409-210515.png)

__210409-225707	max_len=17__
!["210409-225707"](./210409-225707.png)

__210410-004542	max_len=21__
!["210410-004542"](./210410-004542.png)

### Citations
@inproceedings{arslan2020claimbuster,
    title={{A Benchmark Dataset of Check-worthy Factual Claims}},
    author={Arslan, Fatma and Hassan, Naeemul and Li, Chengkai and Tremayne, Mark },
    booktitle={14th International AAAI Conference on Web and Social Media},
    year={2020},
    organization={AAAI}
}

@article{meng2020gradient,
  title={Gradient-Based Adversarial Training on Transformer Networks for Detecting Check-Worthy Factual Claims},
  author={Meng, Kevin and Jimenez, Damian and Arslan, Fatma and Devasier, Jacob Daniel and Obembe, Daniel and Li, Chengkai},
  journal={arXiv preprint arXiv:2002.07725},
  year={2020}
}
