# Detect claims to fact check in political debates

In this project you will implement BERT model to detect which sentences in political debates should be fact checked.
Dataset from ClaimBuster: https://zenodo.org/record/3609356

The classifier is evalued using the same metrics as http://ranger.uta.edu/~cli/pubs/2017/claimbuster-kdd17-hassan.pdf (Table 2).

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import json
import time 

# Load the preprocessed data

In [2]:
df = pd.read_csv("../data_preprocessing/data.csv")
df['date'] = pd.to_datetime(df['date'])
df.dropna(inplace=True)
df.reset_index(inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23462 entries, 0 to 23461
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   index       23462 non-null  int64         
 1   date        23462 non-null  datetime64[ns]
 2   Text        23462 non-null  object        
 3   Clean_text  23462 non-null  object        
 4   Verdict     23462 non-null  int64         
dtypes: datetime64[ns](1), int64(2), object(2)
memory usage: 916.6+ KB


# Train-test split


In [3]:
mask = df["date"].dt.year < 2012

X_train = df.loc[mask, "Clean_text"].values
y_train = df.loc[mask, "Verdict"].values

X_test = df.loc[~mask, "Clean_text"].values
y_test = df.loc[~mask, "Verdict"].values

In [4]:
X_test.shape


(5344,)

###  TRANSFORMER IMPLEMENTATION

In [5]:
np.unique(y_test)

array([-1,  0,  1], dtype=int64)

In [6]:
# one-hot encoding the y_train data

labels = np.zeros((y_train.size, len(np.unique(y_train))))
labels[np.arange(y_train.size), y_train] = 1

y_encoded_train = labels
y_encoded_train

array([[0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       ...,
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.]])

In [8]:
# one-hot encoding the y_test data

labels_test = np.zeros((y_test.size, len(np.unique(y_test))))
labels_test[np.arange(y_test.size), y_test] = 1

y_encoded_test = labels_test
y_encoded_test

array([[0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       ...,
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.]])

In [9]:
y_encoded_test[30]

array([0., 1., 0.])

### Implement a Transformer block as a layer

##### Load BERT Model and Tokenizer from the transformers package 

- The **tokenizer** converts the input text into tokens.

- The **classAutoTokenizer** contains various types of tokenizers.

- **TFBertModel** pre-trained **Bert** model for TensorFlow.

- The **bert-base-cased** model is used in this project.

In [11]:
from transformers import AutoTokenizer,TFBertModel
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
bert = TFBertModel.from_pretrained('bert-base-cased')

Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [12]:
Token = tokenizer.tokenize(X_train[0])
Token

['standing', 'still']

### Input Data Modeling

- Before training the deep learning model, the input text data are converted into BERT’s input data format using the tokenizer.

In [13]:
# Tokenize the input data using tokenizer (bert-base-cased)

x_train = tokenizer(
    text=X_train.tolist(),
    add_special_tokens=True,
    max_length=50,
    truncation=True,
    padding=True, 
    return_tensors='tf',
    return_token_type_ids = False,
    return_attention_mask = True,
    verbose = True)


x_test = tokenizer(
    text=X_test.tolist(),
    add_special_tokens=True,
    max_length=50,
    truncation=True,
    padding=True, 
    return_tensors='tf',
    return_token_type_ids = False,
    return_attention_mask = True,
    verbose = True)
    

In [14]:
x_test

{'input_ids': <tf.Tensor: shape=(5344, 50), dtype=int32, numpy=
array([[  101,  2215,  1143, ...,     0,     0,     0],
       [  101,  2905,   102, ...,     0,     0,     0],
       [  101, 17847,  3136, ...,     0,     0,     0],
       ...,
       [  101,  1440,  1159, ...,     0,     0,     0],
       [  101,  2037, 14644, ...,     0,     0,     0],
       [  101,  1250,  1662, ...,     0,     0,     0]])>, 'attention_mask': <tf.Tensor: shape=(5344, 50), dtype=int32, numpy=
array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]])>}

The **tokenizer** takes the necessary parameters and returns tensor in the same format Bert accepts.
- **return_token_type_ids** is set **False** because **token_type_ids** is not required for our task.
- **return_attention_mask** is **True** means attention_mask is included in the input.
- **return_tensors=’tf’** implies input tensor for the TensorFlow model.
- **max_length=50** implies the maximum length of each sentence is 50. Sentences with bigger lengths will be trimmed to 50, while smaller sentence lengths will be padded to 50.
- **add_special_tokens=True** means **CLS, SEP** token will be added in the tokenization.

As outputs of the  data modeling, the **tokenizer** will return a dictionary (x_train) containing **‘Input_ids’**, **‘attention_mask’** as key for their respective data.


In [15]:
input_ids = x_train['input_ids']
attention_mask = x_train['attention_mask']
attention_mask

<tf.Tensor: shape=(18118, 50), dtype=int32, numpy=
array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]])>

In [16]:
input_ids

<tf.Tensor: shape=(18118, 50), dtype=int32, numpy=
array([[  101,  2288,  1253, ...,     0,     0,     0],
       [  101,  1210,  2648, ...,     0,     0,     0],
       [  101,  5835,  3682, ...,     0,     0,     0],
       ...,
       [  101,  1341,  3254, ...,     0,     0,     0],
       [  101, 21744,  1618, ...,     0,     0,     0],
       [  101, 25092,  1710, ...,     0,     0,     0]])>

In [17]:
x_test['input_ids']

<tf.Tensor: shape=(5344, 50), dtype=int32, numpy=
array([[  101,  2215,  1143, ...,     0,     0,     0],
       [  101,  2905,   102, ...,     0,     0,     0],
       [  101, 17847,  3136, ...,     0,     0,     0],
       ...,
       [  101,  1440,  1159, ...,     0,     0,     0],
       [  101,  2037, 14644, ...,     0,     0,     0],
       [  101,  1250,  1662, ...,     0,     0,     0]])>

In [18]:
x ={'input_ids':x_train['input_ids'],'attention_mask':x_train['attention_mask']} 


In [19]:
x['input_ids']

<tf.Tensor: shape=(18118, 50), dtype=int32, numpy=
array([[  101,  2288,  1253, ...,     0,     0,     0],
       [  101,  1210,  2648, ...,     0,     0,     0],
       [  101,  5835,  3682, ...,     0,     0,     0],
       ...,
       [  101,  1341,  3254, ...,     0,     0,     0],
       [  101, 21744,  1618, ...,     0,     0,     0],
       [  101, 25092,  1710, ...,     0,     0,     0]])>

### Model Building

Importing necessary libraries.


In [20]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.initializers import TruncatedNormal
from tensorflow.keras.losses import CategoricalCrossentropy
from tensorflow.keras.metrics import CategoricalAccuracy
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.layers import Input, Dense
from sklearn.metrics import classification_report

##### The Keras functional API is used to design the deep learning model.

In [21]:
max_len = 50
input_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_ids")
input_mask = Input(shape=(max_len,), dtype=tf.int32, name="attention_mask")
embeddings = bert(input_ids,attention_mask = input_mask)[0] 
out = tf.keras.layers.GlobalMaxPool1D()(embeddings)
out = Dense(128, activation='relu')(out)
out = tf.keras.layers.Dropout(0.1)(out)
out = Dense(32,activation = 'relu')(out)
y = Dense(3,activation = 'softmax')(out)

model = tf.keras.Model(inputs=[input_ids, input_mask], outputs=y)

model.layers[2].trainable = True
#model.layers[2].trainable = False

- **Bert layers** accept three input arrays: **input_ids**, **attention_mask**, and **token_type_ids**

- **input_ids** contains input word encoding.

- **token_type_ids** is required for the question-answering model and hence will not be used here.

- For the Bert layer,**input_ids** and **attention_mask** are used as two input layers.
- **Embeddings** contain hidden states of the Bert layer.


In [22]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, 50)]         0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, 50)]         0           []                               
                                                                                                  
 tf_bert_model (TFBertModel)    TFBaseModelOutputWi  108310272   ['input_ids[0][0]',              
                                thPoolingAndCrossAt               'attention_mask[0][0]']         
                                tentions(last_hidde                                               
                                n_state=(None, 50,                                            

### Model Compilation

**Define learning parameters and compile the model.**

In [23]:
optimizer = Adam(learning_rate=5e-05, epsilon=1e-08,decay=0.01,clipnorm=1.0)

model.compile(optimizer = optimizer, loss = 'categorical_crossentropy', metrics = ['accuracy'])


**learning_rate = 5e-05**: the learning rate for the model will be significantly lower.

**Loss = CategoricalCrossentropy**: the target has 3 unique integer values representing two distinct categories for the multi-class classification task.


### Model Training

Train the model with the training tensors, while the test tensors are used for the model validation.

Training and fine-tuning of the BERT model takes a bit longer time.

In [24]:
train_history = model.fit(
    x ={'input_ids':x_train['input_ids'],'attention_mask':x_train['attention_mask']} ,
    y = y_encoded_train,
    validation_data = (
    {'input_ids':x_test['input_ids'],'attention_mask':x_test['attention_mask']}, y_encoded_test
    ),
  epochs=1,
    batch_size=36
)




- **Model Evaluation**

Testing the model on the test data

In [25]:
predicted_raw = model.predict({'input_ids':x_test['input_ids'],'attention_mask':x_test['attention_mask']})

In [26]:
predicted_raw

array([[0.10026944, 0.05610476, 0.8436258 ],
       [0.36755702, 0.28952134, 0.34292173],
       [0.27910215, 0.08186944, 0.63902843],
       ...,
       [0.06776952, 0.03152383, 0.90070665],
       [0.1288251 , 0.06344972, 0.8077252 ],
       [0.04614429, 0.02721756, 0.9266382 ]], dtype=float32)

In [27]:
predicted_raw[30]

array([0.04799846, 0.83002764, 0.12197381], dtype=float32)

In [28]:
y_true = df.loc[~mask, "Verdict"].values

In [29]:
# Convert the prediction probabilities to the target classes 

y_predicted = []

for t in predicted_raw:
    #print(t)
    x = np.argmax(t)
    x2 = 0
    if x == 2:      #  Class 3: NFS
        x2 = -1    
    elif x ==1:     # Class 2: CFS
        x2 = 1
    elif x ==0:     # Class 1: UFS
        x2 = 0
    y_predicted.append(x2)

In [30]:
yp = y_predicted[30]
yt = y_true[30]
yr = predicted_raw[30]
yc =np.argmax(predicted_raw[30])

print('yt:', yt, 'yp:',yp,'yr:',yr,'yc:',yc)


yt: 1 yp: 1 yr: [0.04799846 0.83002764 0.12197381] yc: 1


In [31]:
np.unique(np.array(y_predicted))

array([-1,  0,  1])

In [32]:
report = classification_report(y_true, y_predicted, target_names= ["NFS", "UFS", "CFS"])

print(report)

              precision    recall  f1-score   support

         NFS       0.78      0.91      0.84      3296
         UFS       0.47      0.41      0.44       623
         CFS       0.76      0.52      0.61      1425

    accuracy                           0.75      5344
   macro avg       0.67      0.61      0.63      5344
weighted avg       0.74      0.75      0.74      5344

