**Authors:** 
- Bruna Atamanczuk (254205) 
- John Emeka Udegbunam (207951) 
- Kurt Arve Skipenes Karadas (890802)

# Detect claims to fact check in political debates - Deep learning using word embedding

In this project we implement various classifiers using neural networks to detect which sentences in political debates should be fact checked.

The following models are implemented: 

- Bidirectional LSTM
- Stacked Bi-LSTM
- CNN
- CNN + LSTM

Dataset from ClaimBuster: https://zenodo.org/record/3609356 
The classifiers are evaluated using the same metrics as http://ranger.uta.edu/~cli/pubs/2017/claimbuster-kdd17-hassan.pdf (Table 2)


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.metrics import *

# Loading the data

In [2]:
df = pd.read_csv("../data_preprocessing/data.csv")
df['date'] = pd.to_datetime(df['date'])
df.dropna(inplace=True)
df.reset_index(inplace=True)
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23462 entries, 0 to 23461
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   index       23462 non-null  int64         
 1   date        23462 non-null  datetime64[ns]
 2   Text        23462 non-null  object        
 3   Clean_text  23462 non-null  object        
 4   Verdict     23462 non-null  int64         
dtypes: datetime64[ns](1), int64(2), object(2)
memory usage: 916.6+ KB


# Train-test split

In [3]:
mask = df["date"].dt.year < 2012

X_train = df.loc[mask, "Clean_text"].values
y_train = df.loc[mask, "Verdict"].values

X_test = df.loc[~mask, "Clean_text"].values
y_test = df.loc[~mask, "Verdict"].values

# Data Preprocessing

In [4]:
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential

In [5]:
# defining vocabulary
vocabulary = {}
sentences_len = []
for sentence in X_train:
    for term in sentence.split():
        vocabulary.setdefault(term, len(vocabulary))

In [6]:
# Defining vocabulary size
vocabulary_size = list(vocabulary.values())[-1] + 1

print(f"vocabulary is composed of {vocabulary_size} unique words")

vocabulary is composed of 10205 unique words


## One hot encoding representation

In order to train our models, that text data need to be converted into integers. The text is encoded using the `Tokenizer`. This will return sequences of integers where each number represents is conneced to a dictionary key

Encoding train data

In [7]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)
X_train_encoded = tokenizer.texts_to_sequences(X_train)

Encodding Test data

In [8]:
X_test_encoded = tokenizer.texts_to_sequences(X_test)

## Padding sequences

In [9]:
# finding max sentence length

vec_lengths = []
for i in X_train_encoded:
    vec_lengths.append(len(i))


max_length = np.unique(vec_lengths)[-1]


In [10]:
X_train_embedded=pad_sequences(X_train_encoded,padding='post',maxlen=max_length)
print(X_train_embedded)

[[  783   148     0 ...     0     0     0]
 [  130   110   771 ...     0     0     0]
 [  462  2841    30 ...     0     0     0]
 ...
 [    2  6525    43 ...     0     0     0]
 [ 1245    49   566 ...     0     0     0]
 [10205   264     1 ...     0     0     0]]


### For training

In [11]:
X_train_embedded.shape

(18118, 65)

### For testing

In [12]:
X_test_embedded=pad_sequences(X_test_encoded,padding='post',maxlen=max_length)
print(X_test_embedded.shape)

(5344, 65)


### For the labels

In [13]:
from sklearn.preprocessing import OneHotEncoder
one_hot_encoder = OneHotEncoder(sparse=False)
one_hot_encoder.fit(y_train.reshape(-1, 1))
y_encoded = one_hot_encoder.transform(y_train.reshape(-1, 1))

y_encoded.shape

(18118, 3)

In [14]:
y_encoded_test = one_hot_encoder.transform(y_test.reshape(-1,1))
y_encoded_test.shape

(5344, 3)

# Creating the models

While creating these models we monitor training loss and validation loss. If the training loss keeps decreasing and the validation loss keeps increasing, it tells us that our model is overfitting and it will not generalize well in new data.
One way of avoiding overfitting in deep learning models is to reduce the number of epochs or to set up an EarlyStop. We will use this notebook to check how the models respond, and if necessary a Earlystop will be defined when we perform word embedding using the GloVe model.

In [15]:
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Bidirectional
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import MaxPooling1D
from tensorflow.keras.layers import GlobalMaxPooling1D
from tensorflow.keras.layers import Conv1D
from tensorflow.keras.backend import clear_session

## Bidirectional LSTM 

In [16]:
model_bi = Sequential()
model_bi.add(Embedding(vocabulary_size+1, 97, input_length=max_length))
model_bi.add(Bidirectional(LSTM(100)))
model_bi.add(Dropout(0.5))
model_bi.add(Dense(97, activation = "relu"))
model_bi.add(Dense(3, activation='softmax'))
model_bi.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model_bi.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 65, 97)            989982    
                                                                 
 bidirectional (Bidirectiona  (None, 200)              158400    
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 200)               0         
                                                                 
 dense (Dense)               (None, 97)                19497     
                                                                 
 dense_1 (Dense)             (None, 3)                 294       
                                                                 
Total params: 1,168,173
Trainable params: 1,168,173
Non-trainable params: 0
______________________________________________

In [17]:
model_bi.fit(X_train_embedded,y_encoded, validation_split=0.2, epochs = 4)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x2c3ed3cee50>

In [18]:
predictions = model_bi.predict(X_test_embedded)
preds = one_hot_encoder.inverse_transform(predictions).reshape(-1,)
print(classification_report(y_test, preds, target_names=["NFS", "UFS", "CFS"]))


              precision    recall  f1-score   support

         NFS       0.73      0.90      0.81      3296
         UFS       0.37      0.20      0.26       623
         CFS       0.64      0.43      0.51      1425

    accuracy                           0.69      5344
   macro avg       0.58      0.51      0.53      5344
weighted avg       0.66      0.69      0.66      5344



## Stacked Bi-LSTM

In [19]:
model_bi = Sequential()
model_bi.add(Embedding(vocabulary_size+1, 200, input_length=max_length))
model_bi.add(Dropout(0.2))
model_bi.add(Bidirectional(LSTM(100, return_sequences=True)))
model_bi.add(Bidirectional(LSTM(100)))
model_bi.add(Dropout(0.2))
model_bi.add(Dense(97, activation = "relu"))
model_bi.add(Dense(3, activation='softmax'))
model_bi.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model_bi.summary())

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 65, 200)           2041200   
                                                                 
 dropout_1 (Dropout)         (None, 65, 200)           0         
                                                                 
 bidirectional_1 (Bidirectio  (None, 65, 200)          240800    
 nal)                                                            
                                                                 
 bidirectional_2 (Bidirectio  (None, 200)              240800    
 nal)                                                            
                                                                 
 dropout_2 (Dropout)         (None, 200)               0         
                                                                 
 dense_2 (Dense)             (None, 97)               

In [20]:
model_bi.fit(X_train_embedded,y_encoded, validation_split=0.2, epochs = 5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x2c3f7f42850>

In [21]:
predictions = model_bi.predict(X_test_embedded)
preds = one_hot_encoder.inverse_transform(predictions).reshape(-1,)
print(classification_report(y_test, preds, target_names=["NFS", "UFS", "CFS"]))


              precision    recall  f1-score   support

         NFS       0.75      0.85      0.80      3296
         UFS       0.33      0.28      0.30       623
         CFS       0.60      0.47      0.53      1425

    accuracy                           0.68      5344
   macro avg       0.56      0.53      0.54      5344
weighted avg       0.66      0.68      0.67      5344



## Convolutional Neural Network

In [22]:
from tensorflow.keras.layers import GlobalMaxPooling1D
from tensorflow.keras.layers import Conv1D
clear_session()


In [23]:
embedding_dim = 100

model = Sequential()
model.add(Embedding(vocabulary_size+1, embedding_dim, input_length=max_length))
model.add(Conv1D(128, 10, activation='relu'))
model.add(GlobalMaxPooling1D())
model.add(Dense(32, activation='relu'))
model.add(Dense(3, activation='softmax'))
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 65, 100)           1020600   
                                                                 
 conv1d (Conv1D)             (None, 56, 128)           128128    
                                                                 
 global_max_pooling1d (Globa  (None, 128)              0         
 lMaxPooling1D)                                                  
                                                                 
 dense (Dense)               (None, 32)                4128      
                                                                 
 dense_1 (Dense)             (None, 3)                 99        
                                                                 
Total params: 1,152,955
Trainable params: 1,152,955
Non-trainable params: 0
______________________________________________

In [24]:
model.fit(X_train_embedded,y_encoded, validation_split=0.2, epochs = 5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x2c406bb1670>

In [25]:
predictions = model.predict(X_test_embedded)
preds = one_hot_encoder.inverse_transform(predictions).reshape(-1,)
print(classification_report(y_test, preds, target_names=["NFS", "UFS", "CFS"]))

              precision    recall  f1-score   support

         NFS       0.74      0.87      0.80      3296
         UFS       0.33      0.25      0.28       623
         CFS       0.60      0.43      0.50      1425

    accuracy                           0.68      5344
   macro avg       0.56      0.52      0.53      5344
weighted avg       0.66      0.68      0.66      5344



## Convolutional Neural network + LSTM

In [34]:
model_conv = Sequential()
model_conv.add(Embedding(vocabulary_size+1, 100, input_length=max_length))
model_conv.add(Dropout(0.2))
model_conv.add(Conv1D(100, 8, activation='relu'))
model_conv.add(MaxPooling1D(pool_size=10))
model_conv.add(LSTM(100))
model_conv.add(Dense(32, activation = "relu"))
model_conv.add(Dense(3, activation='softmax'))
model_conv.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [35]:
model_conv.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 65, 100)           1020600   
                                                                 
 dropout_1 (Dropout)         (None, 65, 100)           0         
                                                                 
 conv1d_2 (Conv1D)           (None, 58, 100)           80100     
                                                                 
 max_pooling1d_1 (MaxPooling  (None, 5, 100)           0         
 1D)                                                             
                                                                 
 lstm_1 (LSTM)               (None, 100)               80400     
                                                                 
 dense_4 (Dense)             (None, 32)                3232      
                                                      

In [36]:
model_conv.fit(X_train_embedded,y_encoded, validation_split=0.2, epochs = 5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x2c407481760>

In [37]:
predictions = model_conv.predict(X_test_embedded)
preds = one_hot_encoder.inverse_transform(predictions).reshape(-1,)
print(classification_report(y_test, preds, target_names=["NFS", "UFS", "CFS"]))


              precision    recall  f1-score   support

         NFS       0.77      0.84      0.80      3296
         UFS       0.29      0.21      0.25       623
         CFS       0.56      0.51      0.54      1425

    accuracy                           0.68      5344
   macro avg       0.54      0.52      0.53      5344
weighted avg       0.66      0.68      0.67      5344



As we can see the models performed similar or worst than our baseline model. Among the options, both Bidirectional models seemed to perform better then the convolutional models. However the improvement was not so significant to justify using such models over a simpler, less expensive model such as the SVM.