# Final Project: Fake News Detection

By Felix Daubner - Hochschule der Medien

Module 'Supervised and Unsupervised Learning' - Prof. Dr.-Ing. Johannes Maucher

## Model Training

In [1]:
import pandas as pd
import numpy as np
import altair as alt

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

from tensorflow import keras
from keras.models import Model
from keras.layers import Embedding, Flatten, Dense, LSTM, Conv1D, Flatten, MaxPooling1D, Dropout, Bidirectional, Input, Concatenate

import pickle
from gensim.models import KeyedVectors

NUM_WORDS=3000
MAX_SEQUENCE_LEN = 57
NUM_CAT = 20

In [2]:
def prepareFeatures(X):
    '''
    This function gets the features and modifies the features in a way to be able to train neural network
    using encoded categorical data and tokenized text.
    Returns numpy arrays.
    '''
    X_token = np.array(X["token"].apply(np.asarray))
    X_token = np.array([arr for arr in X_token])

    X_enc = np.array(X.drop(["token"], axis=1).apply(np.array))

    return X_token, X_enc

def prepareTarget(y):
    '''
    This function returns the target data as a numpy array.
    '''
    return np.array(y)

def visualizeHistory(history):
    '''
    This function gets keras.History object and plots loss, validation loss, precision and validation precision.
    Returns altair hconcat object containing the charts.
    '''

    l, p, v_l, v_p = history.history.keys()

    data = pd.DataFrame({"epoch": history.epoch,
            "loss": history.history[l],
            "val_loss": history.history[v_l],
            "precision": history.history[p],
            "val_precision": history.history[v_p]})
    
    loss_min = min(data["loss"].min(), data["val_loss"].min())
    loss_max = max(data["loss"].max(), data["val_loss"].max())

    precision_min = min(data["precision"].min(), data["val_precision"].min())
    precision_max = max(data["precision"].max(), data["val_precision"].max())

    data_melted = data.melt('epoch', value_vars=['loss', 'val_loss', 'precision', 'val_precision'], var_name='type', value_name='value')
    
    data_loss = data_melted[data_melted["type"].isin(["loss", "val_loss"])]
    loss = alt.Chart(data_loss).mark_line().encode(
        x = "epoch",
        y = alt.Y("value", scale = alt.Scale(domain=[loss_min, loss_max])),
        color = alt.Color("type", legend=alt.Legend(orient="right"))
    ).properties(
        title = "Training and Validation Loss over epochs"
    )

    data_precision = data_melted[data_melted["type"].isin(["precision", "val_precision"])]
    precision = alt.Chart(data_precision).mark_line().encode(
        x = "epoch",
        y = alt.Y("value", scale = alt.Scale(domain=[precision_min, precision_max])),
        color = alt.Color("type", legend=alt.Legend(orient="right"))
    ).properties(
        title = "Training and Validation Precision over epochs"
    )

    return alt.hconcat(loss, precision).resolve_scale(color="independent")


def performanceReport(model, X_train, y_train, X_val, y_val):
    '''
    This function gets a model, training and validation data.
    It predicts training and validation data, compares it to the true data and prints out classification report for both, training and validation.
    '''
    y_pred_train = (model.predict(X_train) > 0.5).astype(int)
    y_pred_val = (model.predict(X_val) > 0.5).astype(int)

    print("\nClassifcation Report of Performance on Training data")
    print(classification_report(y_train, y_pred_train))
    
    print("\n")
    print("* "*10)

    print("\nClassifcation Report of Performance on Validation data")
    print(classification_report(y_val, y_pred_val))

This section contains the model training. Different types of models should be trained and then compared to find out which model fits the challenge, to determine whether a political statement was fake-news or true, best. There are four types of models to be compared: a feedforward neural network, a Long Short-Term Memory, a bidirectional Long Short-Term Memory and Convolutional Neural Network. Those models will vary in terms of layers and hyperparameters still trying to keep them rather simple. All models are trained using the encoded categorical data of 'channel' and the tokenized statements including stop words. All models are then trained using 20 epochs and a batch size of 128. 

The best model is evaluated based on training and validation performance. At the end, the best two models are chosen and will be optimized in the next section.

### Prepare data for training and validation

As the training and validation data is the same for every model, the preparation of the preprocessed data resulting in a structure able to train different kinds of neural networks, is only needed to be done once.

In [3]:
data = pd.read_json("data/processed.json", orient="records", lines=True)

In [4]:
data.columns

Index(['statement', 'channel_Instagram', 'channel_Other', 'channel_TV',
       'channel_TikTok', 'channel_X', 'channel_ad', 'channel_article',
       'channel_blog', 'channel_campaign', 'channel_debate',
       'channel_interview', 'channel_lecture', 'channel_mail',
       'channel_podcast', 'channel_presentation', 'channel_press',
       'channel_social media', 'channel_speech', 'channel_talk',
       'channel_video', 'truth', 'token', 'statement_stop', 'token_stop'],
      dtype='object')

Before starting defining the different models, the data is prepared for the training process. The neural network to be trained only takes numpy arrays as input. Thus, the data currently saved as a pandas DataFrame is converted in to a numpy array. In this conversion process, only "token", the encoded channel columns and "truth" are kept.

In [5]:
X = data.drop(["statement", "statement_stop", "token_stop", "truth"], axis=1)
y = data["truth"]

In [6]:
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.7, random_state=42)

After splitting the data into features and target, the features still have to preprared for training by splitting the encoded categorical data from the tokenized and padded statements. The statement data has to be taken care of using an Embedding Layer while a Dense layer is sufficient to handle the encoded categorical data.

In [7]:
X_train_token, X_train_enc = prepareFeatures(X_train)
X_val_token, X_val_enc = prepareFeatures(X_val)
y_train = prepareTarget(y_train)
y_val = prepareTarget(y_val)

### Prepare infrastructure

Still, the data despite being prepared to fit the structure of neural networks, is not ready for training yet. The tokenized statements saved in "X_train_token" and "X_val_token" need to be transformed into a embedding matrix which assigns every word / token a vector. This vector represents the word in a multi-dimensional vector space and models the relationship between different words.

A pre-trained word embedding from FastText which already contains the vectors for each word is used.

In [8]:
word2vec = KeyedVectors.load_word2vec_format("wiki-news-300d-1M.vec")

Also, to remember the words which are placed behind each token, the trained 'tokenizer'-object of section [data pre-processing](03_data-understanding.ipynb) is imported.

In [9]:
with open("tokenizer/tokenizer.pickle", "rb") as handle:
    tokenizer = pickle.load(handle)

The following code creates an embedding matrix which assigns every word their respective vector as saved in the pre-trained word embedding of FastText.

In [10]:
embedding_dim = 300  
word_index = tokenizer.word_index 
num_words = min(len(word_index) + 1, NUM_WORDS)  

embedding_matrix = np.zeros((num_words, embedding_dim))
for word, i in word_index.items():
    if i < num_words:
        if word in word2vec.key_to_index:
            embedding_vector = word2vec[word]
            embedding_matrix[i] = embedding_vector

For every kind of model which is going to be trained in the following, two input layers are defined. Those two input layers are the same used for every kind of model as the input data doesn't vary between types of models.

There is one input layer for tokenized text data and another input layer for encoded categorical data.

In [11]:
text_input = Input(shape=(MAX_SEQUENCE_LEN,), name="text_input")
categorical_input = Input(shape=(NUM_CAT,), name="categorical_input")

Using the input layer and embedding matrix, an Embedding layer can be set up.

In [12]:
emb = Embedding(NUM_WORDS, embedding_dim, weights=[embedding_matrix], input_length=MAX_SEQUENCE_LEN, trainable=False)(text_input)

To handle the encoded categorical data, a dense layer is sufficient.

In [13]:
cat = Dense(32, activation="relu")(categorical_input)

## Model training

When looking at fake-news detection, it is decided which metric should be optimized. Classification provide a lots of useful metrics which have to be chosen for each project individually. The most common and known metrics are accuracy, precision, recall and f1-score.

Most often, accuracy is not a good metric as it doesn't take into account the cost of predicting errors. That's why either precision or recall should be used.

The worst case at fake-news is when a fake-news statement is not identified as fake-news but as true. Whereas the other way, a true statement being classified as fake-news statement does not harm in the same way. Translating this into the terms of this project means a false positive ("a statement which is 'fake' (0) gets classified as 'true' (1)") is worse than a false negative ("a statement which is 'true' (1) gets classified as 'false' (0)"). The metrics focusing on optimizing the false positives is precision. Therefore, precision is used when trying to chose and optimize a fake-news classification model.

In the following, four different types models are trained and evaluated. Based on those evaluations, the best model is chosen. 
The next section [optimization](07_evaluation-optimization.ipynb) handles feature extraction and hyperparameter tuning of the chosen model.

### Feedforward Neural Nerwork

The first model to be trained is a simple feedforward neural network. The feedforward neural network consists of Dense and Dropout layers which make the architecture quite easy and not too complex.

In [14]:
ff_flatten_text = Flatten()(emb)

ff_combined = Concatenate()([ff_flatten_text, cat])
ff_dense1 = Dense(128, activation="relu")(ff_combined)
ff_drop = Dropout(0.4)(ff_dense1)
ff_dense2 = Dense(32, activation="relu")(ff_drop)
ff_output = Dense(1, activation="sigmoid")(ff_dense2)

In [15]:
ff = Model(inputs=[categorical_input, text_input], outputs=ff_output)
ff.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 text_input (InputLayer)        [(None, 57)]         0           []                               
                                                                                                  
 embedding (Embedding)          (None, 57, 300)      900000      ['text_input[0][0]']             
                                                                                                  
 categorical_input (InputLayer)  [(None, 20)]        0           []                               
                                                                                                  
 flatten (Flatten)              (None, 17100)        0           ['embedding[0][0]']              
                                                                                              

In [16]:
ff.compile(optimizer="sgd", loss="binary_crossentropy", metrics=[keras.metrics.Precision()])

In [17]:
ff_hist = ff.fit([X_train_enc, X_train_token], y_train, epochs=20, batch_size=128, validation_data=([X_val_enc, X_val_token], y_val))

Epoch 1/20


2025-01-18 16:29:46.467188: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz


Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


After training the feedforward neural network, there are some things standing out.

As seen in the visualizatons below, both the training and validation loss decline per epoch. Same goes for precision which inclines per epoch. Both metrics show signs of overfitting as the training metrics perform significantly better compared to the validation metrics. Overfitting should be avoided as the performance on new, unseen data is significantly worse than the performance on training data.

In [18]:
visualizeHistory(ff_hist)

When comparing the results of the classification report, the signs of overfitting still remain. Although the model performs already quite good with a weighted precision on the training data of 0.89, it definitely needs to be optimized as the weighted precision on the validation data is at only 0.81. Due to the simple architecture of the model, the feedforward neural networks is considered when evaluating the best model.

In [19]:
performanceReport(ff, [X_train_enc, X_train_token], y_train, [X_val_enc, X_val_token], y_val)


Classifcation Report of Performance on Training data
              precision    recall  f1-score   support

           0       0.80      0.76      0.78      7000
           1       0.77      0.81      0.79      7109

    accuracy                           0.78     14109
   macro avg       0.79      0.78      0.78     14109
weighted avg       0.79      0.78      0.78     14109



* * * * * * * * * * 

Classifcation Report of Performance on Validation data
              precision    recall  f1-score   support

           0       0.76      0.72      0.74      3078
           1       0.72      0.77      0.75      2969

    accuracy                           0.74      6047
   macro avg       0.74      0.74      0.74      6047
weighted avg       0.74      0.74      0.74      6047



### LSTM

Next type is a Long Short-Term Memory Neural Network. This type of neural network ...

In [20]:
lstm_ = LSTM(128)(emb)

In [21]:
lstm_combined = Concatenate()([lstm_, cat])

In [22]:
lstm_dense1 = Dense(128, activation='relu')(lstm_combined)
lstm_drop1 = Dropout(0.4)(lstm_dense1)
lstm_dense2 = Dense(64, activation='relu')(lstm_drop1)
lstm_drop2 = Dropout(0.4)(lstm_dense2)
lstm_output = Dense(1, activation='sigmoid')(lstm_drop2)

In [23]:
lstm = Model(inputs=[categorical_input, text_input], outputs=lstm_output)
lstm.summary()

Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 text_input (InputLayer)        [(None, 57)]         0           []                               
                                                                                                  
 embedding (Embedding)          (None, 57, 300)      900000      ['text_input[0][0]']             
                                                                                                  
 categorical_input (InputLayer)  [(None, 20)]        0           []                               
                                                                                                  
 lstm (LSTM)                    (None, 64)           93440       ['embedding[0][0]']              
                                                                                            

In [24]:
lstm.compile(optimizer="sgd", loss="binary_crossentropy", metrics=[keras.metrics.Precision()])

In [25]:
lstm_hist = lstm.fit([X_train_enc, X_train_token], y_train, batch_size=128, epochs=20, validation_data=([X_val_enc, X_val_token], y_val))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [26]:
visualizeHistory(lstm_hist)

In [27]:
performanceReport(lstm, [X_train_enc, X_train_token], y_train, [X_val_enc, X_val_token], y_val)


Classifcation Report of Performance on Training data
              precision    recall  f1-score   support

           0       0.74      0.51      0.60      7000
           1       0.63      0.83      0.72      7109

    accuracy                           0.67     14109
   macro avg       0.69      0.67      0.66     14109
weighted avg       0.69      0.67      0.66     14109



* * * * * * * * * * 

Classifcation Report of Performance on Validation data
              precision    recall  f1-score   support

           0       0.75      0.52      0.61      3078
           1       0.62      0.82      0.71      2969

    accuracy                           0.67      6047
   macro avg       0.69      0.67      0.66      6047
weighted avg       0.69      0.67      0.66      6047



### Bi-directional LSTM

In [28]:
blstm_ = Bidirectional(LSTM(128))(emb)

In [29]:
blstm_combined = Concatenate()([blstm_, cat])

In [30]:
blstm_dense1 = Dense(128, activation='relu')(blstm_combined)
blstm_drop1 = Dropout(0.4)(blstm_dense1)
blstm_dense2 = Dense(32, activation='relu')(blstm_drop1)
blstm_drop2 = Dropout(0.4)(blstm_dense2)
blstm_output = Dense(1, activation='sigmoid')(blstm_drop2)

In [31]:
blstm = Model(inputs=[categorical_input, text_input], outputs=blstm_output)
blstm.summary()

Model: "model_2"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 text_input (InputLayer)        [(None, 57)]         0           []                               
                                                                                                  
 embedding (Embedding)          (None, 57, 300)      900000      ['text_input[0][0]']             
                                                                                                  
 categorical_input (InputLayer)  [(None, 20)]        0           []                               
                                                                                                  
 bidirectional (Bidirectional)  (None, 256)          439296      ['embedding[0][0]']              
                                                                                            

In [32]:
blstm.compile(optimizer="sgd", loss="binary_crossentropy", metrics=[keras.metrics.Precision()])

In [33]:
blstm_hist = blstm.fit([X_train_enc, X_train_token], y_train, batch_size=128, epochs=20, validation_data=([X_val_enc, X_val_token], y_val))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [34]:
visualizeHistory(blstm_hist)

In [35]:
performanceReport(blstm, [X_train_enc, X_train_token], y_train, [X_val_enc, X_val_token], y_val)


Classifcation Report of Performance on Training data
              precision    recall  f1-score   support

           0       0.74      0.54      0.62      7000
           1       0.64      0.81      0.72      7109

    accuracy                           0.68     14109
   macro avg       0.69      0.68      0.67     14109
weighted avg       0.69      0.68      0.67     14109



* * * * * * * * * * 

Classifcation Report of Performance on Validation data
              precision    recall  f1-score   support

           0       0.74      0.54      0.62      3078
           1       0.63      0.80      0.70      2969

    accuracy                           0.67      6047
   macro avg       0.68      0.67      0.66      6047
weighted avg       0.68      0.67      0.66      6047



### Convolutional Neural Network

In [36]:
cnn_ = Conv1D(filters=128, kernel_size=5, activation='relu')(emb)
cnn_maxpool = MaxPooling1D(pool_size=5)(cnn_)

In [37]:
cnn_flatten_text = Flatten()(cnn_)

cnn_combined = Concatenate()([cnn_flatten_text, cat])
cnn_flatten = Flatten()(cnn_maxpool)
cnn_dense1 = Dense(128, activation="relu")(cnn_flatten)
cnn_drop = Dropout(0.4)(cnn_dense1)
cnn_dense2 = Dense(32, activation="relu")(cnn_flatten)
cnn_drop2 = Dropout(0.4)(cnn_dense2)
cnn_output = Dense(1, activation="sigmoid")(cnn_drop2)

In [38]:
cnn = Model(inputs=[categorical_input, text_input], outputs=cnn_output)
cnn.summary()

Model: "model_3"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 text_input (InputLayer)        [(None, 57)]         0           []                               
                                                                                                  
 embedding (Embedding)          (None, 57, 300)      900000      ['text_input[0][0]']             
                                                                                                  
 conv1d (Conv1D)                (None, 53, 128)      192128      ['embedding[0][0]']              
                                                                                                  
 max_pooling1d (MaxPooling1D)   (None, 10, 128)      0           ['conv1d[0][0]']                 
                                                                                            

In [39]:
cnn.compile(optimizer="sgd", loss="binary_crossentropy", metrics=[keras.metrics.Precision()])

In [40]:
cnn_hist = cnn.fit([X_train_enc, X_train_token], y_train, batch_size=128, epochs=20, validation_data=([X_val_enc, X_val_token], y_val))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [41]:
visualizeHistory(cnn_hist)

In [42]:
performanceReport(cnn, [X_train_enc, X_train_token], y_train, [X_val_enc, X_val_token], y_val)


Classifcation Report of Performance on Training data
              precision    recall  f1-score   support

           0       0.62      0.63      0.63      7000
           1       0.63      0.62      0.62      7109

    accuracy                           0.62     14109
   macro avg       0.63      0.63      0.62     14109
weighted avg       0.63      0.62      0.62     14109



* * * * * * * * * * 

Classifcation Report of Performance on Validation data
              precision    recall  f1-score   support

           0       0.62      0.61      0.61      3078
           1       0.60      0.61      0.61      2969

    accuracy                           0.61      6047
   macro avg       0.61      0.61      0.61      6047
weighted avg       0.61      0.61      0.61      6047



## Evaluation and Optimization

This following section contains the optimization and further evaluation of chosen models from the previous  section. The model(s) are optimized in hyperparameters and feature extraction. 
Currently, the models were trained using the tokenized data including stop words and the encoded categorical channel columns. Feature extraction is going to decide which features are needed to achieve the best results. 

Focus in optimization and feature extraction are to have less complicity combined with only using features contributing to improving a models overall performance.
At the end, the best found model should be able to master the task of fake-news classification based on the test set which consist of the famous LIAR dataset.

### Import and prepare test data