# Final Project: Fake News Detection

By Felix Daubner - Hochschule der Medien

Module 'Supervised and Unsupervised Learning' - Prof. Dr.-Ing. Johannes Maucher

## Baseline model - Logistic Regression

To-Do:
- Create baseline model using neural network

In [13]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from tensorflow import keras
import pickle

from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

To being able to compare the results of the machine learning model to be trained, a baseline model will be implemented. The baseline acts as a reference and will be implemented without further exploration, discussion and / or optimization.

As the task to solve is a classification task, a logistic regression is trained and evaluated.

### Prepare data for model training

At first, the data which was preprocessed in the previous notebook is imported into notebook. Still, it needs to be adjusted to train a Logistic Regression model.

In [2]:
data = pd.read_csv("data/processed.csv", sep=";", index_col=0)

In [3]:
data.head()

Unnamed: 0,statement,issue_2018-california-governors-race,issue_2024-senate-elections,issue_Alcohol,issue_abc-news-week,issue_abortion,issue_ad-watch,issue_afghanistan,issue_after-the-fact,issue_agriculture,...,channel_mail,channel_podcast,channel_presentation,channel_press,channel_social media,channel_speech,channel_talk,channel_video,truth,token
0,says sen bob casey dpa is trying to change the...,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,says the election results are suspicious becau...,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,a ballot dump around 4 am in milwaukee shows t...,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,kari lake is threatening social security and m...,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,republican senate candidate sam brown wants to...,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


There are some colums which can not be used for a Logistic Regression model.
Thus, the columns "statement" and "token" have to be dropped from the dataset. Even though the to-be-trained machine learning will mostly focus on the statements to determine whether it was true or false, the baseline model should not only be used as a reference but at the same time evaluate the impact of "channel" and "issue" on "truth".

In [4]:
data_dropped = data.drop(["statement", "token"], axis=1)

What is left are the encoded columns of the original 'channel' and 'issue' as well as the target variable 'truth'. By only using those information, a Logistic Regression model is trained using 70% of the data as training data.

In [26]:
X_log = data_dropped.drop(["truth"], axis=1)
y_log = data_dropped["truth"]

In [27]:
X_train_log, X_test_log, y_train_log, y_test_log = train_test_split(X_log, y_log, train_size=0.7, random_state=42)

### Train model

After splitting the data into X (features) and y (target), the data was split into training and test sets. Now, the model is initialized and then trained using only the training data. As there is not much information in the data, the expectations of the model in terms of accuracy are estimated between 55 - 60%.

In [28]:
log = LogisticRegression()
log.fit(X_train_log, y_train_log)

### Evaluate model

Some evaluations are done using first the training and then the test set. A classification report should provide some insights into the  performance of the model which will be the reference for the neural network.

In [29]:
results_train = pd.DataFrame(y_train_log.values, columns=["true"])
results_train["predicted"] = log.predict(X_train_log)

results_train["correct"] = results_train["true"] == results_train["predicted"]

In [30]:
results_train[["correct"]].value_counts()

correct
True       9887
False      4216
dtype: int64

In [31]:
print(f"Classification Report of training data:")
print(classification_report(results_train["true"], results_train["predicted"]))

Classification Report of training data:
              precision    recall  f1-score   support

           0       0.75      0.60      0.67      7023
           1       0.67      0.80      0.73      7080

    accuracy                           0.70     14103
   macro avg       0.71      0.70      0.70     14103
weighted avg       0.71      0.70      0.70     14103



In [32]:
results_test = pd.DataFrame(y_test_log.values, columns=["true"])
results_test["predicted"] = log.predict(X_test_log)

results_test["correct"] = results_test["true"] == results_test["predicted"]

In [33]:
results_test[["correct"]].value_counts()

correct
True       4226
False      1819
dtype: int64

In [34]:
print(f"Classification Report of test data:")
print(classification_report(results_test["true"], results_test["predicted"]))

Classification Report of test data:
              precision    recall  f1-score   support

           0       0.75      0.61      0.67      3051
           1       0.66      0.79      0.72      2994

    accuracy                           0.70      6045
   macro avg       0.71      0.70      0.70      6045
weighted avg       0.71      0.70      0.70      6045



## Baseline model - Neural Network (MLP)

As a first baseline model, a Logistic Regression model was trained on only the categorical data but not the statements itself. That's why a second baseline model, a multi-layer-perceptron, is inititalized and trained using the tokenized and padded statements.

Before this can be done, the data has to be transformed into a useful data structure.

### Prepare and save data for training

The data is prepared for the training process by converting the tokenized statements into a numpy array. In this conversion process, only "token" and "truth" are considered, the encoded channel issue columns are dropped from this baseline model.

In [35]:
X_mlp = np.array(data["token"].apply(np.array).to_list())
y_mlp = np.array(data["truth"])

Also, the data is splitted into training and test data.

In [36]:
X_train_mlp, X_test_mlp, y_train_mlp, y_test_mlp = train_test_split(X_mlp, y_mlp, train_size=0.7, random_state=42)

In [54]:
X_train_mlp = X_train_mlp.reshape(-1, 1)
X_test_mlp = X_test_mlp.reshape(-1, 1)

Now, the structure of the MLP is defined. Until now, all statements are tokenized which means every word is assigned to a number. This array of numbers represents the statement. Currently, the relationship between those numbers is unknown. This is why an Embedding layer is needed which maps each number representing a word to a multidimensional vector.

A pre-trained Embedding is used from 'glove' which is famous library word embeddings.

In [None]:
glove_file = "glove.840B.300d.txt"
glove2word2vec(glove_file, "word2vec.txt")

  glove2word2vec(glove_file, "word2vec.txt")


In [None]:
word2vec = KeyedVectors.load_word2vec_format("word2vec.txt")

Embedding matrix

In [None]:
with open("tokenizer/tokenizer.pickle", "rb") as handle:
    tokenizer = pickle.load(handle)

In [None]:
embedding_dim = 300  
word_index = tokenizer.word_index 
num_words = min(len(word_index) + 1, 3000)  

embedding_matrix = np.zeros((num_words, embedding_dim))
for word, i in word_index.items():
    if i < num_words:
        if word in word2vec.key_to_index:
            embedding_vector = word2vec[word]
            embedding_matrix[i] = embedding_vector

---

In [55]:
X_train_mlp.shape

(14103, 1)

In [56]:
model = keras.Sequential()
model.add(keras.Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=57, trainable=False))
model.add(keras.LSTM(64))
model.add(keras.Dense(1, activation='sigmoid'))

Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_8 (Dense)             (None, 64)                902656    
                                                                 
 dense_9 (Dense)             (None, 1)                 65        
                                                                 
Total params: 902,721
Trainable params: 902,721
Non-trainable params: 0
_________________________________________________________________


In [57]:
model.compile(optimizer="sgd", metrics=["accuracy"])

In [58]:
model.fit(X_train_mlp, y_train_mlp, epochs=20, batch_size=128, validation_data=(X_test_mlp, y_test_mlp))

Epoch 1/20


ValueError: in user code:

    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/keras/engine/training.py", line 1160, in train_function  *
        return step_function(self, iterator)
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/keras/engine/training.py", line 1146, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/keras/engine/training.py", line 1135, in run_step  **
        outputs = model.train_step(data)
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/keras/engine/training.py", line 993, in train_step
        y_pred = self(x, training=True)
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
        raise e.with_traceback(filtered_tb) from None
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/keras/engine/input_spec.py", line 277, in assert_input_compatibility
        raise ValueError(

    ValueError: Exception encountered when calling layer "sequential_6" "                 f"(type Sequential).
    
    Input 0 of layer "dense_8" is incompatible with the layer: expected axis -1 of input shape to have value 14103, but received input with shape (None, 1)
    
    Call arguments received by layer "sequential_6" "                 f"(type Sequential):
      • inputs=tf.Tensor(shape=(None, 1), dtype=string)
      • training=True
      • mask=None
