# Final Project: Fake News Detection

By Felix Daubner - Hochschule der Medien

Module 'Supervised and Unsupervised Learning' - Prof. Dr.-Ing. Johannes Maucher

## Baseline model - Logistic Regression

In [20]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from tensorflow import keras
from keras.layers import Embedding, Dense, LSTM
import pickle

from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

NUM_WORDS=3000
MAX_SEQUENCE_LEN = 57

To being able to compare the results of the machine learning model to be trained, a baseline model will be implemented. The baseline acts as a reference and will be implemented without further exploration, discussion and / or optimization.

As the task to solve is a classification task, a logistic regression is trained and evaluated.

### Prepare data for model training

At first, the data which was preprocessed in the previous notebook is imported into notebook. Still, it needs to be adjusted to train a Logistic Regression model.

In [64]:
data = pd.read_json("data/processed.json", orient="records", lines=True)

In [95]:
X = np.array(data["token"].apply(np.array).to_list())
y = np.array(data["truth"])

In [104]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

### Train model

After splitting the data into X (features) and y (target), the data was split into training and test sets. Now, the model is initialized and then trained using only the training data. As there is not much information in the data, the expectations of the model in terms of accuracy are estimated between 55 - 60%.

In [105]:
log = LogisticRegression(solver="newton-cholesky")
log.fit(X_train, y_train)

### Evaluate model

Some evaluations are done using first the training and then the test set. A classification report should provide some insights into the  performance of the model which will be the reference for the neural network.

In [106]:
results_train = pd.DataFrame(y_train, columns=["true"])
results_train["predicted"] = log.predict(X_train)

results_train["correct"] = results_train["true"] == results_train["predicted"]

In [107]:
results_train[["correct"]].value_counts()

correct
True       8156
False      5953
dtype: int64

In [108]:
print(f"Classification Report of training data:")
print(classification_report(results_train["true"], results_train["predicted"]))

Classification Report of training data:
              precision    recall  f1-score   support

           0       0.57      0.61      0.59      7000
           1       0.59      0.55      0.57      7109

    accuracy                           0.58     14109
   macro avg       0.58      0.58      0.58     14109
weighted avg       0.58      0.58      0.58     14109



In [109]:
results_test = pd.DataFrame(y_test, columns=["true"])
results_test["predicted"] = log.predict(X_test)

results_test["correct"] = results_test["true"] == results_test["predicted"]

In [110]:
results_test[["correct"]].value_counts()

correct
True       3396
False      2651
dtype: int64

In [111]:
print(f"Classification Report of test data:")
print(classification_report(results_test["true"], results_test["predicted"]))

Classification Report of test data:
              precision    recall  f1-score   support

           0       0.57      0.59      0.58      3078
           1       0.56      0.53      0.54      2969

    accuracy                           0.56      6047
   macro avg       0.56      0.56      0.56      6047
weighted avg       0.56      0.56      0.56      6047



## Baseline model - Neural Network (LSTM)

As a first baseline model, a Logistic Regression model was trained on only the categorical data but not the statements itself. That's why a second baseline model, a multi-layer-perceptron, is inititalized and trained using the tokenized and padded statements.

Before this can be done, the data has to be transformed into a useful data structure.

### Prepare and save data for training

The data is prepared for the training process by converting the tokenized statements into a numpy array. In this conversion process, only "token" and "truth" are considered, the encoded channel issue columns are dropped from this baseline model.

Also, the data is splitted into training and test data.

Now, the structure of the MLP is defined. Until now, all statements are tokenized which means every word is assigned to a number. This array of numbers represents the statement. Currently, the relationship between those numbers is unknown. This is why an Embedding layer is needed which maps each number representing a word to a multidimensional vector.

A pre-trained Embedding is used from [GloVe](https://nlp.stanford.edu/projects/glove/) which is famous library word embeddings.

In [69]:
word2vec = KeyedVectors.load_word2vec_format("wiki-news-300d-1M.vec")

Embedding matrix

In [112]:
with open("tokenizer/tokenizer.pickle", "rb") as handle:
    tokenizer = pickle.load(handle)

In [113]:
embedding_dim = 300  
word_index = tokenizer.word_index 
num_words = min(len(word_index) + 1, NUM_WORDS)  

embedding_matrix = np.zeros((num_words, embedding_dim))
for word, i in word_index.items():
    if i < num_words:
        if word in word2vec.key_to_index:
            embedding_vector = word2vec[word]
            embedding_matrix[i] = embedding_vector

In [114]:
model = keras.Sequential()
model.add(Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=MAX_SEQUENCE_LEN, trainable=False))
model.add(LSTM(64))
model.add(Dense(1, activation='sigmoid'))

In [115]:
model.compile(optimizer="sgd", loss="binary_crossentropy", metrics=["accuracy"])

In [116]:
history = model.fit(X_train, y_train, epochs=20, batch_size=128, validation_data=(X_test, y_test))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x3a15743d0>