# Final Project: Fake News Detection

By Felix Daubner - Hochschule der Medien

Module 'Supervised and Unsupervised Learning' - Prof. Dr.-Ing. Johannes Maucher

## Baseline model - Logistic Regression

To-Do:
- Create baseline model using neural network

In [20]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from tensorflow import keras
from keras.layers import Embedding, Dense, LSTM
import pickle

from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

NUM_WORDS=3000
MAX_SEQUENCE_LEN = 57

To being able to compare the results of the machine learning model to be trained, a baseline model will be implemented. The baseline acts as a reference and will be implemented without further exploration, discussion and / or optimization.

As the task to solve is a classification task, a logistic regression is trained and evaluated.

### Prepare data for model training

At first, the data which was preprocessed in the previous notebook is imported into notebook. Still, it needs to be adjusted to train a Logistic Regression model.

In [21]:
data = pd.read_csv("data/processed.csv", sep=";", index_col=0)

In [22]:
data.head()

Unnamed: 0,statement,channel_Instagram,channel_Other,channel_TV,channel_TikTok,channel_X,channel_ad,channel_article,channel_blog,channel_campaign,...,channel_presentation,channel_press,channel_social media,channel_speech,channel_talk,channel_video,truth,token,statement_stop,token_stop
0,says sen bob casey dpa is trying to change the...,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",says sen bob casey dpa trying change outcome e...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,says the election results are suspicious becau...,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",says election results suspicious opponent us s...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,a ballot dump around 4 am in milwaukee shows t...,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",ballot dump around 4 milwaukee shows wisconsin...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,kari lake is threatening social security and m...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",kari lake threatening social security medicare,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,republican senate candidate sam brown wants to...,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,1,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",republican senate candidate sam brown wants cu...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


There are some colums which can not be used for a Logistic Regression model.
Thus, the columns "statement" and "token" have to be dropped from the dataset. Even though the to-be-trained machine learning will mostly focus on the statements to determine whether it was true or false, the baseline model should not only be used as a reference but at the same time evaluate the impact of "channel" and "issue" on "truth".

In [23]:
data_dropped = data.drop(["statement", "token"], axis=1)

What is left are the encoded columns of the original 'channel' and 'issue' as well as the target variable 'truth'. By only using those information, a Logistic Regression model is trained using 70% of the data as training data.

In [24]:
X_log = data_dropped.drop(["truth"], axis=1)
y_log = data_dropped["truth"]

In [25]:
X_train_log, X_test_log, y_train_log, y_test_log = train_test_split(X_log, y_log, train_size=0.7, random_state=42)

### Train model

After splitting the data into X (features) and y (target), the data was split into training and test sets. Now, the model is initialized and then trained using only the training data. As there is not much information in the data, the expectations of the model in terms of accuracy are estimated between 55 - 60%.

In [26]:
log = LogisticRegression()
log.fit(X_train_log, y_train_log)

ValueError: could not convert string to float: '95 overthecounter pharmaceuticals us come china'

### Evaluate model

Some evaluations are done using first the training and then the test set. A classification report should provide some insights into the  performance of the model which will be the reference for the neural network.

In [29]:
results_train = pd.DataFrame(y_train_log.values, columns=["true"])
results_train["predicted"] = log.predict(X_train_log)

results_train["correct"] = results_train["true"] == results_train["predicted"]

In [30]:
results_train[["correct"]].value_counts()

correct
True       9887
False      4216
dtype: int64

In [31]:
print(f"Classification Report of training data:")
print(classification_report(results_train["true"], results_train["predicted"]))

Classification Report of training data:
              precision    recall  f1-score   support

           0       0.75      0.60      0.67      7023
           1       0.67      0.80      0.73      7080

    accuracy                           0.70     14103
   macro avg       0.71      0.70      0.70     14103
weighted avg       0.71      0.70      0.70     14103



In [32]:
results_test = pd.DataFrame(y_test_log.values, columns=["true"])
results_test["predicted"] = log.predict(X_test_log)

results_test["correct"] = results_test["true"] == results_test["predicted"]

In [33]:
results_test[["correct"]].value_counts()

correct
True       4226
False      1819
dtype: int64

In [34]:
print(f"Classification Report of test data:")
print(classification_report(results_test["true"], results_test["predicted"]))

Classification Report of test data:
              precision    recall  f1-score   support

           0       0.75      0.61      0.67      3051
           1       0.66      0.79      0.72      2994

    accuracy                           0.70      6045
   macro avg       0.71      0.70      0.70      6045
weighted avg       0.71      0.70      0.70      6045



## Baseline model - Neural Network (MLP)

As a first baseline model, a Logistic Regression model was trained on only the categorical data but not the statements itself. That's why a second baseline model, a multi-layer-perceptron, is inititalized and trained using the tokenized and padded statements.

Before this can be done, the data has to be transformed into a useful data structure.

### Prepare and save data for training

The data is prepared for the training process by converting the tokenized statements into a numpy array. In this conversion process, only "token" and "truth" are considered, the encoded channel issue columns are dropped from this baseline model.

In [27]:
X_mlp = np.array(data["token"].apply(np.array).to_list())
y_mlp = np.array(data["truth"])

Also, the data is splitted into training and test data.

In [28]:
X_train_mlp, X_test_mlp, y_train_mlp, y_test_mlp = train_test_split(X_mlp, y_mlp, train_size=0.7, random_state=42)

In [29]:
X_train_mlp = X_train_mlp.reshape(-1, 1)
X_test_mlp = X_test_mlp.reshape(-1, 1)

Now, the structure of the MLP is defined. Until now, all statements are tokenized which means every word is assigned to a number. This array of numbers represents the statement. Currently, the relationship between those numbers is unknown. This is why an Embedding layer is needed which maps each number representing a word to a multidimensional vector.

A pre-trained Embedding is used from [GloVe](https://nlp.stanford.edu/projects/glove/) which is famous library word embeddings.

In [30]:
word2vec = KeyedVectors.load_word2vec_format("wiki-news-300d-1M.vec")

Embedding matrix

In [31]:
with open("tokenizer/tokenizer.pickle", "rb") as handle:
    tokenizer = pickle.load(handle)

In [32]:
embedding_dim = 300  
word_index = tokenizer.word_index 
num_words = min(len(word_index) + 1, NUM_WORDS)  

embedding_matrix = np.zeros((num_words, embedding_dim))
for word, i in word_index.items():
    if i < num_words:
        if word in word2vec.key_to_index:
            embedding_vector = word2vec[word]
            embedding_matrix[i] = embedding_vector

In [33]:
X_train_mlp.shape

(14109, 1)

In [34]:
model = keras.Sequential()
model.add(Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=MAX_SEQUENCE_LEN, trainable=False))
model.add(LSTM(64))
model.add(Dense(1, activation='sigmoid'))

In [37]:
model.compile(optimizer="sgd", loss="binary_crossentropy", metrics=["accuracy"])

In [38]:
model.fit(X_train_mlp, y_train_mlp, epochs=20, batch_size=128, validation_data=(X_test_mlp, y_test_mlp))

Epoch 1/20


2025-01-16 08:26:50.107168: W tensorflow/core/framework/op_kernel.cc:1757] OP_REQUIRES failed at cast_op.cc:121 : UNIMPLEMENTED: Cast string to float is not supported


UnimplementedError: Graph execution error:

Detected at node 'sequential_4/Cast' defined at (most recent call last):
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/runpy.py", line 196, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/runpy.py", line 86, in _run_code
      exec(code, run_globals)
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/ipykernel_launcher.py", line 17, in <module>
      app.launch_new_instance()
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/traitlets/config/application.py", line 992, in launch_instance
      app.start()
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/ipykernel/kernelapp.py", line 736, in start
      self.io_loop.start()
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 195, in start
      self.asyncio_loop.run_forever()
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
      self._run_once()
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
      handle._run()
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/asyncio/events.py", line 80, in _run
      self._context.run(self._callback, *self._args)
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 516, in dispatch_queue
      await self.process_one()
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 505, in process_one
      await dispatch(*args)
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 412, in dispatch_shell
      await result
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 740, in execute_request
      reply_content = await reply_content
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 422, in do_execute
      res = shell.run_cell(
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/ipykernel/zmqshell.py", line 546, in run_cell
      return super().run_cell(*args, **kwargs)
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3024, in run_cell
      result = self._run_cell(
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3079, in _run_cell
      result = runner(coro)
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner
      coro.send(None)
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3284, in run_cell_async
      has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3466, in run_ast_nodes
      if await self.run_code(code, result, async_=asy):
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3526, in run_code
      exec(code_obj, self.user_global_ns, self.user_ns)
    File "/var/folders/wd/t40ff8kx75d1p_blvmjy6b6m0000gn/T/ipykernel_12012/355114051.py", line 1, in <module>
      model.fit(X_train_mlp, y_train_mlp, epochs=20, batch_size=128, validation_data=(X_test_mlp, y_test_mlp))
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/keras/engine/training.py", line 1564, in fit
      tmp_logs = self.train_function(iterator)
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/keras/engine/training.py", line 1160, in train_function
      return step_function(self, iterator)
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/keras/engine/training.py", line 1146, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/keras/engine/training.py", line 1135, in run_step
      outputs = model.train_step(data)
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/keras/engine/training.py", line 993, in train_step
      y_pred = self(x, training=True)
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/keras/engine/training.py", line 557, in __call__
      return super().__call__(*args, **kwargs)
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/keras/engine/base_layer.py", line 1097, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/keras/engine/sequential.py", line 410, in call
      return super().call(inputs, training=training, mask=mask)
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/keras/engine/functional.py", line 510, in call
      return self._run_internal_graph(inputs, training=training, mask=mask)
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/keras/engine/functional.py", line 649, in _run_internal_graph
      y = self._conform_to_reference_input(y, ref_input=x)
    File "/Users/felix/anaconda3/envs/dsmml/lib/python3.10/site-packages/keras/engine/functional.py", line 761, in _conform_to_reference_input
      tensor = tf.cast(tensor, dtype=ref_input.dtype)
Node: 'sequential_4/Cast'
Cast string to float is not supported
	 [[{{node sequential_4/Cast}}]] [Op:__inference_train_function_4178]