# Final Project: Fake News Detection

By Felix Daubner - Hochschule der Medien

Module 'Supervised and Unsupervised Learning' - Prof. Dr.-Ing. Johannes Maucher

## Data Preprocessing

In [85]:
import pandas as pd
import numpy as np
from langdetect import detect

Currently, it's not possible to train a machine learning model using the dataset as in [data understanding](03_data-understanding.ipynb). There are some steps which have to be done before, such as:
- Binarize and encode truth-column
- Convert text (string) to tokens
- Pad all sequences to the same length

In [65]:
data = pd.read_csv("data/scraped.csv", sep=";", index_col=0)

In [66]:
data.head()

Unnamed: 0,statement,issue,truth
0,"Says Sen. Bob Casey, D-Pa., “is trying to chan...",2024-senate-elections,false
1,Says the election results are suspicious becau...,2024-senate-elections,false
2,A “ballot dump” around 4 a.m. in Milwaukee sho...,2024-senate-elections,pants-fire
3,“Kari Lake is threatening Social Security and ...,2024-senate-elections,half-true
4,Republican Senate candidate Sam Brown “wants t...,2024-senate-elections,half-true


#### Binarization of 'truth'

In [67]:
data["truth"].unique()

array(['false', 'pants-fire', 'half-true', 'barely-true', 'mostly-true',
       'true', 'half-flip', 'full-flop', 'no-flip'], dtype=object)

In [68]:
true = ["true", "mostly-true", "half-true"]
false = ["barely-true", "false", "pants-fire"]
flip = ["full-flop", "half-flip", "no-flip"]

In [69]:
data_dropped = data[~data["truth"].isin(flip)]

In [70]:
data_dropped["truth"].unique()

array(['false', 'pants-fire', 'half-true', 'barely-true', 'mostly-true',
       'true'], dtype=object)

In [71]:
data_binary = data_dropped.copy()
data_binary["truth"] = data_binary["truth"].replace(true, 1)
data_binary["truth"] = data_binary["truth"].replace(false, 0)

In [72]:
data_binary["truth"].value_counts()

0    10632
1     6134
Name: truth, dtype: int64

In [73]:
data_binary["truth"].dtype

dtype('int64')

#### Tokenization of 'statement'

...

In [74]:
data_binary["statement"] = data_binary["statement"].str.lower().str.replace(r'[^a-zA-Z0-9 ]',"", regex=True).astype("str")

In [75]:
data_binary

Unnamed: 0,statement,issue,truth
0,says sen bob casey dpa is trying to change the...,2024-senate-elections,0
1,says the election results are suspicious becau...,2024-senate-elections,0
2,a ballot dump around 4 am in milwaukee shows t...,2024-senate-elections,0
3,kari lake is threatening social security and m...,2024-senate-elections,1
4,republican senate candidate sam brown wants to...,2024-senate-elections,1
...,...,...,...
16921,missouri is the state with the lowest paid wor...,workers,1
16922,in 2009 hillary clinton was at the state depa...,workers,1
16923,says bernie sanders fundamentally changed the ...,workers,1
16924,we work longer hours than any people in the in...,workers,0


In [76]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

Before tokenizing and padding the statements, it is to be checked whether all statements are in English. All non-English statements shall be dropped from the DataFrame. To check the language of a statement, the library 'langdetect' is used. langdetect returns the language of a string.

In [95]:
data_binary["lang"] = data_binary["statement"].apply(detect)

In [96]:
data_binary["lang"].value_counts()

en    16074
es      489
da       30
af       25
fr       23
it       21
ca       20
nl       18
et       15
no       13
sv       12
tl        6
ro        5
cy        3
pt        3
id        2
sk        2
fi        2
hu        1
so        1
pl        1
Name: lang, dtype: int64

All rows which were detected to be non-English should be removed from the DataFrame without further checking. As the majority (about 94%) of all statements are classified to be English, not much data is lost.

In [99]:
data_eng = data_binary[data_binary["lang"] == "en"]

In [101]:
data_eng.shape[0] == sum(data_binary["lang"] == "en")

True

Now, it is assumed that there are only English statements left in the data. Thus, the statements can now be tokenized using the keras.Tokenizer which transforms the statements into a list of integers.

In [77]:
NUM_WORDS=3000

In [102]:
data_token = data_eng.copy()

token = Tokenizer(num_words=NUM_WORDS)
statements = data_eng["statement"].to_list()
token.fit_on_texts(statements)
data_token["token"] = token.texts_to_sequences(statements)

In [104]:
data_token.head()

Unnamed: 0,statement,issue,truth,token,lang
0,says sen bob casey dpa is trying to change the...,2024-senate-elections,0,"[9, 177, 1632, 2444, 8, 471, 2, 259, 1, 4, 1, ...",en
1,says the election results are suspicious becau...,2024-senate-elections,0,"[9, 1, 184, 2055, 12, 63, 670, 27, 177, 681, 7...",en
2,a ballot dump around 4 am in milwaukee shows t...,2024-senate-elections,0,"[5, 799, 2608, 420, 398, 1248, 3, 254, 47, 1, ...",en
3,kari lake is threatening social security and m...,2024-senate-elections,1,"[1966, 1175, 8, 2609, 103, 96, 6, 113]",en
4,republican senate candidate sam brown wants to...,2024-senate-elections,1,"[155, 167, 274, 512, 142, 2, 112, 247, 6, 103,...",en


In [105]:
max(data_token["token"].apply(len))

57

#### Padding of sequences

In [106]:
padded = pad_sequences(data_token["token"].to_list())
data_token["token"] = padded.tolist()

In [107]:
data_token.head()

Unnamed: 0,statement,issue,truth,token,lang
0,says sen bob casey dpa is trying to change the...,2024-senate-elections,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",en
1,says the election results are suspicious becau...,2024-senate-elections,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",en
2,a ballot dump around 4 am in milwaukee shows t...,2024-senate-elections,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",en
3,kari lake is threatening social security and m...,2024-senate-elections,1,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",en
4,republican senate candidate sam brown wants to...,2024-senate-elections,1,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",en


In [108]:
data_token.to_csv("data/processed.csv", sep=";")

#### Prepare data for training

Before finishing data preprocessing, the data is prepared for the training process. The neural network to be trained only takes numpy arrays as input. Thus, the data currently saved as a pandas DataFrame is converted in to a numpy array. In this conversion process, only the columns "token" and "truth" are considered.

In [109]:
X = np.array(data_token["token"].apply(np.array).to_list())
y = np.array(data_token["truth"])

The prepared data X containing all the features and y containing all targets to the corresponding features. Both arrays are saved as a csv-file.

In [116]:
np.savetxt("data/X.csv", X, delimiter=";")
np.savetxt("data/y.csv", y, delimiter=";")