# Final Project: Fake News Detection

By Felix Daubner - Hochschule der Medien

Module 'Supervised and Unsupervised Learning' - Prof. Dr.-Ing. Johannes Maucher

## Data Preprocessing

In [1]:
import pandas as pd
import numpy as np
from langdetect import detect
#import openai
from imblearn.over_sampling import RandomOverSampler

In [2]:
def replaceChannel(data, channel, alternative=None):
    data["channel"] = data["channel"].replace(fr'.*{channel}.*', channel, regex=True)
    
    if alternative:
        data["channel"] = data["channel"].replace(fr'.*{alternative}.*', channel, regex=True)

    return data["channel"]

""" def searchParty(people):
    for p in people.keys():
        
        prompt = f"Classify the political party of {p}. Respond only with 'd' for Democrat, 'r' for Republican. If unknown, respond with 'None'."
        
        try:
            response = openai.chat.completions.create(
                messages=[
                    {
                        "role": "user",
                        "content": prompt
                    }
                ],
                model="gpt-3.5-turbo"
            )
            party = response.choices[0].text.strip()
            
            # Validate the response and assign to the dictionary
            if party in ['r', 'r', 'None']:
                people[p] = party
            else:
                people[p] = 'None'
                
        except Exception as e:
            print(f"Error classifying {p}: {e}")
            people[p] = 'None'
    
    return people """

' def searchParty(people):\n    for p in people.keys():\n        \n        prompt = f"Classify the political party of {p}. Respond only with \'d\' for Democrat, \'r\' for Republican. If unknown, respond with \'None\'."\n        \n        try:\n            response = openai.chat.completions.create(\n                messages=[\n                    {\n                        "role": "user",\n                        "content": prompt\n                    }\n                ],\n                model="gpt-3.5-turbo"\n            )\n            party = response.choices[0].text.strip()\n            \n            # Validate the response and assign to the dictionary\n            if party in [\'r\', \'r\', \'None\']:\n                people[p] = party\n            else:\n                people[p] = \'None\'\n                \n        except Exception as e:\n            print(f"Error classifying {p}: {e}")\n            people[p] = \'None\'\n    \n    return people '

Currently, it's not possible to train a machine learning model using the dataset which was analysed in [data understanding](03_data-understanding.ipynb). There are some steps which have to be done before which have already been named, such as:

- Remove 'full-flip', 'half-flip', 'no-flip' from target variable
- Control language of all statements and only keep English statements
- Handle missing values in column 'channel'
- Encode 'issue'
- Summarize channels, for e.g. 'in a X post' and 'in a post on X' should become 'X'
- Clean column 'person' as there are many invalid values
- Binarize and encode truth-column
- Balance dataset
- Convert text (string) to tokens
- Pad all sequences to the same length

All those operations have to be done on the validation data (LIAR.csv) too, only using the insights we get from the scraped data.

First, import the raw data by using 'read_csv'.

In [3]:
data = pd.read_csv("data/scraped.csv", sep=";", index_col=0)

In [4]:
data.head()

Unnamed: 0,statement,issue,person,channel,truth
0,"Says Sen. Bob Casey, D-Pa., “is trying to chan...",2024-senate-elections,Elon Musk,in an X post,false
1,Says the election results are suspicious becau...,2024-senate-elections,Eric Hovde,"in X, formerly Twitter",false
2,A “ballot dump” around 4 a.m. in Milwaukee sho...,2024-senate-elections,Instagram posts,in an Instagram post,pants-fire
3,“Kari Lake is threatening Social Security and ...,2024-senate-elections,WinSenate,in a Facebook ad,half-true
4,Republican Senate candidate Sam Brown “wants t...,2024-senate-elections,Make the Road Nevada,in an X post,half-true


In [5]:
test = pd.read_csv("data/LIAR.csv", sep=";", header=None)

In [6]:
test.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,0,2635.json,false,Says the Annies List political group supports ...,abortion,dwayne-bohac,State representative,Texas,republican,0.0,1.0,0.0,0.0,0.0,a mailer
1,1,10540.json,half-true,When did the decline of coal start? It started...,"energy,history,job-accomplishments",scott-surovell,State delegate,Virginia,democrat,0.0,0.0,1.0,1.0,0.0,a floor speech.
2,2,324.json,mostly-true,"Hillary Clinton agrees with John McCain ""by vo...",foreign-policy,barack-obama,President,Illinois,democrat,70.0,71.0,160.0,163.0,9.0,Denver
3,3,1123.json,false,Health care reform legislation is likely to ma...,health-care,blog-posting,,,none,7.0,19.0,3.0,5.0,44.0,a news release
4,4,9028.json,half-true,The economic turnaround started at the end of ...,"economy,jobs",charlie-crist,,Florida,democrat,15.0,9.0,20.0,19.0,2.0,an interview on CNN


### Fit test data to train data structure

Even though the test data should be handled as new data which has not been seen before, to be able to use it as test data it has to have the same structure as the training data. By looking at the current test data, the data does not comply with the structure of the training data. So before preprocessing the training data, the struture of the test data is adjusted.

In [7]:
test_dropped = test.drop([0, 1, 6, 7, 8, 9, 10, 11, 12, 13], axis=1)

In [8]:
test_dropped.columns = ["truth", "statement", "issue", "person", "channel"]

In [9]:
test_dropped.head()

Unnamed: 0,truth,statement,issue,person,channel
0,false,Says the Annies List political group supports ...,abortion,dwayne-bohac,a mailer
1,half-true,When did the decline of coal start? It started...,"energy,history,job-accomplishments",scott-surovell,a floor speech.
2,mostly-true,"Hillary Clinton agrees with John McCain ""by vo...",foreign-policy,barack-obama,Denver
3,false,Health care reform legislation is likely to ma...,health-care,blog-posting,a news release
4,half-true,The economic turnaround started at the end of ...,"economy,jobs",charlie-crist,an interview on CNN


### Remove invalid truth classifications

In [10]:
data.shape

(16926, 5)

In [11]:
data["truth"].unique()

array(['false', 'pants-fire', 'half-true', 'barely-true', 'mostly-true',
       'true', 'half-flip', 'full-flop', 'no-flip'], dtype=object)

In [12]:
flip = ["half-flip", "full-flop", "no-flip"]

In [13]:
data_removedflip = data[~data["truth"].isin(flip)]

In [14]:
data_removedflip.shape

(16766, 5)

In [15]:
print(f"Removed columns: {data.shape[0]} - {data_removedflip.shape[0]} = {data.shape[0] - data_removedflip.shape[0]}")

Removed columns: 16926 - 16766 = 160


### Check language of statements

Before tokenizing and padding the statements, it is to be checked whether all statements are in English. All non-English statements shall be dropped from the DataFrame. To check the language of a statement, the library 'langdetect' is used. langdetect returns the language of a string.

In [16]:
data_lang = data_removedflip.copy()

data_lang["lang"] = data_removedflip["statement"].apply(detect)

In [17]:
data_lang["lang"].value_counts()

en    16169
es      489
da       17
fr       13
af       12
ca       12
nl       11
de       10
no        7
it        7
sv        5
et        5
pt        3
tl        3
id        1
hu        1
cy        1
Name: lang, dtype: int64

All rows which were detected to be non-English should be removed from the DataFrame without further checking. As the majority (about 94%) of all statements are classified to be English, not much data is lost.

In [18]:
data_eng = data_lang[data_lang["lang"] == "en"]

In [19]:
data_eng.shape[0] == sum(data_lang["lang"] == "en")

True

As we do not need the column 'lang' for further processing, we can drop it and move on with a simplified dataset.

In [20]:
data_cleanedlang = data_eng.drop(["lang"], axis=1)

### Handle missing values

The info()-method outputs some basic info of the DataFrame. As seen in the output below, column 'channel' misses some values. There are many ways how to deal with missing values.

In [21]:
data_cleanedlang.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16169 entries, 0 to 16925
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   statement  16169 non-null  object
 1   issue      16169 non-null  object
 2   person     16169 non-null  object
 3   channel    16108 non-null  object
 4   truth      16169 non-null  object
dtypes: object(5)
memory usage: 757.9+ KB


Two options are considered: either deleting the rows and therefore reduce the dataset or fill the missing values with 'Other' and keep the rows. Both options do not contribute in gaining information, but at least the second option does not lose information. So this procedure is preferred.

In [22]:
data_mv = data_cleanedlang.fillna("Other")

In [23]:
data_mv.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16169 entries, 0 to 16925
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   statement  16169 non-null  object
 1   issue      16169 non-null  object
 2   person     16169 non-null  object
 3   channel    16169 non-null  object
 4   truth      16169 non-null  object
dtypes: object(5)
memory usage: 757.9+ KB


As seen in the output of info() now there are no more missing values in the dataset.

### Encode issue

'issue' contains the overarching topic of the statement. As there are many different topics which are handled in this dataset, it is to be decided how to deal with this information. As much information as possible should be used for model training, 'issue' is kept and needs to be encoded to make it useful for a machine learning model.

In [24]:
data_issue = pd.get_dummies(data_mv, columns=["issue"], prefix="issue", drop_first=True)

After encoding 'issue' there are many more columns added to the DataFrame.

In [25]:
print(f"Columns (original structure): {data_mv.shape[1]}")
print(f"Columns (encoded structure): {data_issue.shape[1]}")

Columns (original structure): 5
Columns (encoded structure): 159


### Summarize 'channel'

As already seen in data understanding, there are different the statements were made on. Unfortunately, those channel are not standardized in any way.

In [26]:
len(data_issue["channel"].unique())

3136

As seen by the previous output there are currently 3136 different channels. This number is quite high and should be reduced to a reasonable amount of channels. Based on previous, manual insights the most important some channels can already be defined.

In [27]:
channels = ["Facebook", "Instagram", "TikTok", "Threads", "X", "interview", "speech", "ad", "debate", "blog", "press", "campaign", 
            "TV", "radio", "video", "social media", "article", "lecture", "talk", "presentation", "mail", "podcast"]

A function was implemented to scan through the column 'channel' and replace the content of every cell containing a certain keyword with the keyword for this respective channel. So for e.g. every cell containing the word 'Facebook' in all possible variations should be replaced by only the simple word 'Facebook'. This is done for all channels which are already defined. All other cells are replaced by "Other" as the channel might be too specific.

In [28]:
data_channel = data_issue.copy()

for ch in channels:
    data_channel["channel"] = replaceChannel(data_channel, ch)

    if ch == "X":
        data_channel["channel"] = replaceChannel(data_channel, ch, "Twitter")
        data_channel["channel"] = replaceChannel(data_channel, ch, "twitter")
        data_channel["channel"] = replaceChannel(data_channel, ch, "Tweet")
        data_channel["channel"] = replaceChannel(data_channel, ch, "tweet")
        data_channel["channel"] = replaceChannel(data_channel, ch, "x")
    if ch == "social media":
        data_channel["channel"] = replaceChannel(data_channel, ch, "post")
    if ch == "TV":
        data_channel["channel"] = replaceChannel(data_channel, ch, "CNN")
        data_channel["channel"] = replaceChannel(data_channel, ch, "Fox")
    if ch == "campaign":
        data_channel["channel"] = replaceChannel(data_channel, ch, "rally")
        data_channel["channel"] = replaceChannel(data_channel, ch, "Campaign")
    if ch == "speech":
        data_channel["channel"] = replaceChannel(data_channel, ch, "Speech")
    if ch == "press":
        data_channel["channel"] = replaceChannel(data_channel, ch, "newsletter")

data_channel.loc[~data_channel["channel"].isin(channels), "channel"] = "Other"


Now, only the channels that have been defined earlier are left in column 'channel'.

In [29]:
data_channel["channel"].unique().tolist()

['X',
 'Instagram',
 'Facebook',
 'debate',
 'press',
 'ad',
 'Other',
 'interview',
 'campaign',
 'blog',
 'speech',
 'social media',
 'video',
 'TikTok',
 'TV',
 'mail',
 'article',
 'podcast',
 'talk',
 'lecture',
 'presentation']

In [30]:
data_channel[["channel"]].groupby("channel").size()

channel
Facebook        2366
Instagram        986
Other           2591
TV               241
TikTok           113
X               2025
ad              1864
article          134
blog             170
campaign         484
debate           659
interview       1731
lecture            2
mail             204
podcast           21
presentation      14
press            518
social media     660
speech          1090
talk              14
video            282
dtype: int64

Currently, the column 'channel' contains all the channels listed in the output above. The channels are represented by strings which are not usable in a machine learning model so the string have to be converted into a numerical representations. This can be done by one-hot-encoding which creates a column for every characteristic of 'channel' and assigns the number '1' only to the column of the channel the statement was made on. All other channels receive the value '0'. 

In [31]:
data_channel_enc = pd.get_dummies(data_channel, columns=["channel"], prefix="channel", drop_first=True)

As a result of one-hot-encoding there are many more columns compared to the original DataFrame structure.

In [32]:
print(f"Columns (original structure): {data_channel.shape[1]}")
print(f"Columns (encoded structure): {data_channel_enc.shape[1]}")

Columns (original structure): 159
Columns (encoded structure): 178


In [33]:
data_channel_enc.columns.to_list()

['statement',
 'person',
 'truth',
 'issue_2018-california-governors-race',
 'issue_2024-senate-elections',
 'issue_Alcohol',
 'issue_abc-news-week',
 'issue_abortion',
 'issue_ad-watch',
 'issue_afghanistan',
 'issue_after-the-fact',
 'issue_agriculture',
 'issue_animals',
 'issue_artificial-intelligence',
 'issue_ask-politifact',
 'issue_autism',
 'issue_bankruptcy',
 'issue_baseball',
 'issue_bipartisanship',
 'issue_border-security',
 'issue_bush-administration',
 'issue_campaign-advertising',
 'issue_campaign-finance',
 'issue_candidates-biography',
 'issue_cap-and-trade',
 'issue_census',
 'issue_children',
 'issue_china',
 'issue_city-budget',
 'issue_city-government',
 'issue_civil-rights',
 'issue_climate-change',
 'issue_congress',
 'issue_constitutional-amendments',
 'issue_consumer-safety',
 'issue_coronavirus',
 'issue_corporations',
 'issue_corrections-and-updates',
 'issue_county-budget',
 'issue_county-government',
 'issue_crime',
 'issue_criminal-justice',
 'issue_deat

### Clean 'person'

In this section it is evaluated whether the column 'person' can be preprocessed in any way.

The dataset contains 2765 different persons which a bit less than the number of channels before preprocessing but still too many to extract useful information.

In [34]:
len(data_channel_enc["person"].unique())

2766

One way of 'cleaning' this column can be to replace the name of person with the party they are associated with. Even though the sample size was quite small in the previous notebook data understanding, there was a correlation between member of a party and the probability of true/false statements. Member of the Republican party tend to have a higher probability of saying false statements, while members of the Democratic party tend to have a higher chance of saying true statements.
So coming back to preprocessing, this would mean that for e.g. the name 'Donald Trump' would be replaced with 'Republican', the name 'Barack Obama' would be replaced with 'Democrat'. 

Still, 2765 persons are still too much so manually assign them to a party. As it needs reliable sources to assign every person to a party, the column 'person' should not be considered as a feature and is therefore dropped from the dataset.

In [35]:
data_person = data_channel_enc.drop(["person"], axis=1)

In [36]:
#api_key = ""

#with open("api-key.txt", "r") as f:
#    api_key = f.read()

#openai.api_key = api_key
#genai.configure(api_key=api_key)

In [37]:
#persons_dict = dict()
#for p in data_channel["person"].unique():
#    persons_dict[p] = None

In [38]:
#persons_dict = searchParty(persons_dict)

### Binarize 'truth'

Now, the target variable 'truth' should be converted to boolean value. As already said in data understanding, the target variable 'truth' can take on different characteristics. Specifically, it can take on one of the values of 'pants-fire', 'false', 'barely-true', 'half-true', 'mostly-true' and 'true'. The machine learning model to be trained should only predict whether a statement is rather true or rather false, so the target variable has to be binarized. 

The following allocation will be considered:

In [39]:
true = ["true", "mostly-true", "half-true"]
false = ["barely-true", "false", "pants-fire"]

To check whether only the previously named truth characteristics are valid, the unique values should be printed.

In [40]:
data_person["truth"].unique()

array(['false', 'pants-fire', 'half-true', 'barely-true', 'mostly-true',
       'true'], dtype=object)

The target variable 'truth' is now replaced based on the defined lists for true and false values. Statements which will be classified in the 'true' range, receive value 1 as target. Statements which will be classified in the 'false' range, receive value 0 as target.

In [41]:
data_binary = data_person.copy()
data_binary["truth"] = data_binary["truth"].replace(true, 1)
data_binary["truth"] = data_binary["truth"].replace(false, 0)

As seen below, the values are not distributed evenly. This issue has to be taken on in the next step as it is required to balance out the dataset to prevent bias in the model.

In [42]:
data_binary["truth"].value_counts()

0    10074
1     6095
Name: truth, dtype: int64

### Balance dataset

There are two popular ways to balance out a dataset to achieve an even distribution between two classes.

First option is to oversample. Oversampling means to copy random entries from the class with less occurences until there are as many samples in both classes. The advantage when using oversampling is that no data is lost. Unfortunately, biases can be reproduced and even reinforced by multiplying samples.

Second options is to undersample. Undersampling is the opposite of oversampling. So, instead of multiplying samples of the lower class until both classes are even, this time samples from the class with more occurences are deleted until both classes are balanced out. This leads to the loss of data which is not a good option to consider as we might lose out on relevant information. 

So oversampling is used to balance out the data.

In [43]:
ros = RandomOverSampler(random_state=42)

Before starting to resample, the data has to be splitted into X and y. X containing all features, y containing only the target variable 'truth'.

In [44]:
X = data_binary.drop(["truth"], axis=1)
y = data_binary["truth"]

Currently, there are 6095 samples of class '1' (or in this context: true statements), while there are 10085 samples of class '0' (false statements)

In [45]:
y.value_counts()

0    10074
1     6095
Name: truth, dtype: int64

Now, the data is resampled randomly with replacement. That means, that from the minority class '1', a random sample is chosen and copied until the both classes have the same number of samples.

In [46]:
X_resampled, y_resampled = ros.fit_resample(X, y)

After oversampling, the dataset is now balanced out and be further prepared for training.

In [47]:
y_resampled.value_counts()

0    10074
1    10074
Name: truth, dtype: int64

In [48]:
data_balanced = X_resampled.copy()

data_balanced["truth"] = y_resampled

### Tokenization of 'statement'

To make the statements usable for a machine learning model, the strings have to be converted into a numerical representation.

... -> Inhalte aus der Vorlesung einfügen!

In [49]:
data_balanced["statement"] = data_balanced["statement"].str.lower().str.replace(r'[^a-zA-Z0-9 ]',"", regex=True).astype("str")

In [50]:
data_balanced[["statement", "truth"]].head()

Unnamed: 0,statement,truth
0,says sen bob casey dpa is trying to change the...,0
1,says the election results are suspicious becau...,0
2,a ballot dump around 4 am in milwaukee shows t...,0
3,kari lake is threatening social security and m...,1
4,republican senate candidate sam brown wants to...,1


In [51]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

Now, it is assumed that there are only English statements left in the data. Thus, the statements can now be tokenized using keras.Tokenizer which transforms the statements into a list of integers.

In [52]:
NUM_WORDS=3000

In [53]:
data_token = data_balanced.copy()

token = Tokenizer(num_words=NUM_WORDS)
statements = data_balanced["statement"].to_list()
token.fit_on_texts(statements)
data_token["token"] = token.texts_to_sequences(statements)

In [54]:
data_token[["statement", "token", "truth"]].head()

Unnamed: 0,statement,token,truth
0,says sen bob casey dpa is trying to change the...,"[9, 185, 1682, 2948, 8, 503, 3, 263, 1, 4, 1, ...",0
1,says the election results are suspicious becau...,"[9, 1, 197, 2369, 12, 70, 645, 28, 185, 693, 7...",0
2,a ballot dump around 4 am in milwaukee shows t...,"[5, 928, 2258, 414, 388, 1181, 2, 253, 64, 1, ...",0
3,kari lake is threatening social security and m...,"[2061, 1124, 8, 2646, 109, 102, 6, 115]",1
4,republican senate candidate sam brown wants to...,"[141, 135, 246, 507, 152, 3, 104, 220, 6, 109,...",1


As seen in the previous output, not all list of tokens have the same size. To make all tokens the same size, the tokenized statements have to be padded with '0' until they all reach the same length.

In [55]:
data_token["token"].apply(len).describe()

count    20148.000000
mean        15.713073
std          7.254187
min          1.000000
25%         10.000000
50%         14.000000
75%         20.000000
max         57.000000
Name: token, dtype: float64

### Padding of sequences

The longest tokenized statement is given by the maximum value of the previous output which shows the descriptive statistics of the length of all tokenized, but not yet padded statements. The maximum is at 57 which means all statemtents have to be padded to a uniformal length of 57.

In [56]:
padded = pad_sequences(data_token["token"].to_list())

data_padded = data_token.copy()
data_padded["token"] = padded.tolist()

After padding the tokenized statements, there are many zeros at the beginning. This is called pre-padding.

In [57]:
data_padded[["statement", "token", "truth"]].head()

Unnamed: 0,statement,token,truth
0,says sen bob casey dpa is trying to change the...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0
1,says the election results are suspicious becau...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0
2,a ballot dump around 4 am in milwaukee shows t...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0
3,kari lake is threatening social security and m...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1
4,republican senate candidate sam brown wants to...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1


In [58]:
data_padded.to_csv("data/processed.csv", sep=";")