# Final Project: Fake News Detection

By Felix Daubner - Hochschule der Medien

Module 'Supervised and Unsupervised Learning' - Prof. Dr.-Ing. Johannes Maucher

## Data Preprocessing

In [60]:
import pandas as pd
import numpy as np
from langdetect import detect
#import openai
from imblearn.over_sampling import RandomOverSampler
from gensim.models import KeyedVectors

In [1]:
def replaceChannel(data, channel, alternative=None):
    data["channel"] = data["channel"].replace(fr'.*{channel}.*', channel, regex=True)
    
    if alternative:
        data["channel"] = data["channel"].replace(fr'.*{alternative}.*', channel, regex=True)

    return data["channel"]

Currently, it's not possible to train a machine learning model using the dataset which was analysed in [data understanding](03_data-understanding.ipynb). There are some steps which have to be done before which have already been named, such as:

- Remove 'full-flip', 'half-flip', 'no-flip' from target variable
- Control language of all statements and only keep English statements
- Handle missing values in column 'channel'
- Encode 'issue'
- Summarize channels, for e.g. 'in a X post' and 'in a post on X' should become 'X'
- Clean column 'person' as there are many invalid values
- Binarize and encode truth-column
- Balance dataset
- Convert text (string) to tokens
- Pad all sequences to the same length

All those operations have to be done on the validation data (LIAR.csv) too, only using the insights we get from the scraped data.

First, import the raw data by using 'read_csv'.

In [3]:
data = pd.read_csv("data/scraped.csv", sep=";", index_col=0)

In [4]:
data.head()

Unnamed: 0,statement,issue,person,channel,truth
0,"Says Sen. Bob Casey, D-Pa., “is trying to chan...",2024-senate-elections,Elon Musk,in an X post,false
1,Says the election results are suspicious becau...,2024-senate-elections,Eric Hovde,"in X, formerly Twitter",false
2,A “ballot dump” around 4 a.m. in Milwaukee sho...,2024-senate-elections,Instagram posts,in an Instagram post,pants-fire
3,“Kari Lake is threatening Social Security and ...,2024-senate-elections,WinSenate,in a Facebook ad,half-true
4,Republican Senate candidate Sam Brown “wants t...,2024-senate-elections,Make the Road Nevada,in an X post,half-true


In [5]:
test = pd.read_csv("data/LIAR.csv", sep=";", header=None)

In [6]:
test.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,0,2635.json,false,Says the Annies List political group supports ...,abortion,dwayne-bohac,State representative,Texas,republican,0.0,1.0,0.0,0.0,0.0,a mailer
1,1,10540.json,half-true,When did the decline of coal start? It started...,"energy,history,job-accomplishments",scott-surovell,State delegate,Virginia,democrat,0.0,0.0,1.0,1.0,0.0,a floor speech.
2,2,324.json,mostly-true,"Hillary Clinton agrees with John McCain ""by vo...",foreign-policy,barack-obama,President,Illinois,democrat,70.0,71.0,160.0,163.0,9.0,Denver
3,3,1123.json,false,Health care reform legislation is likely to ma...,health-care,blog-posting,,,none,7.0,19.0,3.0,5.0,44.0,a news release
4,4,9028.json,half-true,The economic turnaround started at the end of ...,"economy,jobs",charlie-crist,,Florida,democrat,15.0,9.0,20.0,19.0,2.0,an interview on CNN


### Remove invalid truth classifications

In [7]:
data.shape

(16926, 5)

In [8]:
data["truth"].unique()

array(['false', 'pants-fire', 'half-true', 'barely-true', 'mostly-true',
       'true', 'half-flip', 'full-flop', 'no-flip'], dtype=object)

In [9]:
flip = ["half-flip", "full-flop", "no-flip"]

In [10]:
data_removedflip = data[~data["truth"].isin(flip)]

In [11]:
data_removedflip.shape

(16766, 5)

In [12]:
print(f"Removed columns: {data.shape[0]} - {data_removedflip.shape[0]} = {data.shape[0] - data_removedflip.shape[0]}")

Removed columns: 16926 - 16766 = 160


### Check language of statements

Before tokenizing and padding the statements, it is to be checked whether all statements are in English. All non-English statements shall be dropped from the DataFrame. To check the language of a statement, the library 'langdetect' is used. langdetect returns the language of a string.

In [13]:
data_lang = data_removedflip.copy()

data_lang["lang"] = data_removedflip["statement"].apply(detect)

In [14]:
data_lang["lang"].value_counts()

en    16177
es      484
da       15
nl       13
fr       12
ca       12
af        9
de        7
no        7
sv        6
et        6
it        5
tl        4
pt        3
id        2
hu        1
ro        1
sk        1
cy        1
Name: lang, dtype: int64

All rows which were detected to be non-English should be removed from the DataFrame without further checking. As the majority (about 94%) of all statements are classified to be English, not much data is lost.

In [15]:
data_eng = data_lang[data_lang["lang"] == "en"]

In [16]:
data_eng.shape[0] == sum(data_lang["lang"] == "en")

True

As we do not need the column 'lang' for further processing, we can drop it and move on with a simplified dataset.

In [17]:
data_cleanedlang = data_eng.drop(["lang"], axis=1)

### Handle missing values

The info()-method outputs some basic info of the DataFrame. As seen in the output below, column 'channel' misses some values. There are many ways how to deal with missing values.

In [18]:
data_cleanedlang.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16177 entries, 0 to 16925
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   statement  16177 non-null  object
 1   issue      16177 non-null  object
 2   person     16177 non-null  object
 3   channel    16116 non-null  object
 4   truth      16177 non-null  object
dtypes: object(5)
memory usage: 758.3+ KB


Two options are considered: either deleting the rows and therefore reduce the dataset or fill the missing values with 'Other' and keep the rows. Both options do not contribute in gaining information, but at least the second option does not lose information. So this procedure is preferred.

In [19]:
data_mv = data_cleanedlang.fillna("Other")

In [20]:
data_mv.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16177 entries, 0 to 16925
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   statement  16177 non-null  object
 1   issue      16177 non-null  object
 2   person     16177 non-null  object
 3   channel    16177 non-null  object
 4   truth      16177 non-null  object
dtypes: object(5)
memory usage: 758.3+ KB


As seen in the output of info() now there are no more missing values in the dataset.

### Summarize 'channel'

As already seen in data understanding, there are different the statements were made on. Unfortunately, those channel are not standardized in any way.

In [21]:
len(data_mv["channel"].unique())

3137

As seen by the previous output there are currently 3136 different channels. This number is quite high and should be reduced to a reasonable amount of channels. Based on previous, manual insights the most important some channels can already be defined.

In [22]:
channels = ["Facebook", "Instagram", "TikTok", "Threads", "X", "interview", "speech", "ad", "debate", "blog", "press", "campaign", 
            "TV", "radio", "video", "social media", "article", "lecture", "talk", "presentation", "mail", "podcast"]

A function was implemented to scan through the column 'channel' and replace the content of every cell containing a certain keyword with the keyword for this respective channel. So for e.g. every cell containing the word 'Facebook' in all possible variations should be replaced by only the simple word 'Facebook'. This is done for all channels which are already defined. All other cells are replaced by "Other" as the channel might be too specific.

In [23]:
data_channel = data_mv.copy()

for ch in channels:
    data_channel["channel"] = replaceChannel(data_channel, ch)

    if ch == "X":
        data_channel["channel"] = replaceChannel(data_channel, ch, "Twitter")
        data_channel["channel"] = replaceChannel(data_channel, ch, "twitter")
        data_channel["channel"] = replaceChannel(data_channel, ch, "Tweet")
        data_channel["channel"] = replaceChannel(data_channel, ch, "tweet")
        data_channel["channel"] = replaceChannel(data_channel, ch, "x")
    if ch == "social media":
        data_channel["channel"] = replaceChannel(data_channel, ch, "post")
    if ch == "TV":
        data_channel["channel"] = replaceChannel(data_channel, ch, "CNN")
        data_channel["channel"] = replaceChannel(data_channel, ch, "Fox")
    if ch == "campaign":
        data_channel["channel"] = replaceChannel(data_channel, ch, "rally")
        data_channel["channel"] = replaceChannel(data_channel, ch, "Campaign")
    if ch == "speech":
        data_channel["channel"] = replaceChannel(data_channel, ch, "Speech")
    if ch == "press":
        data_channel["channel"] = replaceChannel(data_channel, ch, "newsletter")

data_channel.loc[~data_channel["channel"].isin(channels), "channel"] = "Other"


Now, only the channels that have been defined earlier are left in column 'channel'.

In [24]:
data_channel["channel"].unique().tolist()

['X',
 'Instagram',
 'Facebook',
 'debate',
 'press',
 'ad',
 'Other',
 'interview',
 'campaign',
 'blog',
 'speech',
 'social media',
 'video',
 'TikTok',
 'TV',
 'mail',
 'article',
 'podcast',
 'talk',
 'lecture',
 'presentation']

In [25]:
data_channel[["channel"]].groupby("channel").size()

channel
Facebook        2368
Instagram        985
Other           2592
TV               241
TikTok           113
X               2029
ad              1863
article          134
blog             172
campaign         484
debate           659
interview       1733
lecture            2
mail             204
podcast           21
presentation      14
press            518
social media     660
speech          1089
talk              14
video            282
dtype: int64

Currently, the column 'channel' contains all the channels listed in the output above. The channels are represented by strings which are not usable in a machine learning model so the string have to be converted into a numerical representations. This can be done by one-hot-encoding which creates a column for every characteristic of 'channel' and assigns the number '1' only to the column of the channel the statement was made on. All other channels receive the value '0'. 

In [26]:
data_channel_enc = pd.get_dummies(data_channel, columns=["channel"], prefix="channel", drop_first=True)

As a result of one-hot-encoding there are many more columns compared to the original DataFrame structure.

In [27]:
print(f"Columns (original structure): {data_channel.shape[1]}")
print(f"Columns (encoded structure): {data_channel_enc.shape[1]}")

Columns (original structure): 5
Columns (encoded structure): 24


In [28]:
data_channel_enc.columns.to_list()

['statement',
 'issue',
 'person',
 'truth',
 'channel_Instagram',
 'channel_Other',
 'channel_TV',
 'channel_TikTok',
 'channel_X',
 'channel_ad',
 'channel_article',
 'channel_blog',
 'channel_campaign',
 'channel_debate',
 'channel_interview',
 'channel_lecture',
 'channel_mail',
 'channel_podcast',
 'channel_presentation',
 'channel_press',
 'channel_social media',
 'channel_speech',
 'channel_talk',
 'channel_video']

### Handle 'person' and 'issue'

In this section it is evaluated whether the column 'person' can be preprocessed in any way.

The dataset contains 2765 different persons which a bit less than the number of channels before preprocessing but still too many to extract useful information.

In [29]:
len(data_channel_enc["person"].unique())

2766

One way of 'cleaning' this column can be to replace the name of person with the party they are associated with. Even though the sample size was quite small in the previous notebook data understanding, there was a correlation between member of a party and the probability of true/false statements. Member of the Republican party tend to have a higher probability of saying false statements, while members of the Democratic party tend to have a higher chance of saying true statements.
So coming back to preprocessing, this would mean that for e.g. the name 'Donald Trump' would be replaced with 'Republican', the name 'Barack Obama' would be replaced with 'Democrat'. 

Still, 2765 persons are still too much so manually assign them to a party. As it needs reliable sources to assign every person to a party, the column 'person' should not be considered as a feature and is therefore dropped from the dataset.

In [30]:
data_dropped = data_channel_enc.drop(["person", "issue"], axis=1)

### Binarize 'truth'

Now, the target variable 'truth' should be converted to boolean value. As already said in data understanding, the target variable 'truth' can take on different characteristics. Specifically, it can take on one of the values of 'pants-fire', 'false', 'barely-true', 'half-true', 'mostly-true' and 'true'. The machine learning model to be trained should only predict whether a statement is rather true or rather false, so the target variable has to be binarized. 

The following allocation will be considered:

In [31]:
true = ["true", "mostly-true", "half-true"]
false = ["barely-true", "false", "pants-fire"]

To check whether only the previously named truth characteristics are valid, the unique values should be printed.

In [32]:
data_dropped["truth"].unique()

array(['false', 'pants-fire', 'half-true', 'barely-true', 'mostly-true',
       'true'], dtype=object)

The target variable 'truth' is now replaced based on the defined lists for true and false values. Statements which will be classified in the 'true' range, receive value 1 as target. Statements which will be classified in the 'false' range, receive value 0 as target.

In [33]:
data_binary = data_dropped.copy()
data_binary["truth"] = data_binary["truth"].replace(true, 1)
data_binary["truth"] = data_binary["truth"].replace(false, 0)

As seen below, the values are not distributed evenly. This issue has to be taken on in the next step as it is required to balance out the dataset to prevent bias in the model.

In [34]:
data_binary["truth"].value_counts()

0    10081
1     6096
Name: truth, dtype: int64

### Balance dataset

There are two popular ways to balance out a dataset to achieve an even distribution between two classes.

First option is to oversample. Oversampling means to copy random entries from the class with less occurences until there are as many samples in both classes. The advantage when using oversampling is that no data is lost. Unfortunately, biases can be reproduced and even reinforced by multiplying samples.

Second options is to undersample. Undersampling is the opposite of oversampling. So, instead of multiplying samples of the lower class until both classes are even, this time samples from the class with more occurences are deleted until both classes are balanced out. This leads to the loss of data which is not a good option to consider as we might lose out on relevant information. 

So oversampling is used to balance out the data.

In [35]:
ros = RandomOverSampler(random_state=42)

Before starting to resample, the data has to be splitted into X and y. X containing all features, y containing only the target variable 'truth'.

In [36]:
X = data_binary.drop(["truth"], axis=1)
y = data_binary["truth"]

Currently, there are 6095 samples of class '1' (or in this context: true statements), while there are 10085 samples of class '0' (false statements)

In [37]:
y.value_counts()

0    10081
1     6096
Name: truth, dtype: int64

Now, the data is resampled randomly with replacement. That means, that from the minority class '1', a random sample is chosen and copied until the both classes have the same number of samples.

In [38]:
X_resampled, y_resampled = ros.fit_resample(X, y)

After oversampling, the dataset is now balanced out and be further prepared for training.

In [39]:
y_resampled.value_counts()

0    10081
1    10081
Name: truth, dtype: int64

In [40]:
data_balanced = X_resampled.copy()

data_balanced["truth"] = y_resampled

### Tokenization of 'statement'

To make the statements usable for a machine learning model, the strings have to be converted into a numerical representation.

... -> Inhalte aus der Vorlesung einfügen!

In [41]:
data_balanced["statement"] = data_balanced["statement"].str.lower().str.replace(r'[^a-zA-Z0-9 ]',"", regex=True).astype("str")

In [42]:
data_balanced[["statement", "truth"]].head()

Unnamed: 0,statement,truth
0,says sen bob casey dpa is trying to change the...,0
1,says the election results are suspicious becau...,0
2,a ballot dump around 4 am in milwaukee shows t...,0
3,kari lake is threatening social security and m...,1
4,republican senate candidate sam brown wants to...,1


In [43]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

Now, it is assumed that there are only English statements left in the data. Thus, the statements can now be tokenized using keras.Tokenizer which transforms the statements into a list of integers.

In [44]:
NUM_WORDS=3000

In [45]:
data_token = data_balanced.copy()

token = Tokenizer(num_words=NUM_WORDS)
statements = data_balanced["statement"].to_list()
token.fit_on_texts(statements)
data_token["token"] = token.texts_to_sequences(statements)

In [46]:
data_token[["statement", "token", "truth"]].head()

Unnamed: 0,statement,token,truth
0,says sen bob casey dpa is trying to change the...,"[9, 185, 1896, 2942, 8, 488, 3, 271, 1, 4, 1, ...",0
1,says the election results are suspicious becau...,"[9, 1, 192, 2498, 12, 67, 637, 28, 185, 691, 7...",0
2,a ballot dump around 4 am in milwaukee shows t...,"[5, 861, 2755, 404, 398, 1144, 2, 263, 64, 1, ...",0
3,kari lake is threatening social security and m...,"[1897, 1177, 8, 2637, 108, 99, 6, 115]",1
4,republican senate candidate sam brown wants to...,"[150, 147, 260, 504, 151, 3, 109, 213, 6, 108,...",1


As seen in the previous output, not all list of tokens have the same size. To make all tokens the same size, the tokenized statements have to be padded with '0' until they all reach the same length.

In [47]:
data_token["token"].apply(len).describe()

count    20162.000000
mean        15.693185
std          7.245064
min          1.000000
25%         10.000000
50%         14.000000
75%         20.000000
max         57.000000
Name: token, dtype: float64

In [48]:
MAX_TOKEN = 57

### Padding of sequences

The longest tokenized statement is given by the maximum value of the previous output which shows the descriptive statistics of the length of all tokenized, but not yet padded statements. The maximum is at 57 which means all statemtents have to be padded to a uniformal length of 57.

In [49]:
padded = pad_sequences(data_token["token"].to_list())

data_padded = data_token.copy()
data_padded["token"] = padded.tolist()

After padding the tokenized statements, there are many zeros at the beginning. This is called pre-padding.

In [50]:
data_padded[["statement", "token", "truth"]].head()

Unnamed: 0,statement,token,truth
0,says sen bob casey dpa is trying to change the...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0
1,says the election results are suspicious becau...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0
2,a ballot dump around 4 am in milwaukee shows t...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0
3,kari lake is threatening social security and m...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1
4,republican senate candidate sam brown wants to...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1


### Save processed training data

In [51]:
data_padded.to_csv("data/processed.csv", sep=";")

### Preprocessing of validation & test data

Validation and test data is not processed in a way the training data got processed as this data is handled like unseen data. Currently, the validation & test data does not comply with the structure of the training data. However, to being able to use the validation and test data, it has to be in a certain format and structure. In the following, the validation and test data which will be taken from 'LIAR' dataset is prepared for usage in the model to be created.

All the processing steps are done only based on the knowledge found out in the process of preparing the training data. This prevents implementing a bias which already includes knowledge from the validation and test data which can and should not be known at this point.

In [52]:
test = pd.read_csv("data/LIAR.csv", sep=";", index_col=0, header=None)

In [53]:
test.head()

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,2635.json,false,Says the Annies List political group supports ...,abortion,dwayne-bohac,State representative,Texas,republican,0.0,1.0,0.0,0.0,0.0,a mailer
1,10540.json,half-true,When did the decline of coal start? It started...,"energy,history,job-accomplishments",scott-surovell,State delegate,Virginia,democrat,0.0,0.0,1.0,1.0,0.0,a floor speech.
2,324.json,mostly-true,"Hillary Clinton agrees with John McCain ""by vo...",foreign-policy,barack-obama,President,Illinois,democrat,70.0,71.0,160.0,163.0,9.0,Denver
3,1123.json,false,Health care reform legislation is likely to ma...,health-care,blog-posting,,,none,7.0,19.0,3.0,5.0,44.0,a news release
4,9028.json,half-true,The economic turnaround started at the end of ...,"economy,jobs",charlie-crist,,Florida,democrat,15.0,9.0,20.0,19.0,2.0,an interview on CNN


First, all columns which do not contain similar information as in the training set are dropped.

In [54]:
test_dropped = test.drop([1, 6, 7, 8, 9, 10, 11, 12, 13], axis=1)

To have a better overview over the data, the columns are renamed to comply with the training data structure.

In [55]:
test_dropped.columns = ["truth", "statement", "issue", "person", "channel"]

After renaming, the data looks familiar. The column names are similar, also the structure inside the columns seems like the structure of the training data before processing.

In [56]:
test_dropped.head()

Unnamed: 0_level_0,truth,statement,issue,person,channel
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,false,Says the Annies List political group supports ...,abortion,dwayne-bohac,a mailer
1,half-true,When did the decline of coal start? It started...,"energy,history,job-accomplishments",scott-surovell,a floor speech.
2,mostly-true,"Hillary Clinton agrees with John McCain ""by vo...",foreign-policy,barack-obama,Denver
3,false,Health care reform legislation is likely to ma...,health-care,blog-posting,a news release
4,half-true,The economic turnaround started at the end of ...,"economy,jobs",charlie-crist,an interview on CNN


Following steps have to be done to achieve the same structure and format as the training data:
- Drop non-English rows
- Drop 'person' and 'issue'
- Summarize and encode 'channel'
- Binarize 'truth'
- Tokenize and pad 'statement'

As already said, all those steps are done using the same methods as done with the training data.
To automate this process, a preprocessing pipeline should be implemented. Due to the time restriction and a different focus of this project, the creation of a data preprocessing pipeline will be dispensed with.

In [57]:
class preprocessData():
    def __init__(self, data):
        self.__data = data

    def orchestrate(self):
        print("Check Basics ...", sep=' ', end='', flush=True) 
        data_b = self.__checkBasics(self.__data)
        print(" Finished Basics!")
        print("Check Language ...", sep=' ', end='', flush=True) 
        data_l = self.__checkLanguage(data_b)
        print(" Finished Language!")
        print("Check 'person' ...", sep=' ', end='', flush=True) 
        data_p = self.__checkPerson(data_l)
        print(" Finished 'person'!")
        print("Handle 'channel' ...", sep=' ', end='', flush=True) 
        data_c = self.__handleChannel(data_p)
        print(" Finished handling of 'channel'!")
        print("Binarize 'truth' ...", sep=' ', end='', flush=True) 
        data_t = self.__binarizeTruth(data_c)
        print(" Finished binarizing of 'truth'!")
        print("Tokenize 'statement' ...", sep=' ', end='', flush=True) 
        data_result = self.__tokenizeStatement(data_t)
        print(" Finished tokenizing of 'statement'!")

        print("Finished preprocessing of data!")
        return data_result

    def __checkBasics(self, data):
        data.dropna(axis=0, subset=["truth", "statement"])
        return data

    def __checkLanguage(self, data):
        data["lang"] = data["statement"].apply(detect)
        return data[data["lang"]=="en"].drop(["lang"], axis=1)

    def __checkPerson(self, data):
        try:
            return data.drop(["person"], axis=1)
        except:
            return data

    def __handleChannel(self, data):
        for ch in channels:
            data["channel"] = replaceChannel(data, ch)

            if ch == "X":
                data["channel"] = replaceChannel(data, ch, "Twitter")
                data["channel"] = replaceChannel(data, ch, "twitter")
                data["channel"] = replaceChannel(data, ch, "Tweet")
                data["channel"] = replaceChannel(data, ch, "tweet")
                data["channel"] = replaceChannel(data, ch, "x")
            if ch == "social media":
                data["channel"] = replaceChannel(data, ch, "post")
            if ch == "TV":
                data["channel"] = replaceChannel(data, ch, "CNN")
                data["channel"] = replaceChannel(data, ch, "Fox")
            if ch == "campaign":
                data["channel"] = replaceChannel(data, ch, "rally")
                data["channel"] = replaceChannel(data, ch, "Campaign")
            if ch == "speech":
                data["channel"] = replaceChannel(data, ch, "Speech")
            if ch == "press":
                data["channel"] = replaceChannel(data, ch, "newsletter")

        data.loc[~data["channel"].isin(channels), "channel"] = "Other"

        return pd.get_dummies(data, columns=["channel"], prefix="channel", drop_first=True)

    def __binarizeTruth(self, data):
        data["truth"] = data["truth"].replace(true, 1)
        data["truth"] = data["truth"].replace(false, 0)
        return data

    def __tokenizeStatement(self, data):
        statements = data["statement"].to_list()
        data["token"] = token.texts_to_sequences(statements)

        padded = pad_sequences(data["token"].to_list(), maxlen=MAX_TOKEN)

        data["token"] = padded.tolist()

        return data

In [58]:
pipeline = preprocessData(test_dropped)
test_processed = pipeline.orchestrate()

Check Basics ... Finished Basics!
Check Language ... Finished Language!
Check 'person' ... Finished 'person'!
Handle 'channel' ... Finished handling of 'channel'!
Binarize 'truth' ... Finished binarizing of 'truth'!
Tokenize 'statement' ... Finished tokenizing of 'statement'!
Finished preprocessing of data!


In [59]:
test_processed.to_csv("data/LIAR_processed.csv", sep=";")