# Fake News Detection: What NLP tools available can help us tackle this issue?

## Introduction

The rise of fake news and propaganda has sparked numerous research efforts to address the problem using machine learning (ML). Some researchers concentrate on Deep Learning techniques, while others opt for a more hands-on approach. This study seeks to merge these two approaches by undertaking feature engineering. Our goal is to create features that can be leveraged for training neural networks, effectively bridging the gap between manual methods and automated Deep Learning techniques.

NLP project plan:

1. Data Acquisition
2. Text Cleaning
3. Text Processing
   - Sentence segmentation
   - Tokenization
   - Linguistic analysis:
        * Morphology (lemmatization, stemming)
        * Syntax (grammars, parse trees, syntactic parsing)
4. Feature Engineering
   - TF-IDF with ensamble method
   - NER
   - Embeddings (one hot encoding)
   - N-grams
   - Word2Vec
   - not SVM /LDA!1  
5. Feature Extraction
   - Chi2
6. Further work (model building)
7. Conclusions

Before jumping to the code, let us import all the tools necessary for this project.

In [33]:
import pandas as pd
import numpy as np
import matplotlib as plot

import nltk
from nltk.tokenize import word_tokenize
from string import punctuation

import re
import csv

In [None]:
nltk.download('punkt')

## 1. Data Acquisition

In this project we will be using [LIAR](https://paperswithcode.com/dataset/liar) dataset. We have downloaded the resources into the `dataset` folder, so now we can load the data into `pandas` data frame.

In [25]:
liar_df = pd.read_csv("dataset/train.tsv", sep='\t', header=None)

According to the documentation, the dataset has the following columns:

    Column 1: the ID of the statement ([ID].json).
    Column 2: the label.
    Column 3: the statement.
    Column 4: the subject(s).
    Column 5: the speaker.
    Column 6: the speaker's job title.
    Column 7: the state info.
    Column 8: the party affiliation.
    Column 9-13: the total credit history count, including the current statement.
    9: barely true counts.
    10: false counts.
    11: half true counts.
    12: mostly true counts.
    13: pants on fire counts.
    Column 14: the context (venue / location of the speech or statement).

We can define a `column_names` variable to store them and introduce them as headers into our data frame.

In [47]:
column_names = ['Statement ID', 'Label', 'Statement', 'Subject', 'Speaker', 'Speaker\'s job title', 'State', 'Party affiliation', 'Barely true', 'False', 'Half true', 'Mostly true', 'Pants on fire', 'Statement context']
liar_df.columns = column_names

liar_df.head()

Unnamed: 0,Statement ID,Label,Statement,Subject,Speaker,Speaker's job title,State,Party affiliation,Barely true,False,Half true,Mostly true,Pants on fire,Statement context
0,2635.json,false,Says the Annies List political group supports ...,abortion,dwayne-bohac,State representative,Texas,republican,0.0,1.0,0.0,0.0,0.0,a mailer
1,10540.json,half-true,When did the decline of coal start? It started...,"energy,history,job-accomplishments",scott-surovell,State delegate,Virginia,democrat,0.0,0.0,1.0,1.0,0.0,a floor speech.
2,324.json,mostly-true,"Hillary Clinton agrees with John McCain ""by vo...",foreign-policy,barack-obama,President,Illinois,democrat,70.0,71.0,160.0,163.0,9.0,Denver
3,1123.json,false,Health care reform legislation is likely to ma...,health-care,blog-posting,,,none,7.0,19.0,3.0,5.0,44.0,a news release
4,9028.json,half-true,The economic turnaround started at the end of ...,"economy,jobs",charlie-crist,,Florida,democrat,15.0,9.0,20.0,19.0,2.0,an interview on CNN


In [56]:
liar_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10240 entries, 0 to 10239
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Statement ID         10240 non-null  object 
 1   Label                10240 non-null  object 
 2   Statement            10240 non-null  object 
 3   Subject              10238 non-null  object 
 4   Speaker              10238 non-null  object 
 5   Speaker's job title  7342 non-null   object 
 6   State                8030 non-null   object 
 7   Party affiliation    10238 non-null  object 
 8   Barely true          10238 non-null  float64
 9   False                10238 non-null  float64
 10  Half true            10238 non-null  float64
 11  Mostly true          10238 non-null  float64
 12  Pants on fire        10238 non-null  float64
 13  Statement context    10138 non-null  object 
dtypes: float64(5), object(9)
memory usage: 1.1+ MB


Before manipulating the data, let us make sure that we don't have any missing values that is important to us, like statements and scores. As seen in the `head()` output, there are some missing values in `Speaker's job title`, `State` and `Party affiliation` columns, but that is less important data for our analysis.

In [30]:
def check_column_values(df, columns_to_check):
    nan_check = df[columns_to_check].isna().any()
    empty_string_check = (df[columns_to_check] == '').any()
    none_check = pd.isna(df[columns_to_check]).any()
    
    result = {
        'NaN Check': nan_check,
        'Empty String Check': empty_string_check,
        'None Check': none_check
    }

    return result

In [48]:
check_column_values(liar_df, ['Label', 'Statement', 'Subject', 'Barely true', 'False', 'Half true', 'Mostly true', 'Pants on fire'])

{'NaN Check': Label            False
 Statement        False
 Subject           True
 Barely true       True
 False             True
 Half true         True
 Mostly true       True
 Pants on fire     True
 dtype: bool,
 'Empty String Check': Label            False
 Statement        False
 Subject          False
 Barely true      False
 False            False
 Half true        False
 Mostly true      False
 Pants on fire    False
 dtype: bool,
 'None Check': Label            False
 Statement        False
 Subject           True
 Barely true       True
 False             True
 Half true         True
 Mostly true       True
 Pants on fire     True
 dtype: bool}

It looks like we do have missing values in numerous columns, let us explore them more.

In [49]:
subj_na = liar_df[liar_df['Subject'].isna()]
subj_na.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, 2142 to 9375
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Statement ID         2 non-null      object 
 1   Label                2 non-null      object 
 2   Statement            2 non-null      object 
 3   Subject              0 non-null      object 
 4   Speaker              0 non-null      object 
 5   Speaker's job title  0 non-null      object 
 6   State                0 non-null      object 
 7   Party affiliation    0 non-null      object 
 8   Barely true          0 non-null      float64
 9   False                0 non-null      float64
 10  Half true            0 non-null      float64
 11  Mostly true          0 non-null      float64
 12  Pants on fire        0 non-null      float64
 13  Statement context    0 non-null      object 
dtypes: float64(5), object(9)
memory usage: 240.0+ bytes


In [51]:
subj_na.loc[2142, 'Statement']

'The fact is that although we have had a president who is opposed to abortion over the last eight years, abortions have not gone down.\'\'\tabortion\tbarack-obama\tPresident\tIllinois\tdemocrat\t70\t71\t160\t163\t9\ta TV interview with megachurch pastor Rick Warren in Lake Forest, Calif.\n2724.json\ttrue\tMost of the jobs that we lost were lost before the economic policies we put in place had any effect.\teconomy,job-accomplishments,jobs,stimulus\tbarack-obama\tPresident\tIllinois\tdemocrat\t70\t71\t160\t163\t9\tan interview on The Daily Show with Jon Stewart"'

It looks like we do have all the information, however it wasn't parsed correctly. If we look into the raw dataset - here's how this particular row looks like:

`638.json	false	"The fact is that although we have had a president who is opposed to abortion over the last eight years, abortions have not gone down.''	abortion	barack-obama	President	Illinois	democrat	70	71	160	163	9	a TV interview with megachurch pastor Rick Warren in Lake Forest, Calif.`

The issue is with the incorrect quotation marks, which confused `pandas`. Let us try to fix that.

In [39]:
result_list = re.split(r'(?<!\')\t', subj_na.loc[2142, 'Statement'])

# Remove any remaining single quotes around the first and last elements
result_list[0] = result_list[0].lstrip('\'')
result_list[-1] = result_list[-1].rstrip('\'')

print(result_list)

["The fact is that although we have had a president who is opposed to abortion over the last eight years, abortions have not gone down.''\tabortion", 'barack-obama', 'President', 'Illinois', 'democrat', '70', '71', '160', '163', '9', 'a TV interview with megachurch pastor Rick Warren in Lake Forest, Calif.\n2724.json', 'true', 'Most of the jobs that we lost were lost before the economic policies we put in place had any effect.', 'economy,job-accomplishments,jobs,stimulus', 'barack-obama', 'President', 'Illinois', 'democrat', '70', '71', '160', '163', '9', 'an interview on The Daily Show with Jon Stewart"']


Instead of cleaning the document line by line we could remove all the punctuation before feeding it to `pandas`. We will define two functions for that.

In [22]:
def remove_punctuation(text):
    words = word_tokenize(text)
    words_without_punct = [word for word in words if word not in punctuation]
    clean_text = ' '.join(words_without_punct)
    
    return clean_text

def process_tsv(input_file, output_file):
    with open(input_file, 'r', encoding='utf-8') as f_in:
        with open(output_file, 'w', encoding='utf-8') as f_out:
            for line in f_in:
                fields = line.strip().split('\t')
                clean_fields = [remove_punctuation(field) for field in fields]
                f_out.write('\t'.join(clean_fields) + '\n')

Now let us try to process our train.tsv file.

In [23]:
input_file = "dataset/train.tsv"
output_file = "dataset/train-cleaned.tsv"

process_tsv(input_file, output_file)

We can re-load the dataset into the data frame and see if it's any better after cleaning.

In [26]:
liar_df = pd.read_csv("dataset/train-cleaned.tsv", sep='\t', header=None)
liar_df.columns = column_names
liar_df.head()

Unnamed: 0,Statement ID,Label,Statement,Subject,Speaker,Speaker's job title,State,Party affiliation,Barely true,False,Half true,Mostly true,Pants on fire,Statement context
0,2635.json,false,Says the Annies List political group supports ...,abortion,dwayne-bohac,State representative,Texas,republican,0,1,0,0,0,a mailer
1,10540.json,half-true,When did the decline of coal start It started ...,energy history job-accomplishments,scott-surovell,State delegate,Virginia,democrat,0,0,1,1,0,a floor speech
2,324.json,mostly-true,Hillary Clinton agrees with John McCain `` by ...,foreign-policy,barack-obama,President,Illinois,democrat,70,71,160,163,9,Denver
3,1123.json,false,Health care reform legislation is likely to ma...,health-care,blog-posting,,,none,7,19,3,5,44,a news release
4,9028.json,half-true,The economic turnaround started at the end of ...,economy jobs,charlie-crist,,Florida,democrat,15,9,20,19,2,an interview on CNN


Let us check column values and the problematic row we had before.

In [28]:
liar_df.loc[2154]

Statement ID                                                    638.json
Label                                                              false
Statement              `` The fact is that although we have had a pre...
Subject                                                         abortion
Speaker                                                     barack-obama
Speaker's job title                                            President
State                                                           Illinois
Party affiliation                                               democrat
Barely true                                                           70
False                                                                 71
Half true                                                            160
Mostly true                                                          163
Pants on fire                                                          9
Statement context      a TV interview with megachur

In [31]:
check_column_values(liar_df, ['Label', 'Statement', 'Subject', 'Barely true', 'False', 'Half true', 'Mostly true', 'Pants on fire'])

{'NaN Check': Label            False
 Statement        False
 Subject          False
 Barely true      False
 False            False
 Half true        False
 Mostly true      False
 Pants on fire    False
 dtype: bool,
 'Empty String Check': Label            False
 Statement        False
 Subject          False
 Barely true      False
 False            False
 Half true        False
 Mostly true      False
 Pants on fire    False
 dtype: bool,
 'None Check': Label            False
 Statement        False
 Subject          False
 Barely true      False
 False            False
 Half true        False
 Mostly true      False
 Pants on fire    False
 dtype: bool}

In [32]:
subj_na = liar_df[liar_df['Subject'].isna()]
subj_na.info()

<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Statement ID         0 non-null      object
 1   Label                0 non-null      object
 2   Statement            0 non-null      object
 3   Subject              0 non-null      object
 4   Speaker              0 non-null      object
 5   Speaker's job title  0 non-null      object
 6   State                0 non-null      object
 7   Party affiliation    0 non-null      object
 8   Barely true          0 non-null      int64 
 9   False                0 non-null      int64 
 10  Half true            0 non-null      int64 
 11  Mostly true          0 non-null      int64 
 12  Pants on fire        0 non-null      int64 
 13  Statement context    0 non-null      object
dtypes: int64(5), object(9)
memory usage: 0.0+ bytes


Looks like we have all the data we need, however, there are still small artifacts left, like in the row 2154 we have redundant `` marks. We will address this issue in the Text processing section.

## 2. Text Processing

Let us explore what labels we have.

In [51]:
labels = liar_df['Label'].unique()
print(labels)

['false' 'half-true' 'mostly-true' 'true' 'barely-true' 'pants-fire']


These labels correspond to the 5 columns with scores that we have in our dataset. Let us now create 6 different dataframes for each category.

In [54]:
for label in liar_df['Label'].unique():
    label_df = liar_df[liar_df['Label'] == label].copy()
    
    file_name = f"dataset/{label}_data.csv"
    
    label_df.to_csv(file_name, index=False)

We want our processing pipeline to look like this:

In [None]:
def preprocess(data):
    data = convert_lower_case(data)
    data = remove_punctuation(data)
    data = remove_apostrophe(data)
    data = remove_single_characters(data)
    data = convert_numbers(data)
    data = remove_stop_words(data)
    data = stemming(data)
    data = remove_punctuation(data)
    data = convert_numbers(data)

As suggested in [this article](https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089), some methods are repeated on purpose, as it helps to clean the data deeper.

## 4. Feature Engineering

### 4.1 TF-IDF
TF-IDF is useful for determining the importance of a word in a document relative to a collection of documents. It helps identify significant terms in a document, which can then be used for classification or similarity scoring. Fake news often contains certain distinctive terms or phrases that can be detected using TF-IDF.