# Natural Language Processing - Fake News Detection with LIAR dataset

Dataset [source](https://paperswithcode.com/dataset/liar)

## Import necessary libraries

- `pandas` for data manipulation (reading dataset and manipulation)
- `nltk` for Natural Language Processing stuff 
- `re` for applying regular expressions (RegEx)
- `sklearn` (scikit-learn) for checking accuracy through accuracy score and for some algorithms

In [1]:
import re
import string

import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download("stopwords")
nltk.download("wordnet")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\amoghshakya\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\amoghshakya\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Reading the dataset

The dataset from LIAR dataset is actually in `tsv` (Tab Separated Values) format. So, we can use the `sep` argument in `pandas.read_csv()` method to read values separated by `\t` (tabs).

In [2]:
# read dataset
dataset_folder = "./liar_dataset"

# these are tsv files, so put separator as '\t'
train_df = pd.read_csv(f"{dataset_folder}/train.tsv", sep="\t")
test_df = pd.read_csv(f"{dataset_folder}/test.tsv", sep="\t")
valid_df = pd.read_csv(f"{dataset_folder}/valid.tsv", sep="\t")

In [3]:
train_df

Unnamed: 0,2635.json,false,Says the Annies List political group supports third-trimester abortions on demand.,abortion,dwayne-bohac,State representative,Texas,republican,0,1,0.1,0.2,0.3,a mailer
0,10540.json,half-true,When did the decline of coal start? It started...,"energy,history,job-accomplishments",scott-surovell,State delegate,Virginia,democrat,0.0,0.0,1.0,1.0,0.0,a floor speech.
1,324.json,mostly-true,"Hillary Clinton agrees with John McCain ""by vo...",foreign-policy,barack-obama,President,Illinois,democrat,70.0,71.0,160.0,163.0,9.0,Denver
2,1123.json,false,Health care reform legislation is likely to ma...,health-care,blog-posting,,,none,7.0,19.0,3.0,5.0,44.0,a news release
3,9028.json,half-true,The economic turnaround started at the end of ...,"economy,jobs",charlie-crist,,Florida,democrat,15.0,9.0,20.0,19.0,2.0,an interview on CNN
4,12465.json,true,The Chicago Bears have had more starting quart...,education,robin-vos,Wisconsin Assembly speaker,Wisconsin,republican,0.0,3.0,2.0,5.0,1.0,a an online opinion-piece
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10234,5473.json,mostly-true,There are a larger number of shark attacks in ...,"animals,elections",aclu-florida,,Florida,none,0.0,1.0,1.0,1.0,0.0,"interview on ""The Colbert Report"""
10235,3408.json,mostly-true,Democrats have now become the party of the [At...,elections,alan-powell,,Georgia,republican,0.0,0.0,0.0,1.0,0.0,an interview
10236,3959.json,half-true,Says an alternative to Social Security that op...,"retirement,social-security",herman-cain,,Georgia,republican,4.0,11.0,5.0,3.0,3.0,a Republican presidential debate
10237,2253.json,false,On lifting the U.S. Cuban embargo and allowing...,"florida,foreign-policy",jeff-greene,,Florida,democrat,3.0,1.0,3.0,0.0,0.0,a televised debate on Miami's WPLG-10 against ...


## Preprocessing

There are no columns in the dataset (bummer). We can assign them columns provided from the README file.

In [4]:
# define columns 

columns = ["id", "label", "statement", "subject", "speaker", 
           "speaker_job", "state_info", "party_affiliation", 
           "barely_true_counts", "false_counts", "half_true_counts", 
           "mostly_true_counts", "pants_on_fire_counts", "context"]

train_df.columns = columns
test_df.columns = columns
valid_df.columns = columns

In [5]:
print(train_df.columns)
print(test_df.columns)
print(valid_df.columns)

Index(['id', 'label', 'statement', 'subject', 'speaker', 'speaker_job',
       'state_info', 'party_affiliation', 'barely_true_counts', 'false_counts',
       'half_true_counts', 'mostly_true_counts', 'pants_on_fire_counts',
       'context'],
      dtype='object')
Index(['id', 'label', 'statement', 'subject', 'speaker', 'speaker_job',
       'state_info', 'party_affiliation', 'barely_true_counts', 'false_counts',
       'half_true_counts', 'mostly_true_counts', 'pants_on_fire_counts',
       'context'],
      dtype='object')
Index(['id', 'label', 'statement', 'subject', 'speaker', 'speaker_job',
       'state_info', 'party_affiliation', 'barely_true_counts', 'false_counts',
       'half_true_counts', 'mostly_true_counts', 'pants_on_fire_counts',
       'context'],
      dtype='object')


### Define preprocessing functions

Here, we define the functions that we are going to apply to the dataset.

> I created a class for it because I like to organize things. (*there are better ways to do this*)

In [6]:
train_df['label'].unique()

array(['half-true', 'mostly-true', 'false', 'true', 'barely-true',
       'pants-fire'], dtype=object)

In [7]:
class Preprocessor:

    @staticmethod
    def preprocess_text(text: str) -> str:
        lemmatizer = WordNetLemmatizer()
        stop_words = set(stopwords.words("english"))
        text = text.lower()
        # TLDR; translate is just a fancier replace method
        # The translate() method returns a string where some specified characters are replaced 
        # with the character described in a dictionary, or in a mapping table.
        # The maketrans() method returns a mapping table that can be used with the translate() 
        # method to replace specified characters. The third parameter is a string describing what characters to remove.
        text = text.translate(str.maketrans("", "", string.punctuation))
        text = re.sub(r'\d+', '', text)
        tokens = text.split()
        tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
        processed_text = " ".join(tokens)
        return processed_text
    
    @staticmethod
    def numero_labelo(label: str) -> str:
        # binary labeling because it's not fake if it's half true
        return 1 if label in ["half-true", "mostly-true", "true"] else 0

In [8]:
preprocessor = Preprocessor()

# example
print(Preprocessor.preprocess_text(text="Bloody-mindedness, and the understanding that it's analog, not digital. \
                        You won't go from unable to able; you'll go from unable to more able, progressively, \
                        for as long as you remain bloody-minded about continually working on it."))

# Quote by Kevin Malone (from what he intended)
print(preprocessor.preprocess_text("Why waste time saying a lot of words when few words can do the trick?"))

# Convert Label example
print(preprocessor.numero_labelo("half-true"))
print(preprocessor.numero_labelo("barely-true"))

bloodymindedness understanding analog digital wont go unable able youll go unable able progressively long remain bloodyminded continually working
waste time saying lot word word trick
1
0
