# Exploratory Data Analysis

* Dataset taken from https://github.com/Tariq60/LIAR-PLUS

## 1. Import Libraries

In [19]:
import numpy as np
import pandas as pd

TRAIN_PATH = "../data/raw/dataset/tsv/train2.tsv"
VAL_PATH = "../data/raw/dataset/tsv/val2.tsv"
TEST_PATH = "../data/raw/dataset/tsv/test2.tsv"

columns = ["id", "statement_json", "label", "statement", "subject", "speaker", "speaker_title", "state_info",
           "party_affiliation", "barely_true_count", "false_count", "half_true_count", "mostly_true_count",
           "pants_fire_count", "context", "justification"]



## 2. Read the dataset

In [13]:
train_df = pd.read_csv(TRAIN_PATH, sep="\t", names=columns)
val_df = pd.read_csv(VAL_PATH, sep="\t", names=columns)
test_df = pd.read_csv(TEST_PATH, sep="\t", names=columns)

In [14]:
print(f"Length of train set: {len(train_df)}")
print(f"Length of validation set: {len(val_df)}")
print(f"Length of test set: {len(test_df)}")

Length of train set: 10242
Length of validation set: 1284
Length of test set: 1267


In [45]:
train_df.head()

Unnamed: 0,id,statement_json,label,statement,subject,speaker,speaker_title,state_info,party_affiliation,barely_true_count,false_count,half_true_count,mostly_true_count,pants_fire_count,context,justification
0,0.0,2635.json,false,Says the Annies List political group supports ...,abortion,dwayne-bohac,State representative,Texas,republican,0.0,1.0,0.0,0.0,0.0,a mailer,That's a premise that he fails to back up. Ann...
1,1.0,10540.json,half-true,When did the decline of coal start? It started...,"energy,history,job-accomplishments",scott-surovell,State delegate,Virginia,democrat,0.0,0.0,1.0,1.0,0.0,a floor speech.,"Surovell said the decline of coal ""started whe..."
2,2.0,324.json,mostly-true,"Hillary Clinton agrees with John McCain ""by vo...",foreign-policy,barack-obama,President,Illinois,democrat,70.0,71.0,160.0,163.0,9.0,Denver,Obama said he would have voted against the ame...
3,3.0,1123.json,false,Health care reform legislation is likely to ma...,health-care,blog-posting,,,none,7.0,19.0,3.0,5.0,44.0,a news release,The release may have a point that Mikulskis co...
4,4.0,9028.json,half-true,The economic turnaround started at the end of ...,"economy,jobs",charlie-crist,,Florida,democrat,15.0,9.0,20.0,19.0,2.0,an interview on CNN,"Crist said that the economic ""turnaround start..."


## 3. Data Cleaning

* Some of the most important coloumns are "label", "statement".
* Now we should check if any of them have null values.

In [33]:
print("Do we have empty strings in `label`?")
pd.isna(train_df["label"]).value_counts()

Do we have empty strings in `label`?


False    10240
True         2
Name: label, dtype: int64

* 2 entries without any label
* What exactly are those 2 entries?

In [36]:
train_df.loc[pd.isna(train_df["label"]), :].index

Int64Index([2143, 9377], dtype='int64')

In [42]:
train_df.loc[[2143]]


Unnamed: 0,id,statement_json,label,statement,subject,speaker,speaker_title,state_info,party_affiliation,barely_true_count,false_count,half_true_count,mostly_true_count,pants_fire_count,context,justification
2143,,,,,,,,,,,,,,,,


In [43]:
train_df.loc[[9377]]

Unnamed: 0,id,statement_json,label,statement,subject,speaker,speaker_title,state_info,party_affiliation,barely_true_count,false_count,half_true_count,mostly_true_count,pants_fire_count,context,justification
9377,,,,,,,,,,,,,,,,


* All the coloumns of those 2 entries are blank
* Drop those 2 entries

In [44]:
train_df.dropna(subset=["label"], inplace=True)
len(train_df)

10240