In [1]:
# Import any needed packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# Read in training dataset
train = pd.read_csv("../../Data/RawDataCsvFormat/train.csv",index_col=0)
# Read in testing dataset
test = pd.read_csv("../../Data/RawDataCsvFormat/test.csv",index_col=0)
# Test has an extra index column read in from the file. Need to remove it
test = test.iloc[:,1:]
#Read in development set
dev = pd.read_csv("../../Data/RawDataCsvFormat/dev.csv",index_col=0)

In [3]:
# Look at the proportion of the rows in the different datasets containing NA/Null values
train_na_prop = len(train.loc[train.isnull().any(axis=1)])/len(train)
dev_na_prop = len(dev.loc[dev.isnull().any(axis=1)])/len(dev)
test_na_prop = len(test.loc[test.isnull().any(axis=1)])/len(test)

In [4]:
print("Proportion of rows with NA/Null feature value in Training Set:")
print(train_na_prop)
print("Proportion of rows with NA/Null feature value in Development Set:")
print(dev_na_prop)
print("Proportion of rows with NA/Null feature value in Testing Set:")
print(test_na_prop)

Proportion of rows with NA/Null feature value in Training Set:
0.1990439381611066
Proportion of rows with NA/Null feature value in Development Set:
0.20475020475020475
Proportion of rows with NA/Null feature value in Testing Set:
0.18461538461538463


In [5]:
# Look at proportions of missing values in each column of each of the datasets
train_col_propNA = train.isnull().sum()/len(train)
dev_col_propNA = dev.isnull().sum()/len(dev)
test_col_propNA = test.isnull().sum()/len(test)

In [6]:
print("Proportion of values in each column that are NA/Null for the Training Set:")
print(train_col_propNA)
print("Proportion of values in each column that are NA/Null for the Development Set:")
print(dev_col_propNA)
print("Proportion of values in each column that are NA/Null for the Testing Set:")
print(test_col_propNA)

Proportion of values in each column that are NA/Null for the Training Set:
claim_id          0.000000
claim             0.000814
date_published    0.197010
explanation       0.000814
fact_checkers     0.001119
main_text         0.002644
sources           0.002848
label             0.002746
subjects          0.002848
dtype: float64
Proportion of values in each column that are NA/Null for the Development Set:
claim_id          0.000000
claim             0.001638
date_published    0.200655
explanation       0.001638
fact_checkers     0.003276
main_text         0.003276
sources           0.004095
label             0.004914
subjects          0.005733
dtype: float64
Proportion of values in each column that are NA/Null for the Testing Set:
claim_id          0.000000
claim             0.000000
date_published    0.182996
explanation       0.000000
fact_checkers     0.000000
main_text         0.000000
sources           0.000000
label             0.001619
subjects          0.001619
dtype: float64

'date_published' is the only feature suffering from a high number of Null values, and it is consistent in the proportion missing throughout each of the datasets. This is something I will take note of for later on, as I begin testing and parameterizing models, because a significant number of missing values always has the potential to cause confusion when evaluating and interpreting results.

In [16]:
# Quick look at the label distributions for each of these three data sets
train_labels = train.loc[:,"label"]
dev_labels = dev.loc[:,"label"]
test_labels = test.loc[:,"label"]
# Look at the unique values in each of the datasets
train_lab_names = train_labels.unique()
dev_lab_names = dev_labels.unique()
test_lab_names = test_labels.unique()

In [21]:
# Know from dataset description, labels are {true,false,mixture,unproven} and possibly nan
print(train_lab_names)
print(dev_lab_names)
print(test_lab_names)

['false' 'mixture' 'true' 'unproven' nan 'snopes']
['unproven' 'true' 'false' 'mixture' nan
 'National, Candidate Biography, Donald Trump, ']
['false' 'true' 'unproven' 'mixture' nan]


The test label names align with the dataset description, other than the need to look at why so many label values are read as 'nan' to confirm that is not a mistake (or misalignment) with how the data was read in or how the data was stored. Sacrificing labeled data that could be recovered can be a huge mistake, labeled data is scarce in general, and generative models almost always perform better when given more valid to learn from.

The development set label names have 'snopes' in them for some reason. This is likely a mistake with the tabs in the .tsv file, so it was put in the wrong column and an easy fix. snopes.com is one of the sources for this data, so it must've gotten shifted into this column because of some white space mixup when reading and writing files multiple times. This is another indicator it is important to examine these nan values in the label columns, these little problems arise all the time and are important to think about with every dataset.

The training set seems to have the exact same problem as the development set, but the value 'National, Candidate Biography, Donald Trump, ' seems more like it belongs in the subjects column.