# Predicting False Statements

Cristian Nuno | May 7, 2019

## Question

Can we use the content of a statement to predict how likely it is to be true?

## Dataset
Today we will be using The [LIAR dataset](https://www.cs.ucsb.edu/˜william/data/liar_dataset.zip) from 2017, which includes 12.8K human labeled short statements from [PolitiFact's](https://www.politifact.com/truth-o-meter/article/2018/feb/12/principles-truth-o-meter-politifacts-methodology-i/) [API](https://s3.amazonaws.com/static.politifact.com/api/doc.html). 

Each statement was evaluated by a PolitiFact editor for its truthfulness. Below captures the labels assigned to each statement within the LIAR dataset:


| **Label** | **Description** |
| :-------: | :-------------: |
| True (`true`) | The statement is accurate and there’s nothing significant missing. |
| Mostly True (`mostly-true`) | The statement is accurate but needs clarification or additional information. |
| Half True (`half-true`) | The statement is partially accurate but leaves out important details or takes things out of context. |
| Barely True* (`barely-true`) | The statement contains an element of truth but ignores critical facts that would give a different impression. |
| False (`false`) | The statement is not accurate. |
| Pants on Fire (`pants-fire`) | The statement is not accurate and makes a ridiculous claim. |

**PolitiFact assigns statements as 'Mostly False' but the creators of the LIAR dataset relabeled it as 'Barely True'*

## Example Statements

### True

> 'The Chicago Bears have had more starting quarterbacks in the last 10 years than the total number of tenured (UW) faculty fired during the last two decades.' - Robin Vos, R-WI State Assembly Speaker

### Mostly True

> 'Youth unemployment in minority communities is about 40 to 45 percent.' - Peter Kinder, Fmr. Lt. Govenor, R-MO

### Half True

> 'Says that voter identification laws keep poor people from voting, minorities from voting, the elderly from voting, students from voting.' - Marcia Fudge, D-OH 11th District Representative

### Barely True

> 'The jobs bill includes President Obamas tax on soup kitchens' - Eric Cantor, Fmr. R-VA 7th Districrt Representative

### False

> 'I dont know who (Jonathan Gruber) is.' - Nancy Pelosi, Speaker of the House, D-CA 12th District Representative

### Pants on Fire

> 'In the case of a catastrophic event, the Atlanta-area offices of the Centers for Disease Control and Prevention will self-destruct.' - The Walking Dead


In [53]:
# load necessary modules
import numpy as np
import pandas as pd
%matplotlib inline
%config InlineBackend.figure_format = 'svg' # improve image quality of inline graphics
import matplotlib.pyplot as plt

In [2]:
# store the column names
# note: see the README for more details
column_names = ["id", "label", "statement",
                "subject", "speaker", "speaker_position",
                "state", "party", "barely_true_counts", 
                "false_counts", "half_true_counts", "mostly_true_counts",
                "pants_on_fire_counts", "context"]

# import data
news_df = pd.read_csv("raw_data/train.tsv", 
                      sep="\t", 
                      header=None,
                      names=column_names)

In [149]:
news_df['label'].unique()

array(['false', 'half-true', 'mostly-true', 'true', 'barely-true',
       'pants-fire'], dtype=object)

## Let's create a flag to indicate a statement that contains the "pants-fire" `label` value

To do this, we'll be creating our own function that is inspired from the [pd.get_dummies()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) function to transform the categorical variable `label` into indicator variables. 

The name 'dummies' refers the information contained in each of the newly created indicator variables: a 0 or a 1. The value 0 or 1 indicates the absence or presence of some categorical effect that may be expected to influence the outcome (i.e. statements which are extremely false).

In [150]:
def encode_label(label):
    """Encode true as 1, pants-on-fire as 0, everything else as None"""
    if label == "true":
        return 1
    elif label == "pants-fire" or label == "false":
        return 0
    else:
        return None

In [151]:
news_df["label_numeric"] = news_df["label"].apply(encode_label)

In [152]:
data = news_df[['statement', 'label_numeric']].dropna()

In [153]:
data.head()

Unnamed: 0,statement,label_numeric
0,Says the Annies List political group supports ...,0.0
3,Health care reform legislation is likely to ma...,0.0
5,The Chicago Bears have had more starting quart...,1.0
12,When Mitt Romney was governor of Massachusetts...,0.0
16,McCain opposed a requirement that the governme...,1.0


## Now that we have our data, let's build our model using two methods

### TFIDF
Term frequency–inverse document frequency (TFIDF), is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

### Random Forest
Random Forest is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

In [179]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier


tfidf_pipe = Pipeline([('vectorizer', TfidfVectorizer()),
                ('model', MultinomialNB())])

rf_pipe = Pipeline([('vectorizer', TfidfVectorizer()),
                ('model', RandomForestClassifier(n_estimators=100))])

# create training and test data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data['statement'],
                                                    data['label_numeric'],
                                                    test_size=0.2, 
                                                    random_state=2019)

In [186]:
tfidf_pipe.fit(X=X_train, y=y_train)

Pipeline(memory=None,
     steps=[('vectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=...e,
        vocabulary=None)), ('model', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [187]:
rf_pipe.fit(X=X_train, y=y_train)

Pipeline(memory=None,
     steps=[('vectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=...obs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

In [188]:
# Some false statements
false_statements = np.random.choice(X_test[y_test==0], size=5)
false_statements

array(['In three days last week, (Gov. Rick) Perry flew to five cities at taxpayers expense, holding press conferences, delivering $2,325,000 in checks.',
       'What is the proper collective noun for a group of baboons? Believe it or not . . . a Congress!',
       'When we took office, let me remind you, there was virtually no international pressure on Iran.',
       'Says Saddam Hussein had a 10-year relationship with al-Qaida.',
       'The most likely triggering cause of (microcephaly) is the DTaP shot, a vaccine that had been recently mandated by the Brazilian government to be injected into pregnant women.'],
      dtype=object)

In [189]:
tfidf_pipe.predict_proba(false_statements)

array([[0.64261609, 0.35738391],
       [0.69152126, 0.30847874],
       [0.75915142, 0.24084858],
       [0.59077478, 0.40922522],
       [0.67149034, 0.32850966]])

In [190]:
rf_pipe.predict_proba(false_statements)

array([[0.58, 0.42],
       [0.48, 0.52],
       [0.58, 0.42],
       [0.76, 0.24],
       [0.83, 0.17]])

In [191]:
# Some true statements
true_statements = np.random.choice(X_test[y_test==1], size=5)
true_statements

array(['The very first meal on the surface of the moon was the Holy Communion.',
       'Under legislation that has cleared the Georgia House, some children who are legal refugees could obtain state scholarships to attend private schools.',
       'We have one of the most expensive General Assemblies, per capita, in the entire country.',
       'Over the past twenty years, the number of homicides committed with a firearm in the United States has decreased by nearly 40 percent. The number of other crimes involving the use of a firearm has also plummeted, declining by nearly 70 percent.',
       'We have one of the most expensive General Assemblies, per capita, in the entire country.'],
      dtype=object)

In [192]:
tfidf_pipe.predict_proba(true_statements)

array([[0.70321654, 0.29678346],
       [0.62922799, 0.37077201],
       [0.53058161, 0.46941839],
       [0.42896536, 0.57103464],
       [0.53058161, 0.46941839]])

In [193]:
rf_pipe.predict_proba(true_statements)

array([[0.66, 0.34],
       [0.42, 0.58],
       [0.57, 0.43],
       [0.43, 0.57],
       [0.57, 0.43]])