# Predicting False Statements

Cristian Nuno | May 7, 2019

## Question

Can we use the content of a statement to predict how likely it is to be true?

## Dataset
Today we will be using The [LIAR dataset](https://www.cs.ucsb.edu/˜william/data/liar_dataset.zip) from 2017, which includes 12.8K human labeled short statements from [PolitiFact's](https://www.politifact.com/truth-o-meter/article/2018/feb/12/principles-truth-o-meter-politifacts-methodology-i/) [API](https://s3.amazonaws.com/static.politifact.com/api/doc.html). 

Each statement was evaluated by a PolitiFact editor for its truthfulness. Below captures the labels assigned to each statement within the LIAR dataset:


| **Label** | **Description** |
| :-------: | :-------------: |
| True (`true`) | The statement is accurate and there’s nothing significant missing. |
| Mostly True (`mostly-true`) | The statement is accurate but needs clarification or additional information. |
| Half True (`half-true`) | The statement is partially accurate but leaves out important details or takes things out of context. |
| Barely True* (`barely-true`) | The statement contains an element of truth but ignores critical facts that would give a different impression. |
| False (`false`) | The statement is not accurate. |
| Pants on Fire (`pants-fire`) | The statement is not accurate and makes a ridiculous claim. |

**PolitiFact assigns statements as 'Mostly False' but the creators of the LIAR dataset relabeled it as 'Barely True'*

## Example Statements

### True

> 'The Chicago Bears have had more starting quarterbacks in the last 10 years than the total number of tenured (UW) faculty fired during the last two decades.' - Robin Vos, R-WI State Assembly Speaker

### Mostly True

> 'Youth unemployment in minority communities is about 40 to 45 percent.' - Peter Kinder, Fmr. Lt. Govenor, R-MO

### Half True

> 'Says that voter identification laws keep poor people from voting, minorities from voting, the elderly from voting, students from voting.' - Marcia Fudge, D-OH 11th District Representative

### Barely True

> 'The jobs bill includes President Obamas tax on soup kitchens' - Eric Cantor, Fmr. R-VA 7th Districrt Representative

### False

> 'I dont know who (Jonathan Gruber) is.' - Nancy Pelosi, Speaker of the House, D-CA 12th District Representative

### Pants on Fire

> 'In the case of a catastrophic event, the Atlanta-area offices of the Centers for Disease Control and Prevention will self-destruct.' - The Walking Dead


In [1]:
# load necessary modules
import numpy as np
import pandas as pd
%matplotlib inline
%config InlineBackend.figure_format = 'svg' # improve image quality of inline graphics
import matplotlib.pyplot as plt

Download the `.zip` file and unzip it. By default, it will download into your working directory.

In [2]:
!wget https://www.cs.ucsb.edu/~william/data/liar_dataset.zip

--2019-05-14 15:47:13--  https://www.cs.ucsb.edu/~william/data/liar_dataset.zip
Resolving www.cs.ucsb.edu (www.cs.ucsb.edu)... 23.185.0.3
Connecting to www.cs.ucsb.edu (www.cs.ucsb.edu)|23.185.0.3|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://sites.cs.ucsb.edu/~william/data/liar_dataset.zip [following]
--2019-05-14 15:47:13--  https://sites.cs.ucsb.edu/~william/data/liar_dataset.zip
Resolving sites.cs.ucsb.edu (sites.cs.ucsb.edu)... 128.111.27.13
Connecting to sites.cs.ucsb.edu (sites.cs.ucsb.edu)|128.111.27.13|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1013571 (990K) [application/zip]
Saving to: ‘liar_dataset.zip’


2019-05-14 15:47:13 (1.58 MB/s) - ‘liar_dataset.zip’ saved [1013571/1013571]



In [3]:
!unzip liar_dataset.zip

Archive:  liar_dataset.zip
  inflating: README                  
  inflating: test.tsv                
  inflating: train.tsv               
  inflating: valid.tsv               


In [4]:
# store the column names
# note: see the README for more details
column_names = ["id", "label", "statement",
                "subject", "speaker", "speaker_position",
                "state", "party", "barely_true_counts", 
                "false_counts", "half_true_counts", "mostly_true_counts",
                "pants_on_fire_counts", "context"]

# import data
news_df = pd.read_csv("train.tsv", 
                      sep="\t", 
                      header=None,
                      names=column_names)

In [None]:
news_df['label'].unique()

array(['false', 'half-true', 'mostly-true', 'true', 'barely-true',
       'pants-fire'], dtype=object)

In [5]:
news_df.head()

Unnamed: 0,id,label,statement,subject,speaker,speaker_position,state,party,barely_true_counts,false_counts,half_true_counts,mostly_true_counts,pants_on_fire_counts,context
0,2635.json,false,Says the Annies List political group supports ...,abortion,dwayne-bohac,State representative,Texas,republican,0.0,1.0,0.0,0.0,0.0,a mailer
1,10540.json,half-true,When did the decline of coal start? It started...,"energy,history,job-accomplishments",scott-surovell,State delegate,Virginia,democrat,0.0,0.0,1.0,1.0,0.0,a floor speech.
2,324.json,mostly-true,"Hillary Clinton agrees with John McCain ""by vo...",foreign-policy,barack-obama,President,Illinois,democrat,70.0,71.0,160.0,163.0,9.0,Denver
3,1123.json,false,Health care reform legislation is likely to ma...,health-care,blog-posting,,,none,7.0,19.0,3.0,5.0,44.0,a news release
4,9028.json,half-true,The economic turnaround started at the end of ...,"economy,jobs",charlie-crist,,Florida,democrat,15.0,9.0,20.0,19.0,2.0,an interview on CNN


## Let's create a flag to indicate a statement that contains the "pants-fire" `label` value

To do this, we'll be creating our own function that is inspired from the [pd.get_dummies()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) function to transform the categorical variable `label` into indicator variables. 

The name 'dummies' refers the information contained in each of the newly created indicator variables: a 0 or a 1. The value 0 or 1 indicates the absence or presence of some categorical effect that may be expected to influence the outcome (i.e. statements which are extremely false).

In [6]:
def encode_label(label):
    """Encode true as 1, pants-on-fire as 0, everything else as None"""
    if label == "true":
        return 1
    elif label == "pants-fire" or label == "false":
        return 0
    else:
        return None

In [7]:
news_df["label_numeric"] = news_df["label"].apply(encode_label)

In [8]:
news_df.head()

Unnamed: 0,id,label,statement,subject,speaker,speaker_position,state,party,barely_true_counts,false_counts,half_true_counts,mostly_true_counts,pants_on_fire_counts,context,label_numeric
0,2635.json,false,Says the Annies List political group supports ...,abortion,dwayne-bohac,State representative,Texas,republican,0.0,1.0,0.0,0.0,0.0,a mailer,0.0
1,10540.json,half-true,When did the decline of coal start? It started...,"energy,history,job-accomplishments",scott-surovell,State delegate,Virginia,democrat,0.0,0.0,1.0,1.0,0.0,a floor speech.,
2,324.json,mostly-true,"Hillary Clinton agrees with John McCain ""by vo...",foreign-policy,barack-obama,President,Illinois,democrat,70.0,71.0,160.0,163.0,9.0,Denver,
3,1123.json,false,Health care reform legislation is likely to ma...,health-care,blog-posting,,,none,7.0,19.0,3.0,5.0,44.0,a news release,0.0
4,9028.json,half-true,The economic turnaround started at the end of ...,"economy,jobs",charlie-crist,,Florida,democrat,15.0,9.0,20.0,19.0,2.0,an interview on CNN,


In [None]:
data = news_df[['statement', 'label_numeric']].dropna()

In [None]:
data.head()

Unnamed: 0,statement,label_numeric
0,Says the Annies List political group supports ...,0.0
3,Health care reform legislation is likely to ma...,0.0
5,The Chicago Bears have had more starting quart...,1.0
12,When Mitt Romney was governor of Massachusetts...,0.0
16,McCain opposed a requirement that the governme...,1.0


## Now that we have our data, let's build our model using two methods

### TFIDF
Term frequency–inverse document frequency (TFIDF), is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

### Random Forest
Random Forest is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB


tfidf_pipe = Pipeline([('vectorizer', TfidfVectorizer()),
                ('model', MultinomialNB())])

rf_pipe = Pipeline([('vectorizer', TfidfVectorizer()),
                ('model', RandomForestClassifier(n_estimators=100))])

# create training and test data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data['statement'],
                                                    data['label_numeric'],
                                                    test_size=0.2, 
                                                    random_state=2019)

In [None]:
tfidf_pipe.fit(X=X_train, y=y_train)

Pipeline(memory=None,
     steps=[('vectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=...e,
        vocabulary=None)), ('model', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [None]:
rf_pipe.fit(X=X_train, y=y_train)

Pipeline(memory=None,
     steps=[('vectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=...obs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

In [13]:
# Some false statements
false_statements = np.random.choice(X_test[y_test==0], size=5)
false_statements

array(['Says that you cant destroy the guns purchased in a (gun) buyback program as the city (of Memphis) wants to do.',
       'John McCain refuses to support a new bipartisan energy bill "because it would take away tax breaks from oil companies like Exxon Mobil."',
       'I have never voted for a tax increase.',
       'Were only inches away from no longer being a free economy.',
       'Did you know ObamaCare will cost nearly twice as much as initially expected -- $1.8 TRILLION?'],
      dtype=object)

The use of `predict_proba()` outputs a probability matrix of dimension (N,2). The first index refers to the probability that the data belong to class 0 (i.e. a false statement), and the second refers to the probability that the data belong to class 1 (i.e. a true statement)

In [14]:
tfidf_pipe.predict_proba(false_statements)

array([[0.79564463, 0.20435537],
       [0.75105964, 0.24894036],
       [0.78441905, 0.21558095],
       [0.74259511, 0.25740489],
       [0.79177517, 0.20822483]])

So even though we're on feeding `tfidf_pipe.predict_proba()` 5 false statements, our model isn't perfectly identifying these false statements.

Even when we use the Random Forest classifier, the model performance doesn't improve much.

In [15]:
rf_pipe.predict_proba(false_statements)

array([[0.69      , 0.31      ],
       [0.71      , 0.29      ],
       [0.76666667, 0.23333333],
       [0.72      , 0.28      ],
       [0.71      , 0.29      ]])

In [None]:
# Some true statements
true_statements = np.random.choice(X_test[y_test==1], size=5)
true_statements

<<<<<<< local


array(['After the shootings of Dallas policemen, nearly 500 people applied in just 12 days.',
       'A congressional laundry closed due to the partial government shutdown.',
       'Says the state budget includes spending on commercials for Fortune 500 companies.',
       'Tells President Barack Obama that he also asked former President George W. Bush about how he felt about Americans hating him.',
       'Hillary Clinton "agreed with (John McCain) on voting for the war in Iraq."'],
      dtype=object)



array(['Obamacare is one big fat VA system.',
       'We have no regulation of drones in the United States in their commercial use.',
       'We dont have a (military) reserve force if something happens.',
       'U.S. Reps. Hank Johnson, John Lewis and other members of the Congressional Progressive Caucus are socialists who are openly serving in the U.S. Congress.',
       'About $4.89 billion in one-time money was used to balance the current state general revenue fund budget.'],
      dtype=object)

>>>>>>> remote


What becomes clear is that words alone are not enough to predict if what someone is saying is true. Our probabilities of the statement being true (knowing that these statements were marked as true by PolitiFact) are even low -- lower than those for false statements.

In [17]:
tfidf_pipe.predict_proba(true_statements)

array([[0.67753335, 0.32246665],
       [0.5992584 , 0.4007416 ],
       [0.78660939, 0.21339061],
       [0.78675914, 0.21324086],
       [0.75352311, 0.24647689]])

In [None]:
rf_pipe.predict_proba(true_statements)

<<<<<<< local




array([[0.75183513, 0.24816487],
       [0.6129931 , 0.3870069 ],
       [0.69691092, 0.30308908],
       [0.69107322, 0.30892678],
       [0.7037632 , 0.2962368 ]])

>>>>>>> remote


<<<<<<< local <modified: text/plain>


array([[0.67, 0.33],
       [0.74, 0.26],
       [0.75, 0.25],
       [0.75, 0.25],
       [0.48, 0.52]])



>>>>>>> remote <removed>


## Summary

Classifying fake news requires more than words. In the future, we should expand the model to give characteristics about the speaker of the statement, where the statement was said, etc... that would help our model understand the context of the words in each statement.

In [190]:
rf_pipe.predict_proba(false_statements)

array([[0.58, 0.42],
       [0.48, 0.52],
       [0.58, 0.42],
       [0.76, 0.24],
       [0.83, 0.17]])

In [191]:
# Some true statements
true_statements = np.random.choice(X_test[y_test==1], size=5)
true_statements

array(['The very first meal on the surface of the moon was the Holy Communion.',
       'Under legislation that has cleared the Georgia House, some children who are legal refugees could obtain state scholarships to attend private schools.',
       'We have one of the most expensive General Assemblies, per capita, in the entire country.',
       'Over the past twenty years, the number of homicides committed with a firearm in the United States has decreased by nearly 40 percent. The number of other crimes involving the use of a firearm has also plummeted, declining by nearly 70 percent.',
       'We have one of the most expensive General Assemblies, per capita, in the entire country.'],
      dtype=object)

In [192]:
tfidf_pipe.predict_proba(true_statements)

array([[0.70321654, 0.29678346],
       [0.62922799, 0.37077201],
       [0.53058161, 0.46941839],
       [0.42896536, 0.57103464],
       [0.53058161, 0.46941839]])

In [193]:
rf_pipe.predict_proba(true_statements)

array([[0.66, 0.34],
       [0.42, 0.58],
       [0.57, 0.43],
       [0.43, 0.57],
       [0.57, 0.43]])