# LIAR DETECTION GROUP PROJECT - Baseline Models  


### CONTENTS  

Imports  
Load ISOT data  
Pre-process ISOT data  
Train/Dev/Test split ISOT data  

##### Baselines (Naive Bayes):  
- ISOT full "text" field  (using CountVectorizer)  
- Verification test by assigning random 0's and 1's to the Dev Labels and re-running  
- Train with ISOT "text"; predict ISOT "title"  
- Read and setup LIAR dataset  
- Using the ISOT "text" model, predict the liar_dev_labels and score the predictions  
- Using the LIAR model, predict the ISOT "text" and score the predictions 
- ISOT "text" field using TfidfVectorizer  
- ISOT "text" field after removing "Reuters" and location from real news  
- ISOT "title" field; ; print top misclassifications between predicted dev classes and dev labels 
- Using the ISOT "title" model, predict the ISOT "text" classes  
- Using the ISOT "title" model, predict the liar_dev_labels and score the predictions; print top misclassifications  
- Using the LIAR model, predict the ISOT "title" and score the predictions  
- Divide LIAR data into train/dev/test, train a model, and see how well it predicts on its own data type  



    

## This uses the file Abhishek created; needs correct date format (2016-01-01) for comparisons, as noted below.  Train with 2016 data.

From Abhishek:  
df_alldata3_dates.pkl' file with the parsed dates. Its the unembedded but tokenized isot data file with a column called 'date_parsed' . I left the null dates in there since there are only 10 of them. If u need to u can assign a date to them like this
all_data.loc[all_data['date_parsed'].isnull(),'date_parsed'] = max(all_data['date_parsed'])+ pd.DateOffset(days=10)

the code above would assign a date which is 10 days greater than the max date in the ISOT dataset

In [1]:
from __future__ import absolute_import
from __future__ import print_function
from __future__ import division

import json, os, re, shutil, sys, time
from importlib import reload
import collections, itertools
import unittest
from IPython.display import display, HTML
from sklearn.utils import shuffle
# NLTK for NLP utils and corpora
import nltk

# NumPy and TensorFlow
import numpy as np
import pandas as pd
#import tensorflow as tf

# Helper libraries
from w266_common import utils, vocabulary, tf_embed_viz
#from ark-tweet-nlp-0.3.2 import 


In [2]:
#### MAY NEED TO RUN THIS CELL TWICE

def get_data(filename, sep=',', header=0, names = None):
    '''Read CSV file into a pandas dataframe'''
      
    filepath = DATAPATH + filename
    return pd.read_csv(filepath, header=header, sep=sep, quotechar='"')

In [3]:
##
# from sklearn.naive_bayes import BernoulliNB  #requires all features be binary
from sklearn.naive_bayes import MultinomialNB  #appropriate for word count features from CountVectorizer
# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *
#from sklearn.grid_search import GridSearchCV   # THIS HAS BEEN DEPRECATED
from sklearn.model_selection import GridSearchCV
# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

### Load data
Loading the "Fake News" dataset from the Information security and object technology (ISOT) Research lab at the University of Victoria School of Engineering.

The ISOT Fake News Dataset is a compilation of several thousands fake news and truthful articles, obtained from different legitimate news sites and sites flagged as unreliable by politifact.com.

In [4]:
isot_data = pd.read_pickle('parsed_data/df_alldata3_dates.pkl')
isot_data.info(memory_usage='deep', verbose=True)
isot_data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44898 entries, 0 to 44897
Data columns (total 10 columns):
title           44898 non-null object
text            44898 non-null object
subject         44898 non-null object
date            44898 non-null object
target          44898 non-null object
title_tokcan    44898 non-null object
title_POS       44898 non-null object
text_tokcan     44898 non-null object
text_POS        44898 non-null object
date_parsed     44888 non-null datetime64[ns]
dtypes: datetime64[ns](1), object(9)
memory usage: 528.1 MB


Unnamed: 0,title,text,subject,date,target,title_tokcan,title_POS,text_tokcan,text_POS,date_parsed
0,BRAINIAC Gets Rejected After Trying To Buy BMW...,Does anyone else out there see a future BMW ca...,Government News,"Mar 20, 2016",0,"[brainiac<allcaps>, gets, rejected, after, try...","[N, V, V, P, V, P, V, ^, P, ^, ^, ,, O, V, A, ...","[does, anyone, else, out, there, see, a, futur...","[V, N, R, P, R, V, D, A, ^, N, N, P, D, N, ,, ...",2016-03-20
1,Windows 10 is Stealing Your Bandwidth (You Mig...,21st Century Wire says We ve heard a lot of no...,US_News,"April 7, 2016",0,"[windows, <number>, is, stealing, your, bandwi...","[^, $, V, V, D, N, ,, O, V, V, P, V, O, ,]","[<number>st, century, wire, says, we, ve, hear...","[A, N, ^, V, O, V, V, D, N, P, R, R, A, N, P, ...",2016-04-07
2,STUNNING STORY The Media And Democrats Hid Fro...,"In an email sent on April 15, 2011, our upstan...",left-news,"Mar 2, 2017",0,"[stunning<allcaps>, story<allcaps>, the, media...","[A, N, D, N, &, N, V, P, ^, ,, R, Z, ^, ^, N, ...","[in, an, email, sent, on, april, <number>, ,, ...","[P, D, N, V, P, ^, $, ,, $, ,, D, A, N, A, ^, ...",2017-03-02
3,North Korea's Kim Jong Un fetes nuclear scient...,SEOUL (Reuters) - North Korean leader Kim Jong...,worldnews,"September 10, 2017",1,"[north, korea's, kim, jong, un, fetes, nuclear...","[^, Z, ^, ^, ^, V, A, N, ,, V, N, N]","[seoul<allcaps>, (, reuters, ), -, north, kore...","[^, ,, ^, ,, ,, ^, ^, N, ^, ^, ^, V, D, A, N, ...",2017-09-10
4,White House developing comprehensive biosecuri...,"ASPEN, Colorado (Reuters) - The Trump administ...",politicsNews,"July 20, 2017",1,"[white, house, developing, comprehensive, bios...","[A, N, V, A, N, N, ,, A]","[aspen<allcaps>, ,, colorado, (, reuters, ), -...","[^, ,, ^, ,, ^, ,, ,, D, ^, N, V, V, D, A, A, ...",2017-07-20


In [5]:
print(isot_data.columns)

Index(['title', 'text', 'subject', 'date', 'target', 'title_tokcan',
       'title_POS', 'text_tokcan', 'text_POS', 'date_parsed'],
      dtype='object')


## Sort ISOT data by date

In [6]:
print(isot_data.date_parsed.unique().shape)  #1011
#all_data.date.to_csv('isot_raw_dates.csv', sep=',')

(1011,)


In [7]:
#bad_indices = [68,5399,9496,10428,16193,20207,25095,25969,32881,39580]


In [8]:
#all_data.drop(all_data.index[[68,5399,9496,10428,16193,20207,25095,25969,32881,39580]], inplace=True)
#all_data2 = all_data.drop(all_data.index[[68,5399,9496,10428,16193,20207,25095,25969,32881,39580]])

#all_data2 = all_data.drop([all_data.index[39580]])

'''
all_data2.drop([all_data2.index[32881]], inplace=True)
all_data2.drop([all_data2.index[25969]], inplace=True)
all_data2.drop([all_data2.index[25095]], inplace=True)
all_data2.drop([all_data2.index[20207]], inplace=True)
all_data2.drop([all_data2.index[16193]], inplace=True)
all_data2.drop([all_data2.index[10428]], inplace=True)
all_data2.drop([all_data2.index[9496]], inplace=True)
all_data2.drop([all_data2.index[5399]], inplace=True)
all_data2.drop([all_data2.index[68]], inplace=True)
'''

#all_data2 = all_data

'''
all_data2.drop(39580,  inplace=True)
all_data2.drop(32881, inplace=True)
all_data2.drop(25969, inplace=True)
all_data2.drop(25095, inplace=True)
all_data2.drop(20207, inplace=True)
all_data2.drop(16193, inplace=True)
all_data2.drop(10428, inplace=True)
all_data2.drop(9496, inplace=True)
all_data2.drop(5399, inplace=True)
all_data2.drop(68, inplace=True)
'''

# Delete row at index position 0 & 1
#modDfObj = dfObj.drop([dfObj.index[0] , dfObj.index[1]])


'\nall_data2.drop(39580,  inplace=True)\nall_data2.drop(32881, inplace=True)\nall_data2.drop(25969, inplace=True)\nall_data2.drop(25095, inplace=True)\nall_data2.drop(20207, inplace=True)\nall_data2.drop(16193, inplace=True)\nall_data2.drop(10428, inplace=True)\nall_data2.drop(9496, inplace=True)\nall_data2.drop(5399, inplace=True)\nall_data2.drop(68, inplace=True)\n'

### Here is the correct way to select dates from the data_parsed column.


In [7]:
#print(isot_data.date_parsed)
#print(all_data_sorted[all_data_sorted['date'].str.contains('2015')].shape)
#print(isot_data.date_parsed[isot_data['date_parsed'].str.contains('2016')])

#print(isot_data.date_parsed[isot_data['date_parsed'] > '2016-01-01'])
print(isot_data.date_parsed[isot_data['date_parsed'] < '2016-01-01'])


42      2015-10-31
61      2015-07-25
83      2015-10-20
97      2015-12-20
123     2015-09-26
161     2015-07-14
232     2015-08-07
235     2015-09-19
267     2015-07-24
289     2015-07-27
363     2015-11-16
364     2015-08-22
390     2015-06-12
414     2015-10-10
417     2015-09-16
428     2015-07-17
442     2015-10-07
481     2015-07-12
493     2015-12-08
500     2015-10-28
531     2015-12-23
534     2015-07-07
546     2015-12-08
619     2015-12-23
647     2015-04-04
657     2015-05-21
681     2015-05-21
715     2015-06-08
732     2015-08-18
736     2015-04-15
           ...    
44381   2015-12-03
44405   2015-11-30
44412   2015-10-01
44470   2015-09-18
44492   2015-10-19
44496   2015-12-20
44529   2015-05-30
44531   2015-12-03
44549   2015-12-08
44584   2015-05-30
44591   2015-11-08
44592   2015-10-25
44618   2015-09-12
44638   2015-12-07
44662   2015-06-25
44665   2015-04-16
44668   2015-09-29
44689   2015-10-23
44723   2015-07-23
44724   2015-08-11
44752   2015-08-11
44760   2015

In [8]:
all_data2 = isot_data

In [9]:
all_data2['date'] = pd.to_datetime(all_data2.date, infer_datetime_format = True, errors='ignore')
all_data2.head(10)

Unnamed: 0,title,text,subject,date,target,title_tokcan,title_POS,text_tokcan,text_POS,date_parsed
0,BRAINIAC Gets Rejected After Trying To Buy BMW...,Does anyone else out there see a future BMW ca...,Government News,"Mar 20, 2016",0,"[brainiac<allcaps>, gets, rejected, after, try...","[N, V, V, P, V, P, V, ^, P, ^, ^, ,, O, V, A, ...","[does, anyone, else, out, there, see, a, futur...","[V, N, R, P, R, V, D, A, ^, N, N, P, D, N, ,, ...",2016-03-20
1,Windows 10 is Stealing Your Bandwidth (You Mig...,21st Century Wire says We ve heard a lot of no...,US_News,"April 7, 2016",0,"[windows, <number>, is, stealing, your, bandwi...","[^, $, V, V, D, N, ,, O, V, V, P, V, O, ,]","[<number>st, century, wire, says, we, ve, hear...","[A, N, ^, V, O, V, V, D, N, P, R, R, A, N, P, ...",2016-04-07
2,STUNNING STORY The Media And Democrats Hid Fro...,"In an email sent on April 15, 2011, our upstan...",left-news,"Mar 2, 2017",0,"[stunning<allcaps>, story<allcaps>, the, media...","[A, N, D, N, &, N, V, P, ^, ,, R, Z, ^, ^, N, ...","[in, an, email, sent, on, april, <number>, ,, ...","[P, D, N, V, P, ^, $, ,, $, ,, D, A, N, A, ^, ...",2017-03-02
3,North Korea's Kim Jong Un fetes nuclear scient...,SEOUL (Reuters) - North Korean leader Kim Jong...,worldnews,"September 10, 2017",1,"[north, korea's, kim, jong, un, fetes, nuclear...","[^, Z, ^, ^, ^, V, A, N, ,, V, N, N]","[seoul<allcaps>, (, reuters, ), -, north, kore...","[^, ,, ^, ,, ,, ^, ^, N, ^, ^, ^, V, D, A, N, ...",2017-09-10
4,White House developing comprehensive biosecuri...,"ASPEN, Colorado (Reuters) - The Trump administ...",politicsNews,"July 20, 2017",1,"[white, house, developing, comprehensive, bios...","[A, N, V, A, N, N, ,, A]","[aspen<allcaps>, ,, colorado, (, reuters, ), -...","[^, ,, ^, ,, ^, ,, ,, D, ^, N, V, V, D, A, A, ...",2017-07-20
5,LOL! GEORGE LOPEZ Booed Off Stage At Children’...,George Lopez was hired to be the emcee for the...,politics,"Oct 14, 2017",0,"[lol<allcaps>, !, george<allcaps>, lopez<allca...","[!, ,, ^, ^, V, P, N, P, ^, ^, N, P, N, V, D, ...","[george, lopez, was, hired, to, be, the, emcee...","[^, ^, V, V, P, V, D, N, P, D, ^, G, N, N, G, ...",2017-10-14
6,HILLARY CLINTON CRONYISM VIOLATES FEDERAL RULE...,Former Secretary of State Hillary Clinton soug...,politics,"Oct 6, 2016",0,"[hillary<allcaps>, clinton<allcaps>, cronyism<...","[^, ^, N, V, A, N, ,, Z, ,, A, N, ,, V, N, P, ...","[former, secretary, of, state, hillary, clinto...","[A, N, P, ^, ^, ^, V, P, V, ^, &, ^, ^, N, N, ...",2016-10-06
7,Republican Senator Alexander to consult on bip...,WASHINGTON (Reuters) - U.S. Republican Senator...,politicsNews,"September 26, 2017",1,"[republican, senator, alexander, to, consult, ...","[A, N, ^, P, V, P, A, N, N]","[washington<allcaps>, (, reuters, ), -, u.s., ...","[^, ,, ^, ,, ,, ^, ^, ^, ^, ^, V, ^, P, O, V, ...",2017-09-26
8,Kellyanne Conway Announces Trump’s HUGE ‘Than...,Kellyanne Conway accidentally announced exactl...,News,"January 9, 2017",0,"[kellyanne, conway, announces, trump’s, huge<a...","[^, ^, V, Z, A, ,, V, O, ,, N, P, ^, ,, &, L, ...","[kellyanne, conway, accidentally, announced, e...","[^, ^, R, V, R, R, ^, ^, V, P, V, ^, ^, P, D, ...",2017-01-09
9,"Zimbabwe's army seizes power, Mugabe confined ...",HARARE (Reuters) - Zimbabwe s military seized ...,worldnews,"November 15, 2017",1,"["", zimbabwe's, army, seizes, power, ,, mugabe...","[,, Z, N, N, N, ,, ^, V, &, ,, A, ,]","[harare<allcaps>, (, reuters, ), -, zimbabwe, ...","[^, ,, ^, ,, ,, ^, G, A, A, N, P, ^, V, O, V, ...",2017-11-15


In [10]:
#all_data_sorted = all_data.sort_values(by=['date'], ascending=True)
#all_data_sorted.head(10)

all_data_sorted = all_data2.sort_values(by=['date_parsed'], ascending=True)
all_data_sorted.head(10)

Unnamed: 0,title,text,subject,date,target,title_tokcan,title_POS,text_tokcan,text_POS,date_parsed
32109,WATCH DIRTY HARRY REID ON HIS LIE ABOUT ROMNEY...,"In case you missed it Sen. Harry Reid (R-NV), ...",politics,"Mar 31, 2015",0,"[watch<allcaps>, dirty<allcaps>, harry<allcaps...","[V, A, ^, ^, P, D, V, P, Z, N, ,, ,, O, V, V, ...","[in, case, you, missed, it, sen., harry, reid,...","[P, N, O, V, O, ^, ^, ^, ,, V, ,, ,, O, V, A, ...",2015-03-31
1756,APPLE’S CEO SAYS RELIGIOUS FREEDOM LAWS ARE ‘D...,The gay mafia has a new corporate Don. This i...,politics,"Mar 31, 2015",0,"[apple<allcaps>’s, ceo<allcaps>, says<allcaps>...","[Z, N, V, A, N, N, V, ,, A, ,, P, ^, &, V, N, ...","[the, gay, mafia, has, a, new, corporate, don,...","[D, A, N, V, D, A, N, ^, ,, O, V, D, A, N, O, ...",2015-03-31
16251,HILLARY RODHAM NIXON: A CANDIDATE WITH MORE BA...,The irony here isn t lost on us. Hillary is be...,politics,"Mar 31, 2015",0,"[hillary<allcaps>, rodham<allcaps>, nixon<allc...","[^, ^, ^, ,, D, N, P, A, N, P, D, ^, ^]","[the, irony, here, isn, t, lost, on, us, ., hi...","[D, N, R, G, G, V, P, O, ,, ^, V, V, V, P, D, ...",2015-03-31
29683,FLASHBACK: KING OBAMA COMMUTES SENTENCES OF 22...,Just making room for Hillary President Obama t...,politics,"Mar 31, 2015",0,"[flashback<allcaps>, :, king<allcaps>, obama<a...","[N, ,, ^, ^, V, N, P, $, N, N]","[just, making, room, for, hillary, president, ...","[R, V, N, P, ^, ^, ^, N, V, D, N, P, V, D, N, ...",2015-03-31
6704,HILLARY RODHAM NIXON: A CANDIDATE WITH MORE BA...,The irony here isn t lost on us. Hillary is be...,left-news,"Mar 31, 2015",0,"[hillary<allcaps>, rodham<allcaps>, nixon<allc...","[^, ^, ^, ,, D, N, P, A, N, P, D, ^, ^]","[the, irony, here, isn, t, lost, on, us, ., hi...","[D, N, R, G, G, V, P, O, ,, ^, V, V, V, P, D, ...",2015-03-31
3475,BENGHAZI PANEL CALLS HILLARY TO TESTIFY UNDER ...,Does anyone really think Hillary Clinton will ...,politics,"Mar 31, 2015",0,"[benghazi<allcaps>, panel<allcaps>, calls<allc...","[^, N, V, ^, P, V, P, N, P, A, N, V, P, A, N, ...","[does, anyone, really, think, hillary, clinton...","[V, N, R, V, ^, ^, V, V, V, P, N, ,, O, V, O, ...",2015-03-31
17824,OH NO! GUESS WHO FUNDED THE SHRINE TO TED KENNEDY,Nothing like political cronyism to make your s...,politics,"Mar 31, 2015",0,"[oh<allcaps>, no<allcaps>, !, guess<allcaps>, ...","[!, !, ,, V, O, V, D, N, P, ^, ^]","[nothing, like, political, cronyism, to, make,...","[N, P, A, N, P, V, D, N, V, ,, R, P, O, V, P, ...",2015-03-31
21447,WATCH DIRTY HARRY REID ON HIS LIE ABOUT ROMNEY...,"In case you missed it Sen. Harry Reid (R-NV), ...",left-news,"Mar 31, 2015",0,"[watch<allcaps>, dirty<allcaps>, harry<allcaps...","[V, A, ^, ^, P, D, V, P, Z, N, ,, ,, O, V, V, ...","[in, case, you, missed, it, sen., harry, reid,...","[P, N, O, V, O, ^, ^, ^, ,, V, ,, ,, O, V, A, ...",2015-03-31
23391,“Non-violence hasn’t worked”…Reverend Sam Most...,Yeah that whole taking up arms thing seems t...,left-news,"Apr 1, 2015",0,"[“, non-violence, hasn’t, worked, ”, …, revere...","[,, N, V, V, ,, ,, ^, ^, ^, ,, N, P, N, V, P, ...","[yeah, that, whole, taking, up, arms, thing, s...","[!, D, A, V, T, N, N, V, P, V, V, R, P, ^, D, ...",2015-04-01
12524,MUSLIM WOMAN ARRESTED FOR SPITTING ON HER FELL...,This woman s having trouble entering the Walma...,politics,"Apr 1, 2015",0,"[muslim<allcaps>, woman<allcaps>, arrested<all...","[^, N, V, P, V, P, D, N, A, ^, N, &, O, V, R, ...","[this, woman, s, having, trouble, entering, th...","[D, N, G, V, N, V, D, ^, P, N, V, P, V, O, O, ...",2015-04-01


In [11]:
#print(all_data_sorted[all_data_sorted.date < 2016])

#all_data_sorted['date'] = all_data_sorted['date'].strftime('%Y-%m-%d')


print(all_data_sorted[all_data_sorted['date'].str.contains('2015')].shape) # (2485, 10)
print(all_data_sorted[all_data_sorted['date'].str.contains('2016')].shape) # (16465, 10)
print(all_data_sorted[all_data_sorted['date'].str.contains('2017')].shape) # (25899, 10)
print(all_data_sorted[all_data_sorted['date'].str.contains('2018')].shape) # (0, 10)
print(all_data_sorted[all_data_sorted['date'].str.contains('-18')].shape) # (35, 10)


print(all_data_sorted[all_data_sorted['date_parsed'] < '2017-01-01'].shape) # (18949, 10)
print(all_data_sorted[all_data_sorted['date_parsed'] >= '2017-01-01'].shape) # (25939, 10)


(2485, 10)
(16470, 10)
(25904, 10)
(0, 10)
(35, 10)
(18949, 10)
(25939, 10)


In [12]:
#print(all_data_sorted.date[all_data_sorted['date'].str.contains('2016')])
#print(all_data_sorted.date[all_data_sorted['date'].str.contains('2017')])

In [13]:
all_data_2016 = all_data_sorted[all_data_sorted['date'].str.contains('2016')]
all_data_2017 = all_data_sorted[all_data_sorted['date'].str.contains('2017')]

print(all_data_2016.shape)
print(all_data_2017.shape)

all_data_2017.head()

(16470, 10)
(25904, 10)


Unnamed: 0,title,text,subject,date,target,title_tokcan,title_POS,text_tokcan,text_POS,date_parsed
6934,"Trump leaves open possible Taiwan meet, questi...","PALM BEACH, Fla. (Reuters) - U.S. President-el...",politicsNews,"January 1, 2017",1,"[trump, leaves, open, possible, taiwan, meet, ...","[^, V, V, A, ^, V, ,, N, ^, V]","[palm<allcaps>, beach<allcaps>, ,, fla, ., (, ...","[N, N, ,, ^, ,, ,, ^, ,, ,, ^, ^, ^, ^, P, ^, ...",2017-01-01
10566,CHECK OUT TRUMP’S HILARIOUS New Years Eve Twee...,"Whether they like it or not, Trump continues t...",left-news,"Jan 1, 2017",0,"[check<allcaps>, out<allcaps>, trump<allcaps>’...","[V, T, Z, A, A, N, N, V, P, D, ,, A, N, ,]","[whether, they, like, it, or, not, ,, trump, c...","[P, O, V, O, &, R, ,, ^, V, P, V, D, N, P, O, ...",2017-01-01
23501,ARE ANGRY LEFTISTS Planning Violent Communist ...,Author Ed Klein told Pete Hegseth on Fox and F...,left-news,"Jan 1, 2017",0,"[are<allcaps>, angry<allcaps>, leftists<allcap...","[V, A, N, V, A, N, N, ,, ,, O, V, D, N, P, ,, ...","[author, ed, klein, told, pete, hegseth, on, f...","[N, ^, ^, V, ^, ^, P, ^, &, N, N, P, ^, ^, V, ...",2017-01-01
37283,RUSSIA’S STRANGE 45-ACRE MARYLAND COMPOUND Clo...,WASHINGTON It sits on the Corsica River alon...,Government News,"Jan 1, 2017",0,"[russia<allcaps>’s, strange<allcaps>, <number>...","[Z, A, $, ^, N, V, R, P, V, N]","[washington<allcaps>, it, sits, on, the, corsi...","[^, O, V, P, D, ^, ^, P, D, ^, N, ,, D, A, ^, ...",2017-01-01
33664,DISGUSTING! USA TODAY Video Suggests “Trump Er...,https://www.youtube.com/watch?v=8dsDdBqF828,left-news,"Jan 1, 2017",0,"[disgusting<allcaps>, !, usa<allcaps>, today<a...","[A, ,, ^, N, N, V, ,, ^, ^, ,, V, V, V, A, ,, ...",[<url>],[U],2017-01-01


## Train / Dev  Split ISOT data based on dates

In [14]:
#train/dev/train split
#train_dev_split = 0.8

'''
train_fract = 0.70
dev_fract = 0.15
test_fract = 0.15

if (train_fract+dev_fract+test_fract) == 1.0:
    print('Split fractions add up to 1.0')
else:
    print('SPLIT FRACTIONS DO NOT ADD UP TO 1.0; PLEASE TRY AGAIN.............')

#train_data = all_data[:int(len(all_data)*train_dev_split)].reset_index(drop=True)
#dev_data = all_data[int(len(all_data)*train_dev_split):].reset_index(drop=True)
'''

train_set = all_data_2016.reset_index(drop=True)
dev_set = all_data_2017.reset_index(drop=True)

print('training set: ',train_set.shape)
print('dev set: ',dev_set.shape)


training set:  (16470, 10)
dev set:  (25904, 10)


In [15]:
train_set.head(5)

Unnamed: 0,title,text,subject,date,target,title_tokcan,title_POS,text_tokcan,text_POS,date_parsed
0,Here’s What The Fracking Industry Gave To Okl...,The fracking industry has given the state of O...,News,"January 1, 2016",0,"[here’s, what, the, fracking, industry, gave, ...","[L, O, D, A, N, V, P, ^, P, $, ,, N, ,]","[the, fracking, industry, has, given, the, sta...","[D, A, N, V, V, D, N, P, ^, D, A, N, P, N, P, ...",2016-01-01
1,IS A REVOLUTION COMING? Former Congressman Thr...,Spoken like a true American Former Congressman...,left-news,"Jan 1, 2016",0,"[is<allcaps>, a, revolution<allcaps>, coming<a...","[V, D, N, V, ,, A, N, V, A, N, P, Z, N, N, V, ...","[spoken, like, a, true, american, former, cong...","[V, P, D, A, ^, ^, ^, ^, ^, V, P, ^, ,, P, ^, ...",2016-01-01
2,SHAMELESS! Hillary Clinton Throws Benghazi Fam...,Hillary Clinton s pointing the finger at the v...,politics,"Jan 1, 2016",0,"[shameless<allcaps>, !, hillary, clinton, thro...","[^, ,, ^, ^, V, ^, N, P, D, N]","[hillary, clinton, s, pointing, the, finger, a...","[^, ^, G, V, D, N, P, D, N, P, D, ^, V, D, N, ...",2016-01-01
3,No Kidding! Senator John McCain And Hillary We...,Who even admits to doing this? I have to say t...,politics,"Jan 1, 2016",0,"[no, kidding, !, senator, john, mccain, and, h...","[D, V, ,, ^, ^, ^, &, ^, V, N, P, N, P, D, N, ...","["", who, even, admits, to, doing, this, ?, i, ...","[,, O, R, V, P, V, O, ,, O, V, P, V, P, O, R, ...",2016-01-01
4,Ben Carson Campaign In Shambles After Top Aid...,It s not looking like much of a Happy New for ...,News,"January 1, 2016",0,"[ben, carson, campaign, in, shambles, after, t...","[^, ^, N, P, N, P, A, N, ,, $, N, V]","[it, s, not, looking, like, much, of, a, happy...","[O, V, R, V, R, A, P, D, A, A, P, A, A, N, ^, ...",2016-01-01


In [16]:
dev_set.head(5)

Unnamed: 0,title,text,subject,date,target,title_tokcan,title_POS,text_tokcan,text_POS,date_parsed
0,"Trump leaves open possible Taiwan meet, questi...","PALM BEACH, Fla. (Reuters) - U.S. President-el...",politicsNews,"January 1, 2017",1,"[trump, leaves, open, possible, taiwan, meet, ...","[^, V, V, A, ^, V, ,, N, ^, V]","[palm<allcaps>, beach<allcaps>, ,, fla, ., (, ...","[N, N, ,, ^, ,, ,, ^, ,, ,, ^, ^, ^, ^, P, ^, ...",2017-01-01
1,CHECK OUT TRUMP’S HILARIOUS New Years Eve Twee...,"Whether they like it or not, Trump continues t...",left-news,"Jan 1, 2017",0,"[check<allcaps>, out<allcaps>, trump<allcaps>’...","[V, T, Z, A, A, N, N, V, P, D, ,, A, N, ,]","[whether, they, like, it, or, not, ,, trump, c...","[P, O, V, O, &, R, ,, ^, V, P, V, D, N, P, O, ...",2017-01-01
2,ARE ANGRY LEFTISTS Planning Violent Communist ...,Author Ed Klein told Pete Hegseth on Fox and F...,left-news,"Jan 1, 2017",0,"[are<allcaps>, angry<allcaps>, leftists<allcap...","[V, A, N, V, A, N, N, ,, ,, O, V, D, N, P, ,, ...","[author, ed, klein, told, pete, hegseth, on, f...","[N, ^, ^, V, ^, ^, P, ^, &, N, N, P, ^, ^, V, ...",2017-01-01
3,RUSSIA’S STRANGE 45-ACRE MARYLAND COMPOUND Clo...,WASHINGTON It sits on the Corsica River alon...,Government News,"Jan 1, 2017",0,"[russia<allcaps>’s, strange<allcaps>, <number>...","[Z, A, $, ^, N, V, R, P, V, N]","[washington<allcaps>, it, sits, on, the, corsi...","[^, O, V, P, D, ^, ^, P, D, ^, N, ,, D, A, ^, ...",2017-01-01
4,DISGUSTING! USA TODAY Video Suggests “Trump Er...,https://www.youtube.com/watch?v=8dsDdBqF828,left-news,"Jan 1, 2017",0,"[disgusting<allcaps>, !, usa<allcaps>, today<a...","[A, ,, ^, N, N, V, ,, ^, ^, ,, V, V, V, A, ,, ...",[<url>],[U],2017-01-01


In [26]:
# print out ISOT dev set
#dev_set.to_csv('isot_dev_set.csv', sep=',')

## Baseline Model: Naive Bayes Classifier

### Classify full text

In [17]:
##
# from sklearn.naive_bayes import BernoulliNB  #requires all features be binary
from sklearn.naive_bayes import MultinomialNB  #appropriate for word count features from CountVectorizer
# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *
#from sklearn.grid_search import GridSearchCV   # THIS HAS BEEN DEPRECATED
from sklearn.model_selection import GridSearchCV
# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score


train_data, train_labels = train_set.text.values, train_set.target.values
dev_data, dev_labels = dev_set.text.values, dev_set.target.values

train_labels = train_labels.astype(int)
dev_labels = dev_labels.astype(int)

print('train_data shape:', train_data.shape)
#print(train_data[0].shape)
#print(train_data[:1])
print('\ntrain_labels shape:', train_labels.shape)
print(train_labels)
print(type(train_labels[0]))
#train_labels.head()
#dev_data.head()
#dev_labels.head()


train_data shape: (16470,)

train_labels shape: (16470,)
[0 0 0 ... 0 0 0]
<class 'numpy.int64'>


In [18]:
print('ISOT train target=real:', len(train_labels[train_labels == 1]))
print('ISOT train target=fake:', len(train_labels[train_labels == 0]))

ISOT train target=real: 4716
ISOT train target=fake: 11754


In [19]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_data)
X

<16470x70178 sparse matrix of type '<class 'numpy.int64'>'
	with 3549231 stored elements in Compressed Sparse Row format>

In [20]:
#print(X[0])

In [21]:
print('X.shape:', X.shape) # (). There are x documents (rows) in the corpus, with y features (unique words = vocabulary)
print('Vocabulary size (number of features or columns):', X.shape[1])  # 
print('Non-zero elements in matrix (X.nnz):', X.nnz)   # This indicates that there are z non-zero elements in the matrix.
print('Average number of non-zero features per example (per document): %.3f' %(X.nnz/X.shape[0]))  # non-zero elements in matrix / documents = xxx
print('Fraction of non-zero elements in matrix: %.4f' %( X.nnz/(X.shape[0] * X.shape[1])) )   # Fraction of entries in the matrix that are non-zero = X.nnz/(rows*columns) = 0.xxx 


# What are the 0th and last feature strings (in alphabetical order)?
print('0th feature string:', vectorizer.get_feature_names()[0])   # 
print('last feature string:', vectorizer.get_feature_names()[X.shape[1]-1])    # 

X.shape: (16470, 70178)
Vocabulary size (number of features or columns): 70178
Non-zero elements in matrix (X.nnz): 3549231
Average number of non-zero features per example (per document): 215.497
Fraction of non-zero elements in matrix: 0.0031
0th feature string: 00
last feature string: zzzzzzzz


In [24]:
# Using the standard CountVectorizer, what fraction of the words in the dev data are missing from the vocabulary? 
vectorizer_dev = CountVectorizer()
X_dev = vectorizer_dev.fit_transform(dev_data)  # Independently build a vocabulary using dev_data.
print('Vocabulary using train data:', X.shape[1])
print('X_dev.shape:', X_dev.shape)# 
print('Vocabulary using dev data:', X_dev.shape[1])                # 

# Feed dev_data into the vectorizer fit using training data.
X_dev_transformed = vectorizer.transform(dev_data)
print('X_dev_transformed shape:', X_dev_transformed.shape)
#print('X_dev_transformed.shape', X_dev_transformed.shape)  # (676, 26,879)  EXPECT .shape[1] equal to original number of features
#print('non-zero indices in X_dev_transformed:', X_dev_transformed.nonzero())  # could also use this to check which features missing...

''' This is way too slow; use set intersection instead!!
# Look at each feature (vocabulary word) in X_dev and see if it is a feature in X.
count = 0
for i in range(X_dev.shape[1]):
    if vectorizer_dev.get_feature_names()[i] in vectorizer.get_feature_names():
        count += 1
print('Count of words (features) in X_dev also in X:', count)   
print('Fraction of words in dev data missing from training vocabulary: %.3f' %((X_dev.shape[1] - count)/X_dev.shape[1]) )
count of words (features) in X_dev also in X: 12219
Fraction of words in dev data missing from training vocabulary: 0.248
'''

set1 = set(vectorizer_dev.get_feature_names())
set2 = set(vectorizer.get_feature_names())
print('Count of words (features) in X_dev also in X:', len(set1.intersection(set2)))
print('Fraction of words in dev data missing from training vocabulary: %.3f' %((X_dev.shape[1] - len(set1.intersection(set2)))/X_dev.shape[1]) )

Vocabulary using train data: 70178
X_dev.shape: (25904, 90229)
Vocabulary using dev data: 90229
X_dev_transformed shape: (25904, 70178)
Count of words (features) in X_dev also in X: 42089
Fraction of words in dev data missing from training vocabulary: 0.534


In [23]:
# MultinomialNB
print('\nMultinomialNB')
alpha = 1.0
clf = MultinomialNB(alpha=alpha)
clf.fit(X, train_labels)

print('accuracy: %3.2f' %clf.score(X_dev_transformed, dev_labels))


MultinomialNB
accuracy: 0.93


In [24]:

print('accuracy: %3.2f' %clf.score(X_dev_transformed, dev_labels))

y_pred = clf.predict(X_dev_transformed)

acc = accuracy_score(dev_labels, y_pred)
print("Accuracy on dev set: {:.02%}".format(acc))


accuracy: 0.93
Accuracy on dev set: 93.33%


#### ONE OFF: Use CountVectorizer to determine ISOT "text" vocabulary sizes for 2016 and 2017 sets, for both True and Fake news

In [35]:
# 2016: Use train_set
train_set_2016_true = train_set[train_set.target=='1']
#train_set_2016_true.head()
vectorizer_2016_true = CountVectorizer()
X_2016_true = vectorizer_2016_true.fit_transform(train_set_2016_true.text.values)  # Independently build a vocabulary.
print('Vocabulary using ISOT 2016 true text:', X_2016_true.shape[1])

train_set_2016_false = train_set[train_set.target=='0']
#train_set_2016_false.head()
vectorizer_2016_false = CountVectorizer()
X_2016_false = vectorizer_2016_false.fit_transform(train_set_2016_false.text.values)  # Independently build a vocabulary.
print('Vocabulary using ISOT 2016 false text:', X_2016_false.shape[1])


# 2017: Use dev_set
dev_set_2017_true = dev_set[dev_set.target=='1']
#dev_set_2017_true.head()
vectorizer_2017_true = CountVectorizer()
X_2017_true = vectorizer_2017_true.fit_transform(dev_set_2017_true.text.values)  # Independently build a vocabulary.
print('Vocabulary using ISOT 2017 true text:', X_2017_true.shape[1])

dev_set_2017_false = dev_set[dev_set.target=='0']
#dev_set_2017_false.head()
vectorizer_2017_false = CountVectorizer()
X_2017_false = vectorizer_2017_false.fit_transform(dev_set_2017_false.text.values)  # Independently build a vocabulary.
print('Vocabulary using ISOT 2017 false text:', X_2017_false.shape[1])


In [36]:
vectorizer_2016_true = CountVectorizer()
X_2016_true = vectorizer_2016_true.fit_transform(train_set_2016_true.text.values)  # Independently build a vocabulary.
print('Vocabulary using ISOT 2016 true text:', X_2016_true.shape[1])

Vocabulary using ISOT 2016 true text: 33431


In [40]:
vectorizer_2016_false = CountVectorizer()
X_2016_false = vectorizer_2016_false.fit_transform(train_set_2016_false.text.values)  # Independently build a vocabulary.
print('Vocabulary using ISOT 2016 false text:', X_2016_false.shape[1])

Vocabulary using ISOT 2016 false text: 62384


In [41]:
vectorizer_2017_true = CountVectorizer()
X_2017_true = vectorizer_2017_true.fit_transform(dev_set_2017_true.text.values)  # Independently build a vocabulary.
print('Vocabulary using ISOT 2017 true text:', X_2017_true.shape[1])

Vocabulary using ISOT 2017 true text: 59575


In [42]:
vectorizer_2017_false = CountVectorizer()
X_2017_false = vectorizer_2017_false.fit_transform(dev_set_2017_false.text.values)  # Independently build a vocabulary.
print('Vocabulary using ISOT 2017 false text:', X_2017_false.shape[1])

Vocabulary using ISOT 2017 false text: 60498


In [25]:
print('predict proba:', clf.predict_proba(X).shape)
print('predict proba example:', clf.predict_proba(X[0]))

predict proba: (16470, 2)
predict proba example: [[0.0232333 0.9767667]]


In [26]:
print('feature_log_prob_ shape:', clf.feature_log_prob_.shape)
print('feature_log_prob_ example:', clf.feature_log_prob_[0][0])

feature_log_prob_ shape: (2, 70178)
feature_log_prob_ example: -10.01371842130451


In [27]:
feature_names = vectorizer.get_feature_names()
print(feature_names[:20])

['00', '000', '0000', '00000017', '00004', '000048', '0009', '000938', '000a', '000after', '000although', '000american', '000california', '000cylvia', '000dillon000', '000florida', '000georgia', '000have', '000illegal', '000illinois']


In [39]:
## Print out most probable words

cols = ['Label', 'WORD', 'feature_log_prob_ when fake', 'feature_log_prob_ when true']
row_list = []
stoppish_words = ['his', 'about', 'we', 'on', 'one', 'they', 'be', 'he', 'an', 'who', 'as', 'but', 'are', 'with', 'have', 'not', 'from', 'of', 'the', 'to', 'by', 'for', 'has', 'that', 'in', 'it', 'this', 'is', 'was', 'at', 'would', 'and', 'said']

for i in range(2):   # 2 category labels
    print('\nMOST COMMON WORDS IN CLASS', i, '(Fake=0; Real=1)')
    print('%19s %12s %12s' %('WORD', 'Fake prob', 'True prob'))
    n=100
    for j in range(n):   # top n weights for each class
        index = -1 - j
        feature_index = np.argsort(clf.feature_log_prob_[i,:])[index]
        #print(feature_index, vectorizer.get_feature_names()[feature_index], clf.feature_log_prob_[0,feature_index], lr2.coef_[1,feature_index], lr2.coef_[2,feature_index], lr2.coef_[3,feature_index])
        print('%19s %12.3f %12.3f' %(vectorizer.get_feature_names()[feature_index], clf.feature_log_prob_[0,feature_index], clf.feature_log_prob_[1,feature_index]))
        
        word = vectorizer.get_feature_names()[feature_index]
        if word not in stoppish_words:  # disclude stoppish words
            row_list.append(dict( [('Label', i), ('WORD', word), ('feature_log_prob_ when fake', clf.feature_log_prob_[0,feature_index]),('feature_log_prob_ when true', clf.feature_log_prob_[1,feature_index]) ]  ))
    print()

feature_log_prob_df = pd.DataFrame(row_list, columns=cols)



MOST COMMON WORDS IN CLASS 0 (Fake=0; Real=1)
               WORD    Fake prob    True prob
                the       -2.916       -2.888
                 to       -3.516       -3.516
                 of       -3.727       -3.713
                and       -3.762       -3.855
                 in       -4.047       -3.833
               that       -4.149       -4.493
                 is       -4.455       -4.972
                for       -4.650       -4.548
                 it       -4.749       -5.186
                 he       -4.788       -4.830
                 on       -4.809       -4.389
              trump       -4.851       -4.591
                was       -5.035       -5.207
               with       -5.064       -5.066
                 as       -5.101       -5.167
               this       -5.116       -5.931
                his       -5.128       -5.142
                 be       -5.235       -5.465
               they       -5.308       -5.950
                are       -5.312 

              white       -6.736       -6.551
                out       -5.969       -6.564
               told       -7.089       -6.567
              court       -7.453       -6.578
               last       -7.143       -6.580
             senate       -8.096       -6.610
               when       -6.073       -6.613
        republicans       -7.027       -6.659
                 no       -6.225       -6.669
                him       -6.276       -6.686
                two       -7.029       -6.687
               york       -7.798       -6.691
              there       -6.163       -6.704
         government       -6.987       -6.713
          candidate       -7.343       -6.730
           national       -7.328       -6.748
            against       -6.797       -6.753
            senator       -8.121       -6.754
            tuesday       -8.345       -6.763
              first       -6.944       -6.775
                law       -7.124       -6.779
               what       -5.934  

In [40]:
print(feature_log_prob_df.size)
print(len(feature_log_prob_df))
feature_log_prob_df.head()

536
134


Unnamed: 0,Label,WORD,feature_log_prob_ when fake,feature_log_prob_ when true
0,0,trump,-4.850555,-4.591216
1,0,you,-5.456274,-6.907848
2,0,their,-5.753203,-6.238394
3,0,people,-5.836045,-6.414343
4,0,her,-5.847883,-6.241646


In [41]:
feature_log_prob_df.to_csv('nb_isot_text_2016_feature_log_prob_df.csv', sep=',')

In [42]:
prob_diff = clf.feature_log_prob_[1] - clf.feature_log_prob_[0]

for j in range(50):   # top 5 weights for each class
    index = -1 - j
    feat_index = np.argsort(prob_diff[:])[index]
    #print('%19s %12.3f %12.3f' %(vectorizer.get_feature_names()[feat_index], clf.feature_log_prob_[0,feat_index], clf.feature_log_prob_[1,feat_index]))


In [43]:
print('clf.feature_count_ :', clf.feature_count_.shape)

for i in range(2):   # 2 category labels
    print('\nMOST COMMON WORDS IN CLASS', i, '(Fake=0; Real=1)')
    print('%19s %12s %12s' %('WORD', 'Fake count', 'Real count'))
    for j in range(100):   # top x most frequent words for each class
        index = -1 - j
        feature_index = np.argsort(clf.feature_count_[i,:])[index]
        print('%19s %12d %12d' %(vectorizer.get_feature_names()[feature_index], clf.feature_count_[0,feature_index], clf.feature_count_[1,feature_index]))
    print()
    
    
#print('Real_News clf.feature_count_ :', np.sort(clf.feature_count_[1,:]))
#print('Real_News clf.feature_count indices :', np.argsort(clf.feature_count_[1,:]))
##print('Real_News clf.feature_count words :', vectorizer.get_feature_names()[np.argsort(clf.feature_count_[1,:])])
print()
#print('Fake_News clf.feature_count_ :', clf.feature_count_[0,:])

clf.feature_count_ : (2, 70178)

MOST COMMON WORDS IN CLASS 0 (Fake=0; Real=1)
               WORD   Fake count   Real count
                the       259974       114115
                 to       142621        60938
                 of       115524        50049
                and       111599        43402
                 in        83879        44366
               that        75741        22930
                 is        55777        14208
                for        45893        21705
                 it        41593        11464
                 he        40006        16370
                 on        39154        25460
              trump        37563        20789
                was        31243        11226
               with        30340        12937
                 as        29246        11689
               this        28809         5442
                his        28476        11984
                 be        25574         8676
               they        23772         5341
 

              white         5699         2928
                out        12271         2891
               told         4006         2883
              court         2783         2851
               last         3794         2844
             senate         1462         2760
               when        11065         2751
        republicans         4262         2628
                 no         9499         2603
                him         9032         2559
                two         4250         2555
               york         1970         2545
              there        10106         2512
         government         4433         2490
          candidate         3105         2448
           national         3152         2404
            against         5360         2393
            senator         1426         2390
            tuesday         1140         2368
              first         4627         2341
                law         3865         2332
               what        12706  

#### Use TfidfVectorizer and compare to CountVectorizer

In [58]:
'''
Convert a collection of raw documents to a matrix of TF-IDF features.
Equivalent to CountVectorizer followed by TfidfTransformer.

In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little 
meaningful information about the actual contents of the document. If we were to feed the direct count data directly to 
a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.

In order to re-weight the count features into floating point values suitable for usage by a classifier it is very 
common to use the tf–idf transform.

Tf means term-frequency while tf–idf means term-frequency times inverse document-frequency: 
\text{tf-idf(t,d)}=\text{tf(t,d)} \times \text{idf(t)}.
'''

t_vectorizer = TfidfVectorizer()
t_X = t_vectorizer.fit_transform(train_data)   
#print(t_X.shape)
t_X_dev = t_vectorizer.transform(dev_data)
#print(t_X_dev.shape)


# MultinomialNB
#The multinomial distribution normally requires integer feature counts. 
#However, in practice, fractional counts such as tf-idf may also work.
print('\nMultinomialNB with TfidfVectorizer')
alpha = 1.0
t_clf = MultinomialNB(alpha=alpha)
t_clf.fit(t_X, train_labels)

print('accuracy: %3.3f' %t_clf.score(t_X_dev, dev_labels))

t_dev_predicted_labels = t_clf.predict(t_X_dev)  # "predict" and report accuracy using dev set
#print(t_dev_predicted_labels.shape)

print('\nf1 score of dev predicted labels:', metrics.f1_score(dev_labels, t_dev_predicted_labels, average='weighted'))
print('classification report of dev predicted labels: \n', classification_report(dev_labels, t_dev_predicted_labels))
print()



MultinomialNB with TfidfVectorizer
accuracy: 0.642

f1 score of dev predicted labels: 0.6339414983696479
classification report of dev predicted labels: 
               precision    recall  f1-score   support

           0       0.50      1.00      0.66      9203
           1       1.00      0.45      0.62     16701

   micro avg       0.64      0.64      0.64     25904
   macro avg       0.75      0.72      0.64     25904
weighted avg       0.82      0.64      0.63     25904




In [47]:
print('stop words:', t_vectorizer.get_stop_words())

stop words: None


In [52]:
print('stop words:', t_vectorizer.stop_words_)
print('features', t_vectorizer.get_feature_names())
print('idf:', t_vectorizer.idf_)

stop words: set()
idf: [ 5.93023304  3.08816137  8.91759707 ... 10.01620936  9.32306218
 10.01620936]


In [55]:
#print('vocabulary:', t_vectorizer.vocabulary_)

#### Repeat TfidfVectorizer with min_df and max_df 

In [78]:

t_vectorizer2 = TfidfVectorizer(min_df=0.0, max_df=0.5)
t_X2 = t_vectorizer2.fit_transform(train_data)   
#print(t_X.shape)
t_X_dev2 = t_vectorizer2.transform(dev_data)
#print(t_X_dev.shape)


# MultinomialNB
#The multinomial distribution normally requires integer feature counts. 
#However, in practice, fractional counts such as tf-idf may also work.
print('\nMultinomialNB with TfidfVectorizer')
alpha = 1.0
t_clf2 = MultinomialNB(alpha=alpha)
t_clf2.fit(t_X2, train_labels)

print('accuracy: %3.3f' %t_clf2.score(t_X_dev2, dev_labels))

t_dev_predicted_labels2 = t_clf2.predict(t_X_dev2)  # "predict" and report accuracy using dev set
#print(t_dev_predicted_labels.shape)

print('\nf1 score of dev predicted labels:', metrics.f1_score(dev_labels, t_dev_predicted_labels2, average='weighted'))
print('classification report of dev predicted labels: \n', classification_report(dev_labels, t_dev_predicted_labels2))
print()



MultinomialNB with TfidfVectorizer
accuracy: 0.697

f1 score of dev predicted labels: 0.696146112796466
classification report of dev predicted labels: 
               precision    recall  f1-score   support

           0       0.54      0.99      0.70      9203
           1       0.99      0.53      0.69     16701

   micro avg       0.70      0.70      0.70     25904
   macro avg       0.77      0.76      0.70     25904
weighted avg       0.83      0.70      0.70     25904




In [79]:
print('stop words:', t_vectorizer2.stop_words_)

stop words: {'his', 'about', 'we', 'on', 'one', 'they', 'be', 'he', 'an', 'who', 'as', 'but', 'are', 'with', 'have', 'not', 'from', 'of', 'the', 'to', 'by', 'for', 'has', 'that', 'in', 'it', 'this', 'is', 'was', 'at', 'trump', 'would', 'and', 'said'}


#### DO verification test by assigning random 0's and 1's to the Dev Labels and re-running.

In [35]:
sample = np.random.binomial(1, 0.5, size=dev_labels.shape[0])
print(sample.mean())

0.5020044543429845


In [36]:
print('accuracy: %3.4f' %clf.score(X_dev_transformed, sample))

accuracy: 0.4953


In [37]:
sample2 = np.random.binomial(1, 0.2, size=dev_labels.shape[0])
print(sample2.mean())

0.19510022271714922


In [38]:
print('accuracy: %3.4f' %clf.score(X_dev_transformed, sample2))

accuracy: 0.5145


#### This result is as expected; basically get random model predictions of ~ 50% once we randomize the dev lables in any fashion.

### Train with ISOT "text"; predict ISOT "title"

In [44]:
train_data, train_labels = train_set.text.str.lower().values, train_set.target.values
dev_data, dev_labels = dev_set.title.str.lower().values, dev_set.target.values

train_labels = train_labels.astype(int)
dev_labels = dev_labels.astype(int)

#train_data.head()
print('train_data shape:', train_data.shape)
#print(train_data[0].shape)
print(train_data[:1])
print('\ntrain_labels shape:', train_labels.shape)
print(train_labels)

print('dev_data shape:', dev_data.shape)

train_data shape: (16470,)
['the fracking industry has given the state of oklahoma a new claim to fame that no one in oklahoma wanted. in 2015, oklahoma had more earthquakes than the entire continental united states combined. there were 857 earthquakes in oklahoma with a magnitude of 3.0 or higher. the total number of earthquakes for the continental united states with a 3.0 magnitude or higher was 1,556.nearly two dozen peer reviewed, scientific papers have been published that show that there is a likely link between waste water injection wells used by the fracking industry and the increased number of quakes that have rocked the state.oklahoma has seen a dramatic increase in the number of earthquakes over the past few years, that coincide with the state s fracking boom. in 2014, there were 585 quakes, a record for the state. in 2013, there were a comparatively stable 106.here is a chart that shows the rising number of earthquakes. it is important to note that there were 84 less quakes 

In [45]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_data)
X_dev_transformed = vectorizer.transform(dev_data)

print('X.shape:', X.shape)
print('X_dev_transformed.shape:', X_dev_transformed.shape)

# MultinomialNB
print('\nMultinomialNB trained using ISOT "text" field; predict ISOT "title" classes')
alpha = 1.0
clf = MultinomialNB(alpha=alpha)
clf.fit(X, train_labels)

print('accuracy: %3.3f' %clf.score(X_dev_transformed, dev_labels))

X.shape: (16470, 70178)
X_dev_transformed.shape: (25904, 70178)

MultinomialNB trained using ISOT "text" field; predict ISOT "title" classes
accuracy: 0.798


### Apply model to LIAR dataset text to predict results and compute score ('title' field contains the statement).

In [43]:
#### MAY NEED TO RUN THIS CELL TWICE

def get_data(filename, sep=',', header=0, names = None):
    '''Read CSV file into a pandas dataframe'''
      
    filepath = DATAPATH + filename
    return pd.read_csv(filepath, header=header, sep=sep, quotechar='"')

In [44]:
# define each downloaded file
LIAR_TEST_FILENAME = 'test.tsv'
LIAR_TRAIN_FILENAME = 'train.tsv'
LIAR_DEV_FILENAME = 'valid.tsv'

# define the downloaded file path 
DATAPATH = './datasets/LIAR/'

## title =statement, target = politifact rating

h_names= ['id', 'target', 'title', 'subject', 'speaker', 'speaker_job_title', 'state', 'party',
          'barely_true_count', 'false_count', 'half_true_count', 'mostly_true_count','pantsonfire_count',
          'context']

liar_test_data = get_data(LIAR_TEST_FILENAME, sep ='\t', header =None)
liar_train_data = get_data(LIAR_TRAIN_FILENAME, '\t', header =None)
liar_dev_data = get_data(LIAR_DEV_FILENAME, '\t', header =None)
print("LIAR training dataset: ", liar_train_data.shape)
print("LIAR test dataset: ", liar_test_data.shape)
print("LIAR dev dataset: ", liar_dev_data.shape)

liar_test_data.columns = h_names
liar_train_data.columns = h_names
liar_dev_data.columns = h_names
# ## add a label column to the data with the target values
# #fake_data.loc[:,'target'] = '0'
# #true_data['target'] = '1'

# #append the datasets and shuffle them
# all_data = true_data.append(fake_data, ignore_index=True)
# all_data = all_data.sample(frac=1).reset_index(drop=True)

## NOTE: if trouble loading, re-run get_data function.

LIAR training dataset:  (10240, 14)
LIAR test dataset:  (1267, 14)
LIAR dev dataset:  (1284, 14)


In [45]:
# combine all the liar data
liar_data = liar_train_data.append(liar_test_data, ignore_index =True)
liar_data = liar_data.append(liar_dev_data, ignore_index =True)
liar_data = liar_data.sample(frac=1).reset_index(drop=True)
print("Complete LIAR dataset: ",liar_data.shape)
liar_data.head()

Complete LIAR dataset:  (12791, 14)


Unnamed: 0,id,target,title,subject,speaker,speaker_job_title,state,party,barely_true_count,false_count,half_true_count,mostly_true_count,pantsonfire_count,context
0,6164.json,half-true,The Mack Penny Plan for the federal budget wou...,"deficit,federal-budget",connie-mack,U.S. representative from Fort Myers,Florida,republican,3.0,3.0,1.0,3.0,1.0,a campaign mailer
1,5970.json,mostly-true,Says Williamson County Attorney Jana Duty has ...,candidates-biography,john-bradley,Williamson County district attorney,Texas,republican,0.0,0.0,0.0,1.0,0.0,an op-ed in the Austin American-Statesman
2,12434.json,pants-fire,It is Hillary Clintons agenda to release the v...,"crime,criminal-justice,legal-issues",donald-trump,President-Elect,New York,republican,63.0,114.0,51.0,37.0,61.0,a speech to the National Rifle Association
3,7576.json,half-true,Very few men outlive their own fertility.,"children,families,gays-and-lesbians,human-righ...",charles-cooper,,,newsmaker,0.0,0.0,1.0,0.0,0.0,arguments before the U.S. Supreme Court
4,12891.json,true,Overdosing is now the number one accidental ki...,drugs,josh-shapiro,,Pennsylvania,democrat,0.0,0.0,0.0,0.0,0.0,a platform on his campaign website


In [49]:
print(liar_data.title[4])
print(liar_data.target.unique())

Of all the illegals in America, more than half come through Arizona.
['mostly-true' 'false' 'true' 'barely-true' 'half-true' 'pants-fire']


In [50]:
targets = liar_data.target.unique()
print(targets)

print('target,  number of examples')
for target in targets:
    print(target, len(liar_data[liar_data.target==target]))
    
print('\ntotal examples', len(liar_data))

['mostly-true' 'false' 'true' 'barely-true' 'half-true' 'pants-fire']
target,  number of examples
mostly-true 2454
false 2507
true 2053
barely-true 2103
half-true 2627
pants-fire 1047

total examples 12791


In [51]:
liar_data['binary_target'] = -1

'''  # this does not work
for i in range(liar_data.shape[0]):   
    if liar_data.target.iloc[i] == ('pants-fire' or 'false' or 'barely-true') :
        liar_data.binary_target.iloc[i] = 0  # fake news
    elif liar_data.target.iloc[i] == ('true' or 'mostly-true'):
        liar_data.binary_target.iloc[i] = 1  # real news
'''

''' these do not work
#liar_data.binary_target[((liar_data.target=='pants-fire') | (liar_data.target=='false') | (liar_data.target=='barely-true')] = 0
#liar_data['binary_target'] = np.where( ( (liar_data.target=='pants-fire') | (liar_data.target=='false') | (liar_data.target=='barely-true')] = 0                                        
## example:df['points'] = np.where( ( (df['gender'] == 'male') & (df['pet1'] == df['pet2'] ) ) | ( (df['gender'] == 'female') & (df['pet1'].isin(['cat','dog'] ) ) ), 5, 0)
#liar_data['binary_target'] = np.where( (liar_data.target.isin (['pants-fire','false','barely-true']), 0,1))
'''

# This might work better!
#'''
def binary_seq_target(rating):
    ## if no rating provided assume the statement to be true
    map_r = {'pants-fire':0, 'false':0, 'barely-true':0, 'half-true':-1, 'mostly-true':1, 'true':1}
    return map_r.get(rating, 1)
    
##change the target labels to 0(false), 1(true news)
#liar_data2.loc[:,'target'] = pd.Series(liar_data2['target'].apply(seq_target), index = liar_data2.index)
liar_data.loc[:,'binary_target'] = pd.Series(liar_data['target'].apply(binary_seq_target), index = liar_data.index)
liar_data.head(10)    
#'''

'''
# these give a warning: 'A value is trying to be set on a copy of a slice from a DataFrame'
liar_data.binary_target[liar_data.target=='pants-fire'] = 0  # fake news
liar_data.binary_target[liar_data.target=='false'] = 0
liar_data.binary_target[liar_data.target=='barely-true'] = 0
liar_data.binary_target[liar_data.target=='true'] = 1        # real news
liar_data.binary_target[liar_data.target=='mostly-true'] = 1
'''

liar_data.head(10)


Unnamed: 0,id,target,title,subject,speaker,speaker_job_title,state,party,barely_true_count,false_count,half_true_count,mostly_true_count,pantsonfire_count,context,binary_target
0,2653.json,mostly-true,Rick Perry has become a millionaire on the pub...,"candidates-biography,message-machine",bill-white,Former mayor of Houston,Texas,democrat,2.0,3.0,5.0,7.0,3.0,a TV ad,1
1,8220.json,false,Theres only one candidate under investigation ...,ethics,ken-cuccinelli,Attorney General,Virginia,republican,1.0,10.0,3.0,2.0,1.0,a TV ad.,0
2,1467.json,true,The law is very clear! 'The monies recouped fr...,economy,judd-gregg,U.S. Senator,,republican,0.0,0.0,0.0,1.0,0.0,a Senate budget hearing,1
3,746.json,true,"John McCain wants to ""give oil companies anoth...",taxes,barack-obama,President,Illinois,democrat,70.0,71.0,160.0,163.0,9.0,"Oxford, Miss.",1
4,2100.json,barely-true,"Of all the illegals in America, more than half...","immigration,message-machine",john-mccain,U.S. senator,Arizona,republican,31.0,39.0,31.0,37.0,8.0,a campaign ad,0
5,7357.json,true,Says Texas ranks first in executions among the...,"criminal-justice,pundits",sacramento-bee-editorial-board,,,none,1.0,0.0,1.0,1.0,0.0,an editorial.,1
6,13402.json,barely-true,Says Russ Feingold voted to raise taxes on Soc...,"federal-budget,immigration,income,social-secur...",ron-johnson,,Wisconsin,republican,14.0,6.0,10.0,10.0,1.0,a radio ad,0
7,7232.json,true,A far different picture from the prior eight y...,taxes,chris-christie,Governor of New Jersey,New Jersey,republican,10.0,17.0,27.0,19.0,8.0,a speech,1
8,5420.json,true,Our small staff of 51 is still fewer than we h...,"criminal-justice,state-finances",carol-hunstein,"Chief Justice, Georgia Supreme Court",Georgia,none,0.0,0.0,0.0,0.0,0.0,a speech,1
9,2013.json,true,Crime is down in Arizona.,"crime,immigration,abc-news-week",al-hunt,Executive editor for Washington for Bloomberg ...,,none,0.0,1.0,0.0,0.0,0.0,"an appearance on ABC's ""This Week.""",1


In [52]:
binary_targets = liar_data.binary_target.unique()
print(binary_targets)

print('\nbinary_target,  number of examples')
for binary_target in binary_targets:
    print(binary_target, len(liar_data[liar_data.binary_target==binary_target]))

[ 1  0 -1]

binary_target,  number of examples
1 4507
0 5657
-1 2627


#### Must discard label = -1  

In [53]:
liar_dev_labels = liar_data.binary_target[liar_data.binary_target >= 0].values  ## discard "half-true"!!!!
print('liar_dev_labels:\n', liar_dev_labels[:10])

liar_dev_labels:
 [1 0 1 1 0 1 0 1 1 1]


In [54]:
print(liar_data.title)


0        Rick Perry has become a millionaire on the pub...
1        Theres only one candidate under investigation ...
2        The law is very clear! 'The monies recouped fr...
3        John McCain wants to "give oil companies anoth...
4        Of all the illegals in America, more than half...
5        Says Texas ranks first in executions among the...
6        Says Russ Feingold voted to raise taxes on Soc...
7        A far different picture from the prior eight y...
8        Our small staff of 51 is still fewer than we h...
9                                Crime is down in Arizona.
10       Florida has issued more than 3 million conceal...
11       Recent reports state that the U.S. Customs and...
12       The race will tighten, just because that's wha...
13       Says Abraham Lincoln supported an agreement th...
14       Race to the Top grants require participating s...
15       Twitter, Google and Facebook are burying the F...
16       Florida has the most errors and exonerations f.

### Using the ISOT "text" model, predict the liar_dev_labels and score the predictions. (lower case them first...)

In [55]:
train_data, train_labels = train_set.text.str.lower().values, train_set.target.values   # original ISOT data

#dev_data = liar_data.title[liar_data.binary_target != -1].values  
dev_data = liar_data.title[liar_data.binary_target != -1].str.lower().values      # LIAR data                                        # full LIAR data
dev_labels = liar_data.binary_target[liar_data.binary_target != -1].values

train_labels = train_labels.astype(int)
dev_labels = dev_labels.astype(int)

#train_data.head()
print('train_data shape:', train_data.shape)
#print(train_data[0].shape)
#print(train_data[:1])
print('train_labels shape:', train_labels.shape)
print(train_labels)

print('\ndev_data shape:', dev_data.shape)
print(dev_data[:1])
print('dev_labels shape:', dev_labels.shape)
print(dev_labels)

train_data shape: (16470,)
train_labels shape: (16470,)
[0 0 0 ... 0 0 0]

dev_data shape: (10164,)
['rick perry has become a millionaire on the public payroll.']
dev_labels shape: (10164,)
[1 0 1 ... 1 1 1]


In [56]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_data)           # original ISOT training data
X_dev_transformed = vectorizer.transform(dev_data)

# MultinomialNB
print('\nMultinomialNB using "text" field: Fit using ISOT data; Predict on LIAR data:')
alpha = 1.0
clf = MultinomialNB(alpha=alpha)
clf.fit(X, train_labels)

print('accuracy: %3.3f' %clf.score(X_dev_transformed, dev_labels))


MultinomialNB using "text" field: Fit using ISOT data; Predict on LIAR data:
accuracy: 0.534


In [57]:
print('Non-zero elements in matrix (X.nnz):', X.nnz)   # This indicates that there are z non-zero elements in the matrix.
print('Average number of non-zero features per example (per document): %.3f' %(X.nnz/X.shape[0]))  # non-zero elements in matrix / documents = xxx

Non-zero elements in matrix (X.nnz): 3549231
Average number of non-zero features per example (per document): 215.497


In [58]:
print('Non-zero elements in matrix (X_dev_transformed.nnz):', X_dev_transformed.nnz)   # This indicates that there are z non-zero elements in the matrix.
print('Average number of non-zero features per example (per document): %.3f' %(X_dev_transformed.nnz/X_dev_transformed.shape[0]))  # non-zero elements in matrix / documents = xxx

Non-zero elements in matrix (X_dev_transformed.nnz): 160668
Average number of non-zero features per example (per document): 15.808


In [59]:
print('LIAR true:', len(liar_data.binary_target[liar_data.binary_target == 1]))
#print('LIAR true:', liar_data.binary_target[liar_data.binary_target == 1].shape)
print('LIAR false:', len(liar_data.binary_target[liar_data.binary_target == 0]))

LIAR true: 4507
LIAR false: 5657


### REVERSE: Using LIAR model, predict the ISOT "text" and score the predictions. (lower case them first...)

In [60]:
train_data = liar_data.title[liar_data.binary_target != -1].str.lower().values   # full LIAR data
train_labels = liar_data.binary_target[liar_data.binary_target != -1].values 
 
dev_data = train_set.text.str.lower().values                                      # LSOT data
dev_labels = train_set.target.values 


train_labels = train_labels.astype(int)
dev_labels = dev_labels.astype(int)

#train_data.head()
print('train_data shape:', train_data.shape)
#print(train_data[0].shape)
#print(train_data[:1])
print('train_labels shape:', train_labels.shape)
print(train_labels)

print('\ndev_data shape:', dev_data.shape)
#print(dev_data[:1])
print('dev_labels shape:', dev_labels.shape)
print(dev_labels)

train_data shape: (10164,)
train_labels shape: (10164,)
[1 0 1 ... 1 1 1]

dev_data shape: (16470,)
dev_labels shape: (16470,)
[0 0 0 ... 0 0 0]


In [61]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_data)           # LIAR training data
print(X.shape)
X_dev_transformed = vectorizer.transform(dev_data) # LSOT

# MultinomialNB
print('\nMultinomialNB:  Fit using LIAR data; predict on ISOT "text" data')
alpha = 1.0
clf = MultinomialNB(alpha=alpha)
clf.fit(X, train_labels)

print('accuracy: %3.3f' %clf.score(X_dev_transformed, dev_labels))

(10164, 12231)

MultinomialNB:  Fit using LIAR data; predict on ISOT "text" data
accuracy: 0.689


In [62]:
print('LIAR true:', len(liar_data.binary_target[liar_data.binary_target == 1]))
#print('LIAR true:', liar_data.binary_target[liar_data.binary_target == 1].shape)
print('LIAR false:', len(liar_data.binary_target[liar_data.binary_target == 0]))

LIAR true: 4507
LIAR false: 5657


NOTE: VERY DIFFERENT results between the default CountVectorizer and TfidfVectorizer!!  

Using full text means all TRUE news contains the word "Reuters", which is an unfair advantage.  Will try to remove those and run again, expecting lower accuracy.  
Should also account for text starting with: "'The following statements\xa0were posted to the verified Twitter accounts of U.S. President Donald Trump, @realDonaldTrump and @POTUS.  The opinions expressed are his own.\xa0Reuters has not edited the statements or confirmed their accuracy."

### Repeat Naive Bayes on text field after removing first chunk of text, including "Reuters"

In [71]:
true_data.head()

Unnamed: 0,title,text,subject,date,target
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",1
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",1
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",1
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",1
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",1


In [72]:
true_data.iloc[0,1][22:]

' The head of a conservative Republican faction in the U.S. Congress, who voted this month for a huge expansion of the national debt to pay for tax cuts, called himself a “fiscal conservative” on Sunday and urged budget restraint in 2018. In keeping with a sharp pivot under way among Republicans, U.S. Representative Mark Meadows, speaking on CBS’ “Face the Nation,” drew a hard line on federal spending, which lawmakers are bracing to do battle over in January. When they return from the holidays on Wednesday, lawmakers will begin trying to pass a federal budget in a fight likely to be linked to other issues, such as immigration policy, even as the November congressional election campaigns approach in which Republicans will seek to keep control of Congress. President Donald Trump and his Republicans want a big budget increase in military spending, while Democrats also want proportional increases for non-defense “discretionary” spending on programs that support education, scientific resear

In [73]:
true_data.iloc[:,1]
#true_data.iloc[13,1]

0        WASHINGTON (Reuters) - The head of a conservat...
1        WASHINGTON (Reuters) - Transgender people will...
2        WASHINGTON (Reuters) - The special counsel inv...
3        WASHINGTON (Reuters) - Trump campaign adviser ...
4        SEATTLE/WASHINGTON (Reuters) - President Donal...
5        WEST PALM BEACH, Fla./WASHINGTON (Reuters) - T...
6        WEST PALM BEACH, Fla (Reuters) - President Don...
7        The following statements were posted to the ve...
8        The following statements were posted to the ve...
9        WASHINGTON (Reuters) - Alabama Secretary of St...
10       (Reuters) - Alabama officials on Thursday cert...
11       NEW YORK/WASHINGTON (Reuters) - The new U.S. t...
12       The following statements were posted to the ve...
13       The following statements were posted to the ve...
14        (In Dec. 25 story, in second paragraph, corre...
15       (Reuters) - A lottery drawing to settle a tied...
16       WASHINGTON (Reuters) - A Georgian-American bus.

In [74]:
# How many of the TRUE NEWS docs contain "Reuters"?  
# How many of the TRUE NEWS docs start with "The following statements"?  

reuters_counter=0
statements_counter=0

for i in range(true_data.shape[0]):
    if true_data.iloc[i,1].find("Reuters") > 0:
        reuters_counter += 1
    if (true_data.iloc[i,1].find("following") > 0) & (true_data.iloc[i,1].find("statements") > 0):
        statements_counter += 1

print('reuters_counter:', reuters_counter)
print('statement_counter:', statements_counter)
print('total true docs:', true_data.shape[0])



reuters_counter: 21378
statement_counter: 156
total true docs: 21417


#### Need to remove "Reuters" from True News

In [75]:
re.sub(r"^.?([r,R]euters) - ", "", "WASHINGTON (Reuters) - my name is reuters, Reuters is the code")  


'WASHINGTON (Reuters) - my name is reuters, Reuters is the code'

In [76]:
re.sub(r"[r,R]euters", "", "my name is reuters, Reuters is the code")

'my name is ,  is the code'

In [77]:
re.sub(r"[\w+\s+]+[r,R]euters", "", "WASHINGTON (Reuters) - my name is reuters, Reuters is the code")

'WASHINGTON (Reuters) -, is the code'

In [78]:
re.sub(r"[\w+\s+]+([r,R]euters) - ", "", "WASHINGTON (Reuters) - my name is reuters, Reuters is the code")

'WASHINGTON (Reuters) - my name is reuters, Reuters is the code'

In [79]:
re.sub(r"\w.*[r,R]euters\W*", "","ASPEN, Colorado (Reuters) - The Trump administ")  ### This is the one we want...

'The Trump administ'

In [80]:
def remove_reuters(text):
    return(re.sub(r"\w.*[r,R]euters\W*", "", text))

true_data['text2'] = true_data['text'].apply(remove_reuters)
true_data.head()

KeyboardInterrupt: 

In [None]:
true_data['text2'] = true_data['text'].apply(remove_reuters)

In [None]:
true_data.head()

In [None]:
fake_data['text2'] = fake_data['text']

In [None]:
#append the datasets and shuffle them
#all_data2 = true_data.append(fake_data, ignore_index=True)
#all_data2 = all_data2.sample(frac=1).reset_index(drop=True)

#all_data2.describe()

In [56]:
#all_data2.head()

Unnamed: 0,title,text,subject,date,target,text2
0,U.S. to review energy royalty rates on federal...,WASHINGTON (Reuters) - The U.S. Interior Depar...,politicsNews,"March 29, 2017",1,The U.S. Interior Department said on Wednesday...
1,Illinois Senate votes for $454 million higher-...,CHICAGO (Reuters) - For the second time in two...,politicsNews,"May 5, 2016",1,"For the second time in two weeks, the Illinois..."
2,New Jersey 'Bridgegate' defendant says he was ...,"NEWARK, N.J. (Reuters) - After days of complai...",politicsNews,"October 17, 2016",1,After days of complaints about traffic jams at...
3,Spain's Rajoy calls on Catalonia leaders to ca...,MADRID (Reuters) - Spanish Prime Minister Mari...,worldnews,"September 20, 2017",1,Spanish Prime Minister Mariano Rajoy on Wednes...
4,"BREAKING: CROOKED VA GOVERNOR, Close Hillary F...",How much more criminal activity are American v...,left-news,"Oct 23, 2016",0,How much more criminal activity are American v...


In [57]:
## Re-define train/dev/test:

'''train_set = all_data2[ :int(len(all_data2)*train_fract)].reset_index(drop=True)
dev_set = all_data2[int(len(all_data2)*(train_fract)) : int(len(all_data)*(train_fract+dev_fract))].reset_index(drop=True)
test_set = all_data2[int(len(all_data2)*(train_fract+dev_fract)) : ].reset_index(drop=True)

print('training set: ',train_set.shape)
print('dev set: ',dev_set.shape)
print('test set: ',test_set.shape)


train_data, train_labels = train_set.text2.values, train_set.target.values
dev_data, dev_labels = dev_set.text2.values, dev_set.target.values

train_labels = train_labels.astype(int)
dev_labels = dev_labels.astype(int)

print('\ntrain_data shape:', train_data.shape)
print('train_labels shape:', train_labels.shape)
print(train_labels)
'''

training set:  (31428, 6)
dev set:  (6735, 6)
test set:  (6735, 6)

train_data shape: (31428,)
train_labels shape: (31428,)
[1 1 1 ... 0 0 0]


In [60]:
'''vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_data)
X_dev_transformed = vectorizer.transform(dev_data)

# MultinomialNB
print('\nMultinomialNB with "Reuters" removed from text field')
alpha = 1.0
clf = MultinomialNB(alpha=alpha)
clf.fit(X, train_labels)

print('accuracy: %3.3f' %clf.score(X_dev_transformed, dev_labels))
'''


MultinomialNB with "Reuters" removed from text field
accuracy: 0.947


In [None]:
#print(train_data)

### Run Naive Bayes on the Title field

In [63]:
#fake_data.title

#print(fake_data.title.iloc[:15])

In [54]:
train_data, train_labels = train_set.title.values, train_set.target.values
dev_data, dev_labels = dev_set.title.values, dev_set.target.values

train_labels = train_labels.astype(int)
dev_labels = dev_labels.astype(int)

#train_data.head()
print('train_data shape:', train_data.shape)
#print(train_data[0].shape)
print(train_data[:1])
print('\ntrain_labels shape:', train_labels.shape)
print(train_labels)

print('dev_data shape:', dev_data.shape)


train_data shape: (16470,)
[' Here’s What The Fracking Industry Gave To Oklahoma In 2015 (IMAGES)']

train_labels shape: (16470,)
[0 0 0 ... 0 0 0]
dev_data shape: (25904,)


In [55]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_data)


print('X.shape:', X.shape) # (). There are x documents (rows) in the corpus, with y features (unique words = vocabulary)
print('Vocabulary size (number of features or columns):', X.shape[1])  # 
print('Non-zero elements in matrix (X.nnz):', X.nnz)   # This indicates that there are z non-zero elements in the matrix.
print('Average number of non-zero features per example (per document): %.3f' %(X.nnz/X.shape[0]))  # non-zero elements in matrix / documents = xxx
print('Fraction of non-zero elements in matrix: %.4f' %( X.nnz/(X.shape[0] * X.shape[1])) )   # Fraction of entries in the matrix that are non-zero = X.nnz/(rows*columns) = 0.xxx 


X.shape: (16470, 13548)
Vocabulary size (number of features or columns): 13548
Non-zero elements in matrix (X.nnz): 207035
Average number of non-zero features per example (per document): 12.570
Fraction of non-zero elements in matrix: 0.0009


In [56]:
# Using the standard CountVectorizer, what fraction of the words in the dev data are missing from the vocabulary? 
vectorizer_dev = CountVectorizer()
X_dev = vectorizer_dev.fit_transform(dev_data)  # Independently build a vocabulary using dev_data.
print('X_dev.shape:', X_dev.shape)
print('Vocabulary using train data:', X.shape[1])  # 
print('Vocabulary using dev data:', X_dev.shape[1])                # 

# Feed dev_data into the vectorizer fit using training data.
X_dev_transformed = vectorizer.transform(dev_data)
print('X_dev_transformed shape:', X_dev_transformed.shape)
#print('X_dev_transformed.shape', X_dev_transformed.shape)  # ()  EXPECT .shape[1] equal to original number of features
#print('non-zero indices in X_dev_transformed:', X_dev_transformed.nonzero())  # could also use this to check which features missing...

''' This is way too slow; use set intersection instead!!
# Look at each feature (vocabulary word) in X_dev and see if it is a feature in X.
count = 0
for i in range(X_dev.shape[1]):
    if vectorizer_dev.get_feature_names()[i] in vectorizer.get_feature_names():
        count += 1
print('Count of words (features) in X_dev also in X:', count)   
print('Fraction of words in dev data missing from training vocabulary: %.3f' %((X_dev.shape[1] - count)/X_dev.shape[1]) )
count of words (features) in X_dev also in X: 12219
Fraction of words in dev data missing from training vocabulary: 0.248
'''

set1 = set(vectorizer_dev.get_feature_names())
set2 = set(vectorizer.get_feature_names())
print('Count of words (features) in X_dev also in X:', len(set1.intersection(set2)))
print('Fraction of words in dev data missing from training vocabulary: %.3f' %((X_dev.shape[1] - len(set1.intersection(set2)))/X_dev.shape[1]) )

X_dev.shape: (25904, 16289)
Vocabulary using train data: 13548
Vocabulary using dev data: 16289
X_dev_transformed shape: (25904, 13548)
Count of words (features) in X_dev also in X: 9498
Fraction of words in dev data missing from training vocabulary: 0.417


In [57]:
# MultinomialNB
print('\nMultinomialNB using "title" field')
alpha = 1.0
clf = MultinomialNB(alpha=alpha)
clf.fit(X, train_labels)

print('accuracy: %3.3f' %clf.score(X_dev_transformed, dev_labels))


MultinomialNB using "title" field
accuracy: 0.888


In [50]:
#print('title, target label\n', train_set.title, train_set.target)
print('title, target label\n', train_set.title[4], train_set.target[4])

title, target label
  Ben Carson Campaign In Shambles After Top Aides, 20 Staffers Quit 0


In [69]:
type(train_set.target)

pandas.core.series.Series

In [70]:
type(train_labels)

numpy.ndarray

#### ONE OFF: Use CountVectorizer to determine vocab sizes for ISOT "title" for 2016 and 2017, both for fake and real titles

In [51]:
# 2016: Use train_set
train_set_2016_true = train_set[train_set.target=='1']
#train_set_2016_true.head()
vectorizer_2016_true = CountVectorizer()
X_2016_true = vectorizer_2016_true.fit_transform(train_set_2016_true.title.values)  # Independently build a vocabulary.
print('Vocabulary using ISOT 2016 true title:', X_2016_true.shape[1])

train_set_2016_false = train_set[train_set.target=='0']
#train_set_2016_false.head()
vectorizer_2016_false = CountVectorizer()
X_2016_false = vectorizer_2016_false.fit_transform(train_set_2016_false.title.values)  # Independently build a vocabulary.
print('Vocabulary using ISOT 2016 false title:', X_2016_false.shape[1])


# 2017: Use dev_set
dev_set_2017_true = dev_set[dev_set.target=='1']
#dev_set_2017_true.head()
vectorizer_2017_true = CountVectorizer()
X_2017_true = vectorizer_2017_true.fit_transform(dev_set_2017_true.title.values)  # Independently build a vocabulary.
print('Vocabulary using ISOT 2017 true title:', X_2017_true.shape[1])

dev_set_2017_false = dev_set[dev_set.target=='0']
#dev_set_2017_false.head()
vectorizer_2017_false = CountVectorizer()
X_2017_false = vectorizer_2017_false.fit_transform(dev_set_2017_false.title.values)  # Independently build a vocabulary.
print('Vocabulary using ISOT 2017 false title:', X_2017_false.shape[1])

Vocabulary using ISOT 2016 true title: 6363
Vocabulary using ISOT 2016 false title: 11360
Vocabulary using ISOT 2017 true title: 11884
Vocabulary using ISOT 2017 false title: 10123


In [53]:
# Check overlap in 2016 data.  Transform X_2016_false using X_2016_true vocab

#X_2016_false_transformed = vectorizer_2016_true.transform(train_set_2016_false.title.values)
#print('X_dev_transformed to true vocab shape:', X_2016_false_transformed.shape)

set1 = set(vectorizer_2016_true.get_feature_names())
set2 = set(vectorizer_2016_false.get_feature_names())
print('Count of words (features) in dev_set_2016_true also in dev_set_2016_false:', len(set1.intersection(set2)))

Count of words (features) in dev_set_2016_true also in dev_set_2016_false: 4175


In [52]:
# Check overlap in 2017 data.  Transform X_2017_false using X_2017_true vocab

#X_2017_false_transformed = vectorizer_2017_true.transform(dev_set_2017_false.title.values)
#print('X_dev_transformed to true vocab shape:', X_2017_false_transformed.shape)

set1 = set(vectorizer_2017_true.get_feature_names())
set2 = set(vectorizer_2017_false.get_feature_names())
print('Count of words (features) in dev_set_2017_true also in dev_set_2017_false:', len(set1.intersection(set2)))

X_dev_transformed to true vocab shape: (9203, 11884)
Count of words (features) in dev_set_2017_true also in dev_set_2017_false: 5718


#### Re-do training and dev eval using only LOWER CASE text.  (Not relevant since CountVectorizer already does this??)

In [71]:
print(train_set.title.values)

[' Here’s What The Fracking Industry Gave To Oklahoma In 2015 (IMAGES)'
 'IS A REVOLUTION COMING? Former Congressman Throws Down Gauntlet On Obama’s Executive Gun Grab: “It’s War, Defy His Executive Actions”'
 'SHAMELESS! Hillary Clinton Throws Benghazi Families Under The Bus' ...
 ' Trump’s Russia Connection Says US Troops Lives Are In Danger'
 'LIBERALS SEE THE LIGHT! HuffPo Columnist LETS IT RIP On The Obama “Destruction” Of The Democrats [Video]'
 'BREAKING: Federal Judge STOPS Obamacare Transgender, Abortion Related Protections']


In [72]:
print(all_data['title'].str.lower().values)

NameError: name 'all_data' is not defined

In [73]:
train_data, train_labels = train_set.title.str.lower().values, train_set.target.values
dev_data, dev_labels = dev_set.title.str.lower().values, dev_set.target.values

In [74]:
print(train_data)
print()
print(dev_data)
print(train_data.shape, dev_data.shape)

[' here’s what the fracking industry gave to oklahoma in 2015 (images)'
 'is a revolution coming? former congressman throws down gauntlet on obama’s executive gun grab: “it’s war, defy his executive actions”'
 'shameless! hillary clinton throws benghazi families under the bus' ...
 ' trump’s russia connection says us troops lives are in danger'
 'liberals see the light! huffpo columnist lets it rip on the obama “destruction” of the democrats [video]'
 'breaking: federal judge stops obamacare transgender, abortion related protections']

['trump leaves open possible taiwan meet, questions russia hacking'
 'check out trump’s hilarious new years eve tweet to his “many enemies”'
 'are angry leftists planning violent communist revolution?…“it is their goal to “block, obstruct, disrupt” trump’s inauguration'
 ...
 'barbra streisand gives up on dream of impeaching trump over fake trump-russian collusion…tweets hilarious new reason trump should be impeached'
 'watch: senator lindsey graham drop

In [75]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_data)
X_dev_transformed = vectorizer.transform(dev_data)

# MultinomialNB
print('\nMultinomialNB using "title" field')
alpha = 1.0
clf = MultinomialNB(alpha=alpha)
clf.fit(X, train_labels)

print('accuracy: %3.3f' %clf.score(X_dev_transformed, dev_labels))


MultinomialNB using "title" field
accuracy: 0.888


In [58]:
## Print out most probable words

print('clf.feature_count_ :', clf.feature_count_.shape)

cols = ['Label', 'WORD', 'feature_log_prob_ when fake', 'feature_log_prob_ when true']
row_list = []
stoppish_words = ['his', 'about', 'we', 'on', 'one', 'they', 'be', 'he', 'an', 'who', 'as', 'but', 'are', 'with', 'have', 'not', 'from', 'of', 'the', 'to', 'by', 'for', 'has', 'that', 'in', 'it', 'this', 'is', 'was', 'at', 'would', 'and', 'said']

for i in range(2):   # 2 category labels
    print('\nMOST COMMON WORDS IN CLASS', i, '(Fake=0; Real=1)')
    print('%19s %12s %12s' %('WORD', 'Fake prob', 'True prob'))
    n=300
    for j in range(n):   # top n weights for each class
        index = -1 - j
        feature_index = np.argsort(clf.feature_log_prob_[i,:])[index]
        #print(feature_index, vectorizer.get_feature_names()[feature_index], clf.feature_log_prob_[0,feature_index], lr2.coef_[1,feature_index], lr2.coef_[2,feature_index], lr2.coef_[3,feature_index])
        print('%19s %12.3f %12.3f' %(vectorizer.get_feature_names()[feature_index], clf.feature_log_prob_[0,feature_index], clf.feature_log_prob_[1,feature_index]))
        
        word = vectorizer.get_feature_names()[feature_index]
        if word not in stoppish_words:  # disclude stoppish words
            row_list.append(dict( [('Label', i), ('WORD', word), ('feature_log_prob_ when fake', clf.feature_log_prob_[0,feature_index]),('feature_log_prob_ when true', clf.feature_log_prob_[1,feature_index]) ]  ))
    print()

feature_log_prob_df = pd.DataFrame(row_list, columns=cols)



clf.feature_count_ : (2, 13548)

MOST COMMON WORDS IN CLASS 0 (Fake=0; Real=1)
               WORD    Fake prob    True prob
                 to       -3.696       -3.561
              trump       -3.712       -3.495
              video       -3.723       -8.143
                the       -4.037       -5.933
                 of       -4.322       -4.579
                for       -4.341       -4.403
                 in       -4.393       -4.209
            hillary       -4.637       -7.393
                and       -4.638       -6.124
                 on       -4.673       -4.427
              obama       -4.843       -4.631
                 is       -4.861       -6.499
               with       -5.041       -5.224
              watch       -5.159       -8.897
                his       -5.379       -6.522
              about       -5.405       -6.534
            clinton       -5.430       -4.563
                 it       -5.439       -7.064
               this       -5.474       -7.932
 

            illegal       -7.338       -8.779
               only       -7.338       -9.367
           response       -7.338       -8.143
                day       -7.338       -8.086
               live       -7.347       -9.185
              truth       -7.356       -9.590
               stop       -7.356       -7.644
               ever       -7.364       -9.878
                pro       -7.364       -8.578
             before       -7.373       -7.511
              wants       -7.373       -7.644
              money       -7.382       -7.681
                 ll       -7.382       -9.878
          candidate       -7.400       -6.987
          wikileaks       -7.400       -8.897
         terrorists       -7.400       -9.878
               race       -7.400       -6.351
              again       -7.409       -8.491
          hilarious       -7.409      -10.976
           students       -7.409       -9.185
             emails       -7.409       -7.064
                see       -7.418  

               cruz       -6.434       -6.465
               ryan       -7.718       -6.465
              calls       -6.864       -6.465
               plan       -7.579       -6.477
                 is       -4.861       -6.499
                 no       -6.671       -6.522
                his       -5.379       -6.522
             russia       -7.568       -6.534
              about       -5.405       -6.534
            senator       -7.990       -6.570
               meet       -8.169       -6.582
            against       -6.689       -6.594
              north       -8.490       -6.594
           security       -7.718       -6.607
               more       -6.802       -6.620
              party       -7.145       -6.646
            speaker       -8.922       -6.646
     administration       -9.056       -6.672
              could       -7.195       -6.672
                fbi       -6.792       -6.672
                may       -7.568       -6.686
              judge       -7.296  

           attorney       -8.463       -7.542
                one       -6.380       -7.542
          obamacare       -8.463       -7.575
             battle       -9.056       -7.575
           abortion       -7.973       -7.575
           spending       -9.536       -7.575
                 an       -6.827       -7.575
              texas       -7.694       -7.575
           approves      -10.714       -7.575
         nomination       -8.635       -7.575
               amid       -9.798       -7.575
                fox       -6.620       -7.575
              syria       -7.852       -7.575
               that       -5.882       -7.575
              leads      -10.491       -7.609
               bush       -7.810       -7.609
             police       -6.698       -7.609
           business       -8.188       -7.609
            release       -8.111       -7.609
              takes       -7.601       -7.609
               take       -7.579       -7.609
              block       -9.056  

In [77]:
print(feature_log_prob_df.size)
print(len(feature_log_prob_df))
feature_log_prob_df.head()

604
151


Unnamed: 0,Label,WORD,feature_log_prob_ when fake,feature_log_prob_ when true
0,0,trump,-3.711562,-3.494799
1,0,video,-3.723226,-8.143141
2,0,hillary,-4.636531,-7.392836
3,0,obama,-4.843465,-4.630718
4,0,watch,-5.159278,-8.896913


In [78]:
feature_log_prob_df.to_csv('nb_isot_title_2016_feature_log_prob_df.csv', sep=',')

#### Many words in the Title show an imbalance between Fake News and Real News.  For example, "trump" is favored by nearly a 2:1 ratio in Fake vs. Real news.  "hillary" is favored by ~ 400:1 in Fake vs. Real news.  "watch" is favored ~ 700:1 in Fake vs. Real news.  Words such as "he", "his", "she", "her", "him", "it", "they", "them", "we", "us", "like", "here", "donald", "gop", "liberal",  "media", "america", "muslim", "racist", "breaking", are also heavily favored in Fake news in this dataset.

In [45]:
## Print out top MISCLASSIFICATIONS

# Make predictions on the dev data and show the top n documents where the ratio R is largest, where R is:
# R = maximum predicted probability / predicted probability of the correct label

n=2000
print('X_dev_transformed.shape:', X_dev_transformed.shape)  # (6735, 18745)
print()

r_array = np.zeros(X_dev_transformed.shape[0])  # one array element for each dev example (6735)
for i in range(X_dev_transformed.shape[0]):
    max_pred_prob = np.max(clf.predict_proba(X_dev_transformed)[i,:])
    #print(max_pred_prob)
    correct_label = int(dev_labels[i])
    pred_prob_correct_label = clf.predict_proba(X_dev_transformed)[i,correct_label]
    R = max_pred_prob / pred_prob_correct_label
    r_array[i] = R

print('max R:', np.max(r_array)) 
#print('mean R:', np.mean(r_array))
sorted_r = -np.sort(-r_array)
print()

cols = ['text', 'label', 'pred_class', 'R']
row_list = []

for i in range(n):
    index = -1 - i
    label_index = np.argsort(r_array)[index]
    print('R:', r_array[label_index])    
    print('\nDev Title',str(label_index)+':\n', dev_data[label_index])
    print('\nLABEL =', dev_labels[label_index])
    print('predicted class =', clf.predict(X_dev_transformed)[label_index],'\n' )
    print(70*'-')
    row_list.append(dict( [('text',dev_data[label_index]), ('label', dev_labels[label_index]), 
                         ('pred_class', clf.predict(X_dev_transformed)[label_index]), ('R', r_array[label_index])]  ))


error_df = pd.DataFrame(row_list, columns=cols)
#error_df.loc[i] = [dev_data[label_index], dev_labels[label_index], clf.predict(X_dev_transformed)[label_index], r_array[label_index]]

X_dev_transformed.shape: (6735, 18834)

max R: 579477906.9052335

R: 579477906.9052335

Dev Title 933:
 syria ceasefire? lavrov, kerry agree to fight al-nusra, no strikes on ‘rebels,’ aleppo relief

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 327208.9999931936

Dev Title 6652:
 obama tells students at town hall about how failures have shaped him

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 300213.10668885405

Dev Title 1023:
 trump tweets mock video of himself tackling, punching cnn logo

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 141965.07454670782

Dev Title 447:
 n. korea’s latest missile launch aimed at testing carrying “large scale heavy nuclear warhead”

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 106148.57283973221

Dev Title 1339:
 r

predicted class = 0 

----------------------------------------------------------------------
R: 231.19177947561602

Dev Title 6387:
 democrats push for ban and restrictions on online ammo sales

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 190.71984083689634

Dev Title 1145:
 an obama, not the president, brings down the house at democratic convention

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 185.70483822611732

Dev Title 952:
 trump says: 'most likely i won't be doing the debate' on fox news

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 162.6167528668266

Dev Title 3920:
  republicans push bill to legalize voter intimidation to help trump in pennsylvania

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 162.51516877242474

Dev Title 474:
 watch: 

predicted class = 1 

----------------------------------------------------------------------
R: 49.98347218323574

Dev Title 2625:
 as trump’s popularity soars abroad…village in india renames itself “trump” [video]

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 48.46316688387616

Dev Title 5709:
 wikileaks releases hacked democratic national committee audio files

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 46.71099679194184

Dev Title 6644:
 the pope pushes climate justice : “ambitious action” needed…add global warming agenda to “works of mercy”

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 46.62875511754213

Dev Title 2343:
  reince priebus facing misconduct complaint for interfering with fbi’s russia investigation

LABEL = 0
predicted class = 1 

---------------------------------------------------------

 trump asks congress to investigate former obama administration

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 13.840000188538374

Dev Title 364:
 a picture and its story: tear gas in nairobi

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 13.613610773928388

Dev Title 1566:
 new report: obamaphone program stashed $9 billion in private bank accounts…exposes massive windfall for phone companies

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 13.235076059660049

Dev Title 3566:
  supreme court to hear major wisconsin gerrymandering case

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 12.400327925923996

Dev Title 1476:
 trump to make day trip to washington during his vacation: white house

LABEL = 1
predicted class = 0 

----------------------------------

predicted class = 0 

----------------------------------------------------------------------
R: 6.002623399666059

Dev Title 5983:
 orlando shooter traveled to saudi arabia in 2011, 2012: msnbc

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 5.991215880242218

Dev Title 4158:
 republican money class fears stigma of becoming trump donors

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 5.972221572411643

Dev Title 2878:
  want to ride on air force one as a senator? then be prepared to vote for trump’s health care bill

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 5.9531938192545795

Dev Title 3644:
 u.s. judge finds texas voter id law was intended to discriminate

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 5.592172087583151

Dev Title 5592:
 pro-trum

R: 2.9167799765572116

Dev Title 1061:
 two-thirds of us navy strike fighter jets grounded: navy claims no money to fix them

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 2.8606695769562016

Dev Title 677:
 kushner told flynn to contact russians last year: nbc news

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 2.854233455688451

Dev Title 3238:
 trump jr. emails suggest he welcomed russian help against clinton

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 2.8207074497903193

Dev Title 3472:
 clinton calls trump too unsteady to be president

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 2.7866286025836717

Dev Title 797:
 bill clinton portrays hillary as 'change-maker' in speech to democrats

LABEL = 1
predicted class = 0 

------------------------

R: 1.6575665129408053

Dev Title 6112:
  beatles drummer ringo starr will not perform in bigoted state of north carolina

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 1.57909812179985

Dev Title 1195:
 senator paul suffers five broken ribs after assault: reports

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1.5699818184890793

Dev Title 1663:
 marco rubio emerges as champion of battered republican establishment

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1.550021188288333

Dev Title 1774:
 judge halts ohio law that blocked funds for planned parenthood

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1.5293640388912033

Dev Title 2137:
 indicted texas mayor arrested for disrupting meeting seeking his ouster

LABEL = 1
predicted class = 0 

--------

predicted class = 1 

----------------------------------------------------------------------
R: 1.0052104600233587

Dev Title 3296:
  trump’s legal team in uproar after he forces them to attack robert mueller

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 1.0044546500531961

Dev Title 5260:
 will tim cook's privacy stance win or lose customers for apple? 

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1.0016774907978108

Dev Title 5344:
 trump: special counsel appointment 'hurts our country terribly' - tv reports

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2159:
 flashback: hillary received $500k in jewelry from king of barbaric nation who brutally oppresses women

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2220:
 

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2207:
 france and qatar sign deals worth 12 billion euros: macron

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2209:
 trump isn’t going to invade venezuela, but what us is planning could be much worse

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2165:
 boiler room #66 – globo-terror & the pokego-pocalypse

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2173:
 britain's hammond says brexit deal is a boost to the economy

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2166:
 spanish prosecutor asks for catalan police chief to be held in custody

LABEL = 1
predicted class =

predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2373:
 sentencing reform could help u.s. economy: white house panel

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2374:
 house speaker urges trump not to scrap 'dreamers' immigration policy

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2376:
  pastor demands bernie sanders convert to christianity during trump rally (video)

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2355:
 latest fire in chinese capital kills five despite safety blitz

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2354:
 updated video: is this america? conservatives and their families experience shocking abuse 

R: 1.0

Dev Title 2381:
  trump embarrasses u.s., brags about election win in joint press conference with trudeau (video)

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2401:
 u.s. targets iraqis for deportation in wake of travel ban deal

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2382:
 cnn cancels popular dr drew show after he tells viewers he’s “gravely concerned ” about hillary’s health [video]

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2383:
 watch trump crowd erupt: “we have no choice! complete shut down of muslims entering u.s.”

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2384:
 dutch government: 2 dead, 43 wounded on saint martin

LABEL = 1
predicted class = 1 

----------

R: 1.0

Dev Title 2237:
 u.n. experts urge aung san suu kyi to meet persecuted rohingya

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2238:
  the fbi just announced the results of their investigation into hillary clinton

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2239:
  is gabby giffords being sued by her shooter? (images)

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2240:
 papal official denies report sanders invited himself to vatican

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2241:
  trump jr. also met with ex-soviet intelligence officer to get dirt on hillary clinton

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

R: 1.0

Dev Title 2293:
 for trump's defenders, white house turmoil is politics as usual

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2294:
  this texas pro-cop facebook page shared this racist image and it went viral (images)

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2295:
 obama fights to keep radical agenda alive: asks crooked ag loretta lynch to find way to challenge supreme court decision that blocked his executive order amnesty scheme

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2296:
 woman introducing hillary refuses to say “one nation under god”…hillary laughs [video]

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2297:
 trump: cerberus ceo offered services, hopes he not ne

R: 1.0

Dev Title 1853:
 german trains collide near duesseldorf, several people injured

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 1854:
 you wouldn’t allow someone to abuse your child…so why do we allow climate change radicals to target them?

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 1855:
 trump outlines plans for first day in office, meets with cabinet hopefuls

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 1858:
  gop congressman defies trump with blistering attack on his voter fraud investigation (video)

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 1859:
 attack on trump: mitt romney just ‘awoke a sleeping giant’

LABEL = 0
predicted class = 0 

------------------------------

Dev Title 1909:
  united just took a disgusting step to smear the victim at the heart of viral pr disaster

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 1910:
  white supremacist conservatives cheer ‘hail trump, hail victory,’ perform nazi salute

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 1911:
 trump wants moderator free debate with hillary

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 1912:
 tunisia parliament approves controversial amnesty for ben ali-era corruption

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 1913:
 house leader: 'very difficult' to speed up end to medicaid expansion

LABEL = 1
predicted class = 1 

----------------------------------------------------------------

predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 1749:
  american psychoanalytic association gives members permission to diagnose trump’s mental health

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 1750:
 will vile leftists turn democrats away?…watch angry leftists openly bully americans engaged in peaceful prayer [video]

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 1751:
 obama administration reverses course on atlantic oil drilling

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 1752:
 poll: trump in lead at 40.6 percent

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 1753:
 german court rules “sharia police” patrolling city st

R: 1.0

Dev Title 1803:
 moscow invites 33 syrian groups to november 18 peace congress in russia

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 1804:
 surreal: lone venezuelan man plays violin amidst tear gas and molotov cocktails during riot [video]

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 1805:
 no 'bespoke' brexit, transition means 'status quo': barnier

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 1806:
 rachel maddow tries to embarrass trump by exposing 2005 tax returns…backfires big-time..gets destroyed on twitter!

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 1807:
 two kosovo men plea guilty of plotting to attack israeli soccer team

LABEL = 1
predicted class = 1 

----------

predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2070:
 davos elites struggle for answers as trump era dawns

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2097:
  fox news is imploding as greta van susteren leaves network in wake of settlement with gretchen carlson

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2098:
 oops! media forgot ted kennedy asked russia to intervene in election, help defeat ronald reagan

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2099:
  conservatives desperately turn to social media to blame refugees for orlando shooting (tweets)

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2100:
 obama to grant wo

predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2049:
 britain's boris johnson tells eu: put a tiger in the tank of brexit talks

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2048:
 trump to keep obama rule curbing corporate tax inversion deals

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2047:
  dem senator calls out mike pence for telling a huge lie about trumpcare

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 1997:
 new senate republicans healthcare bill already in trouble

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 1973:
 great pick! kt mcfarland to join team trump [video]

LABEL = 0
predicted class = 0 

-------------

predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 1998:
  tn lawmakers got big money to approve 279 percent interest rate loans (video)

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2046:
 ex-cia director reveals what triggered obama to spy on trump…you won’t believe what it is!

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 1999:
  nate silver: hillary clinton’s wins most resemble the democratic party

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2024:
 'for the party and the motherland!': north korea's kim heralds missile test after setbacks

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2025:
 tillerson says never considered r

predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2011:
 putin, after meeting south korean leader, calls for talks on north korea crisis

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2012:
 trump aide kushner scraps plan for canada visit: canada official

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2013:
 professors head to courts, threaten strike after estacio layoffs in brazil

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2014:
 have the us, trump really abandoned ‘regime change’ in syria?

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2015:
 in georgia, costliest u.s. house race hits ugly note as election looms

LABEL = 1
pr

predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2967:
 trump’s doj makes announcement on anti-gun obama-era “operation choke point”

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2969:
 feel good story of the day: globalist billionaire george soros melt down…calls trump “con artist”…says he threatens “open society model”

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2971:
 china firmly opposes taiwan's leadership engaging with u.s. officials

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2972:
  fox news host calls for american muslims with links to isis to be executed without trial (video)

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

D

 pakistanis worry that president trump may favor rival india

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 3017:
 lest we forget: ‘independent’ mueller is part of establishment that helped sell iraq war

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 3018:
 maradona backs venezuela's maduro, signs for world cup coverage

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 3019:
  the washington post just asked a major question of trump’s administration, and he’ll be livid

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 3020:
 police fire teargas at kenyan vote protesters

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 3021:
 

predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2862:
  president obama wants a woman to be president, and that’s not all (video)

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2863:
  gop rep. just achieved the impossible by outdoing trump’s hypocrisy on february jobs report

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2864:
 lower taxes, big gains: the stocks poised to win from tax cuts

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2865:
 north korean soldier, shot and wounded, defects to south

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2866:
 in liberal utopia of chicago, where americans are​ dying left & right…rahm ema

predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2919:
 why this blue-collar democrat stronghold county is still fighting for trump: “he was the hope we were all waiting on, the guy riding up on the white horse” [video]

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2920:
 after sowing doubts, trump backs nato mutual defense under charter

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2921:
  watch: texas pastor cheers orlando mass shooting, prays god will finish off those in icu

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2922:
 watch: gun nut believes trump’s ‘second amendment’ threat is a call to arms

LABEL = 0
predicted class = 0 

-----------------------------------------------------------------

R: 1.0

Dev Title 3188:
 clinton warns against complacency, trump warns of world war three

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 3189:
 senate intel panel to seek testimony from trump jr.: senate source

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 3190:
 u.s. special envoy says kurdish referendum has 'a lot of risks'

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 3191:
 hulk hogan is kicked to curb by wwe for speculation over racist comments

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 3192:
 canadian tech companies ask ottawa to issue visas after u.s. ban

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 31

predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 3240:
 pro-hillary new york daily news writer destroys hillary, tells her to, “shut the f**k up and go away”

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 3241:
 trump administration releases rules on disclosing cyber flaws

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 3242:
 watch: unhinged clinton supporter knocks elderly man to ground after he tries to stop him from burning u.s. flag [video]

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 3243:
 did hillary just lose her “get out of jail free” card?…senior trump advisor: “trump has not ruled out criminal probe” against hillary

LABEL = 0
predicted class = 0 

-------------------------------------------

predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 3071:
 china enshrines 'xi jinping thought', key xi ally to step down

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 3072:
 house could pass tax reform if senate adds health mandate repeal: ryan

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 3073:
  republicans in this state just sent a huge ‘f**k you!’ to workers and democracy with one terrible move

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 3074:
 saudi foreign minister says iran main sponsor of global terror

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 3075:
 russia tells two u.s. news outlets they may be affected by new 'f

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 3130:
  paul ryan gets called out at town hall and gives the dumbest reason for supporting trump (video)

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 3132:
 ‘america is under israeli occupation’ by dahlia wasfi

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 3133:
 in war-torn darfur, new u.s. aid chief stresses need for humanitarian access

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 3134:
 america’s national security compromised by several key obama cabinet members

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 3135:
 epa chief unconvinced on co2 link to global warmin

predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2542:
 [video] 16 yr old arrested for violent gang beating in mcdonalds…15 yr old victim brags about new found fame

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2543:
 trump asked putin if allegations of russian meddling were true: ria

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2544:
 #berkeley irony alert! anarchists loot starbucks…destroy store windows [video]

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2545:
 obama stares down american sniper widow taya kyle, as cnn gives her time to confront him about gun control on live tv [video]

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

D

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2600:
  newt gingrich brutally reminds fox & friends that they created trump (video)

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2601:
 u.s. federal employee 'gag orders' may be illegal, lawmakers warn trump

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2602:
 shocker! bratty kid who said “screw our president!” is drew carey’s son! [video]

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2603:
 cia claims of russian intervention in us election fall flat

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2604:
 wow! alex jones releases secretly recorded interview with megyn k

predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2440:
 satellite signals not helpful to argentine submarine search: navy official

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2441:
 who needs nancy pelosi when congress has paul ryan: “it’s [obamatrade deal] declassified and made public once it’s agreed to”

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2442:
 it’s a movement! trump releases great new ad…with a little jab at hillary [video]

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2443:
 the truth about alicia machado blows up…backfires big-time on hillary’s dirty campaign! [video]

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev 

 watch: patriots fans boo players who disrespect our national anthem…shout “stand up!” at players taking a knee

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2496:

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2497:
 factbox: state-by-state poll closing times for u.s. election

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2498:
 uk counter-terrorism police charge 14-year-old boys with murder plot

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2499:
  one of hillary’s opponents could drop out and endorse her (video)

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2500:
 unhinged radical leftists try to storm trump’s 

predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2762:
  florida school overreacts to prank by pressing felony charges

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2763:
 high court emissions ruling won't deter clean energy drive 

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2764:
 [video] german mayor blames victims of mass rape, sexual assault by muslim migrants for not defending themselves

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2765:
  obama’s doing something big to raise pay for millions of americans (video)

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2766:
 last flight departs as iraq imposes ban for kurdish in

predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2813:
  watch: dumb*ss trump supporters bewilder cnn host with delusions of illegal voters

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2814:
 california judge questions trump's sanctuary city order

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2815:
 liberal rags like usa today working overtime to destroy trump…here’s proof americans aren’t listening

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2816:
 virtual tie raises doubts: can hillary clinton close the deal?

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2817:
  tn school officials told to falsify student disciplinary cod

predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2657:
 outrage! student threatens violence against trump in high school yearbook quote

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2658:
 philippine president declares marawi liberated as battle goes on

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2659:
 romania expels pro-russian serb for photographing military radar

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2660:
 hillary’s campaign manager deflects and dodges questions on clinton e-mail scandal and sanctuary cities [video]

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2661:
 “stop blaming white people for trump’s win l

predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2715:
 mexico's pena nieto says disaster funds limited, need to rework budget

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2716:
  conservatives can’t believe canadian pm would understand science, accuse him of faking it

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2717:
 hillary makes first appearance since election…wow america! you dodged a bullet!

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2718:
 open-border liberals put entire nation on high alert: german spy chief warns 1,000+ radical islamists ready to attack…over 100 isis members among refugees

LABEL = 0
predicted class = 0 

-----------------------------------------------------------------

R: 1.0

Dev Title 549:
 refs walk off in protest after high school players kneel during anthem

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 550:
 this year: let’s make christmas great again…

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 551:
 oh boy! obamacare architect ripped to shreds by maria bartiromo…who’s stupid now? [video]

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 552:
  cher does not hold back on twitter as she calls for death of michigan governor (tweets)

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 553:
  support for banning assault rifles skyrockets after orlando shooting

LABEL = 0
predicted class = 0 

------------------------------------------------------------------

predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 596:
 u.s. gives the united nations billions of your tax dollars for what?

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 597:
 tillerson says working to bring stability to russia ties

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 598:
 julian assange reveals john podesta’s hilarious email password…”a 14 year old kid could’ve hacked podesta” [video]

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 599:
 portuguese protest over deadly forest fires, government pledges aid

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 600:
 factbox: behind trump's bid to revive travel ban at the u.s. s

R: 1.0

Dev Title 451:
 pilgrims return to mecca as haj winds down without incident

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 427:
  heavily-armed man claiming to be jesus arrested after plotting to kidnap obama’s dog

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 428:
 hillary’s new america: uniformed police officers not allowed on dnc floor [video]

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 429:
 exclusive: former top brazil prosecutor says successor, police chief slowing graft probes

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 430:
 why isn’t this news? three black men are taken alive after shooting up school bus with children inside

LABEL = 0
predicted class = 0 

----------

R: 1.0

Dev Title 509:
  greta van susteren slams fox news for not dealing with roger ailes sooner

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------


In [46]:
print(error_df.size)
print(len(error_df))
error_df.head()

8000
2000


Unnamed: 0,text,label,pred_class,R
0,"syria ceasefire? lavrov, kerry agree to fight ...",0,1,579477900.0
1,obama tells students at town hall about how fa...,1,0,327209.0
2,"trump tweets mock video of himself tackling, p...",1,0,300213.1
3,n. korea’s latest missile launch aimed at test...,0,1,141965.1
4,russian lawmaker warns: north korea ready to l...,0,1,106148.6


In [61]:
error_df.to_csv('nb_isot_text_errors.csv', sep=',')

### Using the ISOT "title" model, predict the ISOT "text" classes 

In [79]:
train_data, train_labels = train_set.title.str.lower().values, train_set.target.values
dev_data, dev_labels = dev_set.text.str.lower().values, dev_set.target.values

train_labels = train_labels.astype(int)
dev_labels = dev_labels.astype(int)

#train_data.head()
print('train_data shape:', train_data.shape)
#print(train_data[0].shape)
print(train_data[:1])
print('\ntrain_labels shape:', train_labels.shape)
print(train_labels)

print('dev_data shape:', dev_data.shape)


train_data shape: (16470,)
[' here’s what the fracking industry gave to oklahoma in 2015 (images)']

train_labels shape: (16470,)
[0 0 0 ... 0 0 0]
dev_data shape: (25904,)


In [80]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_data)
X_dev_transformed = vectorizer.transform(dev_data)

print('X.shape:', X.shape)
print('X_dev_transformed.shape:', X_dev_transformed.shape)

# MultinomialNB
print('\nMultinomialNB trained using ISOT "title" field; predict ISOT "text" classes')
alpha = 1.0
clf = MultinomialNB(alpha=alpha)
clf.fit(X, train_labels)

print('accuracy: %3.3f' %clf.score(X_dev_transformed, dev_labels))

X.shape: (16470, 13548)
X_dev_transformed.shape: (25904, 13548)

MultinomialNB trained using ISOT "title" field; predict ISOT "text" classes
accuracy: 0.547


### Using the ISOT "title" model, predict the liar_dev_labels and score the predictions. (lower case them first...)

In [81]:
#train_data, train_labels = train_set.title.values, train_set.target.values   # original ISOT data
train_data, train_labels = train_set.title.str.lower().values, train_set.target.values   # original ISOT data

#dev_data = liar_data.title[liar_data.binary_target != -1].values  
dev_data = liar_data.title[liar_data.binary_target != -1].str.lower().values                                            # full LIAR data
dev_labels = liar_data.binary_target[liar_data.binary_target != -1].values

train_labels = train_labels.astype(int)
dev_labels = dev_labels.astype(int)

#train_data.head()
print('train_data shape:', train_data.shape)
#print(train_data[0].shape)
print(train_data[:1])
print('train_labels shape:', train_labels.shape)
print(train_labels)

print('\ndev_data shape:', dev_data.shape)
print(dev_data[:1])
print('dev_labels shape:', dev_labels.shape)
print(dev_labels)

train_data shape: (16470,)
[' here’s what the fracking industry gave to oklahoma in 2015 (images)']
train_labels shape: (16470,)
[0 0 0 ... 0 0 0]

dev_data shape: (10164,)
['rick perry has become a millionaire on the public payroll.']
dev_labels shape: (10164,)
[1 0 1 ... 1 1 1]


In [82]:

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_data)           # original ISOT training data
X_dev_transformed = vectorizer.transform(dev_data)

# MultinomialNB
print('\nMultinomialNB using "title" field: Fit using ISOT data; Predict on LIAR data:')
alpha = 1.0
clf = MultinomialNB(alpha=alpha)
clf.fit(X, train_labels)

print('accuracy: %3.3f' %clf.score(X_dev_transformed, dev_labels))


MultinomialNB using "title" field: Fit using ISOT data; Predict on LIAR data:
accuracy: 0.536


In [83]:
print('Non-zero elements in matrix (X.nnz):', X.nnz)   # This indicates that there are z non-zero elements in the matrix.
print('Average number of non-zero features per example (per document): %.3f' %(X.nnz/X.shape[0]))  # non-zero elements in matrix / documents = xxx

Non-zero elements in matrix (X.nnz): 207035
Average number of non-zero features per example (per document): 12.570


In [84]:
print('Non-zero elements in matrix (X_dev_transformed.nnz):', X_dev_transformed.nnz)   # This indicates that there are z non-zero elements in the matrix.
print('Average number of non-zero features per example (per document): %.3f' %(X_dev_transformed.nnz/X_dev_transformed.shape[0]))  # non-zero elements in matrix / documents = xxx

Non-zero elements in matrix (X_dev_transformed.nnz): 150700
Average number of non-zero features per example (per document): 14.827


In [85]:
#print(X_dev_transformed)
print(train_data)

[' here’s what the fracking industry gave to oklahoma in 2015 (images)'
 'is a revolution coming? former congressman throws down gauntlet on obama’s executive gun grab: “it’s war, defy his executive actions”'
 'shameless! hillary clinton throws benghazi families under the bus' ...
 ' trump’s russia connection says us troops lives are in danger'
 'liberals see the light! huffpo columnist lets it rip on the obama “destruction” of the democrats [video]'
 'breaking: federal judge stops obamacare transgender, abortion related protections']


In [86]:
print(dev_data)

['rick perry has become a millionaire on the public payroll.'
 'theres only one candidate under investigation -- terry mcauliffe.'
 "the law is very clear! 'the monies recouped from the tarp shall be paid into the general fund of the treasury for the reduction of the public debt.'"
 ... 'on running for the presidency in 2012'
 'since 1995, the top 400 wealthiest families have seen their incomes go up 400 percent and their tax rates go down 40 percent.'
 'says u.s. senate rival tammy baldwin wants a completely government-controlled health care system that goes far beyond obamacare and is a medicare system for all.']


In [56]:
## Print out top MISCLASSIFICATIONS

# Make predictions on the dev data and show the top n documents where the ratio R is largest, where R is:
# R = maximum predicted probability / predicted probability of the correct label

n=2000
print('X_dev_transformed.shape:', X_dev_transformed.shape)  # (10164, 18698)
print()

r_array = np.zeros(X_dev_transformed.shape[0])  # one array element for each dev example (6735)
for i in range(X_dev_transformed.shape[0]):
    max_pred_prob = np.max(clf.predict_proba(X_dev_transformed)[i,:])
    #print(max_pred_prob)
    correct_label = int(dev_labels[i])
    pred_prob_correct_label = clf.predict_proba(X_dev_transformed)[i,correct_label]
    R = max_pred_prob / pred_prob_correct_label
    r_array[i] = R

print('max R:', np.max(r_array)) 
#print('mean R:', np.mean(r_array))
sorted_r = -np.sort(-r_array)
print()

cols = ['text', 'label', 'pred_class', 'R']
row_list = []

for i in range(n):
    index = -1 - i
    label_index = np.argsort(r_array)[index]
    print('R:', r_array[label_index])    
    print('\nDev Title',str(label_index)+':\n', dev_data[label_index])
    print('\nLABEL =', dev_labels[label_index])
    print('predicted class =', clf.predict(X_dev_transformed)[label_index],'\n' )
    print(70*'-')
    row_list.append(dict( [('text',dev_data[label_index]), ('label', dev_labels[label_index]), 
                         ('pred_class', clf.predict(X_dev_transformed)[label_index]), ('R', r_array[label_index])]  ))


error_df2 = pd.DataFrame(row_list, columns=cols)


X_dev_transformed.shape: (10164, 18834)

max R: 2.3083086426130772e+39

R: 2.3083086426130772e+39

Dev Title 2594:
 hospitals, doctors, mris, surgeries and so forth are more extensively used and far more expensive in this country than they are in many other countries.''	health-care	mitt-romney	former governor	massachusetts	republican	34	32	58	33	19	a fox news sunday interview
9874.json	barely-true	obamacare cuts seniors medicare.	health-care,medicare	ed-gillespie	republican strategist	washington, d.c.	republican	2	3	2	2	1	a campaign email.
3072.json	mostly-true	the refusal of many federal employees to fly coach costs taxpayers $146 million annually.	government-efficiency,transparency	newsmax	magazine and website	florida	none	0	0	0	1	0	an e-mail solicitation
2436.json	mostly-true	florida spends more than $300 million a year just on children repeating pre-k through 3rd grade.	education	alex-sink		florida	democrat	1	2	2	4	0	figures cites on campaign website
9721.json	true	milwaukee county

predicted class = 0 

----------------------------------------------------------------------
R: 1824298908270.6704

Dev Title 3971:
 the oddest thing is he doesn't want to do for america what he did for massachusetts. he did mandate health care for massachusetts, which is hillarycare, and he doesn't want to do that for america.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1321003981987.1548

Dev Title 2156:
 the first time he ever voted as a democrat was here in florida in 2008. he only voted four times in his life, and he's asking floridians to come out and vote for him.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1255236288960.444

Dev Title 6147:
 (deborah) ross defends those who want to burn the american flag, and even called efforts to ban flag-burning ridiculous, yet refused to help a disabled veteran fly the flag.

LABEL = 1
predicted class = 0 

---------

predicted class = 0 

----------------------------------------------------------------------
R: 30104761329.794304

Dev Title 8332:
 the dream act was written by members of both parties. when it came up for a vote a year and a half ago, republicans in congress blocked it. the bill hadnt changed. ... the only thing that had changed was politics.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 28643667704.36464

Dev Title 3519:
 what happens is people like warren buffett and he says this himself...pay 15 percent on the millions of dollars that they earn from wealth income... while their secretary is paying a higher rate on her work income. it's not right.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 28537739889.66962

Dev Title 4404:
 says president barack obama came into office very concerned about wiretappings but then he became president of the united states, he got

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 4908918036.714075

Dev Title 3298:
 women in oregon are paid 79 cents for every dollar paid to men. if the wage gap was eliminated, a working woman in oregon would have enough money for 2,877 gallons of gas, 72 more weeks of food for her family or nearly 12 more months of rent.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 4537913427.70513

Dev Title 734:
 people have actually broken down the transcripts for oral arguments and (antonin scalia) told more jokes and got more laughs than any of the other justices.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 4305676498.812546

Dev Title 5420:
 when students leave our high schools and they go to the community college, 70-75 percent of them have to pay to take remedial math.

LABEL = 1
predicted class = 0 

---------

 florida has 1,200 golf courses. i think 58 million rounds played a year in florida. weve got 44 percent of all travel golf in the country here. 5 million people come here just for golf.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 718815962.6766292

Dev Title 6070:
 gov. palin ... is somebody who actually doesn't believe that climate change is man-made.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 713259676.0114233

Dev Title 2359:
 this is the most generous country in the world when it comes to immigration. there are a million people a year who legally immigrate to the united states.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 692204992.8925891

Dev Title 8680:
 says a study shows that children who live with a biological parent and the parents boyfriend or girlfriend have a 20 times greater chance of

R: 149413235.42444047

Dev Title 6750:
 american hustleshows the fbi making real-life bribes to washington politicians. i know, because as your u.s. senator, i turned them down.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 146526709.04539287

Dev Title 6952:
 a national study of 2,500 charter schools shows that maybe 20 percent do better than the community public schools, 40 percent or so do worse and the rest are not having any significant difference.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 146271337.75690976

Dev Title 1252:
 when these same republicans - including mr. boehner - were in charge, the number of earmarks and pet projects went up, not down.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 142066711.55615392

Dev Title 1227:
 even our attorney general who is a strong democrat, she has said

R: 56824137.421627216

Dev Title 1465:
 with those first principles, it allowed a fellow like me to get in his truck and go from one end of the state to the other; started 20 points down and wound up 20 points ahead on election night.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 55950498.42614751

Dev Title 6219:
 last year, korea sold nearly half a million cars in our country. ... the united states, you know how much we sell to them? six thousand. what kind of deal is that?

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 55060489.92560798

Dev Title 2450:
 says that when president obama visited el paso, he pronounced that the border with mexico and the united states was safer than it ever was in history.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 54169090.11006554

Dev Title 5087:
 61 percent of non-tea

predicted class = 0 

----------------------------------------------------------------------
R: 27582384.447091848

Dev Title 2735:
 one out of every four students fails to earn a high school diploma. in our major cities across america, half of our kids dont graduate.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 27504720.42511426

Dev Title 2819:
 when you have 8,000 veterans a year committing suicide, then you have a serious problem.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 26090295.18430614

Dev Title 1994:
 over the past 10 years, the number of people living in lower manhattan has nearly doubled. in fact, lower manhattan has added more people over the last 10 years than atlanta, dallas and philadelphia combined.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 26077236.34916437

Dev Title 5544:
 when

R: 16529478.37420926

Dev Title 9948:
 never once did they (house republicans) actually cut spending or reduce the state budget. even when they cut taxes in 2005, they increased spending. they never paid for their tax cuts.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 16387251.495297706

Dev Title 8939:
 says barack obama said when he was running for office four years ago that he would halve the annual deficit by the end of his first term. that simply has not happened.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 16201836.983399617

Dev Title 6825:
 switzerland and the netherlands . . . cover all their citizens using private insurers, and they do so for much less cost.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 15736618.122870173

Dev Title 5732:
 there currently are 825,000 student stations sitting e

predicted class = 0 

----------------------------------------------------------------------
R: 7488261.89532145

Dev Title 1649:
 the r.i. turnpike and bridge authority was supposed to exist only until the bonds used to build the newport bridge were paid off through tolls. once the bonds were paid, the newport bridge was to be transferred to the state of rhode island and become toll-free.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 7486891.130851949

Dev Title 5951:
 members of the public are being charged $50 to hear gov. scott walker and a dozen members of his administration talk about jobs and the economy at lambeau field.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 7301456.692649285

Dev Title 6235:
 in every committee when the health care bill was considered, democrats voted against an amendment that would require members of congress and their staff to tak

predicted class = 0 

----------------------------------------------------------------------
R: 4113662.866962374

Dev Title 1715:
 says paul ryan voted for two wars that were unpaid for, voted for the bush tax cuts that were unpaid for, voted for the prescription drug bill that cost as much as my health care bill -- but wasnt paid for.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 4057971.942213712

Dev Title 1200:
 we just dont want to get to be like louisiana, where you have drive-up daiquiri shops.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 4009157.084098642

Dev Title 7610:
 in other states (where illegal immigrants have been allowed to get drivers licenses) their insurance premiums for everybody have gone down.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 3976676.0929697524

Dev Title 5691:
 the 

predicted class = 0 

----------------------------------------------------------------------
R: 2527487.0005962034

Dev Title 1968:
 the epa was asked about an environmental citation for the city landfill in nashua, n.h. but didnt know why it was cited.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 2525693.830137111

Dev Title 7788:
 john mccain "voted against the tax cuts of 2001 and 2003, wrongly claiming they helped only the rich."

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 2519339.1351626073

Dev Title 9801:
 says his views on reparations for slavery are the same as barack obamas and hillary clintons.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 2497954.8832776677

Dev Title 2350:
 i think with the exception of the last year or maybe the last two years, we were at 100 percent when it came to contri

R: 1632749.4117190544

Dev Title 4185:
 huckabee "was one of the highest taxing governors that we had in this country and rivaling bill clinton in terms of the cato ratings."

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1615724.6231822239

Dev Title 9923:
 almost every country on earth sees america as stronger and more respected today than they did eight years ago when i took office.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1587605.6374438428

Dev Title 7003:
 florida has the most concealed weapon permits in the nation, nearly double that of the second state, which is texas.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1551718.4750612339

Dev Title 4744:
 our reserves are now in much better shape than they were just a few years ago.

LABEL = 1
predicted class = 0 

----------------------------------

predicted class = 0 

----------------------------------------------------------------------
R: 1085310.0246910404

Dev Title 6225:
 every single month since 1985 has been warmer than the historic average. all 12 of the warmest years on record have come in the last 15 years.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1079918.571071568

Dev Title 8760:
 these 60 acres (the atlanta braves want to build a stadium on) have produced zero splost money for parks and recreation, have produced zero money for education.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1075029.3605399993

Dev Title 394:
 right now we are spending at an all-time high, close to 25 percent of our gdp [is] being spent on the federal government. but our revenues are at an almost all-time low of about 15 percent [of gdp].

LABEL = 1
predicted class = 0 

---------------------------------------------

predicted class = 0 

----------------------------------------------------------------------
R: 727130.0633366465

Dev Title 1957:
 once people become citizens under the dream act and turn 21, they can sponsor their illegal immigrant parents for legalization.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 726939.8227531625

Dev Title 5953:
 atlanta has issued an increasing number of citations - and collected an increasing amount of revenue - since mayor kasim reed took office in 2010.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 713701.5841846956

Dev Title 1680:
 in the 1950s and 1960s, the minimum wage was such that it would lift you out of poverty.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 704271.4304418837

Dev Title 7264:
 says mitt romney told university students in ohio that to start a business,

predicted class = 0 

----------------------------------------------------------------------
R: 519946.8091856711

Dev Title 4975:
 many dont know that bill young was once the minority leader in the florida senate...because he was the only republican senator.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 509514.8224841226

Dev Title 3753:
 rhode island could tell you who has a camper, but we couldnt figure out who has a gun.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 506851.99607547815

Dev Title 3185:
 for every extra year that a girl stays in secondary school, her chance of getting infected with hiv/aids decreases by half.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 492844.237003661

Dev Title 6038:
 i never gave up custody of my children. i never lost custody of my children.

LABEL = 1
predicted cl

R: 348365.3900798529

Dev Title 5864:
 long-term federal investment in u.s. airports is urgent because there was a recent survey of the top airports in the world, and there was not a single u.s. airport that came in the top 25.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 347625.38455065235

Dev Title 3823:
 under the arizona immigration law, police are required to check immigration status if someone's "lawn is overgrown" or if a dog is "barking too loudly."

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 347459.0087881945

Dev Title 4263:
 he admits he still doesn't know how to use a computer, can't send an e-mail.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 343014.5027196352

Dev Title 6904:
 says raising the state income tax rate on millionaires to offset property taxes for other residents is not a tax

R: 245772.92379661332

Dev Title 2965:
 says hillary clintons campaign hasnt been clear about when she wiped herserverof her work emails.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 240829.06908955247

Dev Title 1049:
 georgia is now the eighth most populous state in the nation, moving from the number 10 position in just four years.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 234811.41735986358

Dev Title 9674:
 the merger of georgia state university and georgia perimeter college will make gsu one of the largest universities in the nation, with more than 54,000 students.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 233579.24544724976

Dev Title 6995:
 in the past year, floridians, not government, created almost 135,000 new private sector jobs. we netted more than 120,000 total jobs in the first 11 mon

R: 183898.5329266492

Dev Title 3839:
 well, you know, the teamsters wanted to drill in alaska. i voted against drilling in alaska. so it's not like i'm a slam dunk on every issue.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 183387.91322879127

Dev Title 2552:
 president barack obama said over 20 times he did not have the legal authority to act as he did on immigration.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 179420.4898862853

Dev Title 695:
 planned parenthood is an organization that funnels millions of dollars in political contributions to pro-abortion candidates.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 177952.9051337039

Dev Title 6683:
 two years after the worst recession most of us have ever known ... corporate profits are up.

LABEL = 1
predicted class = 0 

---------------------------

R: 142451.08109532186

Dev Title 3925:
 new mexico was 46th in teacher pay (when he was elected), now we're 29th.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 139767.9014429509

Dev Title 3879:
 greg abbott has benefitted from payday lenders who have given him $300,000 and then received a ruling from him that they can operate in a loophole in the law that allows them to charge unlimited rates and fees.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 139589.32719972671

Dev Title 4159:
 the atlanta braves are the oldest continuously operating professional sports franchise in america.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 138078.3389468027

Dev Title 1407:
 as florida's cfo, i shut down krakow's scam and refunded more than $1.2 million to josephine and other victims of this con man.

LABEL = 1
predict

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 106835.56385597168

Dev Title 2780:
 the claim ... that we plan to set up panels of bureaucrats with the power to kill off senior citizens ... is a lie, plain and simple.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 106045.52175268663

Dev Title 6475:
 more americans have died from guns in the united states since 1968 than on battlefields of all the wars in american history.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 105806.28021778421

Dev Title 152:
 to this day, (the cuban government) is a regime that provides safe harbor to terrorists and fugitives.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 105159.81860930251

Dev Title 300:
 91% of suspected terrorists who attempted to buy guns in america 

R: 82063.1058919746

Dev Title 4258:
 most of the newspapers that endorsed alex sink also endorsed barack obama.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 81026.8196534511

Dev Title 9115:
 in 1978, a student who worked a minimum-wage summer job could afford to pay a years full tuition at the 4-year public university of their choice.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 80457.11691391785

Dev Title 3234:
 a black male baby born today, if we do not change the system, stands a one-in-three chance (of) ending up in jail.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 79384.07482208834

Dev Title 2211:
 this march, for the first time in human history, the monthly average carbon dioxide in our atmosphere exceeded 400 parts per million. the range had been 170-300 parts per million for hundreds of tho

R: 65306.75485609134

Dev Title 6212:
 medicare and medicaid are the single biggest drivers of the federal deficit and the federal debt by a huge margin.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 65254.5711008529

Dev Title 4231:
 how many concerts would taylor swift have to perform to pay off one day of interest on our national debt? she would have to perform every day for three years.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 64863.72499190085

Dev Title 2432:
 members of congress did not have three days to read the bill when the stimulus was rushed into law.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 64553.21537973654

Dev Title 940:
 theres a big chunk of the country that thinks that i have been too soft on wall street.

LABEL = 1
predicted class = 0 

---------------------------------------

predicted class = 0 

----------------------------------------------------------------------
R: 52319.99391218967

Dev Title 6641:
 over the last 10 years, incomes for the top 1 percent have grown. meanwhile, the bottom half of the country, theyve seen their wages stagnate.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 52217.90455178417

Dev Title 8937:
 a proposed revenue smart cap gives floridians a voice, requiring a 60 percent vote by citizens in order to impose a new tax, fee, license, fine, charge or assessment.

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 52024.00290734343

Dev Title 7791:
 florida democrats put my social security number and my wifes employment identification number in a mail piece.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 51436.87617189747

Dev Title 51:
 since being elected 

R: 41897.63513104064

Dev Title 2380:
 republican ideas on health care dont give people an option to even enroll in something that they can afford.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 41444.08265090869

Dev Title 2493:
 obamas secretary of energy, dr. steven chu, has said publicly he wants us to pay european levels (for gasoline), and that would be $9 or $10 a gallon.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 41332.315389850264

Dev Title 3773:
 every republican nominee since richard nixon, who at one time was under an audit, has released their tax returns.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 40949.21046107449

Dev Title 7647:
 we have 10,000 baby boomers retiring every day.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 40

R: 33032.2132050393

Dev Title 1104:
 says madison mayor paul soglins stated intent when proposing that city contractors disclose private political donations was to discourage contributions to organizations with which he disagrees.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 32964.68197521093

Dev Title 3436:
 amazing fact: senate has already voted on more amendments in 2015 than reid allowed all year last year.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 32824.356392951064

Dev Title 8275:
 tennessee students now cover about 67 percent of the cost of their education at public universities, and some 60 percent at community colleges.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 32629.104748528775

Dev Title 7260:
 the obama administration is using as its legal justification for these airstrikes (on the

predicted class = 0 

----------------------------------------------------------------------
R: 26486.158096200965

Dev Title 6176:
 my debt to gdp was the lowest or one of the lowest of modern presidents. my taxes to gdp was the lowest and my spending to gdp was too.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 26159.781305162694

Dev Title 6314:
 according to the most recent report, wisconsins high school graduation rates are also up again -- to third best in the country.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 26105.72764345869

Dev Title 4251:
 says secretary of state john kerry, when he was a senator, flew to managua and met with a communist dictator there, daniel ortega, and accused the reagan administration of engaging in terrorism.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 26028.90158760

predicted class = 0 

----------------------------------------------------------------------
R: 22207.476885691915

Dev Title 4502:
 we spend less than 2 percent more every year. that is the lowest increase in spending since they have been keeping numbers.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 22101.563490367753

Dev Title 10022:
 every month since 9/11, there have been as many suicide attacks against the united states and its allies as there were in all the years leading up to 9/11.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 21948.965103624727

Dev Title 6907:
 when we had a conservative republican president we were losing 750,000 jobs a month.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 21943.092197951177

Dev Title 3020:
 under the federal controlled substance act, marijuana is listed in th

R: 18283.83736793964

Dev Title 319:
 currently it costs more than a penny for the u.s. mint to make a one cent coin and more than a nickel to make the five cent piece.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 18270.508369930132

Dev Title 5345:
 funding the federal health care law without a tax hike will require the state to cut nearly a quarter of its annual budget.

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 17963.392127232746

Dev Title 2402:
 says nfl commissioner roger goodell interviewed domestic abuse victim janay rice with ray rice present and for every domestic violence agency, every law enforcement agency, thats a no-no.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 17953.24601908468

Dev Title 9242:
 ten years ago, john mccain offered a bill that said he would ban a candidate from paying

R: 14623.461148250575

Dev Title 9684:
 says $38 billion in spending cuts in federal budget compromise is less than $1 billion in real cuts.

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 14604.544055204868

Dev Title 2723:
 black children constitute 18 percent of the nations public school population but 40 percent of the children who are suspended or expelled.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 14591.356965966239

Dev Title 1604:
 says joanne kloppenburg has told us she thinks its her job to promote a more equal society.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 14546.035399508766

Dev Title 4589:
 says a rather extraordinary amount of non-classroom employees were added by texas school districts over the last decade.

LABEL = 1
predicted class = 0 

------------------------------------------

Dev Title 7653:
 many types of fish and shellfish from waters across the state are labeled unsafe to eat.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 12251.692168361233

Dev Title 3649:
 roughly 500,000 georgians -- or about 5 percent of the states residents -- have gone through a background check to legally obtain a georgia weapons carry license.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 12128.022344070274

Dev Title 4641:
 the gop platform that seeks further limits on abortion and is silent on an exception for rape has been there for more than 30 years.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 12069.171289284868

Dev Title 3535:
 our graduation rate is the highest it's ever been.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 12011.79

predicted class = 0 

----------------------------------------------------------------------
R: 10096.20068882405

Dev Title 8315:
 as dane county executive, kathleen falk raised property taxes by millions of dollars every year and approved the second highest increase in the state in 2010.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 10084.172375766653

Dev Title 7193:
 says there have been some job gains in the mcmansion state since mr. christie took office, but they have lagged gains both in the nation as a whole and in new york and connecticut, the obvious points of comparison.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 10053.491398846603

Dev Title 7194:
 every day, 34 americans are murdered with guns.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 10013.437372769557

Dev Title 1658:
 nearly 6 out o

predicted class = 0 

----------------------------------------------------------------------
R: 8652.352281246789

Dev Title 3973:
 in last weeks debate, bernie questioned hillarys commitment to fighting climate change because a whopping 0.2 percent of the money given to our campaign has come from employees of oil and gas companies. not even 2 percent, mind you: 0.2 percent.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 8649.408579267234

Dev Title 9679:
 ted cruz was the longest-serving solicitor general in the history of texas.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 8619.49621747625

Dev Title 5437:
 (obamas) entire national security team, including his secretary of state, said we want to arm and train and equip (syrian rebel forces), and he made the unilateral decision to turn them down.

LABEL = 1
predicted class = 0 

------------------------------------

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 7569.733909347404

Dev Title 7779:
 says if we raise the number of third-graders who read at a third-grade level, we affect everything, from graduation rates to incarceration rates.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 7507.483692719188

Dev Title 9077:
 before the republican wave in 2010, democrats had an advantage on the generic ballot in congress. even in 1994 with the gingrich revolution ... democrats had that advantage.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 7488.659705948351

Dev Title 157:
 federal officials declared that grant funds could be used only for milwaukees streetcar project, meaning it isnt possible to redirect the money to other modes of public transportation or to our public schools.

LABEL = 1
predicted class = 0 

----------

R: 6739.000004305607

Dev Title 4875:
 two-thirds of the people who receive the minimum wage are female.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 6646.042255626682

Dev Title 5968:
 the highest paid employee of the state of rhode island is a basketball coach.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 6592.738506809064

Dev Title 2827:
 the extra point is almost automatic. (the nfl) had five missed extra points this year out of 1,200 some odd attempts.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 6586.676012969461

Dev Title 5136:
 i am going to be on the ballot in all 50 states. there is no other third-party candidate thats going to come close to achieving that.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 6549.9075017223195

Dev Title

predicted class = 0 

----------------------------------------------------------------------
R: 5633.370161430319

Dev Title 6837:
 new jersey loses net, that is minus those who come into the state, 30,000 students a year.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 5583.034836006587

Dev Title 8997:
 it is a greater crime to have an untagged alligator than to host an open house party (for kids).

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 5557.241863714718

Dev Title 5116:
 says virginia economic development officials decided they didnt want to bid on his companys electric automobile plant.

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 5541.667796607344

Dev Title 6214:
 one third of our age group (millennials) have moved back in with their parents.

LABEL = 1
predicted class = 0 

-------------------

R: 4917.273483766707

Dev Title 8812:
 unlike virtually every other campaign, we dont have a super pac.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 4912.970280568632

Dev Title 161:
 during the reagan era, while productivity increased, "wages for working people remained frozen."

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 4911.848305117481

Dev Title 8912:
 of the 98 top oxycodone-dispensing doctors who used to live in florida, today, there are none.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 4911.562723016746

Dev Title 7247:
 as ceo of wwe, linda mcmahon was caught tipping off a ringside physician about a federal investigation into illegally distributing steroids to wrestlers.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 4901.2852321580

predicted class = 0 

----------------------------------------------------------------------
R: 4282.357246087313

Dev Title 1547:
 our businesses have created jobs every single month since (obamacare) became law.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 4257.822068916563

Dev Title 6698:
 trey radel does not even qualify to drive a lee county school bus at this point, yet he occupies a seat in congress.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 4208.65995819857

Dev Title 5645:
 hillary clinton agrees with john mccain "by voting to give george bush the benefit of the doubt on iran."

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 4151.971087077038

Dev Title 6371:
 our children's safety is potentially at risk because nearly half of the apple juice consumed by our children comes from apples grown in

predicted class = 0 

----------------------------------------------------------------------
R: 3690.7238501033817

Dev Title 3979:
 by the time i left the state department, economic growth was up and opium production was down in afghanistan, while infant mortality declined and school enrollment rose by more than sevenfold.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 3669.649845962513

Dev Title 7162:
 according to the centers for disease control and prevention (cdc), about 120 americans on average die from a drug overdose every day. overall, drug overdose deaths now outnumber deaths from firearms.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 3657.8322028903785

Dev Title 257:
 four percent of american citizens are black males, but they are 35 percent of murder victims.

LABEL = 1
predicted class = 0 

-------------------------------------------------------------

R: 3119.1186847794393

Dev Title 5783:
 the reason we have a national debt is not because of defense spending. what is driving our long-term debt are medicare and social security programs.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 3114.5235369342067

Dev Title 9648:
 millions of dollars are spent by planned parenthood to elect democrats to the house of representatives and the senate.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 3114.473156940081

Dev Title 7798:
 we now actually import more oil than we did before 9/11.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 3104.905953916894

Dev Title 6339:
 at bain capital, we helped start an early childhood learning company called bright horizons that first lady michelle obama rightly praised.

LABEL = 1
predicted class = 0 

--------------------------------

R: 2812.362114289692

Dev Title 3996:
 congress can tell [the supreme court] which cases they ought to hear. we have that authority.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 2810.0283774409245

Dev Title 3337:
 guantanamo detainees get taxpayer-paid-for prayer rugsthey had honey-glazed chicken and rice pilaf.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 2809.27368239853

Dev Title 9069:
 saysbernie sanders voted for what we call the charleston loophole.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 2786.376864226691

Dev Title 2379:
 federal spending is all discretionary, other than interest on the national debt. social security is discretionary. we have the discretion to change the law. same is true with medicare and medicaid.

LABEL = 1
predicted class = 0 

----------------------------------------

R: 2373.887592652843

Dev Title 4681:
 the republican from georgia [u.s. rep. jack kingston], he hasnt even been to a nascar race.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 2353.3927419621064

Dev Title 312:
 a republican-sponsored wisconsin mining bill will take at least seven years to create jobs.

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 2351.8286598306886

Dev Title 247:
 says u.n. arms treaty will mandate a new international gun registry.

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 2343.200858575316

Dev Title 7873:
 wisconsin republicans repealed a statewide fair pay law that made sure women are treated fairly on the job.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 2340.9702663312764

Dev Title 10112:
 bernie sanders opposes the

predicted class = 1 

----------------------------------------------------------------------
R: 1894.6250059995043

Dev Title 3169:
 democrats have lost more than 900 state legislators since barack obama has been president.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1893.9276935548621

Dev Title 8944:
 texas is home to millions of latinas, but the state has never elected a latina to congress.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1875.2089940491242

Dev Title 9400:
 the current debate over authorizing military action against the islamic state would be the first time congress would place limits on the commander-in-chiefs ability to be commander-in-chief.

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 1874.682926465972

Dev Title 2769:
 today, the united states has, sadly, one of the lowest voter t

predicted class = 0 

----------------------------------------------------------------------
R: 1623.0550517014597

Dev Title 1770:
 the proposed mine in northern wisconsin would be built without any government oversight, and will be nine miles long.

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 1622.0151473976453

Dev Title 2997:
 chinese tire imports threatened 1,000 american jobs, so president obama stood up to china and protected american workers. mitt romney attacked obamas decision.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1612.6175277888287

Dev Title 4876:
 the united states death rate is two-and-a-half times higher for those who do not have a high school education.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1610.8917414969976

Dev Title 4202:
 passing a federal firearms background check th

R: 1403.8193223172045

Dev Title 8494:
 kasim reed has kept every promise he made as a candidate.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1401.3822041840046

Dev Title 5535:
 most of, if not all of, the [dekalb school construction] projects always came in on or were under budget.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1371.742997039115

Dev Title 8028:
 at the beginning of world war ii, we had a relatively small army, smaller than portugals.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1356.3381292683146

Dev Title 4342:
 you can buy lobster with food stamps.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1354.0795239950617

Dev Title 1888:
 gun violence is by far the leading cause of death for young african american men, outstrippin

R: 1135.3564093654418

Dev Title 5874:
 george bush ... used a signing statement (on a fema bill) to say, 'i don't have to follow that, unless i choose to.' 

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1135.3205065840118

Dev Title 3630:
 on average, college students are taking six years to get a four-year degree.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1121.1914298275492

Dev Title 3209:
 there are countries in africa where they have higher vaccination rates than here in the united states.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1120.1345516770107

Dev Title 1009:
 on support for trade promotion authority, calledfast-track

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 1115.4435313767458

Dev Title 6883:
 90 percent of babies with d

predicted class = 0 

----------------------------------------------------------------------
R: 903.3382099154194

Dev Title 813:
 from 1947 to 1979, family incomes for rich, middle-income and poor americans grew about the same rate. but since 1979, incomes for rich families have grown much faster.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 899.8759799108651

Dev Title 2914:
 when president bush took office in 2001, he inherited a $236 billion budget surplus, with a projected 10-year surplus of $5.6 trillion. when he ended his term, he left a $1.3 trillion deficit and a projected 10-year shortfall of $8 trillion.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 895.4947205857611

Dev Title 5037:
 newt gingrich was fined $300,000 for ethics violations.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 892.54211

R: 792.7735483818047

Dev Title 9808:
 federal spending is the highest its been as a share of our economy in 60 years (and) revenue is the lowest its been as a share of our economy in 60 years.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 790.5592825803794

Dev Title 6922:
 mitt romney drove to canada with the family dog seamus strapped to the roof of the car.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 789.8027996906985

Dev Title 3397:
 we spent $3-million of your money to study the dna of bears.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 783.3763140869502

Dev Title 5684:
 says there are ohio turnpike workers getting paid $66,000 a year to collect tolls that machines might collect.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 775.850437

R: 688.716202139837

Dev Title 1997:
 says time magazine called him "one of america's best governors."

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 684.9982024433659

Dev Title 113:
 says that before health care reform, one of every three health care dollars spent -- more than $800 billion a year -- didnt go for health care.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 684.9543525041544

Dev Title 5276:
 a bill to aid state and local governments is fully paid-for by closing costly corporate tax loopholes.

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 682.8046715533666

Dev Title 4789:
 says texas state funds were spent on a tv series on spouses cheating on their wives, kind of glorifying the act of cheating.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------

predicted class = 0 

----------------------------------------------------------------------
R: 620.6937106289352

Dev Title 4298:
 says sen. mitch mcconnell is the no. 1 recipient of contributions from lobbyists this cycle.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 619.9307037231373

Dev Title 1481:
 china holds 26 percent of the u.s. debt.

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 614.7488454943849

Dev Title 8743:
 a bipartisan background check amendment outlawed any (gun) registry. plain and simple, right there in the text.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 613.1273536795973

Dev Title 6705:
 says president barack obama gives students the right to repay (federal) loans as a clear, fixed, low percentage of their income for up to 20 years.

LABEL = 1
predicted class = 0 

------------

----------------------------------------------------------------------
R: 577.799667122053

Dev Title 3617:
 hillary clinton advocates "a freeze on foreclosures. barack obama said no."

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 577.0951506456847

Dev Title 6939:
 the president referred to the syrian opposition just a few months ago as pharmacists and doctors, and so on.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 576.4018311485905

Dev Title 6386:
 jim renacci cheated on his income taxes and is a deadbeat citizen.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 575.5865302427522

Dev Title 6643:
 the state covered a smaller percentage of the cost of k-12 education in 2013 than it did in 2002.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 575.5

predicted class = 1 

----------------------------------------------------------------------
R: 514.8190505564496

Dev Title 9847:
 my campaign alone has created more jobs in the state of rhode island than narragansett beer.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 513.7403532260187

Dev Title 9585:
 says barack obama is the first president in modern history not to have a single year of 3 percent growth.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 513.6307135277632

Dev Title 1510:
 says democrats cut $35 million from the state budget for information technology improvements.

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 513.1201699280001

Dev Title 6927:
 the u.s. is borrowing approximately $2.52 for every $1 of economic growth so far in 2012.

LABEL = 1
predicted class = 0 

-----------------------

R: 465.4439207156054

Dev Title 1721:
 marco rubio spent $400k of your tax dollars remodeling offices, and building a members-only lounge.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 465.3397427507184

Dev Title 3870:
 under this tax cut, middle-class families dont save enough for a weeks worth of groceries, while millionaires save enough to go on an exotic vacation.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 464.1924266251449

Dev Title 5324:
 says mitt romney supports cap and trade.

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 463.02955963267806

Dev Title 10139:
 bob barr has changed his position on the defense of marriage act over the years.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 462.7674373483122

Dev Title 1570:
 in 1999, the n

predicted class = 0 

----------------------------------------------------------------------
R: 412.843423095906

Dev Title 5562:
 the u.s. has now spent more on reconstructing afghanistan than was spent on the marshall plan and the reconstruction of europe.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 411.91556420257666

Dev Title 2753:
 pasco county has the second highest population of homeless in all of florida.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 411.79249490059897

Dev Title 5263:
 the majority of (the american people) voted for a democratic house.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 411.19290805208277

Dev Title 6067:
 texas is growing twice as fast as the rest of the country.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------


In [87]:
print(error_df2.size)
print(len(error_df2))
error_df2.head()

NameError: name 'error_df2' is not defined

In [60]:
error_df2.to_csv('nb_isot_text_to_liar_errors.csv', sep=',')

### REVERSE: Using LIAR model, predict the ISOT "title" and score the predictions. (lower case them first...)

In [88]:
train_data = liar_data.title[liar_data.binary_target != -1].str.lower().values   # full LIAR data
train_labels = liar_data.binary_target[liar_data.binary_target != -1].values 
 
dev_data = train_set.title.str.lower().values                                      # LSOT data
dev_labels = train_set.target.values 


train_labels = train_labels.astype(int)
dev_labels = dev_labels.astype(int)

#train_data.head()
print('train_data shape:', train_data.shape)
#print(train_data[0].shape)
#print(train_data[:1])
print('train_labels shape:', train_labels.shape)
print(train_labels)

print('\ndev_data shape:', dev_data.shape)
#print(dev_data[:1])
print('dev_labels shape:', dev_labels.shape)
print(dev_labels)

train_data shape: (10164,)
train_labels shape: (10164,)
[1 0 1 ... 1 1 1]

dev_data shape: (16470,)
dev_labels shape: (16470,)
[0 0 0 ... 0 0 0]


In [89]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_data)           # original ISOT training data
X_dev_transformed = vectorizer.transform(dev_data)

# MultinomialNB
print('\nMultinomialNB using "title" field: Fit using ISOT data; Predict on LIAR data:')
alpha = 1.0
clf = MultinomialNB(alpha=alpha)
clf.fit(X, train_labels)

print('accuracy: %3.3f' %clf.score(X_dev_transformed, dev_labels))


MultinomialNB using "title" field: Fit using ISOT data; Predict on LIAR data:
accuracy: 0.671


### Split LIAR data into Train/Dev/Test, train model, and evaluate on its own type 

In [90]:
#liar_data.head(5)
liar_data = liar_data[liar_data.binary_target != -1]
liar_data.head(5)

Unnamed: 0,id,target,title,subject,speaker,speaker_job_title,state,party,barely_true_count,false_count,half_true_count,mostly_true_count,pantsonfire_count,context,binary_target
0,2653.json,mostly-true,Rick Perry has become a millionaire on the pub...,"candidates-biography,message-machine",bill-white,Former mayor of Houston,Texas,democrat,2.0,3.0,5.0,7.0,3.0,a TV ad,1
1,8220.json,false,Theres only one candidate under investigation ...,ethics,ken-cuccinelli,Attorney General,Virginia,republican,1.0,10.0,3.0,2.0,1.0,a TV ad.,0
2,1467.json,true,The law is very clear! 'The monies recouped fr...,economy,judd-gregg,U.S. Senator,,republican,0.0,0.0,0.0,1.0,0.0,a Senate budget hearing,1
3,746.json,true,"John McCain wants to ""give oil companies anoth...",taxes,barack-obama,President,Illinois,democrat,70.0,71.0,160.0,163.0,9.0,"Oxford, Miss.",1
4,2100.json,barely-true,"Of all the illegals in America, more than half...","immigration,message-machine",john-mccain,U.S. senator,Arizona,republican,31.0,39.0,31.0,37.0,8.0,a campaign ad,0


In [91]:
binary_targets = liar_data.binary_target.unique()
print(binary_targets)

print('\nbinary_target,  number of examples')
for binary_target in binary_targets:
    print(binary_target, len(liar_data[liar_data.binary_target==binary_target]))

[1 0]

binary_target,  number of examples
1 4507
0 5657


In [92]:
#train/dev/train split
#train_dev_split = 0.8

train_fract = 0.70
dev_fract = 0.15
test_fract = 0.15

if (train_fract+dev_fract+test_fract) == 1.0:
    print('Split fractions add up to 1.0')
else:
    print('SPLIT FRACTIONS DO NOT ADD UP TO 1.0; PLEASE TRY AGAIN.............')

train_set = liar_data[ :int(len(liar_data)*train_fract)].reset_index(drop=True)
dev_set = liar_data[int(len(liar_data)*(train_fract)) : int(len(liar_data)*(train_fract+dev_fract))].reset_index(drop=True)
test_set = liar_data[int(len(liar_data)*(train_fract+dev_fract)) : ].reset_index(drop=True)

print('training set: ',train_set.shape)
print('dev set: ',dev_set.shape)
print('test set: ',test_set.shape)

Split fractions add up to 1.0
training set:  (7114, 15)
dev set:  (1525, 15)
test set:  (1525, 15)


In [93]:
# print out LIAR dev set
dev_set.to_csv('liar_dev_set.csv', sep=',')

In [94]:
train_data = train_set.title[train_set.binary_target != -1].str.lower().values   # full LIAR data
train_labels = train_set.binary_target[train_set.binary_target != -1].values 
 
dev_data = dev_set.title.str.lower().values                                      # LSOT data
dev_labels = dev_set.binary_target[dev_set.binary_target != -1].values 

print('train_data shape:', train_data.shape)
print('train_labels shape:', train_labels.shape)
print('dev_data shape:', dev_data.shape)
print('dev_labels shape:', dev_labels.shape)

train_data shape: (7114,)
train_labels shape: (7114,)
dev_data shape: (1525,)
dev_labels shape: (1525,)


In [95]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_data)      
print(X.shape)
X_dev_transformed = vectorizer.transform(dev_data)

# MultinomialNB
print('\nMultinomialNB using LIAR training data; Predict on LIAR dev data:')
alpha = 1.0
clf = MultinomialNB(alpha=alpha)
clf.fit(X, train_labels)

print('accuracy: %3.3f' %clf.score(X_dev_transformed, dev_labels))

(7114, 10391)

MultinomialNB using LIAR training data; Predict on LIAR dev data:
accuracy: 0.630
