# LIAR DETECTION GROUP PROJECT - Baseline Models  


### CONTENTS  

Imports  
Load ISOT data  
Pre-process ISOT data  
Train/Dev/Test split ISOT data  

##### Baselines (Naive Bayes):  
- ISOT full "text" field  (using CountVectorizer)  
- Verification test by assigning random 0's and 1's to the Dev Labels and re-running  
- Train with ISOT "text"; predict ISOT "title"  
- Read and setup LIAR dataset  
- Using the ISOT "text" model, predict the liar_dev_labels and score the predictions  
- Using the LIAR model, predict the ISOT "text" and score the predictions 
- ISOT "text" field using TfidfVectorizer  
- ISOT "text" field after removing "Reuters" and location from real news  
- ISOT "title" field; ; print top misclassifications between predicted dev classes and dev labels 
- Using the ISOT "title" model, predict the ISOT "text" classes  
- Using the ISOT "title" model, predict the liar_dev_labels and score the predictions; print top misclassifications  
- Using the LIAR model, predict the ISOT "title" and score the predictions  
- Divide LIAR data into train/dev/test, train a model, and see how well it predicts on its own data type  



    

In [1]:
from __future__ import absolute_import
from __future__ import print_function
from __future__ import division

import json, os, re, shutil, sys, time
from importlib import reload
import collections, itertools
import unittest
from IPython.display import display, HTML
from sklearn.utils import shuffle
# NLTK for NLP utils and corpora
import nltk

# NumPy and TensorFlow
import numpy as np
import pandas as pd
#import tensorflow as tf

# Helper libraries
from w266_common import utils, vocabulary, tf_embed_viz
#from ark-tweet-nlp-0.3.2 import 


In [2]:
#### MAY NEED TO RUN THIS CELL TWICE

def get_data(filename, sep=',', header=0, names = None):
    '''Read CSV file into a pandas dataframe'''
      
    filepath = DATAPATH + filename
    return pd.read_csv(filepath, header=header, sep=sep, quotechar='"')

In [3]:
##
# from sklearn.naive_bayes import BernoulliNB  #requires all features be binary
from sklearn.naive_bayes import MultinomialNB  #appropriate for word count features from CountVectorizer
# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *
#from sklearn.grid_search import GridSearchCV   # THIS HAS BEEN DEPRECATED
from sklearn.model_selection import GridSearchCV
# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

### Load data
Loading the "Fake News" dataset from the Information security and object technology (ISOT) Research lab at the University of Victoria School of Engineering.

The ISOT Fake News Dataset is a compilation of several thousands fake news and truthful articles, obtained from different legitimate news sites and sites flagged as unreliable by politifact.com.

In [4]:
# define each downloaded file
FAKE_FILENAME = 'Fake.csv'
TRUE_FILENAME = 'True.csv'

# define the downloaded file path 
DATAPATH = './datasets/ISOT_FakeNews/'

def get_data(filename):
    '''Read CSV file into a pandas dataframe'''
      
    filepath = DATAPATH + filename
    return pd.read_csv(filepath, header=0, sep=',', quotechar='"')


fake_data = get_data(FAKE_FILENAME)
true_data = get_data(TRUE_FILENAME)



# add a label column to the data with the target values
fake_data.loc[:,'target'] = '0'
true_data['target'] = '1'

#append the datasets and shuffle them
all_data = true_data.append(fake_data, ignore_index=True)
all_data = all_data.sample(frac=1).reset_index(drop=True)

all_data.describe()

Unnamed: 0,title,text,subject,date,target
count,44898,44898.0,44898,44898,44898
unique,38729,38646.0,8,2397,2
top,Factbox: Trump fills top jobs for his administ...,,politicsNews,"December 20, 2017",0
freq,14,627.0,11272,182,23481


In [5]:
#fake_data.head(15)
#true_data.head(16)
all_data.head(15)

Unnamed: 0,title,text,subject,date,target
0,Labor Dept. looks into delaying fiduciary rule...,WASHINGTON (Reuters) - The U.S. Labor Departme...,politicsNews,"February 3, 2017",1
1,Senior Republicans signal issues in Congress f...,WASHINGTON (Reuters) - Representative Mac Thor...,politicsNews,"July 6, 2016",1
2,"Nearly 30,000 Kurds displaced from city near K...","BAGHDAD (Reuters) - Nearly 30,000 Kurds have b...",worldnews,"October 25, 2017",1
3,WTF: Top Trump Advisor Tells Trump To Kill Li...,As if Donald Trump isn t paranoid and delusion...,News,"March 6, 2017",0
4,EPIC RESPONSE AFTER THE BOSTON GLOBE Runs Fake...,THE BOSTON GLOBE ran a Sunday edition with a f...,politics,"Apr 11, 2016",0
5,"In message to Russia, Western powers demand U....",PARIS (Reuters) - Major Western powers appeare...,worldnews,"November 8, 2017",1
6,GERMAN RESIDENTS FIGHT BACK: Anti-Islamic Song...,Apparently these Germans are not interested in...,left-news,"Jan 3, 2016",0
7,"Passenger train derails in Spain, 21 hurt",MADRID (Reuters) - A passenger train derailed ...,worldnews,"November 29, 2017",1
8,JUDGE JEANINE IS FURIOUS! RINO’s Are Plotting ...,Fox News Channel s Jeanine Pirro went after th...,politics,"Jun 18, 2017",0
9,New judge assigned to U.S. lawsuit against AT&...,WASHINGTON (Reuters) - The U.S. Department of ...,politicsNews,"November 21, 2017",1


In [6]:

print(fake_data.title[0])
print('\n', fake_data.text[0])
print('\nTRUE DATA: ', true_data.text[3000])

 Donald Trump Sends Out Embarrassing New Year’s Eve Message; This is Disturbing

 Donald Trump just couldn t wish all Americans a Happy New Year and leave it at that. Instead, he had to give a shout out to his enemies, haters and  the very dishonest fake news media.  The former reality show star had just one job to do and he couldn t do it. As our Country rapidly grows stronger and smarter, I want to wish all of my friends, supporters, enemies, haters, and even the very dishonest Fake News Media, a Happy and Healthy New Year,  President Angry Pants tweeted.  2018 will be a great year for America! As our Country rapidly grows stronger and smarter, I want to wish all of my friends, supporters, enemies, haters, and even the very dishonest Fake News Media, a Happy and Healthy New Year. 2018 will be a great year for America!  Donald J. Trump (@realDonaldTrump) December 31, 2017Trump s tweet went down about as welll as you d expect.What kind of president sends a New Year s greeting like this

### Cleanup
Check for NA values.

May not want the dataset to contain the 'subject' since all the true news data comes from "Reuters"

In [7]:
all_data.isna().sum()

title      0
text       0
subject    0
date       0
target     0
dtype: int64

In [8]:
all_data.info(memory_usage='deep', verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44898 entries, 0 to 44897
Data columns (total 5 columns):
title      44898 non-null object
text       44898 non-null object
subject    44898 non-null object
date       44898 non-null object
target     44898 non-null object
dtypes: object(5)
memory usage: 151.9 MB


### Tokenize and Canonicalize Text

Need to work on Tokenize and Canonicalizing text. Words like "Obama's" need to be corrected. Do we need to mark of sentences within a text? Might want to use some regex code from camron.

In [9]:
"""
Source:  https://gist.github.com/tokestermw/cb87a97113da12acb388
"""

FLAGS = re.MULTILINE | re.DOTALL

## hashtag code does not work, needs tweaking
def hashtag(text):
    text = text.group()
    hashtag_body = text[1:]
    if hashtag_body.isupper():
        result = " {} ".format(hashtag_body.lower())
    else:
        result = " ".join(["<hashtag>"] + re.split(r"(?=[A-Z])", hashtag_body, flags=FLAGS))
    return result

def allcaps(text):
    text = text.group()
    return text.lower() + " <allcaps>"


def tokenize(text):
    # Different regex parts for smiley faces
    eyes = r"[8:=;]"
    nose = r"['`\-]?"

    # function so code less repetitive
    def re_sub(pattern, repl):
        return re.sub(pattern, repl, text, flags=FLAGS)

    text = re_sub(r"https?:\/\/\S+\b|www\.(\w+\.)+\S*", "<url>")
    text = re_sub(r"@\w+", "<user>")
    text = re_sub(r"{}{}[)dD]+|[)dD]+{}{}".format(eyes, nose, nose, eyes), "<smile>")
    text = re_sub(r"{}{}p+".format(eyes, nose), "<lolface>")
    text = re_sub(r"{}{}\(+|\)+{}{}".format(eyes, nose, nose, eyes), "<sadface>")
    text = re_sub(r"{}{}[\/|l*]".format(eyes, nose), "<neutralface>")
    text = re_sub(r"/"," / ")
    text = re_sub(r"<3","<heart>")
    text = re_sub(r"[-+]?[.\d]*[\d]+[:,.\d]*", "<number>")
    #text = re_sub(r"#\S+", hashtag)
    text = re_sub(r"([!?.]){2,}", r"\1 <repeat>")
    text = re_sub(r"\b(\S*?)(.)\2{2,}\b", r"\1\2 <elong>")
    text = re_sub(r"([A-Z]){2,}", allcaps)

       
    output = text.lower().split()
    #output = list(itertools.chain(*[re.split(r'([^\w<>])', x) for x in output]))  #Splits punctuation, keeping < and >
    return [item for item in output if item != '']  #Removes blank strings from list

teststring = "My name is ABHI :). Learning the back-portion :(. Obama's nephew. @random. http://www.abc.com"
tokenize(teststring)

['my',
 'name',
 'is',
 'abhi',
 '<allcaps>',
 '<smile>.',
 'learning',
 'the',
 'back-portion',
 '<sadface>.',
 "obama's",
 'nephew.',
 '<user>.',
 '<url>']

In [10]:
def CNG_tokenizer(text):
    '''tokenizer, and part-of-speech tagger from Carnegie Mellon
    created by Olutobi Owoputi, Brendan O'Connor, Kevin Gimpel, Nathan Schneider, Chris Dyer, Dipanjan Das, Daniel Mills, 
    Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah Smith
    RunTagger [options] [ExamplesFilename]
      runs the CMU ARK Twitter tagger on tweets from ExamplesFilename, 
      writing taggings to standard output. Listens on stdin if no input filename.

    Options:
      --model <Filename>        Specify model filename. (Else use built-in.)
      --just-tokenize           Only run the tokenizer; no POS tags.
      --quiet                   Quiet: no output
      --input-format <Format>   Default: auto
                                Options: json, text, conll
      --output-format <Format>  Default: automatically decide from input format.
                                Options: pretsv, conll
      --input-field NUM         Default: 1
                                Which tab-separated field contains the input
                                (1-indexed, like unix 'cut')
                                Only for {json, text} input formats.
      --word-clusters <File>    Alternate word clusters file (see FeatureExtractor)
      --no-confidence           Don't output confidence probabilities
      --decoder <Decoder>       Change the decoding algorithm (default: greedy)

    Tweet-per-line input formats:
       json: Every input line has a JSON object containing the tweet,
             as per the Streaming API. (The 'text' field is used.)
       text: Every input line has the text for one tweet.
    We actually assume input lines are TSV and the tweet data is one field.
    (Therefore tab characters are not allowed in tweets.
    Twitter's own JSON formats guarantee this;
    if you extract the text yourself, you must remove tabs and newlines.)
    Tweet-per-line output format is
       pretsv: Prepend the tokenization and tagging as new TSV fields, 
               so the output includes a complete copy of the input.
    By default, three TSV fields are prepended:
       Tokenization \t POSTags \t Confidences \t (original data...)
    The tokenization and tags are parallel space-separated lists.
    The 'conll' format is token-per-line, blank spaces separating tweets.'''

    file = open("teststring.txt", "w") 
    file.write(text) 
    file.close() 

#! ./ark-tweet-nlp-0.3.2/runTagger.sh ./ark-tweet-nlp-0.3.2/examples/example_tweets.txt
#! ./ark-tweet-nlp-0.3.2/twokenize.sh --output-format pretsv ./ark-tweet-nlp-0.3.2/examples/casual.txt
    tokens = ! ./ark-tweet-nlp-0.3.2/runTagger.sh --output-format conll teststring.txt
    tokens_list = list([re.split(r'([\t])',x) for x in tokens])
    tokens_list = [[ item for item in word if item != '\t' ] for word in tokens_list]
    
    #pandas frame for the tokens and POS
    
    pd_tokens = pd.DataFrame(tokens_list[1:-2], columns = ['word','tag','confidence'] )
    print(pd_tokens)
    word_list = [tokenize(word) for word in pd_tokens['word'].tolist()]
    
    word_tokens = ["".join(word) for word in word_list] #concats the allcaps text at the end of string
    pos_tokens = pd_tokens['tag'].tolist()
    conf_tokens = pd_tokens['confidence'].tolist()
    return word_tokens, pos_tokens, conf_tokens

In [11]:
CNG_tokenizer(teststring)

                  word tag confidence
0                   My   D     0.9984
1                 name   N     0.9996
2                   is   V     0.9953
3                 ABHI   ^     0.6305
4                   :)   E     0.9775
5                    .   ,     0.9951
6             Learning   V     0.9957
7                  the   D     0.9960
8         back-portion   N     0.8512
9                   :(   E     0.9162
10                   .   ,     0.9876
11             Obama's   Z     0.8890
12              nephew   N     0.9575
13                   .   ,     0.9980
14             @random   @     0.9945
15                   .   ,     0.9953
16  http://www.abc.com   U     0.9871


(['my',
  'name',
  'is',
  'abhi<allcaps>',
  '<smile>',
  '.',
  'learning',
  'the',
  'back-portion',
  '<sadface>',
  '.',
  "obama's",
  'nephew',
  '.',
  '<user>',
  '.',
  '<url>'],
 ['D',
  'N',
  'V',
  '^',
  'E',
  ',',
  'V',
  'D',
  'N',
  'E',
  ',',
  'Z',
  'N',
  ',',
  '@',
  ',',
  'U'],
 ['0.9984',
  '0.9996',
  '0.9953',
  '0.6305',
  '0.9775',
  '0.9951',
  '0.9957',
  '0.9960',
  '0.8512',
  '0.9162',
  '0.9876',
  '0.8890',
  '0.9575',
  '0.9980',
  '0.9945',
  '0.9953',
  '0.9871'])

In [12]:

#Make new column with tokenized, canonicalized text
all_data['text_tokcan'] = all_data['text'].apply(tokenize)
all_data.tail(5)

Unnamed: 0,title,text,subject,date,target,text_tokcan
44893,President Obama Says He Believes The Affordab...,The president says he doesn t think Obamacare ...,News,"January 8, 2017",0,"[the, president, says, he, doesn, t, think, ob..."
44894,A Top Republican Really Just Blamed Obama For...,Rep. Steve King (R-IA) has been in the news co...,News,"June 15, 2017",0,"[rep., steve, king, (r-ia, <allcaps>), has, be..."
44895,LOL! MOTHER OF TWO Just Got A BIG SURPRISE Fro...,"The picture, snapped by a White House photogra...",left-news,"Nov 6, 2017",0,"[the, picture,, snapped, by, a, white, house, ..."
44896,FLASHBACK: KEY DEMOCRATS Call for Violence in ...,And we wonder why violence like today s shooti...,Government News,"Jun 14, 2017",0,"[and, we, wonder, why, violence, like, today, ..."
44897,DEPLORABLE! HILLARY’S Campaign Is In PANIC Mod...,What happens when Hillary s poll numbers take ...,left-news,"Sep 16, 2016",0,"[what, happens, when, hillary, s, poll, number..."


In [13]:
#padded_sentences = ([u"<s>", u"<s>"] + s + [u"</s>"] for s in sents)

In [14]:

def build_vocab(corpus, V=None, **kw):
    if isinstance(corpus, list):
        token_feed = (utils.canonicalize_word(w) for w in corpus)
        vocab = vocabulary.Vocabulary(token_feed, size=V, **kw)
    print("Vocabulary: {:,} types".format(vocab.size))
    return vocab


#utils.canonicalize_word(teststring.split())
vocab=build_vocab(tokenize(teststring))
print("{:,} words".format(vocab.size))
print("wordset: ",vocab.ordered_words())



Vocabulary: 17 types
17 words
wordset:  ['<s>', '</s>', '<unk>', 'my', 'name', 'is', 'abhi', '<allcaps>', '<smile>.', 'learning', 'the', 'back-portion', '<sadface>.', "obama's", 'nephew.', '<user>.', '<url>']


In [15]:
print('ISOT ALL target=real:', len(all_data.target[all_data.target == '1']))
print('ISOT ALL target=fake:', len(all_data.target[all_data.target == '0']))

ISOT ALL target=real: 21417
ISOT ALL target=fake: 23481


### Train / Dev / Test Split ISOT data

In [16]:
#train/dev/train split
#train_dev_split = 0.8

train_fract = 0.70
dev_fract = 0.15
test_fract = 0.15

if (train_fract+dev_fract+test_fract) == 1.0:
    print('Split fractions add up to 1.0')
else:
    print('SPLIT FRACTIONS DO NOT ADD UP TO 1.0; PLEASE TRY AGAIN.............')

#train_data = all_data[:int(len(all_data)*train_dev_split)].reset_index(drop=True)
#dev_data = all_data[int(len(all_data)*train_dev_split):].reset_index(drop=True)

train_set = all_data[ :int(len(all_data)*train_fract)].reset_index(drop=True)
dev_set = all_data[int(len(all_data)*(train_fract)) : int(len(all_data)*(train_fract+dev_fract))].reset_index(drop=True)
test_set = all_data[int(len(all_data)*(train_fract+dev_fract)) : ].reset_index(drop=True)

print('training set: ',train_set.shape)
print('dev set: ',dev_set.shape)
print('test set: ',test_set.shape)

Split fractions add up to 1.0
training set:  (31428, 6)
dev set:  (6735, 6)
test set:  (6735, 6)


In [17]:
train_set.head(5)

Unnamed: 0,title,text,subject,date,target,text_tokcan
0,Labor Dept. looks into delaying fiduciary rule...,WASHINGTON (Reuters) - The U.S. Labor Departme...,politicsNews,"February 3, 2017",1,"[washington, <allcaps>, (reuters), -, the, u.s..."
1,Senior Republicans signal issues in Congress f...,WASHINGTON (Reuters) - Representative Mac Thor...,politicsNews,"July 6, 2016",1,"[washington, <allcaps>, (reuters), -, represen..."
2,"Nearly 30,000 Kurds displaced from city near K...","BAGHDAD (Reuters) - Nearly 30,000 Kurds have b...",worldnews,"October 25, 2017",1,"[baghdad, <allcaps>, (reuters), -, nearly, <nu..."
3,WTF: Top Trump Advisor Tells Trump To Kill Li...,As if Donald Trump isn t paranoid and delusion...,News,"March 6, 2017",0,"[as, if, donald, trump, isn, t, paranoid, and,..."
4,EPIC RESPONSE AFTER THE BOSTON GLOBE Runs Fake...,THE BOSTON GLOBE ran a Sunday edition with a f...,politics,"Apr 11, 2016",0,"[the, <allcaps>, boston, <allcaps>, globe, <al..."


In [18]:
dev_set.head(5)

Unnamed: 0,title,text,subject,date,target,text_tokcan
0,Kurds abandon territory in the face of Iraq go...,BAGHDAD/KIRKUK (Reuters) - The Baghdad governm...,worldnews,"October 17, 2017",1,"[baghdad, <allcaps>, /, kirkuk, <allcaps>, (re..."
1,"Trump meets with U.S. community bankers, pledg...",WASHINGTON (Reuters) - President Donald Trump ...,politicsNews,"March 9, 2017",1,"[washington, <allcaps>, (reuters), -, presiden..."
2,‘IT’S ALL ABOUT THE KIDS’…JUST ASK THE UNIONS:...,If the Chicago Public Schools were a business ...,left-news,"Jul 3, 2015",0,"[if, the, chicago, public, schools, were, a, b..."
3,Instant View: Comey accuses Trump administrati...,(Reuters) - Former FBI Director James Comey on...,politicsNews,"June 8, 2017",1,"[(reuters), -, former, fbi, <allcaps>, directo..."
4,There’s Something Hokey About Ted,"21st Century Wire says At some point, the poli...",Middle-east,"March 21, 2016",0,"[<number>st, century, wire, says, at, some, po..."


In [19]:
# print out ISOT dev set
#dev_set.to_csv('isot_dev_set.csv', sep=',')

In [20]:
test_set.head(5)

Unnamed: 0,title,text,subject,date,target,text_tokcan
0,"Trump 'not thrilled' with debate dates, Clinto...",WASHINGTON (Reuters) - Republican candidate Do...,politicsNews,"July 31, 2016",1,"[washington, <allcaps>, (reuters), -, republic..."
1,Trump Just Had A Massive MELTDOWN Over Sally ...,After former Acting Attorney General Sally Yat...,News,"May 8, 2017",0,"[after, former, acting, attorney, general, sal..."
2,Hackers Reveal Trump Linked The Most Powerful...,Donald Trump s administration is already provi...,News,"January 26, 2017",0,"[donald, trump, s, administration, is, already..."
3,House Republicans Begin Process To Withdraw A...,"Since 1945, the United States has been a cruci...",News,"January 22, 2017",0,"[since, <number>, the, united, states, has, be..."
4,VIDEO SURFACES OF DISGRACED Alleged Pedophile ...,"Earlier today, it was reported by TMZ that an ...",left-news,"Oct 30, 2017",0,"[earlier, today,, it, was, reported, by, tmz, ..."


## Baseline Model: Naive Bayes Classifier

### Classify full text

In [21]:
##
# from sklearn.naive_bayes import BernoulliNB  #requires all features be binary
from sklearn.naive_bayes import MultinomialNB  #appropriate for word count features from CountVectorizer
# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *
#from sklearn.grid_search import GridSearchCV   # THIS HAS BEEN DEPRECATED
from sklearn.model_selection import GridSearchCV
# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score


train_data, train_labels = train_set.text.values, train_set.target.values
dev_data, dev_labels = dev_set.text.values, dev_set.target.values

train_labels = train_labels.astype(int)
dev_labels = dev_labels.astype(int)

print('train_data shape:', train_data.shape)
#print(train_data[0].shape)
#print(train_data[:1])
print('\ntrain_labels shape:', train_labels.shape)
print(train_labels)
print(type(train_labels[0]))
#train_labels.head()
#dev_data.head()
#dev_labels.head()


train_data shape: (31428,)

train_labels shape: (31428,)
[1 1 1 ... 0 1 0]
<class 'numpy.int64'>


In [22]:
print('ISOT train target=real:', len(train_labels[train_labels == 1]))
print('ISOT train target=fake:', len(train_labels[train_labels == 0]))

ISOT train target=real: 15036
ISOT train target=fake: 16392


In [23]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_data)
X

<31428x105902 sparse matrix of type '<class 'numpy.int64'>'
	with 6598223 stored elements in Compressed Sparse Row format>

In [24]:
#print(X[0])

In [25]:
print('X.shape:', X.shape) # (). There are x documents (rows) in the corpus, with y features (unique words = vocabulary)
print('Vocabulary size (number of features or columns):', X.shape[1])  # 
print('Non-zero elements in matrix (X.nnz):', X.nnz)   # This indicates that there are z non-zero elements in the matrix.
print('Average number of non-zero features per example (per document): %.3f' %(X.nnz/X.shape[0]))  # non-zero elements in matrix / documents = xxx
print('Fraction of non-zero elements in matrix: %.4f' %( X.nnz/(X.shape[0] * X.shape[1])) )   # Fraction of entries in the matrix that are non-zero = X.nnz/(rows*columns) = 0.xxx 


# What are the 0th and last feature strings (in alphabetical order)?
print('0th feature string:', vectorizer.get_feature_names()[0])   # 
print('last feature string:', vectorizer.get_feature_names()[X.shape[1]-1])    # 

X.shape: (31428, 105902)
Vocabulary size (number of features or columns): 105902
Non-zero elements in matrix (X.nnz): 6598223
Average number of non-zero features per example (per document): 209.947
Fraction of non-zero elements in matrix: 0.0020
0th feature string: 00
last feature string: zzzzzzzzzzzzz


In [26]:
# Using the standard CountVectorizer, what fraction of the words in the dev data are missing from the vocabulary? 
vectorizer_dev = CountVectorizer()
X_dev = vectorizer_dev.fit_transform(dev_data)  # Independently build a vocabulary using dev_data.
print('X_dev.shape:', X_dev.shape)
print('Vocabulary using train data:', X.shape[1])  # 
print('Vocabulary using dev data:', X_dev.shape[1])                # 

# Feed dev_data into the vectorizer fit using training data.
X_dev_transformed = vectorizer.transform(dev_data)
print('X_dev_transformed shape:', X_dev_transformed.shape)
#print('X_dev_transformed.shape', X_dev_transformed.shape)  # (676, 26,879)  EXPECT .shape[1] equal to original number of features
#print('non-zero indices in X_dev_transformed:', X_dev_transformed.nonzero())  # could also use this to check which features missing...

''' This is way too slow; use set intersection instead!!
# Look at each feature (vocabulary word) in X_dev and see if it is a feature in X.
count = 0
for i in range(X_dev.shape[1]):
    if vectorizer_dev.get_feature_names()[i] in vectorizer.get_feature_names():
        count += 1
print('Count of words (features) in X_dev also in X:', count)   
print('Fraction of words in dev data missing from training vocabulary: %.3f' %((X_dev.shape[1] - count)/X_dev.shape[1]) )
count of words (features) in X_dev also in X: 12219
Fraction of words in dev data missing from training vocabulary: 0.248
'''

set1 = set(vectorizer_dev.get_feature_names())
set2 = set(vectorizer.get_feature_names())
print('Count of words (features) in X_dev also in X:', len(set1.intersection(set2)))
print('Fraction of words in dev data missing from training vocabulary: %.3f' %((X_dev.shape[1] - len(set1.intersection(set2)))/X_dev.shape[1]) )

X_dev.shape: (6735, 53384)
Vocabulary using train data: 105902
Vocabulary using dev data: 53384
X_dev_transformed shape: (6735, 105902)
Count of words (features) in X_dev also in X: 44900
Fraction of words in dev data missing from training vocabulary: 0.159


In [27]:
# MultinomialNB
print('\nMultinomialNB')
alpha = 1.0
clf = MultinomialNB(alpha=alpha)
clf.fit(X, train_labels)

print('accuracy: %3.2f' %clf.score(X_dev_transformed, dev_labels))


MultinomialNB
accuracy: 0.95


In [28]:

print('accuracy: %3.2f' %clf.score(X_dev_transformed, dev_labels))

y_pred = clf.predict(X_dev_transformed)

acc = accuracy_score(dev_labels, y_pred)
print("Accuracy on dev set: {:.02%}".format(acc))


accuracy: 0.95
Accuracy on dev set: 95.43%


In [29]:
print('predict proba:', clf.predict_proba(X).shape)
print('predict proba example:', clf.predict_proba(X[0]))

predict proba: (31428, 2)
predict proba example: [[3.50355978e-15 1.00000000e+00]]


In [30]:
print('feature_log_prob_ shape:', clf.feature_log_prob_.shape)
print('feature_log_prob_ example:', clf.feature_log_prob_[0][0])

feature_log_prob_ shape: (2, 105902)
feature_log_prob_ example: -9.682948061308743


In [31]:
feature_names = vectorizer.get_feature_names()
print(feature_names[:20])

['00', '000', '0000', '00004', '000048', '00007', '00042', '0005', '0009', '000938', '000a', '000after', '000although', '000california', '000cases', '000cylvia', '000dillon000', '000ecuador', '000florida', '000georgia']


In [32]:
for i in range(2):   # 2 category labels
    for j in range(10):   # top 5 weights for each class
        index = -1 - j
        feature_index = np.argsort(clf.feature_log_prob_[i,:])[index]
        #print(feature_index, vectorizer.get_feature_names()[feature_index], clf.feature_log_prob_[0,feature_index], lr2.coef_[1,feature_index], lr2.coef_[2,feature_index], lr2.coef_[3,feature_index])
        print('%19s %12.3f %12.3f' %(vectorizer.get_feature_names()[feature_index], clf.feature_log_prob_[0,feature_index], clf.feature_log_prob_[1,feature_index]))
    print()


                the       -2.899       -2.825
                 to       -3.525       -3.502
                 of       -3.729       -3.678
                and       -3.772       -3.796
                 in       -4.052       -3.798
               that       -4.176       -4.525
                 is       -4.481       -4.986
                for       -4.664       -4.629
                 it       -4.768       -5.102
                 on       -4.777       -4.320

                the       -2.899       -2.825
                 to       -3.525       -3.502
                 of       -3.729       -3.678
                and       -3.772       -3.796
                 in       -4.052       -3.798
                 on       -4.777       -4.320
               said       -5.678       -4.407
               that       -4.176       -4.525
                for       -4.664       -4.629
                 is       -4.481       -4.986



In [33]:
prob_diff = clf.feature_log_prob_[1] - clf.feature_log_prob_[0]

for j in range(50):   # top 5 weights for each class
    index = -1 - j
    feat_index = np.argsort(prob_diff[:])[index]
    print('%19s %12.3f %12.3f' %(vectorizer.get_feature_names()[feat_index], clf.feature_log_prob_[0,feat_index], clf.feature_log_prob_[1,feat_index]))


            myanmar      -15.056       -7.935
           rohingya      -15.056       -8.238
            rakhine      -15.749       -9.052
         puigdemont      -15.749       -9.424
               zuma      -15.749       -9.473
                suu      -15.749       -9.724
                kyi      -15.749       -9.724
             odinga      -15.749       -9.865
              rajoy      -15.749       -9.896
                fdp      -15.749       -9.906
          kuczynski      -15.749       -9.913
          mnangagwa      -15.749       -9.953
                anc      -15.749       -9.990
            juncker      -15.749      -10.045
             hariri      -14.650       -8.990
             tmsnrt      -15.749      -10.115
            barnier      -15.749      -10.164
               aung      -15.749      -10.168
             harare      -15.749      -10.259
             marawi      -15.749      -10.305
               kurz      -15.749      -10.342
          ramaphosa      -15.749  

In [34]:
print('clf.feature_count_ :', clf.feature_count_.shape)

for i in range(2):   # 2 category labels
    print('\nMOST COMMON WORDS IN CLASS', i, '(Fake=0; Real=1)')
    print('%19s %12s %12s' %('WORD', 'Fake count', 'Real count'))
    for j in range(100):   # top x most frequent words for each class
        index = -1 - j
        feature_index = np.argsort(clf.feature_count_[i,:])[index]
        print('%19s %12d %12d' %(vectorizer.get_feature_names()[feature_index], clf.feature_count_[0,feature_index], clf.feature_count_[1,feature_index]))
    print()
    
    
#print('Real_News clf.feature_count_ :', np.sort(clf.feature_count_[1,:]))
#print('Real_News clf.feature_count indices :', np.argsort(clf.feature_count_[1,:]))
##print('Real_News clf.feature_count words :', vectorizer.get_feature_names()[np.argsort(clf.feature_count_[1,:])])
print()
#print('Fake_News clf.feature_count_ :', clf.feature_count_[0,:])

clf.feature_count_ : (2, 105902)

MOST COMMON WORDS IN CLASS 0 (Fake=0; Real=1)
               WORD   Fake count   Real count
                the       380761       339942
                 to       203640       172797
                 of       166010       144867
                and       159092       128832
                 in       120166       128519
               that       106220        62130
                 is        78301        39178
                for        65180        56000
                 it        58738        34888
                 on        58221        76259
              trump        55622        38572
                 he        55511        38373
                was        47281        34043
               with        44306        38196
                his        41016        27148
               this        40484        14904
                 as        40460        33368
                 be        34351        24049
                 by        33450        33830


                out        16517         7617
               when        15500         7451
             former         5076         7431
                her        18340         7407
             donald        12434         7373
           security         4145         7195
              north         1856         7077
            percent         2970         7065
               into         9456         6845
              court         3685         6802
              white         9417         6788
            clinton        13438         6763
              obama        13125         6713
                all        18094         6612
             senate         2564         6559
                any         8054         6469
            country         6454         6364
              first         7454         6135
              china          891         6124
           minister          694         6082
          officials         2695         6035
               week         3432  

#### Use TfidfVectorizer and compare to CountVectorizer

In [35]:
'''
Convert a collection of raw documents to a matrix of TF-IDF features.
Equivalent to CountVectorizer followed by TfidfTransformer.

In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little 
meaningful information about the actual contents of the document. If we were to feed the direct count data directly to 
a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.

In order to re-weight the count features into floating point values suitable for usage by a classifier it is very 
common to use the tf–idf transform.

Tf means term-frequency while tf–idf means term-frequency times inverse document-frequency: 
\text{tf-idf(t,d)}=\text{tf(t,d)} \times \text{idf(t)}.
'''

t_vectorizer = TfidfVectorizer()
t_X = t_vectorizer.fit_transform(train_data)   
#print(t_X.shape)
t_X_dev = t_vectorizer.transform(dev_data)
#print(t_X_dev.shape)


# MultinomialNB
#The multinomial distribution normally requires integer feature counts. 
#However, in practice, fractional counts such as tf-idf may also work.
print('\nMultinomialNB with TfidfVectorizer')
alpha = 1.0
t_clf = MultinomialNB(alpha=alpha)
t_clf.fit(t_X, train_labels)

print('accuracy: %3.3f' %t_clf.score(t_X_dev, dev_labels))

t_dev_predicted_labels = t_clf.predict(t_X_dev)  # "predict" and report accuracy using dev set
#print(t_dev_predicted_labels.shape)

print('\nf1 score of dev predicted labels:', metrics.f1_score(dev_labels, t_dev_predicted_labels, average='weighted'))
print('classification report of dev predicted labels: \n', classification_report(dev_labels, t_dev_predicted_labels))
print()


MultinomialNB with TfidfVectorizer
accuracy: 0.938

f1 score of dev predicted labels: 0.9382177743796193
classification report of dev predicted labels: 
               precision    recall  f1-score   support

           0       0.94      0.95      0.94      3527
           1       0.94      0.93      0.93      3208

   micro avg       0.94      0.94      0.94      6735
   macro avg       0.94      0.94      0.94      6735
weighted avg       0.94      0.94      0.94      6735




#### DO verification test by assigning random 0's and 1's to the Dev Labels and re-running.

In [35]:
sample = np.random.binomial(1, 0.5, size=dev_labels.shape[0])
print(sample.mean())

0.5020044543429845


In [36]:
print('accuracy: %3.4f' %clf.score(X_dev_transformed, sample))

accuracy: 0.4953


In [37]:
sample2 = np.random.binomial(1, 0.2, size=dev_labels.shape[0])
print(sample2.mean())

0.19510022271714922


In [38]:
print('accuracy: %3.4f' %clf.score(X_dev_transformed, sample2))

accuracy: 0.5145


#### This result is as expected; basically get random model predictions of ~ 50% once we randomize the dev lables in any fashion.

### Train with ISOT "text"; predict ISOT "title"

In [37]:
train_data, train_labels = train_set.text.str.lower().values, train_set.target.values
dev_data, dev_labels = dev_set.title.str.lower().values, dev_set.target.values

train_labels = train_labels.astype(int)
dev_labels = dev_labels.astype(int)

#train_data.head()
print('train_data shape:', train_data.shape)
#print(train_data[0].shape)
print(train_data[:1])
print('\ntrain_labels shape:', train_labels.shape)
print(train_labels)

print('dev_data shape:', dev_data.shape)

train_data shape: (31428,)
['washington (reuters) - the u.s. labor department is looking into delaying the implementation date of its new fiduciary rule governing the advice that brokers can give about retirement investments, it said on friday, after president donald trump called for a review that could ultimately lead to scrapping it. “the department of labor will now consider its legal options to delay the applicability of the date as we comply with the president’s memorandum,” acting u.s. secretary of labor ed hugler said in a statement. ']

train_labels shape: (31428,)
[1 1 1 ... 0 1 0]
dev_data shape: (6735,)


In [38]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_data)
X_dev_transformed = vectorizer.transform(dev_data)

print('X.shape:', X.shape)
print('X_dev_transformed.shape:', X_dev_transformed.shape)

# MultinomialNB
print('\nMultinomialNB trained using ISOT "text" field; predict ISOT "title" classes')
alpha = 1.0
clf = MultinomialNB(alpha=alpha)
clf.fit(X, train_labels)

print('accuracy: %3.3f' %clf.score(X_dev_transformed, dev_labels))

X.shape: (31428, 105902)
X_dev_transformed.shape: (6735, 105902)

MultinomialNB trained using ISOT "text" field; predict ISOT "title" classes
accuracy: 0.912


### Apply model to LIAR dataset text to predict results and compute score ('title' field contains the statement).

In [39]:
#### MAY NEED TO RUN THIS CELL TWICE

def get_data(filename, sep=',', header=0, names = None):
    '''Read CSV file into a pandas dataframe'''
      
    filepath = DATAPATH + filename
    return pd.read_csv(filepath, header=header, sep=sep, quotechar='"')

In [40]:
# define each downloaded file
LIAR_TEST_FILENAME = 'test.tsv'
LIAR_TRAIN_FILENAME = 'train.tsv'
LIAR_DEV_FILENAME = 'valid.tsv'

# define the downloaded file path 
DATAPATH = './datasets/LIAR/'

## title =statement, target = politifact rating

h_names= ['id', 'target', 'title', 'subject', 'speaker', 'speaker_job_title', 'state', 'party',
          'barely_true_count', 'false_count', 'half_true_count', 'mostly_true_count','pantsonfire_count',
          'context']

liar_test_data = get_data(LIAR_TEST_FILENAME, sep ='\t', header =None)
liar_train_data = get_data(LIAR_TRAIN_FILENAME, '\t', header =None)
liar_dev_data = get_data(LIAR_DEV_FILENAME, '\t', header =None)
print("LIAR training dataset: ", liar_train_data.shape)
print("LIAR test dataset: ", liar_test_data.shape)
print("LIAR dev dataset: ", liar_dev_data.shape)

liar_test_data.columns = h_names
liar_train_data.columns = h_names
liar_dev_data.columns = h_names
# ## add a label column to the data with the target values
# #fake_data.loc[:,'target'] = '0'
# #true_data['target'] = '1'

# #append the datasets and shuffle them
# all_data = true_data.append(fake_data, ignore_index=True)
# all_data = all_data.sample(frac=1).reset_index(drop=True)

## NOTE: if trouble loading, re-run get_data function.

LIAR training dataset:  (10240, 14)
LIAR test dataset:  (1267, 14)
LIAR dev dataset:  (1284, 14)


In [41]:
# combine all the liar data
liar_data = liar_train_data.append(liar_test_data, ignore_index =True)
liar_data = liar_data.append(liar_dev_data, ignore_index =True)
liar_data = liar_data.sample(frac=1).reset_index(drop=True)
print("Complete LIAR dataset: ",liar_data.shape)
liar_data.head()

Complete LIAR dataset:  (12791, 14)


Unnamed: 0,id,target,title,subject,speaker,speaker_job_title,state,party,barely_true_count,false_count,half_true_count,mostly_true_count,pantsonfire_count,context
0,6193.json,false,(Environmentalists) said Were only going to st...,"energy,environment",ron-ramsey,Speaker of the Tennessee Senate,Tennessee,republican,0.0,2.0,0.0,0.0,1.0,rally for coal miners
1,12371.json,true,"For the first time in over 40 years, Republica...","elections,history",ed-gillespie,Republican strategist,"Washington, D.C.",republican,2.0,3.0,2.0,2.0,1.0,a speech
2,12409.json,true,You can import as many hemp products into this...,"drugs,legal-issues,marijuana",william-devereaux,lawyer,Rhode Island,none,0.0,0.0,0.0,0.0,0.0,a legislative hearing
3,1770.json,true,We can prevent terror suspects from boarding a...,"civil-rights,guns,terrorism",michael-bloomberg,,New York,independent,0.0,2.0,2.0,3.0,0.0,an op ed article
4,10700.json,pants-fire,"Says Ted Cruz said, There is no place for gays...","candidates-biography,gays-and-lesbians,marriag...",facebook-posts,Social media posting,,none,14.0,18.0,15.0,11.0,36.0,an online meme


In [42]:
print(liar_data.title[4])
print(liar_data.target.unique())

Says Ted Cruz said, There is no place for gays or atheists in my America. None. Our Constitution makes that clear.
['false' 'true' 'pants-fire' 'half-true' 'barely-true' 'mostly-true']


In [43]:
targets = liar_data.target.unique()
print(targets)

print('target,  number of examples')
for target in targets:
    print(target, len(liar_data[liar_data.target==target]))
    
print('\ntotal examples', len(liar_data))

['false' 'true' 'pants-fire' 'half-true' 'barely-true' 'mostly-true']
target,  number of examples
false 2507
true 2053
pants-fire 1047
half-true 2627
barely-true 2103
mostly-true 2454

total examples 12791


In [44]:
liar_data['binary_target'] = -1

'''  # this does not work
for i in range(liar_data.shape[0]):   
    if liar_data.target.iloc[i] == ('pants-fire' or 'false' or 'barely-true') :
        liar_data.binary_target.iloc[i] = 0  # fake news
    elif liar_data.target.iloc[i] == ('true' or 'mostly-true'):
        liar_data.binary_target.iloc[i] = 1  # real news
'''

''' these do not work
#liar_data.binary_target[((liar_data.target=='pants-fire') | (liar_data.target=='false') | (liar_data.target=='barely-true')] = 0
#liar_data['binary_target'] = np.where( ( (liar_data.target=='pants-fire') | (liar_data.target=='false') | (liar_data.target=='barely-true')] = 0                                        
## example:df['points'] = np.where( ( (df['gender'] == 'male') & (df['pet1'] == df['pet2'] ) ) | ( (df['gender'] == 'female') & (df['pet1'].isin(['cat','dog'] ) ) ), 5, 0)
#liar_data['binary_target'] = np.where( (liar_data.target.isin (['pants-fire','false','barely-true']), 0,1))
'''

# This might work better!
#'''
def binary_seq_target(rating):
    ## if no rating provided assume the statement to be true
    map_r = {'pants-fire':0, 'false':0, 'barely-true':0, 'half-true':-1, 'mostly-true':1, 'true':1}
    return map_r.get(rating, 1)
    
##change the target labels to 0(false), 1(true news)
#liar_data2.loc[:,'target'] = pd.Series(liar_data2['target'].apply(seq_target), index = liar_data2.index)
liar_data.loc[:,'binary_target'] = pd.Series(liar_data['target'].apply(binary_seq_target), index = liar_data.index)
liar_data.head(10)    
#'''

'''
# these give a warning: 'A value is trying to be set on a copy of a slice from a DataFrame'
liar_data.binary_target[liar_data.target=='pants-fire'] = 0  # fake news
liar_data.binary_target[liar_data.target=='false'] = 0
liar_data.binary_target[liar_data.target=='barely-true'] = 0
liar_data.binary_target[liar_data.target=='true'] = 1        # real news
liar_data.binary_target[liar_data.target=='mostly-true'] = 1
'''

liar_data.head(10)


Unnamed: 0,id,target,title,subject,speaker,speaker_job_title,state,party,barely_true_count,false_count,half_true_count,mostly_true_count,pantsonfire_count,context,binary_target
0,6193.json,false,(Environmentalists) said Were only going to st...,"energy,environment",ron-ramsey,Speaker of the Tennessee Senate,Tennessee,republican,0.0,2.0,0.0,0.0,1.0,rally for coal miners,0
1,12371.json,true,"For the first time in over 40 years, Republica...","elections,history",ed-gillespie,Republican strategist,"Washington, D.C.",republican,2.0,3.0,2.0,2.0,1.0,a speech,1
2,12409.json,true,You can import as many hemp products into this...,"drugs,legal-issues,marijuana",william-devereaux,lawyer,Rhode Island,none,0.0,0.0,0.0,0.0,0.0,a legislative hearing,1
3,1770.json,true,We can prevent terror suspects from boarding a...,"civil-rights,guns,terrorism",michael-bloomberg,,New York,independent,0.0,2.0,2.0,3.0,0.0,an op ed article,1
4,10700.json,pants-fire,"Says Ted Cruz said, There is no place for gays...","candidates-biography,gays-and-lesbians,marriag...",facebook-posts,Social media posting,,none,14.0,18.0,15.0,11.0,36.0,an online meme,0
5,1330.json,false,"In the House health care bill, ""Something like...",health-care,ron-wyden,U.S. Senator,Oregon,democrat,0.0,1.0,0.0,3.0,0.0,MSNBC's 'Morning Meeting With Dylan Ratigan',0
6,5727.json,half-true,Says oil productions down where Obamas in charge.,"corrections-and-updates,energy,message-machine...",crossroads-gps,Conservative advocacy group,,republican,9.0,1.0,4.0,1.0,2.0,a Web ad,-1
7,1717.json,barely-true,Obama did not open new lands to offshore drill...,"economy,energy,environment,florida,transportation",house-natural-resources-committee-republicans,,,republican,1.0,0.0,0.0,0.0,0.0,a news release posted on his website.,0
8,4102.json,half-true,Numerous studies have shown that these so-call...,"jobs,unions",sheila-oliver,Assemblywoman,New Jersey,democrat,0.0,1.0,1.0,3.0,0.0,a press release from Assembly Speaker Sheila O...,-1
9,11629.json,mostly-true,"In 2008, Maggie Hassan voted against legislati...","immigration,voting-record",jennifer-horn,"Chairman, New Hampshire Republican Party",New Hampshire,republican,2.0,0.0,1.0,1.0,0.0,a memo,1


In [45]:
binary_targets = liar_data.binary_target.unique()
print(binary_targets)

print('\nbinary_target,  number of examples')
for binary_target in binary_targets:
    print(binary_target, len(liar_data[liar_data.binary_target==binary_target]))

[ 0  1 -1]

binary_target,  number of examples
0 5657
1 4507
-1 2627


#### Must discard label = -1  

In [46]:
liar_dev_labels = liar_data.binary_target[liar_data.binary_target >= 0].values  ## discard "half-true"!!!!
print('liar_dev_labels:\n', liar_dev_labels[:10])

liar_dev_labels:
 [0 1 1 1 0 0 0 1 0 0]


In [47]:
print(liar_data.title)


0        (Environmentalists) said Were only going to st...
1        For the first time in over 40 years, Republica...
2        You can import as many hemp products into this...
3        We can prevent terror suspects from boarding a...
4        Says Ted Cruz said, There is no place for gays...
5        In the House health care bill, "Something like...
6        Says oil productions down where Obamas in charge.
7        Obama did not open new lands to offshore drill...
8        Numerous studies have shown that these so-call...
9        In 2008, Maggie Hassan voted against legislati...
10       Hispanic students in Florida perform the best ...
11       Says Milwaukee County Executive Chris Abele el...
12       Limiting labor negotiations to only wages is h...
13       The governor must inform the Senate president ...
14       Says he was known as Veto Corleone for cutting...
15       Compact fluorescent light bulbs are toxic and ...
16       Sen. Joe Biden, the ranking member of the Fore.

### Using the ISOT "text" model, predict the liar_dev_labels and score the predictions. (lower case them first...)

In [48]:
train_data, train_labels = train_set.text.str.lower().values, train_set.target.values   # original ISOT data

#dev_data = liar_data.title[liar_data.binary_target != -1].values  
dev_data = liar_data.title[liar_data.binary_target != -1].str.lower().values      # LIAR data                                        # full LIAR data
dev_labels = liar_data.binary_target[liar_data.binary_target != -1].values

train_labels = train_labels.astype(int)
dev_labels = dev_labels.astype(int)

#train_data.head()
print('train_data shape:', train_data.shape)
#print(train_data[0].shape)
#print(train_data[:1])
print('train_labels shape:', train_labels.shape)
print(train_labels)

print('\ndev_data shape:', dev_data.shape)
print(dev_data[:1])
print('dev_labels shape:', dev_labels.shape)
print(dev_labels)

train_data shape: (31428,)
train_labels shape: (31428,)
[1 1 1 ... 0 1 0]

dev_data shape: (10164,)
['(environmentalists) said were only going to stop coal mining above 2,000 feet. ... well guess where all the coal in the state of tennessee is? above 2,000 feet.']
dev_labels shape: (10164,)
[0 1 1 ... 1 1 0]


In [49]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_data)           # original ISOT training data
X_dev_transformed = vectorizer.transform(dev_data)

# MultinomialNB
print('\nMultinomialNB using "text" field: Fit using ISOT data; Predict on LIAR data:')
alpha = 1.0
clf = MultinomialNB(alpha=alpha)
clf.fit(X, train_labels)

print('accuracy: %3.3f' %clf.score(X_dev_transformed, dev_labels))


MultinomialNB using "text" field: Fit using ISOT data; Predict on LIAR data:
accuracy: 0.541


In [50]:
print('Non-zero elements in matrix (X.nnz):', X.nnz)   # This indicates that there are z non-zero elements in the matrix.
print('Average number of non-zero features per example (per document): %.3f' %(X.nnz/X.shape[0]))  # non-zero elements in matrix / documents = xxx

Non-zero elements in matrix (X.nnz): 6598223
Average number of non-zero features per example (per document): 209.947


In [51]:
print('Non-zero elements in matrix (X_dev_transformed.nnz):', X_dev_transformed.nnz)   # This indicates that there are z non-zero elements in the matrix.
print('Average number of non-zero features per example (per document): %.3f' %(X_dev_transformed.nnz/X_dev_transformed.shape[0]))  # non-zero elements in matrix / documents = xxx

Non-zero elements in matrix (X_dev_transformed.nnz): 161155
Average number of non-zero features per example (per document): 15.855


In [52]:
print('LIAR true:', len(liar_data.binary_target[liar_data.binary_target == 1]))
#print('LIAR true:', liar_data.binary_target[liar_data.binary_target == 1].shape)
print('LIAR false:', len(liar_data.binary_target[liar_data.binary_target == 0]))

LIAR true: 4507
LIAR false: 5657


### REVERSE: Using LIAR model, predict the ISOT "text" and score the predictions. (lower case them first...)

In [58]:
train_data = liar_data.title[liar_data.binary_target != -1].str.lower().values   # full LIAR data
train_labels = liar_data.binary_target[liar_data.binary_target != -1].values 
 
dev_data = train_set.text.str.lower().values                                      # LSOT data
dev_labels = train_set.target.values 


train_labels = train_labels.astype(int)
dev_labels = dev_labels.astype(int)

#train_data.head()
print('train_data shape:', train_data.shape)
#print(train_data[0].shape)
#print(train_data[:1])
print('train_labels shape:', train_labels.shape)
print(train_labels)

print('\ndev_data shape:', dev_data.shape)
#print(dev_data[:1])
print('dev_labels shape:', dev_labels.shape)
print(dev_labels)

train_data shape: (10164,)
train_labels shape: (10164,)
[1 0 0 ... 1 0 0]

dev_data shape: (31428,)
dev_labels shape: (31428,)
[1 0 1 ... 0 1 0]


In [59]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_data)           # LIAR training data
print(X.shape)
X_dev_transformed = vectorizer.transform(dev_data) # LSOT

# MultinomialNB
print('\nMultinomialNB:  Fit using LIAR data; predict on ISOT "text" data')
alpha = 1.0
clf = MultinomialNB(alpha=alpha)
clf.fit(X, train_labels)

print('accuracy: %3.3f' %clf.score(X_dev_transformed, dev_labels))

(10164, 12231)

MultinomialNB:  Fit using LIAR data; predict on ISOT "text" data
accuracy: 0.568


In [60]:
print('LIAR true:', len(liar_data.binary_target[liar_data.binary_target == 1]))
#print('LIAR true:', liar_data.binary_target[liar_data.binary_target == 1].shape)
print('LIAR false:', len(liar_data.binary_target[liar_data.binary_target == 0]))

LIAR true: 4507
LIAR false: 5657


NOTE: Nearly same results (accuracy~0.94) for both the default CountVectorizer and TfidfVectorizer.  Using full text means all TRUE news contains the word "Reuters", which is an unfair advantage.  Will try to remove those and run again, expecting lower accuracy.  
Should also account for text starting with: "'The following statements\xa0were posted to the verified Twitter accounts of U.S. President Donald Trump, @realDonaldTrump and @POTUS.  The opinions expressed are his own.\xa0Reuters has not edited the statements or confirmed their accuracy."

### Repeat Naive Bayes on text field after removing first chunk of text, including "Reuters"

In [None]:
true_data.head()

Unnamed: 0,title,text,subject,date,target
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",1
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",1
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",1
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",1
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",1


In [None]:
true_data.iloc[0,1][22:]

' The head of a conservative Republican faction in the U.S. Congress, who voted this month for a huge expansion of the national debt to pay for tax cuts, called himself a “fiscal conservative” on Sunday and urged budget restraint in 2018. In keeping with a sharp pivot under way among Republicans, U.S. Representative Mark Meadows, speaking on CBS’ “Face the Nation,” drew a hard line on federal spending, which lawmakers are bracing to do battle over in January. When they return from the holidays on Wednesday, lawmakers will begin trying to pass a federal budget in a fight likely to be linked to other issues, such as immigration policy, even as the November congressional election campaigns approach in which Republicans will seek to keep control of Congress. President Donald Trump and his Republicans want a big budget increase in military spending, while Democrats also want proportional increases for non-defense “discretionary” spending on programs that support education, scientific resear

In [None]:
true_data.iloc[:,1]
#true_data.iloc[13,1]

0        WASHINGTON (Reuters) - The head of a conservat...
1        WASHINGTON (Reuters) - Transgender people will...
2        WASHINGTON (Reuters) - The special counsel inv...
3        WASHINGTON (Reuters) - Trump campaign adviser ...
4        SEATTLE/WASHINGTON (Reuters) - President Donal...
5        WEST PALM BEACH, Fla./WASHINGTON (Reuters) - T...
6        WEST PALM BEACH, Fla (Reuters) - President Don...
7        The following statements were posted to the ve...
8        The following statements were posted to the ve...
9        WASHINGTON (Reuters) - Alabama Secretary of St...
10       (Reuters) - Alabama officials on Thursday cert...
11       NEW YORK/WASHINGTON (Reuters) - The new U.S. t...
12       The following statements were posted to the ve...
13       The following statements were posted to the ve...
14        (In Dec. 25 story, in second paragraph, corre...
15       (Reuters) - A lottery drawing to settle a tied...
16       WASHINGTON (Reuters) - A Georgian-American bus.

In [None]:
# How many of the TRUE NEWS docs contain "Reuters"?  
# How many of the TRUE NEWS docs start with "The following statements"?  

reuters_counter=0
statements_counter=0

for i in range(true_data.shape[0]):
    if true_data.iloc[i,1].find("Reuters") > 0:
        reuters_counter += 1
    if (true_data.iloc[i,1].find("following") > 0) & (true_data.iloc[i,1].find("statements") > 0):
        statements_counter += 1

print('reuters_counter:', reuters_counter)
print('statement_counter:', statements_counter)
print('total true docs:', true_data.shape[0])



reuters_counter: 21378
statement_counter: 156
total true docs: 21417


#### Need to remove "Reuters" from True News

In [31]:
re.sub(r"^.?([r,R]euters) - ", "", "WASHINGTON (Reuters) - my name is reuters, Reuters is the code")  


'WASHINGTON (Reuters) - my name is reuters, Reuters is the code'

In [32]:
re.sub(r"[r,R]euters", "", "my name is reuters, Reuters is the code")

'my name is ,  is the code'

In [39]:
re.sub(r"[\w+\s+]+[r,R]euters", "", "WASHINGTON (Reuters) - my name is reuters, Reuters is the code")

'WASHINGTON (Reuters) -, is the code'

In [38]:
re.sub(r"[\w+\s+]+([r,R]euters) - ", "", "WASHINGTON (Reuters) - my name is reuters, Reuters is the code")

'WASHINGTON (Reuters) - my name is reuters, Reuters is the code'

In [40]:
re.sub(r"\w.*[r,R]euters\W*", "","ASPEN, Colorado (Reuters) - The Trump administ")  ### This is the one we want...

'The Trump administ'

In [51]:
def remove_reuters(text):
    return(re.sub(r"\w.*[r,R]euters\W*", "", text))

true_data['text2'] = true_data['text'].apply(remove_reuters)
true_data.head()

Unnamed: 0,title,text,subject,date,target,text2
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",1,The head of a conservative Republican faction ...
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",1,Transgender people will be allowed for the fir...
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",1,The special counsel investigation of links bet...
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",1,for comment. Mueller’s office declined to comm...
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",1,President Donald Trump called on the U.S. Post...


In [52]:
true_data['text2'] = true_data['text'].apply(remove_reuters)

In [53]:
true_data.head()

Unnamed: 0,title,text,subject,date,target,text2
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",1,The head of a conservative Republican faction ...
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",1,Transgender people will be allowed for the fir...
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",1,The special counsel investigation of links bet...
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",1,for comment. Mueller’s office declined to comm...
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",1,President Donald Trump called on the U.S. Post...


In [54]:
fake_data['text2'] = fake_data['text']

In [55]:
#append the datasets and shuffle them
all_data2 = true_data.append(fake_data, ignore_index=True)
all_data2 = all_data2.sample(frac=1).reset_index(drop=True)

all_data2.describe()

Unnamed: 0,title,text,subject,date,target,text2
count,44898,44898.0,44898,44898,44898,44898.0
unique,38729,38646.0,8,2397,2,38228.0
top,Factbox: Trump fills top jobs for his administ...,,politicsNews,"December 20, 2017",0,
freq,14,627.0,11272,182,23481,627.0


In [56]:
all_data2.head()

Unnamed: 0,title,text,subject,date,target,text2
0,U.S. to review energy royalty rates on federal...,WASHINGTON (Reuters) - The U.S. Interior Depar...,politicsNews,"March 29, 2017",1,The U.S. Interior Department said on Wednesday...
1,Illinois Senate votes for $454 million higher-...,CHICAGO (Reuters) - For the second time in two...,politicsNews,"May 5, 2016",1,"For the second time in two weeks, the Illinois..."
2,New Jersey 'Bridgegate' defendant says he was ...,"NEWARK, N.J. (Reuters) - After days of complai...",politicsNews,"October 17, 2016",1,After days of complaints about traffic jams at...
3,Spain's Rajoy calls on Catalonia leaders to ca...,MADRID (Reuters) - Spanish Prime Minister Mari...,worldnews,"September 20, 2017",1,Spanish Prime Minister Mariano Rajoy on Wednes...
4,"BREAKING: CROOKED VA GOVERNOR, Close Hillary F...",How much more criminal activity are American v...,left-news,"Oct 23, 2016",0,How much more criminal activity are American v...


In [57]:
## Re-define train/dev/test:

train_set = all_data2[ :int(len(all_data2)*train_fract)].reset_index(drop=True)
dev_set = all_data2[int(len(all_data2)*(train_fract)) : int(len(all_data)*(train_fract+dev_fract))].reset_index(drop=True)
test_set = all_data2[int(len(all_data2)*(train_fract+dev_fract)) : ].reset_index(drop=True)

print('training set: ',train_set.shape)
print('dev set: ',dev_set.shape)
print('test set: ',test_set.shape)


train_data, train_labels = train_set.text2.values, train_set.target.values
dev_data, dev_labels = dev_set.text2.values, dev_set.target.values

train_labels = train_labels.astype(int)
dev_labels = dev_labels.astype(int)

print('\ntrain_data shape:', train_data.shape)
print('train_labels shape:', train_labels.shape)
print(train_labels)

training set:  (31428, 6)
dev set:  (6735, 6)
test set:  (6735, 6)

train_data shape: (31428,)
train_labels shape: (31428,)
[1 1 1 ... 0 0 0]


In [60]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_data)
X_dev_transformed = vectorizer.transform(dev_data)

# MultinomialNB
print('\nMultinomialNB with "Reuters" removed from text field')
alpha = 1.0
clf = MultinomialNB(alpha=alpha)
clf.fit(X, train_labels)

print('accuracy: %3.3f' %clf.score(X_dev_transformed, dev_labels))


MultinomialNB with "Reuters" removed from text field
accuracy: 0.947


In [59]:
#print(train_data)

['The U.S. Interior Department said on Wednesday that it would form a new committee to review royalty rates collected from oil and gas drilling, coal mining and renewable energy production on federal lands to ensure taxpayers receive their full value. Interior Secretary Ryan Zinke said the committee would advise him on whether the government is getting a fair price from companies that lease public land for energy and natural resource development.  The committee will replace the process put in place by former Interior Secretary Sally Jewell to review and overhaul the federal coal leasing program. “The programmatic review put in place (by Jewell) was costly and unnecessary,” Zinke told reporters on Wednesday. “I have established a royalty policy committee to provide advice to me about how we value collections across the board,” he said. In January 2016, the administration of former Democratic President Barack Obama began a multiyear review of the federal coal leasing program after govern

### Run Naive Bayes on the Title field

In [29]:
fake_data.title

#print(fake_data.title.iloc[:15])

0         Donald Trump Sends Out Embarrassing New Year’...
1         Drunk Bragging Trump Staffer Started Russian ...
2         Sheriff David Clarke Becomes An Internet Joke...
3         Trump Is So Obsessed He Even Has Obama’s Name...
4         Pope Francis Just Called Out Donald Trump Dur...
5         Racist Alabama Cops Brutalize Black Boy While...
6         Fresh Off The Golf Course, Trump Lashes Out A...
7         Trump Said Some INSANELY Racist Stuff Inside ...
8         Former CIA Director Slams Trump Over UN Bully...
9         WATCH: Brand-New Pro-Trump Ad Features So Muc...
10        Papa John’s Founder Retires, Figures Out Raci...
11        WATCH: Paul Ryan Just Told Us He Doesn’t Care...
12        Bad News For Trump — Mitch McConnell Says No ...
13        WATCH: Lindsey Graham Trashes Media For Portr...
14        Heiress To Disney Empire Knows GOP Scammed Us...
15        Tone Deaf Trump: Congrats Rep. Scalise On Los...
16        The Internet Brutally Mocks Disney’s New Trum.

In [53]:
train_data, train_labels = train_set.title.values, train_set.target.values
dev_data, dev_labels = dev_set.title.values, dev_set.target.values

train_labels = train_labels.astype(int)
dev_labels = dev_labels.astype(int)

#train_data.head()
print('train_data shape:', train_data.shape)
#print(train_data[0].shape)
print(train_data[:1])
print('\ntrain_labels shape:', train_labels.shape)
print(train_labels)

print('dev_data shape:', dev_data.shape)


train_data shape: (31428,)
["Labor Dept. looks into delaying fiduciary rule after Trump's order"]

train_labels shape: (31428,)
[1 1 1 ... 0 1 0]
dev_data shape: (6735,)


In [54]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_data)


print('X.shape:', X.shape) # (). There are x documents (rows) in the corpus, with y features (unique words = vocabulary)
print('Vocabulary size (number of features or columns):', X.shape[1])  # 
print('Non-zero elements in matrix (X.nnz):', X.nnz)   # This indicates that there are z non-zero elements in the matrix.
print('Average number of non-zero features per example (per document): %.3f' %(X.nnz/X.shape[0]))  # non-zero elements in matrix / documents = xxx
print('Fraction of non-zero elements in matrix: %.4f' %( X.nnz/(X.shape[0] * X.shape[1])) )   # Fraction of entries in the matrix that are non-zero = X.nnz/(rows*columns) = 0.xxx 


X.shape: (31428, 18801)
Vocabulary size (number of features or columns): 18801
Non-zero elements in matrix (X.nnz): 382390
Average number of non-zero features per example (per document): 12.167
Fraction of non-zero elements in matrix: 0.0006


In [55]:
# Using the standard CountVectorizer, what fraction of the words in the dev data are missing from the vocabulary? 
vectorizer_dev = CountVectorizer()
X_dev = vectorizer_dev.fit_transform(dev_data)  # Independently build a vocabulary using dev_data.
print('X_dev.shape:', X_dev.shape)
print('Vocabulary using train data:', X.shape[1])  # 
print('Vocabulary using dev data:', X_dev.shape[1])                # 

# Feed dev_data into the vectorizer fit using training data.
X_dev_transformed = vectorizer.transform(dev_data)
print('X_dev_transformed shape:', X_dev_transformed.shape)
#print('X_dev_transformed.shape', X_dev_transformed.shape)  # ()  EXPECT .shape[1] equal to original number of features
#print('non-zero indices in X_dev_transformed:', X_dev_transformed.nonzero())  # could also use this to check which features missing...

''' This is way too slow; use set intersection instead!!
# Look at each feature (vocabulary word) in X_dev and see if it is a feature in X.
count = 0
for i in range(X_dev.shape[1]):
    if vectorizer_dev.get_feature_names()[i] in vectorizer.get_feature_names():
        count += 1
print('Count of words (features) in X_dev also in X:', count)   
print('Fraction of words in dev data missing from training vocabulary: %.3f' %((X_dev.shape[1] - count)/X_dev.shape[1]) )
count of words (features) in X_dev also in X: 12219
Fraction of words in dev data missing from training vocabulary: 0.248
'''

set1 = set(vectorizer_dev.get_feature_names())
set2 = set(vectorizer.get_feature_names())
print('Count of words (features) in X_dev also in X:', len(set1.intersection(set2)))
print('Fraction of words in dev data missing from training vocabulary: %.3f' %((X_dev.shape[1] - len(set1.intersection(set2)))/X_dev.shape[1]) )

X_dev.shape: (6735, 10418)
Vocabulary using train data: 18801
Vocabulary using dev data: 10418
X_dev_transformed shape: (6735, 18801)
Count of words (features) in X_dev also in X: 9275
Fraction of words in dev data missing from training vocabulary: 0.110


In [56]:
# MultinomialNB
print('\nMultinomialNB using "title" field')
alpha = 1.0
clf = MultinomialNB(alpha=alpha)
clf.fit(X, train_labels)

print('accuracy: %3.3f' %clf.score(X_dev_transformed, dev_labels))


MultinomialNB using "title" field
accuracy: 0.949


In [57]:
#print('title, target label\n', train_set.title, train_set.target)
print('title, target label\n', train_set.title[4], train_set.target[4])

title, target label
 EPIC RESPONSE AFTER THE BOSTON GLOBE Runs Fake Cover Bashing Trump…THIS IS GREAT! 0


In [66]:
type(train_set.target)

pandas.core.series.Series

In [67]:
type(train_labels)

numpy.ndarray

#### ONE OFF: Human Baseline Examples: read in ISOT examples previously scored by humans in the project group.   Add the predicted class to the file.  

In [60]:
isot_human = pd.read_csv('human_isot_examples.csv')
isot_human.head()

Unnamed: 0,title
0,Turkey condemns U.S. move on Jerusalem as 'irr...
1,UK finance minister's future questioned by PM ...
2,Canada government facing resistance from Senat...
3,Tillerson says would support maintaining Russi...
4,DEPLORABLE! HILLARY’S Campaign Is In PANIC Mod...


In [64]:
X_dev_transformed = vectorizer.transform(isot_human.title)

In [69]:
cols = ['title', 'pred_class']
row_list = []

for i in range(len(isot_human.title)):  
    print('\nDev Title: ', isot_human.title.iloc[i])
    print('predicted class =', clf.predict(X_dev_transformed)[i],'\n' )
    row_list.append(dict( [('title',isot_human.title.iloc[i]), ('pred_class', clf.predict(X_dev_transformed)[i])]  ))


isot_human_df = pd.DataFrame(row_list, columns=cols)


Dev Title:  Turkey condemns U.S. move on Jerusalem as 'irresponsible'
predicted class = 1 


Dev Title:  UK finance minister's future questioned by PM May's allies as budget nears
predicted class = 1 


Dev Title:  Canada government facing resistance from Senate over pot law
predicted class = 1 


Dev Title:  Tillerson says would support maintaining Russia sanctions for now
predicted class = 1 


Dev Title:  DEPLORABLE! HILLARY’S Campaign Is In PANIC Mode…Their Latest “RACIST FROG” Story Proves It [VIDEO]
predicted class = 0 


Dev Title:   Kentucky Woman Brutally Beaten By Man For Looking Too ‘Masculine’ While Onlookers Do NOTHING
predicted class = 0 


Dev Title:   ALL Of The GOP Candidates Would WRECK Our Environment – Here’s How
predicted class = 0 


Dev Title:  Canadian judge suspends Quebec niqab ban
predicted class = 1 


Dev Title:  THE HIGHEST TAXED PLACES TO LIVE Also Happen To Be Democrat Controlled Cesspools Of Corruption
predicted class = 0 


Dev Title:  BREAKING: BLIND

In [70]:
isot_human_df.head()

Unnamed: 0,title,pred_class
0,Turkey condemns U.S. move on Jerusalem as 'irr...,1
1,UK finance minister's future questioned by PM ...,1
2,Canada government facing resistance from Senat...,1
3,Tillerson says would support maintaining Russi...,1
4,DEPLORABLE! HILLARY’S Campaign Is In PANIC Mod...,0


In [71]:
isot_human_df.to_csv('human_isot_examples.csv', sep=',')

#### Re-do training and dev eval using only LOWER CASE text.  (Not relevant since CountVectorizer already does this??)

In [68]:
print(train_set.title.values)

['Senate panel sets hearing for Trump tax nominee'
 '“WOODY” KAINE One Of Six ARRESTED After Peaceful Pro-Trump Supporters Were Attacked By VIOLENT RIOTERS At MN State Capitol [VIDEO]'
 'Armed faction takes over protection of Libyan oil and gas complex, fresh concern over migrants'
 ...
 'Authorities Allowed NYC Chain Migration Terrorist To Stop Interrogation Several Times To Pray After Admitting He Was Triggered By CHRISTMAS Posters'
 'Unpopularity of Clinton, Trump puts spotlight on potential running mates'
 'TRUMP FINANCIAL ADVISOR Has Great Tax News For Job Creators: “I think they need to get that done quickly”…MAGA! [VIDEO]']


In [69]:
print(all_data['title'].str.lower().values)

['senate panel sets hearing for trump tax nominee'
 '“woody” kaine one of six arrested after peaceful pro-trump supporters were attacked by violent rioters at mn state capitol [video]'
 'armed faction takes over protection of libyan oil and gas complex, fresh concern over migrants'
 ... 'tillerson underlines cooperation with japan, seoul on north korea'
 'austrian coalition talks set to begin, far right likely partner'
 ' usa swimming metes out punishment that stanford rapist really deserves']


In [70]:
train_data, train_labels = train_set.title.str.lower().values, train_set.target.values
dev_data, dev_labels = dev_set.title.str.lower().values, dev_set.target.values

In [71]:
print(train_data)
print()
print(dev_data)
print(train_data.shape, dev_data.shape)

['senate panel sets hearing for trump tax nominee'
 '“woody” kaine one of six arrested after peaceful pro-trump supporters were attacked by violent rioters at mn state capitol [video]'
 'armed faction takes over protection of libyan oil and gas complex, fresh concern over migrants'
 ...
 'authorities allowed nyc chain migration terrorist to stop interrogation several times to pray after admitting he was triggered by christmas posters'
 'unpopularity of clinton, trump puts spotlight on potential running mates'
 'trump financial advisor has great tax news for job creators: “i think they need to get that done quickly”…maga! [video]']

['proposed budget deal is much worse on u.s. border security than we originally thought'
 "australian, american, malaysian arrested in indonesia's bali for drugs"
 ' muslim woman who turned in paris terrorist shares important thought with the world'
 ...
 'lol! social media users respond to hillary’s birthday tweet to herself…and they’re hilarious!'
 'white 

In [76]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_data)
X_dev_transformed = vectorizer.transform(dev_data)

# MultinomialNB
print('\nMultinomialNB using "title" field')
alpha = 1.0
clf = MultinomialNB(alpha=alpha)
clf.fit(X, train_labels)

print('accuracy: %3.3f' %clf.score(X_dev_transformed, dev_labels))


MultinomialNB using "title" field
accuracy: 0.947


In [80]:
# PRINT ISOT DEV TITLES WITH LABELS AND PREDICTED VALUES

cols = ['ISOT title', 'Label', 'Predicted Class']
row_list = []

for i in range(dev_data.shape[0]):
    #print(dev_data[i], dev_labels[i], clf.predict(X_dev_transformed)[i])
    row_list.append(dict( [('ISOT title', dev_data[i]), ('Label', dev_labels[i]), ('Predicted Class', clf.predict(X_dev_transformed)[i])]  ))

dev_set_df = pd.DataFrame(row_list, columns=cols)

In [81]:
print(dev_set_df.size)
print(len(dev_set_df))
dev_set_df.head()

20205
6735


Unnamed: 0,ISOT title,Label,Predicted Class
0,proposed budget deal is much worse on u.s. bor...,0,0
1,"australian, american, malaysian arrested in in...",1,1
2,muslim woman who turned in paris terrorist sh...,0,0
3,unreal! house moves to ban sale or display of ...,0,0
4,democrat accused of sexual harassment just thr...,0,0


In [82]:
#dev_set_df.to_csv('isot_dev_set.csv', sep=',')

In [83]:
print('clf.feature_count_ :', clf.feature_count_.shape)

for i in range(2):   # 2 category labels
    print('\nMOST COMMON WORDS IN CLASS', i, '(Fake=0; Real=1)')
    print('%19s %12s %12s' %('WORD', 'Fake count', 'Real count'))
    for j in range(100):   # top x most frequent words for each class
        index = -1 - j
        feature_index = np.argsort(clf.feature_count_[i,:])[index]
        print('%19s %12d %12d' %(vectorizer.get_feature_names()[feature_index], clf.feature_count_[0,feature_index], clf.feature_count_[1,feature_index]))
    print()

clf.feature_count_ : (2, 18779)

MOST COMMON WORDS IN CLASS 0 (Fake=0; Real=1)
               WORD   Fake count   Real count
                 to         6587         5401
              trump         6494         3920
              video         5952           18
                the         4417          396
                 of         3570         2133
                for         3329         1991
                 in         3259         3293
                and         2597          433
                 on         2540         2328
                 is         2031          304
              obama         1820          455
               with         1707         1047
            hillary         1617           32
              watch         1354           18
              about         1197          182
                his         1140          151
                 it         1136          173
                 he         1095          244
              after         1088          701
 

          democrats          268          202
           campaign          310          202
          lawmakers           16          200
              urges            4          197
            myanmar            0          196
              trade           11          192
              putin          107          192
            sources           13          191
       presidential           87          191
            senator          183          190
         opposition            4          190
            foreign          104          190
         healthcare           70          189
             former          226          188
            islamic           64          188
             mexico           67          185
              about         1197          182
             attack          209          182
             border           84          182
             syrian           52          178
             german           35          177
            britain           21  

#### Many words in the Title show an imbalance between Fake News and Real News.  For example, "trump" is favored by nearly a 2:1 ratio in Fake vs. Real news.  "hillary" is favored by ~ 400:1 in Fake vs. Real news.  "watch" is favored ~ 700:1 in Fake vs. Real news.  Words such as "he", "his", "she", "her", "him", "it", "they", "them", "we", "us", "like", "here", "donald", "gop", "liberal",  "media", "america", "muslim", "racist", "breaking", are also heavily favored in Fake news in this dataset.

In [84]:
## Print out top MISCLASSIFICATIONS

# Make predictions on the dev data and show the top n documents where the ratio R is largest, where R is:
# R = maximum predicted probability / predicted probability of the correct label

n=2000
print('X_dev_transformed.shape:', X_dev_transformed.shape)  # (6735, 18745)
print()

r_array = np.zeros(X_dev_transformed.shape[0])  # one array element for each dev example (6735)
for i in range(X_dev_transformed.shape[0]):
    max_pred_prob = np.max(clf.predict_proba(X_dev_transformed)[i,:])
    #print(max_pred_prob)
    correct_label = int(dev_labels[i])
    pred_prob_correct_label = clf.predict_proba(X_dev_transformed)[i,correct_label]
    R = max_pred_prob / pred_prob_correct_label
    r_array[i] = R

print('max R:', np.max(r_array)) 
#print('mean R:', np.mean(r_array))
sorted_r = -np.sort(-r_array)
print()

cols = ['text', 'label', 'pred_class', 'R']
row_list = []

for i in range(n):
    index = -1 - i
    label_index = np.argsort(r_array)[index]
    print('R:', r_array[label_index])    
    print('\nDev Title',str(label_index)+':\n', dev_data[label_index])
    print('\nLABEL =', dev_labels[label_index])
    print('predicted class =', clf.predict(X_dev_transformed)[label_index],'\n' )
    print(70*'-')
    row_list.append(dict( [('text',dev_data[label_index]), ('label', dev_labels[label_index]), 
                         ('pred_class', clf.predict(X_dev_transformed)[label_index]), ('R', r_array[label_index])]  ))


error_df = pd.DataFrame(row_list, columns=cols)
#error_df.loc[i] = [dev_data[label_index], dev_labels[label_index], clf.predict(X_dev_transformed)[label_index], r_array[label_index]]

X_dev_transformed.shape: (6735, 18779)

max R: 4999260.399758549

R: 4999260.399758549

Dev Title 4655:
 stockholm study: us & europe top arms trade globally – saudi arabia’s weapons imports skyrocket over 200 percent

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 4999260.399758549

Dev Title 6658:
 stockholm study: us & europe top arms trade globally – saudi arabia’s weapons imports skyrocket over 200 percent

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 385986.37565659604

Dev Title 4432:
 russian lawmaker warns: north korea ready to launch missile capable of hitting u.s.

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 192188.46410006026

Dev Title 6270:
 n. korea’s latest missile launch aimed at testing carrying “large scale heavy nuclear warhead”

LABEL = 0
predicted class = 1 

--------------------------

predicted class = 0 

----------------------------------------------------------------------
R: 162.23182876673704

Dev Title 3868:
 hillary clinton makes surprise appearance at new york film panel

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 158.03087690546639

Dev Title 114:
 obama’s epa pushes for tougher mileage standards for trucks

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 134.6350165378123

Dev Title 5999:
 trump allies falsely link reuters to claim detroit video feed was cut short

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 127.01834381539528

Dev Title 4439:
 top us spy agency refuses to endorse cia’s ‘russian hacking’ assessment due to “lack of evidence”

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 124.85594777524324

Dev Title 2

predicted class = 1 

----------------------------------------------------------------------
R: 26.14111798306174

Dev Title 6544:
 waving german flag, far-right and anti-islam groups rally together before vote

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 25.023334957016857

Dev Title 6059:
 clinton attacks trump's outreach to black voters in new ad

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 24.953894231467242

Dev Title 2116:
 u.s. state department tweets, then deletes congratulations for iran oscar win

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 24.599802438453995

Dev Title 619:
 clinton's it aide to plead the fifth in email lawsuit: the hill

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 22.955570315306826

Dev Title 651:
 cpac says trum

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 10.52189027265216

Dev Title 6608:
 trump tells state department to make cut more than 50% of funding to u.n.

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 10.198961202017005

Dev Title 5783:
  senate gives trump jr. ultimatum: respond by friday or face subpoena

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 10.129908188171132

Dev Title 277:
 bill clinton, tim kaine cancel iowa event after police shooting

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 9.94694774806254

Dev Title 4377:
 tiffany & co. takes big risk…sides against trump on very controversial issue

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 9.755222672868836

Dev Title 5452:
 i

R: 5.15719812460325

Dev Title 1229:
 trump takes populist message to u.s. heartland in 'thank you' tour

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 5.134641765959553

Dev Title 1899:
 trading algorithm shows how mass shootings, politics boost gun shares

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 5.087541493042257

Dev Title 3003:
  dem. senator blasts mitch mcconnell for excluding women from panel drafting new senate health plan

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 5.034618804185088

Dev Title 1120:
  senator asks doj to step in after white supremacists vow to engage in poll watching

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 5.024745768964768

Dev Title 227:
 new york times stands by trump story, rebuts claim of libel

LABEL = 

 obama brags about hijacking 1.35 million acres in utah, nevada…despite opposition…posts wrong picture of land

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 2.8793276149295255

Dev Title 5107:
 clinton ad blitz outpaces trump as his super pacs bow out

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 2.8764486434758196

Dev Title 5279:
 obama regime agrees to cut deal with iran that further threatens our national security

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 2.782930628917775

Dev Title 1573:
 fired from 'apprentice,' omarosa may get trump white house job

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 2.7585654850476424

Dev Title 3248:
 new report: obamaphone program stashed $9 billion in private bank accounts…exposes massive windfall for ph

R: 1.999619434590249

Dev Title 726:
 trump campaigns in california, denounces protesters at rally as 'thugs'

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1.949742795820203

Dev Title 1552:
  marcobot malfunction: new data shows rubio’s campaign in crisis (photos)

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 1.931627089742608

Dev Title 1213:
  republican senator prays for god to kill obama as soon as possible: ‘let his children be fatherless’

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 1.922559711480483

Dev Title 3478:
  top dems take action to keep documents on russia investigation safe from trump

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 1.8972040693970387

Dev Title 4953:
 u.s. vote authorities warned to be alert to russian hacks fak

predicted class = 1 

----------------------------------------------------------------------
R: 1.167364752904823

Dev Title 5084:
 u.s. judge bars colorado from enforcing law banning voter 'selfies'

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1.1450982491473556

Dev Title 4764:
 former intelligence officials say trump is being manipulated by putin

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1.141434004277621

Dev Title 6630:
 21wire.tv members newsletter – sept 9, 2016

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 1.1279235478363356

Dev Title 531:
 u.s. spy chief james clapper: u.s. must be prepared for a ‘large armageddon-scale’ cyber attack

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 1.1103057704872648

Dev Title 5101:
  clinton’s lates

predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2251:
 lol! this one picture sums up trump’s brutal smackdown of mainstream media at today’s press conference

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 1752:
 u.s. spy chief to resign as trump takes office

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 1753:
 walmart removes controversial t-shirt but black lives matter tees remain

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 1754:
  evan mcmullin: ‘never in his life’ has trump had to watch his mouth (video)

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 1760:
  jeb bush blames the pope for making him lose the gop nomination (

R: 1.0

Dev Title 2235:
 insults fly during obama’s town hall in laos…except the insults were directed at americans! [video]

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2234:
 breaking: trump just made a huge announcement…proving he’s the only candidate who truly believes #blacklivesmatter

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2202:
 storm maria brings fear, pain and shock to puerto ricans

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 1768:
 handpicked successor and son of disgraced lawmaker john conyers reportedly body-slammed, spit upon and slit his girlfriend with knife

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2233:
 obama expected to sign iran sanctions act extension i

predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2342:
  on their anniversary, bill writes hillary a love note; the right flips out

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2343:
 china adds new protections for graft suspects amid detention system overhaul

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2344:
  this homeless veteran receives a gift from a child he’ll never forget (video)

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 1737:
  watch: trevor noah takes white grievance queen tomi lahren to the woodshed

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2346:
 late uk pm heath had questions to answer over child sex ab

 u.s. senate republicans want to speed trump nominee approvals

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2387:
 senator elizabeth warren to meet leandra english on monday: aide

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2389:
  donald trump jr tweets unintentionally humiliating photo, blocks people who mock him

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2399:
 u.s. farm lobby turns up heat on trump team as nafta talks near

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2391:
 guatemala congress withdraws bill that cut anti-graft penalties

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2392:
 uk certain ir

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2270:
 in big win for trump, senate approves his conservative court pick

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2271:
  obama visits wounded warriors after trump uses soldiers as campaign props

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2272:
 british pm may's voice repeatedly fails in keynote speech

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2273:
 tax policy expert eyed for senior white house economic post: cnbc

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2274:
 they fought and died to protect total strangers from socialism…why young voters are embraci

Dev Title 2313:
  scholastic yanks children’s book for its portrayal of happy slaves (image)

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2314:
 czech far-right party says will not support new government

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2315:
 kerry may visit cuba soon for human rights dialogue

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2316:
 florida lawmakers: couples can move in without saying 'i do'

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2317:
 a young father explains socialism to his 10 year old son…a must read for every american

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2318:
 no

predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 1960:
 republican asia experts say trump presidency would be 'ruinous'

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 1961:
  donald trump’s negatives hit record highs after orlando shooting

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 1962:
 liberia's supreme court halts election preparation over fraud accusations

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 1983:
 'big question' is whether rohingya can go home: u.n. refugee chief

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 1985:
 trump talks trade with eu, varied differences remain

LABEL = 1
predicted class = 1 

---------

predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 1861:
  on the anniversary of the berlin wall coming down, trump is on his way towards building his own

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 1862:
  confused looking trump totally humiliates himself when he gets lost on stage in poland

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 1863:
  trump’s relationship with this woman could lead to huge scandal for gop

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 1864:
  michael jordan ‘can no longer stay silent’ on police shootings and is taking action

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 1866:
 harvey weinstein rape a

predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 1879:
 watch: bikers for trump ready to take a stand against antifa thugs: “twinkle toes and butter cups” [video]

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 1880:
  watch: cnn anchor takes hannity to the woodshed for letting trump call election ‘rigged’

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 1900:
 u.n. special envoy urges poland to open up debate on judicial reform

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 1903:
 u.s. congress tangles with facebook, other social media firms over russia probe

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 1904:
 murdered north korea

predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 1789:
  he ‘belongs in an institution’: james comey’s republican father rips trump

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2110:
 childish cnn host refuses to call trump her president: “he’s your president” [video]

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 1788:
 senate intelligence panel chief: comey firing may slow russia probe

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2111:
 kremlin says hopes comey firing will not hurt russia-u.s. ties

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2112:
  this 19-year-old from flint destroys trump, calls for him to end campaign

predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2151:
 wow! even cnn’s reporting on mueller’s new russian investigation hires who made major contributions to hillary, barack, schumer campaigns

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2152:
  watch: megyn kelly takes a shot at hannity by wondering when trump will talk to a real journalist

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2153:
 thai authorities close in on yingluck's escape accomplices

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2154:
 democrats admit plan to commit mass voter fraud [video]

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2155:
 halloween fire

R: 1.0

Dev Title 2035:
 revealed: loretta lynch given talking points for secret clinton ‘tarmac meeting’

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2036:
 russia moves to shelter banks funding arms makers from sanctions

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2037:
 ‘color rev’ agit prop: george soros moveon agitators march on america – as billionaire instigator sued

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2039:
  donald trump makes another truly pathetic attempt at making hispanics like him (video)

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2040:
 nato sees growing russia, china challenge; higher risk of war

LABEL = 1
predicted class = 1 

---------------------------

predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2081:
 south sudan rebel groups clash, at least three dead

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2082:
 pm may is driving britain to cliff-edge brexit: labour leader

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 1795:
 trump backtracks on cyber unit with russia after harsh criticism

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2083:
 “very fake news” cnn president like oz behind the curtain…feeds anti-trump questions to anchors while interviewing guests on live tv [video]

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 1794:
 despite showman reputation, trump inauguration

predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2995:
 (video)sec of state kerry goes full-on delusional in comments on the fall of ramadi

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2996:
 more jobs! dow company ceo at mi trump rally: announces new mi plant: “we’re not a ‘red tape’ company, but a ‘red carpet’ country for american businesses”

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2997:
 “meathead” rob reiner calls for ‘all out war’ against trump…twitter user warns “hollywood folks” to be careful…they might just get a war

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2998:
  watch trump’s former campaign manager totally forget that john kerry lost in 2004

LABEL = 0
predicted class = 0 

---

predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 3056:
  trevor noah and the daily show correspondents collapse with the giggles over this gop nightmare (video)

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 3032:
 britain says minister's remarks offer no basis for action against jailed aid worker in iran

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 3031:
 raw video: a shocking tour of the detroit ghetto…ruled by democrats for decades! [video]

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 3030:
 a legend in his own mind? joe biden has regrets about not running and you won’t believe why

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev T

R: 1.0

Dev Title 2851:
 boom! mother of black son murdered by blacks: “i don’t preach black lives matter, because black lives only matters when law enforcement is involved” [video]

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2852:
 al sharpton blames racism for trump victory, then says he’s not gonna blame anyone [video]

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2854:
 irony alert! dc’s day without women literally led by a man…event turns into anti-trump rally: “he is wrong. we have to stop him.” [video]

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2855:
 hungary prosecutors probe accounts of opposition jobbik party

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2856:
  watch: cnn

predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2903:
 talk of shifting funds away from trump premature: republican official

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2904:
 north korea to make announcement at 0330 gmt: yonhap

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2905:
 trump-backed candidate for senate heads to alabama run-off

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2906:
 in afghan review, trump's frustration carries echoes of obama years

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2907:
 repeat deceit: how us tries to link iran to al qaeda

LABEL = 0
predicted class = 0 

------------------------------

R: 1.0

Dev Title 3168:
 instant view: republicans pull obamacare repeal bill

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 3169:
  canadian hilariously humiliates three trump supporters for insulting his country

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 3171:
 dancer who hid peru rebel chief freed from prison after 25 years

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 3172:
 black trump supporter delivers powerful message: “#blacklivesmatter is making black people look like fools” [video]

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 3173:
 trump vows 'insurance for everybody' in replacing obamacare

LABEL = 1
predicted class = 1 

--------------------------------------------------

R: 1.0

Dev Title 3271:
 republicans retreat from plan to curb some press camera access in u.s. capitol

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 3247:
 ramadan abdullah set free on bail after police make shocking discovery of weapons in storage locker destined for secretive islamic compound in upstate ny

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 3246:
  gop rep. wants a $30k a year housing allowance; twitter rips him a new one

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 3245:
 trump talks pardons amid probes of russia role in u.s. election

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 3244:
 trump consults republican senators on fed chief candidates

LABEL = 1
predicted class 

R: 1.0

Dev Title 3104:
 transcript and video of president trump’s inauguration speech: “january 20th 2017, will be remembered as the day the people became the rulers of this nation again”

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 3105:
  hillary used campaign event with obama to perfectly roast trump’s birther past (video)

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 3106:
  racist mississippi town gives mlk day a new name, and twitter annihilates them for it (tweets)

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 3107:
 eu foreign policy chief expects strong eu backing for iran deal

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 3108:
  red cross raises tens of millions and you won’

predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 3151:
 highlights: the trump presidency on march 14 at 9:25 p.m. et

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 3153:
  watch: black driver bravely shames cop who drew his gun over a turn signal infraction

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 3154:
 more californians register to vote but fewer are republicans, state says

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 3155:
 italian catholics told to “pray silently” so as not to “offend” muslim refugees living in church [video]

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 3156:
 chronology: it only took seven decades: 

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2561:
 egypt's sisi congratulates trump, looks forward to new era of closer ties

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2562:
 redux 1963? the deep state vs donald trump

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2563:
 defense chief say he has power to set afghan troop levels

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2564:
  medicaid directors of all 50 states issue joint statement slamming gop health bill

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2565:
 syria fighting worst since aleppo, air strikes deadly: aid agencies

LABEL = 1
predicted class = 

R: 1.0

Dev Title 2581:
 negative tone of white house race sours young voters

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2635:
 just in: close friend of hillary clinton who planned to “step away from politics” found dead in apparent suicide

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2620:
 michigan senator urges congress to retain electric car tax credit

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2609:
 tennessee bill would allow counselors to deny service based on religion

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2610:
 yemen humanitarian situation likely to worsen with saleh death: mattis

LABEL = 1
predicted class = 1 

---------------------------------------------------

Dev Title 2600:
  trump vows that his administration will ‘hire american’ and twitter rips him apart

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2601:
  leaked footage exposes astonishing and criminal antics of catholic priest in his own church (video)

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2602:
 russia warns of serious consequences from u.s. strike in syria

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2603:
 new poll shows philippine president still hugely popular

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2604:
  watch: al sharpton catches john mccain acting exactly like donald trump toward immigrants

LABEL = 0
predicted class = 0 

--------------------------------------

R: 1.0

Dev Title 2435:
 threatened u.s. pullout might help, not hobble, global climate pact

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2436:
 inspired by 'blasphemy killer', new pakistani party eyes 2018 vote

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2437:
 trump congratulates china's xi on 'extraordinary elevation'

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2439:
 'disastrous' conditions for migrants displaced by libya clashes, official says

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2440:
 u.s. voters say yes to big bond issues, mixed message on taxes

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 

predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2488:
  dad of murdered reporter hits trump, gop beautifully over ‘assault on 2nd amendment nonsense’

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2489:
 a young father explains socialism to his 10 year old son…a must read for every american

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2490:
 muslims silent after terror attacks…but blame trump after witnesses give description of “tall hispanic” who killed ny imam and assistant [video]

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2492:
  scaramucci tv appearance goes off the rails as show brings special guest to mock him (video)

LABEL = 0
predicted class = 0 

----------------------------------------

predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2755:
 o’reilly nails it! obama, trump and the true meaning of fairness and equality [video]

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2756:
 congress probes ny fed's handling of bangladesh bank heist: letter

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2757:
 merkel gets strong backing from her party after talks fail

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2758:
 kazakhstan to re-examine 2004 banker's death, may target nazarbayev critic

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2759:
 u.s. house passes $1.2 trillion measure to fund government

LABEL = 1
predicted

predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2814:
  jason chaffetz prepares to run for the hills as the russia probe heats up (details)

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2815:
 anti-assad nations say no to syria reconstruction until political process on track

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2816:
 this everyday american’s video is guaranteed to make you laugh: “funny, i never trusted donald trump until the media told me not to.”

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2817:
 u.s. diplomat engaging in back-channel diplomacy with north korea: ap

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2

predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2649:
 trial against guatemalan president's brother, son begins

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2650:
 u.s. says holds myanmar military leaders accountable in rohingya crisis

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2651:
 senate confirms huntsman as ambassador to russia

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2652:
 open-border liberals put entire nation on high alert: german spy chief warns 1,000+ radical islamists ready to attack…over 100 isis members among refugees

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2653:
 the single chart to share that te

----------------------------------------------------------------------
R: 1.0

Dev Title 2694:
 watch: only 6 people show up to see hillary at tx airport…and she ignored all 6 of them

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2695:
 sanders lashes out at clinton in contentious democratic debate

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 2696:
  history channel classification of the 2016 election makes trump supporters lose their sh*t

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2697:
  trump admin. tosses out another obama rule – his sons can be even worse douchebags now

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 2698:
 ag jeff sessions warns leakers…taking steps to stop the l

predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 575:
 u.s., nordic nations call on russia's military to comply with obligations 

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 573:
 (video) water wars: obama’s epa power grab to regulate puddles, ponds, and just about anything that’s wet

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 437:
 u.s. policy on iran won't harm its oil industry: minister

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 559:
 u.s. house russian probe leaders promise to protect fbi investigation

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 549:

LABEL = 1
predicted class = 1 

------------------------------

R: 1.0

Dev Title 650:
 ronald reagan: remembering our heroes [video]

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 652:
 factbox: international reaction to arrest of reuters reporters in myanmar

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 653:
 syrian muslim man whose family perished on trip so he could get free dental care has new spokesperson role

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 654:
 mockingbird redux? cnn’s role in peddling fake ‘nothing burger’ russia-gate news revealed

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 630:
 espn senior writer says cops, soldiers singing at ballparks is racist show of power…says 9/11 police officers aren’t heroes [video]

LABEL = 0
pred

predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 482:
 nation of islam joins #blacklivesmatter terrorists to shut down chicago’s popular “miracle mile” on busiest shopping day of year [videos and photos]

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 483:
 awesome! tucker carlson and stephen cohen: “something’s going on…i’ve never seen anything like this” [video]

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 484:
 trump would undo obama's cuba moves unless religious freedom allowed

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 485:
 trump decides against russia 'war room' in the white house: official

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------

R: 1.0

Dev Title 534:
  report: trump tried to pull his own saturday night massacre, wh aides said hell no

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 493:
 giuliani pulls out of consideration to serve in administration: trump

LABEL = 1
predicted class = 1 

----------------------------------------------------------------------
R: 1.0

Dev Title 535:
  trump comes unglued, continues miss universe feud with insane early morning twitter rant

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 536:
 wow! trump reveals embarrassing story about anti-trump msnbc hacks “crazy mika” and “psycho joe” [video]

LABEL = 0
predicted class = 0 

----------------------------------------------------------------------
R: 1.0

Dev Title 537:
 malaysia's dissent on myanmar statement reveals cracks in asean facade

LABEL = 1
predicted class = 1 

-----------

In [85]:
print(error_df.size)
print(len(error_df))
error_df.head()

8000
2000


Unnamed: 0,text,label,pred_class,R
0,stockholm study: us & europe top arms trade gl...,0,1,4999260.0
1,stockholm study: us & europe top arms trade gl...,0,1,4999260.0
2,russian lawmaker warns: north korea ready to l...,0,1,385986.4
3,n. korea’s latest missile launch aimed at test...,0,1,192188.5
4,trump’s plan to rein in wall street? tax cuts...,0,1,30778.05


In [86]:
error_df.to_csv('nb_isot_title_errors.csv', sep=',')

### Using the ISOT "title" model, predict the ISOT "text" classes 

In [87]:
train_data, train_labels = train_set.title.str.lower().values, train_set.target.values
dev_data, dev_labels = dev_set.text.str.lower().values, dev_set.target.values

train_labels = train_labels.astype(int)
dev_labels = dev_labels.astype(int)

#train_data.head()
print('train_data shape:', train_data.shape)
#print(train_data[0].shape)
print(train_data[:1])
print('\ntrain_labels shape:', train_labels.shape)
print(train_labels)

print('dev_data shape:', dev_data.shape)


train_data shape: (31428,)
['senate panel sets hearing for trump tax nominee']

train_labels shape: (31428,)
[1 0 1 ... 0 1 0]
dev_data shape: (6735,)


In [88]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_data)
X_dev_transformed = vectorizer.transform(dev_data)

print('X.shape:', X.shape)
print('X_dev_transformed.shape:', X_dev_transformed.shape)

# MultinomialNB
print('\nMultinomialNB trained using ISOT "title" field; predict ISOT "text" classes')
alpha = 1.0
clf = MultinomialNB(alpha=alpha)
clf.fit(X, train_labels)

print('accuracy: %3.3f' %clf.score(X_dev_transformed, dev_labels))

X.shape: (31428, 18779)
X_dev_transformed.shape: (6735, 18779)

MultinomialNB trained using ISOT "title" field; predict ISOT "text" classes
accuracy: 0.771


### Using the ISOT "title" model, predict the liar_dev_labels and score the predictions. (lower case them first...)

In [89]:
#train_data, train_labels = train_set.title.values, train_set.target.values   # original ISOT data
train_data, train_labels = train_set.title.str.lower().values, train_set.target.values   # original ISOT data

#dev_data = liar_data.title[liar_data.binary_target != -1].values  
dev_data = liar_data.title[liar_data.binary_target != -1].str.lower().values                                            # full LIAR data
dev_labels = liar_data.binary_target[liar_data.binary_target != -1].values

train_labels = train_labels.astype(int)
dev_labels = dev_labels.astype(int)

#train_data.head()
print('train_data shape:', train_data.shape)
#print(train_data[0].shape)
print(train_data[:1])
print('train_labels shape:', train_labels.shape)
print(train_labels)

print('\ndev_data shape:', dev_data.shape)
print(dev_data[:1])
print('dev_labels shape:', dev_labels.shape)
print(dev_labels)

train_data shape: (31428,)
['senate panel sets hearing for trump tax nominee']
train_labels shape: (31428,)
[1 0 1 ... 0 1 0]

dev_data shape: (10164,)
['we have a $2 trillion to $3 trillion infrastructure deficit. weve got work that needs to get done. weve got people that want to go to work.']
dev_labels shape: (10164,)
[1 0 0 ... 1 0 0]


In [90]:

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_data)           # original ISOT training data
X_dev_transformed = vectorizer.transform(dev_data)

# MultinomialNB
print('\nMultinomialNB using "title" field: Fit using ISOT data; Predict on LIAR data:')
alpha = 1.0
clf = MultinomialNB(alpha=alpha)
clf.fit(X, train_labels)

print('accuracy: %3.3f' %clf.score(X_dev_transformed, dev_labels))


MultinomialNB using "title" field: Fit using ISOT data; Predict on LIAR data:
accuracy: 0.535


In [91]:
print('Non-zero elements in matrix (X.nnz):', X.nnz)   # This indicates that there are z non-zero elements in the matrix.
print('Average number of non-zero features per example (per document): %.3f' %(X.nnz/X.shape[0]))  # non-zero elements in matrix / documents = xxx

Non-zero elements in matrix (X.nnz): 381492
Average number of non-zero features per example (per document): 12.139


In [92]:
print('Non-zero elements in matrix (X_dev_transformed.nnz):', X_dev_transformed.nnz)   # This indicates that there are z non-zero elements in the matrix.
print('Average number of non-zero features per example (per document): %.3f' %(X_dev_transformed.nnz/X_dev_transformed.shape[0]))  # non-zero elements in matrix / documents = xxx

Non-zero elements in matrix (X_dev_transformed.nnz): 154189
Average number of non-zero features per example (per document): 15.170


In [93]:
#print(X_dev_transformed)
print(train_data)

['senate panel sets hearing for trump tax nominee'
 '“woody” kaine one of six arrested after peaceful pro-trump supporters were attacked by violent rioters at mn state capitol [video]'
 'armed faction takes over protection of libyan oil and gas complex, fresh concern over migrants'
 ...
 'authorities allowed nyc chain migration terrorist to stop interrogation several times to pray after admitting he was triggered by christmas posters'
 'unpopularity of clinton, trump puts spotlight on potential running mates'
 'trump financial advisor has great tax news for job creators: “i think they need to get that done quickly”…maga! [video]']


In [94]:
print(dev_data)

['we have a $2 trillion to $3 trillion infrastructure deficit. weve got work that needs to get done. weve got people that want to go to work.'
 'says bag litter increased after san francisco banned single-use shopping bags.'
 'human activity is not causing these dramatic changes to our climate.'
 ...
 'says that during his eight years as florida governor, we created 1.3 million net new jobs -- more jobs created than texas.'
 'says chris koster fell silentas attorney general in pursuing a case against a free classified service accused of promoting prostitution after accepting over $12,000 in campaign contributions from people affiliated with the service.'
 'the auditor [for the city of providence] was not locked out of access to the citys finances.']


In [95]:
## Print out top MISCLASSIFICATIONS

# Make predictions on the dev data and show the top n documents where the ratio R is largest, where R is:
# R = maximum predicted probability / predicted probability of the correct label

n=2000
print('X_dev_transformed.shape:', X_dev_transformed.shape)  # (10164, 18698)
print()

r_array = np.zeros(X_dev_transformed.shape[0])  # one array element for each dev example (6735)
for i in range(X_dev_transformed.shape[0]):
    max_pred_prob = np.max(clf.predict_proba(X_dev_transformed)[i,:])
    #print(max_pred_prob)
    correct_label = int(dev_labels[i])
    pred_prob_correct_label = clf.predict_proba(X_dev_transformed)[i,correct_label]
    R = max_pred_prob / pred_prob_correct_label
    r_array[i] = R

print('max R:', np.max(r_array)) 
#print('mean R:', np.mean(r_array))
sorted_r = -np.sort(-r_array)
print()

cols = ['text', 'label', 'pred_class', 'R']
row_list = []

for i in range(n):
    index = -1 - i
    label_index = np.argsort(r_array)[index]
    print('R:', r_array[label_index])    
    print('\nDev Title',str(label_index)+':\n', dev_data[label_index])
    print('\nLABEL =', dev_labels[label_index])
    print('predicted class =', clf.predict(X_dev_transformed)[label_index],'\n' )
    print(70*'-')
    row_list.append(dict( [('text',dev_data[label_index]), ('label', dev_labels[label_index]), 
                         ('pred_class', clf.predict(X_dev_transformed)[label_index]), ('R', r_array[label_index])]  ))


error_df2 = pd.DataFrame(row_list, columns=cols)


X_dev_transformed.shape: (10164, 18779)

max R: 1.0068274792602476e+39

R: 1.0068274792602476e+39

Dev Title 8003:
 hospitals, doctors, mris, surgeries and so forth are more extensively used and far more expensive in this country than they are in many other countries.''	health-care	mitt-romney	former governor	massachusetts	republican	34	32	58	33	19	a fox news sunday interview
9874.json	barely-true	obamacare cuts seniors medicare.	health-care,medicare	ed-gillespie	republican strategist	washington, d.c.	republican	2	3	2	2	1	a campaign email.
3072.json	mostly-true	the refusal of many federal employees to fly coach costs taxpayers $146 million annually.	government-efficiency,transparency	newsmax	magazine and website	florida	none	0	0	0	1	0	an e-mail solicitation
2436.json	mostly-true	florida spends more than $300 million a year just on children repeating pre-k through 3rd grade.	education	alex-sink		florida	democrat	1	2	2	4	0	figures cites on campaign website
9721.json	true	milwaukee county

predicted class = 0 

----------------------------------------------------------------------
R: 1381139272427.7778

Dev Title 7475:
 (deborah) ross defends those who want to burn the american flag, and even called efforts to ban flag-burning ridiculous, yet refused to help a disabled veteran fly the flag.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1152286933898.4456

Dev Title 4708:
 when we passed (the stand your ground law), we said it portends horrific events when peoples lives were put into these situations.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 996640276431.7708

Dev Title 3284:
 the iranians are now saying that what were saying the deal is and what they understand it to be are two different things.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 771393772384.4912

Dev Title 873:
 women have 

R: 25035119109.156372

Dev Title 3562:
 a small majority of americans dont think they like the affordable care act, but a large majority of americans dont want to do away with the protections that are in the affordable care act.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 24714793682.116577

Dev Title 8089:
 just about everyone everywhere is spending more hours on the job, less time with their families, bringing home smaller and smaller paychecks, while theyre paying more and more at the gas pump and the grocery stores.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 23964946992.1742

Dev Title 6547:
 the words subhuman mongrel, which ted nugent called president barack obama, were used by the nazis to justify the genocide of the jewish community.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 19537635606.42

R: 3582660652.72882

Dev Title 7624:
 the president promised to close the space gap, but he now seems intent on repeating the events that created the space gap in the first place -- putting in place a new rocket design and then trying to underfund the effort...

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 3484559600.3223386

Dev Title 9061:
 during the bush administration, you actually had a prominent liberal write a book about how bush was preparing for a fascist takeover of this country.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 3148196742.2602096

Dev Title 8265:
 the marijuana that kids are smoking today is not the same as the marijuana that jeb bush smoked 40 years ago.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 3077476665.457691

Dev Title 5126:
 i took on the worst road system in the country

R: 643816316.8559713

Dev Title 5213:
 the typical white male worker in this country is making in real terms what he was making in 1973 and the average worker is making what they were making in 1996.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 642594360.3074553

Dev Title 1247:
 this is the most generous country in the world when it comes to immigration. there are a million people a year who legally immigrate to the united states.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 638285146.8798432

Dev Title 2478:
 this (schip) is socialized medicine. it is going to go to families that make $60,000 a year. those aren't poor children.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 619440569.7840726

Dev Title 6317:
 gov. palin ... is somebody who actually doesn't believe that climate change is man-made.

LABEL

predicted class = 0 

----------------------------------------------------------------------
R: 166649777.94548717

Dev Title 2122:
 the irs doesnt have to prove something against you ... youve got the burden of proof.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 164852955.15444598

Dev Title 303:
 while fat-cat bureaucrats at the department of education are getting paid an average salary of $102,000 a year, teachers in georgia are getting paid half of that.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 152675375.80071

Dev Title 1521:
 on the night of the iowa caucuses, obama promised the nation that he would do health care reform focused on cost containment, he opposed an individual mandate, and he said he was going to do it with republicans.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 150634614.26416

R: 83761784.09796436

Dev Title 9251:
 if you take into account all the people who are struggling for work, or have just stopped looking, the real unemployment rate is over 15 percent.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 82893125.26225387

Dev Title 3205:
 every 10th dollar spent by the social security administration on its program for the poor is waste, or fraud, they cant validate that the people should have gotten it, totaling about $8 billion a year.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 79074770.37252755

Dev Title 9634:
 the hyde amendmentlanguage was in the (human trafficking) bill. the democratic sponsor admits it was in the bill, and she voted for it.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 75285767.93986852

Dev Title 5768:
 its actually easier for the united states to get

predicted class = 0 

----------------------------------------------------------------------
R: 30463884.017173767

Dev Title 6219:
 the governor won this state with 49 percent. we had some of the closest races in the house in history. so youre not dealing with this 70-30, like they want to make it.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 30096638.94193928

Dev Title 2633:
 there are corporations in this nation, some of the biggest corporations in this nation, who do not pay taxes.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 29296204.536345076

Dev Title 6386:
 says opponent kathie tovo believes that austin invests too much in the cops, firefighters and paramedics that protect our families and neighborhoods.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 29037922.049516007

Dev Title 2309:
 currentl

R: 15518202.048925404

Dev Title 9068:
 when you look at the number of crashes before the cameras were installed compared to after, theyre virtually the same.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 15392754.313048052

Dev Title 3532:
 twenty-five percent of our kids in foster care are there because their parents are involved in drugs.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 15334455.37340457

Dev Title 4557:
 since 2009, georgias public schools have lost nearly 9,000 classroom teachers while the number of students has gone up.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 14841297.796738738

Dev Title 3923:
 the average family (is) now bringing home $4,000 less than they did just five years ago.

LABEL = 1
predicted class = 0 

------------------------------------------------------------------

R: 8760166.955418846

Dev Title 2187:
 if you make the average amount of people in wisconsin, $50,000, you got $1.60 less a week in taxes under the state income-tax cut, but it didnt show up in your paycheck.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 8722375.1437597

Dev Title 5782:
 john mccain "voted against the tax cuts of 2001 and 2003, wrongly claiming they helped only the rich."

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 8321471.188541626

Dev Title 1627:
 ronald reagan met with gorbachev; kennedy met with khruschev; and nixon met with mao and these were folks who have done horrendous damage not only to their own countries but to other countries.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 8022727.734729394

Dev Title 8686:
 says mccain once said that on "the most important issues of our day

R: 5005918.492883661

Dev Title 1991:
  ... following world war ii war crime trials were convened. the japanese were tried and convicted and hung for war crimes committed against american pows. among those charges for which they were convicted was waterboarding.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 4950224.801198866

Dev Title 5691:
 members of the public are being charged $50 to hear gov. scott walker and a dozen members of his administration talk about jobs and the economy at lambeau field.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 4783170.140950097

Dev Title 3690:
 when rick scott was deposed in lawsuits about his company, he took the fifth 75 times.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 4740172.95005173

Dev Title 400:
 barack obama got more campaign contributions from fannie mae 

 the no. 1 issue that the american people care about is getting america back to work.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 2963366.2542037345

Dev Title 522:
 when social security first started, there was 16 workers for every retiree. today there are three workers for every retiree and soon there will be only two for every retiree.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 2869875.003196389

Dev Title 1232:
 today, the five largest financial institutions are 38 percent bigger than they were back in 2008, when they were too big to fail.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 2860549.1222728346

Dev Title 1685:
 when deborah took office, the (sellwood bridge) project had languished for years, with only $11 million in funds. with her leadership the remaining funding was secured that got th

R: 1996374.6901774001

Dev Title 1691:
 we've won twice as many states. we've won a greater share of the popular vote.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1969897.494912056

Dev Title 3368:
 its legal to sign a recall petition (against gov. scott walker) even if you have already signed another recall petition, but only one signature counts

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1950666.0886476883

Dev Title 1285:
 by one leading measure, what business owners pay out in wages and salaries is now finally growing faster than what they spend on health insurance for the first time in 17 years.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1904333.3406927139

Dev Title 1931:
 the fed created $1.2 trillion out of nothing, gave it to banks, and some of them foreign banks, so that they could stabil

R: 1306742.435733137

Dev Title 8783:
 pregnant women trying to buy health insurance on their own are barred from maternity coverage because they have a pre-existing condition.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1280144.2693117687

Dev Title 6568:
 thanks to you, in under one week we have the largest gop (facebook) page in the governor's race!

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1279110.5720530003

Dev Title 1434:
 if you want to see the jobs that ive saved and created in this storm he helped create, go anywhere in ohio.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1272771.2374393705

Dev Title 518:
 the dallas cowboys cant put a sticker on their helmets for the 5 police officers who were killed.

LABEL = 1
predicted class = 0 

--------------------------------------------------------

R: 893010.2596025355

Dev Title 6931:
 he admits he still doesn't know how to use a computer, can't send an e-mail.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 892008.5177870733

Dev Title 6022:
 gov. lawton chiles said "if i were to become ... become speaker of the house it would be (his) worst nightmare."

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 886453.9521356217

Dev Title 9714:
 under republican-backed state budget, the state education agency estimates expansion of wisconsins school voucher program could cost nearly $2 billion annually

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 846818.3349166585

Dev Title 63:
 there are more words in the irs code than there are in the bible.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 840581.7639

----------------------------------------------------------------------
R: 550831.0044628215

Dev Title 5231:
 of the 10,000 rhode island children in charter school lotteries for the fall, more than 70 percent are poor .... in addition, a majority of them are students of color.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 549083.6415836208

Dev Title 1954:
 we also have, in a park thats not far from here, an ability to build a reservoir that can hold a 30-day water supply for the city of atlanta.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 539405.043078575

Dev Title 1828:
 when the united states invaded iraq, saddam hussein wanted to acquire weapons of mass destruction, and "he said so himself after his capture."

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 538024.831153344

Dev Title 2630:
 on immigra

 there is more oil produced at home than we buy from the rest of the world the first time thats happened in nearly 20 years.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 389495.5369416421

Dev Title 4476:
 the providence economic development partnership . . .which you [cicilline] chaired, loaned $103,000 in taxpayer funds to one of your campaign workers. the worker never paid back the loan.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 379395.7202587039

Dev Title 737:
 manufacturing wages today in america on a per-hour basis are actually a bit lower than average wages in the economy as a whole.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 377830.60106725624

Dev Title 2366:
 says king street patriots held a fundraiser featuring an author who believes that registering the poor to vote is un-american.

LA

predicted class = 0 

----------------------------------------------------------------------
R: 312514.4835689631

Dev Title 634:
 the fbi is involved in an investigation about people in the rio grande valley who are using cocaine to buy votes.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 307913.88043090695

Dev Title 6850:
 i think with the exception of the last year or maybe the last two years, we were at 100 percent when it came to contributing to the providence pension fund.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 307121.59264473105

Dev Title 3032:
 says mitt romney is committed to overturning roe vs. wade, and he supports such amendments that define a life as beginning at the moment of conception.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 300957.50481705624

Dev Title 259:
 i never gave up

R: 250224.72718810552

Dev Title 7433:
 we have only one person on the (tva) board, to my knowledge, who even has any corporate board experience.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 249267.72203765262

Dev Title 10079:
 we just had the best year for the auto industry in america in history.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 248565.81508392905

Dev Title 7404:
 a trade war is something very different (than curbing new trade agreements). we went down that road in the 1930s. it made the great depression longer and more painful.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 241460.77779437287

Dev Title 10007:
 a 50-50 public-private split for paying for a new milwaukee bucks arena would be much better -- in terms of the portion of the public financing -- than most of the other arena proje

predicted class = 0 

----------------------------------------------------------------------
R: 203258.10986341178

Dev Title 7552:
 atlanta has issued an increasing number of citations - and collected an increasing amount of revenue - since mayor kasim reed took office in 2010.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 199367.05088330692

Dev Title 5597:
 my son had to resign his job because of federal regulations that washington has put on us.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 197351.26365096736

Dev Title 5090:
 legally, it doesnt make any difference which state district you live in when running for congress.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 196116.7957738954

Dev Title 4491:
 mitt romney boasts that he is "proud to be the only major candidate for president to sign the tax p

R: 151588.52625106144

Dev Title 4380:
 ed schultz said alan grayson is what its all about.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 151426.54131841176

Dev Title 5670:
 planned parenthood is an organization that funnels millions of dollars in political contributions to pro-abortion candidates.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 150782.17167174176

Dev Title 7623:
 rhode island could tell you who has a camper, but we couldnt figure out who has a gun.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 149305.69850469966

Dev Title 7193:
 under ted cruzs tax plan, businesses will now have to pay 16 percent on the money they make. they will also have to pay 16 percent on the money they pay their employees.

LABEL = 1
predicted class = 0 

------------------------------------------------------------

R: 106519.61684468387

Dev Title 8220:
 says president barack obama has launched twice as many strikes (on)countries that are predominantly muslim than president george w. bush.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 106210.93637680553

Dev Title 8713:
 the university of texas is starting the first medical school at a major tier one university in the last 50 years.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 105783.45338188348

Dev Title 2997:
 we created 800,000 new jobs, we cut the unemployment rate almost in half and today new york state has more private sector jobs than it has ever had in its history

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 105775.5296469128

Dev Title 3717:
 there are 321,092 public school teachers in texas. and there are 313,850 non-teachers in our public schools.

LABE

R: 71890.77672933342

Dev Title 251:
 when mitt romney was governor of massachusetts, he stood in front of a coal plant and pointed at it and said, this plant kills.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 70259.18496814909

Dev Title 338:
 about two-thirds of all consumption is services. . . it was just the opposite when the sales tax was enacted in the 30s.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 70024.76981846633

Dev Title 8674:
 making public colleges and universities tuition-free, that exists in countries all over the world, used to exist in the united states.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 69549.03849914465

Dev Title 5873:
 had the online change of address been in place in 2008 an estimated 130,000 voters who cast provisional ballots could have changed their address onlin

R: 52721.45537898637

Dev Title 9579:
 new mexico was 46th in teacher pay (when he was elected), now we're 29th.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 51616.39224174272

Dev Title 4411:
 sen. harry reid voted against declaring english our national language, twice.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 51611.82755681177

Dev Title 8606:
 at nearly 19 million people, the population of florida is larger than all the earlier primary and caucus states combined.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 51394.01871798779

Dev Title 8789:
 91% of suspected terrorists who attempted to buy guns in america walked away with the weapon they wanted.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 51386.3073863088

Dev Title 2940:
 we are the

R: 41594.85355901372

Dev Title 1789:
 romney used to favor gun control, and said he didn't want to go back to reagan-bush.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 41575.27949285359

Dev Title 3900:
 one in 19 americans today get ssdi or ssi. thats one in 19 americans (who) are disabled.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 41526.370915964326

Dev Title 6974:
 the lgbt community is more often the victims of hate crimes than any other recognized group.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 41419.359183801476

Dev Title 7958:
 felony crimes in the city of atlanta are the lowest they have been since 1969.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 41300.50323325797

Dev Title 7467:
 says hillary clinton was completely exone

predicted class = 0 

----------------------------------------------------------------------
R: 33519.90086865307

Dev Title 2440:
 just days ago, irans supreme leader (ali) khamenei, who will oversee implementation of this agreement, was calling israel a rabid dog and accusing the united states of war crimes.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 33313.991771583635

Dev Title 2523:
 hillary clinton "actually differed with (john mccain) by arguing for exceptions for torture before changing positions."

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 32646.523687232944

Dev Title 878:
 president obamas recent plan to cut $100 million of waste within his administration wont actually save money because hes going to spend it elsewhere.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 32343.738968990772

Dev 

R: 26346.84302084669

Dev Title 6785:
 under president george w. bush, the u.s. had 52 months of ... uninterrupted job creation and revenues were at an all-time high in 2007.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 26067.626055578086

Dev Title 7701:
 you can go to georgia and make about $6,000 more a year as a teacher.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 25852.855155954883

Dev Title 2312:
 says mary burkes madison school district will be the only school district left in the state to ignore the (act 10) law in the 2015-16 school year.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 25727.07332598117

Dev Title 658:
 says president barack obama released a statement on the death of brother robin williams before ... a statement on brother michael brown.

LABEL = 1
predicted class = 0 

--------

R: 20555.29862610516

Dev Title 3338:
 here in california, donald trump has given $12,000 to jerry brown, gavin newsom and kamala harris.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 20520.487900886026

Dev Title 671:
 there are justice department policies against fbi director james comey discussing details of a federal investigation so close to an election.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 20473.246370127443

Dev Title 4231:
 says george lemieux was one of two republicans who voted for president barack obamas jobs bill.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 20308.355388928976

Dev Title 7967:
 the cost for renovating the headquarters of the u.n. has doubled from the original estimate.

LABEL = 1
predicted class = 0 

-------------------------------------------------------------------

R: 17596.29359359503

Dev Title 7282:
 i went to the olympics that was out of balance, and we got it on balance.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 17491.72488681233

Dev Title 783:
 george bush never suggested that we eliminate funding for planned parenthood.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 17472.43640246261

Dev Title 8112:
 i joined the gang of 14, seven republicans, seven democrats, so that we wouldn't blow up the united states senate. sen. obama had the opportunity to join that group. he chose not to.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 17408.304487561556

Dev Title 4408:
 under donald trumps tax plan, the top 0.1 percent of taxpayers -- people earning multiple millions of dollars a year, on average -- would get more tax relief than the bottom 60 percent of taxpayers

predicted class = 0 

----------------------------------------------------------------------
R: 14435.160206354094

Dev Title 5068:
 in 1978, a student who worked a minimum-wage summer job could afford to pay a years full tuition at the 4-year public university of their choice.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 14381.24144034129

Dev Title 1229:
 saysbernie sanders voted for what we call the charleston loophole.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 14305.484819882946

Dev Title 2267:
 sarah palin was repeating "abraham lincoln's words" in discussing the war in iraq.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 14159.52025732991

Dev Title 8492:
 almost 400 arrests in the city last year for panhandling-related offenses involved just 78 suspects, an indication that the same people are p

predicted class = 0 

----------------------------------------------------------------------
R: 11905.024800523079

Dev Title 553:
 based on the august 2012 national jobs numbers, for every person who got a job, nearly four people stopped looking for a job.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 11820.518884894816

Dev Title 7291:
 we spend more on health care than any other country, but we're ranked 47th in life expectancy and 43rd in child mortality.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 11791.54678218565

Dev Title 1130:
 the national debt "has gone up by $1,729,000,000 during the isner v. mahut match."

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 11659.09635493372

Dev Title 5425:
 says newt gingrich said spanish is the language of the ghetto.

LABEL = 1
predicted class = 0 

----------

 there are a larger number of shark attacks in florida than there are cases of voter fraud.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 9887.020599065396

Dev Title 421:
 out of 150,000 tenured teachers in the last 10 years, only 17 have been dismissed for incompetence.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 9838.884445826918

Dev Title 1002:
 when i go on letterman the ratings go up.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 9810.480036714185

Dev Title 3526:
 we're spending $1.6 billion for all of latin america in terms of aid and assistance, a fraction of what we're spending in iraq, the $500 billion we've spent there

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 9785.643785050257

Dev Title 2353:
 milwaukee is the most segregated

predicted class = 0 

----------------------------------------------------------------------
R: 8377.728040610735

Dev Title 6798:
 more than 125 jobs have been created at the milwaukee job corps center.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 8332.421357269053

Dev Title 2401:
 more people were killed by terrorists in 2015 than in any other year ever, after an 80 percentincrease from 2014.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 8232.723823286435

Dev Title 4989:
 the average time someone used to hold a share of stock back in the 60s was eight years. now, the average time is four months.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 8213.91076353596

Dev Title 9831:
 the world food demand is going to double sometime between now and 2070.

LABEL = 1
predicted class = 0 

-----------------------

R: 7180.477829212331

Dev Title 2798:
 says stefani carter repeatedly used her campaign contributors donations to rent margarita machines for our state house office.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 7137.40788804908

Dev Title 3727:
 john mccain said last year he didn't know of a solution to the mortgage crisis.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 7124.186879102938

Dev Title 1716:
 his biography says he was the top rotc officer in the nation.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 7067.929960714097

Dev Title 2996:
 we borrow a million dollars every minute.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 6990.628178805419

Dev Title 4105:
 i'm the product of a mixed marriage that would have been illegal in 12 states w

predicted class = 0 

----------------------------------------------------------------------
R: 5893.157097216204

Dev Title 9625:
 says hillary clinton was literally present when we pressed the reset button with russia just a few months after russia had invaded georgia.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 5890.150565425575

Dev Title 9170:
 a million people a year come into the u.s. legally. no other country even comes close to that figure.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 5830.577034656717

Dev Title 8067:
 im the only member of the house of representatives who raised most of his campaign funds in the last election from small contributions of less than $200.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 5827.471946036238

Dev Title 7379:
 nearly 60 percent of women who use birth co

R: 5212.08434533518

Dev Title 4914:
 says marco rubio has the worst voting record there is today.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 5180.027696556184

Dev Title 6591:
 we doubled the size of the company (hewlett-packard).

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 5172.3031810580205

Dev Title 9780:
 obamas secretary of energy, dr. steven chu, has said publicly he wants us to pay european levels (for gasoline), and that would be $9 or $10 a gallon.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 5131.065245679291

Dev Title 2288:
 in the past two years in congress, ive written more bills, passed more amendments on the floor of the house and enacted more of my bills into law than any other member of the house.

LABEL = 1
predicted class = 0 

---------------------------------------------------

predicted class = 0 

----------------------------------------------------------------------
R: 4489.702760706121

Dev Title 2614:
 says u.n. arms treaty will mandate a new international gun registry.

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 4488.486260566986

Dev Title 3231:
 we pay among the highest tolls in the nation for the privilege of crossing that bridge.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 4462.787917530538

Dev Title 1334:
 a majority of the candidates on this stage have supported amnesty ... i have never supported amnesty.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 4448.6093620904885

Dev Title 3114:
 says rick perry supported a guest worker program to help people who would otherwise be illegal aliens.

LABEL = 1
predicted class = 0 

-------------------------------------------

predicted class = 0 

----------------------------------------------------------------------
R: 3479.17904903111

Dev Title 9172:
 the health care plan for members of congress "is no better than the janitor who cleans their offices."

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 3479.0344427926425

Dev Title 8061:
 before the republican wave in 2010, democrats had an advantage on the generic ballot in congress. even in 1994 with the gingrich revolution ... democrats had that advantage.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 3451.429593485052

Dev Title 7952:
 barack obama won't even use the term 'war on terrorism.'

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 3446.7159998950206

Dev Title 2762:
 says if you try to hide marijuana in a hemp field, it becomes worthless. the thc goes away.

LABEL = 1
p

R: 3103.662277745462

Dev Title 9402:
 police officers in this state have that right, to check the immigration status of people they arrest.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 3102.360668839848

Dev Title 2676:
 the extra point is almost automatic. (the nfl) had five missed extra points this year out of 1,200 some odd attempts.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 3065.5453727451563

Dev Title 3074:
 virginia was named best managed state, best state for business and best state toraise a child while i was governor.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 3057.141891040102

Dev Title 1615:
 says the multnomah county library system is the second busiest in the nation.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 3055.82138

R: 2683.724207767592

Dev Title 667:
 of the 98 top oxycodone-dispensing doctors who used to live in florida, today, there are none.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 2667.4734440135294

Dev Title 236:
 dan patrick called for increasing the gas tax and the state sales tax.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 2667.1586176827545

Dev Title 8712:
 says the number of americans living at or below the poverty line is at its highest level since 1964.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 2659.087374805335

Dev Title 5778:
 says more than $3.5 billion in state revenue that is supposed to be dedicated to basic needs and functions is being diverted to make the books look balanced.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 

R: 2310.9157004554563

Dev Title 4823:
 the number of african-american men in prison has increased fivefold since he left office.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 2297.026236058803

Dev Title 7500:
 says state rep. jim keffer, a gop lieutenant to house speaker joe straus, "did mail pieces for democrat mark strama to help him defeat" a republican.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 2292.4252131317885

Dev Title 8517:
 before our last three presidents, you have to go back to the 1800s, early 1800s to find three presidents in a row being consecutively re-elected.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 2292.1234302880066

Dev Title 3246:
 john mccain said...in december he was surprised there was a subprime mortgage problem.

LABEL = 1
predicted class = 0 

-----------------------

R: 2031.590741073033

Dev Title 193:
 alabamas crimson tide will be the underdog in saturdays game against the georgia bulldogs, the first time in 72 consecutive games that alabama has not been favored by oddsmakers to win.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 2023.4101632006088

Dev Title 9039:
 about 40 percent of workers dont ... have a single paid sick day.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 2021.4074540853305

Dev Title 1322:
 the u.s. department of homeland security warned that the save database is not a foolproof means of verifying (citizenship on) the voter rolls.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 2009.4909031959667

Dev Title 6566:
 among the developed nations, we are the least economically and socially mobile country in the world.

LABEL = 1
predicted class = 0 

-

R: 1790.1680971949663

Dev Title 1522:
 there are close to 900,000 unemployed veterans in america right now.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1786.8883486298582

Dev Title 7225:
 roughly 500,000 georgians -- or about 5 percent of the states residents -- have gone through a background check to legally obtain a georgia weapons carry license.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1783.7803804304729

Dev Title 2605:
 says rick scott changed his promise from700,000 jobs created on top of what normal growth would be to just 700,000 jobs.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1782.8200561463182

Dev Title 2981:
 the arizona state board of educations failure to report teachers whose certifications have been revoked or suspended ... resulted in the death of a student.

LABEL = 1
predict

R: 1448.589887216249

Dev Title 6418:
 says federal health care overhaul will cost texas state government upwards of $30 billion over the next 10 years.

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 1433.0389688413418

Dev Title 4085:
 (paul ryan) has actually proposed three total, three bills that have become law in his entire career dating back to 1999.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1432.6880419752788

Dev Title 8694:
 the very first meal on the surface of the moon was the holy communion.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1431.3560883071943

Dev Title 3536:
 says madison mayor paul soglins stated intent when proposing that city contractors disclose private political donations was to discourage contributions to organizations with which he disagrees.

LABEL = 1
predicted class =

R: 1170.6720940322045

Dev Title 5880:
 in rural virginia, sen. warner ran 8-10 points ahead of a traditional democrat -- ahead of senator kaine, ahead of governor mcauliffe.

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 1170.514440615496

Dev Title 8083:
 texas is home to millions of latinas, but the state has never elected a latina to congress.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1168.3093107620803

Dev Title 4969:
 says british voters under 50, especially millennials, overwhelmingly voted to stay, in the european union. it was older voters who voted to leave.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1155.374879512349

Dev Title 9074:
 under scott walker, right now, were 46th in the country in terms of new businesses started.

LABEL = 1
predicted class = 0 

-------------------------------

predicted class = 0 

----------------------------------------------------------------------
R: 1042.1733089369363

Dev Title 8128:
 women in florida make 83 cents for every dollar a man makes.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1039.2836317594683

Dev Title 7383:
 in america, radical speech is not a crime.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1037.1263575623052

Dev Title 2374:
 says gov. rick scott returned $1 million in federal funding that would have helped the state cover the cost of overseeing insurance rates under the new health care law.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1035.9641884157234

Dev Title 3155:
 hillary clinton "starts off with 47 percent of the country against her."

LABEL = 1
predicted class = 0 

--------------------------------------------------------

predicted class = 0 

----------------------------------------------------------------------
R: 928.5405943509214

Dev Title 7521:
 john mccain has said the economy is "not his strong suit."

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 927.7215533350085

Dev Title 4416:
 as a senator, barack obama supported "an amendment that basically gutted the legal temporary worker program."

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 926.246915658554

Dev Title 4314:
 sometimes i was the only no vote on the entire board.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 907.0009687046321

Dev Title 4744:
 in wisconsin, only half of all the adults with serious psychological distress received mental health treatment or medication.

LABEL = 1
predicted class = 0 

----------------------------------------------------------

predicted class = 0 

----------------------------------------------------------------------
R: 750.3929215365233

Dev Title 1862:
 household incomes are down more than $4,000 since the year 2000.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 737.7942320400978

Dev Title 9997:
 snitker has been virtually ignored by the major media.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 737.7834364011019

Dev Title 2221:
 says scott walker cut back early voting and signed legislation that would make it harder for college students to vote.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 737.2446013863703

Dev Title 7478:
 jim renacci cheated on his income taxes and is a deadbeat citizen.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 737.1910925375148

Dev Tit

R: 607.9830626091743

Dev Title 2453:
 says she balanced a $10 billion budget shortfall without raising taxes.

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 607.1826550485167

Dev Title 5195:
 quarterbacks won the (super bowl) mvp more than 50 percent of the time.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 606.230133597983

Dev Title 2179:
 in africa, a child dies every minute because of (malaria).

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 605.444588923628

Dev Title 6372:
 this census is also the shortest and least intrusive count in modern history.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 604.0952683517969

Dev Title 1698:
 u.s. sen. johnny isakson has voted for $7 trillion of our national debt!

LABEL = 1
predicted class = 0 

----

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 537.9617695526508

Dev Title 1732:
 have the suburbs been inundated with former residents of atlanta housing projects? absolutely not.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 537.4695951687644

Dev Title 7716:
 says over 50 percent of u.s. job growth in june came from wisconsin.

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 536.0167652707937

Dev Title 7065:
 on lee fishers watch, almost nine out of 10 jobs that ohio lost were lost to other states, not to other countries.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 535.177451044567

Dev Title 1675:
 a child born in america today will inherit $1.5 million in debt the moment theyre placed in their mothers arms.

LABEL = 1
predicted class = 0 

--

predicted class = 0 

----------------------------------------------------------------------
R: 489.61833028164216

Dev Title 7488:
 as ceo of wwe, linda mcmahon was caught tipping off a ringside physician about a federal investigation into illegally distributing steroids to wrestlers.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 489.3029270511515

Dev Title 4298:
 we have seen hate crimes skyrocket in the wake of the immigration debate.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 488.4247577913371

Dev Title 7768:
 texas has the highest rate of uninsured in the nation. ... and there are more uninsured children in texas than in any other state.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 487.92664400758247

Dev Title 6652:
 the federal government owns about half of the west, yet it continues to acquir

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 433.49053826239356

Dev Title 8512:
 barry smitherman doesnt have enough legal experience to apply for most of the jobs at the attorney generals office.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 432.80570616140125

Dev Title 9261:
 we have an increase in murder within our cities, the biggest in 45 years.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 430.3328436619045

Dev Title 1715:
 women receive only 77 cents for every dollar a man earns.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 429.3148411293322

Dev Title 7069:
 in the 1950s, "a lot of people got rich and they had to pay a top tax rate of 90 percent."

LABEL = 1
predicted class = 0 

-------------------------------------------------------

predicted class = 0 

----------------------------------------------------------------------
R: 360.5490397658864

Dev Title 1485:
 over the last 40 years, this countrys prison population has grown by 500 percent.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 359.8999068511732

Dev Title 1942:
 a bipartisan background check amendment outlawed any (gun) registry. plain and simple, right there in the text.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 359.6673491039705

Dev Title 8900:
 women make 77 cents for every dollar a man earns.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 359.5050984561923

Dev Title 5179:
 the black unemployment rate (has) increased since the recovery has begun.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 358.0081897691

In [96]:
print(error_df2.size)
print(len(error_df2))
error_df2.head()

8000
2000


Unnamed: 0,text,label,pred_class,R
0,"hospitals, doctors, mris, surgeries and so for...",1,0,1.0068269999999999e+39
1,georgia has the most restrictive ballot access...,1,0,4.538944e+31
2,let's pay attention to kids who are not going ...,1,0,5.77479e+27
3,fifty-six percent decline in overall crime. a ...,1,0,9.824992e+22
4,"what difference, at this point, does it make? ...",1,0,2.865032e+22


In [97]:
error_df2.to_csv('nb_isot_title_to_liar_errors.csv', sep=',')

### REVERSE: Using LIAR model, predict the ISOT "title" and score the predictions. (lower case them first...)

In [98]:
train_data = liar_data.title[liar_data.binary_target != -1].str.lower().values   # full LIAR data
train_labels = liar_data.binary_target[liar_data.binary_target != -1].values 
 
dev_data = train_set.title.str.lower().values                                      # LSOT data
dev_labels = train_set.target.values 


train_labels = train_labels.astype(int)
dev_labels = dev_labels.astype(int)

#train_data.head()
print('train_data shape:', train_data.shape)
#print(train_data[0].shape)
#print(train_data[:1])
print('train_labels shape:', train_labels.shape)
print(train_labels)

print('\ndev_data shape:', dev_data.shape)
#print(dev_data[:1])
print('dev_labels shape:', dev_labels.shape)
print(dev_labels)

train_data shape: (10164,)
train_labels shape: (10164,)
[1 0 0 ... 1 0 0]

dev_data shape: (31428,)
dev_labels shape: (31428,)
[1 0 1 ... 0 1 0]


In [99]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_data)           # original ISOT training data
X_dev_transformed = vectorizer.transform(dev_data)

# MultinomialNB
print('\nMultinomialNB using "title" field: Fit using ISOT data; Predict on LIAR data:')
alpha = 1.0
clf = MultinomialNB(alpha=alpha)
clf.fit(X, train_labels)

print('accuracy: %3.3f' %clf.score(X_dev_transformed, dev_labels))


MultinomialNB using "title" field: Fit using ISOT data; Predict on LIAR data:
accuracy: 0.566


### Split LIAR data into Train/Dev/Test, train model, and evaluate on its own type 

In [72]:
#liar_data.head(5)
liar_data = liar_data[liar_data.binary_target != -1]
liar_data.head(5)

Unnamed: 0,id,target,title,subject,speaker,speaker_job_title,state,party,barely_true_count,false_count,half_true_count,mostly_true_count,pantsonfire_count,context,binary_target
0,6193.json,false,(Environmentalists) said Were only going to st...,"energy,environment",ron-ramsey,Speaker of the Tennessee Senate,Tennessee,republican,0.0,2.0,0.0,0.0,1.0,rally for coal miners,0
1,12371.json,true,"For the first time in over 40 years, Republica...","elections,history",ed-gillespie,Republican strategist,"Washington, D.C.",republican,2.0,3.0,2.0,2.0,1.0,a speech,1
2,12409.json,true,You can import as many hemp products into this...,"drugs,legal-issues,marijuana",william-devereaux,lawyer,Rhode Island,none,0.0,0.0,0.0,0.0,0.0,a legislative hearing,1
3,1770.json,true,We can prevent terror suspects from boarding a...,"civil-rights,guns,terrorism",michael-bloomberg,,New York,independent,0.0,2.0,2.0,3.0,0.0,an op ed article,1
4,10700.json,pants-fire,"Says Ted Cruz said, There is no place for gays...","candidates-biography,gays-and-lesbians,marriag...",facebook-posts,Social media posting,,none,14.0,18.0,15.0,11.0,36.0,an online meme,0


In [73]:
binary_targets = liar_data.binary_target.unique()
print(binary_targets)

print('\nbinary_target,  number of examples')
for binary_target in binary_targets:
    print(binary_target, len(liar_data[liar_data.binary_target==binary_target]))

[0 1]

binary_target,  number of examples
0 5657
1 4507


In [75]:
#train/dev/train split
#train_dev_split = 0.8

train_fract = 0.70
dev_fract = 0.15
test_fract = 0.15

if (train_fract+dev_fract+test_fract) == 1.0:
    print('Split fractions add up to 1.0')
else:
    print('SPLIT FRACTIONS DO NOT ADD UP TO 1.0; PLEASE TRY AGAIN.............')

train_set = liar_data[ :int(len(liar_data)*train_fract)].reset_index(drop=True)
dev_set = liar_data[int(len(liar_data)*(train_fract)) : int(len(liar_data)*(train_fract+dev_fract))].reset_index(drop=True)
test_set = liar_data[int(len(liar_data)*(train_fract+dev_fract)) : ].reset_index(drop=True)

print('training set: ',train_set.shape)
print('dev set: ',dev_set.shape)
print('test set: ',test_set.shape)

Split fractions add up to 1.0
training set:  (7114, 15)
dev set:  (1525, 15)
test set:  (1525, 15)


In [76]:
# print out LIAR dev set
#dev_set.to_csv('liar_dev_set.csv', sep=',')

In [77]:
train_data = train_set.title[train_set.binary_target != -1].str.lower().values   # full LIAR data
train_labels = train_set.binary_target[train_set.binary_target != -1].values 
 
dev_data = dev_set.title.str.lower().values                                      # LSOT data
dev_labels = dev_set.binary_target[dev_set.binary_target != -1].values 

print('train_data shape:', train_data.shape)
print('train_labels shape:', train_labels.shape)
print('dev_data shape:', dev_data.shape)
print('dev_labels shape:', dev_labels.shape)

train_data shape: (7114,)
train_labels shape: (7114,)
dev_data shape: (1525,)
dev_labels shape: (1525,)


In [78]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_data)      
print(X.shape)
X_dev_transformed = vectorizer.transform(dev_data)

# MultinomialNB
print('\nMultinomialNB using LIAR training data; Predict on LIAR dev data:')
alpha = 1.0
clf = MultinomialNB(alpha=alpha)
clf.fit(X, train_labels)

print('accuracy: %3.3f' %clf.score(X_dev_transformed, dev_labels))

(7114, 10384)

MultinomialNB using LIAR training data; Predict on LIAR dev data:
accuracy: 0.627


In [106]:
# PRINT LIAR DEV TITLES WITH LABELS AND PREDICTED VALUES

cols = ['LIAR title', 'Label', 'Predicted Class']
row_list = []

for i in range(dev_data.shape[0]):
    #print(dev_data[i], dev_labels[i], clf.predict(X_dev_transformed)[i])
    row_list.append(dict( [('LIAR title', dev_data[i]), ('Label', dev_labels[i]), ('Predicted Class', clf.predict(X_dev_transformed)[i])]  ))

dev_set_df = pd.DataFrame(row_list, columns=cols)

In [107]:
print(dev_set_df.size)
print(len(dev_set_df))
dev_set_df.head()

4575
1525


Unnamed: 0,LIAR title,Label,Predicted Class
0,we did not even have a federal income tax in t...,1,1
1,"in texas, there are 668 democratic hispanic el...",1,1
2,a test last month at the max station at 162nd ...,1,1
3,says president barack obama initially said the...,1,0
4,the next state budget will begin with a surplu...,0,0


In [108]:
#dev_set_df.to_csv('liar_dev_set.csv', sep=',')

#### ONE OFF: Human Baseline Examples: read in LIAR examples previously scored by humans in the project group.   Add the predicted class to the file.  

In [79]:
liar_human = pd.read_csv('human_liar_examples.csv')
liar_human.head()

Unnamed: 0,title
0,A new Colorado law literally allows residents ...
1,Sixty percent of New Jersey doctors do not acc...
2,Austin is a city that has basically doubled in...
3,Says U.S. Rep. Stephen Fincher breaks earmark ...
4,Of all cities in the United States with more t...


In [80]:
X_dev_transformed = vectorizer.transform(liar_human.title)

In [81]:
cols = ['title', 'pred_class']
row_list = []

for i in range(len(liar_human.title)):  
    print('\nDev Title: ', liar_human.title.iloc[i])
    print('predicted class =', clf.predict(X_dev_transformed)[i],'\n' )
    row_list.append(dict( [('title',liar_human.title.iloc[i]), ('pred_class', clf.predict(X_dev_transformed)[i])]  ))


liar_human_df = pd.DataFrame(row_list, columns=cols)


Dev Title:  A new Colorado law literally allows residents to print ballots from their home computers, then encourages them to turn ballots over to collectors.
predicted class = 0 


Dev Title:  Sixty percent of New Jersey doctors do not accept Medicaid patients.
predicted class = 1 


Dev Title:  Austin is a city that has basically doubled in size every 25 years or so since it was founded.
predicted class = 1 


Dev Title:  Says U.S. Rep. Stephen Fincher breaks earmark pledge.
predicted class = 0 


Dev Title:  Of all cities in the United States with more than 100,000 people, Providence is the 183rd safest.
predicted class = 1 


Dev Title:  When the salmonella source was finally identified, FDA officials had to wait for industry approval before they could go live with the [peanut] recall.
predicted class = 1 


Dev Title:  Mr. President, multiple times from your administration there have come statements that Republicans have no ideas and no solutions on health care.
predicted class =

In [82]:
liar_human_df.head()

Unnamed: 0,title,pred_class
0,A new Colorado law literally allows residents ...,0
1,Sixty percent of New Jersey doctors do not acc...,1
2,Austin is a city that has basically doubled in...,1
3,Says U.S. Rep. Stephen Fincher breaks earmark ...,0
4,Of all cities in the United States with more t...,1


In [83]:
liar_human_df.to_csv('human_liar_examples.csv', sep=',')