# LIAR DETECTION GROUP PROJECT - Baseline Models  


### CONTENTS  

Imports  
Load ISOT data  
Pre-process ISOT data  
Train/Dev/Test split ISOT data  

##### Baselines (Naive Bayes):  
- ISOT full "text" field  (using CountVectorizer)
    verification test by assigning random 0's and 1's to the Dev Labels and re-running     
- Read and setup LIAR dataset  
- Using the ISOT "text" model, predict the liar_dev_labels and score the predictions  
- Using the LIAR model, predict the ISOT "text" and score the predictions 
- ISOT "text" field using TfidfVectorizer  
- ISOT "text" field after removing "Reuters" and location from real news  
- ISOT "title" field; ; print top misclassifications between predicted dev classes and dev labels       
- Using the ISOT "title" model, predict the liar_dev_labels and score the predictions; print top misclassifications  
- Using the LIAR model, predict the ISOT "title" and score the predictions  
- Divide LIAR data into train/dev/test, train a model, and see how well it predicts on it's own data type  



    

In [1]:
from __future__ import absolute_import
from __future__ import print_function
from __future__ import division

import json, os, re, shutil, sys, time
from importlib import reload
import collections, itertools
import unittest
from IPython.display import display, HTML
from sklearn.utils import shuffle
# NLTK for NLP utils and corpora
import nltk

# NumPy and TensorFlow
import numpy as np
import pandas as pd
#import tensorflow as tf

# Helper libraries
from w266_common import utils, vocabulary, tf_embed_viz
#from ark-tweet-nlp-0.3.2 import 


In [2]:
#### MAY NEED TO RUN THIS CELL TWICE

def get_data(filename, sep=',', header=0, names = None):
    '''Read CSV file into a pandas dataframe'''
      
    filepath = DATAPATH + filename
    return pd.read_csv(filepath, header=header, sep=sep, quotechar='"')

In [3]:
##
# from sklearn.naive_bayes import BernoulliNB  #requires all features be binary
from sklearn.naive_bayes import MultinomialNB  #appropriate for word count features from CountVectorizer
# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *
#from sklearn.grid_search import GridSearchCV   # THIS HAS BEEN DEPRECATED
from sklearn.model_selection import GridSearchCV
# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

### Load data
Loading the "Fake News" dataset from the Information security and object technology (ISOT) Research lab at the University of Victoria School of Engineering.

The ISOT Fake News Dataset is a compilation of several thousands fake news and truthful articles, obtained from different legitimate news sites and sites flagged as unreliable by politifact.com.

In [4]:
# define each downloaded file
FAKE_FILENAME = 'Fake.csv'
TRUE_FILENAME = 'True.csv'

# define the downloaded file path 
DATAPATH = './datasets/ISOT_FakeNews/'

def get_data(filename):
    '''Read CSV file into a pandas dataframe'''
      
    filepath = DATAPATH + filename
    return pd.read_csv(filepath, header=0, sep=',', quotechar='"')


fake_data = get_data(FAKE_FILENAME)
true_data = get_data(TRUE_FILENAME)



# add a label column to the data with the target values
fake_data.loc[:,'target'] = '0'
true_data['target'] = '1'

#append the datasets and shuffle them
all_data = true_data.append(fake_data, ignore_index=True)
all_data = all_data.sample(frac=1).reset_index(drop=True)

all_data.describe()

Unnamed: 0,title,text,subject,date,target
count,44898,44898.0,44898,44898,44898
unique,38729,38646.0,8,2397,2
top,Factbox: Trump fills top jobs for his administ...,,politicsNews,"December 20, 2017",0
freq,14,627.0,11272,182,23481


In [5]:
#fake_data.head(15)
#true_data.head(16)
all_data.head(15)

Unnamed: 0,title,text,subject,date,target
0,FARMER FINED A WHOPPING $2.8 MILLION Asks Pres...,A California farmer fined $2.8 million for plo...,Government News,"Jul 29, 2017",0
1,Kremlin 'deeply concerned' by rising tension o...,MOSCOW (Reuters) - Moscow is deeply concerned ...,politicsNews,"September 22, 2017",1
2,Turkey's Erdogan takes legal action after lawm...,ISTANBUL (Reuters) - President Tayyip Erdogan ...,worldnews,"October 31, 2017",1
3,Democrats want a law to stop Trump from bombin...,WASHINGTON (Reuters) - Democratic U.S. senator...,politicsNews,"October 31, 2017",1
4,Sander’s Campaign Manager Literally Stole An ...,In what seems to be another misspeak moment fo...,News,"April 6, 2016",0
5,Malta court hears blogger bomb probably trigge...,VALLETTA (Reuters) - The bomb used to kill Mal...,worldnews,"December 19, 2017",1
6,North Korea's Kim Jong Un fetes nuclear scient...,SEOUL (Reuters) - North Korean leader Kim Jong...,worldnews,"September 10, 2017",1
7,U.S. tech titans lead legal brief against Trum...,"(Reuters) - More than 100 companies, including...",politicsNews,"February 6, 2017",1
8,WATCH: CNN Panelist HUMILIATES ‘Snowflake’ Tr...,Donald Trump is so thin-skinned that he would ...,News,"February 26, 2017",0
9,MarkLevin is Freaking Awesome: Obama negotiate...,Let s get real with some awesome truth from Ma...,left-news,"Apr 5, 2015",0


In [6]:

print(fake_data.title[0])
print('\n', fake_data.text[0])
print('\nTRUE DATA: ', true_data.text[3000])

 Donald Trump Sends Out Embarrassing New Year’s Eve Message; This is Disturbing

 Donald Trump just couldn t wish all Americans a Happy New Year and leave it at that. Instead, he had to give a shout out to his enemies, haters and  the very dishonest fake news media.  The former reality show star had just one job to do and he couldn t do it. As our Country rapidly grows stronger and smarter, I want to wish all of my friends, supporters, enemies, haters, and even the very dishonest Fake News Media, a Happy and Healthy New Year,  President Angry Pants tweeted.  2018 will be a great year for America! As our Country rapidly grows stronger and smarter, I want to wish all of my friends, supporters, enemies, haters, and even the very dishonest Fake News Media, a Happy and Healthy New Year. 2018 will be a great year for America!  Donald J. Trump (@realDonaldTrump) December 31, 2017Trump s tweet went down about as welll as you d expect.What kind of president sends a New Year s greeting like this

### Cleanup
Check for NA values.

May not want the dataset to contain the 'subject' since all the true news data comes from "Reuters"

In [7]:
all_data.isna().sum()

title      0
text       0
subject    0
date       0
target     0
dtype: int64

In [8]:
all_data.info(memory_usage='deep', verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44898 entries, 0 to 44897
Data columns (total 5 columns):
title      44898 non-null object
text       44898 non-null object
subject    44898 non-null object
date       44898 non-null object
target     44898 non-null object
dtypes: object(5)
memory usage: 151.9 MB


### Tokenize and Canonicalize Text

Need to work on Tokenize and Canonicalizing text. Words like "Obama's" need to be corrected. Do we need to mark of sentences within a text? Might want to use some regex code from camron.

In [9]:
"""
Source:  https://gist.github.com/tokestermw/cb87a97113da12acb388
"""

FLAGS = re.MULTILINE | re.DOTALL

## hashtag code does not work, needs tweaking
def hashtag(text):
    text = text.group()
    hashtag_body = text[1:]
    if hashtag_body.isupper():
        result = " {} ".format(hashtag_body.lower())
    else:
        result = " ".join(["<hashtag>"] + re.split(r"(?=[A-Z])", hashtag_body, flags=FLAGS))
    return result

def allcaps(text):
    text = text.group()
    return text.lower() + " <allcaps>"


def tokenize(text):
    # Different regex parts for smiley faces
    eyes = r"[8:=;]"
    nose = r"['`\-]?"

    # function so code less repetitive
    def re_sub(pattern, repl):
        return re.sub(pattern, repl, text, flags=FLAGS)

    text = re_sub(r"https?:\/\/\S+\b|www\.(\w+\.)+\S*", "<url>")
    text = re_sub(r"@\w+", "<user>")
    text = re_sub(r"{}{}[)dD]+|[)dD]+{}{}".format(eyes, nose, nose, eyes), "<smile>")
    text = re_sub(r"{}{}p+".format(eyes, nose), "<lolface>")
    text = re_sub(r"{}{}\(+|\)+{}{}".format(eyes, nose, nose, eyes), "<sadface>")
    text = re_sub(r"{}{}[\/|l*]".format(eyes, nose), "<neutralface>")
    text = re_sub(r"/"," / ")
    text = re_sub(r"<3","<heart>")
    text = re_sub(r"[-+]?[.\d]*[\d]+[:,.\d]*", "<number>")
    #text = re_sub(r"#\S+", hashtag)
    text = re_sub(r"([!?.]){2,}", r"\1 <repeat>")
    text = re_sub(r"\b(\S*?)(.)\2{2,}\b", r"\1\2 <elong>")
    text = re_sub(r"([A-Z]){2,}", allcaps)

       
    output = text.lower().split()
    #output = list(itertools.chain(*[re.split(r'([^\w<>])', x) for x in output]))  #Splits punctuation, keeping < and >
    return [item for item in output if item != '']  #Removes blank strings from list

teststring = "My name is ABHI :). Learning the back-portion :(. Obama's nephew. @random. http://www.abc.com"
tokenize(teststring)

['my',
 'name',
 'is',
 'abhi',
 '<allcaps>',
 '<smile>.',
 'learning',
 'the',
 'back-portion',
 '<sadface>.',
 "obama's",
 'nephew.',
 '<user>.',
 '<url>']

In [10]:
def CNG_tokenizer(text):
    '''tokenizer, and part-of-speech tagger from Carnegie Mellon
    created by Olutobi Owoputi, Brendan O'Connor, Kevin Gimpel, Nathan Schneider, Chris Dyer, Dipanjan Das, Daniel Mills, 
    Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah Smith
    RunTagger [options] [ExamplesFilename]
      runs the CMU ARK Twitter tagger on tweets from ExamplesFilename, 
      writing taggings to standard output. Listens on stdin if no input filename.

    Options:
      --model <Filename>        Specify model filename. (Else use built-in.)
      --just-tokenize           Only run the tokenizer; no POS tags.
      --quiet                   Quiet: no output
      --input-format <Format>   Default: auto
                                Options: json, text, conll
      --output-format <Format>  Default: automatically decide from input format.
                                Options: pretsv, conll
      --input-field NUM         Default: 1
                                Which tab-separated field contains the input
                                (1-indexed, like unix 'cut')
                                Only for {json, text} input formats.
      --word-clusters <File>    Alternate word clusters file (see FeatureExtractor)
      --no-confidence           Don't output confidence probabilities
      --decoder <Decoder>       Change the decoding algorithm (default: greedy)

    Tweet-per-line input formats:
       json: Every input line has a JSON object containing the tweet,
             as per the Streaming API. (The 'text' field is used.)
       text: Every input line has the text for one tweet.
    We actually assume input lines are TSV and the tweet data is one field.
    (Therefore tab characters are not allowed in tweets.
    Twitter's own JSON formats guarantee this;
    if you extract the text yourself, you must remove tabs and newlines.)
    Tweet-per-line output format is
       pretsv: Prepend the tokenization and tagging as new TSV fields, 
               so the output includes a complete copy of the input.
    By default, three TSV fields are prepended:
       Tokenization \t POSTags \t Confidences \t (original data...)
    The tokenization and tags are parallel space-separated lists.
    The 'conll' format is token-per-line, blank spaces separating tweets.'''

    file = open("teststring.txt", "w") 
    file.write(text) 
    file.close() 

#! ./ark-tweet-nlp-0.3.2/runTagger.sh ./ark-tweet-nlp-0.3.2/examples/example_tweets.txt
#! ./ark-tweet-nlp-0.3.2/twokenize.sh --output-format pretsv ./ark-tweet-nlp-0.3.2/examples/casual.txt
    tokens = ! ./ark-tweet-nlp-0.3.2/runTagger.sh --output-format conll teststring.txt
    tokens_list = list([re.split(r'([\t])',x) for x in tokens])
    tokens_list = [[ item for item in word if item != '\t' ] for word in tokens_list]
    
    #pandas frame for the tokens and POS
    
    pd_tokens = pd.DataFrame(tokens_list[1:-2], columns = ['word','tag','confidence'] )
    print(pd_tokens)
    word_list = [tokenize(word) for word in pd_tokens['word'].tolist()]
    
    word_tokens = ["".join(word) for word in word_list] #concats the allcaps text at the end of string
    pos_tokens = pd_tokens['tag'].tolist()
    conf_tokens = pd_tokens['confidence'].tolist()
    return word_tokens, pos_tokens, conf_tokens

In [11]:
CNG_tokenizer(teststring)

                  word tag confidence
0                   My   D     0.9984
1                 name   N     0.9996
2                   is   V     0.9953
3                 ABHI   ^     0.6305
4                   :)   E     0.9775
5                    .   ,     0.9951
6             Learning   V     0.9957
7                  the   D     0.9960
8         back-portion   N     0.8512
9                   :(   E     0.9162
10                   .   ,     0.9876
11             Obama's   Z     0.8890
12              nephew   N     0.9575
13                   .   ,     0.9980
14             @random   @     0.9945
15                   .   ,     0.9953
16  http://www.abc.com   U     0.9871


(['my',
  'name',
  'is',
  'abhi<allcaps>',
  '<smile>',
  '.',
  'learning',
  'the',
  'back-portion',
  '<sadface>',
  '.',
  "obama's",
  'nephew',
  '.',
  '<user>',
  '.',
  '<url>'],
 ['D',
  'N',
  'V',
  '^',
  'E',
  ',',
  'V',
  'D',
  'N',
  'E',
  ',',
  'Z',
  'N',
  ',',
  '@',
  ',',
  'U'],
 ['0.9984',
  '0.9996',
  '0.9953',
  '0.6305',
  '0.9775',
  '0.9951',
  '0.9957',
  '0.9960',
  '0.8512',
  '0.9162',
  '0.9876',
  '0.8890',
  '0.9575',
  '0.9980',
  '0.9945',
  '0.9953',
  '0.9871'])

In [12]:

#Make new column with tokenized, canonicalized text
all_data['text_tokcan'] = all_data['text'].apply(tokenize)
all_data.tail(5)

Unnamed: 0,title,text,subject,date,target,text_tokcan
44893,U.S. says air strikes in Somalia kill six al S...,WASHINGTON (Reuters) - The U.S. military said ...,worldnews,"September 13, 2017",1,"[washington, <allcaps>, (reuters), -, the, u.s..."
44894,INFLUENTIAL HOLLYWOOD LEFTIST Looks Forward To...,How very progressive Not since the Civil Right...,left-news,"Feb 26, 2016",0,"[how, very, progressive, not, since, the, civi..."
44895,Tillerson says U.S. committed to NATO in first...,BRUSSELS (Reuters) - U.S. Secretary of State R...,politicsNews,"March 31, 2017",1,"[brussels, <allcaps>, (reuters), -, u.s., secr..."
44896,"Without Trump, Republican debate has second lo...",WASHINGTON (Reuters) - The Fox News debate wit...,politicsNews,"January 29, 2016",1,"[washington, <allcaps>, (reuters), -, the, fox..."
44897,House Speaker Ryan expects tax plan this fall:...,WASHINGTON (Reuters) - U.S. House of Represent...,politicsNews,"September 7, 2017",1,"[washington, <allcaps>, (reuters), -, u.s., ho..."


In [13]:
#padded_sentences = ([u"<s>", u"<s>"] + s + [u"</s>"] for s in sents)

In [14]:

def build_vocab(corpus, V=None, **kw):
    if isinstance(corpus, list):
        token_feed = (utils.canonicalize_word(w) for w in corpus)
        vocab = vocabulary.Vocabulary(token_feed, size=V, **kw)
    print("Vocabulary: {:,} types".format(vocab.size))
    return vocab


#utils.canonicalize_word(teststring.split())
vocab=build_vocab(tokenize(teststring))
print("{:,} words".format(vocab.size))
print("wordset: ",vocab.ordered_words())



Vocabulary: 17 types
17 words
wordset:  ['<s>', '</s>', '<unk>', 'my', 'name', 'is', 'abhi', '<allcaps>', '<smile>.', 'learning', 'the', 'back-portion', '<sadface>.', "obama's", 'nephew.', '<user>.', '<url>']


In [15]:
print('ISOT ALL target=real:', len(all_data.target[all_data.target == '1']))
print('ISOT ALL target=fake:', len(all_data.target[all_data.target == '0']))

ISOT ALL target=real: 21417
ISOT ALL target=fake: 23481


### Train / Dev / Test Split ISOT data

In [16]:
#train/dev/train split
#train_dev_split = 0.8

train_fract = 0.70
dev_fract = 0.15
test_fract = 0.15

if (train_fract+dev_fract+test_fract) == 1.0:
    print('Split fractions add up to 1.0')
else:
    print('SPLIT FRACTIONS DO NOT ADD UP TO 1.0; PLEASE TRY AGAIN.............')

#train_data = all_data[:int(len(all_data)*train_dev_split)].reset_index(drop=True)
#dev_data = all_data[int(len(all_data)*train_dev_split):].reset_index(drop=True)

train_set = all_data[ :int(len(all_data)*train_fract)].reset_index(drop=True)
dev_set = all_data[int(len(all_data)*(train_fract)) : int(len(all_data)*(train_fract+dev_fract))].reset_index(drop=True)
test_set = all_data[int(len(all_data)*(train_fract+dev_fract)) : ].reset_index(drop=True)

print('training set: ',train_set.shape)
print('dev set: ',dev_set.shape)
print('test set: ',test_set.shape)

Split fractions add up to 1.0
training set:  (31428, 6)
dev set:  (6735, 6)
test set:  (6735, 6)


In [17]:
train_set.head(5)

Unnamed: 0,title,text,subject,date,target,text_tokcan
0,FARMER FINED A WHOPPING $2.8 MILLION Asks Pres...,A California farmer fined $2.8 million for plo...,Government News,"Jul 29, 2017",0,"[a, california, farmer, fined, $<number>, mill..."
1,Kremlin 'deeply concerned' by rising tension o...,MOSCOW (Reuters) - Moscow is deeply concerned ...,politicsNews,"September 22, 2017",1,"[moscow, <allcaps>, (reuters), -, moscow, is, ..."
2,Turkey's Erdogan takes legal action after lawm...,ISTANBUL (Reuters) - President Tayyip Erdogan ...,worldnews,"October 31, 2017",1,"[istanbul, <allcaps>, (reuters), -, president,..."
3,Democrats want a law to stop Trump from bombin...,WASHINGTON (Reuters) - Democratic U.S. senator...,politicsNews,"October 31, 2017",1,"[washington, <allcaps>, (reuters), -, democrat..."
4,Sander’s Campaign Manager Literally Stole An ...,In what seems to be another misspeak moment fo...,News,"April 6, 2016",0,"[in, what, seems, to, be, another, misspeak, m..."


In [18]:
dev_set.head(5)

Unnamed: 0,title,text,subject,date,target,text_tokcan
0,Iran provided capability for missile attacks f...,DUBAI (Reuters) - Iran has provided the capabi...,worldnews,"November 10, 2017",1,"[dubai, <allcaps>, (reuters), -, iran, has, pr..."
1,OBAMA THREATENS TO SURFACE FROM LEFTIST BUNKER...,Contrary to what the media would like us to be...,politics,"Sep 4, 2017",0,"[contrary, to, what, the, media, would, like, ..."
2,JUST IN: Criminal Hackers Who Stole Data From ...,"Just two years ago, the Obama White House welc...",politics,"Mar 15, 2017",0,"[just, two, years, ago,, the, obama, white, ho..."
3,DOJ Lawyers LITERALLY Argued That Trump Is Ab...,You know how white supremacists have been call...,News,"February 7, 2017",0,"[you, know, how, white, supremacists, have, be..."
4,Brazil Supreme Court sends new Temer graft cha...,BRASILIA (Reuters) - A majority of the judges ...,worldnews,"September 20, 2017",1,"[brasilia, <allcaps>, (reuters), -, a, majorit..."


In [None]:
# print out ISOT dev set
#dev_set.to_csv('isot_dev_set.csv', sep=',')

In [19]:
test_set.head(5)

Unnamed: 0,title,text,subject,date,target,text_tokcan
0,STATE DEPARTMENT COVERUP: Reporter Questions M...,The deception was really a Glitch Sure!,Government News,"May 11, 2016",0,"[the, deception, was, really, a, glitch, sure!]"
1,DNC MEGA-DONOR Ditches Dems in Scathing Messag...,Wow! This is huge! The Democrats and their lea...,politics,"Nov 25, 2017",0,"[wow!, this, is, huge!, the, democrats, and, t..."
2,Russia says regrets over U.S. moves on consula...,MOSCOW (Reuters) - Russian Foreign Minister Se...,politicsNews,"August 31, 2017",1,"[moscow, <allcaps>, (reuters), -, russian, for..."
3,Nearly half of Americans still oppose Republic...,WASHINGTON/NEW YORK (Reuters) - As Republicans...,politicsNews,"December 11, 2017",1,"[washington, <allcaps>, /, new, <allcaps>, yor..."
4,DEMOCRAT CONGRESSWOMAN Who Served Two Tours In...,Congresswoman Gabbard also criticizes Hillary ...,politics,"Nov 23, 2015",0,"[congresswoman, gabbard, also, criticizes, hil..."


## Baseline Model: Naive Bayes Classifier

### Classify full text

In [20]:
##
# from sklearn.naive_bayes import BernoulliNB  #requires all features be binary
from sklearn.naive_bayes import MultinomialNB  #appropriate for word count features from CountVectorizer
# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *
#from sklearn.grid_search import GridSearchCV   # THIS HAS BEEN DEPRECATED
from sklearn.model_selection import GridSearchCV
# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score


train_data, train_labels = train_set.text.values, train_set.target.values
dev_data, dev_labels = dev_set.text.values, dev_set.target.values

train_labels = train_labels.astype(int)
dev_labels = dev_labels.astype(int)

print('train_data shape:', train_data.shape)
#print(train_data[0].shape)
#print(train_data[:1])
print('\ntrain_labels shape:', train_labels.shape)
print(train_labels)
print(type(train_labels[0]))
#train_labels.head()
#dev_data.head()
#dev_labels.head()


train_data shape: (31428,)

train_labels shape: (31428,)
[0 1 1 ... 1 1 0]
<class 'numpy.int64'>


In [21]:
print('ISOT train target=real:', len(train_labels[train_labels == 1]))
print('ISOT train target=fake:', len(train_labels[train_labels == 0]))

ISOT train target=real: 14922
ISOT train target=fake: 16506


In [22]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_data)
X

<31428x105168 sparse matrix of type '<class 'numpy.int64'>'
	with 6550150 stored elements in Compressed Sparse Row format>

In [23]:
#print(X[0])

In [24]:
print('X.shape:', X.shape) # (). There are x documents (rows) in the corpus, with y features (unique words = vocabulary)
print('Vocabulary size (number of features or columns):', X.shape[1])  # 
print('Non-zero elements in matrix (X.nnz):', X.nnz)   # This indicates that there are z non-zero elements in the matrix.
print('Average number of non-zero features per example (per document): %.3f' %(X.nnz/X.shape[0]))  # non-zero elements in matrix / documents = xxx
print('Fraction of non-zero elements in matrix: %.4f' %( X.nnz/(X.shape[0] * X.shape[1])) )   # Fraction of entries in the matrix that are non-zero = X.nnz/(rows*columns) = 0.xxx 


# What are the 0th and last feature strings (in alphabetical order)?
print('0th feature string:', vectorizer.get_feature_names()[0])   # 
print('last feature string:', vectorizer.get_feature_names()[X.shape[1]-1])    # 

X.shape: (31428, 105168)
Vocabulary size (number of features or columns): 105168
Non-zero elements in matrix (X.nnz): 6550150
Average number of non-zero features per example (per document): 208.418
Fraction of non-zero elements in matrix: 0.0020
0th feature string: 00
last feature string: zzzzzzzzzzzzz


In [25]:
# Using the standard CountVectorizer, what fraction of the words in the dev data are missing from the vocabulary? 
vectorizer_dev = CountVectorizer()
X_dev = vectorizer_dev.fit_transform(dev_data)  # Independently build a vocabulary using dev_data.
print('X_dev.shape:', X_dev.shape)
print('Vocabulary using train data:', X.shape[1])  # 
print('Vocabulary using dev data:', X_dev.shape[1])                # 

# Feed dev_data into the vectorizer fit using training data.
X_dev_transformed = vectorizer.transform(dev_data)
print('X_dev_transformed shape:', X_dev_transformed.shape)
#print('X_dev_transformed.shape', X_dev_transformed.shape)  # (676, 26,879)  EXPECT .shape[1] equal to original number of features
#print('non-zero indices in X_dev_transformed:', X_dev_transformed.nonzero())  # could also use this to check which features missing...

''' This is way too slow; use set intersection instead!!
# Look at each feature (vocabulary word) in X_dev and see if it is a feature in X.
count = 0
for i in range(X_dev.shape[1]):
    if vectorizer_dev.get_feature_names()[i] in vectorizer.get_feature_names():
        count += 1
print('Count of words (features) in X_dev also in X:', count)   
print('Fraction of words in dev data missing from training vocabulary: %.3f' %((X_dev.shape[1] - count)/X_dev.shape[1]) )
count of words (features) in X_dev also in X: 12219
Fraction of words in dev data missing from training vocabulary: 0.248
'''

set1 = set(vectorizer_dev.get_feature_names())
set2 = set(vectorizer.get_feature_names())
print('Count of words (features) in X_dev also in X:', len(set1.intersection(set2)))
print('Fraction of words in dev data missing from training vocabulary: %.3f' %((X_dev.shape[1] - len(set1.intersection(set2)))/X_dev.shape[1]) )

X_dev.shape: (6735, 53603)
Vocabulary using train data: 105168
Vocabulary using dev data: 53603
X_dev_transformed shape: (6735, 105168)
Count of words (features) in X_dev also in X: 44811
Fraction of words in dev data missing from training vocabulary: 0.164


In [26]:
# MultinomialNB
print('\nMultinomialNB')
alpha = 1.0
clf = MultinomialNB(alpha=alpha)
clf.fit(X, train_labels)

print('accuracy: %3.2f' %clf.score(X_dev_transformed, dev_labels))


MultinomialNB
accuracy: 0.96


In [27]:

print('accuracy: %3.2f' %clf.score(X_dev_transformed, dev_labels))

y_pred = clf.predict(X_dev_transformed)

acc = accuracy_score(dev_labels, y_pred)
print("Accuracy on test set: {:.02%}".format(acc))


accuracy: 0.96
Accuracy on test set: 95.80%


In [28]:
print('predict proba:', clf.predict_proba(X).shape)
print('predict proba example:', clf.predict_proba(X[0]))

predict proba: (31428, 2)
predict proba example: [[0.31792561 0.68207439]]


In [29]:
print('feature_log_prob_ shape:', clf.feature_log_prob_.shape)
print('feature_log_prob_ example:', clf.feature_log_prob_[0][0])

feature_log_prob_ shape: (2, 105168)
feature_log_prob_ example: -9.69227039235917


In [30]:
feature_names = vectorizer.get_feature_names()
print(feature_names[:20])

['00', '000', '0000', '00000017', '00004', '000048', '000063', '00007', '00042', '0009', '000938', '000a', '000after', '000although', '000american', '000california', '000cases', '000cylvia', '000dillon000', '000ecuador']


In [31]:
for i in range(2):   # 2 category labels
    for j in range(10):   # top 5 weights for each class
        index = -1 - j
        feature_index = np.argsort(clf.feature_log_prob_[i,:])[index]
        #print(feature_index, vectorizer.get_feature_names()[feature_index], clf.feature_log_prob_[0,feature_index], lr2.coef_[1,feature_index], lr2.coef_[2,feature_index], lr2.coef_[3,feature_index])
        print('%19s %12.3f %12.3f' %(vectorizer.get_feature_names()[feature_index], clf.feature_log_prob_[0,feature_index], clf.feature_log_prob_[1,feature_index]))
    print()


                the       -2.901       -2.827
                 to       -3.524       -3.502
                 of       -3.733       -3.682
                and       -3.776       -3.798
                 in       -4.054       -3.802
               that       -4.175       -4.521
                 is       -4.488       -4.986
                for       -4.660       -4.626
                 on       -4.773       -4.313
                 it       -4.775       -5.097

                the       -2.901       -2.827
                 to       -3.524       -3.502
                 of       -3.733       -3.682
                and       -3.776       -3.798
                 in       -4.054       -3.802
                 on       -4.773       -4.313
               said       -5.677       -4.407
               that       -4.175       -4.521
                for       -4.660       -4.626
                 is       -4.488       -4.986



In [32]:
prob_diff = clf.feature_log_prob_[1] - clf.feature_log_prob_[0]

for j in range(50):   # top 5 weights for each class
    index = -1 - j
    feat_index = np.argsort(prob_diff[:])[index]
    print('%19s %12.3f %12.3f' %(vectorizer.get_feature_names()[feat_index], clf.feature_log_prob_[0,feat_index], clf.feature_log_prob_[1,feat_index]))


            rakhine      -15.744       -9.126
            myanmar      -14.358       -7.988
               zuma      -15.744       -9.376
           rohingya      -14.646       -8.307
         puigdemont      -15.744       -9.482
                fdp      -15.744       -9.629
                suu      -15.744       -9.808
                kyi      -15.744       -9.815
                anc      -15.744       -9.888
             odinga      -15.744       -9.895
          mnangagwa      -15.744       -9.899
              rajoy      -15.744       -9.942
             hariri      -14.646       -8.873
             tmsnrt      -15.744      -10.080
           kenyatta      -15.744      -10.110
          kuczynski      -15.744      -10.159
               kurz      -15.744      -10.163
            barnier      -15.744      -10.206
            barzani      -15.744      -10.260
               aung      -15.744      -10.265
             kirkuk      -15.051       -9.572
             marawi      -15.744  

In [33]:
print('clf.feature_count_ :', clf.feature_count_.shape)

for i in range(2):   # 2 category labels
    print('\nMOST COMMON WORDS IN CLASS', i, '(Fake=0; Real=1)')
    print('%19s %12s %12s' %('WORD', 'Fake count', 'Real count'))
    for j in range(100):   # top x most frequent words for each class
        index = -1 - j
        feature_index = np.argsort(clf.feature_count_[i,:])[index]
        print('%19s %12d %12d' %(vectorizer.get_feature_names()[feature_index], clf.feature_count_[0,feature_index], clf.feature_count_[1,feature_index]))
    print()
    
    
#print('Real_News clf.feature_count_ :', np.sort(clf.feature_count_[1,:]))
#print('Real_News clf.feature_count indices :', np.argsort(clf.feature_count_[1,:]))
##print('Real_News clf.feature_count words :', vectorizer.get_feature_names()[np.argsort(clf.feature_count_[1,:])])
print()
#print('Fake_News clf.feature_count_ :', clf.feature_count_[0,:])

clf.feature_count_ : (2, 105168)

MOST COMMON WORDS IN CLASS 0 (Fake=0; Real=1)
               WORD   Fake count   Real count
                the       378260       334713
                 to       202916       170383
                 of       164632       142334
                and       157655       126704
                 in       119436       126219
               that       105775        61488
                 is        77401        38652
                for        65124        55402
                 on        58168        75697
                 it        58083        34583
              trump        55459        37957
                 he        54925        38037
                was        47303        33346
               with        44025        37877
                his        40597        26565
                 as        40368        32760
               this        40320        14668
                 be        34260        23938
                 by        33339        33241


             former         5122         7501
               when        15522         7411
                her        18163         7373
           campaign         7918         7340
             donald        12383         7256
           security         4108         7200
            percent         3120         7077
              north         1915         6994
               into         9360         6774
              obama        13241         6755
              court         3880         6711
            clinton        13553         6668
              white         9013         6618
                all        17723         6540
             senate         2617         6428
                any         8018         6332
            country         6399         6192
              first         7396         6114
              china          893         6055
          officials         2775         5965
           minister          677         5954
               week         3482  

#### DO verification test by assigning random 0's and 1's to the Dev Labels and re-running.

In [35]:
sample = np.random.binomial(1, 0.5, size=dev_labels.shape[0])
print(sample.mean())

0.5020044543429845


In [36]:
print('accuracy: %3.4f' %clf.score(X_dev_transformed, sample))

accuracy: 0.4953


In [37]:
sample2 = np.random.binomial(1, 0.2, size=dev_labels.shape[0])
print(sample2.mean())

0.19510022271714922


In [38]:
print('accuracy: %3.4f' %clf.score(X_dev_transformed, sample2))

accuracy: 0.5145


#### This result is as expected; basically get random model predictions of ~ 50% once we randomize the dev lables in any fashion.

### Apply model to LIAR dataset text to predict results and compute score ('title' field contains the statement).

In [4]:
# define each downloaded file
LIAR_TEST_FILENAME = 'test.tsv'
LIAR_TRAIN_FILENAME = 'train.tsv'
LIAR_DEV_FILENAME = 'valid.tsv'

# define the downloaded file path 
DATAPATH = './datasets/LIAR/'

## title =statement, target = politifact rating

h_names= ['id', 'target', 'title', 'subject', 'speaker', 'speaker_job_title', 'state', 'party',
          'barely_true_count', 'false_count', 'half_true_count', 'mostly_true_count','pantsonfire_count',
          'context']

liar_test_data = get_data(LIAR_TEST_FILENAME, sep ='\t', header =None)
liar_train_data = get_data(LIAR_TRAIN_FILENAME, '\t', header =None)
liar_dev_data = get_data(LIAR_DEV_FILENAME, '\t', header =None)
print("LIAR training dataset: ", liar_train_data.shape)
print("LIAR test dataset: ", liar_test_data.shape)
print("LIAR dev dataset: ", liar_dev_data.shape)

liar_test_data.columns = h_names
liar_train_data.columns = h_names
liar_dev_data.columns = h_names
# ## add a label column to the data with the target values
# #fake_data.loc[:,'target'] = '0'
# #true_data['target'] = '1'

# #append the datasets and shuffle them
# all_data = true_data.append(fake_data, ignore_index=True)
# all_data = all_data.sample(frac=1).reset_index(drop=True)

## NOTE: if trouble loading, re-run get_data function.

LIAR training dataset:  (10240, 14)
LIAR test dataset:  (1267, 14)
LIAR dev dataset:  (1284, 14)


In [5]:
# combine all the liar data
liar_data = liar_train_data.append(liar_test_data, ignore_index =True)
liar_data = liar_data.append(liar_dev_data, ignore_index =True)
liar_data = liar_data.sample(frac=1).reset_index(drop=True)
print("Complete LIAR dataset: ",liar_data.shape)
liar_data.head()

Complete LIAR dataset:  (12791, 14)


Unnamed: 0,id,target,title,subject,speaker,speaker_job_title,state,party,barely_true_count,false_count,half_true_count,mostly_true_count,pantsonfire_count,context
0,472.json,mostly-true,"For what we spend in just one week in Iraq, 80...",health-care,florida-consumer-action-network,,Florida,none,0.0,0.0,0.0,1.0,0.0,a press conference in Tampa
1,11871.json,barely-true,Tell me what Madeleine Albrights position was ...,"foreign-policy,iraq",bernie-s,U.S. Senator,Vermont,independent,18.0,12.0,22.0,41.0,0.0,comments on Meet the Press
2,13078.json,false,McGinty previously told a local community news...,population,pat-toomey,Candidate for U.S. Senate,Pennsylvania,republican,3.0,2.0,2.0,1.0,0.0,In a press release
3,6901.json,false,Says when I voted against [an increase in the ...,"candidates-biography,income,voting-record",joseph-kyrillos,State Senator,New Jersey,republican,3.0,3.0,2.0,2.0,1.0,a debate on New Jersey 101.5-FM
4,9486.json,half-true,Gov. Rick Scott signed into law a bill that gi...,"elections,transparency",bill-nelson,,Florida,democrat,3.0,1.0,8.0,10.0,0.0,his campaign website


In [39]:
print(liar_data.title[4])
print(liar_data.target.unique())

Alexi Giannoulias top aide was a longtime BP lobbyist.
['mostly-true' 'barely-true' 'true' 'false' 'half-true' 'pants-fire']


In [6]:
targets = liar_data.target.unique()
print(targets)

print('target,  number of examples')
for target in targets:
    print(target, len(liar_data[liar_data.target==target]))
    
print('\ntotal examples', len(liar_data))

['mostly-true' 'barely-true' 'false' 'half-true' 'true' 'pants-fire']
target,  number of examples
mostly-true 2454
barely-true 2103
false 2507
half-true 2627
true 2053
pants-fire 1047

total examples 12791


In [7]:
liar_data['binary_target'] = -1

'''  # this does not work
for i in range(liar_data.shape[0]):   
    if liar_data.target.iloc[i] == ('pants-fire' or 'false' or 'barely-true') :
        liar_data.binary_target.iloc[i] = 0  # fake news
    elif liar_data.target.iloc[i] == ('true' or 'mostly-true'):
        liar_data.binary_target.iloc[i] = 1  # real news
'''

''' these do not work
#liar_data.binary_target[((liar_data.target=='pants-fire') | (liar_data.target=='false') | (liar_data.target=='barely-true')] = 0
#liar_data['binary_target'] = np.where( ( (liar_data.target=='pants-fire') | (liar_data.target=='false') | (liar_data.target=='barely-true')] = 0                                        
## example:df['points'] = np.where( ( (df['gender'] == 'male') & (df['pet1'] == df['pet2'] ) ) | ( (df['gender'] == 'female') & (df['pet1'].isin(['cat','dog'] ) ) ), 5, 0)
#liar_data['binary_target'] = np.where( (liar_data.target.isin (['pants-fire','false','barely-true']), 0,1))
'''

# This might work better!
#'''
def binary_seq_target(rating):
    ## if no rating provided assume the statement to be true
    map_r = {'pants-fire':0, 'false':0, 'barely-true':0, 'half-true':-1, 'mostly-true':1, 'true':1}
    return map_r.get(rating, 1)
    
##change the target labels to 0(false), 1(true news)
#liar_data2.loc[:,'target'] = pd.Series(liar_data2['target'].apply(seq_target), index = liar_data2.index)
liar_data.loc[:,'binary_target'] = pd.Series(liar_data['target'].apply(binary_seq_target), index = liar_data.index)
liar_data.head(10)    
#'''

'''
# these give a warning: 'A value is trying to be set on a copy of a slice from a DataFrame'
liar_data.binary_target[liar_data.target=='pants-fire'] = 0  # fake news
liar_data.binary_target[liar_data.target=='false'] = 0
liar_data.binary_target[liar_data.target=='barely-true'] = 0
liar_data.binary_target[liar_data.target=='true'] = 1        # real news
liar_data.binary_target[liar_data.target=='mostly-true'] = 1
'''

liar_data.head(10)


Unnamed: 0,id,target,title,subject,speaker,speaker_job_title,state,party,barely_true_count,false_count,half_true_count,mostly_true_count,pantsonfire_count,context,binary_target
0,472.json,mostly-true,"For what we spend in just one week in Iraq, 80...",health-care,florida-consumer-action-network,,Florida,none,0.0,0.0,0.0,1.0,0.0,a press conference in Tampa,1
1,11871.json,barely-true,Tell me what Madeleine Albrights position was ...,"foreign-policy,iraq",bernie-s,U.S. Senator,Vermont,independent,18.0,12.0,22.0,41.0,0.0,comments on Meet the Press,0
2,13078.json,false,McGinty previously told a local community news...,population,pat-toomey,Candidate for U.S. Senate,Pennsylvania,republican,3.0,2.0,2.0,1.0,0.0,In a press release,0
3,6901.json,false,Says when I voted against [an increase in the ...,"candidates-biography,income,voting-record",joseph-kyrillos,State Senator,New Jersey,republican,3.0,3.0,2.0,2.0,1.0,a debate on New Jersey 101.5-FM,0
4,9486.json,half-true,Gov. Rick Scott signed into law a bill that gi...,"elections,transparency",bill-nelson,,Florida,democrat,3.0,1.0,8.0,10.0,0.0,his campaign website,-1
5,8288.json,barely-true,The Capitol Police force is going so far as to...,criminal-justice,chris-larson,Wisconsin Senate Minority Leader,Wisconsin,democrat,6.0,5.0,0.0,1.0,1.0,an interview,0
6,11731.json,mostly-true,Marco Rubio voted against authorizing Presiden...,"candidates-biography,congress,foreign-policy,h...",hillary-clinton,Presidential candidate,New York,democrat,40.0,29.0,69.0,76.0,7.0,a posting on the Clinton website,1
7,3721.json,mostly-true,Says businesses already pay most of the taxes.,"state-budget,state-finances,taxes",steve-ogden,oil and gas producer,Texas,republican,0.0,0.0,1.0,1.0,0.0,comments to reporters on Texas Senate floor,1
8,764.json,false,"John McCain accused Barack Obama ""of letting i...",abortion,barack-obama,President,Illinois,democrat,70.0,71.0,160.0,163.0,9.0,a TV ad,0
9,6802.json,barely-true,"Brendan Doherty wants to repeal Obamacare, inc...","drugs,health-care,medicare,message-machine-201...",david-cicilline,mayor of Providence,Rhode Island,democrat,7.0,4.0,5.0,4.0,1.0,a campaign commercial,0


In [8]:
binary_targets = liar_data.binary_target.unique()
print(binary_targets)

print('\nbinary_target,  number of examples')
for binary_target in binary_targets:
    print(binary_target, len(liar_data[liar_data.binary_target==binary_target]))

[ 1  0 -1]

binary_target,  number of examples
1 4507
0 5657
-1 2627


#### Must discard label = -1  

In [10]:
liar_dev_labels = liar_data.binary_target[liar_data.binary_target >= 0].values  ## discard "half-true"!!!!
print('liar_dev_labels:\n', liar_dev_labels[:10])

liar_dev_labels:
 [1 0 0 0 0 1 1 0 0 1]


In [33]:
print(liar_data.title)


0        Last month, 44 of the 50 states saw an increas...
1         Says Russ Feingold cut Medicare by $523 billion.
2        Research performed by economists has shown no ...
3        An effort to repeal voting-reform legislation ...
4        Says recall organizers started their website l...
5        Says the state of Texas rates as unacceptable ...
6        People can use food stamps for anything, inclu...
7        President Barack Obama wants to take in 250,00...
8        Says Hillary Clinton aide Huma Abedin has ties...
9        Medicare only has about 50 percent of it paid ...
10       Common Core is not from the federal government...
11       Says President Franklin Delano Roosevelt sent ...
12       SaysMichael Bennet wants to close Guantanamo B...
13       About 95 percent of (Ohios) electricity comes ...
14       Governor Palin is the most popular governor in...
15       The national economic recovery has led to high...
16       In Atlanta, since 1994 when the Seven Deadly S.

### Using the ISOT "text" model, predict the liar_dev_labels and score the predictions. (lower case them first...)

In [50]:
train_data, train_labels = train_set.text.str.lower().values, train_set.target.values   # original ISOT data

#dev_data = liar_data.title[liar_data.binary_target != -1].values  
dev_data = liar_data.title[liar_data.binary_target != -1].str.lower().values      # LIAR data                                        # full LIAR data
dev_labels = liar_data.binary_target[liar_data.binary_target != -1].values

train_labels = train_labels.astype(int)
dev_labels = dev_labels.astype(int)

#train_data.head()
print('train_data shape:', train_data.shape)
#print(train_data[0].shape)
#print(train_data[:1])
print('train_labels shape:', train_labels.shape)
print(train_labels)

print('\ndev_data shape:', dev_data.shape)
print(dev_data[:1])
print('dev_labels shape:', dev_labels.shape)
print(dev_labels)

train_data shape: (31428,)
train_labels shape: (31428,)
[0 0 0 ... 0 0 0]

dev_data shape: (10164,)
['in writing his book, gov. perry pointed out that by any measure social security has been a failure.']
dev_labels shape: (10164,)
[1 1 1 ... 0 1 0]


In [51]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_data)           # original ISOT training data
X_dev_transformed = vectorizer.transform(dev_data)

# MultinomialNB
print('\nMultinomialNB using "text" field: Fit using ISOT data; Predict on LIAR data:')
alpha = 1.0
clf = MultinomialNB(alpha=alpha)
clf.fit(X, train_labels)

print('accuracy: %3.3f' %clf.score(X_dev_transformed, dev_labels))


MultinomialNB using "text" field: Fit using ISOT data; Predict on LIAR data:
accuracy: 0.541


In [52]:
print('Non-zero elements in matrix (X.nnz):', X.nnz)   # This indicates that there are z non-zero elements in the matrix.
print('Average number of non-zero features per example (per document): %.3f' %(X.nnz/X.shape[0]))  # non-zero elements in matrix / documents = xxx

Non-zero elements in matrix (X.nnz): 6577318
Average number of non-zero features per example (per document): 209.282


In [53]:
print('Non-zero elements in matrix (X_dev_transformed.nnz):', X_dev_transformed.nnz)   # This indicates that there are z non-zero elements in the matrix.
print('Average number of non-zero features per example (per document): %.3f' %(X_dev_transformed.nnz/X_dev_transformed.shape[0]))  # non-zero elements in matrix / documents = xxx

Non-zero elements in matrix (X_dev_transformed.nnz): 161111
Average number of non-zero features per example (per document): 15.851


In [54]:
print('LIAR true:', len(liar_data.binary_target[liar_data.binary_target == 1]))
#print('LIAR true:', liar_data.binary_target[liar_data.binary_target == 1].shape)
print('LIAR false:', len(liar_data.binary_target[liar_data.binary_target == 0]))

LIAR true: 4507
LIAR false: 5657


### REVERSE: Using LIAR model, predict the ISOT "text" and score the predictions. (lower case them first...)

In [71]:
train_data = liar_data.title[liar_data.binary_target != -1].str.lower().values   # full LIAR data
train_labels = liar_data.binary_target[liar_data.binary_target != -1].values 
 
dev_data = train_set.text.str.lower().values                                      # LSOT data
dev_labels = train_set.target.values 


train_labels = train_labels.astype(int)
dev_labels = dev_labels.astype(int)

#train_data.head()
print('train_data shape:', train_data.shape)
#print(train_data[0].shape)
#print(train_data[:1])
print('train_labels shape:', train_labels.shape)
print(train_labels)

print('\ndev_data shape:', dev_data.shape)
#print(dev_data[:1])
print('dev_labels shape:', dev_labels.shape)
print(dev_labels)

train_data shape: (10164,)
train_labels shape: (10164,)
[1 1 0 ... 1 1 1]

dev_data shape: (31428,)
dev_labels shape: (31428,)
[1 0 0 ... 1 0 0]


In [70]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_data)           # LIAR training data
X_dev_transformed = vectorizer.transform(dev_data) # LSOT

# MultinomialNB
print('\nMultinomialNB:  Fit using LIAR data; predict on ISOT "text" data')
alpha = 1.0
clf = MultinomialNB(alpha=alpha)
clf.fit(X, train_labels)

print('accuracy: %3.2f' %clf.score(X_dev_transformed, dev_labels))


MultinomialNB:  Fit using LIAR data; predict on ISOT "text" data
accuracy: 0.57


In [79]:
print('LIAR true:', len(liar_data.binary_target[liar_data.binary_target == 1]))
#print('LIAR true:', liar_data.binary_target[liar_data.binary_target == 1].shape)
print('LIAR false:', len(liar_data.binary_target[liar_data.binary_target == 0]))

LIAR true: 4507
LIAR false: 5657


In [None]:
'''
# Fit a Multinomial Naive Bayes model and find the optimal value for alpha

cv_params = {'alpha': [1E-10, 0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 10.0]}
mnb = GridSearchCV(estimator=MultinomialNB(), param_grid=cv_params, scoring='f1_weighted', cv=10, n_jobs=-1)
mnb.fit(X, train_labels)  

print('\nMultinomial Naive Bayes best GridSearchCV results:')
print('Best params:', mnb.best_params_)
print('Best score: %.3f' %(mnb.best_score_))
#print('Best estimator: \n', mnb.best_estimator_)

mnb_dev_predicted_labels = mnb.predict(X_dev_transformed)  # "predict" and report accuracy using dev set
print('f1 score of dev predicted labels using Multinominal Naive Bayes: %.3f' %(metrics.f1_score(dev_labels, mnb_dev_predicted_labels, average='weighted')))
#print('classification report of dev predicted labels: \n', classification_report(dev_labels, mnb_dev_predicted_labels))
print()
'''

'\n# Fit a Multinomial Naive Bayes model and find the optimal value for alpha\n\ncv_params = {\'alpha\': [1E-10, 0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 10.0]}\nmnb = GridSearchCV(estimator=MultinomialNB(), param_grid=cv_params, scoring=\'f1_weighted\', cv=10, n_jobs=-1)\nmnb.fit(X, train_labels)  \n\nprint(\'\nMultinomial Naive Bayes best GridSearchCV results:\')\nprint(\'Best params:\', mnb.best_params_)\nprint(\'Best score: %.3f\' %(mnb.best_score_))\n#print(\'Best estimator: \n\', mnb.best_estimator_)\n\nmnb_dev_predicted_labels = mnb.predict(X_dev_transformed)  # "predict" and report accuracy using dev set\nprint(\'f1 score of dev predicted labels using Multinominal Naive Bayes: %.3f\' %(metrics.f1_score(dev_labels, mnb_dev_predicted_labels, average=\'weighted\')))\n#print(\'classification report of dev predicted labels: \n\', classification_report(dev_labels, mnb_dev_predicted_labels))\nprint()\n'

In [None]:
'''
Convert a collection of raw documents to a matrix of TF-IDF features.
Equivalent to CountVectorizer followed by TfidfTransformer.

In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little 
meaningful information about the actual contents of the document. If we were to feed the direct count data directly to 
a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.

In order to re-weight the count features into floating point values suitable for usage by a classifier it is very 
common to use the tf–idf transform.

Tf means term-frequency while tf–idf means term-frequency times inverse document-frequency: 
\text{tf-idf(t,d)}=\text{tf(t,d)} \times \text{idf(t)}.
'''

t_vectorizer = TfidfVectorizer()
t_X = t_vectorizer.fit_transform(train_data)   
#print(t_X.shape)
t_X_dev = t_vectorizer.transform(dev_data)
#print(t_X_dev.shape)


# MultinomialNB
#The multinomial distribution normally requires integer feature counts. 
#However, in practice, fractional counts such as tf-idf may also work.
print('\nMultinomialNB with TfidfVectorizer')
alpha = 1.0
t_clf = MultinomialNB(alpha=alpha)
t_clf.fit(t_X, train_labels)

print('accuracy: %3.2f' %t_clf.score(t_X_dev, dev_labels))

t_dev_predicted_labels = t_clf.predict(t_X_dev)  # "predict" and report accuracy using dev set
#print(t_dev_predicted_labels.shape)

print('\nf1 score of dev predicted labels:', metrics.f1_score(dev_labels, t_dev_predicted_labels, average='weighted'))
print('classification report of dev predicted labels: \n', classification_report(dev_labels, t_dev_predicted_labels))
print()


MultinomialNB with TfidfVectorizer
accuracy: 0.94

f1 score of dev predicted labels: 0.9373187230659072
classification report of dev predicted labels: 
               precision    recall  f1-score   support

           0       0.93      0.95      0.94      3485
           1       0.94      0.93      0.93      3250

   micro avg       0.94      0.94      0.94      6735
   macro avg       0.94      0.94      0.94      6735
weighted avg       0.94      0.94      0.94      6735




NOTE: Nearly same results (accuracy~0.94) for both the default CountVectorizer and TfidfVectorizer.  Using full text means all TRUE news contains the word "Reuters", which is an unfair advantage.  Will try to remove those and run again, expecting lower accuracy.  
Should also account for text starting with: "'The following statements\xa0were posted to the verified Twitter accounts of U.S. President Donald Trump, @realDonaldTrump and @POTUS.  The opinions expressed are his own.\xa0Reuters has not edited the statements or confirmed their accuracy."

### Repeat Naive Bayes on text field after removing first chunk of text, including "Reuters"

In [None]:
true_data.head()

Unnamed: 0,title,text,subject,date,target
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",1
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",1
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",1
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",1
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",1


In [None]:
true_data.iloc[0,1][22:]

' The head of a conservative Republican faction in the U.S. Congress, who voted this month for a huge expansion of the national debt to pay for tax cuts, called himself a “fiscal conservative” on Sunday and urged budget restraint in 2018. In keeping with a sharp pivot under way among Republicans, U.S. Representative Mark Meadows, speaking on CBS’ “Face the Nation,” drew a hard line on federal spending, which lawmakers are bracing to do battle over in January. When they return from the holidays on Wednesday, lawmakers will begin trying to pass a federal budget in a fight likely to be linked to other issues, such as immigration policy, even as the November congressional election campaigns approach in which Republicans will seek to keep control of Congress. President Donald Trump and his Republicans want a big budget increase in military spending, while Democrats also want proportional increases for non-defense “discretionary” spending on programs that support education, scientific resear

In [None]:
true_data.iloc[:,1]
#true_data.iloc[13,1]

0        WASHINGTON (Reuters) - The head of a conservat...
1        WASHINGTON (Reuters) - Transgender people will...
2        WASHINGTON (Reuters) - The special counsel inv...
3        WASHINGTON (Reuters) - Trump campaign adviser ...
4        SEATTLE/WASHINGTON (Reuters) - President Donal...
5        WEST PALM BEACH, Fla./WASHINGTON (Reuters) - T...
6        WEST PALM BEACH, Fla (Reuters) - President Don...
7        The following statements were posted to the ve...
8        The following statements were posted to the ve...
9        WASHINGTON (Reuters) - Alabama Secretary of St...
10       (Reuters) - Alabama officials on Thursday cert...
11       NEW YORK/WASHINGTON (Reuters) - The new U.S. t...
12       The following statements were posted to the ve...
13       The following statements were posted to the ve...
14        (In Dec. 25 story, in second paragraph, corre...
15       (Reuters) - A lottery drawing to settle a tied...
16       WASHINGTON (Reuters) - A Georgian-American bus.

In [None]:
# How many of the TRUE NEWS docs contain "Reuters"?  
# How many of the TRUE NEWS docs start with "The following statements"?  

reuters_counter=0
statements_counter=0

for i in range(true_data.shape[0]):
    if true_data.iloc[i,1].find("Reuters") > 0:
        reuters_counter += 1
    if (true_data.iloc[i,1].find("following") > 0) & (true_data.iloc[i,1].find("statements") > 0):
        statements_counter += 1

print('reuters_counter:', reuters_counter)
print('statement_counter:', statements_counter)
print('total true docs:', true_data.shape[0])



reuters_counter: 21378
statement_counter: 156
total true docs: 21417


#### Need to remove "Reuters" from True News

In [31]:
re.sub(r"^.?([r,R]euters) - ", "", "WASHINGTON (Reuters) - my name is reuters, Reuters is the code")  


'WASHINGTON (Reuters) - my name is reuters, Reuters is the code'

In [32]:
re.sub(r"[r,R]euters", "", "my name is reuters, Reuters is the code")

'my name is ,  is the code'

In [39]:
re.sub(r"[\w+\s+]+[r,R]euters", "", "WASHINGTON (Reuters) - my name is reuters, Reuters is the code")

'WASHINGTON (Reuters) -, is the code'

In [38]:
re.sub(r"[\w+\s+]+([r,R]euters) - ", "", "WASHINGTON (Reuters) - my name is reuters, Reuters is the code")

'WASHINGTON (Reuters) - my name is reuters, Reuters is the code'

In [40]:
re.sub(r"\w.*[r,R]euters\W*", "","ASPEN, Colorado (Reuters) - The Trump administ")  ### This is the one we want...

'The Trump administ'

In [None]:
def remove_reuters(text):
    return(re.sub(r"\w.*[r,R]euters\W*", "", text))

true_data['text2'] = true_data['text'].apply(remove_reuters)
true_data.head()

In [None]:
true_data['text2'] = true_data['text'].apply(remove_reuters)

In [None]:
true_data.head()

In [None]:
fake_data['text2'] = fake_data['text']

In [None]:
#append the datasets and shuffle them
all_data2 = true_data.append(fake_data, ignore_index=True)
all_data2 = all_data2.sample(frac=1).reset_index(drop=True)

all_data2.describe()

In [None]:
all_data2.head()

In [None]:
## Re-define train/dev/test:

train_set = all_data2[ :int(len(all_data2)*train_fract)].reset_index(drop=True)
dev_set = all_data2[int(len(all_data2)*(train_fract)) : int(len(all_data)*(train_fract+dev_fract))].reset_index(drop=True)
test_set = all_data2[int(len(all_data2)*(train_fract+dev_fract)) : ].reset_index(drop=True)

print('training set: ',train_set.shape)
print('dev set: ',dev_set.shape)
print('test set: ',test_set.shape)


train_data, train_labels = train_set.text2.values, train_set.target.values
dev_data, dev_labels = dev_set.text2.values, dev_set.target.values

train_labels = train_labels.astype(int)
dev_labels = dev_labels.astype(int)

print('\ntrain_data shape:', train_data.shape)
print('train_labels shape:', train_labels.shape)
print(train_labels)

In [None]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_data)
X_dev_transformed = vectorizer.transform(dev_data)

# MultinomialNB
print('\nMultinomialNB with "Reuters" removed from text field')
alpha = 1.0
clf = MultinomialNB(alpha=alpha)
clf.fit(X, train_labels)

print('accuracy: %3.2f' %clf.score(X_dev_transformed, dev_labels))

In [None]:
print(train_data)

### Run Naive Bayes on the Title field

In [29]:
fake_data.title

#print(fake_data.title.iloc[:15])

0         Donald Trump Sends Out Embarrassing New Year’...
1         Drunk Bragging Trump Staffer Started Russian ...
2         Sheriff David Clarke Becomes An Internet Joke...
3         Trump Is So Obsessed He Even Has Obama’s Name...
4         Pope Francis Just Called Out Donald Trump Dur...
5         Racist Alabama Cops Brutalize Black Boy While...
6         Fresh Off The Golf Course, Trump Lashes Out A...
7         Trump Said Some INSANELY Racist Stuff Inside ...
8         Former CIA Director Slams Trump Over UN Bully...
9         WATCH: Brand-New Pro-Trump Ad Features So Muc...
10        Papa John’s Founder Retires, Figures Out Raci...
11        WATCH: Paul Ryan Just Told Us He Doesn’t Care...
12        Bad News For Trump — Mitch McConnell Says No ...
13        WATCH: Lindsey Graham Trashes Media For Portr...
14        Heiress To Disney Empire Knows GOP Scammed Us...
15        Tone Deaf Trump: Congrats Rep. Scalise On Los...
16        The Internet Brutally Mocks Disney’s New Trum.

In [53]:
train_data, train_labels = train_set.title.values, train_set.target.values
dev_data, dev_labels = dev_set.title.values, dev_set.target.values

train_labels = train_labels.astype(int)
dev_labels = dev_labels.astype(int)

#train_data.head()
print('train_data shape:', train_data.shape)
#print(train_data[0].shape)
print(train_data[:1])
print('\ntrain_labels shape:', train_labels.shape)
print(train_labels)


train_data shape: (31428,)
['FARMER FINED A WHOPPING $2.8 MILLION Asks President Trump For Help']

train_labels shape: (31428,)
[0 1 1 ... 1 1 0]


In [54]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_data)


print('X.shape:', X.shape) # (). There are x documents (rows) in the corpus, with y features (unique words = vocabulary)
print('Vocabulary size (number of features or columns):', X.shape[1])  # 
print('Non-zero elements in matrix (X.nnz):', X.nnz)   # This indicates that there are z non-zero elements in the matrix.
print('Average number of non-zero features per example (per document): %.3f' %(X.nnz/X.shape[0]))  # non-zero elements in matrix / documents = xxx
print('Fraction of non-zero elements in matrix: %.4f' %( X.nnz/(X.shape[0] * X.shape[1])) )   # Fraction of entries in the matrix that are non-zero = X.nnz/(rows*columns) = 0.xxx 


X.shape: (31428, 18698)
Vocabulary size (number of features or columns): 18698
Non-zero elements in matrix (X.nnz): 382799
Average number of non-zero features per example (per document): 12.180
Fraction of non-zero elements in matrix: 0.0007


In [55]:
# Using the standard CountVectorizer, what fraction of the words in the dev data are missing from the vocabulary? 
vectorizer_dev = CountVectorizer()
X_dev = vectorizer_dev.fit_transform(dev_data)  # Independently build a vocabulary using dev_data.
print('X_dev.shape:', X_dev.shape)
print('Vocabulary using train data:', X.shape[1])  # 
print('Vocabulary using dev data:', X_dev.shape[1])                # 

# Feed dev_data into the vectorizer fit using training data.
X_dev_transformed = vectorizer.transform(dev_data)
print('X_dev_transformed shape:', X_dev_transformed.shape)
#print('X_dev_transformed.shape', X_dev_transformed.shape)  # ()  EXPECT .shape[1] equal to original number of features
#print('non-zero indices in X_dev_transformed:', X_dev_transformed.nonzero())  # could also use this to check which features missing...

''' This is way too slow; use set intersection instead!!
# Look at each feature (vocabulary word) in X_dev and see if it is a feature in X.
count = 0
for i in range(X_dev.shape[1]):
    if vectorizer_dev.get_feature_names()[i] in vectorizer.get_feature_names():
        count += 1
print('Count of words (features) in X_dev also in X:', count)   
print('Fraction of words in dev data missing from training vocabulary: %.3f' %((X_dev.shape[1] - count)/X_dev.shape[1]) )
count of words (features) in X_dev also in X: 12219
Fraction of words in dev data missing from training vocabulary: 0.248
'''

set1 = set(vectorizer_dev.get_feature_names())
set2 = set(vectorizer.get_feature_names())
print('Count of words (features) in X_dev also in X:', len(set1.intersection(set2)))
print('Fraction of words in dev data missing from training vocabulary: %.3f' %((X_dev.shape[1] - len(set1.intersection(set2)))/X_dev.shape[1]) )

X_dev.shape: (6735, 10493)
Vocabulary using train data: 18698
Vocabulary using dev data: 10493
X_dev_transformed shape: (6735, 18698)
Count of words (features) in X_dev also in X: 9319
Fraction of words in dev data missing from training vocabulary: 0.112


In [56]:
# MultinomialNB
print('\nMultinomialNB using "title" field')
alpha = 1.0
clf = MultinomialNB(alpha=alpha)
clf.fit(X, train_labels)

print('accuracy: %3.3f' %clf.score(X_dev_transformed, dev_labels))


MultinomialNB using "title" field
accuracy: 0.952


In [57]:
#print('title, target label\n', train_set.title, train_set.target)
print('title, target label\n', train_set.title[4], train_set.target[4])

title, target label
  Sander’s Campaign Manager Literally Stole An ‘Onion’ Story To Throw Clinton Under The Bus (VIDEO) 0


In [58]:
type(train_set.target)

pandas.core.series.Series

In [59]:
type(train_labels)

numpy.ndarray

#### Re-do training and dev eval using only LOWER CASE text.  (Not relevant since CountVectorizer already does this??)

In [60]:
print(train_set.title.values)

['FARMER FINED A WHOPPING $2.8 MILLION Asks President Trump For Help'
 "Kremlin 'deeply concerned' by rising tension on Korean peninsula"
 "Turkey's Erdogan takes legal action after lawmaker calls him 'fascist dictator'"
 ... 'Trump officials to unveil plan to cut factory rules this week'
 'Ukraine to ramp up health spending after anti-corruption push'
 'BREAKING: Putin Tramples Obama’s Imaginary Red Line With Airstrikes In Syria']


In [61]:
print(all_data['title'].str.lower().values)

['farmer fined a whopping $2.8 million asks president trump for help'
 "kremlin 'deeply concerned' by rising tension on korean peninsula"
 "turkey's erdogan takes legal action after lawmaker calls him 'fascist dictator'"
 ... 'tillerson says u.s. committed to nato in first alliance meeting'
 'without trump, republican debate has second lowest rating'
 'house speaker ryan expects tax plan this fall: nyt interview']


In [62]:
train_data, train_labels = train_set.title.str.lower().values, train_set.target.values
dev_data, dev_labels = dev_set.title.str.lower().values, dev_set.target.values

In [63]:
print(train_data)
print()
print(dev_data)
print(train_data.shape, dev_data.shape)

['farmer fined a whopping $2.8 million asks president trump for help'
 "kremlin 'deeply concerned' by rising tension on korean peninsula"
 "turkey's erdogan takes legal action after lawmaker calls him 'fascist dictator'"
 ... 'trump officials to unveil plan to cut factory rules this week'
 'ukraine to ramp up health spending after anti-corruption push'
 'breaking: putin tramples obama’s imaginary red line with airstrikes in syria']

['iran provided capability for missile attacks from yemen: u.s. air force'
 'obama threatens to surface from leftist bunker to speak out against trump’s plan to end daca'
 'just in: criminal hackers who stole data from 500 million yahoo email users were russian agents obama regime invited to u.s…courted by fbi'
 ...
 'phony hillary pulls the woman card at jay-z/beyonce gig: “we have a glass ceiling to crack…” [video]'
 "u.s. seeks meeting soon to revive asia-pacific 'quad' security forum"
 'u.n. hopes trump will preach human rights to duterte']
(31428,) (67

In [64]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_data)
X_dev_transformed = vectorizer.transform(dev_data)

# MultinomialNB
print('\nMultinomialNB using "title" field')
alpha = 1.0
clf = MultinomialNB(alpha=alpha)
clf.fit(X, train_labels)

print('accuracy: %3.3f' %clf.score(X_dev_transformed, dev_labels))


MultinomialNB using "title" field
accuracy: 0.952


In [65]:
print('clf.feature_count_ :', clf.feature_count_.shape)

for i in range(2):   # 2 category labels
    print('\nMOST COMMON WORDS IN CLASS', i, '(Fake=0; Real=1)')
    print('%19s %12s %12s' %('WORD', 'Fake count', 'Real count'))
    for j in range(100):   # top x most frequent words for each class
        index = -1 - j
        feature_index = np.argsort(clf.feature_count_[i,:])[index]
        print('%19s %12d %12d' %(vectorizer.get_feature_names()[feature_index], clf.feature_count_[0,feature_index], clf.feature_count_[1,feature_index]))
    print()

clf.feature_count_ : (2, 18698)

MOST COMMON WORDS IN CLASS 0 (Fake=0; Real=1)
               WORD   Fake count   Real count
                 to         6635         5424
              trump         6608         3896
              video         6044           20
                the         4386          389
                 of         3564         2065
                for         3391         1918
                 in         3326         3249
                 on         2657         2349
                and         2575          435
                 is         2018          305
              obama         1853          491
               with         1746         1040
            hillary         1682           31
              watch         1395           23
              about         1166          193
                his         1148          146
              after         1113          703
                 he         1109          247
                 it         1106          177
 

              about         1166          193
             attack          240          193
         opposition            5          192
          democrats          260          191
             former          222          189
         healthcare           64          187
       presidential           86          187
              trade            9          186
                say          141          186
            foreign          107          185
             mexico           72          181
             german           32          179
            britain           18          179
          tillerson           14          178
                two          148          177
               meet           57          177



#### Many words in the Title show an imbalance between Fake News and Real News.  For example, "trump" is favored by nearly a 2:1 ratio in Fake vs. Real news.  "hillary" is favored by ~ 400:1 in Fake vs. Real news.  "watch" is favored ~ 700:1 in Fake vs. Real news.  Words such as "he", "his", "she", "her", "him", "it", "they", "them", "we", "us", "like", "here", "donald", "gop", "liberal",  "media", "america", "muslim", "racist", "breaking", are also heavily favored in Fake news in this dataset.

In [66]:
## Print out top MISCLASSIFICATIONS

# Make predictions on the dev data and show the top n documents where the ratio R is largest, where R is:
# R = maximum predicted probability / predicted probability of the correct label

n=200
print('X_dev_transformed.shape:', X_dev_transformed.shape)  # (6735, 18745)
print()

r_array = np.zeros(X_dev_transformed.shape[0])  # one array element for each dev example (6735)
for i in range(X_dev_transformed.shape[0]):
    max_pred_prob = np.max(clf.predict_proba(X_dev_transformed)[i,:])
    #print(max_pred_prob)
    correct_label = int(dev_labels[i])
    pred_prob_correct_label = clf.predict_proba(X_dev_transformed)[i,correct_label]
    R = max_pred_prob / pred_prob_correct_label
    r_array[i] = R

print('max R:', np.max(r_array)) 
#print('mean R:', np.mean(r_array))
sorted_r = -np.sort(-r_array)
print()

for i in range(n):
    index = -1 - i
    label_index = np.argsort(r_array)[index]
    print('R:', r_array[label_index])    
    print('\nDev Title',str(label_index)+':\n', dev_data[label_index])
    print('\nLABEL =', dev_labels[label_index])
    print('predicted class =', clf.predict(X_dev_transformed)[label_index],'\n' )
    print(70*'-')


X_dev_transformed.shape: (6735, 18698)

max R: 79336.15609396338

R: 79336.15609396338

Dev Title 1093:
  white house officials push for kushner to step aside amid russia scandal

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 52394.02733777778

Dev Title 6294:
 forsaken sultan: erdogan isolated ahead trump meeting in washington

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 15922.715357921137

Dev Title 2404:
 “rise from your knees!” poland’s prime minister tells eu no more migrants for poland

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 8965.02369023231

Dev Title 4810:
 philippines: 2016 washington’s fury as philippine’s elections threaten us anti-china policy

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 8628.269321446822

Dev Title 3620:
 clin


----------------------------------------------------------------------
R: 57.773614452614666

Dev Title 2431:
 indian police ask interfaith couples: is it love or terror?

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 57.31230514925986

Dev Title 5838:
 russian military: us coalition predator drone spotted at time & place of syria un aid convoy attack

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 55.238748076853646

Dev Title 2618:
 mohammed dahlan speaks about palestinian unity and his back-room role

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 50.16278549505422

Dev Title 3095:
 teen jackie evancho first singer confirmed for trump inauguration

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 50.11125778774602

Dev Title 2634:
 lake oroville dam s

R: 10.478674373869522

Dev Title 4339:
 trump promised to repeal obamacare. now what?

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 10.455522631588375

Dev Title 5658:
  trump tells corporate ceos he will slash 75 percent of regulations that protect workers and environment

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 10.326991916955192

Dev Title 6463:
 ‘anti-russia’ escalation? plans for new us marine base in norway

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 10.301628881798175

Dev Title 5356:
  victory: supreme court saves affirmative action

LABEL = 0
predicted class = 1 

----------------------------------------------------------------------
R: 10.072698532659977

Dev Title 3700:
 clinton campaign: no evidence computer systems were compromised

LABEL = 1
predicted class = 0 

----------------------

R: 3.7664209601747283

Dev Title 5634:
 would-be reagan assassin released from psychiatric hospital

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------


### Using the ISOT "title" model, predict the liar_dev_labels and score the predictions. (lower case them first...)

In [44]:
#train_data, train_labels = train_set.title.values, train_set.target.values   # original ISOT data
train_data, train_labels = train_set.title.str.lower().values, train_set.target.values   # original ISOT data

#dev_data = liar_data.title[liar_data.binary_target != -1].values  
dev_data = liar_data.title[liar_data.binary_target != -1].str.lower().values                                            # full LIAR data
dev_labels = liar_data.binary_target[liar_data.binary_target != -1].values

train_labels = train_labels.astype(int)
dev_labels = dev_labels.astype(int)

#train_data.head()
print('train_data shape:', train_data.shape)
#print(train_data[0].shape)
print(train_data[:1])
print('train_labels shape:', train_labels.shape)
print(train_labels)

print('\ndev_data shape:', dev_data.shape)
print(dev_data[:1])
print('dev_labels shape:', dev_labels.shape)
print(dev_labels)

train_data shape: (31428,)
['farmer fined a whopping $2.8 million asks president trump for help']
train_labels shape: (31428,)
[0 1 1 ... 1 1 0]

dev_data shape: (10164,)
['a gallon (of gasoline) delivered to the front lines for our troops in afghanistan cost more than $400.']
dev_labels shape: (10164,)
[1 0 1 ... 1 0 0]


In [45]:

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_data)           # original ISOT training data
X_dev_transformed = vectorizer.transform(dev_data)

# MultinomialNB
print('\nMultinomialNB using "title" field: Fit using ISOT data; Predict on LIAR data:')
alpha = 1.0
clf = MultinomialNB(alpha=alpha)
clf.fit(X, train_labels)

print('accuracy: %3.3f' %clf.score(X_dev_transformed, dev_labels))


MultinomialNB using "title" field: Fit using ISOT data; Predict on LIAR data:
accuracy: 0.533


In [46]:
print('Non-zero elements in matrix (X.nnz):', X.nnz)   # This indicates that there are z non-zero elements in the matrix.
print('Average number of non-zero features per example (per document): %.3f' %(X.nnz/X.shape[0]))  # non-zero elements in matrix / documents = xxx

Non-zero elements in matrix (X.nnz): 382799
Average number of non-zero features per example (per document): 12.180


In [47]:
print('Non-zero elements in matrix (X_dev_transformed.nnz):', X_dev_transformed.nnz)   # This indicates that there are z non-zero elements in the matrix.
print('Average number of non-zero features per example (per document): %.3f' %(X_dev_transformed.nnz/X_dev_transformed.shape[0]))  # non-zero elements in matrix / documents = xxx

Non-zero elements in matrix (X_dev_transformed.nnz): 154134
Average number of non-zero features per example (per document): 15.165


In [48]:
#print(X_dev_transformed)
print(train_data)

['farmer fined a whopping $2.8 million asks president trump for help'
 "kremlin 'deeply concerned' by rising tension on korean peninsula"
 "turkey's erdogan takes legal action after lawmaker calls him 'fascist dictator'"
 ... 'trump officials to unveil plan to cut factory rules this week'
 'ukraine to ramp up health spending after anti-corruption push'
 'breaking: putin tramples obama’s imaginary red line with airstrikes in syria']


In [49]:
print(dev_data)

['a gallon (of gasoline) delivered to the front lines for our troops in afghanistan cost more than $400.'
 'the insurance risk corridors arent going broke like republicans predicted.'
 'americans spend more than $160 billion and 6 billion hours per year complying with the tax code.'
 ...
 'after my first year as governor, i was one of the most unpopular governors, maybe the most unpopular governor in the country. ... it changed.'
 'the immigration bill includes free obamacars, motorcycles or scooters.'
 'every (personhood) bill ive ever support has either had language that says were conforming to the constitutional rulings of the supreme court or something to that effect.']


In [51]:
## Print out top MISCLASSIFICATIONS

# Make predictions on the dev data and show the top n documents where the ratio R is largest, where R is:
# R = maximum predicted probability / predicted probability of the correct label

n=200
print('X_dev_transformed.shape:', X_dev_transformed.shape)  # (10164, 18698)
print()

r_array = np.zeros(X_dev_transformed.shape[0])  # one array element for each dev example (6735)
for i in range(X_dev_transformed.shape[0]):
    max_pred_prob = np.max(clf.predict_proba(X_dev_transformed)[i,:])
    #print(max_pred_prob)
    correct_label = int(dev_labels[i])
    pred_prob_correct_label = clf.predict_proba(X_dev_transformed)[i,correct_label]
    R = max_pred_prob / pred_prob_correct_label
    r_array[i] = R

print('max R:', np.max(r_array)) 
#print('mean R:', np.mean(r_array))
sorted_r = -np.sort(-r_array)
print()

for i in range(n):
    index = -1 - i
    label_index = np.argsort(r_array)[index]
    print('R:', r_array[label_index])    
    print('\nDev Title',str(label_index)+':\n', dev_data[label_index])
    print('\nLABEL =', dev_labels[label_index])
    print('predicted class =', clf.predict(X_dev_transformed)[label_index],'\n' )
    print(70*'-')


X_dev_transformed.shape: (10164, 18698)

max R: 7.737871858182432e+35

R: 7.737871858182432e+35

Dev Title 3077:
 hospitals, doctors, mris, surgeries and so forth are more extensively used and far more expensive in this country than they are in many other countries.''	health-care	mitt-romney	former governor	massachusetts	republican	34	32	58	33	19	a fox news sunday interview
9874.json	barely-true	obamacare cuts seniors medicare.	health-care,medicare	ed-gillespie	republican strategist	washington, d.c.	republican	2	3	2	2	1	a campaign email.
3072.json	mostly-true	the refusal of many federal employees to fly coach costs taxpayers $146 million annually.	government-efficiency,transparency	newsmax	magazine and website	florida	none	0	0	0	1	0	an e-mail solicitation
2436.json	mostly-true	florida spends more than $300 million a year just on children repeating pre-k through 3rd grade.	education	alex-sink		florida	democrat	1	2	2	4	0	figures cites on campaign website
9721.json	true	milwaukee county s

----------------------------------------------------------------------
R: 37902156141.858665

Dev Title 6467:
 you cant give a child an aspirin in school without permission. you cant do any kind of medication, but we can secretly take the child off and have an abortion.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 31572437600.72912

Dev Title 794:
 opponent holly benson said that just because youre poor doesnt mean youre unhealthy, it just means you have a lot more time to go running.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 28872002398.34081

Dev Title 10073:
 even members of the nra, when they were polled recently, were under the impression that everyone has a criminal background check.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 28091440913.504726

Dev Title 6366:
 says hillary clinton tripled t

R: 1181211989.1606631

Dev Title 8594:
 the truth of the matter is that during my administration, the fbi's crime statistics show that violent crime was reduced in massachusetts by 7 percent.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1163646487.6627328

Dev Title 2263:
 big soda has a lot of money.they make a lot of profit off their product and they market in neighborhoods that suffer from these very things that were trying to cure.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1122947206.7903893

Dev Title 4403:
 when mccain was questioned about hiring lobbyists to his campaign staff, "his top lobbyist actually had the nerve to say, 'the american people won't care about this.' "

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 1065672390.0930934

Dev Title 8836:
 there was a recent report out that the pr

predicted class = 0 

----------------------------------------------------------------------
R: 148612411.9522563

Dev Title 8477:
 on the night of the iowa caucuses, obama promised the nation that he would do health care reform focused on cost containment, he opposed an individual mandate, and he said he was going to do it with republicans.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 145531418.82216495

Dev Title 1833:
 we had a no child left behind a similar piece of legislation in our state a number of years ago, well before the federal law. and it's had a big impact here. it's improved schools.

LABEL = 1
predicted class = 0 

----------------------------------------------------------------------
R: 140378567.7250198

Dev Title 9169:
 wisconsin is called the badger state because our ancestors came here with the hopes of living the american dream by mining.

LABEL = 1
predicted class = 0 

-------------------------------

### REVERSE: Using LIAR model, predict the ISOT "title" and score the predictions. (lower case them first...)

In [70]:
train_data = liar_data.title[liar_data.binary_target != -1].str.lower().values   # full LIAR data
train_labels = liar_data.binary_target[liar_data.binary_target != -1].values 
 
dev_data = train_set.title.str.lower().values                                      # LSOT data
dev_labels = train_set.target.values 


train_labels = train_labels.astype(int)
dev_labels = dev_labels.astype(int)

#train_data.head()
print('train_data shape:', train_data.shape)
#print(train_data[0].shape)
#print(train_data[:1])
print('train_labels shape:', train_labels.shape)
print(train_labels)

print('\ndev_data shape:', dev_data.shape)
#print(dev_data[:1])
print('dev_labels shape:', dev_labels.shape)
print(dev_labels)

train_data shape: (10164,)
train_labels shape: (10164,)
[1 1 1 ... 0 1 0]

dev_data shape: (31428,)
dev_labels shape: (31428,)
[0 0 0 ... 0 0 0]


In [72]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_data)           # original ISOT training data
X_dev_transformed = vectorizer.transform(dev_data)

# MultinomialNB
print('\nMultinomialNB using "title" field: Fit using ISOT data; Predict on LIAR data:')
alpha = 1.0
clf = MultinomialNB(alpha=alpha)
clf.fit(X, train_labels)

print('accuracy: %3.3f' %clf.score(X_dev_transformed, dev_labels))


MultinomialNB using "title" field: Fit using ISOT data; Predict on LIAR data:
accuracy: 0.571


### Split LIAR data into Train/Dev/Test, train model, and evaluate on its own type 

In [11]:
#liar_data.head(5)
liar_data = liar_data[liar_data.binary_target != -1]
liar_data.head(5)

Unnamed: 0,id,target,title,subject,speaker,speaker_job_title,state,party,barely_true_count,false_count,half_true_count,mostly_true_count,pantsonfire_count,context,binary_target
0,472.json,mostly-true,"For what we spend in just one week in Iraq, 80...",health-care,florida-consumer-action-network,,Florida,none,0.0,0.0,0.0,1.0,0.0,a press conference in Tampa,1
1,11871.json,barely-true,Tell me what Madeleine Albrights position was ...,"foreign-policy,iraq",bernie-s,U.S. Senator,Vermont,independent,18.0,12.0,22.0,41.0,0.0,comments on Meet the Press,0
2,13078.json,false,McGinty previously told a local community news...,population,pat-toomey,Candidate for U.S. Senate,Pennsylvania,republican,3.0,2.0,2.0,1.0,0.0,In a press release,0
3,6901.json,false,Says when I voted against [an increase in the ...,"candidates-biography,income,voting-record",joseph-kyrillos,State Senator,New Jersey,republican,3.0,3.0,2.0,2.0,1.0,a debate on New Jersey 101.5-FM,0
5,8288.json,barely-true,The Capitol Police force is going so far as to...,criminal-justice,chris-larson,Wisconsin Senate Minority Leader,Wisconsin,democrat,6.0,5.0,0.0,1.0,1.0,an interview,0


In [12]:
binary_targets = liar_data.binary_target.unique()
print(binary_targets)

print('\nbinary_target,  number of examples')
for binary_target in binary_targets:
    print(binary_target, len(liar_data[liar_data.binary_target==binary_target]))

[1 0]

binary_target,  number of examples
1 4507
0 5657


In [13]:
#train/dev/train split
#train_dev_split = 0.8

train_fract = 0.70
dev_fract = 0.15
test_fract = 0.15

if (train_fract+dev_fract+test_fract) == 1.0:
    print('Split fractions add up to 1.0')
else:
    print('SPLIT FRACTIONS DO NOT ADD UP TO 1.0; PLEASE TRY AGAIN.............')

train_set = liar_data[ :int(len(liar_data)*train_fract)].reset_index(drop=True)
dev_set = liar_data[int(len(liar_data)*(train_fract)) : int(len(liar_data)*(train_fract+dev_fract))].reset_index(drop=True)
test_set = liar_data[int(len(liar_data)*(train_fract+dev_fract)) : ].reset_index(drop=True)

print('training set: ',train_set.shape)
print('dev set: ',dev_set.shape)
print('test set: ',test_set.shape)

Split fractions add up to 1.0
training set:  (7114, 15)
dev set:  (1525, 15)
test set:  (1525, 15)


In [14]:
# print out LIAR dev set
dev_set.to_csv('liar_dev_set.csv', sep=',')

In [42]:
train_data = train_set.title[train_set.binary_target != -1].str.lower().values   # full LIAR data
train_labels = train_set.binary_target[train_set.binary_target != -1].values 
 
dev_data = dev_set.title.str.lower().values                                      # LSOT data
dev_labels = dev_set.binary_target[dev_set.binary_target != -1].values 

print('train_data shape:', train_data.shape)
print('train_labels shape:', train_labels.shape)
print('dev_data shape:', dev_data.shape)
print('dev_labels shape:', dev_labels.shape)

train_data shape: (7114,)
train_labels shape: (7114,)
dev_data shape: (1525,)
dev_labels shape: (1525,)


In [43]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_data)           
X_dev_transformed = vectorizer.transform(dev_data)

# MultinomialNB
print('\nMultinomialNB using LIAR training data; Predict on LIAR dev data:')
alpha = 1.0
clf = MultinomialNB(alpha=alpha)
clf.fit(X, train_labels)

print('accuracy: %3.3f' %clf.score(X_dev_transformed, dev_labels))


MultinomialNB using LIAR training data; Predict on LIAR dev data:
accuracy: 0.622
