# LIAR DETECTION GROUP PROJECT - Neural BOW Models  


### CONTENTS  

Imports  
Load ISOT data from appropriate pickle file  
Load ISOT vocabulary from pickle file  (note: vocab contains both "title" and "text" words)  
Train/Dev/Test split ISOT data  
Load LIWC data for custom features  
Load LIAR data (for evaluating models)  

#### Neural BOW Models:
- Model_1: Initial run replicating settings from Assignment 2, but with ISOT "title" data.  
- Model_2: Use GloVe word embeddings rather than initializing embeddings with uniform random numbers.  
- Model_3: Random word embeddings, but custom LIWC features concatenated into the model. 
- Model_4: Incorporate GloVe embeddings as well as LIWC features. Still training with ISOT "title" data.  
- Model_4a: Incorporate GloVe embeddings as well as LIWC features. Still training with ISOT "title" data. TUNE: # fully-connected layers = 2; hidden_dims=50.
- Model_4b: Incorporate GloVe embeddings as well as LIWC features. Still training with ISOT "title" data. TUNE: # fully-connected layers = 2; hidden_dims=50; dropout_rate=0.8.  
- Model_4c: Incorporate GloVe embeddings as well as LIWC features. Still training with ISOT "title" data. TUNE: # fully-connected layers = 2; hidden_dims=50; dropout_rate=0.3.  
- Model_4d: Incorporate GloVe embeddings as well as LIWC features. Still training with ISOT "title" data. TUNE: # fully-connected layers = 1; hidden_dims=25; dropout_rate=0.5; regularization strength (beta) = 0.001. 
- Model_4e: Incorporate GloVe embeddings as well as LIWC features. Still training with ISOT "title" data. TUNE: # fully-connected layers = 1; hidden_dims=25; dropout_rate=0.5; regularization strength (beta) = 0.1. 





    

In [1]:
from __future__ import absolute_import
from __future__ import print_function
from __future__ import division

import json, os, re, shutil, sys, time
from importlib import reload
import os, sys, re, json, time, datetime, shutil
import itertools, collections
from functools import reduce
import unittest
from IPython.display import display, HTML
from sklearn.utils import shuffle
# NLTK for NLP utils and corpora
import nltk

# NumPy and TensorFlow
import numpy as np
import pandas as pd
import tensorflow as tf
#assert(tf.__version__.startswith("1.8"))

import pickle
import dill
# Helper libraries
from w266_common import utils, vocabulary, tf_embed_viz
from w266_common import patched_numpy_io
import timeit  #For timing


In [2]:
# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

In [3]:
print('TensorFlow version:', tf.VERSION)

TensorFlow version: 1.10.1


### Load ISOT data and vocabulary from pickle files  
Loading the dataset from the Information security and object technology (ISOT) Research lab at the University of Victoria School of Engineering.

The ISOT Fake News Dataset is a compilation of several thousands fake news and truthful articles, obtained from different legitimate news sites and sites flagged as unreliable by politifact.com.

In [4]:
# Read ISOT data from pickle file.
all_data = pd.read_pickle('parsed_data/df_alldata2.pkl')  # ISOT data (CMU) tokenized and POS tags added

# NOTE: for models 2+, read in the pickle file that includes the GloVe embeddings
# all_data = pd.read_pickle('parsed_data/df_alldata_embed.pkl')  # GloVe embeddings for ISOT title and text tokens

#all_data.info(memory_usage='deep', verbose=True)

In [5]:
all_data.head()

Unnamed: 0,title,text,subject,date,target,title_tokcan,title_POS,text_tokcan,text_POS
0,BRAINIAC Gets Rejected After Trying To Buy BMW...,Does anyone else out there see a future BMW ca...,Government News,"Mar 20, 2016",0,"[brainiac<allcaps>, gets, rejected, after, try...","[N, V, V, P, V, P, V, ^, P, ^, ^, ,, O, V, A, ...","[does, anyone, else, out, there, see, a, futur...","[V, N, R, P, R, V, D, A, ^, N, N, P, D, N, ,, ..."
1,Windows 10 is Stealing Your Bandwidth (You Mig...,21st Century Wire says We ve heard a lot of no...,US_News,"April 7, 2016",0,"[windows, <number>, is, stealing, your, bandwi...","[^, $, V, V, D, N, ,, O, V, V, P, V, O, ,]","[<number>st, century, wire, says, we, ve, hear...","[A, N, ^, V, O, V, V, D, N, P, R, R, A, N, P, ..."
2,STUNNING STORY The Media And Democrats Hid Fro...,"In an email sent on April 15, 2011, our upstan...",left-news,"Mar 2, 2017",0,"[stunning<allcaps>, story<allcaps>, the, media...","[A, N, D, N, &, N, V, P, ^, ,, R, Z, ^, ^, N, ...","[in, an, email, sent, on, april, <number>, ,, ...","[P, D, N, V, P, ^, $, ,, $, ,, D, A, N, A, ^, ..."
3,North Korea's Kim Jong Un fetes nuclear scient...,SEOUL (Reuters) - North Korean leader Kim Jong...,worldnews,"September 10, 2017",1,"[north, korea's, kim, jong, un, fetes, nuclear...","[^, Z, ^, ^, ^, V, A, N, ,, V, N, N]","[seoul<allcaps>, (, reuters, ), -, north, kore...","[^, ,, ^, ,, ,, ^, ^, N, ^, ^, ^, V, D, A, N, ..."
4,White House developing comprehensive biosecuri...,"ASPEN, Colorado (Reuters) - The Trump administ...",politicsNews,"July 20, 2017",1,"[white, house, developing, comprehensive, bios...","[A, N, V, A, N, N, ,, A]","[aspen<allcaps>, ,, colorado, (, reuters, ), -...","[^, ,, ^, ,, ^, ,, ,, D, ^, N, V, V, D, A, A, ..."


In [6]:
all_data.title[0]

'BRAINIAC Gets Rejected After Trying To Buy BMW With EBT Card…What Happens Next Is HYSTERICAL!'

In [7]:
all_data.title_tokcan[0]

['brainiac<allcaps>',
 'gets',
 'rejected',
 'after',
 'trying',
 'to',
 'buy',
 'bmw<allcaps>',
 'with',
 'ebt<allcaps>',
 'card',
 '…',
 'what',
 'happens',
 'next',
 'is',
 'hysterical<allcaps>',
 '!']

In [8]:
'''
print('length of all_data:', len(all_data))  # 44898
print('shape of all_data:', all_data.shape)  # (44898, 11)
print('length of embedded title:', len(all_data.embedded_title))  # 44898
print('shape of embedded title:', all_data.embedded_title.shape)  # (44898,)
print('dimension of single embedded title:', len(all_data.embedded_title[0]))  # 18
print('dimension of single embedded title:', len(all_data.embedded_title[1]))  # 14
print('dimension of single embedded title:', len(all_data.embedded_title[0][0]))  # 50
print('example of a title embedding:\n', all_data.embedded_title[0])
#print('dimension of single embedded text:', len(all_data.embedded_text[0][0]))  # 50
'''

"\nprint('length of all_data:', len(all_data))  # 44898\nprint('shape of all_data:', all_data.shape)  # (44898, 11)\nprint('length of embedded title:', len(all_data.embedded_title))  # 44898\nprint('shape of embedded title:', all_data.embedded_title.shape)  # (44898,)\nprint('dimension of single embedded title:', len(all_data.embedded_title[0]))  # 18\nprint('dimension of single embedded title:', len(all_data.embedded_title[1]))  # 14\nprint('dimension of single embedded title:', len(all_data.embedded_title[0][0]))  # 50\nprint('example of a title embedding:\n', all_data.embedded_title[0])\n#print('dimension of single embedded text:', len(all_data.embedded_text[0][0]))  # 50\n"

In [9]:
'''# Look at file for GloVe embedding of titles
isot_title_embed = all_data.embedded_title.values

print('isot_title_embed shape:', isot_title_embed.shape)  #(44898,)
print('isot_title_embed shape of first element:', isot_title_embed[0].shape)  # (18, 50)
print('isot_title_embed example:\n', isot_title_embed[0])
'''

"# Look at file for GloVe embedding of titles\nisot_title_embed = all_data.embedded_title.values\n\nprint('isot_title_embed shape:', isot_title_embed.shape)  #(44898,)\nprint('isot_title_embed shape of first element:', isot_title_embed[0].shape)  # (18, 50)\nprint('isot_title_embed example:\n', isot_title_embed[0])\n"

In [10]:
'''
test = np.reshape(isot_title_embed, (-1, 50)) 
print(text.shape)
'''

'\ntest = np.reshape(isot_title_embed, (-1, 50)) \nprint(text.shape)\n'

In [11]:
# Read ISOT vocab from pickle file.

vocab = pd.read_pickle('parsed_data/vocab.pkl')  # ISOT data (CMU) tokenized and POS tags added

In [12]:
print("{:,} words".format(vocab.size))  # Note: this combines words from ISOT "title" AND "text" fields!
print("wordset: ",vocab.ordered_words()[:30])
print(vocab)

152,182 words
wordset:  ['<s>', '</s>', '<unk>', 'the', ',', '.', 'to', 'of', 'a', 'and', 'in', 'that', 'on', '<number>', 'for', 's', 'is', 'he', 'said', 'trump', 'it', 'with', 'was', 'as', 'his', 'by', 'has', 'be', 'have', 'not']
<w266_common.vocabulary.Vocabulary object at 0x7fb0ba2812e8>


In [13]:
print('ISOT ALL target=real:', len(all_data.target[all_data.target == '1']))
print('ISOT ALL target=fake:', len(all_data.target[all_data.target == '0']))

ISOT ALL target=real: 21417
ISOT ALL target=fake: 23481


### Load ISOT LIWC features from pickle file

In [14]:
liwc_isot = pd.read_pickle('parsed_data/liwc_isot2.pkl')

In [15]:
liwc_isot.head()

Unnamed: 0,function,pronoun,ppron,i,we,you,shehe,they,ipron,article,...,money,relig,death,informal,swear,netspeak,assent,nonflu,filler,Unnamed: 74
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
print(liwc_isot.values)
print(liwc_isot.shape)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
(152182, 74)


In [17]:
print(liwc_isot.values[:10,:])

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0]
 [1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0]
 [1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

In [18]:
#liwc = tf.to_float(liwc_isot.values)
liwc = liwc_isot.astype('float32')
print(liwc.values)


[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [19]:
print(np.array(liwc))

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


### Train / Dev / Test Split ISOT data

In [20]:
#train/dev/train split
#train_dev_split = 0.8

train_fract = 0.70
dev_fract = 0.15
test_fract = 0.15

if (train_fract+dev_fract+test_fract) == 1.0:
    print('Split fractions add up to 1.0')
else:
    print('SPLIT FRACTIONS DO NOT ADD UP TO 1.0; PLEASE TRY AGAIN.............')

#train_data = all_data[:int(len(all_data)*train_dev_split)].reset_index(drop=True)
#dev_data = all_data[int(len(all_data)*train_dev_split):].reset_index(drop=True)

train_set = all_data[ :int(len(all_data)*train_fract)].reset_index(drop=True)
dev_set = all_data[int(len(all_data)*(train_fract)) : int(len(all_data)*(train_fract+dev_fract))].reset_index(drop=True)
test_set = all_data[int(len(all_data)*(train_fract+dev_fract)) : ].reset_index(drop=True)

print('training set: ',train_set.shape)
print('dev set: ',dev_set.shape)
print('test set: ',test_set.shape)

Split fractions add up to 1.0
training set:  (31428, 9)
dev set:  (6735, 9)
test set:  (6735, 9)


In [21]:
train_set.head()

Unnamed: 0,title,text,subject,date,target,title_tokcan,title_POS,text_tokcan,text_POS
0,BRAINIAC Gets Rejected After Trying To Buy BMW...,Does anyone else out there see a future BMW ca...,Government News,"Mar 20, 2016",0,"[brainiac<allcaps>, gets, rejected, after, try...","[N, V, V, P, V, P, V, ^, P, ^, ^, ,, O, V, A, ...","[does, anyone, else, out, there, see, a, futur...","[V, N, R, P, R, V, D, A, ^, N, N, P, D, N, ,, ..."
1,Windows 10 is Stealing Your Bandwidth (You Mig...,21st Century Wire says We ve heard a lot of no...,US_News,"April 7, 2016",0,"[windows, <number>, is, stealing, your, bandwi...","[^, $, V, V, D, N, ,, O, V, V, P, V, O, ,]","[<number>st, century, wire, says, we, ve, hear...","[A, N, ^, V, O, V, V, D, N, P, R, R, A, N, P, ..."
2,STUNNING STORY The Media And Democrats Hid Fro...,"In an email sent on April 15, 2011, our upstan...",left-news,"Mar 2, 2017",0,"[stunning<allcaps>, story<allcaps>, the, media...","[A, N, D, N, &, N, V, P, ^, ,, R, Z, ^, ^, N, ...","[in, an, email, sent, on, april, <number>, ,, ...","[P, D, N, V, P, ^, $, ,, $, ,, D, A, N, A, ^, ..."
3,North Korea's Kim Jong Un fetes nuclear scient...,SEOUL (Reuters) - North Korean leader Kim Jong...,worldnews,"September 10, 2017",1,"[north, korea's, kim, jong, un, fetes, nuclear...","[^, Z, ^, ^, ^, V, A, N, ,, V, N, N]","[seoul<allcaps>, (, reuters, ), -, north, kore...","[^, ,, ^, ,, ,, ^, ^, N, ^, ^, ^, V, D, A, N, ..."
4,White House developing comprehensive biosecuri...,"ASPEN, Colorado (Reuters) - The Trump administ...",politicsNews,"July 20, 2017",1,"[white, house, developing, comprehensive, bios...","[A, N, V, A, N, N, ,, A]","[aspen<allcaps>, ,, colorado, (, reuters, ), -...","[^, ,, ^, ,, ^, ,, ,, D, ^, N, V, V, D, A, A, ..."


In [22]:
dev_set.head()

Unnamed: 0,title,text,subject,date,target,title_tokcan,title_POS,text_tokcan,text_POS
0,Turkey condemns U.S. move on Jerusalem as 'irr...,ISTANBUL (Reuters) - Turkey s foreign ministry...,worldnews,"December 6, 2017",1,"[turkey, condemns, u.s., move, on, jerusalem, ...","[N, V, ^, N, P, ^, P, ,, A, ,]","[istanbul<allcaps>, (, reuters, ), -, turkey, ...","[^, ,, ^, ,, ,, N, G, A, N, P, ^, V, D, N, P, ..."
1,UK finance minister's future questioned by PM ...,LONDON (Reuters) - Britain s finance minister ...,worldnews,"October 14, 2017",1,"[uk<allcaps>, finance, minister's, future, que...","[^, N, S, N, V, P, ^, Z, N, P, N, V]","[london<allcaps>, (, reuters, ), -, britain, s...","[^, ,, ^, ,, ,, ^, G, N, N, ^, ^, P, ^, V, V, ..."
2,Canada government facing resistance from Senat...,OTTAWA (Reuters) - The Canadian government s p...,worldnews,"November 3, 2017",1,"[canada, government, facing, resistance, from,...","[^, N, V, N, P, ^, P, N, N]","[ottawa<allcaps>, (, reuters, ), -, the, canad...","[^, ,, ^, ,, ,, D, ^, ^, G, V, P, V, A, N, P, ..."
3,Tillerson says would support maintaining Russi...,WASHINGTON (Reuters) - President-elect Donald ...,politicsNews,"January 11, 2017",1,"[tillerson, says, would, support, maintaining,...","[^, V, V, V, V, ^, N, P, R]","[washington<allcaps>, (, reuters, ), -, presid...","[^, ,, ^, ,, ,, ^, ^, ^, N, P, ^, N, P, ^, ,, ..."
4,DEPLORABLE! HILLARY’S Campaign Is In PANIC Mod...,What happens when Hillary s poll numbers take ...,politics,"Sep 16, 2016",0,"[deplorable<allcaps>, !, hillary<allcaps>’s, c...","[A, ,, Z, N, V, P, N, N, ,, D, A, ,, A, N, ,, ...","[what, happens, when, hillary, s, poll, number...","[O, V, R, ^, G, N, N, V, D, N, P, O, V, V, V, ..."


In [23]:
# print out ISOT dev set
#dev_set.to_csv('isot_dev_set.csv', sep=',')

In [24]:
test_set.head()

Unnamed: 0,title,text,subject,date,target,title_tokcan,title_POS,text_tokcan,text_POS
0,NEW LAW WILL PUNISH MUSLIM Migrants…Assimilate...,Is this common sense law even practical given ...,left-news,"Apr 23, 2016",0,"[new<allcaps>, law<allcaps>, will<allcaps>, pu...","[A, N, V, V, A, N, ,, V, &, V, T, ,]","[is, this, common, sense, law, even, practical...","[V, D, N, N, N, R, A, V, R, A, N, N, P, N, P, ..."
1,STUNNING: Hillary’s Own Numbers Show Her Tax H...,"Wow! If I were Hillary, I d stop sending peopl...",left-news,"Oct 9, 2016",0,"[stunning<allcaps>, :, hillary’s, own, numbers...","[A, ,, Z, A, N, V, D, N, N, N, V, V, A, N, A, ...","[wow, !, if, i, were, hillary, ,, i, d, stop, ...","[!, ,, P, O, V, ^, ,, O, V, V, V, N, P, D, N, ..."
2,How Is Panama’s “Migrant Crisis” Giving A FREE...,Let this sink in The word has been out for som...,Government News,"Aug 10, 2016",0,"[how, is, panama’s, “, migrant, crisis, ”, giv...","[R, V, Z, ,, A, N, ,, V, D, A, N, P, D, ^, P, ...","[let, this, sink, in, the, word, has, been, ou...","[V, O, V, P, D, N, V, V, T, P, D, N, R, P, D, ..."
3,THE LIST OF OBAMA’S HISTORIC FIRSTS AKA HOW CH...,Wow! What a list of accomplishments! The probl...,left-news,"Apr 14, 2015",0,"[the<allcaps>, list<allcaps>, of<allcaps>, oba...","[D, N, P, Z, A, N, G, R, ^, N, V, ^, R, A]","[wow, !, what, a, list, of, accomplishments, !...","[!, ,, O, D, N, P, N, ,, D, N, V, P, D, N, V, ..."
4,Top tech executives to attend Trump summit on ...,NEW YORK (Reuters) - Top executives from Alpha...,politicsNews,"December 11, 2016",1,"[top, tech, executives, to, attend, trump, sum...","[A, N, N, P, V, ^, N, P, ^, ,, ^]","[new<allcaps>, york<allcaps>, (, reuters, ), -...","[A, ^, ,, ^, ,, ,, A, N, P, ^, ^, ,, ^, ^, &, ..."


### Select ISOT features and labels for training model 

In [25]:
train_data, train_labels = train_set.title_tokcan.values, train_set.target.values
dev_data, dev_labels = dev_set.title_tokcan.values, dev_set.target.values
test_data, test_labels = test_set.title_tokcan.values, test_set.target.values

train_labels = train_labels.astype(int)
dev_labels = dev_labels.astype(int)
test_labels = test_labels.astype(int)

#train_data.head()
print('train_data shape:', train_data.shape)
#print(train_data[0].shape)
print(train_data[:1])
print('train_labels shape:', train_labels.shape)
print(train_labels)
print()
print('dev_data shape:', dev_data.shape)
print(dev_data[:1])
print('dev_labels shape:', dev_labels.shape)
print(dev_labels)
print()
print('test_data shape:', test_data.shape)
print(test_data[:1])
print('test_labels shape:', test_labels.shape)
print(test_labels)


train_data shape: (31428,)
[list(['brainiac<allcaps>', 'gets', 'rejected', 'after', 'trying', 'to', 'buy', 'bmw<allcaps>', 'with', 'ebt<allcaps>', 'card', '…', 'what', 'happens', 'next', 'is', 'hysterical<allcaps>', '!'])]
train_labels shape: (31428,)
[0 0 0 ... 0 1 0]

dev_data shape: (6735,)
[list(['turkey', 'condemns', 'u.s.', 'move', 'on', 'jerusalem', 'as', "'", 'irresponsible', "'"])]
dev_labels shape: (6735,)
[1 1 1 ... 0 0 0]

test_data shape: (6735,)
[list(['new<allcaps>', 'law<allcaps>', 'will<allcaps>', 'punish<allcaps>', 'muslim<allcaps>', 'migrants', '…', 'assimilate', 'or', 'get', 'out', '!'])]
test_labels shape: (6735,)
[0 0 0 ... 0 0 1]


In [26]:
# characterize length of documents in train_data

lengths = [len(train_data[i]) for i in range(train_data.shape[0])]

a = np.array(lengths)
p = np.percentile(a, 95) # return 95th percentile
print('95th percentile:', p)

95th percentile: 25.0


In [27]:
# Bokeh for plotting.
import bokeh.plotting as bp
from bokeh.models import HoverTool
bp.output_notebook()

# Helper code for plotting histograms
def plot_length_histogram(lengths, x_range=[0,100], bins=40, normed=True):
    hist, bin_edges = np.histogram(a=lengths, bins=bins, normed=normed, range=x_range)
    bin_centers = (bin_edges[1:] + bin_edges[:-1])/2
    bin_widths =  (bin_edges[1:] - bin_edges[:-1])

    hover = HoverTool(tooltips=[("bucket", "@x"), ("count", "@top")], mode="vline")
    fig = bp.figure(plot_width=800, plot_height=400, tools=[hover])
    fig.vbar(x=bin_centers, width=bin_widths, top=hist, hover_fill_color="firebrick")
    fig.y_range.start = 0
    fig.x_range.start = 0
    fig.xaxis.axis_label = "Example length (number of tokens)"
    fig.yaxis.axis_label = "Frequency"
    bp.show(fig)

In [28]:
plot_length_histogram(lengths)

  


### Load LIAR data to evaluate various models below.  

In [29]:
# DON"T Read LIAR data from pickle file. ****************
#liar_data = pd.read_pickle('parsed_data/df_liardata2.pkl')  # data (CMU) tokenized and POS tags added
# Heads up on the df_liardata2.pkl: it looks like during the process, 
# "mostly-false" was used in place of "barely_true".  
# I believe it means that the "barely_true" items were omitted from the pickled file.
# ^^^^^^^^^^^


#### USE THIS ONE WHEN AVAILABLE!
liar_data = pd.read_pickle('parsed_data/df_liardata2binary.pkl')  # data (CMU) tokenized and POS tags added



liar_data.info(memory_usage='deep', verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12791 entries, 0 to 12790
Data columns (total 5 columns):
target           12791 non-null object
title            12791 non-null object
title_tokcan     12791 non-null object
title_POS        12791 non-null object
binary_target    12791 non-null int64
dtypes: int64(1), object(4)
memory usage: 9.5 MB


In [30]:
liar_data.head(10)

Unnamed: 0,target,title,title_tokcan,title_POS,binary_target
0,mostly-true,Says 31 percent of Texas physicians accept all...,"[says, <number>, percent, of, texas, physician...","[V, $, N, P, ^, N, V, D, A, ^, N, ,, R, P, $, ...",1
1,half-true,''Both Democrats and Republicans are advocatin...,"['', both, democrats, and, republicans, are, a...","[,, D, N, &, N, V, V, P, D, N, P, N, N, V, P, ...",-1
2,true,A Republican-led softening of firearms trainin...,"[a, republican-led, softening, of, firearms, t...","[D, A, N, P, N, N, N, V, D, A, N, V, V, V, P, ...",1
3,pants-fire,The first tweet was sent from Austin.,"[the, first, tweet, was, sent, from, austin, .]","[D, A, N, V, V, P, ^, ,]",0
4,half-true,Georgia has the countrys second highest number...,"[georgia, has, the, countrys, second, highest,...","[^, V, D, N, A, A, N, P, A, N, N, N, ,]",-1
5,half-true,"Because of Gov. Scott Walkers budgeting, a gre...","[because, of, gov, ., scott, walkers, budgetin...","[P, P, ^, ,, ^, ^, N, ,, D, A, N, P, A, N, N, ...",-1
6,mostly-true,Florida has reduced its carbon emissions by 20...,"[florida, has, reduced, its, carbon, emissions...","[^, V, V, L, N, N, P, $, N, P, $, ,]",1
7,half-true,Louisianas film incentives program is so big t...,"[louisianas, film, incentives, program, is, so...","[^, N, N, N, V, R, A, O, R, N, D, A, N, R, V, ...",-1
8,half-true,Under the Obama economy ... utility bills are ...,"[under, the, obama, economy, .<repeat>, utilit...","[P, D, ^, N, ,, N, N, V, A, ,]",-1
9,mostly-true,Mt. Hood Community College is No. 1 on average...,"[mt, ., hood, community, college, is, no, ., <...","[^, ,, N, N, N, V, !, ,, $, P, A, &, A, N, N, ...",1


In [31]:
binary_targets = liar_data.binary_target.unique()
print(binary_targets)

print('\nbinary_target,  number of examples')
for binary_target in binary_targets:
    print(binary_target, len(liar_data[liar_data.binary_target==binary_target]))

[ 1 -1  0]

binary_target,  number of examples
1 4507
-1 2627
0 5657


In [32]:
liar_data_binary = liar_data[liar_data.binary_target >= 0]  ## discard "half-true"!!!!
liar_data_binary.head(10)

Unnamed: 0,target,title,title_tokcan,title_POS,binary_target
0,mostly-true,Says 31 percent of Texas physicians accept all...,"[says, <number>, percent, of, texas, physician...","[V, $, N, P, ^, N, V, D, A, ^, N, ,, R, P, $, ...",1
2,true,A Republican-led softening of firearms trainin...,"[a, republican-led, softening, of, firearms, t...","[D, A, N, P, N, N, N, V, D, A, N, V, V, V, P, ...",1
3,pants-fire,The first tweet was sent from Austin.,"[the, first, tweet, was, sent, from, austin, .]","[D, A, N, V, V, P, ^, ,]",0
6,mostly-true,Florida has reduced its carbon emissions by 20...,"[florida, has, reduced, its, carbon, emissions...","[^, V, V, L, N, N, P, $, N, P, $, ,]",1
9,mostly-true,Mt. Hood Community College is No. 1 on average...,"[mt, ., hood, community, college, is, no, ., <...","[^, ,, N, N, N, V, !, ,, $, P, A, &, A, N, N, ...",1
10,false,Latinos now make up the majority population in...,"[latinos, now, make, up, the, majority, popula...","[N, R, V, T, D, N, N, P, ^, ,]",0
11,barely-true,"They were going to build the wall a while ago,...","[they, were, going, to, build, the, wall, a, w...","[O, V, V, P, V, D, N, N, N, R, ,, R, R, A, R, ...",0
12,false,Says Texas has been waiting for two years for ...,"["", says, texas, has, been, waiting, for, two,...","[,, V, ^, V, V, V, P, $, N, P, D, A, N, P, V, ...",0
13,pants-fire,Barack Hussein Obama will ... force courts to ...,"[barack, hussein, obama, will, .<repeat>, forc...","[^, ^, ^, V, ,, V, N, P, V, ^, ^, N, P, A, N, ,]",0
14,mostly-true,Austin has more lobbyists working for it than ...,"[austin, has, more, lobbyists, working, for, i...","[^, V, A, N, V, P, O, P, D, A, N, P, ^, ,]",1


In [33]:
liar_data_binary = liar_data_binary.reset_index(drop=True)
liar_data_binary.head(10)

Unnamed: 0,target,title,title_tokcan,title_POS,binary_target
0,mostly-true,Says 31 percent of Texas physicians accept all...,"[says, <number>, percent, of, texas, physician...","[V, $, N, P, ^, N, V, D, A, ^, N, ,, R, P, $, ...",1
1,true,A Republican-led softening of firearms trainin...,"[a, republican-led, softening, of, firearms, t...","[D, A, N, P, N, N, N, V, D, A, N, V, V, V, P, ...",1
2,pants-fire,The first tweet was sent from Austin.,"[the, first, tweet, was, sent, from, austin, .]","[D, A, N, V, V, P, ^, ,]",0
3,mostly-true,Florida has reduced its carbon emissions by 20...,"[florida, has, reduced, its, carbon, emissions...","[^, V, V, L, N, N, P, $, N, P, $, ,]",1
4,mostly-true,Mt. Hood Community College is No. 1 on average...,"[mt, ., hood, community, college, is, no, ., <...","[^, ,, N, N, N, V, !, ,, $, P, A, &, A, N, N, ...",1
5,false,Latinos now make up the majority population in...,"[latinos, now, make, up, the, majority, popula...","[N, R, V, T, D, N, N, P, ^, ,]",0
6,barely-true,"They were going to build the wall a while ago,...","[they, were, going, to, build, the, wall, a, w...","[O, V, V, P, V, D, N, N, N, R, ,, R, R, A, R, ...",0
7,false,Says Texas has been waiting for two years for ...,"["", says, texas, has, been, waiting, for, two,...","[,, V, ^, V, V, V, P, $, N, P, D, A, N, P, V, ...",0
8,pants-fire,Barack Hussein Obama will ... force courts to ...,"[barack, hussein, obama, will, .<repeat>, forc...","[^, ^, ^, V, ,, V, N, P, V, ^, ^, N, P, A, N, ,]",0
9,mostly-true,Austin has more lobbyists working for it than ...,"[austin, has, more, lobbyists, working, for, i...","[^, V, A, N, V, P, O, P, D, A, N, P, ^, ,]",1


In [34]:
print('LIAR true:', len(liar_data[liar_data.binary_target == 1]))
print('LIAR false:', len(liar_data[liar_data.binary_target == 0]))

LIAR true: 4507
LIAR false: 5657


In [35]:
liar_title_tokans = liar_data_binary.title_tokcan.values
liar_labels = liar_data_binary.binary_target.values

print('liar titles:', liar_title_tokans)
print('liar labels:', liar_labels)

liar titles: [list(['says', '<number>', 'percent', 'of', 'texas', 'physicians', 'accept', 'all', 'new', 'medicaid', 'patients', ',', 'down', 'from', '<number>', 'percent', 'in', '<number>', '.'])
 list(['a', 'republican-led', 'softening', 'of', 'firearms', 'training', 'rules', 'means', 'that', 'untrained', 'individuals', 'would', 'be', 'allowed', 'to', 'carry', 'guns', 'with', 'a', 'state', 'permit', '.'])
 list(['the', 'first', 'tweet', 'was', 'sent', 'from', 'austin', '.']) ...
 list(['in', 'september', ',', 'the', 'department', 'of', 'business', 'and', 'consumer', 'services', 'enacted', 'a', 'sudden', '<number>%', 'workers', 'compensation', 'premium', 'assessment', 'by', 'rule', 'with', 'little', 'notice', 'to', 'the', 'public', '.'])
 list(['(', 'mary', ')', 'burke', 'was', 'a', 'senior', 'member', 'of', 'the', 'doyle', 'administration', 'that', 'left', 'wisconsin', 'with', '<number>', 'fewer', 'jobs', '.'])
 list(['as', 'a', 'senator', ',', 'hillary', 'clinton', 'actually', 'paid'

## Neural BOW Model 4: ISOT "title" data WITH GloVe embeddings AND with LIWC features

#### Include reference functions for viewing convenience

In [36]:
# May need this info (from utils.py)
'''
def build_vocab(corpus, V=10000, **kw):
    from . import vocabulary
    if isinstance(corpus, list):
        token_feed = (canonicalize_word(w) for w in corpus)
        vocab = vocabulary.Vocabulary(token_feed, size=V, **kw)
    else:
        token_feed = (canonicalize_word(w) for w in corpus.words())
        vocab = vocabulary.Vocabulary(token_feed, size=V, **kw)

    print("Vocabulary: {:,} types".format(vocab.size))
    return vocab

# Window and batch functions
def pad_np_array(example_ids, max_len=250, pad_id=0):
    """Pad a list of lists of ids into a rectangular NumPy array.

    Longer sequences will be truncated to max_len ids, while shorter ones will
    be padded with pad_id.

    Args:
        example_ids: list(list(int)), sequence of ids for each example
        max_len: maximum sequence length
        pad_id: id to pad shorter sequences with

    Returns: (x, ns)
        x: [num_examples, max_len] NumPy array of integer ids
        ns: [num_examples] NumPy array of sequence lengths (<= max_len)
    """
    arr = np.full([len(example_ids), max_len], pad_id, dtype=np.int32)
    ns = np.zeros([len(example_ids)], dtype=np.int32)
    for i, ids in enumerate(example_ids):
        cpy_len = min(len(ids), max_len)
        arr[i,:cpy_len] = ids[:cpy_len]
        ns[i] = cpy_len
    return arr, ns

def id_lists_to_sparse_bow(id_lists, vocab_size):
    """Convert a list-of-lists-of-ids to a sparse bag-of-words matrix.

    Args:
        id_lists: (list(list(int))) list of lists of word ids
        vocab_size: (int) vocab size; must be greater than the largest word id
            in id_lists.

    Returns:
        (scipy.sparse.csr_matrix) where each row is a sparse vector of word
        counts for the corresponding example.
    """
    from scipy import sparse
    ii = []  # row indices (example ids)
    jj = []  # column indices (token ids)
    for row_id, ids in enumerate(id_lists):
        ii.extend([row_id]*len(ids))
        jj.extend(ids)
    x = sparse.csr_matrix((np.ones_like(ii), (ii, jj)),
                          shape=[len(id_lists), vocab_size])
    return x
'''

'\ndef build_vocab(corpus, V=10000, **kw):\n    from . import vocabulary\n    if isinstance(corpus, list):\n        token_feed = (canonicalize_word(w) for w in corpus)\n        vocab = vocabulary.Vocabulary(token_feed, size=V, **kw)\n    else:\n        token_feed = (canonicalize_word(w) for w in corpus.words())\n        vocab = vocabulary.Vocabulary(token_feed, size=V, **kw)\n\n    print("Vocabulary: {:,} types".format(vocab.size))\n    return vocab\n\n# Window and batch functions\ndef pad_np_array(example_ids, max_len=250, pad_id=0):\n    """Pad a list of lists of ids into a rectangular NumPy array.\n\n    Longer sequences will be truncated to max_len ids, while shorter ones will\n    be padded with pad_id.\n\n    Args:\n        example_ids: list(list(int)), sequence of ids for each example\n        max_len: maximum sequence length\n        pad_id: id to pad shorter sequences with\n\n    Returns: (x, ns)\n        x: [num_examples, max_len] NumPy array of integer ids\n        ns: [num_

In [37]:
# These are functions that were in the "SSTDataset" class in sst.py from A2
'''
def get_filtered_split(split='train', df_idxs=None, root_only=False):
    if not hasattr(split):
        raise ValueError("Invalid split name '%s'" % name)
    df = getattr(split)
    if df_idxs is not None:
        df = df.loc[df_idxs]
    #if root_only:          # Should not need in Final Project.
        #df = df[df.is_root]
    return df

def as_padded_array(split='train', max_len=40, pad_id=0,
                    root_only=False, df_idxs=None):
    """Return the dataset as a (padded) NumPy array.
    Longer sequences will be truncated to max_len ids, while shorter ones
    will be padded with pad_id.
    Args:
      split: 'train' or 'test'
      max_len: maximum sequence length
      pad_id: id to pad shorter sequences with
      root_only: if true, will only export root phrases
      df_idxs: (optional) custom list of indices to export
    Returns: (x, ns, y)
      x: [num_examples, max_len] NumPy array of integer ids
      ns: [num_examples] NumPy array of sequence lengths (<= max_len)
      y: [num_examples] NumPy array of target ids
    """
    df = get_filtered_split(split, df_idxs, root_only)
    x, ns = utils.pad_np_array(df.ids, max_len=max_len, pad_id=pad_id)
    return x, ns, np.array(df.label, dtype=np.int32)

def as_sparse_bow(split='train', root_only=False, df_idxs=None):
    from scipy import sparse
    df = get_filtered_split(split, df_idxs, root_only)
    x = utils.id_lists_to_sparse_bow(df['ids'], self.vocab.size)
    y = np.array(df.label, dtype=np.int32)
    return x, y
'''

'\ndef get_filtered_split(split=\'train\', df_idxs=None, root_only=False):\n    if not hasattr(split):\n        raise ValueError("Invalid split name \'%s\'" % name)\n    df = getattr(split)\n    if df_idxs is not None:\n        df = df.loc[df_idxs]\n    #if root_only:          # Should not need in Final Project.\n        #df = df[df.is_root]\n    return df\n\ndef as_padded_array(split=\'train\', max_len=40, pad_id=0,\n                    root_only=False, df_idxs=None):\n    """Return the dataset as a (padded) NumPy array.\n    Longer sequences will be truncated to max_len ids, while shorter ones\n    will be padded with pad_id.\n    Args:\n      split: \'train\' or \'test\'\n      max_len: maximum sequence length\n      pad_id: id to pad shorter sequences with\n      root_only: if true, will only export root phrases\n      df_idxs: (optional) custom list of indices to export\n    Returns: (x, ns, y)\n      x: [num_examples, max_len] NumPy array of integer ids\n      ns: [num_examples] 

#### Construct train, dev, test data arrays  

In [38]:
## Training data

all_train_ids=[]
for i, tokens in enumerate(train_data):  # here, tokens are the words in a single sentence
    sent_ids = vocab.words_to_ids(tokens)
    all_train_ids.append(sent_ids)
print(all_train_ids[:5])

max_len = 40   # Retain this setting, since it fits the ISOT "title" length distribution quite well.
train_x, train_ns = utils.pad_np_array(all_train_ids, max_len=max_len)
print()
print(train_x[:2])
print()
print(train_ns[:2])

train_y = train_labels
print(train_y[:2])

[[61446, 990, 1530, 58, 375, 6, 1413, 15748, 21, 15579, 2669, 546, 67, 1814, 244, 16, 13711, 114], [5462, 13, 16, 5198, 217, 27116, 32, 51, 358, 199, 6, 8238, 20, 33], [11865, 11780, 3, 132, 9, 214, 11625, 30, 184, 35, 115, 5688, 4405, 1667, 3300, 284, 21360, 900, 6, 5508, 628, 8277, 101, 727, 8878, 4, 183, 173, 137, 331], [164, 10696, 1423, 3056, 3626, 53446, 298, 3394, 4, 2526, 6052, 6849], [97, 79, 3249, 3501, 71854, 1249, 35, 242]]

[[61446   990  1530    58   375     6  1413 15748    21 15579  2669   546
     67  1814   244    16 13711   114     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0]
 [ 5462    13    16  5198   217 27116    32    51   358   199     6  8238
     20    33     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0]]

[18 14]
[0 0]


In [39]:
## Dev data

all_dev_ids=[]
for i, tokens in enumerate(dev_data):  # here, tokens are the words in a single sentence
    sent_ids = vocab.words_to_ids(tokens)
    all_dev_ids.append(sent_ids)
print(all_dev_ids[:5])

max_len = 40   # Retain this setting, since it fits the ISOT "title" length distribution quite well.
dev_x, dev_ns = utils.pad_np_array(all_dev_ids, max_len=max_len)
print()
print(dev_x[:2])
print()
print(dev_ns[:2])

dev_y = dev_labels
print(dev_y[:2])

[[667, 7856, 48, 349, 12, 1217, 23, 324, 5344, 324], [1506, 1320, 35206, 550, 2038, 25, 1714, 15232, 718, 23, 500, 12153], [1471, 84, 1479, 2948, 30, 162, 76, 6403, 140], [862, 159, 45, 167, 4700, 161, 434, 14, 109], [25681, 114, 8122, 105, 16, 10, 48747, 6434, 546, 50, 671, 42, 10579, 53956, 44, 310, 3888, 20, 209, 414, 221]]

[[  667  7856    48   349    12  1217    23   324  5344   324     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0]
 [ 1506  1320 35206   550  2038    25  1714 15232   718    23   500 12153
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0]]

[10 12]
[1 1]


In [40]:
## Test data

all_test_ids=[]
for i, tokens in enumerate(test_data):  # here, tokens are the words in a single sentence
    sent_ids = vocab.words_to_ids(tokens)
    all_test_ids.append(sent_ids)
print(all_test_ids[:5])

max_len = 40   # Retain this setting, since it fits the ISOT "title" length distribution quite well.
test_x, test_ns = utils.pad_np_array(all_test_ids, max_len=max_len)
print()
print(test_x[:2])
print()
print(test_ns[:2])

test_y = test_labels
print(test_y[:2])

[[1736, 11781, 6755, 54015, 5717, 1292, 546, 12633, 57, 141, 65, 114], [11865, 35, 5577, 231, 1110, 255, 62, 203, 6309, 2067, 49, 974, 148, 638, 1288, 149, 22317], [115, 16, 73985, 42, 3272, 530, 44, 966, 8, 8622, 37059, 1381, 734, 48, 6, 519, 30, 1621, 4, 1334, 4, 1531, 4, 4781, 546, 68, 128, 43, 449, 3, 4440, 9160, 1116, 78], [734, 11868, 1495, 5688, 32164, 62246, 23932, 5539, 7055, 22765, 62247, 322, 15433, 1464], [287, 3117, 2389, 6, 1836, 19, 1386, 12, 219, 35, 26587]]

[[ 1736 11781  6755 54015  5717  1292   546 12633    57   141    65   114
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0]
 [11865    35  5577   231  1110   255    62   203  6309  2067    49   974
    148   638  1288   149 22317     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0]]

[12 17]
[0 0]


In [41]:
print("Examples:\n", train_x[:3])
print("Original sequence lengths: ", train_ns[:3])
print("Target labels: ", train_y[:3])
print("")
print("Padded:\n", " ".join(vocab.ids_to_words(train_x[0])))
print("Un-padded:\n", " ".join(vocab.ids_to_words(train_x[0,:train_ns[0]])))

Examples:
 [[61446   990  1530    58   375     6  1413 15748    21 15579  2669   546
     67  1814   244    16 13711   114     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0]
 [ 5462    13    16  5198   217 27116    32    51   358   199     6  8238
     20    33     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0]
 [11865 11780     3   132     9   214 11625    30   184    35   115  5688
   4405  1667  3300   284 21360   900     6  5508   628  8277   101   727
   8878     4   183   173   137   331     0     0     0     0     0     0
      0     0     0     0]]
Original sequence lengths:  [18 14 30]
Target labels:  [0 0 0]

Padded:
 brainiac<allcaps> gets rejected after trying to buy bmw<allcaps> with ebt<allcaps> card … what happens next is hysterical<allcaps> ! <s> <s> <s> <s> <s> <s> <s> <s> <s>

### Use tf.Estimator API along with nbow_models_x.py 

#### Things to consider:  
- Start w/ 2 epochs (20 was original)       
- Consider use of dropouts in fully-connected layers     
-  Use embed_dim = 300 rather than 50??   
- xx  
...  


In [42]:
print('vocab size:', vocab.size)

vocab size: 152182


In [43]:
## Setup model framework
## (Must specify correct nbow_model_x name in this cell to use the correct nbow_model_x.py file.)

import nbow_model_4; reload(nbow_model_4)

# Specify model hyperparameters as used by model_fn.  Use embed_dim2=74 for all LIWC
### ADD NEW PARAMETER: liwc_dim???

model_params = dict(V=vocab.size, embed_dim=50, hidden_dims=[25], num_classes=2,
                    encoder_type='bow',
                    lr=0.1, optimizer='adagrad', beta=0.1)  # can set optimizer to 'adagrad' or 'adam', which is slower here

checkpoint_dir = "/tmp/tf_nbow_" + datetime.datetime.now().strftime("%Y%m%d-%H%M")
if os.path.isdir(checkpoint_dir):
    shutil.rmtree(checkpoint_dir)
# Write vocabulary to file, so TensorBoard can label embeddings.
# creates checkpoint_dir/projector_config.pbtxt and checkpoint_dir/metadata.tsv
#ds.vocab.write_projector_config(checkpoint_dir, "Encoder/Embedding_Layer/W_embed")
vocab.write_projector_config(checkpoint_dir, "Encoder/Embedding_Layer/W_embed")

model = tf.estimator.Estimator(model_fn=nbow_model_4.classifier_model_fn, 
                               params=model_params,
                               model_dir=checkpoint_dir)
print("")
print("To view training (once it starts), run:\n")
print("    tensorboard --logdir='{:s}' --port 6006".format(checkpoint_dir))
print("\nThen in your browser, open: http://localhost:6006")

Vocabulary (152,182 words) written to '/tmp/tf_nbow_20181204-0539/metadata.tsv'
Projector config written to /tmp/tf_nbow_20181204-0539/projector_config.pbtxt
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tf_nbow_20181204-0539', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fb0af5bb780>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}

To view training (once it starts), run:

    tensorboard --logdir='/tmp/tf_nbow_20181204-0539' --port 6006

Then in your browser, open: http://localhos

In [44]:
## Train model and Evaluate on Dev data

# Training params, just used in this cell for the input_fn-s
train_params = dict(batch_size=32, total_epochs=10, eval_every=1) # start with 2 epochs rather than 20; eval_every=1 (was 2)
assert(train_params['total_epochs'] % train_params['eval_every'] == 0)

# Construct and train the model, saving checkpoints to the directory above.
# Input function for training set batches
# Do 'eval_every' epochs at once, followed by evaluating on the dev set.
# NOTE: use patch_numpy_io.numpy_input_fn instead of tf.estimator.inputs.numpy_input_fn
train_input_fn = patched_numpy_io.numpy_input_fn(
                    x={"ids": train_x, "ns": train_ns}, y=train_y,
                    batch_size=train_params['batch_size'], 
                    num_epochs=train_params['eval_every'], shuffle=True, seed=42
                 )

# Input function for dev set batches. As above, but:
# - Don't randomize order
# - Iterate exactly once (one epoch)
dev_input_fn = tf.estimator.inputs.numpy_input_fn(
                    x={"ids": dev_x, "ns": dev_ns}, y=dev_y,
                    batch_size=128, num_epochs=1, shuffle=False
                )

for _ in range(train_params['total_epochs'] // train_params['eval_every']):
    # Train for a few epochs, then evaluate on dev
    model.train(input_fn=train_input_fn)
    eval_metrics = model.evaluate(input_fn=dev_input_fn, name="dev")

INFO:tensorflow:Calling model_fn.

LIWC type: isot 

dropout rate: 0.5
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/tf_nbow_20181204-0539/model.ckpt.
INFO:tensorflow:loss = 599.45416, step = 1
INFO:tensorflow:global_step/sec: 227.728
INFO:tensorflow:loss = 251.22923, step = 101 (0.441 sec)
INFO:tensorflow:global_step/sec: 320.02
INFO:tensorflow:loss = 187.57425, step = 201 (0.313 sec)
INFO:tensorflow:global_step/sec: 329.359
INFO:tensorflow:loss = 178.33722, step = 301 (0.303 sec)
INFO:tensorflow:global_step/sec: 319.43
INFO:tensorflow:loss = 152.62575, step = 401 (0.313 sec)
INFO:tensorflow:global_step/sec: 343.398
INFO:tensorflow:loss = 142.14127, step = 501 (0.291 sec)
INFO:tensorflow:global_step/sec: 355.428
INFO:tensorflow:loss = 141.3009, step = 601 (0.282 sec)
INFO:tensorfl

INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tf_nbow_20181204-0539/model.ckpt-3932
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-12-04-05:41:38
INFO:tensorflow:Saving dict for global step 3932: accuracy = 0.88641423, cross_entropy_loss = 0.3064123, global_step = 3932, loss = 158.18713
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 3932: /tmp/tf_nbow_20181204-0539/model.ckpt-3932
INFO:tensorflow:Calling model_fn.

LIWC type: isot 

dropout rate: 0.5
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tf_nbow_20181204-0539/model.ckpt-3932
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 3932 into /tmp/tf_nbow_20181204-0539/model.ckpt.
INFO:tensorflow:loss = 33.015076,

INFO:tensorflow:global_step/sec: 324.575
INFO:tensorflow:loss = 18.461359, step = 7582 (0.308 sec)
INFO:tensorflow:global_step/sec: 328.849
INFO:tensorflow:loss = 23.147413, step = 7682 (0.304 sec)
INFO:tensorflow:global_step/sec: 302.884
INFO:tensorflow:loss = 21.23738, step = 7782 (0.330 sec)
INFO:tensorflow:Saving checkpoints for 7864 into /tmp/tf_nbow_20181204-0539/model.ckpt.
INFO:tensorflow:Loss for final step: 3.755464.
INFO:tensorflow:Calling model_fn.

LIWC type: isot 

dropout rate: 0.5
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-12-04-05:43:43
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tf_nbow_20181204-0539/model.ckpt-7864
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-12-04-05:43:44
INFO:tensorflow:Saving dict for global step 7864: accuracy = 0.8976986, cross_entropy_loss = 0.28798774, global_step = 7864, loss = 101.

In [45]:
## Evaluate model on (ISOT) Test data

test_input_fn = tf.estimator.inputs.numpy_input_fn(
                    x={"ids": test_x, "ns": test_ns}, y=test_y,
                    batch_size=128, num_epochs=1, shuffle=False
                )

eval_metrics = model.evaluate(input_fn=test_input_fn, name="test")

print("Accuracy on test set: {:.02%}".format(eval_metrics['accuracy']))
eval_metrics

INFO:tensorflow:Calling model_fn.

LIWC type: isot 

dropout rate: 0.5
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-12-04-05:44:47
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tf_nbow_20181204-0539/model.ckpt-9830
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-12-04-05:44:47
INFO:tensorflow:Saving dict for global step 9830: accuracy = 0.90898293, cross_entropy_loss = 0.27436695, global_step = 9830, loss = 88.609215
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 9830: /tmp/tf_nbow_20181204-0539/model.ckpt-9830
Accuracy on test set: 90.90%


{'accuracy': 0.90898293,
 'cross_entropy_loss': 0.27436695,
 'loss': 88.609215,
 'global_step': 9830}

In [46]:
## We can also evaluate the old-fashioned way, by calling model.predict(...) and working with the predicted labels directly:

from sklearn.metrics import accuracy_score
predictions = list(model.predict(test_input_fn))  # list of dicts
y_pred = [p['max'] for p in predictions]
acc = accuracy_score(y_pred, test_y)
print("Accuracy on test set: {:.02%}".format(acc))

INFO:tensorflow:Calling model_fn.

LIWC type: isot 

dropout rate: 0.5
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tf_nbow_20181204-0539/model.ckpt-9830
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Accuracy on test set: 90.90%


##### Accuracy of 90.9% is lower than that obtained using Model_1.  

### Create padded LIAR data and apply prediction function to LIAR data.  



In [47]:
## LIAR data padding

all_liar_ids=[]
for i, tokens in enumerate(liar_title_tokans):  # here, tokens are the words in a single sentence
    sent_ids = vocab.words_to_ids(tokens)
    all_liar_ids.append(sent_ids)
print(all_liar_ids[:5])

max_len = 40   # Retain this setting, since it fits the ISOT "title" length distribution quite well.
liar_x, liar_ns = utils.pad_np_array(all_liar_ids, max_len=max_len)
print()
print(liar_x[:2])
print()
print(liar_ns[:2])

liar_y = liar_labels
print(liar_y[:2])

[[159, 13, 150, 7, 575, 10817, 1395, 66, 71, 1957, 3758, 4, 173, 30, 13, 150, 10, 13, 5], [8, 4931, 11790, 7, 3566, 1713, 599, 678, 11, 67186, 1117, 45, 27, 751, 6, 1574, 1294, 21, 8, 69, 4147, 5], [3, 117, 788, 22, 624, 30, 6938, 5], [645, 26, 3845, 73, 3218, 2700, 25, 13, 150, 147, 13, 5], [31885, 5, 8522, 464, 882, 16, 77, 5, 13, 12, 1679, 9, 287, 6247, 472, 9, 1543, 5]]

[[  159    13   150     7   575 10817  1395    66    71  1957  3758     4
    173    30    13   150    10    13     5     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0]
 [    8  4931 11790     7  3566  1713   599   678    11 67186  1117    45
     27   751     6  1574  1294    21     8    69  4147     5     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0]]

[19 22]
[1 1]


### NOTE: STOP!! - This requires manual intervention in the liwc_features function within the nbow_model_4.py file to set the correct LIWC type!!!!!!!!!!!!!!!!!!!


##############################
# PROCEED ONLY AFTER UPDATING nbow_model_4.py file!!!!!
##############################

In [48]:
## Evaluate model on LIAR data

####. S  T. O. P. !!!!  ###

### NOTE: MUST SELECT WHICH LIWC FILE TO USE WITHIN nbow_model_x.py, specifically the liwc_features function.



##############################
# PROCEED ONLY AFTER UPDATING nbow_model_3.py file!!!!!
##############################


reload(nbow_model_4)   ### 

test_input_fn_liar = tf.estimator.inputs.numpy_input_fn(
                    x={"ids": liar_x, "ns": liar_ns}, y=liar_y,
                    batch_size=128, num_epochs=1, shuffle=False
                )

eval_metrics = model.evaluate(input_fn=test_input_fn_liar, name="test")

print("Accuracy on LIAR set: {:.02%}".format(eval_metrics['accuracy']))
eval_metrics

INFO:tensorflow:Calling model_fn.

LIWC type: liar 

dropout rate: 0.5
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-12-04-05:45:25
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tf_nbow_20181204-0539/model.ckpt-9830
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-12-04-05:45:26
INFO:tensorflow:Saving dict for global step 9830: accuracy = 0.5267611, cross_entropy_loss = 1.1509721, global_step = 9830, loss = 214.12784
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 9830: /tmp/tf_nbow_20181204-0539/model.ckpt-9830
Accuracy on LIAR set: 52.68%


{'accuracy': 0.5267611,
 'cross_entropy_loss': 1.1509721,
 'loss': 214.12784,
 'global_step': 9830}

#### Prediction accuracy of 52.7% for LIAR data is better than that obtained with Model_1.  