# Description

This notebook contains.
- description if the structure of the dataset
- implementation and training of Binary classification using Bi-LSTM described in the paper for condition S
- implementation and training of 6 way classification using Bi-LSTM described in the paper for condition S
- implementation and training of Binary classification using a deeper Bi-LSTM nerual network for condition S
- implementation and training of 6 way classification using a deeper Bi-LSTM neural network for condition S

## Model description
### Binary - model
- tokenize the text files of the statements
- embedding layer using glove embeddings
- one bidirectional lstm(32 units)
- one dense softmax layer with 2 output units

### 6 way - model_6
- tokenize the text files of the statements
- embedding layer using glove embeddings
- one bidirectional lstm(32 units)
- one dense softmax layer with 6 output units

### Deeper Binary - model_d
- tokenize the text files of the statements
- embedding layer using glove embeddings
- four bidirectional lstm(two 64 untis and two 32 units)
- three dense layers ending with a softmax activation with 2 output units

### Deeper 6 way - model_d_6
- tokenize the text files of the statements
- embedding layer using glove embeddings
- four bidirectional lstm(four with 64 units)
- two dense layers ending with a softmax activation with 6 output units

## Result
### Binary - model
- Val accuracy = 61.68%
- Test accuracy = 57.93%

### 6 way - model_6
- Val accuracy = 25.16%
- Test accuracy = 26.75%

### Deeper Binary - model_d
- Val accuracy = 62.38%
- Test accuracy = 58.405%

### Deeper 6 way - model_d_6
- Val accuracy = 23.44%
- Test accuracy = 24.78%

## Weights file
### Binary - model
- model_weights_1.h5

### 6 way - model_6
- model_6_weights_1.h5

### Deeper Binary - model_d
- model_d_weights_2.h5

### Deeper 6 way - model_d_6
- model_d_6_weights_1.h5


In [1]:
import numpy as np
import matplotlib.pyplot as plt
import os
import re
import pandas as pd

In [2]:
dataset_dir = "dataset"

train_data_file = os.path.join(dataset_dir, "train2.tsv")
test_data_file = os.path.join(dataset_dir, "test2.tsv")
val_data_file = os.path.join(dataset_dir, "val2.tsv")

In [3]:
# column names are taken from the readme.md of the LIAR-PLUS github repo -
# link to repo - https://github.com/Tariq60/LIAR-PLUS

col_names = ["id", "label", "statement", "subject", "speaker", "speaker_job", "state_info", "party", \
             "barely_true", "false", "half_true", "mostly_true", "pants_on_fire", "context", "ex_just"]

In [4]:
train_data = pd.read_csv(train_data_file, sep = '\t', header = None, names = col_names,)# na_values = ["NaN"], na_filter = True)
test_data = pd.read_csv(test_data_file, sep = '\t', header = None, names = col_names)
val_data = pd.read_csv(val_data_file, sep = '\t', header = None, names = col_names)

In [5]:
train_data.head()

Unnamed: 0,id,label,statement,subject,speaker,speaker_job,state_info,party,barely_true,false,half_true,mostly_true,pants_on_fire,context,ex_just
0,2635.json,false,Says the Annies List political group supports ...,abortion,dwayne-bohac,State representative,Texas,republican,0.0,1.0,0.0,0.0,0.0,a mailer,That's a premise that he fails to back up. Ann...
1,10540.json,half-true,When did the decline of coal start? It started...,"energy,history,job-accomplishments",scott-surovell,State delegate,Virginia,democrat,0.0,0.0,1.0,1.0,0.0,a floor speech.,"Surovell said the decline of coal ""started whe..."
2,324.json,mostly-true,"Hillary Clinton agrees with John McCain ""by vo...",foreign-policy,barack-obama,President,Illinois,democrat,70.0,71.0,160.0,163.0,9.0,Denver,Obama said he would have voted against the ame...
3,1123.json,false,Health care reform legislation is likely to ma...,health-care,blog-posting,,,none,7.0,19.0,3.0,5.0,44.0,a news release,The release may have a point that Mikulskis co...
4,9028.json,half-true,The economic turnaround started at the end of ...,"economy,jobs",charlie-crist,,Florida,democrat,15.0,9.0,20.0,19.0,2.0,an interview on CNN,"Crist said that the economic ""turnaround start..."


In [6]:
test_data.head()

Unnamed: 0,id,label,statement,subject,speaker,speaker_job,state_info,party,barely_true,false,half_true,mostly_true,pants_on_fire,context,ex_just
0,11972.json,true,Building a wall on the U.S.-Mexico border will...,immigration,rick-perry,Governor,Texas,republican,30,30,42,23,18,Radio interview,"Meantime, engineering experts agree the wall w..."
1,11685.json,false,Wisconsin is on pace to double the number of l...,jobs,katrina-shankland,State representative,Wisconsin,democrat,2,1,0,0,0,a news conference,She cited layoff notices received by the state...
2,11096.json,false,Says John McCain has done nothing to help the ...,"military,veterans,voting-record",donald-trump,President-Elect,New York,republican,63,114,51,37,61,comments on ABC's This Week.,"Trump said that McCain ""has done nothing to he..."
3,5209.json,half-true,Suzanne Bonamici supports a plan that will cut...,"medicare,message-machine-2012,campaign-adverti...",rob-cornilles,consultant,Oregon,republican,1,1,3,1,1,a radio show,"But spending still goes up. In addition, many ..."
4,9524.json,pants-fire,When asked by a reporter whether hes at the ce...,"campaign-finance,legal-issues,campaign-adverti...",state-democratic-party-wisconsin,,Wisconsin,democrat,5,7,2,2,7,a web video,Our rating A Democratic Party web video making...


In [7]:
val_data.head()

Unnamed: 0,id,label,statement,subject,speaker,speaker_job,state_info,party,barely_true,false,half_true,mostly_true,pants_on_fire,context,ex_just
0,12134.json,barely-true,We have less Americans working now than in the...,"economy,jobs",vicky-hartzler,U.S. Representative,Missouri,republican,1,0,1,0,0,an interview with ABC17 News,"However, Hartzler was talking about the entire..."
1,238.json,pants-fire,"When Obama was sworn into office, he DID NOT u...","obama-birth-certificate,religion",chain-email,,,none,11,43,8,5,105,,Ellison used a Koran that once belonged to Tho...
2,7891.json,false,Says Having organizations parading as being so...,"campaign-finance,congress,taxes",earl-blumenauer,U.S. representative,Oregon,democrat,0,1,1,1,0,a U.S. Ways and Means hearing,"However, we have two professors who say the la..."
3,8169.json,half-true,Says nearly half of Oregons children are poor.,poverty,jim-francesconi,Member of the State Board of Higher Education,Oregon,none,0,1,1,1,0,an opinion article,"In fact, if you use federal definitions for po..."
4,929.json,half-true,On attacks by Republicans that various program...,"economy,stimulus",barack-obama,President,Illinois,democrat,70,71,160,163,9,interview with CBS News,Obama's point is that some perspective is in o...


In [8]:
print(train_data.info())
# print(train_data.describe())
print(train_data.shape)
print(test_data.info())
print(test_data.shape)
print(val_data.info())
print(val_data.shape)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10240 entries, 0 to 10239
Data columns (total 15 columns):
id               10240 non-null object
label            10240 non-null object
statement        10240 non-null object
subject          10238 non-null object
speaker          10238 non-null object
speaker_job      7343 non-null object
state_info       8032 non-null object
party            10238 non-null object
barely_true      10238 non-null float64
false            10238 non-null float64
half_true        10238 non-null float64
mostly_true      10238 non-null float64
pants_on_fire    10238 non-null float64
context          10138 non-null object
ex_just          10156 non-null object
dtypes: float64(5), object(10)
memory usage: 1.2+ MB
None
(10240, 15)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1267 entries, 0 to 1266
Data columns (total 15 columns):
id               1267 non-null object
label            1267 non-null object
statement        1267 non-null object
subject     

In [9]:
train_data["statement"].head()

0    Says the Annies List political group supports ...
1    When did the decline of coal start? It started...
2    Hillary Clinton agrees with John McCain "by vo...
3    Health care reform legislation is likely to ma...
4    The economic turnaround started at the end of ...
Name: statement, dtype: object

In [10]:
import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential, Model
from keras.layers import Dense, LSTM, Embedding, Input, Bidirectional
from keras.initializers import Constant
from keras.utils import to_categorical

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [11]:
# using glove embeddings, as mentioned in the paper. Reference taken from keras documentation on using pretrained word embeddings
# link to reference - https://keras.io/examples/pretrained_word_embeddings/

glove_file = os.path.join("glove", "glove.6B.100d.txt")
max_no_of_words = 20000
embeddings_dim = 100
max_len_seq = 1000

In [12]:
embeddings_index = {}
with open(glove_file) as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
        
# print(embeddings_index)

In [13]:
# tokenizing the statements

tokenizer = Tokenizer(num_words=max_no_of_words)

tokenizer.fit_on_texts(list(train_data["statement"]))

train_sequences = tokenizer.texts_to_sequences(list(train_data["statement"]))
val_sequences = tokenizer.texts_to_sequences(list(val_data["statement"]))
test_sequences = tokenizer.texts_to_sequences(list(test_data["statement"]))

train_seq = np.array(pad_sequences(train_sequences, maxlen = max_len_seq))
val_seq = np.array(pad_sequences(val_sequences, maxlen = max_len_seq))
test_seq = np.array(pad_sequences(test_sequences, maxlen = max_len_seq))

In [14]:
num_words = min(max_no_of_words, len(tokenizer.word_index)) + 1 # add 1 as 0 is not indexed by the tokenizer
embedding_matrix = np.zeros((num_words, embeddings_dim)) 
for word, i in tokenizer.word_index.items():
    if i > num_words-1:
        continue
    embedding_vector = embeddings_index.get(word) # to avoid KeyError exception
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros as .get will return None
        embedding_matrix[i] = embedding_vector    

In [15]:
tf_val = {"pants-fire":0, "false":0, "barely-true":0, "half-true":1, "mostly-true":1, "true":1}
train_tf = np.array(list(map(lambda l: tf_val[l], list(train_data["label"]))))
train_cat_tf = to_categorical(train_tf)

val_tf = np.array(list(map(lambda l: tf_val[l], list(val_data["label"]))))
val_cat_tf = to_categorical(val_tf)

test_tf = np.array(list(map(lambda l: tf_val[l], list(test_data["label"]))))
test_cat_tf = to_categorical(test_tf) 

In [16]:
sequence_inp  = Input(shape = (max_len_seq, ), dtype = 'int32')
x = Embedding(num_words, embeddings_dim, embeddings_initializer = Constant(embedding_matrix), 
                        input_length = max_len_seq,
                        trainable=False)(sequence_inp)
x = Bidirectional(LSTM(32))(x)
c = Dense(2, activation = "softmax")(x)

model = Model(sequence_inp, c)                  
                  
model.compile(loss = "categorical_crossentropy", optimizer = "adam", metrics = ["accuracy"])

Instructions for updating:
Colocations handled automatically by placer.


In [17]:
model.fit(train_seq, train_cat_tf, batch_size = 32, epochs = 10, verbose = 1, validation_data = (val_seq, val_cat_tf))

Instructions for updating:
Use tf.cast instead.
Train on 10240 samples, validate on 1284 samples
Epoch 1/10
   32/10240 [..............................] - ETA: 16:02 - loss: 0.7573 - acc: 0.4062

KeyboardInterrupt: 

In [None]:
print("test accuracy = {}".format(model.evaluate(test_seq, test_cat_tf)[1]))

pred_prob = model.predict(test_seq)
print(np.argmax(pred_prob[:10], axis = 1))
print(test_data["label"].head(10))

In [18]:
six_val = {"pants-fire":0, "false":1, "barely-true":2, "half-true":3, "mostly-true":4, "true":5}
rev_six_val = dict(map(reversed, six_val.items()))

train_6 = np.array(list(map(lambda l: six_val[l], list(train_data["label"]))))
train_cat_6 = to_categorical(train_6)

val_6 = np.array(list(map(lambda l: six_val[l], list(val_data["label"]))))
val_cat_6 = to_categorical(val_6)

test_6 = np.array(list(map(lambda l: six_val[l], list(test_data["label"]))))
test_cat_6 = to_categorical(test_6)

In [19]:
# input is same to both the networks, using sequence_inp again for the embedding layer
x_6 = Embedding(num_words, embeddings_dim, embeddings_initializer = Constant(embedding_matrix), 
                        input_length = max_len_seq, trainable=False)(sequence_inp)
x_6 = Bidirectional(LSTM(32))(x_6)
c_6 = Dense(6, activation = "softmax")(x_6)

model_6 = Model(sequence_inp, c_6)                  
                  
model_6.compile(loss = "categorical_crossentropy", optimizer = "adam", metrics = ["accuracy"])

In [None]:
model_6.fit(train_seq, train_cat_6, batch_size = 32, epochs = 10, verbose = 1, validation_data = (val_seq, val_cat_6))

In [None]:
print("test accuracy = {}".format(model_6.evaluate(test_seq, test_cat_6)[1]))

pred_prob_6 = model_6.predict(test_seq)
print(list(map(lambda r: rev_six_val[r], list(np.argmax(pred_prob_6[:10], axis = 1)))))
print(test_data["label"].head(10))

In [20]:
x_d = Embedding(num_words, embeddings_dim, embeddings_initializer = Constant(embedding_matrix),
                    input_length = max_len_seq, trainable = False)(sequence_inp)
x_d = Bidirectional(LSTM(64, return_sequences = True))(x_d)
x_d = Bidirectional(LSTM(64, return_sequences = True))(x_d)
x_d = Bidirectional(LSTM(32, return_sequences = True))(x_d) # remove for model_d_weights_1.h5
x_d = Bidirectional(LSTM(32))(x_d) # remove for model_d_weights_1.h5
x_d = Dense(16, activation = "tanh")(x_d)
x_d = Dense(4, activation = "tanh")(x_d)
c_d = Dense(2, activation = "softmax")(x_d)

model_d = Model(sequence_inp, c_d)

model_d.compile(loss = "categorical_crossentropy", optimizer = "adam", metrics = ["accuracy"])

In [None]:
model_d.fit(train_seq, train_cat_tf, batch_size = 32, epochs = 7, verbose = 1, validation_data = (val_seq, val_cat_tf))

In [None]:
print("test accuracy = {}".format(model_d.evaluate(test_seq, test_cat_tf)[1]))

pred_prob_d = model_d.predict(test_seq)
print(pred_prob_d[:10])
print(np.argmax(pred_prob_d[:10], axis = 1))
print(test_data["label"].head(10))

In [22]:
model.save_weights("model_weights_1.h5")
model_6.save_weights("model_6_weights_1.h5")
model_d.save_weights("model_d_weights_2.h5")

In [23]:
x_d_6 = Embedding(num_words, embeddings_dim, embeddings_initializer = Constant(embedding_matrix),
                    input_length = max_len_seq, trainable = False)(sequence_inp)
x_d_6 = Bidirectional(LSTM(64, return_sequences = True))(x_d_6)
x_d_6 = Bidirectional(LSTM(64, return_sequences = True))(x_d_6)
x_d_6 = Bidirectional(LSTM(64, return_sequences = True))(x_d_6) # remove for model_d_weights_1.h5
x_d_6 = Bidirectional(LSTM(64))(x_d_6) # remove for model_d_weights_1.h5
x_d_6 = Dense(16, activation = "tanh")(x_d)
c_d_6 = Dense(6, activation = "softmax")(x_d)

model_d_6 = Model(sequence_inp, c_d_6)

model_d_6.compile(loss = "categorical_crossentropy", optimizer = "adam", metrics = ["accuracy"])

In [None]:
model_d_6.fit(train_seq, train_cat_6, batch_size = 32, epochs = 7, verbose = 1, validation_data = (val_seq, val_cat_6))

In [None]:
print("test accuracy = {}".format(model_d_6.evaluate(test_seq, test_cat_6)[1]))

pred_prob_d_6 = model_d_6.predict(test_seq)
print(list(map(lambda r: rev_six_val[r], list(np.argmax(pred_prob_d_6[:10], axis = 1)))))
print(test_data["label"].head(10))

In [None]:
model_d_6.save_weights("model_d_6_weights_1.h5")