## First attempt at a CNN categorizer

I have borrowed heavily from CNN.ipynb in assignment 4 for the code shell here.  I have not used the vocabulary objects however as I wanted to be in control of how the vocab is structured.

I removed the embedding layer as I do not want to train embeddings but use one of the pretrained vectorizers.



In [1]:
## Usual Imports
import numpy as np
# import pandas as pd
import matplotlib.pyplot as plt
import re
import json

import string


from scipy import spatial

from nltk.tokenize import sent_tokenize, word_tokenize
## Requires nltk.download('punkt') if not already in the environment.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.optimizers import SGD, RMSprop

from sklearn.manifold import TSNE

# Helper libraries
from w266_common import utils, vocabulary

## SET HYPERPARAMETERS

    max_length: maximum number of tokens that can be processed
    embed_dim: number of dimensions in the embedding - used to set up the model
    

In [2]:
%%time
# Specify model hyperparameters.

# maximum number of tokens to look at.
max_len = 100


epochs = 50
batch_size = 32

embed_dim = 50
num_filters = [2, 2, 2]
kernel_sizes = [2, 3, 4]
dense_layer_dims = [8,8,8]
dropout_rate = 0.5
num_classes = 2



train_embeds = True  # Whether we allow the embeddings to be changed

# Select model optimization approach
# opt = 'adam'
opt = SGD(lr=0.1)
# opt = RMSprop(learning_rate=0.001)



CPU times: user 750 µs, sys: 74 µs, total: 824 µs
Wall time: 466 µs


## Read in the data set of sentences and store

In [3]:
%%time
# Load Data file into numpy array

with open("../data/3xNCS.json") as f:
    d = np.asarray(json.load(f))

# shuffle the array to randomize the data sets used
# essential as the file has all the true labels before
# the false ones
np.random.shuffle(d)

CPU times: user 17.4 ms, sys: 0 ns, total: 17.4 ms
Wall time: 17.2 ms


In [4]:
d

array([{'sentence_id': 28694, 'label': 0, 'text': 'I think that afterschool money ought to be available for faith-based programs and charitable programs that exist because somebody has heard the call to love a neighbor like you would like to be loved yourself.'},
       {'sentence_id': 23643, 'label': 0, 'text': 'I think it is a worthy goal in America to have every child protected by law and welcomed in life.'},
       {'sentence_id': 27990, 'label': 0, 'text': 'I know what it takes to get this country back, and will work with good Democrats and good Republicans to do that.'},
       ...,
       {'sentence_id': 16380, 'label': 0, 'text': 'Lloyd Bentsen and I are optimists, and so are the American people.'},
       {'sentence_id': 9520, 'label': 0, 'text': 'Opening up more trade, particularly in Latin America.'},
       {'sentence_id': 19915, 'label': 0, 'text': 'I want people in my administration and will have people in my administration who understand America.'}],
      dtype=object)

## Split into train, validation and test sets

In [5]:
%%time 

train_len = int(0.6 * len(d))
val_len = int(0.2 * len(d))

train, val, test = np.split(d, [train_len, train_len + val_len])

CPU times: user 24 µs, sys: 0 ns, total: 24 µs
Wall time: 25.5 µs


In [6]:
train

array([{'sentence_id': 28694, 'label': 0, 'text': 'I think that afterschool money ought to be available for faith-based programs and charitable programs that exist because somebody has heard the call to love a neighbor like you would like to be loved yourself.'},
       {'sentence_id': 23643, 'label': 0, 'text': 'I think it is a worthy goal in America to have every child protected by law and welcomed in life.'},
       {'sentence_id': 27990, 'label': 0, 'text': 'I know what it takes to get this country back, and will work with good Democrats and good Republicans to do that.'},
       ...,
       {'sentence_id': 23171, 'label': 0, 'text': "You're right, what does matter is a plan."},
       {'sentence_id': 19259, 'label': 1, 'text': 'I vetoed that bill because it cut more Medicare and and basically ran the risk of breaking up the system.'},
       {'sentence_id': 24604, 'label': 1, 'text': 'My request for this year was $4.9 billion for AIDS -- ten times as much per AIDS victim as per ca

In [7]:
for item in train:
    print( item['label'], item["text"][:50])

0 I think that afterschool money ought to be availab
0 I think it is a worthy goal in America to have eve
0 I know what it takes to get this country back, and
0 And the reason we've gotten it is because they see
0 We set up a simple structure of government that co
1 90 percent of our kids are out there in those publ
1 We paid more military aid, more aid into Laos po- 
1 Middle-class people, defined as people with income
0 That's how we make sure that everybody is willing 
0 And by the way, if the government can be as effici
0 Now, this is the most dangerous technology that we
0 Because that's one of the most critical issues her
0 Let's just talk about the CAFE standards -- that's
0 And I'm going to put in place a better homeland se
1 I fought for, and the Congress finally went along 
1 And a recent book that just came out by perhaps th
0 Ultimately, we're going to have to make some decis
0 First of all, the sequester is not something that 
0 All schools, all school districts, students 

0 I was called the sheriff, by the -- one of the sen
0 I will go very quickly.
1 Well, you know, once again, Donald is implying tha
0 Because you are the ones who have the hardest time
0 We know that the path we're taking is not working.
0 So any -- any language to the contrary is simply n
0 What service are you in?
0 I see mothers holding their babies up, so that the
1 They didn't have the weapons, they didn't have the
0 I'm not satisfied until every American enjoys his 
0 We have a lot of differences in our country, and s
0 This is a contest between a Democrat in the mainst
0 But it is also essential that he not allow a dolla
1 For example, we have reduced the size of the Feder
0 We've got a great economic package which I hope we
0 What I care about is children.
1 the air pollution standard laws that were passed i
0 The good news is (inaudible).
1 More progress in the past eight years than in the 
0 Let me assure you that we made every possible over
0 I want to be the commander in ch

1 Four, in the votes at the U.N., particularly the v
1 The President told the terrorists he was going to 
1 We have dramatically increased control and enforce
0 I think that abortion is a very difficult issue an
1 I indicated under oath that the initiative that I 
1 In fact, the Trump Hotel right here in Las Vegas w
0 I am a son of a Methodist minister.
1 In Lebanon, this President exercised American powe
1 This -- we have weakened our economy.
0 And when people vote, I think you're going to see 
0 I do, and I'm standing up to the issue with an ans
0 We just have a big difference on where this countr
0 Now, like uh - any hardworking uh - person seeking
0 And the way to do better is not to tax and spend b
1 We're giving them lots of money, lots of everythin
0 But if you take it in percentages, then you find o
0 But there are limits to what we can spend.
0 I'd go crazy sitting up there slow dancing that on
1 Not only has an innocent person been executed, but
0 They ought to be engaged in

1 As a matter of fact, the year before I went out of
0 So when you look at the penalty that he paid, and 
1 We've shipped millions of jobs overseas and we hav
0 But let me -- let me come back -- we can come back
0 Let me tell you, Putin has outsmarted her and Obam
0 You go to the heart of America's worries tonight.
1 Under the windfall profits tax, we expect to spend
0 I'll get people back to work with 12 million new j
0 We have to change the culture of America.
1 China and India are graduating more graduates in t
0 But I don't see where it is inflationary to have p
0 So I say this is very simple.
0 But that's the way I, George Bush, feel about it.
0 And I saw the cocaine scene treated with humor, as
0 I never, ever conceived of such a thing.
0 We are going to have to make the Iraqi government 
0 I think I'm a much wiser and more experienced man 
0 There -- there are people in Iran who have the sam
1 There are 168 federal programs that allow quotas.
0 If he wants to start, he can start

In [8]:
# split into x and y sets for use in keras runs
x_train, y_train = np.array( [ i["text"] for i in train ] ), np.array( [ i["label"] for i in train ] )
x_val,   y_val   = np.array( [ i["text"] for i in val ] ),   np.array( [ i["label"] for i in val ] )
x_test,  y_test  = np.array( [ i["text"] for i in test ] ),  np.array( [ i["label"] for i in test ] )

## Tokenize the training data set and build a vocabulary from it
Want to do this to minimize the vocab to just what is in the training set for performance.

using defaults:  strip all punctuation except apostrophies, set to lower case and split on spaces. No restriction on length of vocabulary.

In [9]:
# Tokenize the training data set and build a vocabulary from it
# Want to do this to minimize the vocab to just what is in
# the training set for performance.

# using defaults:  strip all punctuation except apostrophies,
# set to lower case and split on spaces. No restriction on 
# length of vocabulary



t = keras.preprocessing.text.Tokenizer()

# extract just the text from the training set
x_train = [x["text"] for x in train]
t.fit_on_texts(x_train)

vocab_size = len(t.word_index) + 1

# Convert the test statements into arrays of index numbers
# from the vocabulary that represent each word. and then 
# pad or truncate to max_len tokens

x_train_ids = pad_sequences(t.texts_to_sequences(x_train)
                            ,max_len,
                            padding='post',
                            truncating = 'post')

x_val_ids = pad_sequences(t.texts_to_sequences(x_val),
                          max_len, 
                          padding='post', 
                          truncating = 'post')
x_test_ids = pad_sequences(t.texts_to_sequences(x_test),
                           max_len, 
                           padding='post', 
                           truncating = 'post')

# Get list of words in the vocab to filter what we load from the GloVe file
vocab_list = list(t.word_index.keys())

In [10]:
len(vocab_list)

7230

In [11]:
x_val[521]

"These crooks -- now, Willie Sutton would have gone to own a savings and loan rather than rob banks, because he robbed banks because that's where the money is; owning a savings and loan is where the money was."

In [12]:
x_val_ids[521]

array([  88,   36,   41,   10,  368,    2,  187,    7,  995,    3, 1064,
        642,   82, 1782,   47,   29, 1782,   47,   55,  114,    1,  146,
         11, 5434,    7,  995,    3, 1064,   11,  114,    1,  146,   27,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0], dtype=int32)

In [13]:
t.sequences_to_texts([x_val_ids[521]])

["these now would have gone to own a savings and loan rather than banks because he banks because that's where the money is owning a savings and loan is where the money was"]

In [14]:
y_val[521]

0

In [15]:
np.count_nonzero(y_train == 1)

1674

In [16]:
np.count_nonzero(y_train == 0)

4959

In [17]:
np.count_nonzero(y_train == 1)/len(y_train)

0.25237449118046135

# Load in GloVe Embeddings

Using gloVe.6B.zip for testing - trained with Wikipedia 2014 and Gigaword 5.  Data set is 822Mb, so putting into /mnt/export/NLPData on the file server in a partition with 7TB of free space.  This has 50, 100, 200 and 300 dimension vectors - worth looking at for training vs. accuracy purposes

First experiementation will be with the 50 vector version.

We will then create an embedding matrix to load into the Embedding layer of the model

In [18]:
%%time

# Load embeddings dictionary, but filter out words not in the training set
embed_loc = "/mnt/export/NLPData"
embed_file = "glove.6B.50d.txt"

embed_dict = {}

with open(embed_loc + "/" + embed_file, "r", encoding="utf-8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        if word in vocab_list:
            vector = np.asarray(values[1:], "float32")   #Keeps data smaller instead of defaulting to float64
            embed_dict[word] = vector

print(len(embed_dict))


7037
CPU times: user 26.9 s, sys: 130 ms, total: 27.1 s
Wall time: 26.8 s


In [19]:
# we can almost certainly wrap this in the loop as we read in the embeddings, but just for clarity:

embedding_matrix = np.zeros((vocab_size, embed_dim))
for word, i in t.word_index.items():
    embedding_vector = embed_dict.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
    else:
        print(f"Could not find '{word}' in the GloVe vocab")
        
### NEED TO SEE HOW TO HANDLE ALL THE CONTRACTION WORDS SO WE CAN ENCODE AS SOMETHING OTHER THAN UNK
        

Could not find 'it's' in the GloVe vocab
Could not find 'don't' in the GloVe vocab
Could not find 'that's' in the GloVe vocab
Could not find 'we're' in the GloVe vocab
Could not find 'i'm' in the GloVe vocab
Could not find 'we've' in the GloVe vocab
Could not find 'i've' in the GloVe vocab
Could not find 'you're' in the GloVe vocab
Could not find 'they're' in the GloVe vocab
Could not find 'can't' in the GloVe vocab
Could not find 'he's' in the GloVe vocab
Could not find 'didn't' in the GloVe vocab
Could not find 'there's' in the GloVe vocab
Could not find 'let's' in the GloVe vocab
Could not find 'i'll' in the GloVe vocab
Could not find 'you've' in the GloVe vocab
Could not find 'we'll' in the GloVe vocab
Could not find 'doesn't' in the GloVe vocab
Could not find 'i'd' in the GloVe vocab
Could not find 'isn't' in the GloVe vocab
Could not find 'what's' in the GloVe vocab
Could not find 'won't' in the GloVe vocab
Could not find 'president's' in the GloVe vocab
Could not find 'haven't' 

In [20]:
embedding_matrix

array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.41800001,  0.24968   , -0.41242   , ..., -0.18411   ,
        -0.11514   , -0.78580999],
       [ 0.68046999, -0.039263  ,  0.30186   , ..., -0.073297  ,
        -0.064699  , -0.26043999],
       ...,
       [-0.021461  , -0.5122    ,  1.05599999, ..., -0.25196001,
         0.20523   , -0.38449001],
       [ 0.29561001, -0.82200998,  1.00129998, ..., -0.026888  ,
         0.14869   , -0.54429001],
       [ 0.73167998, -0.11411   ,  0.70012999, ..., -0.15169001,
         0.61448997, -1.32319999]])

In [21]:
%%time
# Construct the convolutional neural network.
# The form of each keras layer function is as follows:
#    result = keras.layers.LayerType(arguments for the layer)(layer(s) it should use as input)
# concretely,
#    this_layer_output = keras.layers.Dense(100, activation='relu')(prev_layer_vector)
# performs this_layer_output = relu(prev_layer_vector x W + b) where W has 100 columns.



# set up input layer (receives word IDs) and embedding that tuyrns that into GloVe embeddings
word_ids = keras.layers.Input(shape=(max_len,))
h=keras.layers.Embedding(vocab_size,
                         embed_dim,
                         weights=[embedding_matrix],
                         trainable = train_embeds)(word_ids)


# Add convolutional layers and pooling layers based on number of filters and kernel size(s)
conv_layers_for_all_kernel_sizes = []
for kernel_size, filters in zip(kernel_sizes, num_filters):
    print(f"Adding Convolution: Kernel Size: {kernel_size}, Filter Count: {filters}")
    # note that all convolution layers take the same input "h" the output from the embedding layer
    conv_layer = keras.layers.Conv1D(filters=filters, kernel_size=kernel_size, activation='relu')(h)
    conv_layer = keras.layers.GlobalMaxPooling1D()(conv_layer)

    # Keep track of all the kernels so we can add the concatenation next
    conv_layers_for_all_kernel_sizes.append(conv_layer)


# Concat the feature maps from each different size.
h = keras.layers.concatenate(conv_layers_for_all_kernel_sizes, axis=1)


# Dropout can help with overfitting (improve generalization) by randomly 0-ing different subsets of values
# in the vector.
h = keras.layers.Dropout(rate=dropout_rate)(h)



# Add the fully connected feed forward layers for categorization
# Add a fully connected layer for each dense layer dimension in dense_layer_dims.
for dim in dense_layer_dims:
    h = keras.layers.Dense(dim, activation='relu')(h)
    print(dim,h)


# Add the output layer for classifier - in this case, there is only one output
prediction = keras.layers.Dense(num_classes, activation='softmax')(h)


# Create and compile the model
model = keras.Model(inputs=word_ids, outputs=prediction)
model.compile(optimizer=opt,
              loss='binary_crossentropy',               # as we only have a single output class
              metrics=['accuracy'])                    # What metric to output as we train.

Adding Convolution: Kernel Size: 2, Filter Count: 2
Adding Convolution: Kernel Size: 3, Filter Count: 2
Adding Convolution: Kernel Size: 4, Filter Count: 2
8 Tensor("dense/Relu:0", shape=(None, 8), dtype=float32)
8 Tensor("dense_1/Relu:0", shape=(None, 8), dtype=float32)
8 Tensor("dense_2/Relu:0", shape=(None, 8), dtype=float32)
CPU times: user 189 ms, sys: 4.41 ms, total: 194 ms
Wall time: 188 ms


In [29]:
model.summary()

Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 100)]        0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 100, 50)      361550      input_1[0][0]                    
__________________________________________________________________________________________________
conv1d (Conv1D)                 (None, 99, 2)        202         embedding[0][0]                  
__________________________________________________________________________________________________
conv1d_1 (Conv1D)               (None, 98, 2)        302         embedding[0][0]                  
_______________________________________________________________________________________

In [23]:
x_train_ids

array([[   6,   24,    5, ...,    0,    0,    0],
       [   6,   24,   12, ...,    0,    0,    0],
       [   6,   65,   30, ...,    0,    0,    0],
       ...,
       [ 158,  110,   30, ...,    0,    0,    0],
       [   6,  677,    5, ...,    0,    0,    0],
       [  49, 2378,   14, ...,    0,    0,    0]], dtype=int32)

In [24]:
y_train

array([0, 0, 0, ..., 0, 1, 1])

In [25]:
x_val_ids

array([[5032,    9,  781, ...,    0,    0,    0],
       [   3,  287, 4165, ...,    0,    0,    0],
       [   3,   58,    8, ...,    0,    0,    0],
       ...,
       [   3,    6,   61, ...,    0,    0,    0],
       [  22,  128,   60, ...,    0,    0,    0],
       [   1,  105,   97, ...,    0,    0,    0]], dtype=int32)

In [26]:
for item in y_val:
    print(item)

1
0
0
0
0
0
0
0
0
1
0
1
0
0
0
0
1
0
0
1
0
0
1
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
1
0
1
1
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
1
0
1
0
0
0
1
0
0
1
0
0
0
1
1
1
1
1
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
1
0
0
0
1
0
0
1
0
0
0
0
0
0
1
1
1
0
0
1
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
1
0
0
1
0
0
0
1
0
1
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
1
1
0
1
0
0
1
0
0
1
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
1
0
1
1
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0
0
1
0
0
1
0
1
1
0
1
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
1
1
1
0
1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
0
0
1
0
0
0
0
1
0
0
1
0
0
1
0
1
0
0
1
0
0
1
1
0
0
0
0
0
0
0
0
1
0
0
1
1
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
1
0
1
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
1


In [27]:
print(f"x length: {len(x_train_ids)}")
print(f"y length: {len(y_train)}")

x length: 6633
y length: 6633


In [28]:
%%time
model.reset_states()
model.fit(x_train_ids, y_train, epochs=epochs, batch_size = batch_size, validation_data = (x_val_ids, y_val) )

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
CPU times: user 6min 35s, sys: 53.8 s, total: 7min 29s
Wall time: 30.4 s


<tensorflow.python.keras.callbacks.History at 0x7f55ec1ea280>

### Comments

1. 





### Dataset Citations

@inproceedings{arslan2020claimbuster,
    title={{A Benchmark Dataset of Check-worthy Factual Claims}},
    author={Arslan, Fatma and Hassan, Naeemul and Li, Chengkai and Tremayne, Mark },
    booktitle={14th International AAAI Conference on Web and Social Media},
    year={2020},
    organization={AAAI}
}

@article{meng2020gradient,
  title={Gradient-Based Adversarial Training on Transformer Networks for Detecting Check-Worthy Factual Claims},
  author={Meng, Kevin and Jimenez, Damian and Arslan, Fatma and Devasier, Jacob Daniel and Obembe, Daniel and Li, Chengkai},
  journal={arXiv preprint arXiv:2002.07725},
  year={2020}
}

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. [pdf] [bib]


### Useful resources in building this
Using pre-trained word embeddings: https://keras.io/examples/nlp/pretrained_word_embeddings/