## CNN sentence classifier 

I have borrowed heavily from CNN.ipynb in assignment 4 for the code shell here.  I have not used the vocabulary objects however as I wanted to be in control of how the vocab is structured.

This is parameterized so we can run multiple cases to pick the best results.


In [1]:
## Enable Tensorboard
%load_ext tensorboard

## Usual Imports
import numpy as np
# import pandas as pd
import matplotlib.pyplot as plt
import re
import json
import datetime
import string


from scipy import spatial

from nltk.tokenize import sent_tokenize, word_tokenize
## Requires nltk.download('punkt') if not already in the environment.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.optimizers import SGD, RMSprop

from sklearn.manifold import TSNE

# Helper libraries
from w266_common import utils, vocabulary



# to fix the CUDA issues for CUDA 11.2
from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)


In [2]:
# maximum number of tokens to look at.
max_len = 100
embed_dim = 50

## Read in the data set of sentences and store

In [3]:
%%time
# Load Data file into numpy array

with open("../data/3xNCS.json") as f:
    d = np.asarray(json.load(f))

# shuffle the array to randomize the data sets used
# essential as the file has all the true labels before
# the false ones
np.random.seed(42)
np.random.shuffle(d)

len(d)

CPU times: user 13.5 ms, sys: 276 µs, total: 13.7 ms
Wall time: 13 ms


11056

## Split into train, validation and test sets

In [4]:
%%time 

train_len = int(0.6 * len(d))
val_len = int(0.2 * len(d))

train, val, test = np.split(d, [train_len, train_len + val_len])

# split into x and y sets for use in keras runs
x_train, y_train = np.array( [ i["text"] for i in train ] ), np.array( [ i["label"] for i in train ] )
x_val,   y_val   = np.array( [ i["text"] for i in val ] ),   np.array( [ i["label"] for i in val ]   )
x_test,  y_test  = np.array( [ i["text"] for i in test ] ),  np.array( [ i["label"] for i in test ]  )

## Cant seem to get the binary output to work, so have to one-hot the labels.
## separating the code so we can pull it out if I get the binary output working.

# y_train = [ [abs(a-1),a] for a in y_train]
# y_val   = [ [abs(a-1),a] for a in y_val  ]
# y_test  = [ [abs(a-1),a] for a in y_test ]

CPU times: user 10.2 ms, sys: 3.99 ms, total: 14.2 ms
Wall time: 13.3 ms


## Tokenize the training data set and build a vocabulary from it
Want to do this to minimize the vocab to just what is in the training set for performance.

using defaults:  strip all punctuation except apostrophies, set to lower case and split on spaces. No restriction on length of vocabulary.

In [5]:
# Tokenize the training data set and build a vocabulary from it
# Want to do this to minimize the vocab to just what is in
# the training set for performance.

# Defaults of the tokenizer:
# set to lower case and split on spaces.
# No restriction on length of vocabulary
#
# Overriding the default filters with string.punctuation
# as the default does not strip apostrophies which is
# expected in the GloVe tokenizer.



t = keras.preprocessing.text.Tokenizer(filters=string.punctuation)

t.fit_on_texts(x_train)



# Convert the test statements into arrays of index numbers
# from the vocabulary that represent each word. and then 
# pad or truncate to max_len tokens

x_train_ids = pad_sequences(t.texts_to_sequences(x_train)
                            ,max_len,
                            padding='post',
                            truncating = 'post')

x_val_ids = pad_sequences(t.texts_to_sequences(x_val),
                          max_len, 
                          padding='post', 
                          truncating = 'post')
x_test_ids = pad_sequences(t.texts_to_sequences(x_test),
                           max_len, 
                           padding='post', 
                           truncating = 'post')

# Get list of words in the vocab to filter what we load from the GloVe file
vocab_list = list(t.word_index.keys())

In [6]:
len(vocab_list)

7017

In [7]:
vocab_size = len(vocab_list) +1

In [8]:
np.count_nonzero(y_train == 1)

1672

In [9]:
np.count_nonzero(y_train == 0)

4961

In [10]:
# Check for % of statements that are checkable claims to 
# ensure we have some level of balance that will avoid
# high accuracy for single value guessing.

print(f"ratio of checkable to non-checkable in training data  : {np.count_nonzero(y_train == 1)/len(y_train)}")
print(f"ratio of checkable to non-checkable in validation data: {np.count_nonzero(y_val == 1)/len(y_val)}")
print(f"ratio of checkable to non-checkable in test data      : {np.count_nonzero(y_test == 1)/len(y_test)}")

ratio of checkable to non-checkable in training data  : 0.25207296849087896
ratio of checkable to non-checkable in validation data: 0.250565355042967
ratio of checkable to non-checkable in test data      : 0.24321880650994576


# Function to create embeddings dictionary for the network

Defaults to using gloVe.6B.zip for testing - trained with Wikipedia 2014 and Gigaword 5.  Data set is 822Mb, so putting into /mnt/export/NLPData on the file server in a partition with 7TB of free space.  This has 50, 100, 200 and 300 dimension vectors - worth looking at for training vs. accuracy purposes

This is the 50 dimension embedding.

The function returns the embedding matrix needed for the keras.layers.Embedding() layer.

In [11]:
def embed_matrix(vocab_list, embed_loc = "/mnt/export/NLPData", embed_file = "glove.6B.50d.txt"):

    embed_dict = {}

    with open(embed_loc + "/" + embed_file, "r", encoding="utf-8") as f:
        for line in f:
            values = line.split()
            word = values[0]
            if word in vocab_list:
                vector = np.asarray(values[1:], "float32")   #Keeps data smaller instead of defaulting to float64
                embed_dict[word] = vector

    print(len(embed_dict))
    
    vocab_size = len(t.word_index) + 1
    
    embedding_matrix = np.zeros((vocab_size, embed_dim))
    for word, i in t.word_index.items():
        embedding_vector = embed_dict.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
        else:
            print(f"Could not find {word} in the GloVe vocab")
            
    return(embedding_matrix)      

In [12]:
embedding_matrix = embed_matrix(vocab_list)

6972
Could not find hadn in the GloVe vocab
Could not find ã‚ in the GloVe vocab
Could not find itãâ‚¬s in the GloVe vocab
Could not find weã¢â‚¬â„¢ll in the GloVe vocab
Could not find midgetman in the GloVe vocab
Could not find itã¢â‚¬â„¢s in the GloVe vocab
Could not find snipings in the GloVe vocab
Could not find recallable in the GloVe vocab
Could not find vietnese in the GloVe vocab
Could not find namese in the GloVe vocab
Could not find insur in the GloVe vocab
Could not find ãƒâ¢ã¢â€šâ¬ã¢â‚¬å“ in the GloVe vocab
Could not find l981 in the GloVe vocab
Could not find 30ãâ‚¬s in the GloVe vocab
Could not find 40ãâ‚¬s in the GloVe vocab
Could not find insolvable in the GloVe vocab
Could not find arterially in the GloVe vocab
Could not find braggadocios in the GloVe vocab
Could not find aayuh in the GloVe vocab
Could not find appropri in the GloVe vocab
Could not find l4s in the GloVe vocab
Could not find lorranna in the GloVe vocab
Could not find shinsheki in the GloVe vocab
Coul

In [13]:
embedding_matrix

array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.41800001,  0.24968   , -0.41242   , ..., -0.18411   ,
        -0.11514   , -0.78580999],
       [ 0.68046999, -0.039263  ,  0.30186   , ..., -0.073297  ,
        -0.064699  , -0.26043999],
       ...,
       [-0.70185   , -0.48587   , -1.1372    , ..., -0.70876002,
        -0.73249   , -0.17163   ],
       [-0.051174  , -0.36392999, -0.058348  , ...,  0.60711002,
        -1.63810003, -0.95243001],
       [ 0.68660998, -0.078468  ,  0.20039   , ...,  0.64521003,
         0.068449  ,  0.30531999]])

In [19]:
%%time

### run_model() - a function to build and run the model based on the passed parameters (if any)

# Construct the convolutional neural network.
# The form of each keras layer function is as follows:
#    result = keras.layers.LayerType(arguments for the layer)(layer(s) it should use as input)
# concretely,
#    this_layer_output = keras.layers.Dense(100, activation='relu')(prev_layer_vector)
# performs this_layer_output = relu(prev_layer_vector x W + b) where W has 100 columns.
#
# Defaults are defined at the top of the notebook



def run_model(max_len = max_len,        # set at top of notebook
              epochs = 30,
              batch_size = 50,
              embed_dim = embed_dim,    # set at top of notebook
              num_filters = [2, 2, 2],
              kernel_sizes = [2, 3, 4],
              dense_layer_dims = [8],
              dropout_rate = 0.1,
              train_embeds = False,  # Whether we allow the embeddings to be changed
              opt = 'adam', 
              log_run_results = False):

    num_classes = 1

    # set up input layer (receives word IDs) and embedding that tuyrns that into GloVe embeddings
    word_ids = keras.layers.Input(shape=(max_len,))
    h=keras.layers.Embedding(vocab_size,
                             embed_dim,
                             weights=[embedding_matrix],
                             trainable = train_embeds)(word_ids)

    # Add convolutional layers and pooling layers based on number of filters and kernel size(s)
    conv_layers_for_all_kernel_sizes = []
    for kernel_size, filters in zip(kernel_sizes, num_filters):
        print(f"Adding Convolution: Kernel Size: {kernel_size}, Filter Count: {filters}")
        # note that all convolution layers take the same input "h" the output from the embedding layer
        conv_layer = keras.layers.Conv1D(filters=filters, kernel_size=kernel_size, activation='relu')(h)
        conv_layer = keras.layers.GlobalMaxPooling1D()(conv_layer)
        conv_layers_for_all_kernel_sizes.append(conv_layer)

    # Concat the feature maps from each different size.
    h = keras.layers.concatenate(conv_layers_for_all_kernel_sizes, axis=1)

    # Dropout can help with overfitting
    h = keras.layers.Dropout(rate=dropout_rate)(h)

    # Add the fully connected feed forward layers for categorization
    # Add a fully connected layer for each dense layer dimension in dense_layer_dims.
    for dim in dense_layer_dims:
        h = keras.layers.Dense(dim, activation='relu')(h)

    # Add the output layer for classifier - in this case, there is only one output
    prediction = keras.layers.Dense(num_classes, activation='sigmoid')(h)

    # Create and compile the model
    model = keras.Model(inputs=word_ids, outputs=prediction)
    model.compile(optimizer=opt,
                  loss='binary_crossentropy',               # as we only have a single output class
                  # metrics=['binary_accuracy'])                    # What metric to output as we train.
                  metrics=['accuracy'])                    # What metric to output as we train.
    
    print(model.summary())

    model.reset_states()

    tag  = datetime.datetime.now().strftime("%y%m%d-%H%M%S")
    log_dir = f"tblogs-{kernel_sizes}/".replace("[","").replace("]","").replace(", ","") + \
                datetime.datetime.now().strftime("%y%m%d-%H%M%S")
    
    tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

    model.fit(x_train_ids, y_train,
              epochs=epochs,
              batch_size = batch_size,
              validation_data = (x_val_ids, y_val),
              callbacks=[tensorboard_callback]   )

    hist = model.history.history
    
    if log_run_results:
        with open("runs.log", 'a') as f:
            f.write(f"{tag}|{max_len}|{epochs}|{batch_size}|{embed_dim}|{num_filters}|{kernel_sizes}|{dense_layer_dims}|{dropout_rate}|{num_classes}|{train_embeds}|")
            for metric in list(hist.keys()):
                print(metric)
                f.write(f"{metric}|")
                for i in range(0,epochs):
                    f.write(f"{hist[metric][i]}|")
            f.write(f"END\n")        
            f.close()
    
    return(hist)

CPU times: user 0 ns, sys: 9 µs, total: 9 µs
Wall time: 14.8 µs


In [20]:
run_model(log_run_results  = False)

Adding Convolution: Kernel Size: 2, Filter Count: 2
Adding Convolution: Kernel Size: 3, Filter Count: 2
Adding Convolution: Kernel Size: 4, Filter Count: 2
Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_3 (InputLayer)            [(None, 100)]        0                                            
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, 100, 50)      350900      input_3[0][0]                    
__________________________________________________________________________________________________
conv1d_6 (Conv1D)               (None, 99, 2)        202         embedding_2[0][0]                
__________________________________________________________________________________________________
conv1d_7 (Conv1D)               (No

{'loss': [0.5604069232940674,
  0.4382692575454712,
  0.3883560597896576,
  0.37111109495162964,
  0.35807913541793823,
  0.35679253935813904,
  0.347476601600647,
  0.34749364852905273,
  0.3408835828304291,
  0.3352910578250885,
  0.32710227370262146,
  0.3251993656158447,
  0.32075414061546326,
  0.31486231088638306,
  0.31799766421318054,
  0.31718748807907104,
  0.3072964549064636,
  0.31077519059181213,
  0.3092276155948639,
  0.30367082357406616,
  0.3053053319454193,
  0.29876306653022766,
  0.30660584568977356,
  0.3016033172607422,
  0.2967414855957031,
  0.28846511244773865,
  0.29470938444137573,
  0.29081088304519653,
  0.2859611213207245,
  0.29291263222694397],
 'accuracy': [0.742348849773407,
  0.8101914525032043,
  0.8358209133148193,
  0.8444142937660217,
  0.8492386341094971,
  0.8492386341094971,
  0.8524046540260315,
  0.8542137742042542,
  0.8590381145477295,
  0.8600934743881226,
  0.864616334438324,
  0.8662747144699097,
  0.868536114692688,
  0.8695914149284363

## Output of the tests are deleted as the notebook will not load in Chrome

In [None]:
# this is very quick, so we can use grid search on these models.

epochs = 30


# Write the header line
with open("runs.log",'a') as f:
    f.write("timestamp|max_len|epochs|batch_size|embed_dim|num_filters|kernel_sizes|dense_layer_dims|dropout_rate|num_classes|train_embeds|")
    for metric in [ 'loss', 'accuracy', 'val_loss', 'val_accuracy']:
        f.write(f"{metric}|")
        for i in range(0,epochs):
            f.write(f"{metric}-{i}|")
    f.write(f"END\n")        
    f.close()

    
# run the models

for k_sizes in [ [2,4], [4,8], [8,16], [2,3,4], [2,4,6], [2,6,8], [4,6,8], [4,8,12], [8,12,16] ]:

    # Set filter counts according to the number of kernels in use.
    if len(k_sizes) == 2:
        filter_counts = [ [2,2], [4,4], [8,8], [2,4], [4,8], [8,16], [16,32] ]
    if len(k_sizes) == 3:
        filter_counts = [ [2,2,2], [4,4,4], [8,8,8], [2,3,4], [3,4,5], [4,5,6], [8,10,16], [8,16,32] ]
    
    
    for n_fil in filter_counts:
        for dld in [ [4], [8], [16], [32], [64], [4,4], [8,8], [16,16], [32,32], [4,4,4], [8,8,8], [16,16,16] ]:
            for dropout in np.arange(0,0.9,0.2):
                for tr in [True, False]:
                    hist = run_model(epochs           = epochs,
                                     kernel_sizes     = k_sizes,
                                     num_filters      = n_fil,
                                     dense_layer_dims = dld,
                                     dropout_rate     = dropout,
                                     train_embeds     = tr,
                                     log_run_results  = True)
                    print("\n\n\n##########################################################################\n\n\n")



### Comments

1. max_len = 100

epochs = 30

batch_size = 10

embed_dim = 50

num_filters = [2, 2, 2]

kernel_sizes = [2, 3, 4]

dense_layer_dims = [8]

dropout_rate = 0.1

train_embeds = False  # Whether we allow the embeddings to be changed

opt = 'adam'

opt = SGD(lr=0.1)

opt = RMSprop(learning_rate=0.001)






### Dataset Citations

@inproceedings{arslan2020claimbuster,
    title={{A Benchmark Dataset of Check-worthy Factual Claims}},
    author={Arslan, Fatma and Hassan, Naeemul and Li, Chengkai and Tremayne, Mark },
    booktitle={14th International AAAI Conference on Web and Social Media},
    year={2020},
    organization={AAAI}
}

@article{meng2020gradient,
  title={Gradient-Based Adversarial Training on Transformer Networks for Detecting Check-Worthy Factual Claims},
  author={Meng, Kevin and Jimenez, Damian and Arslan, Fatma and Devasier, Jacob Daniel and Obembe, Daniel and Li, Chengkai},
  journal={arXiv preprint arXiv:2002.07725},
  year={2020}
}

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. [pdf] [bib]


### Useful resources in building this
Using pre-trained word embeddings: https://keras.io/examples/nlp/pretrained_word_embeddings/