# Overview

We will be doing the following to create a Deep Neural Network using RNN and Softmax as the activation output layer:

- Instantiate required Python components.
- Set Hyperparameters
- Read the CSV data
- Remove unused fields.
- Keep only the message in the JSON.
- Define two lists: messages and labels.
- Split data between training and validation sets.
- Tokenize words
- Pad sequences so they are the same size.
- Build LSTM
- Train several epochs.
- Plot Loss and Accuracy to view model's performance.
- Make predictions.


# Instantiate required Python components.

Our project will use TensorFlow for developing our model.  We'll also need several other Python libraries to work with our CSV.

In [1]:
import pandas as pd
import csv
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
STOPWORDS = set(stopwords.words('english'))

2023-01-08 03:06:59.095245: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-08 03:06:59.208791: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-01-08 03:06:59.208819: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-01-08 03:06:59.790162: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-

# Set Hyperparameters

This handy section will control all the important parameters for our model.

In [2]:
vocab_size = 5000
embedding_dim = 64
max_length = 200
trunc_type = 'post'
padding_type = 'post'
oov_tok = '<OOV>'
training_portion = .8

# The file that contains the data.
SAVE_DIRECTORY = "../artifacts/data/sources"
FILE_MESSAGES = f"{SAVE_DIRECTORY}/20221220-message-incidents.csv"

# Read the CSV data

Read the CSV contents and keep only specific fields.

In [3]:
# Open file and save to dataframe.
df = pd.read_csv(FILE_MESSAGES)

# print(df.columns)

# Preprocess Data

As part of the Machine Learning process, we will remove fields not required, fix missing values, remove noisy data, and any additional steps to prepare for the ML training process.

## Keep Labels and Messages

We will keep only specific columns that is important to the model.

In [4]:
# Keep specific columns.
df = df[["reason", "messages"]]

print(df.columns)

Index(['reason', 'messages'], dtype='object')


## Remove Empty Messages Data

Let's remove any message column if the array is empty.

In [5]:
# Create a boolean mask to select columns with only empty lists
removeEmptyMessages = df['messages'].apply(lambda x: x == '[]')

# Use the mask to drop the columns with only empty lists
df = df.drop(index=df[removeEmptyMessages].index)

print(f'Total number of rows after removing empty lists: {len(df)}')

Total number of rows after removing empty lists: 4042


## Remove JSON and Keep Message Field

We will remove the JSON formatting and keep the message field.

In [6]:
import json

# Define a function to extract the message field from the JSON
def extract_message(messageString):
    # Convert from String to JSON
    messageToJson = json.loads(messageString)
    
    return messageToJson[0]['message']

# Apply the function to the 'json' column and create a new 'message' column with the 1st message only.
df['singleMessage'] = df['messages'].apply(extract_message)


## Remove unused fields.

In [7]:
# Remove 'messages'
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html
df.drop(['messages'], axis=1)


Unnamed: 0,reason,singleMessage
0,Account number visible. Please remove from con...,a.b.c.warriortrading.com
1,Inappropriate comment.,mammkd. sdkkf
2,Caps for tickers only.,wattior
3,Caps for tickers only.,wattior
4,Inappropriate comment.,wattior
...,...,...
4115,"Per our Chat Room rules, we ask that capital l...",NICE SAW IT
4116,Your message was deleted as it was deemed to b...,"Jorge is killing it right now, y'all XD"
4117,This post is best for the Lounge where we enco...,I guarantee you Jorge just made my salary toda...
4118,Your message was deleted as it was deemed to b...,gm Mark. We've been passing around the [$XBI]...


# ▶️ Remove Stop Words

We'll remove words not needed for the training.

In [8]:
def removeStopwords(text):
    # Split the text into words
    words = text.split()
    
    # Use a list comprehension to remove the stopwords
    filtered_words = [word for word in words if word.lower() not in STOPWORDS]
    
    # Join the filtered words back into a single string
    filtered_text = ' '.join(filtered_words)
    
    return filtered_text

# Iterate through the rows of the dataframe
for index, row in df.iterrows():
    # Remove stopwords from the 'text' column
    row['singleMessage'] = removeStopwords(row['singleMessage'])


labels = df['reason']
messages = df['singleMessage']

# Make sure both labels and messages have the same length.
print(f'Labels: {len(labels)}')
print(f'Messages: {len(messages)}')

# print(df)

Labels: 4042
Messages: 4042


## Removing Punctuations and Cleaning Special Characters

# Split into Training and Validation Set

In [9]:
train_size = int(len(messages) * training_portion)

train_messages = messages[0: train_size]
train_labels = labels[0: train_size]

validation_messages = messages[train_size:]
validation_labels = labels[train_size:]

print(train_size)
print(len(train_messages))
print(len(train_labels))
print(len(validation_messages))
print(len(validation_labels))

3233
3233
3233
809
809


## Words Not in Index as OOV

In [10]:
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(train_messages)
word_index = tokenizer.word_index

In [11]:
dict(list(word_index.items())[0:10])

{'<OOV>': 1,
 'lol': 2,
 'com': 3,
 'https': 4,
 'warriortrading': 5,
 'ross': 6,
 'like': 7,
 'long': 8,
 'quote': 9,
 'trading': 10}

## Turn Tokens to Lists of Sequence

In [12]:
train_sequences = tokenizer.texts_to_sequences(train_messages)

In [13]:
print(len(train_sequences[20]))

8


## Train Sequences Padding

When training a Neural Network the sequences must be of the same size.  To do that, we'll pad the sequences so they are the same sizse.

In [14]:
train_padded = pad_sequences(train_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

In [15]:
print(len(train_sequences[0]))
print(len(train_padded[0]))

print(len(train_sequences[11]))
print(len(train_padded[11]))

print(len(train_sequences[41]))
print(len(train_padded[41]))

5
200
1
200
16
200


In [16]:
print(train_sequences[41])

[131, 292, 20, 93, 2307, 237, 46, 7, 21, 56, 91, 2308, 76, 2309, 556, 2310]


In [17]:
print(train_padded[41])

[ 131  292   20   93 2307  237   46    7   21   56   91 2308   76 2309
  556 2310    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0 

## Validation Sequence Padding

We do the same for the validation sequence.

In [18]:
validation_sequences = tokenizer.texts_to_sequences(validation_messages)
validation_padded = pad_sequences(validation_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

print(len(validation_sequences))
print(validation_padded.shape)

809
(809, 200)


# Tokenize Labels

Since ML don't care what our labels are as text, we'll be tokenizing them.

In [19]:
# Current labels
print(set(labels))

{'Please Email in any direct any feedback about the chat room. We posted info about Ross and the streams in the announcements earlier, thank you!', 'Putting on watch for their political commentary, see deleted messages', 'This post may be interpreted as financial advice or encouraging others to buy, sell, or hold. It has therefore been removed in order to comply with market rules and regulations. Learn more on our Chat Room Rules page.', '-------', 'please reach out to broker', 'language', 'please no buy alerts', 'language please', "You talk about alcohol too much in this room. That can be difficult for people who have an issue with it.  Trading is a stressful environment and adding too much talk about alcohol isn't needed.   Besides, Warrior strives to make this a family friendly room.  Thank you, Katya for being more mindful about this topic.  ", 'Small caps', 'P/L claims are limited to 3 per day. Please reach out to us at team@warriortrading.com or https://warrior.app/contact for mo

In [20]:
label_tokenizer = Tokenizer()
label_tokenizer.fit_on_texts(labels)

# TODO: Investigate this part. It's not working with Tensorflow.
training_label_seq = np.array(label_tokenizer.texts_to_sequences(train_labels), dtype=object)
validation_label_seq = np.array(label_tokenizer.texts_to_sequences(validation_labels), dtype=object)

In [21]:
print(label_tokenizer.texts_to_sequences(train_labels))

[[140, 195, 210, 29, 203, 157, 89, 221, 222], [51, 56], [87, 1, 20, 34], [87, 1, 20, 34], [51, 56], [51, 56], [94, 93, 95, 89, 33, 50], [51, 56], [51, 56], [51, 56], [166, 59], [166, 59], [166, 59], [], [166, 59], [94, 93, 95, 89, 33, 50], [130, 33, 50, 131, 25, 132, 4, 2, 26], [130, 33, 50, 131, 25, 132, 4, 2, 26], [130, 33, 50, 131, 25, 132, 4, 2, 26], [130, 33, 50, 131, 25, 132, 4, 2, 26], [130, 33, 50, 131, 25, 132, 4, 2, 26], [130, 33, 50, 131, 25, 132, 4, 2, 26], [51, 56], [51, 56], [96, 27, 90, 59, 33, 50, 5, 3], [96, 27, 90, 59, 33, 50, 5, 3], [], [], [131, 211], [], [14, 15, 29, 53, 4, 2, 18], [96, 27, 90, 59, 33, 50, 5, 3], [166, 59], [16, 11, 153, 403, 4, 275, 35, 110], [166, 59, 27, 153, 161], [339, 12, 263, 511], [339, 12, 263, 51, 360, 29, 361, 157, 151, 362, 25, 363, 110], [339, 12, 263, 51, 360, 29, 361, 157, 151, 362, 25, 363, 110], [339, 12, 263, 51, 360, 29, 361, 157, 151, 362, 25, 363, 110], [], [130, 33, 50, 131, 25, 132, 4, 2, 26], [96, 27, 90, 59, 33, 50, 5, 3], 

In [22]:
# Analyze the tokenized labels.
print(training_label_seq[0])
print(training_label_seq[1])
print(training_label_seq[2])
print(training_label_seq.shape)

print(validation_label_seq[0])
print(validation_label_seq[1])
print(validation_label_seq[2])
print(validation_label_seq.shape)

[140, 195, 210, 29, 203, 157, 89, 221, 222]
[51, 56]
[87, 1, 20, 34]
(3233,)
[35, 28, 12, 41, 1, 2, 18, 46, 10, 47, 31, 37, 9, 48, 6, 43, 45, 14, 15, 42, 2, 9, 23, 19, 22, 1, 7, 8, 6, 44, 26, 39, 16, 11, 1, 17]
[35, 28, 12, 41, 1, 2, 18, 46, 10, 47, 31, 37, 9, 48, 6, 43, 45, 14, 15, 42, 2, 9, 23, 19, 22, 1, 7, 8, 6, 44, 26, 39, 16, 11, 1, 17]
[35, 28, 12, 41, 1, 2, 18, 46, 10, 47, 31, 37, 9, 48, 6, 43, 45, 14, 15, 42, 2, 9, 23, 19, 22, 1, 7, 8, 6, 44, 26, 39, 16, 11, 1, 17]
(809,)


## Explore Message Original vs After Padding

In [23]:
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_message(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])
print(decode_message(train_padded[21]))
print('---')
print(train_messages[21])

thats biden want raise taxes to improve platform ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
---
Thats Biden want raise taxes....to improve platform


# Model Training

## Instantiate Keras

We'll be using softmax as our activation for multiple outputs.

In [24]:
model = tf.keras.Sequential([
    # Add an Embedding layer expecting input vocab of size 5000, and output embedding dimension of 
    # size 64 we set at the top
    tf.keras.layers.Embedding(vocab_size, embedding_dim),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(embedding_dim)),
    # tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    
    # use ReLU in place of tanh function since they are very good alternatives of each other.
    tf.keras.layers.Dense(embedding_dim, activation='relu'),
    
    # Add a Dense layer with 6 units and softmax activation.
    # When we have multiple outputs, softmax convert outputs layers into a probability distribution.
    tf.keras.layers.Dense(6, activation='softmax')
])
model.summary()

2023-01-08 03:07:01.810250: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-01-08 03:07:01.810307: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303)
2023-01-08 03:07:01.810326: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ba1c25b27f4f): /proc/driver/nvidia/version does not exist
2023-01-08 03:07:01.810538: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 64)          320000    
                                                                 
 bidirectional (Bidirectiona  (None, 128)              66048     
 l)                                                              
                                                                 
 dense (Dense)               (None, 64)                8256      
                                                                 
 dense_1 (Dense)             (None, 6)                 390       
                                                                 
Total params: 394,694
Trainable params: 394,694
Non-trainable params: 0
_________________________________________________________________


In [25]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [26]:
# print(training_label_seq_list)

# print(len(train_padded))
# print(len(training_label_seq))

# print(training_label_seq)

# print(training_label_seq)
print(validation_label_seq)

[list([35, 28, 12, 41, 1, 2, 18, 46, 10, 47, 31, 37, 9, 48, 6, 43, 45, 14, 15, 42, 2, 9, 23, 19, 22, 1, 7, 8, 6, 44, 26, 39, 16, 11, 1, 17])
 list([35, 28, 12, 41, 1, 2, 18, 46, 10, 47, 31, 37, 9, 48, 6, 43, 45, 14, 15, 42, 2, 9, 23, 19, 22, 1, 7, 8, 6, 44, 26, 39, 16, 11, 1, 17])
 list([35, 28, 12, 41, 1, 2, 18, 46, 10, 47, 31, 37, 9, 48, 6, 43, 45, 14, 15, 42, 2, 9, 23, 19, 22, 1, 7, 8, 6, 44, 26, 39, 16, 11, 1, 17])
 list([209, 34, 29]) list([209, 34, 29])
 list([58, 24, 3, 13, 30, 86, 88, 62, 63, 3, 13, 30, 10, 65, 52, 72, 73, 5, 74, 38, 22, 1, 32, 20, 34, 4, 70, 40, 61, 75, 1, 55, 4, 76, 7, 8, 77, 2, 3, 68, 69, 2, 78, 79, 5, 80, 25, 32, 20, 60, 64, 81, 71, 5, 2, 3, 13, 7, 8, 66, 36, 82, 10, 19, 49, 9, 57, 6, 67, 83, 25, 84, 7, 8, 12, 85, 5, 24, 3, 23, 16, 11, 1, 21, 17])
 list([21, 102, 54, 91, 36, 40, 54, 117, 4, 38, 51, 27, 33, 5, 118, 99, 2, 103, 104, 119, 120, 1, 2, 3, 16, 11, 1, 21, 17])
 list([35, 28, 12, 41, 1, 2, 18, 46, 10, 47, 31, 37, 9, 48, 6, 43, 45, 14, 15, 42, 2, 9, 

In [27]:
# TODO: Convert training_label_seq to a Tensor.
## Convert to Tensor
# training_label_seq_tensor = tf.constant(training_label_seq)

# Cast the Tensor to dtype float32
# training_label_seq_tensor = tf.cast(training_label_seq_tensor, dtype=tf.float32)

num_epochs = 10
history = model.fit(train_padded, training_label_seq, epochs=num_epochs, validation_data=(validation_padded, validation_label_seq), verbose=2)


ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).