# Overview

We will be doing the following to create a Deep Neural Network using RNN and Softmax as the activation output layer:

- Instantiate required Python components.
- Set Hyperparameters
- Read the CSV data
- Remove unused fields.
- Keep only the message in the JSON.


# Instantiate required Python components.

Our project will use TensorFlow for developing our model.  We'll also need several other Python libraries to work with our CSV.

In [1]:
import re
import pandas as pd
import csv
import numpy as np
import json
# import tensorflow as tf
# from tensorflow.keras.preprocessing.text import Tokenizer
# from tensorflow.keras.preprocessing.sequence import pad_sequences

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
STOPWORDS = set(stopwords.words('english'))

# Used for Troubleshooting
from IPython.display import display

# Read config.json
with open('../config.json', 'r') as config_file:
    config = json.load(config_file)

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Set Hyperparameters

This handy section will control all the important parameters for our model.

In [2]:
# The file that contains the data.
FILE_MESSAGES = f"{config['save_directory']}/data/sources/20240722-message-incidents.csv"

# Read the CSV data

Read the CSV contents and keep only specific fields.

In [3]:
# Open file and save to dataframe.
df = pd.read_csv(FILE_MESSAGES)

#print(df.columns)
display(df)

Unnamed: 0,id,userId,roomId,operatorId,undoUserId,userRole,userName,operatorName,undoUserName,undoAt,reason,actionType,unitCount,messages,createdAt,updatedAt,expiredAt,uuid,originalAction
0,1,130504,100001,103674,,3,quotetester01,Chaofan,,,Account number visible. Please remove from con...,1,1,"[{""id"":""3866400"",""message"":""a.b.c.warriortradi...",4/23/21 4:23 AM,4/23/21 4:23 AM,,,0
1,2,130504,100001,103674,,3,quotetester01,Chaofan,,,Inappropriate comment.,1,1,"[{""id"":""3968143"",""message"":""mammkd. sdkkf"",""cr...",4/23/21 4:58 AM,4/23/21 4:58 AM,,,0
2,3,130504,100001,103674,,3,quotetester01,Chaofan,,,Caps for tickers only.,2,1,"[{""id"":""3968144"",""message"":""wattior"",""createdA...",4/23/21 5:22 AM,4/23/21 5:22 AM,,,0
3,4,130504,100001,103674,,3,quotetester01,Chaofan,,,Caps for tickers only.,2,1,"[{""id"":""3968144"",""message"":""wattior"",""createdA...",4/23/21 7:53 AM,4/23/21 7:53 AM,,,0
4,5,130504,100001,103674,,3,quotetester01,Chaofan,,,Inappropriate comment.,2,1,"[{""id"":""3968144"",""message"":""wattior"",""createdA...",4/23/21 7:54 AM,4/23/21 7:54 AM,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6195,6196,151406,100003,101755,,3,ali al,Arsh,,,This comment is more appropriate for the Suppo...,5,1,"[{""id"":""8962376"",""message"":""top gapper scanner...",7/17/24 3:39 PM,7/17/24 3:39 PM,,eeb7088b-fb06-49dd-a478-48d9cac6860d,5
6196,6197,151426,100003,101755,,3,Todd Mar,Arsh,,,Your message was deleted because it is incompr...,1,1,"[{""id"":""8962871"",""message"":""333333333333333333...",7/17/24 5:37 PM,7/17/24 5:37 PM,,ca174b5c-222f-42e1-8b58-0cda6a9d04db,1
6197,6198,151694,100008,101755,,3,Stephen Had,Arsh,,,This comment is more appropriate for the Suppo...,5,1,"[{""id"":""8963052"",""message"":""Hi Folks, can any ...",7/17/24 6:08 PM,7/17/24 6:08 PM,,bdd15c05-3b06-48be-ae87-cbb6bd510fba,5
6198,6199,150332,100008,101755,,3,Avi Bar,Arsh,,,This comment is more appropriate for the Suppo...,5,1,"[{""id"":""8963249"",""message"":""does anyone know w...",7/17/24 7:11 PM,7/17/24 7:11 PM,,c06cb1db-5829-4b33-a940-0404c490eea7,5


# Preprocess Data

As part of the Machine Learning process, we will remove fields not required, fix missing values, remove noisy data, and any additional steps to prepare for the ML training process.

## Keep Labels and Messages

We will keep only specific columns that is important to the model.

In [4]:
# Keep specific columns.
df = df[["reason", "messages"]]

print(df.columns)
print(f'Total number of rows: {len(df)}')

Index(['reason', 'messages'], dtype='object')
Total number of rows: 6200


## Remove Empty Messages Data

Let's remove any message column if the array is empty.

In [5]:
# Create a boolean mask to select columns with only empty lists
removeEmptyMessages = df['messages'].apply(lambda x: x == '[]')

# Use the mask to drop the columns with only empty lists
df = df.drop(index=df[removeEmptyMessages].index)

print(f'Total number of rows after removing empty lists: {len(df)}')

Total number of rows after removing empty lists: 6104


## Remove JSON and Keep Message Field

We will remove the JSON formatting and keep the message field.

In [6]:
import json

# Define a function to extract the message field from the JSON
def extract_message(messageString):
    # Convert from String to JSON
    messageToJson = json.loads(messageString)
    
    return messageToJson[0]['message']

# Apply the function to the 'json' column and create a new 'message' column with the 1st message only.
df['singleMessage'] = df['messages'].apply(extract_message)

# Let's see our progress.
display(df)


Unnamed: 0,reason,messages,singleMessage
0,Account number visible. Please remove from con...,"[{""id"":""3866400"",""message"":""a.b.c.warriortradi...",a.b.c.warriortrading.com
1,Inappropriate comment.,"[{""id"":""3968143"",""message"":""mammkd. sdkkf"",""cr...",mammkd. sdkkf
2,Caps for tickers only.,"[{""id"":""3968144"",""message"":""wattior"",""createdA...",wattior
3,Caps for tickers only.,"[{""id"":""3968144"",""message"":""wattior"",""createdA...",wattior
4,Inappropriate comment.,"[{""id"":""3968144"",""message"":""wattior"",""createdA...",wattior
...,...,...,...
6195,This comment is more appropriate for the Suppo...,"[{""id"":""8962376"",""message"":""top gapper scanner...",top gapper scanners only shows between 9 and 9...
6196,Your message was deleted because it is incompr...,"[{""id"":""8962871"",""message"":""333333333333333333...",33333333333333333333333333333333333334yuuuuuuu...
6197,This comment is more appropriate for the Suppo...,"[{""id"":""8963052"",""message"":""Hi Folks, can any ...","Hi Folks, can any tell me how you get the Star..."
6198,This comment is more appropriate for the Suppo...,"[{""id"":""8963249"",""message"":""does anyone know w...",does anyone know what happens if you hold a po...


## Remove unused fields.

In [7]:
# Remove 'messages'
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html
df.drop(['messages'], axis=1, inplace=True)

# Let's see our progress.
display(df)

Unnamed: 0,reason,singleMessage
0,Account number visible. Please remove from con...,a.b.c.warriortrading.com
1,Inappropriate comment.,mammkd. sdkkf
2,Caps for tickers only.,wattior
3,Caps for tickers only.,wattior
4,Inappropriate comment.,wattior
...,...,...
6195,This comment is more appropriate for the Suppo...,top gapper scanners only shows between 9 and 9...
6196,Your message was deleted because it is incompr...,33333333333333333333333333333333333334yuuuuuuu...
6197,This comment is more appropriate for the Suppo...,"Hi Folks, can any tell me how you get the Star..."
6198,This comment is more appropriate for the Suppo...,does anyone know what happens if you hold a po...


## Remove Stop Words

We'll remove words not needed for the training.

In [8]:
def removeStopwords(text):
    # Split the text into words
    words = text.split()
    
    # Use a list comprehension to remove the stopwords
    filtered_words = [word for word in words if word.lower() not in STOPWORDS]
    
    # Join the filtered words back into a single string
    filtered_text = ' '.join(filtered_words)
    
    return filtered_text

# Iterate through the rows of the dataframe
for index, row in df.iterrows():
    # Remove stopwords from the 'text' column
    row['singleMessage'] = removeStopwords(row['singleMessage'])

# labels = df['reason']
# messages = df['singleMessage']

# Make sure both labels and messages have the same length.
print(f'Labels: {len(df["reason"])}')
print(f'Messages: {len(df["singleMessage"])}')

# Let's see our progress.
display(df)

Labels: 6104
Messages: 6104


Unnamed: 0,reason,singleMessage
0,Account number visible. Please remove from con...,a.b.c.warriortrading.com
1,Inappropriate comment.,mammkd. sdkkf
2,Caps for tickers only.,wattior
3,Caps for tickers only.,wattior
4,Inappropriate comment.,wattior
...,...,...
6195,This comment is more appropriate for the Suppo...,top gapper scanners only shows between 9 and 9...
6196,Your message was deleted because it is incompr...,33333333333333333333333333333333333334yuuuuuuu...
6197,This comment is more appropriate for the Suppo...,"Hi Folks, can any tell me how you get the Star..."
6198,This comment is more appropriate for the Suppo...,does anyone know what happens if you hold a po...


# Clean Data

## Replace Stock symbols with STOCKSYMBOL

We will replace stock symbols with "STOCKSYMBOL".  We're making this friendly for word2vec to process.

In [9]:
def replaceStockSymbol(text):
    return re.sub('\[\$.+\]\(.+\)', 'STOCKSYMBOL', text)

## Removing Punctuations and Cleaning Special Characters

Word2Vec doesn't do so well with punctuations and special characters.  We're going to remove them.

In [10]:
def clean_text(x):
    pattern = r'[^a-zA-z0-9\s]'
    text = re.sub(pattern, '', x)
    return x

## Replace Numbers with #s

In [11]:
def clean_numbers(x):
    if bool(re.search(r'\d', x)):
        x = re.sub('[0-9]{5,}', '#####', x)
        x = re.sub('[0-9]**{4}**', '####', x)
        x = re.sub('[0-9]**{3}**', '###', x)
        x = re.sub('[0-9]**{2}**', '##', x)
    return x

## Remove chatroom-resource links.

Remove any row that contains chatroom-resource.warriortrading.com because that'll confuse the training process.

In [12]:
# remove any row containing 'https://chatroom-resource.website.com'
df = df[~df['singleMessage'].str.contains('https://chatroom-resource')]


## Off-topic Reasons

Lots of off-topic reasons that appear to be the same.  We will change all the off-topic reasons to just be 'Off-topic'.

In [13]:
# TODO: Convert all related off-topic to only: Off-topic
df.loc[df['reason'].str.contains('off topic', case=False), 'reason'] = 'Off-topic'
df.loc[df['reason'].str.contains('lounge', case=False), 'reason'] = 'Off-topic'
df.loc[df['reason'].str.contains('Loung', case=False), 'reason'] = 'Off-topic'


## Shorten Long Reasons

Several reasons are too long.  Shorten them so they are easier for ML.

In [14]:
reasons_tooLong = [
    
]

## Remove Contractions

Remove words such as:
- ain't
- can't

## Keep Specific Reasons

We will be keeping specific reasons because there are test reasons that shouldn't be part of the Machine Learning process.

In [15]:
keepStrings = [
    'Off-topic',
    'Account number visible. Please remove from content before reposting.',
    'Inappropriate comment.',
    'Caps for tickers only.',
    'Third-party links / content not allowed.',
    'False information.',
    'Politics not allowed outside of references to the market.',
    'Personal or sensitive information not allowed in chat.',
    'outside link',
    'False information or no source.',
    'Perv is an inappropriate term please refrain from these kinds of discussions here',
    'competitor ',
    'Inappropriate comment.',
    'Loung please ',
    'Language',
    'language please',
    'Bullying a member or moderator.',
    'Support Room would be more appropriate for this inquiry.',
    'Reviewed by admin internally; not necessary to post to public chat.',
    'comments about market manipulation not allowed even if joking',
    'talking about drugs ',
    'no advice',
    'CMEG complaint',
    'gray area',
    'possible password',
    'Might be searching for offline contact',
    'Stop spamming and no caps'
    'private info',
    'password',
    'professional ',
    'Discord is not an allowed word in the room. Please don\'t bypass the filter ',
    'File type not allowed',
    '"Any discussion related in any way to market manipulation is strictly prohibited, as is advising others on whether to buy, sell, or hold."',
    'English only please!',
    'No reference to 3rd party contact',
    'Not sure what this is',
    'Please provide more information when making comments like these. For example "AFRM is being shorted every candle, so I think it\'s manipulated" ',
    'False or misleading information, or no source.',
    'Bypassing the chat filters is not allowed.'

]

In [16]:
# Test the results
#display(df)
display(df[df['reason'].isin(keepStrings)])

Unnamed: 0,reason,singleMessage
0,Account number visible. Please remove from con...,a.b.c.warriortrading.com
1,Inappropriate comment.,mammkd. sdkkf
2,Caps for tickers only.,wattior
3,Caps for tickers only.,wattior
4,Inappropriate comment.,wattior
...,...,...
6165,Off-topic,do you know what tradezeros is mike
6166,Off-topic,what is the minimum you can fund an Interactiv...
6167,Off-topic,"i have 3.000 to open account, what broker woul..."
6180,Off-topic,I am new to Warrior and am curious as to which...


In [17]:

def remove_contractions(text):
    contractions_dict = { 
        "ain't": "is not",
        "aren't": "are not",
        "can't": "cannot",
        "can't've": "cannot have",
        "'cause": "because",
        "could've": "could have",
        "couldn't": "could not",
        "couldn't've": "could not have",
        "didn't": "did not",
        "doesn't": "does not",
        "don't": "do not",
        "hadn't": "had not",
        "hadn't've": "had not have",
        "hasn't": "has not",
        "haven't": "have not",
        "he'd": "he would",
        "he'd've": "he would have",
        "he'll": "he will",
        "he'll've": "he will have",
        "he's": "he is",
        "how'd": "how did",
        "how'd'y": "how do you",
        "how'll": "how will",
        "how's": "how is",
        "i'd": "I would",
        "i'd've": "I would have",
        "i'll": "I will",
        "i'll've": "I will have",
        "i'm": "I am",
        "i've": "I have",
        "isn't": "is not",
        "it'd": "it would",
        "it'd've": "it would have",
        "it'll": "it will",
        "it'll've": "it will have",
        "it's": "it is",
        "let's": "let us",
        "ma'am": "madam",
        "mayn't": "may not",
        "might've": "might have",
        "mightn't": "might not",
        "mightn't've": "might not have",
        "must've": "must have",
        "mustn't": "must not",
        "mustn't've": "must not have",
        "needn't": "need not",
        "needn't've": "need not have",
        "o'clock": "of the clock",
        "oughtn't": "ought not",
        "oughtn't've": "ought not have",
        "shan't": "shall not",
        "sha'n't": "shall not",
        "shan't've": "shall not have",
        "she'd": "she would",
        "she'd've": "she would have",
        "she'll": "she will",
        "she'll've": "she will have",
        "she's": "she is",
        "should've": "should have",
        "shouldn't": "should not",
        "shouldn't've": "should not have",
        "so've": "so have",
        "so's": "so is",
        "that'd": "that would",
        "that'd've": "that would have",
        "that's": "that is",
        "there'd": "there would",
        "there'd've": "there would have",
        "there's": "there is",
        "they'd": "they would",
        "they'd've": "they would have",
        "they'll": "they will",
        "they'll've": "they will have",
        "they're": "they are",
        "they've": "they have",
        "to've": "to have",
        "wasn't": "was not",
        "we'd": "we would",
        "we'd've": "we would have",
        "we'll": "we will",
        "we'll've": "we will have",
        "we're": "we are",
        "we've": "we have",
        "weren't": "were not",
        "what'll": "what will",
        "what'll've": "what will have",
        "what're": "what are",
        "what's": "what is",
        "what've": "what have",
        "when's": "when is",
        "when've": "when have",
        "where'd": "where did",
        "where's": "where is",
        "where've": "where have",
        "who'll": "who will",
        "who'll've": "who will have",
        "who's": "who is",
        "who've": "who have",
        "why's": "why is",
        "why've": "why have",
        "will've": "will have",
        "won't": "will not",
        "won't've": "will not have",
        "would've": "would have",
        "wouldn't": "would not",
        "wouldn't've": "would not have",
        "y'all": "you all",
        "y'all'd": "you all would",
        "y'all'd've": "you all would have",
        "y'all're": "you all are",
        "y'all've": "you all have",
        "you'd": "you would",
        "you'd've": "you would have",
        "you'll": "you will",
        "you'll've": "you will have",
        "you're": "you are",
        "you've": "you have"
    }
    
    contracts_re = re.compile('(%s)' % '|'.join(contractions_dict.keys()))
    text = contracts_re.sub(lambda x: contractions_dict[x.group()], text)
    return text

In [18]:
# Let's test our remove_contractions(text) function.
text = "I ain't goin to the store, I can't find my keys"
text = remove_contractions(text)
print(text)

I is not goin to the store, I cannot find my keys


## Remove Index Column

For some reason there is a column for index.  Let's remove that.

## ▶️ Apply Functions to Data

We're going to take all the functions we have created and clean our dataframe.

In [19]:
# Apply the functions to the 'singleMessage' column

# replaceStockSymbol(text)
df['singleMessage'] = df['singleMessage'].apply(replaceStockSymbol)

# clean_text(text)
df['singleMessage'] = df['singleMessage'].apply(clean_text)

# remove_contractions()
df['singleMessage'] = df['singleMessage'].apply(remove_contractions)

# TODO: Keep Specific Reasons
df = df[df['reason'].isin(keepStrings)]

# TODO: clean_numbers()
# df['singleMessage'] = df['singleMessage'].apply(clean_numbers)

In [20]:
# Check out the latest dataframe update.
display(df)

Unnamed: 0,reason,singleMessage
0,Account number visible. Please remove from con...,a.b.c.warriortrading.com
1,Inappropriate comment.,mammkd. sdkkf
2,Caps for tickers only.,wattior
3,Caps for tickers only.,wattior
4,Inappropriate comment.,wattior
...,...,...
6165,Off-topic,do you know what tradezeros is mike
6166,Off-topic,what is the minimum you can fund an Interactiv...
6167,Off-topic,"i have 3.000 to open account, what broker woul..."
6180,Off-topic,I am new to Warrior and am curious as to which...


# 🚧 Save Data to Disk

Let's save all our hard work formatting the dataframe to a CSV for future reference.

- [Pandas DataFrame.to_csv](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html)

In [21]:
df.to_csv(f"{config['save_directory']}/data/output/1-preprocessed-data.csv", index=False)