# Overview

We will be doing the following to create a Deep Neural Network using RNN and Softmax as the activation output layer:

- Instantiate required Python components.
- Set Hyperparameters
- Read the CSV data
- Remove unused fields.
- Keep only the message in the JSON.


# Instantiate required Python components.

Our project will use TensorFlow for developing our model.  We'll also need several other Python libraries to work with our CSV.

In [1]:
import pandas as pd
import csv
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
STOPWORDS = set(stopwords.words('english'))

2023-01-17 05:58:03.175763: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-17 05:58:03.285230: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-01-17 05:58:03.285252: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-01-17 05:58:03.843785: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-

# Set Hyperparameters

This handy section will control all the important parameters for our model.

In [2]:
# The file that contains the data.
FILE_MESSAGES = "./data/20221220-message-incidents.csv"

# Read the CSV data

Read the CSV contents and keep only specific fields.

In [3]:
# Open file and save to dataframe.
df = pd.read_csv(FILE_MESSAGES)

# print(df.columns)

# Preprocess Data

As part of the Machine Learning process, we will remove fields not required, fix missing values, remove noisy data, and any additional steps to prepare for the ML training process.

## Keep Labels and Messages

We will keep only specific columns that is important to the model.

In [4]:
# Keep specific columns.
df = df[["reason", "messages"]]

print(df.columns)

Index(['reason', 'messages'], dtype='object')


## Remove Empty Messages Data

Let's remove any message column if the array is empty.

In [5]:
# Create a boolean mask to select columns with only empty lists
removeEmptyMessages = df['messages'].apply(lambda x: x == '[]')

# Use the mask to drop the columns with only empty lists
df = df.drop(index=df[removeEmptyMessages].index)

print(f'Total number of rows after removing empty lists: {len(df)}')

Total number of rows after removing empty lists: 4042


## Remove JSON and Keep Message Field

We will remove the JSON formatting and keep the message field.

In [6]:
import json

# Define a function to extract the message field from the JSON
def extract_message(messageString):
    # Convert from String to JSON
    messageToJson = json.loads(messageString)
    
    return messageToJson[0]['message']

# Apply the function to the 'json' column and create a new 'message' column with the 1st message only.
df['singleMessage'] = df['messages'].apply(extract_message)


## Remove unused fields.

In [7]:
# Remove 'messages'
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html
df.drop(['messages'], axis=1)


Unnamed: 0,reason,singleMessage
0,Account number visible. Please remove from con...,a.b.c.warriortrading.com
1,Inappropriate comment.,mammkd. sdkkf
2,Caps for tickers only.,wattior
3,Caps for tickers only.,wattior
4,Inappropriate comment.,wattior
...,...,...
4115,"Per our Chat Room rules, we ask that capital l...",NICE SAW IT
4116,Your message was deleted as it was deemed to b...,"Jorge is killing it right now, y'all XD"
4117,This post is best for the Lounge where we enco...,I guarantee you Jorge just made my salary toda...
4118,Your message was deleted as it was deemed to b...,gm Mark. We've been passing around the [$XBI]...


## Remove Stop Words

We'll remove words not needed for the training.

In [8]:
def removeStopwords(text):
    # Split the text into words
    words = text.split()
    
    # Use a list comprehension to remove the stopwords
    filtered_words = [word for word in words if word.lower() not in STOPWORDS]
    
    # Join the filtered words back into a single string
    filtered_text = ' '.join(filtered_words)
    
    return filtered_text

# Iterate through the rows of the dataframe
for index, row in df.iterrows():
    # Remove stopwords from the 'text' column
    row['singleMessage'] = removeStopwords(row['singleMessage'])

# labels = df['reason']
# messages = df['singleMessage']

# Make sure both labels and messages have the same length.
print(f'Labels: {len(df["reason"])}')
print(f'Messages: {len(df["singleMessage"])}')


Labels: 4042
Messages: 4042


# Remove Unused Data

In [9]:
df = df.drop(['messages'], axis=1)

# Clean Data

## Removing Punctuations and Cleaning Special Characters

Word2Vec doesn't do so well with punctuations and special characters.  We're going to remove them.

In [10]:
import re

def clean_text(x):
    pattern = r'[^a-zA-z0-9\s]'
    text = re.sub(pattern, '', x)
    return x

## Replace Numbers with #s

In [11]:
def clean_numbers(x):
    if bool(re.search(r'\d', x)):
        x = re.sub('[0-9]{5,}', '#####', x)
        x = re.sub('[0-9]**{4}**', '####', x)
        x = re.sub('[0-9]**{3}**', '###', x)
        x = re.sub('[0-9]**{2}**', '##', x)
    return x

## Remove Contractions

Remove words such as:
- ain't
- can't

In [12]:

def remove_contractions(text):
    contractions_dict = { 
        "ain't": "is not",
        "aren't": "are not",
        "can't": "cannot",
        "can't've": "cannot have",
        "'cause": "because",
        "could've": "could have",
        "couldn't": "could not",
        "couldn't've": "could not have",
        "didn't": "did not",
        "doesn't": "does not",
        "don't": "do not",
        "hadn't": "had not",
        "hadn't've": "had not have",
        "hasn't": "has not",
        "haven't": "have not",
        "he'd": "he would",
        "he'd've": "he would have",
        "he'll": "he will",
        "he'll've": "he will have",
        "he's": "he is",
        "how'd": "how did",
        "how'd'y": "how do you",
        "how'll": "how will",
        "how's": "how is",
        "i'd": "I would",
        "i'd've": "I would have",
        "i'll": "I will",
        "i'll've": "I will have",
        "i'm": "I am",
        "i've": "I have",
        "isn't": "is not",
        "it'd": "it would",
        "it'd've": "it would have",
        "it'll": "it will",
        "it'll've": "it will have",
        "it's": "it is",
        "let's": "let us",
        "ma'am": "madam",
        "mayn't": "may not",
        "might've": "might have",
        "mightn't": "might not",
        "mightn't've": "might not have",
        "must've": "must have",
        "mustn't": "must not",
        "mustn't've": "must not have",
        "needn't": "need not",
        "needn't've": "need not have",
        "o'clock": "of the clock",
        "oughtn't": "ought not",
        "oughtn't've": "ought not have",
        "shan't": "shall not",
        "sha'n't": "shall not",
        "shan't've": "shall not have",
        "she'd": "she would",
        "she'd've": "she would have",
        "she'll": "she will",
        "she'll've": "she will have",
        "she's": "she is",
        "should've": "should have",
        "shouldn't": "should not",
        "shouldn't've": "should not have",
        "so've": "so have",
        "so's": "so is",
        "that'd": "that would",
        "that'd've": "that would have",
        "that's": "that is",
        "there'd": "there would",
        "there'd've": "there would have",
        "there's": "there is",
        "they'd": "they would",
        "they'd've": "they would have",
        "they'll": "they will",
        "they'll've": "they will have",
        "they're": "they are",
        "they've": "they have",
        "to've": "to have",
        "wasn't": "was not",
        "we'd": "we would",
        "we'd've": "we would have",
        "we'll": "we will",
        "we'll've": "we will have",
        "we're": "we are",
        "we've": "we have",
        "weren't": "were not",
        "what'll": "what will",
        "what'll've": "what will have",
        "what're": "what are",
        "what's": "what is",
        "what've": "what have",
        "when's": "when is",
        "when've": "when have",
        "where'd": "where did",
        "where's": "where is",
        "where've": "where have",
        "who'll": "who will",
        "who'll've": "who will have",
        "who's": "who is",
        "who've": "who have",
        "why's": "why is",
        "why've": "why have",
        "will've": "will have",
        "won't": "will not",
        "won't've": "will not have",
        "would've": "would have",
        "wouldn't": "would not",
        "wouldn't've": "would not have",
        "y'all": "you all",
        "y'all'd": "you all would",
        "y'all'd've": "you all would have",
        "y'all're": "you all are",
        "y'all've": "you all have",
        "you'd": "you would",
        "you'd've": "you would have",
        "you'll": "you will",
        "you'll've": "you will have",
        "you're": "you are",
        "you've": "you have"
    }
    
    contracts_re = re.compile('(%s)' % '|'.join(contractions_dict.keys()))
    text = contracts_re.sub(lambda x: contractions_dict[x.group()], text)
    return text

In [13]:
text = "I ain't goin to the store, I can't find my keys"
text = remove_contractions(text)
print(text)

I is not goin to the store, I cannot find my keys


## ▶️ Apply Functions to Data

We're going to take all the functions we have created and clean our dataframe.

In [14]:
# Apply the functions to the 'singleMessage' column

# TODO: clean_text()
df['singleMessage'] = df['singleMessage'].apply(clean_text)

# TODO: clean_numbers()

# TODO: remove_contractions()

In [15]:
print(df)

                                                 reason  \
0     Account number visible. Please remove from con...   
1                                Inappropriate comment.   
2                                Caps for tickers only.   
3                                Caps for tickers only.   
4                                Inappropriate comment.   
...                                                 ...   
4115  Per our Chat Room rules, we ask that capital l...   
4116  Your message was deleted as it was deemed to b...   
4117  This post is best for the Lounge where we enco...   
4118  Your message was deleted as it was deemed to b...   
4119  Your message was deleted as it was deemed to b...   

                                          singleMessage  
0                              a.b.c.warriortrading.com  
1                                         mammkd. sdkkf  
2                                               wattior  
3                                               wattior  
4

# 🚧 Save Data to Disk