# Overview

We will use Recurrent Neural Network (LSTM) and softmax to have a list of label predictions.

Requirements:

- Output file from 1-preprocess-data.ipynb

# Install Dependencies

Our environment will need several ML packages required to import.

## PIP Packages (Optional)

In [1]:
pip install tensorflow numpy pandas scikit-learn

Note: you may need to restart the kernel to use updated packages.


## Required Packages

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

2023-06-07 22:08:17.531117: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-07 22:08:18.167424: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-06-07 22:08:18.167471: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-06-07 22:08:19.852766: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-

# Hyper Parameters

In [3]:
# Load the CSV
file_path = 'data/output/3-merge-data.csv'
df = pd.read_csv(file_path)

# Tokenize and pad the text data
max_len = 100  # Maximum length of input sequences
vocab_size = 10000  # Vocabulary size

# Training Settings
epochsCount = 4
epochsShuffleData = True

# Split Train and Test Data

In [4]:
# Handle NaN values
df = df.dropna(subset=['singleMessage'])

# Extract features and target
X = df['singleMessage']
y = df['reason']

# Split the dataset into training and testing sets (70% training, 30% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)



tokenizer = Tokenizer(num_words=vocab_size, oov_token='<OOV>')
tokenizer.fit_on_texts(X_train)

X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

X_train_padded = pad_sequences(X_train_seq, maxlen=max_len)
X_test_padded = pad_sequences(X_test_seq, maxlen=max_len)

# Encode the target labels
label_encoder = LabelEncoder()
label_encoder.fit(y)  # Fit on the entire dataset

y_train_encoded = label_encoder.transform(y_train)
y_test_encoded = label_encoder.transform(y_test)


num_classes = len(label_encoder.classes_)

y_train_categorical = to_categorical(y_train_encoded, num_classes=num_classes)
y_test_categorical = to_categorical(y_test_encoded, num_classes=num_classes)


# Train

We will shuffle our data per each epoch.  We want a list of label probabilities so we will be using softmax activation.

In [5]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

# Create the RNN model
embedding_dim = 128

model = Sequential([
    Embedding(vocab_size, embedding_dim, input_length=max_len),
    LSTM(64, return_sequences=True),
    LSTM(32),
    Dense(32, activation='relu'),
    Dense(num_classes, activation='softmax')
])

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
model.fit(X_train_padded, y_train_categorical, epochs=epochsCount, validation_data=(X_test_padded, y_test_categorical), shuffle=epochsShuffleData)



2023-06-07 22:08:25.882348: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-06-07 22:08:25.883175: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303)
2023-06-07 22:08:25.883211: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ba1c25b27f4f): /proc/driver/nvidia/version does not exist
2023-06-07 22:08:25.885147: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f080ef26370>

# Prediction

Get a list of label probabilities based on the message we provide.

In [10]:
# Example message to predict
new_message = ["fuck you"]

# Preprocess the new message
new_message_seq = tokenizer.texts_to_sequences(new_message)
new_message_padded = pad_sequences(new_message_seq, maxlen=max_len)

# Predict the class probabilities
class_probabilities = model.predict(new_message_padded)

# Sort the class probabilities in descending order along with their corresponding classes
sorted_probabilities = sorted(zip(label_encoder.classes_, class_probabilities[0]), key=lambda x: x[1], reverse=True)

# Print the sorted class probabilities
for reason, probability in sorted_probabilities:
    print(f"{reason}: {probability:.4f}")


Politics not allowed outside of references to the market.: 0.9955
Clean: 0.0041
Off-topic: 0.0002
Inappropriate comment.: 0.0001
Personal or sensitive information not allowed in chat.: 0.0001
Third-party links / content not allowed.: 0.0000
Caps for tickers only.: 0.0000
password: 0.0000
Bypassing the chat filters is not allowed.: 0.0000
Not sure what this is: 0.0000
False information or no source.: 0.0000
language please: 0.0000
outside link: 0.0000
False or misleading information, or no source.: 0.0000
False information.: 0.0000
Reviewed by admin internally; not necessary to post to public chat.: 0.0000
Might be searching for offline contact: 0.0000
Account number visible. Please remove from content before reposting.: 0.0000
Bullying a member or moderator.: 0.0000
Please provide more information when making comments like these. For example "AFRM is being shorted every candle, so I think it's manipulated" : 0.0000
Support Room would be more appropriate for this inquiry.: 0.0000
CMEG c