Credit to https://towardsdatascience.com/multi-class-text-classification-with-lstm-1590bee1bd17 for main inspiration and code  
Credit to David Lee for assistance on certain aspects of this notebooks

## **Notebook Contents**

- [Import Libraries](#importlibrarieml)  
- [Import Dataframes](#importdataframeml)
- [Word Cleaning](#wordcleaninml)
- [Preprocess Data](#preprocessml)
- [Modeling](#modelingml)
- [Scores](#scoresml)
- [Citations](#citesml)



<a name="importlibrarieml"></a>
## **Import Libraries**

In [35]:
# Standard Imports
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score

# NLP Imports
import re
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
from nltk import word_tokenize
STOPWORDS = set(stopwords.words('english'))

# Keras Imports
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical
from keras.callbacks import EarlyStopping
from keras.layers import Dropout
import tensorflow as tf

# Google Colab import to bring in dataframes
import io

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

<a name="importdataframeml"></a>
## **Import Dataframes**


In [4]:
from google.colab import files
uploaded = files.upload()

Saving data_ai.csv to data_ai.csv
Saving data_ml.csv to data_ml.csv


In [36]:
data_ai = pd.read_csv(io.BytesIO(uploaded['data_ai.csv']))
data_ml= pd.read_csv(io.BytesIO(uploaded['data_ml.csv']))

In [37]:
data_ai.head()

Unnamed: 0,subreddit,title,selftext
0,artificial,Could AI ethics draw on non-Western philosophi...,
1,artificial,Realistic simulation of tearing meat and peeli...,
2,artificial,[R] Using Deep RL to Model Human Locomotion Co...,In the new paper [*Deep Reinforcement Learning...
3,artificial,Artificial Intelligence Easily Beats Human Fig...,
4,artificial,Foiling illicit cryptocurrency mining with art...,


In [38]:
data_ml.head()

Unnamed: 0,subreddit,title,selftext
0,MachineLearning,[R] Taming pretrained transformers for eXtreme...,New X-Transformer model from Amazon Research\n...
1,MachineLearning,[R] Taming pretrained transformers for eXtreme...,
2,MachineLearning,[D] Why can't I find papers from CVRP '20 / Be...,I am looking for a few of the winning papers f...
3,MachineLearning,[D] Help with bone semantic segmentation,"Hi, I'm Anibal and I'm a software developer.\n..."
4,MachineLearning,help with bone semantic segmentation,[removed]


In [39]:
data_ai.shape

(31299, 3)

In [40]:
data_ml.shape

(31299, 3)

In [41]:
df = data_ai.append(data_ml).reset_index()

In [42]:
df.drop(columns='index',inplace=True)

In [43]:
df

Unnamed: 0,subreddit,title,selftext
0,artificial,Could AI ethics draw on non-Western philosophi...,
1,artificial,Realistic simulation of tearing meat and peeli...,
2,artificial,[R] Using Deep RL to Model Human Locomotion Co...,In the new paper [*Deep Reinforcement Learning...
3,artificial,Artificial Intelligence Easily Beats Human Fig...,
4,artificial,Foiling illicit cryptocurrency mining with art...,
...,...,...,...
62593,MachineLearning,What are some things that you wish you knew be...,[removed]
62594,MachineLearning,[D] Does anyone created a formal database for ...,I'm looking for a database that has sufficient...
62595,MachineLearning,"[P] Demo of ""Arbitrary Style Transfer with Sty...",Hi MachineLearning\n\nI'll introduce awsome st...
62596,MachineLearning,[R] Triplet loss for image retrieval,"Hi, there!\n\n \nThis is an example of image ..."


<a name="wordcleaninml"></a>
# **Word Cleaning**

In [48]:
# TEXT CLEANING FUNCTION FOR EVERY POST IN BOTH SUBREDDITS

# These will be replaced by a space ' '
symbol_replace_space = re.compile('[/(){}\[\]\|@,;]')

 # We will get rid of all these in the function below
bad_symbols = re.compile('[^0-9a-z #+_]')

# We will get rid of all of the stopwords
STOPWORDS = set(stopwords.words('english'))


# Function to clean our texts
def clean_text(text):

    # Make all of the text lower case
    text = text.lower() 

    # Replace symbol_replace_space symbols with a space
    text = symbol_replace_space.sub(' ', text) # substitute the matched string in symbol_replace_space with space.
    
    # remove symbols which are in bad_symbols from text.
    text = bad_symbols.sub('', text) 
    
    text = re.sub(r'\d+', '', text) # This gets rid of the integers

    text = text.replace('x', '')

    # remove stopwords from text
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) 

    return text

# Applying the clean_text function above to every title in df['title']
df['title'] = df['title'].apply(clean_text)

<a name="preprocessml"></a>
### **Preprocessing Data** 

In [49]:
df.head()

Unnamed: 0,subreddit,title,selftext
0,artificial,could ai ethics draw nonwestern philosophies h...,
1,artificial,realistic simulation tearing meat peeling chee...,
2,artificial,r using deep rl model human locomotion control...,In the new paper [*Deep Reinforcement Learning...
3,artificial,artificial intelligence easily beats human fig...,
4,artificial,foiling illicit cryptocurrency mining artifici...,


In [50]:
# The maximum number of words to be used. (most frequent)
max_words = 1_000

# Max number of words in each title.
# First 500 words in the title
max_sequence_length = 500

# This is the second argument in our embedding layer 
embedding_dimensions = 100

# Keras Tokenizer turning each text in the corpus into either a sequence of integers or into a vector
# Instantiate the Tokenizer
tokenizer = Tokenizer(num_words=max_words, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)

# Use the tokenizer on every document in our corpus
tokenizer.fit_on_texts(df['title'].values)

# Replaces the word with it's index
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 30041 unique tokens.


In [51]:
# Taking the texts in df['title'] and Tokenizing the list of texts
X = tokenizer.texts_to_sequences(df['title'].values)

# Keras pad sequence --> Make sequences the same size! Makes the shape the same 
X = pad_sequences(X, maxlen=max_sequence_length) 
print('Shape of data tensor:', X.shape)

Shape of data tensor: (62598, 500)


In [52]:
# Turn our classes into 0's and 1's
y = pd.get_dummies(df['subreddit']).values
print('Shape of label tensor:', y.shape)

Shape of label tensor: (62598, 2)


In [53]:
y

array([[0, 1],
       [0, 1],
       [0, 1],
       ...,
       [1, 0],
       [1, 0],
       [1, 0]], dtype=uint8)

In [54]:
# Split our data into training and testing
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.10, random_state = 42)
print('Train')
print(X_train.shape,y_train.shape) # training data
print('='*40)
print('Test')
print(X_test.shape,y_test.shape) # testing data

Train
(56338, 500) (56338, 2)
Test
(6260, 500) (6260, 2)


In [55]:
len(X)

62598

<a name="modelingml"></a>
## **Modeling**

In [56]:

model = Sequential() #Instantiate the Sequential Model

model.add(Embedding(max_words, embedding_dimensions, input_length=X.shape[1])) # Adding the embedding layer 1st
model.add(SpatialDropout1D(0.10)) 
model.add(LSTM(100, dropout=0.10, recurrent_dropout=0.10))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

epochs = 5
batch_size = 128

history = model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size,validation_split=0.2,callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.001)])

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [58]:
# Evaluating our model on the Testing Data
accr = model.evaluate(X_test,y_test)
print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}'.format(accr[0],accr[1]))

Test set
  Loss: 0.413
  Accuracy: 0.815


In [59]:
new_post = ["Decision Trees: Understanding the Basis of Ensemble Methods"]
seq = tokenizer.texts_to_sequences(new_post)
padded = pad_sequences(seq, maxlen=max_sequence_length)
pred = model.predict(padded)
labels = ['MachineLearning', 'artifical']
print(pred, labels[np.argmax(pred)])

[[0.76751316 0.23248681]] MachineLearning


<a name="scoresml"></a>
## **Scores**


**Production Model:**
- Dropout .10
- Batchsize: 128
- Epochs: 5
- LSTM: 100
- Epoch 5/5 Accuracy:  0.899
- Test Set Accuracy: 0.811

I tweaked this model with a bunch of different hyper paramaters. In the future I believe adding the 'selftext' column to this model would greatly increase the accuracy scores.

<a name="citesml"></a>
## **Citations**

Embedding:
- https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/
- https://stats.stackexchange.com/questions/270546/how-does-keras-embedding-layer-work

Keras Sequential Model:
- https://keras.io/guides/sequential_model/

Drop out:
- https://machinelearningmastery.com/dropout-for-regularizing-deep-neural-networks/
- https://machinelearningmastery.com/use-dropout-lstm-networks-time-series-forecasting/#:~:text=Long%20Short%2DTerm%20Memory%20

Softmax:
- https://medium.com/analytics-vidhya/softmax-classifier-using-tensorflow-on-mnist-dataset-with-sample-code-6538d0783b84
- https://medium.com/data-science-bootcamp/understand-the-softmax-function-in-minutes-f3a59641e86d

LSTM:
- https://towardsdatascience.com/choosing-the-right-hyperparameters-for-a-simple-lstm-using-keras-f8e9ed76f046
- https://towardsdatascience.com/multi-class-text-classification-with-lstm-1590bee1bd17 
- https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21#:~:text=An%20LSTM%20has%20a%20similar,operations%20within%20the%20LSTM's%20cells.&text=These%20operations%20are%20used%20to,to%20keep%20or%20forget%20information. 

Extras:
- https://stackoverflow.com/questions/30315035/strip-numbers-from-string-in-python