# Reddit Comment Classification
---
An attempt to classify Reddit comments from a public dataset into desireable and undesireable comments, defined by a clearly positive score  or a negative score, respectively. 

## Hypothesis
0. The model will not be able to predict a positive or negative score given tokenized word sequences generated from the body of the comment.
1. The model will be able to predict a positive or negative score given tokenized word sequences generated from the body of the comment.

## Initial Data Exploration

In [None]:
# Get a list of all subreddits
SELECT DISTINCT(subreddit) FROM May2015; # ~50k active in just in one month

# Count how many posts were made in each subreddit, we'll use the top 10
SELECT `subreddit`,
COUNT(`subreddit`) AS `subreddit_occurrence` 
FROM     `May2015`
GROUP BY `subreddit`
ORDER BY `subreddit_occurrence` DESC
LIMIT 10;

# get a baseline for the ratio of undesireable to desirable (accuracy baseline)


## First run at classification

In [15]:
from __future__ import division
import sqlite3, re, random, math
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing import text as k_text
from keras.models import Sequential
from keras.layers import Dense, Activation, LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

def cleanDataSet(data):
    corpus = []
    values = []

    for post in data:
        if post[2] > 0:
            label = [0,1]
        else:
            label = [1,0]
        values.append(label)
        corpus.append(post[1])    

    return corpus, values

#Parameters
train_lmt = 90000
max_features = 5000
max_length = 500
max_layer_density = 128
vector_length = 64
top_words = 5000
epochs = 5

print('Querying DB...\n')
sql_conn = sqlite3.connect('./input/database.sqlite')

train_data = sql_conn.execute("SELECT subreddit, body, score FROM May2015\
                                WHERE (score > 1 OR score < 0) \
                                AND body != 'deleted'\
                                AND subreddit IN \
                                ('AskReddit', 'leagueoflegends', 'nba', 'funny', 'pics', 'nfl', 'pcmasterrace', 'videos', 'news', 'todayilearned') \
                                LIMIT " + str(train_lmt))

print('Building Corpus...\n')
x, y = cleanDataSet(train_data)

y = list(y)
good_ratio = len([it for it in y if it[1] == 1]) / len(x)
print('The ratio of good comments to bad in this data set is: {}% ' \
      .format(good_ratio *100))
print('If the network never gains better accuracy than this, it is stuck in a local minima.\n')
y = np.array(y)

tokenizer = Tokenizer(num_words=max_features, split=' ', 
                      filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',)
tokenizer.fit_on_texts(x)
x = tokenizer.texts_to_sequences(x)
x = pad_sequences(x, maxlen=max_length)

print('Building model...\n')
model = Sequential()
model.add(Embedding(max_features, vector_length, input_length=x.shape[1]))
model.add(LSTM(256))
model.add(Dense(max_layer_density, activation="relu"))
model.add(Dense(math.floor(max_layer_density / 2), activation="relu"))
model.add(Dense(2, activation="softmax"))

model.compile(loss='categorical_crossentropy', optimizer='adagrad', metrics=['accuracy'])

model.fit(x, y,
          batch_size=128,
          epochs=epochs,
          verbose=1,
          validation_split=0.2)


Querying DB...

Building Corpus...

The ratio of good comments to bad in this data set is: 94.33222222222221% 
If the network never gains better accuracy than this, it is stuck in a local minima.

Building model...

Train on 72000 samples, validate on 18000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f00228ecfd0>

## Interpretation of Results
The output of the LSTM above shows no sign of improvement after several epochs, running even for another ten epochs produces the same results for training and validation. This really isn't surprising given the difficult nature of the task and the simplicity of the data structure. Several hyperparameters and optimizers were tried, and the above is shown as a visual summary.

The assumption of the hypothesis for this experiment was that simple word sequences contained all data necessary to determine the score of a comment at least at a binary level. This experiment does not show conclusively that this is false, but it does fail to disprove the null hypothesis.

## Further Experiments
Expanding on this experiment could involve tweaking hyperparameters, but success may be more likely to come from better embedding context into the dataset. If one was able to compile a dataset that not only contained the comment being weighed, but also the parent comment it was attached to then results might be more ideal. 