# DeepFlair - A deep learning-based system to autoflair Reddit submissions

Reddit's Automoderator has gone insane and has fully taken over r/Science. You are mandated by the all-powerful robot to build a new module allowing it to automatically (and correctly) flair all new submissions to the subreddit.

On reddit, a flair is the tag given to a post. For instance, 'medicine' and 'biology' are possible flairs you will find on r/science. You can visit http://reddit.com/r/science for a better example, each post's flair being found to the left of the link to its comments.

The specifications are as follow:
* All code must be in python
* Extract at least 2,000 recent r/Science posts and use them to fit your classifier.

***WARNING: data collection will be a bit tough because Reddit’s official API no longer allows to extract posts from subs. You’ll have to be creative. Please send us your thoughts by email as soon as you have a solution so we can guide you if necessary. Don’t spend more than 20 minutes on finding a way to collect the data.***


As a second test, run your classifier on a set of at least 500 posts from the month of January to prove its real-life usefulness.
* Write a short (~2 pages) report detailing how you gathered data, your choice of tools, classification algorithm, hyper-parameter and your evaluation techniques. Present and discuss your results and give a possible explanations for them. Also suggest possible improvements.
* Quickly answer the following questions:
    * How would you make your classifier available as a web service?
    * Would it be possible to improve the results using the content of a submission's comments? (bear in mind r/Science is highly moderated and non-topical discussions are removed) If so, how?
* Deliver the following:
    * The data used
    * All the code used for the task
    * ~2 pages report + answers to the questions


## Problem Statement

I'm going to explain here what I understood for the problem, and give a clear definition of the problem statement that I will try to answer.

### Problem Definition

Reddit is an American social news aggregation, web content rating, and discussion website. Users can submit content to the site such as links, text posts, and images, which are then voted up or down by other members. This content can be organized in "subreddits" which can be seen as channels allowing discussions on a specific topic. In this assignment, we will focus on the r/science subreddit.

Each submission on Reddit can be tagged, i.e. flaired, with a keyword. 'medicine' and 'biology' are examples given of possible flairs that can be found under the r/science subreddit.

The problem is the following:

**Given the content and metadata of a submission under the r/science subreddit, design an automated system capable of tagging the specific post with a single relevant keyword.**

### Example

Given the following post (taken from today's r/science posts):

*"Scientists have devised a "double Trojan horse" drug that fools antibiotic-resistant bacteria into committing suicide. The drug appears to be a nutrient, but it contains two antibiotics. When the bacterium destroys the first antibiotic, it unleashes the second antibiotic, killing it."*

Our system should be able to assign automatically the tag **'medicine'**.

## Observations and Solving Strategy

### Observations

Natural Language Processing (NLP) is a science designed specifically to tackle such tasks. It will allow us to extract the essence of each post, and automatically understand the topic.

As a subfield of deep learning, a NLP problem usually requires a lot of training data to achieve good performance. Thus, we will need to collect a good amount of Reddit submissions in order to train our model. Here, the assignment specifies we will need to collect at least **~2,000 posts** to fit our classifier.

We will need a finite number of classes for our classifier. This means we need to identify exhaustively the different tags/flairs that can be attributed to a post.

Our data will need to be clearly labeled with ground-truth tags. It seems that **each post can be assigned only one tag**, i.e. one class.

In this assignment, we don't really need to understand the meaning of each post, we just need to assign them a topic label. For this reason, my first thought is to try a very dense simple deep neural net without even considering concepts such as 'context' with more powerful recurrent neural nets like GRU or LSTM.

However, we would like our system to be able to react positively to words it has not seen before. For example, if our training set of examples contains posts with the word 'cancer' and are tagged with the word 'medicine', we would like our system to be able to recognize an unseen post containing the word 'tumor' as a 'medicine' post as well. We will use **word embeddings** to answer this. 

### Solving Strategy

It is always better to start with simple models first and then iterate to improve performance. This gives us a good start to work with and improve. Here are the following steps we will implement:

1. Data collection   
2. Create a simple model
3. Train it
4. Evaluate the model

Let's start!


## Data Collection

The assignement lets us know that the Reddit API does not allow to extract posts from subreddits anymore, so we will use the API from pushshift.io instead. (https://github.com/pushshift/api)

We will start by just tickling around and see what kind of information we can get. This will help us decide which kind of metadata we would like to use to train our model:

In [83]:
import json
import requests

url_base = "https://api.pushshift.io/"

def get_posts(subreddit, before=0, size=500):
    request_url = "{}/reddit/search/submission/?subreddit={}&before={}d&size={}".format(url_base, subreddit, before, size)
    response = requests.get(request_url)
    if response.status_code == 200:
        return json.loads(response.content.decode('utf-8'))
    else:
        return None
    
    
last_post = get_posts("science", size=1)

if last_post:
    print(json.dumps(last_post, sort_keys=True, indent=2))
else:
    print('[!] Request Failed')
    

{
  "data": [
    {
      "author": "jalovisko",
      "author_flair_css_class": null,
      "author_flair_richtext": [],
      "author_flair_text": null,
      "author_flair_type": "text",
      "can_mod_post": false,
      "contest_mode": false,
      "created_utc": 1526159182,
      "domain": "skoltech.ru",
      "full_link": "https://www.reddit.com/r/science/comments/8iyxub/neural_network_trained_to_assess_fire_effects/",
      "id": "8iyxub",
      "is_crosspostable": true,
      "is_original_content": false,
      "is_reddit_media_domain": false,
      "is_self": false,
      "is_video": false,
      "link_flair_background_color": "#d982cb",
      "link_flair_css_class": "compsci",
      "link_flair_richtext": [
        {
          "e": "text",
          "t": "Computer Science"
        }
      ],
      "link_flair_template_id": "6462d546-889b-11e3-9380-12313b0ce8a6",
      "link_flair_text": "Computer Science",
      "link_flair_text_color": "light",
      "link_flair_type": "ric

Lots of information here!

After some investigation, we realize that the flair information is contained in the *link_flair_text* key. Bad news, this field is not always present so we will have to take care of this case. Good news? We found our labels. This is what we will try to predict with our model.

Let's see now which information could help us train a model to perform this task.

Apparently, the only interesting information we can get from this is the title (*title* key). This seems the most promising as it is kind of a summary for each submission. It contains text with relevant words that can be linked to the post topic.

The URL might also contain some keywords that could help classify a post, but it seems less promising: the url is not necessary connected to the post content, and it will contain a lot of noise anyway.

We will keep focused on the information contained in the title to train our model.

### Collecting a lot of submissions

We observe that the pushshift API does not let us return more than 500 hundred results per request. Remember, we want to collect at least 2,000 posts. Let's aim for 10,000 to have a decent number of training examples.

We will collect those examples from different time periods. That way, we'll be sure all our examples are different, and we would not want our classifier to overfit a specific period of time.

Let's extract 500 posts from each month over the last three years: 500*36 = 18,000 posts. I initially tried with two years of data, but many posts are not labeled so I just took more data to get close to the 10,000 examples we aim for.

We will preprocess the data to keep only the title and flair. The collected data will be dumped into a csv file.



In [85]:
import csv

# Key names we want to keep
fieldnames = ["title", "link_flair_text"]

def format_posts(posts):
    res = []
    for post in posts:
        res.append({key: post[key] for key in fieldnames if key in post})
    return res


def remove_posts_without_label(posts):
    return [post for post in posts if fieldnames[1] in post]


def get_data(months):
    data = []
    for i in range(months):
        posts = get_posts("science", before=i*30, size=500)
        posts = format_posts(posts["data"])
        posts = remove_posts_without_label(posts)
        data.extend(posts)
    return data


def get_classes(posts):
    return set([post[fieldnames[1]] for post in posts])


def change_flair_to_label(posts, flair_to_label):
    for post in posts:
        post[fieldnames[1]] = flair_to_label[post[fieldnames[1]]]


# Get, format and clean the data
posts = get_data(36)

# Determine the classes
classes = get_classes(posts)
num_classes = len(classes)

# Define two dictionaries mapping label to flair and flair to label
flair_to_label = {c:i for (i,c) in enumerate(classes)}
label_to_flair = {i:c for (i,c) in enumerate(classes)}

# Now we have our mapping, we can change the flair to its associated label
change_flair_to_label(posts, flair_to_label)

# Some nice information about our dataset
print("Number of training examples: {}".format(len(posts)))
print("Number of classes: {}".format(num_classes))
print("Classes: {}". format(classes))

# We save the data in a csv file
with open("data.csv", "w") as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(posts)


Number of training examples: 9742
Number of classes: 128
Classes: {'Flu AMA', 'Planetary Science AMA', 'Monsanto AMA', 'Autism AMA', 'Honey Bee Genome AMA', 'Chemistry AMA', 'Stress and the Brain AMA', 'Science Discussion', 'Science Writing AMA', 'Weatherman AMA', 'Social Sciences', 'Open Access AMA', 'NOAA AMA', 'Hurricane Prediction AMA', 'Clean Water AMA', 'Computer Science', 'Suicide Prevention AMA', "People's Climate March AMA", 'Computer Sci', 'Cholera AMA', 'DNA Day Series | National Society of Genetic Counselors', 'Health AMA', 'Virtual Reality AMA', 'Open Science AMA', 'Forensic Chemistry AMA', 'Plasma Physics AMA', 'Climate Change AMA', 'Fetal Tissue Research AMA', 'Self-treating ALS AMA', 'Zealandia expedition AMA', 'Chronic Pain AMA', 'Unlock Your Genome AMA', 'Subreddit News', 'In Mice', 'Economics', 'Psychology', 'Human Genome AMA', 'Snow and Ice AMA', 'Physics', 'Climate Science AMA', 'Misleading Title', 'Biosimilars AMA', 'GMO AMA', 'DNA Day Series | The Cancer Genome A

At end of this step, we manage to collect around ~10,000 examples from the past three years, correctly labeled. There are ~130 different classes / different tags. We now have a good dataset to work with and train a classifier.

# Model Definition

As we said earlier, we probably don't need to use RNNs like LSTM or GRU networks right away. It seems to me that the importance of the context here is not that important since we want to assign a general topic to each post.

If we had to care about things like negations in the content, to perform sentiment analysis for example, we would probably have to care about context. (ex: "I am not happy" --> our system would fail to recognize the negation whereas LSTMs would be able to take this into account. But here it seems fine because we won't have so many cases likes this)

On the other hand, we would like our system to classify correctly posts that use vocabulary unseen in the training set. If the model trained on specific medicine terminology, using word embeddings would allow us to associate new medicine terminology with the one our system trained on.

Below is an image showcasing our first design idea for our model:

![Model Design](./20180512_133744.jpg)

We'll take the posts' titles as input and transform them into their vector representations. For this, we will use pre-trained word embeddings to avoid training overhead. This will be a good enough start.

We'll then compute the average on these words embeddings, and feed the resulting vector to a dense neural net. The output of this network will be a vector of probabilities for each class of shape (num_classes, 1). We will have to convert our labels to their corresponding one-hot representations to match the network output.

Finally, we'll predict the tag of unseen posts by taking the maximum value of the vector of probabilities.

# Data pre-processing

There is still some processing to be done to prepare our data. Let's load everything in a pandas dataframe first and take a look at what we got.

In [86]:
import numpy as np
import pandas as pd

df = pd.read_csv("data.csv", quotechar='"', skipinitialspace=True)

print(df[:10])

                                               title  link_flair_text
0     Neural network trained to assess fire effects.               15
1  So you can never out run your cat and your dog...              118
2  Effects of N-acetylcysteine on marijuana depen...               76
3      Omeprazole increases the risk of heart attack               76
4  Stephen Hawking service: Possibility of time t...               48
5  Teachers who antagonize their students by beli...               35
6  It's tobacco and alcohol use - not illegal dru...               75
7  Fitness apps found to make almost no differenc...               75
8  Google DeepMind's AI learns navigation skills ...               15
9  People using brain-computer interface are more...               61


## Text pre-processing

We need to prepare and clean the title text data thanks to Keras pre-processing tools. We are going to map words to indexes and pad the different sequences to the maximum length title.

In [87]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# We extract the text data from the title column of our dataframe
texts = df[fieldnames[0]].as_matrix()

tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print('Found {} unique tokens.'.format(len(word_index)))

maxlen = len(max(texts, key=len).split())

data = pad_sequences(sequences, maxlen=maxlen)

Found 19441 unique tokens.


## Labels pre-processing

The text is now properly cleaned and ready to be fed to the neural network. Let's now transform our labels to their one-hot encoding:

In [88]:
from keras.utils import to_categorical

labels = df[fieldnames[1]]
labels = to_categorical(labels, num_classes)

Let's take a look at the shape of our dataset:

In [89]:
print('Shape of data tensor: {}'.format(data.shape))
print('Shape of label tensor: {}'.format(labels.shape))

Shape of data tensor: (9742, 44)
Shape of label tensor: (9742, 128)


Finally, we split our data into a training set and a validation set:

In [90]:
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
nb_validation_samples = int(0.2 * data.shape[0])

X_train = data[:-nb_validation_samples]
Y_train = labels[:-nb_validation_samples]
X_val = data[-nb_validation_samples:]
Y_val = labels[-nb_validation_samples:]

# Model Implementation

It is now time to implement our model. We will first prepare the embedding layer by loading pre-trained 50-dimensional Glove word embeddings.

In [91]:
from keras import backend as K
from keras.layers import Dense, Dropout, Embedding, Input, Lambda
from keras.models import Model
from keras.optimizers import Adam

# Compute an index mapping words to known embeddings, by parsing the data dump of pre-trained
# 50-dimensional GloVe embeddings.
embeddings_index = {}
with open('data/glove.6B.50d.txt') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

print('Found {} word vectors.'.format(len(embeddings_index)))

# Compute the embedding matrix
embedding_matrix = np.zeros((len(word_index) + 1, 50))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector
        
# Prepare the embedding_layer
embedding_layer = Embedding(len(word_index) + 1,
                            50,
                            weights=[embedding_matrix],
                            input_length=maxlen,
                            trainable=False)

Found 400000 word vectors.


We are finally ready to define the model and train it:

In [92]:
def model(maxlen, num_classes):
    # Text input
    sequence_input = Input(shape=(maxlen,), dtype='int32')
    # Embedding layer
    embedded_sequences = embedding_layer(sequence_input)
    # Average layer
    X = Lambda(lambda x: K.mean(x, axis=1))(embedded_sequences)
    # Dense layer
    X = Dense(128, activation='relu')(X)
    # Dropout
    X = Dropout(0.5)(X)
    # Softmax layer
    output = Dense(num_classes, activation='softmax')(X)
    
    return Model(sequence_input, output)
   
# Create the network
model = model(maxlen, num_classes)  

# Choose an Adam optimizer
opt = Adam(lr=0.1, beta_1=0.9, beta_2=0.999, decay=0.01)

model.compile(optimizer=opt,
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model.fit(X_train, Y_train, validation_data=(X_val, Y_val), epochs=100, batch_size=512)

Train on 7794 samples, validate on 1948 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100


Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0x7f14e3eab940>

We achieve poor performance on the training set... less than 50%. Unfortunetaly, I took way too much time figuring out how to collect and format the data to feed it to the neural net so I don't have more time to spend on improving my performance.

The performance on the training and validation sets are fairly close which suggests we do not suffer overfitting.

The learning seems to start pretty well, but then stagnates really fast. We clearly have a bias problem which could potentially be solved by 3 different methods:

* Gather more data since our training set might be not big enough to achieve such task.
* Train a bigger network, but we need to be careful not to overfit the training set. If we train a network that is too big, we will also have to be careful about the problem of vanishing/exploding gradients.
* Try a different NN architecture, maybe try RNNs after all? In that case, it would be better to use a bidirectionnal LSTM right away since these nets have proven to be the most performant, while avoiding to some point the problem of vanishing/exploding gradients.

One thing that I wonder is if we could focus our analysis on more specific words. For now, we don't perform any filtering, and words like "the", "and", etc. are used to often that it may impact our performance. On the other hand, specific terminology like the name of a specific disease may give much more confidence that an article is actually about medicine.

**How would you make your classifier available as a web service?**
I have no experience on how to make such a system available as a web service. However, I know that Tensorflow is a stable framework for production uses. We could also make this model train on the newly posted submissions by changing the optimizer to stochastic gradient descent.

**Would it be possible to improve the results using the content of a submission's comments? (bear in mind r/Science is highly moderated and non-topical discussions are removed) If so, how?**
If the comments on a post are highly-moderated and are relevant to the post's topic, then, yes I think it would be possible to use the comments to improve our system. One thing we could do is choose an arbitrary number of comments to consider: maybe 5 comments for example. To avoid any input size problem, we could concatenate the comments with the post title. As long as the comments are relevant to the post, it should reinforce the topic.