## Twitter Content Classification and Author Identification

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
import nltk
from nltk.corpus import stopwords
import re
import collections
import random
import math

import warnings
warnings.filterwarnings('ignore')

## 1. Load and preprocess data

In [2]:
train_input = pd.read_csv('train.csv')
test_input = pd.read_csv('test.csv')
print "train_input shape is: ", np.shape(train_input)
print "test_input shape is: ", np.shape(test_input)

train_input shape is:  (4743, 2)
test_input shape is:  (1701, 2)


In [3]:
def label_tweet(input_set):
    handle = input_set['handle']
    # put it into an array named label, 
    # where 0 represents HillaryClinton, 
    # 1 represents readDonaldTrump
    label = []
    for i in range(len(handle)):
        if handle[i] == "HillaryClinton":
            label.append(0)
        if handle[i] == "realDonaldTrump":
            label.append(1)
    label = np.asarray(label)
    return label

train_label = label_tweet(train_input)

In [4]:
train_corpus = train_input['tweet'].as_matrix()
test_corpus = test_input['tweet'].as_matrix()

## 2. Tokenize

Besides the method demonstrated here, we also tried another tokenizer called TweetTokenizer in NLTK library, which is shown below:

In [5]:
# from nltk.tokenize.casual import TweetTokenizer
# def tokenization(tknzr, tweet):
#     tweet = filter(lambda word: not re.match('[https://]', word), tweet)
#     tokens = tknzr.tokenize(tweet)
#     return tokens   
# tknzer = TweetTokenizer()
# train_corpus_tokenized = []
# for i in range(len(train_corpus)):
#     train_corpus_tokenized.append(tokenization(tknzer, train_corpus[i]))

But it turned out that this cannot produce neat results.
<p>We also tried character-wise tokenization.</p>

In [6]:
# def tokenization(text):
#     return [i for i in text]

<p>[['T', 'h', 'e', 'q', 'u', 'e', 's', 't', 'i', 'o', 'n', 'i', 'n', 't', 'h', 'i', 's', 'e', 'l', 'e', 'c', 't', 'i', 'o', 'n', ':', 'W', 'h', 'o', 'c', 'a', 'n', 'p', 'u', 't', 't', 'h', 'e', 'p', 'l', 'a', 'n', 's', 'i', 'n', 't', 'o', 'a', 'c', 't', 'i', 'o', 'n', 't', 'h', 'a', 't', 'w', 'i', 'l', 'l', 'm', 'a', 'k', 'e', 'y', 'o', 'u', 'r', 'l', 'i', 'f', 'e', 'b', 'e', 't', 't', 'e', 'r', '?', 'h', 't', 't', 'p', 's', ':', '/', '/', 't', '.', 'c', 'o', '/', 'X', 'r', 'e', 'E', 'Y', '9', 'O', 'i', 'c', 'G'], ['L', 'a', 's', 't', 'n', 'i', 'g', 'h', 't', ',', 'D', 'o', 'n', 'a', 'l', 'd', 'T', 'r', 'u', 'm', 'p', 's', 'a', 'i', 'd', 'n', 'o', 't', 'p', 'a', 'y', 'i', 'n', 'g', 't', 'a', 'x', 'e', 's', 'w', 'a', 's', '"', 's', 'm', 'a', 'r', 't', '.', '"', 'Y', 'o', 'u', 'k', 'n', 'o', 'w', 'w', 'h', 'a', 't', 'I', 'c', 'a', 'l', 'l', 'i', 't', '?', 'U', 'n', 'p', 'a', 't', 'r', 'i', 'o', 't', 'i', 'c', '.', 'h', 't', 't', 'p', 's', ':', '/', '/', 't', '.', 'c', 'o', '/', 't', '0', 'x', 'm', 'B', 'f', 'j', '7', 'z', 'F'], ['I', 'f', 'w', 'e', 's', 't', 'a', 'n', 'd', 't', 'o', 'g', 'e', 't', 'h', 'e', 'r', ',', 't', 'h', 'e', 'r', 'e', "'", 's', 'n', 'o', 't', 'h', 'i', 'n', 'g', 'w', 'e', 'c', 'a', 'n', "'", 't', 'd', 'o', '.', 'M', 'a', 'k', 'e', 's', 'u', 'r', 'e', 'y', 'o', 'u', "'", 'r', 'e', 'r', 'e', 'a', 'd', 'y', 't', 'o', 'v', 'o', 't', 'e', ':', 'h', 't', 't', 'p', 's', ':', '/', '/', 't', '.', 'c', 'o', '/', 't', 'T', 'g', 'e', 'q', 'x', 'N', 'q', 'Y', 'm', 'h', 't', 't', 'p', 's', ':', '/', '/', 't', '.', 'c', 'o', '/', 'Q', '3', 'Y', 'm', 'b', 'b', '7', 'U', 'N', 'y'], ['B', 'o', 't', 'h', 'c', 'a', 'n', 'd', 'i', 'd', 'a', 't', 'e', 's', 'w', 'e', 'r', 'e', 'a', 's', 'k', 'e', 'd', 'a', 'b', 'o', 'u', 't', 'h', 'o', 'w', 't', 'h', 'e', 'y', "'", 'd', 'c', 'o', 'n', 'f', 'r', 'o', 'n', 't', 'r', 'a', 'c', 'i', 'a', 'l', 'i', 'n', 'j', 'u', 's', 't', 'i', 'c', 'e', '.', 'O', 'n', 'l', 'y', 'o', 'n', 'e', 'h', 'a', 'd', 'a', 'r', 'e', 'a', 'l', 'a', 'n', 's', 'w', 'e', 'r', '.', 'h', 't', 't', 'p', 's', ':', '/', '/', 't', '.', 'c', 'o', '/', 's', 'j', 'n', 'E', 'o', 'k', 'c', 'k', 'i', 's'], ['J', 'o', 'i', 'n', 'm', 'e', 'f', 'o', 'r', 'a', '3', 'p', 'm', 'r', 'a', 'l', 'l', 'y', '-', 't', 'o', 'm', 'o', 'r', 'r', 'o', 'w', 'a', 't', 't', 'h', 'e', 'M', 'i', 'd', '-', 'A', 'm', 'e', 'r', 'i', 'c', 'a', 'C', 'e', 'n', 't', 'e', 'r', 'i', 'n', 'C', 'o', 'u', 'n', 'c', 'i', 'l', 'B', 'l', 'u', 'f', 'f', 's', ',', 'I', 'o', 'w', 'a', '!', 'T', 'i', 'c', 'k', 'e', 't', 's', ':', '\xe2', '\x80', '\xa6', 'h', 't', 't', 'p', 's', ':', '/', '/', 't', '.', 'c', 'o', '/', 'd', 'f', 'z', 's', 'b', 'I', 'C', 'i', 'X', 'c']]</p>

Above is the result of this tokenization method and it makes the vectors very large. Finally the accuracy of this model is 90%. It turned out that this method does not necessarily benefit a classification model.
<p>Instead a tokenizer is written in a very straightforward way. It throws away common stop words, as well as domain names starting with 'https://', which are all useless in classifing whose tweet is it.</p>

In [7]:
# Load the stopwords
stop_words = stopwords.words('english')
# 'https' seems useless, so I add it to stop_words
stop_words.append(u'https')

In [8]:
def tokenization(text):
    tokens=[]
    for word in nltk.word_tokenize(text.decode('utf-8')):
        # skip all the websites, punctuations, pure digits
        if not re.match('[//]', word) and re.search('[a-zA-Z]', word) and word.lower() not in stop_words:
            tokens.append(word.lower())
    return tokens

# Tokenize training set
train_corpus_tokenized = []
for i in train_corpus:
    train_corpus_tokenized.append(' '.join(tokenization(i)))

# Tokenize testing set
test_corpus_tokenized = []
for i in test_corpus:
    test_corpus_tokenized.append(' '.join(tokenization(i)))

print "After tokenization, training set and testing set look like:"
print(train_corpus_tokenized[:5])
print(test_corpus_tokenized[:5])

After tokenization, training set and testing set look like:
[u'question election put plans action make life better', u'last night donald trump said paying taxes smart know call unpatriotic', u"stand together 's nothing ca n't make sure 're ready vote", u"candidates asked 'd confront racial injustice one real answer", u'join 3pm rally tomorrow mid-america center council bluffs iowa tickets']
[u"could n't proud hillaryclinton vision command last night 's debate showed 's ready next potus", u"election important sit go make sure 're registered nationalvoterregistrationday -h", u'government people join movement today', u"national voterregistrationday make sure 're registered vote makeamericagreatagain\u2026", u'great afternoon little havana hispanic community leaders thank support imwithyou']


The tokenized tweets look good, but wordwise form is needed so that every word can be counted and stored in a word dictionary.

The purpose of this step is to assign every word an ID so that later we can transform every tweet into a vector of word IDs.

In [9]:
train_tokenized_word = []
for i in range(len(train_corpus_tokenized)):
    train_tokenized_word.append(tf.compat.as_str(train_corpus_tokenized[i]).split())

test_tokenized_word = []
for i in range(len(test_corpus_tokenized)):
    test_tokenized_word.append(tf.compat.as_str(test_corpus_tokenized[i]).split())
    
print "After tf.compat.as_str, training set and testing set look like:"
print(train_tokenized_word[:5])
print(test_tokenized_word[:5])

After tf.compat.as_str, training set and testing set look like:
[['question', 'election', 'put', 'plans', 'action', 'make', 'life', 'better'], ['last', 'night', 'donald', 'trump', 'said', 'paying', 'taxes', 'smart', 'know', 'call', 'unpatriotic'], ['stand', 'together', "'s", 'nothing', 'ca', "n't", 'make', 'sure', "'re", 'ready', 'vote'], ['candidates', 'asked', "'d", 'confront', 'racial', 'injustice', 'one', 'real', 'answer'], ['join', '3pm', 'rally', 'tomorrow', 'mid-america', 'center', 'council', 'bluffs', 'iowa', 'tickets']]
[['could', "n't", 'proud', 'hillaryclinton', 'vision', 'command', 'last', 'night', "'s", 'debate', 'showed', "'s", 'ready', 'next', 'potus'], ['election', 'important', 'sit', 'go', 'make', 'sure', "'re", 'registered', 'nationalvoterregistrationday', '-h'], ['government', 'people', 'join', 'movement', 'today'], ['national', 'voterregistrationday', 'make', 'sure', "'re", 'registered', 'vote', 'makeamericagreatagain\xe2\x80\xa6'], ['great', 'afternoon', 'little', 

## 3. Build dataset which assigns ids to each word and vectorize tweets

In [10]:
cnt = collections.Counter()
for i in range(len(train_tokenized_word)):
    for word in train_tokenized_word[i]:
        cnt[word] += 1

print 'Altogether there are: ' + str((len(cnt))) + ' words'

vocabulary_size = 10000

Altogether there are: 8507 words


So ID number should range from 1 to 8507, and by default the more a word appears, the smaller its ID should be. Next we should build a function to assign each ID to each word.

we went to https://github.com/tensorflow/tensorflow/blob/r1.4/tensorflow/examples/tutorials/word2vec/word2vec_basic.py for reference while building this function.

In [11]:
def build_dataset(cnt, words, n_words):
    """Process raw inputs into a dataset."""
    count = [['UNK', -1]]
    count.extend(cnt.most_common(n_words - 1))
    dictionary = dict()
    for word, _ in count:
        dictionary[word] = len(dictionary)
    data = []
    unk_count = 0
    
    for i in range(len(words)):
        inner_data = []
        for word in words[i]:
            index = dictionary.get(word, 0)
            if index == 0:  # dictionary['UNK']
                unk_count += 1
            inner_data.append(index)
        data.append(inner_data)
        
    count[0][1] = unk_count
    reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    return data, count, dictionary, reversed_dictionary

train_x, count, dictionary, reverse_dictionary = build_dataset(cnt, train_tokenized_word, vocabulary_size)

In [12]:
# Process testing data
test = []
for sentence in test_tokenized_word:
    cur = []
    for word in sentence:
        if(word in dictionary):
            cur.append(dictionary[word])
        else:
            cur.append(0)
    # to store corresponding label
    test.append([cur,[0, 0]])

test_length = len(test)
print "Testing data set size is: " + str((len(test))) + ", such as:"
print(test[:5])

Testing data set size is: 1701, such as:
[[[92, 6, 123, 56, 1097, 0, 44, 83, 2, 101, 1224, 2, 200, 160, 32], [0, 0]], [[108, 271, 2004, 50, 12, 148, 76, 616, 1765, 81], [0, 0]], [[1276, 9, 43, 212, 27], [0, 0]], [[118, 0, 12, 148, 76, 616, 22, 0], [0, 0]], [[5, 933, 244, 0, 2910, 606, 517, 4, 66, 168], [0, 0]]]


We divided the training dataset into two parts to form an actual training set and a validation set.

In [13]:
# Process training data
train_all = [[train_x[i], [train_label[i], 1-train_label[i]]] for i in range(0, len(train_x))]

# shuffle the training set in which to pick training and validation sets
r_index = list(range(len(train_all)))
random.shuffle(r_index)
train = [train_all[i] for i in r_index[:int(len(r_index)*0.8)]]
valid = [train_all[i] for i in r_index[int(len(r_index)*0.8):]]

print "Training data set size is: " + str((len(train))) + ", such as:"
print(train[:5])
print "Validation data set size is: " + str((len(valid))) + ", such as:"
print(valid[:5])

Training data set size is: 3794, such as:
[[[102, 2192, 16, 213, 2371], [0, 1]], [[411, 142, 2093, 300, 1430, 907, 100, 9, 65, 846, 150, 1392, 8219, 1], [0, 1]], [[27, 2, 1593, 56, 271, 1314, 1015, 8268, 224, 1170, 224, 138, 626, 100, 46], [0, 1]], [[59, 7, 1, 69, 2303, 137, 31, 2, 69, 380, 1721, 31], [0, 1]], [[38, 11, 2, 216, 509, 3645, 46, 17, 2958, 955, 1958], [0, 1]]]
Validation data set size is: 949, such as:
[[[4, 6592, 1626, 187, 181, 197, 1719, 1881, 2, 653], [1, 0]], [[20, 53, 51, 354, 21, 42], [1, 0]], [[1, 2, 3317, 977, 3468, 772, 909, 121, 94, 1198, 7728, 59], [0, 1]], [[4157, 3058, 499, 3008, 431, 7328, 3417, 305, 4273, 7973, 1862], [0, 1]], [[103, 184, 6502, 128, 667], [0, 1]]]


Each tweet has different length. While passing batches of vectorized tweet into the RNN model, their lengths should be identical. Therefore we do the padding for each batch of tweets to make shorter vectors longer by appending 0's.

In [14]:
class SimpleDataIterator():
    def __init__(self, df):
        self.df = df
        self.size = len(self.df)
        self.epochs = 0
        self.shuffle()

    def shuffle(self):
        random.shuffle(self.df)
        self.cursor = 0

    def next_batch(self, n):
        if self.cursor + n > self.size:
            self.epochs += 1
            print("SimpleDataIterator epoch : ", self.epochs)
            self.shuffle()
        res = self.df[self.cursor : self.cursor + n]
        self.cursor += n
        return res

class PaddedDataIterator(SimpleDataIterator):
    def next_batch(self, n):
        if self.cursor + n > self.size:
            self.epochs += 1
            self.shuffle()
        res = self.df[self.cursor : self.cursor + n]
        self.cursor += n

        # Pad sequences with 0s so they are all the same length
        max_len = 0
        for row in res:
            if len(row[0]) > max_len:
                max_len = len(row[0])
        seqlen = np.array([max_len for i in range(len(res))])
        ret = []
        label = []
        for row in res:
            ret += [row[0] + [0] * (max_len - len(row[0]))]
            label.append(row[1][0])
        x = np.array(ret)
        y = np.array(label)

        return x, y, seqlen

Do some tests about the classes specified above to make sure they work correctly.

In [15]:
data = SimpleDataIterator(valid)
d = data.next_batch(500)
print 'Input sequences is like this:' 
print d[:5]

Input sequences is like this:
[[[6, 130, 225, 5, 1009, 39, 1138, 4943, 16, 920], [0, 1]], [[7, 1, 90, 86, 20, 201, 1193, 1063, 13, 863, 49, 26, 20, 201, 87, 62, 6, 560], [0, 1]], [[1449, 8, 38, 292, 8264, 382, 334, 955, 2748, 13, 454, 2841], [1, 0]], [[196, 672, 776, 91, 1115, 661, 637, 431], [0, 1]], [[3968, 8, 916, 725, 194, 1, 9, 5], [1, 0]]]


In [16]:
data = PaddedDataIterator(train)
d = data.next_batch(3)
print 'Input sequences in one random batch is now like:'
print d[0]
print 'with shape of: '
print d[0].shape
print 'Corresponding labels are: '
print d[1]
print 'where 0 stands for Hillary and 1 for Trump.'

Input sequences in one random batch is now like:
[[   9   45 7453    3  104   19 7306 8189  193    0    0    0]
 [   3   15  470    7    1    2  219  188 2354 4097 2179 5113]
 [  60   85  125  110   20 6916   32 1594 1978  221 1943    0]]
with shape of: 
(3, 12)
Corresponding labels are: 
[0 0 0]
where 0 stands for Hillary and 1 for Trump.


The above batch is just an example. Every time we run the code snippet above, it provides a different batch example, but they all show that after padding, each batch of tweets has identical length. 

Now do the padding for the testing dataset.

In [17]:
def align(data):
    max_len = 0
    for row in data:
        if len(row[0]) > max_len:
            max_len = len(row[0])
    ret = []
    label = []
    for row in data:
        ret += [row[0] + [0]*(max_len - len(row[0]))]
        label.append(row[1][0])
    x = np.array(ret)
    y = np.array(label)
    seq_len = np.array([max_len for i in data])
    
    return x, y, seq_len

batch_size = 256
print 'test set length = %d' % test_length
print 'batch size = %d' % batch_size
print 'so there should be %d batches' % (test_length / batch_size + 1)
print ' '

test_list = []
test_addlen = test
test_addlen.extend(test[0:batch_size])
for i in range(test_length / batch_size + 1):
    x, y, seq_len = align(test[i * batch_size : (i+1) * batch_size])
    print 'testing batch ' + str(i + 1) + ' complete, with the vector length equals ' + str(x.shape[1])
    test_list.append([x, y, seq_len])

test set length = 1701
batch size = 256
so there should be 7 batches
 
testing batch 1 complete, with the vector length equals 19
testing batch 2 complete, with the vector length equals 18
testing batch 3 complete, with the vector length equals 23
testing batch 4 complete, with the vector length equals 20
testing batch 5 complete, with the vector length equals 20
testing batch 6 complete, with the vector length equals 18
testing batch 7 complete, with the vector length equals 19


## 4. Build RNN model with a GRU cell

In this part we build an RNN model. This model contains a GRU cell, which is a good substitution for LSTM.
<img src="LSTM3-var-GRU.png">
This image is from http://colah.github.io/posts/2015-08-Understanding-LSTMs/

In [18]:
def reset_graph():
    if 'sess' in globals() and sess:
        sess.close()
    tf.reset_default_graph()

The model contains 4 parameters: vocabulary size, hidden state size, batch size and number of classes. 2 of them are hyper parameters that can be tuned. They are hidden state size and batch size.
<p>If hidden state size is too small, the result would not be precise enough; if it is too big, it causes overfitting.</p>
<p>Batch size decides the running time. If batch size is too small, the running time could be long.</p>
<p>Also, we use AdamOptimizer as the optimizer. We have tried GradientDescentOptimizer but the result was far from good. Inside AdamOptimizer, the learning rate could also be tuned. The default is 0.001, which has the result of 2.18 in Kaggle. We changed it to 0.0001 to make a better result.</p>

In [19]:
# Build RNN model
def build_graph(
    vocab_size = len(dictionary),
    state_size = 60,
    batch_size = 256,
    num_classes = 2):

    reset_graph()

    # Placeholders
    x = tf.placeholder(tf.int32, [batch_size, None]) # [batch_size, num_steps]
    seqlen = tf.placeholder(tf.int32, [batch_size])
    y = tf.placeholder(tf.int32, [batch_size])

    # Embedding layer
    embeddings = tf.get_variable('embedding_matrix', [vocab_size, state_size])
    rnn_inputs = tf.nn.embedding_lookup(embeddings, x)

    # RNN
    cell = tf.nn.rnn_cell.GRUCell(state_size)
    init_state = tf.get_variable('init_state', [1, state_size],
                                 initializer=tf.constant_initializer(0.0))
    init_state = tf.tile(init_state, [batch_size, 1])
    rnn_outputs, final_state = tf.nn.dynamic_rnn(cell, rnn_inputs, sequence_length=seqlen,
                                                 initial_state=init_state)

    # Obtain the last relevant output
    idx = tf.range(batch_size) * tf.shape(rnn_outputs)[1] + (seqlen - 1)
    last_rnn_output = tf.gather(tf.reshape(rnn_outputs, [-1, state_size]), idx)

    # finally use a Softmax layer to output a probability
    with tf.variable_scope('softmax'):
        W = tf.get_variable('W', [state_size, num_classes])
        b = tf.get_variable('b', [num_classes], initializer=tf.constant_initializer(0.0))
    logits = tf.matmul(last_rnn_output, W) + b
    preds = tf.nn.softmax(logits)
    
    # evaluate the model
    correct = tf.equal(tf.cast(tf.argmax(preds,1),tf.int32), y)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
    loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits))
   
    # optimizer, which could be tuned
    train_step = tf.train.AdamOptimizer(1e-4).minimize(loss)

    return {
        'x': x,
        'seqlen': seqlen,
        'y': y,
        'loss': loss,
        'ts': train_step,
        'preds': preds,
        'accuracy': accuracy
    }

## 5. Train the RNN model and make predictions

Train the RNN model until the accuracy converges. The number of epochs needed for the accuracy to converge depends on the hyper parameters, such as the batch size, optimizer learning rate and hidden state size. 
<p>Note that the epoch does not have to be very large. If the training accuracy grows to as much as 0.99, it could be possible that the model overfits the training set, thus fits not that good for the validation set as well as testing set.</p>
<p>While the model is being trained, real-time accuracies of training set and validation set are printed.</p>
<p>After training, we make predictions of the test based on the model.</p>

In [20]:
def train_graph(g, batch_size = 256, num_epochs = 15, iterator = PaddedDataIterator):
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        tr = iterator(train)
        tv = iterator(valid)

        step, accuracy = 0, 0
        tr_losses, tv_losses = [], []
        current_epoch = 0
        while current_epoch < num_epochs:
            step += 1
            batch = tr.next_batch(batch_size)
            feed = {g['x']: batch[0], g['y']: batch[1], g['seqlen']: batch[2]}
            accuracy_, _ = sess.run([g['accuracy'], g['ts']], feed_dict=feed)
            accuracy += accuracy_

            if tr.epochs > current_epoch:
                current_epoch += 1
                tr_losses.append(accuracy / step)
                step, accuracy = 0, 0

                # evaluate validation set
                tv_epoch = tv.epochs
                while tv.epochs == tv_epoch:
                    step += 1
                    batch = tv.next_batch(batch_size)
                    feed = {g['x']: batch[0], g['y']: batch[1], g['seqlen']: batch[2]}
                    accuracy_ = sess.run([g['accuracy']], feed_dict=feed)[0]
                    accuracy += accuracy_

                tv_losses.append(accuracy / step)
                step, accuracy = 0,0
                print 'Accuracy after epoch %d is: ' % current_epoch 
                print 'training: %f, validation: %f' % (tr_losses[-1], tv_losses[-1])
        print '---------- RNN training done! ----------'
            
        # make predictions with the current model
        predictions = []
        for te in test_list:
            feed = {g['x']: te[0], g['y']: te[1], g['seqlen']: te[2]}
            preds_, _ = sess.run([g['preds'], g['ts']], feed_dict=feed)
            predictions.extend(preds_)

    return tr_losses, tv_losses, predictions

In [21]:
g = build_graph()
tr_losses, tv_losses, predictions = train_graph(g)

Accuracy after epoch 1 is: 
training: 0.507031, validation: 0.514648
Accuracy after epoch 2 is: 
training: 0.507812, validation: 0.519531
Accuracy after epoch 3 is: 
training: 0.525112, validation: 0.554688
Accuracy after epoch 4 is: 
training: 0.761998, validation: 0.786458
Accuracy after epoch 5 is: 
training: 0.839844, validation: 0.808594
Accuracy after epoch 6 is: 
training: 0.875558, validation: 0.845052
Accuracy after epoch 7 is: 
training: 0.873047, validation: 0.846354
Accuracy after epoch 8 is: 
training: 0.932199, validation: 0.888021
Accuracy after epoch 9 is: 
training: 0.940011, validation: 0.901042
Accuracy after epoch 10 is: 
training: 0.943080, validation: 0.889323
Accuracy after epoch 11 is: 
training: 0.944196, validation: 0.911458
Accuracy after epoch 12 is: 
training: 0.953125, validation: 0.895833
Accuracy after epoch 13 is: 
training: 0.954520, validation: 0.903646
Accuracy after epoch 14 is: 
training: 0.963170, validation: 0.901042
Accuracy after epoch 15 is: 


In [22]:
print 'Some predictions are like: '
print(predictions[:10])

Some predictions are like: 
[array([ 0.9805249 ,  0.01947507], dtype=float32), array([ 0.99372512,  0.00627485], dtype=float32), array([ 0.03518222,  0.96481776], dtype=float32), array([ 0.93461794,  0.06538212], dtype=float32), array([ 0.00934731,  0.99065268], dtype=float32), array([ 0.98951858,  0.01048141], dtype=float32), array([ 0.9609549 ,  0.03904507], dtype=float32), array([ 0.95169759,  0.0483024 ], dtype=float32), array([ 0.990026  ,  0.00997404], dtype=float32), array([ 0.60694057,  0.39305946], dtype=float32)]


## 7. Output the prediction

In [23]:
import csv
csvfile = file('csvtest.csv', 'wb')
writer = csv.writer(csvfile)
writer.writerow(['id', 'realDonaldTrump', 'HillaryClinton'])
data = []
for i in range(test_length):
    data.append((i, predictions[i][1], predictions[i][0]))

writer.writerows(data)
csvfile.close()

## 8. Compare with logistic regression

In [24]:
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

In [25]:
#define vectorizer parameters
vectorizer = CountVectorizer(decode_error = 'ignore')
transformer = TfidfTransformer(norm = 'l2', use_idf = True)

# Apply TF-IDF to the training set
# Use fit_transform() to learn the matrix model
tfidf_matrix = transformer.fit_transform(vectorizer.fit_transform(train_corpus_tokenized))
X_train = tfidf_matrix.toarray()
print 'After TF-IDF, the input matrix is of size: ' + str(X_train.shape)

After TF-IDF, the input matrix is of size: (4743, 7656)


In [26]:
# Apply TF-IDF to the testing set
# Instead use transform() to apply the matrix model to the testing set
X_test = transformer.transform(vectorizer.transform(test_corpus_tokenized))
test_length = np.shape(X_test)[0]
print 'After TF-IDF, the input matrix is of size: ' + str(X_test.shape)

After TF-IDF, the input matrix is of size: (1701, 7656)


In [27]:
# Apply logistic regression model to the training set
LR_model = LogisticRegression()
LR_model.fit(X_train, train_label)

# Use the model to make prediction, output as probabilities
predictions = LR_model.predict_proba(X_test)
print predictions

[[ 0.89305199  0.10694801]
 [ 0.90574438  0.09425562]
 [ 0.23665331  0.76334669]
 ..., 
 [ 0.32748839  0.67251161]
 [ 0.32560385  0.67439615]
 [ 0.45170627  0.54829373]]


In [28]:
# Output the data just like before
csvfile = file('csvtestLR.csv', 'wb')
writer = csv.writer(csvfile)
writer.writerow(['id', 'realDonaldTrump', 'HillaryClinton'])
data = []
for i in range(test_length):
    data.append((i, predictions[i][1], predictions[i][0]))

writer.writerows(data)
csvfile.close()

## 9. Reference and summary

During this project we went to the following online resources for help: tensorflow word2vec tutorial: https://www.tensorflow.org/tutorials/word2vec, Andrej Karpathy blog: http://karpathy.github.io/2015/05/21/rnn-effectiveness/, Colah's blog - understanding LSTM: http://colah.github.io/posts/2015-08-Understanding-LSTMs/, r2rt blog: <a href="https://r2rt.com/recurrent-neural-networks-in-tensorflow-iii-variable-length-sequences.html">Recurrent Neural Networks in Tensorflow III - Variable Length Sequences</a>

We tuned hyper parameters and found out that the parameters specified in this notebook produce the 'for-this-moment' best result, which is 0.244.
<p>Lastly in order to compare the model with 'traditional' machine learning models and see who is better, we selected logistic regression as a representative. The final output seems also reasonalble and the Kaggle score is 0.327, just a little bit worse than our model implementing RNN. Maybe, if the dataset is larger, say more than 10,000 or even larger, RNN would be able to perform better than logistic regression more obviously.</p>