# Assignment 4 - Using NLP to play the stock market

In this assignment, we'll use everything we've learned to analyze corporate news and pick stocks. Be aware that in this assignment, we're trying to beat the benchmark of random chance (aka better than 50%).

This assignment will involve building three models:

**1. An RNN based on word inputs**

**2. A CNN based on character inputs**

**3. A neural net architecture that merges the previous two models**

You will apply these models to predicting whether a stock return will be positive or negative in the same day of a news publication.

## Your X - Reuters news data

Reuters is a news outlet that reports on corporations, among many other things. Stored in the `news_reuters.csv` file is news data listed in columns. The corresponding columns are the `ticker`, `name of company`, `date of publication`, `headline`, `first sentence`, and `news category`.

In this assignment it is up to you to decide how to clean this dataset. For instance, many of the first sentences contain a location name showing where the reporting is done. This is largely irrevant information and will probably just make your data noisier. You can also choose to subset on a certain news category, which might enhance your model performance and also limit the size of your data.

## Your Y - Stock information from Yahoo! Finance

Trading data from Yahoo! Finance was collected and then normalized using the [S&P 500](https://en.wikipedia.org/wiki/S%26P_500_Index). This is stored in the `stockReturns.json` file. 

In our dataset, the ticker for the S&P is `^GSPC`. Each ticker is compared the S&P and then judged on whether it is outperforming (positive value) or under-performing (negative value) the S&P. Each value is reported on a daily interval from 2004 to now.

Below is a diagram of the data in the json file. Note there are three types of data: short: 1 day return, mid: 7 day return, long 28 day return.

```
          term (short/mid/long)
         /         |         \
   ticker A   ticker B   ticker C
      /   \      /   \      /   \
  date1 date2 date1 date2 date1 date2
```

You will need to pick a length of time to focus on (day, week, month). You are welcome to train models on each dataset as well.  

Transform the return data such that the outcome will be binary:

```
label[y < 0] = 0
label[y >= 0] = 1
```

Finally, this data needs needs to be joined on the date and ticker - For each date of news publication, we want to join the corresponding corporation's news on its return information. We make the assumption that the day's return will reflect the sentiment of the news, regardless of timing.


# Your models - RNN, CNN, and RNN+CNN

For your RNN model, it needs to be based on word inputs, embedding the word inputs, encoding them with an RNN layer, and finally a decoding step (such as softmax or some other choice).

Your CNN model will be based on characters. For reference on how to do this, look at the CNN class demonstration in the course repository.

Finally you will combine the architecture for both of these models, either [merging](https://github.com/ShadyF/cnn-rnn-classifier) using the [Functional API](https://keras.io/getting-started/functional-api-guide/) or [stacking](http://www.aclweb.org/anthology/S17-2134). See the links for reference.

For each of these models, you will need to:
1. Create a train and test set, retaining the same test set for every model
2. Show the architecture for each model, printing it in your python notebook
2. Report the peformance according to some metric
3. Compare the performance of all of these models in a table (precision and recall)
4. Look at your labeling and print out the underlying data compared to the labels - for each model print out 2-3 examples of a good classification and a bad classification. Make an assertion why your model does well or poorly on those outputs.
5. For each model, calculate the return from the three most probable positive stock returns. Compare it to the actual return. Print this information in a table.

### Good luck!

### Load the required Libraries

In [30]:
# Utility libraries

import os
import pickle
import numpy as np
import pandas as pd
import re
import calendar
import warnings

# Prepocessing libraries

from sklearn.model_selection import train_test_split

# Keras imports

from keras.preprocessing.text import Tokenizer,text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.models import Sequential, Model
from keras.preprocessing.text import Tokenizer
from keras.layers import Bidirectional, Embedding, LSTM, Dense, Conv1D, GlobalMaxPool1D, MaxPool1D, MaxPooling1D, Dropout, Activation , Flatten , Input, concatenate
from keras.callbacks import ModelCheckpoint
from keras import backend as K

### Load the Data

In [3]:
reutersFile = 'news_reuters.csv'
stockFile = 'stockReturns.json'

rawX = pd.read_csv('news_reuters.csv', header=None, 
                   names=['ticker', 'company', 'pub_date', 'headline', 'first_sent', 'category'])
rawY = pd.read_json('stockReturns.json')

### Reformat and Merge Data

In [4]:
def reformat_y_data(data, tickerType='mid'):
    """Convert stock data into binary postive/negative"""
    tmp = data[tickerType].apply(pd.Series)
    tmp = tmp.stack().rename('price', inplace=True).reset_index()
    tmp['y'] = np.where(tmp['price'] >= 0, 1, 0)
    tmp.rename(columns={'level_0': 'ticker', 'level_1': 'pub_date'}, inplace=True)
    return tmp

def clean_and_merge_data(X, Y):
    """Filter X to only those tickers with stock data"""
    y_tickers = set(Y['ticker'])
    X = X.loc[X['ticker'].isin(y_tickers)]
    # Make sure data types are the same for merge    
    Y['pub_date'] = Y['pub_date'].astype(rawX['pub_date'].dtype)
    Y['ticker'] = Y['ticker'].astype(rawX['ticker'].dtype)
    return X.merge(Y, on=['ticker', 'pub_date'], how='left')

In [5]:
cleanY = reformat_y_data(rawY, 'short')
merged = clean_and_merge_data(rawX, cleanY)

### Clean up text columns and tokenize data

In [6]:
def clean_text(sent):
    """Clean up text data by:
    
    1. Replacing double spaces into a single space
    2. Replace U.S. to United States so U won't get deleted with next 
       replacement
    3. Remove all capitalized words at the beginning of the 
       sentence, since those are mostly places (aka NEW YORK)
    4. Remove unnecessary punctuation (hyphens and asterisks)
    5. Remove dates
    """
    monthStrings = list(calendar.month_name)[1:] + list(calendar.month_abbr)[1:]
    monthPattern = '|'.join(monthStrings)
    
    sent = re.sub(r' +', ' ', sent)
    sent = re.sub(r'U.S.', 'United States', sent)
    sent = re.sub(r'^(\W?[A-Z\s\d]+\b-?)', '', sent)
    sent = re.sub(r'^ ?\W ', '', sent)
    sent = re.sub(r'({}) \d+'.format(monthPattern), '', sent)
    
    # replace double spaces one more time after previous cleaning 
    sent = re.sub(r' +', ' ', sent)
    return sent 

### Tokenize the data

In [7]:
def tokenize_sent(col):
    """Tokenize string into a sequence of words"""
    return [text_to_word_sequence(text, lower=False) for text in col]

def filt_to_one(x, random_state=10):
    """Filter dataset so that there is only one observation per day.
    
    If there is more than one record, will use the topStory record
    if one exists.  If one doesn't or there are 2 topStory records
    then it will randomly select one of the observations.
    """
    if x.shape[0] > 1:
        if 'topStory' in x['category'].unique():
            x = x.loc[x['category'] == 'topStory']
        if x.shape[0] > 1:
            x = x.sample(n=1, random_state=random_state)
    return x

### Clean up the Data

In [8]:
# Clean up text
merged['headline'] = merged.headline.apply(clean_text)
merged['first_sent'] = merged.first_sent.apply(clean_text)

# Turn sentences into tokens
merged['headline_token'] = tokenize_sent(merged.headline)
merged['first_sent_token'] = tokenize_sent(merged.first_sent)


# Get one record per company/day
finalData = merged.groupby(by=['ticker', 'pub_date']).apply(filt_to_one)

# Combine Headline and First Sentence into one text 
finalData['final_text'] = finalData['headline_token'] + finalData.first_sent_token

# Remove observations with missing stock price
finalData.dropna(inplace=True)

In [9]:
new_cols = ['ticker2', 'company', 'pub_date2', 'headline', 'first_sent', 'category', 'price', 'y', 'headline_token', 'first_sent_token', 'final_text']
finalData.columns = new_cols
finalData.reset_index(inplace=True)

In [10]:
X = finalData['headline'].values
y = finalData['y'].values

# 1. Create a train and test set, retaining the same test set for every model

In [11]:
X_train,X_test,y_train,y_test = train_test_split(X,y,stratify=y)

In [14]:
tokenizer = Tokenizer(num_words=200)
tokenizer.fit_on_texts(X_train)
sequences = tokenizer.texts_to_sequences(X_train)
train_data = pad_sequences(sequences, maxlen=100)
###################################
MAX_NUM_WORDS=40 # how many unique words to use (i.e num rows in embedding vector)
MAX_SEQUENCE_LENGTH=100 # max number of words in a review to use
word_index = tokenizer.word_index
y_train_labels = to_categorical(np.asarray(y_train))
print('Shape of data tensor:', train_data.shape)
print('Shape of label tensor:', y_train_labels.shape)
#####################################

Shape of data tensor: (8542, 100)
Shape of label tensor: (8542, 2)


In [15]:
tokenizer_test = Tokenizer(num_words=200)
tokenizer_test.fit_on_texts(X_test)
sequences_test = tokenizer_test.texts_to_sequences(X_test)
test_data = pad_sequences(sequences_test, maxlen=100)
###################################
MAX_NUM_WORDS=40 # how many unique words to use (i.e num rows in embedding vector)
MAX_SEQUENCE_LENGTH=100 # max number of words in a review to use
word_index_text = tokenizer_test.word_index
y_test_labels = to_categorical(np.asarray(y_test))
print('Shape of data tensor:', test_data.shape)
print('Shape of label tensor:', y_test_labels.shape)
#####################################

Shape of data tensor: (2848, 100)
Shape of label tensor: (2848, 2)


### Load word embeddings

In [16]:
GLOVE_DIR=''

import os
embeddings_index = {}
f = open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))


Found 400000 word vectors.


### Build Embedding Matrix

In [17]:
EMBEDDING_DIM = 100 # how big is each word vector

embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
       # words not found in embedding index will be all-zeros.
       embedding_matrix[i] = embedding_vector

In [18]:
from keras.layers import Embedding
embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

## Model 1: RNN

In [19]:
def create_rnn_model(seq_input_len, embed_matrix, 
                     n_RNN_nodes, n_dense_nodes, 
                     recurrent_dropout=0.2, 
                     drop_out=.2, n_out=2):
    
    word_input = Input(shape=(seq_input_len,), name='word_input_layer')
        
    word_embeddings = Embedding(input_dim=embed_matrix.shape[0],
                                output_dim=embed_matrix.shape[1],
                                weights=[embed_matrix], 
                                mask_zero=True, 
                                name='word_embedding_layer')(word_input) 

    hidden_layer1 = Bidirectional(LSTM(units=n_RNN_nodes, return_sequences=True, 
                                      recurrent_dropout=recurrent_dropout, 
                                      dropout=drop_out, name='hidden_layer1'))(word_embeddings)
    
    hidden_layer2 = Bidirectional(LSTM(units=n_RNN_nodes, return_sequences=False, 
                                      recurrent_dropout=recurrent_dropout,
                                      dropout=drop_out, name='hidden_layer2'))(hidden_layer1)

    dense_layer = Dense(units=n_dense_nodes, activation='relu', name='dense_layer')(hidden_layer2)

    drop_out3 = Dropout(drop_out)(dense_layer)

    output_layer = Dense(units=n_out, activation='softmax',
                         name='output_layer')(drop_out3)

    model = Model(inputs=[word_input], outputs=output_layer)
    model.compile(loss='categorical_crossentropy', optimizer="adam", 
                  metrics=['accuracy', recall, precision])

    return model 

In [20]:
n_out = 2
nb_epoch = 1

### Define functions to calculate precision and recall

In [21]:
def precision(y_true, y_pred):
    """Precision metric.

    Only computes a batch-wise average of precision.

    Computes the precision, a metric for multi-label classification of
    how many selected items are relevant.
    """
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision


def recall(y_true, y_pred):
    """Recall metric.

    Only computes a batch-wise average of recall.

    Computes the recall, a metric for multi-label classification of
    how many relevant items are selected.
    """
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

In [23]:
rnn_model = create_rnn_model(seq_input_len=train_data.shape[-1],
                             embed_matrix=embedding_matrix, 
                             recurrent_dropout=.4, drop_out=.5,
                             n_RNN_nodes=500, n_dense_nodes=500, n_out=n_out)

In [26]:
def train_and_test_model(model, x_train, y_train, x_test, y_test, 
                         modelSaveName, modelSavePath='',
                         batch_size=128, epochs=2, validation_split=.1):

    """Train model, save weights, and predict data"""
    
    print(model.summary())
    
    filepath = os.path.join(modelSavePath, modelSaveName + '.hdf5')
    checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1)
    callbacks_list = [checkpoint]
    model.fit(x=x_train, y=y_train, batch_size=batch_size, 
              epochs=epochs, validation_split=validation_split, 
              callbacks=callbacks_list)
    
    score, acc, rec, prec = model.evaluate(x_test, y_test, batch_size=batch_size)
    return (model, acc, rec, prec)    

## 2.1 Show the architecture for model (RNN)

## 3.1 Report the peformance according to some metric (RNN)

In [27]:
rnn_res = train_and_test_model(rnn_model, train_data, y_train_labels, test_data, y_test_labels, 'rnn_model',
                               epochs=nb_epoch)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
word_input_layer (InputLayer (None, 100)               0         
_________________________________________________________________
word_embedding_layer (Embedd (None, 100, 100)          1009700   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100, 1000)         2404000   
_________________________________________________________________
bidirectional_2 (Bidirection (None, 1000)              6004000   
_________________________________________________________________
dense_layer (Dense)          (None, 500)               500500    
_________________________________________________________________
dropout_1 (Dropout)          (None, 500)               0         
_________________________________________________________________
output_layer (Dense)         (None, 2)                 1002      
Total para

## Model 2: CNN

## 2.2 Show the architecture for model (CNN)

## 3.2 Report the peformance according to some metric (CNN)

In [31]:
def vectorize_sentences(data, lexicon, maxlen=200):
    X = []
    for sentences in data:
        x = [lexicon[token] if token in lexicon else lexicon['<UNK>'] for 
                                 token in sentences]
        x2 = np.eye(len(char_indices) + 1)[x]
        X.append(x2)
    return (pad_sequences(X, maxlen=maxlen))

def create_cnn_model(char_maxlen, vocab_size,
                     nb_filter=100, filter_kernels = [4] * 4,
                     pool_size=3, n_dense_nodes=100,
                     drop_out=.2, n_out=2):

    inputs = Input(shape=(char_maxlen, vocab_size), name='char_input_layer')

    conv1 = Conv1D(nb_filter, kernel_size=filter_kernels[0],
                  padding='valid', activation='relu',
                  input_shape=(char_maxlen, vocab_size))(inputs)
    
    maxpool1 = MaxPool1D(pool_size=pool_size)(conv1)

    conv2 = Conv1D(nb_filter, kernel_size=filter_kernels[1],
                          padding='valid', activation='relu')(maxpool1)
    maxpool2 = MaxPool1D(pool_size=pool_size)(conv2)

    conv3 = Conv1D(nb_filter, kernel_size=filter_kernels[2],
                          padding='valid', activation='relu')(maxpool2)

    conv4 = Conv1D(nb_filter, kernel_size=filter_kernels[3],
                          padding='valid', activation='relu')(conv3)

    maxpool3 = MaxPool1D(pool_size=pool_size)(conv4)
    flatten = Flatten()(maxpool3)

    dense_layer = Dense(n_dense_nodes, activation='relu')(flatten)
    dropout = Dropout(drop_out)(dense_layer)

    output_layer = Dense(n_out, activation='softmax', name='output')(dropout)

    model = Model(inputs=inputs, outputs=output_layer)

    model.compile(loss='categorical_crossentropy', optimizer="adam", 
                  metrics=['accuracy', recall, precision])    
    return model 

char_maxlen = 1024 
nb_filter = 128
dense_outputs = 1024
filter_kernels = [7, 5, 5, 3]
pool_size = 5

# Turn all tokens into one string and then all obs 
# into one overall string
trainTokensAsString = X_train
testTokensAsString = X_test
oneTxt = ' '.join(trainTokensAsString)

# Get info about characters
chars = set(oneTxt)
vocab_size = len(chars) + 1
print('total chars:', vocab_size)
char_indices = dict((c, i + 2) for i, c in enumerate(chars))
indices_char = dict((i + 2, c) for i, c in enumerate(chars))

char_indices['<UNK>'] = 1
indices_char[1] = '<UNK>'

trainCharData = vectorize_sentences(trainTokensAsString, char_indices, char_maxlen)
testCharData = vectorize_sentences(testTokensAsString, char_indices, char_maxlen)

trainCharData.shape

testCharData.shape

char_maxlen

cnn_model = create_cnn_model(char_maxlen=char_maxlen, 
                             vocab_size=vocab_size,
                             nb_filter=nb_filter, 
                             filter_kernels=filter_kernels,
                             pool_size=pool_size, 
                             n_dense_nodes=dense_outputs,
                             drop_out=.5, 
                             n_out=n_out)

cnn_res = train_and_test_model(cnn_model, trainCharData[:, :, 1:],
                               y_train_labels, 
                               testCharData[:, :, 1:], 
                               y_test_labels, 
                               'cnn_model',
                               epochs=nb_epoch)

total chars: 92
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
char_input_layer (InputLayer (None, 1024, 92)          0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 1018, 128)         82560     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 203, 128)          0         
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 199, 128)          82048     
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 39, 128)           0         
_________________________________________________________________
conv1d_4 (Conv1D)            (None, 35, 128)           82048     
_________________________________________________________________
conv1d_5 (Conv1D)            (None, 33, 128)           49280

## Model 3: RNN+CNN

In [32]:
def create_cnn_rnn_model(rnn_input_len, char_maxlen, vocab_size,
                         embed_matrix, n_RNN_nodes, 
                         nb_filter=100, filter_kernels = [4] * 4,
                         pool_size=3, n_dense_nodes=100,
                         recurrent_dropout=0.2, 
                         drop_out=.2, n_out=2):
    
    word_input = Input(shape=(rnn_input_len,), name='word_input_layer')
    char_input = Input(shape=(char_maxlen, vocab_size), name='char_input_layer')
    
    word_embeddings = Embedding(input_dim=embed_matrix.shape[0],
                                output_dim=embed_matrix.shape[1],
                                weights=[embed_matrix], 
                                mask_zero=True, 
                                name='word_embedding_layer')(word_input) 

    rnn_output1 = Bidirectional(LSTM(units=n_RNN_nodes, return_sequences=True, 
                                      recurrent_dropout=recurrent_dropout, 
                                      dropout=drop_out, name='hidden_layer1'))(word_embeddings)
    
    rnn_output2 = Bidirectional(LSTM(units=n_RNN_nodes, return_sequences=False, 
                                      recurrent_dropout=recurrent_dropout,
                                      dropout=drop_out, name='hidden_layer2'))(rnn_output1)
            
    conv1 = Conv1D(nb_filter, kernel_size=filter_kernels[0],
                  padding='valid', activation='relu',
                  input_shape=(char_maxlen, vocab_size))(char_input)

    maxpool1 = MaxPool1D(pool_size=pool_size)(conv1)

    conv2 = Conv1D(nb_filter, kernel_size=filter_kernels[1],
                          padding='valid', activation='relu')(maxpool1)
    maxpool2 = MaxPool1D(pool_size=pool_size)(conv2)

    conv3 = Conv1D(nb_filter, kernel_size=filter_kernels[2],
                          padding='valid', activation='relu')(maxpool2)

    conv4 = Conv1D(nb_filter, kernel_size=filter_kernels[3],
                          padding='valid', activation='relu')(conv3)

    maxpool3 = MaxPool1D(pool_size=pool_size)(conv4)
    cnn_output = Flatten()(maxpool3)

    merged_layer = concatenate([cnn_output, rnn_output2])
    
    dense_layer1 = Dense(n_dense_nodes, activation='relu', name='dense_layer')(merged_layer)
    drop_out1 = Dropout(drop_out)(dense_layer1)
    dense_layer2 = Dense(n_dense_nodes, activation='relu')(drop_out1)
    drop_out2 = Dropout(drop_out)(dense_layer2)
    
    main_output = Dense(n_out, activation='softmax', name='output_layer')(drop_out2)

    model = Model(inputs=[word_input, char_input], outputs=[main_output])

    model.compile(loss='categorical_crossentropy', optimizer="adam", 
                  metrics=['accuracy', recall, precision])    

    return model 

In [33]:
cnn_rnn_model = create_cnn_rnn_model(rnn_input_len=train_data.shape[-1], 
                                     char_maxlen=char_maxlen, 
                                     vocab_size=vocab_size,
                                     embed_matrix=embedding_matrix, 
                                     n_RNN_nodes=500,
                                     nb_filter=nb_filter, 
                                     filter_kernels=filter_kernels,
                                     pool_size=pool_size, 
                                     n_dense_nodes=400,
                                     recurrent_dropout=0.4, 
                                     drop_out=.5, 
                                     n_out=n_out)

## 2.3 Show the architecture for model (RNN+CNN)

## 3.3 Report the peformance according to some metric (RNN+CNN)

In [34]:
cnn_rnn_res = train_and_test_model(cnn_rnn_model, 
                               [train_data, trainCharData[:, :, 1:]],
                               y_train_labels, 
                               [test_data, testCharData[:, :, 1:]],
                               y_test_labels, 
                               'cnn_rnn_model',
                               epochs=nb_epoch)

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
char_input_layer (InputLayer)   (None, 1024, 92)     0                                            
__________________________________________________________________________________________________
conv1d_6 (Conv1D)               (None, 1018, 128)    82560       char_input_layer[0][0]           
__________________________________________________________________________________________________
max_pooling1d_4 (MaxPooling1D)  (None, 203, 128)     0           conv1d_6[0][0]                   
__________________________________________________________________________________________________
conv1d_7 (Conv1D)               (None, 199, 128)     82048       max_pooling1d_4[0][0]            
__________________________________________________________________________________________________
max_poolin

## 4. Compare performance of all of models in a table (precision and recall)

In [35]:
pd.DataFrame.from_records([cnn_res[1:4], rnn_res[1:4], cnn_rnn_res[1:4]], 
                          columns=['accuracy', 'recall', 'precision'], 
                         index=['cnn_mod', 'rnn_mod', 'cnn_rnn_mod'])

Unnamed: 0,accuracy,recall,precision
cnn_mod,0.5,0.5,0.5
rnn_mod,0.500351,0.500351,0.500351
cnn_rnn_mod,0.502809,0.502809,0.502809


## 5. Look at your labeling and print out the underlying data compared to the labels - for each model print out 2-3 examples of a good classification and a bad classification. Make an assertion why your model does well or poorly on those outputs.

In [36]:
def print_classifications(classifications, classType, test_y, test_text):
    texts = [''.join(sent) for sent in test_text[classifications]]
    stock_movements = np.where(test_y[classifications], 'positive', 'negative')
    
    print('Examples of {} predictions:\n'.format(classType))
    for i in range(len(texts)):
        print('Stock movement was {}'.format(stock_movements[i]))
        print('News info:\n{}'.format(texts[i]))
        print('')

In [37]:
def predict_and_print_samples(model, modelName, test_x, test_y=y_test, test_text = X_test):
    """"Print out predictions of the model"""
    print('Stats for {} model'.format(modelName))
    
    res = model.predict(test_x)
    class_res = np.apply_along_axis(np.argmax, axis=1, arr=res)

    comparisons = class_res == test_y
    comparisons = pd.DataFrame(comparisons)
    
    good_class = comparisons.loc[comparisons[0] == True].index[0:3]
    bad_class = comparisons.loc[comparisons[0] == False].index[0:3]

    print_classifications(good_class, 'correct', test_y, test_text)
    print_classifications(bad_class, 'INcorrect', test_y, test_text)
    
    y_test_df = pd.DataFrame(y_test)
    
    top3MostProbPosArg = np.argsort(res[:, 1])[-3:]
    top3Y = y_test_df.iloc[top3MostProbPosArg]
    top3Probs = pd.Series(res[top3MostProbPosArg, 1], index=top3Y.index)
    top3Data = pd.concat([top3Y, top3Probs], axis=1)
    top3Data.columns = ['Actual', 'PositiveProb']
    print('')
    print('Top 3 Most Positive Probability:')
    print(top3Data)

In [38]:
predict_and_print_samples(rnn_res[0], 'RNN', test_data)

predict_and_print_samples(cnn_res[0], 'CNN', testCharData[:, :, 1:])

predict_and_print_samples(cnn_rnn_res[0], 'CNN_RNN', [test_data, testCharData[:, :, 1:]])

Stats for RNN model
Examples of correct predictions:

Stock movement was negative
News info:
Mediobanca CEO says is serene despite probe 

Stock movement was negative
News info:
regulators tell two insurers and GE Capital to upgrade resolution plans 

Stock movement was negative
News info:
Lloyds eyes return to private hands in next 12 months 

Examples of INcorrect predictions:

Stock movement was positive
News info:
Deals of the day- Mergers and acquisitions 

Stock movement was positive
News info:
Einhorn sues Apple marks biggest investor challenge in years 

Stock movement was positive
News info:
Qatar regulator reprimands RBS for insufficient staff training 


Top 3 Most Positive Probability:
      Actual  PositiveProb
1144     1.0      0.500726
176      0.0      0.501730
1874     1.0      0.502338
Stats for CNN model
Examples of correct predictions:

Stock movement was negative
News info:
Mediobanca CEO says is serene despite probe 

Stock movement was negative
News info:
regulat

### ASSERTION for RNN

#### Examples of correct predictions:

#### Stock movement was negative
#### News info:
#### Mediobanca CEO says is serene despite probe 

#### Here the probe is being made against Mediobanca and we have predicted
#### that stock movement is negative which makes sense

#### Stock movement was negative
#### News info:
#### regulators tell two insurers and GE Capital to upgrade resolution plans 

#### Here the regulators are against GE Capital and we have predicted
#### that stock movement is negative which makes sense

#### Stock movement was negative
#### News info:
#### Lloyds eyes return to private hands in next 12 months 

#### Here the Lloyds return to private hands which may allude to existing problems
#### and we have predicted that stock movement is negative which makes sense

#### Examples of INcorrect predictions:

#### Stock movement was positive
#### News info:
#### Deals of the day- Mergers and acquisitions 

#### Not every Merger and acquisition can be considered positive

#### Stock movement was positive
#### News info:
#### Einhorn sues Apple marks biggest investor challenge in years 

#### Here Einhorn sues Apple and we have predicted positive which is incorrect

#### Stock movement was positive
#### News info:
#### Qatar regulator reprimands RBS for insufficient staff training 

#### Here regulator reprimands RBS and we have predicted positive which is incorrect

### ASSERTION for CNN

#### Examples of correct predictions:

#### Stock movement was negative
#### News info:
#### Mediobanca CEO says is serene despite probe 

#### Here the probe is being made against Mediobanca and we have predicted
#### that stock movement is negative which makes sense

#### Stock movement was negative
#### News info:
#### regulators tell two insurers and GE Capital to upgrade resolution plans 

#### Here the regulators are against GE Capital and we have predicted
#### that stock movement is negative which makes sense

#### Stock movement was negative
#### News info:
#### Lloyds eyes return to private hands in next 12 months 

#### Here the Lloyds return to private hands which may allude to existing problems
#### and we have predicted that stock movement is negative which makes sense

#### Examples of INcorrect predictions:

#### Stock movement was positive
#### News info:
#### Deals of the day- Mergers and acquisitions 

#### Not every Merger and acquisition can be considered positive

#### Stock movement was positive
#### News info:
#### Einhorn sues Apple marks biggest investor challenge in years 

#### Here Einhorn sues Apple and we have predicted positive which is incorrect

#### Stock movement was positive
#### News info:
#### Qatar regulator reprimands RBS for insufficient staff training 

#### Here regulator reprimands RBS and we have predicted positive which is incorrect

### ASSERTION for CNN_RNN

#### Examples of correct predictions:

#### Stock movement was negative
#### News info:
#### Mediobanca CEO says is serene despite probe 

#### Here the probe is being made against Mediobanca and we have predicted
#### that stock movement is negative which makes sense

#### Stock movement was negative
#### News info:
#### regulators tell two insurers and GE Capital to upgrade resolution plans 

#### Here the regulators are against GE Capital and we have predicted
#### that stock movement is negative which makes sense

#### Stock movement was negative
#### News info:
#### Lloyds eyes return to private hands in next 12 months 

#### Here the Lloyds return to private hands which may allude to existing problems
#### and we have predicted that stock movement is negative which makes sense

#### Examples of INcorrect predictions:

#### Stock movement was positive
#### News info:
#### Deals of the day- Mergers and acquisitions 

#### Not every Merger and acquisition can be considered positive

#### Stock movement was positive
#### News info:
#### Einhorn sues Apple marks biggest investor challenge in years 

#### Here Einhorn sues Apple and we have predicted positive which is incorrect

#### Stock movement was positive
#### News info:
#### Qatar regulator reprimands RBS for insufficient staff training 

#### Here regulator reprimands RBS and we have predicted positive which is incorrect