# Natural Language Processing for Signal Generation on News Data

In the following code,  we will utilize pretrained embeddings from both GloVe and FastText Skipgram models to preprocess text datasets for LSTM network.

### Load Packages and Initialize the Environment

In [41]:
import sklearn
import datetime
import pydot, graphviz
import numpy as np
import pandas as pd
from tqdm import tqdm
from datetime import date
from numpy.random import seed
from IPython.display import Image
from sklearn.model_selection import train_test_split, StratifiedKFold

In [42]:
import keras
import tensorflow as tf
from tensorflow import set_random_seed
from tensorflow.python import pywrap_tensorflow
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical

In [43]:
from keras import layers
from keras.models import Model
from keras.optimizers import Adam
from keras.layers import LSTM, concatenate, Bidirectional
from keras.layers import PReLU, ELU, LeakyReLU, GRU, SimpleRNN
from keras.layers import Input, Dense, Dropout, Activation, Embedding, BatchNormalization
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, TensorBoard
from keras.utils.vis_utils import plot_model

In [44]:
import nltk
nltk.download('stopwords')
from nltk import RegexpTokenizer
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/marketlab/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [45]:
from tradingcore.utils import embedding_matrix

Setting random state to eliminate randomness. Assigning constant variables for later uses.

In [62]:
#seed(42)
#set_random_seed(42)
MAX_SEQUENCE_LENGTH = 256
EMBEDDING_DIM = 300

print("Keras version:",keras.__version__)
print("Tensorflow version:",tf.__version__)
print("Sklearn version:",sklearn.__version__)

Keras version: 2.2.4
Tensorflow version: 1.13.1
Sklearn version: 0.20.1


### Load Data - Financial News Dataset 
Contributors of this dataset viewed a new article headline and a short, bolded excerpt of a sentence or two from the attendant article. Next, they decided if the sentence in question provided an indication of the U.S. economy’s health, then rated the indication on a scale of 1-9, with 1 being negative and 9 being positive.
source: https://www.figure-eight.com/data-for-everyone/


In [63]:
df = pd.read_csv("../data/us-economic-newspaper.csv",encoding = "ISO-8859-1")
df.head()

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,positivity,positivity:confidence,relevance,relevance:confidence,orig__golden,articleid,date,headline,lineid,next_sentence,positivity_gold,previous_sentence,relevance_gold,text
0,729487630,False,finalized,3,6/8/15 14:26,,0.0,not sure,0.3469,,109092213,2/23/93,Nasdaq Index Falls 1.7% But Dow Stocks Are Up:...,109092213_01,"The Nasdaq composite index, home of technology...",,,,The stock market accelerated its screeching sw...
1,729487631,False,finalized,3,6/11/15 10:58,6.0,0.3675,yes,1.0,,109092213,2/23/93,Nasdaq Index Falls 1.7% But Dow Stocks Are Up:...,109092213_02,"The bond market continued to rally, propelling...",,The stock market accelerated its screeching sw...,,"The Nasdaq composite index, home of technology..."
2,729487632,False,finalized,3,6/6/15 0:15,5.0,0.3416,yes,0.6771,,109092213,2/23/93,Nasdaq Index Falls 1.7% But Dow Stocks Are Up:...,109092213_03,The Nasdaq market was stricken by the collapse...,,"The Nasdaq composite index, home of technology...",,"The bond market continued to rally, propelling..."
3,729487633,False,finalized,3,6/14/15 20:27,4.0,0.6756,yes,1.0,,109092213,2/23/93,Nasdaq Index Falls 1.7% But Dow Stocks Are Up:...,109092213_04,uring the marketÛªs split ÛÓ was 4.26 percen...,,"The bond market continued to rally, propelling...",,The Nasdaq market was stricken by the collapse...
4,729487634,False,finalized,3,6/6/15 13:22,3.0,0.6509,yes,1.0,,109092213,2/23/93,Nasdaq Index Falls 1.7% But Dow Stocks Are Up:...,109092213_05,". ""It's a nervous market,Û said Lawrence R. ...",,The Nasdaq market was stricken by the collapse...,,uring the marketÛªs split ÛÓ was 4.26 percen...


**Split the dataset into testing and training datasets for machine learning**
* X and y respectively correspond to data features (i.e. input) and data labels (i.e. output)
  * Training set the data used to "learn" the parameters in our model with a supervised learning method. This usually uses the majority of the original dataset to achieve best effect.
  * Testing set is the data used to evaluate the effectiveness of our model, often used to produce numerical metrics (e.g. accuracy rate)

In [64]:
X = df.text
y = df.positivity
y = y.fillna(5)
for i, score in enumerate(y):
    if score > 5: 
        y[i] = 2
    elif score == 5:
        y[i] = 1
    else:
        y[i] = 0
y = y.astype('int32')
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    stratify = y,
                                                    test_size=0.10,
                                                    random_state=42)

### Preprocess Data
<img src="../imgs/preprocess_data.png">

**Tokenize training set**

* Tokenizer from Keras creates a vocabulary index from the training set based on word frequency.
  * Tokenize here also did special character removal for us
  * we can also perform stop word removal here
* Every unique word is assigned a unique integer value

In [65]:
word_filter = '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'

tokenizer = Tokenizer(num_words = None,
                      filters = word_filter,
                      lower = True,
                      split = " ",
                      char_level = False)

tokenizer.fit_on_texts(X_train)

word_index = tokenizer.word_index

**Convert all datasets to numerical values**
* Apply the vocabulary index to X_train and X_test
  * The datasets are converted from texts to sequences of integers based on previously created vocabulary index
  * The sequences are padded with zeros and are limited with MAX_SEQUENCE_LENGTH to have a fixed length
* Convert y_train and y_test to one-hot encoded vectors

In [66]:
X_train = pad_sequences(tokenizer.texts_to_sequences(X_train),
                        maxlen = MAX_SEQUENCE_LENGTH,
                        value = 0.0)

X_test = pad_sequences(tokenizer.texts_to_sequences(X_test),
                       maxlen = MAX_SEQUENCE_LENGTH,
                       value = 0.0)

y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

### Build Embeding Matrix

In this project, we will use pretrained word embeding which are stored in files. We have to build the matrices from these files before we use them. Vocabulary index created earlier with the tokenizer is used to create the embedding matrices **embedding_matrix** is a function defined in utils.py which helps us build word embeding from files. It has the following parameters:
* path_to_embedding： path to text file of word embeddings
* embedding_dim: dimension of word embeddings
* word_index: dictionary mapping words to indices

The output is a numpy matrix containing the embeddings

In [67]:
glove_embedding_matrix = embedding_matrix("../data/news_data/glove/glove.840B.300d.txt",
                                          EMBEDDING_DIM,
                                          tokenizer.word_index)

fasttext_embedding_matrix = embedding_matrix("../data/news_data/fasttext/wiki-news-300d-1M.vec",
                                             EMBEDDING_DIM,
                                             tokenizer.word_index)

### Neural Networks Model
Recall from the first notebook that Nerual Network can be treated as a Composite function and each layer is one of these ingredient functions. When we write code to build our own Nerual Network using Keras, what we have to do is just find the layer we want to use and compose (or say concatenate) them. For example, the following network:
<img src="../imgs/layer_example.png">
will be encoded as:
<img src="../imgs/layer_code.png">


**Network for processing sequential information**
<img src="../imgs/sequence_subnet.png">
The function above build the subnet for processing sequential information. The sequencial information in this project is the sentences represented by a sequence of embeded word vectors. Keras has already provided a LSTM/GRU/SimpleRNN class that we could use a black box to represent the entire LSTM/GRU/RNN network.

Things you may want to know:
* difference between LSTM/GRU/RNN (how to choose which one to use): 
    * model performance: LSTM > GRU > RNN
    * memory requirement: LSTM > GRU > RNN
    * Speed of training and predicting: RNN > GRU > LSTM
    * Rule of thumb: Pick LSTM if you want better model performance. Pick RNN if you care more about speed or don't have enough computing resources.
* overfitting: the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably.  
* Dropout layer: Randomly zero out some values in the input. Reduce overfitting.
* Batch Normalization layer: output the normalized input. Prevent extreme value and reduce overfitting.

In [68]:
## Subchannel network for encoding sequential information
def subnetwork_channel(input_layer : layers,
                       RNN_architecture : str,
                       units : int,
                       dropout_rate : float) -> layers:
    """
    This function creates a sub network for encoding sequences.
    
    Inputs:
    input_layer - The input keras layer into the subnetwork
    RNN_architecture - Name of the RNN type to use
    units - Number of units in the RNN
    dropout_rate - dropout rate
    
    Outputs:
    batch - Batch Normalized output layer
    
    """
    assert RNN_architecture in ["LSTM", "GRU", "RNN"]
    
    dropout1 = Dropout(rate = dropout_rate)(input_layer)
    
    if RNN_architecture == "LSTM":
        rnn_layer = Bidirectional(LSTM(units = units, return_sequences = False))(dropout1)
    elif RNN_architecture == "GRU":
        rnn_layer = Bidirectional(GRU(units = units, return_sequences = False))(dropout1)
    elif RNN_architecture == "RNN":
        rnn_layer = Bidirectional(SimpleRNN(units = units, return_sequences = False))(dropout1)
    
    dropout2 = Dropout(rate = dropout_rate)(rnn_layer)
    batch = BatchNormalization()(dropout2)
    return batch

### Network for Generating Signal
This sub-network takes in the feature vectors of a piece of text and predict (? what is the meaning of the class)

In [69]:
## Output layer network
def output_channel(input_layer : layers,
                   activation : str,
                   units : int,
                   dropout_rate : float) -> layers:
    """
    This function creates a sub network for outputing classification probabilities.
    
    Inputs:
    input_layer - The input keras layer into the subnetwork
    activation  - Name of the activation type to use
    units - Number of units in the Dense network
    dropout_rate - dropout rate
    
    Outputs:
    output - Softmax output layer
    
    """
    assert activation in ["ReLU","PReLU", "ELU", "LeakyReLU"]
    
    dense = Dense(units)(input_layer)
    
    if activation == "PReLU":
        act = PReLU()(dense)
    elif activation == "ELU":
        act = ELU()(dense)
    elif activation == "LeakyReLU":
        act = LeakyReLU()(dense)
    elif activation == "ReLU":
        act = Dense(units, activation='relu')(input_layer)
    
        
    dropout = Dropout(rate = dropout_rate)(act)
    batch = BatchNormalization()(dropout)
    output = Dense(3,activation='softmax', name = "Output")(batch)
    
    return output

### Put All Parts Together

In [90]:
## Define full model.
def define_model(RNN_architecture : str = "LSTM",
                 rnn_units : int = 256,
                 dense_units : int = 128,
                 dense_activation : str = "PReLU",
                 dropout_rate : float = 0.4) -> Model:
    """
    This function defines and compiles a Multichannel RNN for Sentiment Classification.
    
    Inputs:
    RNN_architecture - Name of the RNN type to use
    rnn_units - Number of units in the RNN
    dense_units - Number of units in the Dense network
    dense_activation  - Name of the activation type to use
    dropout_rate - dropout rate
    
    Outputs:
    model - A Keras model
    
    """
    # Input Layer
    shape = (MAX_SEQUENCE_LENGTH,)
    input1 = Input(shape = shape, name = "Main_input")
    
    # Channel 1 - GLoVe
    embedding1 = Embedding(len(word_index) + 1,
              EMBEDDING_DIM,
              weights=[glove_embedding_matrix],
              input_length=MAX_SEQUENCE_LENGTH,
              trainable=False,
              input_shape=X_train.shape[1:], name = "GLoVe_Embedding")(input1)

    net1 = subnetwork_channel(embedding1,
                              RNN_architecture = RNN_architecture,
                              units = rnn_units,
                              dropout_rate = dropout_rate)
    
    # Channel 2 - Fast Text
    embedding2 = Embedding(len(word_index) + 1,
              EMBEDDING_DIM,
              weights=[fasttext_embedding_matrix],
              input_length=MAX_SEQUENCE_LENGTH,
              trainable=False,
              input_shape=shape, name = "FastText_Embedding")(input1)

    net2 = subnetwork_channel(embedding2,
                              RNN_architecture = RNN_architecture,
                              units = rnn_units, 
                              dropout_rate = dropout_rate)
    
    # Merge
    merged = concatenate([net1,net2], name ="Merge")
    # Output channel
    output = output_channel(merged,
                            activation = dense_activation,
                            units = dense_units,
                            dropout_rate = dropout_rate)
    
    # Compile 
    model = Model(inputs = input1, outputs = output)
    model.compile(loss = 'categorical_crossentropy', optimizer = Adam(0.002), metrics = ['categorical_accuracy'])
    
    return model

### Initialize the Model
you can change the parameters here and try to get better performance

In [94]:

model = define_model(RNN_architecture = "LSTM",
                     rnn_units= 256,
                     dense_units = 128,
                     dense_activation = "ReLU",
                     dropout_rate = 0.4)

# print out a summary of the model
model.summary()


__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Main_input (InputLayer)         (None, 256)          0                                            
__________________________________________________________________________________________________
GLoVe_Embedding (Embedding)     (None, 256, 300)     5474700     Main_input[0][0]                 
__________________________________________________________________________________________________
FastText_Embedding (Embedding)  (None, 256, 300)     5474700     Main_input[0][0]                 
__________________________________________________________________________________________________
dropout_51 (Dropout)            (None, 256, 300)     0           GLoVe_Embedding[0][0]            
__________________________________________________________________________________________________
dropout_53

<img src="../imgs/multichannel-bidirectionalLSTM.png">

### Training
Fit the model to the training data

In [95]:
## Train
tensorboard = TensorBoard(log_dir='tasks/tensorboard/logs/2/')
model.fit(X_train, y_train, epochs = 30, batch_size = 1024, callbacks = [tensorboard])

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x7fd7f39e6c50>

### Click [here](/tensorboard/) to start TensorBoard.

### testing
Test the model on data that are not in the training set. 

In [96]:
## Find the testing accuracy
val_loss, val_catergorical_accuracy = model.evaluate(X_test,y_test)
print("Validation Accuracy: {:.1f}".format(val_catergorical_accuracy * 100))

Validation Accuracy: 57.4


Our model was able to achieve ~79% accuracy. According to research on sentiment analysis and classification, human raters may only agree with each other about 80% of the time. Due to the nature of sentiment analysis, the outcome a reader arrives at can be very subjective depending on how the reader interprets the words, tone or phrasing of the text. Thus, a model that predicts with 100% accuracy may still disagree with a human 20% of the time. 

### Exercise: Re-tune Neural Network Parameters
Try experimenting with different parameters in the neural network.
In the function 'define_model'
    - 'RNN_architecture' can be one of: "RNN", "GRU", "LSTM".
    - 'rnn_units' are the number of units in the RNN
    - 'dense_units' are the number of units in the dense network
    - 'dense_activation' can be one of: "PReLU", "LeakyReLU", "ELU", "ReLU"
    - 'dropout_rate' rate of dropout throughout the network