# Natural Language Processing for Signal Generation on News Data

In the following code,  we will utilize pretrained embeddings from both GloVe and FastText Skipgram models to preprocess text datasets for LSTM network.

### Load Packages and Initialize the Environment

In [1]:
import sklearn
import numpy as np
import pandas as pd
from tqdm import tqdm
from datetime import date
from numpy.random import seed
from IPython.display import Image
from sklearn.model_selection import train_test_split, StratifiedKFold

In [2]:
import keras
import tensorflow as tf
from tensorflow import set_random_seed
from tensorflow.python import pywrap_tensorflow
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical

Using TensorFlow backend.


In [3]:
import nltk
nltk.download('stopwords')
from nltk import RegexpTokenizer
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/marketlab/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Setting random state to eliminate randomness. Assigning constant variables for later uses.

In [4]:
seed(42)
set_random_seed(42)
MAX_SEQUENCE_LENGTH = 32
EMBEDDING_DIM = 300

print("Keras version:",keras.__version__)
print("Tensorflow version:",tf.__version__)
print("Sklearn version:",sklearn.__version__)

Keras version: 2.2.4
Tensorflow version: 1.13.1
Sklearn version: 0.20.1


### Load Data - Financial News Dataset 

We will be utilizing open source news data from Bloomberg and Reuters between 2006 and 2012.


In [5]:
df = pd.read_csv("../data/news_data/news_data_labelled.csv")
df.head()

Unnamed: 0,headline,timestamp,url,tldr,Class
0,Exxon Mobil offers plan to end Alaska dispute,2006-10-20 06:15:00,http://www.reuters.com/article/2006/10/20/busi...,In a proposal sent earlier this week to the Al...,2
1,"Hey buddy, can you spare $600 for a Google share?",2006-10-20 04:25:00,http://www.reuters.com/article/2006/10/20/busi...,SAN FRANCISCO/NEW YORK (Reuters) - Wall Stree...,1
2,Ford posts biggest loss in 14 years,2006-10-23 06:42:00,http://www.reuters.com/article/2006/10/23/us-a...,Ford also said it was considering raising new ...,1
3,Shell looks to buy out Canada unit for C$7.7 b...,2006-10-23 04:34:00,http://www.reuters.com/article/2006/10/23/us-e...,"In July, Shell Canada rattled the industry and...",1
4,"U.S. venture investors betting on energy, Web 2.0",2006-10-23 08:36:00,http://www.reuters.com/article/2006/10/23/us-f...,SAN FRANCISCO (Reuters) - U.S. venture capita...,1


In [15]:
#print("Starting timestamp: {}".format(df.timestamp.min()))
#print("Ending timestamp: {}".format(df.timestamp.max()))
print(df["Class"]).value_counts()

0         2
1         1
2         1
3         1
4         1
5         2
6         2
7         2
8         2
9         2
10        2
11        1
12        1
13        1
14        2
15        1
16        2
17        2
18        2
19        2
20        2
21        2
22        1
23        1
24        2
25        1
26        2
27        1
28        1
29        2
         ..
136705    1
136706    1
136707    1
136708    1
136709    1
136710    1
136711    1
136712    1
136713    0
136714    1
136715    1
136716    1
136717    1
136718    1
136719    1
136720    1
136721    1
136722    1
136723    1
136724    2
136725    1
136726    2
136727    2
136728    1
136729    1
136730    2
136731    1
136732    2
136733    2
136734    1
Name: Class, Length: 136735, dtype: int64


AttributeError: 'NoneType' object has no attribute 'value_counts'

**Split the dataset into testing and training datasets for machine learning**
* X and y respectively correspond to data features (i.e. input) and data labels (i.e. output)
  * Training set the data used to "learn" the parameters in our model with a supervised learning method. This usually uses the majority of the original dataset to achieve best effect.
  * Testing set is the data used to evaluate the effectiveness of our model, often used to produce numerical metrics (e.g. accuracy rate)

In [7]:
X = df.tldr
y = df.Class
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    stratify = y,
                                                    test_size=0.10,
                                                    random_state=42)

### Preprocess Data
<img src="../imgs/preprocess_data.png">

**Tokenize training set**

* Tokenizer from Keras creates a vocabulary index from the training set based on word frequency.
  * Tokenize here also did special character removal for us
  * we can also perform stop word removal here
* Every unique word is assigned a unique integer value

In [9]:
word_filter = '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'

tokenizer = Tokenizer(num_words = None,
                      filters = word_filter,
                      lower = True,
                      split = " ",
                      char_level = False)

tokenizer.fit_on_texts(X_train)

word_index = tokenizer.word_index

**Convert all datasets to numerical values**
* Apply the vocabulary index to X_train and X_test
  * The datasets are converted from texts to sequences of integers based on previously created vocabulary index
  * The sequences are padded with zeros and are limited with MAX_SEQUENCE_LENGTH to have a fixed length
* Convert y_train and y_test to one-hot encoded vectors

In [10]:
X_train = pad_sequences(tokenizer.texts_to_sequences(X_train),
                        maxlen = MAX_SEQUENCE_LENGTH,
                        value = 0.0)

X_test = pad_sequences(tokenizer.texts_to_sequences(X_test),
                       maxlen = MAX_SEQUENCE_LENGTH,
                       value = 0.0)

y_train = to_categorical(y_train)
print(y_train)
y_test = to_categorical(y_test)

### Build Embeding Matrix

* In this project, we will use pretrained word embeding which are stored in files. We have to build the matrices from these files before we use them
* Vocabulary index created earlier with the tokenizer is used to create the embedding matrices

In [11]:
def embedding_matrix(path_to_embedding : str,embedding_dim: int, word_index : dict) -> np.array:
    """
    This function creates an embedding matrix.
    
    Inputs:
    path_to_embedding - path to text file of word embeddings
    embedding_dim - dimension of word embeddings
    word_index - dictionary mapping words to indices
    
    Outputs:
    embedding_matrix - numpy matrix containing the embeddings
    
    """
    embeddings_index = {}
    f = open(path_to_embedding, encoding='utf-8')
    for line in f:
        try:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
        except:
            pass
        
    f.close()

    embedding_matrix = np.zeros((len(word_index) + 1,embedding_dim))
    found = 0
    for word, i in word_index.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            # words not found in embedding index will be all-zeros.
            found +=1
            embedding_matrix[i] = embedding_vector

    return embedding_matrix

In [12]:
glove_embedding_matrix = embedding_matrix("data/news_data/glove/glove.840B.300d.txt",
                                          EMBEDDING_DIM,
                                          tokenizer.word_index)

fasttext_embedding_matrix = embedding_matrix("data/news_data/fasttext/wiki-news-300d-1M.vec",
                                             EMBEDDING_DIM,
                                             tokenizer.word_index)

FileNotFoundError: [Errno 2] No such file or directory: 'data/news_data/glove/glove.840B.300d.txt'