<h1 style="padding-top: 25px;padding-bottom: 25px;text-align: left; padding-left: 10px; background-color: #DDDDDD; 
    color: black;"> <img style="float: left; padding-right: 10px; width: 45px" src="https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/iacs.png"> AC295: Advanced Practical Data Science </h1>

## Project: News Analytics for Stock Return Prediction

**Harvard University, Fall 2020**  
**Instructors**: Pavlos Protopapas  

### **Team: $\alpha\beta normal$ $Distri\beta ution$**
#### **Roht Beri, Eduardo Peynetti, Jessica Wijaya, Stuart Neilson**

# Creating TFRecords Datasets & BERT Pipeline for Tiingo News Dataset

This notebook details the process of preprocessing News dataset (source: Tiingo) stored in MongoDB Atlas. The preprocessing entails the following steps:
1. Extract the data from the MongoDB based on the selected tags (S&P 1500 stocks and 12 Sectors)
2. Generate BERT tokens using pretrained BERT tokenizer and save the pre-processed data in TFRecords format
3. Build a pipeline for BERT Model to generate BERT Sentiment and BERT Hidden layers using pre-trained BERT Sequence Classification and BERT Model
4. Generate BERT Hidden layers and BERT Sentiment score using pre-trained BERT models
5. Update the MongoDB with the sentiment score and hidden layer vector
6. Tiingo News Data is timestamped with both date and time. However, we analyze news on a 24 hour window basis. Futher, we use 24-hour window of 9AM EST Today to 9AM EST Tomorrow. We generate a new field tradeDate to capture and lable the data with relvant data stamp
7. We have multiple news articles on maany days for a given stock. We aggregate the data datewise and ticker/sector wise and compute averages
8. Finally, we build a tool to extract this summarized information for the MongoDB.

## Disks

### Connect Google Drive

In [12]:
from google.colab import drive

#drive.flush_and_unmount()

drive.mount('/content/drive', force_remount=True)

Drive not mounted, so nothing to flush and unmount.
Mounted at /content/drive


## Libraries

### Install Packages

In [13]:
!pip3 install transformers
!pip3 install --upgrade pymongo[srv]==3.10.1
!pip3 install dask[dataframe]

Requirement already up-to-date: pymongo[srv]==3.10.1 in /usr/local/lib/python3.6/dist-packages (3.10.1)


### Imports

In [14]:
import os
import ast
import requests
import tarfile
import tempfile
import zipfile
import shutil
import csv
import json
import time
import sys
import subprocess
import logging
import pickle

import pymongo
import bson
import dns

import numpy as np
import pandas as pd
import dask.dataframe as dd
import matplotlib.pyplot as plt

import tensorflow as tf
import tensorflow_addons as tfa
import tensorflow_hub as hub

from tensorflow import keras
from tensorflow.python.keras import backend as K
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras import layers

from collections import Counter
from glob import glob
from threading import Thread

from pytz import timezone
from bson import BSON, ObjectId
from pymongo import MongoClient
from tabulate import tabulate
from tqdm.notebook import tqdm, trange
from datetime import datetime, timedelta, date, time
from dateutil.parser import isoparser

from transformers import BertTokenizer, TFBertForSequenceClassification, TFBertModel

%matplotlib inline

## Variables

### Useful Constants and Variables

In [15]:
# Set google drive path for pipeline storage
PATH = '/content/drive/MyDrive/abnormal-distribution-project-data/headlines/'
TICKERS_PATH = '/content/drive/MyDrive/abnormal-distribution-project-data/components/sp1500.csv'
IND_PATH = '/content/drive/MyDrive/abnormal-distribution-project-data/components/industries.csv'
IND_GROUP_PATH = '/content/drive/MyDrive/abnormal-distribution-project-data/components/industry_groups.csv'
SECTOR_PATH = '/content/drive/MyDrive/abnormal-distribution-project-data/components/sectors.csv'
BERT_HIDDEN_PATH = '/content/drive/MyDrive/abnormal-distribution-project-data/headlines/bert_hs/'

#Helpful Constants
START = datetime(2000,1,1)
END = datetime.now()
WINDOW = 365
AUTOTUNE = tf.data.experimental.AUTOTUNE

# Mongo Atlas keys & host name
PASSWORD = '47PXdQpbJKFTLGTJ'
DBNAME = 'abnormalDistribution'
COLLECTION = 'tiingo'
HOST = f'mongodb+srv://abnormal-distribution:{PASSWORD}@cluster0.friwl.mongodb.net/{DBNAME}?retryWrites=true&w=majority'
print(HOST)

# Pipeline variables
batch_size = 64
prefetch = AUTOTUNE

# BERT Model
bert_model = 'ipuneetrathore/bert-base-cased-finetuned-finBERT' #'bert-base-uncased'

mongodb+srv://abnormal-distribution:47PXdQpbJKFTLGTJ@cluster0.friwl.mongodb.net/abnormalDistribution?retryWrites=true&w=majority


In [16]:
#MASTER_PATH = '/content/drive/MyDrive/abnormal-distribution-project-data/lean/master.p'
TRAIN_PATH = '/content/drive/MyDrive/abnormal-distribution-project-data/lean/train.p'
VALID_PATH = '/content/drive/MyDrive/abnormal-distribution-project-data/lean/valid.p'
TEST_PATH = '/content/drive/MyDrive/abnormal-distribution-project-data/lean/test.p'

### Get Tickers and Sectors

In [17]:
tickers = pd.read_csv(TICKERS_PATH).values.flatten()
sectors = pd.read_csv(SECTOR_PATH).sector.values

#industry = pd.read_csv(IND_PATH).industry.values
#ind_group = pd.read_csv(IND_GROUP_PATH).industry_group.values

tickers.sort()
sectors.sort()

## TFRecords

### Utils for Creating TFRecords

In [None]:
# The following functions can be used to convert a value to a type compatible
# with tf.train.Example.
# Credit: https://www.tensorflow.org/tutorials/load_data/tfrecord

def _bytes_feature(value):
  """Returns a bytes_list from a string / byte."""
  if isinstance(value, type(tf.constant(0))):
    value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
  return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(value):
  """Returns a float_list from a float / double."""
  return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_feature(value):
  """Returns an int64_list from a bool / enum / int / uint."""
  return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

In [None]:
# Function to create TFRecord Sample for writing
def get_tf_record(item, tokenizer):

    id = item['id']
    t_open = item['target_open']
    t_close = item['target_close']

    description = item['text']
    description_token = tokenizer.encode_plus(
        description, 
        add_special_tokens = True, # add [CLS], [SEP]
        max_length = 512, # max length of the text that can go to BERT (<=512)
        padding='max_length',
        return_attention_mask = True, # add attention mask to not focus on pad tokens
        truncation='longest_first',
        return_tensors="tf"
    )
    description_input = description_token['input_ids'].numpy().tostring()
    description_type = description_token['token_type_ids'].numpy().tostring()
    description_attention = description_token['attention_mask'].numpy().tostring()

    # Create tf.train.Example
    feature={
        '_id' : _int64_feature(id),
        'input_ids': _bytes_feature(description_input),
        'token_type_ids': _bytes_feature(description_type),
        'attention_mask': _bytes_feature(description_attention), 
        'return_open': _int64_feature(t_open),
        'return_close': _int64_feature(t_close),
    }
    features=tf.train.Features(feature=feature)
    example = tf.train.Example(features=features)

    return example

In [None]:
# Creates shard for the given ticker
def write_tfrecord(df, shard_path, tokenizer):
    
    # TFRecord writer initialized
    with tf.io.TFRecordWriter(shard_path) as writer:
        
        # Tokenize the data and parse the TF Record
        examples = df.apply(get_tf_record, axis=1, args=(tokenizer,))
        
        # Write to TFRecords
        for example in tqdm(examples):
            
            # Wrtie the TFRecord
            writer.write(example.SerializeToString())

In [None]:
# Function to create TFRecords for the dataset
def create_TFRecords(
    train_path=TRAIN_PATH, valid_path=VALID_PATH, 
    test_path=TEST_PATH, path=PATH
):
    # Setup folders
    if not os.path.exists(path + 'tf_records_lean'):
        os.mkdir(path + 'tf_records_lean')


    # Setup paths
    done_path = path + 'tf_records_lean/*.records'
    shard_path = path + "tf_records_lean/{}_{:02d}.records"

    # Files alread done   
    records_done = []
    records_done.extend(glob(done_path))

    #Initalize BERT Tokenizer
    tokenizer = BertTokenizer.from_pretrained(bert_model, do_lower_case=False)

    # List of data paths
    path_set = [train_path, valid_path, test_path]
    path_type = ['train', 'valid', 'test']

    # Max number of records ina shard
    max_num_records = 100000

    # Run loop over three paths
    for i, data_path in tqdm(enumerate(path_set)):
        
        # Read the data into pandas
        with open(data_path,'rb') as pkl_file:
            df = pickle.load(pkl_file)

        num_records = df.shape[0]
        num_shards = num_records//max_num_records
        num_shards += min(1, num_records % max_num_records)

        print('Processing {} data'.format(path_type[i]))
        for j in trange(num_shards):
            # Path for the shard
            temp_path = shard_path.format(path_type[i], j)
            
            # Check if the shard already exists
            if os.path.exists(temp_path):
                continue
            
            # Start and the end of the shard
            start = j * max_num_records
            end = min(num_records, (j+1)*max_num_records)

            # Dataframe slice
            temp_df = df.iloc[start : end]

            # Create the TFRecord
            write_tfrecord(temp_df, temp_path, tokenizer)

### Create TFRecords Dataset

In [None]:
#shutil.rmtree('/content/drive/MyDrive/abnormal-distribution-project-data/headlines/tf_records_monthly_fin')

In [None]:
# Create TFRecords for Stocks
create_TFRecords()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=40.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Processing train data


HBox(children=(FloatProgress(value=0.0, max=15.0), HTML(value='')))


Processing valid data


HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))


Processing test data


HBox(children=(FloatProgress(value=0.0, max=19.0), HTML(value='')))





### Read TFRecords Dataset

In [None]:
# Check data integrity
tick_path = PATH + 'tf_records_lean/*.records'

files = glob(tick_path)
i = 0
for file in tqdm(files):
    raw_dataset = tf.data.TFRecordDataset(file)
    try:
        for raw_record in raw_dataset.take(10):
            example = tf.train.Example()
            example.ParseFromString(raw_record.numpy())
            example = example.features.feature['_id']
            
        del raw_dataset
    except:
        #os.remove(file)
        print('file in problem: ', file)
        i += 1

if i:
    print(f"{i} files were deleted, please rerun Create TFRecords")
else:
    print("TFRecords are in good shape")

HBox(children=(FloatProgress(value=0.0, max=39.0), HTML(value='')))


TFRecords are in good shape


In [None]:
filenames = ['/content/drive/MyDrive/abnormal-distribution-project-data/headlines/tf_records_lean/train_02.records']
raw_dataset = tf.data.TFRecordDataset(filenames)
raw_dataset

<TFRecordDatasetV2 shapes: (), types: tf.string>

In [None]:
for raw_record in raw_dataset.take(1):
    example = tf.train.Example()
    example.ParseFromString(raw_record.numpy())
    print(example)

features {
  feature {
    key: "_id"
    value {
      int64_list {
        value: 3903159
      }
    }
  }
  feature {
    key: "attention_mask"
    value {
      bytes_list {
        value: "\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\000\000\000\001\

In [None]:
# Create a dictionary describing the features.
features={
        '_id': tf.io.FixedLenFeature([], tf.int64), 
        'input_ids': tf.io.FixedLenFeature([], tf.string),
        'token_type_ids': tf.io.FixedLenFeature([], tf.string),
        'attention_mask': tf.io.FixedLenFeature([], tf.string), 
        'return_open': tf.io.FixedLenFeature([], tf.int64),
        'return_close': tf.io.FixedLenFeature([], tf.int64),
    }

def _parse_image_function(example_proto):
  # Parse the input tf.train.Example proto using the dictionary above.
  return tf.io.parse_single_example(example_proto, features)

parsed_image_dataset = raw_dataset.map(_parse_image_function)
parsed_image_dataset

<MapDataset shapes: {_id: (), attention_mask: (), input_ids: (), return_close: (), return_open: (), token_type_ids: ()}, types: {_id: tf.int64, attention_mask: tf.string, input_ids: tf.string, return_close: tf.int64, return_open: tf.int64, token_type_ids: tf.string}>

In [None]:
i = 0
for data in parsed_image_dataset:
    input_ids = tf.io.decode_raw(data['input_ids'], tf.int32)
    print(data['return_open'].numpy())
    print(data['return_close'].numpy())
    print(data['_id'])
    print(input_ids.numpy()[:10])
    print(input_ids.numpy().shape)
    break

0
0
tf.Tensor(3903159, shape=(), dtype=int64)
[  101 13876  6778 17244 14719  3561   119   113  5883 12649]
(512,)


## Pipeline

### Utils for BERT Pieline

In [18]:
# Function to parse data features
def _parse_features_function(example):
    # Parse the input tf.train.Example proto using the dictionary above.
    tf_records_features = {
        '_id': tf.io.FixedLenFeature([], tf.int64), 
        'input_ids': tf.io.FixedLenFeature([], tf.string),
        'token_type_ids': tf.io.FixedLenFeature([], tf.string),
        'attention_mask': tf.io.FixedLenFeature([], tf.string), 
        'return_open': tf.io.FixedLenFeature([], tf.int64),
        'return_close': tf.io.FixedLenFeature([], tf.int64),
    }
    return tf.io.parse_single_example(example, tf_records_features)


# Structure the data for training returns
def structure_data(data):
    id = data['_id']
    input_ids = tf.io.decode_raw(data['input_ids'], tf.int32)
    attention_mask = tf.io.decode_raw(data['attention_mask'], tf.int32)
    token_type_ids = tf.io.decode_raw(data['token_type_ids'], tf.int32)
    #open = data['return_open'] + 1
    #close = data['return_close'] + 1
    
    return ((input_ids, token_type_ids, attention_mask), id)

In [19]:
def generate_multiple_pipelines(
    train = True, test = False, open=True, path=PATH, batch_size=batch_size
):
    print("Generating Pipeline....")

    func = structure_data

    if train:
        tfrecords_pattern_path = path + "tf_records_lean/train_*.records"
    elif test:
        tfrecords_pattern_path = path + "tf_records_lean/test_*.records"
    else:
        tfrecords_pattern_path = path + "tf_records_lean/valid_*.records"

    options = tf.data.Options()
    options.experimental_deterministic = True
    
    train_files = tf.io.matching_files(tfrecords_pattern_path)
    
    train_shards = tf.data.Dataset.from_tensor_slices(train_files)

    train = train_shards.interleave(tf.data.TFRecordDataset, cycle_length=12)
    train = train.with_options(options)
    train = train.map(_parse_features_function, num_parallel_calls=AUTOTUNE)
    train = train.map(func, num_parallel_calls=AUTOTUNE)
    train = train.batch(batch_size)
    #train = train.shuffle(buffer_size=512)
    #train = train.cache().prefetch(prefetch)

    return train

### Test Pipeline

In [20]:
tickers_data = generate_multiple_pipelines()
tickers_data

Generating Pipeline....


<BatchDataset shapes: (((None, None), (None, None), (None, None)), (None,)), types: ((tf.int32, tf.int32, tf.int32), tf.int64)>

In [21]:
for item in tickers_data.take(1):
    print('answer: ', item[1][0].numpy())
    print()
    print('input_ids: ',item[0][0][0].numpy().shape)

answer:  2007543

input_ids:  (512,)


In [22]:
for item in tickers_data.take(1):
    print(item)

((<tf.Tensor: shape=(64, 512), dtype=int32, numpy=
array([[  101,  2750,  4980, ...,     0,     0,     0],
       [  101,   164,  1120, ...,     0,     0,     0],
       [  101, 13876,  6778, ...,     0,     0,     0],
       ...,
       [  101, 19443,  1942, ...,     0,     0,     0],
       [  101, 16409, 16891, ...,     0,     0,     0],
       [  101,  3957,  2801, ...,     0,     0,     0]], dtype=int32)>, <tf.Tensor: shape=(64, 512), dtype=int32, numpy=
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int32)>, <tf.Tensor: shape=(64, 512), dtype=int32, numpy=
array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]], dtype=int32)>), <tf.Tensor: shape=(64,), dtype=int64, numpy=
a

## BERT Model

### Utils to build model to get BERT Sentiment with finetuning

In [23]:
# Function for threading model
def generate_model_prediction(
    model_hs, model_sentiment, input, id, path_results):

    pred_y_hs = model_hs(input)[1]
    pred_y_sent = model_sentiment.predict(input).logits
    pred_y_sent = tf.nn.softmax(pred_y_sent)
    
    temp_df = pd.DataFrame(
        {'_id': id, 
         'fin_features': list(pred_y_hs),
         'fin_neg': list(pred_y_sent[:,0]),
         'fin_neu': list(pred_y_sent[:,1]),
         'fin_pos': list(pred_y_sent[:,2]),
         }
    )
    
    temp_df.to_csv(path_results, index=False)

In [24]:
# Function to train and finetune
def get_BERT_HS(train=False, valid=False, test=False, 
                start=0, end = 30000, path=BERT_HIDDEN_PATH):
    
    # Clear Backend
    K.clear_session()

    if not os.path.exists(path+'lean_fin_BERT_HS'):
        os.mkdir(path+'lean_fin_BERT_HS')

    path_1 = path + '/train_*.csv'
    path_2 = path + '/valid_*.csv'
    path_3 = path + '/test_*.csv'
    files_1 = []
    files_2 = []
    files_3 = []
    
    # BERT classification model
    bert_sentiment = TFBertForSequenceClassification.from_pretrained(
        bert_model,
        return_dict=True,
        from_pt = True,
    )
    
    # BERT model
    bert_HS = TFBertModel.from_pretrained(
        bert_model,
        return_dict=True,
        from_pt = True,
    )

    # Data sets and types
    data_type = []
    data_set = []
    file_type = []

    # Training set
    if train:
        train = generate_multiple_pipelines()
        data_type.append('train')
        data_set.append(train)
        files_1 = glob(path_1)
        print("Number of files already processed: ", len(files_1))
        file_type.append(files_1)
   
    # Validation set
    if valid:
        valid = generate_multiple_pipelines(train=False)
        data_type.append('valid')
        data_set.append(valid)
        files_2 = glob(path_2)
        print("Number of files already processed: ", len(files_2))
        file_type.append(files_2)

    # Validation set
    if test:
        test = generate_multiple_pipelines(train=False, test=True)
        data_type.append('test')
        data_set.append(test)
        files_3 = glob(path_3)
        print("Number of files already processed: ", len(files_3))
        file_type.append(files_3)


    #save the outputs
    shard_path = path + 'lean_fin_BERT_HS/{}_{}.csv' 

    # Generate predictions over three pipelines
    for i, data in tqdm(enumerate(data_set)):
        
        print("Processing {} data...". format(data_type[i]))
        
        # Set batch number
        batch_num = 0

        for input, id in tqdm(data):

            if batch_num < start:
                batch_num += 1
                continue

            if batch_num > end:
                continue

            # set path
            set_path = shard_path.format(data_type[i], batch_num)

            if set_path in file_type[i]:
                batch_num += 1
                continue

            # Generate Features
            if not os.path.exists(set_path):
                generate_model_prediction(
                    bert_HS, bert_sentiment, input, id, set_path
                )
            
            batch_num += 1

### Get fin-BERT Predictions

In [26]:
# Generate fin-BERT Predictions
get_BERT_HS(valid=True)

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

All the weights of TFBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorc

Generating Pipeline....
Number of files already processed:  0


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Processing valid data...


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))





## Scrap

In [None]:
# Connect to MongoDB
client = pymongo.MongoClient(HOST)
db = client[DBNAME]
news_collection = db[COLLECTION]

In [None]:
# Space for testing/experimentation

0

In [None]:
# Terminate the MongoDB Session
client.close()