# Byte Level Language Model Evaluation

## Model Objective:
___
* P(Price Direction = 0 or 1 | NewsHeadlines) = Probability of News Headlines Affecting Stock Price to move Up or Down after certain time horizon.

Please read `Language Model Report.pdf`for detailed report.


## Files Directory:
____
__training_checkpoints_CharWeights__ Folder that has model weights(parameters) that the model takes on to power its predictive capabilities

__Language Model.ipynb__: The notbook that governs this automation(as so called)
 
__news_headlines.csv__: CSV file that has the news headlines per stock ticker per provided dates

__functions.py__ All utility functions that assist in delivering the notebook outcome

__CharLangModel.h5__ DeepLearning charactyer level language model that learns "financial" news headlines(text) dense representations by learning how to regenerate same text(news headline) passed to it

__daily.h5__ DeepLearning Model that predicts asset price direction (0:Down, 1:Up) once given a news headline as inputm at T-1 to give back prediction at T

__environment.env___ File that contains all enviroment dependencies for running the notebook (instruction below)
## Enviroment Setup and Instruction:
____
1. To create a python virtual enviroment, preferablly in anaconda as its manages virtual envs efficiently, pass the following command in the anaconda terminal after cd into the folder that has the `environment.yml` file in, by passing the following command:

`conda env create -f environment.yml`

2. Make sure you are in the enviroment you just created or the one that has the packages that enables you to run the notebook.

3. Unzip the attached automation folder `requirements.zip` and put all files in one folder and find the above file directory to familiarize yourself with the automation. Make sure the jupyter notebook (`Language _Model.ipynb`) is in same folder as all the files you just unzipped.

## How to Use the Resource
_________________________
This notebook shows a proof of cencept of state-of-art techniques applied by top trading firms to leverage deep learning techniques in order to process natural language to a fine grained character (byte) level that can be more efficient then word vectors(embeddings) in terms in its compactness and closer to a generalized contextual representation for financial news data.

Feel free to check my github for more details.


###### __Disclaimer__:
- The models are not to be used for any sort of investment advice as the notebook is a proof of concept of hows to leverage Natural Lamguage Processing using state of the art Deep Learning (LSTM architecture).
- The accuracy of the model is only justfied based on the proof of concept.

##### Package Imports

In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.sequence import pad_sequences
import tensorflow.keras.backend as K
from tqdm import tqdm
import datetime
from pathlib import Path
import os
import time
import pandas as pd
import numpy as np
import psutil
# Ignore harmless warnings
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')

pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 1000)
pd.set_option('display.width', 1000)
%matplotlib inline

In [2]:
# Make sure you have these package versions
print(tf.__version__)
print(keras.__version__)
print(np.__version__)

2.2.0
2.3.0-tf
1.19.5


###### Pull in my helper hardcoded functions from `functions.py` script, make sure this script in same directory as this jupyter notebook

In [3]:
# Pull in my helper hardcoded functions from functions.py script 
from functions import clean_text,encode2bytes,split_X_y, model_complile, generate_text

To check where your notebook's location/path/directory is, type this in code cell: `pwd`

In [None]:
pwd #RUN THIS CODE CELL

In [4]:
#Set Processor to Run computations in backend
print(tf.config.list_physical_devices(device_type=None))
tf.config.optimizer.set_jit(True)
gpus = tf.config.list_physical_devices('XLA_CPU') #Our normal laptops have Accelerated Linear Algebra Processor (XLA) activate it through C API
if gpus:
  # Restrict TensorFlow to only use some XLA_CPU
    try:
        tf.config.experimental.set_visible_devices(gpus[:], 'XLA_CPU')
        logical_gpus = tf.config.experimental.list_logical_devices('XLA_CPU')
        print(len(gpus), "Physical GPU,", len(logical_gpus), "Logical GPUs")
        print('used: {}% free: {:.2f}GB'.format(psutil.virtual_memory().percent, float(psutil.virtual_memory().free)/1024**3))#@ 
    except RuntimeError as e:
    # Visible devices must be set at program startup
        print(e)


[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:XLA_CPU:0', device_type='XLA_CPU')]
1 Physical GPU, 1 Logical GPUs
used: 72.5% free: 4.34GB


##### Make sure you put the right path for the new_headlines, CharLangModel & daily model files :`new_headlines.csv` - `CharLangModel.h5` - `daily.h5`

In [6]:
# fix random seed for same reproducibility as my results due to stochastic nature of start point
K.clear_session()
tf.keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)

#Get Data: MAKE SURE THE FILES YOUR ARE LOADING IN ARE IN SAME FOLDER AS YOU JUPYTER SO CODE BELOW WORKS
# ../:to run from terminal -- ./: to run from jupyter
NEWS_STORE = Path(".","new_headlines.csv") 
CharLangModel = Path('.','CharLangModel.h5')
daily_model = Path('.','daily.h5')

### IDEA FORMULATION

In [8]:
def make_bitseq(s: str) -> str:
    if not s.isascii():
        raise ValueError("ASCII only allowed")
    return " ".join(f"{ord(i):08b}" for i in s)
make_bitseq('Hello')

'01001000 01100101 01101100 01101100 01101111'

In [9]:
def n_possible_values(nbits: int) -> int:
    return 2 ** nbits
print('6 Bits :', n_possible_values(6))
print('7 Bits :', n_possible_values(7))
print('8 Bits :', n_possible_values(8))

6 Bits : 64
7 Bits : 128
8 Bits : 256


In [10]:
def convert(binary_number):
    binary = binary_number
    i = 0
    decimal_number = 0
    while (binary_number != 0):
        c = int(binary_number % 10)
        decimal_number = decimal_number + c * (2 ** i)
        i +=1
        binary_number = binary_number / 10
    print('Binary number: %d' % binary)
    print('Decimal number: %d' % decimal_number)
    return 0
convert(1010)

Binary number: 1010
Decimal number: 10


0

In [11]:
1 * 2**3 + 0 * 2**2 + 0 * 2**1 + 1 * 2**0

9

"Bottom Line:" Aim is to represent characters that require "1 Byte" and of only "7 Bits slots" (Not 8 Bits as we are used too) covering all english language alphabets, numerics and symbols up too decimal point "127". Decimal point 0 is reserved for padding and whenever model sees a 0 it will ignore it.

### GET DATA

In [7]:
news= pd.read_csv(NEWS_STORE)
news = news.set_index('time')
news.index = pd.to_datetime(news.index)
news.index = news.index.strftime('%Y-%m-%d %H:%M:%S')
# news = news.drop(['index', 'time'], axis = 1)
# news.index.name = 'time'
news.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1536339 entries, 2020-09-01 21:20:54 to 2021-07-20 03:53:56
Data columns (total 2 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   text    1536339 non-null  object
 1   ticker  1536339 non-null  object
dtypes: object(2)
memory usage: 35.2+ MB


#### DOWNSIZE FOR MEMORY EFFICIENCY:
If you want to downsize the `news` --> Pass in this code: `news = news.iloc[0:int(0.25*len(news))]`

In [14]:
news.info()

<class 'pandas.core.frame.DataFrame'>
Index: 384084 entries, 2020-09-01 21:20:54 to 2020-10-22 21:00:00
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    384084 non-null  object
 1   ticker  384084 non-null  object
dtypes: object(2)
memory usage: 8.8+ MB


### PARSE STOP-END TOKENS & ENCODE(1-BYTE)

In [15]:
txt = ''
# Count Unique Characters
for doc in news.text:
    for s in doc:
        txt += s
chars = sorted(set(txt))
print(chars)
print(len(chars))

['\t', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~', '\xa0', '¡', '£', '¥', '§', '\xad', '®', '²', '´', '¹', '½', '¿', 'Á', 'Ã', 'Ä', 'Å', 'Ç', 'È', 'É', 'Ê', 'Í', 'Ï', 'Ñ', 'Ó', 'Ô', 'Õ', 'Ö', 'Ø', 'Ú', 'Ü', 'à', 'á', 'â', 'ã', 'ä', 'å', 'ç', 'è', 'é', 'ê', 'ë', 'í', 'ï', 'ñ', 'ó', 'ô', 'õ', 'ö', 'ø', 'ú', 'ü', 'ý', 'ć', 'Ē', 'İ', 'Ł', 'ł', 'ń', 'ō', 'Ś', 'Ş', 'ş', 'š', 'ū', 'ż', 'Ž', 'ʰ', '̧', 'Β', 'Μ', 'ა', 'ბ', 'გ', 'დ', 'ე', 'ვ', 'ზ', 'თ', 'ი', 'კ', 'ლ', 'მ', 'ნ', 'ო', 'რ', 'ს', 'ტ', 'უ', 'შ', 'ც', 'ძ', 'ხ', 'ᵗ', '\u200a', '\u200b', '\u200d', '‐', '‑', '‒', '–', '

In [16]:
text = clean_text(news, 'text')  #---> <s> headline <\s> and clean
b_text = encode2bytes(text) #----->ordinal encoding
max_sentence_len = max(map(len,b_text))
#max([len(sentence) for sentence in b_text])

In [65]:
uniqye_characters = set(x for l in b_text for x in l)
len(uniqye_characters)

95

### SPLIT INPUT / TARGET

In [17]:
X, y = split_X_y(b_text)
num = np.random.randint(0, len(X))
print('This is an example of the training sequence encoded as bytes:\n')
print(X[num])
print(text[num])
print(y[num])

This is an example of the training sequence encoded as bytes:

[60, 115, 62, 65, 109, 97, 122, 111, 110, 32, 65, 110, 110, 111, 117, 110, 99, 101, 115, 32, 70, 105, 114, 115, 116, 32, 82, 111, 98, 111, 116, 105, 99, 115, 32, 70, 117, 108, 102, 105, 108, 108, 109, 101, 110, 116, 32, 67, 101, 110, 116, 101, 114, 32, 105, 110, 32, 76, 111, 117, 105, 115, 105, 97, 110, 97, 60, 92, 115]
<s>Amazon Announces First Robotics Fulfillment Center in Louisiana<\s>
[115, 62, 65, 109, 97, 122, 111, 110, 32, 65, 110, 110, 111, 117, 110, 99, 101, 115, 32, 70, 105, 114, 115, 116, 32, 82, 111, 98, 111, 116, 105, 99, 115, 32, 70, 117, 108, 102, 105, 108, 108, 109, 101, 110, 116, 32, 67, 101, 110, 116, 101, 114, 32, 105, 110, 32, 76, 111, 117, 105, 115, 105, 97, 110, 97, 60, 92, 115, 62]


### PADDING 
Masking is a way to tell sequence-processing layers that certain timesteps in an input are missing, and thus should be skipped when processing the data.

Padding is a special form of masking where the masked steps are at the start or at the beginning of a sequence. Padding comes from the need to encode sequence data into contiguous batches: in order to make all sequences in a batch fit a given standard length, it is necessary to pad or truncate some sequences.

In [18]:
X = pad_sequences(X, maxlen = max_sentence_len, padding = 'post')
y = pad_sequences(y, maxlen = max_sentence_len, padding = 'post')
print(X.shape, y.shape)

(384084, 516) (384084, 516)


### TEST & VALIDATION SETS

In [19]:
length_train = X.shape[0]
train_size = length_train * 90//100

validation_seq_data = tf.data.Dataset.from_tensor_slices((X[train_size:length_train + 1],y[train_size:length_train + 1]))
test_seq_data = tf.data.Dataset.from_tensor_slices((X[length_train + 1: ],y[length_train + 1:]))

In [20]:
print('Check VALIDATION set:')
for input_txt, target_txt in  validation_seq_data.take(1):
    print('--------------------------------Headline--------------------------------')
    print(input_txt.numpy())
    print("".join(map(chr, input_txt.numpy())))
#     print(''.join(index2char[input_txt.numpy()]))
    print('\n')
    print(target_txt.numpy())
    print("".join(map(chr, target_txt.numpy())))

Check VALIDATION set:
--------------------------------Headline--------------------------------
[ 60 115  62  66  82  73  69  70  45  67  97 114 110 105 118  97 108  32
  67 111 114 112 111 114  97 116 105 111 110  32  38  32  80  76  67  32
  85 112 100  97 116 101  32  79 110  32  67 121  98 101 114  32  69 118
 101 110 116  60  92 115   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0 

In [21]:
#Mini-Batching/Subsequencing
batch_size = 256


validation_seq_data = validation_seq_data.batch(batch_size, drop_remainder=True)
test_seq_data = test_seq_data.batch(batch_size, drop_remainder=True)
print('Validation Set Shape: ', validation_seq_data, '\nTest Set Shape: ', test_seq_data)

Validation Set Shape:  <BatchDataset shapes: ((256, 516), (256, 516)), types: (tf.int32, tf.int32)> 
Test Set Shape:  <BatchDataset shapes: ((256, 516), (256, 516)), types: (tf.int32, tf.int32)>


In [22]:
# Memory Buffer for data prefetching 
AUTOTUNE = tf.data.experimental.AUTOTUNE

def configure_dataset(dataset):
    return dataset.cache().prefetch(buffer_size=AUTOTUNE)

validation_seq_data = configure_dataset(validation_seq_data)
test_seq_data = configure_dataset(test_seq_data)

### LOAD LANGUAGE MODEL
* If you want to compile: Compile then load in that order only.

In [23]:
model = tf.keras.models.load_model(CharLangModel, compile=False)
#If you get error in his step make sure your numpy version in '1.21.1'(numpy.__version__)
model.build(tf.TensorShape([256,None]))
#compile below

In [24]:
model_complile(model)

1

##### Make sure you put the right path for the trained weights folder :`training_checkpoints_CharWeights`

In [25]:
trained_weights = './training_checkpoints_CharWeights'
model.load_weights(tf.train.latest_checkpoint(trained_weights))

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x24104180f40>

In [66]:
tf.train.latest_checkpoint(trained_weights) #take a look to what we are fetching

'./training_checkpoints_CharWeights\\ckpt_20'

### Validation Set Evaluation

###### This is a memory consuming step, feek free to skip it

In [26]:
score = model.evaluate(validation_seq_data, verbose=1)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Test loss: 0.01008651778101921
Test accuracy: 0.15295174717903137


* Note the input is sparese and some categories are missing since we only filtered on english language used characters

In [None]:
# prepare model again to use on inference data
model_complile(model, sparse_acc = False)
trained_weights = './training_checkpoints_CharWeights'
model.load_weights(tf.train.latest_checkpoint(trained_weights))

## Inference 

In [32]:
# How to use the models on future data inputs
headline = "<s>ZOETIS INC <ZTS.N>: CREDIT SUISSE RAISES PRICE TARGET TO $192 FROM $182 ZOETIS INC <ZTS.N>: BOFA GLOBAL RESEARCH RAISES PRICE OBJECTIVE TO $175 FROM $170 NYSE ORDER IMBALANCE <ZTS.N> 77562.0 SHARES ON SELL SIDE<\s>"
# Encode UTF-8 ordinal level
headline = encode2bytes(headline)
#Split INput text and Output target
X, y_true = headline[:-1], headline[1:]
# Convert to array and squeeze dimension over axis = 0
X = tf.expand_dims(X, 0)
# Predict Ouput
prediction = model.predict(X.numpy())
prediction.shape

(1, 216, 127)

#### Logits (as is)

In [33]:
print("Input:      ", "".join(map(chr, np.squeeze(X))))
print("Prediction: " ,"".join(map(chr,np.argmax(prediction, axis = -1).squeeze())))
print("Actual:     " ,"".join(map(chr,np.squeeze(y_true))))

Input:       <s>ZOETIS INC <ZTS.N>: CREDIT SUISSE RAISES PRICE TARGET TO $192 FROM $182 ZOETIS INC <ZTS.N>: BOFA GLOBAL RESEARCH RAISES PRICE OBJECTIVE TO $175 FROM $170 NYSE ORDER IMBALANCE <ZTS.N> 77562.0 SHARES ON SELL SIDE<\s
Prediction:  s>JOETIS INC <XTS.N>: CREDIT SUISSE RAISES PRICE TARGET TO $192 FROM $182 QOETIS INC <XTS.N>: BOFA GLOBAL RESEARCH RAISES PRICE OB,ECTIVE TO $175 FROM $170 NYSE ORDER IMBALANCE <ZTS.N> J7562.0 SHARES ON SELL SIDE<\s>
Actual:      s>ZOETIS INC <ZTS.N>: CREDIT SUISSE RAISES PRICE TARGET TO $192 FROM $182 ZOETIS INC <ZTS.N>: BOFA GLOBAL RESEARCH RAISES PRICE OBJECTIVE TO $175 FROM $170 NYSE ORDER IMBALANCE <ZTS.N> 77562.0 SHARES ON SELL SIDE<\s>


#### Softmax

In [34]:
prediction = prediction[-1,:,:]
p_i = np.zeros((prediction.shape))
for i in range(0, len(headline[:-1])):
    p = np.exp(prediction[i])/np.sum(np.exp(prediction[i])) #softmax
    p_i[i] = p

In [35]:
print(np.argmax(p_i, axis = 1))
''.join(map(chr,np.argmax(p_i, axis = 1)))

[115  62  74  79  69  84  73  83  32  73  78  67  32  60  88  84  83  46
  78  62  58  32  67  82  69  68  73  84  32  83  85  73  83  83  69  32
  82  65  73  83  69  83  32  80  82  73  67  69  32  84  65  82  71  69
  84  32  84  79  32  36  49  57  50  32  70  82  79  77  32  36  49  56
  50  32  81  79  69  84  73  83  32  73  78  67  32  60  88  84  83  46
  78  62  58  32  66  79  70  65  32  71  76  79  66  65  76  32  82  69
  83  69  65  82  67  72  32  82  65  73  83  69  83  32  80  82  73  67
  69  32  79  66  44  69  67  84  73  86  69  32  84  79  32  36  49  55
  53  32  70  82  79  77  32  36  49  55  48  32  78  89  83  69  32  79
  82  68  69  82  32  73  77  66  65  76  65  78  67  69  32  60  90  84
  83  46  78  62  32  74  55  53  54  50  46  48  32  83  72  65  82  69
  83  32  79  78  32  83  69  76  76  32  83  73  68  69  60  92 115  62]


's>JOETIS INC <XTS.N>: CREDIT SUISSE RAISES PRICE TARGET TO $192 FROM $182 QOETIS INC <XTS.N>: BOFA GLOBAL RESEARCH RAISES PRICE OB,ECTIVE TO $175 FROM $170 NYSE ORDER IMBALANCE <ZTS.N> J7562.0 SHARES ON SELL SIDE<\\s>'

#### Random Sampling

In [36]:
sampled_indices = tf.random.categorical(prediction, num_samples=1) 
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()
"".join(map(chr,sampled_indices))
# wont use sampling in my case

's>XOETIS INC <ZTS.N>: CVEDIT SUISSE RAISES PRICE TARGET TO $192 FROM $132 6OETIS INC <ZTS.N>: BOFA GLOBAL RESEARCH RAISES PRICE OB,ECTIXE TO $17Y 3ROM $120 NYSE ORDER IMUALANCE UUTS.N> 37592.0 SHARES ON SELL SIDE<\\s>'

* Use generate new text function to test on model

In [37]:
generate_text(model, '<s> Today', 10, 1)
#Concluded model is stateless and only learned how to represent and regenerate passed text but not generate new text!

'<s> Today<>>>>>>>>>'

__Concluded__: model is stateless and only learned how to represent and regenerate passed text but not generate new text based on the past term/word!

### EMBEDDINGS LAYER

In [38]:
trained_embeddings = model.get_layer('EmbedLayer').get_weights()[0]

In [39]:
trained_embeddings.shape

(127, 256)

In [40]:
input_seq = '<s>AMZN wont be paying any taxes for 2019~<\s'
input_seq = tf.squeeze(encode2bytes(input_seq)).numpy()
input_seq

array([ 60, 115,  62,  65,  77,  90,  78,  32, 119, 111, 110, 116,  32,
        98, 101,  32, 112,  97, 121, 105, 110, 103,  32,  97, 110, 121,
        32, 116,  97, 120, 101, 115,  32, 102, 111, 114,  32,  50,  48,
        49,  57, 126,  60,  92, 115])

In [41]:
# Process of representing each of our features/character/byte/input
lookup_table = tf.nn.embedding_lookup(trained_embeddings, input_seq)
lookup_table

<tf.Tensor: shape=(45, 256), dtype=float32, numpy=
array([[-0.13882878, -0.30200124,  0.08077791, ...,  0.16514868,
         0.18841438,  0.09872594],
       [-0.11984932, -0.17366889,  0.0651598 , ...,  0.12940577,
        -0.2054549 , -0.05256342],
       [ 0.04611618,  0.08065684,  0.1510432 , ...,  0.01680804,
        -0.09226788, -0.03016171],
       ...,
       [-0.13882878, -0.30200124,  0.08077791, ...,  0.16514868,
         0.18841438,  0.09872594],
       [-0.10369939, -0.25992334,  0.02147431, ...,  0.1398279 ,
         0.11564761,  0.15265961],
       [-0.11984932, -0.17366889,  0.0651598 , ...,  0.12940577,
        -0.2054549 , -0.05256342]], dtype=float32)>

In [42]:
#make sure language model embeddings is equal when we transfered them to the asset price prediction
np.all(trained_embeddings[60]) == np.all(lookup_table[0])

True

__In a nutshell:__

For each character/byte the model looks up the embedding, runs the LSTM one timestep with the embedding as input, and applies the dense layer to generate logits predicting the log-likelihood of the next character/Byte. This distribution, for each predicted character/byte, is defined by the logits over the characters(i.e 1-126 Decimal Points bytes(0 is reserved for padding)).

#### Save Embeddings Representations and Visualize in 3D

In [None]:
import io, csv

# save model weights

print(trained_embeddings.shape) # shape: (characters/bytes, embedding_dim) -->(127,256)

# save embeddings.
out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')
tsv_writer = csv.writer(out_m, delimiter='\t')


for i in range(0,127):
    if i == 0: continue # skip 0, it's padding.
    vec = trained_embeddings[i] 
    tsv_writer.writerow(str(chr(i)))
#     out_m.write(chr(i+1), lineterminator='\n')# skip 0, it's padding.255 last vector
    out_v.write('\t'.join([str(x) for x in vec]) + "\n")
out_v.close()
out_m.close()

Click me [Embeddings Projector](https://projector.tensorflow.org/) to visualize embeddings in 3D.
Upload the `vecs.tsv` & `meta.tsv`, that where created as a result of running the above code, to the right place on the website to visualize the character level dense vectors in a compressed 3-D representations.

For ready pretrained models [TensorFlow Hub](https://tfhub.dev/)

I changed encoding scheme to cover UTF-8 encoded characters and the implemetation of how I encoded the characters to decimal point, byte level encoding, is found in the `functions.py` script.

## Transfer Learning
Now we transfer the learning from the language model , that was able to draw connections between characters in financial news headlines context, to stablize the learning of the second, ultimate goal, model that predicts price directions of an asset based on news headlines.

In [8]:
daily = tf.keras.models.load_model(daily_model, compile=False) #make sure you have the right path for daily_model
daily.summary()

Model: "RNNStocks"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
EmbedLayer (Embedding)       (None, None, 256)         32512     
_________________________________________________________________
BiLSTM (Bidirectional)       (None, 2048)              10493952  
_________________________________________________________________
BatchNormal (BatchNormalizat (None, 2048)              8192      
_________________________________________________________________
FullConnected (Dense)        (None, 512)               1049088   
_________________________________________________________________
leaky_re_lu (LeakyReLU)      (None, 512)               0         
_________________________________________________________________
BatchNormal2 (BatchNormaliza (None, 512)               2048      
_________________________________________________________________
Output (Dense)               (None, 1)                 51

###### Make sure `new_headlines.csv` is in same path of notebook location

In [9]:
final_df= pd.read_csv("new_headlines.csv")
final_df = final_df.set_index('time')
final_df.index = pd.to_datetime(final_df.index)
final_df.index = final_df.index.strftime('%Y-%m-%d %H:%M:%S')
final_df.index = pd.to_datetime(final_df.index)
max_date = final_df.index.max().strftime('%Y-%m-%d')

idx = pd.IndexSlice
final_df = final_df.reset_index().set_index(['time', 'ticker']).sort_index(level = 0, sort_remaining = 0).loc[idx[max_date,:], :]

In [10]:
final_df['n_Characters'] = final_df['text'].str.len()
final_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
n_Characters,2982.0,70.261234,33.093554,11.0,47.0,67.0,84.0,265.0


In [12]:
X = encode2bytes(final_df.text.apply(lambda x: '<s>' + x + '<\s'))
X = pad_sequences(X, maxlen =  max(map(len, X)), padding = 'post', truncating='post')

predictions = daily.predict(X)

In [55]:
final_df['Predictions'] = np.squeeze(predictions)
final_df['Prediction Date'] = (datetime.datetime.now()).strftime('%Y-%m-%d')
final_df['BUY(20% Threshold)'] = (np.squeeze(predictions) > 0.2)
final_df['BUY(40% Threshold)'] = (np.squeeze(predictions) > 0.4)
final_df['BUY(60% Threshold)'] = (np.squeeze(predictions) > 0.6)

In [56]:
final_df

Unnamed: 0_level_0,Unnamed: 1_level_0,text,n_Characters,Predictions,Prediction Date,BUY(20% Threshold),BUY(40% Threshold),BUY(60% Threshold)
time,ticker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-07-20 00:02:21,JNJ,US: US states to unveil $35b opioid settlement,46,0.237652,2021-08-08,True,False,False
2021-07-20 00:04:22,RSG,"REPUBLIC SERVICES, INC. SEC Filings files Form...",51,0.132664,2021-08-08,False,False,False
2021-07-20 00:05:07,CDNS,CADENCE DESIGN SYSTEMS INC SEC Filings files F...,54,0.054491,2021-08-08,False,False,False
2021-07-20 00:05:43,PPG,"(EN) PPG INDUSTRIES, INC. Monthly Presentation...",57,0.249839,2021-08-08,True,False,False
2021-07-20 00:06:53,RSG,"REPUBLIC SERVICES, INC. SEC Filings files Form...",51,0.132664,2021-08-08,False,False,False
...,...,...,...,...,...,...,...,...
2021-07-20 21:47:13,GL,PDF 1: TORCHMARK CORPORATION (Gl-globe Life In...,108,0.112298,2021-08-08,False,False,False
2021-07-20 21:47:13,GL,TORCHMARK CORPORATION (Gl-globe Life Inc- Anno...,101,0.084044,2021-08-08,False,False,False
2021-07-20 21:48:33,MRVL,"Marvell Technology, Inc. SEC Filings files For...",52,0.072060,2021-08-08,False,False,False
2021-07-20 21:54:52,IR,SPX Flow Denies to Sell Itself to Ingersoll Rand,48,0.114485,2021-08-08,False,False,False


###### Make put the right path to where you like to save these output file:

In [15]:
final_df[final_df['text'].str.contains('NYSE ORDER IMBALANCE')][['text', 'Predictions', 'Prediction Date']].to_csv('ORDER_IMBALANCES.csv')

Key Words:
*  blank check company
* SPAC
* 13F,G OR D
* Clinical Trials

###### Make sure you put the right path to where you like to save these output files:

In [None]:
#save your predictions to the path of you choice
final_df.to_csv('Todays_Prediction.csv')

#### Pass in News Headline or Random Words to model

In [60]:
sample = "Amazon's Sales Growth Costs a Fortune in Shipping and Fulfillment" + " Jeff Bezos, Bill Gates and other tech luminaries react to Biden's victory" + " Amazon rolls out rewards program that makes it easier for drivers to get work" + " TECH Alibaba cloud growth outpaces Amazon and Microsoft as Chinese tech giant pushes for profitability"
# sample = "Joe Biden" 
# sample = "Donald Trump" 
sample = '<s>' + sample + '<\s' 
print(sample)
sample = encode2bytes(sample)
print(sample)
# sample = tf.ragged.constant(sample)
sample = tf.squeeze(sample, )
sample = tf.expand_dims(sample, 0).numpy()
print(sample)
sample.shape

<s>Amazon's Sales Growth Costs a Fortune in Shipping and Fulfillment Jeff Bezos, Bill Gates and other tech luminaries react to Biden's victory Amazon rolls out rewards program that makes it easier for drivers to get work TECH Alibaba cloud growth outpaces Amazon and Microsoft as Chinese tech giant pushes for profitability<\s
[[60], [115], [62], [65], [109], [97], [122], [111], [110], [39], [115], [32], [83], [97], [108], [101], [115], [32], [71], [114], [111], [119], [116], [104], [32], [67], [111], [115], [116], [115], [32], [97], [32], [70], [111], [114], [116], [117], [110], [101], [32], [105], [110], [32], [83], [104], [105], [112], [112], [105], [110], [103], [32], [97], [110], [100], [32], [70], [117], [108], [102], [105], [108], [108], [109], [101], [110], [116], [32], [74], [101], [102], [102], [32], [66], [101], [122], [111], [115], [44], [32], [66], [105], [108], [108], [32], [71], [97], [116], [101], [115], [32], [97], [110], [100], [32], [111], [116], [104], [101], [114], [

(1, 326)

In [61]:
predict = daily(sample).numpy()[0][0]
print("Probability from Headlines: %f" % predict)

Probability from Headlines: 0.233987


### Distribute Computations on Devices:
Some tricks how you can fixate certain training on a specific processor

In [62]:
tf.config.list_physical_devices()

[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
 PhysicalDevice(name='/physical_device:XLA_CPU:0', device_type='XLA_CPU')]

In [63]:
with tf.device('/CPU'):
    xs = np.zeros((len(uniqye_characters),1))
    h_prev = np.zeros((10,1))
    Wxh = np.random.randn(10, len(uniqye_characters))*0.01 # input to hidden
    Whh = np.random.randn(10, 10)*0.01 # hidden to hidden
    Why = np.random.randn(len(uniqye_characters), 10)*0.01 # hidden to output
    bh = np.zeros((10, 1)) # hidden bias
    by = np.zeros((len(uniqye_characters), 1)) # output bias
    hs = np.tanh(np.dot(Wxh, xs) + np.dot(Whh, h_prev) + bh) # hidden state

In [64]:
#Saved packages to run this notebook
#!conda env export > environment.yml

If interested please refer to my github: [My Github!](https://github.com/firobeid)