<font color='#E27271'>

# *Unveiling Complex Interconnections Among Companies through Learned Embeddings*</font>

-----------------------
<font color='#E27271'>

Ethan Moody, Eugene Oon, and Sam Shinde</font>

<font color='#E27271'>

August 2023</font>

-----------------------
<font color='#00AED3'>

# **BERT MLM** </font>
-----------------------

BERT Masked Language Model. It is one of the pre-training objectives used in training the Bidirectional Encoder Representations from Transformers (BERT) model. During pre-training, certain tokens in the input text are randomly masked, and the objective is to predict the masked tokens based on the context provided by the surrounding tokens. This process helps BERT learn bidirectional representations, allowing it to capture contextual information effectively. By using MLM, BERT gains a deeper understanding of the language and can be fine-tuned for various downstream tasks, such as text classification, named entity recognition, and machine translation, achieving state-of-the-art results in natural language processing.

We will conduct an experiment where we will employ the Masked Language Model (MLM) training technique on the 10K corpus to train BERT. Subsequently, we will utilize this trained model for our text classification task.

## [1] Installs, Imports and Setups

### [1.1] Complete Initial Installs

In [None]:
# Installs
!pip install transformers --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m55.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m26.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m104.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m66.4 MB/s[0m eta [36m0:00:00[0m
[?25h

### [1.2] Import Packages

In [None]:
# Imports General Packages
import os, sys
import pandas as pd
import json
from datetime import date
import re
from datetime import datetime

#
import jax
from jax import numpy as jnp

# Import Transformers
from transformers import BertTokenizer, TFBertForMaskedLM
import tensorflow as tf

# Setup
path = '/content/gdrive/My Drive/project'

### [1.3] Mount Drive

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


## [2] Helper Functions

### [2.1] Clean Input Text

In [None]:
def clean(rawtext):
  """Function to remove unwanted text which might impact model performance, such as -
      Remove Special Characters
      Remove Consecutive Whitespace
      Remove new line characters
      Remove Table Content
      Remove all characters except lowercase or uppercase alphabetic character
      (a-z, A-Z) or a whitespace character (\s) or dot (.)
  """

  # Remove specific (non-breaking space) character sequence
  rawtext = rawtext.replace('\\xa0','')

  # Remove New Line (escape the backslash)
  rawtext = rawtext.replace('\\n','')

  # pattern that matches one or more consecutive whitespace characters
  rawtext = re.sub('\s\s+',' ',rawtext)

  # Replace new line with Space
  rawtext = re.sub('\n',' ',rawtext)

  # Replace Table Content
  rawtext = re.sub("(?is)<table[^>]*>(.*?)<\/table>", "", rawtext)

  # pattern that matches any character that is not a lowercase or uppercase alphabetic character (a-z, A-Z) or a whitespace character (\s)
  rawtext = re.sub(r'[^A-Za-z .]+', '', rawtext)
  # rawtext = re.sub(r'[^A-Za-z0-9 .]+', '', rawtext)
  # rawtext = re.sub('[^a-zA-Z\s]','',rawtext)

  # pattern that matches one or more consecutive digits
  # rawtext = re.sub(r'\d+', '', rawtext)

  rawtext = re.sub('I tem','',rawtext)
  rawtext = re.sub('TABLEEND','',rawtext)
  rawtext = re.sub('TABLESTART','',rawtext)

  # matches one or more consecutive spaces
  rawtext = re.sub(' +', ' ', rawtext)

  return rawtext

### [2.2] Chuck Section into paragraph's

In [None]:
def chunk_section(section, length=512, pcnt=1):
  """Function that takes a section of text as a input and outputs a list of
  paragraph text of desired word count.
  section : input text of any length
  length  : desired length/word count per paragraph (eg: 512)
  pcnt    : Actual word count of the Output paragraph is a percentage of desired
            length (eg: 512 * 0.8 = 409). This is to keep some buffer if needed.
            If no buffer needed then this variable can take a value of 1
  """

  # Empty Ouput List
  lines = []
  line = ''
  # Calculate the final length of words in the final paragraph
  final_len = int(length * pcnt)

  # Loop through each sentence in the section
  for sentence in (s.strip()+'.' for s in section.split('.')[:-1]):

    # Check if the desired word count length is met
    if len(line.split()) + len(sentence.split()) + 1 >= final_len: #can't fit on that line => start new one
        lines.append(line)
        line = sentence.strip()

    else:                                   #can fit on => add a space then this sentence
        line += ' ' + sentence.strip()

  return lines

### [2.3] Number of GPUs

In [None]:
def num_gpus():
    """Get the number of available GPUs."""
    try:
        return jax.device_count('gpu')
    except:
        return 0  # No GPU backend found

num_gpus()

1

## [3] Modeling Data Preparation

### [3.1] Load Training data

In [None]:
# Load Non S&P Data json file
data_df = pd.read_json(path + '/data/10K/nsp500_final.json')
data_df.head()

Unnamed: 0,ticker,cik,formType,filedAt,linkToTxt,linkToHtml,periodOfReport,year,ind,name,sector,industry,industry_group,business_cnt,business
0,MBLY,1910139,10-K,2023-03-09T16:15:44-05:00,https://www.sec.gov/Archives/edgar/data/191013...,https://www.sec.gov/Archives/edgar/data/191013...,2022-12-31,2022,NASDAQ,Mobileye Global Inc,Consumer Discretionary,Automobile Components,Automobiles & Components,133257,Item 1. Business \n\nIn this Annual Report on...
1,RIVN,1874178,10-K,2023-02-28T17:15:26-05:00,https://www.sec.gov/Archives/edgar/data/187417...,https://www.sec.gov/Archives/edgar/data/187417...,2022-12-31,2022,NASDAQ,Rivian Automotive Inc,Consumer Discretionary,Automobiles,Automobiles & Components,42199,Item 1. Business \n\nOverview \n\nRivian exis...
2,LCID,1811210,10-K,2023-02-28T16:09:35-05:00,https://www.sec.gov/Archives/edgar/data/181121...,https://www.sec.gov/Archives/edgar/data/181121...,2022-12-31,2022,NASDAQ,Lucid Group Inc,Consumer Discretionary,Automobiles,Automobiles & Components,82184,Item 1. Business. \n\nOVERVIEW \n\nMission \n...
3,LEA,842162,10-K,2023-02-09T16:59:45-05:00,https://www.sec.gov/Archives/edgar/data/842162...,https://www.sec.gov/Archives/edgar/data/842162...,2022-12-31,2022,NYSE,Lear Corp,Consumer Discretionary,Automobile Components,Automobiles & Components,88376,ITEM 1 &#8211; BUSINESS \n\nIn this Annual Re...
4,ALV,1034670,10-K,2023-02-16T09:41:48-05:00,https://www.sec.gov/Archives/edgar/data/103467...,https://www.sec.gov/Archives/edgar/data/103467...,2022-12-31,2022,NYSE,Autoliv Inc,Consumer Discretionary,Automobile Components,Automobiles & Components,38394,Item 1. Business \n\n&#160; \n\nGeneral \n\nA...


### [3.2] Chunk Training Data

In [None]:
stime = datetime.now()
data_input = []
para_list = []
for index, row in data_df[data_df['year']==2022].iterrows():
  if row['business'][:5] != 'Error' or len(row['business'].values) != 0:
    para_list = chunk_section(clean(row['business']),512, 0.8)
    data_input.extend(para_list)
    para_list = []

etime = datetime.now()
print(f'Total time taken is: {((etime-stime).total_seconds())/60}')

Total time taken is: 0.67274415


### [3.3] Train Validation Split

In [None]:
from sklearn.model_selection import train_test_split

train_input, val_input = train_test_split(data_input,
                                                  test_size=0.15,
                                                  random_state=42)

print(f'Train Input Count      : {len(train_input)}')
print(f'Train Validation Count : {len(val_input)}')

Train Input Count      : 89522
Train Validation Count : 15799


## [4] Building and Training MLM Model

### [4.1] Load Model and Tokenizer

In [None]:
# Tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertForMaskedLM.from_pretrained('bert-base-uncased')

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForMaskedLM.

All the weights of TFBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.


### [4.2] Tokenize Training and Validation Data

In [None]:
stime = datetime.now()
train_inputs = tokenizer(train_input,return_tensors='tf', max_length=512,
                   truncation=True, padding='max_length')

etime = datetime.now()
print(f'Total time taken is: {((etime-stime).total_seconds())/60}')

Total time taken is: 12.235143983333334


In [None]:
stime = datetime.now()
val_inputs = tokenizer(val_input,return_tensors='tf', max_length=512,
                   truncation=True, padding='max_length')

etime = datetime.now()
print(f'Total time taken is: {((etime-stime).total_seconds())/60}')

Total time taken is: 2.1895686833333334


### [4.3] Create labels

In [None]:
train_inputs['labels'] = train_inputs['input_ids']
val_inputs['labels'] = val_inputs['input_ids']

### [4.4] Create Tensor of Uniform Random values for Training and Validation

In [None]:
train_rand = tf.random.uniform(train_inputs.input_ids.shape)
print(train_rand.shape)
val_rand = tf.random.uniform(val_inputs.input_ids.shape)
print(val_rand.shape)

(89522, 512)
(15799, 512)


### [4.5] Create Mask for 15%

In [None]:
train_mask_array = (train_rand < 0.15) & (train_inputs.input_ids != 101) & (train_inputs.input_ids != 0) & (train_inputs.input_ids != 102)
val_mask_array = (val_rand < 0.15) & (val_inputs.input_ids != 101) & (val_inputs.input_ids != 0) & (val_inputs.input_ids != 102)

In [None]:
train_selection = []
for i in range(train_mask_array.shape[0]):
  train_selection.append(train_mask_array[i].numpy().nonzero()[0].tolist())

In [None]:
val_selection = []
for i in range(val_mask_array.shape[0]):
  val_selection.append(val_mask_array[i].numpy().nonzero()[0].tolist())

### [4.6] Apply Mask on Training and Validation Data

In [None]:
train_input_ids_np = train_inputs.input_ids.numpy()
for i in range(train_mask_array.shape[0]):
  train_input_ids_np[i, train_selection[i]] = 103
val_input_ids_np = val_inputs.input_ids.numpy()
for i in range(val_mask_array.shape[0]):
  val_input_ids_np[i, val_selection[i]] = 103

In [None]:
train_inputs.input_ids = tf.convert_to_tensor(train_input_ids_np)
val_inputs.input_ids = tf.convert_to_tensor(val_input_ids_np)

### [4.7] Compile and Fit Model

In [None]:
callbacks = [
    tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=4, min_delta=0.02, restore_best_weights=True),
    tf.keras.callbacks.ReduceLROnPlateau(monitor='val_accuracy', factor=1e-6, patience=2, verbose=0, mode='auto', min_delta=0.001, cooldown=0, min_lr=1e-6)
]

In [None]:
stime = datetime.now()
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.00001),
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics='accuracy')

history = model.fit([train_inputs.input_ids,train_inputs.token_type_ids, train_inputs.attention_mask],
                    train_inputs.labels,
                    validation_data=([val_inputs.input_ids, val_inputs.token_type_ids, val_inputs.attention_mask],
                                                  val_inputs.labels),
                    batch_size=8,epochs=3,
                    callbacks=callbacks)


Epoch 1/3
Epoch 2/3
Epoch 3/3


### [4.8] Save Model

In [None]:
model.save_pretrained('/content/gdrive/My Drive/project/models/10k_bert_model_final')

In [None]:
etime = datetime.now()
print(f'Total time taken is: {((etime-stime).total_seconds())/60}')

Total time taken is: 732.5279683333333
