# Generating Document Vectors for SEC Filings

Many tasks require embeddings of domain-specific vocabulary that models pretrained on a generic corpus may not be able to capture. Standard **Word2Vec** models are not able toassign vectors to out-of-vocabulary words and instead use a default vector that reduces their predictive value. For example, when working with industry-specific documents, the vocabulary or its usage may change over time as new technologies or products emerge. As a result, the embeddings need to evolve as well. In addition, documents like corporate earnings releases use nuanced language that GloVe vectors pretrained on Wikipedia articles are unlikely to properly reflect.

**Doc2Vec** is a model that represents each document as a vector, which usually outperforms simple-averaging of Word2Vec vectors. Given the specific content of each section of SEC filings, training document embeddings instead of single word embeddings can increase the predictive content of each section in the downstream tasks.

The document embedding model, produces embeddings for pieces of text like a paragraph or a product review directly. Similar to Word2Vec, there are also two flavors of Doc2Vec:
- **The distributed memory (DM)** model corresponds to the Word2Vec CBOW model. The doc-vectors are obtained by training a neural network on the synthetic task of predicting a center word based an average of both context word-vectors and the full document's doc-vector.
- **The distributed bag of words (DBOW)** model corresponds to the Word2Vec skip-gram architecture. The doc vectors result from training a neural net to predict a target word using the full document's doc vector.

Gensim's [Doc2Vec class](https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html) implements above algorithms. We use [Gensim](https://radimrehurek.com/gensim/intro.html) library for training embeddings that represent documents as vectors. Gensim is a free open-source Python library for representing documents as semantic vectors. Gensim is designed to process raw, unstructured texts using unsupervised machine learning algorithms.

## Imports & Settings

In [1]:
import warnings
warnings.filterwarnings('ignore')

import os
from pathlib import Path
from collections import Counter
import logging
import json

import numpy as np
import pandas as pd

import datasets

## Preprocessing Data
We will use [Hugging Face dataset](https://huggingface.co/datasets/c3po-ai/edgar-corpus) that contains **10-K** annual reports of public companies from 1993-2020 from SEC EDGAR filings. We extract the most informative sections, namely:
- Sections 1 and 1A: Business and Risk Factors
- Sections 7 and 7A: Management's Discussion and Disclosures about Market Risks

For about 3,000 companies, we have stock prices between 2013-2016 to label the data for predictive modeling. Therefore, we use the years 2010-2015 for training document embeddings and 2016's data for testing. In addition, for downstream tasks such as predicting company's returns, we select a subset of the companies (large, medium and small caps) to reduce computation times.

---

In [2]:
NUM_PROC = 4

SECTIONS = ["cik", "year", "section_1", "section_1A", "section_7", "section_7A"]

FROM_DISK = True

START_YEAR = 2010
END_YEAR = 2016

# The sentence transformer language model: all-MiniLM-L6-v2 has embedding size of 384
# Set the same for Gensim embedding size for consistent comparisons
EMBEDDING_SIZE = 384

results_path = Path('sec-edgar-10k')

model_path = results_path / 'models'
data_path = results_path / 'data'
parsed_data_path = results_path / 'parsed-data'
companies_data_path = results_path / 'subset-companies'
log_path = results_path / 'logs'

for path in [model_path, data_path, parsed_data_path, companies_data_path, log_path]:
    if not path.exists():
        path.mkdir(parents=True)

logging.basicConfig(
    filename=log_path / 'doc2vec.log',
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    datefmt='%H:%M:%S'
)

In [3]:
if not FROM_DISK:
    years = {}
    for year in range(START_YEAR, END_YEAR + 1):
        years[year] = datasets.load_dataset("eloukas/edgar-corpus", f"year_{year}", split="all")
    
    train = datasets.concatenate_datasets([years[year] for year in range(START_YEAR, END_YEAR)], axis=0)
    test = years[END_YEAR]
    train = train.remove_columns(list(set(train.column_names) - set(SECTIONS)))
    test = test.remove_columns(list(set(test.column_names) - set(SECTIONS)))
    
    dataset = train.train_test_split(train_size=0.8)
    dataset["validation"] = dataset.pop("test")
    dataset["test"] = test
    
    dataset.save_to_disk(data_path.as_posix(), num_proc=NUM_PROC)
else:
    dataset = datasets.load_from_disk(data_path.as_posix())

## Document Embeddings: Gensim

We use **Spacy** to preprocess and filter the paragraphs e.g., removing stop words, digits, punctuations etc. The model is trained on all 4 sections of SEC filings corpus using distributed memory, which is analogous to Word2Vec skip-gram. The doc vectors are obtained by training a neural network using Doc2Vec's **distributed memory** algorithm of Gensim library.

---

In [4]:
import spacy

from gensim.models import KeyedVectors, Doc2Vec, doc2vec
from gensim.utils import simple_preprocess

In [5]:
nlp = spacy.load('en_core_web_sm', disable=['ner'])
nlp.max_length = 10_000_000

SECTIONS_TO_PARSE = ['section_1', 'section_1A', 'section_7', 'section_7A']

In [6]:
train = datasets.concatenate_datasets([dataset["train"], dataset["validation"]], axis=0)
test = dataset["test"]
if int(sorted(set(train['year']))[-1]) >= END_YEAR:
    test = train.filter(lambda row: int(row['year']) == END_YEAR)
    train = train.filter(lambda row: int(row['year']) >= START_YEAR and int(row['year']) < END_YEAR)

In [7]:
def parse_sections(examples, col_names):
    sections = {}

    for col in col_names:
        docs = []
        for text in examples[col]:
            doc = nlp(text)
            clean_sentence = []
            for sentence in doc.sents:
                if sentence is not None:
                    for token in sentence:
                        if not any([token.is_stop,
                                    token.is_digit,
                                    not token.is_alpha,
                                    token.is_punct,
                                    token.is_space,
                                    token.lemma_ == '-PRON-',
                                    token.pos_ in ['PUNCT', 'SYM', 'X']]):
                            clean_sentence.append(token.text)
            if len(clean_sentence) > 0:
                docs.append(' '.join(clean_sentence))
            else:
                docs.append("")
        sections[col] = docs

    res = {"cik": examples["cik"], "year": examples["year"]}
    for section in sections:
        res[f"parsed_{section}"] = sections[section]

    return res

In [8]:
if not FROM_DISK:
    train_parsed = train.map(parse_sections, batched=True, fn_kwargs={"col_names": SECTIONS_TO_PARSE}, remove_columns=train.column_names, num_proc=NUM_PROC)
    test_parsed = test.map(parse_sections, batched=True, fn_kwargs={"col_names": SECTIONS_TO_PARSE}, remove_columns=test.column_names, num_proc=NUM_PROC)
    
    dataset = datasets.DatasetDict()
    dataset['train'] = train_parsed
    dataset['test'] = test_parsed
    
    dataset.save_to_disk(parsed_data_path.as_posix(), num_proc=NUM_PROC)
else:
    dataset = datasets.load_from_disk(parsed_data_path.as_posix())
    train_parsed = dataset['train']
    test_parsed = dataset['test']
    if int(sorted(set(train_parsed['year']))[-1]) >= END_YEAR:
        test_parsed = train_parsed.filter(lambda row: int(row['year']) == END_YEAR)
        train_parsed = train_parsed.filter(lambda row: int(row['year']) >= START_YEAR and int(row['year']) < END_YEAR)

In [9]:
PARSED_SECTIONS = set(train_parsed.column_names) - set(['cik', 'year'])

In [10]:
def wrap(data, col_names, new_name):
    wrapped = []
    for col in col_names:
        temp = data.select_columns(['cik', 'year', col])
        temp = temp.filter(lambda example: example[col] != '')
        temp = temp.add_column(name='section',  column=[col.split('_')[-1]] * len(temp))
        temp = temp.rename_column(col, new_name)
        wrapped.append(temp)

    return wrapped

In [11]:
# Concatenate all parsed sections of SEC filings in one column to be fed to the gensim model for training
wrapped = wrap(train_parsed, PARSED_SECTIONS, new_name='parsed_sections')
train_parsed = datasets.concatenate_datasets(wrapped, axis=0)

wrapped = wrap(test_parsed, PARSED_SECTIONS, new_name='parsed_sections')
test_parsed = datasets.concatenate_datasets(wrapped, axis=0)

In [12]:
print(f"Number of training examples: {train_parsed.num_rows}")
print(f"Number of test examples: {test_parsed.num_rows}")

Number of training examples: 182218
Number of test examples: 27178


In [13]:
class Corpus:
    def __init__(self, data, col_name):
        self.data = data
        self.data.set_format('pandas')
        self.df = self.data[col_name]
    
    """An iterator that yields sentences (lists of str)"""
    def __iter__(self):
        for i, line in enumerate(self.df):
            # lowercase and tokenize
            tokens = simple_preprocess(line)
            # for training data, add tags
            yield doc2vec.TaggedDocument(tokens, [i])

    def __len__(self):
        return len(self.df)

In [14]:
train_corpus = Corpus(train_parsed, col_name='parsed_sections')

### Train and save the model or load from checkpoint

In [15]:
if os.path.exists((model_path / 'doc2vec_0.model').as_posix()):
    gensim_model = Doc2Vec.load((model_path / 'doc2vec_0.model').as_posix())
    wv = KeyedVectors.load((model_path / 'word_vectors_0.bin').as_posix())
else:
    gensim_model = Doc2Vec(
        documents=train_corpus,
        dm=1,              # 1=dist. memory, 0=dist. BOW
        vector_size=EMBEDDING_SIZE,
        window=8,          # max distance between target and context
        min_count=2,       # ignore tokens w. lower frequency
        epochs=15,
        alpha=0.05,        # initial learning rate
        min_alpha=0.001,   # final learning rate
        hs=0,              # 1=hierarchical softmax, 0=negative sampling
        negative=14,       # negative training (noise) samples, only needed for negative sampling
        dm_concat=0,       # 1=concatenate vectors, 0=sum
        dbow_words=0,      # 1=train word vectors as well (in skip-gram fashion), 0=only doc. vectors
        workers=4
    )
    
    gensim_model.save((model_path / 'doc2vec_0.model').as_posix())
    gensim_model.wv.save((model_path / 'word_vectors_0.bin').as_posix())

### Evaluate model

In [16]:
def eval_model(test, train, model, topn=10):
    top = {}
    sim_docs = {}
    
    for doc_id in range(len(test)):
        inferred_vector = model.infer_vector(list(test)[doc_id].words)
        sims = model.dv.most_similar(positive=[inferred_vector], topn=topn)
        docids = [docid for docid, sim in sims]
        cik = test.data[:]['cik'].iloc[doc_id]
        year = test.data[:]['year'].iloc[doc_id]
        section = test.data[:]['section'].iloc[doc_id]
        top_n = train.data[:][['cik', 'year', 'section']].loc[docids].values
        top[(cik, year, section)] = (*list(top_n[0]), sims[0][1])
        sim_docs[cik] = dict(Counter(top_n[:, 0]))

    accuracy = 0
    for k in top:
        c, y, s = k
        if top[k][0] == c and top[k][1] == y and top[k][2] == s:
            accuracy += 1
    
    print(f"Model accuracy in selecting the exact same document: {accuracy / len(top) * 100}%")

    accuracy = 0
    for k in top:
        c, y, s = k
        if top[k][0] == c:
            accuracy += 1

    print(f"Model accuracy in selecting documents from the same company: {accuracy / len(top) * 100}%")

    return top, sim_docs

#### Sanity check using a small validation set of documents from train corpus

To assess the model, we will first infer new vectors for random sample documents of the training corpus, compare the inferred vectors with the training corpus, and then returning the rank of the document based on self-similarity. Basically, we are pretending as if the training corpus is some new unseen data and then seeing how they compare with the trained model. The expectation is that we have likely overfit our model and so we should be able to find similar documents very easily.

In [17]:
doc_ids = np.random.randint(0, len(train_parsed), size=50)

In [18]:
valid_corpus = Corpus(train_parsed.select(doc_ids), col_name='parsed_sections')

In [19]:
top, sim_docs = eval_model(valid_corpus, train_corpus, gensim_model)

Model accuracy in selecting the exact same document: 66.0%
Model accuracy in selecting documents from the same company: 84.0%


### Continue training if needed

We may need to continue training to increase the validation set accuracy in selecting the same document used for the training.

In [20]:
if os.path.exists((model_path / 'doc2vec_1.model').as_posix()):
    gensim_model = Doc2Vec.load((model_path / 'doc2vec_1.model').as_posix())
    wv = KeyedVectors.load((model_path / 'word_vectors_1.bin').as_posix())
else:
    gensim_model.train(train_corpus, total_examples=gensim_model.corpus_count, epochs=gensim_model.epochs)
    
    gensim_model.save((model_path / 'doc2vec_1.model').as_posix())
    gensim_model.wv.save((model_path / 'word_vectors_1.bin').as_posix())

In [21]:
top, sim_docs = eval_model(valid_corpus, train_corpus, gensim_model)

Model accuracy in selecting the exact same document: 68.0%
Model accuracy in selecting documents from the same company: 84.0%


## Document Embeddings Using Sentence Transformers

Doc2Vec embeddings allow only a single fixed-length representation of each token that does not differentiate between context-specific usages. To address problems such as multiple meanings for the same word, called *polysemy*, several new models have emerged that build on the **attention** mechanism designed to learn more contextualized word embeddings. The key characteristics of these models are as follows:- The use of bidirectional language models that process text both left-to-right and right-to-left for a richer context representation- The use of semi-supervised pretraining on a large generic corpus to learn universal language aspects in the form of embeddings and network weights that can be used and fine-tuned for specific tasks.

To this end, we use **SentenceTransformers**, which is a Python framework for state-of-the-art sentence, text and image embeddings. The initial work is described the paper [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084). It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.

The fine-tuned model is adopted from the [HuggingFace model repository](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2). In short, the pretrained `nreimers/MiniLM-L6-H384-uncased` model is fine-tuned on a 1B sentence pairs dataset. A self-supervised contrastive learning objective has been used to fine-tune the **BERT encoder**: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in the dataset.

### Intended Use
It is important to note that this model is intented to be used as a sentence and short paragraph encoder. Given an input text, it ouptuts a vector which captures the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks. The output embeddings are tuned for similarity and clustering tasks, hence not fully tailored to our task of predicting returns given SEC filings.

Nevertheless, as a pedagogical example, it will be interesting to show how a model like this can be adopted in our use case even though it is not specifically fine-tuned using financial text data.

---

In [22]:
import tensorflow as tf

In [23]:
from transformers import AutoTokenizer, TFAutoModel

# this model is fine tuned on nreimers/MiniLM-L6-H384-uncased pre-trained model
model_ckpt = "sentence-transformers/all-MiniLM-L6-v2"

tokenizer = AutoTokenizer.from_pretrained(model_ckpt, use_fast=True)
sbert_model = TFAutoModel.from_pretrained(model_ckpt, from_pt=True)

MAX_LENGTH = tokenizer.max_len_single_sentence
BATCH_SIZE = 16

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['embeddings.position_ids']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [24]:
# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask, sample_map):
    batch_result = []
    token_embeddings = model_output[0] # First element of model_output contains all token embeddings
    input_mask_expanded = tf.expand_dims(tf.cast(attention_mask, tf.float32), -1)
    # Perform mean pooling over chunks of each sequence
    for i in range(BATCH_SIZE):
        s_idx = np.where(sample_map==i)[0][0]
        e_idx = np.where(sample_map==i)[0][-1] + 1
        sum_embeddings = tf.reduce_sum(token_embeddings[s_idx:e_idx] * input_mask_expanded[s_idx:e_idx], axis=[0, 1])
        sum_mask = tf.clip_by_value(tf.reduce_sum(input_mask_expanded[s_idx:e_idx]), clip_value_min=1e-7, clip_value_max=np.inf)
        batch_result += [sum_embeddings / sum_mask]
        
    return tf.stack(batch_result)

def get_embeddings(data, col_name):
    encoded_input = tokenizer(
        data[col_name],
        padding=True,
        truncation=True,
        max_length=MAX_LENGTH,
        return_overflowing_tokens=True,  # Large sequences are chunked into arrays of max_length
        return_tensors="tf"
    )

    # Identifies what sequence chunk belongs to which sample
    sample_map = encoded_input.pop("overflow_to_sample_mapping")
    attention_mask = encoded_input["attention_mask"]
    model_output = sbert_model(**encoded_input)
    
    return {
        "cik": data["cik"],
        "year": data["year"],
        "section": data["section"],
        f"{col_name}_embedding": mean_pooling(model_output, attention_mask, sample_map)
    }

In [25]:
# Concatenate all sections of raw SEC filings in one column to be fed to the transformer model
wrapped = wrap(train, SECTIONS_TO_PARSE, new_name='raw_sections')
train = datasets.concatenate_datasets(wrapped, axis=0)

wrapped = wrap(test, SECTIONS_TO_PARSE, new_name='raw_sections')
test = datasets.concatenate_datasets(wrapped, axis=0)

In [29]:
print(f"Generate embeddings for SEC filings data using a sentence transformer for {sorted(set(train['year']))} as train and {set(test['year'])} as test data.")

Generate embeddings for SEC filings data using a sentence transformer for ['2010', '2011', '2012', '2013', '2014', '2015'] as train and {'2016'} as test data.


Embedding calculations using sentence transformer on a local machine would take a long time for all SEC dataset. We postpone the performance analysis of sentence transformer to the 3rd notebook where we select a subset of 750 companies for train/test a booster model.