## SEC Sentiment Analysis

In this Notebook, we'll show how to use the [`unstructured` core library](https://unstructured-io.github.io/unstructured/) and the [SEC pipelines API](https://github.com/Unstructured-IO/pipeline-sec-filings) to train a sentiment analysis model using content from the risk factors section of S-1 filings. To train and use the sentiment analysis model, we'll perform the following steps:

1. [Grab 10-K filings from EDGAR](#get-filings)
1. [Extract the Risk Factors section using the SEC pipeline API](#extract-narrative)
1. [Use a staging brick to stage the data for a labeling task in LabelStudio](#stage-label-studio)
1. [Train a sentiment analysis model with Hugging Face](#train)
1. [Use a staging brick to chunk input for the attention window of the sentiment analysis model](#chunk)

### Grab 10-K filings from EDGAR <a id="get-filings"></a>

The first step in the process is to pull documents from EDGAR, the SEC's filing system. Filings in EDGAR are in XML format and use a standard called [XBRL](https://www.xbrl.org/the-standard/what/ixbrl/). To do this, we'll make a few API calls based on the ticker symbols of publicly traded companies and save the files locally in a directory called `xbrl-forms`.

In [None]:
import os
from fetch import get_form_by_ticker

In [None]:
tickers = ['ehc', 'mrk','nke', 'msex', 'v', 'cvs', 'doc', 'smtc', 'cl', 
'ava', 'bc', 'f', 'lmt', 'cri', 'aig', 'rgld', 'apld', 'omcl', 
'mmm', 'bgs', 'dis','wetg', 'bj']

In [None]:
cwd = os.getcwd()
data_directory = os.path.join(cwd, "xbrl-forms")
if not os.path.exists(data_directory):
    os.mkdir(data_directory)

If you're running this notebook at home, make sure to update the company and email as appropriate to correctly identify yourself to the SEC API.

In [None]:
forms = []
for ticker in tickers: 
    form_text = get_form_by_ticker(
        ticker=ticker,
        form_type="10-K",
        company="Unstructured Technologies",
        email="support@unstructured.io"
    )
    
    filename = os.path.join(data_directory, f"{ticker}-10k.xbrl")
    with open(filename, "w") as f:
        f.write(form_text)
    print(".", end="")

.......................

### Extract the Risk Factors Narrative <a id="extract-narrative"></a>

Next, we'll extract the risk factors section by submitting the documents to the Unstructured SEC pipelines API. The SEC pipelines API accepts documents in XBRL format, finds the requested section, and returns the document as a JSON. You can learn more about the SEC pipelines API [here](https://github.com/Unstructured-IO/pipeline-sec-filings).

In [None]:
import requests
import time
from fetch import get_version

In [None]:
version = get_version()
url = f"https://api.unstructured.io/sec-filings/v{version}/section"
print(url)

https://api.unstructured.io/sec-filings/v0.2.0/section


In [None]:
risk_factors = dict()
for ticker in tickers:
    response = requests.post(
        url,
        files={"text_files": open(f"./xbrl-forms/{ticker}-10k.xbrl", "rb")},
        data={"section": ["RISK_FACTORS"]},
    )
    response.raise_for_status()
    risk_factors[ticker] = response.json()["RISK_FACTORS"]
    time.sleep(1)
    print(".", end="")

.......................

### Stage for LabelStudio <a id="stage-label-studio"></a>

The next step is to label our data for the sentiment analysis model. To do that, we'll use [LabelStudio](https://labelstud.io/). The `unstructured` core library lets us easily prepare data for upload to LabelStudio using the [`stage_for_label_studio`](https://unstructured-io.github.io/unstructured/bricks.html#stage-for-label-studio) staging brick. In this section, we'll format the data for upload to LabelStudio, and also use an off-the-shelf sentiment analysis model to pre-annotate the data. 

In [None]:
from unstructured.staging.base import isd_to_elements

In [None]:
elements = []
for sections in risk_factors.values():
    elements.extend(isd_to_elements(sections))

In [None]:
from transformers import pipeline

In [None]:
model = "distilbert-base-uncased-finetuned-sst-2-english"
sentiment_pipeline = pipeline(model=model)

Downloading:   0%|          | 0.00/5.47k [00:00<?, ?B/s]

In [None]:
from unstructured.staging.label_studio import (
    stage_for_label_studio,
    LabelStudioAnnotation,
    LabelStudioResult,
)

In this step, we apply an off-the-shelf sentiment analysis model to pre-annotate our data. Once it's up in LabelStudio, you'll see the model outputs applied as the default labels. Feel free to update the labels as appropriate in the LabelStudio UI.

In [None]:
annotations = []
for i, element in enumerate(elements):
    inference = sentiment_pipeline(element.text, truncation=True)
    result = [LabelStudioResult(
              type="choices",
              value={"choices": [inference[0]["label"].title()]},
              from_name="sentiment",
              to_name="text",
    )]
    annotations.append([LabelStudioAnnotation(result=result)])
    print(".", end="") if i % 40 == 1 else None

...............................................................

The `stage_for_label_studio` function formats the outputs for upload to LabelStudio. After we save the results as a JSON, we can create a new project in LabelStudio and upload the training examples.

In [None]:
label_studio_data = stage_for_label_studio(
    elements=elements,
    annotations=annotations,
)

In [None]:
import json

In [None]:
with open("sec-sentiment-analysis.json", "w") as f:
    json.dump(label_studio_data, f, indent=4)

#### NOTE: Transition to LabelStudio

At this point you should upload your data set to LabelStudio using the instructions in the [LabelStudio docs](https://labelstud.io/guide/tasks.html#Import-data-from-the-Label-Studio-UI). For the sentiment analysis model, choose the "Text Classification" template for your project. The JSON from this notebook will include annotations already, but you can improve the model by doing some additional labeling yourself. In the next step, we'll export the labeled data for model training.

### Train a Sentiment Model <a id="train"></a>

After labeling the data, we're ready to train the sentiment analysis model using the Hugging Face `transformers` library. Check out the [Hugging Face documentation](https://huggingface.co/blog/sentiment-analysis-python) for more information on how to train models in `transformers`.

The first step is to export the labeled data from LabelStudio. When you export the data, select the JSON-Min format. Once that's done, we'll convert the dictionary to a Hugging Face `Dataset` object so that it can be used in the model training pipeline.

In [None]:
from datasets import Dataset

In [None]:
with open("sec-sentiment-analysis-labeled.json", "r") as f:
    training_data = json.load(f)

In [None]:
datasets_data = Dataset.from_dict({
    "text": [item["text"] for item in training_data],
    "label": [0 if item["sentiment"] == "Negative" else 1 
             for item in training_data]
})

Next, we'll read in our base model and tokenizer. For this example, we'll fine-tune the `distilbert-base-uncased` model using our labels from LabelStudio. Check out [this list](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads) of models from Hugging Face if you want to try fine-tuning a different base model.

In [None]:
model_name = "distilbert-base-uncased"

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    model_name, 
    num_labels=2
)


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'classifier

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
def preprocess_function(examples):
   return tokenizer(examples["text"], truncation=True)

In [None]:
tokenized_train = datasets_data.map(preprocess_function, batched=True)

  0%|          | 0/3 [00:00<?, ?ba/s]

In [None]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
from transformers import Trainer

In [None]:
trainer = Trainer(
   model=model,
   train_dataset=tokenized_train,
   tokenizer=tokenizer,
   data_collator=data_collator,
)

In [None]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 2499
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 939
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.2334


Saving model checkpoint to tmp_trainer/checkpoint-500
Configuration saved in tmp_trainer/checkpoint-500/config.json
Model weights saved in tmp_trainer/checkpoint-500/pytorch_model.bin
tokenizer config file saved in tmp_trainer/checkpoint-500/tokenizer_config.json
Special tokens file saved in tmp_trainer/checkpoint-500/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=939, training_loss=0.1685810637550232, metrics={'train_runtime': 1125.0908, 'train_samples_per_second': 6.663, 'train_steps_per_second': 0.835, 'total_flos': 489859524447516.0, 'train_loss': 0.1685810637550232, 'epoch': 3.0})

Now that the model is trained, we'll save it locally so we can use it for inference. Hugging Face users can also upload the model to a remote model repository.

In [None]:
trainer.save_model("sec-sentiment-model")

Saving model checkpoint to sec-sentiment-model
Configuration saved in sec-sentiment-model/config.json
Model weights saved in sec-sentiment-model/pytorch_model.bin
tokenizer config file saved in sec-sentiment-model/tokenizer_config.json
Special tokens file saved in sec-sentiment-model/special_tokens_map.json


In [None]:
sec_sentiment_model = pipeline(
task="sentiment-analysis",
model="./sec-sentiment-model",
)

loading configuration file ./sec-sentiment-model/config.json
Model config DistilBertConfig {
  "_name_or_path": "./sec-sentiment-model",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.23.1",
  "vocab_size": 30522
}

loading configuration file ./sec-sentiment-model/config.json
Model config DistilBertConfig {
  "_name_or_path": "./sec-sentiment-model",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hid

In [None]:
elements[0].text

'Our business, operations, and financial position are subject to various risks. Some of these risks are described below, and the reader should take such risks into account in evaluating Encompass Health or any investment decision involving Encompass Health. This section does not describe all risks that may be applicable to us, our industry, or our business, and it is intended only as a summary of material risk factors. More detailed information concerning other risks and uncertainties as well as those described below is contained in other sections of this annual report. Still other risks and uncertainties we have not or cannot foresee as material to us may also adversely affect us in the future. If any of the risks below or other risks or uncertainties discussed elsewhere in this annual report are actually realized, our business and financial condition, results of operations, and cash flows could be adversely affected. In the event the impact is materially adverse, the trading price of

In [None]:
sec_sentiment_model(elements[0].text)

[{'label': 'LABEL_0', 'score': 0.9983240962028503}]

### Stage for Transformers <a id="chunk"></a>

Finally, we're ready to use our trained sentiment analysis model. To help, we'll apply our [`stage_for_transformers`](https://unstructured-io.github.io/unstructured/bricks.html#stage-for-transformers) brick, which chunks output based on the size of the attention window. In this case, we'll take the first ten paragraphs we received back from the SEC API and chunk them into two text snippets that fit into the attention window for the sentiment analysis model.

In [None]:
from unstructured.staging.huggingface import stage_for_transformers

In [None]:
chunked_text = stage_for_transformers(elements[:10], tokenizer)

In [None]:
results = sec_sentiment_model(chunked_text)

Disabling tokenizer parallelism, we're using DataLoader multithreading already


In [None]:
results

[{'label': 'LABEL_0', 'score': 0.9982876181602478},
 {'label': 'LABEL_0', 'score': 0.9985221028327942}]