# SEC Filing Section Pipeline

This notebook defines the pipeline for extracting the narrative text sections
from the 10-K, 10-Q, and S-1 filings. This notebook contains both
exploration code and the code for defining the API. Code cells marked
with `#pipeline-api` are included in the API definition.

To demonstrate how off-the-shelf Unstructured Bricks extract
meaningful data from complex source documents, we will apply
a series of Bricks with explanations before defining the API.

#### Table of Contents

1. [Pulling in Raw Documents](#raw)
1. [Reading the Document](#reading)
1. [Custom Partitioning Bricks](#custom)
1. [Cleaning Bricks](#cleaning)
1. [Staging Bricks](#staging)
1. [Define the Pipeline API](#pipeline)

## Section 1: Pulling in Raw Documents <a id="raw"></a>

In [None]:
from prepline_sec_filings.fetch import (
    get_form_by_ticker, open_form_by_ticker
)

text = get_form_by_ticker(
    'rgld', 
    '10-K', 
    company='Unstructured Technologies', 
    email='support@unstructured.io'
)

In [None]:
print(text[1375:3284])

```json
[
  {
    "text": "You should carefully consider the risks described in this section. Our future performance is subject to risks and uncertainties that could have a material adverse effect on our business, results of operations, and financial condition and the trading price of our common stock. We may be subject to other risks and uncertainties not presently known to us. In addition, please see our note about forward-looking statements included in the MD&A.",
    "type": "NarrativeText"
  },
  {
    "text": "Our revenue is subject to volatility in metal prices, which could negatively affect our results of operations or cash flow.",
    "type": "NarrativeText"
  },
  {
    "text": "Market prices for gold, silver, copper, nickel, and other metals may fluctuate widely over time and are affected by numerous factors beyond our control. These factors include metal supply and demand, industrial and jewelry fabrication, investment demand, central banking actions, inflation expectations, currency values, interest rates, forward sales by metal producers, and political, trade, economic, or banking conditions.",
    "type": "NarrativeText"
  },
```

## Section 2: Reading the Document <a id="reading"></a>

In [None]:
from unstructured.documents.html import HTMLDocument

html_document = HTMLDocument.from_string(text).doc_after_cleaners(skip_headers_and_footers=True, skip_table_text=True)

In [None]:
for element in html_document.pages[0].elements[71:75]:
    print(element)
    print("\n")

In [None]:
html_document.pages[0].elements[71:75]

In [None]:
from unstructured.nlp.partition import is_possible_title

is_possible_title("Regulation")

In [None]:
is_possible_title("""Operators of the mines that are subject to our 
stream and royalty interests must comply with numerous environmental, 
mine safety, land use, waste disposal, remediation and public health 
laws and regulations promulgated by federal, state, provincial and 
local governments in the United States, Canada, Chile, the Dominican 
Republic, Ghana, Mexico, Botswana, Australia and other countries where 
we hold interests. Although we, as a stream or royalty interest owner, 
are not""")

In [None]:
from unstructured.nlp.partition import is_possible_narrative_text

is_possible_narrative_text("Regulation")

In [None]:
is_possible_narrative_text("""Operators of the mines that are subject to our 
stream and royalty interests must comply with numerous environmental, 
mine safety, land use, waste disposal, remediation and public health 
laws and regulations promulgated by federal, state, provincial and 
local governments in the United States, Canada, Chile, the Dominican 
Republic, Ghana, Mexico, Botswana, Australia and other countries where 
we hold interests. Although we, as a stream or royalty interest owner, 
are not""")

## Section 3: Custom Partitioning Bricks <a id="custom"></a>

In [None]:
import re
from unstructured.documents.elements import Title

In [None]:
ITEM_TITLE_RE = re.compile(
    r"(?i)item \d{1,3}(?:[a-z]|\([a-z]\))?(?:\.)?(?::)?"
)

In [None]:
def is_10k_item_title(title: str) -> bool:
    """Determines if a title corresponds to a 10-K item heading."""
    return ITEM_TITLE_RE.match(title) is not None

In [None]:
for element in html_document.elements:
    if isinstance(element, Title) and is_10k_item_title(element.text):
        print(element)

## Section 4: Cleaning Bricks <a id="cleaning"></a>

In [None]:
from unstructured.cleaners.core import clean
def clean_sec_docs(text):
    return clean(text, extra_whitespace=True, dashes=True, trailing_punctuation=True)

In [None]:
for element in html_document.elements:
    element.text = clean_sec_docs(element.text)
    if isinstance(element, Title) and is_10k_item_title(element.text):
        print(element)

In [None]:
# pipeline-api
from prepline_sec_filings.sections import section_string_to_enum, validate_section_names, SECSection
from prepline_sec_filings.sec_document import SECDocument, REPORT_TYPES, VALID_FILING_TYPES

In [None]:
sec_document = SECDocument.from_string(text)
risk_narrative = sec_document.get_section_narrative(SECSection.RISK_FACTORS)

In [None]:
for element in risk_narrative[:3]:
    print(element)
    print("\n")

## Section 5: Staging Bricks <a id="staging"></a>

In [None]:
from unstructured.staging.label_studio import stage_for_label_studio

In [None]:
label_studio_data = stage_for_label_studio(risk_narrative)
label_studio_data[:5]

## Section 6: Define the Pipeline API <a id="pipeline"></a>

In [None]:
# pipeline-api
from unstructured.staging.base import convert_to_isd
from prepline_sec_filings.sections import ALL_SECTIONS, SECTIONS_10K, SECTIONS_S1

In [None]:
# pipeline-api
def pipeline_api(text, m_section=[]):
    """Many supported sections including: RISK_FACTORS, MANAGEMENT_DISCUSSION, and many more"""
    validate_section_names(m_section)
    
    sec_document = SECDocument.from_string(text)
    if sec_document.filing_type not in VALID_FILING_TYPES:
        raise ValueError(
            f"SEC document filing type {sec_document.filing_type} is not supported, "
            f"must be one of {','.join(VALID_FILING_TYPES)}"
        )
    results = {}
    if m_section == [ALL_SECTIONS]:
        if sec_document.filing_type in REPORT_TYPES:
            m_section = [enum.name for enum in SECTIONS_10K]
        else:
            m_section = [enum.name for enum in SECTIONS_S1]
    for section in m_section:
        results[section] = sec_document.get_section_narrative(
            section_string_to_enum[section])
    return {section:convert_to_isd(section_narrative) for section, section_narrative in results.items()}

In [None]:
risk_narrative = pipeline_api(text, ["RISK_FACTORS"])["RISK_FACTORS"]
risk_narrative[:5]

In [None]:
all_narratives = pipeline_api(text, ["_ALL"])
for section, elems in all_narratives.items():
    print(section)
    print(elems[:4])
    print("---------------")