In [1]:
# Note: this cell is hidden from the HTML output. Read more: https://nbsphinx.readthedocs.io/en/0.2.1/hidden-cells.html
import jupyter_black

jupyter_black.load()

# Developer Guide: Comprehensive Overview

Welcome to the Comprehensive Developer Guide for `sec-parser`. This guide is designed to provide an in-depth understanding of the `sec-parser` project, whether you're a new developer looking to contribute, or an experienced one seeking to leverage its capabilities. We'll walk you through the codebase, explaining key components and their interactions, and provide examples to help you get started. 

This guide is interactive, allowing you to engage with the code and concepts as you learn. You can run and modify all the code examples shown here for yourself by cloning the repository and running the [developer_guide.ipynb](https://github.com/alphanome-ai/sec-parser/blob/main/docs/source/notebooks/developer_guide.ipynb) in a Jupyter notebook. 

Alternatively, you can also run the notebook directly in your browser using Cloud-based Jupyter environments:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/alphanome-ai/sec-parser/blob/main/docs/source/notebooks/developer_guide.ipynb)
[![My Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/alphanome-ai/sec-parser/main?filepath=docs/source/notebooks/developer_guide.ipynb)

Let's dive in!

## Utilizing BeautifulSoup for Parsing
Many SEC EDGAR filings are available in HTML document format. To ease the process of reading the documents, we will be using the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) ("bs4") library to parse an HTML document into a tree-like structure of HTML Tags (`bs4.Tag`).

## Understanding the Role of HtmlTag
Instead of interacting directly with `bs4.Tag`, the SEC EDGAR HTML Parser uses `HtmlTag`, a wrapper around `bs4.Tag`.

In [2]:
from sec_parser.processing_engine import HtmlTag, HtmlTagParser

print(HtmlTag.__doc__)
print(HtmlTagParser.__doc__)


    The HtmlTag class is a wrapper for BeautifulSoup4 Tag objects.

    It serves three main purposes:

    1. Decoupling: By abstracting the underlying BeautifulSoup4 library, we
       can isolate our application logic from the library specifics. This
       makes it easier to modify or even replace the HTML parsing library in
       the future without extensive codebase changes.

    2. Usability: The HtmlTag class provides a convenient location to add
       extension methods or additional properties not offered by the native
       BeautifulSoup4 Tag class. This enhances the usability of the class.

    3. Caching: The HtmlTag class also caches processing results, improving
       performance by avoiding unnecessary re-computation.
    

    The HtmlTagParser parses an HTML document using BeautifulSoup4.
    It then wraps the parsed bs4.Tag objects into HtmlTag objects.
    


## Defining Semantic Elements

In [3]:
from sec_parser.semantic_elements import AbstractSemanticElement as SemanticElement

print(SemanticElement.__doc__)


    In the domain of HTML parsing, especially in the context of SEC EDGAR documents,
    a semantic element refers to a meaningful unit within the document that serves a
    specific purpose. For example, a paragraph or a table might be considered a
    semantic element. Unlike syntactic elements, which merely exist to structure the
    HTML, semantic elements carry information that is vital to the understanding of the
    document's content.

    This class serves as a foundational representation of such semantic elements,
    containing an HtmlTag object that stores the raw HTML tag information. Subclasses
    will implement additional behaviors based on the type of the semantic element.
    


A few examples of Semantic Elements:

In [4]:
from sec_parser.semantic_elements import (
    TextElement,
    TableElement,
    TitleElement,
    TopLevelSectionStartMarker,
    UndeterminedElement,
)

print(TextElement.__doc__)
print(TableElement.__doc__)
print(TitleElement.__doc__)
print(TopLevelSectionStartMarker.__doc__)
print(UndeterminedElement.__doc__)

The TextElement class represents a standard text paragraph within a document.
The TableElement class represents a standard table within a document.

    The TitleElement class represents the title of a paragraph or other content object.
    It serves as a semantic marker, providing context and structure to the document.
    

    The TopLevelSectionStartMarker class represents the beginning of a top-level
    section of a document. For instance, in SEC 10-Q reports, a
    top-level section could be "Part I, Item 3. Quantitative and Qualitative
    Disclosures About Market Risk.".
    

    The UndeterminedElement class represents an element whose type
    has not yet been determined. The parsing process aims to
    transform all instances of this class into more specific
    subclasses of AbstractSemanticElement.
    


To summarize, the purpose of parsing is to produce an ordered list of Semantic Elements from a tree of HTML Tags.

## The Relationship Between SemanticElement and HtmlTag

For simplicity, we mark single `HtmlTag` objects as `SemanticElement` objects. This means that to create a `SemanticElement` object, we simply need to pass a single `HtmlTag` object when creating it.

## Understanding the Parsing Process

In [5]:
from sec_parser.processing_engine import AbstractSemanticElementParser

print(AbstractSemanticElementParser.__doc__)


    Responsible for parsing semantic elements from HTML documents.
    It takes raw HTML and returns a list of objects representing semantic elements.

    At a High Level:
    1. Extract top-level HTML tags from the document.
    2. Convert them into a list of more specific semantic elements step-by-step.

    Why Focus on Top-Level Tags?
    SEC filings typically have a flat HTML structure. This simplifies the
    parsing process, as each top-level HTML tag often directly corresponds
    to a single semantic element. This is different from many websites,
    where HTML tags are usually nested deeply, requiring more complex parsing.

    For Advanced Users:
    The parsing process is implemented as a sequence of steps (Pipeline Pattern)
    and allows customization of each step (Strategy Pattern).

    - Pipeline Pattern: Raw HTML tags are processed in a sequential, step-by-step manner
    - Strategy Pattern: You can either replace, remove, or extend any of the existing
      steps w

**Example 1:** Using the default parsing pipeline:

In [6]:
from sec_parser import Edgar10QParser

parser = Edgar10QParser()
# parser.parse(html)

**Example 2:** This is a trivial example to demonstrate how a parser without processing steps will just return 
the "starting state", which is each of the `HtmlTag` objects wrapped in a `UndeterminedElement` object.

In [7]:
def get_steps():
    return []


parser = Edgar10QParser(get_steps)
parser.parse("<img><img><img>")

[UndeterminedElement<img>, UndeterminedElement<img>, UndeterminedElement<img>]

**Example 3:** Advanced customization of the pipeline. Suppose `TableParsingStep` is a bottleneck for performance. In that case, you can easily remove it from the pipeline, or swap it out for a custom or inherited alternative. You can even write your own processing steps to have a completely custom parsing pipeline.

In [8]:
from sec_parser.processing_steps import TableParsingStep


def get_steps():
    steps = Edgar10QParser.get_default_steps()
    return [s for s in steps if not isinstance(s, TableParsingStep)]


parser = Edgar10QParser(get_steps)
# parser.parse(html)

## Handling Multiple Semantic Elements in a Single HTML Tag

If multiple Semantic Elements are in the same HTML tag, we would use some processing step to split the elements, and then contain them in a `CompositeSemanticElement`.

In [9]:
from sec_parser.semantic_elements import CompositeSemanticElement

print(CompositeSemanticElement.__doc__)


    CompositeSemanticElement acts as a container that can encapsulate other
    semantic elements.

    This is used for handling special cases where a single HTML root
    tag wraps multiple semantic elements. This maintains structural integrity
    and allows for seamless reconstitution of the original HTML document.

    Why is this useful:
    1. Some semantic elements, like XBRL tags (<ix>), may wrap multiple semantic
    elements. The container ensures that these relationships are not broken
    during parsing.
    2. Enables the parser to fully reconstruct the original HTML document, which
    opens up possibilities for features like semantic segmentation visualization
    (e.g. recreate the original document but put semi-transparent colored boxes
    on top, based on semantic meaning), serialization of parsed documents into
    an augmented HTML, and debugging by comparing to the original document.
    


**Example 1:** This is an oversimplified

In [10]:
from sec_parser.processing_steps import AbstractProcessingStep
from sec_parser.semantic_elements import AbstractSemanticElement


class SplitterProcessingStep(AbstractProcessingStep):
    def process(
        self, elements: list[AbstractSemanticElement]
    ) -> list[AbstractSemanticElement]:
        new_elements = []

## Designing the ProcessingStep

Each `ProcessingStep` instance is created from scratch for each parsed document. This means that each `ProcessingStep` instance can have its own state, and can be used to store information about the document being parsed. This is useful for processing steps that need to keep track of information across multiple `HtmlTag` objects.

**Example 1:** Counting the number of images in a document

In [11]:
from sec_parser.processing_steps import AbstractProcessingStep


class CounterProcessingStep(AbstractProcessingStep):
    pass


parser = Edgar10QParser(get_steps=lambda: [])
parser.parse("<img><img><img>")

[UndeterminedElement<img>, UndeterminedElement<img>, UndeterminedElement<img>]

## Introduction to Semantic Trees

In [12]:
from sec_parser.semantic_tree import TreeBuilder

print(TreeBuilder.__doc__)


    Builds a semantic tree from a list of semantic elements.

    Why Use a Tree Structure?
    Using a tree data structure allows for easier and more robust filtering of sections.
    With a tree, you can select specific branches to filter, making it straightforward
    to identify section boundaries. This approach is more maintainable and robust
    compared to attempting the same operations on a flat list of elements.

    Overview:
    1. Takes a list of semantic elements.
    2. Applies nesting rules to these elements.

    Customization:
    The nesting process is customizable through a list of rules. These rules determine
    how new elements should be nested under existing ones.

    Advanced Customization:
    You can supply your own set of rules by providing a callable to `get_rules`, which
    should return a list of `AbstractNestingRule` instances.
    
