# Comprehensive Developer Guide

Welcome to the Comprehensive Developer Guide for `sec-parser`. This guide is designed to provide an in-depth understanding of the `sec-parser` project, whether you're a new developer looking to contribute, or an experienced one seeking to leverage its capabilities. We'll walk you through the codebase, explaining key components and their interactions, and provide examples to help you get started. 

This guide is interactive, allowing you to engage with the code and concepts as you learn. You can run and modify all the code examples shown here for yourself by cloning the repository and running the [comprehensive_developer_guide.ipynb](https://github.com/alphanome-ai/sec-parser/blob/main/docs/source/notebooks/comprehensive_developer_guide.ipynb) in a Jupyter notebook. 

Alternatively, you can also run the notebook directly in your browser using Google Colab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/alphanome-ai/sec-parser/blob/main/docs/source/notebooks/comprehensive_developer_guide.ipynb)

Let's dive in!

#### We're using BeautifulSoup
Most human-readable SEC EDGAR Reports are formatted as HTML documents. To ease the process of reading the documents, we will be using the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) ("bs4") library to parse an HTML document into a tree-like structure of HTML Tags (`bs4.Tag`).

#### `HtmlTag` wraps `bs4.Tag`
Instead of interacting directly with `bs4.Tag`, the SEC EDGAR HTML Parser uses `HtmlTag`, a wrapper around `bs4.Tag`.


In [1]:
from sec_parser.processing_engine import HtmlTag, HtmlTagParser

print(HtmlTag.__doc__)
print(HtmlTagParser.__doc__)


    The HtmlTag class is a wrapper for BeautifulSoup4 Tag objects.

    It serves three main purposes:

    1. Decoupling: By abstracting the underlying BeautifulSoup4 library, we
       can isolate our application logic from the library specifics. This
       makes it easier to modify or even replace the HTML parsing library in
       the future without extensive codebase changes.

    2. Usability: The HtmlTag class provides a convenient location to add
       extension methods or additional properties not offered by the native
       BeautifulSoup4 Tag class. This enhances the usability of the class.

    3. Caching: The HtmlTag class also caches processing results, improving
       performance by avoiding unnecessary re-computation.
    

    The HtmlTagParser parses an HTML document using BeautifulSoup4.
    It then wraps the parsed bs4.Tag objects into HtmlTag objects.
    


#### What is a Semantic Element?

In [10]:
from sec_parser.semantic_elements import AbstractSemanticElement as SemanticElement

print(SemanticElement.__doc__)


    In the domain of HTML parsing, especially in the context of SEC EDGAR documents,
    a semantic element refers to a meaningful unit within the document that serves a
    specific purpose. For example, a paragraph or a table might be considered a
    semantic element. Unlike syntactic elements, which merely exist to structure the
    HTML, semantic elements carry information that is vital to the understanding of the
    document's content.

    This class serves as a foundational representation of such semantic elements,
    containing an HtmlTag object that stores the raw HTML tag information. Subclasses
    will implement additional behaviors based on the type of the semantic element.
    


A few examples:

In [8]:
from sec_parser.semantic_elements import TextElement, TableElement

print(TextElement.__doc__)
print(TableElement.__doc__)

The TextElement class represents a standard text paragraph within a document.
The TableElement class represents a standard table within a document.


To summarize, the purpose of parsing is to produce an ordered list of Semantic Elements from a tree of HTML Tags.

#### `SemanticElement` wraps `HtmlTag`