## Task 2: Content Summarization

Extract and summarize useful information from a 10-K report. 

Use the 10-K report of NextGen Healthcare as an example: https://www.sec.gov/Archives/edgar/data/708818/000095017023023499/nxgn-20230331.htm

#### 1. Extract the “Business” section and the “Risk Factors” section from the report

In [60]:
# The required packages such as unstructured & pipeline-sec-fillings are already installed, if it's not already present then install it

# Check if the ratelimit, unstructured packages are available in the virtual 
# environment, if not then install it
try:
    import ratelimit, unstructured
except ImportError:
    %pip install ratelimit unstructured

# Check if 'external_packages' directory containing the contents of 'pipeline-sec-fillings' 
# is available if not then download it from the github repository
import os
import re

# If the external package is not available already
if not os.path.exists(os.path.join(os.getcwd(), 'pipeline-sec-filings')):
    !git clone https://github.com/Unstructured-IO/pipeline-sec-filings.git --depth=1


# Move to pipeline-sec-filings directory
%cd pipeline-sec-filings

# Download RAW documents from SEC website
from prepline_sec_filings.fetch import (
    get_form_by_ticker
)

# Parse the raw text & extract title
from unstructured.documents.elements import Title
from unstructured.documents.html import HTMLDocument
from unstructured.cleaners.core import (
                clean_extra_whitespace,
                clean_ordered_bullets,
                clean_bullets,
                replace_unicode_quotes,
                )
                                    

/home/joshi/ai_challenge/src/pipeline-sec-filings


In [61]:
""" Custom definitions to solve the task  """

def get_raw_10k_reports(
        form_type:str='10-K',
        company_symbol:str='nxgn',
        company='Akshay Tech', 
        email='contact@akshayjoshi.tech'
    ) -> str:
    """ Use pipeline-sec-filings to download filings of the desired company """
    return get_form_by_ticker( 
                        ticker=company_symbol,
                        form_type=form_type,
                        company=company,
                        email=email
            )

def convert_rawtext_to_html(
        raw_text:str,
        skip_table_text:bool=True
    ) -> HTMLDocument:
    """ Convert raw text to HTML object """
    return HTMLDocument.from_string(
                    raw_text
                ).doc_after_cleaners(
                    skip_headers_and_footers=True, 
                    skip_table_text=skip_table_text
            )

def is_10k_item_title(title:str) -> bool:
    """Determines if a title corresponds to a 10-K item heading."""
    ITEM_TITLE_RE = re.compile(
                    r"(?i)item \d{1,3}(?:[a-z]|\([a-z]\))?(?:\.)?(?::)?"
                )
    
    return ITEM_TITLE_RE.match(title) is not None

def get_chapter_titles(
        html_document:HTMLDocument
    ) -> list:
    """ Get the chapter titles from the HTML document """
    
    titles = []
    title_ids = []
    for element in html_document.elements:
        element.text = clean_extra_whitespace(element.text)
        if isinstance(element, Title) and is_10k_item_title(element.text):
            titles.append(element.text)
            title_ids.append(element)

    # print(f"Found {len(titles)} chapter titles")
    # print(f"Sample titles: {titles[:5]}")
    return titles, title_ids


def clean(text:str) -> str:
    """ Clean the text before training the model """
    text = text.lower()
    text = clean_extra_whitespace(text)
    text = clean_ordered_bullets(text)
    text = clean_bullets(text)
    text = replace_unicode_quotes(text)

    # Remove punctuations except . (such as: , : ; etc.) from the text
    pattern = re.compile(r'[,\/#!$%\^&\*;:{}=\-_`~()]')
    text = pattern.sub('', text)

    # Remove entire hyperlinks
    url_pattern = re.compile(r'\(https?://[^\s]+?\)')
    text = url_pattern.sub(' ', text)

    # Remove "table of contents" string from the text
    pattern = re.compile(r'table of contents')
    text = pattern.sub('', text)

    return text

def get_chapter_text(
        html_document:HTMLDocument,
        chapters:list[str]=["Business", "Risk Factors"]
    ) -> dict:
    """ Get the chapter text from the HTML document """
    titles, title_ids = get_chapter_titles(html_document)

    # Get idices of titles list where "Business" & "Risk Factors" are present in each row
    idxs = [i for i, title in enumerate(titles) 
            if any(chapter.lower() in title.lower() 
                   for chapter in chapters)]

    # Dict of chapter title & text
    chapter_text_dict = {}


    for chapter_id in idxs:
        for i, el in enumerate(html_document.elements):
            if el.id == title_ids[chapter_id].id:
                break

        temp_txt_list = []
        first_title_index = i
        for i in range(
                    first_title_index, 
                    first_title_index + len(html_document.elements)
                ):
            temp_txt_list.append(clean(html_document.elements[i].text))

            # If element id == id of the next chapter title then break
            if html_document.elements[i].id == title_ids[chapter_id + 1].id:
                # Delete last element as it's the next chapter title
                del temp_txt_list[-1], temp_txt_list[0]
                break

        chapter_text_dict[chapters[chapter_id]] = " ".join(temp_txt_list)
        
    return chapter_text_dict

In [62]:
# Load reports for a company
raw_text = get_raw_10k_reports()

# Convert raw text to HTML object
html_document = convert_rawtext_to_html(raw_text)

# Get the chapter text from the HTML document
chapter_text_dict = get_chapter_text(
                        html_document,
                        chapters=["Business", "Risk Factors"]
                    )

print(chapter_text_dict)

{'Business': 'company overview nextgen healthcare is a leading provider of innovative cloudbased healthcare technology solutions that empower ambulatory healthcare providers to manage the risk and complexity of delivering care in the united states healthcare system. our combination of technological breadth depth and domain expertise positions us as a preferred solution provider and trusted advisor for our clients. in addition to highly configurable core clinical and financial capabilities our portfolio includes tightly integrated solutions that deliver on ambulatory healthcare imperatives including consumerism digitization risk allocation regulatory influence and integrated care and health equity. we serve clients across all 50 states. over 100000 providers use nextgen healthcare solutions to deliver care in nearly every medical specialty in a wide variety of practice models including accountable care organizations “acos” independent physician associations “ipas” managed service organi

#### 2. Summarize each of the two sections using the Flan-T5 Large model

Issue with vanilla T5 & Flan T5 models is that the maximum length of the input sequence is 512 tokens. This is a problem because the text per row in 10-K report is much longer than max_seq_length of the model. Flan T5 uses relative position embedding to extend the maximum length of the input sequence to 2048 tokens. However, this is still not enough to cover the entire text in 10-K report & also might generate out of memory error because of the **Quadratic Space & Time Complexity** of the self-attention mechanism.

Technically we can split the text into multiple chunks of 512 tokens (with some heuristic) and feed them into the model one by one, and then perform **hierarchical** **summarization** over smaller summarizes generated from chunks of text. However, this is a **costly** & **slow** process!

I am using some of the ideas from the divide-and-conquer strategy used in [8] in the reference papers section.

In [63]:
# We are done with using the pipeline-sec-filings package, so move back to the parent directory
%cd ..

/home/joshi/ai_challenge/src


In [68]:
# Import the custom chunking function to create text chunks from a long text
from utils.preprocess_utils import create_text_chunks


# Create text chunks for both the chapters
chapter_chunks = {}

# Use pretrained Flan-T5 Large model as instructed in the task
tokenizer = "google/flan-t5-large"

for key, value in chapter_text_dict.items():
    chapter_chunks[key] = create_text_chunks(
                                long_text=value,
                                tokenizer_name_or_path=tokenizer)

# Print number of chunks for each chapter
for key, value in chapter_chunks.items():
    print(f"Number of chunks of 512 tokens each for {key} chapter: {len(value)}")


Max seq. length of the model: 512
Max seq. length of the model: 512
Number of chunks of 512 tokens each for Business chapter: 22
Number of chunks of 512 tokens each for Risk Factors chapter: 38
