# Extracting 10-K Items from the Document

This notebook focuses on extracting key sections from a 10-K filing, such as *Business*, *Risk Factors*, and other mandated sections. Our objective is to segment the document and classify these sections according to the standard 10-K items, enabling us to later organize the content and build a RAG (Retrieval-Augmented Generation) agent capable of reliably answering questions from the document.

### Approach:

We will extract the sections of the document and their corresponding pages using **two methods**:

1. **Using PDF Bookmarks**: Bookmarks, if available, often represent the main sections of the 10-K. We will extract them to identify the section names and their start pages.
2. **Using Internal Hyperlinks from the Table of Contents (TOC)**: In cases where bookmarks are missing, we will rely on internal hyperlinks found in the 10-K's TOC to identify the sections and their start pages.

Once we have extracted the sections and their page ranges, we will use a Large Language Model (LLM) to determine:
- **The name of each section** (e.g., "Risk Factors").
- **The corresponding 10-K item** (e.g., "Item 1A: Risk Factors").
- **The start page** of the section.
- **The end page** of the section.

The LLM will classify each section by mapping it to a corresponding 10-K item, guided by a predefined prompt containing descriptions of the standard 10-K items.

### Relevant Links:

- [Investopedia: SEC Filings - Forms You Need to Know](https://www.investopedia.com/articles/fundamental-analysis/08/sec-forms.asp)
- [SEC Investor Bulletin: How to Read a 10-K](https://www.sec.gov/files/reada10k.pdf)
- [SEC General Instructions for Form 10-K](https://www.sec.gov/files/form10-k.pdf)


In [166]:
import requests
import fitz  # PyMuPDF
import yaml
import os
import json
from typing import Dict, Union, Any
from pathlib import Path

from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import JsonOutputParser, StrOutputParser
from langchain_aws import ChatBedrock
from langchain_openai import ChatOpenAI

from aily_py_commons.io.env_vars import AilySettings, COREPRODUCT_PROD
from langfuse.callback import CallbackHandler
from aily_ai_brain.modules.llms import get_llm
from aily_ai_brain.common.enums import BedrockModelID, OpenAIModelID
from aily_ai_brain.common.langfuse_handler import get_langfuse_handler

from pydantic_models.item_10k import Item10KList

In [145]:
def download_pdf(url: str, save_path: str) -> None:
    """
    Downloads a PDF from a given URL and saves it to the specified file path.

    Args:
        url (str): The URL of the PDF to download.
        save_path (str): The local file path where the PDF will be saved.
    """
    response = requests.get(url)
    if response.status_code == 200:
        with open(save_path, 'wb') as f:
            f.write(response.content)
        print(f"PDF downloaded successfully and saved as '{save_path}'")
    else:
        raise Exception(f"Failed to download PDF. Status code: {response.status_code}")

def load_pdf(pdf_path: str) -> fitz.Document:
    """
    Loads a PDF document from a given file path.

    Args:
        pdf_path (str): The local file path of the PDF to load.

    Returns:
        fitz.Document: A PyMuPDF document object if the file is successfully opened.

    Raises:
        Exception: If the file cannot be opened.
    """
    try:
        document = fitz.open(pdf_path)
        print(f"Successfully loaded PDF from '{pdf_path}'")
        return document
    except Exception as e:
        raise Exception(f"Error opening PDF: {e}")

In [163]:
AilySettings(COREPRODUCT_PROD)

langfuse_tags = [
    "team: genai",
    "environment: dev",
    "project: 10k_file_processing_test",
]

langfuse_project_name = "scanner"

# APPL 10-K 2023 URL
appl_url = "https://s2.q4cdn.com/470004039/files/doc_earnings/2023/q4/filing/_10-K-Q4-2023-As-Filed.pdf"
appl_pdf_path = "appl_2023.pdf"

# JNJ 10-K 2023 URL
jnj_url = "https://d18rn0p25nwr6d.cloudfront.net/CIK-0000200406/2d8bead4-a89a-4802-8c63-1266ad78e6a2.pdf"
jnj_pdf_path = "jnj_2023.pdf"

company_name = "appl_2023"
url = appl_url
pdf_path = appl_pdf_path

 [32maily-logging:[2024-10-08 17:37:54 CEST+0200] [INFO] ------ AilySettings ------ [0m
 [32maily-logging:[2024-10-08 17:37:55 CEST+0200] [INFO] AWS Role    : DataArchitect [0m
 [32maily-logging:[2024-10-08 17:37:55 CEST+0200] [INFO] AWS Profile : aws-infrastructure [0m
 [32maily-logging:[2024-10-08 17:37:55 CEST+0200] [INFO] Aily Env    : prod [0m
 [32maily-logging:[2024-10-08 17:37:55 CEST+0200] [INFO] -------------------------- [0m


# 1 - Extracting PDF Bookmarks

In [147]:
def extract_bookmarks(pdf_path: str) -> list[tuple[str, int]]:
    """
    Extracts bookmarks (Table of Contents) from a PDF file and returns them as a list of tuples.

    Args:
        pdf_path (str): The local file path of the PDF to open and extract bookmarks from.

    Returns:
        list[tuple[str, int]]: A list of tuples, where each tuple contains the bookmark title and page number.
    """
    try:
        document = load_pdf(pdf_path)
    except Exception as e:
        print(f"Error loading PDF: {e}")
        return []
    
    # Extracting bookmarks (Table of Contents)
    bookmarks = document.get_toc()

    # Close the PDF document
    document.close()

    # Return list of tuples with title and page number
    return [(title, page) for level, title, page in bookmarks] if bookmarks else []


def format_bookmarks_string(bookmarks: list[tuple[str, int]]) -> str:
    """
    Returns a string representation of bookmarks and their corresponding pages.

    Args:
        bookmarks (list[tuple[str, int]]): List of tuples where each tuple contains the bookmark title and page number.

    Returns:
        str: A string containing the bookmarks and pages.
    """
    if bookmarks:
        result = "Bookmarks and their corresponding pages:\n"
        for title, page in bookmarks:
            result += f"'{title}' -> Page {page}\n"
    else:
        result = "No bookmarks found in the PDF.\n"
    
    return result

In [148]:
# Download the APPL 10-K PDF
download_pdf(url, pdf_path)

# Extract bookmarks
bookmarks = extract_bookmarks(pdf_path)

# Format bookmarks as a raw text
bookmarks_formatted = format_bookmarks_string(bookmarks)

# Print bookmarks
print(bookmarks_formatted)

PDF downloaded successfully and saved as 'appl_2023.pdf'
Successfully loaded PDF from 'appl_2023.pdf'
Bookmarks and their corresponding pages:
'Cover Page' -> Page 1
'TABLE OF CONTENTS' -> Page 3
'PART I' -> Page 4
'Item 1. Business' -> Page 4
'Company Background' -> Page 4
'Products' -> Page 4
'Services' -> Page 5
'Segments' -> Page 5
'Markets and Distribution' -> Page 5
'Competition' -> Page 5
'Supply of Components' -> Page 6
'Research and Development' -> Page 6
'Intellectual Property' -> Page 6
'Business Seasonality and Product Introductions' -> Page 7
'Human Capital' -> Page 7
'Available Information' -> Page 7
'Item 1A. Risk Factors' -> Page 8
'Macroeconomic and Industry Risks' -> Page 8
'Business Risks' -> Page 10
'Legal and Regulatory Compliance Risks' -> Page 15
'Financial Risks' -> Page 18
'General Risks' -> Page 19
'Item 1B. Unresolved Staff Comments' -> Page 19
'Item 1C. Cybersecurity' -> Page 19
'Item 2. Properties' -> Page 20
'Item 3. Legal Proceedings' -> Page 20
'Item 4. 

# 2 - Extracting links from TOC

In [149]:
def extract_internal_hyperlinks_with_context(pdf_file_path, page_number):
    """
    Extracts internal hyperlinks and their surrounding text from a specific page of a PDF.

    Args:
        pdf_file_path (str): Path to the PDF file.
        page_number (int): Page number to extract internal links from (1-based index).

    Returns:
        list: A list of tuples containing the destination page number and the associated link text.
    """
    hyperlinks = []

    # Open the PDF file
    document = fitz.open(pdf_file_path)

    # Check if the specified page exists
    if page_number - 1 < len(document):
        page = document[page_number - 1]

        # Get the links on the page
        for link in page.get_links():
            # Check if 'page' and 'from' fields are present in the link dictionary
            if 'page' in link and 'from' in link:
                dest_page_number = int(link['page']) + 1  # Convert to 1-based index
                rect = link['from']  # Get the rectangle coordinates of the link

                # Extract text in the rectangle area around the link
                text_within_rect = page.get_text("text", clip=rect)

                # Append the destination page and the extracted text to the list
                hyperlinks.append((
                    f"Page {dest_page_number}",
                    text_within_rect.strip()
                ))

    document.close()  # Close the document
    return hyperlinks


def format_internal_hyperlinks_string(hyperlinks: list) -> str:
    """
    Returns a string representation of extracted internal hyperlinks and their corresponding text.

    Args:
        hyperlinks (list): A list of tuples containing the destination page number and link text.

    Returns:
        str: A string containing the internal hyperlinks and their text.
    """
    if not hyperlinks:
        result = "No internal hyperlinks found.\n"
    else:
        result = ""
        for link, link_text in hyperlinks:
            result += f'Link: {link}\nLink Text: {link_text}\n'
    
    return result

In [150]:
toc_page_number = 3  # Change to the desired page number

# Download the APPL 10-K PDF
download_pdf(appl_url, appl_pdf_path)

# Extract and print internal hyperlinks from the specified page
internal_hyperlinks = extract_internal_hyperlinks_with_context(appl_pdf_path, toc_page_number)

# Format hyperlinks as a raw text
hyperlinks_formatted = format_bookmarks_string(bookmarks)

# Print hyperlinks
print(hyperlinks_formatted)

PDF downloaded successfully and saved as 'appl_2023.pdf'
Bookmarks and their corresponding pages:
'Cover Page' -> Page 1
'TABLE OF CONTENTS' -> Page 3
'PART I' -> Page 4
'Item 1. Business' -> Page 4
'Company Background' -> Page 4
'Products' -> Page 4
'Services' -> Page 5
'Segments' -> Page 5
'Markets and Distribution' -> Page 5
'Competition' -> Page 5
'Supply of Components' -> Page 6
'Research and Development' -> Page 6
'Intellectual Property' -> Page 6
'Business Seasonality and Product Introductions' -> Page 7
'Human Capital' -> Page 7
'Available Information' -> Page 7
'Item 1A. Risk Factors' -> Page 8
'Macroeconomic and Industry Risks' -> Page 8
'Business Risks' -> Page 10
'Legal and Regulatory Compliance Risks' -> Page 15
'Financial Risks' -> Page 18
'General Risks' -> Page 19
'Item 1B. Unresolved Staff Comments' -> Page 19
'Item 1C. Cybersecurity' -> Page 19
'Item 2. Properties' -> Page 20
'Item 3. Legal Proceedings' -> Page 20
'Item 4. Mine Safety Disclosures' -> Page 20
'PART II'

# 3 - Use an LLM to identify sections (i.e., 10-K items)

In [151]:
def load_prompts():
    """Load prompts from a YAML file in the current working directory."""
    current_dir = os.getcwd()  # Get the current working directory in Jupyter
    prompts_file = os.path.join(current_dir, "prompts.yaml")
    with open(prompts_file) as file:
        return yaml.safe_load(file)

## 3.1 - Load 10K item descriptions

In [152]:
def generate_sec_10k_items_description_from_file(file_path: str) -> str:
    """
    Reads the YAML file and translates its content describing 10-K items into a structured textual format
    suitable for use in a prompt.

    Args:
        file_path (str): The file path of the YAML file containing the 10-K item descriptions.

    Returns:
        str: A string containing a structured description of all the 10-K items.
    """
    try:
        # Open and read the YAML file
        with open(file_path, 'r') as file:
            data = yaml.safe_load(file)

        # Initialize an empty list to hold the textual output
        item_descriptions = []

        # Iterate through each part and its items
        for part in data:
            part_name = part.get('part', '')
            items = part.get('items', [])
            
            # Process each item in the part
            for item in items:
                item_id = item.get('item_id', '')
                name = item.get('name', '')
                description = item.get('description', '')
                
                # Format the item as a structured text
                item_text = f"**Item {item_id}: {name}**\n{description}\n"
                item_descriptions.append(item_text)

        # Join all item descriptions into a single string
        return "\n".join(item_descriptions)
    
    except FileNotFoundError:
        print(f"Error: The file {file_path} was not found.")
        return ""
    except yaml.YAMLError as e:
        print(f"Error parsing YAML: {e}")
        return ""
    
file_path = './sec_10k_item_descriptions.yaml'
sec_10k_item_descriptions = generate_sec_10k_items_description_from_file(file_path)
print(sec_10k_item_descriptions)

**Item 1: Business**
Requires a description of the company’s business, including its main products and services, what subsidiaries it owns, and what markets it operates in. This section may also include information about recent events, competition the company faces, regulations that apply to it, labor issues, special operating costs, or seasonal factors. This is a good place to start to understand how the company operates.

**Item 1A: Risk Factors**
Includes information about the most significant risks that apply to the company or to its securities. Companies generally list the risk factors in order of their importance. In practice, this section focuses on the risks themselves, not how the company addresses those risks. Some risks may be true for the entire economy, some may apply only to the company’s industry sector or geographic region, and some may be unique to the company.

**Item 1B: Unresolved Staff Comments**
Requires the company to explain certain comments it has received from

### 3.2 - Load prompts

In [153]:
prompts = load_prompts()

### 3.3 - General chain code

**Note:** We pass the langfuse_handler and LLM since we want all steps to be within the same Langfuse trace

In [154]:
def run_chain(
    llm: Union[ChatOpenAI, ChatBedrock],
    prompt: str,
    input_data: Dict[str, Any],
    output_parser: Union[StrOutputParser, JsonOutputParser],
    handler: CallbackHandler,
):
    """
    Executes a processing chain that combines a language model, a prompt 
    template, and an output parser to generate and process completions 
    based on input data.

    Args:
        llm (Union[ChatOpenAI, ChatBedrock]): 
            The language model to use for generating responses, either 
            ChatOpenAI or ChatBedrock.

        prompt (str): 
            A string representing the prompt template to format, guiding 
            the language model's responses.

        input_data (Dict[str, Any]): 
            A dictionary containing input data to format into the prompt. 
            Keys must match variable placeholders in the prompt template.

        output_parser (Union[StrOutputParser, JsonOutputParser]): 
            An instance of a parser used to parse the output from the 
            language model. Should be either StrOutputParser or 
            JsonOutputParser.

        handler (CallbackHandler): 
            An instance of a callback handler for managing tracing and 
            callback events during processing.

    Returns:
        Any: 
            The processed output from the language model after applying 
            the specified output parser. The return type depends on the 
            parser used (string or JSON).
    """
    # Load prompt template
    prompt_template = PromptTemplate.from_template(template=prompt)

    # Create the processing chain
    chain = prompt_template | llm | output_parser

    # Invoke the chain with the input and handler (for callbacks)
    completion = chain.invoke(
        input=input_data,
        config={"callbacks": [handler]}
    )

    return completion


## 3.4 - Initialize LLMs

We want to have separate traces for the bookmark case and for the TOC case, so we create a method for initialization

In [155]:
def initialize_models_and_handler():
    langfuse_handler = get_langfuse_handler(
        langfuse_tags=langfuse_tags,
        trace_name=f"{company_name}_10k_items",
        project_name=langfuse_project_name,
    )

    # Load LLM
    gpt_4o = get_llm(
        langfuse_handler=langfuse_handler,
        sensitive_data=False,
        model_id=OpenAIModelID.GPT4O,
    )
    
    gpt_4o_mini = get_llm(
        langfuse_handler=langfuse_handler,
        sensitive_data=False,
        model_id=OpenAIModelID.GPT4O_mini,
    )
    
    return gpt_4o, gpt_4o_mini, langfuse_handler

In [156]:
gpt_4o_bookmarks, gpt_4o_mini_bookmarks, handler_bookmarks = initialize_models_and_handler()
gpt_4o_hyperlinks, gpt_4o_mini_hyperlinks, handler_hyperlinks = initialize_models_and_handler()

insights, genai, core, rnd, fin, m_and_s, pro, ebi, qa, supply, ppl, spend, gra, gtm
Please contact the genai team if we have forgotten you. [0m
insights, genai, core, rnd, fin, m_and_s, pro, ebi, qa, supply, ppl, spend, gra, gtm
Please contact the genai team if we have forgotten you. [0m


## 3.4 - Identify 10-K items

In [157]:
# Using PDF Bookmarks
identified_10k_items_bookmarks = run_chain(
    llm=gpt_4o_bookmarks,
    prompt=prompts["identify_10k_items"],
    input_data={
        "sec_10k_item_descriptions": sec_10k_item_descriptions,
        "raw_toc": bookmarks_formatted
    },
    output_parser=StrOutputParser(),
    handler=handler_bookmarks,
)

# Using Hyperlinks
identified_10k_items_hyperlinks = run_chain(
    llm=gpt_4o_bookmarks,
    prompt=prompts["identify_10k_items"],
    input_data={
        "sec_10k_item_descriptions": sec_10k_item_descriptions,
        "raw_toc": hyperlinks_formatted
    },
    output_parser=StrOutputParser(),
    handler=handler_hyperlinks,
)

## 3.5 - Identify page ranges

We use GPT-4o-mini because the task is quite simple

In [158]:
# Using Bookmarks
identified_10k_items_with_page_ranges_bookmarks = run_chain(
    llm=gpt_4o_mini_bookmarks,
    prompt=prompts["page_ranges_10k_items"],
    input_data={
        "identified_10k_items": identified_10k_items_bookmarks,
    },
    output_parser=StrOutputParser(),
    handler=handler_bookmarks,
)

# Using Hyperlinks
identified_10k_items_with_page_ranges_hyperlinks = run_chain(
    llm=gpt_4o_mini_hyperlinks,
    prompt=prompts["page_ranges_10k_items"],
    input_data={
        "identified_10k_items": identified_10k_items_hyperlinks,
    },
    output_parser=StrOutputParser(),
    handler=handler_hyperlinks,
)

## 3.6 - Format answer as a Pydantic model

We do this to reduce structured outputs issues. For more info: https://www.llmwatch.com/p/the-downsides-of-structured-outputs

We use GPT-4o-mini because the task is quite simple

In [159]:
json_parser = JsonOutputParser(pydantic_object=Item10KList)

# Using Bookmarks
identified_10k_items_with_page_ranges_formatted_bookmarks = run_chain(
    llm=gpt_4o_mini_bookmarks,
    prompt=prompts["format_10k_items_with_page_ranges"],
    input_data={
        "identified_10k_items_with_page_ranges": identified_10k_items_with_page_ranges_bookmarks,
        "formatting_instructions": json_parser.get_format_instructions()
    },
    output_parser=json_parser,
    handler=handler_bookmarks,
)

# Using Hyperlinks
identified_10k_items_with_page_ranges_formatted_hyperlinks = run_chain(
    llm=gpt_4o_mini_hyperlinks,
    prompt=prompts["format_10k_items_with_page_ranges"],
    input_data={
        "identified_10k_items_with_page_ranges": identified_10k_items_with_page_ranges_hyperlinks,
        "formatting_instructions": json_parser.get_format_instructions()
    },
    output_parser=json_parser,
    handler=handler_hyperlinks,
)

In [160]:
identified_10k_items_with_page_ranges_formatted_bookmarks

{'items': [{'item_id': '1',
   'name': 'Business',
   'page_start': 4,
   'page_end': 7},
  {'item_id': '1A', 'name': 'Risk Factors', 'page_start': 8, 'page_end': 18},
  {'item_id': '1B',
   'name': 'Unresolved Staff Comments',
   'page_start': 19,
   'page_end': 19},
  {'item_id': '2', 'name': 'Properties', 'page_start': 20, 'page_end': 20},
  {'item_id': '3',
   'name': 'Legal Proceedings',
   'page_start': 20,
   'page_end': 20},
  {'item_id': '5',
   'name': 'Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities',
   'page_start': 21,
   'page_end': 22},
  {'item_id': '7',
   'name': 'Management’s Discussion and Analysis of Financial Condition and Results of Operations',
   'page_start': 23,
   'page_end': 28},
  {'item_id': '7A',
   'name': 'Quantitative and Qualitative Disclosures about Market Risk',
   'page_start': 29,
   'page_end': 29},
  {'item_id': '8',
   'name': 'Financial Statements and Supplementary Data',
   'page_

In [161]:
identified_10k_items_with_page_ranges_formatted_hyperlinks

{'items': [{'item_id': '1',
   'name': 'Business',
   'page_start': 4,
   'page_end': 7},
  {'item_id': '1A', 'name': 'Risk Factors', 'page_start': 8, 'page_end': 18},
  {'item_id': '1B',
   'name': 'Unresolved Staff Comments',
   'page_start': 19,
   'page_end': 19},
  {'item_id': '2', 'name': 'Properties', 'page_start': 20, 'page_end': 20},
  {'item_id': '3',
   'name': 'Legal Proceedings',
   'page_start': 20,
   'page_end': 20},
  {'item_id': '5',
   'name': 'Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities',
   'page_start': 21,
   'page_end': 22},
  {'item_id': '7',
   'name': 'Management’s Discussion and Analysis of Financial Condition and Results of Operations',
   'page_start': 23,
   'page_end': 28},
  {'item_id': '7A',
   'name': 'Quantitative and Qualitative Disclosures about Market Risk',
   'page_start': 29,
   'page_end': 29},
  {'item_id': '8',
   'name': 'Financial Statements and Supplementary Data',
   'page_

In [162]:
identified_10k_items_with_page_ranges_formatted_hyperlinks == identified_10k_items_with_page_ranges_formatted_bookmarks

True

# 4 - Divide PDF and assign the corresponding name

In [172]:
def split_pdf_by_sections(json_data, input_pdf_path, output_directory):
    """
    Splits the input PDF into sections based on the provided JSON data and saves them in the specified directory.

    Parameters:
    - json_data: JSON formatted string or dictionary containing section details.
    - input_pdf_path: Path to the input PDF file to be split.
    - output_directory: Directory where the split PDFs will be saved.

    Returns:
    - None
    """

    # Load the JSON data if it's a string
    if isinstance(json_data, str):
        data = json.loads(json_data)
    else:
        data = json_data

    # Create a Path object for the output directory
    output_dir_path = Path(output_directory)

    # Check if the directory already exists
    if output_dir_path.exists():
        print(f"The directory '{output_directory}' already exists. No PDFs will be created.")
        return  # Exit the function if the directory already exists

    # Create the output directory if it doesn't exist
    output_dir_path.mkdir(parents=True, exist_ok=True)

    # Open the input PDF
    with fitz.open(input_pdf_path) as pdf:
        for item in data['items']:
            # Extract section details
            section_name = f"item_{item['item_id']}"
            start_page = item['page_start'] - 1  # Convert to 0-based index
            end_page = item['page_end'] - 1      # Convert to 0-based index

            # Create a new PDF for the section
            section_pdf = fitz.open()

            # Add pages from the original PDF to the new section PDF
            for page_num in range(start_page, end_page + 1):
                section_pdf.insert_pdf(pdf, from_page=page_num, to_page=page_num)

            # Create the full path for the section PDF
            section_pdf_filename = f"{section_name.replace(' ', '_')}.pdf"
            section_pdf_path = output_dir_path / section_pdf_filename

            # Save the section PDF
            section_pdf.save(section_pdf_path)
            section_pdf.close()
            print(f"Created: {section_pdf_path}")

In [174]:
split_pdf_by_sections(
    json_data=identified_10k_items_with_page_ranges_formatted_hyperlinks, 
    input_pdf_path=pdf_path, 
    output_directory=company_name
)

Created: appl_2023/item_1.pdf
Created: appl_2023/item_1A.pdf
Created: appl_2023/item_1B.pdf
Created: appl_2023/item_2.pdf
Created: appl_2023/item_3.pdf
Created: appl_2023/item_5.pdf
Created: appl_2023/item_7.pdf
Created: appl_2023/item_7A.pdf
Created: appl_2023/item_8.pdf
Created: appl_2023/item_9.pdf
Created: appl_2023/item_9A.pdf
Created: appl_2023/item_9B.pdf
Created: appl_2023/item_10.pdf
Created: appl_2023/item_11.pdf
Created: appl_2023/item_12.pdf
Created: appl_2023/item_13.pdf
Created: appl_2023/item_14.pdf
Created: appl_2023/item_15.pdf
Created: appl_2023/item_16.pdf


# 5 - Transform PDF into text and clean it

#### Extract text using PyMUPDF

In [9]:
import fitz  # PyMuPDF

# Load the PDF
pdf_path = "./appl_2023/item_5.pdf"
doc = fitz.open(pdf_path)

# Access the first page
page = doc.load_page(0)  # Page numbering starts at 0

# Extract the text from the page
text = page.get_text()

# Print the text
print(text)

# Close the PDF document
doc.close()


PART II
Item 5. 
Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity 
Securities
The Company’s common stock is traded on The Nasdaq Stock Market LLC under the symbol AAPL.
Holders
As of October 20, 2023, there were 23,763 shareholders of record.
Purchases of Equity Securities by the Issuer and Affiliated Purchasers
Share repurchase activity during the three months ended September 30, 2023 was as follows (in millions, except number of 
shares, which are reflected in thousands, and per-share amounts):
Periods
Total Number
of Shares 
Purchased
Average 
Price
Paid Per 
Share
Total Number 
of Shares
Purchased as 
Part of Publicly
Announced 
Plans or 
Programs
Approximate 
Dollar Value of
Shares That May 
Yet Be Purchased
Under the Plans 
or Programs (1)
July 2, 2023 to August 5, 2023:
Open market and privately negotiated purchases
 
33,864 
$ 
191.62 
 
33,864 
August 6, 2023 to September 2, 2023:
August 2023 ASRs
 
22,085 (2)
(2)
 
22,085 (2)
O

#### Extract text using PYMUDF4LLM

In [None]:
import pymupdf4llm
output = pymupdf4llm.to_markdown("/Users/fernando/Documents/GitHub/10k-processing/appl_2023/item_5.pdf", write_images=True)

print(output)

#### Transform page into image for TableTransformer

In [11]:
import fitz  # PyMuPDF

# Load the PDF document
pdf_path = "./appl_2023/item_5.pdf"
doc = fitz.open(pdf_path)

# Select the page you want to convert (e.g., the first page)
page = doc.load_page(0)  # Page numbering starts from 0

# Render the page into a pixmap (image)
pix = page.get_pixmap()

# Save the image as a PNG file
pix.save("item_5_pg_1.png")

print("PDF page has been converted to an image!")


PDF page has been converted to an image!


Use Table Transformer to extract table (doesnt work well)

https://huggingface.co/spaces/whn09/Table-Structure-Recognition-Demo

# 6 - Filter information not related to the item (before or after)

Processing /Users/fernando/Documents/GitHub/10k-processing/appl_2023/item_5.pdf...


# Item 5. Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities

 The Company’s common stock is traded on The Nasdaq Stock Market LLC under the symbol AAPL.

 Holders

 As of October 20, 2023, there were 23,763 shareholders of record.

 Purchases of Equity Securities by the Issuer and Affiliated Purchasers

 Share repurchase activity during the three months ended September 30, 2023 was as follows (in millions, except number of shares, which are reflected in thousands, and per-share amounts):

**Total Number**

**of Shares** **Approximate**

**Purchased as** **Dollar Value of**

**Average** **Part of Publicly** **Shares That May**

**Total Number** **Price** **Announced** **Yet Be Purchased**

**of Shares** **Paid Per** **Plans or** **Under the Plans**

**Periods** **Purchased** **Share** **Programs** **or Programs [(1)]**

July 2, 2023 to August 5, 2023:

Open market and privately negotiated purchases 33,864 $ 191.62 33,864

August

# 7 - Export result as a series of TXT files