<a href="https://colab.research.google.com/github/asantos2000/master-degree-santos-anderson/blob/main/code/src/chap_6_semantic_annotation_elements_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Semantic Annotation - Elements extraction

Extract and identify elements.

Chapter 6. Ferramentas de suporte
- Section 6.2 Implementação dos principais componentes
  - Section 6.2.3 Anotações semânticas
    - Section Algoritmo "extract / classify elements"

## Google colab

In [83]:
%load_ext autoreload
%autoreload 2

import sys
import os

IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
  from google.colab import drive
  drive.mount('/content/drive')
  !rm -rf cfr2sbvr configuration checkpoint
  !git clone https://github.com/asantos2000/master-degree-santos-anderson.git cfr2sbvr
  %pip install -r cfr2sbvr/code/requirements.txt
  !cp -r cfr2sbvr/code/src/configuration .
  !cp -r cfr2sbvr/code/src/checkpoint .
  !cp -r cfr2sbvr/code/config.colab.yaml config.yaml
  DEFAULT_CONFIG_FILE="config.yaml"
else:
  DEFAULT_CONFIG_FILE="../config.yaml"

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Imports

In [84]:
# Standard library imports
import json

# Third-party libraries
import pandas as pd
from pydantic import BaseModel, Field
from typing import List, Optional, Tuple, Set

# Local application/library-specific imports
import checkpoint.main as checkpoint
from checkpoint.main import restore_checkpoint, save_checkpoint, Document, DocumentProcessor
import configuration.main as configuration
import logging_setup.main as logging_setup
import llm_query.main as llm_query
from llm_query.main import query_instruct_llm
import token_estimator.main as token_estimator
from token_estimator.main import estimate_tokens

DEV_MODE = True

if DEV_MODE:
    # Development mode
    import importlib
    importlib.reload(configuration)
    importlib.reload(logging_setup)
    importlib.reload(checkpoint)
    importlib.reload(llm_query)
    importlib.reload(token_estimator)

## Settings

Default settings, check them before run the notebook.

### Get configuration

In [85]:
# load config
config = configuration.load_config(DEFAULT_CONFIG_FILE)

config["LLM"]["MODEL"]="o1-preview"

Generated files for analysis in this run

In [86]:
print(config["DEFAULT_CHECKPOINT_FILE"],
config["DEFAULT_EXTRACTION_REPORT_FILE"],
config["DEFAULT_EXCEL_FILE"])

../data/checkpoints/documents-2024-12-05-3.json ../outputs/extraction_report-2024-12-05-1.html ../outputs/compare_items_metrics.xlsx


### Logging configuration

In [87]:
logger = logging_setup.setting_logging(config["DEFAULT_LOG_DIR"], config["LOG_LEVEL"])

2024-12-05 20:55:38 - INFO - Logging is set up with daily rotation.


## Checkpoints

Documents, annoted datasets, statistics and metrics about the execution of the notebook are stored by checkpoint module.

Checkpoints are stored / retrieved at the directory `DEFAULT_CHECKPOINT_FILE` in the configuration file.

During the execution, it will restore the checkpoint at the beginning of the section and saved at the end. We can run and restore the checkpoint several times. If the run fails, check the closest checkpoint and restore it.

### Restore the checkpoint

In [88]:
# Restore the checkpoint
manager = restore_checkpoint(filename=config["DEFAULT_CHECKPOINT_FILE"])

2024-12-05 20:55:38 - ERROR - Checkpoint file '../data/checkpoints/documents-2024-12-05-3.json' not found or is empty, initializing new checkpoint.


## General functions and data structures

In [89]:
def basic_text_stats(text: str) -> Tuple[int, int, int]:
    """
    Computes basic text statistics: number of lines, words, and average words per line.

    Args:
        text (str): The text to analyze.

    Returns:
        Tuple[int, int, int]: A tuple containing the number of lines, total words, and average words per line.
    """
    lines=len(text.split("\n"))
    words=len(text.split(" "))
    avg_words_per_line=round(words/lines)
    return lines, words, avg_words_per_line

In [90]:
def calculate_content_quantities_p1(doc_id, content_data, filename):
    elements = content_data.get("elements", [])
    logger.debug(elements)

    # Collect statistics
    num_elements = len(elements)
    fact_count = 0
    fact_type_count = 0
    rule_count = 0
    verb_count = 0
    term_count = 0

    # Process each element within the document
    for element in elements:
        classification = element.get("classification", "Unknown")
        if classification == "Fact":
            fact_count += 1
        elif classification == "Fact Type":
            fact_type_count += 1
        elif classification == "Rule":
            rule_count += 1
        verb_count += len(element.get("verb_symbols", []))
        term_count += len(element.get("terms", []))

    return {
        "document_id": doc_id,
        "quantity_of_elements": num_elements,
        "quantity_of_facts": fact_count,
        "quantity_of_fact_types": fact_type_count,
        "quantity_of_rules": rule_count,
        "quantity_of_verbs": verb_count,
        "quantity_of_terms": term_count,
        "filename": filename,
    }

In [91]:
def process_documents_p1(file_path, file_name, doc_ids):
    # Initialize data containers for the two tables
    table_data = []

    with open(file_path, 'r') as file:
        content = json.load(file)

        # Iterate over each document in the file
        for doc_id, content_data in content.items():
            logger.debug(doc_id, content_data)
            # Check if the document ID is in the list to process
            #if doc_id in doc_ids and 'content' in doc_data:
            if all([doc_id in doc_ids, 'content' in content_data]):
                table_data.append(calculate_content_quantities_p1(doc_id, content_data['content'], file_name))

    return table_data


In [92]:
def calculate_content_quantities_p2(doc_id, content_data, filename):
    terms_relationship = content_data['content'].get('terms_relationship', [])
    logger.debug(f"terms_relationship: {terms_relationship}")
    terms = content_data['content']['terms']
    logger.debug(f"terms: {terms}")

    # Count terms with and without definitions
    total_terms = len(terms)
    terms_with_definition = sum(1 for term in terms if term.get('definition'))
    terms_without_definition = total_terms - terms_with_definition

    # Check for term relationships and count them
    terms_relationship_count = len(terms_relationship)

    # Add data to table
    return {
        "document_id": doc_id,
        "count_of_terms": total_terms,
        "terms_with_definition": terms_with_definition,
        "terms_without_definition": terms_without_definition,
        "terms_relationship_count": terms_relationship_count,
        "filename": filename
    }

LLM model for extracting elements P1.

In [93]:
class Item(BaseModel):
    term: str = Field(..., description="The term is a word or a group of words that represents a specific concept, entity, or subject in a particular context")
    classification: str = Field(..., description="The classification of the term, either 'Common Noun' or 'Proper Noun'.")
    confidence: float = Field(..., description="The confidence score of the classification.")
    reason: Optional[str] = Field(None, description="The reason for the confidence score.")
    extracted_confidence:float = Field(..., description="The confidence scores of the terms extracted from the statement.")
    extracted_reason: Optional[str] = Field(None, description="The reasons for the confidence scores of the terms extracted from the statement.")

class Element(BaseModel):
    id: int = Field(..., description="A unique numeric identifier for each fact, fact type, or rule.")
    title: str = Field(..., description="The title for statement.")
    statement: str = Field(..., description="The full statement or phrase representing the fact, fact type, or rule.")
    terms: List[Item] = Field(..., description="A list of terms involved in the fact, fact type, or rule.")
    verb_symbols: List[str] = Field(..., description="A list of vers, verb phrases or prepositions connecting the terms.")
    verb_symbols_extracted_confidence: List[float] = Field(..., description="The confidence scores of the verb symbols extracted from the statement.")
    verb_symbols_extracted_reason: List[str] = Field(..., description="The reasons for the confidence scores of the verb symbols extracted from the statement.")
    classification: str = Field(..., description="Indicates whether the statement is classified as 'Fact', 'Fact Type', or 'Operative Rule'.")
    confidence: float = Field(..., description="The confidence score of the classification.")
    reason: Optional[str] = Field(None, description="The reason for the confidence score.")
    sources: List[str] = Field(..., description="The paragraph ID of the document where the fact, fact type, or rule is located (e.g., '(a)', '(b)(2)').")

class ElementsDocumentModel(BaseModel):
    section: str = Field(..., description="The section ID of the document.")
    summary: str = Field(..., description="The summary of the document.")
    elements: List[Element] = Field(..., description="A list of facts, fact types, and rules extracted from the document.")

LLM model for extracting elements P2.

In [94]:
class Term(BaseModel):
    term: str = Field(..., description="The term is a word or a group of words that represents a specific concept, entity, or subject in a particular context")
    definition: Optional[str] = Field(None, description="Definition is a explanation or description of the meaning of the term.")
    confidence: float = Field(..., description="The confidence score of the definition extracted from the document.")
    reason: Optional[str] = Field(None, description="The reason for the confidence score.")
    isLocalScope: bool = Field(..., description="Indicates whether the statement is a local scope or not.")
    local_scope_confidence: float = Field(..., description="The confidence score of the local scope.")
    local_scope_reason: Optional[str] = Field(None, description="The reason for the local scope confidence score.")

class TermsRelationship(BaseModel):
    term_1: str = Field(..., description="First term in the relationship.")
    term_2: str = Field(..., description="Second term in the relationship.")
    relation: str = Field(..., description="The type of relationship between the terms.")
    confidence: float = Field(..., description="The confidence score of the relationship extracted from the document.")
    reason: Optional[str] = Field(None, description="The reason for the confidence score.")

class TermsDocumentModel(BaseModel):
    terms: List[Term] = Field(..., description="A list of terms.")
    terms_relationship: List[TermsRelationship] = Field(..., description="A list of relationships between terms.")

In [95]:
def extract_unique_terms(document: ElementsDocumentModel) -> List[str]:
    """
    Extracts unique terms from the 'terms' attribute of elements within an ElementsDocumentModel instance.

    Args:
        document (ElementsDocumentModel): The document containing elements, each with a list of terms.

    Returns:
        List[str]: A list of unique terms found across all elements in the document.

    This function iterates through each element of the document, accesses the terms list in each element, and collects
    the unique terms. It uses a set to ensure that the terms are unique before converting it back to a list for the output.
    """

    # Initialize a set to store unique terms
    unique_terms: Set[str] = set()

    # Loop through each element in the 'elements' list of the document
    for element in document.elements:
        # Loop through the 'terms' list in each element
        for term_info in element.terms:
            # Add the term to the set
            unique_terms.add(term_info.term)

    # Convert the set to a list and return it
    return list(unique_terms)

## Datasets

Datasets used in the notebook. They are divided into sections and true tables. The sections are the documents from CFR and true tables are annoted  or "golden" datasets.

### Get section from KG CFR
Due the mistakes in the original dataset, we need to correct it. This function will not be used in the final version. Instead we will use variables (document_02, document_05, document_07) from the original dataset.

Texts to extract the elements: CFR Sections 275.0-2, 275.0-5, 275.0-7

#### Section 275.0-2

In [96]:
manager.add_document(
    Document(
        id="§ 275.0-2",
        type="section",
content = """
§ 275.0-2 General procedures for serving non-residents.
(a) General procedures for serving process, pleadings, or other papers on non-resident investment advisers, general partners and managing agents.  Under Forms ADV and ADV-NR [17 CFR 279.1 and 279.4], a person may serve process, pleadings, or other papers on a non-resident investment adviser, or on a non-resident general partner or non-resident managing agent of an investment adviser by serving any or all of its appointed agents:
  (1) A person may serve a non-resident investment adviser, non-resident general partner, or non-resident managing agent by furnishing the Commission with one copy of the process, pleadings, or papers, for each named party, and one additional copy for the Commission's records.
  (2) If process, pleadings, or other papers are served on the Commission as described in this section, the Secretary of the Commission (Secretary) will promptly forward a copy to each named party by registered or certified mail at that party's last address filed with the Commission.
  (3) If the Secretary certifies that the Commission was served with process, pleadings, or other papers pursuant to paragraph (a)(1) of this section and forwarded these documents to a named party pursuant to paragraph (a)(2) of this section, this certification constitutes evidence of service upon that party.
(b) Definitions.  For purposes of this section:
  (1) Managing agent  means any person, including a trustee, who directs or manages, or who participates in directing or managing, the affairs of any unincorporated organization or association other than a partnership.
  (2) Non-resident  means:
    (i) An individual who resides in any place not subject to the jurisdiction of the United States;
    (ii) A corporation that is incorporated in or that has its principal office and place of business in any place not subject to the jurisdiction of the United States; and
    (iii) A partnership or other unincorporated organization or association that has its principal office and place of business in any place not subject to the jurisdiction of the United States.
  (3) Principal office and place of business  has the same meaning as in § 275.203A-3(c) of this chapter.
"""
    )
)

In [97]:
docs = manager.list_document_ids(doc_type="section")

for doc in docs:
    text = manager.retrieve_document(doc, "section").content
    logger.info(f"Document ID: {doc}")
    paragraphs, words, avg_word_per_paragraph = basic_text_stats(text)
    tokens = estimate_tokens(text)
    logger.info(f"Section paragraphs: {paragraphs}, words: {words}, avg_word_per_paragraph: {avg_word_per_paragraph}, tokens: {tokens}")


2024-12-05 20:55:38 - INFO - Document ID: § 275.0-2
2024-12-05 20:55:38 - INFO - Section paragraphs: 14, words: 362, avg_word_per_paragraph: 26, tokens: 481


#### Section 275.0-5

In [98]:
manager.add_document(
    Document(
        id="§ 275.0-5",
        type="section",
content = """
§ 275.0-5 Procedure with respect to applications and other matters.
The procedure hereinbelow set forth will be followed with respect to any proceeding initiated by the filing of an application, or upon the Commission's own motion, pursuant to any section of the Act or any rule or regulation thereunder, unless in the particular case a different procedure is provided:
(a) Notice of the initiation of the proceeding will be published in the Federal Register and will indicate the earliest date upon which an order disposing of the matter may be entered. The notice will also provide that any interested person may, within the period of time specified therein, submit to the Commission in writing any facts bearing upon the desirability of a hearing on the matter and may request that a hearing be held, stating his reasons therefor and the nature of his interest in the matter.
(b) An order disposing of the matter will be issued as of course following the expiration of the period of time referred to in paragraph (a) of this section, unless the Commission thereafter orders a hearing on the matter.
(c) The Commission will order a hearing on the matter, if it appears that a hearing is necessary or appropriate in the public interest or for the protection of investors,
  (1) upon the request of any interested person or
  (2) upon its own motion.
(d) Definition of application. For purposes of this rule, an “application” means any application for an order of the Commission under the Act other than an application for registration as an investment adviser.
"""
    )
)

#### Section 275.0-7

In [99]:
manager.add_document(
    Document(
        id="§ 275.0-7",
        type="section",
content = """
§ 275.0-7 Small entities under the Investment Advisers Act for purposes of the Regulatory Flexibility Act.
(a) For purposes of Commission rulemaking in accordance with the provisions of Chapter Six of the Administrative Procedure Act (5 U.S.C. 601 et seq.) and unless otherwise defined for purposes of a particular rulemaking proceeding, the term small business or small organization for purposes of the Investment Advisers Act of 1940 shall mean an investment adviser that:
  (1) Has assets under management, as defined under Section 203A(a)(3) of the Act (15 U.S.C. 80b-3a(a)(2)) and reported on its annual updating amendment to Form ADV (17 CFR 279.1), of less than $25 million, or such higher amount as the Commission may by rule deem appropriate under Section 203A(a)(1)(A) of the Act (15 U.S.C. 80b-3a(a)(1)(A));
  (2) Did not have total assets of $5 million or more on the last day of the most recent fiscal year; and
  (3) Does not control, is not controlled by, and is not under common control with another investment adviser that has assets under management of $25 million or more (or such higher amount as the Commission may deem appropriate), or any person (other than a natural person) that had total assets of $5 million or more on the last day of the most recent fiscal year.
(b) For purposes of this section:
  (1) Control  means the power, directly or indirectly, to direct the management or policies of a person, whether through ownership of securities, by contract, or otherwise.
    (i) A person is presumed to control a corporation if the person:
      (A) Directly or indirectly has the right to vote 25 percent or more of a class of the corporation's voting securities; or
      (B) Has the power to sell or direct the sale of 25 percent or more of a class of the corporation's voting securities.
    (ii) A person is presumed to control a partnership if the person has the right to receive upon dissolution, or has contributed, 25 percent or more of the capital of the partnership.
    (iii) A person is presumed to control a limited liability company (LLC) if the person:
      (A) Directly or indirectly has the right to vote 25 percent or more of a class of the interests of the LLC;
      (B) Has the right to receive upon dissolution, or has contributed, 25 percent or more of the capital of the LLC; or
      (C) Is an elected manager of the LLC.
    (iv) A person is presumed to control a trust if the person is a trustee or managing agent of the trust.
  (2) Total assets  means the total assets as shown on the balance sheet of the investment adviser or other person described above under paragraph (a)(3) of this section, or the balance sheet of the investment adviser or such other person with its subsidiaries consolidated, whichever is larger.
"""
    )
)

### True tables

True tables are annotated or "golden" datasets in which entities have been manually identified and labeled within the original source data.

True tables for sectiona 275.0-2, 275.0-5 and 275.0-7

Load true table for P1 - Elements extraction and classification, terms, and verb symbols, and true table for P2 - Terms definition and synonyms.

In [100]:
with open(f"{config['DEFAULT_DATA_DIR']}/documents_true_table.json", 'r') as file:
    data = json.load(file)

    manager.add_document(
        Document.model_validate(data["§ 275.0-2_P1|true_table"])
    )

    manager.add_document(
        Document.model_validate(data["§ 275.0-5_P1|true_table"])
    )

    manager.add_document(
        Document.model_validate(data["§ 275.0-7_P1|true_table"])
    )

    manager.add_document(
        Document.model_validate(data["§ 275.0-2_P2|true_table"])
    )

    manager.add_document(
        Document.model_validate(data["§ 275.0-5_P2|true_table"])
    )

    manager.add_document(
        Document.model_validate(data["§ 275.0-7_P2|true_table"])
    )

### Save checkpoint

In [101]:
# Persist the state to a file
save_checkpoint(filename=config["DEFAULT_CHECKPOINT_FILE"], manager=manager)

2024-12-05 20:55:39 - INFO - DocumentManager state persisted to file: ../data/checkpoints/documents-2024-12-05-3.json
2024-12-05 20:55:39 - INFO - Checkpoint saved.


## Prompt engeneering

Prompt strucuture is based on [1]. It is a zero-shot prompt following the concept of chain of thought.

Following the approaches are taken.

### 1. facts and fact types
Try to extract all facts and fact types from a given document.

This approach has successful results. It is focused on extracting the elements, and achive the best results, similar to the approach 3.

In [102]:
system_prompt_facts = """

You are tasked with extracting **facts**, **fact types**, and their **relationships** from a given document. Follow these steps carefully:

#### Steps to Perform:

1. **Identify Facts and Fact Types**:
   - A **fact** is a specific instance or statement that describes an event or condition.
   - A **fact type** is a general template or relationship that defines how entities interact.
   - For each fact or fact type:
     - Extract the **statement** that represents the fact or fact type.
     - List the **terms** (Nouns or Proper nouns) involved in the fact or fact type.
     - Identify the **fact symbols** (verbs, verb phrases, or prepositions) connecting the terms.
     - Classify the statement as either a **Fact** or **Fact Type**.
     - Note the section or paragraph where the fact or fact type appears as the **source**.

2. **Classify Terms**:
   - For each fact or fact type, classify all **terms**:
     - Label each term as either a **Noun** or **Proper Noun**.
   - Ensure that the terms are extracted accurately and classified correctly.

3. **Define term**:
   - For each term look in the document for the term definition. If the term definition is not found, use "missing".:

4. **Identify Fact Symbols**:
   - Extract the verbs or prepositions that define the relationships between the terms. These are referred to as **fact symbols**.
   - Each fact or fact type should have a list of fact symbols.

5. **Source Information**:
   - Record the paragraph or section of the document where each fact or fact type is found as **source** information (e.g., “(a)(1)”, “(b)”).

6. **Recognize Term Relationships**:
   - Identify relationships between terms:
     - **Synonyms**: Terms that can be used interchangeably without changing the meaning.
     - **Hypernym-Hyponym**: A broader term (hypernym) that includes a more specific term (hyponym).
   - For each pair of terms:
     - Identify the relationship (either "Synonym" or "Hypernym-Hyponym").
     - Ensure that both terms involved in the relationship are valid terms from the document.

7. **Structure the Output in JSON Format**:
   - Create a JSON object with the following structure:
     - **facts_and_fact_types**: A list of dictionaries, where each dictionary contains:
       - **id**: A unique identifier for the fact or fact type.
       - **statement**: The extracted fact or fact type.
       - **terms**: A list of dictionaries, where each dictionary has a term and its classification (either "Noun" or "Proper Noun").
       - **fact_symbols**: A list of verb phrases or prepositions connecting the terms.
       - **classification**: Either "Fact" or "Fact Type".
       - **source**: The section or paragraph where the fact or fact type appears.
     - **terms_relationship**: A list of dictionaries, where each dictionary contains:
       - **terms**: A list of two related terms.
       - **relation**: Either "Synonym" or "Hypernym-Hyponym".

#### Example Output:

```json
{
  "facts_and_fact_types": [
    {
      "id": 1,
      "statement": "A person serves a non-resident investment adviser by furnishing the Commission with process, pleadings, or papers.",
      "terms": [
        {"Person": "Noun"},
        {"Non-resident investment adviser": "Noun"},
        {"Commission": "Proper Noun"},
        {"Process": "Noun"},
        {"Pleadings": "Noun"},
        {"Papers": "Noun"}
      ],
      "fact_symbols": ["serves", "by furnishing", "with"],
      "classification": "Fact Type",
      "sources": ["(a)"]
    }
  ],
  "terms_relationship": [
    {
      "terms": [
        "Principal office",
        "Place of business"
      ],
      "relation": "Synonym"
    }
  ]
}
```

#### Guidelines:
- Be precise in identifying **terms** and **fact symbols**.
- Classify the relationships between terms accurately as **Synonym** or **Hypernym-Hyponym**.
- Ensure the final output adheres to the specified JSON structure.

#### Start of the document
"""

### 2. facts, fact types, rules, and terms with definitions

Try to extract all facts, fact types, rules, and terms with definitions from a given document. Try to extract the relationships for each term  as well.

**Results**

The result are fairly consistent, but it failed to extract term's definitions, even when the definition was clear in the text, like in the document 275.0-7 from the fragment "... the **term** small business or small organization for purposes of the Investment Advisers Act of 1940 shall **mean** an investment adviser that: ...". The prompt failed to define small business and small organization, what are the main purpose of the document. It also failed to recognize that small business and small organization are synonyms.

In [103]:
system_prompt_v1 = """
You are tasked with extracting **facts**, **fact types**, **rules**, and their **relationships** from a given document. Follow these steps carefully:

<steps>

1. Summarize the document. Use the summary to verify if all important facts, fact types, and rules are present.

2. **Identify Facts, Fact Types, and Rules**:
   - A **fact** is a specific instance or statement that describes an event or condition. Facts are statements of truth without any directive element. They are often associated with relationships between terms or entities. e.g., "John works for X Inc.".
   - A **fact type** is a general, abstract template that describes the potential relationships between terms or entities. It serves as a model for generating specific facts. e.g., "Person works for Company".
   - A **rule** rule is generally defined as a statement that governs or constrains some aspect of the business. It specifies what must be done or what is not allowed, often guiding actions, decisions, and behaviors within an organization. Rules enforce compliance, limit possibilities, or prescribe specific behaviors in response to business situations. e.g., "A customer must provide identification before opening an account.".
   - For each fact, fact type, or rule:
     - Extract the **statement** that represents the fact, fact type, or rule.
     - List the **terms** involved in the fact, fact type, or rule.
     - Identify the **verb symbols** (verbs, verb phrases, or prepositions) connecting the terms.
     - Classify the statement as either a **Fact**, **Fact Type**, or **Rule**.
     - Note the section or paragraph where the fact, fact type, or rule appears as the **source**.
     - For each term look in the document for the term definition. If the term definition is not found, use "missing".:

3. Classify Terms:
   - For each fact, fact type, or rule classify all **terms**:
     - Label each term as either a **Common Noun** or **Proper Noun**.
   - Ensure that the terms are extracted accurately and classified correctly.

4. Define term:
   - For each term look in the document for the term definition, explaining, or meaning. If the term definition is not found, use "missing".:

4. Identify Verb Symbols:
   - Extract the verbs or prepositions that define the relationships between the terms. These are referred to as **verb symbols**.
   - Each fact, fact type, or rule should have a list of verb symbols.

5. Source Information:
   - Record the paragraph or section of the document where each fact, fact type, or rule is found as **source** information (e.g., "(a)(1)", "(b)").

6. Recognize term relationships:
   - Identify relationships between terms:
     - **Synonyms**: Terms that can be used interchangeably without changing the meaning.
     - **Hypernym-Hyponym**: A broader term (hypernym) that includes a more specific term (hyponym).
   - For each pair of terms:
     - Identify the relationship (either "Synonym" or "Hypernym-Hyponym").
     - Ensure that both terms involved in the relationship are valid terms from the document.

7. Answer only with the output example structure in JSON format. All the values are optional.

<output_example>

```json
{
  "section": "§ 123.4-5",
  "elements": [
    {
      "id": 1,
      "statement": "A person serves a non-resident investment adviser by furnishing the Commission with process, pleadings, or papers.",
      "terms": [
        {
            "term": "Person",
            "classification": "Noun",
            "definition": "missing"
        },
      ...
      ],
      "verb_symbols": ["serves", "by furnishing", "with"],
      "classification": "Fact Type",
      "sources": ["(a)"]
    }
  ],
  "terms_relationship": [
    {
      "terms": [
        "Principal office",
        "Place of business"
      ],
      "relation": "Synonym"
    }
  ]
},
...
```
</output_example>

</steps>
"""

The v2 is a variation of the v1, with more concise description of the steps, and changing the organization of the text. The results are the same, but there was miss classification of the statements.

In [104]:
system_prompt_v2 = """
Extract facts, fact types, and their relationships from a given document, and structure the output in a specified JSON format.

Follow the steps to identify and classify statements, using document details to find definitions and source information.

# Steps

1. **Summarize the Document:**
   - Provide a summary to ensure the completeness of identified facts, fact types, and rules.

2. **Identify Facts, Fact Types, and Rules:**
   - Define and extract each:
     - **Fact:** Instance or statement of event/condition, e.g., "John works for X Inc."
     - **Fact Type:** Template for relationships, e.g., "Person works for Company."
     - **Rule:** Governing statement, e.g., "A customer must provide identification before opening an account."
   - For each, document:
     - **Statement**
     - **Terms** involved
     - **Verb Symbols** connecting the terms
     - **Classification** as Fact, Fact Type, or Rule
     - **Source** paragraph or section in the document

3. **Classify Terms:**
   - Classify each term as **Common Noun** or **Proper Noun**.

4. **Define Term:**
   - Locate definitions for terms in the document, or mark as "missing."

5. **Identify Verb Symbols:**
   - Extract verbs or prepositions (verb symbols) that define term relationships.

6. **Source Information:**
   - Note the document source (section/paragraph) for each statement.

7. **Recognize Term Relationships:**
   - Identify pairs of terms with relationships:
     - **Synonyms:** interchangeable terms.
     - **Hypernym-Hyponym:** broader (hypernym) includes more specific (hyponym).
   - Ensure relationship validity using document terms.

# Output Format

Produce a structured JSON format based on the specified template. Ensure all necessary fields are populated accurately, even if some fields are optional or marked as "missing".

# Examples

**Example JSON Structure:**

```json
{
  "section": "§ 123.4-5",
  "elements": [
    {
      "id": 1,
      "statement": "A person serves a non-resident investment adviser by furnishing the Commission with process, pleadings, or papers.",
      "terms": [
        {
            "term": "Person",
            "classification": "Noun",
            "definition": "missing"
        },
        // Additional terms...
      ],
      "verb_symbols": ["serves", "by furnishing", "with"],
      "classification": "Fact Type",
      "sources": ["(a)"]
    }
  ],
  "terms_relationship": [
    {
      "terms": [
        "Principal office",
        "Place of business"
      ],
      "relation": "Synonym"
    }
  ]
}
```

# Notes

- Ensure extracted statements are fully detailed and clearly classified.
- Pay careful attention to identifying and classifying terms accurately.
- Follow the precise JSON format for all outputs, populating fields as required.
"""


The v3 is back to v1, changing the organization of the text.

**Results**

The results are the same of v1 and v2. 5 elements were extracted. 16 terms were extracted with 2 definitions.

In [105]:
system_prompt_v3 = """
You are tasked with extracting elements and **relationships** from a given legal document. Please follow these steps carefully and ensure all instructions are adhered to:

**Steps**:

1. **Summarize the document**:
   - Summarize the document to understand its purpose and use it to verify if all important terms, term definitions, facts, fact types, and rules are identified in subsequent steps.

2. **Identify Facts, Fact Types, and Rules**:
   - **Definitions**:
     - **Fact**: A specific instance or statement that describes an event or condition without any directive element. Facts often involve relationships between terms or entities. Example: "John works for X Inc."
     - **Fact Type**: A general, abstract template that describes potential relationships between terms or entities, serving as a model for generating specific facts. Example: "Person works for Company."
     - **Rule**: A statement that governs or constrains some aspect of the business, specifying what must be done or what is not allowed. Rules enforce compliance, limit possibilities, or prescribe specific behaviors in response to business situations. Example: "A customer must provide identification before opening an account."
   - **For each fact, fact type, or rule**:
     - **Extract the statement**: Identify the exact statement or phrase from the document representing the fact, fact type, or rule.
     - **Extract Terms**: List all the terms involved in the statement.
     - **Extract Verb Symbols**: Identify verbs, verb phrases, or prepositions that connect the terms in the statement.
     - **Classification**: Classify the statement as either a **Fact**, **Fact Type**, or **Rule**.
     - **Source**: Note the specific paragraph or section of the document where the statement is found (e.g., "(a)(1)", "(b)").

3. **Classify Terms**:
   - For each term extracted classify it as either a **Common Noun** or a **Proper Noun**.

4. **Define Terms**:
   - For each term:
     - Search the entire document for the term's definition, explanation, or meaning. Also, look in the document summary.
     - If the definition is found, include it.
     - If the definition is not found in the document, use **None**.

5. **Identify Relationships Between Terms**:
   - **Types of Relationships**:
     - **Synonym**: Terms that can be used interchangeably without changing the meaning.
     - **Hypernym-Hyponym**: A broader term (hypernym) that includes a more specific term (hyponym).
   - **For each pair of terms in the document**:
     - Identify if a relationship exists as either "Synonym" or "Hypernym-Hyponym".
     - Only include relationships where both terms are present in the document.

6. **Provide JSON Output**:
   - Format your answer as per the output example below.
   - **All values are optional**: Include as much information as is available based on the document.
   - **Do not include any additional text or explanation outside the JSON structure**.

**Output Example**:

```json
{
  "section": "§ 123.4-5",
  "elements": [
    {
      "id": 1,
      "statement": "A person serves a non-resident investment adviser by furnishing the Commission with process, pleadings, or papers.",
      "terms": [
        {
          "term": "Person",
          "classification": "Common Noun",
          "definition": "An individual or legal entity."
        },
        {
          "term": "Non-resident investment adviser",
          "classification": "Common Noun",
          "definition": null
        },
        ...
      ],
      "verb_symbols": ["serves", "by furnishing", "with"],
      "classification": "Fact Type",
      "sources": ["(a)"]
    },
    ...
  ],
  "terms_relationship": [
    {
      "terms": [
        "Principal office",
        "Place of business"
      ],
      "relation": "Synonym"
    },
    {
      "terms": [
        "Person",
        "Individual"
      ],
      "relation": "Synonym"
    },
    ...
  ]
}
```
"""

In [106]:
response_prompt_v3 = {
  "section": "§ 275.0-7",
  "elements": [
    {
      "id": 1,
      "statement": "An investment adviser that has assets under management of less than $25 million is considered a small business for the purposes of the Investment Advisers Act of 1940.",
      "terms": [
        {
          "term": "Investment adviser",
          "classification": "Common Noun",
          "definition": None
        },
        {
          "term": "Assets under management",
          "classification": "Common Noun",
          "definition": None
        },
        {
          "term": "$25 million",
          "classification": "Proper Noun",
          "definition": None
        },
        {
          "term": "Small business",
          "classification": "Common Noun",
          "definition": None
        }
      ],
      "verb_symbols": ["has", "is considered"],
      "classification": "Fact Type",
      "sources": ["(a)(1)"]
    },
    {
      "id": 2,
      "statement": "An investment adviser is considered a small organization if it did not have total assets of $5 million or more on the last day of the most recent fiscal year.",
      "terms": [
        {
          "term": "Investment adviser",
          "classification": "Common Noun",
          "definition": None
        },
        {
          "term": "Total assets",
          "classification": "Common Noun",
          "definition": None
        },
        {
          "term": "$5 million",
          "classification": "Proper Noun",
          "definition": None
        },
        {
          "term": "Small organization",
          "classification": "Common Noun",
          "definition": None
        },
        {
          "term": "Fiscal year",
          "classification": "Common Noun",
          "definition": None
        }
      ],
      "verb_symbols": ["did not have", "is considered"],
      "classification": "Fact Type",
      "sources": ["(a)(2)"]
    },
    {
      "id": 3,
      "statement": "An investment adviser is not considered a small business if it controls, is controlled by, or is under common control with another investment adviser that has assets under management of $25 million or more.",
      "terms": [
        {
          "term": "Investment adviser",
          "classification": "Common Noun",
          "definition": None
        },
        {
          "term": "Control",
          "classification": "Common Noun",
          "definition": "The power, directly or indirectly, to direct the management or policies of a person."
        },
        {
          "term": "Common control",
          "classification": "Common Noun",
          "definition": None
        },
        {
          "term": "$25 million",
          "classification": "Proper Noun",
          "definition": None
        },
        {
          "term": "Small business",
          "classification": "Common Noun",
          "definition": None
        }
      ],
      "verb_symbols": ["controls", "is controlled by", "is under"],
      "classification": "Rule",
      "sources": ["(a)(3)"]
    },
    {
      "id": 4,
      "statement": "Control means the power, directly or indirectly, to direct the management or policies of a person, whether through ownership of securities, by contract, or otherwise.",
      "terms": [
        {
          "term": "Control",
          "classification": "Common Noun",
          "definition": "The power, directly or indirectly, to direct the management or policies of a person."
        },
        {
          "term": "Person",
          "classification": "Common Noun",
          "definition": None
        },
        {
          "term": "Securities",
          "classification": "Common Noun",
          "definition": None
        },
        {
          "term": "Contract",
          "classification": "Common Noun",
          "definition": None
        }
      ],
      "verb_symbols": ["means", "to direct", "whether through"],
      "classification": "Fact",
      "sources": ["(b)(1)"]
    },
    {
      "id": 5,
      "statement": "A person is presumed to control a corporation if the person has the right to vote 25 percent or more of a class of the corporation's voting securities.",
      "terms": [
        {
          "term": "Person",
          "classification": "Common Noun",
          "definition": None
        },
        {
          "term": "Corporation",
          "classification": "Common Noun",
          "definition": None
        },
        {
          "term": "Voting securities",
          "classification": "Common Noun",
          "definition": None
        },
        {
          "term": "25 percent",
          "classification": "Proper Noun",
          "definition": None
        }
      ],
      "verb_symbols": ["is presumed", "to control", "has the right to vote"],
      "classification": "Fact Type",
      "sources": ["(b)(1)(i)(A)"]
    }
  ],
  "terms_relationship": [
    {
      "terms": [
        "Investment adviser",
        "Small business"
      ],
      "relation": "Hypernym-Hyponym"
    },
    {
      "terms": [
        "Investment adviser",
        "Small organization"
      ],
      "relation": "Hypernym-Hyponym"
    }
  ]
}


In [107]:
# Restore checkpoint
manager = restore_checkpoint(filename=config["DEFAULT_CHECKPOINT_FILE"])

2024-12-05 20:55:39 - INFO - DocumentManager restored from file: ../data/checkpoints/documents-2024-12-05-3.json
2024-12-05 20:55:39 - INFO - Checkpoint restored from ../data/checkpoints/documents-2024-12-05-3.json.


In [108]:
len(response_prompt_v3["elements"]), len(response_prompt_v3["terms_relationship"])

(5, 2)

### 3. facts, fact types, rules, and terms

Try to extract all facts, fact types, rules, and terms without definitions from a given document, and do not try to extract the relationships for each term.

This approach is very similar to the approach used in the previous, but it is more focused on extracting the elements. It is divided in two parts:
- Extract the elements
- Extract the definitions and relationships

**Results**

The result are consistents, 7 elements and 21 terms with definitions are extracted. in contrast, the previous approach, 5 elements and 16 terms with 2 definitions were extracted. An improvement of 40% extracting facts and rules, 31% extracting terms, and 1050% extracting definitions.

ts are extracted in the first part. For the second part the result are much better than the previous approach, more definitions and relationships are extracted.

The prompt for the first part is similar to the previous one, but without the steps 4 and 5. The definition and relationships elements are removed from the output json.

> The summary of the document was added to the output json.

In [109]:
system_prompt_v4_1 = """
You are tasked with extracting elements from a given legal document. Please follow these steps carefully and ensure all instructions are adhered to:

# Steps

1. **Summarize the document** to understand its purpose and use it to verify if all important terms,facts, fact types, and rules are identified in subsequent steps.

2. **Identify elements**:
   - **About the elements**:
     - **Fact**: A specific instance or statement that describes an event or condition without any directive element. Facts often involve relationships between terms or entities. Example: "John works for X Inc."
     - **Fact Type**: A general, abstract template that describes potential relationships between terms or entities, serving as a model for generating specific facts. Example: "Person works for Company."
     - **Operative Rule**: A statement that governs or constrains some aspect of the business, specifying what must be done or what is not allowed. Rules enforce compliance, limit possibilities, or prescribe specific behaviors in response to business situations. Operative rules (otherwise known as normative rules or prescriptive rules) state what must or must not happen in particular circumstances. Operative rules can be contravened: required information may be omitted, inappropriate information supplied, or an attempt may be made to perform a process that is prohibited. Example: "A customer must provide identification before opening an account."
     - **Term**: A word or a group of words that represents a specific concept, entity, or subject in a particular context.
     - Terms, Fact, Fact Type, and Operative Rule are statements that should allow only full compliance or full contravention; partial compliance is not possible. The presence of "or" or "and" often suggests the need to separate a statement into two.
   - **For each fact, fact type, or rule**:
     - **Extract the statement**: Identify the exact statement or phrase from the document representing the fact, fact type, or rule.
     - **Give a unique title to the statement**.
     - **Extract and classify Terms**:
       - **Extract all the terms involved in the statement**. Record the level of confidence in the extraction, ranging from 0 to 1, and provide a brief reason for the confidence score.
       - **Classify each term** as either **Common Noun** or **Proper Noun**.
       - If a Term contains nouns separated by "and," ",", or "or," split it into two or more terms. For example, "Principal office and place of business" should be split into "Principal office" and "Place of business".
     - **Extract Verb Symbols**: Identify verbs, verb phrases, or prepositions that connect the terms in the statement. Record the level of confidence in the extraction, ranging from 0 to 1, and provide a brief reason for the confidence score.
     - **Classification**: Classify the statement as either a **Fact**, **Fact Type**, or **Rule**.
     - **Confidence**: Record the level of confidence in the classification, ranging from 0 to 1.
     - **Reason**: Provide a brief reason for the classification score.
     - **Source**: Note the specific paragraph or section of the document where the statement is found (e.g., "(a)(1)", "(b)").

3. **Provide JSON Output**:
   - Format your answer as per the output example below.
   - **All values are optional**: Include as much information as is available based on the document.
   - **Do not include any additional text or explanation outside the JSON structure**.

**Output Example**:

```json
{
  "section": "§ 123.4-5",
  "elements": [
    {
      "id": 1,
      "title": "some title",
      "statement": "A person serves a non-resident investment adviser by furnishing the Commission with process, pleadings, or papers.",
      "terms": [
        {
          "term": "Person",
          "classification": "Common Noun",
          "confidence": 0.9,
          "reason": "The term is ..."
          "extract_confidence": 0.9,
          "extract_reason": "The term is ..."
        },
        {
          "term": "Non-resident investment adviser",
          "classification": "Common Noun",
          "confidence": 0.8,
          "reason": "The term is ...",
          "extract_confidence": 0.8,
          "extract_reason": "The term is ..."  
        },
        ...
      ],
      "verb_symbols": ["serves", "by furnishing", "with"],
      "verb_symbols_extracted_confidence": [0.9, 0.8, 0.7],
      "verb_symbols_extracted_reason": ["The verb is ...", "The verb is ...", "The verb is ..."],
      "classification": "Fact Type",
      "confidence": 0.8,
      "reason": "The statement is ...",
      "sources": ["(a)"]
    },
    ...
  ]
}
```

# Notes
1. Level of Granularity: Extract and analyze every potential statement from the document;
2. Contextual Interpretation: Extract explicitly stated facts, fact types, and rules;
3. Scope of Terms: Classify every noun phrase as a term, even if it is peripheral to the main statement;
4. Verb Symbols: Include prepositions and auxiliary phrases;
5. Classification Nuances: Record the level of confidence range from 0 to 1;
6. Section Handling: Strictly tie every element to its section (e.g., § 275.0-7(a)(1));
7. Order of Presentation: Follow the sequence of the document strictly;
8. Edge Cases: Classify it with a lower level of confidence;
9. Output Preferences: Add timestamp and processing notes;
10. Formatting Precision: Ensure the JSON adheres strictly to a specific schema (e.g., for use in a system).
"""

In [110]:
response_prompt_v4_1 = {
  "section": "§ 275.0-7",
  "summary": "The definition of small entities under the Investment Advisers Act for the purposes of the Regulatory Flexibility Act. It details criteria for qualifying as a small business or organization and provides definitions for 'control' and 'total assets' within this context.",
  "elements": [
    {
      "id": 1,
      "statement": "The term small business or small organization for purposes of the Investment Advisers Act of 1940 shall mean an investment adviser that has assets under management of less than $25 million.",
      "terms": [
        {
          "term": "Small business",
          "classification": "Common Noun"
        },
        {
          "term": "Small organization",
          "classification": "Common Noun"
        },
        {
          "term": "Investment adviser",
          "classification": "Common Noun"
        },
        {
          "term": "Assets under management",
          "classification": "Common Noun"
        },
        {
          "term": "$25 million",
          "classification": "Common Noun"
        }
      ],
      "verb_symbols": ["mean", "has"],
      "classification": "Fact",
      "sources": ["(a)(1)"]
    },
    {
      "id": 2,
      "statement": "An investment adviser did not have total assets of $5 million or more on the last day of the most recent fiscal year.",
      "terms": [
        {
          "term": "Investment adviser",
          "classification": "Common Noun"
        },
        {
          "term": "Total assets",
          "classification": "Common Noun"
        },
        {
          "term": "$5 million",
          "classification": "Common Noun"
        },
        {
          "term": "Fiscal year",
          "classification": "Common Noun"
        }
      ],
      "verb_symbols": ["did not have"],
      "classification": "Fact",
      "sources": ["(a)(2)"]
    },
    {
      "id": 3,
      "statement": "An investment adviser does not control, is not controlled by, and is not under common control with another investment adviser that has assets under management of $25 million or more.",
      "terms": [
        {
          "term": "Investment adviser",
          "classification": "Common Noun"
        },
        {
          "term": "Control",
          "classification": "Common Noun"
        },
        {
          "term": "$25 million",
          "classification": "Common Noun"
        }
      ],
      "verb_symbols": ["does not control", "is not controlled by", "is not under common control with"],
      "classification": "Fact",
      "sources": ["(a)(3)"]
    },
    {
      "id": 4,
      "statement": "Control means the power, directly or indirectly, to direct the management or policies of a person, whether through ownership of securities, by contract, or otherwise.",
      "terms": [
        {
          "term": "Control",
          "classification": "Common Noun"
        },
        {
          "term": "Power",
          "classification": "Common Noun"
        },
        {
          "term": "Management",
          "classification": "Common Noun"
        },
        {
          "term": "Policies",
          "classification": "Common Noun"
        },
        {
          "term": "Person",
          "classification": "Common Noun"
        }
      ],
      "verb_symbols": ["means", "to direct"],
      "classification": "Fact Type",
      "sources": ["(b)(1)"]
    },
    {
      "id": 5,
      "statement": "A person is presumed to control a corporation if the person directly or indirectly has the right to vote 25 percent or more of a class of the corporation's voting securities.",
      "terms": [
        {
          "term": "Person",
          "classification": "Common Noun"
        },
        {
          "term": "Corporation",
          "classification": "Common Noun"
        },
        {
          "term": "Voting securities",
          "classification": "Common Noun"
        },
        {
          "term": "25 percent",
          "classification": "Common Noun"
        }
      ],
      "verb_symbols": ["is presumed to control", "has the right to vote"],
      "classification": "Operative Rule",
      "sources": ["(b)(1)(i)(A)"]
    },
    {
      "id": 6,
      "statement": "A person is presumed to control a partnership if the person has the right to receive upon dissolution, or has contributed, 25 percent or more of the capital of the partnership.",
      "terms": [
        {
          "term": "Person",
          "classification": "Common Noun"
        },
        {
          "term": "Partnership",
          "classification": "Common Noun"
        },
        {
          "term": "Dissolution",
          "classification": "Common Noun"
        },
        {
          "term": "Capital",
          "classification": "Common Noun"
        },
        {
          "term": "25 percent",
          "classification": "Common Noun"
        }
      ],
      "verb_symbols": ["is presumed to control", "has the right to receive", "has contributed"],
      "classification": "Operative Rule",
      "sources": ["(b)(1)(ii)"]
    },
    {
      "id": 7,
      "statement": "Total assets means the total assets as shown on the balance sheet of the investment adviser or other person with its subsidiaries consolidated, whichever is larger.",
      "terms": [
        {
          "term": "Total assets",
          "classification": "Common Noun"
        },
        {
          "term": "Balance sheet",
          "classification": "Common Noun"
        },
        {
          "term": "Investment adviser",
          "classification": "Common Noun"
        },
        {
          "term": "Subsidiaries",
          "classification": "Common Noun"
        }
      ],
      "verb_symbols": ["means", "shown on"],
      "classification": "Fact Type",
      "sources": ["(b)(2)"]
    }
  ]
}

In [111]:
len(response_prompt_v4_1["elements"])

7

The steps 4 and 5 are adapted from the previous approach. The system prompt for the second part is:

In [112]:
system_prompt_v4_2 = """
You are tasked with extracting definitions and **relationships** of terms in the terms list searching a given legal document. Please follow these steps carefully and ensure all instructions are adhered to:

# Steps

1. **Summarize the document** to understand its purpose and use it to verify if all important terms, term definitions, facts, fact types, and rules are identified in subsequent steps.

2. **Define terms**:
  - For each term:
    - Search the entire document for the term's definition, explanation, or meaning. Also, look in the document summary.
    - If the definition is found, include it.
    - If the definition is not found in the document, use null.
    - Record the level of confidence in the definition, ranging from 0 to 1.
    - Explain the reason for the confidence level.

3. **isLocalScope**: Is there an indication that the definition is exclusive to this section? Example: "For purposes of this section...", "as described in this section", "as defined in this section". If yes answer only with true. Otherwise, the answer is false.

4. **Identify synonym relationships between terms**:
  - For each term in the terms list:
    - Compare it against other terms in the text to find synonyms.
    - Ensure both terms exist within the same document context.
  - List all valid synonym pairs identified.
  - Record the level of confidence in the synonym relationship, ranging from 0 to 1.
  - Explain the reason for the confidence level.

5. **Provide JSON Output**:
  - Format your answer as per the output example below.
  - **All values are optional**: Include as much information as is available based on the document.
  - **Do not include any additional text or explanation outside the JSON structure**.

**Output Example**:

```json
{
  "terms": [
    {
      "term": "Person",
      "definition": "A person is a person.",
      "confidence": 0.9,
      "reason": "The definition was ...",
      "isLocalScope": true,
      "local_scope_confidence": 0.9,
      "local_scope_reason": "The scope is ..."
    },
    {
      "term": "Capital",
      "definition": "The total assets of a person.",
      "confidence": 0.8,
      "reason": "The definition is ...",
      "isLocalScope": false,
      "local_scope_confidence": 0.9,
      "local_scope_reason": "The scope is ..."
    },
    ...
  ],
  "relationships": [
    {
      "term_1": "Person",
      "term_2": "Capital",
      "relationship": "Synonym",
      "confidence": 0.8,
      "reason": "The relationship is ...",
    },
    {
      "term_1": "Capital",
      "term_2": "Person",
      "relationship": "Synonym",
      "confidence": 0.5,
      "reason": "The relationship is ...",
    },
    ...
  ]
}
```
"""

In the "user prompt", along with the document, a unique list of terms from the result of the previous part, is provided. The drawback of this approach is the document needs to be provided again. It means spending more tokens.

As commented above, the output is better than the previous approach. 21 terms are extracted with definitions, and 6 relationships are identified. More important that the terms small business, and, small organization are extracted.

In [113]:
response_prompt_v4_2 = {
  "terms": [
    {
      "term": "$5 million",
      "definition": "An amount referenced as a threshold for total assets of an investment adviser or other entity on the last day of the most recent fiscal year."
    },
    {
      "term": "Control",
      "definition": "The power, directly or indirectly, to direct the management or policies of a person, whether through ownership of securities, by contract, or otherwise."
    },
    {
      "term": "Capital",
      "definition": "The amount of financial contribution or investment in a partnership or LLC, particularly relevant to the right to receive upon dissolution or contribution of 25 percent or more."
    },
    {
      "term": "Dissolution",
      "definition": "The act of formally ending a partnership or LLC, at which point capital contributions may be distributed."
    },
    {
      "term": "25 percent",
      "definition": "A threshold used to presume control over a corporation, partnership, or LLC, based on ownership, voting rights, or capital contribution."
    },
    {
      "term": "Subsidiaries",
      "definition": "Companies that are controlled by another company, typically through ownership of more than 50% of the subsidiary’s voting stock."
    },
    {
      "term": "Management",
      "definition": "The act of overseeing and controlling the policies or operations of an entity."
    },
    {
      "term": "Corporation",
      "definition": "A legal entity that is presumed to be controlled if a person has the right to vote or sell 25 percent or more of its voting securities."
    },
    {
      "term": "Balance sheet",
      "definition": "A financial statement that reports total assets, used to determine control and asset thresholds for investment advisers."
    },
    {
      "term": "Assets under management",
      "definition": "The total market value of investments that an investment adviser manages on behalf of clients."
    },
    {
      "term": "$25 million",
      "definition": "An amount referenced as a threshold for assets under management to determine whether an entity qualifies as a small business or small organization under the Investment Advisers Act."
    },
    {
      "term": "Fiscal year",
      "definition": "A one-year period used for accounting purposes and preparing financial statements, relevant to determining total assets."
    },
    {
      "term": "Voting securities",
      "definition": "Securities that give the holder the right to vote on matters of corporate policy or management, used to determine control."
    },
    {
      "term": "Power",
      "definition": "The ability to influence or direct the management or policies of a person or entity, often associated with control."
    },
    {
      "term": "Total assets",
      "definition": "The total value of all assets as shown on an entity's balance sheet, including those of subsidiaries, used to assess financial thresholds."
    },
    {
      "term": "Investment adviser",
      "definition": "An individual or firm that manages the investments of clients, subject to regulations under the Investment Advisers Act of 1940."
    },
    {
      "term": "Person",
      "definition": "An individual, corporation, partnership, LLC, trust, or other entity, potentially subject to control rules under the Investment Advisers Act."
    },
    {
      "term": "Small business",
      "definition": "An investment adviser with less than $25 million in assets under management and less than $5 million in total assets, or as otherwise defined by the Commission."
    },
    {
      "term": "Partnership",
      "definition": "A business structure where control is presumed if a person owns or contributes 25 percent or more of the partnership's capital."
    },
    {
      "term": "Small organization",
      "definition": "An entity, such as an investment adviser, that qualifies as a small business under the Investment Advisers Act by meeting specific asset thresholds."
    },
    {
      "term": "Policies",
      "definition": "The principles or rules governing the management and control of an entity, relevant to determining control under the Investment Advisers Act."
    }
  ],
  "relationships": [
    {
      "term_1": "Small business",
      "term_2": "Small organization",
      "relationship": "Synonym"
    },
    {
      "term_1": "$5 million",
      "term_2": "Total assets",
      "relationship": "Hypernym-Hyponym"
    },
    {
      "term_1": "$25 million",
      "term_2": "Assets under management",
      "relationship": "Hypernym-Hyponym"
    },
    {
      "term_1": "Person",
      "term_2": "Corporation",
      "relationship": "Hypernym-Hyponym"
    },
    {
      "term_1": "Person",
      "term_2": "Partnership",
      "relationship": "Hypernym-Hyponym"
    },
    {
      "term_1": "Person",
      "term_2": "Investment adviser",
      "relationship": "Hypernym-Hyponym"
    }
  ]
}


In [114]:
len(response_prompt_v4_2["terms"]), len(response_prompt_v4_2["relationships"])

(21, 6)

### Save checkpoint

Define which prompt will be used in the experiment.

In [115]:
# TODO: Refactor name to system_prompt_extract_P1 and use the function above
system_prompt_extract_part_1 = system_prompt_v4_1
system_prompt_extract_part_2 = system_prompt_v4_2

manager.add_document(
    Document(
        id="prompt-extract_P1",
        type="prompt",
        content=f"""
{system_prompt_extract_part_1}
        """,
    )
)

manager.add_document(
    Document(
        id="prompt-extract_P2",
        type="prompt",
        content=f"""
{system_prompt_extract_part_2}
        """,
    )
)

# Persist the state to a file
save_checkpoint(filename=config["DEFAULT_CHECKPOINT_FILE"], manager=manager)

2024-12-05 20:55:39 - INFO - DocumentManager state persisted to file: ../data/checkpoints/documents-2024-12-05-3.json
2024-12-05 20:55:39 - INFO - Checkpoint saved.


## Execution

Restore checkpoint

In [116]:
manager = restore_checkpoint(filename=config["DEFAULT_CHECKPOINT_FILE"])

2024-12-05 20:55:39 - INFO - DocumentManager restored from file: ../data/checkpoints/documents-2024-12-05-3.json
2024-12-05 20:55:39 - INFO - Checkpoint restored from ../data/checkpoints/documents-2024-12-05-3.json.


### extract and classify elements

- Classify statements in the document;
- Extract terms and verb symbols;
- Classify terms.

In [117]:
for doc in manager.list_document_ids(doc_type="section"):
    logger.info(f"Processing document: {doc}")
    retrieved_doc = manager.retrieve_document(doc_id=doc, doc_type="section")

    # Part 1 - Extraction of elements
    # TODO: Refactor to a function. Put it in the prompt engeering section
    user_prompt = f"""
# Document

{manager.retrieve_document(doc_id=doc, doc_type="section").content}
    """

    logger.debug(system_prompt_extract_part_1, user_prompt)

    logger.info("P1. Extracting elements...")
    response_part_1, completion_1, elapse_time_1 = query_instruct_llm(
        system_prompt=system_prompt_extract_part_1,
        user_prompt=user_prompt,
        document_model=ElementsDocumentModel,
        llm_model=config["LLM"]["MODEL"],
        temperature=config["LLM"]["TEMPERATURE"],
        max_tokens=config["LLM"]["MAX_TOKENS"],
    )

    logger.debug(response_part_1)

    doc_1 = Document(
        id=f"{doc}_P1",
        type="llm_response",
        content=response_part_1,
        elapsed_times=[elapse_time_1],
        completions=[completion_1.dict()],
    )
    manager.add_document(doc_1)

    # Part 2 - Definition of terms and relationships

    terms_list_part_1 = extract_unique_terms(response_part_1)

    user_prompt = f"""
# Terms list

{terms_list_part_1}

# Document
{manager.retrieve_document(doc_id=doc, doc_type="section").content}
    """

    logger.info("P2. Extracting terms and relationships...")

    response_part_2, completion_2, elapse_time_2 = query_instruct_llm(
        system_prompt=system_prompt_extract_part_2,
        user_prompt=user_prompt,
        document_model=TermsDocumentModel,
        llm_model=config["LLM"]["MODEL"],
        temperature=config["LLM"]["TEMPERATURE"],
        max_tokens=config["LLM"]["MAX_TOKENS"],
    )

    logger.debug(response_part_2)

    doc_2 = Document(
        id=f"{doc}_P2",
        type="llm_response",
        content=response_part_2,
        elapsed_times=[elapse_time_2],
        completions=[completion_2.dict()],
    )
    manager.add_document(doc_2)

    logger.info("Saving llm_response to checkpoint...")

    # Save each document to save money.
    save_checkpoint(filename=config["DEFAULT_CHECKPOINT_FILE"], manager=manager)

logger.info("Finished processing documents.")

2024-12-05 20:55:39 - INFO - Processing document: § 275.0-2
2024-12-05 20:55:39 - INFO - P1. Extracting elements...
2024-12-05 20:55:40 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 400 Bad Request"
2024-12-05 20:55:40 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 400 Bad Request"
2024-12-05 20:55:40 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 400 Bad Request"


InstructorRetryException: Error code: 400 - {'error': {'message': "Unsupported parameter: 'tool_choice' is not supported with this model.", 'type': 'invalid_request_error', 'param': 'tool_choice', 'code': 'unsupported_parameter'}}

In [None]:
raise Exception("Stop here for now.")

Avarage execution time: 50s / per prompt.

Restore checkpoint

In [None]:
# Restore checkpoint
manager = restore_checkpoint(filename=config["DEFAULT_CHECKPOINT_FILE"])

2024-12-02 20:29:10 - INFO - DocumentManager restored from file: ../data/checkpoints/documents-2024-12-02-9.json
2024-12-02 20:29:10 - INFO - Checkpoint restored from ../data/checkpoints/documents-2024-12-02-9.json.


### Check the content of datasets

In [None]:
processor = DocumentProcessor(manager, merge=True)

pred_operative_rules = processor.get_rules()
pred_facts = processor.get_facts()
pred_terms = processor.get_terms()
pred_names = processor.get_names()
pred_terms_with_definitions = processor.get_terms(definition_filter="non_null")
pred_names_with_definitions = processor.get_names(definition_filter="non_null")

logger.debug(f"Rules: {pred_operative_rules}")
logger.debug(f"Facts: {pred_facts}")
logger.debug(f"Terms: {pred_terms}")
logger.debug(f"Names: {pred_names}")
logger.info(f"Rules to evaluate: {len(pred_operative_rules)}")
logger.info(f"Facts to evaluate: {len(pred_facts)}")
logger.info(f"Terms to evaluate: {len(pred_terms)}")
logger.info(f"Names to evaluate: {len(pred_names)}")
logger.info(f"Terms with definitions: {len(pred_terms_with_definitions)}")
logger.info(f"Names with definitions: {len(pred_names_with_definitions)}")

2024-12-05 20:30:10 - INFO - Document did not have facts classifications to process: 'NoneType' object has no attribute 'content'


2024-12-05 20:30:10 - INFO - Rules to evaluate: 0
2024-12-05 20:30:10 - INFO - Facts to evaluate: 0
2024-12-05 20:30:10 - INFO - Terms to evaluate: 0
2024-12-05 20:30:10 - INFO - Names to evaluate: 0
2024-12-05 20:30:10 - INFO - Terms with definitions: 0
2024-12-05 20:30:10 - INFO - Names with definitions: 0


In [None]:
logger.info("SECTIONS:")
# List all document ids | type
logger.info(f"section docs: {manager.list_document_ids(doc_type='section')}")

# Retrieve a document by id | type
for doc in manager.list_document_ids(doc_type="section"):
    retrieved_doc = manager.retrieve_document(doc_id=doc, doc_type="section")
    logger.debug(retrieved_doc)
    lines, words, avg_words_per_line = basic_text_stats(retrieved_doc.content)
    logger.info(
        f"{doc}: Total number of lines: {lines}, total number of words: {words}, and average words per line: {avg_words_per_line}"
    )

retrieved_true_table_p1 = []
retrieved_true_table_p2 = []

for doc in manager.list_document_ids(doc_type="true_table"):
    logger.info(f"Processing document: {doc} ...")
    # Docs type true_table P1
    if doc.endswith("_P1"):
        retrieved_true_table_p1.append(
            calculate_content_quantities_p1(
                doc,
                manager.retrieve_document(
                    doc_id=doc, doc_type="true_table"
                ).model_dump()["content"],
                filename="p1_true_table.json",
            )
        )
        logger.info("retrieve P1")
    # Docs type true_table P2
    elif doc.endswith("_P2"):
        retrieved_true_table_p2.append(
            calculate_content_quantities_p2(
                doc,
                manager.retrieve_document(
                    doc_id=doc, doc_type="true_table"
                ).model_dump(),
                filename="p2_true_table.json",
            )
        )
        logger.info("retrieve P2")

# Convert collected data to a DataFrame
table_true_df_p1 = pd.DataFrame(retrieved_true_table_p1)
table_true_df_p2 = pd.DataFrame(retrieved_true_table_p2)

# Save DataFrames to CSV if needed
table_true_df_p1.to_excel(f"{config['DEFAULT_OUTPUT_DIR']}/P1_summary_true_table.xlsx", index=False)
table_true_df_p2.to_excel(f"{config['DEFAULT_OUTPUT_DIR']}/P2_summary_true_table.xlsx", index=False)

2024-12-05 20:30:25 - INFO - SECTIONS:
2024-12-05 20:30:25 - INFO - section docs: ['§ 275.0-2', '§ 275.0-5', '§ 275.0-7']
2024-12-05 20:30:25 - INFO - § 275.0-2: Total number of lines: 14, total number of words: 362, and average words per line: 26
2024-12-05 20:30:25 - INFO - § 275.0-5: Total number of lines: 10, total number of words: 260, and average words per line: 26
2024-12-05 20:30:25 - INFO - § 275.0-7: Total number of lines: 19, total number of words: 513, and average words per line: 27
2024-12-05 20:30:25 - INFO - Processing document: § 275.0-2_P1 ...
2024-12-05 20:30:25 - INFO - retrieve P1
2024-12-05 20:30:25 - INFO - Processing document: § 275.0-5_P1 ...
2024-12-05 20:30:25 - INFO - retrieve P1
2024-12-05 20:30:25 - INFO - Processing document: § 275.0-7_P1 ...
2024-12-05 20:30:25 - INFO - retrieve P1
2024-12-05 20:30:25 - INFO - Processing document: § 275.0-2_P2 ...
2024-12-05 20:30:25 - INFO - retrieve P2
2024-12-05 20:30:25 - INFO - Processing document: § 275.0-5_P2 ...
2