# Lab 5 - nlp2sbvr

Lab 5 - Initial version.

Chapter 6. Ferramentas de suporte
- Section 6.2 Implementação dos principais componentes
  - Section 6.2.4 Transformação para SBVR
    - Section Algoritmo "nlp2sbvr"
    - Section Algoritmo "elements association and creation"
    - Section Algoritmo "define vocabular namespace"
    - Section Algoritmo "similarity search"

## Google colab

In [1]:
%load_ext autoreload
%autoreload 2

import sys

IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
  from google.colab import drive
  drive.mount('/content/drive')
  !rm -rf cfr2sbvr configuration checkpoint
  !git clone https://github.com/asantos2000/master-degree-santos-anderson.git cfr2sbvr
  %pip install -r cfr2sbvr/code/requirements.txt
  !cp -r cfr2sbvr/code/src/configuration .
  !cp -r cfr2sbvr/code/src/checkpoint .
  !cp -r cfr2sbvr/code/config.colab.yaml config.yaml
  DEFAULT_CONFIG_FILE="config.yaml"
else:
  DEFAULT_CONFIG_FILE="../config.yaml"

## Imports

In [13]:
# only for labs
import sys
sys.path.append(r'../src')

In [14]:
# Standard library imports
from collections import defaultdict
import json
import re
from pathlib import Path

# Third-party libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from pydantic import BaseModel, Field
from sklearn.metrics import confusion_matrix, classification_report
from typing import List, Dict, Optional, Any, Tuple, Set

# Local application/library-specific imports
import checkpoint.main as checkpoint
from checkpoint.main import (
    normalize_str,
    save_checkpoint,
    restore_checkpoint,
    get_all_checkpoints,
    get_elements_from_checkpoints,
    get_elements_from_true_tables,
    get_true_table_files,
    DocumentProcessor,
    Document,
)
import configuration.main as configuration
import logging_setup.main as logging_setup
import token_estimator.main as token_estimator
from token_estimator.main import estimate_tokens
import rules_taxonomy_provider.main as rules_taxonomy_provider
from rules_taxonomy_provider.main import RuleInformationProvider, RulesTemplateProvider
import llm_query.main as llm_query
from llm_query.main import query_instruct_llm

DEV_MODE = True

if DEV_MODE:
    # Development mode
    import importlib

    importlib.reload(configuration)
    importlib.reload(logging_setup)
    importlib.reload(checkpoint)
    importlib.reload(token_estimator)
    importlib.reload(rules_taxonomy_provider)
    importlib.reload(llm_query)

## Settings

Default settings, check them before run the notebook.

### Get configuration

In [11]:
# load config
DEFAULT_CONFIG_FILE = "../config.yaml"
config = configuration.load_config(DEFAULT_CONFIG_FILE)

Generated files for analysis in this run

In [8]:
print(config["DEFAULT_CHECKPOINT_FILE"],
config["DEFAULT_EXTRACTION_REPORT_FILE"],
config["DEFAULT_EXCEL_FILE"])

../data/checkpoints/documents-2024-11-10-1.json ../outputs/extraction_report-2024-11-10-1.html ../outputs/compare_items_metrics.xlsx


### Logging configuration

In [9]:
logger = logging_setup.setting_logging(config["DEFAULT_LOG_DIR"], config["LOG_LEVEL"])

2024-11-10 17:42:57 - INFO - Logging is set up with daily rotation.


## Checkpoints

Documents, annoted datasets, statistics and metrics about the execution of the notebook are stored by checkpoint module.

Checkpoints are stored / retrieved at the directory `DEFAULT_CHECKPOINT_FILE` in the configuration file.

During the execution, it will restore the checkpoint at the beginning of the section and saved at the end. We can run and restore the checkpoint several times. If the run fails, check the closest checkpoint and restore it.

### Restore the checkpoint

In [23]:
# Restore the checkpoint

# To run after classification
last_checkpoint = configuration.get_last_filename(config["DEFAULT_CHECKPOINT_DIR"], "documents", "json")

logger.info(f"{last_checkpoint=}")

config["DEFAULT_CHECKPOINT_FILE"] = last_checkpoint

manager = restore_checkpoint(filename=config["DEFAULT_CHECKPOINT_FILE"])

2024-11-10 17:54:14 - INFO - last_checkpoint='../data/checkpoints/documents-2024-11-01-3.json'
2024-11-10 17:54:14 - INFO - DocumentManager restored from file: ../data/checkpoints/documents-2024-11-01-3.json
2024-11-10 17:54:14 - INFO - Checkpoint restored from ../data/checkpoints/documents-2024-11-01-3.json.


## Datasets

Datasets used in the notebook. They are divided into sections and true tables. The sections are the documents from CFR and true tables are annoted  or "golden" datasets.

### General functions and data structures

In [8]:
def get_section_from_kg(conn: Any, section_num: str) -> str:
    """
    Retrieves a section from the Knowledge Graph based on the section number.

    Args:
        conn: The connection object to the Knowledge Graph.
        section_num (str): The section number to query.

    Returns:
        str: The retrieved section content as a string.

    Raises:
        Exception: If there is an error executing the query.
    """
    # Query section number from KG
    query = """
    PREFIX fro-cfr: <http://finregont.com/fro/cfr/Code_Federal_Regulations.ttl#>
    PREFIX fro-leg-ref: <http://finregont.com/fro/ref/LegalReference.ttl#>

    SELECT ?section ?section_seq ?section_num ?section_subject ?section_citation ?section_notes ?divide ?divide_seq ?paragraph_enum ?paragraph_text
    WHERE {
      ?section a fro-cfr:CFR_Section ;
        fro-leg-ref:hasSequenceNumber ?section_seq ;
        fro-cfr:hasSectionNumber ?section_num ;
        fro-cfr:hasSectionSubject ?section_subject .
      OPTIONAL {?section fro-leg-ref:refers_toNote ?section_notes} .
      OPTIONAL {?section fro-cfr:hasSectionCitation ?section_citation} .

      ?divide fro-leg-ref:divides ?section ; # rdf:type fro-cfr:CFR_Parapraph
        fro-leg-ref:hasSequenceNumber ?divide_seq ;
        fro-cfr:hasParagraphText ?paragraph_text ;
        fro-leg-ref:hasSequenceNumber ?paragraph_seq .
      OPTIONAL {?divide fro-cfr:hasParagraphEnumText ?paragraph_enum} .
    """ + f"""
      FILTER("{section_num}" = ?section_num)
    """ + """
    }
    ORDER BY ?section_num ?section ?divide_seq
    """
    tuple_query = conn.prepareTupleQuery(QueryLanguage.SPARQL, query)
    result = tuple_query.evaluate()

    logger.debug(f"result.metadata: {result.metadata}")
    logger.debug(f"result.variable_names: {result.variable_names}")

    body_text = ""
    previous_section = None
    previous_paragraph_id = None
    with result:
      for binding_set in result:
          section = binding_set.getValue("section")
          section_seq = str(binding_set.getValue("section_seq")).replace('"', '')
          section_num = str(binding_set.getValue("section_num")).replace('"', '')
          section_subject = str(binding_set.getValue("section_subject")).replace('"', '')
          section_citation = str(binding_set.getValue("section_citation")).replace('"', '')
          section_notes = str(binding_set.getValue("section_notes")).replace('"', '')
          divide = binding_set.getValue("divide")
          divide_seq = str(binding_set.getValue("divide_seq")).replace('"', '')
          paragraph_enum = str(binding_set.getValue("paragraph_enum")).replace('"', '')
          paragraph_text = str(binding_set.getValue("paragraph_text")).replace('"', '')
          # Header
          if previous_section != section:
            previous_section = section
            header = f"""
    section_number: {section_num}
    section_subject: {section_subject}
    section_id: {section}
    citations: {section_citation}
    notes: {section_notes}
            """
          # Body
          if paragraph_enum != "None":
            body_text += f"""
    paragraph_enumeration: {paragraph_enum}
    paragraph_text: {paragraph_text}
    """
          else:
            body_text += f"""
    paragraph_text: {paragraph_text}
    """

    return header + body_text


In [21]:
def save_prompts_samples(system_prompts, user_prompts, element_name, manager):
    manager.add_document(
        Document(
            id=f"prompt-system-transform_rules_{element_name.replace(' ', '_')}",
            type="prompt",
            content=system_prompts[0],
        )
    )

    manager.add_document(
        Document(
            id=f"prompt-user-transform_rules_{element_name.replace(' ', '_')}",
            type="prompt",
            content=user_prompts[0],
        )
    )

    logger.info(f"System prompts for {element_name}s: {len(system_prompts)}")
    logger.info(f"User prompts for {element_name}s: {len(user_prompts)}")

    save_checkpoint(filename=config["DEFAULT_CHECKPOINT_FILE"], manager=manager)

LLM response model

In [60]:
class TransformedStatement(BaseModel):
    doc_id: str = Field(..., description="Document ID associated with the statement.")
    statement_id: str = Field(..., description="A provided string that identifies the statement. e.g., '1', 'Person'.")
    statement: str = Field(..., description="The statement to be transformed.")
    statement_source: str = Field(..., description="Source of the statement.")
    templates_ids: List[str] = Field(..., description="List of template IDs.")
    transformed: str = Field(..., description="The transformed statement.")

In [10]:
def transform_statement(element_name, user_prompts, system_prompts, manager):
    # Initialize an empty list to accumulate all responses
    all_responses = []

    # Loop through each pair of user and system prompts with a counter
    for index, (user_prompt, system_prompt) in enumerate(
        zip(user_prompts, system_prompts), start=1
    ):
        logger.info(f"Processing transformation prompt {index} for {element_name}.")
        logger.debug(system_prompt)
        logger.debug(user_prompt)

        # Query the language model
        response = query_instruct_llm(
            system_prompt=system_prompt,
            user_prompt=user_prompt,
            document_model=List[TransformedStatement],
            llm_model=config["LLM"]["MODEL"],
            temperature=config["LLM"]["TEMPERATURE"],
            max_tokens=config["LLM"]["MAX_TOKENS"],
        )

        logger.debug(response)

        # Accumulate the responses in the list
        all_responses.extend(response)

        logger.info(f"Finished processing classification and templates prompt {index}.")

    # After the loop, create a single Document with all the accumulated responses
    doc = Document(
        id=f"transform_{element_name.replace(' ', '_')}s",
        type="llm_response_transform",
        content=all_responses,
    )
    manager.add_document(doc)

    logger.info(f"{element_name}s: {len(all_responses)}")

    return all_responses

In [11]:

def get_prompts_for_rule(rules, rule_template_formulation, data_dir):
    rule_template_provider = RulesTemplateProvider(data_dir)

    system_prompts = []
    user_prompts = []

    for rule in rules:
        element_name = rule.get("element_name")

        if element_name == ["Term", "Name"]:
            statement_key = "definition"
            statement_id_key = "signifier"
        else:
            statement_key = "statement"
            statement_id_key = "statement_id"

        input_rule = {
            "doc_id": rule["doc_id"],
            f"{statement_id_key}": rule["statement_id"],
            "source": rule["source"],
            f"{statement_key}": rule.get("statement", rule.get("definition")),
            "templates_ids": rule["templates_ids"],
        }
        user_prompt = get_user_prompt_transform(element_name, input_rule)
        user_prompts.append(user_prompt)
        rule_templates_subtemplates = rule_template_provider.get_rules_template(rule["templates_ids"])
        system_prompt = get_system_prompt_transform(element_name,rule_template_formulation, rule_templates_subtemplates)
        system_prompts.append(system_prompt)
        logger.debug(system_prompt)
        logger.debug(user_prompt)

    logger.info(f"System prompts for {element_name}s: {len(system_prompts)}")
    logger.info(f"User prompts for {element_name}s: {len(user_prompts)}")

    return system_prompts, user_prompts, element_name


### True tables

True tables are annotated or "golden" datasets in which entities have been manually identified and labeled within the original source data.

True tables for sectiona 275.0-2, 275.0-5 and 275.0-7

Classification true tables

In [12]:
true_table_files = get_true_table_files(config["DEFAULT_DATA_DIR"])

for item in true_table_files:
    with open(item["path"], 'r') as file:
        data = json.load(file)

        logger.debug(data[item["id"]])
        logger.info(f"Adding {item['id']} true table to the manager")

        # manager.add_document(
        #     Document.model_validate(data[item["id"]])
        # )

# # Persist the state to a file
# save_checkpoint(filename=config["DEFAULT_CHECKPOINT_FILE"], manager=manager)

2024-11-10 14:36:42 - INFO - Adding classify_P1|true_table true table to the manager
2024-11-10 14:36:42 - INFO - Adding classify_P2_Definitional_facts|true_table true table to the manager
2024-11-10 14:36:42 - INFO - Adding classify_P2_Definitional_names|true_table true table to the manager
2024-11-10 14:36:42 - INFO - Adding classify_P2_Definitional_terms|true_table true table to the manager
2024-11-10 14:36:42 - INFO - Adding classify_P2_Operative_rules|true_table true table to the manager


### Elements to transform

Get expressions to transform

In [None]:
processor = DocumentProcessor(manager)

pred_operative_rules = processor.get_rules()
pred_facts = processor.get_facts()
pred_terms = processor.get_terms(definition_filter="non_null")
pred_names = processor.get_names(definition_filter="non_null")

logger.debug(f"Rules: {pred_operative_rules}")
logger.debug(f"Facts: {pred_facts}")
logger.debug(f"Terms: {pred_terms}")
logger.debug(f"Names: {pred_names}")
logger.info(f"Rules to evaluate: {len(pred_operative_rules)}")
logger.info(f"Facts to evaluate: {len(pred_facts)}")
logger.info(f"Terms to evaluate: {len(pred_terms)}")
logger.info(f"Names to evaluate: {len(pred_names)}")

2024-11-10 17:33:27 - INFO - Rules to evaluate: 6
2024-11-10 17:33:27 - INFO - Facts to evaluate: 17
2024-11-10 17:33:27 - INFO - Terms to evaluate: 91
2024-11-10 17:33:27 - INFO - Names to evaluate: 15


Look for duplications.

In [113]:
# Track seen items and duplicates
seen_terms = set()
duplicates = []

# Check for duplicates by specified fields
for term in pred_terms:
    # Create a unique key from the specified fields
    key = (term['doc_id'], term['statement_id'], term['definition'], term['source'])
    if key in seen_terms:
        duplicates.append(term)
    else:
        seen_terms.add(key)

# Output results
if duplicates:
    print("Duplicate entries found in true_terms based on specified fields:")
    for dup in duplicates:
        print(dup)
else:
    print("No duplicate entries found in true_terms based on specified fields.")


Duplicate entries found in true_terms based on specified fields:
{'doc_id': '§ 275.0-5', 'statement_id': 'Matter', 'definition': "The subject of the proceeding initiated by the application or Commission's motion.", 'source': '(a)', 'element_name': 'Term', 'transformed': "A Matter is by definition the subject of the proceeding that is initiated by the application or Commission's motion.", 'type': 'Definitional', 'subtype': 'Formal intensional definitions', 'confidence': 0.9, 'explanation': "The statement defines 'matter' by specifying it as the subject of a proceeding, which fits the formal intensional definition structure by using a hypernym and a distinguishing characteristic.", 'templates_ids': ['T7']}


## Prompt engeneering

### nlp2sbvr

#### System prompt

Formulation is expressed using a template (WITT, 2012, p. 162).

In [16]:
rule_template_formulation = """
# How to interpret the templates and subtemplates

Each formulation is expressed using a template, in which the various symbols have the following meanings:

1. Each item enclosed in "angle brackets" ("<" and ">") is a placeholder, in place of which any suitable text may be substituted. For example, any of the following may be substituted in place of <operative rule statement subject> (subtemplate):
    a. a term: for example, "flight booking request",
    b. a term followed by a qualifying clause: for example, "flight booking request for a one-way journey",
    c. a reference to a combination of items: for example, "combination of enrollment date and graduation date", with or without a qualifying clause,
    d. a reference to a set of items: for example, "set of passengers", with or without a qualifying clause.
2. Each pair of braces ("{" and "}") encloses a set of options (separated from each other by the bar symbol: "|"), one of which is included in the rule statement. For example,
3. If a pair of braces includes a bar symbol immediately before the closing brace, the null option is allowed: that is, you can, if necessary, include none of the options at that point in the rule statement.
4. Sets of options may be nested. For example, in each of the templates above
    a. a conditional clause may be included or omitted,
    b. if included, the conditional clause should be preceded by either "if" or "unless".
5. A further notation, introduced later in this section, uses square brackets to indicate that a syntactic element may be repeated indefinitely.
6. Any text not enclosed in either "angle brackets" or braces (i.e., "must", "not", "may", and "only") is included in every rule statement conforming to the relevant template.
"""

In [17]:
def get_system_prompt_transform(element_name, rule_template_formulation, rule_templates_subtemplates):
    return f"""
Transform each given {element_name} {"definition" if element_name in ["Term", "Name"] else "statement"} into a structured format by matching it to the specified templates and subtemplates.

# Steps

1. **Use Template**:
   - For given expression, use the templates and subtemplates provided for transformation.
   - Determine the appropriate template or subtemplate based on the structure of the expression.
   
2. **Replace Placeholders**:
   - Substitute placeholders, such as `<term>`, `<verb phrase>`, `<conditional clause>`, etc., with suitable values as per the expression.
   - For terms and names, the statement_id is the term defined by the statement.
   
3. **Include Qualifying Details**:
   - Where placeholders, such as `<qualifying clause>`, require additional details (e.g., attributes or qualifiers to distinguish meaning), ensure that these are included appropriately as per the respective subtemplate.

4. **Transform into Structured Format**:
   - Once the transformation is complete, ensure it's in the correct template format.

5. **Output as Structured JSON**:
   - For every transformed expression generate a JSON object as per the specified output format.

{rule_template_formulation}

# Provided templates and subtemplates for transformation

{rule_templates_subtemplates}

# Output Format

[
    {{
      "doc_id": <doc_id>,
      "statement_id": <statement_id or signifier>,
      "source": <source>,
      "statement": <statement or definition>,
      "templates_ids": [<templates_id>],
      "transformed": <transformed_statement>,
    }},
    ...
]

- **`doc_id`**: A original document identifier.
- **`statement_id or signifier`**: The original statement or definition identifier. e.g., '1', 'Person'".
- **`source`**: The original source of the expression.
- **`statement or definition`**: The original text of the given expression.
- **`templates_ids`**: The exact template(s) to be used for transformation (e.g., T1, T2, etc.)
- **`transformed`**: The transformation result, formatted according to the matching template and subtemplates.

# Notes
- Use only the provided templates and subtemplates for transformation.
- If a placeholder within an expression is not applicable or optional, consider whether it should be omitted or replaced by a suitable value.
- Each expression may involve nested levels of substitution as indicated by the subtemplate hierarchy (e.g., a qualifying clause that contains sub-elements).
- Ensure accuracy in grammar and compliance with logical constructs when performing substitutions.
"""

#### User prompt

In [18]:
def get_user_prompt_transform(element_name, rule):

    return f"""
# Here's the {element_name} {"definition" if element_name in ["Term", "Name"] else "statement"} you need to transform using template {rule.get("templates_ids")}.

{json.dumps(rule, indent=2)}
"""

#### Save checkpoint

In [19]:
# Persist the state to a file
save_checkpoint(filename=config["DEFAULT_CHECKPOINT_FILE"], manager=manager)

2024-11-10 14:37:37 - INFO - DocumentManager state persisted to file: ../data/checkpoints/documents-2024-11-01-3.json
2024-11-10 14:37:37 - INFO - Checkpoint saved.


## Execution

### nlp2sbvr

#### Restore checkpoint

In [114]:
manager = restore_checkpoint(filename=config["DEFAULT_CHECKPOINT_FILE"])

2024-11-10 17:33:47 - INFO - DocumentManager restored from file: ../data/checkpoints/documents-2024-11-01-3.json
2024-11-10 17:33:47 - INFO - Checkpoint restored from ../data/checkpoints/documents-2024-11-01-3.json.


#### Operative rules

Get prompts for operative rules

In [115]:
system_prompts_operative_rules, user_prompts_operative_rules, element_name = (
    get_prompts_for_rule(
        rules=pred_operative_rules,
        rule_template_formulation=rule_template_formulation,
        data_dir=config["DEFAULT_DATA_DIR"],
    )
)

2024-11-10 17:33:52 - INFO - System prompts for Operative Rules: 6
2024-11-10 17:33:52 - INFO - User prompts for Operative Rules: 6


Save a sample of the system prompt and user prompt.

In [23]:
save_prompts_samples(
    system_prompts_operative_rules, user_prompts_operative_rules, element_name, manager
)

2024-11-10 14:37:53 - INFO - System prompts for Operative Rules: 6
2024-11-10 14:37:53 - INFO - User prompts for Operative Rules: 6
2024-11-10 14:37:53 - INFO - DocumentManager state persisted to file: ../data/checkpoints/documents-2024-11-01-3.json
2024-11-10 14:37:53 - INFO - Checkpoint saved.


Call LLM to transform operative rules

In [None]:
responses_operative_rules = transform_statement(
    element_name=element_name,
    user_prompts=user_prompts_operative_rules,
    system_prompts=system_prompts_operative_rules,
    manager=manager,
)

logger.debug(f"{responses_operative_rules=}")

# Persist the state to a file
save_checkpoint(filename=config["DEFAULT_CHECKPOINT_FILE"], manager=manager)

Average execution time 5s per prompt.

#### Facts

Get prompts for facts.

In [24]:
system_prompts_facts, user_prompts_facts, element_name = get_prompts_for_rule(
    rules=pred_facts,
    rule_template_formulation=rule_template_formulation,
    data_dir=config["DEFAULT_DATA_DIR"],
)

2024-11-10 14:38:05 - INFO - System prompts for Fact Types: 17
2024-11-10 14:38:05 - INFO - User prompts for Fact Types: 17


Save a sample of the system prompt and user prompt.

In [25]:
save_prompts_samples(
    system_prompts_operative_rules, user_prompts_operative_rules, element_name, manager
)

2024-11-10 14:38:07 - INFO - System prompts for Fact Types: 6
2024-11-10 14:38:07 - INFO - User prompts for Fact Types: 6
2024-11-10 14:38:07 - INFO - DocumentManager state persisted to file: ../data/checkpoints/documents-2024-11-01-3.json
2024-11-10 14:38:07 - INFO - Checkpoint saved.


Call LLM to transform facts

In [None]:
responses_facts = transform_statement(
    element_name=element_name,
    user_prompts=user_prompts_facts,
    system_prompts=system_prompts_facts,
    manager=manager,
)

logger.debug(f"{responses_facts=}")

# Persist the state to a file
save_checkpoint(filename=config["DEFAULT_CHECKPOINT_FILE"], manager=manager)

Average execution time 5s per prompt.

#### Terms

Get prompts for facts.

In [26]:
system_prompts_terms, user_prompts_terms, element_name = get_prompts_for_rule(
    rules=pred_terms,
    rule_template_formulation=rule_template_formulation,
    data_dir=config["DEFAULT_DATA_DIR"],
)

2024-11-10 14:39:29 - INFO - System prompts for Terms: 91
2024-11-10 14:39:29 - INFO - User prompts for Terms: 91


Save a sample of the system prompt and user prompt.

In [27]:
save_prompts_samples(
    system_prompts_operative_rules, user_prompts_operative_rules, element_name, manager
)

2024-11-10 14:39:35 - INFO - System prompts for Terms: 6
2024-11-10 14:39:35 - INFO - User prompts for Terms: 6
2024-11-10 14:39:35 - INFO - DocumentManager state persisted to file: ../data/checkpoints/documents-2024-11-01-3.json
2024-11-10 14:39:35 - INFO - Checkpoint saved.


Call LLM to transform facts

In [None]:
responses_terms = transform_statement(
    element_name=element_name,
    user_prompts=user_prompts_terms,
    system_prompts=system_prompts_terms,
    manager=manager,
)

logger.debug(f"{responses_terms=}")

# Persist the state to a file
save_checkpoint(filename=config["DEFAULT_CHECKPOINT_FILE"], manager=manager)

Average execution time 4s per prompt.

#### Names

Get prompts for names.

In [28]:
system_prompts_names, user_prompts_names, element_name = get_prompts_for_rule(
    rules=pred_names,
    rule_template_formulation=rule_template_formulation,
    data_dir=config["DEFAULT_DATA_DIR"],
)

2024-11-10 14:39:45 - INFO - System prompts for Names: 15
2024-11-10 14:39:45 - INFO - User prompts for Names: 15


Save a sample of the system prompt and user prompt.

In [29]:
save_prompts_samples(
    system_prompts_operative_rules, user_prompts_operative_rules, element_name, manager
)

2024-11-10 14:39:48 - INFO - System prompts for Names: 6
2024-11-10 14:39:48 - INFO - User prompts for Names: 6
2024-11-10 14:39:48 - INFO - DocumentManager state persisted to file: ../data/checkpoints/documents-2024-11-01-3.json
2024-11-10 14:39:48 - INFO - Checkpoint saved.


Call LLM to transform facts

In [None]:
responses_names = transform_statement(
    element_name=element_name,
    user_prompts=user_prompts_names,
    system_prompts=system_prompts_names,
    manager=manager,
)

logger.debug(f"{responses_names=}")

# Persist the state to a file
save_checkpoint(filename=config["DEFAULT_CHECKPOINT_FILE"], manager=manager)

Average execution time 5s per prompt.

#### Check missing transformations

Restore checkpoint

In [25]:
manager = restore_checkpoint(filename=config["DEFAULT_CHECKPOINT_FILE"])

2024-11-10 18:01:44 - INFO - DocumentManager restored from file: ../data/checkpoints/documents-2024-11-01-3.json
2024-11-10 18:01:44 - INFO - Checkpoint restored from ../data/checkpoints/documents-2024-11-01-3.json.


Evaluate elements for missing transformations

In [30]:
processor = DocumentProcessor(manager)

pred_operative_rules = processor.get_rules()
pred_facts = processor.get_facts()
pred_terms = processor.get_terms(definition_filter="non_null")
pred_names = processor.get_names(definition_filter="non_null")

logger.debug(f"Rules: {pred_operative_rules}")
logger.debug(f"Facts: {pred_facts}")
logger.debug(f"Terms: {pred_terms}")
logger.debug(f"Names: {pred_names}")

data = [pred_facts, pred_terms, pred_names, pred_operative_rules]
data_names = ["pred_facts", "pred_terms", "pred_names", "pred_operative_rules"]

for element_list, element_name in zip(data, data_names):
    empty_transformed_elements = [element for element in element_list if not element['transformed']]
    logger.info(f"Empty transformed {element_name}: {len(empty_transformed_elements)}/{len(element_list)}")


2024-11-10 18:04:06 - INFO - Empty transformed pred_facts: 0/17
2024-11-10 18:04:06 - INFO - Empty transformed pred_terms: 0/91
2024-11-10 18:04:06 - INFO - Empty transformed pred_names: 0/15
2024-11-10 18:04:06 - INFO - Empty transformed pred_operative_rules: 0/6


#### Discussion

For the the first parte (prompt_classify_p1), the assigned confidence levels reflect a calibrated approach to expressions involving multiple classifications where a dominant rule type is not explicitly evident. For instance, when an expression primarily constrains data (Data rule) but also includes specific parties (Party rule), a high confidence level is attributed to Data while a moderate confidence level is applied to Party, acknowledging its secondary relevance. Similarly, expressions referencing roles such as “Secretary” or “interested person” without explicit party restrictions are assigned moderate confidence for Party classification due to interpretive ambiguity. Procedural elements that impact data handling, such as document forwarding, receive high confidence for Data rules; however, a moderate confidence level is assigned for Activity rules when procedural references are indirect. This methodology prioritizes primary rule types while accounting for the interpretive limits of secondary classifications.

### elements association and creation

In [None]:
# TODO: Implement
...

### define vocabular namespace

In [None]:
# TODO: Implement
...

### similarity search

In [None]:
# TODO: Implement
...