<a href="https://colab.research.google.com/github/asantos2000/master-degree-santos-anderson/blob/main/code/src/chap_6_nlp2sbvr_transform.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# nlp2sbvr - Transformação para SBVR

Chapter 6. Ferramentas de suporte
- Section 6.2 Implementação dos principais componentes
  - Section 6.2.4 nlp2sbvr
    - Section Algoritmo "Transformação para SBVR"

## Google colab

In [38]:
%load_ext autoreload
%autoreload 2

import sys

IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
  from google.colab import drive
  drive.mount('/content/drive')
  !rm -rf cfr2sbvr configuration checkpoint
  !git clone https://github.com/asantos2000/master-degree-santos-anderson.git cfr2sbvr
  %pip install -r cfr2sbvr/code/requirements.txt
  !cp -r cfr2sbvr/code/src/configuration .
  !cp -r cfr2sbvr/code/src/checkpoint .
  !cp -r cfr2sbvr/code/config.colab.yaml config.yaml
  DEFAULT_CONFIG_FILE="config.yaml"
else:
  DEFAULT_CONFIG_FILE="../config.yaml"

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Imports

In [39]:
# Standard library imports
import json

# Third-party libraries
from pydantic import BaseModel, Field
from typing import List, Any

# Local application/library-specific imports
import checkpoint.main as checkpoint
from checkpoint.main import (
    save_checkpoint,
    restore_checkpoint,
    DocumentProcessor,
    Document,
)
import configuration.main as configuration
import logging_setup.main as logging_setup
import rules_taxonomy_provider.main as rules_taxonomy_provider
from rules_taxonomy_provider.main import RulesTemplateProvider
import llm_query.main as llm_query
from llm_query.main import query_instruct_llm

DEV_MODE = True

if DEV_MODE:
    # Development mode
    import importlib

    importlib.reload(configuration)
    importlib.reload(logging_setup)
    importlib.reload(checkpoint)
    importlib.reload(rules_taxonomy_provider)
    importlib.reload(llm_query)

## Settings

Default settings, check them before run the notebook.

### Get configuration

In [3]:
# load config
config = configuration.load_config(DEFAULT_CONFIG_FILE)

Generated files for analysis in this run

In [4]:
print(config["DEFAULT_CHECKPOINT_FILE"],
config["DEFAULT_EXTRACTION_REPORT_FILE"],
config["DEFAULT_EXCEL_FILE"])

../data/checkpoints/documents-2024-11-30-1.json ../outputs/extraction_report-2024-11-30-1.html ../outputs/compare_items_metrics.xlsx


### Logging configuration

In [5]:
logger = logging_setup.setting_logging(config["DEFAULT_LOG_DIR"], config["LOG_LEVEL"])

2024-11-30 01:52:37 - INFO - Logging is set up with daily rotation.


## Checkpoints

Documents, annoted datasets, statistics and metrics about the execution of the notebook are stored by checkpoint module.

Checkpoints are stored / retrieved at the directory `DEFAULT_CHECKPOINT_FILE` in the configuration file.

During the execution, it will restore the checkpoint at the beginning of the section and saved at the end. We can run and restore the checkpoint several times. If the run fails, check the closest checkpoint and restore it.

Restore the checkpoint

In [6]:
# To run after classification
last_checkpoint = configuration.get_last_filename(config["DEFAULT_CHECKPOINT_DIR"], "documents", "json")

logger.info(f"{last_checkpoint=}")

config["DEFAULT_CHECKPOINT_FILE"] = last_checkpoint

manager = restore_checkpoint(filename=config["DEFAULT_CHECKPOINT_FILE"])

2024-11-30 01:52:37 - INFO - last_checkpoint='../data/checkpoints/documents-2024-11-29-4.json'
2024-11-30 01:52:37 - INFO - DocumentManager restored from file: ../data/checkpoints/documents-2024-11-29-4.json
2024-11-30 01:52:37 - INFO - Checkpoint restored from ../data/checkpoints/documents-2024-11-29-4.json.


## General functions and data structures

In [7]:
def save_prompts_samples(system_prompts, user_prompts, element_name, manager):
    manager.add_document(
        Document(
            id=f"prompt-system-transform_rules_{element_name.replace(' ', '_')}",
            type="prompt",
            content=system_prompts[0],
        )
    )

    manager.add_document(
        Document(
            id=f"prompt-user-transform_rules_{element_name.replace(' ', '_')}",
            type="prompt",
            content=user_prompts[0],
        )
    )

    logger.info(f"System prompts for {element_name}s: {len(system_prompts)}")
    logger.info(f"User prompts for {element_name}s: {len(user_prompts)}")

    save_checkpoint(filename=config["DEFAULT_CHECKPOINT_FILE"], manager=manager)

LLM response model

In [8]:
class TransformedStatement(BaseModel):
    doc_id: str = Field(..., description="Document ID associated with the statement.")
    statement_id: str = Field(..., description="A provided string that identifies the statement. e.g., '1', 'Person'.")
    statement_title: str = Field(..., description="Title of the statement.") 
    statement: str = Field(..., description="The statement to be transformed.")
    statement_sources: List[str] = Field(..., description="Sources of the statement.")
    templates_ids: List[str] = Field(..., description="List of template IDs.")
    transformed: str = Field(..., description="The transformed statement.")
    confidence: float = Field(..., description="Confidence of the transformation.")
    reason: str = Field(..., description="Reason for confidence score of the transformation.")

class TransformedStatements(BaseModel):
    TransformedStatements: List[TransformedStatement] = Field(..., description="List of transformed statements.")

In [9]:
def transform_statement(element_name, user_prompts, system_prompts, manager):
    # Initialize an empty list to accumulate all responses
    all_responses = []
    elapse_times = []
    completions = []

    # Loop through each pair of user and system prompts with a counter
    for index, (user_prompt, system_prompt) in enumerate(
        zip(user_prompts, system_prompts), start=1
    ):
        logger.info(f"Processing transformation prompt {index} for {element_name}.")
        logger.debug(system_prompt)
        logger.debug(user_prompt)

        # Query the language model
        response, completion, elapse_time = query_instruct_llm(
            system_prompt=system_prompt,
            user_prompt=user_prompt,
            document_model=TransformedStatements,
            llm_model=config["LLM"]["MODEL"],
            temperature=config["LLM"]["TEMPERATURE"],
            max_tokens=config["LLM"]["MAX_TOKENS"],
        )

        logger.debug(response)

        # Accumulate the responses in the list
        all_responses.extend(response.TransformedStatements)
        elapse_times.append(elapse_time)
        completions.append(completion.dict())

        logger.info(f"Finished processing classification and templates prompt {index}.")

    # After the loop, create a single Document with all the accumulated responses
    doc = Document(
        id=f"transform_{element_name.replace(' ', '_')}s",
        type="llm_response_transform",
        content=all_responses,
        elapsed_times=elapse_times,
        completions=completions,
    )
    manager.add_document(doc)

    logger.info(f"{element_name}s: {len(all_responses)}")

    return all_responses

In [10]:

def get_prompts_for_rule(rules, rule_template_formulation, data_dir):
    rule_template_provider = RulesTemplateProvider(data_dir)

    system_prompts = []
    user_prompts = []

    for rule in rules:
        element_name = rule.get("element_name")

        if element_name == ["Term", "Name"]:
            statement_key = "definition"
            statement_id_key = "signifier"
        else:
            statement_key = "statement"
            statement_id_key = "statement_id"

        # # Return templates and examples for fact types or all
        # if element_name == "Fact Type":
        #     return_forms = "fact_type"
        # else:
        #     return_forms = "rule"
        # logger.info(f"Processing {element_name} with return forms {return_forms}.")

        input_rule = {
            "doc_id": rule["doc_id"],
            f"{statement_id_key}": rule["statement_id"],
            "sources": rule["sources"],
            f"{statement_key}": rule.get("statement", rule.get("definition")),
            "templates_ids": rule["templates_ids"],
        }
        user_prompt = get_user_prompt_transform(element_name, input_rule)
        user_prompts.append(user_prompt)
        rule_templates_subtemplates = rule_template_provider.get_rules_template(rule["templates_ids"])
        system_prompt = get_system_prompt_transform(element_name,rule_template_formulation, rule_templates_subtemplates)
        system_prompts.append(system_prompt)
        logger.debug(system_prompt)
        logger.debug(user_prompt)

    logger.info(f"System prompts for {element_name}s: {len(system_prompts)}")
    logger.info(f"User prompts for {element_name}s: {len(user_prompts)}")

    return system_prompts, user_prompts, element_name


## Datasets

Datasets used in the notebook.

### True tables

True tables are annotated or "golden" datasets in which entities have been manually identified and labeled within the original source data.

True tables for sectiona 275.0-2, 275.0-5 and 275.0-7

Load true table for P1 - Taxonomy Classification - top level and P2 - Taxonomy Classification - sub levels

In [11]:
with open(f"{config['DEFAULT_DATA_DIR']}/documents_true_table.json", 'r') as file:
    data = json.load(file)

    manager.add_document(
        Document.model_validate(data["transform_Operative_Rules|true_table"])
    )

    manager.add_document(
        Document.model_validate(data["transform_Fact_Types|true_table"])
    )

    manager.add_document(
        Document.model_validate(data["transform_Terms|true_table"])
    )

    # manager.add_document(
    #     Document.model_validate(data["transform_Names|true_table"])
    # )

Save the checkpoint

In [12]:
save_checkpoint(filename=config["DEFAULT_CHECKPOINT_FILE"], manager=manager)

2024-11-30 01:52:38 - INFO - DocumentManager state persisted to file: ../data/checkpoints/documents-2024-11-29-4.json
2024-11-30 01:52:38 - INFO - Checkpoint saved.


### Elements to transform

Get expressions to transform

In [13]:
processor = DocumentProcessor(manager, merge=True)

pred_operative_rules = processor.get_rules()
pred_facts = processor.get_facts()
pred_terms = processor.get_terms(definition_filter="non_null")
pred_names = processor.get_names(definition_filter="non_null")

logger.debug(f"Rules: {pred_operative_rules}")
logger.debug(f"Facts: {pred_facts}")
logger.debug(f"Terms: {pred_terms}")
logger.debug(f"Names: {pred_names}")
logger.info(f"Rules to evaluate: {len(pred_operative_rules)}")
logger.info(f"Facts to evaluate: {len(pred_facts)}")
logger.info(f"Terms to evaluate: {len(pred_terms)}")
logger.info(f"Names to evaluate: {len(pred_names)}")

2024-11-30 01:52:38 - INFO - Rules to evaluate: 6
2024-11-30 01:52:38 - INFO - Facts to evaluate: 16
2024-11-30 01:52:38 - INFO - Terms to evaluate: 23
2024-11-30 01:52:38 - INFO - Names to evaluate: 5


element_dict={'doc_id': '§ 275.0-2', 'statement_id': 1, 'statement_title': 'Service of process on non-resident entities', 'statement': 'A person may serve process, pleadings, or other papers on a non-resident investment adviser, or on a non-resident general partner or non-resident managing agent of an investment adviser by serving any or all of its appointed agents.', 'sources': ['(a)'], 'terms': [{'term': 'Person', 'classification': 'Common Noun', 'confidence': 0.9, 'reason': 'The term is a general reference to an individual or entity.', 'extracted_confidence': 0.9, 'extracted_reason': 'The term is explicitly mentioned as the subject performing the action.'}, {'term': 'Non-resident investment adviser', 'classification': 'Common Noun', 'confidence': 0.9, 'reason': 'The term refers to a specific type of entity involved in the process.', 'extracted_confidence': 0.9, 'extracted_reason': 'The term is explicitly mentioned as the recipient of the action.'}, {'term': 'Non-resident general par

In [14]:
pred_terms

[{'doc_id': '§ 275.0-2',
  'statement_id': 'Managing agent',
  'definition': 'Any person, including a trustee, who directs or manages, or who participates in directing or managing, the affairs of any unincorporated organization or association other than a partnership.',
  'isLocalScope': True,
  'sources': ['(b)(1)'],
  'element_name': 'Term',
  'transformed': 'A managing agent is by definition a person who directs or manages, or who participates in directing or managing, the affairs of any unincorporated organization or association other than a partnership.',
  'type': 'Definitional',
  'subtype': 'Formal intensional definitions',
  'confidence': 0.9,
  'explanation': "The statement defines 'Managing agent' by specifying it as a type of 'person' with the characteristic of directing or managing the affairs of an unincorporated organization or association. The hypernym is 'person', and the qualifying clause is 'who directs or manages, or who participates in directing or managing, the af

## Prompt engeneering

### System prompt

Formulation is expressed using a template (WITT, 2012, p. 162).

In [15]:
rule_template_formulation = """
# How to interpret the templates and subtemplates

Each formulation is expressed using a template, in which the various symbols have the following meanings:

1. Each item enclosed in "angle brackets" ("<" and ">") is a placeholder, in place of which any suitable text may be substituted. For example, any of the following may be substituted in place of <operative rule statement subject> (subtemplate):
    a. a term: for example, "flight booking request",
    b. a term followed by a qualifying clause: for example, "flight booking request for a one-way journey",
    c. a reference to a combination of items: for example, "combination of enrollment date and graduation date", with or without a qualifying clause,
    d. a reference to a set of items: for example, "set of passengers", with or without a qualifying clause.
2. Each pair of braces ("{" and "}") encloses a set of options (separated from each other by the bar symbol: "|"), one of which is included in the rule statement. For example,
3. If a pair of braces includes a bar symbol immediately before the closing brace, the null option is allowed: that is, you can, if necessary, include none of the options at that point in the rule statement.
4. Sets of options may be nested. For example, in each of the templates above
    a. a conditional clause may be included or omitted,
    b. if included, the conditional clause should be preceded by either "if" or "unless".
5. A further notation, introduced later in this section, uses square brackets to indicate that a syntactic element may be repeated indefinitely.
6. Any text not enclosed in either "angle brackets" or braces (i.e., "must", "not", "may", and "only") is included in every rule statement conforming to the relevant template.
"""

In [16]:
def get_system_prompt_transform(element_name, rule_template_formulation, rule_templates_subtemplates):
    statement_name = "definition" if element_name in ["Term", "Name"] else "statement"
    return f"""
Transform each given {element_name} {statement_name} into a structured format by matching it to the specified templates and subtemplates.

# Steps

1. **Summarize {statement_name}**: Summarize the given {element_name} {statement_name} to understand its structure and content.

2. **Use Template**:
   - For given expression, use the templates and subtemplates ({"Fact Type Form" if element_name in ["Fact Type", "Fact"] else "Rule Form"}) provided for transformation.
   - Determine the appropriate template or subtemplate based on the structure of the expression.
   
3. **Replace Placeholders**:
   - Substitute placeholders, such as `<term>`, `<verb phrase>`, `<conditional clause>`, etc., with suitable values as per the expression.
   - For terms and names, the statement_id is the term defined by the statement.
   
4. **Include Qualifying Details**:
   - Where placeholders, such as `<qualifying clause>`, require additional details (e.g., attributes or qualifiers to distinguish meaning), ensure that these are included appropriately as per the respective subtemplate.

5. **Transform into Structured Format**:
   - Once the transformation is complete, ensure it's in the correct template format.

6. **Output as Structured JSON**:
   - For every transformed expression generate a JSON object as per the specified output format.

7. **Review and Validate**:
   - Ensure accuracy in grammar and compliance with logical constructs when performing substitutions.
   - Ensure the generated JSON is in the correct template format.

8. **Assess the Transformation**:
   - Record the confidence level and reason for the confidence score in the JSON object.

{rule_template_formulation}

# Provided templates and subtemplates for transformation

{rule_templates_subtemplates}

# Output Format

[
    {{
      "doc_id": <doc_id>,
      "statement_id": <statement_id or signifier>,
      "statement_title": <statement_title>,
      "sources": [<source>],
      "statement": <statement or definition>,
      "templates_ids": [<templates_id>],
      "transformed": <transformed_statement>,
      "confidence": <confidence_level>,
      "reason": <reason_for_confidence>
    }},
    ...
]

- **`doc_id`**: A original identifier of the given document.
- **`statement_id or signifier`**: The original identifier of the given {statement_name}. e.g., '1', 'Person'".
- **`statement_title`**: The title of the given {statement_name}.
- **`sources`**: The original sources of the given {statement_name}.
- **`statement or definition`**: The original text of the given {statement_name}.
- **`templates_ids`**: The template(s) used for the transformation (e.g., T1, T2, etc.)
- **`transformed`**: The transformed statement according to template.
- **`confidence`**: The confidence level of the transformation range from 0 to 1.
- **`reason`**: The reason for the confidence score.

# Notes
- Use only the provided templates and subtemplates for transformation.
- If a placeholder within an expression is not applicable or optional, consider whether it should be omitted or replaced by a suitable value.
- Each expression may involve nested levels of substitution as indicated by the subtemplate hierarchy (e.g., a qualifying clause that contains sub-elements).
- Ensure accuracy in grammar and compliance with logical constructs when performing substitutions.
"""

### User prompt

In [17]:
def get_user_prompt_transform(element_name, rule):

    return f"""
# Here's the {element_name} {"definition" if element_name in ["Term", "Name"] else "statement"} you need to transform using template {rule.get("templates_ids")}.

{json.dumps(rule, indent=2)}
"""

Save checkpoint

In [18]:
# Persist the state to a file
save_checkpoint(filename=config["DEFAULT_CHECKPOINT_FILE"], manager=manager)

2024-11-30 01:52:38 - INFO - DocumentManager state persisted to file: ../data/checkpoints/documents-2024-11-29-4.json
2024-11-30 01:52:38 - INFO - Checkpoint saved.


## Execution

Restore checkpoint

In [19]:
manager = restore_checkpoint(filename=config["DEFAULT_CHECKPOINT_FILE"])

2024-11-30 01:52:38 - INFO - DocumentManager restored from file: ../data/checkpoints/documents-2024-11-29-4.json
2024-11-30 01:52:38 - INFO - Checkpoint restored from ../data/checkpoints/documents-2024-11-29-4.json.


### Operative rules

Get prompts for operative rules

In [20]:
system_prompts_operative_rules, user_prompts_operative_rules, element_name = (
    get_prompts_for_rule(
        rules=pred_operative_rules,
        rule_template_formulation=rule_template_formulation,
        data_dir=config["DEFAULT_DATA_DIR"],
    )
)

2024-11-30 01:52:38 - INFO - System prompts for Operative Rules: 6
2024-11-30 01:52:38 - INFO - User prompts for Operative Rules: 6


Save a sample of the system prompt and user prompt.

In [21]:
save_prompts_samples(
    system_prompts_operative_rules, user_prompts_operative_rules, element_name, manager
)

2024-11-30 01:52:38 - INFO - System prompts for Operative Rules: 6
2024-11-30 01:52:38 - INFO - User prompts for Operative Rules: 6
2024-11-30 01:52:38 - INFO - DocumentManager state persisted to file: ../data/checkpoints/documents-2024-11-29-4.json
2024-11-30 01:52:38 - INFO - Checkpoint saved.


Call LLM to transform operative rules

In [22]:
responses_operative_rules = transform_statement(
    element_name=element_name,
    user_prompts=user_prompts_operative_rules,
    system_prompts=system_prompts_operative_rules,
    manager=manager,
)

logger.debug(f"{responses_operative_rules=}")

# Persist the state to a file
save_checkpoint(filename=config["DEFAULT_CHECKPOINT_FILE"], manager=manager)

2024-11-30 01:52:38 - INFO - Processing transformation prompt 1 for Operative Rule.
2024-11-30 01:52:45 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-11-30 01:52:45 - INFO - Tokes used: CompletionUsage(completion_tokens=193, prompt_tokens=5022, total_tokens=5215, completion_tokens_details=None, prompt_tokens_details=None)
2024-11-30 01:52:45 - INFO - Execution time for query_instruct_llm: 3.09 seconds
2024-11-30 01:52:45 - INFO - Finished processing classification and templates prompt 1.
2024-11-30 01:52:45 - INFO - Processing transformation prompt 2 for Operative Rule.
2024-11-30 01:52:48 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-11-30 01:52:48 - INFO - Tokes used: CompletionUsage(completion_tokens=223, prompt_tokens=5049, total_tokens=5272, completion_tokens_details=None, prompt_tokens_details=None)
2024-11-30 01:52:48 - INFO - Execution time for query_instruct_llm: 3.25 seconds
2024-1

Average execution time 5s per prompt.

### Fact Types

Get prompts for facts.

In [23]:
system_prompts_facts, user_prompts_facts, element_name = get_prompts_for_rule(
    rules=pred_facts,
    rule_template_formulation=rule_template_formulation,
    data_dir=config["DEFAULT_DATA_DIR"],
)

2024-11-30 01:53:00 - INFO - System prompts for Fact Types: 16
2024-11-30 01:53:00 - INFO - User prompts for Fact Types: 16


Save a sample of the system prompt and user prompt.

In [24]:
save_prompts_samples(
    system_prompts_operative_rules, user_prompts_operative_rules, element_name, manager
)

2024-11-30 01:53:00 - INFO - System prompts for Fact Types: 6
2024-11-30 01:53:00 - INFO - User prompts for Fact Types: 6
2024-11-30 01:53:00 - INFO - DocumentManager state persisted to file: ../data/checkpoints/documents-2024-11-29-4.json
2024-11-30 01:53:00 - INFO - Checkpoint saved.


Call LLM to transform facts

In [25]:
responses_facts = transform_statement(
    element_name=element_name,
    user_prompts=user_prompts_facts,
    system_prompts=system_prompts_facts,
    manager=manager,
)

logger.debug(f"{responses_facts=}")

# Persist the state to a file
save_checkpoint(filename=config["DEFAULT_CHECKPOINT_FILE"], manager=manager)

2024-11-30 01:53:00 - INFO - Processing transformation prompt 1 for Fact Type.
2024-11-30 01:53:02 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-11-30 01:53:02 - INFO - Tokes used: CompletionUsage(completion_tokens=206, prompt_tokens=1837, total_tokens=2043, completion_tokens_details=None, prompt_tokens_details=None)
2024-11-30 01:53:02 - INFO - Execution time for query_instruct_llm: 2.45 seconds
2024-11-30 01:53:02 - INFO - Finished processing classification and templates prompt 1.
2024-11-30 01:53:02 - INFO - Processing transformation prompt 2 for Fact Type.
2024-11-30 01:53:05 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-11-30 01:53:05 - INFO - Tokes used: CompletionUsage(completion_tokens=206, prompt_tokens=1846, total_tokens=2052, completion_tokens_details=None, prompt_tokens_details=None)
2024-11-30 01:53:05 - INFO - Execution time for query_instruct_llm: 3.26 seconds
2024-11-30 01:53

Average execution time 5s per prompt.

### Terms

Get prompts for facts.

In [26]:
system_prompts_terms, user_prompts_terms, element_name = get_prompts_for_rule(
    rules=pred_terms,
    rule_template_formulation=rule_template_formulation,
    data_dir=config["DEFAULT_DATA_DIR"],
)

2024-11-30 01:53:50 - INFO - System prompts for Terms: 23
2024-11-30 01:53:50 - INFO - User prompts for Terms: 23


Save a sample of the system prompt and user prompt.

In [27]:
save_prompts_samples(
    system_prompts_terms, user_prompts_terms, element_name, manager
)

2024-11-30 01:53:50 - INFO - System prompts for Terms: 23
2024-11-30 01:53:50 - INFO - User prompts for Terms: 23
2024-11-30 01:53:50 - INFO - DocumentManager state persisted to file: ../data/checkpoints/documents-2024-11-29-4.json
2024-11-30 01:53:50 - INFO - Checkpoint saved.


Call LLM to transform terms

In [28]:
responses_terms = transform_statement(
    element_name=element_name,
    user_prompts=user_prompts_terms,
    system_prompts=system_prompts_terms,
    manager=manager,
)

logger.debug(f"{responses_terms=}")

# Persist the state to a file
save_checkpoint(filename=config["DEFAULT_CHECKPOINT_FILE"], manager=manager)

2024-11-30 01:53:50 - INFO - Processing transformation prompt 1 for Term.
2024-11-30 01:53:52 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-11-30 01:53:52 - INFO - Tokes used: CompletionUsage(completion_tokens=164, prompt_tokens=5079, total_tokens=5243, completion_tokens_details=None, prompt_tokens_details=None)
2024-11-30 01:53:52 - INFO - Execution time for query_instruct_llm: 2.16 seconds
2024-11-30 01:53:52 - INFO - Finished processing classification and templates prompt 1.
2024-11-30 01:53:52 - INFO - Processing transformation prompt 2 for Term.
2024-11-30 01:53:58 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-11-30 01:53:58 - INFO - Tokes used: CompletionUsage(completion_tokens=185, prompt_tokens=1829, total_tokens=2014, completion_tokens_details=None, prompt_tokens_details=None)
2024-11-30 01:53:58 - INFO - Execution time for query_instruct_llm: 2.11 seconds
2024-11-30 01:53:58 - INFO

Average execution time 4s per prompt.

### Names

Get prompts for names.

In [29]:
system_prompts_names, user_prompts_names, element_name = get_prompts_for_rule(
    rules=pred_names,
    rule_template_formulation=rule_template_formulation,
    data_dir=config["DEFAULT_DATA_DIR"],
)

2024-11-30 01:54:47 - INFO - System prompts for Names: 5
2024-11-30 01:54:47 - INFO - User prompts for Names: 5


Save a sample of the system prompt and user prompt.

In [30]:
save_prompts_samples(
    system_prompts_names, user_prompts_names, element_name, manager
)

2024-11-30 01:54:47 - INFO - System prompts for Names: 5
2024-11-30 01:54:47 - INFO - User prompts for Names: 5
2024-11-30 01:54:47 - INFO - DocumentManager state persisted to file: ../data/checkpoints/documents-2024-11-29-4.json
2024-11-30 01:54:47 - INFO - Checkpoint saved.


Call LLM to transform names

In [31]:
responses_names = transform_statement(
    element_name=element_name,
    user_prompts=user_prompts_names,
    system_prompts=system_prompts_names,
    manager=manager,
)

logger.debug(f"{responses_names=}")

# Persist the state to a file
save_checkpoint(filename=config["DEFAULT_CHECKPOINT_FILE"], manager=manager)

2024-11-30 01:54:47 - INFO - Processing transformation prompt 1 for Name.
2024-11-30 01:54:50 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-11-30 01:54:50 - INFO - Tokes used: CompletionUsage(completion_tokens=170, prompt_tokens=5074, total_tokens=5244, completion_tokens_details=None, prompt_tokens_details=None)
2024-11-30 01:54:50 - INFO - Execution time for query_instruct_llm: 2.78 seconds
2024-11-30 01:54:50 - INFO - Finished processing classification and templates prompt 1.
2024-11-30 01:54:50 - INFO - Processing transformation prompt 2 for Name.
2024-11-30 01:54:52 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-11-30 01:54:52 - INFO - Tokes used: CompletionUsage(completion_tokens=167, prompt_tokens=5071, total_tokens=5238, completion_tokens_details=None, prompt_tokens_details=None)
2024-11-30 01:54:52 - INFO - Execution time for query_instruct_llm: 2.62 seconds
2024-11-30 01:54:52 - INFO

Average execution time 5s per prompt.

### Check missing transformations

Restore checkpoint

In [32]:
manager = restore_checkpoint(filename=config["DEFAULT_CHECKPOINT_FILE"])

2024-11-30 01:54:59 - INFO - DocumentManager restored from file: ../data/checkpoints/documents-2024-11-29-4.json
2024-11-30 01:54:59 - INFO - Checkpoint restored from ../data/checkpoints/documents-2024-11-29-4.json.


Evaluate elements for missing transformations

In [40]:
processor = DocumentProcessor(manager, merge=True)

pred_operative_rules = processor.get_rules()
pred_facts = processor.get_facts()
pred_terms = processor.get_terms(definition_filter="non_null")
pred_names = processor.get_names(definition_filter="non_null")

logger.debug(f"Rules: {pred_operative_rules}")
logger.debug(f"Facts: {pred_facts}")
logger.debug(f"Terms: {pred_terms}")
logger.debug(f"Names: {pred_names}")

data = [pred_facts, pred_terms, pred_names, pred_operative_rules]
data_names = ["pred_facts", "pred_terms", "pred_names", "pred_operative_rules"]

for element_list, element_name in zip(data, data_names):
    empty_transformed_elements = []
    for element in element_list:
        if not element.get("transformed"):
            logger.debug(f"{element_name} - {element.get('statement_id')}: {element.get('transformed')}")
            empty_transformed_elements.append(element)

    logger.info(f"Empty transformed {element_name}: {len(empty_transformed_elements)}/{len(element_list)}")


2024-11-30 01:59:03 - INFO - Empty transformed pred_facts: 0/16
2024-11-30 01:59:03 - INFO - Empty transformed pred_terms: 0/23
2024-11-30 01:59:03 - INFO - Empty transformed pred_names: 0/5
2024-11-30 01:59:03 - INFO - Empty transformed pred_operative_rules: 0/6


## Discussion

For the the first parte (prompt_classify_p1), the assigned confidence levels reflect a calibrated approach to expressions involving multiple classifications where a dominant rule type is not explicitly evident. For instance, when an expression primarily constrains data (Data rule) but also includes specific parties (Party rule), a high confidence level is attributed to Data while a moderate confidence level is applied to Party, acknowledging its secondary relevance.

Similarly, expressions referencing roles such as “Secretary” or “interested person” without explicit party restrictions are assigned moderate confidence for Party classification due to interpretive ambiguity. Procedural elements that impact data handling, such as document forwarding, receive high confidence for Data rules; however, a moderate confidence level is assigned for Activity rules when procedural references are indirect. This methodology prioritizes primary rule types while accounting for the interpretive limits of secondary classifications.