# Entity Extraction with Structured Output

Databricks Doc: https://docs.databricks.com/en/machine-learning/model-serving/structured-outputs.html

OpenAI doc: https://platform.openai.com/docs/guides/function-calling

In [0]:
%pip install mlflow openai --quiet

dbutils.library.restartPython()

In [0]:
%run ../rag_app_sample_code/A_POC_app/pdf_uc_volume/00_config

## Preparation

In [0]:
# get Databricks credentials
import openai
from mlflow.utils.databricks_utils import get_databricks_host_creds

creds = get_databricks_host_creds()

In [0]:
# use openai's client to make calls to Databricks Foundation Model API
client = openai.OpenAI(
    api_key=creds.token,
    base_url=creds.host + '/serving-endpoints',
)

model = "databricks-meta-llama-3-1-70b-instruct"

In [0]:
parsed_docs = spark.table(destination_tables_config["parsed_docs_table_name"])

In [0]:
# display(parsed_docs)

In [0]:
import pyspark.sql.functions as F
_path = "dbfs:/Volumes/felixflory/ey_dbs_workshop_2024_10/raw_data/project_churches/Project Churches - FINAL Red Flag Report 181121.pdf"
F.lit(_path)

In [0]:
sample = (
  parsed_docs
  .select("doc_parsed_contents")
  .where(F.col("path") == F.lit(_path))
  .take(1)[0].doc_parsed_contents['parsed_content'])

In [0]:
# print(sample[:100])

## `Structured output` with Llama 3.1 70b

* Since you are looking to extract information from a piece of text and perform structured output, `response-format` is a good tool for this
* Basically, you are constraining the output of the LLM to something that's structured and valid
* At this time, out of all the models that we offer in our Foundation Model API, only `databricks-meta-llama-3-1-70b-instruct` and the 405 version currently have `function-calling` enabled. 

In [0]:
# create your function schema
response_format = {
    "type": "json_schema",
    "json_schema": {
        "name": "get_EBITDA_kpi_1",
        # "description": "Get EBITDA kpi values",
        "schema": {
            "type": "object",
            "properties": {
                "reported_EBITDA": {
                    "type": "string",
                    "description": "Reported EBITDA for the year 2021",
                },
                "adjusted_EBITDA": {
                    "type": "string",
                    "description": "adjusted EBITDA for for the year 2021 Source: Management information & EY analysis",
                },
                "min_adjusted_NWC": {
                    "type": "array", "items": {"type": "integer"},
                    "description": "4 values of the 'Min. adjusted NWC'; the Source: Management information & EY analysis",
                },
                "max_adjusted_NWC": {
                    "type": "array", "items": {"type": "integer"},
                    "description": "4 values of the 'Max. adjusted NWC'; the Source: Management information & EY analysis",
                },
            },
        },
    },
}

Note that only the function_schema is defined and not a function itself. The tool calling below will only produce the function arguments and not call a function itself. The arguments to the function are the extracted entities. 

In [0]:
# model = "databricks-meta-llama-3-1-70b-instruct"
model = "Yash_GPT_4o"

In [0]:
messages = [{
        "role": "system",
        "content": "You are an expert at structured data extraction. You will be given unstructured text from a document and should convert it into the given structure."
      },
      {
        "role": "user",
        "content": f"Given the following information, please provide kpi information: \n{sample}\n"
      }]

response = client.chat.completions.create(
    model=model,
    messages=messages,
    temperature=0.0,
    response_format=response_format,
    # tools=[t],
    # tool_choice="required",
)

In [0]:
# print(json.dumps(json.loads(response.choices[0].message.model_dump()['content']), indent=2))

In [0]:
# response.to_dict()