In [1]:
%run base.ipynb

# Project Demo
In this notebook we want to verify the total annual revenue of Volkswagen in 2021. I have uploaded an (intentionally wrong) dataset to the test instance of Dataland: https://test.dataland.com/companies/6b507b7d-33ca-41fe-b587-b9defa227468/frameworks/eutaxonomy-non-financials/e45a1f20-17f1-41e8-99b2-8fb641e08c88


## 1. Load the data from Dataland
To verify or reject the claim we first need to load the data from Dataland. We will use the Dataland API to do so. Datasets carry unique identifiers. These can e.g., be found in the URL of the dataset. The Dataland URLs follow the pattern https://dataland.com/companies/{company_id}/frameworks/{framework_id}/{data_id}. In our case the data_id is "16c10aa8-689c-4773-919e-bb493b700db5".

In [1]:
from dataland_qa_lab.utils import config

conf = config.get_config()
dataland_client = conf.dataland_client

data_id = "e45a1f20-17f1-41e8-99b2-8fb641e08c88"

dataset = dataland_client.eu_taxonomy_nf_api.get_company_associated_eutaxonomy_non_financials_data(data_id=data_id)

Dataset may contain hundreds of records. For simplicity, we removed all datapoints except for the total revenue.

In [None]:
revenue_datapoint = dataset.data.revenue.total_amount
revenue_datapoint.model_dump()

To verify the datapoint we need to check whether the specified revenue matches the revenue of the underlying datasource. On Dataland data-sources (i.e., PDFs) are identified by their SHA-256 hash. We can use the Dataland API to download the file.

# 2. Load the data-source from Dataland and convert it to text

In [3]:
document_bytes = dataland_client.documents_api.get_document(revenue_datapoint.data_source.file_reference)

The raw PDF is not of much use to us. We need to extract the text from the PDF to process it further (although you are welcome to experiment with using vision-enabled LLMs instead). Extracting text from PDFs is very challenging. Dealing with tables is especially troublesome. Take a look at https://pypdf.readthedocs.io/en/latest/user/extract-text.html#why-text-extraction-is-hard if you are curious. Due to these challenges, we'll use Azure Document Intelligence to extract the text from the PDF. 

The Document Intelligence API charges per page. The entire document is ~400 pages long. To save costs we will only analyze the page containing the revenue.

In [4]:
import io

import pypdf

full_document_byte_stream = io.BytesIO(document_bytes)
full_pdf = pypdf.PdfReader(full_document_byte_stream)

partial_document_byte_stream = io.BytesIO()
partial_pdf = pypdf.PdfWriter()

partial_pdf.add_page(full_pdf.get_page(int(revenue_datapoint.data_source.page) - 1))  # Correct for 0 offset
partial_pdf.write(partial_document_byte_stream)
partial_document_byte_stream.seek(0)
None

Now we can use the Azure Document Intelligence API to extract the text from the PDF.

In [5]:
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeResult, ContentFormat
from azure.core.credentials import AzureKeyCredential

docintel_cred = AzureKeyCredential(conf.azure_docintel_api_key)
document_intelligence_client = DocumentIntelligenceClient(
    endpoint=conf.azure_docintel_endpoint, credential=docintel_cred
)

poller = document_intelligence_client.begin_analyze_document(
    "prebuilt-layout",
    analyze_request=partial_document_byte_stream,
    content_type="application/octet-stream",
    output_content_format=ContentFormat.MARKDOWN,
)
result: AnalyzeResult = poller.result()

The result is a markdown document. We can display it directly in the notebook.

In [None]:
from IPython.display import Markdown, display

display(Markdown(result.content))

From the table, we can observe that the total revenue is 250,200 € Million. Just from this example you can see that extracting such data is very challenging. You e.g., have to notice that all values in the table are given in € Million.

# 3. Verify the claim using GPT-4o
To verify the claim we will use the GPT-4o model. We provide the model with the text extracted from the PDF and ask it to extract the total revenue. Afterward, we can compare the extracted value to the claimed value. To achieve this, we need to build a prompt that the model can understand.

In [None]:
from openai import AzureOpenAI

client = AzureOpenAI(
    api_key=conf.azure_openai_api_key, api_version="2024-07-01-preview", azure_endpoint=conf.azure_openai_endpoint
)

deployment_name = "gpt-4o"

prompt = f"""
You are an AI research Agent. As the agent, you answer questions briefly, succinctly, and factually.
Always justify you answer.

# Safety
- You **should always** reference factual statements to search results based on [relevant documents]
- Search results based on [relevant documents] may be incomplete or irrelevant. You do not make assumptions
  on the search results beyond strictly what's returned.
- If the search results based on [relevant documents] do not contain sufficient information to answer user
  message completely, you respond using the tool 'cannot_answer_question'
- Your responses should avoid being vague, controversial or off-topic.

# Task
Given the information from the [relevant documents], what is the total revenue of Volkswagen in 2021?

# Relevant Documents
{result.content}
"""

initial_openai_response = client.chat.completions.create(
    model=deployment_name,
    temperature=0,
    messages=[
        {"role": "system", "content": prompt},
    ],
)
initial_openai_response.choices[0].message.content

We can see that the model has answered the question correctly. However, the response is not given in a structured way. We can use the tool calling feature of OpenAI to force the model to provide a structured response.

In [None]:
updated_openai_response = client.chat.completions.create(
    model=deployment_name,
    temperature=0,
    messages=[
        {"role": "system", "content": prompt},
    ],
    tool_choice="required",
    tools=[
        {
            "type": "function",
            "function": {
                "name": "requested_information_precisely_found_in_relevant_documents",
                "description": "Submit the requested information. "
                "Use this function when the information is precisely stated in the relevant documents. ",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "answer_value": {
                            "type": "number",
                            "description": "The precise answer to the imposed question"
                            "without any thousand separators.",
                        },
                        "answer_currency": {
                            "type": "string",
                            "description": "The currency of the answer (e.g., EUR, USD)",
                        },
                        "justification": {"type": "string", "description": "The justification for the answer"},
                    },
                    "required": ["answer_value", "answer_currency", "justification"],
                },
            },
        }
    ],
)
tool_call = updated_openai_response.choices[0].message.tool_calls[0].function
tool_call

In [None]:
import json

parsed_tool_arguments = json.loads(tool_call.arguments)
extracted_revenue = parsed_tool_arguments["answer_value"]
extracted_revenue

This looks very promising and works in this very simple scenario. However, be aware that GPT-4o is not good at performing calculations. This is a problem you'll likely need to tackle later ;).
Additionally, GPT-4o can also make a lot of other mistakes. However, in this case, it worked well.

In [None]:
print(f"Original Value: \t{revenue_datapoint.value}")
print(f"Extracted Value: \t{extracted_revenue}")

We can directly see that the values do not align. Therefore, the claim is incorrect. We report this information back to the data provider by creating a so-called QA Report

# 4. Creating and submitting a QA Report

In [None]:
from dataland_qa.models.currency_data_point import CurrencyDataPoint
from dataland_qa.models.eutaxonomy_non_financials_data import EutaxonomyNonFinancialsData
from dataland_qa.models.eutaxonomy_non_financials_revenue import EutaxonomyNonFinancialsRevenue
from dataland_qa.models.extended_document_reference import ExtendedDocumentReference
from dataland_qa.models.qa_report_data_point_currency_data_point import QaReportDataPointCurrencyDataPoint
from dataland_qa.models.qa_report_data_point_verdict import QaReportDataPointVerdict

selected_qa_report = EutaxonomyNonFinancialsData(
    revenue=EutaxonomyNonFinancialsRevenue(
        totalAmount=QaReportDataPointCurrencyDataPoint(
            comment="The total revenue is incorrect. The correct value is 250200000000 €",
            verdict=QaReportDataPointVerdict.QAREJECTED,
            correctedData=CurrencyDataPoint(
                value=extracted_revenue,
                quality="Reported",
                comment=parsed_tool_arguments["justification"],
                currency=parsed_tool_arguments["answer_currency"],
                dataSource=ExtendedDocumentReference.from_dict(revenue_datapoint.data_source.model_dump(by_alias=True)),
            ),
        )
    )
)
dataland_client.eu_taxonomy_nf_qa_api.post_eutaxonomy_non_financials_data_qa_report(data_id, selected_qa_report)

The created QA Report is now available to the data provider. They can review the report and adapt their data if necessary.