# Project Demo
In this notebook we want to verify the total annual revenue of Volkswagen in 2021. I have uploaded an (intentionally wrong) dataset to the test instance of Dataland: https://test.dataland.com/companies/bc8171a7-a25f-435e-9ea8-a4ab299ae63d/frameworks/eutaxonomy-non-financials/16c10aa8-689c-4773-919e-bb493b700db5


## 1. Load the data from Dataland
To verify or reject the claim we first need to load the data from Dataland. We will use the Dataland API to do so. Datasets carry unique identifiers. These can e.g., be found in the URL of the dataset. The Dataland URLs follow the pattern https://dataland.com/companies/{company_id}/frameworks/{framework_id}/{data_id}. In our case the data_id is "16c10aa8-689c-4773-919e-bb493b700db5".

In [1]:
from dataland_qa_lab.utils import config

conf = config.get_config()
dataland_client = conf.dataland_client

data_id = "16c10aa8-689c-4773-919e-bb493b700db5"

dataset = dataland_client.eu_taxonomy_nf_api.get_company_associated_eutaxonomy_non_financials_data(data_id=data_id)

Dataset may contain hundreds of records. For simplicity, we removed all datapoints except for the total revenue.

In [2]:
revenue_datapoint = dataset.data.revenue.total_amount
revenue_datapoint.model_dump()

{'value': 230000000000.0,
 'quality': <QualityOptions.INCOMPLETE: 'Incomplete'>,
 'comment': '',
 'data_source': {'page': '174',
  'tag_name': None,
  'file_name': 'VW Annual Report',
  'file_reference': '64715c594210a1e87c54fb9254221e2a33d9a6b36af62bf34de608ba00c7ee1f'},
 'currency': 'EUR'}

To verify the datapoint we need to check whether the specified revenue matches the revenue of the underlying datasource. On Dataland data-sources (i.e., PDFs) are identified by their SHA-256 hash. We can use the Dataland API to download the file.

# 2. Load the data-source from Dataland and convert it to text

In [3]:
document_bytes = dataland_client.documents_api.get_document(revenue_datapoint.data_source.file_reference)

The raw PDF is not of much use to us. We need to extract the text from the PDF to process it further (although you are welcome to experiment with using vision-enabled LLMs instead). Extracting text from PDFs is very challenging. Dealing with tables is especially troublesome. Take a look at https://pypdf.readthedocs.io/en/latest/user/extract-text.html#why-text-extraction-is-hard if you are curious. Due to these challenges, we'll use Azure Document Intelligence to extract the text from the PDF. 

The Document Intelligence API charges per page. The entire document is ~400 pages long. To save costs we will only analyze the page containing the revenue.

In [4]:
import io

import pypdf

full_document_byte_stream = io.BytesIO(document_bytes)
full_pdf = pypdf.PdfReader(full_document_byte_stream)

partial_document_byte_stream = io.BytesIO()
partial_pdf = pypdf.PdfWriter()

partial_pdf.add_page(full_pdf.get_page(int(revenue_datapoint.data_source.page) - 1))  # Correct for 0 offset
partial_pdf.write(partial_document_byte_stream)
partial_document_byte_stream.seek(0)
None

Now we can use the Azure Document Intelligence API to extract the text from the PDF.

In [5]:
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeResult, ContentFormat
from azure.core.credentials import AzureKeyCredential

docintel_cred = AzureKeyCredential(conf.azure_docintel_api_key)
document_intelligence_client = DocumentIntelligenceClient(
    endpoint=conf.azure_docintel_endpoint, credential=docintel_cred
)

poller = document_intelligence_client.begin_analyze_document(
    "prebuilt-layout",
    analyze_request=partial_document_byte_stream,
    content_type="application/octet-stream",
    output_content_format=ContentFormat.MARKDOWN,
)
result: AnalyzeResult = poller.result()

The result is a markdown document. We can display it directly in the notebook.

In [6]:
from IPython.display import Markdown, display

display(Markdown(result.content))

<!-- PageNumber="172" -->
<!-- PageHeader="EU Taxonomy" -->
<!-- PageHeader="Group Management Report" -->


# Sales revenue

The definition of turnover in the EU taxonomy corresponds
to sales revenue as reported in the IFRS consolidated financial
statements, which amounted to €250.2 billion in fiscal year
2021 (see also note 1 "Sales revenue" in the notes to the con-
solidated financial statements).

Of this total, €225.4 billion, or 90.1% of Group sales, was
attributable to economic activity 3.3 Manufacture of low-
carbon technologies for transport and classified as taxon-
omy-eligible. This includes sales revenue after sales allow-
ances from new and used vehicles, including motorcycles,
from genuine parts, from the rental and lease business, and
from interest and similar income, as well as sales revenue
directly related to vehicles, such as workshop and other ser-
vices.

Of the taxonomy-eligible sales revenue, €21.3 billion
meet the screening criteria used to measure the substantial
contribution to climate change mitigation. This includes all
of our all-electric vehicles, the majority of the plug-in hybrids,
and the buses meeting the EURO VI standard (Stage E).

Taking into account the DNSH criteria and minimum
safeguards, sales revenue of €21.1 billion attributable to our

passenger cars and light commercial vehicles, accounting for
8.5% of consolidated sales revenue, was taxonomy-aligned. Of
this amount, €14.6 billion, or 5.8% of consolidated sales
revenue, was attributable to our all-electric models (BEVs).

In the Power Engineering Business Area, the majority of
our taxonomy-eligible sales revenue was attributable to
economic activity 3.6 Manufacture of other low-carbon tech-
nologies (€2.4 billion). A further €13 million was contributed
by economic activity 9.1 Close to market research, develop-
ment and innovation. Our activities that fall under economic
activity 3.2 Manufacture of equipment for the production
and use of hydrogen recorded taxonomy-aligned sales reve-
nue of €5 million, taking into account the DNSH criteria and
minimum safeguards.

Of the Volkswagen Group's total sales revenue in fiscal year
2021,

\> €227.8 billion, or 91.0%, was taxonomy-eligible sales reve-
nue and

\> €21.2 billion, or 8.5%, was taxonomy-aligned sales revenue.


## SALES REVENUE


<table>
<tr>
<th rowspan="2">Economic activities</th>
<th colspan="2">SALES REVENUE</th>
<th colspan="2">SUBSTANTIAL CONTRIBUTION TO CLIMATE CHANGE MITIGATION</th>
<th>COMPLIANCE WITH DNSH CRITERIA</th>
<th>COMPLIANCE WITH MINIMUM SAFEGUARDS</th>
<th colspan="2">TAXONOMY-ALIGNED SALES REVENUE</th>
</tr>
<tr>
<th>€ million</th>
<th>%1</th>
<th>€ million</th>
<th>%1</th>
<th>Y/N</th>
<th>Y/N</th>
<th>€ million</th>
<th>%1</th>
</tr>
<tr>
<td>A. Taxonomy-eligible activities</td>
<td>227,787</td>
<td>91.0</td>
<td>21,268</td>
<td>8.5</td>
<td>Y/N</td>
<td>Y</td>
<td>21,152</td>
<td>8.5</td>
</tr>
<tr>
<td>Vehicle-related business</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>3.3 Manufacture of low-carbon technologies for transport</td>
<td>225,380</td>
<td>90.1</td>
<td>21,264</td>
<td>8.5</td>
<td>Y/N</td>
<td>Y</td>
<td>21,147</td>
<td>8.5</td>
</tr>
<tr>
<td>of which taxonomy-aligned BEVs (passenger cars and light commercial vehicles)</td>
<td>–</td>
<td>–</td>
<td>–</td>
<td>–</td>
<td>Y</td>
<td>Y</td>
<td>14,579</td>
<td>5.8</td>
</tr>
<tr>
<td>Power Engineering</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>3.2 Manufacture of equipment for the production and use of hydrogen</td>
<td>5</td>
<td>0.0</td>
<td>5</td>
<td>0.0</td>
<td>Y</td>
<td>Y</td>
<td>5</td>
<td>0.0</td>
</tr>
<tr>
<td>3.6 Manufacture of other low-carbon technologies</td>
<td>2,390</td>
<td>1.0</td>
<td>–</td>
<td>–</td>
<td>–</td>
<td>–</td>
<td>–</td>
<td>–</td>
</tr>
<tr>
<td>9.1 Close to market research, development and innovation</td>
<td>13</td>
<td>0.0</td>
<td>–</td>
<td>–</td>
<td>–</td>
<td>–</td>
<td>–</td>
<td>–</td>
</tr>
<tr>
<td>B. Taxonomy-non-eligible activities</td>
<td>22,413</td>
<td>9.0</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Total (A + B)</td>
<td>250,200</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</table>

1 All percentages relate to the Group's total sales revenue.


From the table, we can observe that the total revenue is 250,200 € Million. Just from this example you can see that extracting such data is very challenging. You e.g., have to notice that all values in the table are given in € Million.

# 3. Verify the claim using GPT-4o
To verify the claim we will use the GPT-4o model. We provide the model with the text extracted from the PDF and ask it to extract the total revenue. Afterward, we can compare the extracted value to the claimed value. To achieve this, we need to build a prompt that the model can understand.

In [7]:
from openai import AzureOpenAI

client = AzureOpenAI(
    api_key=conf.azure_openai_api_key, api_version="2024-07-01-preview", azure_endpoint=conf.azure_openai_endpoint
)

deployment_name = "dataland-prototyping-gpt4o"

prompt = f"""
You are an AI research Agent. As the agent, you answer questions briefly, succinctly, and factually.
Always justify you answer.

# Safety
- You **should always** reference factual statements to search results based on [relevant documents]
- Search results based on [relevant documents] may be incomplete or irrelevant. You do not make assumptions
  on the search results beyond strictly what's returned.
- If the search results based on [relevant documents] do not contain sufficient information to answer user
  message completely, you respond using the tool 'cannot_answer_question'
- Your responses should avoid being vague, controversial or off-topic.

# Task
Given the information from the [relevant documents], what is the total revenue of Volkswagen in 2021?

# Relevant Documents
{result.content}
"""

initial_openai_response = client.chat.completions.create(
    model=deployment_name,
    temperature=0,
    messages=[
        {"role": "system", "content": prompt},
    ],
)
initial_openai_response.choices[0].message.content

'The total revenue of Volkswagen in 2021 was €250.2 billion. This information is directly stated in the relevant document, which specifies that the sales revenue as reported in the IFRS consolidated financial statements amounted to €250.2 billion for the fiscal year 2021.'

We can see that the model has answered the question correctly. However, the response is not given in a structured way. We can use the tool calling feature of OpenAI to force the model to provide a structured response.

In [8]:
updated_openai_response = client.chat.completions.create(
    model=deployment_name,
    temperature=0,
    messages=[
        {"role": "system", "content": prompt},
    ],
    tool_choice="required",
    tools=[
        {
            "type": "function",
            "function": {
                "name": "requested_information_precisely_found_in_relevant_documents",
                "description": "Submit the requested information. "
                "Use this function when the information is precisely stated in the relevant documents. ",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "answer_value": {
                            "type": "number",
                            "description": "The precise answer to the imposed question"
                            "without any thousand separators.",
                        },
                        "answer_currency": {
                            "type": "string",
                            "description": "The currency of the answer (e.g., EUR, USD)",
                        },
                        "justification": {"type": "string", "description": "The justification for the answer"},
                    },
                    "required": ["answer_value", "answer_currency", "justification"],
                },
            },
        }
    ],
)
tool_call = updated_openai_response.choices[0].message.tool_calls[0].function
tool_call

Function(arguments='{"answer_value":250200000000,"answer_currency":"EUR","justification":"The total revenue of Volkswagen in 2021 was €250.2 billion, as stated in the Group Management Report."}', name='requested_information_precisely_found_in_relevant_documents')

In [9]:
import json

parsed_tool_arguments = json.loads(tool_call.arguments)
extracted_revenue = parsed_tool_arguments["answer_value"]
extracted_revenue

250200000000

This looks very promising and works in this very simple scenario. However, be aware that GPT-4o is not good at performing calculations. This is a problem you'll likely need to tackle later ;).
Additionally, GPT-4o can also make a lot of other mistakes. However, in this case, it worked well.

In [10]:
print(f"Original Value: \t{revenue_datapoint.value}")
print(f"Extracted Value: \t{extracted_revenue}")

Original Value: 	230000000000.0
Extracted Value: 	250200000000


We can directly see that the values do not align. Therefore, the claim is incorrect. We report this information back to the data provider by creating a so-called QA Report

# 4. Creating and submitting a QA Report

In [11]:
from dataland_qa.models.eutaxonomy_non_financials_data import EutaxonomyNonFinancialsData
from dataland_qa.models.qa_report_data_point_verdict import QaReportDataPointVerdict

selected_qa_report = EutaxonomyNonFinancialsData.from_dict(
    {
        "revenue": {
            "totalAmount": {
                "comment": "The total revenue is incorrect. The correct value is 250200000000 €",
                "verdict": QaReportDataPointVerdict.QAREJECTED,
                "correctedData": {
                    "value": extracted_revenue,
                    "quality": "Reported",
                    "comment": parsed_tool_arguments["justification"],
                    "currency": parsed_tool_arguments["answer_currency"],
                    "dataSource": revenue_datapoint.data_source.model_dump(by_alias=True),
                },
            }
        }
    }
)
dataland_client.eu_taxonomy_nf_qa_api.post_eutaxonomy_non_financials_data_qa_report(data_id, selected_qa_report)

QaReportMetaInformation(data_id='16c10aa8-689c-4773-919e-bb493b700db5', data_type='eutaxonomy-non-financials', qa_report_id='ac544179-f0de-4c1b-99c1-69992d4da295', reporter_user_id='9697edcb-dab7-41f1-aa69-97338e7300fe', upload_time=1728994823481, active=True)

The created QA Report is now available to the data provider. They can review the report and adapt their data if necessary.