<a href="https://colab.research.google.com/github/dataforgoodfr/12_taxobservatory/blob/main/notebooks/llamaindex_table_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# End-to-end table extraction using llama parse and llama index

This notebook implements the full table extraction workflow (full PDF parsing, indexing, retrieval, query) using llama parse and llama index:
* Complex queries such as "pull all the metrics for all the countries" work ok on relatively complex PDF files with quite many pages and tables, would deserve more testing though
* Simple queries interestingly seem harder. For example, querying the 2018 ENI report for the profit (or loss) before income taxe in France yields incorrect result (total revenues is given instead of profit).
* To do: pull the currency, etc.

⚠ This notebook uses Open AI models so requires an Open API key. Note that llama index supports other types of models and local models.

In [None]:
!pip install llama-index
!pip install llama-index-core
!pip install llama-index-embeddings-openai
!pip install llama-index-postprocessor-flag-embedding-reranker
!pip install git+https://github.com/FlagOpen/FlagEmbedding.git
!pip install llama-parse
!pip install huggingface_hub

# Selecting the PDF file

In [2]:
#!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/10q/uber_10q_march_2022.pdf' -O './uber_10q_march_2022.pdf'

#filename = 'Acciona_2020_CbCR_1.pdf'
#filename = 'Adecco_2021_CbCR_21.pdf'
#filename = 'AXA_2018_CbCR_15.pdf'
filename = 'ENI_2018_CbCR_12-13.pdf'

from huggingface_hub import hf_hub_download
document_path = hf_hub_download(repo_id="DataForGood/taxobservatory_data", filename=filename, repo_type="dataset")

ENI_2018_CbCR_12-13.pdf:   0%|          | 0.00/2.17M [00:00<?, ?B/s]

# Preparing...

In [3]:
# llama-parse is async-first, running the async code in a notebook requires the use of nest_asyncio
import nest_asyncio
nest_asyncio.apply()

import os
from google.colab import userdata
# API access to llama-cloud
os.environ["LLAMA_CLOUD_API_KEY"] = userdata.get("LLAMA_CLOUD_API_KEY")

# Using OpenAI API for embeddings/llms
os.environ["OPENAI_API_KEY"] = userdata.get("OPEN_AI")

import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

In [4]:
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.core import Settings

embed_model=OpenAIEmbedding(model="text-embedding-3-small", embed_batch_size=1)
llm = OpenAI(model="gpt-3.5-turbo-0125")

Settings.llm = llm
Settings.embed_model = embed_model

# Parsing the PDF file and chunking into nodes

In [5]:
from llama_parse import LlamaParse
documents = LlamaParse(result_type="markdown").load_data(document_path)

print(documents[0].text[:100] + '...')

Started parsing the file under job_id fae3d8cb-46b6-4be3-a431-cb5eceb786b1
NO_CONTENT_HERE
---
Mission

We are an energy company.

access to energy for all.

on our unique str...


In [None]:
from llama_index.core.node_parser import MarkdownElementNodeParser

node_parser = MarkdownElementNodeParser(llm=OpenAI(model="gpt-3.5-turbo-0125"), num_workers=8)
nodes = node_parser.get_nodes_from_documents(documents)

In [7]:
from IPython.display import Markdown, display

#for i, node in enumerate(nodes):
#  display(Markdown(f'## {i}. {node.id_} ({type(node).__name__})\n'))
#  for relationship in node.relationships:
#    print(node.relationships[relationship].node_id + ': ' +
#          str(relationship) + ', ' +
#          str(node.relationships[relationship].node_type))
#  #print(node.relationships.keys())
#  display(Markdown(node.text))
#  print()
#  #print(node.text + "\n")

In [8]:
#objects are index nodes
base_nodes, objects = node_parser.get_nodes_and_objects(nodes)

print(len(base_nodes))
print(len(objects))

32
63


# Building the index over the nodes (= file chunks)

In [9]:
#raw_index = VectorStoreIndex.from_documents(documents)
recursive_index = VectorStoreIndex(nodes=base_nodes+objects)

# Simple query: one metric, one country

In [40]:
from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker

query_engine = recursive_index.as_query_engine(
    similarity_top_k=2, # is default
)

#query="Pull for France the total revenues, profit or loss before income tax, income tax paid, stated capital, accumulated earnings, number of employees, and tangible assets."
#query="Pull the total revenues in France."
query="Pull the profit (or loss) before income tax in France."

response = query_engine.query(query)

In [41]:
from llama_index.core.response.notebook_utils import display_response
display_response(response, source_length = 2000, show_source=True)

**`Final Response:`** €4,392,137

---

**`Source Node 1/2`**

**Node ID:** id_612cf24c-12e7-47fd-bce0-da7776cbd59d_67_table<br>**Similarity:** 0.5924011527752708<br>**Text:** Financial information for a company in France including profit, income tax details, capital, earnings, and other key metrics.,
with the following table title:
Financial Information for a Company in France,
with the following columns:
- Country: None
- Profit (loss) before Income Tax: None
- Income Tax Paid: None
- Income Tax (on Cash Basis): None
- Income Tax Accrued - current year: None
- Stated capital: None
- Accumulated Earnings: None
- Number of Employees: None
- Tangible Assets other than Cash and Cash Equivalents: None

|Country|France|
|---|---|
|Profit (loss) before Income Tax|4,392,137|
|Income Tax Paid|19,826|
|Income Tax (on Cash Basis)|14,442|
|Income Tax Accrued - current year|8,750|
|Stated capital|268,663|
|Accumulated Earnings|20,971|
|Number of Employees|739|
|Tangible Assets other than Cash and Cash Equivalents|323,235|<br>

---

**`Source Node 2/2`**

**Node ID:** id_612cf24c-12e7-47fd-bce0-da7776cbd59d_163_table<br>**Similarity:** 0.4675175504919969<br>**Text:** This table provides information on Profit (loss) before Income tax, Total revenues, Revenues from related party transactions, and Revenues from non-related party transactions in thousands of Euros.,
with the following table title:
Financial Performance Summary,
with the following columns:
- Profit (loss) before Income tax: Shows the profit or loss before income tax in thousands of Euros.
- Total revenues: Total revenues amounting to 196 thousand Euros.
- Revenues - related party transaction: Revenues generated from related party transactions.
- Revenues - non-related party transaction: Revenues generated from non-related party transactions.

|Bermuda| | .1| .2| .3| .4| .5| .6| .7| .8|
|---|---|---|---|---|---|---|---|---|---|
|Profit (loss) before Income tax|Total revenues| | | | | | | | |
|(€ thousand)|196|142| |94,481|320,635| | | | |
|Revenues - related party transaction|196| | | | | | | | |
|Revenues - non-related party transaction| | | | | | | | | |<br>

# Complex query: all metrics, all countries

In [11]:
# Retrieve first the most relevant nodes with hopefully the relevant table(s)

from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker
from llama_index.core import QueryBundle
from llama_index.core.response.notebook_utils import display_source_node

reranker = FlagEmbeddingReranker(
    top_n=5,
    model="BAAI/bge-reranker-large",
)

retriever = recursive_index.as_retriever(
    similarity_top_k=15,
)

query_str = """The table provides financial data for various countries and regions.
Columns include the tax jurisdiction (or country or region name), total revenues, profit or loss before income tax, income tax paid, stated capital, accumulated earnings, number of employees, and tangible assets
Each row represent a region or country, such as Afghanistan, Albania, Algeria, Andorra, Angola, Antigua and Barbuda, Argentina, Armenia, Australia, Austria, Azerbaijan, Bahamas, Bahrain, Bangladesh, Barbados, Belarus, Belgium, Belize, Benin, Bhutan, Bolivia, Bosnia and Herzegovina, Botswana, Brazil, Brunei, Bulgaria, Burkina Faso, Burundi, Cabo Verde, Cambodia, Cameroon, Canada, Central African Republic, Chad, Chile, China, Colombia, Comoros, Congo, Democratic Republic of the, Congo, Republic of the, Costa Rica, Cote d'Ivoire, Croatia, Cuba, Cyprus, Czech Republic, Denmark, Djibouti, Dominica, Dominican Republic, East Timor (Timor-Leste), Ecuador, Egypt, El Salvador, Equatorial Guinea, Eritrea, Estonia, Eswatini (formerly Swaziland), Ethiopia, Fiji, Finland, France, Gabon, Gambia, Georgia, Germany, Ghana, Greece, Grenada, Guatemala, Guinea, Guinea-Bissau, Guyana, Haiti, Honduras, Hungary, Iceland, India, Indonesia, Iran, Iraq, Ireland, Israel, Italy, Jamaica, Japan, Jordan, Kazakhstan, Kenya, Kiribati, Korea, North, Korea, South, Kosovo, Kuwait, Kyrgyzstan, Laos, Latvia, Lebanon, Lesotho, Liberia, Libya, Liechtenstein, Lithuania, Luxembourg, Madagascar, Malawi, Malaysia, Maldives, Mali, Malta, Marshall Islands, Mauritania, Mauritius, Mexico, Micronesia, Moldova, Monaco, Mongolia, Montenegro, Morocco, Mozambique, Myanmar (Burma), Namibia, Nauru, Nepal, Netherlands, New Zealand, Nicaragua, Niger, Nigeria, North Macedonia (formerly Macedonia), Norway, Oman, Pakistan, Palau, Palestine, Panama, Papua New Guinea, Paraguay, Peru, Philippines, Poland, Portugal, Qatar, Romania, Russia, Rwanda, Saint Kitts and Nevis, Saint Lucia, Saint Vincent and the Grenadines, Samoa, San Marino, Sao Tome and Principe, Saudi Arabia, Senegal, Serbia, Seychelles, Sierra Leone, Singapore, Slovakia, Slovenia, Solomon Islands, Somalia, South Africa, South Sudan, Spain, Sri Lanka, Sudan, Suriname, Sweden, Switzerland, Syria, Taiwan, Tajikistan, Tanzania, Thailand, Togo, Tonga, Trinidad and Tobago, Tunisia, Turkey, Turkmenistan, Tuvalu, Uganda, Ukraine, United Arab Emirates, United Kingdom, United States, Uruguay, Uzbekistan, Vanuatu, Vatican City, Venezuela, Vietnam, Yemen, Zambia, Zimbabwe.
"""

nodes = retriever.retrieve(query_str)
new_nodes = reranker._postprocess_nodes(nodes, query_bundle=QueryBundle(query_str))

for n in new_nodes:
  display_source_node(n, show_source_metadata=True)

**Node ID:** id_612cf24c-12e7-47fd-bce0-da7776cbd59d_57_table<br>**Similarity:** 9.147665977478027<br>**Text:** This table provides financial data for various countries and regions, including total revenues, p...<br>**Metadata:** {'table_df': "{'Country': {0: 'Kenya', 1: 'Libya', 2: 'Morocco', 3: 'Mozambique', 4: 'Nigeria', 5: 'South Africa', 6: 'Tunisia', 7: 'ASIA AND OCEANIA', 8: 'Australia', 9: 'China', 10: 'India', 11: 'Indonesia', 12: 'Iran', 13: 'Iraq', 14: 'Kazakhstan', 15: 'Lebanon', 16: 'Myanmar', 17: 'Oman', 18: 'Pakistan', 19: 'Russia', 20: 'Saudi Arabia', 21: 'Singapore', 22: 'Timor Leste', 23: 'Turkmenistan', 24: 'United Arab Emirates', 25: 'Vietnam', 26: 'AMERICAS', 27: 'Argentina', 28: 'Bahamas', 29: 'Bermuda', 30: 'Brazil', 31: 'British Virgin Islands', 32: 'Canada', 33: 'Cayman Islands', 34: 'Ecuador', 35: 'Mexico', 36: 'United States', 37: 'Venezuela', 38: 'Eni Group'}, 'Total revenues': {0: '197', 1: '4,294,558', 2: '3,973', 3: '5,267', 4: '1,285,152', 5: '115', 6: '193,424', 7: '6,815,945', 8: '238,849', 9: '183,862', 10: '417', 11: '710,564', 12: '6,663', 13: '556,807', 14: '1,930,690', 15: '8', 16: '188', 17: '69', 18: '140,065', 19: '129,204', 20: '249,267', 21: '1,709,564', 22: '494', 23: '120,030', 24: '838,760', 25: '444', 26: '4,271,096', 27: '7,991', 28: ' ', 29: '196', 30: ' ', 31: ' ', 32: ' ', 33: ' ', 34: '243,307', 35: '1,314', 36: '3,968,057', 37: '50,231', 38: '117,850,810'}, 'Profit (loss) before Income tax': {0: '(3,833)', 1: '2,823,360', 2: '(33,551)', 3: '(4,060)', 4: '157,495', 5: '(6,040)', 6: '26,291', 7: '1,980,951', 8: '51,296', 9: '1,857', 10: '(3,261)', 11: '277,470', 12: '(1,675)', 13: '153,657', 14: '950,015', 15: '(10,822)', 16: '(12,222)', 17: '(6,164)', 18: '51,610', 19: '38,440', 20: ' ', 21: '(6,154)', 22: '(532)', 23: '22,084', 24: '517,323', 25: '(41,971)', 26: '(663,784)', 27: '6,768', 28: ' ', 29: '142', 30: '(34)', 31: ' ', 32: ' ', 33: ' ', 34: '46,648', 35: '(90,396)', 36: '(393,094)', 37: '(233,818)', 38: '4,783,311'}, 'Income Tax Paid (on Cash Basis)': {0: ' ', 1: '1,885,035', 2: ' ', 3: ' ', 4: '157,560', 5: ' ', 6: '132,977', 7: '765,527', 8: '30,580', 9: '(26)', 10: '12', 11: '7,323', 12: '25', 13: '27,187', 14: '211,832', 15: ' ', 16: ' ', 17: ' ', 18: '211', 19: '194', 20: '8,633', 21: '(189)', 22: '(11,419)', 23: '11,280', 24: '479,884', 25: ' ', 26: '68,812', 27: ' ', 28: ' ', 29: ' ', 30: ' ', 31: ' ', 32: ' ', 33: ' ', 34: '63,949', 35: ' ', 36: '4,330', 37: '533', 38: '5,080,890'}, 'Income Tax Accrued - current year': {0: ' ', 1: '1,866,084', 2: ' ', 3: ' ', 4: '204,988', 5: ' ', 6: '136,062', 7: '768,071', 8: '25,039', 9: '11', 10: '163', 11: '46,647', 12: '25', 13: '29,212', 14: '203,985', 15: ' ', 16: ' ', 17: ' ', 18: '2,169', 19: '726', 20: ' ', 21: '739', 22: ' ', 23: '9,528', 24: '449,827', 25: ' ', 26: '65,942', 27: '2,710', 28: ' ', 29: ' ', 30: ' ', 31: ' ', 32: ' ', 33: ' ', 34: '28,755', 35: ' ', 36: '9,936', 37: '24,541', 38: '5,142,394'}, 'Stated capital': {0: ' ', 1: ' ', 2: ' ', 3: ' ', 4: '2,458', 5: ' ', 6: '96', 7: '197,796', 8: '123,487', 9: '39,045', 10: '230', 11: ' ', 12: ' ', 13: ' ', 14: '3,609', 15: ' ', 16: ' ', 17: ' ', 18: ' ', 19: '31,374', 20: ' ', 21: '51', 22: ' ', 23: ' ', 24: ' ', 25: ' ', 26: '6,253,432', 27: '735', 28: ' ', 29: '94,481', 30: '370,029', 31: '42', 32: '2,246,171', 33: ' ', 34: '193', 35: ' ', 36: '3,499,746', 37: '42,033', 38: '80,909,709'}, 'Accumulated Earnings': {0: ' ', 1: ' ', 2: ' ', 3: ' ', 4: '2,555,666', 5: ' ', 6: '9', 7: '(208,967)', 8: '(124,415)', 9: '(57,266)', 10: '(273)', 11: ' ', 12: ' ', 13: ' ', 14: '(1,992)', 15: ' ', 16: ' ', 17: ' ', 18: ' ', 19: '(24,843)', 20: ' ', 21: '(178)', 22: ' ', 23: ' ', 24: ' ', 25: ' ', 26: '(3,382,733)', 27: '(7,226)', 28: ' ', 29: '320,635', 30: '(370,064)', 31: '482,030', 32: '(1,249,559)', 33: ' ', 34: '33,120', 35: '(96,100)', 36: '(2,507,343)', 37: '11,774', 38: '43,634,353'}, 'Number of Employees (number)': {0: '15', 1: '3,709', 2: '6', 3: '115', 4: '1,164', 5: '6', 6: '524', 7: '4,245', 8: '90', 9: '69', 10: '5', 11: '328', 12: ' ', 13: '483', 14: '1,750', 15: ' ', 16: '25', 17: '10', 18: '339', 19: '71', 20: ' ', 21: '27', 22: '3', 23: '983', 24: '25', 25: '37', 26: '1,412', 27: ' ', 28: ' ', 29: ' ', 30: ' ', 31: ' ', 32: ' ', 33: ' ', 34: '770', 35: '125', 36: '378', 37: '139', 38: '40,415'}, 'Tangible Assets other than Cash and Cash Equivalents': {0: '722', 1: '3,432,224', 2: '69', 3: '670', 4: '2,125,179', 5: '342', 6: '224,696', 7: '15,330,053', 8: '848,679', 9: '43,475', 10: '202', 11: '1,443,152', 12: ' ', 13: '656,506', 14: '10,438,477', 15: ' ', 16: '194', 17: '625', 18: '113,661', 19: '297,558', 20: ' ', 21: '69', 22: '51', 23: '465,537', 24: '1,017,082', 25: '4,785', 26: '3,487,273', 27: ' ', 28: ' ', 29: ' ', 30: ' ', 31: ' ', 32: ' ', 33: ' ', 34: '212,821', 35: '380,249', 36: '2,831,464', 37: '62,739', 38: '63,723,700'}}", 'table_summary': 'This table provides financial data for various countries and regions, including total revenues, profit (loss) before income tax, income tax paid, stated capital, accumulated earnings, number of employees, and tangible assets. The table should be kept.,\nwith the following columns:\n'}<br>

**Node ID:** id_612cf24c-12e7-47fd-bce0-da7776cbd59d_55_table<br>**Similarity:** 7.8682026863098145<br>**Text:** This table provides financial data for various countries, including total revenues, profit before...<br>**Metadata:** {'table_df': "{'Country': {0: 'Austria', 1: 'Belgium', 2: 'Cyprus', 3: 'Czech Republic', 4: 'Denmark', 5: 'France', 6: 'Germany', 7: 'Greece', 8: 'Greenland', 9: 'Hungary', 10: 'Ireland', 11: 'Italy', 12: 'Jersey', 13: 'Malta', 14: 'Montenegro', 15: 'Netherlands', 16: 'Poland', 17: 'Portugal', 18: 'Romania', 19: 'Slovakia', 20: 'Slovenia', 21: 'Spain', 22: 'Sweden', 23: 'Switzerland', 24: 'Turkey', 25: 'Ukraine', 26: 'United Kingdom', 27: 'Algeria', 28: 'Angola', 29: 'Congo', 30: 'Democratic Republic of the Congo', 31: 'Egypt', 32: 'Gabon', 33: 'Ghana', 34: 'Ivory Coast'}, 'Total revenues (€ thousand)': {0: '1,229,847', 1: '2,425,316', 2: '4,689', 3: '5,579', 4: '587', 5: '4,392,137', 6: '2,716,743', 7: '192,116', 8: '149', 9: '210,364', 10: '238,354', 11: '43,372,971', 12: '20,935', 13: ' ', 14: '152', 15: '1,612,911', 16: '4,470', 17: '1,224', 18: '2,714', 19: '1,180', 20: '42,813', 21: '373,115', 22: '2,558', 23: '806,778', 24: '15,722', 25: '311', 26: '33,763,635', 27: '1,473,905', 28: '2,540,302', 29: '1,471,937', 30: '19', 31: '3,679,297', 32: '6,018', 33: '372,273', 34: ' '}, 'Profit (loss) before Income tax (€ thousand)': {0: '49,441', 1: '293,390', 2: '-48,174', 3: '-275', 4: '172', 5: '19,826', 6: '119,618', 7: '12,239', 8: '-5,836', 9: '-5,854', 10: '51,773', 11: '-1,856,038', 12: '7,077', 13: ' ', 14: '-5,560', 15: '-298,018', 16: '1,064', 17: '-17,417', 18: '184', 19: '-271', 20: '4,338', 21: '3,996', 22: '383', 23: '23,815', 24: '9,635', 25: '-957', 26: '-654,766', 27: '786,585', 28: '684,095', 29: '-431,660', 30: '-444', 31: '1,779,870', 32: '-11,206', 33: '12,541', 34: '-17,084'}, 'Income Tax Paid (€ thousand)': {0: '8,617', 1: '85,434', 2: ' ', 3: '92', 4: '34', 5: '14,442', 6: '48,300', 7: '184', 8: ' ', 9: '78', 10: '3,000', 11: '116,508', 12: ' ', 13: ' ', 14: ' ', 15: '6,948', 16: '163', 17: ' ', 18: ' ', 19: '2', 20: '659', 21: '652', 22: '86', 23: '4,874', 24: '260', 25: '13', 26: '145,546', 27: '630,079', 28: '315,805', 29: '249,361', 30: ' ', 31: '439,842', 32: ' ', 33: ' ', 34: ' '}, 'Income Tax Accrued (€ thousand)': {0: '9,131', 1: '63,089', 2: ' ', 3: '89', 4: '38', 5: '8,750', 6: '40,798', 7: '3,892', 8: ' ', 9: '67', 10: '7,536', 11: '64,106', 12: ' ', 13: ' ', 14: ' ', 15: '7,890', 16: '194', 17: ' ', 18: '51', 19: '-27', 20: '675', 21: '983', 22: '84', 23: '5,573', 24: '191', 25: ' ', 26: '222,659', 27: '639,796', 28: '337,928', 29: '247,809', 30: ' ', 31: '439,945', 32: ' ', 33: ' ', 34: ' '}, 'Stated capital (€ thousand)': {0: '132,278', 1: '2,419,725', 2: '568,002', 3: ' ', 4: ' ', 5: '268,663', 6: '94,145', 7: '19,182', 8: ' ', 9: '25,380', 10: '500,000', 11: '20,115,619', 12: '25,546', 13: ' ', 14: ' ', 15: '44,485,843', 16: ' ', 17: ' ', 18: ' ', 19: ' ', 20: '12,957', 21: '17,299', 22: ' ', 23: '95,982', 24: '4', 25: '1,978', 26: '5,255,942', 27: '12', 28: ' ', 29: '390,099', 30: '805', 31: '14', 32: '13,710', 33: '12,742', 34: ' '}, 'Accumulated Earnings (€ thousand)': {0: '84,312', 1: '123,514', 2: '-298,465', 3: ' ', 4: ' ', 5: '20,971', 6: '78,231', 7: '6,749', 8: ' ', 9: '-10,709', 10: '67,310', 11: '26,999,365', 12: '7,077', 13: ' ', 14: ' ', 15: '15,886,423', 16: ' ', 17: ' ', 18: ' ', 19: ' ', 20: '14,265', 21: '2,072', 22: ' ', 23: '23,947', 24: '627', 25: '-1,547', 26: '2,596,604', 27: ' ', 28: ' ', 29: '-656,043', 30: '-1,279', 31: '1,239', 32: '-61,721', 33: '-212,564', 34: ' '}, 'Number of Employees (number)': {0: '131', 1: '233', 2: '30', 3: '16', 4: '2', 5: '739', 6: '537', 7: '97', 8: ' ', 9: '159', 10: '12', 11: '21,066', 12: ' ', 13: ' ', 14: '6', 15: '69', 16: '6', 17: '8', 18: '13', 19: '6', 20: '36', 21: '95', 22: '1', 23: '84', 24: '10', 25: '12', 26: '993', 27: '776', 28: '354', 29: '577', 30: ' ', 31: '2,902', 32: '16', 33: '232', 34: '1'}, 'Tangible Assets other than Cash and Cash Equivalents (€ thousand)': {0: '124,428', 1: '3,523', 2: '11,869', 3: '852', 4: ' ', 5: '323,235', 6: '375,949', 7: '1,495', 8: ' ', 9: '33,750', 10: ' ', 11: '13,549,180', 12: '8,664', 13: ' ', 14: '653', 15: '29,327', 16: '286', 17: '3,311', 18: '554', 19: '7', 20: '1,821', 21: '16,966', 22: '404', 23: '84,832', 24: '25,273', 25: '11', 26: '1,898,851', 27: '3,052,456', 28: '4,742,261', 29: '5,435,220', 30: ' ', 31: '7,391,415', 32: '12,551', 33: '1,993,328', 34: ' '}}", 'table_summary': 'This table provides financial data for various countries, including total revenues, profit before income tax, income tax paid and accrued, stated capital, accumulated earnings, number of employees, and tangible assets other than cash and cash equivalents.,\nwith the following table title:\nFinancial Data for Countries,\nwith the following columns:\n- Country: None\n- Total revenues (€ thousand): None\n- Profit (loss) before Income tax (€ thousand): None\n- Income Tax Paid (€ thousand): None\n- Income Tax Accrued (€ thousand): None\n- Stated capital (€ thousand): None\n- Accumulated Earnings (€ thousand): None\n- Number of Employees (number): None\n- Tangible Assets other than Cash and Cash Equivalents (€ thousand): None\n'}<br>

**Node ID:** id_612cf24c-12e7-47fd-bce0-da7776cbd59d_145_table<br>**Similarity:** 7.80562162399292<br>**Text:** This table provides financial data for different countries including profit (loss) before income ...<br>**Metadata:** {'table_df': "{'Country': {0: 'Lebanon', 1: 'Myanmar', 2: 'Oman', 3: 'Pakistan', 4: 'Russia'}, 'Profit (loss) before Income Tax': {0: '8', 1: '188', 2: '69', 3: '140,065', 4: '129,204'}, 'Income Tax Paid (on Cash Basis)': {0: '(10,822)', 1: '(12,222)', 2: '(6,164)', 3: '51,610', 4: '38,440'}, 'Income Tax Accrued - current year': {0: ' ', 1: ' ', 2: ' ', 3: '211', 4: '194'}, 'Stated capital': {0: ' ', 1: ' ', 2: ' ', 3: '2,169', 4: '726'}, 'Accumulated Earnings': {0: ' ', 1: ' ', 2: ' ', 3: ' ', 4: '31,374'}, 'Number of Employees': {0: ' ', 1: '25', 2: '10', 3: '339', 4: '71'}, 'Tangible Assets other than Cash and Cash Equivalents': {0: ' ', 1: '194', 2: '625', 3: '113,661', 4: '297,558'}}", 'table_summary': 'This table provides financial data for different countries including profit (loss) before income tax, income tax paid, income tax accrued, stated capital, accumulated earnings, number of employees, and tangible assets other than cash and cash equivalents.,\nwith the following table title:\nFinancial Data for Different Countries,\nwith the following columns:\n- Country: None\n- Profit (loss) before Income Tax: None\n- Income Tax Paid (on Cash Basis): None\n- Income Tax Accrued - current year: None\n- Stated capital: None\n- Accumulated Earnings: None\n- Number of Employees: None\n- Tangible Assets other than Cash and Cash Equivalents: None\n'}<br>

**Node ID:** id_612cf24c-12e7-47fd-bce0-da7776cbd59d_91_table<br>**Similarity:** 7.764071941375732<br>**Text:** This table provides financial data for different countries including profit (loss) before income ...<br>**Metadata:** {'table_df': "{'Country': {0: 'Portugal', 1: 'Romania', 2: 'Slovakia', 3: 'Slovenia', 4: 'Spain'}, 'Profit (loss) before Income Tax': {0: '1,224', 1: '2,714', 2: '1,180', 3: '42,813', 4: '373,115'}, 'Income Tax Paid': {0: '(17,417)', 1: '184', 2: '(271)', 3: '4,338', 4: '3,996'}, 'Income Tax Accrued': {0: ' ', 1: '51', 2: '2', 3: '659', 4: '652'}, 'Stated Capital': {0: ' ', 1: ' ', 2: '(27)', 3: '675', 4: '983'}, 'Accumulated Earnings': {0: ' ', 1: ' ', 2: ' ', 3: '12,957', 4: '17,299'}, 'Number of Employees': {0: 8, 1: 13, 2: 6, 3: 36, 4: 95}, 'Tangible Assets other than Cash and Cash Equivalents': {0: '3,311', 1: '554', 2: '7', 3: '1,821', 4: '16,966'}}", 'table_summary': 'This table provides financial data for different countries including profit (loss) before income tax, income tax paid, income tax accrued, stated capital, accumulated earnings, number of employees, and tangible assets other than cash and cash equivalents.,\nwith the following table title:\nFinancial Data for Different Countries,\nwith the following columns:\n- Country: None\n- Profit (loss) before Income Tax: None\n- Income Tax Paid: None\n- Income Tax Accrued: None\n- Stated Capital: None\n- Accumulated Earnings: None\n- Number of Employees: None\n- Tangible Assets other than Cash and Cash Equivalents: None\n'}<br>

**Node ID:** id_612cf24c-12e7-47fd-bce0-da7776cbd59d_63_table<br>**Similarity:** 7.708759307861328<br>**Text:** This table provides financial data for different countries including profit (loss) before income ...<br>**Metadata:** {'table_df': "{'Country': {0: 'Austria', 1: 'Belgium', 2: 'Cyprus', 3: 'Czech Republic', 4: 'Denmark'}, 'Profit (loss) before Income Tax': {0: '1,229,847', 1: '2,425,316', 2: '4,689', 3: '5,579', 4: '587'}, 'Income Tax Paid (on Cash Basis)': {0: '49,441', 1: '293,390', 2: '(48,174)', 3: '(275)', 4: '172'}, 'Income Tax Accrued - current year': {0: '8,617', 1: '85,434', 2: ' ', 3: '92', 4: '34'}, 'Income Tax Stated': {0: '9,131', 1: '63,089', 2: '568,002', 3: '89', 4: '38'}, 'Accumulated capital': {0: '132,278', 1: '2,419,725', 2: '(298,465)', 3: ' ', 4: ' '}, 'Number of Employees': {0: '84,312', 1: '123,514', 2: '30', 3: '16', 4: '2'}, 'Tangible Assets other than Cash and Cash Equivalents': {0: '124,428', 1: '3,523', 2: '11,869', 3: '852', 4: ' '}}", 'table_summary': 'This table provides financial data for different countries including profit (loss) before income tax, income tax paid, income tax accrued, income tax stated, accumulated capital, number of employees, and tangible assets other than cash and cash equivalents.,\nwith the following table title:\nFinancial Data for Different Countries,\nwith the following columns:\n- Country: None\n- Profit (loss) before Income Tax: None\n- Income Tax Paid (on Cash Basis): None\n- Income Tax Accrued - current year: None\n- Income Tax Stated: None\n- Accumulated capital: None\n- Number of Employees: None\n- Tangible Assets other than Cash and Cash Equivalents: None\n'}<br>

In [12]:
# Now the query

from llama_index.core.response_synthesizers import ResponseMode
from llama_index.core import get_response_synthesizer

query_str_alt = """Pull the profit (or loss) before income tax, the income tax accrued, the income tax paid, the number of employees,
the stated capital, the accumulated earnings, the tangible assets other than cash and cash equivalent, and the total revenues for all the countries.
Format the output as a table with the following columns jur_name. Sort the country names by alphabetical order."""

query_str = """Pull the following financial data for all the countries and format the results in a table:
- Country name: only include real countries (column: jur_name)
- Total revenue (column: total_revenues)
- Profit (or loss) before tax (column: profit_before_tax)
- Income tax paid (column: tax_paid)
- Accrued tax (column: tax_accrued)
- Number of employees (column: employees)
- Unrelated revenues (column: unrelated_revenues)
- Related revenues (column: related_revenues)
- Stated capital (column: stated_capital)
- Accumulated earnings (column: accumulated_earnings)
- Tangible assets other than cash and cash equivalent (column: tangible_assets)

Sort the jur_name column by alphabetical order."""

#Countries: Afghanistan, Albania, Algeria, Andorra, Angola, Antigua and Barbuda, Argentina, Armenia, Australia, Austria, Azerbaijan, Bahamas, Bahrain, Bangladesh, Barbados, Belarus, Belgium, Belize, Benin, Bhutan, Bolivia, Bosnia and Herzegovina, Botswana, Brazil, Brunei, Bulgaria, Burkina Faso, Burundi, Cabo Verde, Cambodia, Cameroon, Canada, Central African Republic, Chad, Chile, China, Colombia, Comoros, Congo, Democratic Republic of the, Congo, Republic of the, Costa Rica, Cote d'Ivoire, Croatia, Cuba, Cyprus, Czech Republic, Denmark, Djibouti, Dominica, Dominican Republic, East Timor (Timor-Leste), Ecuador, Egypt, El Salvador, Equatorial Guinea, Eritrea, Estonia, Eswatini (formerly Swaziland), Ethiopia, Fiji, Finland, France, Gabon, Gambia, Georgia, Germany, Ghana, Greece, Grenada, Guatemala, Guinea, Guinea-Bissau, Guyana, Haiti, Honduras, Hungary, Iceland, India, Indonesia, Iran, Iraq, Ireland, Israel, Italy, Jamaica, Japan, Jordan, Kazakhstan, Kenya, Kiribati, Korea, North, Korea, South, Kosovo, Kuwait, Kyrgyzstan, Laos, Latvia, Lebanon, Lesotho, Liberia, Libya, Liechtenstein, Lithuania, Luxembourg, Madagascar, Malawi, Malaysia, Maldives, Mali, Malta, Marshall Islands, Mauritania, Mauritius, Mexico, Micronesia, Moldova, Monaco, Mongolia, Montenegro, Morocco, Mozambique, Myanmar (Burma), Namibia, Nauru, Nepal, Netherlands, New Zealand, Nicaragua, Niger, Nigeria, North Macedonia (formerly Macedonia), Norway, Oman, Pakistan, Palau, Palestine, Panama, Papua New Guinea, Paraguay, Peru, Philippines, Poland, Portugal, Qatar, Romania, Russia, Rwanda, Saint Kitts and Nevis, Saint Lucia, Saint Vincent and the Grenadines, Samoa, San Marino, Sao Tome and Principe, Saudi Arabia, Senegal, Serbia, Seychelles, Sierra Leone, Singapore, Slovakia, Slovenia, Solomon Islands, Somalia, South Africa, South Sudan, Spain, Sri Lanka, Sudan, Suriname, Sweden, Switzerland, Syria, Taiwan, Tajikistan, Tanzania, Thailand, Togo, Tonga, Trinidad and Tobago, Tunisia, Turkey, Turkmenistan, Tuvalu, Uganda, Ukraine, United Arab Emirates, United Kingdom, United States, Uruguay, Uzbekistan, Vanuatu, Vatican City, Venezuela, Vietnam, Yemen, Zambia, Zimbabwe

response_synthesizer = get_response_synthesizer(
    response_mode=ResponseMode.COMPACT
)

response = response_synthesizer.synthesize(
    query_str, nodes=new_nodes)

#display_response(response, show_source=True)
display(Markdown(response.response))

| jur_name       | total_revenues | profit_before_tax | tax_paid | tax_accrued | employees | unrelated_revenues | related_revenues | stated_capital | accumulated_earnings | tangible_assets |
|----------------|----------------|-------------------|----------|-------------|-----------|-------------------|------------------|----------------|----------------------|-----------------|
| Algeria        | 1,473,905      | 786,585           | 630,079  | 639,796     | 776       |                   |                  | 12             |                      | 3,052,456       |
| Angola         | 2,540,302      | 684,095           | 315,805  | 337,928     | 354       |                   |                  |                |                      | 4,742,261       |
| Argentina      | 7,991          | 6,768             |          | 2,710       |           |                   |                  | 735            | (7,226)              |                 |
| Australia      | 238,849        | 51,296            | 30,580   | 25,039      | 90        |                   |                  | 123,487        | (124,415)            | 848,679         |
| Austria        | 1,229,847      | 49,441            | 8,617    | 9,131       | 131       |                   |                  | 132,278        | 84,312               | 124,428         |
| Bahamas        |                |                   |          |             |           |                   |                  |                |                      |                 |
| Belgium        | 2,425,316      | 293,390           | 85,434   | 63,089      | 233       |                   |                  | 2,419,725      | 123,514              | 3,523           |
| Bermuda        | 196            | 142               |          |             |           |                   |                  | 94,481         | 320,635              |                 |
| Brazil         |                | (34)              |          |             |           |                   |                  | 370,029        | (370,064)            |                 |
| Canada         |                |                   |          |             |           |                   |                  | 2,246,171      | (1,249,559)          |                 |
| China          | 183,862        | 1,857             | (26)     | 11         | 69        |                   |                  | 39,045         | (57,266)             | 43,475          |
| Congo          | 1,471,937      | (431,660)         | 249,361  | 247,809     | 577       |                   |                  | 390,099        | (656,043)            | 5,435,220       |
| Czech Republic | 5,579          | (275)             | 92       | 89         | 16        |                   |                  |                |                      | 852             |
| Denmark        | 587            | 172               | 34       | 38         | 2         |                   |                  |                |                      |                 |
| Ecuador        | 243,307        | 46,648            | 63,949   | 28,755      | 770       |                   |                  | 193            | 33,120               | 212,821         |
| Egypt          | 3,679,297      | 1,779,870         | 439,842  | 439,945     | 2,902     |                   |                  | 14             | 1,239                | 7,391,415       |
| France         | 4,392,137      | 19,826            | 14,442   | 8,750       | 739       |                   |                  | 268,663        | 20,971               | 323,235         |
| Gabon          | 6,018          | (11,206)          |          |             | 16        |                   |                  | 13,710         | (61,721)             | 12,551          |
| Germany        | 2,716,743      | 119,618           | 48,300   | 40,798      | 537       |                   |                  | 94,145         | 78,231               | 375,949         |
| Ghana          | 372,273        | 12,541            |          |             | 232       |                   |                  | 12,742         | (212,564)            | 1,993,328       |
| Greece         | 192,116        | 12,239            | 184      | 3,892       | 97        |                   |                  | 19,182         | 6,749                | 1,495           |
| Greenland      | 149            | (5,836)           |          |             |           |                   |                  |                |                      |                 |
| Hungary        | 210,364        | (5,854)           | 78       | 67         | 159       |                   |                  | 25,380         | (10,709)             | 33,750          |
| India          | 417            | (3,261)           | 12       | 163        | 5         |                   |                  | 230            | (273)                | 202             |
| Indonesia      | 710,564        | 277,470           | 7,323    | 46,647      | 328       |                   |                  |                |                      | 1,443,152       |
| Iran           | 6,663          | (1,675)           | 25       | 25         |           |                   |                  |                |                      |                 |
| Iraq           | 556,807        | 153,657           | 27,187   | 29,212      | 483       |                   |                  |                |                      | 656,506         |
| Ireland        | 238,354        | 51,773            | 3,000    | 7,536       | 12        |                   |                  | 500,000        | 67,310               |                 |
| Italy          | 43,372,971     | (1,856,038)       | 116,508  | 64,106      | 21,066    |                   |                  | 20,115,619     | 26,999,365          | 13,549,180      |
| Jersey         | 20,935         | 7,077             |          |             |           |                   |                  | 25,546         | 7,077                | 8,664           |
| Kazakhstan     | 1,930,690      | 950,015           | 211,832  | 203,985     | 1,750     |                   |                  | 3,609          | (1,992)              | 10,438,477      |
| Kenya          | 197            | (3,833)           |          |             | 15        |                   |                  |                |                      | 722             |
| Lebanon        | 8              | (10,822)          |          |             |           |                   |                  |                |                      |                 |
| Libya          | 4,294,558      | 2,823,360         | 1,885,035 | 1,866,084   | 3,709     |                   |                  |                |                      | 3,432,224       |
| Malta          |                |                   |          |             |           |                   |                  |                |                      |                 |
| Mexico         | 1,314          | (90,396)          |          |             | 125       |                   |                  |                | (96,100)            | 380,249         |
| Montenegro     | 152            | (5,560)           |          |             | 6         |                   |                  |                |                      | 653             |
| Morocco        | 3,973          | (33,551)          |          |             | 6         |                   |                  |                |                      | 69              |
| Mozambique     | 5,267          | (4,060)           |          |             | 115       |                   |                  |                |                      | 670             |
| Myanmar        | 188            | (12,222)          |          |             | 25        |                   |                  |                |                      | 194             |
| Netherlands    | 1,612,911      | (298,018)         | 6,948    | 7,890       | 69        |                   |                  | 44,485,843     | 15,886,423          | 29,327          |
| Nigeria        | 1,285,152      | 157,495           | 157,560  | 204,988     | 1,164     |                   |                  | 2,458          | 2,555,666           | 2,125,179       |
| Oman           | 69             | (6,164)           |          |             | 10        |                   |                  |                |                      | 625             |
| Pakistan       | 140,065        | 51,610            | 211      | 2,169       | 339       |                   |                  |                |                      | 113,661         |
| Poland         | 4,470          | 1,064             | 163      | 194        | 6         |                   |                  |                |                      | 286             |
| Portugal       | 1,224          | (17,417)          |          |             | 8         |                   |                  |                |                      | 3,311           |
| Romania        | 2,714          | 184               |          | 51         | 13        |                   |                  |                |                      | 554             |
| Russia         | 129,204        | 38,440            | 194      | 726        | 71        |                   |                  | 31,374         | (24,843)            | 297,558         |
| Saudi Arabia   | 249,267        |                   | 8,633    |             |           |                   |                  |                |                      |                 |
| Singapore      | 1,709,564      | (6,154)           | (189)    | 739        | 27        |                   |                  | 51             | (178)               | 69              |
| Slovakia       | 1,180          | (271)             | 2        | (27)       | 6         |                   |                  |                |                      | 7               |
| Slovenia       | 42,813         | 4,338             | 659      | 675        | 36        |                   |                  | 12,957         | 14,265               | 1,821           |
| South Africa   | 115            | (6,040)           |          |             | 6         |                   |                  |                |                      | 342             |
| Spain          | 373,115        | 3,996             | 652      | 983        | 95        |                   |                  | 17,299         | 2,072                | 16,966          |
| Sweden         | 2,558          | 383               | 86       | 84         | 1         |                   |                  |                |                      | 404             |
| Switzerland    | 806,778        | 23,815            | 4,874    | 5,573      | 84        |                   |                  | 95,982         | 23,947               | 84,832          |
| Timor Leste    | 494            | (532)             | (11,419) |             | 3         |                   |                  |                |                      | 51              |
| Turkey         | 15,722         | 9,635             | 260      | 191        | 10        |                   |                  | 4              | 627                  | 25,273          |
| Turkmenistan   | 120,030        | 22,084            | 11,280   | 9,528      | 983       |                   |                  |                |                      | 465,537         |
| Ukraine        | 311            | (957)             | 13       |             | 12        |                   |                  | 1,978          | (1,547)              | 11              |
| United Arab Emirates | 838,760  | 517,323           | 479,884  | 449,827    | 25        |                   |                  |                |                      | 1,017,082       |
| United Kingdom | 33,763,635    | (654,766)         | 145,546  | 222,659    | 993       |                   |                  | 5,255,942      | 2,596,604           | 1,898,851       |
| United States  | 3,968,057      | (393,094)         | 4,330    | 9,936      | 378       |                   |                  | 3,499,746      | (2,507,343)         | 2,831,464       |
| Venezuela      | 50,231         | (233,818)         | 533      | 24,541     | 139       |                   |                  | 42,033         | 11,774               | 62,739          |
| Vietnam        | 444            | (41,971)          |          |             | 37        |                   |                  |                |                      | 4,785           |

# Testing the new parsing to json feature

In [13]:
# New parsing to json
json_objs = LlamaParse(verbose=True).get_json_result(document_path)
pages = json_objs[0]["pages"]

Started parsing the file under job_id a016f873-6af3-40ed-be56-dae4e35141ed


In [None]:
# Pulling tables from json
tables = []
for page in json_objs[0]["pages"]:
  for item in page["items"]:
    if item['type'] == 'table':
      tables.append({"page": page["page"], "table": item["md"]})
      display(Markdown(f'## Page {page["page"]}'))
      display(Markdown(item["md"]))