No-Code Hacks for Tackling Non-Standard Table Recognition in GenAI

Challenge of data extraction accuracy from non-standard tables

The year 2024 marks the second year of the new era of Gen AI. Many AI engineers recognize that the biggest challenge in RAG (Retrieval-Augmented Generation) search patterns is maintaining high accuracy in generated answers. This topic is very popular in the AI community because the quality of pre-processing in the document ingestion pipeline is crucial for accuracy. One major challenge is table recognition, specifically how to extract data relationships across multiple columns and sub-columns. For example, financial documents and global PSG reports often have non-standard table structures. These complexities make it essential to focus on improving pre-processing techniques to ensure high accuracy in generated answers.

No one shoe fits all

Intuitively, we might think that the GPT model can read the document directly. However, this is not the case. For extracting text-based data from documents, we need an OCR service like Azure Document Intelligence to help us read documents like PDFs or DOCs and transform the data into JSON format for pre-processing. However, we all know that there are over a billion document styles in the world. The challenge for OCR to recognize complicated tables and artistic layouts is an ongoing battle that cannot be resolved by one approach or a single model. Therefore, we must focus on pre-processing to refine the OCR results adaptively, especially for data represented in non-standard table layouts.

Non-Standard 2D Table Structures

Since 2022, I've been working on pre-processing document extraction pipelines, focusing on non-standard tables. Here are common patterns:

Combined columns in one row
Combined rows in one column
Sub-column in one row
Headline context in the first and last row
Bilingual context in one cell
Mixture of the above

You might wonder, if a cell can be located by (x, y), doesn't that make it a standard 2D table? Actually, not quite. Standard 2D tables have a fixed number of rows and columns, whereas non-standard tables have a dynamic number of rows and columns. For example, a 3x4 table with 12 cells is standard. If you observe more or fewer cells, it becomes non-standard.

No more coding in the GenAI era

In the past, many approaches to tackling non-standard table extraction were based on programming. These methods worked, but you had to constantly maintain the codebase to support new non-standard table layouts that your existing code couldn't recognize. Now, by leveraging the power of GPT-4o 2024-05-13 and the latest 2024-02-29 preview model from Azure Document Intelligence, your non-standard table recognition mechanism can evolve to the next level, a NO-CODE approach.

Retain the 2D Dimensional Correlation in Markdown format

Why does de-normalizing tables matter?

Although LLM models are now capable of reading data written in markdown and JSON formats and understanding data correlation relationships, this is not the best approach since the data relationships are still stored implicitly. For example, if a child asks their mother what time it is, the mother can give two answers: the first one is "7 o'clock," and the second one is "(24-3)/3." Which answer is the most straightforward and least confusing? Definitely, "7 o'clock" because it is directly in language, with no math involved. Therefore, if we can convert data relationships into clear English statements, LLM models can handle them with minimal effort, perform faster, and deliver the highest accuracy. The diagram below illustrates this refactorization:

Table Transformation Flow

Although we mentioned no code is required, it doesn't mean this can be achieved through configuration alone. We still need to write a smart prompt to instruct GPT-4o to perform the tasks that coding would typically handle.

System message

- you are ai assistant help to read given data from user in markdown format and answer the user query. if the answer going to respond to user is based on the data in table format, please refer to the markdown sample below for how the tabular data look like.\n
 ---tabular data in markdown format\n
    *{{place a few shot example to let openai follow the example as reference}}*
 ---\n

Few Shot Example for De-normalization

Sample_few_shot_example

The quality of your few-shot examples significantly impacts the final result when GPT-4 flattens the markdown or JSON.

Prompt

flattening this table but retain any chinese words which correlated to the english words as reference, given this
---tabular data in markdown format\n
    *{{markdown context extracted from document intelligence}}*
---\n

GPT-4o model parameters

temperature = 0.4 + top_p = 0.4
for flatterning markdown to row-based single sentence in precise manner

temperature = 0.6 + top_p = 0.5
for converting JSON to markdown and auto fill missing cell value in creative manner

Using the prompt and system message template above, GPT-4o can read content in markdown format and flatten data into row-based statements. The trick we use here is to ask GPT-4o to automatically refill any missing cell values based on its understanding of the markdown of non-standard 2D table structures while de-normalizing all the rows against their columns. Furthermore, the prompt instructs GPT-4o to mark any generated cell value with a watermark "{auto-fill}". This watermark allows us to recognize that this cell value is artificially generated, notifying us to double-check this value in our evaluation pipeline.

After GPT-4o flattens the data, we obtain a set of single statements that encompass all related information, including the table name and row item context in relation to its columns.

Row-based single sentence

Property and Equipment, Net: At September 30, 2009, land is $3,190\n
Property and Equipment, Net: At September 30, 2010, buildings and leasehold improvements is $19,987\n

This row-based format helps LLM models understand the semantic relationships between all data segments (row items against all columns). Additionally, consolidating all related data into one statement enhances the confidence score in AI search results, as the indexing algorithm is based on similarity.

Sample A: Combined row item across columns and sub-column

Sample A: Experiment in Azure AI Studio (Chat playground)

Open Azure AI Studio (Chat playground), select gpt-4o 2024-05-13 and configure the model parameters: temperature = 0.4, top_p = 0.4
Copy the system_message for sample A and set it as system message in Chat playground
Open di_output_slim.json, copy the whole string of analyzeResult.content
Open prompt.txt, paste the whole stringin line 2, just above ---\n

Copy all content and paste this aggerated prompt to Chat playground and run it

You will get result smilar to completion of sample A
Copy content of completion.txt and set it as system message in Chat playground
Test out different questions, for example "what is the claims for Anaesthetist visit if I got Deluxe plan"

Sample B: Combined column value for multiple rows

This table looks more complicated that we need to ask GPT-4o twice

1st call: send analyzeResult.tables to GPT-4o, convert to markdown and auto filling the missing cell value
2nd call: send revised markdown to GPT-4o, flattern the markdown like what we have done in Sample A

After the 1st call, the missing cell value is filled smartly by GPT-4o and with watermark {auto-fill}. This watermark allows us to recognize that this cell value is artificially generated, notifying us to double-check this value in our evaluation pipeline.

Sample B: Experiment in Azure AI Studio (Chat playground)

Open Azure AI Studio (Chat playground), select gpt-4o 2024-05-13 and configure the model parameters: max_tokens=4096, temperature = 0.6, top_p = 0.5
Copy the 1st system_message for sample B and set it as system message in Chat playground
Open di_output_slim.json, copy the whole tables array from analyzeResult.tables
Paste the whole table array to Chat playground and run it, no any additional prompt is needed, just like prompt_1stcall
Copy the result and store it as markdown file named "completion_1st.md", just like completion_1st.md
Open and copy 2nd system_message for sample B and set it as system message in Chat playground, also configure the model parameters: temperature = 0.4, top_p = 0.4
Open prompt_2ndcall.txt, paste the markdown file "completion_1st.md" you got from step 5 in line 2, just above ---\n
Copy all content and paste this aggerated prompt to Chat playground and run it
You will get result smilar to 2nd completion of sample B
Copy content of completion.txt and set it as system message in Chat playground
Test out different questions, for example "what is the claims for Chiropracto if non network?"

Cost for De-normalization

Sample A: The input tokens include both system_message.txt and prompt.txt with the markdown context from analyzeResult.content of di_output_slim.json. The output tokens refer to completion.txt

Sample B: The input tokens include system_message_1stcall.txt, system_message_2ndcall.txt, prompt_1stcall.txt and prompt_2ndcall.txt plus the markdown context from completion_1st.md. The output tokens include both completion_1st.md and completion_2nd.txt

Closing

Reflecting on the 1980s and 1990s, we focused on data normalization to enhance retrieval speed and optimize storage capacity. Over the past three decades, the approach to data retrieval has evolved significantly. Instead of reading normalized documents, users now interact with chatbots, posing questions directly. With the advent of NoSQL databases, data denormalization has become a common practice in data engineering. Today, the need for machine models to efficiently read data has further propelled the trend towards data denormalization.

In summary, recognizing non-standard tables is a big challenge in AI, but with the right tools and techniques, we can make significant progress.

References

Analyze complex documents with Azure Document Intelligence Markdown Output and Azure OpenAI

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.vscode		.vscode
samplea		samplea
sampleb		sampleb
screenshot		screenshot
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

No-Code Hacks for Tackling Non-Standard Table Recognition in GenAI

Challenge of data extraction accuracy from non-standard tables

No one shoe fits all

Non-Standard 2D Table Structures

No more coding in the GenAI era

Retain the 2D Dimensional Correlation in Markdown format

Why does de-normalizing tables matter?

Table Transformation Flow

System message

Few Shot Example for De-normalization

Prompt

GPT-4o model parameters

Row-based single sentence

Sample A: Combined row item across columns and sub-column

Sample A: Experiment in Azure AI Studio (Chat playground)

Sample B: Combined column value for multiple rows

Sample B: Experiment in Azure AI Studio (Chat playground)

Cost for De-normalization

Closing

References

About

Releases

Packages

License

denlai-mshk/nocodetable

Folders and files

Latest commit

History

Repository files navigation

No-Code Hacks for Tackling Non-Standard Table Recognition in GenAI

Challenge of data extraction accuracy from non-standard tables

No one shoe fits all

Non-Standard 2D Table Structures

No more coding in the GenAI era

Retain the 2D Dimensional Correlation in Markdown format

Why does de-normalizing tables matter?

Table Transformation Flow

System message

Few Shot Example for De-normalization

Prompt

GPT-4o model parameters

Row-based single sentence

Sample A: Combined row item across columns and sub-column

Sample A: Experiment in Azure AI Studio (Chat playground)

Sample B: Combined column value for multiple rows

Sample B: Experiment in Azure AI Studio (Chat playground)

Cost for De-normalization

Closing

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages