The year 2024 marks the second year of the new era of Gen AI. Many AI engineers recognize that the biggest challenge in RAG (Retrieval-Augmented Generation) search patterns is maintaining high accuracy in generated answers. This topic is very popular in the AI community because the quality of pre-processing in the document ingestion pipeline is crucial for accuracy. One major challenge is table recognition, specifically how to extract data relationships across multiple columns and sub-columns. For example, financial documents and global PSG reports often have non-standard table structures. These complexities make it essential to focus on improving pre-processing techniques to ensure high accuracy in generated answers.
Intuitively, we might think that the GPT model can read the document directly. However, this is not the case. For extracting text-based data from documents, we need an OCR service like Azure Document Intelligence to help us read documents like PDFs or DOCs and transform the data into JSON format for pre-processing. However, we all know that there are over a billion document styles in the world. The challenge for OCR to recognize complicated tables and artistic layouts is an ongoing battle that cannot be resolved by one approach or a single model. Therefore, we must focus on pre-processing to refine the OCR results adaptively, especially for data represented in non-standard table layouts.
Since 2022, I've been working on pre-processing document extraction pipelines, focusing on non-standard tables. Here are common patterns:
-
Combined columns in one row
-
Combined rows in one column
-
Sub-column in one row
-
Headline context in the first and last row
-
Bilingual context in one cell
-
Mixture of the above
You might wonder, if a cell can be located by (x, y), doesn't that make it a standard 2D table? Actually, not quite. Standard 2D tables have a fixed number of rows and columns, whereas non-standard tables have a dynamic number of rows and columns. For example, a 3x4 table with 12 cells is standard. If you observe more or fewer cells, it becomes non-standard.
In the past, many approaches to tackling non-standard table extraction were based on programming. These methods worked, but you had to constantly maintain the codebase to support new non-standard table layouts that your existing code couldn't recognize. Now, by leveraging the power of GPT-4o 2024-05-13 and the latest 2024-02-29 preview model from Azure Document Intelligence, your non-standard table recognition mechanism can evolve to the next level, a NO-CODE approach.
Although LLM models are now capable of reading data written in markdown and JSON formats and understanding data correlation relationships, this is not the best approach since the data relationships are still stored implicitly. For example, if a child asks their mother what time it is, the mother can give two answers: the first one is "7 o'clock," and the second one is "(24-3)/3." Which answer is the most straightforward and least confusing? Definitely, "7 o'clock" because it is directly in language, with no math involved. Therefore, if we can convert data relationships into clear English statements, LLM models can handle them with minimal effort, perform faster, and deliver the highest accuracy. The diagram below illustrates this refactorization:
Although we mentioned no code is required, it doesn't mean this can be achieved through configuration alone. We still need to write a smart prompt to instruct GPT-4o to perform the tasks that coding would typically handle.
- you are ai assistant help to read given data from user in markdown format and answer the user query. if the answer going to respond to user is based on the data in table format, please refer to the markdown sample below for how the tabular data look like.\n
---tabular data in markdown format\n
*{{place a few shot example to let openai follow the example as reference}}*
---\n
The quality of your few-shot examples significantly impacts the final result when GPT-4 flattens the markdown or JSON.
flattening this table but retain any chinese words which correlated to the english words as reference, given this
---tabular data in markdown format\n
*{{markdown context extracted from document intelligence}}*
---\n
temperature = 0.4 + top_p = 0.4
for flatterning markdown to row-based single sentence in precise manner
temperature = 0.6 + top_p = 0.5
for converting JSON to markdown and auto fill missing cell value in creative manner
Using the prompt and system message template above, GPT-4o can read content in markdown format and flatten data into row-based statements. The trick we use here is to ask GPT-4o to automatically refill any missing cell values based on its understanding of the markdown of non-standard 2D table structures while de-normalizing all the rows against their columns. Furthermore, the prompt instructs GPT-4o to mark any generated cell value with a watermark "{auto-fill}". This watermark allows us to recognize that this cell value is artificially generated, notifying us to double-check this value in our evaluation pipeline.
After GPT-4o flattens the data, we obtain a set of single statements that encompass all related information, including the table name and row item context in relation to its columns.
Property and Equipment, Net: At September 30, 2009, land is $3,190\n
Property and Equipment, Net: At September 30, 2010, buildings and leasehold improvements is $19,987\n
This row-based format helps LLM models understand the semantic relationships between all data segments (row items against all columns). Additionally, consolidating all related data into one statement enhances the confidence score in AI search results, as the indexing algorithm is based on similarity.
- Open Azure AI Studio (Chat playground), select gpt-4o 2024-05-13 and configure the model parameters: temperature = 0.4, top_p = 0.4
- Copy the system_message for sample A and set it as system message in Chat playground
- Open di_output_slim.json, copy the whole string of analyzeResult.content
- Open prompt.txt, paste the whole stringin line 2, just above ---\n
- Copy all content and paste this aggerated prompt to Chat playground and run it
- You will get result smilar to completion of sample A
- Copy content of completion.txt and set it as system message in Chat playground
- Test out different questions, for example "what is the claims for Anaesthetist visit if I got Deluxe plan"
This table looks more complicated that we need to ask GPT-4o twice
- 1st call: send analyzeResult.tables to GPT-4o, convert to markdown and auto filling the missing cell value
- 2nd call: send revised markdown to GPT-4o, flattern the markdown like what we have done in Sample A
After the 1st call, the missing cell value is filled smartly by GPT-4o and with watermark {auto-fill}. This watermark allows us to recognize that this cell value is artificially generated, notifying us to double-check this value in our evaluation pipeline.
- Open Azure AI Studio (Chat playground), select gpt-4o 2024-05-13 and configure the model parameters: max_tokens=4096, temperature = 0.6, top_p = 0.5
- Copy the 1st system_message for sample B and set it as system message in Chat playground
- Open di_output_slim.json, copy the whole tables array from analyzeResult.tables
- Paste the whole table array to Chat playground and run it, no any additional prompt is needed, just like prompt_1stcall
- Copy the result and store it as markdown file named "completion_1st.md", just like completion_1st.md
- Open and copy 2nd system_message for sample B and set it as system message in Chat playground, also configure the model parameters: temperature = 0.4, top_p = 0.4
- Open prompt_2ndcall.txt, paste the markdown file "completion_1st.md" you got from step 5 in line 2, just above ---\n
- Copy all content and paste this aggerated prompt to Chat playground and run it
- You will get result smilar to 2nd completion of sample B
- Copy content of completion.txt and set it as system message in Chat playground
- Test out different questions, for example "what is the claims for Chiropracto if non network?"
Sample A: The input tokens include both system_message.txt and prompt.txt with the markdown context from analyzeResult.content of di_output_slim.json. The output tokens refer to completion.txt
Sample B: The input tokens include system_message_1stcall.txt, system_message_2ndcall.txt, prompt_1stcall.txt and prompt_2ndcall.txt plus the markdown context from completion_1st.md. The output tokens include both completion_1st.md and completion_2nd.txt
Reflecting on the 1980s and 1990s, we focused on data normalization to enhance retrieval speed and optimize storage capacity. Over the past three decades, the approach to data retrieval has evolved significantly. Instead of reading normalized documents, users now interact with chatbots, posing questions directly. With the advent of NoSQL databases, data denormalization has become a common practice in data engineering. Today, the need for machine models to efficiently read data has further propelled the trend towards data denormalization.
In summary, recognizing non-standard tables is a big challenge in AI, but with the right tools and techniques, we can make significant progress.