Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved._

SPDX-License-Identifier: MIT-0

# Handling Multi Page Tables in Textract

## Background

In this notebook, we will cover how to detect and merge single tables that span multiple pages. 


## Setup
_This Notebook was created on ml.t2.medium notebook instances._

Let's start by install and import all neccessary libaries:

In [1]:
!pip install amazon-textract-response-parser
!pip install amazon-textract-prettyprinter
!pip install amazon-textract-helper

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes


In [2]:
import os
import json
from trp.t_pipeline import pipeline_merge_tables
import trp.trp2 as t2
from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import Textract_Pretty_Print, get_string, get_tables_string, Pretty_Print_Table_Format
from trp.trp2 import TDocument, TDocumentSchema
from trp.t_tables import MergeOptions, HeaderFooterType
import boto3
textract_client = boto3.client('textract', region_name='us-east-2')

## Call Textract Command-line Tool
amazon-textract-helper provides a collection of ready to use functions and sample implementations to speed up the evaluation and development for any project using Amazon Textract. It installs a command line tool called amazon-textract.
Make sure your environment is setup with AWS credentials through configuration files or environment variables or an attached role. You can replace the S3 URI for pdf documents with your own. (https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html)

In [4]:
s3_uri_of_documents = "s3://amazon-textract-public-content/multi-page-table/MPT_sample01-multi_page_table.pdf"
textract_json = call_textract(input_document=s3_uri_of_documents, features=[Textract_Features.FORMS, Textract_Features.TABLES], boto3_textract_client = textract_client)

## Pretty print the output (pre-table merge)
Pretty print outputs nicely formatted information for words, lines, forms or tables. The pretty print command requires to read a file. So first we write the response into a json file. As you can see, there are two separate tables printed by this function.

In [5]:
import pandas as pd
from trp import Document
from textractprettyprinter.t_pretty_print import convert_table_to_list
from IPython.display import display

def PrettyPrintTables(textract_json):
    df = None
    table_count = 0
    tdoc = Document(textract_json)
    for page in tdoc.pages:
      for table in page.tables:
        table_count += 1
        df = pd.DataFrame(convert_table_to_list(trp_table=table))
        print('Table id:', table.id, 'Row count:', len(df.index))
        display(df)

In [6]:
PrettyPrintTables(textract_json)

Table id: a4e6561d-0843-4551-9345-a41a73d80a49 Row count: 35


Unnamed: 0,0,1,2
0,Date,Description,Withdrawals/Deposits
1,2/6/2020,Food Purchase - McDonalds - 1194089245,-171
2,8/22/2020,Online Retail - Amazon.com - 1232495036,-44
3,11/1/2020,Food Purchase - Tim Hortons - 1509173654,-224
4,5/26/2020,Retail Purchase - Sobeys - 1896933493,-244
5,8/8/2020,Food Purchase - Tim Hortons - 1966610947,-116
6,7/6/2020,Transport - Lyft - 2039726014,-249
7,2/13/2020,Food Purchase - McDonalds - 2130609679,-127
8,8/1/2020,Transport - Uber - 2135962828,-65
9,5/31/2020,Online Retail - Amazon.com - 2257980180,-156


Table id: 480dd5ba-b789-4fb0-83fd-f5d0792f26a3 Row count: 34


Unnamed: 0,0,1,2
0,7/28/2020,Transport - Uber - 5004776995,-150
1,6/7/2020,Retail Purchase - Loblaws - 5476998456,-294
2,7/23/2020,Retail Purchase - Sobeys - 5505969927,-269
3,3/21/2020,Transport - Lyft - 5688740948,-52
4,8/27/2020,Food Purchase - McDonalds - 5798336406,-222
5,11/25/2020,Food Purchase - Starbucks - 5822058649,-169
6,9/19/2020,Retail Purchase - Sobeys - 5948096947,-250
7,2/25/2020,Retail Purchase - Sobeys - 6030150884,-269
8,7/15/2020,Food Purchase - McDonalds - 6031636977,-287
9,9/2/2020,Food Purchase - Tim Hortons - 6052214096,-268


## Merge tables across pages
Sometimes tables start on one page and continue across the next page or pages. This component identifies if that is the case based on the number of columns and if a header is present on the subsequent table and can modify the output Textract JSON schema for down-stream processing. Other custom-logic is possible to develop for specific use cases.

The MergeOptions.MERGE combines the tables and makes them appear as one for post processing, with the drawback that the geometry information is not accuracy any longer. So overlaying with bounding boxes will not be accuracy.

The MergeOptions.LINK maintains the geometric structure and enriches the table information with links between the table elements. There is a custom['previus_table'] and custom['next_table'] attribute added to the TABLE blocks in the Textract JSON schema.

In [7]:
t_document: t2.TDocument = t2.TDocumentSchema().load(textract_json)    
t_document = pipeline_merge_tables(t_document, MergeOptions.MERGE, None, HeaderFooterType.NONE)
json_data = t2.TDocumentSchema().dump(t_document)   

#### Pretty print the output (post-table merge)
As you can see, both tables are merged into one table.

In [8]:
PrettyPrintTables(json_data)

Table id: a4e6561d-0843-4551-9345-a41a73d80a49 Row count: 69


Unnamed: 0,0,1,2
0,Date,Description,Withdrawals/Deposits
1,2/6/2020,Food Purchase - McDonalds - 1194089245,-171
2,8/22/2020,Online Retail - Amazon.com - 1232495036,-44
3,11/1/2020,Food Purchase - Tim Hortons - 1509173654,-224
4,5/26/2020,Retail Purchase - Sobeys - 1896933493,-244
...,...,...,...
64,3/31/2020,Food Purchase - Starbucks - 9006979910,-162
65,12/5/2020,Service Charge - Bank - 9195881165,-10
66,6/23/2020,Food Purchase - McDonalds - 9705740969,-183
67,9/7/2020,Retail Purchase - Sobeys - 9867812469,-32


## Link tables across pages
The MergeOptions.LINK maintains the geometric structure and enriches the table information with links between the table elements. There is a custom['previus_table'] and custom['next_table'] attribute added to the TABLE blocks in the Textract JSON schema.

In [9]:
t_document: t2.TDocument = t2.TDocumentSchema().load(textract_json)    
t_document = pipeline_merge_tables(t_document, MergeOptions.LINK, None, HeaderFooterType.NONE)  

In [10]:
for b in t_document.blocks:
    if b.block_type == t2.TextractBlockTypes.TABLE.name:
        print('---------------')
        print('Table id: ' + b.id)
        print(b.custom)
        

---------------
Table id: a4e6561d-0843-4551-9345-a41a73d80a49
{'next_table': '480dd5ba-b789-4fb0-83fd-f5d0792f26a3'}
---------------
Table id: 480dd5ba-b789-4fb0-83fd-f5d0792f26a3
{'previous_table': 'a4e6561d-0843-4551-9345-a41a73d80a49'}


## Additional Examples: The tool identifies and merges tables across the document
In this example, the document contains multiple tables across the document. Two pairs of tables require to be merged.

In [11]:
textract_json = call_textract(input_document="s3://amazon-textract-public-content/multi-page-table/MPT_sample02-multi_tables.pdf",features=[Textract_Features.FORMS, Textract_Features.TABLES], boto3_textract_client = textract_client)
PrettyPrintTables(textract_json)

Table id: 5f46b780-4e0e-4175-9acf-5c144d7fb8f7 Row count: 10


Unnamed: 0,0,1,2
0,Date,Description 1,Withdrawals/Deposits
1,2/6/2020,Food Purchase - McDonalds - 1194089245,-171
2,8/22/2020,Online Retail - Amazon.com - 1232495036,-44
3,11/1/2020,Food Purchase - Tim Hortons - 1509173654,-224
4,5/26/2020,Retail Purchase - Sobeys - 1896933493,-244
5,8/8/2020,Food Purchase - Tim Hortons - 1966610947,-116
6,7/6/2020,Transport - Lyft - 2039726014,-249
7,2/13/2020,Food Purchase - McDonalds - 2130609679,-127
8,8/1/2020,Transport - Uber - 2135962828,-65
9,5/31/2020,Online Retail - Amazon.com - 2257980180,-156


Table id: 66f91244-0e3a-46bb-bdaf-d480f49cc069 Row count: 5


Unnamed: 0,0,1,2
0,Date,Description 2,Withdrawals / Deposits
1,10/6/2020,Food Purchase - Tim Hortons - 3993716869,-194
2,5/12/2020,Transport - Lyft - 4027406850,-243
3,1/26/2020,Retail Purchase - Loblaws - 4335346753,-173
4,4/9/2020,Retail Purchase - Sobeys - 4505103520,-174


Table id: b43d8737-3481-443a-b5ef-faa3a01c57c5 Row count: 8


Unnamed: 0,0,1,2
0,Date,Description 4,Withdrawals/Deposits
1,9/14/2020,Retail Purchase - Loblaws - 3631254120,-258
2,11/15/2020,Food Purchase - Tim Hortons - 3666530197,-297
3,4/7/2020,Food Purchase - Tim Hortons - 3746491660,-162
4,7/24/2020,Transport - Lyft - 3750786000,-161
5,12/15/2020,Food Purchase - Starbucks - 3882328066,-178
6,10/6/2020,Food Purchase - Tim Hortons - 3993716869,-194
7,5/12/2020,Transport - Lyft - 4027406850,-243


Table id: 87122c45-6ca1-4c23-89e3-e1a256bba6a5 Row count: 6


Unnamed: 0,0,1,2
0,Date,Description 3,Withdrawals/ Deposits
1,5/12/2020,Transport - Lyft - 4027402346850,-243
2,1/26/2020,Retail Purchase - Loblaws - 4335346234753,-173
3,4/9/2020,Retail Purchase - Sobeys - 4505102343520,-174
4,4/10/2020,Transport - Uber - 4608033234455,-183
5,6/15/2020,Service Charge - Bank - 471155209600,-7


Table id: 905f9f19-91ea-4ab1-a8ea-d180992b408f Row count: 3


Unnamed: 0,0,1,2
0,4/10/2020,Transport - Uber - 4608033455,-183
1,6/15/2020,Service Charge - Bank - 4711509600,-7
2,11/10/2020,Transport - Lyft - 4740375574,-141


Table id: f0991c49-d7be-497f-92a3-0c68edad2180 Row count: 4


Unnamed: 0,0,1,2
0,1/26/2020,Retail Purchase - Loblaws - 4335346753,-173
1,4/9/2020,Retail Purchase - Sobeys - 4505103520,-174
2,4/10/2020,Transport - Uber - 4608033455,-183
3,6/15/2020,Service Charge - Bank - 4711509600,-7


#### Merge tables with 95% dimension tolerance
We use a custom accuracy of 95% to calculate table similarity. By default, the component uses 99%

In [12]:
t_document: t2.TDocument = t2.TDocumentSchema().load(textract_json)    
t_document = pipeline_merge_tables(t_document, MergeOptions.MERGE, None, HeaderFooterType.NONE, 95)
json_data = t2.TDocumentSchema().dump(t_document)
PrettyPrintTables(json_data)

Table id: 5f46b780-4e0e-4175-9acf-5c144d7fb8f7 Row count: 10


Unnamed: 0,0,1,2
0,Date,Description 1,Withdrawals/Deposits
1,2/6/2020,Food Purchase - McDonalds - 1194089245,-171
2,8/22/2020,Online Retail - Amazon.com - 1232495036,-44
3,11/1/2020,Food Purchase - Tim Hortons - 1509173654,-224
4,5/26/2020,Retail Purchase - Sobeys - 1896933493,-244
5,8/8/2020,Food Purchase - Tim Hortons - 1966610947,-116
6,7/6/2020,Transport - Lyft - 2039726014,-249
7,2/13/2020,Food Purchase - McDonalds - 2130609679,-127
8,8/1/2020,Transport - Uber - 2135962828,-65
9,5/31/2020,Online Retail - Amazon.com - 2257980180,-156


Table id: 66f91244-0e3a-46bb-bdaf-d480f49cc069 Row count: 8


Unnamed: 0,0,1,2
0,Date,Description 2,Withdrawals / Deposits
1,10/6/2020,Food Purchase - Tim Hortons - 3993716869,-194
2,5/12/2020,Transport - Lyft - 4027406850,-243
3,1/26/2020,Retail Purchase - Loblaws - 4335346753,-173
4,4/9/2020,Retail Purchase - Sobeys - 4505103520,-174
5,4/10/2020,Transport - Uber - 4608033455,-183
6,6/15/2020,Service Charge - Bank - 4711509600,-7
7,11/10/2020,Transport - Lyft - 4740375574,-141


Table id: 87122c45-6ca1-4c23-89e3-e1a256bba6a5 Row count: 6


Unnamed: 0,0,1,2
0,Date,Description 3,Withdrawals/ Deposits
1,5/12/2020,Transport - Lyft - 4027402346850,-243
2,1/26/2020,Retail Purchase - Loblaws - 4335346234753,-173
3,4/9/2020,Retail Purchase - Sobeys - 4505102343520,-174
4,4/10/2020,Transport - Uber - 4608033234455,-183
5,6/15/2020,Service Charge - Bank - 471155209600,-7


Table id: b43d8737-3481-443a-b5ef-faa3a01c57c5 Row count: 8


Unnamed: 0,0,1,2
0,Date,Description 4,Withdrawals/Deposits
1,9/14/2020,Retail Purchase - Loblaws - 3631254120,-258
2,11/15/2020,Food Purchase - Tim Hortons - 3666530197,-297
3,4/7/2020,Food Purchase - Tim Hortons - 3746491660,-162
4,7/24/2020,Transport - Lyft - 3750786000,-161
5,12/15/2020,Food Purchase - Starbucks - 3882328066,-178
6,10/6/2020,Food Purchase - Tim Hortons - 3993716869,-194
7,5/12/2020,Transport - Lyft - 4027406850,-243


Table id: f0991c49-d7be-497f-92a3-0c68edad2180 Row count: 4


Unnamed: 0,0,1,2
0,1/26/2020,Retail Purchase - Loblaws - 4335346753,-173
1,4/9/2020,Retail Purchase - Sobeys - 4505103520,-174
2,4/10/2020,Transport - Uber - 4608033455,-183
3,6/15/2020,Service Charge - Bank - 4711509600,-7


## Additional Examples: Merging a table that extends across pages
This example has a table that extends across pages 1,2 and 3 and requires to be merged.

In [13]:
textract_json = call_textract(input_document="s3://amazon-textract-public-content/multi-page-table/MPT_sample03-long_multi_page_table.pdf",features=[Textract_Features.FORMS, Textract_Features.TABLES], boto3_textract_client = textract_client)
PrettyPrintTables(textract_json)

Table id: 60a259b9-ee2d-4a4a-8588-fee6783c3b29 Row count: 10


Unnamed: 0,0,1,2
0,Date,Description 1,Withdrawals/Deposits
1,2/6/2020,Food Purchase - McDonalds - 1194089245,-171
2,8/22/2020,Online Retail - Amazon.com - 1232495036,-44
3,11/1/2020,Food Purchase - Tim Hortons - 1509173654,-224
4,5/26/2020,Retail Purchase - Sobeys - 1896933493,-244
5,8/8/2020,Food Purchase - Tim Hortons - 1966610947,-116
6,7/6/2020,Transport - Lyft - 2039726014,-249
7,2/13/2020,Food Purchase - McDonalds - 2130609679,-127
8,8/1/2020,Transport - Uber - 2135962828,-65
9,5/31/2020,Online Retail - Amazon.com - 2257980180,-156


Table id: 9e0006c6-c14c-483a-ac2e-5fdefc3327a6 Row count: 7


Unnamed: 0,0,1,2
0,Date,Description 2,Withdrawals / Deposits
1,2/6/2020,Food Purchase - McDonalds - 1194089245,-171
2,8/22/2020,Online Retail - Amazon.com - 1232495036,-44
3,11/1/2020,Food Purchase - Tim Hortons - 1509173654,-224
4,5/26/2020,Retail Purchase - Sobeys - 1896933493,-244
5,8/8/2020,Food Purchase - Tim Hortons - 1966610947,-116
6,7/6/2020,Transport - Lyft - 2039726014,-249


Table id: 171b2155-fefd-4b96-9108-a356a1649031 Row count: 41


Unnamed: 0,0,1,2
0,2/13/2020,Food Purchase - McDonalds - 2130609679,-127
1,8/1/2020,Transport - Uber - 2135962828,-65
2,5/31/2020,Online Retail - Amazon.com - 2257980180,-156
3,11/10/2020,Service Charge - Bank - 2270088418,-2
4,8/11/2020,Food Purchase - McDonalds - 2350678683,-219
5,9/19/2020,Food Purchase - Starbucks - 2558819681,-229
6,8/17/2020,Food Purchase - McDonalds - 2591297145,-127
7,11/21/2020,Retail Purchase - Sobeys - 2687789993,-84
8,7/3/2020,Transport - Lyft - 2705182570,-157
9,4/1/2020,Transport - Uber - 3020941883,-238


Table id: d4ce7a0c-e7ca-4df0-8b29-c7284bcda530 Row count: 11


Unnamed: 0,0,1,2
0,7/23/2020,Food Purchase - Starbucks - 6528599341,-21
1,8/1/2020,Service Charge - Bank - 6556484014,-5
2,3/5/2020,Online Retail - Amazon.com - 6597837385,-215
3,2/2/2020,Food Purchase - Tim Hortons - 6694508417,-82
4,12/30/2020,Online Retail - Amazon.com - 7544350060,-112
5,9/27/2020,Online Retail - Amazon.com - 7673426647,-4
6,8/22/2020,Transport - Uber - 7686595684,-57
7,10/9/2020,Food Purchase - McDonalds - 7779261930,-64
8,8/31/2020,Retail Purchase - Loblaws - 8158060902,-95
9,4/6/2020,Retail Purchase - Sobeys - 8403287221,-242


In [14]:
t_document: t2.TDocument = t2.TDocumentSchema().load(textract_json)    
t_document = pipeline_merge_tables(t_document, MergeOptions.MERGE, None, HeaderFooterType.NONE)
json_data = t2.TDocumentSchema().dump(t_document)
PrettyPrintTables(json_data)

Table id: 60a259b9-ee2d-4a4a-8588-fee6783c3b29 Row count: 10


Unnamed: 0,0,1,2
0,Date,Description 1,Withdrawals/Deposits
1,2/6/2020,Food Purchase - McDonalds - 1194089245,-171
2,8/22/2020,Online Retail - Amazon.com - 1232495036,-44
3,11/1/2020,Food Purchase - Tim Hortons - 1509173654,-224
4,5/26/2020,Retail Purchase - Sobeys - 1896933493,-244
5,8/8/2020,Food Purchase - Tim Hortons - 1966610947,-116
6,7/6/2020,Transport - Lyft - 2039726014,-249
7,2/13/2020,Food Purchase - McDonalds - 2130609679,-127
8,8/1/2020,Transport - Uber - 2135962828,-65
9,5/31/2020,Online Retail - Amazon.com - 2257980180,-156


Table id: 9e0006c6-c14c-483a-ac2e-5fdefc3327a6 Row count: 59


Unnamed: 0,0,1,2
0,Date,Description 2,Withdrawals / Deposits
1,2/6/2020,Food Purchase - McDonalds - 1194089245,-171
2,8/22/2020,Online Retail - Amazon.com - 1232495036,-44
3,11/1/2020,Food Purchase - Tim Hortons - 1509173654,-224
4,5/26/2020,Retail Purchase - Sobeys - 1896933493,-244
5,8/8/2020,Food Purchase - Tim Hortons - 1966610947,-116
6,7/6/2020,Transport - Lyft - 2039726014,-249
7,2/13/2020,Food Purchase - McDonalds - 2130609679,-127
8,8/1/2020,Transport - Uber - 2135962828,-65
9,5/31/2020,Online Retail - Amazon.com - 2257980180,-156


## Additional Examples: Merging tables when the Pages have headers and footers
The document contains header and footer values that can be ignored while assessing tables to be merged. This example has both a header and a footer.

In [15]:
textract_json = call_textract(input_document="s3://amazon-textract-public-content/multi-page-table/MPT_sample04-header_footer_table.pdf",features=[Textract_Features.FORMS, Textract_Features.TABLES], boto3_textract_client = textract_client)
PrettyPrintTables(textract_json)

Table id: 2a4cee5b-4eb0-4d31-8178-15691778e9c4 Row count: 4


Unnamed: 0,0,1,2
0,Date,Description,Withdrawals / Deposits
1,10/6/2020,Food Purchase - Tim Hortons - 3993716869,-194
2,5/12/2020,Transport - Lyft - 4027406850,-243
3,1/26/2020,Retail Purchase - Loblaws - 4335346753,-173


Table id: cca5059a-4647-4922-a46c-fb0e36e060c4 Row count: 4


Unnamed: 0,0,1,2
0,4/9/2020,Retail Purchase - Sobeys - 4505103520,-174
1,4/10/2020,Transport - Uber - 4608033455,-183
2,6/15/2020,Service Charge - Bank - 4711509600,-7
3,11/10/2020,Transport - Lyft - 4740375574,-141


In [16]:
t_document: t2.TDocument = t2.TDocumentSchema().load(textract_json)    
t_document = pipeline_merge_tables(t_document, MergeOptions.MERGE, None, HeaderFooterType.NORMAL)
json_data = t2.TDocumentSchema().dump(t_document)
PrettyPrintTables(json_data)

Table id: 2a4cee5b-4eb0-4d31-8178-15691778e9c4 Row count: 8


Unnamed: 0,0,1,2
0,Date,Description,Withdrawals / Deposits
1,10/6/2020,Food Purchase - Tim Hortons - 3993716869,-194
2,5/12/2020,Transport - Lyft - 4027406850,-243
3,1/26/2020,Retail Purchase - Loblaws - 4335346753,-173
4,4/9/2020,Retail Purchase - Sobeys - 4505103520,-174
5,4/10/2020,Transport - Uber - 4608033455,-183
6,6/15/2020,Service Charge - Bank - 4711509600,-7
7,11/10/2020,Transport - Lyft - 4740375574,-141


## Creating a Custom Table Detection Function
The component allows you to use your own table detection logic by passing the function to the pipeline_merge_tables function.
In the below example, we use a sample custom function that merges successive tables together

In [17]:
textract_json = call_textract(input_document="s3://amazon-textract-public-content/multi-page-table/MPT_sample02-multi_tables.pdf",features=[Textract_Features.FORMS, Textract_Features.TABLES], boto3_textract_client = textract_client)
PrettyPrintTables(textract_json)

Table id: 89acb7ac-d265-46b5-8c87-b97447834e22 Row count: 10


Unnamed: 0,0,1,2
0,Date,Description 1,Withdrawals/Deposits
1,2/6/2020,Food Purchase - McDonalds - 1194089245,-171
2,8/22/2020,Online Retail - Amazon.com - 1232495036,-44
3,11/1/2020,Food Purchase - Tim Hortons - 1509173654,-224
4,5/26/2020,Retail Purchase - Sobeys - 1896933493,-244
5,8/8/2020,Food Purchase - Tim Hortons - 1966610947,-116
6,7/6/2020,Transport - Lyft - 2039726014,-249
7,2/13/2020,Food Purchase - McDonalds - 2130609679,-127
8,8/1/2020,Transport - Uber - 2135962828,-65
9,5/31/2020,Online Retail - Amazon.com - 2257980180,-156


Table id: 18a0ab18-4c9f-4dfc-9985-bd8f5d6aee73 Row count: 5


Unnamed: 0,0,1,2
0,Date,Description 2,Withdrawals / Deposits
1,10/6/2020,Food Purchase - Tim Hortons - 3993716869,-194
2,5/12/2020,Transport - Lyft - 4027406850,-243
3,1/26/2020,Retail Purchase - Loblaws - 4335346753,-173
4,4/9/2020,Retail Purchase - Sobeys - 4505103520,-174


Table id: 7ddac09b-d3ef-40d9-bae6-1ff194229ea4 Row count: 8


Unnamed: 0,0,1,2
0,Date,Description 4,Withdrawals/Deposits
1,9/14/2020,Retail Purchase - Loblaws - 3631254120,-258
2,11/15/2020,Food Purchase - Tim Hortons - 3666530197,-297
3,4/7/2020,Food Purchase - Tim Hortons - 3746491660,-162
4,7/24/2020,Transport - Lyft - 3750786000,-161
5,12/15/2020,Food Purchase - Starbucks - 3882328066,-178
6,10/6/2020,Food Purchase - Tim Hortons - 3993716869,-194
7,5/12/2020,Transport - Lyft - 4027406850,-243


Table id: 11c505be-a2e2-442a-965e-69a74d8de89d Row count: 6


Unnamed: 0,0,1,2
0,Date,Description 3,Withdrawals/ Deposits
1,5/12/2020,Transport - Lyft - 4027402346850,-243
2,1/26/2020,Retail Purchase - Loblaws - 4335346234753,-173
3,4/9/2020,Retail Purchase - Sobeys - 4505102343520,-174
4,4/10/2020,Transport - Uber - 4608033234455,-183
5,6/15/2020,Service Charge - Bank - 471155209600,-7


Table id: 56618885-f7d4-4ed5-96fa-0524d29fdd98 Row count: 3


Unnamed: 0,0,1,2
0,4/10/2020,Transport - Uber - 4608033455,-183
1,6/15/2020,Service Charge - Bank - 4711509600,-7
2,11/10/2020,Transport - Lyft - 4740375574,-141


Table id: 04d054ac-b104-4ecd-be1b-84779f00e578 Row count: 4


Unnamed: 0,0,1,2
0,1/26/2020,Retail Purchase - Loblaws - 4335346753,-173
1,4/9/2020,Retail Purchase - Sobeys - 4505103520,-174
2,4/10/2020,Transport - Uber - 4608033455,-183
3,6/15/2020,Service Charge - Bank - 4711509600,-7


In [18]:
from trp.t_pipeline import order_blocks_by_geo

def CustomTableDetectionFunction(t_document):
    table_ids_merge_list = []
    table_id_pairs = []
    ordered_doc = order_blocks_by_geo(t_document)
    trp_doc = Document(TDocumentSchema().dump(ordered_doc))
    for current_page in trp_doc.pages:
        if(len(current_page.tables) == 0):
            break
        for table in current_page.tables:
            table_id_pairs.append(table.id)
            if(len(table_id_pairs) > 1):
                table_ids_merge_list.append(table_id_pairs.copy())
                table_id_pairs.clear()
    return table_ids_merge_list


t_document: t2.TDocument = t2.TDocumentSchema().load(textract_json)    
t_document = pipeline_merge_tables(t_document, MergeOptions.MERGE, CustomTableDetectionFunction, HeaderFooterType.NORMAL)
json_data = t2.TDocumentSchema().dump(t_document)
PrettyPrintTables(json_data)

Table id: 89acb7ac-d265-46b5-8c87-b97447834e22 Row count: 15


Unnamed: 0,0,1,2
0,Date,Description 1,Withdrawals/Deposits
1,2/6/2020,Food Purchase - McDonalds - 1194089245,-171
2,8/22/2020,Online Retail - Amazon.com - 1232495036,-44
3,11/1/2020,Food Purchase - Tim Hortons - 1509173654,-224
4,5/26/2020,Retail Purchase - Sobeys - 1896933493,-244
5,8/8/2020,Food Purchase - Tim Hortons - 1966610947,-116
6,7/6/2020,Transport - Lyft - 2039726014,-249
7,2/13/2020,Food Purchase - McDonalds - 2130609679,-127
8,8/1/2020,Transport - Uber - 2135962828,-65
9,5/31/2020,Online Retail - Amazon.com - 2257980180,-156


Table id: 56618885-f7d4-4ed5-96fa-0524d29fdd98 Row count: 9


Unnamed: 0,0,1,2
0,4/10/2020,Transport - Uber - 4608033455,-183
1,6/15/2020,Service Charge - Bank - 4711509600,-7
2,11/10/2020,Transport - Lyft - 4740375574,-141
3,Date,Description 3,Withdrawals/ Deposits
4,5/12/2020,Transport - Lyft - 4027402346850,-243
5,1/26/2020,Retail Purchase - Loblaws - 4335346234753,-173
6,4/9/2020,Retail Purchase - Sobeys - 4505102343520,-174
7,4/10/2020,Transport - Uber - 4608033234455,-183
8,6/15/2020,Service Charge - Bank - 471155209600,-7


Table id: 7ddac09b-d3ef-40d9-bae6-1ff194229ea4 Row count: 12


Unnamed: 0,0,1,2
0,Date,Description 4,Withdrawals/Deposits
1,9/14/2020,Retail Purchase - Loblaws - 3631254120,-258
2,11/15/2020,Food Purchase - Tim Hortons - 3666530197,-297
3,4/7/2020,Food Purchase - Tim Hortons - 3746491660,-162
4,7/24/2020,Transport - Lyft - 3750786000,-161
5,12/15/2020,Food Purchase - Starbucks - 3882328066,-178
6,10/6/2020,Food Purchase - Tim Hortons - 3993716869,-194
7,5/12/2020,Transport - Lyft - 4027406850,-243
8,1/26/2020,Retail Purchase - Loblaws - 4335346753,-173
9,4/9/2020,Retail Purchase - Sobeys - 4505103520,-174
