# Mortgage Document Extraction

---

At this point we have identified the documents, and we can now start extracting information from them. What we want to extract information from each document will depend on the type of document as depicted in the figure below.

<p align="center">
  <img src="./images/extraction.png" alt="cfn1" width="800px"/>
</p>

We will be extracting information from the following documents-

- [Unified Residential Loan Application (URLA1003) form](#step1)
- [Paystub](#step2)
- [W2 form](#step3)
- [Bank statement](#step4)
- [Credit card statement](#step5)
- [Mortgage Note](#step6)
- [Passport](#step7)
- [1099 INT form](#step8)
- [1099 DIV form](#step9)
- [1099 MISC form](#step10)
- [1099 R form](#step11)
- [Employment verification letter](#step12)
- [Mortgage Statement](#step13)

---

## Setup Notebook

We will be using the [Amazon Textract Parser Library](https://github.com/aws-samples/amazon-textract-response-parser/tree/master/src-python) for parsing through the Textract response, data science library [Pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) for content analysis, the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/), and [AWS boto3 python sdk](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) to work with Amazon Textract and Amazon A2I. Let's now install and import them.

In [None]:
!python -m pip install -q amazon-textract-response-parser --upgrade --force-reinstall
!python -m pip install -q amazon-textract-caller --upgrade --force-reinstall
!python -m pip install -q amazon-textract-prettyprinter --upgrade --force-reinstall

In [None]:
import boto3
import botocore
import sagemaker
import os
import io
import datetime
import json
import pandas as pd
from PIL import Image as PImage, ImageDraw
from pathlib import Path
import multiprocessing as mp
from IPython.display import Image, display, HTML, JSON, IFrame
from trp import Document
from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import Textract_Pretty_Print, get_string, Pretty_Print_Table_Format
from trp.trp2 import TDocument

# variables
data_bucket = sagemaker.Session().default_bucket()
region = boto3.session.Session().region_name
account_id = boto3.client('sts').get_caller_identity().get('Account')

os.environ["BUCKET"] = data_bucket
os.environ["REGION"] = region
role = sagemaker.get_execution_role()

print(f"SageMaker role is: {role}\nDefault SageMaker Bucket: s3://{data_bucket}")

s3=boto3.client('s3')
textract = boto3.client('textract', region_name=region)

---
# Upload all documents to S3

We will upload all of our mortgage related documents (present in the `/docs` directory) to Amazon S3 Bucket. Note that documents will be uploaded into SageMaker's default S3 bucket. If you wish to use a different bucket please make sure you update the bucket name in `data_bucket` variable and also ensure that SageMaker has permissions to the S3 bucket.

In [None]:
# Upload images to S3 bucket:
!aws s3 cp docs s3://{data_bucket}/idp-mortgage/textract --recursive --only-show-errors

Verify that the document's have been uploaded to S3.

In [None]:
!aws s3 ls s3://{data_bucket}/idp-mortgage/textract/

---
# Extracting information from documents

We will extract the information out of each of the Mortgage documents that are available in the S3 bucket. We will use Amazon Textract's [Analyze Document](https://docs.aws.amazon.com/textract/latest/dg/analyzing-document-text.html) API to extract FORMS, TABLES, and QUERIES out of the documents. However, in some cases (for example; for the passport document) we will use a specific Amazon Textract API called the [AnalyzeID](https://docs.aws.amazon.com/textract/latest/dg/analyzing-document-identity.html) API.

---
## 1. URLA-1003 (Uniform Residential Loan Application) <a id="step1"></a>


In [None]:
documentName = "docs/URLA-1003.pdf"
display(IFrame(documentName, 500, 600));

In [None]:
response_urla_1003 = call_textract(input_document=f's3://{data_bucket}/idp-mortgage/textract/URLA-1003.pdf', 
                                 features=[Textract_Features.TABLES, Textract_Features.FORMS])
doc_urla_1003 = Document(response_urla_1003)

### Printing the content of the doc in Line and Word Format

In this section, we will extract the lines and words that appear in the document.

In [None]:
# Iterate over elements in the document
for page in doc_urla_1003.pages:    
    # Print lines and words
    for line in page.lines:
        print(f"Line├── {line.text}")
        for word in line.words:
            print(f"\tWord└── {word.text}")
        print('\n────────────────────────────\n')

### Get form info (key-value) pairs from the response

In the previous section, note that the output is plain text and doesn't necessarily have information on whether they appear in a form or a table. In this section we will get the form data in the document in key-value pair format.

In [None]:
for page in doc_urla_1003.pages:
    forms=[]
    for field in page.form.fields:
        obj={}
        obj[f'{field.key}']=f'{field.value}'
        forms.append(obj)

display(JSON(forms, root='Form Key-values', expanded=True))

In [None]:
num_tables=1
for page in doc_urla_1003.pages:
     # Print tables
    for table in page.tables:
        print(f"Table {num_tables}")
        print("-------------------")
        num_tables=num_tables+1
        for r, row in enumerate(table.rows):
            for c, cell in enumerate(row.cells):
                print(f"Cell[{r}][{c}] = {cell.text}")
        print('\n────────────────────────────\n')

There are no tables detected in this document.

---
## 2. Paystub <a id="step2"></a>


In [None]:
documentName = "docs/Paystub.jpg"
display(Image(filename=documentName, width=500))

In [None]:
response_paystub = call_textract(input_document=f's3://{data_bucket}/idp-mortgage/textract/Paystub.jpg', 
                                 features=[Textract_Features.TABLES, Textract_Features.FORMS])
doc_paystub = Document(response_paystub)

### Printing the content of the doc in Line and Word Format

In this section, we will extract the lines and words that appear in the document.

In [None]:
# Iterate over elements in the document
for page in doc_paystub.pages:
    # Print lines and words
    for line in page.lines:
        print(f"Line├── {line.text}")
        for word in line.words:
            print(f"\tWord└── {word.text}")
        print('\n────────────────────────────\n')

### Get form info (key-value) pairs from the response

In the previous section, note that the output is plain text and doesn't necessarily have information on whether they appear in a form or a table. In this section we will get the form data in the document in key-value pair format.

In [None]:
for page in doc_paystub.pages:
    forms=[]
    for field in page.form.fields:
        obj={}
        obj[f'{field.key}']=f'{field.value}'
        forms.append(obj)

display(JSON(forms, root='Form Key-values', expanded=True))

### Get Table info from the response

In [None]:
num_tables=1
for page in doc_paystub.pages:
     # Print tables
    for table in page.tables:
        print(f"Table {num_tables}")
        print('────────────────────────────')
        num_tables=num_tables+1
        for r, row in enumerate(table.rows):
            for c, cell in enumerate(row.cells):
                print(f"Cell[{r}][{c}] = {cell.text}")
        print('\n')

In [None]:
print(get_string(textract_json=response_paystub, table_format=Pretty_Print_Table_Format.grid, output_type=[Textract_Pretty_Print.TABLES]))

---
## 3. W2 Form <a id="step3"></a>

In [None]:
documentName = "docs/W2.jpg"
display(Image(filename=documentName, width=500))

In [None]:
response_w2 = call_textract(input_document=f's3://{data_bucket}/idp-mortgage/textract/W2.jpg', 
                                 features=[Textract_Features.TABLES, Textract_Features.FORMS])
doc_w2 = Document(response_w2)

### Printing the content of the doc in Line and Word Format

In this section, we will extract the lines and words that appear in the document.

In [None]:
# Iterate over elements in the document
for page in doc_w2.pages:    
    # Print lines and words
    for line in page.lines:
        print(f"Line├── {line.text}")
        for word in line.words:
            print(f"\tWord└── {word.text}")
        print('\n────────────────────────────\n')

### Get form info (key-value) pairs from the response

In the previous section, note that the output is plain text and doesn't necessarily have information on whether they appear in a form or a table. In this section we will get the form data in the document in key-value pair format.

In [None]:
for page in doc_w2.pages:
    forms=[]
    for field in page.form.fields:
        obj={}
        obj[f'{field.key}']=f'{field.value}'
        forms.append(obj)

display(JSON(forms, root='Form Key-values', expanded=True))

---
## 4. Bank Statement <a id="step4"></a>

In [None]:
documentName = "docs/Bank-Statement.jpg"
display(Image(filename=documentName, width=500))

In [None]:
response_bank_stmt = call_textract(input_document=f's3://{data_bucket}/idp-mortgage/textract/Bank-Statement.jpg', 
                                 features=[Textract_Features.TABLES, Textract_Features.FORMS])
doc_bank_stmt = Document(response_bank_stmt)

### Printing the content of the doc in Line and Word Format

In this section, we will extract the lines and words that appear in the document.

In [None]:
# Iterate over elements in the document
for page in doc_bank_stmt.pages:
    # Print lines and words
    for line in page.lines:
        print(f"Line├── {line.text}")
        for word in line.words:
            print(f"\tWord└── {word.text}")
        print('\n────────────────────────────\n')

### Get form info (key-value) pairs from the response

In the previous section, note that the output is plain text and doesn't necessarily have information on whether they appear in a form or a table. In this section we will get the form data in the document in key-value pair format.

In [None]:
for page in doc_bank_stmt.pages:
    forms=[]
    for field in page.form.fields:
        obj={}
        obj[f'{field.key}']=f'{field.value}'
        forms.append(obj)

display(JSON(forms, root='Form Key-values', expanded=True))

In [None]:
num_tables=1
for page in doc_bank_stmt.pages:
     # Print tables
    for table in page.tables:
        print(f"Table {num_tables}")
        print('────────────────────────────')
        num_tables=num_tables+1
        for r, row in enumerate(table.rows):
            for c, cell in enumerate(row.cells):
                print(f"Cell[{r}][{c}] = {cell.text}")
        print('\n')

In [None]:
print(get_string(textract_json=response_bank_stmt, table_format=Pretty_Print_Table_Format.grid, output_type=[Textract_Pretty_Print.TABLES]))

---

## 5. Credit Card Statement <a id="step5"></a>

In [None]:
# Document
documentName = "docs/credit-card-stmt.pdf"
display(IFrame(documentName, 500, 600));

In [None]:
response_cc = call_textract(input_document=f's3://{data_bucket}/idp-mortgage/textract/credit-card-stmt.pdf', 
                            features=[Textract_Features.TABLES, Textract_Features.FORMS])

Let's look at the raw JSON response response returned by Amazon Textract.

In [None]:
print(json.dumps(response_cc, indent=4))

You can parse this raw JSON using simple logic. However, to make it easier to parse and get the information out of the JSON response we will use Textract response parser library. Library parses JSON and provides programming language specific constructs to work with different parts of the document. For more details please refer to the [Amazon Textract Parser Library](https://github.com/aws-samples/amazon-textract-response-parser/tree/master/src-python). In order to do that we will use the `Document` wrapper which makes it easy for us to write logic.

In [None]:
doc_cc = Document(response_cc)

### Printing the content of the doc in Line and Word Format

In this section, we will extract the lines and words that appear in the document.

In [None]:
doc_cc = Document(response_cc)

# Iterate over elements in the document
for page in doc_cc.pages:
    # Print lines and words
    for line in page.lines:
        print(f"Line├── {line.text}")
        for word in line.words:
            print(f"\tWord└── {word.text}")
        print('\n')

### Get form info (key-value) pairs from the response

In the previous section, note that the output is plain text and doesn't necessarily have information on whether they appear in a form or a table. In this section we will get the form data in the document in key-value pair format.

In [None]:
for page in doc_cc.pages:
    # Print fields
    forms=[]
    for field in page.form.fields:
        obj={}
        obj[f'{field.key}']=f'{field.value}'
        forms.append(obj)

display(JSON(forms, root='Form Key-values', expanded=True))

### Get Tables from the response

We will now get all the table data that appear in the table. We will first see how the raw data per cell of every table looks like

In [None]:
num_tables=1
for page in doc_cc.pages:
     # Print tables
    for table in page.tables:
        print(f"Table {num_tables}")
        print("-------------------")
        num_tables=num_tables+1
        for r, row in enumerate(table.rows):
            for c, cell in enumerate(row.cells):
                print(f"Cell[{r}][{c}] = {cell.text}")
        print('\n')

### Tables with Confidence Score

In [None]:
# Print tables
num_tables=1
for table in page.tables:
    print(f"Table {num_tables}")
    print("-------------------")
    for r, row in enumerate(table.rows):
        for c, cell in enumerate(row.cells):
            print(f"Cell[{r}][{c}] :  Text: {cell.text} ,  Confidence: {round(cell.confidence,2)}%")
    print('\n')        
                        

### Printing the Table using Prettyprinter library

We can use the amazon-textract-textractor library's [prettyprinter](https://github.com/aws-samples/amazon-textract-textractor/tree/master/prettyprinter) tool we can get the table data in various easy to consume and read format such as CSV, Grid etc.

In [None]:
print(get_string(textract_json=response_cc, 
                 table_format=Pretty_Print_Table_Format.grid, 
                 output_type=[Textract_Pretty_Print.TABLES]))

The code above displays all the tables that exist in that document. However, we are only interested in one particular table, i.e. the table that contains all the credit card transactions. Here's a way to specifically pick that table and display it in a Pandas dataframe format. We will use a utility function called `convert_table_to_list` for this.

In [None]:
from textractprettyprinter.t_pretty_print import convert_table_to_list

dfs = list()

for page in doc_cc.pages:
    for table in page.tables:
        dfs.append(pd.DataFrame(convert_table_to_list(trp_table=table)))
        
# Only the third table i.e. the table at index 2 considering index starts at 0
cc_transactions = dfs[2]
# Make the first row as column headers for the dataframe
cc_transactions.columns = cc_transactions.iloc[0]
#drop the first row since it's the column header
cc_transactions = cc_transactions.drop(cc_transactions.index[0]) 

#display the dataframe
cc_transactions

---
## 6. Mortgage Note - _Amazon Textract Queries example_ <a id="step6"></a>

In [None]:
documentName = "docs/Mortgage-Note.pdf"
display(IFrame(documentName, 500, 600));

A we can see the mortgage note is a document containing dense text. In this case we are interested in finding out a few key information from the entire document. Instead of extracting all the text from the document, and then apply logic (for example: regular expression) to find out that information, we will use Amazon Textract queries feature to grab the infromation from the document. 

Specifically, the information we are looking for are-

1. The Lender name
2. The principal ammount the borrower has to pay
3. The yearly interest rate.
4. Monthly payment amount.

We will craft questions in plain english language for the Textract API and pass it to the API call to get the information. Queries are-

1. Who is the Lender?
2. What is the principal amount borrower has to pay?
3. What is the yearly interest rate?
4. What is the monthly payment amount?

Also, we can see that all of this information is available in the first page of this multi-page document so we don't need the AI to look through all the pages to find this info. We will pass the page number when making the API call. Note: If the page number is not known then the `pages` parameter can be assigned a wild-card value of `"*"`, in which case Amazon Textract will look for answers in all pages of the document. Page ranging, for example to look for answers starting at page 3 to the last page the expression can be `"3-*"`, or from page 3 to 6 the expression can be `"3-6"`.

In [None]:
from textractcaller import QueriesConfig, Query

# Setup the queries
query1 = Query(text="Who is the Lender?" , alias="LENDER_NAME", pages=["1"])
query2 = Query(text="What is the principal amount borrower has to pay?", alias="PRINCIPAL_AMOUNT", pages=["1"])
query3 = Query(text="What is the yearly interest rate?", alias="INTEREST_RATE", pages=["1"])
query4 = Query(text="What is the monthly payment amount?", alias="MONTHLY_AMOUNT", pages=["1"])

#Setup the query config with the above queries
queries_config = QueriesConfig(queries=[query1, query2, query3, query4])

response_mortgage_note = call_textract(input_document=f's3://{data_bucket}/idp-mortgage/textract/Mortgage-Note.pdf',
                          features=[Textract_Features.QUERIES],
                          queries_config=queries_config)
doc_mortgage_note = Document(response_mortgage_note)

In [None]:
import trp.trp2 as t2
doc_mortgage_note: t2.TDocumentSchema = t2.TDocumentSchema().load(response_mortgage_note) 
    
entities = {}
for page in doc_mortgage_note.pages:
    query_answers = doc_mortgage_note.get_query_answers(page=page)
    if query_answers:
        for answer in query_answers:
            entities[answer[1]] = answer[2]
            
display(JSON(entities, root='Query Answers'))

---
## 7. Passport (ID Document) <a id="step7"></a>

Passport is a special document, i.e. an Identity document. To extract infromation from US passports and driver's license, Amazon Textract's [AnalyzeID](https://docs.aws.amazon.com/textract/latest/dg/analyzing-document-identity.html) API can be used.

In [None]:
documentName = "docs/Passport.pdf"
display(IFrame(documentName, 500, 600));

We will use the `call_textract_analyzeid` tool from the amazon-textract-textractor library.

In [None]:
from textractcaller import call_textract_analyzeid
import trp.trp2_analyzeid as t2id

response_passport = call_textract_analyzeid(document_pages=[f's3://{data_bucket}/idp-mortgage/textract/Passport.pdf'])
doc_passport: t2id.TAnalyzeIdDocument = t2id.TAnalyzeIdDocumentSchema().load(response_passport)

In [None]:
for id_docs in response_passport['IdentityDocuments']:
    id_doc_kvs={}
    for field in id_docs['IdentityDocumentFields']:
        id_doc_kvs[field['Type']['Text']] = field['ValueDetection']['Text']

display(JSON(id_doc_kvs, root='ID Document Key-values', expanded=True))

---

## 8. Extracting 1099-INT Form <a id="step8"></a>

Let's now extract data out of the 1099-INT form. Form 1099-INT is a tax form issued by interest-paying entities, such as banks, investment firms, and other financial institutions, to taxpayers who receive interest income of $10 or more. The information recorded on the form must be reported to the IRS. 1099-INT form is often included in mortgage application packet by property buyers/mortgage applicants to validate their sources of income as reflected in their bank statements.

In [None]:
documentName = "docs/1099-INT-2018.pdf"
display(IFrame(documentName, 500, 600));

In [None]:
response_1099_int = call_textract(input_document=f's3://{data_bucket}/idp-mortgage/textract/1099-INT-2018.pdf', 
                                 features=[Textract_Features.TABLES, Textract_Features.FORMS])
doc_1099_int = Document(response_1099_int)

### Printing the content of the doc in Line and Word Format

In this section, we will extract the lines and words that appear in the document.

In [None]:
# Iterate over elements in the document
for page in doc_1099_int.pages:
    # Print lines and words
    for line in page.lines:
        print(f"Line├── {line.text}")
        for word in line.words:
            print(f"\tWord└── {word.text}")
            
        print('\n────────────────────────────\n')

### Get form info (key-value) pairs from the response

In the previous section, note that the output is plain text and doesn't necessarily have information on whether they appear in a form or a table. In this section we will get the form data in the document in key-value pair format.

In [None]:
for page in doc_1099_int.pages:
    forms=[]
    for field in page.form.fields:
        obj={}
        obj[f'{field.key}']=f'{field.value}'
        forms.append(obj)

display(JSON(forms, root='Form Key-values', expanded=True))

In [None]:
num_tables=1
for page in doc_1099_int.pages:
     # Print tables
    for table in page.tables:
        print(f"Table {num_tables}")
        print("-------------------")
        num_tables=num_tables+1
        for r, row in enumerate(table.rows):
            for c, cell in enumerate(row.cells):
                print(f"Cell[{r}][{c}] = {cell.text}")
        print('\n')

In [None]:
print(get_string(textract_json=response_1099_int, table_format=Pretty_Print_Table_Format.grid, output_type=[Textract_Pretty_Print.TABLES]))

---
## 9. Extracting 1099-DIV Form <a id="step9"></a>

In [None]:
documentName = "docs/1099-DIV.jpg"
display(Image(filename=documentName, width=500))

In [None]:
response_1099_div = call_textract(input_document=f's3://{data_bucket}/idp-mortgage/textract/1099-DIV.jpg', 
                                 features=[Textract_Features.TABLES, Textract_Features.FORMS])
doc_1099_div = Document(response_1099_div)

In [None]:
# Iterate over elements in the document
for page in doc_1099_div.pages:
    # Print lines and words
    for line in page.lines:
        print(f"Line├── {line.text}")
        for word in line.words:
            print(f"\tWord└── {word.text}")
        print('\n────────────────────────────\n')

In [None]:
for page in doc_1099_div.pages:
    forms=[]
    for field in page.form.fields:
        obj={}
        obj[f'{field.key}']=f'{field.value}'
        forms.append(obj)

display(JSON(forms, root='Form Key-values', expanded=True))

In [None]:
num_tables=1
for page in doc_1099_div.pages:
     # Print tables
    for table in page.tables:
        print(f"Table {num_tables}")
        print('────────────────────────────────────────────────────────────')
        num_tables=num_tables+1
        for r, row in enumerate(table.rows):
            for c, cell in enumerate(row.cells):
                print(f"Cell[{r}][{c}] = {cell.text}")
        print('\n')

In [None]:
print(get_string(textract_json=response_1099_div, table_format=Pretty_Print_Table_Format.grid, output_type=[Textract_Pretty_Print.TABLES]))

---
## 10. Extracting 1099-MISC Form <a id="step10"></a>

In [None]:
documentName = "docs/1099-MISC-2021.pdf"
display(IFrame(documentName, 500, 600));

In [None]:
response_1099_misc = call_textract(input_document=f's3://{data_bucket}/idp-mortgage/textract/1099-MISC-2021.pdf', 
                                 features=[Textract_Features.TABLES, Textract_Features.FORMS])
doc_1099_misc = Document(response_1099_misc)

### Printing the content of the doc in Line and Word Format

In this section, we will extract the lines and words that appear in the document.

In [None]:
# Iterate over elements in the document
for page in doc_1099_misc.pages:
    # Print lines and words
    for line in page.lines:
        print(f"Line├── {line.text}")
        for word in line.words:
            print(f"\tWord└── {word.text}")
        print('\n────────────────────────────\n')

### Get form info (key-value) pairs from the response

In the previous section, note that the output is plain text and doesn't necessarily have information on whether they appear in a form or a table. In this section we will get the form data in the document in key-value pair format.

In [None]:
for page in doc_1099_misc.pages:
    forms=[]
    for field in page.form.fields:
        obj={}
        obj[f'{field.key}']=f'{field.value}'
        forms.append(obj)

display(JSON(forms, root='Form Key-values', expanded=True))

In [None]:
num_tables=1
for page in doc_1099_misc.pages:
     # Print tables
    for table in page.tables:
        print(f"Table {num_tables}")
        print('────────────────────────────')
        num_tables=num_tables+1
        for r, row in enumerate(table.rows):
            for c, cell in enumerate(row.cells):
                print(f"Cell[{r}][{c}] = {cell.text}")
        print('\n')

In [None]:
print(get_string(textract_json=response_1099_misc, table_format=Pretty_Print_Table_Format.grid, output_type=[Textract_Pretty_Print.TABLES]))

---
## 11. Extracting 1099-R Form <a id="step11"></a>

In [None]:
documentName = "docs/1099-R.jpg"
display(Image(filename=documentName, width=500))

In [None]:
response_1099_r = call_textract(input_document=f's3://{data_bucket}/idp-mortgage/textract/1099-R.jpg', 
                                 features=[Textract_Features.TABLES, Textract_Features.FORMS])
doc_1099_r = Document(response_1099_r)

In [None]:
# Iterate over elements in the document
for page in doc_1099_r.pages:
    # Print lines and words
    for line in page.lines:
        print(f"Line├── {line.text}")
        for word in line.words:
            print(f"\tWord└── {word.text}")
        print('\n────────────────────────────\n')

In [None]:
for page in doc_1099_r.pages:
    forms=[]
    for field in page.form.fields:
        obj={}
        obj[f'{field.key}']=f'{field.value}'
        forms.append(obj)

display(JSON(forms, root='Form Key-values', expanded=True))

In [None]:
num_tables=1
for page in doc_1099_r.pages:
     # Print tables
    for table in page.tables:
        print(f"Table {num_tables}")
        print('────────────────────────────')
        num_tables=num_tables+1
        for r, row in enumerate(table.rows):
            for c, cell in enumerate(row.cells):
                print(f"Cell[{r}][{c}] = {cell.text}")
        print('\n')

There are no table's detected in this document

---
## 12. Employment Verification Form <a id="step12"></a>

In [None]:
documentName = "docs/Employment_Verification.png"
display(Image(filename=documentName, width=500))

In [None]:
response_emp_ver = call_textract(input_document=f's3://{data_bucket}/idp-mortgage/textract/Employment_Verification.png', 
                            features=[Textract_Features.TABLES, Textract_Features.FORMS])
doc_emp_ver = Document(response_emp_ver)

### Printing the content of the doc in Line and Word Format

In this section, we will extract the lines and words that appear in the document.

In [None]:
# Iterate over elements in the document
for page in doc_emp_ver.pages:    
    # Print lines and words
    for line in page.lines:
        print(f"Line├── {line.text}")
        for word in line.words:
            print(f"\tWord└── {word.text}")
        print('\n────────────────────────────\n')

### Get form info (key-value) pairs from the response

In the previous section, note that the output is plain text and doesn't necessarily have information on whether they appear in a form or a table. In this section we will get the form data in the document in key-value pair format.

In [None]:
for page in doc_emp_ver.pages:
    forms=[]
    for field in page.form.fields:
        obj={}
        obj[f'{field.key}']=f'{field.value}'
        forms.append(obj)

display(JSON(forms, root='Form Key-values', expanded=True))

### Get Table info from the response

In [None]:
num_tables=1
for page in doc_emp_ver.pages:
     # Print tables
    for table in page.tables:
        print(f"Table {num_tables}")
        print('────────────────────────────')
        num_tables=num_tables+1
        for r, row in enumerate(table.rows):
            for c, cell in enumerate(row.cells):
                print(f"Cell[{r}][{c}] = {cell.text}")
        print('\n')

In [None]:
print(get_string(textract_json=response_emp_ver, table_format=Pretty_Print_Table_Format.grid, output_type=[Textract_Pretty_Print.TABLES]))

---

## 13. Mortgage Statement <a id="step13"></a>

In [None]:
# Document
documentName = "docs/Mortgage_Statement.pdf"
display(IFrame(documentName, 500, 600));

In [None]:
response_mtgg_stmt = call_textract(input_document=f's3://{data_bucket}/idp-mortgage/textract/Mortgage_Statement.pdf', 
                            features=[Textract_Features.TABLES, Textract_Features.FORMS])
doc_mtgg_stmt = Document(response_mtgg_stmt)

### Printing the content of the doc in Line and Word Format

In this section, we will extract the lines and words that appear in the document.

In [None]:
# Iterate over elements in the document
for page in doc_mtgg_stmt.pages:    
    # Print lines and words
    for line in page.lines:
        print(f"Line├── {line.text}")
        for word in line.words:
            print(f"\tWord└── {word.text}")
        print('\n────────────────────────────\n')

### Get form info (key-value) pairs from the response

In the previous section, note that the output is plain text and doesn't necessarily have information on whether they appear in a form or a table. In this section we will get the form data in the document in key-value pair format.

In [None]:
for page in doc_mtgg_stmt.pages:
    forms=[]
    for field in page.form.fields:
        obj={}
        obj[f'{field.key}']=f'{field.value}'
        forms.append(obj)

display(JSON(forms, root='Form Key-values', expanded=True))

### Get Table info from the response

In [None]:
num_tables=1
for page in doc_mtgg_stmt.pages:
     # Print tables
    for table in page.tables:
        print(f"Table {num_tables}")
        print('────────────────────────────')
        num_tables=num_tables+1
        for r, row in enumerate(table.rows):
            for c, cell in enumerate(row.cells):
                print(f"Cell[{r}][{c}] = {cell.text}")
        print('\n')

In [None]:
print(get_string(textract_json=response_mtgg_stmt, table_format=Pretty_Print_Table_Format.grid, output_type=[Textract_Pretty_Print.TABLES]))

---
# Conclusion

In this notebook, we saw how to extract FORMS, TABLES, and TEXT lines from various documents that may be present in a mortgage packet. We also used Amazon Textract AnalyzeID to detect information from passport document. We used queries to extract specific information out of a document which is dense text and got accurate responses back from the API. In the next notebook, we will perform enrichment on one of the documents.