# Rule based document validation using RAG

This notebook was tested on a SageMaker Studio Notebook `Data Science 3.0` kernel and  `ml.t3.xlarge` instance.


This lab assumes you have done the previous lab where a bedrock knowledge base was created and data ingested.

One common tasks when it comes to IDP is validating that a document meets certain rules such as:

* Does the document contain all of the information that should be present?
  
* Are the values presented in the document logical and match a specific format?

* Does the extracted data make sense in the current document and it's context?



In [None]:
%store -r

In [30]:
import boto3
import pprint
import os
from botocore.client import Config
from bedrockhelper import interactive_sleep



pp = pprint.PrettyPrinter(indent=2)

bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 0})
bedrock_client = boto3.client('bedrock-runtime')

boto3_session = boto3.session.Session()
region_name = boto3_session.region_name

model_id = "anthropic.claude-3-haiku-20240307-v1:0" # try with both claude 3 Haiku as well as claude 3 Sonnet. for claude 3 Sonnet - "anthropic.claude-3-sonnet-20240229-v1:0"
region_id = region_name # replace it with the region you're running sagemaker notebook

#### Upload rules information into the RAG

First we will upload all of our rules into the RAG

In [20]:
data_root = '../samples/rules/'

In [31]:
# Upload data to s3 to the bucket that was configured as a data source to the knowledge base
s3_client = boto3.client("s3")
def uploadDirectory(path,bucket_name):
        for root,dirs,files in os.walk(path):
            for file in files:
                s3_client.upload_file(os.path.join(root,file),bucket_name,'idp-test/rules/'+file)

uploadDirectory(data_root, bucket_name)

In [None]:
# Start an ingestion job
interactive_sleep(30)
bedrock_agent_client = boto3_session.client('bedrock-agent', region_name=region_name)
#start_job_response = bedrock_agent_client.start_ingestion_job(knowledgeBaseId = kb_id, dataSourceId = ds["dataSourceId"])
start_job_response = bedrock_agent_client.start_ingestion_job(knowledgeBaseId = kb_id, dataSourceId = ds["dataSourceId"])

job = start_job_response["ingestionJob"]
pp.pprint(job)

In [None]:
# Get job 
while(job['status']!='COMPLETE' ):
    get_job_response = bedrock_agent_client.get_ingestion_job(
      knowledgeBaseId = kb['knowledgeBaseId'],
        dataSourceId = ds["dataSourceId"],
        ingestionJobId = job["ingestionJobId"]
  )
    job = get_job_response["ingestionJob"]
    
    interactive_sleep(30)

pp.pprint(job)

### Classification

Before we do any rule based matching, lets first classify the document into a specific class so we can decide what type of rules apply to it.

In [57]:
from rhubarb import DocAnalysis, SystemPrompts, LanguageModels

da = DocAnalysis(file_path="../samples/bank_statement.jpg", 
                 boto3_session=boto3_session,
                 pages = [1],
                 system_prompt=SystemPrompts().ClassificationSysPrompt)
resp = da.run(message="""Given the document, classify the pages into the following classes
                        <classes>
                        DRIVERS_LICENSE  # a driver's license
                        INSURANCE_ID     # a medical insurance ID card
                        RECEIPT          # a store receipt
                        BANK_STATEMENT   # a bank statement
                        W2               # a W2 tax document
                        MOM              # a minutes of meeting or meeting notes
                        </classes>""")
resp['output']

[{'page': 1, 'class': 'BANK_STATEMENT'}]

Because we will need to apply custom logic and prompts to the RAG, we will be using the **retrieve** API which only returns the relevant chunks without calling on an LLM to generate a proper response

In [50]:
bedrock_agent_runtime = boto3.client("bedrock-agent-runtime",
                              config=bedrock_config)

def retrieve(query, kbId, numberOfResults=5):
    return bedrock_agent_runtime.retrieve(
        retrievalQuery= {
            'text': query
        },
        knowledgeBaseId=kbId,
        retrievalConfiguration= {
            'vectorSearchConfiguration': {
                'numberOfResults': numberOfResults,
                'overrideSearchType': "HYBRID", # optional
            }
        }
    )

In [51]:
doc_cls = resp['output'][0]['class']
query = f"What are the rules for this document type: {doc_cls} "
query 

'What are the rules for this document type: BANK_STATEMENT '

In [None]:
response = retrieve(query, kb_id, 5)

retrievalResults = response['retrievalResults']
pp.pprint(retrievalResults)

In this prompt template we include the document class, the retrieved chunks and puts them together in the prompt

In [70]:
prompt_template = f"""


Human: I need you to analyze a document for compliance with specific rules. Here's the context:

Document Type: {doc_cls}

Retrieved Rules:

{retrievalResults}

Based on the list of rules, pick out the rules that apply to the specific document type and only use those

Document Content:


Please perform the following tasks:
1. Analyze each rule and how it applies to the document.
2. For each rule, determine if the document complies or not.
3. Provide a brief explanation for each compliance or non-compliance finding.
4. Summarize the overall compliance status of the document.
5. If there are any areas of ambiguity or where the rules might be interpreted in multiple ways, highlight these.

Present your analysis in a structured format, using markdown for clarity where appropriate.

1. Compliance Determination

2. Explanations for Findings

3. Overall Compliance Summary

4. Areas of Ambiguity

"""

Using the schema, we want to force the LLM to output the results of the compliance check in a fixed JSON that we can then use in downstream systems to further process this document

In [75]:
schema = {
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "Rule": {
        "type": "string"
      },
      "Compliance": {
        "type": "boolean"
      }
    },
    "required": ["Rule", "Compliance"]
  }
}

In [78]:

da = DocAnalysis(file_path="../samples/bank_statement.jpg", 
                  boto3_session=boto3_session,
                  modelId=LanguageModels.CLAUDE_HAIKU_V1,
                  max_tokens= 4000,
                  )
resp = da.run(message=prompt_template,output_schema = schema)


In [81]:
resp ['output']

[{'Rule': 'Document Structure and Formatting', 'Compliance': True},
 {'Rule': 'Personal Information Validation', 'Compliance': True},
 {'Rule': 'Date and Period Validation', 'Compliance': True},
 {'Rule': 'Account Balance Validation', 'Compliance': True},
 {'Rule': 'Account Valuation Table Validation', 'Compliance': False},
 {'Rule': 'Insurance Details Validation', 'Compliance': False},
 {'Rule': 'Numerical and Currency Formatting', 'Compliance': True},
 {'Rule': 'Logical Consistency Checks', 'Compliance': True},
 {'Rule': 'Completeness Checks', 'Compliance': True},
 {'Rule': 'Security and Privacy', 'Compliance': False},
 {'Rule': 'Additional Information', 'Compliance': False}]