# Module 4 - Table Self-query Q&A
---

<div class="alert alert-block alert-info"> 
    <b>NOTE:</b> You will need to use a Jupyter Kernel with Python 3.9 or above to use this notebook. If you are in Amazon SageMaker Studio, you can use the "Data Science 3.0" image.
</div>

In this notebook we will walk through how to perform _"self querying"_ with table data wth tables present in documents. First we will be extracting the tables from a document using Amazon Textract using `AnalyzeDocument` API, generating the table data and then store the table data into a Vector DB in a very specific way, and then performing self-querying on the table data with a Anthropic Claude model via Amazon Bedrock and get precise answers from the model. We will be using open-source [ChromaDB](https://github.com/chroma-core/chroma) as our in-memory vector database.

<div class="alert alert-block alert-info"> 
    <b>NOTE:</b> You can ignore any WARNINGS during the `pip installs`.
</div>

In [None]:
!pip install -U chromadb lark

In [None]:
import json
import os
import sys
import sagemaker
import boto3

role = sagemaker.get_execution_role()
data_bucket = sagemaker.Session().default_bucket()
bedrock = boto3.client('bedrock-runtime')
br = boto3.client('bedrock')
s3 = boto3.client("s3")
print(f"SageMaker bucket is {data_bucket}, and SageMaker Execution Role is {role}")

## Extract table data from the document using Amazon Textract
---

For this module, we will be using a sample bank statement document (`/sample-docs/bank_statement.jpg`) that contains tables data. We will use the `amazon-textract-textractor` library to perform the API call to `AnalyzeDocument` with `TABLE` feature and also read the table data with the Textract response parser. Once the tables are extracted we will parse out 

In [None]:
from IPython.display import IFrame
IFrame("./sample-docs/bank_statement.pdf", width=600, height=800)

In [None]:
from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import Pretty_Print_Table_Format, Textract_Pretty_Print, get_string

textract_json = call_textract(input_document=f"s3://{data_bucket}/input/bank_statement.pdf", features=[Textract_Features.TABLES])

print(get_string(textract_json=textract_json,
               table_format=Pretty_Print_Table_Format.tsv,
               output_type=[Textract_Pretty_Print.TABLES]))

We notice that Textract has extracted two distinct tables from the document. In this walkthrough we will get the first table and perform Self-query on it using Langchain. There are two tables in this page, let's do Q&A on the first table. Note that we are going to use LangChain's `SelfQueryRetriever` which is helpful with Q&A with tables. As of this writing, FAISS is not supported for self-querying with LangChain, so we will use ChromaDB. For more information refer to the [Self-query LangChain documentation](https://python.langchain.com/docs/modules/data_connection/retrievers/self_query/).

## Transform the extracted table data
---

Self query needs the table data to be formatted in a very specific way using LangChain's `Document` model. For example, here is what the structure looks like

Table
<table>
    <tr>
        <th>year</th>
        <th>director</th>
        <th>rating</th>
        <th>movie</th>
        <th>actor</th>
    </tr>
    <tr>
        <td>2010</td>
        <td>Christopher Nolan</td>
        <td>8.2</td>
        <td>Inception</td>
        <td>Leo DiCaprio</td>
    </tr>
    <tr>
        <td>2006</td>
        <td>Satoshi Kon</td>
        <td>8.6</td>
        <td>Paprika</td>
        <td>Megumi Hayashibara</td>
    </tr>
</table>

```python
docs = [
    Document(
        page_content="2010, Christopher Nolan, 8.2, Inception, Leo DiCaprio",
        metadata={"year": 2010, 
                  "director": "Christopher Nolan", 
                  "rating": 8.2, 
                  "movie": "Inception", 
                  "actor": "Leo DiCaprio"},
    ),
    Document(
        page_content="2006, Satoshi Kon, 8.6, Paprika, Megumi Hayashibara",
        metadata={"year": 2006, 
                  "director": "Satoshi Kon", 
                  "rating": 8.6,
                  "movie": "Paprika",
                  "actor": "Megumi Hayashibara"},
    )
    ...
]
```

Above, the table data rows represented by CSV string resides in the `page_content` key in the Document schema. The `metadata` section contains key-value pairs which are table header to cell value. The table may look something like below.

We will transform the first table in the document using the same schema. We will do this by accessing the individual row/col data available in the Textract output using Textract response parser utility tool. Note that our table contains numbers and as such for self-query to work we need to convert numbers into int or float type appropriately as well.

In [None]:
import csv
from io import StringIO
from trp import Document as TDoc
from langchain.schema import Document

doc = TDoc(textract_json)
rows = []

def detect_type(s):
    if type(s) == 'NoneType': 
        return s
    elif not isinstance(s, str):
        s = str(s)
    s = s.replace(',', '')
    try:
        return int(s)
    except ValueError:
        try:
            return float(s)
        except ValueError:
            return s

# Extract the first table data
for page in doc.pages:
    if page.tables:
        for row in page.tables[0].rows:
            cells = [detect_type(cell.text.strip()) for cell in row.cells]
            rows.append(cells)

headers = rows[0]
headers = [ f"Transaction_{x}" for x in headers ]
full_table = []

for row in rows[1:]:
    output = StringIO()
    csv_writer = csv.writer(output)
    csv_writer.writerow(row)
    csv_string = output.getvalue()
    row_meta = {headers[i].replace('($)','').strip(): detect_type(cell) for i, cell in enumerate(row)}
    full_table.append(Document(page_content=csv_string.strip(), metadata=row_meta))

full_table



## Store the table in Vector DB
---

We will now store this into our vector database by first generating embeddings. 

In [None]:
from langchain.embeddings import BedrockEmbeddings
from langchain.vectorstores import Chroma

# Ensure that you have enabled amazon.titan-embed-text-v1 model in Amazon Bedrock console
embeddings = BedrockEmbeddings(client=bedrock,model_id="amazon.titan-embed-text-v1")
vector_db = Chroma.from_documents(documents=full_table,embedding=embeddings)

In [None]:
# CAUTION: If you execute this cell then the ChromaDB collection created in previous step will be deleted

# vector_db.delete_collection()

## Self Query Retriever with Amazon Bedrock and Anthropic Claude
---

We will now create a self-query retriever, much like the retriever we used in the _"In-context QA"_ notebook. However this time we will use some additional information to create the retriever in addition to the vector database. We created a special structure using the table data in the previous code cell (`full_table`), we will also need to define the table definition using LangChain's `AttributeInfo` model. This will help the LLM understand what each of the column/header actually means.

In [None]:
from langchain.llms import Bedrock
from langchain.chains.query_constructor.base import AttributeInfo

metadata_field_info = [
    AttributeInfo(
        name="Transaction_Date",
        description="Date of the bank transaction",
        type="string",
    ),
    AttributeInfo(
        name="Transaction_Description",
        description="Description of the bank transaction",
        type="string",
    ),
    AttributeInfo(
        name="Transaction_Deposits",
        description="The dollar amount deposited into the bank account",
        type="integer",
    ),
    AttributeInfo(
        name="Transaction_Withdrawals",
        description="The dollar amount withdrawn from the bank account",
        type="integer",
    ),
    AttributeInfo(
        name="Transaction_Amount",
        description="The total dollar amount balance in the bank account",
        type="integer",
    )
]
document_content_description = "A transaction in a bank statement"

bedrock_llm = Bedrock(client=bedrock, model_id="anthropic.claude-instant-v1", model_kwargs={'temperature':0})
retriever = SelfQueryRetriever.from_llm(
    bedrock_llm, vector_db, document_content_description, metadata_field_info, verbose=True
)

Let's run a query where the withdrawal amount was greater than 300.

Let's run a query where the transaction amount was greater than 5000.

### Constructing the prompt

In this step we will build the prompt that will help transform the user's question (query) to a `StructuredQuery`. 

In [None]:
from langchain.chains.query_constructor.base import get_query_constructor_prompt

prompt = get_query_constructor_prompt(
    document_content_description,
    metadata_field_info,
)

Let's print out the prompt to see what it looks like. Notice that it is a "few shot" prompt, which means our LLM is shown a few example of how to create a query, using a given a schema.

In [None]:
print(prompt.format(query="dummy question"))

### Constructing the StructuredQuery using the JSON query

In the previous state, we built a prompt that instructs the model how to generate a JSON format query using a given schema. In this section we will chain it together with an `output_parser` which can parse that JSON and create a `StructuredQuery`. This `StructuredQuery` will then be used to query our table data that we loaded into the vector database earlier.

In [None]:
from langchain.chains.query_constructor.base import StructuredQueryOutputParser

output_parser = StructuredQueryOutputParser.from_components()

query_constructor = prompt | bedrock_llm | output_parser

Let's see what the model generates as a `StructuredQeuery` based on a few questions.

In [None]:
print(query_constructor.invoke(
    {
        "query": "What are the transactions with withdrawals greater than 300?"
    }
))

In [None]:
print(query_constructor.invoke(
    {
        "query": "What are the transactions with balance greater than 5000?"
    }
))

### Querying the table data using the structured query

In the previous step we got a `StructuredQuery` from our plain language question. The last step is to hook it up with the vector database so that this `StructuredQuery` can be executed and the relevant data can be extracted from the table. We will use a `SelfQueryRetriever` along with the ChromaDB instance. Notice that we only need to pass our question since `SelfQueryRetriever` will construct the `StructuredQuery` internally (i.e. the step we saw above, and then execute the query on our vector db that contains the table data.

In [None]:
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.retrievers.self_query.chroma import ChromaTranslator

retriever = SelfQueryRetriever(
    query_constructor=query_constructor,
    vectorstore=vector_db,
    structured_query_translator=ChromaTranslator()
)

Let's invoke the `SelfQueryRetriever` with both our questions.

In [None]:
results = retriever.invoke(
    "What are the transactions with withdrawals greater than 300?"
)

for res in results:
    print(res.page_content)

In [None]:
results = retriever.invoke(
    "What are the transactions with balance greater than 5000?"
)

for res in results:
    print(res.page_content)