### Install the Indexify Extractor SDK, Langchain Retriever and the Indexify Client

In [2]:
%%capture
!pip install indexify-extractor-sdk indexify

We have several PDF and Invoice Extractor. The one that worked really well to get various fields from my HOA receipt was the LayoutLMDocumentQA. It can't extract all the values in one shot, but can answer to single questions.

First, get a taste of playing with the extractor locally.

Download the extractor -
```bash
indexify-extractor download hub://pdf/layoutlm_document_qa
```

In [8]:
from indexify_extractor_sdk import load_extractor, Content
extractor, config_cls = load_extractor("layoutlm_document_qa.document_qa:LayoutLMDocumentQA")
content = Content.from_file("/Users/diptanuc/Downloads/Statement_HOA.pdf")


In [9]:
config = config_cls(query="What's the due date?")
result = extractor.extract(content, config)
result

[Feature(feature_type='metadata', name='metadata', value={'query': "What's the due date?", 'answer': '5/1/2024', 'page': 0, 'score': 0.9999791383743286}, comment=None)]

### Start the Indexify Server

To make this extractor continously extract - 
1. Download the Indexify Server
2. Start it in development mode on your laptop
3. Create extraction policies with questions that extracts the fields from the PDF
4. Finally, you can get all the extracted value for a document by making an API call

##### Download the Server
```bash
curl https://tensorlake.ai | sh
```

In [None]:
!./indexify server -d

### Create the Extraction Policies


In [2]:
from indexify import IndexifyClient
client = IndexifyClient()

In [3]:
client.add_extraction_policy(extractor='tensorlake/layoutlm-document-qa-extractor', name="hoa-fees-due-date", input_params={"query": "Whats the due date?"})
client.add_extraction_policy(extractor='tensorlake/layoutlm-document-qa-extractor', name="hoa-fees-outstanding", input_params={"query": "Whats the outstanding amount?"})

### Upload Files

In [4]:
content_id = client.upload_file("/Users/diptanuc/Downloads/Statement_HOA.pdf")

In [6]:
client.get_structured_data(content_id)

[{'id': '3Ie8VXVxfNTPAL5L',
  'content_id': 'efcf0931508836d3',
  'metadata': {'answer': '5/1/2024',
   'page': 0,
   'query': 'Whats the due date?',
   'score': 0.9999799728393556},
  'extractor_name': 'tensorlake/layoutlm-document-qa-extractor'},
 {'id': 'VmCTqMFR-m7IG0nn',
  'content_id': 'efcf0931508836d3',
  'metadata': {'answer': '$603.03',
   'page': 1,
   'query': 'Whats the outstanding amount?',
   'score': 0.9992976188659668},
  'extractor_name': 'tensorlake/layoutlm-document-qa-extractor'}]

In [36]:
client.sql_query("select * from ingestion;")

SqlQueryResult(result=[{'answer': '$603.03', 'content_id': 'd8ec685dd9cc3505', 'page': 1, 'query': 'Whats the outstanding amount?', 'score': 0.9992976188659668}, {'answer': '5/1/2024', 'content_id': 'd8ec685dd9cc3505', 'page': 0, 'query': 'Whats the due date?', 'score': 0.9999799728393556}])