In [1]:
%pip install -q vectara-skunk-client==0.4.11

Note: you may need to restart the kernel to use updated packages.


# Document Data Structuring
Munging files into a structured data format helps preserve relationships between bits of data, retains special meaning of specific data types, and enables users to query the data with filters.

Let's use this National Institute of Health PDF as an example:

[www.techtransfer.nih.gov_tech_tab-3843.pdf](https://docs.vectara.com/assets/files/www.techtransfer.nih.gov_tech_tab-3843-db3371f8a405d760356376da51ce9a53.pdf)

Vectara offers a structured data format where users can convert PDFs to a format like the following structure

In [2]:
from pathlib import Path
import json

json_file = Path("./resources/03_dds/document_example.json")

with open(json_file, 'r', encoding='utf-8') as f:
    json_content = f.read()
print(json_content)

{
  "documentId": "TAB‑3843",
  "title": "Engineered Cell‑Penetrating Monoclonal Antibody for Universal Inuenza Immunotherapy",
  "description": "Home » Tech » Engineered Cell‑Penetrating Monoclonal Antibody for Universal Inuenza Immunotherapy",
  "metadataJson": "{'developmentStatus': 'Pre‑Clinical', 'isAntibodiesProduct': true, 'date': '2023‑05‑17', 'patentSeriesCode' :63, 'patentApplicationNumber' :365841}",
  "section": [{
    "title": "body",
    "text": "Influenza remains a burden on public health..."
  }, {
    "title": "Clinical treatment",
    
    "text": "Clinical Treatment꞉ CPP‑mAbs against influenza NP may...",
    "metadataJson": "{'clinicalTreatment'꞉ 'CPP‑mAbs against influenza NP may...'}"
  }, {
    "text": "Current vaccines remain effective for a short time period..."
  }]
}


## Setup Exercise - Create Corpus
We'll now use some of the familiar code you've seen earlier to create a new lab exercise. We've now put this code into our module "lab_setup". Please review this if you're curious about the steps it is performing.

In [3]:
from lab_setup import create_lab_corpus

corpus_id = create_lab_corpus("03_document_data_structuring")

08:55:35 +1100 lab_setup            INFO:User prefix for lab: david
08:55:35 +1100 lab_setup            INFO:Setting up lab corpus with name [david-03_document_data_structuring]
08:55:35 +1100 Factory              INFO:initializing builder
08:55:35 +1100 Factory              INFO:Factory will load configuration from home directory
08:55:35 +1100 HomeConfigLoader     INFO:Loading configuration from users home directory [C:\Users\david]
08:55:35 +1100 HomeConfigLoader     INFO:Loading default configuration [default]
08:55:35 +1100 HomeConfigLoader     INFO:Parsing config
08:55:35 +1100 root                 INFO:We are processing authentication type [OAuth2]
08:55:35 +1100 root                 INFO:initializing Client
08:55:37 +1100 RequestUtil          INFO:URL for operation list-corpora is: https://api.vectara.io/v1/list-corpora
08:55:38 +1100 root                 INFO:No existing corpus with the name david-03_document_data_structuring
08:55:38 +1100 RequestUtil          INFO:URL for op

## Download PDF and upload to Vectara
We'll now use Python to download the PDF and the vectara-skunk-client to upload to Vectara in our new corpus.

In [4]:
from urllib import request
from vectara.client.core import Factory

local_file = "www.techtransfer.nih.gov_tech_tab-3843.pdf"
request.urlretrieve("https://docs.vectara.com/assets/files/www.techtransfer.nih.gov_tech_tab-3843-db3371f8a405d760356376da51ce9a53.pdf", local_file)

indexer_service = Factory().build().indexer_service
result = indexer_service.upload(corpus_id, local_file, return_extracted=True)

08:55:50 +1100 Factory              INFO:initializing builder
08:55:50 +1100 Factory              INFO:Factory will load configuration from home directory
08:55:50 +1100 HomeConfigLoader     INFO:Loading configuration from users home directory [C:\Users\david]
08:55:50 +1100 HomeConfigLoader     INFO:Loading default configuration [default]
08:55:50 +1100 HomeConfigLoader     INFO:Parsing config
08:55:50 +1100 root                 INFO:We are processing authentication type [OAuth2]
08:55:50 +1100 root                 INFO:initializing Client
08:55:50 +1100 IndexerService       INFO:Headers: {"c": "1623270172", "o": "244"}
www.techtransfer.nih.gov_tech_tab-3843.pdf: 157kB [00:03, 42.2kB/s]                                                  


## Inspect the Result
We'll now have a look at the response object returned.
* Notice that it is a Domain Class (rather than a dict) which means it is more strongly typed and more easily accessed
* We can view the converted body or retain it for upload into another corpus for environment based promotion.

In [5]:
type(result)

vectara.client.domain.UploadDocumentResponse

In [6]:
import logging
from dataclasses import asdict
response_info_json = json.dumps(asdict(result.response),indent=4)
logging.info(f"Lets look at the status:\n{response_info_json}")

08:55:55 +1100 root                 INFO:Lets look at the status:
{
    "status": null,
    "quotaConsumed": {
        "numChars": "3494",
        "numMetadataChars": "1849"
    }
}


In [7]:
document_json = json.dumps(asdict(result.document),indent=4)
logging.info(f"Lets look at the start of the document:{document_json[:800]}")

08:55:58 +1100 root                 INFO:Lets look at the start of the document:{
    "documentId": "www.techtransfer.nih.gov_tech_tab-3843.pdf",
    "title": "Bookmarks     Register     Login",
    "description": null,
    "metadata_json": null,
    "customDims": null,
    "section": [
        {
            "id": null,
            "title": null,
            "text": "Bookmarks     Register     Login",
            "metadataJson": null,
            "customDims": null,
            "section": null
        },
        {
            "id": 1,
            "title": null,
            "text": "\uf0c9",
            "metadataJson": null,
            "customDims": null,
            "section": null
        },
        {
            "id": 2,
            "title": null,
            "text": "Search Site...",
            "metadataJson": null,
            "customDims": null,
            "se
