# Create Synthetic Data
In order to create a good Agentic Demo, we need datasets that are related. We have chosen to
build out a chatbot that uses PII data - so for a demo this needs to be Synthetic.


> If you wish to make an apple pie from scratch you must first invent the universe
>       Carl Sagan

<img src="images/carl-apple-galaxies.png" width="600" />


# Domain Model
We will build a domain model as follows:

1. **Patients**: This will be entirely synthetic data (random patient ID, random names, DOB etc)
2. **Medications**: From our Consumer Medicine Information downloaded from the TGA, we will load these and use RAG to answer various questions:
    1. What is the medicine name
    2. What ailments is the medicine prescribed to treat
3. **Patient-Ailments**: For each patient, we will create a link between the patient and the ailment
        

In [1]:
from faker import Faker
from datetime import datetime
import numpy as np

class PersonGenerator:
    
    def __init__(self):
        Faker.seed(101)
        np.random.seed(101)
        self.fake = Faker()
    
    def create_person(self):
        result = {}
        sex = np.random.choice(["M", "F"], p=[0.5,0.5])
        
        result["first_name"] = self.fake.first_name_male() if sex=="M" else self.fake.first_name_female()
        result["last_name"] = self.fake.last_name()
        result["sex"] = str(sex)
        result["dob"] = datetime.strftime(self.fake.date_of_birth(), "%Y-%M-%d")
        result["address"] = self.fake.address()
        
        return result


person_gen = PersonGenerator()

patients = []

for i in range(0,10):
    person = person_gen.create_person()
    print(f"Created [{person}]")
    patients.append(person)

Created [{'first_name': 'Lauren', 'last_name': 'Lee', 'sex': 'F', 'dob': '1982-00-01', 'address': 'PSC 7083, Box 9347\nAPO AP 28337'}]
Created [{'first_name': 'Erin', 'last_name': 'Kim', 'sex': 'F', 'dob': '1943-00-06', 'address': '8537 Allen Turnpike\nHendersonland, NE 61805'}]
Created [{'first_name': 'Eric', 'last_name': 'Rodriguez', 'sex': 'M', 'dob': '2021-00-02', 'address': '466 Rodney Lodge Suite 073\nWest Laurastad, VT 47636'}]
Created [{'first_name': 'Adam', 'last_name': 'Elliott', 'sex': 'M', 'dob': '1929-00-24', 'address': 'Unit 6241 Box 2659\nDPO AA 22302'}]
Created [{'first_name': 'Kathleen', 'last_name': 'Liu', 'sex': 'F', 'dob': '1951-00-23', 'address': '801 Antonio Key\nRobersonville, TX 93001'}]
Created [{'first_name': 'Lisa', 'last_name': 'Jacobs', 'sex': 'F', 'dob': '1971-00-19', 'address': '238 Hester Dam\nEast Nicolemouth, MS 09355'}]
Created [{'first_name': 'Anthony', 'last_name': 'Pierce', 'sex': 'M', 'dob': '1956-00-17', 'address': '5142 Owens Manors\nNew Tinamou

# Load Consumer Medicine Information
In order to build a robust set to data for our chatbot, of PII "synthetic" data, we'll use the TGA Consumer Medicine Information.

Before we can do anything else, we create our "Corpus". A corpus is where you might put a set of related documents that will help you answer queries. More information about a Vectara Corpus can be found in the [documentation](https://docs.vectara.com/docs/api-reference/admin-apis/admin).



In [2]:
from vectara.factory import Factory
from vectara.managers import CreateCorpusRequest 
import logging

logging.basicConfig(format='%(asctime)s:%(name)-35s %(levelname)s:%(message)s', level=logging.INFO, datefmt='%H:%M:%S %z')
logger = logging.getLogger("cell")


client = Factory().build()
create_corpus_request = CreateCorpusRequest().model_validate({
    "name": "Consumer Medicine Information",
    "description": "Information provided by the Australian TGA for prescribed medicines",
    "filter_attributes": [
        {
            "name": "name",
            "type": "text",
            "level": "document",
            "indexed": True
        }
    ]})
corpus = client.lab_helper.create_lab_corpus(create_corpus_request)
corpus_key = corpus.key



10:04:40 +1100:Factory                             INFO:initializing builder
10:04:40 +1100:Factory                             INFO:Factory will load configuration from home directory
10:04:40 +1100:HomeConfigLoader                    INFO:Loading configuration from users home directory [C:\Users\david]
10:04:40 +1100:HomeConfigLoader                    INFO:Loading specified profile [default]
10:04:40 +1100:root                                INFO:We are processing authentication type [OAuth2]
10:04:40 +1100:LabHelper                           INFO:Found username from environment [david]
10:04:40 +1100:LabHelper                           INFO:Converted username is [david]
10:04:40 +1100:LabHelper                           INFO:Lab corpus name will be [david - Consumer Medicine Information]
10:04:40 +1100:LabHelper                           INFO:Lab corpus key will be [david_consumer_medicine_information]
10:04:40 +1100:LabHelper                           INFO:Creating lab corpus
10:0

## Ingest TGA Data to Vectara Corpus

Now that we have our corpus with modelling applied, we can load this data into Vectara. Vectara automatically:

1. Parses the PDFs and extracts tables
2. Chunks the text into appropriate semantic chunks
3. Converts the chunks into embeddings
4. Stores the embedding in our Vector Index along with index metadata (filter attributes) for efficient querying.

In [4]:
from pathlib import Path

medicine_names = []

for medicine_path in Path("../resources/tga").glob("*.pdf"):
    # Make sure to "lower" the medicine name as metadata attributes are case sensitive.
    medicine_name = medicine_path.name[0:-4].lower()
    medicine_names.append(medicine_name)
    client.upload_manager.upload(corpus_key, medicine_path, metadata={"name": medicine_name})

10:05:00 +1100:httpx                               INFO:HTTP Request: POST https://api.vectara.io/v2/corpora/david_consumer_medicine_information/upload_file "HTTP/1.1 201 Created"
10:05:06 +1100:httpx                               INFO:HTTP Request: POST https://api.vectara.io/v2/corpora/david_consumer_medicine_information/upload_file "HTTP/1.1 201 Created"
10:05:11 +1100:httpx                               INFO:HTTP Request: POST https://api.vectara.io/v2/corpora/david_consumer_medicine_information/upload_file "HTTP/1.1 201 Created"


In [5]:
from vectara.corpora.types import SearchCorpusParameters
from vectara.types import GenerationParameters 

import json
import re

prompt_text = """
[
  {"role": "system", "content": "You are a doctor who is providing advice to patients at the end of their clinical visits."},
  #foreach ($qResult in $vectaraQueryResults)
     {"role": "user", "content": "Give me the $vectaraIdxWord[$foreach.index] search result."},
     {"role": "assistant", "content": "${qResult.getText()}" },
  #end
  {"role": "user", "content": "Generate an answer for the query '${vectaraQuery}' based on the above results in JSON an attribute 'ailments' that has a list of ailment names."}
]
    """

generation = GenerationParameters.parse_obj({
    "generation_preset_name": "vectara-summary-ext-v1.3.0",
    "max_used_search_results": 5,
    "max_response_characters": 300,
    "response_language": "auto",
    "prompt_text": prompt_text
})

search_corpus = SearchCorpusParameters.parse_obj({
    "lexical_interpolation": 0.025,
    "semantics": "default",
    "offset": 0,
    "limit": 10,
    "reranker": {
        "type": "customer_reranker",
        "reranker_id": "rnk_272725719"
    },
    "context_configuration": {
        "characters_before": 30,
        "characters_after": 30,
        "start_tag": "<b>",
        "end_tag": "</b>"
    },
})

medicines = []
ailment_medicine_map = {}

for medicine_name in medicine_names:
    query = f"List the ailments is the medicine {medicine_name} used to treat?"
    query_response = client.corpora.query(corpus_key, query=query, search=search_corpus, generation=generation)
    #print(f"For {medicine_name}, we found the following ailments:\n{query_response.summary}")
    
    ailments = json.loads(query_response.summary)["ailments"]
    medicine = {"name": medicine_name, "ailments": ailments}
    medicines.append(medicine)
    
    # Ensure we have a link to medicines which can treat different ailments
    for ailment in ailments:
        
        # Remove any acronyms in brackets, make lower case
        ailment = re.sub(r"\([^\)]+\)","", ailment).lower()
        ailment = re.sub(r" +"," ", ailment).strip()
        
        if ailment in ailment_medicine_map:
                ailment_medicine = ailment_medicine_map[ailment]
        else:
                ailment_medicine = {"name": ailment, "medicines": []}
                ailment_medicine_map[ailment] = ailment_medicine 
        ailment_medicine["medicines"].append(medicine_name)
        

for ailment in ailment_medicine_map.keys():
    ailment_medicine = ailment_medicine_map[ailment]
    
    available_medicines = ailment_medicine["medicines"]
    print(f"For ailment [{ailment}], you can take [{available_medicines}]")
    
    

C:\Users\david\AppData\Local\Temp\ipykernel_13560\1771166572.py:18: PydanticDeprecatedSince20: The `parse_obj` method is deprecated; use `model_validate` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  generation = GenerationParameters.parse_obj({
C:\Users\david\AppData\Local\Temp\ipykernel_13560\1771166572.py:26: PydanticDeprecatedSince20: The `parse_obj` method is deprecated; use `model_validate` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  search_corpus = SearchCorpusParameters.parse_obj({
10:05:18 +1100:httpx                               INFO:HTTP Request: POST https://api.vectara.io/v2/corpora/david_consumer_medicine_information/query "HTTP/1.1 200 OK"
10:05:25 +1100:httpx                               INFO:HTTP Request: POST https://api.vectara.io/v2/corpora/david_consumer_medicine_information

For ailment [high cholesterol levels], you can take [['atorvachol', 'lipitor']]
For ailment [high blood pressure], you can take [['atorvachol', 'lipitor', 'mersyndol']]
For ailment [coronary heart disease], you can take [['atorvachol', 'lipitor']]
For ailment [risk of heart attack], you can take [['atorvachol', 'lipitor']]
For ailment [risk of stroke], you can take [['atorvachol', 'lipitor']]
For ailment [high triglycerides], you can take [['atorvachol']]
For ailment [high triglyceride levels], you can take [['lipitor']]
For ailment [moderate pain], you can take [['mersyndol']]
For ailment [fever], you can take [['mersyndol']]
For ailment [anxiety], you can take [['mersyndol']]
For ailment [nerves], you can take [['mersyndol']]
For ailment [depression], you can take [['mersyndol']]
For ailment [epilepsy], you can take [['mersyndol']]
For ailment [nausea and vomiting], you can take [['mersyndol']]
For ailment [stomach ulcers], you can take [['mersyndol']]
For ailment [ear infections], y

In [6]:
from pathlib import Path
import sqlite3

output_dir = Path("../output")
output_dir.mkdir(exist_ok=True)

db_file = output_dir / "patient.db"
db_file.unlink()
con = sqlite3.connect(db_file)

cur = con.cursor()
cur.execute("DROP TABLE IF EXISTS patient")
con.commit()
cur.execute("CREATE TABLE patient(id, first_name, last_name, sex, dob, address)")
con.commit()
cur.execute("CREATE TABLE patient_ailments(patient_id, ailment, medicine)")
con.commit()


In [7]:
for idx, patient in enumerate(patients):
    cur.execute(f"INSERT INTO patient VALUES ({idx}, '{patient["first_name"]}', '{patient["last_name"]}', '{patient["sex"]}', '{patient["dob"]}', '{patient["address"]}')")



# Populate Ailments with Prescription 
Now for each patient, we give them an ailment and prescription.

In [8]:
np.random.seed(101)

ailment_names = list(ailment_medicine_map.keys())

for idx, patient in enumerate(patients):
    num_ailments = np.random.randint(4) + 1
    
    patient_ailments = []
    patient_medications = []
    
    for ailment_idx in range(num_ailments):
        ailment_name = ailment_names[np.random.randint(len(ailment_names))]
        
        # Check if patient already has that ailment, if so skip
        if ailment_name in patient_ailments:
            continue
        
        patient_ailments.append(ailment_name)
        
        ailment_medicine = ailment_medicine_map[ailment_name]
        medicine_options = ailment_medicine["medicines"]
        
        medicine = medicine_options[np.random.randint(len(medicine_options))]
        
        print(f"For patient {patient["first_name"]}, for ailment [{ailment_name}] we have prescribed them [{medicine}]")
        cur.execute(f"INSERT INTO patient_ailments VALUES ({idx}, '{ailment_name}', '{medicine}')")

con.commit()

For patient Lauren, for ailment [depression] we have prescribed them [mersyndol]
For patient Lauren, for ailment [mental illnesses] we have prescribed them [mersyndol]
For patient Lauren, for ailment [high triglyceride levels] we have prescribed them [lipitor]
For patient Erin, for ailment [anxiety] we have prescribed them [mersyndol]
For patient Erin, for ailment [nausea and vomiting] we have prescribed them [mersyndol]
For patient Erin, for ailment [fever] we have prescribed them [mersyndol]
For patient Erin, for ailment [risk of stroke] we have prescribed them [lipitor]
For patient Eric, for ailment [high cholesterol levels] we have prescribed them [atorvachol]
For patient Adam, for ailment [epilepsy] we have prescribed them [mersyndol]
For patient Adam, for ailment [fever] we have prescribed them [mersyndol]
For patient Kathleen, for ailment [opioid dependence] we have prescribed them [mersyndol]
For patient Kathleen, for ailment [ear infections] we have prescribed them [mersyndol]

In [9]:
cur.close()
con.close()