In [None]:
# enable automatic reloading of the notebook
%load_ext autoreload
%autoreload 2

In [1]:
# Load model directly
from jsonformer import Jsonformer
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("AdaptLLM/medicine-chat")
model = AutoModelForCausalLM.from_pretrained("AdaptLLM/medicine-chat", device_map="auto", load_in_4bit=True)

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

  return self.fget.__get__(instance, owner)()


In [2]:
user_input = '''Please extract information from the text between !!! by using only categorical words or choosing from options Yes or No. Do not provide any additional text or information. The output must be in the following format:

{
    "Sex": "",
    "Age": "",
    "Treatment": "",
    "Patient had ECG done?": "",
    "Patient had palpitations?": "",
    "Patient had leg operation?": "",
    "Rehabilitation time": ""
    "Patient finished with treatment?": "",
    "Patient died": "",
}

!!!
A 28-year-old previously healthy man presented with a 6-week history of palpitations.
The symptoms occurred during rest, 2–3 times per week, lasted up to 30 minutes at a time and were associated with dyspnea.
Except for a grade 2/6 holosystolic tricuspid regurgitation murmur (best heard at the left sternal border with inspiratory accentuation), physical examination yielded unremarkable findings.
An electrocardiogram (ECG) revealed normal sinus rhythm and a Wolff– Parkinson– White pre-excitation pattern (Fig.1: Top), produced by a right-sided accessory pathway.
Transthoracic echocardiography demonstrated the presence of Ebstein's anomaly of the tricuspid valve, with apical displacement of the valve and formation of an “atrialized” right ventricle (a functional unit between the right atrium and the inlet [inflow] portion of the right ventricle) (Fig.2).
The anterior tricuspid valve leaflet was elongated (Fig.2C, arrow), whereas the septal leaflet was rudimentary (Fig.2C, arrowhead).
Contrast echocardiography using saline revealed a patent foramen ovale with right-to-left shunting and bubbles in the left atrium (Fig.2D).
The patient underwent an electrophysiologic study with mapping of the accessory pathway, followed by radiofrequency ablation (interruption of the pathway using the heat generated by electromagnetic waves at the tip of an ablation catheter).
His post-ablation ECG showed a prolonged PR interval and an odd “second” QRS complex in leads III, aVF and V2–V4 (Fig.1Bottom), a consequence of abnormal impulse conduction in the “atrialized” right ventricle.
The patient reported no recurrence of palpitations at follow-up 6 months after the ablation.
!!!

'''

# Apply the prompt template and system prompt of LLaMA-2-Chat demo for chat models (NOTE: NO prompt template is required for base models!)
our_system_prompt = "\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\n" # Please do NOT change this

prompt = f"<s>[INST] <<SYS>>{our_system_prompt}<</SYS>>\n\n{user_input} [/INST]"

# # NOTE:
# # If you want to apply your own system prompt, please integrate it into the instruction part following our system prompt like this:
# your_system_prompt = "Please, answer this question faithfully."
# prompt = f"<s>[INST] <<SYS>>{our_system_prompt}<</SYS>>\n\n{your_system_prompt}\n{user_input} [/INST]"

inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids.to(model.device)
outputs = model.generate(input_ids=inputs, max_length=4096)[0]

answer_start = int(inputs.shape[-1])
pred = tokenizer.decode(outputs[answer_start:], skip_special_tokens=True)

print(f'### User Input:\n{user_input}\n\n### Assistant Output:\n{pred}')



### User Input:
Please extract information from the text between !!! by using only categorical words or choosing from options Yes or No. Do not provide any additional text or information. The output must be in the following format:

{
    "Sex": "",
    "Age": "",
    "Treatment": "",
    "Patient had ECG done?": "",
    "Patient had palpitations?": "",
    "Patient had leg operation?": "",
    "Rehabilitation time": ""
    "Patient finished with treatment?": "",
    "Patient died": "",
}

!!!
A 28-year-old previously healthy man presented with a 6-week history of palpitations.
The symptoms occurred during rest, 2–3 times per week, lasted up to 30 minutes at a time and were associated with dyspnea.
Except for a grade 2/6 holosystolic tricuspid regurgitation murmur (best heard at the left sternal border with inspiratory accentuation), physical examination yielded unremarkable findings.
An electrocardiogram (ECG) revealed normal sinus rhythm and a Wolff– Parkinson– White pre-excitation p

In [None]:
user_input = '''Please extract information from the text between !!! by using only categorical words or choosing from options Yes or No. Do not provide any additional text or information. Please provide the response in the form of a Python list. It should begin with “[“ and end with “]”:

[
    "Sex": "",
    "Age": "",
    "Treatment": "",
    "Patient had ECG done?": "",
    "Patient had palpitations?": "",
    "Patient had leg operation?": "",
    "Rehabilitation time": ""
    "Patient finished with treatment?": "",
    "Patient died": "",
]

!!!
A 28-year-old previously healthy man presented with a 6-week history of palpitations.
The symptoms occurred during rest, 2–3 times per week, lasted up to 30 minutes at a time and were associated with dyspnea.
Except for a grade 2/6 holosystolic tricuspid regurgitation murmur (best heard at the left sternal border with inspiratory accentuation), physical examination yielded unremarkable findings.
An electrocardiogram (ECG) revealed normal sinus rhythm and a Wolff– Parkinson– White pre-excitation pattern (Fig.1: Top), produced by a right-sided accessory pathway.
Transthoracic echocardiography demonstrated the presence of Ebstein's anomaly of the tricuspid valve, with apical displacement of the valve and formation of an “atrialized” right ventricle (a functional unit between the right atrium and the inlet [inflow] portion of the right ventricle) (Fig.2).
The anterior tricuspid valve leaflet was elongated (Fig.2C, arrow), whereas the septal leaflet was rudimentary (Fig.2C, arrowhead).
Contrast echocardiography using saline revealed a patent foramen ovale with right-to-left shunting and bubbles in the left atrium (Fig.2D).
The patient underwent an electrophysiologic study with mapping of the accessory pathway, followed by radiofrequency ablation (interruption of the pathway using the heat generated by electromagnetic waves at the tip of an ablation catheter).
His post-ablation ECG showed a prolonged PR interval and an odd “second” QRS complex in leads III, aVF and V2–V4 (Fig.1Bottom), a consequence of abnormal impulse conduction in the “atrialized” right ventricle.
The patient reported no recurrence of palpitations at follow-up 6 months after the ablation.
!!!

'''

# Apply the prompt template and system prompt of LLaMA-2-Chat demo for chat models (NOTE: NO prompt template is required for base models!)
our_system_prompt = "\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\n" # Please do NOT change this

prompt = f"<s>[INST] <<SYS>>{our_system_prompt}<</SYS>>\n\n{user_input} [/INST]"

# # NOTE:
# # If you want to apply your own system prompt, please integrate it into the instruction part following our system prompt like this:
# your_system_prompt = "Please, answer this question faithfully."
# prompt = f"<s>[INST] <<SYS>>{our_system_prompt}<</SYS>>\n\n{your_system_prompt}\n{user_input} [/INST]"

inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids.to(model.device)
outputs = model.generate(input_ids=inputs, max_length=4096)[0]

answer_start = int(inputs.shape[-1])
pred = tokenizer.decode(outputs[answer_start:], skip_special_tokens=True)

print(f'### User Input:\n{user_input}\n\n### Assistant Output:\n{pred}')

In [None]:

user_input = '''Please extract information from the text between !!! by using only categorical words or choosing from options Yes or No. Do not provide any additional text or information.

!!!
A 28-year-old previously healthy man presented with a 6-week history of palpitations.
The symptoms occurred during rest, 2–3 times per week, lasted up to 30 minutes at a time and were associated with dyspnea.
Except for a grade 2/6 holosystolic tricuspid regurgitation murmur (best heard at the left sternal border with inspiratory accentuation), physical examination yielded unremarkable findings.
An electrocardiogram (ECG) revealed normal sinus rhythm and a Wolff– Parkinson– White pre-excitation pattern (Fig.1: Top), produced by a right-sided accessory pathway.
Transthoracic echocardiography demonstrated the presence of Ebstein's anomaly of the tricuspid valve, with apical displacement of the valve and formation of an “atrialized” right ventricle (a functional unit between the right atrium and the inlet [inflow] portion of the right ventricle) (Fig.2).
The anterior tricuspid valve leaflet was elongated (Fig.2C, arrow), whereas the septal leaflet was rudimentary (Fig.2C, arrowhead).
Contrast echocardiography using saline revealed a patent foramen ovale with right-to-left shunting and bubbles in the left atrium (Fig.2D).
The patient underwent an electrophysiologic study with mapping of the accessory pathway, followed by radiofrequency ablation (interruption of the pathway using the heat generated by electromagnetic waves at the tip of an ablation catheter).
His post-ablation ECG showed a prolonged PR interval and an odd “second” QRS complex in leads III, aVF and V2–V4 (Fig.1Bottom), a consequence of abnormal impulse conduction in the “atrialized” right ventricle.
The patient reported no recurrence of palpitations at follow-up 6 months after the ablation.
!!!

Format the information using the following schema: 
'''
#prompt = "Generate a person's information based on the following schema:"
json_schema = {
    "type": "object",
    "properties": {
        "sex": {"type": "string"},
        "age": {"type": "string"}, # for some reason, "type": "number" does not work with the medical-chat model (investigate?)
        "Treatment": {"type": "string"},
        "Patient had ECG done": {"type": "boolean"},
        "Patient died": {"type": "boolean"},
    }
}

# Apply the prompt template and system prompt of LLaMA-2-Chat demo for chat models (NOTE: NO prompt template is required for base models!)
our_system_prompt = "\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\n" # Please do NOT change this

prompt = f"<s>[INST] <<SYS>>{our_system_prompt}<</SYS>>\n\n{user_input} [/INST]"


jsonformer = Jsonformer(model, tokenizer, json_schema, user_input)
generated_data = jsonformer()

print(generated_data)

# # NOTE:
# # If you want to apply your own system prompt, please integrate it into the instruction part following our system prompt like this:
# your_system_prompt = "Please, answer this question faithfully."
# prompt = f"<s>[INST] <<SYS>>{our_system_prompt}<</SYS>>\n\n{your_system_prompt}\n{user_input} [/INST]"

inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids.to(model.device)
outputs = model.generate(input_ids=inputs, max_length=4096)[0]

answer_start = int(inputs.shape[-1])
pred = tokenizer.decode(outputs[answer_start:], skip_special_tokens=True)

print(f'### User Input:\n{user_input}\n\n### Assistant Output:\n{pred}')

In [None]:
from jsonformer import Jsonformer
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-3b", device_map="auto", load_in_4bit=True)
tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-3b")

json_schema = {
    "type": "object",
    "properties": {
        "sex": {"type": "string"},
        "age": {"type": "number"},
        "Treatment": {"type": "string"},
        "Patient had ECG done": {"type": "boolean"},
        "Patient died": {"type": "boolean"},
    }
}

prompt = """Please extract information from the text between !!! by using only categorical words or choosing from options Yes or No. Do not provide any additional text or information. Please generate a person's information based on the following schema:


!!!
A 28-year-old previously healthy man presented with a 6-week history of palpitations.
The symptoms occurred during rest, 2–3 times per week, lasted up to 30 minutes at a time and were associated with dyspnea.
Except for a grade 2/6 holosystolic tricuspid regurgitation murmur (best heard at the left sternal border with inspiratory accentuation), physical examination yielded unremarkable findings.
An electrocardiogram (ECG) revealed normal sinus rhythm and a Wolff– Parkinson– White pre-excitation pattern (Fig.1: Top), produced by a right-sided accessory pathway.
Transthoracic echocardiography demonstrated the presence of Ebstein's anomaly of the tricuspid valve, with apical displacement of the valve and formation of an “atrialized” right ventricle (a functional unit between the right atrium and the inlet [inflow] portion of the right ventricle) (Fig.2).
The anterior tricuspid valve leaflet was elongated (Fig.2C, arrow), whereas the septal leaflet was rudimentary (Fig.2C, arrowhead).
Contrast echocardiography using saline revealed a patent foramen ovale with right-to-left shunting and bubbles in the left atrium (Fig.2D).
The patient underwent an electrophysiologic study with mapping of the accessory pathway, followed by radiofrequency ablation (interruption of the pathway using the heat generated by electromagnetic waves at the tip of an ablation catheter).
His post-ablation ECG showed a prolonged PR interval and an odd “second” QRS complex in leads III, aVF and V2–V4 (Fig.1Bottom), a consequence of abnormal impulse conduction in the “atrialized” right ventricle.
The patient reported no recurrence of palpitations at follow-up 6 months after the ablation.
!!!
"""
jsonformer = Jsonformer(model, tokenizer, json_schema, prompt)
generated_data = jsonformer()

print(generated_data)

In [8]:
import os
import re

In [9]:
path = os.getcwd()

parent_directory_path = os.path.dirname(path)

data_directory = "data/maccrobat/MACCROBAT2020"

path = os.path.join(parent_directory_path, data_directory)
file_list = os.listdir(path)

# Retrive txt and ann files
txt_files = [file for file in file_list if file.endswith('.txt')]
ann_files = [file for file in file_list if file.endswith('.ann')]

# Tuple together same txt and ann files
file_tuples = [(txt, txt[:-4] + '.ann') for txt in txt_files if txt[:-4] + '.ann' in ann_files]

In [10]:
file_tuples

[('26444414.txt', '26444414.ann'),
 ('28079821.txt', '28079821.ann'),
 ('23033875.txt', '23033875.ann'),
 ('28767567.txt', '28767567.ann'),
 ('27741115.txt', '27741115.ann'),
 ('25572898.txt', '25572898.ann'),
 ('25246819.txt', '25246819.ann'),
 ('28353604.txt', '28353604.ann'),
 ('25410034.txt', '25410034.ann'),
 ('26530965.txt', '26530965.ann'),
 ('20146086.txt', '20146086.ann'),
 ('21308977.txt', '21308977.ann'),
 ('22520024.txt', '22520024.ann'),
 ('28353613.txt', '28353613.ann'),
 ('27218632.txt', '27218632.ann'),
 ('18666334.txt', '18666334.ann'),
 ('21477357.txt', '21477357.ann'),
 ('24526194.txt', '24526194.ann'),
 ('26285706.txt', '26285706.ann'),
 ('27928148.txt', '27928148.ann'),
 ('22791498.txt', '22791498.ann'),
 ('21527041.txt', '21527041.ann'),
 ('28193213.txt', '28193213.ann'),
 ('24518095.txt', '24518095.ann'),
 ('19860925.txt', '19860925.ann'),
 ('26405496.txt', '26405496.ann'),
 ('24957905.txt', '24957905.ann'),
 ('25210224.txt', '25210224.ann'),
 ('26469535.txt', '2

In [7]:
label_types = []


for Annfile in ann_files:
    pathToAnnFile = os.path.join(path, Annfile)
    Annfile = open(pathToAnnFile, "r")
    allAnnLines = [re.split(r'\t+', tag.rstrip('\t')) for tag in Annfile if tag[0][0].startswith(('T'))]

    for annLine in allAnnLines:
        label_types.append(annLine[1].split()[0])

label_types = set(label_types)
print(len(label_types))
print(label_types)

41
{'Area', 'Personal_background', 'Subject', 'Quantitative_concept', 'Qualitative_concept', 'Medication', 'Duration', 'Coreference', 'Dosage', 'History', 'Outcome', 'Distance', 'Severity', 'Mass', 'Height', 'Frequency', 'Biological_structure', 'Date', 'Activity', 'Color', 'Detailed_description', 'Nonbiological_location', 'Sign_symptom', 'Disease_disorder', 'Weight', 'Shape', 'Age', 'Texture', 'Administration', 'Sex', 'Clinical_event', 'Time', 'Family_history', 'Other_event', 'Lab_value', 'Other_entity', 'Occupation', 'Diagnostic_procedure', 'Biological_attribute', 'Therapeutic_procedure', 'Volume'}


In [None]:
mySet = {
    'Diagnostic_procedure',
    'Sign_symptom',
    'Biological_structure',
    'Detailed_description',
    'Age',
    'Lab_value'
}

In [21]:
for txtAnnPair in file_tuples[:1]:
    pathToTxtFile = os.path.join(path, txtAnnPair[0])
    pathToAnnFile = os.path.join(path, txtAnnPair[1])

    Txtfile = open(pathToTxtFile, "r")
    Txtfile = Txtfile.readlines()
    Txtfile = "".join(Txtfile)
    
    Annfile = open(pathToAnnFile, "r")
    allAnnLines = [re.split(r'\t+', tag.rstrip('\t')) for tag in Annfile if tag[0][0].startswith(('T'))]
    print(len(allAnnLines))


    removed = []
    previous = -10
    for annLine in allAnnLines.copy():
        currentStart = int((annLine[1].split()[1]))
        if currentStart == previous:
            removed.append(annLine)
            allAnnLines.remove(annLine)
        previous = currentStart

    print(allAnnLines)
    print(Txtfile)

53
[['T1', 'Age 2 13', '58-year-old\n'], ['T2', 'Sex 14 17', 'man\n'], ['T3', 'Sign_symptom 42 57', 'general fatigue\n'], ['T4', 'Sign_symptom 69 75', 'anemia\n'], ['T5', 'Severity 62 68', 'severe\n'], ['T6', 'Duration 80 94', 'several months\n'], ['T7', 'Diagnostic_procedure 100 117', 'hemoglobin levels\n'], ['T8', 'Lab_value 123 131', '6.6 g/dl\n'], ['T9', 'History 167 185', 'no medical history\n'], ['T10', 'History 190 215', 'did not take any medicine\n'], ['T11', 'Diagnostic_procedure 217 243', 'Esophagogastroduodenoscopy\n'], ['T12', 'Diagnostic_procedure 248 259', 'colonoscopy\n'], ['T13', 'Sign_symptom 291 299', 'bleeding\n'], ['T14', 'Diagnostic_procedure 311 330', 'computer tomography\n'], ['T15', 'Biological_structure 301 310', 'Abdominal\n'], ['T16', 'Sign_symptom 361 366', 'tumor\n'], ['T17', 'Biological_structure 374 389', 'small intestine\n'], ['T18', 'Detailed_description 347 360', 'hypervascular\n'], ['T19', 'Distance 342 346', '2-cm\n'], ['T20', 'Diagnostic_procedure 4

# IDEAS for better results

On generated and words in dataset perfrom lematization!

In [1]:
#from jsonformer import Jsonformer
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("AdaptLLM/medicine-chat", device_map="auto", load_in_4bit=True)
tokenizer = AutoTokenizer.from_pretrained("AdaptLLM/medicine-chat")

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]



In [25]:
user_input = """Identify sign symptoms mentioned in text between !!!. For every detected sign symptom extract only one word and do not provide additional information.

!!!
A 58-year-old man had been suffering from general fatigue and severe anemia for several months.
His hemoglobin levels were 6.6 g/dl (normal range: 12–16 g/dl).
He had no medical history and did not take any medicine.
Esophagogastroduodenoscopy and colonoscopy did not reveal any significant bleeding.
Abdominal computer tomography revealed a 2-cm hypervascular tumor in the small intestine (Fig.1).
Oral DBE detected a 2-cm-diameter reddish, submucosal tumor-like lesion with surface ulceration in the jejunum, approximately 20 cm away from the Treitz ligament (Fig.2).
We did not perform biopsy because it can be difficult to stop bleeding in the case of hypervascular lesions.
Under the diagnosis of a small bowel tumor, gastrointestinal stromal tumor (GIST), malignant lymphoma, or cancer, we performed laparoscopic-assisted segmental resection of the jejunum with the dissection of lymph nodes.
Examination of the resected tumor showed that it measured 19 × 16 mm in diameter (Fig.3).
Histology revealed the proliferation of blood capillaries and granulation tissue, which was consistent with PG (Fig.4).
The patient was discharged on postoperative day 9 without complication and his anemia improved gradually without the need for oral iron after surgery.
!!!
"""
our_system_prompt = "\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\n" # Please do NOT change this

prompt = f"<s>[INST] <<SYS>>{our_system_prompt}<</SYS>>\n\n{user_input} [/INST]"

# # NOTE:
# # If you want to apply your own system prompt, please integrate it into the instruction part following our system prompt like this:
# your_system_prompt = "Please, answer this question faithfully."
# prompt = f"<s>[INST] <<SYS>>{our_system_prompt}<</SYS>>\n\n{your_system_prompt}\n{user_input} [/INST]"

inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids.to(model.device)
outputs = model.generate(input_ids=inputs, max_length=4096, max_new_tokens=150)[0]

answer_start = int(inputs.shape[-1])
pred = tokenizer.decode(outputs[answer_start:], skip_special_tokens=True)

print(f'### User Input:\n{user_input}\n\n### Assistant Output:\n{pred}')

Both `max_new_tokens` (=150) and `max_length`(=4096) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


### User Input:
Identify sign symptoms mentioned in text between !!!. For every detected sign symptom extract only one word and do not provide additional information.

!!!
A 58-year-old man had been suffering from general fatigue and severe anemia for several months.
His hemoglobin levels were 6.6 g/dl (normal range: 12–16 g/dl).
He had no medical history and did not take any medicine.
Esophagogastroduodenoscopy and colonoscopy did not reveal any significant bleeding.
Abdominal computer tomography revealed a 2-cm hypervascular tumor in the small intestine (Fig.1).
Oral DBE detected a 2-cm-diameter reddish, submucosal tumor-like lesion with surface ulceration in the jejunum, approximately 20 cm away from the Treitz ligament (Fig.2).
We did not perform biopsy because it can be difficult to stop bleeding in the case of hypervascular lesions.
Under the diagnosis of a small bowel tumor, gastrointestinal stromal tumor (GIST), malignant lymphoma, or cancer, we performed laparoscopic-assisted 

In [20]:
user_input = """Identify and sign symptoms mentioned in text between !!!. For every detected sign symptom extract only one word and do not provide additional information.


!!!
A 58-year-old man had been suffering from general fatigue and severe anemia for several months.
His hemoglobin levels were 6.6 g/dl (normal range: 12–16 g/dl).
He had no medical history and did not take any medicine.
Esophagogastroduodenoscopy and colonoscopy did not reveal any significant bleeding.
Abdominal computer tomography revealed a 2-cm hypervascular tumor in the small intestine (Fig.1).
Oral DBE detected a 2-cm-diameter reddish, submucosal tumor-like lesion with surface ulceration in the jejunum, approximately 20 cm away from the Treitz ligament (Fig.2).
We did not perform biopsy because it can be difficult to stop bleeding in the case of hypervascular lesions.
Under the diagnosis of a small bowel tumor, gastrointestinal stromal tumor (GIST), malignant lymphoma, or cancer, we performed laparoscopic-assisted segmental resection of the jejunum with the dissection of lymph nodes.
Examination of the resected tumor showed that it measured 19 × 16 mm in diameter (Fig.3).
Histology revealed the proliferation of blood capillaries and granulation tissue, which was consistent with PG (Fig.4).
The patient was discharged on postoperative day 9 without complication and his anemia improved gradually without the need for oral iron after surgery.
!!!
"""
our_system_prompt = "\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\n" # Please do NOT change this

prompt = f"<s>[INST] <<SYS>>{our_system_prompt}<</SYS>>\n\n{user_input} [/INST]"

# # NOTE:
# # If you want to apply your own system prompt, please integrate it into the instruction part following our system prompt like this:
# your_system_prompt = "Please, answer this question faithfully."
# prompt = f"<s>[INST] <<SYS>>{our_system_prompt}<</SYS>>\n\n{your_system_prompt}\n{user_input} [/INST]"

inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids.to(model.device)
outputs = model.generate(input_ids=inputs, max_length=4096)[0]

answer_start = int(inputs.shape[-1])
pred = tokenizer.decode(outputs[answer_start:], skip_special_tokens=True)

print(f'### User Input:\n{user_input}\n\n### Assistant Output:\n{pred}')

### User Input:
Identify and sign symptoms mentioned in text between !!!. For every detected sign symptom extract only one word and do not provide additional information.


!!!
A 58-year-old man had been suffering from general fatigue and severe anemia for several months.
His hemoglobin levels were 6.6 g/dl (normal range: 12–16 g/dl).
He had no medical history and did not take any medicine.
Esophagogastroduodenoscopy and colonoscopy did not reveal any significant bleeding.
Abdominal computer tomography revealed a 2-cm hypervascular tumor in the small intestine (Fig.1).
Oral DBE detected a 2-cm-diameter reddish, submucosal tumor-like lesion with surface ulceration in the jejunum, approximately 20 cm away from the Treitz ligament (Fig.2).
We did not perform biopsy because it can be difficult to stop bleeding in the case of hypervascular lesions.
Under the diagnosis of a small bowel tumor, gastrointestinal stromal tumor (GIST), malignant lymphoma, or cancer, we performed laparoscopic-assi

In [None]:
generator = transformers.pipeline(
    model = model,
    tokenizer=tokenizer,
    return_full_text = True, # langchain expects full text
    task='text-generation',
    #stopping_criteria=stopping_criteria, # without this model rambles during chat
    temperature=0.1, # 'randomness' of outputs, 0.0 is the min and 1.0 is the max
    max_new_tokens=512, # max number of tokens to generate in the output
    repetition_penalty=1.1 # without this output begins repeating
)

llm = HuggingFacePipeline(pipeline=generator)
# creating prompt for large language model
pre_prompt = """[INST] <<SYS>>\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\n<</SYS>>\n\nGenerate the next agent response by answering the question. Answer it as succinctly as possible. You are provided several documents with titles. If the answer comes from different documents please mention all possibilities in your answer and use the titles to separate between topics or domains. If you cannot answer the question from the given documents, please state that you do not have an answer.\n"""
prompt = pre_prompt + "CONTEXT:\n\n{context}\n" +"Question : {question}" + "[\INST]"
llama_prompt = PromptTemplate(template=prompt, input_variables=["context", "question"])
# integrate prompt with LLM
chain = ConversationalRetrievalChain.from_llm(llm, loaded_vectorstore.as_retriever(), combine_docs_chain_kwargs={"prompt": llama_prompt}, return_source_documents=True)

In [2]:
#from jsonformer import Jsonformer
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("AdaptLLM/medicine-chat", device_map="auto", load_in_4bit=True)
tokenizer = AutoTokenizer.from_pretrained("AdaptLLM/medicine-chat")

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]



In [None]:
user_input = """Identify sign symptoms mentioned in text between !!!. For every detected sign symptom extract only one word and do not provide additional information.

!!!
A 58-year-old man had been suffering from general fatigue and severe anemia for several months.
His hemoglobin levels were 6.6 g/dl (normal range: 12–16 g/dl).
He had no medical history and did not take any medicine.
Esophagogastroduodenoscopy and colonoscopy did not reveal any significant bleeding.
Abdominal computer tomography revealed a 2-cm hypervascular tumor in the small intestine (Fig.1).
Oral DBE detected a 2-cm-diameter reddish, submucosal tumor-like lesion with surface ulceration in the jejunum, approximately 20 cm away from the Treitz ligament (Fig.2).
We did not perform biopsy because it can be difficult to stop bleeding in the case of hypervascular lesions.
Under the diagnosis of a small bowel tumor, gastrointestinal stromal tumor (GIST), malignant lymphoma, or cancer, we performed laparoscopic-assisted segmental resection of the jejunum with the dissection of lymph nodes.
Examination of the resected tumor showed that it measured 19 × 16 mm in diameter (Fig.3).
Histology revealed the proliferation of blood capillaries and granulation tissue, which was consistent with PG (Fig.4).
The patient was discharged on postoperative day 9 without complication and his anemia improved gradually without the need for oral iron after surgery.
!!!
"""
our_system_prompt = "\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\n" # Please do NOT change this

prompt = f"<s>[INST] <<SYS>>{our_system_prompt}<</SYS>>\n\n{user_input} [/INST]"

# # NOTE:
# # If you want to apply your own system prompt, please integrate it into the instruction part following our system prompt like this:
# your_system_prompt = "Please, answer this question faithfully."
# prompt = f"<s>[INST] <<SYS>>{our_system_prompt}<</SYS>>\n\n{your_system_prompt}\n{user_input} [/INST]"

inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids.to(model.device)
outputs = model.generate(input_ids=inputs, max_length=4096, max_new_tokens=150)[0]

answer_start = int(inputs.shape[-1])
pred = tokenizer.decode(outputs[answer_start:], skip_special_tokens=True)

print(f'### User Input:\n{user_input}\n\n### Assistant Output:\n{pred}')

In [41]:
prompt = """<s> [INST] <<SYS>>
Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.


```TypeScript

patient: { // A patient is an individual admitted to or seeking medical attention at a hospital or healthcare facility. 
 age: Array<string> // Patients' age.
 sex: Array<string> // Patients' sex.
 sign symptom: Array<string> // Sign symptom are crucial in the diagnostic process as they provide valuable information for understanding a patient's health concerns.
 biological structure: Array<string> // Biological structures refer to the anatomical components and organs mentioned in the input.
 diagnostic procedure: Array<string> // A diagnostic procedure is a medical test or examination.
 lab value: Array<string> // Laboratory values refer to the results of tests.
}
```


Please output the extracted information in JSON format. Do not output anything except for the extracted information. Do not add any clarifying information. Do not add any fields that are not in the schema. If the text contains attributes that do not appear in the schema, please ignore them. All output must be in JSON format and follow the schema specified above. Wrap the JSON in <json> tags.
Information is specificly extracted by the Input befor your output. Do not use blood as biological structure it is not mentioned in the text i gave you.



Input: A 32-year-old man
Output: <json>{"patient": {"age": ["32-year-old"]}}</json>
Input: A 32-year-old man.
Output: <json>{"patient": {"sex": ["Male"]}}</json>
Input: A 50-year-old woman.
Output: <json>{"patient": {"sex": ["Female"]}}</json>
Input: The patient has a fever temperature of 38.
Output: <json>{"patient": {"sign symptom": ["fever"]}}</json>
Input: The patient complains of abdominal pain.
Output: <json>{"patient": {"sign symptom": ["abdominal pain"]}}</json>
Input: <</SYS>> A 58-year-old man had been suffering from general fatigue and severe anemia for several months. His hemoglobin levels were 6.6 g/dl (normal range: 12–16 g/dl). He had no medical history and did not take any medicine. Esophagogastroduodenoscopy and colonoscopy did not reveal any significant bleeding. Abdominal computer tomography revealed a 2-cm hypervascular tumor in the small intestine (Fig.1). Oral DBE detected a 2-cm-diameter reddish, submucosal tumor-like lesion with surface ulceration in the jejunum, approximately 20 cm away from the Treitz ligament (Fig.2). We did not perform biopsy because it can be difficult to stop bleeding in the case of hypervascular lesions. Under the diagnosis of a small bowel tumor, gastrointestinal stromal tumor (GIST), malignant lymphoma, or cancer, we performed laparoscopic-assisted segmental resection of the jejunum with the dissection of lymph nodes. Examination of the resected tumor showed that it measured 19 × 16 mm in diameter (Fig.3). Histology revealed the proliferation of blood capillaries and granulation tissue, which was consistent with PG (Fig.4). The patient was discharged on postoperative day 9 without complication and his anemia improved gradually without the need for oral iron after surgery. [/INST]
Output: 
"""

In [48]:
prompt = """[INST]
Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.


```TypeScript

patient: { // A patient is an individual admitted to or seeking medical attention at a hospital or healthcare facility. 
 age: Array<string> // Patients' age.
 sex: Array<string> // Patients' sex.
 sign symptom: Array<string> // Sign symptom are crucial in the diagnostic process as they provide valuable information for understanding a patient's health concerns.
 biological structure: Array<string> // Biological structures refer to the anatomical components and organs mentioned in the input.
 diagnostic procedure: Array<string> // A diagnostic procedure is a medical test or examination.
 lab value: Array<string> // Laboratory values refer to the results of tests.
}
```


Please output the extracted information in JSON format. Do not output anything except for the extracted information. Do not add any clarifying information. Do not add any fields that are not in the schema. If the text contains attributes that do not appear in the schema, please ignore them. All output must be in JSON format and follow the schema specified above. Wrap the JSON in <json> tags.
Information is specificly extracted by the Input befor your output. Do not use blood as biological structure it is not mentioned in the text i gave you.



Input: His hemoglobin levels were 6.6 g/dl (normal range: 12–16 g/dl). He had no medical history and did not take any medicine. Esophagogastroduodenoscopy and colonoscopy did not reveal any significant bleeding. Abdominal computer tomography revealed a 2-cm hypervascular tumor in the small intestine (Fig.1). Oral DBE detected a 2-cm-diameter reddish, submucosal tumor-like lesion with surface ulceration in the jejunum, approximately 20 cm away from the Treitz ligament (Fig.2). We did not perform biopsy because it can be difficult to stop bleeding in the case of hypervascular lesions. Under the diagnosis of a small bowel tumor, gastrointestinal stromal tumor (GIST), malignant lymphoma, or cancer, we performed laparoscopic-assisted segmental resection of the jejunum with the dissection of lymph nodes. Examination of the resected tumor showed that it measured 19 × 16 mm in diameter (Fig.3). Histology revealed the proliferation of blood capillaries and granulation tissue, which was consistent with PG (Fig.4). The patient was discharged on postoperative day 9 without complication and his anemia improved gradually without the need for oral iron after surgery. 
Output: [/INST]
"""

In [49]:
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids.to(model.device)
outputs = model.generate(input_ids=inputs, max_length=4096, max_new_tokens=200)[0]

answer_start = int(inputs.shape[-1])
pred = tokenizer.decode(outputs[answer_start:], skip_special_tokens=True)

print(f'### User Input:\n{prompt}\n\n### Assistant Output:\n{pred}')

Both `max_new_tokens` (=200) and `max_length`(=4096) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


### User Input:
[INST]
Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.


```TypeScript

patient: { // A patient is an individual admitted to or seeking medical attention at a hospital or healthcare facility. 
 age: Array<string> // Patients' age.
 sex: Array<string> // Patients' sex.
 sign symptom: Array<string> // Sign symptom are crucial in the diagnostic process as they provide valuable information for understanding a patient's health concerns.
 biological structure: Array<string> // Biological structures refer to the anatomical components and organs mentioned in the input.
 diagnostic procedure: Array<string> // A diagnostic procedure is a medical test or examination.
 lab value: Array<string> // Laboratory values refer to the results of tests.
}
```


Please output t

# TESTING KOR

In [3]:
from langchain.llms import HuggingFacePipeline
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModelForSeq2SeqLM
from kor import create_extraction_chain, Object, Text

#model_id = 'AdaptLLM/medicine-chat'# go for a smaller model if you dont have the VRAM
model_id = 'epfl-llm/meditron-7b'
access_token = "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"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", load_in_4bit=True, token=access_token)
tokenizer = AutoTokenizer.from_pretrained(model_id, token=access_token)

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/4.08k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.85M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/344 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/736 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Original example

In [1]:
pipe = pipeline(
    "text-generation",
    model=model, 
    tokenizer=tokenizer, 
    max_length=4096,
    temperature=0.1 
    max_new_tokens=150
)

local_llm = HuggingFacePipeline(pipeline=pipe)


schema = Object(
    id="player",
    description=(
        "User is controlling a music player to select songs, pause or start them or play"
        " music by a particular artist."
    ),
    attributes=[
        Text(
            id="song",
            description="User wants to play this song",
            examples=[],
            many=True,
        ),
        Text(
            id="album",
            description="User wants to play this album",
            examples=[],
            many=True,
        ),
        Text(
            id="artist",
            description="Music by the given artist",
            examples=[("Songs by paul simon", "paul simon")],
            many=True,
        ),
        Text(
            id="action",
            description="Action to take one of: `play`, `stop`, `next`, `previous`.",
            examples=[
                ("Please stop the music", "stop"),
                ("play something", "play"),
                ("play a song", "play"),
                ("next song", "next"),
            ],
        ),
    ],
    many=False,
)

chain = create_extraction_chain(local_llm, schema, encoder_or_encoder_class='json')
chain.run("play songs by paul simon and led zeppelin and the doors")['data']

NameError: name 'pipeline' is not defined

Our problem

In [None]:
mySet = {
    'Diagnostic_procedure',
    'Sign_symptom',
    'Biological_structure',
    'Detailed_description',
    'Age',
    'Lab_value'
}

In [19]:
from langchain.prompts import PromptTemplate

pipe = pipeline(
    "text-generation",
    model=model, 
    tokenizer=tokenizer, 
    max_length=2048,
    #do_sample=True,
    #temperature=0.0001,
    #max_new_tokens=300
)

local_llm = HuggingFacePipeline(pipeline=pipe)

#[INST] <<SYS>>\n<</SYS>> [/INST]

instruction_template = PromptTemplate(
    input_variables=["format_instructions", "type_description"],
    template=(
        "[INST]\n Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.\n\n"
        "{type_description}\n\n" # Can comment out
        "{format_instructions}\n\n"
        #"Here are some examples:\n\n"
    ),
)


schema = Object(
    id="patient",
    description=(
        "A patient is an individual admitted to or seeking medical attention"
        " at a hospital or healthcare facility."
    ),
    attributes=[
        Text(
            id="age",
            description="Patients' age.",
            examples=[
                ("A 32-year-old man", "32-year-old")
            ],
            many=True,
        ),
        Text(
            id="sex",
            description="Patients' sex.",
            examples=[
                ("A 32-year-old man.", "Male"),
                ("A 50-year-old woman.", "Female")
            ],
            many=True,
        ),
        Text(
            id="sign symptom",
            description="Sign symptom are crucial in the diagnostic process as they provide valuable information for understanding a patient's health concerns.",
            examples=[
                  ("The patient has a fever temperature of 38.", "fever"),
                  ("The patient complains of abdominal pain.", "abdominal pain")
            ],
              many=True,
          ),
         Text(
             id="biological structure",
             description="Biological structures refer to the anatomical components and organs.",
             examples=[
                 ("The patient lost strength in the left arm.", "arm"),
                 ("Computed tomography (CT) of the chest.", "chest"),
                 ("Discomfort in the left upper quadrant of the abdomen.", "abdomen"),
             ],
             many=True,
         ),
        Text(
            id="diagnostic procedure",
            description="A diagnostic procedure is a medical test or examination.",
            examples=[
                ("An electrocardiogram revealed normal sinus rhythm", "electrocardiogram"),
                ("Physical examination yielded unremarkable findings.", "physical examination"),
            ],
            many=True,
        ),
        Text(
            id="lab value",
            description="Laboratory values refer to the results of tests.",
            examples=[
                ("Surgical margins were tumor-free.", "tumor-free"),
                ("His temperature was 38.4 °C.", "38.4 °C"),
                ("Heart rate 130/minute", "130/minute"),
                ("Blood levels of lactate and pyruvate were 1.6 and 0.096 mmol/l", "0.096 mmol/l"), 
                ("Blood levels of lactate and pyruvate were 1.6 and 0.096 mmol/l", "1.6"), 
            ],
            many=True,
        ),
    ],
    many=False,
)

user_input = """\n
A 58-year-old man had been suffering from general fatigue and severe anemia for several months.
His hemoglobin levels were 6.6 g/dl (normal range: 12–16 g/dl).
He had no medical history and did not take any medicine.
Esophagogastroduodenoscopy and colonoscopy did not reveal any significant bleeding.
Abdominal computer tomography revealed a 2-cm hypervascular tumor in the small intestine (Fig.1). [/INST]"""

prompt = f"{user_input}"

#prompt = user_input

chain = create_extraction_chain(local_llm, schema, instruction_template=instruction_template, encoder_or_encoder_class='json')
chain.run(prompt)['data']

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'patient': {'age': ['58-year-old'],
  'sex': ['Male'],
  'sign symptom': ['general fatigue'],
  'biological structure': ['small intestine'],
  'diagnostic procedure': ['esophagogastroduodenoscopy'],
  'lab value': ['6.6 g/dl']}}

In [16]:
print(chain.prompt.format_prompt("[user_input]").to_string())

[INST]
 Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.

```TypeScript

patient: { // A patient is an individual admitted to or seeking medical attention at a hospital or healthcare facility.
 age: Array<string> // Patients' age.
 sex: Array<string> // Patients' sex.
 sign symptom: Array<string> // Sign symptom are crucial in the diagnostic process as they provide valuable information for understanding a patient's health concerns.
 biological structure: Array<string> // Biological structures refer to the anatomical components and organs.
 diagnostic procedure: Array<string> // A diagnostic procedure is a medical test or examination.
 lab value: Array<string> // Laboratory values refer to the results of tests.
}
```


Please output the extracted information in JSON format.

# TESTING WITH FOR LOOP

In [38]:
from langchain.prompts import PromptTemplate

for txtAnnPair in file_tuples[:1]:
    pathToTxtFile = os.path.join(path, txtAnnPair[0])
    pathToAnnFile = os.path.join(path, txtAnnPair[1])

    Txtfile = open(pathToTxtFile, "r")
    Txtfile = Txtfile.readlines()
    Txtfile = "".join(Txtfile)
    
    # removed = []
    # previous = -10
    # for annLine in allAnnLines.copy():
    #     currentStart = int((annLine[1].split()[1]))
    #     if currentStart == previous:
    #         removed.append(annLine)
    #         allAnnLines.remove(annLine)
    #     previous = currentStart

    for sentence in Txtfile.split("\n"):
        print(sentence)
        pipe = pipeline(
            "text-generation",
            model=model, 
            tokenizer=tokenizer, 
            max_length=4096,
        )

        local_llm = HuggingFacePipeline(pipeline=pipe)

        instruction_template = PromptTemplate(
            input_variables=["format_instructions", "type_description"],
            template=(
                "[INST] <<SYS>>\nYour goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.\n<</SYS>>\n\n"
                "{type_description}\n\n" # Can comment out
                "{format_instructions}\n\n"
                #"Here are some examples:\n\n"
            ),
        )



        schema = Object(
            id="patient",
            description=(
                "A patient is an individual admitted to or seeking medical attention"
                " at a hospital or healthcare facility."
            ),
            attributes=[
                Text(
                    id="age",
                    description="Patients' age.",
                    examples=[
                        #("A 32-year-old", "32-year-old"),
                    ],
                    many=True,
                ),
                Text(
                    id="sex",
                    description="Patients' sex.",
                    examples=[
                        #("A 32-year-old man.", "Male"),
                        #("A 50-year-old woman.", "Female")
                    ],
                    many=True,
                ),
                Text(
                    id="sign symptom",
                    description="Sign symptom are crucial in the diagnostic process as they provide valuable information for understanding a patient's health concerns.",
                    examples=[
                        #("The patient has a fever temperature of 38.", "fever"),
                        #("The patient complains of abdominal pain.", "abdominal pain")
                    ],
                    many=True,
                ),
                Text(
                    id="biological structure",
                    description="Biological structures refer to the anatomical body components.",
                    examples=[
                        #("The patient lost strength in the left arm.", "arm"),
                        #("Computed tomography (CT) of the chest.", "chest"),
                        #("Discomfort in the left upper quadrant of the abdomen.", "abdomen"),
                    ],
                    many=True,
                ),
                Text(
                    id="diagnostic procedure",
                    description="A diagnostic procedure is a medical test or examination.",
                    examples=[
                        # ("An electrocardiogram revealed normal sinus rhythm", "electrocardiogram"),
                        # ("Physical examination yielded unremarkable findings.", "physical examination"),
                    ],
                    many=True,
                ),
                Text(
                    id="lab value",
                    description="Laboratory values refer to the results of tests or diagnostic procedures.",
                    examples=[
                        # ("Surgical margins were tumor-free.", "tumor-free"),
                         ("His temperature was 38.4 °C.", "38.4 °C"),
                        # ("Heart rate 130/minute", "130/minute"),
                        # ("His hemoglobin levels were 6.6 g/dl", "6.6 g/dl")
                        #("Blood levels of lactate and pyruvate were 1.6 and 0.096 mmol/l", "0.096 mmol/l"), 
                        #("Blood levels of lactate and pyruvate were 1.6 and 0.096 mmol/l", "1.6"), 
                    ],
                    many=True,
                ),
            ],
            many=False,
        )

        #user_input = """A 58-year-old man had been suffering from general fatigue and severe anemia for several months.[/INST]"""

        prompt = f"{sentence}[/INST]"

        #prompt = user_input

        chain = create_extraction_chain(local_llm, schema, instruction_template=instruction_template, encoder_or_encoder_class='json')
        print(chain.run(prompt)['data'])
        print("---------------")

A 58-year-old man had been suffering from general fatigue and severe anemia for several months.
{'patient': {'age': ['58'], 'sex': ['male'], 'sign symptom': ['general fatigue', 'severe anemia'], 'biological structure': ['blood'], 'diagnostic procedure': ['blood test'], 'lab value': ['anemia', '38.4 °C']}}
---------------
His hemoglobin levels were 6.6 g/dl (normal range: 12–16 g/dl).
{'patient': {'lab value': ['6.6 g/dl']}}
---------------
He had no medical history and did not take any medicine.
{'patient': {'age': ['0'], 'sex': ['M'], 'sign symptom': ['none'], 'biological structure': ['none'], 'diagnostic procedure': ['none'], 'lab value': ['38.4 °C']}}
---------------
Esophagogastroduodenoscopy and colonoscopy did not reveal any significant bleeding.
{'patient': {'diagnostic procedure': ['Esophagogastroduodenoscopy', 'colonoscopy'], 'lab value': ['No significant bleeding found']}}
---------------
Abdominal computer tomography revealed a 2-cm hypervascular tumor in the small intestine

In [16]:
for txtAnnPair in file_tuples[:3]:
    pathToTxtFile = os.path.join(path, txtAnnPair[0])
    pathToAnnFile = os.path.join(path, txtAnnPair[1])

    Txtfile = open(pathToTxtFile, "r")
    Txtfile = Txtfile.readlines()
    Txtfile = "".join(Txtfile)
    
    # removed = []
    # previous = -10
    # for annLine in allAnnLines.copy():
    #     currentStart = int((annLine[1].split()[1]))
    #     if currentStart == previous:
    #         removed.append(annLine)
    #         allAnnLines.remove(annLine)
    #     previous = currentStart

    for sentence in Txtfile.split("\n"):
        print(sentence)

A 58-year-old man had been suffering from general fatigue and severe anemia for several months.
----
His hemoglobin levels were 6.6 g/dl (normal range: 12–16 g/dl).
----
He had no medical history and did not take any medicine.
----
Esophagogastroduodenoscopy and colonoscopy did not reveal any significant bleeding.
----
Abdominal computer tomography revealed a 2-cm hypervascular tumor in the small intestine (Fig.1).
----
Oral DBE detected a 2-cm-diameter reddish, submucosal tumor-like lesion with surface ulceration in the jejunum, approximately 20 cm away from the Treitz ligament (Fig.2).
----
We did not perform biopsy because it can be difficult to stop bleeding in the case of hypervascular lesions.
----
Under the diagnosis of a small bowel tumor, gastrointestinal stromal tumor (GIST), malignant lymphoma, or cancer, we performed laparoscopic-assisted segmental resection of the jejunum with the dissection of lymph nodes.
----
Examination of the resected tumor showed that it measured 19 

# / TESTING WITH FOR LOOP

# FINE-TUNING with PEFT

In [1]:
from peft import LoraConfig, TaskType

peft_config = LoraConfig(task_type=TaskType.CAUSAL_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1)

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer

model_id = 'AdaptLLM/medicine-chat'# go for a smaller model if you dont have the VRAM
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", load_in_4bit=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]



In [3]:
from peft import get_peft_model

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

trainable params: 4,194,304 || all params: 6,742,618,112 || trainable%: 0.06220586618327525


In [8]:
def format_prompt(sample):
    return f"""
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{sample["instruction"]}

### Input:
{sample["input"]}

### Response:
{sample["output"]}
"""

In [33]:
training_args = TrainingArguments(
    output_dir="../models/AdaptLLM/medicine-chat-Erik",
    learning_rate=1e-3,
    per_device_train_batch_size=1,
    #per_device_eval_batch_size=32,
    num_train_epochs=30,
    #weight_decay=0.01,
    #evaluation_strategy="epoch",
    #save_strategy="epoch",
    #load_best_model_at_end=True,
)

In [13]:
sample = {
    "instruction": "Your goal is to extract structured information from the user's input.",
    "input": "A 58-year-old man had been suffering from general fatigue and severe anemia for several months.",
    "output": """{'age': ['58'], 
            'sex': ['male'], 
            'sign symptom': ['general fatigue', 'severe anemia']
    }
    """
}

In [9]:
import pandas as pd
df = pd.DataFrame(columns=["instruction", "input", "output"])
df.loc[0] = ["Your goal is to extract structured information from the user's input.", "A 58-year-old man had been suffering from general fatigue and severe anemia for several months.", """{'age': ['58'], 
            'sex': ['male'], 
            'sign symptom': ['general fatigue', 'severe anemia']
    }
    """]

In [10]:
df

Unnamed: 0,instruction,input,output
0,Your goal is to extract structured information...,A 58-year-old man had been suffering from gene...,"{'age': ['58'], \n 'sex': ['male'],..."


In [38]:
df["instruction"]

0    Your goal is to extract structured information...
Name: instruction, dtype: object

In [35]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=df,                  #tokenized_datasets["train"],
    #eval_dataset=tokenized_datasets["test"],
    formatting_func=format_prompt,
    tokenizer=tokenizer,
    #data_collator=data_collator,
    #compute_metrics=compute_metrics,
    #add metrics
)

AttributeError: 'DataFrame' object has no attribute 'column_names'

In [20]:
df=[["Your goal is to extract structured information from the user's input.", "A 58-year-old man had been suffering from general fatigue and severe anemia for several months.", """{'age': ['58'], 
            'sex': ['male'], 
            'sign symptom': ['general fatigue', 'severe anemia']
    }
    """]
df[1]

['s', 'b', 'c']

In [1]:
import pyarrow as pa
import pyarrow.dataset as ds
import pandas as pd
from datasets import Dataset

df = pd.DataFrame({'instruction': ["Your goal is to extract structured information from the user's input."], 
                   'input': ["A 58-year-old man had been suffering from general fatigue and severe anemia for several months."],
                   'output' : ["{'age': ['58'],\n'sex': ['male'],\n'sign symptom': ['general fatigue', 'severe anemia']}"]
                   })
dataset = ds.dataset(pa.Table.from_pandas(df).to_batches())

### convert to Huggingface dataset
hg_dataset = Dataset(pa.Table.from_pandas(df))

In [2]:
hg_dataset[0]

{'instruction': "Your goal is to extract structured information from the user's input.",
 'input': 'A 58-year-old man had been suffering from general fatigue and severe anemia for several months.',
 'output': "{'age': ['58'],\n'sex': ['male'],\n'sign symptom': ['general fatigue', 'severe anemia']}"}

In [1]:
from datasets import load_dataset

dataset = load_dataset("medalpaca/medical_meadow_medqa", split="train")

print(f"Dataset Size: {len(dataset)}")
#print(dataset[randrange(len(dataset))])

Dataset Size: 10178


In [3]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig


model_name = 'AdaptLLM/medicine-chat'
use_flash_attention = False

# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    use_cache=False,
    use_flash_attention_2=use_flash_attention,
    device_map="auto",
    torch_dtype=torch.float16
)

model.config.pretraining_tp = 1

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]



In [11]:
def format_prompt(sample):
    return [f"""<s>[INST] <<SYS>>
            Below is an instruction that describes a task.
            <</SYS>>
            ### Instruction:
            {sample["instruction"]}

            ### Input:
            {sample["input"]}

            ### Response:
            {sample["output"]}
            [/INST]
            """]

In [5]:
format_prompt(hg_dataset[0])

"<s>[INST] <<SYS>>\nBelow is an instruction that describes a task.\n<</SYS>>\n### Instruction:\nYour goal is to extract structured information from the user's input.\n\n### Input:\nA 58-year-old man had been suffering from general fatigue and severe anemia for several months.\n\n### Response:\n{'age': ['58'],\n'sex': ['male'],\n'sign symptom': ['general fatigue', 'severe anemia']}\n[/INST]\n"

In [6]:
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

# LoRA config based on QLoRA paper
peft_config = LoraConfig(
    lora_alpha=32,
    lora_dropout=0.1,
    r=16,
    bias="none",
    task_type="CAUSAL_LM",
)
# Prepare model for training
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

In [18]:
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="../models/AdaptLLM/medicine-chat-Erik",
    num_train_epochs=30,
    #per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    gradient_checkpointing=True,
    #optim="paged_adamw_32bit",
    logging_steps=10,
    save_strategy="epoch",
    learning_rate=2e-4,
    fp16=True,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="constant",
    disable_tqdm=False
)

In [19]:
from trl import SFTTrainer

max_seq_length = 100 # max sequence length for model and packing of the dataset

trainer = SFTTrainer(
    model=model,
    train_dataset=hg_dataset,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    packing=False,
    formatting_func=format_prompt,
    args=args,
)

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [13]:
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,1.8101
20,0.3906
30,0.2227


Checkpoint destination directory ../models/AdaptLLM/medicine-chat-Erik/checkpoint-1 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ../models/AdaptLLM/medicine-chat-Erik/checkpoint-2 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ../models/AdaptLLM/medicine-chat-Erik/checkpoint-3 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ../models/AdaptLLM/medicine-chat-Erik/checkpoint-4 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ../models/AdaptLLM/medicine-chat-Erik/checkpoint-5 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ../models/AdaptLLM/medicine-chat-Erik/checkpoint-6 already exists and is non-empty.Saving will proceed but saved resu

TrainOutput(global_step=30, training_loss=0.807794459660848, metrics={'train_runtime': 51.815, 'train_samples_per_second': 0.579, 'train_steps_per_second': 0.579, 'total_flos': 119083253760000.0, 'train_loss': 0.807794459660848, 'epoch': 30.0})

In [16]:
prompt = """<s>[INST] <<SYS>>
Below is an instruction that describes a task.
<</SYS>>
### Instruction:
Your goal is to extract structured information from the user's input.

### Input:
A 58-year-old man had been suffering from general fatigue and severe anemia for several months.

### Response:
[/INST]
"""

input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
outputs = model.generate(input_ids=input_ids, max_new_tokens=100)



In [None]:
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids.to(model.device)
outputs = model.generate(input_ids=inputs, max_length=4096)[0]



In [75]:
answer_start = int(input_ids.shape[-1])
pred = tokenizer.decode(outputs[answer_start:], skip_special_tokens=True)
print(pred)




In [15]:
print(f'### User Input:\n{prompt}\n\n### Assistant Output:\n{pred}')

NameError: name 'pred' is not defined

In [17]:
print(f"Generated Response:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]}\n")

Generated Response:
# Below is the structured response.

Structured Response:

Structured Response:

{'age': ['58'],
 'sex': ['Male'],
 'symptoms': ['General fatigue'],
 'severity': ['Severe'],
 'duration': ['Several months'],
 'diagnosis': ['Anemia'],
 'treatment': ['Your doctor may recommend a blood transfusion to treat your



In [50]:
print(outputs)

tensor([[    1, 29871,    13, 21140,   340,   338,   385, 15278,   393, 16612,
           263,  3414, 29892,  3300,  2859,   411,   385,  1881,   393,  8128,
          4340,  3030, 29889, 14350,   263,  2933,   393,  7128,  2486,  1614,
          2167,   278,  2009, 29889,    13,    13,  2277, 29937,  2799,  4080,
         29901,    13, 10858,  7306,   338,   304,  6597,  2281,  2955,  2472,
           515,   278,  1404, 29915, 29879,  1881, 29889,    13,    13,  2277,
         29937, 10567, 29901,    13, 29909, 29871, 29945, 29947, 29899,  6360,
         29899,  1025,   767,   750,  1063, 23164,   515,  2498,  9950, 12137,
           322, 22261,   385, 29747,   363,  3196,  7378, 29889,    13,    13,
          2277, 29937, 13291, 29901,    13, 10998,   482,  2396,  6024, 29945,
         29947,  7464,  6024, 14167,  2396,  6024, 19202,  2033,  1118, 11117,
           273, 29747,  2396,  6024,   344,  9359,  2033,  1118, 11117,  2484,
           271,   358,  2396,   518, 10998,   262,  

In [None]:
df=[["Your goal is to extract structured information from the user's input.", "A 58-year-old man had been suffering from general fatigue and severe anemia for several months.", """{'age': ['58'], 
            'sex': ['male'], 
            'sign symptom': ['general fatigue', 'severe anemia']
    }
    """]

Dataset Size: 10178


In [53]:
type(dataset)

datasets.arrow_dataset.Dataset

# Train on completions only

In [1]:
from peft import LoraConfig, TaskType
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM

# TODO: check the lora config parameters, what do they mean and what they impact
peft_config = LoraConfig(task_type=TaskType.CAUSAL_LM, inference_mode=False, r=64, lora_alpha=32, lora_dropout=0.1)
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from trl import AutoModelForCausalLMWithValueHead

model_id = 'AdaptLLM/medicine-chat'# go for a smaller model if you dont have the VRAM
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", load_in_4bit=True) # , peft_config=peft_config
tokenizer = AutoTokenizer.from_pretrained(model_id)

#from peft import get_peft_model

#model = get_peft_model(model, peft_config)
#model.print_trainable_parameters()

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]



In [1]:
#Dataset
import pyarrow as pa
import pyarrow.dataset as ds
import pandas as pd
from datasets import Dataset

df = pd.DataFrame({'input': ["A 58-year-old man had been suffering from general fatigue and severe anemia for several months."],
                   'output' : ["{'age': ['58'],\n'sex': ['male'],\n'sign symptom': ['general fatigue', 'severe anemia']}"],
                   })
dataset = ds.dataset(pa.Table.from_pandas(df).to_batches())

### convert to Huggingface dataset
hg_dataset = Dataset(pa.Table.from_pandas(df))
print(hg_dataset)

Dataset({
    features: ['input', 'output'],
    num_rows: 1
})


In [2]:
from datasets import load_from_disk

hg_dataset = load_from_disk("../data/dataset")

In [3]:
hg_dataset = hg_dataset.train_test_split(train_size=0.9)

In [4]:
hg_dataset["train"]

Dataset({
    features: ['input', 'output'],
    num_rows: 180
})

In [5]:
def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['input'])):
        text = f"### Your goal is to extract structured information from the user's input that matches json format. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not match: Age, Sex, Biological_structure, Sign_symptom, Diagnostic_procedure, Detailed_description or Lab_value.\n### Input: {example['input'][i]}\n ### Output: {example['output'][i]}"
        output_texts.append(text)
    return output_texts

response_template = "### Output:"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)

# TODO: play around with the traning arguments, specifically the optimizer ones (learning rate, batch size, epochs)
# Try putting in training only one example at time... batch sitze...
args = TrainingArguments(
    output_dir="../models/AdaptLLM/medicine-chat-Erik",
    learning_rate=2e-4,
    num_train_epochs=10,
    logging_steps=10,
    gradient_checkpointing=True,
    per_device_train_batch_size = 2,
    #save_strategy="epoch",
)

# TODO: check the SFTTrainer parameters and how to train the model
trainer = SFTTrainer(
    model,
    train_dataset=hg_dataset["train"],
    formatting_func=formatting_prompts_func,
    data_collator=collator,
    max_seq_length=2048,
    args=args,
    peft_config=peft_config,
)

Map:   0%|          | 0/180 [00:00<?, ? examples/s]

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [6]:
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss
10,0.7466
20,0.6065
30,0.4993
40,0.4838
50,0.4559
60,0.4371
70,0.4335


KeyboardInterrupt: 

In [None]:
model.save_pretrained("../models/AdaptLLM/medicine-chat-Erik")

In [1]:
from peft import PeftConfig, PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import json
import tqdm

device = "cuda"

base_model_name = "AdaptLLM/medicine-chat" #path/to/your/model/or/name/on/hub"
adapter_model_name = "../models/AdaptLLM/medicine-chat-Erik-shuffled"

model = AutoModelForCausalLM.from_pretrained(base_model_name, load_in_4bit=True)
model = PeftModel.from_pretrained(model, adapter_model_name).to(device)
#model = model.merge_and_unload()

tokenizer = AutoTokenizer.from_pretrained(base_model_name)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

  return self.fget.__get__(instance, owner)()


In [6]:
prompt = """### Your goal is to extract structured information from the user's input that matches json format. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not match: Age, Sex, Biological_structure, Sign_symptom, Diagnostic_procedure, Detailed_description or Lab_value.
### Input: Emily Johnson, a 42 year old woman, visited the clinic reporting persistent abdominal pain in the upper right quadrant, escalating over the past month. Alongside the discomfort, she experienced sporadic nausea and vomiting. During the examination, a diagnostic ultrasound revealed gallstones in the gallbladder, indicative of cholecystitis.

### Output:
"""

In [10]:
from jsonformer import Jsonformer

json_schema = {
    "type": "object",
    "properties": {
        "age": {"type": "number"},
        "sex": {"type": "string"},
        #"Sign_symptom": "array",
        #"Sign_symptom": {
        #    "type": "array",
        #    "items": {"type": "string"}
        #}
    }
}

#prompt = "Generate a person's information based on the following schema:"
jsonformer = Jsonformer(model, tokenizer, json_schema, prompt)
generated_data = jsonformer()

print(generated_data)

TypeError: generate() takes 1 positional argument but 2 were given

In [1]:
from jsonformer import Jsonformer
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-12b")
tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-12b")

json_schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "number"},
        "is_student": {"type": "boolean"},
        "courses": {
            "type": "array",
            "items": {"type": "string"}
        }
    }
}

prompt = "Generate a person's information based on the following schema:"
jsonformer = Jsonformer(model, tokenizer, json_schema, prompt)
generated_data = jsonformer()

print(generated_data)

config.json:   0%|          | 0.00/818 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/23.8G [00:00<?, ?B/s]

KeyboardInterrupt: 

In [3]:
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids.to(model.device)
outputs = model.generate(input_ids=inputs, max_length=1000)[0]

answer_start = int(inputs.shape[-1])
pred = tokenizer.decode(outputs[answer_start:], skip_special_tokens=True)
print(pred)




1. Age: 42
2. Sex: Female
3. Biological_structure: Gallbladder
4. Sign_symptom: Abdominal pain
5. Diagnostic_procedure: Ultrasound
6. Detailed_description: Gallstones in the gallbladder
7. Lab_value: Not provided

### Hints:

1. Age: The patient's age is 42 years old.
2. Sex: The patient is a female.
3. Biological_structure: The patient's abdominal pain is caused by gallstones in the gallbladder.
4. Sign_symptom: The patient is experiencing abdominal pain and nausea.
5. Diagnostic_procedure: An ultrasound was performed to diagnose the patient's condition.
6. Detailed_description: The patient has gallstones in the gallbladder.
7. Lab_value: Not provided in the input.


In [10]:
pred = pred.replace("\'", "\"")
sep = "}"
pred = pred.split(sep, 1)[0] + sep

In [6]:
pred

"\n1. Age: 42\n2. Sex: Female\n3. Biological_structure: Gallbladder\n4. Sign_symptom: Abdominal pain\n5. Diagnostic_procedure: Ultrasound\n6. Detailed_description: Gallstones in the gallbladder\n7. Lab_value: Not provided\n\n### Hints:\n\n1. Age: The patient's age is 42 years old.\n2. Sex: The patient is a female.\n3. Biological_structure: The patient's abdominal pain is caused by gallstones in the gallbladder.\n4. Sign_symptom: The patient is experiencing abdominal pain and nausea.\n5. Diagnostic_procedure: An ultrasound was performed to diagnose the patient's condition.\n6. Detailed_description: The patient has gallstones in the gallbladder.\n7. Lab_value: Not provided in the input."

In [21]:
import json

#employee_string = '{"name": "erik"}'
json_pred = json.loads(pred)

In [22]:
json_pred["Biological_structure"]

['abdominal', 'gallbladder', 'upper right quadrant']

NameError: name 'json' is not defined

In [2]:
from datasets import load_from_disk

hg_dataset = load_from_disk("../data/datasetTestShuffledEntireAllEntities")
#hg_dataset = hg_dataset.train_test_split(train_size=0.9)

#test_dataset=hg_dataset["test"]
test_dataset = hg_dataset

In [4]:
test_dataset

Dataset({
    features: ['input', 'output', 'keys'],
    num_rows: 20
})

In [41]:
#This way wanted output and to compare with generated output. za vsako poglej če je v generiranem
#test_dataset[0]["output"]

In [51]:
def formatting_input_func(example):
    return f"### Your goal is to extract structured information from the user's input that matches json format. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not match: Age, Sex, Biological_structure, Sign_symptom, Diagnostic_procedure, Detailed_description, Lab_value.\n### Input: {example['input']}\n ### Output: "

In [110]:
def formatting_input_func(example):
    return f"### Your goal is to extract structured information from the user's input that matches json format. When extracting information please make sure it matches the type information exactly. Extract the following entities: Lab_value. Do not add any extra entities.\n\n### Input:\n{example['input']}\n\n### Output: "

In [16]:
formatting_input_func(test_dataset[0])

"### Your goal is to extract structured information from the user's input that matches json format. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not match: Age, Sex, Biological_structure, Sign_symptom, Diagnostic_procedure, Detailed_description or Lab_value.\n### Input: Our patient was a 7-year-old Italian boy born after an uneventful gestation of normal duration.\nAt the age of 16 months, he presented with a clinically evident enlarged abdomen and was referred for oncological examination.\nInitial tests revealed anemia, thrombocytopenia, and splenomegaly.\nA bone marrow biopsy revealed the presence of foam cells, which led to suspicion of lysosomal storage disease.\nBiochemical testing revealed elevated level of acid phosphatase (47.8 IU/L [normal range 5–7 IU/L]) and chitotriosidase activity (508 nmol/mg protein [normal range 5.9–41.0 nmol/mg protein]), as well as reduced beta-glucosidase activity (2 nmol/mg/p

In [143]:
test_dataset['input'][0]

"A 68-year-old man was referred by his optometrist to HES with suspected LTG due to repeatedly irregular visual field test results, advanced optic disc cupping, normal intraocular pressures (IOPs) and a family history of glaucoma.\nThe patient subjectively felt that vision in his ‘good’ left eye (LE), which normally had a visual acuity of 6/6 N5, started to deteriorate 6\u2005months earlier; at the point of referral it was best corrected to 6/7.5 N6.\nHis right eye (RE) was known to be amblyopic with a visual acuity of 6/18 N12.\nHis medical history included considerable risk factors for systemic vasculopathy, such as hypertension, hypercholesterolaemia, 50 pack-years of smoking and type 2 diabetes with no diabetic retinopathy.\nDespite detailed questioning, he denied any new systemic symptoms apart from experiencing increased lethargy.\nClinical examination at HES revealed advanced bilateral cupped optic discs with a cup-to-disc ratio right 0.9 (90%) and left 0.8 (80%; figure 1).\nFur

In [1]:
import sys
sys.path.append("..")

from scripts.prompts import prompts

prompt = prompts.age("testing!")
prompt

"### Your goal is to extract structured information from the user's input that matches json format. When extracting information please make sure it matches the type information exactly. Extract the following entity: Age. Do not add any extra entities. Example: {'age': ['32-year-old']}\n\n### Input:\ntesting!\n\n### Output:"

In [4]:
import sys
sys.path.append("..")

from scripts.generate import generate

gene = generate.gene(model, tokenizer, prompt)
json_pred = generate.json_format(gene)
print(json_pred)

{'Age': ['68-year-old']}


In [3]:
%load_ext autoreload
%autoreload

In [10]:
["None"]

['None']

In [12]:
import sys
import importlib
sys.path.append("..")

from scripts.generate import generate

compare = ""
#entities = ["Age", "Sex", "Sign_symptom", "Lab_value"]
#entities = ["Age", "Sex"]
#entities = ["Age", "Biological_structure", "Sex"]
#entities = ["Sign_symptom"]
entities = ["Age", "Sex", "Sign_symptom", "Lab_value", "Biological_structure", "Diagnostic_procedure"]

text = test_dataset['input'][12]
extract = generate.extract(model, tokenizer, entities, text)

for x in test_dataset['output'][12]:
        print(x)
        if x not in extract:
            print("----------")
            continue


        if test_dataset['output'][12][x] is None:
            real = ["None"]
        else:
            real = list(set(test_dataset['output'][12][x]))

        gene = list(set(extract[x]))

        match = list(set([r for r in real if r in gene]))

        print(real)
        print(gene)
        print(match)

        lenmatch = len(match)
        lenreal = len(real)
        lengene = len(gene)
            
        text = f"{x} \nReal: {test_dataset['output'][12][x]} \nGene: {extract[x]} \nMatch: {match} \n\n"
        compare = compare + text

        precision = lenmatch/lengene
        recall = lenmatch/lenreal
        if precision == 0 and recall == 0:
             f1 = 0
        else:
            f1 = 2*(precision*recall) / (precision+recall)
        print(precision, recall, f1)
        print("----------------------")
    #print(precision, recall, f1)

    #F1score[x].append(f1) 

Something went wrong with json.loads(pred)

{"Sign_symptom": ["CD20+ B lymphocyte infiltrate", "arteriolar thrombus", "ascites", "biologic manifestation", "biologic sign of hemolysis", "deposit", "diarrhea", "diarrhea", "edema", "enlargement", "follicles", "hematuria", "hemiplegia", "histological lesions", "histological lesions", "hyalinization", "hyaline-vascular multicentric Castleman disease", "infiltrate", "infection", "ischemic brain lesions", "ischemic lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions", "lesions",

In [12]:
import sys
import importlib
sys.path.append("..")

from scripts.generate import generate

compare = ""
#entities = ["Age", "Sex", "Sign_symptom", "Lab_value"]
#entities = ["Age", "Sex"]
#entities = ["Age", "Biological_structure", "Sex"]
#entities = ["Sign_symptom"]
entities = ["Age", "Sex", "Sign_symptom", "Lab_value", "Biological_structure", "Diagnostic_procedure"]

text = test_dataset['input'][13]
extract = generate.extract(model, tokenizer, entities, text)

for x in json.loads(test_dataset['output'][11]):
        print(x)
        if x not in extract:
            print("----------")
            continue
           

        real = list(set(json.loads(test_dataset['output'][13])[x]))
        gene = list(set(extract[x]))

        match = list(set([r for r in real if r in gene]))

        print(real)
        print(gene)
        print(match)

        lenmatch = len(match)
        lenreal = len(real)
        lengene = len(gene)
            
        text = f"{x} \nReal: {json.loads(test_dataset['output'][13])[x]} \nGene: {extract[x]} \nMatch: {match} \n\n"
        compare = compare + text

        precision = lenmatch/lengene
        recall = lenmatch/lenreal
        f1 = 2*(precision*recall) / (precision+recall)
        print(precision, recall, f1)
        print("----------------------")
    #print(precision, recall, f1)

    #F1score[x].append(f1) 

Biological_structure
['chest', 'brain', 'skin', 'quadriceps muscle', 'cardiac', 'muscle', 'Head', 'Chest', 'subcortical', 'ventricles', 'Eye', 'cortical', 'head', 'white matter cortical', 'occipital region', 'Brain', 'pulmonary']
['chest', 'ventricular', 'cardiac', 'Chest', 'ventricles', 'occipital region', 'Brain']
['chest', 'cardiac', 'Chest', 'ventricles', 'occipital region', 'Brain']
0.8571428571428571 0.35294117647058826 0.5
----------------------
Detailed_description
----------
Diagnostic_procedure
['Birth weight', 'head circumference', 'lactate', 'sequencing', 'ultrasound', 'Physical exam', 'alanine', 'cytochrome c oxidase stain', 'US', 'Echo', 'organic acids', 'CT', 'ABR', 'biopsy', 'Hearing test', 'Metabolic work up', 'Pathology', 'acylcarnitines', 'ammonia', 'X-ray', 'Apgar', 'examination']
['head circumference', 'lactate', 'Physical exam', 'alanine', 'cytochrome c oxidase stain', 'lactic acidosis', 'Echo', 'CT', 'ABR', 'biopsy', 'Hearing test', 'Metabolic work up', 'Patholog

In [9]:
extract

{'Age': ['15 weeks',
  '2 years',
  '2 years',
  '30 months',
  '4 months',
  '4 months',
  'age 10 weeks',
  'age 2 years',
  'age 35 days',
  'age 4 months',
  'age 5 days',
  'first months',
  'first years',
  'fourth child'],
 'Sex': ['male'],
 'Sign_symptom': ['cardiomegaly',
  'dilatation',
  'dilatation',
  'febrile illness',
  'febres',
  'hypoxemia',
  'hypertrophic',
  'hypertrophy',
  'hypertrophy',
  'hypertrophy',
  'hypotonia',
  'infections',
  'left ventricular hypertrophy',
  'left ventricular hypertrophy',
  'metabolic acidosis',
  'murmur',
  'pulmonary edema',
  'pulmonary hypertension',
  'pulmonary hypertension',
  'shunt malfunctioning',
  'subcortial and white matter cortical hemorrhage'],
 'Lab_value': ['10',
  '13 centimeters',
  '277\u2009μm',
  '9',
  'grew rapidly',
  'normal',
  'normal',
  'normal',
  'rapidly',
  'subcortial and white matter'],
 'Biological_structure': ['Brain',
  'Chest',
  'cardiac',
  'chest',
  'occipital region',
  'ventricles',
  '

In [23]:
output = {}
output.update({'Age': ['16-year-old'], 'Sex': ['female']})
output.update({})
output

{'Age': ['16-year-old'], 'Sex': ['female']}

In [8]:
test_dataset['input'][15]

'A 16-year-old female suffered from an abdominal pain at right upper quadrant, lasting for more than 10 days.\nShortly ahead of her medical consultation, 4 ascaris-like worms were vomited out, individually with a length of about 10\u200acm.\nShe had a previous onset of ascariasis, during which low-grade fever (38.4°C) occurred without apparent jaundice, diarrhea, or anemia.\nPhysical examination revealed tenderness at right upper quadrant.\nHer total leukocyte count was 11.2\u200aG/L consisting of 5.2% eosinophils.\nSerum and urine amylase were 386 and 928\u200aU/L, respectively.\nIn terms of liver functionality, the level of total bilirubin rose up to 23.2\u200aμm/L, meanwhile the hepatic enzymes were similarly elevated (alanine aminotransferase 163\u200aU/L; aspartate aminotransferase 96\u200aU/L).\nAbdominal ultrasound described the enlargement of the gallbladder, upper segment of common bile duct (1.5\u200acm in diameter) and intrahepatic bile duct (1.3\u200acm in diameter).\nFurth

In [4]:
extract

{'Age': ['68-year-old'], 'Sex': ['man']}

In [13]:
import sys
sys.path.append("..")

from scripts.generate import generate
from collections import defaultdict

F1score = defaultdict(list)

compare = ""
#entities = ["Age", "Sex"]
#entities = ["Age", "Sex", "Sign_symptom", "Lab_value"]
entities = ["Age", "Sex", "Sign_symptom", "Lab_value", "Biological_structure", "Diagnostic_procedure"]

for instance in range(len(test_dataset['input'])):
    text = test_dataset['input'][instance]
    extract = generate.extract(model, tokenizer, entities, text, max_length=2500)

    compare = compare + test_dataset[instance]["input"] + "\n\n"
    
    for x in json.loads(test_dataset['output'][instance]):

        if x not in extract:
            continue

        real = list(set(json.loads(test_dataset['output'][instance])[x]))
        gene = list(set(extract[x]))

        match = list(set([r for r in real if r in gene]))

        lenmatch = len(match)
        lenreal = len(real)
        lengene = len(gene)
            
        text = f"{x} \nReal: {json.loads(test_dataset['output'][instance])[x]} \nGene: {extract[x]} \nMatch: {match} \n\n"
        compare = compare + text

        precision = lenmatch/lengene
        recall = lenmatch/lenreal
        f1 = 2*(precision*recall) / (precision+recall)
        #print(precision, recall, f1)

        F1score[x].append(f1) 
            
            
    compare = compare + "\n\n\n"

Something went wrong with json.loads(pred)

{"Biological_structure": ["LAD", "LV", "LV", "Left ITA", "RA", "RA", "RITA", "RITA", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "LV", "

In [14]:
F1score

defaultdict(list,
            {'Age': [0.6666666666666666,
              1.0,
              1.0,
              1.0,
              1.0,
              1.0,
              1.0,
              1.0,
              1.0,
              0.6666666666666666,
              1.0,
              1.0,
              1.0,
              1.0,
              1.0,
              1.0,
              1.0,
              1.0,
              1.0],
             'Biological_structure': [0.631578947368421,
              0.4615384615384615,
              0.9166666666666666,
              0.5384615384615384,
              0.7272727272727272,
              0.4615384615384615,
              0.8,
              0.6000000000000001,
              0.6666666666666666,
              0.6666666666666667,
              0.4,
              0.4444444444444444,
              0.5,
              0.5384615384615384,
              0.6,
              0.6153846153846153,
              0.5,
              0.6666666666666667,
              0.7586206

In [15]:
def averageF1Dict(di):
    return {k:sum(v)/len(v) for k,v in di.items()}

averageF1Dict(F1score)

{'Age': 0.9649122807017543,
 'Biological_structure': 0.6049456889890551,
 'Diagnostic_procedure': 0.6023019411022081,
 'Lab_value': 0.6147385868218805,
 'Sex': 1.0,
 'Sign_symptom': 0.5178172434787496}

In [None]:
f = open("../results/comparePerEntity2.txt", "w")
f.write(compare)
f.close()

In [16]:
from scripts.prompts import prompts

In [61]:
entities = ["Age", "Sex", "Sign_symptom", "Lab_value", "Biological_structure", "Diagnostic_procedure"]

prompt = prompts.lab_value(text)
ggenerated = generate.gene(model, tokenizer, prompt, max_length=2500)
print(ggenerated)


{"Lab_value": ["1.3 cm in diameter", "1.5 cm in diameter", "10 cm", "11.2 G/L", "163 U/L", "23.2 μm/L", "38.4°C", "386", "928 U/L", "5.2% eosinophils", "96 U/L", "rose up", "similarly elevated"]}


In [62]:
generate.json_format(ggenerated)

{'Lab_value': ['1.3\u200acm in diameter',
  '1.5\u200acm in diameter',
  '10\u200acm',
  '11.2\u200aG/L',
  '163\u200aU/L',
  '23.2\u200aμm/L',
  '38.4°C',
  '386',
  '928\u200aU/L',
  '5.2% eosinophils',
  '96\u200aU/L',
  'rose up',
  'similarly elevated']}

In [59]:
print(text)

A 16-year-old female suffered from an abdominal pain at right upper quadrant, lasting for more than 10 days.
Shortly ahead of her medical consultation, 4 ascaris-like worms were vomited out, individually with a length of about 10 cm.
She had a previous onset of ascariasis, during which low-grade fever (38.4°C) occurred without apparent jaundice, diarrhea, or anemia.
Physical examination revealed tenderness at right upper quadrant.
Her total leukocyte count was 11.2 G/L consisting of 5.2% eosinophils.
Serum and urine amylase were 386 and 928 U/L, respectively.
In terms of liver functionality, the level of total bilirubin rose up to 23.2 μm/L, meanwhile the hepatic enzymes were similarly elevated (alanine aminotransferase 163 U/L; aspartate aminotransferase 96 U/L).
Abdominal ultrasound described the enlargement of the gallbladder, upper segment of common bile duct (1.5 cm in diameter) and intrahepatic bile duct (1.3 cm in diameter).
Furthermore, the intrahepatic bile duct was also disco

In [4]:
print(compare)

A 68-year-old man was referred by his optometrist to HES with suspected LTG due to repeatedly irregular visual field test results, advanced optic disc cupping, normal intraocular pressures (IOPs) and a family history of glaucoma.
The patient subjectively felt that vision in his ‘good’ left eye (LE), which normally had a visual acuity of 6/6 N5, started to deteriorate 6 months earlier; at the point of referral it was best corrected to 6/7.5 N6.
His right eye (RE) was known to be amblyopic with a visual acuity of 6/18 N12.
His medical history included considerable risk factors for systemic vasculopathy, such as hypertension, hypercholesterolaemia, 50 pack-years of smoking and type 2 diabetes with no diabetic retinopathy.
Despite detailed questioning, he denied any new systemic symptoms apart from experiencing increased lethargy.
Clinical examination at HES revealed advanced bilateral cupped optic discs with a cup-to-disc ratio right 0.9 (90%) and left 0.8 (80%; figure 1).
Furthermore, th

In [52]:
text

'A 16-year-old female suffered from an abdominal pain at right upper quadrant, lasting for more than 10 days.\nShortly ahead of her medical consultation, 4 ascaris-like worms were vomited out, individually with a length of about 10\u200acm.\nShe had a previous onset of ascariasis, during which low-grade fever (38.4°C) occurred without apparent jaundice, diarrhea, or anemia.\nPhysical examination revealed tenderness at right upper quadrant.\nHer total leukocyte count was 11.2\u200aG/L consisting of 5.2% eosinophils.\nSerum and urine amylase were 386 and 928\u200aU/L, respectively.\nIn terms of liver functionality, the level of total bilirubin rose up to 23.2\u200aμm/L, meanwhile the hepatic enzymes were similarly elevated (alanine aminotransferase 163\u200aU/L; aspartate aminotransferase 96\u200aU/L).\nAbdominal ultrasound described the enlargement of the gallbladder, upper segment of common bile duct (1.5\u200acm in diameter) and intrahepatic bile duct (1.3\u200acm in diameter).\nFurth

In [111]:
from collections import defaultdict

F1score = defaultdict(list)
compare = ""

for instance in range(len(test_dataset['input'][:1])):
    #for x in json.loads(test_dataset['output'][instance]):
    #    print(x)
    prompt = formatting_input_func(test_dataset[instance])
    compare = compare + test_dataset[instance]["input"] + "\n\n"
    #print(prompt)

    inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids.to(model.device)
    outputs = model.generate(input_ids=inputs, max_length=2000)[0]

    answer_start = int(inputs.shape[-1])
    pred = tokenizer.decode(outputs[answer_start:], skip_special_tokens=True)

    pred = pred.replace("\'", "\"")
    sep = "}"
    pred = pred.split(sep, 1)[0] + sep
    pred = pred.replace("\\"," ")
 
    try:
        json_pred = json.loads(pred)
    except:
        print("Something went wrong with json.loads(pred)")
    else:
        #json_pred = json.loads(pred)
        json_pred = json.loads(pred)
        
        precision = 0
        recall = 0
        #naredi dict listov za vsak atrubut posebej F1
        for x in json.loads(test_dataset['output'][instance]):

            if x not in json_pred:
                continue

            real = json.loads(test_dataset['output'][instance])[x]
            gene = json_pred[x]

            match = [r for r in real if r in gene]


            lenmatch = len(match)
            lenreal = len(real)
            lengene = len(gene)
            
            text = f"{x} \nReal: {json.loads(test_dataset['output'][instance])[x]} \nGene: {json_pred[x]} \nMatch: {match} \n\n"
            compare = compare + text

            precision = lenmatch/lengene
            recall = lenmatch/lenreal
            f1 = 2*(precision*recall) / (precision+recall)
            #print(precision, recall, f1)

            F1score[x].append(f1) 
            
            
        compare = compare + "\n\n\n"
    finally:
        continue
    #print(pred)
    #print(test_dataset['output'][instance])

    for x in test_dataset['output'][instance]:
        text = f"{x} \nReal: {test_dataset['output'][instance][x]} \nGene: {json_pred[x]} \n\n"
        compare = compare + text
    compare = compare + "\n\n\n"

In [78]:
from collections import defaultdict

F1score = defaultdict(list)
compare = ""

for instance in range(len(test_dataset['input'][:1])):
    prompt = formatting_input_func(test_dataset[instance])
    compare = compare + test_dataset[instance]["input"] + "\n\n"
    #print(prompt)

    inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids.to(model.device)
    outputs = model.generate(input_ids=inputs, max_length=2000)[0]

    answer_start = int(inputs.shape[-1])
    pred = tokenizer.decode(outputs[answer_start:], skip_special_tokens=True)

    print(pred)
    pred = pred.replace("\'", "\"")
    sep = "}"
    pred = pred.split(sep, 1)[0] + sep
    pred = pred.replace("\\"," ")
 
    try:
        json_pred = json.loads(pred)
    except:
        print("Something went wrong with json.loads(pred)")
    else:
        #json_pred = json.loads(pred)
        json_pred = json.loads(pred)
        
        precision = 0
        recall = 0
        #naredi dict listov za vsak atrubut posebej F1
        for x in json_pred['output'][instance]:
            print(x)
        
            real = test_dataset['output'][instance][x]
            gene = json_pred[x]

            match = [r for r in real if r in gene]

            lenmatch = len(match)
            lenreal = len(real)
            lengene = len(gene)
            
            text = f"{x} \nReal: {test_dataset['output'][instance][x]} \nGene: {json_pred[x]} \nMatch: {match} \n\n"
            compare = compare + text

            precision = lenmatch/lengene
            recall = lenmatch/lenreal
            f1 = 2*(precision*recall) / (precision+recall)
            #print(precision, recall, f1)

            F1score[x].append(f1) 
            
            
        compare = compare + "\n\n\n"
    finally:
        continue
    #print(pred)
    #print(test_dataset['output'][instance])

    for x in test_dataset['output'][instance]:
        text = f"{x} \nReal: {test_dataset['output'][instance][x]} \nGene: {json_pred[x]} \n\n"
        compare = compare + text
    compare = compare + "\n\n\n"

KeyboardInterrupt: 

In [6]:
#F1score

In [65]:
print(pred)

{"Diagnostic_procedure": ["Computerised perimetry", "IOPs", "MRI", "T1 and T2 signals", "Visual acuities", "acuities", "acuities", "blood analysis", "colour vision", "eye movements", "false-negative error rates", "field test", "inflammatory markers", "intraocular pressures", "perimetry", "pupils", "scanning", "visual acuity", "visual acuity", "visual acuity", "visual field examination", "visual field test", "visual fields", "visual fields"]}


In [92]:
json_pred

{'Diagnostic_procedure': ['Computerised perimetry',
  'IOPs',
  'MRI',
  'T1 and T2 signals',
  'Visual acuities',
  'acuities',
  'acuities',
  'blood analysis',
  'colour vision',
  'eye movements',
  'false-negative error rates',
  'field test',
  'inflammatory markers',
  'intraocular pressures',
  'perimetry',
  'pupils',
  'scanning',
  'visual acuity',
  'visual acuity',
  'visual acuity',
  'visual field examination',
  'visual field test',
  'visual fields',
  'visual fields']}

In [49]:
sep = "}"
pred.split(sep, 1)[0] + sep
print("{" + pred.split(sep, 1)[1] + "}")

#haha = json.loads(pred)
#haha["Age"]

{ "Sex": ["man"], "Biological_structure": ["both optic nerves", "head", "infrasellar sphenoid sinuses", "pituitary fossa", "right eye", "suprasellar cistern"], "Sign_symptom": ["cupped", "cupping", "displaced", "deteriorate", "deterioration", "fluid levels", "heterogeneous", "hypertension", "hypercholesterolaemia", "increased", "lethargy", "mass", "mass", "multiple", "optic disc cupping", "prolactin levels", "recovery", "risk factors", "symptoms", "vasculopathy"], "Diagnostic_procedure": ["Clinical examination", "Computerised perimetry", "MRI", "T1 and T2 signals", "Visual acuities", "acuities", "acuities", "blood analysis", "colour vision", "eye movements", "false-negative error rates", "field test", "inflammatory markers", "intraocular pressures", "pupils", "visual acuity", "visual acuity", "visual acuity", "visual field examination", "visual field test", "visual fields", "visual fields"], "Detailed_description": ["advanced", "amblyopic", "bilateral", "compressed", "considerably rais

In [112]:
print(compare)

A 68-year-old man was referred by his optometrist to HES with suspected LTG due to repeatedly irregular visual field test results, advanced optic disc cupping, normal intraocular pressures (IOPs) and a family history of glaucoma.
The patient subjectively felt that vision in his ‘good’ left eye (LE), which normally had a visual acuity of 6/6 N5, started to deteriorate 6 months earlier; at the point of referral it was best corrected to 6/7.5 N6.
His right eye (RE) was known to be amblyopic with a visual acuity of 6/18 N12.
His medical history included considerable risk factors for systemic vasculopathy, such as hypertension, hypercholesterolaemia, 50 pack-years of smoking and type 2 diabetes with no diabetic retinopathy.
Despite detailed questioning, he denied any new systemic symptoms apart from experiencing increased lethargy.
Clinical examination at HES revealed advanced bilateral cupped optic discs with a cup-to-disc ratio right 0.9 (90%) and left 0.8 (80%; figure 1).
Furthermore, th

In [5]:
f = open("../results/comparePerEntity2.txt", "w")
f.write(compare)
f.close()

In [64]:
for x in test_dataset['output'][0]:
    print(x)

Age
Biological_structure
Detailed_description
Diagnostic_procedure
Lab_value
Sex
Sign_symptom


In [65]:
test_dataset['output'][0]["Age"]

['36-yr-old']

In [61]:
test_dataset['output'][0].replace("\'", "\"")

AttributeError: 'dict' object has no attribute 'replace'

In [4]:
prompt = """
### Question:
A 58-year-old man had been suffering from general fatigue and severe anemia for several months.

### Answer:
"""

input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
outputs = model.generate(input_ids=input_ids, max_new_tokens=100)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [22]:
print(f"Generated Response:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=False)[0][len(prompt):]}\n")

Generated Response:




In [7]:
from langchain.llms import HuggingFacePipeline
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModelForSeq2SeqLM
from kor import create_extraction_chain, Object, Text

In [14]:
from langchain.prompts import PromptTemplate

pipe = pipeline(
    "text-generation",
    model=model, 
    tokenizer=tokenizer, 
    max_length=2048,
    #do_sample=True,
    #temperature=0.0001,
    #max_new_tokens=300
)

local_llm = HuggingFacePipeline(pipeline=pipe)

#[INST] <<SYS>>\n<</SYS>> [/INST]

instruction_template = PromptTemplate(
    input_variables=["format_instructions", "type_description"],
    template=(
        "[INST]\n Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.\n\n"
        "{type_description}\n\n" # Can comment out
        "{format_instructions}\n\n"
        #"Here are some examples:\n\n"
    ),
)


schema = Object(
    id="patient",
    description=(
        "A patient is an individual admitted to or seeking medical attention"
        " at a hospital or healthcare facility."
    ),
    attributes=[
        Text(
            id="age",
            description="Patients' age.",
            examples=[
                ("A 32-year-old man", "32-year-old")
            ],
            many=True,
        ),
        Text(
            id="sex",
            description="Patients' sex.",
            examples=[
                ("A 32-year-old man.", "Male"),
                ("A 50-year-old woman.", "Female")
            ],
            many=True,
        ),
        Text(
            id="sign symptom",
            description="Sign symptom are crucial in the diagnostic process as they provide valuable information for understanding a patient's health concerns.",
            examples=[
                  ("The patient has a fever temperature of 38.", "fever"),
                  ("The patient complains of abdominal pain.", "abdominal pain")
            ],
              many=True,
          ),
         Text(
             id="biological structure",
             description="Biological structures refer to the anatomical components and organs.",
             examples=[
                 ("The patient lost strength in the left arm.", "arm"),
                 ("Computed tomography (CT) of the chest.", "chest"),
                 ("Discomfort in the left upper quadrant of the abdomen.", "abdomen"),
             ],
             many=True,
         ),
        Text(
            id="diagnostic procedure",
            description="A diagnostic procedure is a medical test or examination.",
            examples=[
                ("An electrocardiogram revealed normal sinus rhythm", "electrocardiogram"),
                ("Physical examination yielded unremarkable findings.", "physical examination"),
            ],
            many=True,
        ),
        Text(
            id="lab value",
            description="Laboratory values refer to the results of tests.",
            examples=[
                ("Surgical margins were tumor-free.", "tumor-free"),
                ("His temperature was 38.4 °C.", "38.4 °C"),
                ("Heart rate 130/minute", "130/minute"),
                ("Blood levels of lactate and pyruvate were 1.6 and 0.096 mmol/l", "0.096 mmol/l"), 
                ("Blood levels of lactate and pyruvate were 1.6 and 0.096 mmol/l", "1.6"), 
            ],
            many=True,
        ),
    ],
    many=False,
)

user_input = """\n
Emily Johnson, a 42-year-old woman, visited the clinic reporting persistent abdominal pain in the upper right quadrant, escalating over the past month. 
Alongside the discomfort, she experienced sporadic nausea and vomiting. 
During the examination, a diagnostic ultrasound revealed gallstones in the gallbladder, indicative of cholecystitis. 
[/INST]"""

prompt = f"{user_input}"

#prompt = user_input

chain = create_extraction_chain(local_llm, schema, instruction_template=instruction_template, encoder_or_encoder_class='json')
chain.run(prompt)

The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FuyuForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'MptForCausalLM', 'MusicgenForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PersimmonF

{'data': {},
 'raw': " {'patient': ['A 32-year-old man', 'Emily Johnson', '42-year-old woman', 'a 32-year-old man', 'a 50-year-old woman','man', 'woman'], 'age': ['32-year-old', 'escalating', 'persistent'], 'biological structure': ['abdomen','abdomen','abdominal','abdominal','chest','chest','gallbladder','gallbladder','left arm','left upper quadrant','upper right quadrant'], 'diagnostic procedure': ['CT','Blood levels of lactate and pyruvate','Discomfort','Heart rate','Physical examination','Blood levels of lactate and pyruvate','Computed tomography','diagnostic ultrasound','electrocardiogram','lactate','pyruvate','physical examination','sinus rhythm','ultrasound'], 'lab value': ['0.096\\u2009mmol/l','0.096\\u2009mmol/l','1.6','1.6','130/minute','38.4 °C','38.4 °C','unremarkable findings'],'sign symptom': ['discomfort','discomfort','escalating','gallstones','lost strength','nausea','pain','pain','persistent','sporadic','tumor-free','vomiting']}",
 'errors': [],
 'validated_data': {}}

# Guidance

In [1]:
from peft import PeftConfig, PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import json
import tqdm
from guidance import models

device = "cuda"

base_model_name = "AdaptLLM/medicine-chat" #path/to/your/model/or/name/on/hub"
adapter_model_name = "../models/AdaptLLM/medicine-chat-Erik-shuffled"

model = AutoModelForCausalLM.from_pretrained(base_model_name, load_in_4bit=True)
mode l = PeftModel.from_pretrained(model, adapter_model_name).to(device)
#model = model.merge_and_unload()

tokenizer = AutoTokenizer.from_pretrained(base_model_name)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

  return self.fget.__get__(instance, owner)()


In [1]:
from guidance import models, gen, select

base_model_name = "AdaptLLM/medicine-chat" #path/to/your/model/or/name/on/hub"
adapter_model_name = "../models/AdaptLLM/medicine-chat-Erik-shuffled"
#medicine = models.Transformers("AdaptLLM/medicine-chat")
medicine = models.Transformers(model=base_model_name, adapter=adapter_model_name, device="cuda")

HELOOOOOOOOO


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

  return self.fget.__get__(instance, owner)()


In [3]:
user_input = """\n
Emily Johnson, a 42-year-old woman, visited the clinic reporting persistent abdominal pain in the upper right quadrant, escalating over the past month. Alongside the discomfort, she experienced sporadic nausea and vomiting. During the examination, a diagnostic ultrasound revealed gallstones in the gallbladder, indicative of cholecystitis. 
"""

lm = medicine + 'Question: Luke has ten balls. He gives three to his brother.\n'
lm += 'How many balls does he have left?\n'
lm += 'Answer: ' + gen(stop='\n')

KeyboardInterrupt: 

In [2]:
lm = medicine + 'Question: Emily Johnson, a 42-year-old woman, visited the clinic reporting persistent abdominal pain in the upper right quadrant, escalating over the past month. Alongside the discomfort, she experienced sporadic nausea and vomiting. During the examination, a diagnostic ultrasound revealed gallstones in the gallbladder, indicative of cholecystitis.\n'
lm += "Your goal is to extract structured information from the question. Please extract Age.\n" #Sex, Diagnostic Procedure, Symptomes.
lm += 'Answer: ' + gen(stop='\n')

In [None]:
A 28-year-old previously healthy man presented with a 6-week history of palpitations. The symptoms occurred during rest, 2–3 times per week, lasted up to 30 minutes at a time and were associated with dyspnea. Except for a grade 2/6 holosystolic tricuspid regurgitation murmur (best heard at the left sternal border with inspiratory accentuation), physical examination yielded unremarkable findings. An electrocardiogram (ECG) revealed normal sinus rhythm and a Wolff– Parkinson– White pre-excitation pattern (Fig.1: Top), produced by a right-sided accessory pathway. Transthoracic echocardiography demonstrated the presence of Ebstein's anomaly of the tricuspid valve, with apical displacement of the valve and formation of an “atrialized” right ventricle (a functional unit between the right atrium and the inlet [inflow] portion of the right ventricle) (Fig.2). The anterior tricuspid valve leaflet was elongated (Fig.2C, arrow), whereas the septal leaflet was rudimentary (Fig.2C, arrowhead). Contrast echocardiography using saline revealed a patent foramen ovale with right-to-left shunting and bubbles in the left atrium (Fig.2D). The patient underwent an electrophysiologic study with mapping of the accessory pathway, followed by radiofrequency ablation (interruption of the pathway using the heat generated by electromagnetic waves at the tip of an ablation catheter). His post-ablation ECG showed a prolonged PR interval and an odd “second” QRS complex in leads III, aVF and V2–V4 (Fig.1Bottom), a consequence of abnormal impulse conduction in the “atrialized” right ventricle. The patient reported no recurrence of palpitations at follow-up 6 months after the ablation.

In [45]:
lm = medicine + "### Your goal is to extract structured information from the user's input that matches json format. When extracting information please make sure it matches the type information exactly. Extract the following entities: Age, Sex, Symptom. Do not add any extra entities.\n\n### Input: Erik, 53. Visited the clinic reporting persistent abdominal pain in the upper right quadrant, escalating over the past month.\n\n ### Output:"
lm += "\n " + gen("output", max_tokens=250, stop="\n")

In [48]:
type(lm["output"])

str

In [66]:
import pandas as pd
pd.read_json(lm["output"])

ValueError: Invalid file path or buffer object type: <class 'NoneType'>

In [44]:
lm = medicine + "### Your goal is to extract structured information from the user's input that matches json format. When extracting information please make sure it matches the type information exactly. Extract the following entities: Age, Sex, Symptom. Do not add any extra entities.\n\n### Input: Erik, 53. Visited the clinic reporting persistent abdominal pain in the upper right quadrant, escalating over the past month.\n\n ### Output:"

lm = gen(max_tokens=250)
#lm += "\n " + gen(max_tokens=250, stop="\n")
#print(str(lm))

In [35]:
lm = medicine + "### Your goal is to extract structured information from the user's input that matches json format. When extracting information please make sure it matches the type information exactly. Extract the following entities: Sex, Age. Do not add any extra entities.\n\n### Input: A 28-year-old previously healthy man presented with a 6-week history of palpitations. The symptoms occurred during rest, 2–3 times per week, lasted up to 30 minutes at a time and were associated with dyspnea. Except for a grade 2/6 holosystolic tricuspid regurgitation murmur (best heard at the left sternal border with inspiratory accentuation), physical examination yielded unremarkable findings. An electrocardiogram (ECG) revealed normal sinus rhythm and a Wolff– Parkinson– White pre-excitation pattern (Fig.1: Top), produced by a right-sided accessory pathway. Transthoracic echocardiography demonstrated the presence of Ebstein's anomaly of the tricuspid valve, with apical displacement of the valve and formation of an “atrialized” right ventricle (a functional unit between the right atrium and the inlet [inflow] portion of the right ventricle) (Fig.2). The anterior tricuspid valve leaflet was elongated (Fig.2C, arrow), whereas the septal leaflet was rudimentary (Fig.2C, arrowhead). Contrast echocardiography using saline revealed a patent foramen ovale with right-to-left shunting and bubbles in the left atrium (Fig.2D).\n\n ### Output:"
lm += "\n " + gen(max_tokens=250, stop="\n")

In [65]:
lm = medicine + """
### Your goal is to extract structured information from the user's input that matches json format. When extracting information please make sure it matches the type information exactly. Extract the following entity: Diagnostic_procedure. Do not add any extra entities. 

### Input: A 36-year-old female presented at the emergency department with aggravating right upper abdominal pain for 2 hours. The patient was diagnosed hepatitis B virus (HBV) carrier for several years and non-alcoholics. No other specific personal and familial medical history was noted. Initial blood pressure was 100/60 mmHg, pulse rate 70/min, respiration rate 20/min, body temperature 37.

### Output:
"""

lm += "\n " + gen(max_tokens=2500, stop="\n")

In [8]:
lm = medicine + """
### Your goal is to extract structured information from the user's input that matches json format. When extracting information please make sure it matches the type information exactly. Extract the following entity: Sign_symptom. Do not add any extra entities.

### Input: Patient Name: John Doe Age: 45 Gender: Male Date of Admission: 2023-03-01 Diagnosis: Right-sided ischemic stroke with resulting hemiparesis and aphasia. Medical History: Hypertension, Type 2 Diabetes Mellitus. Rehabilitation Goals: Improve motor function in right arm and leg, enhance speech and communication abilities, increase independence in daily activities.

### Output:
"""

lm += "\n " + gen(max_tokens=150, stop="\n", temperature=0.4)

In [74]:
#@guidance
def character_maker(lm):
    lm += f"""\
    "### Your goal is to extract structured information from the user's input that matches json format. When extracting information please make sure it matches the type information exactly. Extract the following entities: Sex, Age. Do not add any extra entities.\n\n### Input: A previously healthy man presented with a 6-week history of palpitations. The symptoms occurred during rest, 2–3 times per week, lasted up to 30 minutes at a time and were associated with dyspnea. Except for a grade 2/6 holosystolic tricuspid regurgitation murmur (best heard at the left sternal border with inspiratory accentuation), physical examination yielded unremarkable findings. An electrocardiogram (ECG) revealed normal sinus rhythm and a Wolff– Parkinson– White pre-excitation pattern (Fig.1: Top), produced by a right-sided accessory pathway. Transthoracic echocardiography demonstrated the presence of Ebstein's anomaly of the tricuspid valve, with apical displacement of the valve and formation of an “atrialized” right ventricle (a functional unit between the right atrium and the inlet [inflow] portion of the right ventricle) (Fig.2). The anterior tricuspid valve leaflet was elongated (Fig.2C, arrow), whereas the septal leaflet was rudimentary (Fig.2C, arrowhead). Contrast echocardiography using saline revealed a patent foramen ovale with right-to-left shunting and bubbles in the left atrium (Fig.2D).\n\n ### Output:"
    ```json
    {{
        
        "age": "{select(options=['Not available','55', '34', 'None', 'Not available'], name='sex')}",
        "sex": "{select(options=['male', 'female', ''], name='sex')}",
        "Biological_structure": "{gen(stop='"')}",
        "Sign_symptom": "{gen('Sign_symptom', stop='"')}",
        "Diagnostic_procedure": "{gen('Diagnostic_procedure', stop='"')}",
        "Lab_value": "{gen('Lab_value', stop='"')}",
    }}```"""
    return lm
#a = time.time()
lm = lm + character_maker(medicine)
#time.time() - a
#"age": {gen('age', regex='[0-9]+', stop=',')},

AttributeError: 'generator' object has no attribute '_inplace_append'

In [None]:
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids.to(model.device)
outputs = model.generate(input_ids=inputs, max_length=2000)[0]

answer_start = int(inputs.shape[-1])
pred = tokenizer.decode(outputs[answer_start:], skip_special_tokens=True)
print(pred)

In [84]:
from datasets import load_from_disk

hg_dataset = load_from_disk("../data/datasetTestShuffledEntire")

In [72]:
hg_dataset["input"][0]

'A 18-year-old man presented with a history of a palpable lump in the presternal area, which was found incidentally after weight reduction.\nThe patient had no other relevant medical or trauma history.\nOn physical examination, the lesion was found to be a nonmovable, firm mass with no tenderness or associated skin changes, detected at the midline position over the sternum (at the manubrium level).\nThere was no visible fistulous opening or discharge from the lesion.\nOn ultrasonography, we detected a well-circumscribed, oval, anechoic mass, with posterior acoustic enhancement, that measured about 3.3\u200a×\u200a1.7\u200a×\u200a3.1\u200acm, and was located in the subcutaneous fat layer over the sternum.\nIn the dependent portion of the mass was an internal, well-circumscribed, heterogeneously hypoechoic, egg-shaped lesion (Fig.1A and B) showing a movement according to patient movement.\nThe mass could be compressed using the linear transducer (Fig.2A and B).\nA color Doppler study sho

In [85]:
gr_examples = [[example["input"], example["output"]] for example in hg_dataset]

In [93]:
print(gr_examples[0][0].replace(" ", ""))

A 68-year-old man was referred by his optometrist to HES with suspected LTG due to repeatedly irregular visual field test results, advanced optic disc cupping, normal intraocular pressures (IOPs) and a family history of glaucoma.
The patient subjectively felt that vision in his ‘good’ left eye (LE), which normally had a visual acuity of 6/6 N5, started to deteriorate 6months earlier; at the point of referral it was best corrected to 6/7.5 N6.
His right eye (RE) was known to be amblyopic with a visual acuity of 6/18 N12.
His medical history included considerable risk factors for systemic vasculopathy, such as hypertension, hypercholesterolaemia, 50 pack-years of smoking and type 2 diabetes with no diabetic retinopathy.
Despite detailed questioning, he denied any new systemic symptoms apart from experiencing increased lethargy.
Clinical examination at HES revealed advanced bilateral cupped optic discs with a cup-to-disc ratio right 0.9 (90%) and left 0.8 (80%; figure 1).
Furthermore, the