# GPT-4 Summarization

## Creating a Prompt

For creating a prompt, I will give you 10 training examples of the original text-summary pairs and 10 validation examples.
I will also provide code to check the performance of the 10 validation examples below.
You have to imput the output of GPT-4 for these by hand.
Not all experiments use exactly the same data as the original text-summary pairs (see below), but I think these are good to get a sense of the performance and create a prompt for all experiments.

## Experiments To Run

All other experiments come with their own 10 in-context examples.

### For quantitative performance estimates

1. Summarization of 100 original text-summary pairs
2. Summarization of 100 original text-summary pairs with short text (<4000 chars) and long summaries (>600 chars)
    * I did not mention this to you, but we also have to get the performance on this data.
    * This is a subset of 20% of the data I had to work with to make the human annotation feasible. Too long texts where impossible to annotate.
    * Basically I just want to show that this subselection makes no difference in performance.
3. Not high priority, but could be useful: Summarization of 100 _cleaned and improved_ text-summary pairs when using 10 cleaned and improved in-context examples (10 validation _cleaned and improved data_)

### For annotating hallucinations and determining hallucination rates

4. Summarization of 25 examples when using in-context examples with unsupported facts (10 validation _original data_)
    * I will give you 50 test examples to have some for debugging
5. Summarization of 25 examples when using in-context examples with unsupported facts removed (10 validation _cleaned data_)
    * I will give you 50 test examples to have some for debugging

### For qualitative results with human annotation

6. Summarization of 25 examples when using in-context examples with unsupported facts removed and improved text such as deidentification removed (10 validation _cleaned and improved data_)
    * I will give you 50 test examples to have some for debugging

In [9]:
# Imports
import json
import random
import numpy as np
from collections import defaultdict
import evaluate
from rouge_score import rouge_scorer

In [10]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!pip install bert_score

Collecting bert_score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.0.0->bert_score)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.0.0->bert_score)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.0.0->bert_score)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.0.0->bert_score)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.0.0->bert_score)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.0.

In [11]:
# Read all files
def read_jsonl(file_name):
    with open(file_name, "r") as f:
        return [json.loads(line) for line in f]

prompt_train = read_jsonl('/content/drive/MyDrive/summarization_data/prompt_train.json')
prompt_valid = read_jsonl('/content/drive/MyDrive/summarization_data/prompt_valid.json')

exp_1_in_context = read_jsonl('/content/drive/MyDrive/summarization_data/exp_1_in-context.json')
exp_1_test = read_jsonl('/content/drive/MyDrive/summarization_data/exp_1_test.json')
exp_2_in_context = read_jsonl('/content/drive/MyDrive/summarization_data/exp_2_in-context.json')
exp_2_test = read_jsonl('/content/drive/MyDrive/summarization_data/exp_2_test.json')
exp_3_in_context = read_jsonl('/content/drive/MyDrive/summarization_data/exp_3_in-context.json')
exp_3_test = read_jsonl('/content/drive/MyDrive/summarization_data/exp_3_test.json')

exp_4_in_context = read_jsonl('/content/drive/MyDrive/summarization_data/exp_4_in-context.json')
exp_4_test = read_jsonl('/content/drive/MyDrive/summarization_data/exp_4_test.json')
exp_5_in_context = read_jsonl('/content/drive/MyDrive/summarization_data/exp_5_in-context.json')
exp_5_test = read_jsonl('/content/drive/MyDrive/summarization_data/exp_5_test.json')

exp_6_in_context = read_jsonl('/content/drive/MyDrive/summarization_data/exp_6_in-context.json')
exp_6_test = read_jsonl('/content/drive/MyDrive/summarization_data/exp_6_test.json')

assert len(prompt_train) == 10
assert len(prompt_valid) == 10
# Assert length of in-context always 10
assert len(exp_1_in_context) == 10
assert len(exp_2_in_context) == 10
assert len(exp_3_in_context) == 10
assert len(exp_4_in_context) == 10
assert len(exp_5_in_context) == 10
assert len(exp_6_in_context) == 10
# Assert length of test
assert len(exp_1_test) == 100
assert len(exp_2_test) == 100
assert len(exp_3_test) == 100
assert len(exp_4_test) == 50
assert len(exp_5_test) == 50
assert len(exp_6_test) == 50

In [12]:
# Use custom rouge function to obtain rouge 3/4 which are not available in huggingface
def get_rouge_score(gold, pred):
    rouge_scores = ['rouge1', 'rouge2', 'rouge3', 'rouge4', 'rougeL']
    scorer = rouge_scorer.RougeScorer(rouge_scores, use_stemmer=True)
    scores = scorer.score(gold, pred)
    return {k: scores[k].fmeasure * 100 for k in rouge_scores}

def compute_custom_metrics(srcs, golds, preds, device):
    scores = defaultdict(list)
    bertscore = evaluate.load("bertscore")
    sari = evaluate.load("sari")

    # For rouge and length go over examples one by one and determine mean
    for gold, pred in zip(golds, preds):
        for k, v in get_rouge_score(gold, pred).items():
            scores[k].append(v)
        scores['words'].append(len(pred.split(' ')))
    for k, v in scores.items():
        scores[k] = np.mean(v)

    # This is the default call using model_type="roberta-large"
    # This is the same as in the paper "Generation of Patient After-Visit Summaries to Support Physicians" (AVS_gen/eval_summarization.py) using the libary SummerTime
    scores['bert_score'] = np.mean((bertscore.compute(predictions=preds, references=golds, lang="en", device=device))['f1']) * 100
    # BERTScore authors recommend "microsoft/deberta-large-mnli" (https://github.com/Tiiiger/bert_score)
    scores['bert_score_deberta-large'] = np.mean((bertscore.compute(predictions=preds, references=golds, device=device, model_type="microsoft/deberta-large-mnli"))['f1']) * 100
    scores['sari'] = sari.compute(sources=srcs, predictions=preds, references=[[g] for g in golds])['sari']
    # scores['sari'] = scores['sari'][0]
    # Importing readability for dallc score not working: https://pypi.org/project/py-readability-metrics/

    return {k: round(v, 2) for k, v in scores.items()}

In [13]:
# Creating prompt

# To obtain the valid performance on the 10 validation examples
# 1 shot gpt turbo 3.5
#prompt_valid_gpt_predicitions = []
#prompt_valid_gpt_predicitions.append("You were admitted for E. Coli bacteremia and treated with antibiotics. You also had symptoms of diarrhea, vomiting, lightheadedness, and chest pain. Your symptoms improved with treatment. You will continue antibiotics and follow up for further testing. Your blood pressure and blood sugar remained well controlled. You are scheduled for an outpatient exercise stress test.")
#prompt_valid_gpt_predicitions.append("You were admitted for a UTI related to your recent kidney stone procedure. Your symptoms improved with medication. You were also found to have anemia due to chronic inflammation, which needs further evaluation. You were discharged on antibiotics and will follow up with urology as an outpatient for stent-related symptoms.")
#prompt_valid_gpt_predicitions.append("You were admitted for worsening anemia in the setting of NASH cirrhosis and other complex medical issues. The cause of your abdominal pain was unclear but improved without intervention. Your anemia improved before discharge. Your liver transplant evaluation continues, and your MELD score is 26. Please follow up with hepatology for ongoing monitoring and management.")
#prompt_valid_gpt_predicitions.append("You were admitted for acute kidney injury likely due to NSAID use, intravascular depletion from cirrhosis, and recent paracentesis. Your kidney function improved with albumin infusions. Your sodium levels and diplopia were managed during the admission. You have metastatic HCC and will follow up with your oncologists. Your cirrhosis is stable with minimal ascites. Please avoid NSAIDs, adjust gabapentin dose, and ensure kidney function is improving before any contrast studies.")
#prompt_valid_gpt_predicitions.append("You were admitted to the Neurologic ICU for a stroke likely caused by a basilar artery issue. Your blood pressure was managed with medications, and you were started on aspirin and atorvastatin. You were transferred to the neurology floor after stabilization. You developed pneumonia but responded well to antibiotics. You had a Speech and Swallow evaluation and were cleared for certain foods. You were kept on midodrine to maintain blood pressure for CNS perfusion. You will transition back to your home lisinopril as tolerated.")
#prompt_valid_gpt_predicitions.append("You were admitted for throat pain and difficulty swallowing after an ENT procedure. Pain was managed with roxicet. You had a negative swallow evaluation and were discharged on a soft diet. Your cough and increased sputum production were likely due to penumonitis, not pneumonia. Dehydration was treated with IV fluids. Your chronic issues of asthma, GERD, and depression were stable during this admission.")
#prompt_valid_gpt_predicitions.append("During your hospital stay, you were treated for acute on chronic diastolic heart failure exacerbation, likely due to fluid overload. You also received treatment for suspected healthcare-associated pneumonia. Your medications were adjusted to manage your heart failure, hypertension, and hypophosphatemia. You were also evaluated for acute delirium and had your medications adjusted accordingly. Your chronic conditions, such as end-stage renal disease, atrial fibrillation, and CAD, were managed as well. Please follow up with your outpatient cardiologist for further care.")
#prompt_valid_gpt_predicitions.append("You were admitted for left thigh cellulitis, which was treated with IV antibiotics and then transitioned to oral antibiotics. An MRI ruled out abscess. You also had a mild headache, likely due to poor sleep and infection. Your diabetes was managed with an insulin pump adjustment. Your thyroid medication was decreased. Your blood pressure was stable. You were discharged with pain medication and instructions to continue antibiotics for 14 days. Your husband is your healthcare proxy and your code status is full.")
#prompt_valid_gpt_predicitions.append("You were admitted for a right groin hematoma and dehydration, leading to a presyncopal episode. Your kidney injury resolved with IV fluids and medication adjustments. Your atrial fibrillation was treated with ablation, and your medications were adjusted for your heart and kidney conditions. Please continue your medications as prescribed and follow up for further evaluation.")
#prompt_valid_gpt_predicitions.append("You were admitted for a flare of ulcerative colitis, confirmed by imaging and tests. You were treated with steroids and other medications. No evidence of perforation or major complications. You were also found to have macrocytosis and thrombocytosis, likely related to inflammation. Discharged on prednisone with tapering instructions, continue mesalamine enemas, and follow up with your doctor.")

# 3 shot gpt-4
prompt_valid_gpt_predicitions = []
prompt_valid_gpt_predicitions.append("You were admitted to the hospital due to diarrhea, vomiting, and feeling lightheaded. We suspected that your symptoms might be due to a bacterial infection in your gut, possibly caused by recent antibiotic use. We started you on a medication called metronidazole, which helped improve your symptoms. You also had an episode of chest pain, but tests showed no signs of a heart attack. We found bacteria in your blood, but repeated tests showed no growth, suggesting the infection is under control. Your blood pressure and blood sugar levels remained stable throughout your stay. You will continue your current medications and finish your course of antibiotics during your dialysis sessions.")
prompt_valid_gpt_predicitions.append("You were admitted to the hospital due to symptoms of a urinary tract infection (UTI), likely related to a stent placed in your kidney. Your symptoms improved significantly with medication. You were treated with antibiotics, which you should continue to take after discharge. You also have anemia, likely due to chronic inflammation, which should be further evaluated after you leave the hospital. Your high blood pressure and high cholesterol were managed with your usual medications. You have a follow-up appointment with Urology to further evaluate your symptoms.")
prompt_valid_gpt_predicitions.append("You were admitted due to abdominal pain and a decrease in your red blood cell count (anemia). The cause of your abdominal pain was not clear, but it improved on its own. Your anemia was likely due to multiple factors, including possible bone marrow suppression and dilution from fluids given in the emergency department. There was no evidence of gastrointestinal bleeding. Your red blood cell count improved before you were discharged. Your liver disease did not worsen during your stay. Your medications for heart rhythm and blood thinning were continued.")
prompt_valid_gpt_predicitions.append("You were admitted to the hospital due to worsening kidney function, likely caused by a combination of factors including medication use, fluid imbalance, and your underlying liver disease. We treated this with albumin infusions and your kidney function improved. We also managed your low sodium levels by restricting your fluid and sodium intake. Your double vision remained stable, and you have a scheduled MRI to investigate this further. Your liver cancer is being managed with medication and will be followed up with your doctor. We adjusted your pain medication dosage based on your kidney function. It's important to avoid all NSAIDs due to your kidney function and to check your kidney function before getting any procedures with contrast.")
prompt_valid_gpt_predicitions.append("You were admitted to the ICU due to a stroke, likely caused by a condition in your basilar artery. We managed your blood pressure and position to maximize blood flow to your brain. Your condition improved, allowing us to reduce your blood pressure medication and increase your mobility. We also started you on aspirin and a cholesterol medication. You developed pneumonia, which was treated with antibiotics. Once stable, you were moved to a regular room where your diet was adjusted to thin liquids and soft solids. Your blood pressure was managed to ensure adequate blood flow to your brain.")
prompt_valid_gpt_predicitions.append("You were admitted to the hospital due to pain and difficulty swallowing, likely related to your recent ENT procedure. There was no sign of infection. Your swallowing was evaluated and found to be normal. You were also coughing more and producing more sputum, likely due to inflammation in your lungs. There was concern that you were dehydrated due to not eating or drinking much, but after receiving fluids, your labs returned to normal. You were able to tolerate a soft diet with thin liquids upon discharge. Your chronic conditions of asthma, GERD, and depression were stable during your stay.")
prompt_valid_gpt_predicitions.append("During your hospital stay, you were treated for symptoms of heart failure, which included difficulty breathing and a productive cough. Your heart function was evaluated and it was found that you had some fluid overload, which was managed with dialysis. You also received antibiotics for a possible lung infection. Your blood pressure was high, so you were started on a medication called amlodipine. You experienced some confusion during your stay, which could be due to a number of factors, including your heart condition and lack of sleep. Your kidney function was monitored and your medication for high phosphate levels was stopped due to low phosphate levels in your blood. Your other chronic conditions, including atrial fibrillation, anemia, coronary artery disease, depression, and hypothyroidism were managed with your usual medications.")
prompt_valid_gpt_predicitions.append("During your hospital stay, you were treated for a severe skin infection in your left thigh. You were initially given IV antibiotics, which were later switched to oral medications. Your infection improved, and you were discharged with a plan to continue oral antibiotics for an additional 14 days. You also had a headache, which was likely due to poor sleep and the infection. Your blood sugar levels were initially high, but were well controlled after adjusting your insulin pump. Your thyroid medication was decreased due to low TSH levels. Your blood pressure was normal during your stay with your current medication. You were also continued on your current medication for gastroparesis.")
prompt_valid_gpt_predicitions.append("During your hospital stay, you were treated for a hematoma in your right thigh, which occurred due to a high INR after your atrial fibrillation ablation. Your kidney function was temporarily affected, likely due to dehydration, but it improved after receiving IV fluids and adjusting your medication. You also experienced a drop in blood pressure when standing, which was managed by adjusting your medications and providing hydration. Your heart rhythm has been stable since your ablation procedure. Upon discharge, your medications were adjusted: your Coumadin dose was decreased, your aspirin was discontinued, and your Torsemide dose was lowered. It's important to monitor your INR levels and schedule an outpatient sleep study to check for sleep apnea.")
prompt_valid_gpt_predicitions.append("You were admitted to the hospital due to a flare-up of your Ulcerative Colitis, which was causing bloody stools and abdominal pain. We ruled out any infections that could have caused these symptoms. You were treated with steroids and your usual medication for Ulcerative Colitis was continued. We also noted that your blood cells were larger than normal and your platelet count was high, which we will continue to monitor. You were discharged with a plan to gradually reduce your steroid dosage over several weeks. You should follow up with your doctor in a week and continue your usual medications.")


srcs = []
golds = []
preds = []
for i, pred in enumerate(prompt_valid_gpt_predicitions):
    if pred != "":
        srcs.append(exp_1_test[i]['text'])
        golds.append(exp_1_test[i]['summary'])
        preds.append(pred)

print(f"Evaluate on {len(srcs)} validation examples.")
compute_custom_metrics(srcs, golds, preds, "cuda")

# Model                                    & R-1 & R-2 & R-3 & R-L & BERTScore & Deberta & SARI & Words \\ \midrule
# Llama 2 70B (100 training ex.)           & 43  & 15  & 6   & 25  & 87        & 62      & 44.24 & 125  \\

Evaluate on 10 validation examples.


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'rouge1': np.float64(42.89),
 'rouge2': np.float64(13.42),
 'rouge3': np.float64(5.24),
 'rouge4': np.float64(1.74),
 'rougeL': np.float64(25.04),
 'words': np.float64(105.9),
 'bert_score': np.float64(87.49),
 'bert_score_deberta-large': np.float64(62.58),
 'sari': 44.11}

In [None]:
!pip install sacrebleu sacremoses

Collecting sacrebleu
  Downloading sacrebleu-2.5.1-py3-none-any.whl.metadata (51 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/51.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB)
Collecting portalocker (from sacrebleu)
  Downloading portalocker-3.1.1-py3-none-any.whl.metadata (8.6 kB)
Collecting colorama (from sacrebleu)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Downloading sacrebleu-2.5.1-py3-none-any.whl (104 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.1/104.1 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m37.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading co