# Introduction to Bedrock - Text Summarization Sample - Health Care and Life Sciences 

--- 

In this demo notebook, we demonstrate how to use the Bedrock Python SDK for a text question and answer (Q&A) example. We show how to use Bedrock's Foundational Models to answer questions after reading a passage

---

## 1. Set Up and API walkthrough

---
Before executing the notebook for the first time, execute this cell to add bedrock extensions to the Python boto3 SDK

---

In [2]:
import boto3
import botocore
import pandas as pd

try:
    from langchain.chains import LLMChain
    from langchain.llms.bedrock import Bedrock
    from langchain.prompts import PromptTemplate
    from langchain.embeddings import BedrockEmbeddings
except Exception as e:
    print(e)
    print("please install langchain version 0.0.190")

In [3]:
assert boto3.__version__ == "1.26.142"
assert botocore.__version__ == "1.29.142"

#### Un comment these to run from your local environment outside of AWS

In [4]:
import sys
import os
from pprint import pprint

#### Now let's set up our connection to the Amazon Bedrock SDK using Boto3

In [5]:
import boto3
import json

bedrock = boto3.client(
    service_name="bedrock",
    region_name="us-east-1",
    endpoint_url="https://bedrock.us-east-1.amazonaws.com",
)

#### We can validate our connection by testing out the _list_foundation_models()_ method, which will tell us all the models available for us to use 

In [6]:
bedrock.list_foundation_models()

{'ResponseMetadata': {'RequestId': '99f81aba-b73d-43df-b5d8-baa3c5b84758',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Thu, 08 Jun 2023 14:15:33 GMT',
   'content-type': 'application/json',
   'content-length': '861',
   'connection': 'keep-alive',
   'x-amzn-requestid': '99f81aba-b73d-43df-b5d8-baa3c5b84758'},
  'RetryAttempts': 0},
 'modelSummaries': [{'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-tg1-large',
   'modelId': 'amazon.titan-tg1-large'},
  {'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-e1t-medium',
   'modelId': 'amazon.titan-e1t-medium'},
  {'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/stability.stable-diffusion-xl',
   'modelId': 'stability.stable-diffusion-xl'},
  {'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/ai21.j2-grande-instruct',
   'modelId': 'ai21.j2-grande-instruct'},
  {'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/ai21.j2-jumbo-instruct',
   'modelId': 'ai21.j2-jumbo-in

#### In this Notebook we will be using the invoke_model() method of Amazon Bedrock. This will be the primary method we use for most of our Text Generation and Processing tasks. 

##### The mandatory parameters required to use this method are, where _modelId_ represents the Amazon Bedrock model ARN, and _body_ which is the prompt for our task. The _body_ prompt will change depending on the foundational model provider selected. We walk through this in detail below

```
{
   modelId= model_id,
   contentType= "application/json",
   accept= "application/json",
   body=body
}

```

## 2. Reading the data
We take data from the [pubmed database of scientific papers from huggingface.](https://huggingface.co/datasets/scientific_papers/viewer/pubmed/train?row=0)


In [7]:
pubmed_dataset =pd.read_csv("./huggingface_medpub_dataset_short.csv")

In [8]:
pubmed_dataset

Unnamed: 0,article,abstract,section_names
0,anxiety affects quality of life in those livin...,research on the implications of anxiety in pa...,1. Introduction\n2. Methods\n3. Results\n4. Di...
1,small non - coding rnas are transcribed into m...,"small non - coding rnas include sirna , mirna...",Introduction\nAberrant Expression of miRNA in ...
2,ohss is a serious complication of ovulation in...,objective : to evaluate the efficacy and safe...,Introduction\nMaterials and Methods\nResults\n...
3,congenital adrenal hyperplasia ( cah ) refers ...,congenital adrenal hyperplasia is a group of ...,I\nM\nR\nD
4,type 1 diabetes ( t1d ) results from the destr...,objective(s):pentoxifylline is an immunomodul...,Introduction\nMaterials and Methods\nDrug and ...


In [95]:
pubmed_dataset["article_length_chars"] = pubmed_dataset.apply(lambda x: len(x['article']), axis=1)
pubmed_dataset["abstract_length_chars"] = pubmed_dataset.apply(lambda x: len(x['abstract']), axis=1)
pubmed_dataset["article_length_words"] = pubmed_dataset["article"].apply(lambda x: len(x.split(" ")))
pubmed_dataset["abstract_length_words"] = pubmed_dataset["abstract"].apply(lambda x: len(x.split(" ")))
pubmed_dataset.head()

Unnamed: 0,article,abstract,section_names,article_length_chars,abstract_length_chars,article_length_words,abstract_length_words
0,anxiety affects quality of life in those livin...,research on the implications of anxiety in pa...,1. Introduction\n2. Methods\n3. Results\n4. Di...,17759,1240,3072,221
1,small non - coding rnas are transcribed into m...,"small non - coding rnas include sirna , mirna...",Introduction\nAberrant Expression of miRNA in ...,15247,602,2442,105
2,ohss is a serious complication of ovulation in...,objective : to evaluate the efficacy and safe...,Introduction\nMaterials and Methods\nResults\n...,21979,1531,3785,279
3,congenital adrenal hyperplasia ( cah ) refers ...,congenital adrenal hyperplasia is a group of ...,I\nM\nR\nD,5389,822,911,146
4,type 1 diabetes ( t1d ) results from the destr...,objective(s):pentoxifylline is an immunomodul...,Introduction\nMaterials and Methods\nDrug and ...,17938,1411,3217,226


#### Let's now try out the Amazon Bedrock models to have it summarize our sample

## First try: all models, no specialised instruction set for the models

In [96]:
models = {"titan":"amazon.titan-tg1-large", "j2":"ai21.j2-jumbo-instruct", "claude":"anthropic.claude-v1"}
example_text = pubmed_dataset["article"].iloc[3]

In [97]:
import boto3
import json
import csv
from datetime import datetime

def call_bedrock(modelId, prompt_data):
    if 'amazon' in modelId:
        body = json.dumps({
            "inputText": prompt_data,
            "textGenerationConfig":
            {
                "maxTokenCount":4096,
                "stopSequences":[],
                "temperature":0,
                "topP":0.9
            }
        })
        #modelId = 'amazon.titan-tg1-large'
    elif 'anthropic' in modelId:
        body = json.dumps({"prompt": prompt_data, "max_tokens_to_sample": 512})
        #modelId = 'anthropic.claude-instant-v1'
    elif 'ai21' in modelId:
        body = json.dumps({"prompt": prompt_data,
                           "maxTokens":4096})
        #modelId = 'ai21.j2-grande-instruct'
    elif 'stability' in modelId:
        body = json.dumps({"text_prompts":[{"text":prompt_data}]}) 
        #modelId = 'stability.stable-diffusion-xl'
    else:
        print('Parameter model must be one of titan, claude, j2, or sd')
        return
    accept = 'application/json'
    contentType = 'application/json'

    before = datetime.now()
    response = bedrock.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
    latency = (datetime.now() - before)
    response_body = json.loads(response.get('body').read())

    if 'amazon' in modelId:
        response = response_body.get('results')[0].get('outputText')
    elif 'anthropic' in modelId:
        response = response_body.get('completion')
    elif 'ai21' in modelId:
        response = response_body.get('completions')[0].get('data').get('text')

    #Add interaction to the local CSV file...
    #column_name = ["timestamp", "modelId", "prompt", "response", "latency"] #The name of the columns
    data = [datetime.now(), modelId, prompt_data, response, latency] #the data
    with open('./prompt-data/prompt-data.csv', 'a') as f:
        writer = csv.writer(f)
        #writer.writerow(column_name)
        writer.writerow(data)
    
    return response, latency

In [98]:
example_text

'congenital adrenal hyperplasia ( cah ) refers to a group of autosomal recessive disorders caused by an enzyme deficiency which leads to defects in biosynthesis of steroid precursors .\ndepending on the severity and degree of 21 hydroxylase deficiency , the clinical spectrum may vary from mild form of non classical cah to classic cah .\nhowever , the non classical cah variant is more common with a prevalence rate of 1 in 1000 .\nit also helps in maintaining normal levels of precursors by suppressing adreno cortico trophic hormone ( acth ) . during childhood\n, the management is largely focused on achieving normal growth and attaining appropriate final adult height .\njohns medical college hospital , bangalore by the department of endocrinology on patients diagnosed to have cah and seen in the outpatient clinic between january 2012 and october 2012 . during this period\ndata regarding demography , clinical presentation at time of diagnosis , treatment details , height sds and bmi were c

In [99]:
prompt_data =f"""Summarize the following text in 100 words or less:
{example_text}
"""

In [100]:
response, latency = call_bedrock(models["claude"], prompt_data)
print(response, "\n\n", "Inference time:", latency, "\n\n", "Number of words:", len(response.split(" ")))

Summary (101 words):
A study analyzed 29 patients with congenital adrenal hyperplasia ( CAH ) at a hospital in India. The majority (76%) were female. About 40% of females were identified at birth due to genital ambiguity. However, some females sought treatment only in adulthood due to virilization. Nearly all males presented as infants with adrenal crisis. Patients commonly had short stature (35%) and were treated with hydrocortisone or dexamethasone. About 1/3 had over-suppressed hormone levels, potentially impacting height. Adult height was below average for classic CAH patients at 142cm for females and 157cm for non-classic CAH. The study highlights the need for early diagnosis and optimized treatment of CAH.    

 Inference time: 0:00:06.124923 

 Number of words: 112


In [101]:
response, latency = call_bedrock(models["j2"], prompt_data)
print(response, "\n\n", "Inference time:", latency, "\n\n", "Number of words:", len(response.split(" ")))

johns medical college hospital ( jmh ) , bangalore by the department of endocrinology on patients diagnosed to have congenital adrenal hyperplasia 

 Inference time: 0:00:01.172991 

 Number of words: 22


In [102]:
response, latency = call_bedrock(models["titan"], prompt_data)
print(response, "\n\n", "Inference time:", latency, "\n\n", "Number of words:", len(response.split(" ")))

A genetic disorder that results in abnormalities in the biosynthesis of steroid precursors is called congenital adrenal hyperplasia ( CAH). The clinical range varies from the mild, non-classical kind to the severe, classical kind, and is brought on by an enzyme shortage. 29 people were enrolled in the research, including 22 women and 7 men. 11 of them were adults, and 18 of them were children. While 9 infants were discovered at birth owing to genital ambiguity, 1 presented with adrenal crisis symptoms at four weeks of age, 4 patients presented in the pre-pubertal period due to early adrenarche, 5 patients presented in the late adolescent period with marked virilization, and 3 patients presented in the late adolescent period with features of polycystic ovarian disease, 22 women were identified as having classic cah. The mean adult height of those with nccah was higher than that of those with classic cah, although none of the nccah patients received glucocorticoids and all children were 

In [103]:
# how long was our example text?
len(prompt_data.split(" "))

919

# Summarization
Adding that the prompt should be in 5 bullet points and that the style should be appropriate for a high school student. 

### Anthropic Claude Summarisation Deep Dive
For Antropic Claude the order of the prompt should roughly be the following:
- “/n/n Human:“
- Give task context
- Give detailed task description, including rules  and exceptions. Explain it as you would to a new employee with no context. 
- Give examples. Inputs and outputs (optional, but improves formatting and accuracy)
- Demonstrate the output formatting you want; deal with clauses chattiness 
- Assistant: 

—> Assistant: My Answer is (


In [106]:
claude_prompt = f"""\n\nHuman: 
Summarize the following text in 5 sentences as bullet points. Style should be appropriate for a high-school student. 
<text>{example_text}</text>

Assistant: My Answer is (
"""

In [105]:
response, latency = call_bedrock(models["claude"], claude_prompt)
print(response, "\n\n", "Inference time:", latency, "\n\n", "Number of words:", len(response.split(" ")))

• cah refers to enzyme deficiency leading to disorders in steroid production. 
• Variants range from milder non-classical type to severe classic type. 
• Study looked at 29 cah patients, mostly children and women, in India.
• Common presentations were virilization in women and adrenal crisis in men.
• Short stature was found in 1/3 of patients; adult height lower than average.    
) 

 Inference time: 0:00:04.298292 

 Number of words: 64


#### Claude: Adding \<context\> and  Switching instruction order 

In [108]:
claude_prompt = f"""\n\nHuman: 
<context>Medical document, describing genetics defect and its symptoms.</context>
<text>{example_text}</text>

<instruction>Summarize the following text in 5 sentences as bullet points. Style should be appropriate for a high-school student.</instruction>
Assistant: My Answer is 
"""
response, latency = call_bedrock(models["claude"], claude_prompt)
print(response, "\n\n", "Inference time:", latency, "\n\n", "Number of words:", len(response.split(" ")))


•There are types of congenital adrenal hyperplasia (CAH) resulting from enzyme deficiencies that affect steroid production. 
•The most common form is 21-hydroxylase deficiency, ranging from mild (non-classical) to severe (classic) CAH.  
• Patients (especially infants) with classic CAH may experience adrenal crisis without treatment.
• Treatment focuses on hormone replacement and growth promotion. 
•The study found that female classic CAH patients often presented late, average adult height was below normal, and some patients received excess treatment. 

 Inference time: 0:00:04.937051 

 Number of words: 76


In [109]:
claude_prompt = f"""\n\nHuman: 
<instruction>Summarize the following text in 5 sentences as bullet points. Style should be appropriate for a high-school student.</instruction>
<context>Medical document, describing genetics defect and its symptoms.</context>
<text>{example_text}</text>
Assistant: My Answer is
"""
response, latency = call_bedrock(models["claude"], claude_prompt)
print(response, "\n\n", "Inference time:", latency, "\n\n", "Number of words:", len(response.split(" ")))


• A genetic defect causes an enzyme deficiency leading to impaired steroid production.

• The severity varies from mild to severe, causing different symptoms.

• The study analyzes 29 patients, mostly females, with the condition.

• Infants often present with adrenal crisis; girls may have genital ambiguity. 

• Treatment and outcomes vary; short stature and suppressed hormone levels are common. 

 Inference time: 0:00:04.228537 

 Number of words: 57


### Learnings Claude
- Start with \n\nHuman:
- Wrap context/input in xml tags. 
- Context helps the model follow the requested level of depth for explanations, otherwise it will follow too much the medical speech. 
- reversing the order does not change too much the performance. This might be if you have shorter input lenght


## Titan
Now we are going to look at titan model.  

In [110]:
titan_prompt = f"""Summarize the text below in 5 sentences as bullet points. Style should be appropriate for a high-school student.
Text: {example_text}

The text is about: 
"""
response, latency = call_bedrock(models["titan"], titan_prompt)
print(response, "\n\n", "Inference time:", latency, "\n\n", "Number of words:", len(response.split(" ")))

1. Congenital adrenal hyperplasia ( CAH ) is a group of autosomal recessive disorders caused by an enzyme deficiency.
2. The clinical spectrum may vary from mild to severe, depending on the degree of 21-hydroxylase deficiency.
3. The non-classical form of CAH is more common, with a prevalence rate of 1 in 1000.
4. Management during childhood is focused on achieving normal growth and attaining appropriate final adult height.
5. 17-hydroxyprogesterone (17 OHp) levels are used to assess therapy, and values between 1 ng/mL and 12 ng/mL are considered appropriate. 

 Inference time: 0:00:05.352086 

 Number of words: 86


In [111]:
titan_prompt = f"""
{example_text}

Instruction: Summarize the text above in 5 sentences as bullet points. Style should be appropriate for a high-school student.
The text is about: 
"""
response, latency = call_bedrock(models["titan"], titan_prompt)
print(response, "\n\n", "Inference time:", latency, "\n\n", "Number of words:", len(response.split(" ")))

1. Congenital adrenal hyperplasia is a group of autosomal recessive disorders.
2. It is caused by an enzyme deficiency, which leads to defects in the biosynthesis of steroid precursors.
3. Depending on the severity and degree of 21-hydroxylase deficiency, the clinical spectrum may vary from mild to classic.
4. The non-classical form is more common and affects 1 in 1000 people.
5. Management in childhood is focused on achieving normal growth and final adult height. 

 Inference time: 0:00:04.786722 

 Number of words: 71


### Learning Titan:
#### regular order: 
Start the instruction directly without a new line on the instruction. Reference the text with a "Text:" helps. Reference the task body with an output indictor. 
#### reversed order: 
It does respond well to the different prompts. Adding a new line on the top, helps it to separate the example text and instructions.

A output indicator must be given, ideally referencing the task. This output indicator should directly follow the task if the order is <text><instruction><output indicator> 

### AI21 

In [112]:
j2_prompt = f"""Instruction:
Summarize the text below in 5 sentences as bullet points. Style should be appropriate for a high-school student.
{example_text}

The text is about:
"""
response, latency = call_bedrock(models["j2"], j2_prompt)
print(response, "\n\n", "Inference time:", latency, "\n\n", "Number of words:", len(response.split(" ")))

Congenital adrenal hyperplasia (CAH) refers to a group of autosomal recessive disorders caused by an enzyme deficiency which leads to defects in biosynthesis of steroid precursors. Depending on the severity and degree of 21 hydroxylase deficiency, the clinical spectrum may vary from mild form of non classical CAH to classic CAH. However, the non classical CAH variant is more common with a prevalence rate of 1 in 1000. It helps maintain normal levels of precursors by suppressing adreno corticotrophic hormone (ACTH). The management is focused on achieving normal growth and attaining appropriate final adult height. 

 Inference time: 0:00:03.026570 

 Number of words: 95


In [113]:
j2_prompt = f"""{example_text}
Instruction:
Summarize the text above in 5 sentences as bullet points. Style should be appropriate for a high-school student.

The text is about:
"""
response, latency = call_bedrock(models["j2"], j2_prompt)
print(response, "\n\n", "Inference time:", latency, "\n\n", "Number of words:", len(response.split(" ")))


*A group of disorders caused by an enzyme deficiency that leads to defects in biosynthesis of steroid precursors.
*The severity of 21 hydroxylase deficiency determines the clinical spectrum, which varies from mild to classic.
*The management is mostly focused on achieving normal growth and final adult height during childhood.
*A study was conducted on 29 patients diagnosed with CAH and seen between 2012 and 2012, and data was collected.
*The study reported that most classical CAH patients had short stature and suppressed 17 OHP levels, while most NCCAH patients had adequate 17 OHP levels. 

 Inference time: 0:00:03.497687 

 Number of words: 90


### Learning J2:
Adding a tag "Instruction:" helps it to follow what you ask it. 
"Context:" does not seem to help. 
#### regular order: 
Does not work consistently. To be abandoned
#### reversed order: 
Directly start with the text to summarize. No new line, no indicator, that it is text. 
<text><instruction><output indicator> 

# Question and Answering
For Q&A on the medical dataset we are going to use the [MedQuA](https://github.com/abachaa/MedQuAD) dataset. We will pick different Q&A pairs to evaluate the evoltion of our prompt. 

In [9]:
f = open("./painkiller_leaflet_short.txt", "r")
instapainrelief_text = f.read()


### Claude model


In [115]:
claude_prompt = f"""\n\nHuman: 
Given the text below, answer the following questions as found in the text. Answer like you would be talking to a five year old. 
<text>{instapainrelief_text}</text>

<question>Can I drink Alcohol while taking the medicine?</question>

Assistant: My Answer is (
"""

response, latency = call_bedrock(models["claude"], claude_prompt)
print(response, "\n\n", "Inference time:", latency, "\n\n", "Number of words:", len(response.split(" ")))


No, you should avoid drinking alcohol while taking InstaPainRelief tablets. 

 Inference time: 0:00:03.021411 

 Number of words: 10


In [116]:
claude_prompt = f"""\n\nHuman: 
Given the text below, answer the following questions as found in the text. Answer like you would be talking to a five year old. If you do not know the answer, respond with "I don't know".
<text>{instapainrelief_text}</text>

<question>In what package size is InstaPainRelief available?</question>

Assistant: My Answer is (
"""

response, latency = call_bedrock(models["claude"], claude_prompt)
print(response, "\n\n", "Inference time:", latency, "\n\n", "Number of words:", len(response.split(" ")))


I don't know

) the package size of InstaPainRelief. This information is not provided in the given text. 

 Inference time: 0:00:03.437955 

 Number of words: 17


### Titan model

In [117]:
titan_prompt = f"""{instapainrelief_text_short}

Given the text above, answer the following questions as found in the text. Answer like you would be talking to a five year old. 

Question: Can I drink Alcohol while taking the medicine?
Answer: 
"""

response, latency = call_bedrock(models["titan"], titan_prompt)
print(response, "\n\n", "Inference time:", latency, "\n\n", "Number of words:", len(response.split(" ")))

No, do not drink Alcohol while taking the medicine, it may make you feel sick. 

 Inference time: 0:00:04.007388 

 Number of words: 15


### J2 model

In [118]:
j2_prompt = f"""{instapainrelief_text_short}
Given the text above, answer the following questions as found in the text. Answer like you would be talking to a five year old. 

Question: Can I drink Alcohol while taking the medicine?
Answer: 
"""
response, latency = call_bedrock(models["j2"], j2_prompt)
print(response, "\n\n", "Inference time:", latency, "\n\n", "Number of words:", len(response.split(" ")))

It is not recommended to take Alcohol while taking this medicine as it may intensify side effects. 

 Inference time: 0:00:01.010567 

 Number of words: 17


# Text Generation - Chemical reaction 

### Claude 

In [119]:
claude_prompt = f"""\n\nHuman: 
Sulfuric acid reacts with sodium chloride, and gives <chemical1>_____</chemical1> and <chemical2>_____<chemical2>:

Assistant: the chemical1 and chemical 2 are:
"""

response, latency = call_bedrock(models["claude"], claude_prompt)
print(response, "\n\n", "Inference time:", latency, "\n\n", "Number of words:", len(response.split(" ")))


Sulfuric acid reacts with sodium chloride, and gives <chemical1>sodium sulfate</chemical1> and <chemical2>hydrogen chloride</chemical2>: 

 Inference time: 0:00:01.216493 

 Number of words: 13


### Titan

In [120]:
titan_prompt =f"""Question:
Sulfuric acid reacts with sodium chloride, and gives _____ and _____:

Answer: the chemical compounds are:
"""

response, latency = call_bedrock(models["titan"], titan_prompt)
print(response, "\n\n", "Inference time:", latency, "\n\n", "Number of words:", len(response.split(" ")))

sodium sulfate and sulfur dioxide 

 Inference time: 0:00:01.022524 

 Number of words: 5


### J2 model

In [123]:
j2_prompt = f"""Question:
Sulfuric acid reacts with sodium chloride, and gives _____ and _____:

Answer: the chemical compounds are:
"""
response, latency = call_bedrock(models["j2"], j2_prompt)
print(response, "\n\n", "Inference time:", latency, "\n\n", "Number of words:", len(response.split(" ")))

sodium sulfate and hydrogen chloride. 

 Inference time: 0:00:00.562379 

 Number of words: 5
