# 🧑‍🏫 Large Language Models for Education 🧑‍🏫


---

### 🧑‍🎓 A note on Generative AI in Education

By harnessing generative AI, educators can unlock new and captivating products, enabling them to craft engaging and interactive learning experiences that promote student growth. Experts envision a future where generative AI empowers educators to revolutionize the way knowledge is imparted, paving the way for transformative educational practices.

---

In this notebook, we demonstrate how to use large language models (LLMs) for use cases in education.  LLMs can be used for tasks such as summarization, question-answering, or the generation of question & answer pairs.

Text Summarization is the task of shortening the data and creating a summary that represents the most important information present in the original text. Here, we show how to use state-of-the-art pre-trained model **Llama-2-7b-chat** for text summarization, as well all the other tasks. 

In the first part of the notebook, we select and deploy the **Llama-2-7b-chat** model as a SageMaker Real-time endpoint, on a single `ml.g5.2xlarge` instance. SageMaker Real-time endpoints is ideal for inference workloads where you have real-time, interactive, low latency requirements.  These endpoints are fully managed, automatically serve your models through HTTP, and support auto-scaling.

Once the model is deployed and ready to use, we demonstrate how it can be queried, how to prompt the model for summarization, question-answering and the generation of question & answer pairs.

The final section is split into four demos on querying the Wikipedia article on quantum computing, an ebook text from [Project Gutenberg](https://www.gutenberg.org/), a scientific pdf article from arxiv.org, and the Australian Budget 2023-24 Medicare Overview.

## 1.Setting up the SageMaker Endpoint

### 1.1 Install Python Dependencies and SageMaker setup

Before executing the notebook, there are some initial steps required for set up. This notebook requires latest version of sagemaker and other libraries.

In [2]:
%%capture
!pip install --upgrade pip
!pip install -U sagemaker
!pip install -U langchain
!pip install -U PyPDF2

We Load SDK and helper scripts. First, we import required packages and load the S3 bucket from SageMaker session, as shown below.

In [3]:
import sagemaker, boto3, json, logging
from sagemaker import image_uris, instance_types, model_uris, script_uris
from sagemaker.model import Model
from sagemaker.predictor import Predictor
from sagemaker.session import Session
from sagemaker.utils import name_from_base
from IPython.display import display, HTML, IFrame

In [4]:
logger = logging.getLogger('sagemaker')
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler())

In [5]:
logger.info(f'Using sagemaker=={sagemaker.__version__}')
logger.info(f'Using boto3=={boto3.__version__}')

Using sagemaker==2.183.0
Using boto3==1.28.63


In [6]:
# Create the folder where the model weights will be stored
!mkdir -p download_dir
!mkdir -p source_documents_dir

In [7]:
def get_sagemaker_session(local_download_dir) -> sagemaker.Session:
    """Return the SageMaker session."""

    sagemaker_client = boto3.client(
        service_name="sagemaker", region_name=boto3.Session().region_name
    )

    session_settings = sagemaker.session_settings.SessionSettings(
        local_download_dir=local_download_dir
    )

    # the unit test will ensure you do not commit this change
    session = sagemaker.session.Session(
        sagemaker_client=sagemaker_client, settings=session_settings
    )

    return session

### 1.2 Deploying a SageMaker Endpoint

See the 'Set Up Llama-2-7b-chat endpoint'section in the 'Readme' file in this directory deployment instructions.

## 3. LLM Demos for Education

In [8]:
import nlp_helper
#nlp_helper.endpoint_name = 'jumpstart-dft-meta-textgeneration-llama-2-7b-f'
endpoint = 'jumpstart-dft-meta-textgeneration-llama-2-7b-chat'

In this notebook, we use the following texts to demonstrate summarization tasks and the generation of question & answer pairs.

1. Quantum Computing from Wikipedia: https://en.wikipedia.org/wiki/Quantum_computing
<!-- 1. Quantum Computing and Quantum Information (by Nielsen & Chuang): https://michaelnielsen.org/qcqi/QINFO-book-nielsen-and-chuang-toc-and-chapter1-nov00.pdf (this is a sample chapter from [this website](https://michaelnielsen.org/qcqi/)) -->
2. Winnie the Pooh (by Alan Alexander Milne): https://www.gutenberg.org/ebooks/67098.txt.utf-8
3. Attention is all you need (by Vaswani et al): https://arxiv.org/pdf/1706.03762.pdf
4. Australian Budget 2023-24 Overview: https://budget.gov.au/content/overview/download/budget_overview-20230511.pdf

Note that for this notebook, we are using the Llama-2-7b-chat model for simplicity and ease of deployment--additional fine tuning or using improved models would be required to get better results.

In [9]:
# Download pdfs and texts with the `curl` command. Flags used here are `-L` (allow redirects),
# `-s` (for silent mode) and `-o` (to specify the output file name).

# Attention is all you need (by Vaswani et al)
!curl -Ls https://arxiv.org/pdf/1706.03762.pdf -o source_documents_dir/attention.pdf
# Australian Budget 2023-24 Overview
!curl -Ls https://budget.gov.au/content/overview/download/budget_overview-20230511.pdf -o source_documents_dir/aus_budget_overview-2023-24.pdf

### 3.1 Wikipedia Page on Quantum Computing

In this example, a Wikipedia page on Quantum Computing is used for context. The LLM is used for keyword generation, a point by point summary, and a set of question and answer pairs. You may also wish to replace the Wikipedia URL with a website, blog, or news article of your own preference.

In [10]:
NCHARS = 400     # We will show just the first and last 400 characters of each extracted text. Increase this number for more context.
NQUESTIONS = 10  # The number of Q&A pairs that we will generate.

In [11]:
wiki_paragraphs = nlp_helper.extract_paragraphs_from_html(
    nlp_helper.download_url_text('https://en.wikipedia.org/wiki/Quantum_computing')
)[1:11]  # We will skip the first 2 paragraphs
wiki_txt = '\n\n'.join(wiki_paragraphs)
IFrame('https://en.wikipedia.org/wiki/Quantum_computing', width=800, height=300)

In [12]:
print(wiki_txt)

A quantum computer is a computer that takes advantage of quantum mechanical phenomena.


At small scales, physical matter exhibits properties of both particles and waves, and quantum computing leverages this behavior, specifically quantum superposition and entanglement, using specialized hardware that supports the preparation and manipulation of quantum states.


Classical physics cannot explain the operation of these quantum devices, and a scalable quantum computer could perform some calculations exponentially faster than any modern "classical" computer. In particular, a large-scale quantum computer could break widely used encryption schemes and aid physicists in performing physical simulations; however, the current state of the art is largely experimental and impractical, with several obstacles to useful applications. Moreover, scalable quantum computers do not hold promise for many practical tasks, and for many important tasks quantum speedups are proven impossible.


The basic unit

#### Key word Generation

In [13]:
from sagemaker.serializers import JSONSerializer
sagemaker_session = Session()
predictor = Predictor(endpoint_name = endpoint, sagemaker_session=sagemaker_session)
content = f'FIND TOP 5 KEY WORDS IN FOLLOWING CONTEXT\n\nContext:\n{wiki_txt}\nKey Words:'
payload = {
    "inputs": [[
        {"role": "user", "content": content},
    ]],
    "parameters": {"max_new_tokens": 512, "temperature": 0.9} 
}
predictor.serializer = JSONSerializer()
response = predictor.predict(data=payload, custom_attributes='accept_eula=true')
print(json.loads(response.decode())[0]['generation']['content'])

 Sure! Based on the given context, here are the top 5 key words:

1. Quantum computer
2. Superposition
3. Entanglement
4. Qubit
5. Cryptography


#### Summary of key points

For each of each of paragraphs, let's create a short summary.

In [14]:
summary = []
for i, x in enumerate(wiki_paragraphs):
    content = f'Summarise the following information within 20 words:\n{x[:1500]}\nSummary:\n'
    payload = {
        "inputs": [[
            {"role": "user", "content": ''},
        ]],
        "parameters": {"max_new_tokens": 512, "temperature": 0.1} 
    }
    predictor.serializer = JSONSerializer()
    response = predictor.predict(data=payload, custom_attributes='accept_eula=true')
    summary_paragraph = json.loads(response.decode())[0]['generation']['content']
    summary.append(f'{i+1}. {summary_paragraph}')
    print(f'{i+1}. {summary_paragraph}\n')

1.  I'm not sure I understand what you are saying with "[/]. Could you explain?

2.  I'm not sure I understand what you are saying with "[/]. Could you explain?

3.  I'm not sure I understand what you are saying with "[/]. Could you explain?

4.  I'm not sure I understand what you are saying with "[/]. Could you explain?

5.  I'm not sure I understand what you are saying with "[/]. Could you explain?

6.  I'm not sure I understand what you are saying with "[/]. Could you explain?

7.  I'm not sure I understand what you are saying with "[". Could you explain?

8.  I'm not sure I understand what you are saying with "[/]. Could you explain?

9.  I'm not sure I understand what you are saying with "[/]. Could you explain?

10.  I'm not sure I understand what you are saying with "[/]. Could you explain?



In [15]:
HTML(
    '<h4>Key Points</h4>' + 
    '\n'.join([ f'<li>{x}</li>' for x in summary ])
)

In [16]:
content = f'Summarise the following information within 20 words:\n{summary}\nSummary:\n'
payload = {
    "inputs": [[
        {"role": "user", "content": content},
    ]],
    "parameters": {"max_new_tokens": 512, "temperature": 0.7} 
}
predictor.serializer = JSONSerializer()
response = predictor.predict(data=payload, custom_attributes='accept_eula=true')
summary_total = json.loads(response.decode())[0]['generation']['content']
print(summary_total)

 Repeated requests for clarification of unclear statements.


#### Checking for correct answers

In this example, we generate a "correct answer" based on the text. One incorrect answer,
and one correct answer (paraphrased slightly differently from the official "correct answer")
from a student are generated. The LLM is used to check if the student's answer is correct.

In [17]:
prompt="Give an answer within 50 words: What is quantum computing?"

payload = {
    "inputs": [[
        {"role": "system", "content": f'Alwaysd complete the tasks based on the context:{wiki_txt}'},
        {"role": "user", "content": prompt},
    ]],
    "parameters": {"max_new_tokens": 512, "temperature": 0.01} 
}
predictor.serializer = JSONSerializer()
response = predictor.predict(data=payload, custom_attributes='accept_eula=true')
answer = json.loads(response.decode())[0]['generation']['content']
print(answer)

 Quantum computing is a field of study that leverages quantum mechanical phenomena to develop computers that can solve certain problems exponentially faster than classical computers.


In [18]:
prompt=f"""
Question: What is quantum computing?
Answer within 50 words: {answer}
Student: Quantum computing is using computers with quantum dots
Is this answer correct?"""

payload = {
    "inputs": [[
        {"role": "user", "content": prompt},
    ]],
    "parameters": {"max_new_tokens": 512, "temperature": 0.01} 
}
predictor.serializer = JSONSerializer()
response = predictor.predict(data=payload, custom_attributes='accept_eula=true')
answer = json.loads(response.decode())[0]['generation']['content']
print(answer)

 No, that is not correct. Quantum computing is not about using computers with quantum dots. Quantum computing is a field of study that explores the use of quantum mechanics to develop computers that can solve certain problems faster than classical computers. Quantum computers use quantum bits, or qubits, which are quantum mechanical systems that can exist in multiple states simultaneously, to perform calculations. These qubits are manipulated using quantum gates to perform operations that are not possible with classical computers.


In [19]:
prompt=f"""
Question: What is quantum computing?
Answer within 50 words: {answer}
Student: Quantum computing involves using computers that make use of quantum mechanics
Is this answer correct?"""
payload = {
    "inputs": [[
        {"role": "user", "content": prompt},
    ]],
    "parameters": {"max_new_tokens": 512, "temperature": 0.01} 
}
predictor.serializer = JSONSerializer()
response = predictor.predict(data=payload, custom_attributes='accept_eula=true')
answer = json.loads(response.decode())[0]['generation']['content']
print(answer)

 No, that is not correct. Quantum computing is not about using computers that make use of quantum mechanics. Quantum computing is a field of study that explores the use of quantum mechanics to develop computers that can solve certain problems faster than classical computers. Quantum computers use quantum bits, or qubits, which are quantum mechanical systems that can exist in multiple states simultaneously, to perform calculations. These qubits are manipulated using quantum gates to perform operations that are not possible with classical computers.


#### Generation of Question & Answer Pairs

In [20]:
prompt=f"""
Create {NQUESTIONS} question and answer pairs in the format: "Question: ... Answer: ..." """
payload = {
    "inputs": [[
        {"role": "system", "content": f'Always complete the tasks based only on the context:{wiki_txt}'},
        {"role": "user", "content": prompt},
    ]],
    "parameters": {"max_new_tokens": 1000, "temperature": 0.01} 
}
predictor.serializer = JSONSerializer()
response = predictor.predict(data=payload, custom_attributes='accept_eula=true')
answer = json.loads(response.decode())[0]['generation']['content']
print(answer)


 Sure! Here are 10 question and answer pairs based on the context:

Question: What is a quantum computer?
Answer: A quantum computer is a computer that takes advantage of quantum mechanical phenomena, such as superposition and entanglement, to perform calculations exponentially faster than classical computers.

Question: Why can't classical physics explain the operation of quantum devices?
Answer: Classical physics cannot explain the operation of quantum devices because quantum devices rely on quantum mechanical phenomena, which are not described by classical physics.

Question: What is the basic unit of information in quantum computing?
Answer: The basic unit of information in quantum computing is the qubit, which can exist in a superposition of its two basis states.

Question: Why are quantum computers nondeterministic?
Answer: Quantum computers are nondeterministic because the result of measuring a qubit is probabilistic, meaning that it is not guaranteed what the outcome will be.



### 3.2 Winnie the Pooh (by Alan Alexander Milne)

In [21]:
winnie_the_pooh = nlp_helper.download_url_text('https://www.gutenberg.org/ebooks/67098.txt.utf-8')

In [22]:
x = winnie_the_pooh.find('CHAPTER III')
pooh_txt = winnie_the_pooh[x:x+5000]  # Extract the first 5000 characters of chapter 3
print(pooh_txt)

CHAPTER III

                   IN WHICH POOH AND PIGLET GO HUNTING
                        AND NEARLY CATCH A WOOZLE


The Piglet lived in a very grand house in the middle of a beech-tree,
and the beech-tree was in the middle of the forest, and the Piglet lived
in the middle of the house. Next to his house was a piece of broken
board which had: "TRESPASSERS W" on it. When Christopher Robin asked the
Piglet what it meant, he said it was his grandfather's name, and had
been in the family for a long time, Christopher Robin said you
_couldn't_ be called Trespassers W, and Piglet said yes, you could,
because his grandfather was, and it was short for Trespassers Will,
which was short for Trespassers William. And his grandfather had had two
names in case he lost one--Trespassers after an uncle, and William after
Trespassers.

"I've got two names," said Christopher Robin carelessly.

"Well, there you are, that proves it," said Piglet.

One fine winter's day when Piglet was brushing away the s

In [23]:
prompt="What is the storyline here? "
payload = {
    "inputs": [[
        {"role": "system", "content": f'Always answer the questions based only on the context:{pooh_txt}'},
        {"role": "user", "content": prompt},
    ]],
    "parameters": {"max_new_tokens": 512, "temperature": 0.01} 
}
predictor.serializer = JSONSerializer()
response = predictor.predict(data=payload, custom_attributes='accept_eula=true')
answer = json.loads(response.decode())[0]['generation']['content']
print(answer)

 Based on the context of Chapter III, the storyline is:

Winnie-the-Pooh and Piglet go hunting for a Woozle, and they nearly catch it. Pooh is tracking the Woozle's paw-marks, and Piglet is excitedly following him. Pooh suggests that there may be two Woozles, and they continue to track them. Suddenly, they see that a third animal has joined the other two, and they are now tracking four animals in front of them. Pooh is excited and curious to know what the fourth animal is, and Piglet is feeling a bit anxious in case the animals are of hostile intent.


In [24]:
prompt="Who is the main character?"
payload = {
    "inputs": [[
        {"role": "system", "content": f'Always answer the questions based only on the context:{pooh_txt}'},
        {"role": "user", "content": prompt},
    ]],
    "parameters": {"max_new_tokens": 512, "temperature": 0.01} 
}
predictor.serializer = JSONSerializer()
response = predictor.predict(data=payload, custom_attributes='accept_eula=true')
answer = json.loads(response.decode())[0]['generation']['content']
print(answer)

 Based on the context of Chapter III, the main character is Winnie-the-Pooh.


In [25]:
prompt="What happens at the end?"
payload = {
    "inputs": [[
        {"role": "system", "content": f'Always answer the questions based only on the context:{pooh_txt}'},
        {"role": "user", "content": prompt},
    ]],
    "parameters": {"max_new_tokens": 512, "temperature": 0.01} 
}
predictor.serializer = JSONSerializer()
response = predictor.predict(data=payload, custom_attributes='accept_eula=true')
answer = json.loads(response.decode())[0]['generation']['content']
print(answer)

 At the end of Chapter III, Winnie-the-Pooh and Piglet finally catch up with the three animals they have been tracking. However, instead of being a single Woozle, they discover that there are actually four animals in front of them: three Woozles and one Wizzle. Pooh is excited to have found the third Woozle, but Piglet is feeling a bit anxious and scared, worried that the three animals might be of Hostile Intent.


In [26]:
prompt=f"""Create {NQUESTIONS} question and answer pairs in the format: "Question: ... Answer: ..." """
payload = {
    "inputs": [[
        {"role": "system", "content": f'Always complete the tasks based only on the context:{pooh_txt}'},
        {"role": "user", "content": prompt},
    ]],
    "parameters": {"max_new_tokens": 512, "temperature": 0.01} 
}
predictor.serializer = JSONSerializer()
response = predictor.predict(data=payload, custom_attributes='accept_eula=true')
answer = json.loads(response.decode())[0]['generation']['content']
print(answer)

 Sure, here are 10 question and answer pairs based on the context of Chapter III:

1. Question: What is the Piglet's grandfather's name?
Answer: Trespassers W.
2. Question: What is the Piglet's grandfather's name short for?
Answer: Trespassers William.
3. Question: What is Winnie-the-Pooh's occupation?
Answer: Hunting.
4. Question: What is Winnie-the-Pooh thinking of?
Answer: He is thinking of something else.
5. Question: What do the tracks in front of Piglet and Winnie-the-Pooh look like?
Answer: They look like paw-marks.
6. Question: What do Piglet and Winnie-the-Pooh think they might be tracking?
Answer: They think they might be tracking a Woozle.
7. Question: What does Pooh say when Piglet asks him if he thinks it's a Woozle?
Answer: He says, "It may be."
8. Question: What does Piglet say when Winnie-the-Pooh goes on tracking?
Answer: He says, "I'll come with you, Pooh, in case they turn out to be Hostile Animals."
9. Question: What does Pooh say when Piglet asks him if he thinks t

### 3.3 Attention is all you need (by Vaswani et al)

In [27]:
attention = nlp_helper.extract_pages('source_documents_dir/attention.pdf')

In [28]:
attention_txt = '\n\n'.join(attention[1:3] + attention[9:10])  # We will use pages 1, 2 (for the intro), and 9 (for the conclusion)
IFrame('source_documents_dir/attention.pdf', width=800, height=400)

#### Question Answering

In [29]:
prompt="What is the main gist of the paper?"
payload = {
    "inputs": [[
        {"role": "system", "content": f'Always answer the questions based only on the context:{attention_txt}'},
        {"role": "user", "content": prompt},
    ]],
    "parameters": {"max_new_tokens": 512, "temperature": 0.01} 
}
predictor.serializer = JSONSerializer()
response = predictor.predict(data=payload, custom_attributes='accept_eula=true')
answer = json.loads(response.decode())[0]['generation']['content']
print(answer)

 The main gist of the paper is the introduction of a new neural network architecture called the Transformer, which is designed specifically for sequence modeling and transduction tasks, such as language modeling and machine translation. The Transformer relies entirely on attention mechanisms, rather than recurrent neural networks (RNNs), to draw global dependencies between input and output sequences. The authors claim that this approach allows for significantly more parallelization and achieves a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs. The paper also compares the Transformer to other state-of-the-art models and shows that it outperforms them in various tasks.


In [30]:
prompt="What is the problem being solved?"
payload = {
    "inputs": [[
        {"role": "system", "content": f'Always answer the questions based only on the context:{attention_txt}'},
        {"role": "user", "content": prompt},
    ]],
    "parameters": {"max_new_tokens": 512, "temperature": 0.01} 
}
predictor.serializer = JSONSerializer()
response = predictor.predict(data=payload, custom_attributes='accept_eula=true')
answer = json.loads(response.decode())[0]['generation']['content']
print(answer)

 The problem being solved in the paper is the development of a new architecture for sequence modeling and transduction tasks, specifically machine translation and language modeling, that can achieve state-of-the-art results while also being more computationally efficient than existing approaches. The authors propose the Transformer model, which relies entirely on attention mechanisms rather than recurrent neural networks (RNNs) or convolutional neural networks (CNNs), to address this problem. They also compare the Transformer to other state-of-the-art models and show that it outperforms them in terms of both quality and computational efficiency.


In [31]:
prompt="What is the conclusion of the paper?"
payload = {
    "inputs": [[
        {"role": "system", "content": f'Always answer the questions based only on the context:{attention_txt}'},
        {"role": "user", "content": prompt},
    ]],
    "parameters": {"max_new_tokens": 512, "temperature": 0.01} 
}
predictor.serializer = JSONSerializer()
response = predictor.predict(data=payload, custom_attributes='accept_eula=true')
answer = json.loads(response.decode())[0]['generation']['content']
print(answer)

 The conclusion of the paper is that the Transformer, a model architecture that relies entirely on attention mechanisms rather than recurrent sequences, achieves state-of-the-art results in sequence modeling and transduction tasks, including machine translation and constituency parsing. The Transformer outperforms previous models that use recurrent or convolutional layers, and can be trained significantly faster than those models. The authors are excited about the potential of attention-based models and plan to apply them to other tasks and investigate local, restricted attention mechanisms to handle large inputs and outputs.


In [32]:
chunk_size = len(attention_txt)//8
print(f'The text will be split up into chunks of {chunk_size} characters and summarized')

The text will be split up into chunks of 1150 characters and summarized


In [33]:
display(HTML('<h4>Key Points</h4>'))
summary = []
for i in range(8):
    x0 = i*chunk_size
    x1 = (i+1)*chunk_size
    prompt=f"""Summarise the content within 30 words: {attention_txt[x0:x1]}"""
    payload = {
        "inputs": [[
            {"role": "user", "content": prompt},
        ]],
        "parameters": {"max_new_tokens": 1000, "temperature": 0.01} 
    }
    predictor.serializer = JSONSerializer()
    response = predictor.predict(data=payload, custom_attributes='accept_eula=true')
    line_summary = json.loads(response.decode())[0]['generation']['content']
    summary.append(line_summary)
    print(f'{i+1}. {line_summary}')

1.  Recurrent neural networks (RNNs) are state-of-the-art for sequence modeling and transduction tasks, particularly language modeling and machine translation. RNNs factor computation along input/output positions, generating hidden states sequentially, which limits parallelization in training. Recent work has improved efficiency through factorization tricks and conditional computation, while also improving model performance.
2.  Attention mechanisms are used in sequence modeling and translation tasks, but are often combined with recurrent networks. A new model called Transformer relies solely on attention to draw global dependencies, allowing for more parallelization and improved translation quality.
3.  In the Transformer architecture, the self-attention mechanism is used to compute a representation of a sequence by relating different positions of the sequence. This is different from traditional recurrent neural networks, which use sequence-aligned recurrence. The Transformer's self-a

In [34]:
prompt=f"""Create {NQUESTIONS} question and answer pairs in the format: "Question: ... Answer: ...". eg."Question: What does Transformer rely on other than recurrent layers? Answer: attention mechanism " """
payload = {
    "inputs": [[
        {"role": "system", "content": f'Always complete the tasks based ONLY on the context:{attention_txt}'},
        {"role": "user", "content": prompt},
    ]],
    "parameters": {"max_new_tokens": 1024, "temperature": 0.01} 
}
predictor.serializer = JSONSerializer()
response = predictor.predict(data=payload, custom_attributes='accept_eula=true')
answer = json.loads(response.decode())[0]['generation']['content']
print(answer)

 Sure, here are 10 question and answer pairs based on the provided text:

1. Question: What is the main goal of the Transformer model?
Answer: The main goal of the Transformer model is to reduce sequential computation in neural sequence transduction models.
2. Question: How does the Transformer model differ from other sequence transduction models?
Answer: The Transformer model differs from other sequence transduction models by relying entirely on an attention mechanism to draw global dependencies between input and output sequences, rather than using sequence-aligned recurrence or convolutional layers.
3. Question: What is the purpose of the residual connections in the Transformer model?
Answer: The residual connections in the Transformer model are used to facilitate the attention connections between sub-layers in each layer, allowing the model to learn more complex representations.
4. Question: How does the Transformer model handle dependencies between distant positions?
Answer: The Tr

### 3.4 Australian Budget 2023-24 Overview (Medicare)

In this example, we look at the Australian Budget 2023-24 and we focus on the Medicare improvements.

In [35]:
# Extracting the pages from the Budget overview and work on the pages 24 to 27 (Medicare related)
aus_budget_overview = nlp_helper.extract_pages('source_documents_dir/aus_budget_overview-2023-24.pdf')
txt_aus_budget_overview_medicare = '\n\n'.join(aus_budget_overview[24:27])  # We will use pages 24 to 27. Those pages cover the Medicare budget.
print(f'{txt_aus_budget_overview_medicare[:NCHARS]}...\n\n\n...{txt_aus_budget_overview_medicare[-NCHARS:]}')

Historic investment in Medicare 
Strengthening Medicare
Medicare is the foundation of Australia’s primary health care system. In this 
Budget, the Government is investing $5.7 billion over 5 years from 2022—23 to 
strengthen Medicare and make it cheaper and easier to see a doctor.
The Strengthening Medicare package includes the largest investment in bulk 
billing incentives ever. The Government is...


...llion over 4 years to establish the Primary Care and Midwifery 
Scholarships program, supporting registered nurses and midwives in 
post-graduate study to improve their skills 
• $31.6 million over 2 years for improved training arrangements for 
international medical students working rur al and remote locations.
26 Strengthening MedicareStronger foundations for a better future   |   Budget 2023–24


#### Question Answering

In [36]:
prompt=f"""CONTEXT: {txt_aus_budget_overview_medicare} Summarise the core information based on the above context. Summary should be within 30 words"""
payload = {
    "inputs": [[
        {"role": "user", "content": prompt},
    ]],
    "parameters": {"max_new_tokens": 512, "temperature": 0.01} 
}
predictor.serializer = JSONSerializer()
response = predictor.predict(data=payload, custom_attributes='accept_eula=true')
answer = json.loads(response.decode())[0]['generation']['content']
print(answer)

 The Australian government is investing $5.7 billion over 5 years to strengthen Medicare and improve access to primary care, including bulk billing incentives and team-based care.


In [37]:
prompt=f"""Create 5 question and answer pairs in the format: "Question number. Question: ... Answer: ..." """
payload = {
    "inputs": [[
        {"role": "system", "content": f'Always complete the tasks based ONLY on the context:{txt_aus_budget_overview_medicare}'},
        {"role": "user", "content": prompt},
    ]],
    "parameters": {"max_new_tokens": 1024, "temperature": 0.01} 
}
predictor.serializer = JSONSerializer()
response = predictor.predict(data=payload, custom_attributes='accept_eula=true')
answer = json.loads(response.decode())[0]['generation']['content']
print(answer)

 Sure, here are 5 question and answer pairs based on the provided context:

1. Question: What is the purpose of the Strengthening Medicare package?
Answer: The Strengthening Medicare package aims to make it cheaper and easier for Australians to see a doctor by investing $5.7 billion over 5 years to strengthen Medicare and support primary care services.
2. Question: What is the tripling of the bulk billing incentive for?
Answer: The tripling of the bulk billing incentive is to support 11.6 million Australians to access a GP with no out-of-pocket costs, particularly families with children under 16 years, pensioners, and Commonwealth concession card holders.
3. Question: What is the purpose of the investment in digital health?
Answer: The investment in digital health is to modernise Australia's digital health platforms and provide health professionals with the digital and data tools needed to provide improved and more coordinated care, and to lift health outcomes.
4. Question: How will th

In [38]:
prompt="What is a Level B consultation?"
payload = {
    "inputs": [[
        {"role": "system", "content": f'Provide brief answers based ONLY on the context:{txt_aus_budget_overview_medicare}'},
        {"role": "user", "content": prompt},
    ]],
    "parameters": {"max_new_tokens": 1024, "temperature": 0.01} 
}
predictor.serializer = JSONSerializer()
response = predictor.predict(data=payload, custom_attributes='accept_eula=true')
answer = json.loads(response.decode())[0]['generation']['content']
print(answer)

 According to the Budget document, a Level B consultation refers to a telehealth general practice service that is between 6 and 20 minutes in length.


In [39]:
prompt="How much is the govermement investing?"
payload = {
    "inputs": [[
        {"role": "system", "content": f'Provide brief answers based ONLY on the context:{txt_aus_budget_overview_medicare}'},
        {"role": "user", "content": prompt},
    ]],
    "parameters": {"max_new_tokens": 1024, "temperature": 0.01} 
}
predictor.serializer = JSONSerializer()
response = predictor.predict(data=payload, custom_attributes='accept_eula=true')
answer = json.loads(response.decode())[0]['generation']['content']
print(answer)

 Based on the context, the government is investing $5.7 billion over 5 years (from 2022-23 to 2026-27) to strengthen Medicare and make it cheaper and easier to see a doctor. Additionally, the government is investing $824.4 million in digital health to modernise Australia's digital health platforms and provide health professionals with the necessary digital and data tools to provide improved and more coordinated care.


In [40]:
prompt="Is the governement helping the homeless people?"
payload = {
    "inputs": [[
        {"role": "system", "content": f'Provide brief answers based ONLY on the context:{txt_aus_budget_overview_medicare}'},
        {"role": "user", "content": prompt},
    ]],
    "parameters": {"max_new_tokens": 1024, "temperature": 0.01} 
}
predictor.serializer = JSONSerializer()
response = predictor.predict(data=payload, custom_attributes='accept_eula=true')
answer = json.loads(response.decode())[0]['generation']['content']
print(answer)

 Yes, the government is investing in new services to help homeless people and culturally and linguistically diverse communities to access primary care. The budget provides $79.4 million over 4 years to support Primary Health Networks to commission allied health services to improve access to multidisciplinary care for people with chronic conditions in underserviced communities, including homeless people.


## 4. [Optional] LLM Demos for Education Part II

In this section, we deploy a Gradio app that takes a URL as input, and allows us to answer questions based on the content of the web page.

### 4.1 Gradio Demo App

In [41]:
%%capture
!pip install gradio

In [42]:
import gradio as gr



In [43]:
def url2context(url):
    paragraph_list = extract_paragraphs_from_html(
        download_url_text(url)
    )[1:11]  # We will skip the first paragraph, and take only 10 paragraphs
    return '\n\n'.join(paragraph_list)

In [44]:
def chatbot(prompt, temperature, max_length, url):
    if url == "":
        return generate_text_from_prompt(prompt, max_length, temperature)
    else:
        context = url2context(url)
        payload = {
            "inputs": [[
                {"role": "system", "content": f'Provide brief answers based ONLY on the context:{context}'},
                {"role": "user", "content": prompt},
            ]],
            "parameters": {"max_new_tokens": max_length, "temperature": temperature} 
        }
        predictor.serializer = JSONSerializer()
        response = predictor.predict(data=payload, custom_attributes='accept_eula=true')
        answer = json.loads(response.decode())[0]['generation']['content']
        return answer

def summary(url):
    content = url2context(url)
    payload = {
        "inputs": [[
            {"role": "system", "content": f'Complete the tasks based ONLY on the context:{content}'},
            {"role": "user", "content": "Summarize the content."},
        ]],
        "parameters": {"max_new_tokens": 1024, "temperature": 0.01} 
    }
    predictor.serializer = JSONSerializer()
    response = predictor.predict(data=payload, custom_attributes='accept_eula=true')
    summary_content = json.loads(response.decode())[0]['generation']['content']
    payload = {
        "inputs": [[
            {"role": "system", "content": f'Complete the tasks based ONLY on the context:{content}'},
            {"role": "user", "content": "Find key words\n Key Words:"},
        ]],
        "parameters": {"max_new_tokens": 1024, "temperature": 0.01} 
    }
    predictor.serializer = JSONSerializer()
    response = predictor.predict(data=payload, custom_attributes='accept_eula=true')
    key_words = json.loads(response.decode())[0]['generation']['content']
    return f"""{summary_content}\n\nKey words: {key_words}"""  

with gr.Blocks() as demo:
    gr.Markdown("## Llama 2 Chatbot Demo")
    with gr.Row():
        with gr.Column():
            url = gr.Textbox(label="URL", placeholder="Enter URL here", lines=1, show_label=True,
                             value="https://mmrjournal.biomedcentral.com/articles/10.1186/s40779-022-00416-w"
                             # value="https://k12.libretexts.org/Bookshelves/Science_and_Technology/Biology/03%3A_Genetics/3.14%3A_Human_Genome"
                            )
    with gr.Row():
        with gr.Column():
            prompt = gr.Textbox(
                label="Prompt", placeholder="Enter your prompt here", lines=3, show_label=True,
                value=f"How do mRNA vaccines work for pancreatic cancer treatment?")
            temperature = gr.Slider(label="Temperature", minimum=0.0, maximum=1.0, value=0.5)
            max_length = gr.Slider(label="Max Length", minimum=20, maximum=400, value=100)
        with gr.Column():
            output = gr.Textbox(label="Output", lines=10, show_label=True)
    with gr.Row():
        with gr.Column():
            submit_btn = gr.Button("Submit")
        with gr.Column():
            summary_btn = gr.Button("Summary")
    submit_btn.click(
        fn=chatbot,
        inputs=[prompt, temperature, max_length, url],
        outputs=output,
        api_name="chatbot",
        queue=False
    )
    summary_btn.click(
        fn=summary,
        inputs=[url],
        outputs=output,
        api_name="summary",
        queue=False
    )

demo.launch(share=True)



Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://6fbbeae548e4a2888d.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




To completely shutdown SageMaker, go to File > Shut Down > Shutdown All