# Question Answering with Extracted Documents

There are three ways to run Large Language Models (LLMs) models on AWS: 

(1) AI services: Amazon Bedrock  
(2) Amazon SageMaker and Amazon SageMaker JumpStart  
(3) ML Infrastructure  

In this notebook, we focus on (2) Amazon SageMaker JumpStart. 

Foundation models are ML models that are capable of understanding and generating human-like language at a massive scale. These LLMs are trained on vast amounts of text data, often on the order of billions of words or more, using techniques such as unsupervised learning and self-supervised learning. Amazon SageMaker JumpStart offers state-of-the-art, [**built-in foundation models**](https://docs.aws.amazon.com/sagemaker/latest/dg/jumpstart-foundation-models-choose.html) which can be used to build your own generative AI solutions and integrate custom solutions with additional SageMaker features. 

Amazon SageMaker JumpStart provides solution templates that set up infrastructure for common use cases, and executable example notebooks for machine learning with SageMaker. You can access the pre-trained models, solution templates, and examples through the JumpStart landing page in Amazon SageMaker Studio or by using the SageMaker Python SDK. To find out more, read the SageMaker Jumpstart documentation [here](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-jumpstart.html).

You can now also fine-tune models with your own custom data set to improve performance in specific domains. For example, this [blog](https://aws.amazon.com/blogs/machine-learning/domain-adaptation-fine-tuning-of-foundation-models-in-amazon-sagemaker-jumpstart-on-financial-data/) describes how to use domain adaption to fine tune a GPT-J 6B model on publicly available financial data so that the model can generate more relevant text for financial services use cases.

Within Amazon SageMaker Jumpstart, there are models that can perform text to text(text2text) generation tasks, such as [**BloomZ 7B1**](https://huggingface.co/bigscience/bloomz-7b1), [**Flan T5 XXL**](https://huggingface.co/google/flan-t5-xxl), and [**Flan T5 UL2**](https://huggingface.co/google/flan-ul2). These models can respond to user questions with generated text answers.

In this notebook, we will demonstrate:

* (1) How to deploy a LLM in SageMaker JumpStart.
* (2) Common use cases of LLMs.
* (3) How to ask a question to a LLM with and without providing the context.

**This notebook serves a template such that you can easily replace the sample dataset with your own to build a custom question and answering application.**

## Step 1. Deploy a LLM in SageMaker JumpStart

To better illustrate the idea, let's first deploy the model that is required to perform the demo. You will need to (1) install the required python packages, (2)  authenticate the use of AWS services by using an AWS role, (3) select a model, (4) deploy the model

When you deploy a model from JumpStart, SageMaker hosts the model and deploys an endpoint that you can use for inference. In this notebook, we focus on the deployment of Flan T5 and demo with the Flan T5 SageMaker endpoint. 

(Optional) You may extend the notebook by uncommenting the `_MODEL_CONFIG_` python dictionary defined as below to compare the performance between the Flan T5 XL, BloomZ 7B1, and Flan UL2 models.

### 1. Install Libraries

Before executing the notebook, there are some initial steps required for set up. This notebook requires latest version of sagemaker and ipywidgets.

The ipywidgets library allows you to use interactive browser controls for Jupyter notebooks. Examples include basic form controls like sliders, checkboxes and text inputs, and advanced controls like 2d and 3d visualizations.

The Amazon SageMaker Python SDK, shown below as sagemaker, is an open source library for training and deploying machine-learned models on Amazon SageMaker.

In [None]:
!pip install --upgrade pip
!pip install --upgrade sagemaker --quiet
!pip install ipywidgets==7.0.0 --quiet

### 2. Initiate SageMaker Session, and AWS Role

To train and host on Amazon SageMaker, we need to set up and authenticate the use of AWS services. Here, we use the execution role associated with the current notebook instance as the AWS account role with SageMaker access.

In [None]:
import time
import sagemaker, boto3, json
from sagemaker.session import Session
from sagemaker.model import Model
from sagemaker import image_uris, model_uris, script_uris, hyperparameters
from sagemaker.predictor import Predictor
from sagemaker.utils import name_from_base
from sagemaker.jumpstart.model import JumpStartModel

sagemaker_session = Session()
aws_role = sagemaker_session.get_caller_identity_arn()
aws_region = boto3.Session().region_name
sess = sagemaker.Session()
model_version = "*"

print(aws_role)

### 3. Select an LLM to Deploy

As mentioned previously, Amazon SageMaker Jumpstart provides access to hundreds of built-in algorithms with pretrained models from popular model hubs. You can check [the available models](https://sagemaker.readthedocs.io/en/stable/doc_utils/pretrainedmodels.html) on Amazon SageMaker Jumpstart to get the full available model list.

You are able to chose the right EC2 instance type to fit your ML use case. The Amazon SageMaker documentation suggests a few different instance types [here](https://docs.aws.amazon.com/sagemaker/latest/dg/cmn-info-instance-types.html), where you can also view their associated costs. You can also use [Amazon SageMaker Inference Recommender](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender.html) to help you select the best instance type and configuration for your ML models and workloads.

By default, the following cell deploys the huggingface-text2text-flan-t5-small with a ml.g5.xlarge instance.

(Optional) Feel free to uncomment the _MODEL_CONFIG_ python dictionary to compare the performance between the Flan T5 XL, BloomZ 7B1, and Flan UL2 models

In [None]:
_MODEL_CONFIG_ = {
    "huggingface-text2text-flan-t5-small": { 
        "model_predictor": "predic-flan-t5",
        "instance type": "ml.g5.xlarge",
        "prompt": """Answer based on context:\n\n{context}\n\n{question}""",
    },
    "huggingface-textgeneration1-bloomz-7b1-fp16": {
        "model_predictor": "predic-bloomz-7b",
        "instance type": "ml.g5.xlarge",
        "prompt": """question: \"{question}"\\n\nContext: \"{context}"\\n\nAnswer:""",
    },
    # "huggingface-text2text-flan-ul2-bf16": {
    #     "model_predictor": "predic-flan-ul2",
    #     "instance type": "ml.g5.24xlarge",
    #     "prompt": """Answer based on context:\n\n{context}\n\n{question}""",
    # }
}

### 5. Deploy a SageMaker Endpoint

To deploy a SageMaker endpoint, you need to initiate a SageMaker [Model](https://sagemaker.readthedocs.io/en/stable/api/inference/model.html) that can be deployed to an Endpoint. The mandatory parameters are: (1) deploy_image_uri, (2) model_uri, (3) model_inference, (4) model_predictor_inference

**deploy_image_uri**: the URI of the inference container image to be deployed  
**model_id**: the model ID for the model to be deployed  
**endpoint_name**: the name of the endpoint that will be created for the model  
**inference_instance_type**: the instance type of the model to be deployed  
**model_uri**: the URI of the model to be deployed  
**model_inference**: the object containing all of the model's attributes  
**model_predictor_inference**: the object that will be used to deploy the model  

It's also possible to deploy a Sagemaker endpoint by using a low-code deployment with the JumpStartModel class. Using the model ID to define your model as a JumpStart model, and the deploy method to automatically deploy your model for inference. this is the method you will use. 

The following cell can take around 5-10 minutes to process as we are deploying the model endpoint here.

Please note if you decide to deploy multiple models, this cell will take longer to execute.


In [None]:
newline, bold, unbold = "\n", "\033[1m", "\033[0m"

for model_id in _MODEL_CONFIG_:
    endpoint_name = name_from_base(f"jumpstart-example-{model_id}")
    inference_instance_type = _MODEL_CONFIG_[model_id]["instance type"]
    
    model_inference = JumpStartModel(
        model_id=model_id, 
        model_version=model_version
    )
    
    _MODEL_CONFIG_[model_id]["model_predictor"] = model_inference.deploy(
        initial_instance_count=1,
        instance_type=inference_instance_type,
        endpoint_name=endpoint_name
    )
    
    _MODEL_CONFIG_[model_id]["endpoint_name"] = endpoint_name
    
    print(f"{bold}Model {model_id} has been deployed successfully.{unbold}{newline}")
    

Hopefully you will have recieved a message that looks similar to this - "-------!Model huggingface-text2text-flan-t5-small has been deployed successfully."

If you're unable to deploy multiple models, make sure you check your **[AWS service quotas](https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html)** to see how you can request an instance limit increase.

In [None]:
# Run this cell so we can reference the endpoint_name variable in lab 2.
%store endpoint_name



Now we've deployed our model, we can start using our newly created SageMaker endpoint for inference!


## Step 2. Common use cases of LLMs

A popular encoder-decoder model known as [T5](https://huggingface.co/docs/transformers/model_doc/t5) (Text-to-Text Transfer Transformer) is one such model that was subsequently fine-tuned via the Flan method to produce the [Flan-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) family of models. Flan-T5 is an instruction-tuned model and therefore is capable of performing various zero-shot Natural Language Processing (NLP) tasks, as well as few-shot in-context learning tasks. 

You can directly use the [FLAN-T5 model](https://huggingface.co/google/flan-t5-base) for many NLP tasks, without fine-tuning the model. Examples of these tasks include:


* Text summarization
* Common sense reasoning / natural language inference
* Question and answering
* Sentence / sentiment classification
* Translation
* Pronoun resolution

The code below focuses on question and answering, but feel free to add your own use cases. Here are some sample queries you can test out: [Zero-shot prompting for the Flan-T5 foundation model in Amazon SageMaker JumpStart](https://aws.amazon.com/blogs/machine-learning/zero-shot-prompting-for-the-flan-t5-foundation-model-in-amazon-sagemaker-jumpstart/).

In [None]:
question = "Briefly summarize this paragraph: Amazon Comprehend uses natural language processing (NLP) to extract insights about the content of documents. It develops insights by recognizing the entities, key phrases, language, sentiments, and other common elements in a document. Use Amazon Comprehend to create new products based on understanding the structure of documents. For example, using Amazon Comprehend you can search social networking feeds for mentions of products or scan an entire document repository for key phrases. You can access Amazon Comprehend document analysis capabilities using the Amazon Comprehend console or using the Amazon Comprehend APIs. You can run real-time analysis for small workloads or you can start asynchronous analysis jobs for large document sets. You can use the pre-trained models that Amazon Comprehend provides, or you can train your own custom models for classification and entity recognition. All of the Amazon Comprehend features accept UTF-8 text documents as the input. In addition, custom classification and custom entity recognition accept image files, PDF files, and Word files as input. Amazon Comprehend can examine and analyze documents in a variety of languages, depending on the specific feature. For more information, see Languages supported in Amazon Comprehend. Amazon Comprehend’s Dominant language capability can examine documents and determine the dominant language for a far wider selection of languages."
#question = "translate: My name is Mia. to german"
#question = "Review:\nThis moive is so great and once again dazzles and delights us\nIs this movie review sentence negative or positive?\nOPTIONS:\n-positive \n-negative"

You should supply this question within a JSON payload when invoking the endpoint. This JSON payload can include any desired inference parameters that help control the model output, such as maximum sequence length and number of return sequences. To see the full list of payload parameters that can be used here, check out the [transformers library.](https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig)  


* **text_inputs** — The sequence used as a prompt for the generation or as model inputs to the encoder.
* **max_length** (`int`, *optional*, defaults to 20) — The model generates text until the output length (which includes the input context length) reaches max_length. If specified, it must be a positive integer.
* **num_return_sequences (`int`,** *optional*, defaults to 1) — The number of output sequences returned. If specified, it must be a positive integer.
* **top_k** (`int`, *optional*, defaults to 50) — In each step of text generation, sample from only the top_k most likely words. If specified, it must be a positive integer.
* **top_p** (`float`, *optional*, defaults to 1.0) — In each step of text generation, sample from the smallest possible set of words with cumulative probability top_p. If specified, it must be a float between 0–1.
* **do_sample** (`bool`, *optional*, defaults to `False`) — If True, sample the next word as per the likelihood. If specified, it must be Boolean.

In [None]:
payload = {
    "inputs": question,
    "parameters": {
        "max_length": 100, 
        "num_return_sequences": 1, 
        "top_k": 10,
        "top_p": 0.95, 
        "do_sample": True
    },
}


for model_id in _MODEL_CONFIG_:
    query_response = _MODEL_CONFIG_[model_id]["model_predictor"].predict(payload)
    print(f"For model: {model_id}, the generated output is: {query_response[0]['generated_text']}\n")

## Step 3. Ask a question to LLM with and without providing the context

There are a few limitations of using off-the-shelf pre-trained LLMs:

- They’re usually trained offline, making the model agnostic to the latest information (for example, a chatbot trained from 2011–2018 has no information about COVID-19).

- They make predictions by only looking up information stored in its parameters, leading to inferior interpretability.
    
- They’re mostly trained on general domain corpora, making them less effective on domain-specific tasks. There are scenarios when you want models to generate text based on specific data rather than generic data. 



To better illustrate this, lets ask the model a question without providing the context and see how it responds.


### 1. Ask a question without providing the context

In [None]:
question2 = "What is Susanne's job?"

In [None]:
payload = {
    "inputs": question2,
    "parameters": {
        "max_length": 100, 
        "num_return_sequences": 1, 
        "top_k": 10,
        "top_p": 0.95, 
        "do_sample": True
    },
}

for model_id in _MODEL_CONFIG_:
    query_response = _MODEL_CONFIG_[model_id]["model_predictor"].predict(payload)
    print(f"For model: {model_id}, the generated output is: {query_response[0]['generated_text']}\n")

You can see the generated answer is wrong or doesn't make much sense. 

### 2. Improve the answer to the same question using **prompt engineering** with insightful context

Currently, there are two popular ways to reference specific data in LLMs:

- Insert data as context in the model prompt as a way to provide the information that the model can use while creating the result

- Fine-tune the model by providing a file with prompt and completion pairs



We will provide extra contextual information, insert it in a prompt, and send it to the model together with the question. Below is an example.

In [None]:
context = "Peter is a professional footballer. Susanne, his partner, is a writer."

In [None]:
parameters = {
    "max_length": 200, 
    "num_return_sequences": 10, 
    "top_k": 250,
    "top_p": 0.95, 
    "do_sample": False,
    #"temperature": 1,
}


for model_id in _MODEL_CONFIG_:

    prompt = _MODEL_CONFIG_[model_id]["prompt"]

    text_input = prompt.replace("{context}", context)
    text_input = text_input.replace("{question}", question2)
    payload = {"inputs": text_input, "parameters":parameters }
    
    query_response = _MODEL_CONFIG_[model_id]["model_predictor"].predict(payload)
    print(f"For model: {model_id}, the generated output is: {query_response[0]['generated_text']}\n")

As you can see now the model is able to answer the question correctly using the extra information provided as context! Feel free to ask your own questions + play around with the model you've deployed. 

### Cleanup

Run the following to terminate your SageMaker endpoint. 


In [None]:
sagemaker_client = boto3.client('sagemaker')

for model_id in _MODEL_CONFIG_:
    sagemaker_client.delete_endpoint(EndpointName=_MODEL_CONFIG_[model_id]["endpoint_name"])