# End-to-End Deployment of a Fine-Tuned Mistral Model on AWS with SageMaker, Bedrock, and Neuron Integration

This notebook provides a comprehensive, hands-on guide to deploying a fine-tuned Mistral model for the insurance domain on AWS using Amazon SageMaker, Amazon Bedrock, and Hugging Face. The [Mistral-7B-Insurance](https://huggingface.co/bitext/Mistral-7B-Insurance) model is optimized to perform well in customer support scenarios, particularly for answering insurance-related queries. Also, it is a fine-tuned version of the [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)

By following the steps in this notebook, you will learn how to optimize, deploy, and interact with a Mistral fine-tuned LLM using AWS’s robust infrastructure alongside Hugging Face tools. This workflow demonstrates how to leverage Hugging Face’s model repository for storage and distribution, AWS Inferentia instances on SageMaker for cost-effective inference, and Amazon Bedrock for flexible and scalable model serving. Finally, you will build a Streamlit application to interact with the model in real time, showcasing how AWS, Hugging Face, and Mistral services can support real-world customer support applications.

## Goals

The primary goals of this notebook are:
1. **Model Preparation with Hugging Face**: Use Hugging Face’s tools to prepare, convert, and upload the model, enabling streamlined storage and versioning of large models.
2. **Model Optimization for AWS Neuron**: Convert the fine-tuned Mistral model to a Neuron-compatible format using Hugging Face’s `optimum-cli`, optimizing it for performance on Inferentia-based SageMaker instances.
3. **Deploying on Amazon SageMaker**: Set up a cost-effective, high-performance deployment on Amazon SageMaker using `ml.inf2.xlarge` instances, enabling fast inference at lower costs.
4. **Model Import to Amazon Bedrock**: Convert the model from Hugging Face’s storage to `safetensors` format for Amazon Bedrock compatibility. Import it into Amazon Bedrock, allowing for seamless integration with other AWS services and enabling the use of the Converse API for inference.
5. **Real-Time Interaction via Streamlit**: Build an interactive Streamlit application to test and compare model performance on both SageMaker and Bedrock, providing a hands-on experience for user interaction with deployed models.

## Expected Outcomes

By completing this notebook, you will achieve the following outcomes:
- **Experience with Hugging Face Model Handling**: Learn how to use Hugging Face’s model repository for storing, versioning, and retrieving large models, simplifying the model management process.
- **Understanding Model Compilation and Optimization**: Gain hands-on experience with model compilation for AWS Neuron, optimizing the model for deployment on Inferentia instances, and learn how to convert models to the Hugging Face `safetensors` format for efficient storage and compatibility.
- **Proficiency in Multi-Platform Deployment**: Deploy a model on both Amazon SageMaker and Amazon Bedrock, gaining flexibility in model serving options, and understanding the integration between Hugging Face and AWS services.
- **Real-Time Streaming Inference**: Use Amazon Bedrock’s Converse Streaming API to interact with the model in real-time, receiving streamed responses that enhance user interaction.
- **End-to-End Application Development**: Build a Streamlit application with a user-friendly interface that allows side-by-side testing of models deployed on both SageMaker and Bedrock, demonstrating the ease of integrating any LLM with AWS Generative AI services.




---

## Section 1: Import Required Libraries

Import necessary libraries to facilitate model conversion, deployment, and API interactions across AWS services.

In [None]:
!pip install -U transformers \
                sagemaker \
                boto3 \
                tiktoken \
                torch \
                blobfile \
                sentencepiece

In [2]:
import os
import boto3
import json
from datetime import datetime
from transformers import AutoModelForCausalLM, AutoTokenizer
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
import sagemaker


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


## Section 2: Export Variables for Mistral-7B-Insurance Deployment

Define essential variables for model deployment, including model details, batch size, sequence length, and AWS-specific configurations.

### Key Variables in This Notebook

1. **`MODEL_ID`**: Hugging Face ID for the fine-tuned Mistral model in the insurance domain, used throughout the compilation and deployment process.

2. **`BATCH_SIZE`**: Batch size for processing inputs simultaneously, optimizing performance on Inferentia.

3. **`SEQUENCE_LENGTH`**: Maximum input length for each sequence, balancing memory and model capacity.

4. **`MAX_TOTAL_TOKENS`**: Upper limit for tokens in each response, managing output length during inference.

5. **`NUM_CORES`**: Number of Neuron cores used in parallel for efficient processing on Inferentia instances.

6. **`HF_MODEL_ID_TO_PUSH`**: The Hugging Face model name(e.g., "aboavent/Mistral-7B-Insurance-neuron") under which the compiled model will be saved.

7. **`HF_TOKEN`**: Hugging Face authentication token for model access and upload. Check out [Creating User Access Tokens](https://huggingface.co/docs/hub/en/security-tokens) for details on how to create one for you Hugging Face account.

8. **`PRECISION`**: Model precision (`fp16`), reducing memory while maintaining inference speed.

9. **`MODEL_OUTPUT_NAME`**: Designated name for the compiled model, aiding in file tracking and reference.

10. **`COMPILED_MODEL_OUTPUT_PATH`**: Directory path for saving the compiled model after export, used in later deployment steps.

11. **`sagemaker_region`**: The region for SageMaker where the model will be deployed.

12. **`bedrock_region`**: The region for Amazon Bedrock where the model will be imported later.



In [3]:
MODEL_ID = "bitext/Mistral-7B-Insurance"
BATCH_SIZE = 4
SEQUENCE_LENGTH = 2048
MAX_TOTAL_TOKENS = 4096  # Set independently for the total token limit
NUM_CORES = 2
HF_MODEL_ID_TO_PUSH = "aboavent/Mistral-7B-Insurance-neuron" # Set your HF model name
HF_TOKEN = "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"
PRECISION = "fp16"
MODEL_OUTPUT_NAME = "Mistral-7B-Insurance-neuron"
COMPILED_MODEL_OUTPUT_PATH = f"./{MODEL_OUTPUT_NAME}" 
sagemaker_region = "us-east-2"  # Region for SageMaker endpoint
bedrock_region = "us-west-2"    # Region for Bedrock model

os.environ["AWS_REGION"] = sagemaker_region 
os.environ["AWS_DEFAULT_REGION"] = sagemaker_region

## Section 3: Model Compilation with Optimum CLI (Optional)

#### Important Notes before you get started

> This section is *optional* since the compiled model was already made available at [Hugging Face - Mistral-7b-Insurance-Neuron](https://huggingface.co/aboavent/Mistral-7B-Insurance-neuron), however I strongly recommend you to read through this section as it'll provide additional insights on how to compile an existing model to the [AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) architecture.

> ⏰ Note: If you want to compile this model from scratch, please note that this step can take about 30-45 minutes to complete.

In this section, we compile the Mistral model using Hugging Face’s `optimum-cli` tool, which prepares it for efficient inference on AWS Inferentia instances with AWS Neuron. Model compilation is a critical step in optimizing performance, as it converts the model into a format compatible with AWS Neuron, enabling accelerated inference on hardware optimized for deep learning. As we are working with a Mistral LLM that requires full loading into compute memory for compilation, we recommend using a high-memory instance, such as **`inf2.24xlarge`**, **`inf2.48xlarge`**, or **`trn1.32xlarge`**. These instances offer substantial memory and Neuron cores, providing the resources necessary for efficient model processing and compiling tokenizers, `config.json`, and other essential components. You can perform this compilation step on an EC2 instance provisioned with [AWS Deep Learning AMI Neuron](https://docs.aws.amazon.com/dlami/latest/devguide/appendix-ami-release-notes.html). After logging in to the instance, activate the pre-built transformers environment for Inf2 and Trn* by running: 
```bash
source /opt/aws_neuronx_venv_transformers_neuronx/bin/activate
```

#### Steps for compiling the model into the AWS Neuron architecture

1. **Configure Compilation Parameters**:
   - To guide the model compilation process, we define several key parameters: `MODEL_ID` to specify the pretrained Mistral model from Hugging Face, `BATCH_SIZE` to determine the number of inputs processed per batch, and `SEQUENCE_LENGTH` to set the maximum input length, impacting both memory and processing time. We set `NUM_CORES` to define the number of Neuron cores used for parallelization, and `PRECISION` to `fp16` to balance memory usage and performance. Finally, `COMPILED_MODEL_OUTPUT_PATH` specifies where the Neuron-compatible compiled model will be saved.

2. **Set Up Environment and Install Neuron Requirements**:
   - Set up `pip` to use the Neuron repository and install `optimum` with NeuronX support to ensure compatibility with Inferentia hardware.

3. **Log in to Hugging Face**:
   - Use the Hugging Face CLI to authenticate and enable access to the model repository.

4. **Run Optimum CLI for Model Compilation**:
   - We use the Hugging Face `optimum-cli` tool to compile the model for AWS Inferentia hardware. The command includes the model ID, batch size, sequence length, number of cores, precision level, and output path.

5. **Create a New Repository on Hugging Face and Upload the Compiled Model**:
   - After compilation, create a new repository on Hugging Face to store the compiled model and upload the model files.

```bash
# Set Neuron repository for pip
pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com

# Install optimum with NeuronX support and update dependencies
pip install --upgrade-strategy eager optimum[neuronx]

# Log in to Hugging Face
huggingface-cli login --token $HF_TOKEN

# Export model for Neuron
optimum-cli export neuron \
    -m $MODEL_ID \
    --batch_size $BATCH_SIZE \
    --sequence_length $SEQUENCE_LENGTH \
    --num_cores $NUM_CORES \
    --auto_cast_type $PRECISION \
    --trust-remote-code \
    $COMPILED_MODEL_OUTPUT_PATH

# Create a new repository on Hugging Face and upload the compiled model 
huggingface-cli repo create $MODEL_OUTPUT_NAME
huggingface-cli upload $HF_MODEL_ID_TO_PUSH $COMPILED_MODEL_OUTPUT_PATH ./

```

#### Why Compile the Model?

- **Optimized for Inferentia**: Compilation enables the model to leverage the specialized capabilities of AWS Inferentia, reducing inference latency and improving cost-efficiency.
- **Memory Efficiency**: Using `fp16` precision and Neuron-compatible optimizations helps reduce memory usage, allowing for higher throughput.
- **Scalability**: A compiled model is more scalable and cost-effective in production, making it ideal for customer support applications.

For more details on Hugging Face Optimum CLI and its features, refer to the official [Optimum Neuron CLI documentation](https://huggingface.co/docs/optimum-neuron/guides/export_model). Also, check out the [Optimum Neuron official documentation](https://huggingface.co/docs/optimum-neuron/index) and the [Optmimum Neuron github repo]( https://github.com/huggingface/optimum-neuron/tree/main) with additional samples and tutorials for your reference. Some additional examples can also be found at [Optimum Neuron Sample Notebooks](https://github.com/huggingface/optimum-neuron/tree/main/notebooks).

## Section 4: Deploy Model to SageMaker Inference



In this section, we set up the necessary configurations for deploying our fine-tuned Mistral model on Amazon SageMaker. This involves defining the IAM role for SageMaker and setting up key model parameters for optimal performance on Inferentia instances with AWS Neuron. 


### Set Model Deployment Configuration:
   - Configure the model’s environment variables using a dictionary (`hub`). These parameters optimize the model's performance on the `ml.inf2.xlarge` instance type and control important aspects of inference to ensure that the model is optimized for Inferentia instances, balancing speed, cost-efficiency, and accuracy. Key settings include:
      - **`HF_MODEL_ID`**: The Hugging Face model ID of the compiled Mistral model.
      - **`HF_NUM_CORES`**: Number of Neuron cores to allocate, providing parallel processing for efficient inference.
      - **`HF_SEQUENCE_LENGTH`**: Maximum sequence length for inputs, which impacts memory usage and latency.
      - **`HF_AUTO_CAST_TYPE`**: Precision setting, typically `fp16` to balance performance and memory efficiency.
      - **`MAX_BATCH_SIZE`**: Defines the maximum number of inputs processed in a single batch, balancing latency and throughput.
      - **`MAX_INPUT_TOKENS` and `MAX_TOTAL_TOKENS`**: Specifies the maximum tokens allowed for input and total tokens per request, ensuring responses fit within the configured limits.
      - **`HF_TOKEN`**: Authentication token for Hugging Face, enabling access to model files stored on Hugging Face.
      - **`MESSAGES_API_ENABLED`**: Enables conversational interactions by allowing message-based API requests.

 



In [4]:
#role = sagemaker.get_execution_role()
boto_session = boto3.Session(region_name=sagemaker_region)
sagemaker_session = sagemaker.Session(boto_session=boto_session)
role = sagemaker.get_execution_role(sagemaker_session=sagemaker_session)

hub = {
    "HF_MODEL_ID": HF_MODEL_ID_TO_PUSH,
    "HF_NUM_CORES": str(NUM_CORES),
    "HF_SEQUENCE_LENGTH": str(SEQUENCE_LENGTH),
    "HF_AUTO_CAST_TYPE": PRECISION,
    "MAX_BATCH_SIZE": str(BATCH_SIZE),
    "MAX_INPUT_TOKENS": "1800",
    "MAX_TOTAL_TOKENS": str(MAX_TOTAL_TOKENS),
    "HF_TOKEN": HF_TOKEN,
    "MESSAGES_API_ENABLED": "true"
}


### Deploy the compiled model on a `ml.inf2.xlarge` instance in SageMaker.

In [6]:
%%time
huggingface_model = HuggingFaceModel(
    image_uri=get_huggingface_llm_image_uri("huggingface-neuronx", 
                                            version="0.0.24",
                                            session=sagemaker_session,
                                            region=sagemaker_region),
    env=hub,
    role=role
)

# Set this flag to indicate that the model is precompiled
huggingface_model._is_compiled_model = True

# Deploy the model and get the predictor
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.inf2.xlarge",
    container_startup_health_check_timeout=2400,
    volume_size=512
)

sagemaker_endpoint_name = predictor.endpoint_name


------------------!CPU times: user 515 ms, sys: 44.8 ms, total: 560 ms
Wall time: 9min 33s


## Section 5: Test SageMaker Endpoint with both the HuggingFace's and SageMaker's APIs

Send a sample request to the SageMaker endpoint to verify that the model is deployed and functioning correctly using the **HuggingFace API**.


In [7]:
def create_sample_request(system_prompt, user_query):
    """
    Creates a sample request structure for the predictor based on the given system prompt and user query.

    Parameters:
        system_prompt (str): The initial system prompt to set the model's role.
        user_query (str): The user's query for the insurance model.

    Returns:
        dict: A structured request for the SageMaker predictor.
    """
    return {
        "model": HF_MODEL_ID_TO_PUSH,
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_query}
        ],
        "parameters": {
            "do_sample": True,
            "max_new_tokens": 128,
            "temperature": 0.7,
            "top_k": 50,
            "top_p": 0.95,
        }
    }

# List of different system prompts and user queries to test various scenarios
system_user_queries = [
    ("You are an expert in health insurance policies.", "What benefits do I get with my current health plan?"),
    ("You are an insurance advisor.", "How can I reduce my monthly insurance premium?"),
    ("You are an expert in auto insurance policies.", "What happens if my car is totaled?"),
    ("You are an expert in life insurance.", "Can you explain the difference between term and whole life insurance?"),
    ("You are an insurance claims specialist.", "What documents are needed to file a claim for home insurance?"),
    ("You are a customer service representative for health insurance.", "Can I add my spouse to my health insurance policy?"),
    ("You are an expert in travel insurance policies.", "What coverage do I have if my flight is canceled?"),
    ("You are a specialist in pet insurance.", "Does my policy cover emergency vet visits?"),
    ("You are an insurance fraud investigator.", "What are some common signs of insurance fraud?"),
    ("You are an advisor on property insurance.", "How do I increase the coverage for natural disasters?")
]

# Loop through each system prompt and user query, create a request, and get a response from the predictor
for i, (system_prompt, user_query) in enumerate(system_user_queries, start=1):
    print(f"--- Sample Request {i} ---")
    request = create_sample_request(system_prompt, user_query)
    response = predictor.predict(request)
    print("System Prompt:", system_prompt)
    print("User Query:", user_query)
    print("Model Response:", response['choices'][0]['message']['content'])
    print("\n")

--- Sample Request 1 ---
System Prompt: You are an expert in health insurance policies.
User Query: What benefits do I get with my current health plan?
Model Response:  To obtain a detailed understanding of the benefits covered under your health plan, please adhere to the following procedure:

1. Access the {{WEBSITE_URL}}.
2. Input your login information in the designated fields.
3. Locate the {{BRIEF_SUMMARY_SECTION}} within the site.
4. Select your specific health insurance policy from the options available.
5. Examine the comprehensive list of benefits provided in your plan.


--- Sample Request 2 ---
System Prompt: You are an insurance advisor.
User Query: How can I reduce my monthly insurance premium?
Model Response:  To assist you in obtaining a reduction in your monthly insurance premium, please follow the steps outlined below:

1. Examine the specifics of your existing policy, such as your coverage limits and deductibles, to ensure you have a comprehensive understanding of you

### Send a sample request to the model using the SageMaker API


In [8]:
from botocore.exceptions import ClientError

sagemaker_client = boto3.client("sagemaker-runtime", 
                                region_name=sagemaker_region)

# Function to query the model on SageMaker
def query_sagemaker_model(endpoint_name, query):
    payload = {
        "model": HF_MODEL_ID_TO_PUSH,  # Updated model name
        "messages": [
            {"role": "system", "content": "You are an expert in customer support for Insurance."},
            {"role": "user", "content": query}  # Send the user query as a string
        ],
        "parameters": {
            "do_sample": True,
            "max_new_tokens": 4096,
            "temperature": 0.5,
            "top_k": 50,
            "top_p": 0.90,
            "max_length": 4096,
            "stop": None
        }
    }
    
    try:
        # Send the request to SageMaker endpoint
        response = sagemaker_client.invoke_endpoint(
            EndpointName=endpoint_name,
            ContentType="application/json",
            Body=json.dumps(payload)
        )
        
        # Parse the response
        result = json.loads(response['Body'].read())
        print(result)
        return result['choices'][0]['message']['content']
    
    except ClientError as e:
        print(f"An error occurred with SageMaker: {e.response['Error']['Message']}")
        return None
    
if sagemaker_endpoint_name is None:
    raise ValueError("sagemaker_endpoint_name is not set. Make sure to provide an endpoint name so you can query the model.")
    
model_response = query_sagemaker_model(sagemaker_endpoint_name, 
                                       "How can I reduce my monthly insurance premium?")
print(model_response)   

{'object': 'chat.completion', 'id': '', 'created': 1731451647, 'model': 'aboavent/Mistral-7B-Insurance-neuron', 'system_fingerprint': '2.1.1-native', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': ' There are several methods to help you reduce your monthly insurance premium:\n\n1. Shop around: Compare prices and features from various insurance providers to identify the most cost-effective option.\n2. Customize your policy: Adjust your coverage limits to align with your needs and reduce unnecessary coverage that may add to your premium.\n3. Bundle your policies: If you have multiple insurance needs, try acquiring several insurance products from the same provider to benefit from bundled discounts.'}, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 0, 'completion_tokens': 100, 'total_tokens': 100}}
 There are several methods to help you reduce your monthly insurance premium:

1. Shop around: Compare prices and features from various insurance

## Section 6: Convert Model to Safetensors Format for Bedrock

In this section, we convert the compiled Mistral model into the `safetensors` format, which is a lightweight and efficient file format designed for securely storing large model weights. This conversion is necessary to make the model compatible with Amazon Bedrock, allowing us to easily import and use the model within the Bedrock environment.

#### Important Note
> For this conversion, it is recommended to use a larger instance  with **at least 128 GB of memory**. This ensures sufficient resources for loading and processing the entire model during the conversion process. Attempting this step on smaller instances may result in memory-related errors.

#### Steps in This Section

1. **Define Conversion Function**:
   - We define a `convert_to_safetensors` function that loads the model and its tokenizer, then saves them in `safetensors` format.
   - The function uses the Hugging Face Transformers library to load the model and save it in the `safetensors` format for compatibility with Amazon Bedrock.

2. **Run Conversion**:
   - Call the `convert_to_safetensors` function, specifying the model name (`MODEL_ID`) and the directory (`save_directory`) where the converted model files will be saved.
   - This process generates two outputs:
      - The model in `safetensors` format.
      - The tokenizer, saved alongside the model.

3. **Verify Conversion**:
   - After conversion, we list the contents of the target directory (`save_directory`) to confirm that the model files have been saved correctly in the desired format.



In [10]:
%%time

def convert_to_safetensors(model_name, save_directory, max_shard_size="3GB"):
    """
    Convert a Hugging Face model to safetensors format for Amazon Bedrock compatibility, with sharded saving.
    
    Parameters:
        model_name (str): Name of the model to convert.
        save_directory (str): Directory to save the converted model and tokenizer.
        max_shard_size (str): Maximum size of each shard (e.g., "3GB").
    """
    os.makedirs(save_directory, exist_ok=True)
    print(f"Loading model {model_name}...")
    model = AutoModelForCausalLM.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)

    print(f"Converting and saving model to {save_directory} in safetensors format with max shard size of {max_shard_size}...")
    model.save_pretrained(save_directory, safe=True, max_shard_size=max_shard_size)
    tokenizer.save_pretrained(save_directory)
    print("Conversion complete!")

# Specify the directory and model name
save_directory = os.path.expanduser("~/Mistral-7B-Insurance")
os.makedirs(save_directory, exist_ok=True,)
convert_to_safetensors(MODEL_ID, save_directory, "2GB")

# List the contents of the save directory to verify the conversion
print(os.listdir(save_directory))

Loading model bitext/Mistral-7B-Insurance...


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Converting and saving model to /root/Mistral-7B-Insurance2 in safetensors format with max shard size of 2GB...
Conversion complete!
['model-00011-of-00015.safetensors', 'model-00002-of-00015.safetensors', 'tokenizer_config.json', 'model-00015-of-00015.safetensors', 'model-00006-of-00015.safetensors', 'model-00013-of-00015.safetensors', 'model-00004-of-00015.safetensors', 'tokenizer.model', 'model.safetensors.index.json', 'model-00008-of-00015.safetensors', 'config.json', 'model-00003-of-00015.safetensors', 'model-00012-of-00015.safetensors', 'generation_config.json', 'special_tokens_map.json', 'model-00007-of-00015.safetensors', 'model-00005-of-00015.safetensors', 'model-00014-of-00015.safetensors', 'model-00009-of-00015.safetensors', 'model-00010-of-00015.safetensors', 'model-00001-of-00015.safetensors']
CPU times: user 22.4 s, sys: 1min 30s, total: 1min 52s
Wall time: 5min 2s


## Section 7: Upload Converted Model to S3

Upload the `safetensors` formatted model files to an S3 bucket, making them accessible to Amazon Bedrock.


In [11]:
%%time

from botocore.exceptions import ClientError
from tqdm import tqdm  # Progress bar
from datetime import datetime

# Define S3 and local directory configurations
s3_client = boto3.client("s3", region_name=bedrock_region)

base_bucket_name = "mistral-7b-insurance-bedrock-import"
timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
s3_bucket_name = f"{base_bucket_name}-{timestamp}"
s3_model_directory = "safetensors"

local_model_directory = save_directory  # Use save_directory from previous step

print(f"Unique S3 bucket name for this execution: {s3_bucket_name}")


def create_bucket_if_not_exists(bucket_name, region="us-west-2"):
    """
    Creates the S3 bucket if it does not exist.
    
    Parameters:
        bucket_name (str): The name of the bucket to create.
        region (str): The AWS region for the bucket.
    """
    try:
        s3_client.head_bucket(Bucket=bucket_name)
        print(f"Bucket '{bucket_name}' already exists.")
    except ClientError as e:
        error_code = e.response['Error']['Code']
        if error_code == '404':
            print(f"Bucket '{bucket_name}' does not exist. Creating bucket...")
            s3_client.create_bucket(
                Bucket=bucket_name,
                CreateBucketConfiguration={'LocationConstraint': region}
            )
            print(f"Bucket '{bucket_name}' created successfully.")
        else:
            print(f"Unexpected error: {e}")
            raise

# Create the bucket if it doesn't exist
create_bucket_if_not_exists(s3_bucket_name, bedrock_region)

def upload_to_s3(local_directory, bucket, s3_directory):
    """
    Uploads all files from a local directory to the specified S3 bucket and directory.

    Parameters:
        local_directory (str): Path to the local directory containing files to upload.
        bucket (str): Name of the S3 bucket.
        s3_directory (str): Directory path within the S3 bucket to store the files.
    """
    files = [f for f in os.listdir(local_directory) if os.path.isfile(os.path.join(local_directory, f))]
    
    # Progress bar for uploads
    for filename in tqdm(files, desc="Uploading files to S3"):
        file_path = os.path.join(local_directory, filename)
        s3_path = f"{s3_directory}/{filename}"
        print(f"Uploading {filename} to s3://{bucket}/{s3_path}...")
        s3_client.upload_file(file_path, bucket, s3_path)
        print(f"{filename} uploaded successfully.")

# Run the upload function
upload_to_s3(local_model_directory, s3_bucket_name, s3_model_directory)

Unique S3 bucket name for this execution: mistral-7b-insurance-bedrock-import-20241112231130
Bucket 'mistral-7b-insurance-bedrock-import-20241112231130' does not exist. Creating bucket...
Bucket 'mistral-7b-insurance-bedrock-import-20241112231130' created successfully.


Uploading files to S3:   0%|          | 0/21 [00:00<?, ?it/s]

Uploading model-00011-of-00015.safetensors to s3://mistral-7b-insurance-bedrock-import-20241112231130/safetensors/model-00011-of-00015.safetensors...


Uploading files to S3:   5%|▍         | 1/21 [00:12<04:08, 12.44s/it]

model-00011-of-00015.safetensors uploaded successfully.
Uploading model-00002-of-00015.safetensors to s3://mistral-7b-insurance-bedrock-import-20241112231130/safetensors/model-00002-of-00015.safetensors...


Uploading files to S3:  14%|█▍        | 3/21 [00:24<01:57,  6.53s/it]

model-00002-of-00015.safetensors uploaded successfully.
Uploading tokenizer_config.json to s3://mistral-7b-insurance-bedrock-import-20241112231130/safetensors/tokenizer_config.json...
tokenizer_config.json uploaded successfully.
Uploading model-00015-of-00015.safetensors to s3://mistral-7b-insurance-bedrock-import-20241112231130/safetensors/model-00015-of-00015.safetensors...


Uploading files to S3:  19%|█▉        | 4/21 [00:34<02:16,  8.02s/it]

model-00015-of-00015.safetensors uploaded successfully.
Uploading model-00006-of-00015.safetensors to s3://mistral-7b-insurance-bedrock-import-20241112231130/safetensors/model-00006-of-00015.safetensors...


Uploading files to S3:  24%|██▍       | 5/21 [00:48<02:46, 10.38s/it]

model-00006-of-00015.safetensors uploaded successfully.
Uploading model-00013-of-00015.safetensors to s3://mistral-7b-insurance-bedrock-import-20241112231130/safetensors/model-00013-of-00015.safetensors...


Uploading files to S3:  29%|██▊       | 6/21 [01:01<02:47, 11.17s/it]

model-00013-of-00015.safetensors uploaded successfully.
Uploading model-00004-of-00015.safetensors to s3://mistral-7b-insurance-bedrock-import-20241112231130/safetensors/model-00004-of-00015.safetensors...


Uploading files to S3:  33%|███▎      | 7/21 [01:14<02:44, 11.74s/it]

model-00004-of-00015.safetensors uploaded successfully.
Uploading tokenizer.model to s3://mistral-7b-insurance-bedrock-import-20241112231130/safetensors/tokenizer.model...


Uploading files to S3:  43%|████▎     | 9/21 [01:15<01:07,  5.66s/it]

tokenizer.model uploaded successfully.
Uploading model.safetensors.index.json to s3://mistral-7b-insurance-bedrock-import-20241112231130/safetensors/model.safetensors.index.json...
model.safetensors.index.json uploaded successfully.
Uploading model-00008-of-00015.safetensors to s3://mistral-7b-insurance-bedrock-import-20241112231130/safetensors/model-00008-of-00015.safetensors...


Uploading files to S3:  48%|████▊     | 10/21 [01:26<01:22,  7.46s/it]

model-00008-of-00015.safetensors uploaded successfully.
Uploading config.json to s3://mistral-7b-insurance-bedrock-import-20241112231130/safetensors/config.json...


Uploading files to S3:  52%|█████▏    | 11/21 [01:26<00:52,  5.24s/it]

config.json uploaded successfully.
Uploading model-00003-of-00015.safetensors to s3://mistral-7b-insurance-bedrock-import-20241112231130/safetensors/model-00003-of-00015.safetensors...


Uploading files to S3:  57%|█████▋    | 12/21 [01:39<01:05,  7.33s/it]

model-00003-of-00015.safetensors uploaded successfully.
Uploading model-00012-of-00015.safetensors to s3://mistral-7b-insurance-bedrock-import-20241112231130/safetensors/model-00012-of-00015.safetensors...


Uploading files to S3:  67%|██████▋   | 14/21 [01:50<00:42,  6.03s/it]

model-00012-of-00015.safetensors uploaded successfully.
Uploading generation_config.json to s3://mistral-7b-insurance-bedrock-import-20241112231130/safetensors/generation_config.json...
generation_config.json uploaded successfully.
Uploading special_tokens_map.json to s3://mistral-7b-insurance-bedrock-import-20241112231130/safetensors/special_tokens_map.json...


Uploading files to S3:  71%|███████▏  | 15/21 [01:50<00:25,  4.25s/it]

special_tokens_map.json uploaded successfully.
Uploading model-00007-of-00015.safetensors to s3://mistral-7b-insurance-bedrock-import-20241112231130/safetensors/model-00007-of-00015.safetensors...


Uploading files to S3:  76%|███████▌  | 16/21 [02:02<00:32,  6.49s/it]

model-00007-of-00015.safetensors uploaded successfully.
Uploading model-00005-of-00015.safetensors to s3://mistral-7b-insurance-bedrock-import-20241112231130/safetensors/model-00005-of-00015.safetensors...


Uploading files to S3:  81%|████████  | 17/21 [02:13<00:31,  7.99s/it]

model-00005-of-00015.safetensors uploaded successfully.
Uploading model-00014-of-00015.safetensors to s3://mistral-7b-insurance-bedrock-import-20241112231130/safetensors/model-00014-of-00015.safetensors...


Uploading files to S3:  86%|████████▌ | 18/21 [02:26<00:28,  9.39s/it]

model-00014-of-00015.safetensors uploaded successfully.
Uploading model-00009-of-00015.safetensors to s3://mistral-7b-insurance-bedrock-import-20241112231130/safetensors/model-00009-of-00015.safetensors...


Uploading files to S3:  90%|█████████ | 19/21 [02:39<00:20, 10.41s/it]

model-00009-of-00015.safetensors uploaded successfully.
Uploading model-00010-of-00015.safetensors to s3://mistral-7b-insurance-bedrock-import-20241112231130/safetensors/model-00010-of-00015.safetensors...


Uploading files to S3:  95%|█████████▌| 20/21 [02:51<00:10, 11.00s/it]

model-00010-of-00015.safetensors uploaded successfully.
Uploading model-00001-of-00015.safetensors to s3://mistral-7b-insurance-bedrock-import-20241112231130/safetensors/model-00001-of-00015.safetensors...


Uploading files to S3: 100%|██████████| 21/21 [03:05<00:00,  8.81s/it]

model-00001-of-00015.safetensors uploaded successfully.
CPU times: user 1min 12s, sys: 41.5 s, total: 1min 53s
Wall time: 3min 5s





## Section 8: Import Model into Amazon Bedrock

Create an IAM Execution Role for Bedrock with parameters to be used by
a model import job in Amazon Bedrock using the Mistral 7b Insurance model files uploaded to S3.


In [12]:
import boto3
from botocore.exceptions import ClientError
import json

# Retrieve the current account ID dynamically
sts_client = boto3.client("sts")
source_account = sts_client.get_caller_identity()["Account"]
print("Source account: " + source_account)

# IAM client and role/policy details
iam_client = boto3.client('iam')
role_name = "BedrockModelImportExecutionRole"
policy_name = "BedrockModelImportPolicy"

# Define the trust policy to allow Bedrock to assume this role with specific conditions
trust_policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "1",
            "Effect": "Allow",
            "Principal": {
                "Service": "bedrock.amazonaws.com"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "aws:SourceAccount": source_account  
                },
                "ArnEquals": {
                    "aws:SourceArn": f"arn:aws:bedrock:{bedrock_region}:{source_account}:model-import-job/*"  
                }
            }
        }
    ]
}

# Define the permissions policy for S3 and Bedrock access
permissions_policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:ListBucket"
            ],
            "Resource": [
                f"arn:aws:s3:::{s3_bucket_name}",
                f"arn:aws:s3:::{s3_bucket_name}/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "bedrock:CreateModel",
                "bedrock:GetModel",
                "bedrock:ListModels",
                "bedrock:CreateModelImportJob",
                "bedrock:GetModelImportJob"
            ],
            "Resource": "*"
        }
    ]
}

# Create the IAM role
try:
    print("Creating IAM Role...")
    role_response = iam_client.create_role(
        RoleName=role_name,
        AssumeRolePolicyDocument=json.dumps(trust_policy),
        Description="Role for Amazon Bedrock model import job with S3 access"
    )
    role_arn = role_response['Role']['Arn']
    print(f"IAM Role created with ARN: {role_arn}")
except ClientError as e:
    if e.response['Error']['Code'] == 'EntityAlreadyExists':
        print(f"Role '{role_name}' already exists.")
        role_arn = iam_client.get_role(RoleName=role_name)['Role']['Arn']
    else:
        raise

# Attach the permissions policy to the role
try:
    print("Attaching policy to IAM Role...")
    iam_client.put_role_policy(
        RoleName=role_name,
        PolicyName=policy_name,
        PolicyDocument=json.dumps(permissions_policy)
    )
    print("Policy attached successfully.")
except ClientError as e:
    print(f"Error attaching policy: {e}")
    raise

# The role ARN will be used in the next cell to create the model import job


Source account: 603555443475
Creating IAM Role...
Role 'BedrockModelImportExecutionRole' already exists.
Attaching policy to IAM Role...
Policy attached successfully.


## Section 9: Create and Submit Model Import Job to Amazon Bedrock

In this section, we set up and submit a model import job to Amazon Bedrock, which imports our converted model files from Amazon S3 into Bedrock for deployment and inference. This process imports the model to Bedrock, preparing it for deployment and usage with the Converse API.


1. **Initialize Variables**: 
   - **`bedrock_client`**: Establishes a Bedrock client connection in the specified `bedrock_region`.
   - **`s3_model_uri`**: Constructs the S3 URI path to the directory where our model files are stored.
   - **`imported_model_name`**: Provides a user-friendly name for the model once it’s imported to Amazon Bedrock.
   - **`job_name`**: Generates a unique name for the model import job by appending a timestamp, ensuring that each job has a unique identifier.

2. **Submit the Model Import Job**:
   - We use the `create_model_import_job` method of the `bedrock_client` to initiate the import job.
   - This method requires:
      - **`jobName`**: The unique job identifier (`job_name`).
      - **`importedModelName`**: The name under which the model will be imported to Bedrock.
      - **`roleArn`**: The ARN of the IAM role with permissions to access the S3 bucket and perform import operations.
      - **`modelDataSource`**: Specifies the S3 URI containing the model files in `safetensors` format.
      
   - After initiating the job, the response from Bedrock includes details about the job status and configuration.


In [15]:
from datetime import datetime

bedrock_client = boto3.client('bedrock', region_name=bedrock_region)  
s3_model_uri = f"s3://{s3_bucket_name}/{s3_model_directory}/"  
imported_model_name = f"Mistral-7B-Insurance-Model-{datetime.now().strftime('%Y%m%d%H%M%S')}"

# Use the IAM role ARN created in the previous cell
job_name = f"mistral-7b-insurance-import-job-{datetime.now().strftime('%Y%m%d%H%M%S')}"

# Create the model import job
response = bedrock_client.create_model_import_job(
    jobName=job_name,
    importedModelName=imported_model_name,
    roleArn=role_arn,  # Use the ARN from the IAM role created in the previous cell
    modelDataSource={'s3DataSource': {'s3Uri': s3_model_uri}}
)

print("Model import job created:", response)
print(json.dumps(response, indent=4))


Model import job created: {'ResponseMetadata': {'RequestId': '64e75bfd-de78-40ae-a19e-8e582cf74d75', 'HTTPStatusCode': 201, 'HTTPHeaders': {'date': 'Tue, 12 Nov 2024 23:22:23 GMT', 'content-type': 'application/json', 'content-length': '81', 'connection': 'keep-alive', 'x-amzn-requestid': '64e75bfd-de78-40ae-a19e-8e582cf74d75'}, 'RetryAttempts': 0}, 'jobArn': 'arn:aws:bedrock:us-west-2:603555443475:model-import-job/9jh3bfmmr6wo'}
{
    "ResponseMetadata": {
        "RequestId": "64e75bfd-de78-40ae-a19e-8e582cf74d75",
        "HTTPStatusCode": 201,
        "HTTPHeaders": {
            "date": "Tue, 12 Nov 2024 23:22:23 GMT",
            "content-type": "application/json",
            "content-length": "81",
            "connection": "keep-alive",
            "x-amzn-requestid": "64e75bfd-de78-40ae-a19e-8e582cf74d75"
        },
        "RetryAttempts": 0
    },
    "jobArn": "arn:aws:bedrock:us-west-2:603555443475:model-import-job/9jh3bfmmr6wo"
}


## Section 10: Monitor Bedrock Model Import Job Status

In this section, we track the status of our model import job in Amazon Bedrock to ensure that the model is successfully imported and ready for deployment. Since model import jobs can take time, a periodic status check allows us to monitor the progress and handle any errors that may arise.

1. **Set Polling Interval**:
   - Define `polling_interval` to specify the time (in seconds) between each status check. Here, we set it to 30 seconds.

2. **Define `check_job_status` Function**:
   - The `check_job_status` function queries the current status of the model import job using the `get_model_import_job` API.
   - Parameters:
      - **`job_name`**: The unique identifier for the model import job, which we generated in the previous step.
   - Returns:
      - A dictionary with the job's current status (`Completed` or `Failed`), any failure message if applicable, and the `importedModelArn` if the job is successful.

3. **Periodic Status Check Loop**:
   - We initialize `imported_model_arn` as `None`.
   - In an infinite loop, we call `check_job_status` every `polling_interval` seconds to retrieve the latest job status.
   - The current status and any failure message are printed to provide real-time feedback.

4. **Job Completion or Failure Handling**:
   - The loop exits if the job reaches a final state (`Completed` or `Failed`).
      - If `Completed`, the `importedModelArn` is stored for further use, and a success message is printed.
      - If `Failed`, an error message is printed along with any specific failure message.

5. **Set `imported_model_id` for Further Use**:
   - Once the job completes successfully, `imported_model_id` is assigned the value of `importedModelArn`, which will be used to interact with the model in subsequent steps.

This monitoring process allows us to seamlessly track the job’s progress and handle any issues, ensuring that the model is ready for deployment as soon as the import is complete.

In [16]:
%%time

import time
from botocore.exceptions import ClientError

# Use the job name from the response of create_model_import_job to track the job
polling_interval = 30  # Time in seconds between each status check

def check_job_status(job_name):
    """
    Checks the status of the model import job and returns the current status, failure message, and imported model ARN if available.

    Parameters:
        job_name (str): The name of the model import job to check.

    Returns:
        dict: Contains the status, failure message, and imported model ARN if the job is completed.
    """
    try:
        status_response = bedrock_client.get_model_import_job(jobIdentifier=job_name)
        return {
            "status": status_response["status"],
            "failureMessage": status_response.get("failureMessage", ""),
            "importedModelArn": status_response.get("importedModelArn", None)
        }
    except ClientError as e:
        print(f"An error occurred: {e}")
        return None

# Loop to check the job status periodically
print(f"Checking status for job {job_name} every {polling_interval} seconds...")
imported_model_arn = None
while True:
    result = check_job_status(job_name)
    if result is None:
        print("Unable to retrieve job status. Exiting.")
        break

    status = result["status"]
    failure_message = result["failureMessage"]
    imported_model_arn = result["importedModelArn"]
    print(f"Current status: {status}")

    # Check if the job has reached a final state
    if status in ["Completed", "Failed"]:
        if status == "Failed" and failure_message:
            print(f"Job failed with message: {failure_message}")
            imported_model_arn = None  # Clear the ARN if the job failed
        else:
            print(f"Job {job_name} finished with status: {status}")
            print(f"Imported Model ARN: {imported_model_arn}")
        break

    # Wait before the next status check
    time.sleep(polling_interval)

# Set the model ID to the imported model ARN if the job was successful
if imported_model_arn:
    imported_model_id = imported_model_arn  # Assign the model ARN to model_id for further use
else:
    print("Model import job did not complete successfully.")

Checking status for job mistral-7b-insurance-import-job-20241112232223 every 30 seconds...
Current status: InProgress
Current status: InProgress
Current status: InProgress
Current status: InProgress
Current status: InProgress
Current status: InProgress
Current status: InProgress
Current status: InProgress
Current status: InProgress
Current status: InProgress
Current status: InProgress
Current status: InProgress
Current status: InProgress
Current status: InProgress
Current status: InProgress
Current status: InProgress
Current status: InProgress
Current status: InProgress
Current status: InProgress
Current status: InProgress
Current status: InProgress
Current status: InProgress
Current status: InProgress
Current status: InProgress
Current status: InProgress
Current status: Completed
Job mistral-7b-insurance-import-job-20241112232223 finished with status: Completed
Imported Model ARN: arn:aws:bedrock:us-west-2:603555443475:imported-model/7qig1rjkxppp
CPU times: user 137 ms, sys: 40.4 ms, 

## Section 11: Call Imported Model Using Amazon Bedrock Converse API

In this section, we send a test request to our imported model on Amazon Bedrock using the Converse API. Since models may take some time to become ready after import, we implement a retry mechanism with exponential backoff to handle cases where the model is temporarily unavailable.


In [17]:
import time
import json
from botocore.exceptions import ClientError

# Initialize the Bedrock runtime client
bedrock_runtime_client = boto3.client('bedrock-runtime', region_name=bedrock_region)

# Ensure imported_model_id is set from the previous section where the model import job completed
if not imported_model_id:
    raise ValueError("Model ID (importedModelArn) is not set. Ensure the model import job completed successfully.")

# Define the conversation messages, with user role correctly structured
messages = [
    {
        "role": "user",
        "content": [
            {"text": "You are an expert in customer support for insurance. Please help me understand my health insurance benefits."}
        ]
    }
]

# Define the converse function with retry mechanism
def converse_with_retry(messages, max_retries=10, initial_wait=30):
    """
    Calls the Bedrock Converse API with retry logic for the 'Model is not ready' error.

    Parameters:
        messages (list): List of conversation messages.
        max_retries (int): Maximum number of retry attempts.
        initial_wait (int): Initial wait time (in seconds) between retries, doubled after each attempt.

    Returns:
        dict: The API response from Bedrock if successful, None if all retries fail.
    """
    retry_attempt = 0
    wait_time = initial_wait

    while retry_attempt < max_retries:
        # Configure the conversation payload
        converse_config = {
            "modelId": imported_model_id,  # Use the imported model ARN as the model ID
            "messages": messages,
            "inferenceConfig": {
                "temperature": 0.5
            }
        }
        
        print(f"\nAttempt {retry_attempt + 1} of {max_retries}: Sending conversation request...")

        try:
            response = bedrock_runtime_client.converse(**converse_config)
            return response  # Return response if successful
        except ClientError as e:
            error_message = e.response['Error']['Message']
            if "Model is not ready for inference" in error_message:
                print(f"Error: {error_message}. Retrying in {wait_time} seconds...")
                time.sleep(wait_time)  # Wait before retrying
                retry_attempt += 1
                wait_time *= 2  # Exponential backoff
            else:
                print(f"An error occurred: {error_message}")
                return None  # Exit if error is not 'Model not ready'

    print("Max retries reached. Model is still not ready.")
    return None

# Run the conversation with retry logic
response = converse_with_retry(messages)

# Function to print the response if received
def print_converse_response(response):
    if response:
        print(f"Response: {response['output']['message']['content'][0]['text']}")
    else:
        print("No response received.")

# Print the response
print_converse_response(response)



Attempt 1 of 10: Sending conversation request...
Error: Model is not ready for inference. Wait and try your request again. Retrying in 30 seconds...

Attempt 2 of 10: Sending conversation request...
Error: Model is not ready for inference. Wait and try your request again. Retrying in 60 seconds...

Attempt 3 of 10: Sending conversation request...
Response: To effectively understand your health insurance benefits, please adhere to the following steps:

1. Access our website at {{WEBSITE_URL}}.
2. Enter your login credentials to access your account.
3. Proceed to the {{HEALTH_INSURANCE_SECTION}} section of your account.
4. Click on the {{VIEW_DETAILS_TAB}} tab to review your health insurance details.

Should you require additional support, please reach out to our customer service team via our helpline.


## Section 12: Call Imported Model Using Amazon Bedrock Converse Streaming API

In this section, we demonstrate how to use Amazon Bedrock's Converse Streaming API to interact with our imported model in real-time. The Converse Streaming API allows us to receive responses from the model as they are generated, providing a more interactive experience.

#### Steps in This Section

1. **Define Sample Messages**:
   - We create a list of sample messages to simulate different customer support queries related to insurance. Each message serves as a prompt for the model to respond to.

2. **Inference Configuration**:
   - Set parameters such as `temperature` and `top_k` in `inference_config` and `additional_model_fields` to control the model's response style and randomness.

3. **Define the `stream_conversation` Function**:
   - The `stream_conversation` function uses the `converse_stream` method to send messages to the model and receive responses in a streamed format.
   - Parameters:
      - **`bedrock_client`**: The Bedrock runtime client initialized for the specified region.
      - **`model_id`**: The ARN of the imported model, used to identify the model for inference.
      - **`messages`**: The list of conversation messages to send.
      - **`inference_config` and `additional_model_fields`**: Configurations that determine the response style and sampling behavior.
   - As the response stream is received, the function processes different event types:
      - **`messageStart`**: Indicates the start of a new message from the model.
      - **`contentBlockDelta`**: Contains partial content from the model’s response, which is printed in real-time.
      - **`messageStop`**: Marks the end of a message, including information on the stop reason.
      - **`metadata`**: Provides metrics on token usage and latency.

4. **Run the Streaming API for Multiple Test Cases**:
   - We loop through each sample message in `sample_messages`, calling `stream_conversation` for each to receive and display responses from the model.
   - This approach allows us to simulate different customer queries and observe how the model responds in real-time to each prompt.


In [18]:
import boto3
import logging
from botocore.exceptions import ClientError

# Initialize logging
logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
logger = logging.getLogger(__name__)

# Initialize the Bedrock runtime client
bedrock_runtime_client = boto3.client('bedrock-runtime', region_name="us-west-2")  # Replace with your region

# Ensure imported_model_id is set from the previous section where the model import job completed
if not imported_model_id:
    raise ValueError("Model ID (importedModelArn) is not set. Ensure the model import job completed successfully.")

# Define multiple conversation samples without system prompts
sample_messages = [
    [
        {
            "role": "user",
            "content": [{"text": "Can you help me understand my health insurance benefits?"}]
        }
    ],
    [
        {
            "role": "user",
            "content": [{"text": "What does my policy cover if I need to see a specialist?"}]
        }
    ],
    [
        {
            "role": "user",
            "content": [{"text": "Are dental treatments covered in my current insurance plan?"}]
        }
    ],
    [
        {
            "role": "user",
            "content": [{"text": "How do I file a claim for a recent doctor visit?"}]
        }
    ],
    [
        {
            "role": "user",
            "content": [{"text": "Can you explain what deductible means in my policy?"}]
        }
    ]
]

# Inference parameters
inference_config = {"temperature": 0.5}
additional_model_fields = {"top_k": 200}

# Define the streaming converse function
def stream_conversation(bedrock_client, model_id, messages, inference_config, additional_model_fields):
    """
    Calls the Bedrock converse_stream API and handles streaming response.

    Parameters:
        bedrock_client: The Boto3 Bedrock runtime client.
        model_id (str): The model ID to use.
        messages (list): The messages to send.
        inference_config (dict): The inference configuration to use.
        additional_model_fields (dict): Additional model fields to use.
    """
    logger.info("Streaming messages with model %s", model_id)

    response = bedrock_client.converse_stream(
        modelId=model_id,
        messages=messages,
        inferenceConfig=inference_config,
        additionalModelRequestFields=additional_model_fields
    )

    stream = response.get('stream')
    if stream:
        for event in stream:
            if 'messageStart' in event:
                print(f"\nRole: {event['messageStart']['role']}")

            if 'contentBlockDelta' in event:
                print(event['contentBlockDelta']['delta']['text'], end="")

            if 'messageStop' in event:
                print(f"\nStop reason: {event['messageStop']['stopReason']}")

            if 'metadata' in event:
                metadata = event['metadata']
                if 'usage' in metadata:
                    print("\nToken usage")
                    print(f"Input tokens: {metadata['usage']['inputTokens']}")
                    print(f"Output tokens: {metadata['usage']['outputTokens']}")
                    print(f"Total tokens: {metadata['usage']['totalTokens']}")
                if 'metrics' in metadata:
                    print(f"Latency: {metadata['metrics']['latencyMs']} milliseconds")

# Example usage of streaming for multiple test cases
try:
    for i, messages in enumerate(sample_messages, 1):
        print("\n" + "="*50)  # Line separator for clarity
        print(f"\nStarting streaming response for sample #{i}: {messages[0]['content'][0]['text']}")
        stream_conversation(
            bedrock_runtime_client,
            imported_model_id,  # Use the imported model ARN as model ID
            messages,
            inference_config,
            additional_model_fields
        )
        print(f"\nFinished streaming response for sample #{i}: {messages[0]['content'][0]['text']}")
except ClientError as err:
    error_message = err.response['Error']['Message']
    logger.error("A client error occurred: %s", error_message)
    print("A client error occurred: " + format(error_message))
else:
    print(f"\nFinished streaming all test cases with model {imported_model_id}.")


INFO: Streaming messages with model arn:aws:bedrock:us-west-2:603555443475:imported-model/7qig1rjkxppp




Starting streaming response for sample #1: Can you help me understand my health insurance benefits?

Role: assistant
To effectively understand your health insurance benefits, please adhere to the following guidelines:

1. Access the {{WEBSITE_URL}} and sign in to your account.
2. Proceed to the {{HEALTH_INSURANCE_SECTION}} section of the website.
3. Click on the {{VIEW_DETAILS_TAB}} tab to view your health insurance details.
4. Analyze the available information regarding your coverage, deductibles, and any applicable limits.

Should you require additional support, please reach out to our customer service team via the {{HELP

INFO: Streaming messages with model arn:aws:bedrock:us-west-2:603555443475:imported-model/7qig1rjkxppp


_SECTION}} section on our website.
Stop reason: end_turn

Token usage
Input tokens: 19
Output tokens: 138
Total tokens: 157
Latency: 2109 milliseconds

Finished streaming response for sample #1: Can you help me understand my health insurance benefits?


Starting streaming response for sample #2: What does my policy cover if I need to see a specialist?

Role: assistant
 To determine the extent of coverage for visiting a specialist under your insurance policy, please adhere to the following guidelines:

1. Access your account at {{WEBSITE_URL}}.
2. Proceed to the {{COVERAGE_SECTION}} section of the website.
3. Identify the specific insurance policy you are interested in.
4. Select the policy to view its details and locate the section titled {{SPECIALIST_COVERAGE}}.
5. Analyze the information provided to ascertain your coverage for specialist visits.

Should you require any additional support, please reach out

INFO: Streaming messages with model arn:aws:bedrock:us-west-2:603555443475:imported-model/7qig1rjkxppp


 to our customer service team by calling the designated support number.
Stop reason: end_turn

Token usage
Input tokens: 22
Output tokens: 141
Total tokens: 163
Latency: 2209 milliseconds

Finished streaming response for sample #2: What does my policy cover if I need to see a specialist?


Starting streaming response for sample #3: Are dental treatments covered in my current insurance plan?

Role: assistant
 To verify whether your dental treatments are covered by your insurance plan, please adhere to the following guidelines:

1. Access your account at {{WEBSITE_URL}}.
2. Proceed to the {{COVERAGE_SECTION}} section of the website.
3. Choose your dental insurance policy from the displayed options.
4. Select the {{DENTAL_COVERAGE_INFORMATION}} link to review your coverage details.

Should you require additional support, do not hesitate to reach out to our

INFO: Streaming messages with model arn:aws:bedrock:us-west-2:603555443475:imported-model/7qig1rjkxppp


 customer service team by dialing our support number.
Stop reason: end_turn

Token usage
Input tokens: 19
Output tokens: 123
Total tokens: 142
Latency: 1986 milliseconds

Finished streaming response for sample #3: Are dental treatments covered in my current insurance plan?


Starting streaming response for sample #4: How do I file a claim for a recent doctor visit?

Role: assistant
 To initiate the claim process for your recent doctor visit, please adhere to the following guidelines:

1. Access our website at {{WEBSITE_URL}}.
2. Sign in to your account with your registered credentials.
3. Proceed to the {{CLAIM_SECTION}} section of the site.
4. Click on the {{FILE_CLAIM_OPTION}} to begin.
5. Complete the claim form, ensuring all mandatory information is provided, along with any necessary documentation such as invoices or receipts.
6. Double-check all entered details for correctness.
7. Finalize your submission by selecting the {{SUBMIT_BUTTON}}.

Upon receipt of your

INFO: Streaming messages with model arn:aws:bedrock:us-west-2:603555443475:imported-model/7qig1rjkxppp


 submission, our claims department will assess your application and respond promptly.
Stop reason: end_turn

Token usage
Input tokens: 21
Output tokens: 171
Total tokens: 192
Latency: 2638 milliseconds

Finished streaming response for sample #4: How do I file a claim for a recent doctor visit?


Starting streaming response for sample #5: Can you explain what deductible means in my policy?

Role: assistant
 A deductible is the amount you are required to pay out of pocket before your insurance coverage begins. For example, if your health insurance policy has a deductible of $1,000, you would be responsible for paying the first $1,000 of your healthcare expenses before your insurance provider starts covering the remaining costs. This arrangement helps to reduce the overall cost of insurance by encouraging policyholders to be more mindful of their healthcare expenditures.
Stop reason: end_turn

Token usage
Input tokens: 21
Output tokens: 97
Total tokens: 118
Latency: 1724 milliseconds

Fin

## Section 13: Running the Streamlit Application for Model Interaction

In this section, we’ll generate a configuration file for our Streamlit application and then run the app to interact with our deployed models on Amazon SageMaker and Amazon Bedrock.

### Step 1: Generate `app-config.json`

The Streamlit application (`app.py`) requires certain configuration details, such as the SageMaker region, Bedrock region, SageMaker endpoint name, Bedrock model ID, and SageMaker model ID (`sagemaker_model_id`). These parameters are essential for connecting the app to our deployed models.

Run the following code to create the configuration file `app-config.json` with the necessary details.


In [19]:
# Define configuration for Streamlit app
config_data = {
    "sagemaker_region": sagemaker_region,
    "bedrock_region": bedrock_region,
    "sagemaker_endpoint_name": sagemaker_endpoint_name,
    "bedrock_model_id": imported_model_id,  # Use the imported model ARN as the Bedrock model ID
    "sagemaker_model_id": HF_MODEL_ID_TO_PUSH  # Set the SageMaker model ID from the notebook variable
}

# Write configuration to app-config.json
with open("app-config.json", "w") as config_file:
    json.dump(config_data, config_file, indent=4)

print("Configuration saved to app-config.json")

Configuration saved to app-config.json


### Step 2: Start the Streamlit Application

Once `app-config.json` has been generated, you can launch the Streamlit app by running the following command in your terminal:

>**streamlit run app.py --server.port 8501 --server.headless true**

### Step 3: Access the Application

Once Streamlit is running, you can access the application in your browser.

- **If running locally**: Open your browser and navigate to **[http://localhost:8501](http://localhost:8501)**.

- **If running on Amazon SageMaker Studio**:
    1. Take the base URL of your SageMaker Studio environment. This typically looks like:
       ```
       https://<your-host-name>/jupyter/default/lab
       ```
    2. Replace `/lab` at the end of the URL with `/proxy/8501/`, so the final URL becomes:
       ```
       https://<your-host-name>/jupyter/default/proxy/8501/
       ```
       This URL will direct you to the Streamlit application within SageMaker Studio.

- **For other cloud or remote environments**: You may need to configure port forwarding to access the application. Check the environment’s documentation for instructions on setting up port forwarding or consult the Streamlit URL provided by the specific platform.

![Streamlit Chatbot App](streamlit-chatbot-video.gif "Streamlit Chatbot App")

---

## Conclusion

In this notebook, we walked through the end-to-end process of deploying a fine-tuned Mistral model for insurance-related customer support on two different AWS platforms: Amazon SageMaker and Amazon Bedrock. This exercise provided a comprehensive look at how to optimize, deploy, and interact with a Mistral model using AWS’s specialized infrastructure.

### Key Takeaways

By completing this notebook, you should now have an understanding of:
- **Model Optimization and Compilation**: How to prepare a model for efficient deployment on AWS Inferentia instances using Neuron, including the compilation of a Hugging Face model to a Neuron-compatible format with `optimum-cli`.
- **Deploying Models on Amazon SageMaker**: How to set up and deploy the model on a `ml.inf2.xlarge` instance, leveraging SageMaker’s managed inference capabilities for low-latency, cost-effective serving.
- **Converting Models for Amazon Bedrock**: How to convert the model to Hugging Face’s `safetensors` format and upload it to Amazon S3 for import into Amazon Bedrock.
- **Using Amazon Bedrock's Converse API**: How to use the Bedrock Converse API to query the model, including real-time streaming of responses for enhanced interactivity.
- **Building an Interactive Application with Streamlit**: How to connect both deployed models (SageMaker and Bedrock) to a custom chatbot interface, providing a hands-on experience for interacting with the model through a simple UI.

### Value Proposition

The techniques demonstrated in this notebook highlight the versatility and scalability of AWS’s machine learning and AI infrastructure:
- **Cost-Effective Inference**: With Inferentia-based SageMaker deployment, we can run high-performance inferences while minimizing costs.
- **Flexibility Across Platforms**: Amazon Bedrock provides a flexible environment for model hosting and deployment, enabling broader access to generative AI.
- **Seamless User Interaction**: By connecting the models to a Streamlit app, we created an accessible interface that makes it easy for end-users to interact with the model and get real-time responses.

This workflow demonstrated a full deployment lifecycle, from model preparation and compilation to deployment, interaction, and user application, empowering you to build scalable AI solutions on AWS. With these skills, you’re now well-equipped to deploy similar language models across different AWS environments to meet diverse operational needs.
