# Deploy a model on SageMaker Endpoint with LMI container v18

This notebook demonstrates new functionality of Large Model Inference (LMI) container:
1. **Example 1**: use input/output [formatters](https://docs.djl.ai/master/docs/serving/serving/docs/lmi/user_guides/input_formatter_schema.html) for Inference Components deployed for LoRA adapters
2. **Example 2**: use [LMCache](https://lmcache.ai/)

We are going to use AWS Python API (boto3) for these examples

In [107]:
import time
import re
import json
import os
import tarfile
import boto3
from huggingface_hub import snapshot_download
from pathlib import Path
from IPython.display import display, Markdown, clear_output

boto_session = boto3.Session()
region = boto_session.region_name

sm = boto3.client("sagemaker")  # client to intreract with SageMaker
sm_runtime = boto3.client("sagemaker-runtime")  # client to intreract with SageMaker Endpoints
s3 = boto3.client("s3")

In [None]:
#
# Helper functions to remove dependency on SageMaker Python SDK
#
def get_sagemaker_role():
    sts = boto3.client('sts')
    response = sts.get_caller_identity()
    assumed_role = response['Arn']
    role = re.sub(r"^(.+)sts::(\d+):assumed-role/(.+?)/.*$", r"\1iam::\2:role/\3", assumed_role)
    return role


def wait_for_endpoint(endpoint_name: str, sleep_time: int = 60):
    ind = "."
    progress = f"Waiting for '{endpoint_name}': "
    print(progress)

    status = sm.describe_endpoint(EndpointName=endpoint_name)["EndpointStatus"]

    while status == "Creating":
        time.sleep(sleep_time)

        status = sm.describe_endpoint(EndpointName=endpoint_name)["EndpointStatus"]

        clear_output(wait=True)
        progress += ind
        print(progress)

    print(f"Endpoint: '{endpoint_name}', Status: '{status}'")

def wait_for_ic(ic_name: str, sleep_time: int = 60):
    ind = "."
    progress = f"Waiting for '{ic_name}': "
    print(progress)

    status = sm.describe_inference_component(InferenceComponentName = ic_name)["InferenceComponentStatus"]

    while status == "Creating":
        time.sleep(sleep_time)

        status = sm.describe_inference_component(InferenceComponentName = ic_name)["InferenceComponentStatus"]

        clear_output(wait=True)
        progress += ind
        print(progress)

    print(f"IC: '{ic_name}', Status: '{status}'")

In [None]:
#
# Overwrite with your role ARN if you are running this notebook outside of SageMaker Studio
#
role = None

if role == None:
    role = get_sagemaker_role()

bucket = "<YOUR_BUCKET>"

## Example 1. Input/Output formatters

### Configuration

See [this](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers) for more info

In [None]:
CONTAINER_VERSION = "0.36.0-lmi19.0.0-cu128"
inference_image = f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:{CONTAINER_VERSION}"

instance = {"type": "ml.g6e.4xlarge", "num_gpu": 1}

model_id = "microsoft/Phi-4-mini-instruct"

model_name = f"lmiv18-{time.strftime('%y%m%d-%H%M%S')}"
endpoint_config_name, endpoint_name = model_name, model_name
timeout = 600

variant_name = "main"

common_env = {
    "HF_MODEL_ID": model_id,
}
lmi_env = {
    "SERVING_FAIL_FAST": "true",
    "OPTION_ASYNC_MODE": "true",
    "OPTION_ROLLING_BATCH": "disable",
    "OPTION_TENSOR_PARALLEL_DEGREE": json.dumps(instance["num_gpu"]),
    "OPTION_ENTRYPOINT": "djl_python.lmi_vllm.vllm_async_service",
    "OPTION_MAX_MODEL_LEN": "16384",
    "OPTION_TRUST_REMOTE_CODE": "true",
    "OPTION_ENABLE_LORA": "true",
    "OPTION_MAX_LORAS": "4",
    "OPTION_MAX_CPU_LORAS": "8",
    "OPTION_MAX_LORA_RANK": "64",
}
env = common_env | lmi_env

### Deployment
#### Deploying endpoint

In [65]:
model_res = sm.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image,
        "Environment": env,
    },
)

In [None]:
config_res = sm.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ExecutionRoleArn=role,
    ProductionVariants=[
        {
            "VariantName": variant_name,
            "InstanceType": instance["type"],
            "InitialInstanceCount": 1,
            "ContainerStartupHealthCheckTimeoutInSeconds": timeout,
        },
    ],
)

endpoint_res = sm.create_endpoint(EndpointName=endpoint_name,
                                  EndpointConfigName=endpoint_config_name)

_ = wait_for_endpoint(endpoint_name)

#### Create Inference Component for the base model

In [None]:
base_ic_name = f"base-{model_name}"

min_memory_required_in_mb = 4096
number_of_accelerator_devices_required = 1

ic_res = sm.create_inference_component(
    InferenceComponentName=base_ic_name,
    EndpointName=endpoint_name,
    VariantName=variant_name,
    Specification={
        "ModelName": model_name,
        "StartupParameters": {
            "ModelDataDownloadTimeoutInSeconds": timeout,
            "ContainerStartupHealthCheckTimeoutInSeconds": timeout,
        },
        "ComputeResourceRequirements": {
            "MinMemoryRequiredInMb": min_memory_required_in_mb,
            "NumberOfAcceleratorDevicesRequired": number_of_accelerator_devices_required,
        },
    },
    RuntimeConfig={
        "CopyCount": 1,
    },
)
_ = wait_for_ic(base_ic_name)

#### Test deployed base model

In [None]:
payload={
    "messages": [
        {"role": "user", "content": "What is Amazon SageMaker?"}
    ],
}

start_time = time.time()
res = sm_runtime.invoke_endpoint(EndpointName=endpoint_name,
                                 InferenceComponentName=base_ic_name,
                                 Body=json.dumps(payload),
                                 ContentType="application/json")
response = json.loads(res["Body"].read().decode("utf8"))
end_time = time.time()

print(f"âœ… Response time: {end_time-start_time:.2f}s\n")
display(Markdown(response["choices"][0]["message"]["content"]))

usage = response["usage"]
print(f'-----------------------\n{usage}')

âœ… Response time: 6.53s



Amazon SageMaker is a fully managed service provided by Amazon Web Services (AWS) designed to enable developers and data scientists to build, train, and deploy machine learning models at scale. It is a comprehensive platform that includes a range of features to support end-to-end machine learning workflows, which makes it easier for users to go from concept to production. SageMaker provides the tools to develop models faster and more efficiently, as well as the opportunity to engage with the broader ML community through pre-built frameworks and tools supplied by AWS.

Key features and components of Amazon SageMaker include:

1. Pre-built Algorithms: SageMaker comes with hundreds of pre-built models and algorithms that you can use to train and deploy machine learning applications quickly.
2. Automated Model Tuning: SageMaker includes hyperparameter tuning and optimization in the cloud, enabling you to train machine learning models to produce more accurate results through experimentation.
3. Data Processing: SageMaker supports data processing and data preparation at different stages of the machine learning pipeline, including data preprocessing, data transformation, and data augmentation.
4. Estimator APIs: You can use SageMaker Estimators in your notebooks, which are high-level APIs that dramatically simplify the training of machine learning models.
5. Machine Learning Pipeline (ML Pipeline): SageMaker Pipelines feature a CRUD-based operation model on machine learning pipelines, enabling you to integrate SageMaker with other AWS services for end-to-end training and deployment of machine learning models in the cloud.
6. Model Monitoring: SageMaker Models can track actual training job metrics, which enables you to make predictions on live data, automatically update the model training jobs, and separate test and production versions of the model.
7. Edge: SageMaker Edge Manager provides capabilities to build, train, and deploy machine learning models directly to your edge devices.

SageMaker offers a user-friendly interface and scalable resources, allowing data scientists to focus more on building and deploying their models and less on managing infrastructure resources. It also provides an integration with other AWS services for more advanced capabilities, such as deep learning, reinforcement learning, and natural language generation. It supports various programming languages, including Python, Scala, R, and Jupyter Notebook. It also provides built-in support for container (including Docker), which broadens its ability to deploy machine learning models. Overall, SageMaker is a powerful and comprehensive platform that helps leverage AWS computing power for machine learning.

-----------------------
{'prompt_tokens': 9, 'total_tokens': 490, 'completion_tokens': 481, 'prompt_tokens_details': None}


#### Preparing adapter file
##### Download an adapter from the HuggingFace, create compressed tar-archive and upload it to S3 bucket

In [None]:
adapter_id = "grounded-ai/phi4-mini-judge"
local_model_path = Path("./data")
local_model_path.mkdir(exist_ok=True)

_ = snapshot_download(repo_id=adapter_id, local_dir=local_model_path)

!rm -rf data/runs/

In [74]:
adapter_1 = "ic1.tar.gz"

with tarfile.open(adapter_1, 'w:gz') as tar:
    for filename in os.listdir(local_model_path):
        file_path = os.path.join(local_model_path, filename)
        if os.path.isfile(file_path):
            tar.add(file_path, arcname=f"./{filename}")

In [None]:
s3_adapter_key = "adapters-test/lmi_v18"
key = f"{s3_adapter_key}/{adapter_1}"

s3.upload_file(adapter_1, bucket, key)

adapter1_s3_uri = f"s3://{bucket}/{s3_adapter_key}/{adapter_1}"
print(adapter1_s3_uri)

#### Create Inference Component for the first LoRA adapter

In [None]:
adapter1_ic_name = f"adapter1-{model_name}"

adapter_res = sm.create_inference_component(
    InferenceComponentName=adapter1_ic_name,
    EndpointName=endpoint_name,
    Specification={
        "BaseInferenceComponentName": base_ic_name,
        "Container": {
            "ArtifactUrl": adapter1_s3_uri
        },
    },
)

_ = wait_for_ic(adapter1_ic_name)

#### Test IC with adapter

Please note that you should see output which is very close to the base model (we are not using any special abilities of the adapter)

In [None]:
payload={
    "messages": [
        {"role": "user", "content": "What is Amazon SageMaker?"}
    ],
}

start_time = time.time()
res = sm_runtime.invoke_endpoint(EndpointName=endpoint_name,
                                 InferenceComponentName=adapter1_ic_name,
                                 Body=json.dumps(payload),
                                 ContentType="application/json")
response = json.loads(res["Body"].read().decode("utf8"))
end_time = time.time()

print(f"âœ… Response time: {end_time-start_time:.2f}s\n")
display(Markdown(response["choices"][0]["message"]["content"]))

usage = response["usage"]
print(f'-----------------------\n{usage}')

âœ… Response time: 6.24s



Amazon SageMaker is a fully-managed machine learning service provided by Amazon Web Services (AWS) that enables data scientists and developers to build, train, and deploy machine learning models at scale. SageMaker offers a comprehensive set of tools and features to streamline the entire machine learning (ML) workflow, including data labeling, data preprocessing, model training, model tuning, and deployment. It simplifies the process of bringing machine learning models to production, allowing users to focus on innovation rather than the underlying infrastructure.

Key features of Amazon SageMaker include:

1. **Data labeling and pre-processing**: SageMaker offers a range of built-in algorithms for data preprocessing and labeling, making it easier to clean, transform, and prepare data for model training.
2. **Model training and hyperparameter tuning**: Users can quickly train custom models using various built-in algorithms or bring their own models. SageMaker's automatic model tuning feature helps optimize hyperparameters for better performance.
3. **Model deployment**: SageMaker provides various deployment options, such as model hosting directly in the cloud or using a hybrid approach where models are deployed on edge devices. It also ensures secure integration with existing IT systems.
4. **Model monitoring and lifecycle management**: Users can track performance, monitor models, and manage the lifecycle using a unified interface. SageMaker provides detailed metrics and insights to improve model performance over time.
5. **Machine learning pipelines**: SageMaker Pipelines enable the automated execution of machine learning workflows, streamlining the entire process from data preparation to model deployment.
6. **AWS integration**: SageMaker seamlessly integrates with other AWS services, such as Amazon S3 for data storage, Amazon EC2 for compute, and AWS Lambda for serverless computing.

SageMaker supports a wide range of use cases, from explainable AI to machine learning model management across different industries. It also offers features to democratize ML, enabling users with varying levels of expertise to create and deploy models effectively.

In summary, Amazon SageMaker is a powerful and user-friendly service that accelerates the machine learning lifecycle, offering comprehensive tools for data preparation, model training, model tuning, and deployment while providing seamless integration with other AWS services.

-----------------------
{'prompt_tokens': 9, 'total_tokens': 449, 'completion_tokens': 440, 'prompt_tokens_details': None}


#### Prepare custom input formatter

To illustrate the potential usage of custom input/output formatters, we are going to add input formatters that adds additional instruction to the prompt - ***"Speak like a pirate"***

In [None]:
input_formatter = """# adapters/my_adapter/model.py
from djl_python.input_parser import input_formatter

@input_formatter
def custom_input_formatter(decoded_payload: dict, tokenizer=None, **kwargs) -> dict:
    print(f"PAYLOAD: {decoded_payload}")
    if "messages" in decoded_payload:
        messages = decoded_payload["messages"]
        messages.append({'role': 'user', 'content': 'Speak like a pirate'})
        decoded_payload.update({"messages": messages})
    print(f"PAYLOAD: {decoded_payload}")

    return decoded_payload
"""
model_file_name = "./model.py"
with open(model_file_name, 'w') as f:
    f.write(input_formatter)

#### Update the archive with the custom input formatter and upload it to S3

In [None]:
# Create new archive
adapter_2 = "ic2.tar.gz"

with tarfile.open(adapter_2, 'w:gz') as new_tar:
    # Copy all members from original archive
    with tarfile.open(adapter_1, 'r:gz') as original_tar:
        for member in original_tar.getmembers():
            if member.isfile():
                # Copy file data
                file_obj = original_tar.extractfile(member)
                new_tar.addfile(member, file_obj)
            else:
                # Copy directories and other types
                new_tar.addfile(member)

    # Add the new file
    new_tar.add(model_file_name, arcname=model_file_name)

In [None]:
s3_adapter_key = "adapters-test/lmi_v18"
key = f"{s3_adapter_key}/{adapter_2}"

s3.upload_file(adapter_2, bucket, key)

adapter2_s3_uri = f"s3://{bucket}/{s3_adapter_key}/{adapter_2}"
print(adapter2_s3_uri)

#### Create Inference Component for the second LoRA adapter with custom input formatter

In [None]:
adapter2_ic_name = f"adapter2-{model_name}"

adapter_res = sm.create_inference_component(
    InferenceComponentName=adapter2_ic_name,
    EndpointName=endpoint_name,
    Specification={
        "BaseInferenceComponentName": base_ic_name,
        "Container": {
            "ArtifactUrl": adapter2_s3_uri
        },
    },
)

_ = wait_for_ic(adapter2_ic_name)

#### Test inference

Please note that we should see output in a "pirate speak" because our custom input formatter adds this instruction to the input message

In [None]:
payload={
    "messages": [
        {"role": "user", "content": "What is Amazon SageMaker?"}
    ],
}

start_time = time.time()
res = sm_runtime.invoke_endpoint(EndpointName=endpoint_name,
                                 InferenceComponentName=adapter2_ic_name,
                                 Body=json.dumps(payload),
                                 ContentType="application/json")
response = json.loads(res["Body"].read().decode("utf8"))
end_time = time.time()

print(f"âœ… Response time: {end_time-start_time:.2f}s\n")
display(Markdown(response["choices"][0]["message"]["content"]))

usage = response["usage"]
print(f'-----------------------\n{usage}')

âœ… Response time: 4.39s



Arrr, ye be seekin' the knowledge of Amazon SageMaker, be ye? Very well, listen up, matey!

Amazon SageMaker be a magical tool in the cove of machine learnin', crafted by the great and powerful crew at Amazon themselves. It's like a vast treasure map for any brigand or buccaneer havin' a hankerin' to delve into the wilds of artificial intelligence and machine learnin'. This here platform be equipped with all sorts of enchanted artifacts and mystical capabilities designed to help ye build, train, and deploy machine learnin' models with ease, much like navigatin' through the Seven Seas.

With SageMaker, ye can:

1. Sail across the vast oceans of data, preprocessin' and explorin' it with ease.
2. Harness the powers of various machine learnin' frameworks, like TensorFlow and PyTorch, to craft yer models.
3. Use the mighty GPUs and TPUs for training yer models with the strength of a hundred men.
4. Deploy yer models directly into production, like shovin' a cannon into the night sky.
5. Continu'ally improve yer models with the help of Amazon's vast resources and wisdom.

So, if ye be curio'ous or if ye be in need of assistance with machine learnin', cast yer rags at Amazon SageMaker and watch as it transform yer wildest ideas into reality, just like a real pirate's dream come true! Yarrrrr!

-----------------------
{'prompt_tokens': 15, 'total_tokens': 316, 'completion_tokens': 301, 'prompt_tokens_details': None}


## Cleanup for example 1

In [90]:
_ = sm.delete_inference_component(InferenceComponentName=adapter1_ic_name)
_ = sm.delete_inference_component(InferenceComponentName=adapter2_ic_name)

In [91]:
_ = sm.delete_inference_component(InferenceComponentName=base_ic_name)

In [92]:
_ = sm.delete_endpoint(EndpointName=endpoint_name)
_ = sm.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
_ = sm.delete_model(ModelName=model_name)

## Example 2. Use LMCache on Amazon SageMaker AI

In [None]:
CONTAINER_VERSION = "0.36.0-lmi19.0.0-cu128"
inference_image = f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:{CONTAINER_VERSION}"

instance = {"type": "ml.g6e.2xlarge", "num_gpu": 1}

model_id = "Qwen/Qwen3-8B"

model_name = f"lmcache-{time.strftime('%y%m%d-%H%M%S')}"
endpoint_config_name, endpoint_name = model_name, model_name
timeout = 600

variant_name = "main"

common_env = {
    "HF_MODEL_ID": model_id,
}
lmi_env = {
    "SERVING_FAIL_FAST": "true",
    "OPTION_ASYNC_MODE": "true",
    "OPTION_ROLLING_BATCH": "disable",
    "OPTION_TENSOR_PARALLEL_DEGREE": json.dumps(instance["num_gpu"]),
    "OPTION_ENTRYPOINT": "djl_python.lmi_vllm.vllm_async_service",
    "OPTION_MAX_MODEL_LEN": "16384",
    "OPTION_LMCACHE_AUTO_CONFIG": "true",
    "PYTHONHASHSEED": "0"
}
env = common_env | lmi_env

In [118]:
model_resp = sm.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image,
        "Environment": env,
    },
)

In [None]:
config_resp = sm.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": variant_name,
            "ModelName": model_name,
            "InstanceType": instance["type"],
            "InitialInstanceCount": 1,
            "ContainerStartupHealthCheckTimeoutInSeconds": timeout,
        },
    ],
)

endpoint_resp = sm.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

_ = wait_for_endpoint(endpoint_name)

### Test inference

You should see `LMCache INFO: Reqid: chatcmpl-a8455c0f9b1b93f9, Total tokens 16, LMCache hit tokens: 0, need to load: 0 ...` in the CloudWatch logs

In [None]:
payload={
    "messages": [
        {"role": "user", "content": "Name popular places to visit in London?"}
    ],
}

start_time = time.time()
res = sm_runtime.invoke_endpoint(EndpointName=endpoint_name,
                                 Body=json.dumps(payload),
                                 ContentType="application/json")
response = json.loads(res["Body"].read().decode("utf8"))
end_time = time.time()

print(f"âœ… Response time: {end_time-start_time:.2f}s\n")
display(Markdown(response["choices"][0]["message"]["content"]))

usage = response["usage"]
print(f'-----------------------\n{usage}')

âœ… Response time: 36.53s



<think>
Okay, the user is asking for popular places to visit in London. Let me start by recalling the main attractions. The Tower of London is a must, it's a historic site with the Crown Jewels. Then there's Big Ben and the Houses of Parliament, which are iconic. The British Museum is another key spot, especially for those interested in history and art. 

Wait, maybe I should include the London Eye as well. It's a famous landmark and offers great views. The Buckingham Palace is another one, especially if you're in the area during the summer when the Changing of the Guard happens. 

Oh, and the Westminster Abbey is nearby, which is a significant religious and historical site. The Camden Market is a great place for shopping and food, especially for a more local vibe. 

I should also mention the National Gallery and the Tate Modern for art lovers. The Shard is a modern landmark with an observation deck. The Royal Botanic Gardens at Kew are quite far but worth mentioning. 

Don't forget the West End for theater, and maybe the London Zoo. Wait, is the London Zoo still a popular spot? I think it's more of a family attraction. Also, the Harry Potter Studio Tour is a newer addition, though it's in Warner Bros. Studio in Leavesden, which is outside London but still a popular day trip. 

I need to make sure I cover a range of areas: historical, cultural, shopping, and maybe some more unique spots. Let me check if I missed any major ones. The British Library? Maybe not as touristy. The Tate Modern is definitely a top art museum. 

Also, the Covent Garden area has the market and the Royal Opera House. The South Bank is another area with the Globe Theatre and the Thames Path. Maybe include the National Theatre as well. 

Wait, the user might be looking for a variety, so I should categorize them a bit. Let me structure the answer with categories like Historic Sites, Cultural Attractions, Shopping and Markets, Parks and Gardens, and maybe some unique experiences like the Harry Potter tour. 

I should also mention if some places require tickets or have specific hours. For example, the Tower of London has timed entries. The London Eye has different viewing options. Also, note that some places are free, like the British Museum. 

I need to ensure the information is up-to-date. For instance, the recent changes in the London Eye's operations or any closures. But I think the main attractions are still open. 

Let me list them out with brief descriptions and maybe a note on what to expect. That way, the user gets a clear overview without too much detail. Also, include tips like the best times to visit or nearby amenities. 

Wait, the user might be planning a trip and wants a quick list. So keeping it concise but informative. Maybe 15-20 places would be good. Let me check my list again: Tower of London, Big Ben, Houses of Parliament, British Museum, London Eye, Buckingham Palace, Westminster Abbey, National Gallery, Tate Modern, The Shard, Kew Gardens, Covent Garden, South Bank, Camden Market, Harry Potter Studio Tour, and maybe the London Zoo. 

I think that's a solid list. Now, organize them into categories and add a bit of context for each. Make sure to mention that some are free and others require tickets. Also, highlight the unique experiences like the Changing of the Guard or the art collections. 

I should avoid listing too many similar places. For example, the National Gallery and Tate Modern are both art museums but different. Also, include the River Thames for a scenic view. Maybe mention the London Eye as a must-see for the view. 

Okay, I think that covers the main points. Now, structure the answer with each category and bullet points for clarity. Make sure it's easy to read and helpful for someone planning a visit.
</think>

London is a city rich in history, culture, and iconic landmarks. Hereâ€™s a curated list of popular places to visit, categorized for ease of exploration:

---

### **Historic & Iconic Sites**  
1. **Tower of London**  
   - A historic fortress and home to the Crown Jewels. Explore medieval history and the Tower Ravens.  
   - **Tip:** Book tickets in advance for timed entry.  

2. **Big Ben & Houses of Parliament**  
   - The iconic clock tower and seat of the UK government. Great for photos and views of the Thames.  

3. **Westminster Abbey**  
   - A Gothic masterpiece and the coronation site of British monarchs. Visit the Poetsâ€™ Corner and the tomb of Shakespeare.  

4. **Buckingham Palace**  
   - The official residence of the monarch. Donâ€™t miss the Changing of the Guard ceremony (summer months).  

---

### **Cultural & Artistic Attractions**  
5. **British Museum**  
   - Free admission! Explore global art and artifacts, including the Rosetta Stone and Egyptian mummies.  

6. **National Gallery**  
   - A world-class art museum with works by Van Gogh, Da Vinci, and Turner.  

7. **Tate Modern**  
   - A modern art museum housed in a former power station. Features contemporary art and installations.  

8. **The British Library**  
   - A treasure trove of manuscripts, books, and historical documents.  

---

### **Parks & Gardens**  
9. **Hyde Park**  
   - A green oasis in the heart of London. Enjoy the Serpentine Lake, Speakersâ€™ Corner, and the Royal Albert Hall.  

10. **Kew Gardens**  
   - A UNESCO World Heritage Site with botanical wonders, glasshouses, and the Royal Botanic Gardens.  

11. **Regentâ€™s Park**  
   - Home to the London Zoo, Open Air Theatre, and the Royal Albert Hall.  

---

### **Shopping & Markets**  
12. **Camden Market**  
   - A vibrant hub for vintage, street food, and eclectic shops.  

13. **Oxford Street & Bond Street**  
   - For luxury shopping, flagship stores, and high-end boutiques.  

14. **Covent Garden**  
   - A mix of markets, theaters, and historic buildings. Donâ€™t miss the street performers and the Royal Opera House.  

---

### **Unique Experiences**  
15. **London Eye**  
   - A giant Ferris wheel with panoramic views of the city. Perfect for sunset views.  

16. **South Bank & Thames Walk**  
   - Explore the Thames Path, the Globe Theatre, and the National Theatre.  

17. **Harry Potter Studio Tour**  
   - Located in Leavesden (near London), this is a must for fans of the series.  

18. **The Shard**  
   - A modern skyscraper with an observation deck offering stunning city views.  

---

### **Other Highlights**  
19. **St Paulâ€™s Cathedral**  
   - A stunning Gothic cathedral with a famous dome. Visit the Whispering Gallery.  

20. **West End**  
   - The heart of Londonâ€™s theater scene. Catch a show at the Royal Albert Hall or the Lyceum Theatre.  

---

### **Tips**  
- **Free Attractions:** British Museum, National Gallery, and many parks.  
- **Tickets:** Book online for popular sites (e.g., Tower of London, London Eye) to avoid queues.  
- **Transport:** Use the London Underground (Tube) or buses for easy navigation.  

London offers something for everyone, whether youâ€™re a history buff, art lover, or foodie! ðŸ—¼âœ¨

-----------------------
{'prompt_tokens': 16, 'total_tokens': 1609, 'completion_tokens': 1593, 'prompt_tokens_details': None}


### Test Inference (LMCache)

You should see LMCache message that KV cache was re-used in the CloudWatch log

In [None]:
payload={
    "messages": [
        {"role": "user", "content": "Name popular places to visit in London?"}
    ],
}

start_time = time.time()
res = sm_runtime.invoke_endpoint(EndpointName=endpoint_name,
                                 Body=json.dumps(payload),
                                 ContentType="application/json")
response = json.loads(res["Body"].read().decode("utf8"))
end_time = time.time()

print(f"âœ… Response time: {end_time-start_time:.2f}s\n")
display(Markdown(response["choices"][0]["message"]["content"]))

usage = response["usage"]
print(f'-----------------------\n{usage}')

âœ… Response time: 27.00s



<think>
Okay, the user is asking for popular places to visit in London. Let me start by recalling the main attractions. First, the British Museum is a must because it's free and has a vast collection. Then there's the Tower of London, which is historic and has the Crown Jewels. Big Ben and the Houses of Parliament are iconic, even though Big Ben is actually the clock. The London Eye is a modern landmark with great views. Buckingham Palace is another key spot, especially for the Changing of the Guard. Westminster Abbey is a historic church with royal connections. The Shard is a modern skyscraper with an observation deck. The Victoria and Albert Museum is great for art and design. Covent Garden is a lively area with markets and theaters. The National Gallery is famous for its art, and the Natural History Museum is a favorite for its exhibits. Hyde Park is a big green space, and the Thames River cruises offer scenic views. I should also mention some less obvious places like the Science Museum or the British Library. Wait, did I miss any? Maybe the Tower Bridge? Oh, right, it's part of the Tower of London area. Also, the London Underground is a unique experience. I need to make sure the list is comprehensive but not too long. Let me organize them by categories like historical, cultural, natural, and modern. Check if all the places are accurate. For example, the British Museum is free, which is a good point. The London Eye is a fun activity. Also, maybe include some neighborhoods like Notting Hill or Camden for a different vibe. But the user might be looking for the top spots. I should prioritize the most famous ones first. Double-check the names and details to avoid mistakes. Alright, that should cover the essentials.
</think>

London is a vibrant city with a rich history and diverse attractions. Here are some of the most popular places to visit, categorized for easy navigation:

### **Historical & Cultural Landmarks**
1. **British Museum**  
   - Free entry, home to iconic artifacts like the Rosetta Stone and the Parthenon Marbles.  
2. **Tower of London**  
   - Historic fortress with the Crown Jewels, Tower Bridge, and the White Tower.  
3. **Houses of Parliament & Big Ben**  
   - Iconic landmarks with stunning architecture and the famous clock tower.  
4. **Westminster Abbey**  
   - Historic church where British monarchs are crowned and laid to rest.  
5. **Buckingham Palace**  
   - Royal residence with the Changing of the Guard ceremony (daily at 11:30 AM).  

### **Modern & Iconic Attractions**
6. **London Eye**  
   - Giant Ferris wheel with panoramic views of the city.  
7. **The Shard**  
   - Skyscraper with an observation deck offering 360Â° views.  
8. **St Paulâ€™s Cathedral**  
   - Historic church with a breathtaking dome and the "Dome of the Church of England."  

### **Art & Museums**
9. **National Gallery**  
   - World-class art collection, including works by Van Gogh, Rembrandt, and Turner.  
10. **Victoria and Albert Museum (V&A)**  
    - Focuses on art, design, and fashion with a vast collection.  
11. **Science Museum**  
    - Interactive exhibits on science, technology, and innovation.  
12. **Natural History Museum**  
    - Stunning architecture and exhibits like the dinosaur skeletons and the Hintze Hall.  

### **Parks & Green Spaces**
13. **Hyde Park**  
    - One of Londonâ€™s largest parks, ideal for walking, picnics, or the Serpentine Lake.  
14. **Regentâ€™s Park**  
    - Home to the London Zoo, the Royal Albert Hall, and open-air concerts.  
15. **Kensington Gardens**  
    - Adjacent to Buckingham Palace, with the Serpentine and the Peter Pan statue.  

### **Shopping & Markets**
16. **Covent Garden**  
    - Historic market with street performers, shops, and the Royal Opera House.  
17. **Oxford Street**  
    - One of the worldâ€™s busiest shopping streets, with flagship stores.  
18. **Camden Market**  
    - Quirky market with vintage, alternative, and global goods.  

### **Neighborhoods & Unique Experiences**
19. **Notting Hill**  
    - Colorful streets, the Portobello Road market, and the famous "Bend It Like Beckham" location.  
20. **South Bank**  
    - Vibrant area with the Thames, the Globe Theatre, and the Southbank Centre.  
21. **Shoreditch**  
    - Trendy area with street art, cafes, and the famous "Banksy" murals.  

### **Other Highlights**
22. **Tower Bridge**  
    - Iconic bridge with a lift to the upper levels for views of the river.  
23. **The Thames River Cruises**  
    - Scenic boat rides with views of landmarks like the Houses of Parliament.  
24. **The British Library**  
    - World-class library with exhibitions and rare manuscripts.  

### **Tips**
- **Transport**: Use the London Underground (Tube) or buses for easy access.  
- **Tickets**: Some attractions (e.g., Tower of London, London Eye) require advance booking.  
- **Seasonal Events**: Check for festivals like the London Marathon, Notting Hill Carnival, or Christmas markets.  

Whether youâ€™re interested in history, art, nature, or modern culture, London offers something for everyone! ðŸ—¼âœ¨

-----------------------
{'prompt_tokens': 16, 'total_tokens': 1215, 'completion_tokens': 1199, 'prompt_tokens_details': None}


## Cleanup for example 2

In [123]:
_ = sm.delete_endpoint(EndpointName=endpoint_name)
_ = sm.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
_ = sm.delete_model(ModelName=model_name)