# Hyperpod Inference with KV cache and Intelligent Routing Admin Notebook

HyperPod's KV Caching and Intelligent Routing features address key challenges in LLM inference, including high latency for long-context prompts, inefficient resource utilization in multi-turn conversations, redundant computation across similar requests, and poor scaling for concurrent users. Our solution combines advanced caching mechanisms with intelligent request routing to deliver an optimized inference experience.

## 1.0 Choose Your Deployment Path

This notebook supports two deployment scenarios. **Choose the path that matches your situation:**

### Path 1: Console Cluster Creation with Inference Operator
**Use this if:** You are creating a new HyperPod cluster
**What it does:** Automatically installs inference operator during cluster creation
**Skip to:** [Section 10.0 - KV Cache and Intelligent Routing Deployment](##2dce9a4c)

### Path 2: Install Inference Operator on Existing Cluster
**Use this if:** You have an existing HyperPod cluster without inference operator
**What it does:** Manually installs all required components
**Continue with:** Section 2.0 below

---

## PATH 1: Console Cluster Creation

### 1.1 Create Cluster via AWS Console

This is the simplest path, using the AWS Console with automatic configuration and installation of inference operator and enablement of tiered storage with 20% allocated memory by default.

1. Navigate to the SageMaker AI console
2. Choose HyperPod Cluster from the left navigation pane and select Cluster Management
3. Click Create HyperPod cluster and select Orchestrated by EKS
4. Select Quick Setup

![image (15).png](<attachment:image (15).png>)

5. In the Storage Configuration section:
   - Enable tiered storage: Check this box
   - Memory allocation percentage: 20% (default, recommended for most workloads)
   - This allocates 20% of instance CPU memory for MemBrain (L2 cache)
   - Adjust based on your caching needs (10-40% range)

![image (16).png](<attachment:image (16).png>)

6. Complete cluster creation
7. Once cluster is ready, **skip to [Section 10.0 - KV Cache and Intelligent Routing Deployment](#endpoint-deployment)**

---

## PATH 2: Install Inference Operator on Existing Cluster

### Prerequisites

This notebook is to be run by Administrator (with Administrator access) plus sagemaker service in the trust policy, and cluster admin access to EKS cluster

For access to EKS cluster:

- Go to EKS console and select the cluster you are using
- Look in the "Access" tab and select "IAM Access Entries"
- If there is not an entry for your execution role:
  - Select "Create Access Entry"
  - Select the desired execution role and correlate the `AmazonEKSClusterAdminPolicy` with the role

**Important:** Ensure the role running this notebook has Admin Access

## 2.0 Set up Environment 

### 2.1 Set up environment variables

In [None]:
# Name of your EKS cluster
EKS_CLUSTER_NAME=""

# Region
REGION=""

# Account Id
ACCOUNT_ID=""

# Name of Hyperpod cluster
HP_CLUSTER_NAME=""

# ARN of Hyperpod cluster
HP_CLUSTER_ARN=""

# ID of HyperPod cluster (last 12 characters of ARN)
HP_CLUSTER_ID=""

# S3 bucket where tls certificates will be uploaded
TLS_BUCKET_NAME=""

# VPC Id
VPC_ID=""

In [None]:
LB_CONTROLLER_POLICY_NAME = "LBControllerPolicy-" + HP_CLUSTER_ID
S3_MOUNT_ACCESS_POLICY_NAME = "S3MountpointAccessPolicy-" + HP_CLUSTER_ID
S3_CSI_ROLE_NAME = "S3CSIRole-" + HP_CLUSTER_ID
KEDA_OPERATOR_POLICY_NAME = "KedaOperatorPolicy-" + HP_CLUSTER_ID
KEDA_OPERATOR_ROLE_NAME = "KedaOperatorRole-" + HP_CLUSTER_ID
PRESIGNED_URL_ACCESS_POLICY_NAME = "PresignedUrlAccessPolicy" + HP_CLUSTER_ID
HYPERPOD_INFERENCE_ACCESS_POLICY_NAME = "HyperpodInferenceAccessPolicy" + HP_CLUSTER_ID
HYPERPOD_INFERENCE_ROLE_NAME = "HyperpodInferenceRole-" + HP_CLUSTER_ID
HYPERPOD_INFERENCE_SA_NAME="hyperpod-inference-operator-controller"
HYPERPOD_INFERENCE_SA_NAMESPACE="hyperpod-inference-system"
JUMPSTART_GATED_ROLE_NAME = "JumpstartGatedRole-" + HP_CLUSTER_ID
FSX_CSI_ROLE_NAME = "FSxCSIDriverFullAccess-" + HP_CLUSTER_ID

### 2.2 Install dependencies 

Following script downloads and installs Helm, Kubectl and Eksctl if not already installed.

In [None]:
!git clone https://github.com/aws/sagemaker-hyperpod-cli.git

In [None]:
!chmod +x sagemaker-hyperpod-cli/helm_chart/install_dependencies.sh
!./sagemaker-hyperpod-cli/helm_chart/install_dependencies.sh

## 3.0 Connect to your EKS Cluster (via kubeconfig)

In [None]:
!aws eks update-kubeconfig --name "$EKS_CLUSTER_NAME" --region "$REGION" --output json

### 3.1 Confirm Successful Connection to Cluster

The below commands should not show any errors.

In [None]:
!aws sts get-caller-identity

# create-access-entry

For access to EKS cluster with the role above

Go to EKS console and select the cluster you are using.
Note: The correlated EKS cluster name is shown in the output of the above cell.
Look in the "Access" tab and select "IAM Access Entries"
If there is not an entry for your execution role:
Select "Create Access Entry"
Select the desired execution role and correlate the AmazonEKSClusterAdminPolicy with the role

In [None]:
!kubectl config current-context
!kubectl get svc

### 3.2 Associate IAM OIDC Provider with EKS Cluster

In [None]:
!eksctl utils associate-iam-oidc-provider --cluster $EKS_CLUSTER_NAME --region $REGION --approve

## 4.0 Load Balancer Controller Installation

### 4.1 Create IAM Policy for AWSLoadBalancer

In [None]:
!curl -o /tmp/AWSLoadBalancerControllerIAMPolicy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.13.0/docs/install/iam_policy.json
!aws iam create-policy --policy-name $LB_CONTROLLER_POLICY_NAME --policy-document file:///tmp/AWSLoadBalancerControllerIAMPolicy.json

### 4.2 Create the AWSLoadBalancer Service Account

In [None]:
query = "'Policies[?PolicyName==`" + LB_CONTROLLER_POLICY_NAME + "`].Arn'"
policy_arn_list=!(aws iam list-policies --query $query --output text)

policy_arn=policy_arn_list[0]

!eksctl create iamserviceaccount \
    --approve \
    --override-existing-serviceaccounts \
    --name= "aws-load-balancer-controller" \
    --namespace=kube-system \
    --cluster=$EKS_CLUSTER_NAME \
    --attach-policy-arn=$policy_arn \
    --region $REGION

### 4.3 Apply Tags to all Subnets in your EKS Cluster (public and private)

In [None]:
filters = "'[{\"Name\":\"vpc-id\", \"Values\": [\"" + VPC_ID + "\"]}, {\"Name\":\"map-public-ip-on-launch\", \"Values\":[\"true\"]}]'"
subnets = !(aws ec2 describe-subnets --filters $filters --output json | jq '.Subnets[]' | jq --raw-output '.SubnetId')
!echo "Applied tags to following subnets"
!echo $subnets
for subnet in subnets:
    !(aws ec2 create-tags --resources $subnet --tags Key=kubernetes.io/role/elb,Value=1)

## 5.0 Namespace Creation


### 5.1 KEDA Controller Namespace Creation

In [None]:
!kubectl create namespace keda

### 5.2 Cert Manager Namespace Creation

In [None]:
!kubectl create namespace cert-manager

### 5.3 Endpoint creation

In [None]:
from IPython.utils.capture import capture_output

with capture_output() as c:
    !(aws ec2 describe-route-tables --region $REGION --filters "Name=vpc-id,Values=$VPC_ID" --query 'RouteTables[].Associations[].RouteTableId' | jq 'unique' | jq -r 'join (" ")')
ROUTE_TABLE_IDS = c.stdout.strip()

SERVICE_NAME="com.amazonaws." + REGION + ".s3"

In [None]:
# If vpc endpoint is already created via cloudformation, then remove it
import time
import subprocess
from IPython.utils.capture import capture_output

# Check for existing endpoint with corrected query
with capture_output() as c:
    !(aws ec2 describe-vpc-endpoints \
        --region "$REGION" \
        --filters "Name=vpc-id,Values=$VPC_ID" "Name=service-name,Values=$SERVICE_NAME" \
        --query 'VpcEndpoints[?State!=`deleted`].[VpcEndpointId]' \
        --output text)
EXISTING_ENDPOINT = c.stdout.strip()

# If endpoint exists and is not empty/None and only delete S3 VPC endpoint , delete it
if EXISTING_ENDPOINT and EXISTING_ENDPOINT != "None" and "s3" in SERVICE_NAME.lower():
    print("Found existing endpoint, showing details:")
    !(aws ec2 describe-vpc-endpoints --vpc-endpoint-ids "$EXISTING_ENDPOINT" --region "$REGION")

    # Delete the VPC endpoint
    !aws ec2 delete-vpc-endpoints --vpc-endpoint-ids {EXISTING_ENDPOINT} --region {REGION}

    # Wait until the endpoint is fully deleted
    print("Waiting for VPC endpoint to be deleted...")
    SLEEP_INTERVAL = 5
    while True:
        result = subprocess.run(
            ["aws", "ec2", "describe-vpc-endpoints", "--vpc-endpoint-ids", EXISTING_ENDPOINT, "--region", REGION],
            stdout=subprocess.DEVNULL,
            stderr=subprocess.DEVNULL,
        )
        if result.returncode != 0:
            break
        print(".", end="", flush=True)
        time.sleep(SLEEP_INTERVAL)
    print(f"VPC endpoint '{EXISTING_ENDPOINT}' has been successfully deleted.")
else:
    print("No existing endpoint found.")

In [None]:
!(aws ec2 create-vpc-endpoint --vpc-id $VPC_ID --vpc-endpoint-type "Gateway" --service-name $SERVICE_NAME --route-table-ids $ROUTE_TABLE_IDS --region $REGION)

## 6.0 S3 CSI Driver Installation

Setup Role for S3 Mountpoint

In [None]:
%%bash -s "$S3_MOUNT_ACCESS_POLICY_NAME"

cat <<EOF> /tmp/s3-role-policy.json
{
   "Version": "2012-10-17",
   "Statement": [
        {
            "Sid": "MountpointFullBucketAccess",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                    "arn:aws:s3:::*",
                    "arn:aws:s3:::*/*"
            ]
        },
        {
            "Sid": "MountpointFullObjectAccess",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:AbortMultipartUpload",
                "s3:DeleteObject"
            ],
            "Resource": [
                    "arn:aws:s3:::*",
                    "arn:aws:s3:::*/*"
            ]
        }
   ]
}
EOF

aws iam create-policy --policy-name $1 --policy-document file:///tmp/s3-role-policy.json


In [None]:
query = "'Policies[?PolicyName==`" + S3_MOUNT_ACCESS_POLICY_NAME + "`].Arn'"
S3_MOUNT_POLICY_ARN_LIST=!(aws iam list-policies --query $query)
S3_MOUNT_POLICY_ARN = S3_MOUNT_POLICY_ARN_LIST[1].strip()

print(S3_MOUNT_POLICY_ARN)

In [None]:
!eksctl create iamserviceaccount \
    --name s3-csi-driver-sa \
    --override-existing-serviceaccounts \
    --namespace kube-system \
    --cluster $EKS_CLUSTER_NAME \
    --attach-policy-arn $S3_MOUNT_POLICY_ARN \
    --approve \
    --role-name $S3_CSI_ROLE_NAME \
    --region $REGION

In [None]:
!kubectl label serviceaccount s3-csi-driver-sa app.kubernetes.io/component=csi-driver app.kubernetes.io/instance=aws-mountpoint-s3-csi-driver app.kubernetes.io/managed-by=EKS app.kubernetes.io/name=aws-mountpoint-s3-csi-driver -n kube-system --overwrite

### 6.0.1 keda operator role Creation

In [None]:
OIDC_ID = !aws eks describe-cluster --name $EKS_CLUSTER_NAME --query "cluster.identity.oidc.issuer" --output text | cut -d '/' -f 5
OIDC_ID = OIDC_ID[0]

In [None]:
%%bash -s "$REGION" "$OIDC_ID" "$ACCOUNT_ID" "$KEDA_OPERATOR_POLICY_NAME" "$KEDA_OPERATOR_ROLE_NAME"

REGION=$1
OIDC_ID=$2
ACCOUNT_ID=$3
KEDA_OPERATOR_POLICY_NAME=$4
KEDA_OPERATOR_ROLE_NAME=$5

# Create trust policy
cat <<EOF > /tmp/keda-trust-policy.json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID"
            },
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
                "StringLike": {
                    "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:kube-system:keda-operator",
                    "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com"
                }
            }
        }
    ]
}
EOF
 
# Create permissions policy
cat <<EOF > /tmp/keda-policy.json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "cloudwatch:GetMetricData",
                "cloudwatch:GetMetricStatistics",
                "cloudwatch:ListMetrics"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "aps:QueryMetrics",
                "aps:GetLabels",
                "aps:GetSeries",
                "aps:GetMetricMetadata"
            ],
            "Resource": "*"
        }
    ]
}
EOF
 
# Create the role
aws iam create-role \
    --role-name $KEDA_OPERATOR_ROLE_NAME \
    --assume-role-policy-document file:///tmp/keda-trust-policy.json
 
# Create the policy
POLICY_ARN=$(aws iam create-policy \
    --policy-name $KEDA_OPERATOR_POLICY_NAME \
    --policy-document file:///tmp/keda-policy.json \
    --query 'Policy.Arn' \
    --output text)
 
# Attach the policy to the role
aws iam attach-role-policy \
    --role-name $KEDA_OPERATOR_ROLE_NAME \
    --policy-arn $POLICY_ARN

In [None]:
%%bash -s $REGION $PRESIGNED_URL_ACCESS_POLICY_NAME

cat <<EOF> /tmp/presignedurl-policy.json
{
   "Version": "2012-10-17",
   "Statement": [
        {
            "Sid": "CreatePresignedUrlAccess",
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreateHubContentPresignedUrls"
            ],
            "Resource": [
                "arn:aws:sagemaker:$1:aws:hub/SageMakerPublicHub", 
                "arn:aws:sagemaker:$1:aws:hub-content/SageMakerPublicHub/*/*" 
            ]
        }
   ]
}
EOF

aws iam create-policy --policy-name "$2" --policy-document file:///tmp/presignedurl-policy.json

## 7.0 Create Execution Role

#### 7.1.2 Import the Hyperpod Inference Access Policy to IAM

In [None]:
!aws iam create-policy --policy-name $HYPERPOD_INFERENCE_ACCESS_POLICY_NAME --policy-document file://inference-resources/hyperpod_inference_operator_policy.json

#### 7.1.3 Create an IAM role with the Hyperpod Inference Access Policy

In [None]:
query = "'Policies[?PolicyName==`" + HYPERPOD_INFERENCE_ACCESS_POLICY_NAME + "`].Arn'"
policy_arn_list=!(aws iam list-policies --query $query --output text)
policy_arn=policy_arn_list[0]

# Create the IAM role
!eksctl create iamserviceaccount --approve --role-only --name=$HYPERPOD_INFERENCE_SA_NAME --namespace=$HYPERPOD_INFERENCE_SA_NAMESPACE --cluster=$EKS_CLUSTER_NAME --attach-policy-arn=$policy_arn --role-name=$HYPERPOD_INFERENCE_ROLE_NAME --region=$REGION

### 7.2 Presigned URL support

#### 7.2.1 Create role for jumpstart gated model (Optional only for jumpstart gated model)

In [None]:
JUMPSTART_GATED_ROLE_NAME=f"JumpstartGatedRole-{REGION}-{HP_CLUSTER_NAME}"

In [None]:
%%bash -s "$REGION" "$OIDC_ID" "$ACCOUNT_ID" "$JUMPSTART_GATED_ROLE_NAME" "$PRESIGNED_URL_ACCESS_POLICY_NAME"

REGION=$1
OIDC_ID=$2
ACCOUNT_ID=$3
JUMPSTART_GATED_ROLE_NAME=$4
PRESIGNED_URL_ACCESS_POLICY_NAME=$5

# Create trust policy
cat <<EOF > /tmp/trust-policy.json
{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Principal": {
				"Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID"
			},
			"Action": "sts:AssumeRoleWithWebIdentity",
			"Condition": {
				"StringLike": {
					"oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:*:hyperpod-inference-service-account",
					"oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com"
				}
			}
		}
	]
}
EOF

# Create the role using existing trust policy
aws iam create-role \
    --role-name $JUMPSTART_GATED_ROLE_NAME \
    --assume-role-policy-document file:///tmp/trust-policy.json

# Attach the existing PresignedUrlAccessPolicy to the role
aws iam attach-role-policy \
    --role-name $JUMPSTART_GATED_ROLE_NAME \
    --policy-arn arn:aws:iam::${ACCOUNT_ID}:policy/$PRESIGNED_URL_ACCESS_POLICY_NAME

In [None]:
JUMPSTART_GATED_ROLE_ARN_LIST= !aws iam get-role --role-name=$JUMPSTART_GATED_ROLE_NAME --query "Role.Arn" --output text
JUMPSTART_GATED_ROLE_ARN = JUMPSTART_GATED_ROLE_ARN_LIST[0]
!echo $JUMPSTART_GATED_ROLE_ARN

#### 7.2.2 Update the Trust Policy

In [None]:
%%bash -s "$HYPERPOD_INFERENCE_ROLE_NAME" "$HYPERPOD_INFERENCE_SA_NAMESPACE" "$HYPERPOD_INFERENCE_SA_NAME"

cat <<EOF> hyperpod_inference_additional_trust_statement.txt
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "sagemaker.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        },
EOF

aws iam get-role --role-name=$1 --query "Role"."AssumeRolePolicyDocument" > hyperpod_inference_trust_policy.json
if [[ $(diff hyperpod_inference_additional_trust_statement.txt hyperpod_inference_trust_policy.json | grep '^<' -) ]]; then
    echo Trust policy missing some permissions. Updating with the permissions above.
    { head -n3 hyperpod_inference_trust_policy.json && cat hyperpod_inference_additional_trust_statement.txt && tail -n+4 hyperpod_inference_trust_policy.json; } > /tmp/updated_hyperpod_inference_trust_policy.json
    # Also allow any service account to AssumeRoleWithWebIdentity
    sed -i '0,/"StringEquals": {/{s//"StringLike": {/}' /tmp/updated_hyperpod_inference_trust_policy.json
    sed -i "0,/:sub\": \"system:serviceaccount:${2}:${3}\"/{s//:sub\": \"system:serviceaccount:*:*\"/}" /tmp/updated_hyperpod_inference_trust_policy.json
    aws iam update-assume-role-policy --role-name=$1 --policy-document=file:///tmp/updated_hyperpod_inference_trust_policy.json
else
    echo Trust policy contains all necessary permissions.
fi

### 7.4 Get role arns to be used during installation of operator

In [None]:
HYPERPOD_INFERENCE_ROLE_ARN_list= !(aws iam get-role --role-name=$HYPERPOD_INFERENCE_ROLE_NAME --query "Role.Arn" --output text)
HYPERPOD_INFERENCE_ROLE_ARN = HYPERPOD_INFERENCE_ROLE_ARN_list[0]
!echo $HYPERPOD_INFERENCE_ROLE_ARN

In [None]:
S3_CSI_ROLE_ARN_LIST= !aws iam get-role --role-name=$S3_CSI_ROLE_NAME --query "Role.Arn" --output text
S3_CSI_ROLE_ARN = S3_CSI_ROLE_ARN_LIST[0]
!echo $S3_CSI_ROLE_ARN

## 8.0 Install Operator

In [None]:
HELM_CMD="helm install hyperpod-inference-operator ./sagemaker-hyperpod-cli/helm_chart/HyperPodHelmChart/charts/inference-operator \
     -n kube-system \
     --set region=" + REGION + " \
     --set eksClusterName=" + EKS_CLUSTER_NAME + " \
     --set hyperpodClusterArn=" + HP_CLUSTER_ARN + " \
     --set executionRoleArn=" + HYPERPOD_INFERENCE_ROLE_ARN + " \
     --set s3.serviceAccountRoleArn=" + S3_CSI_ROLE_ARN + " \
     --set s3.node.serviceAccount.create=false \
     --set keda.podIdentity.aws.irsa.roleArn=\"arn:aws:iam::" + ACCOUNT_ID + ":role/keda-operator-role\" \
     --set tlsCertificateS3Bucket=" + TLS_BUCKET_NAME + " \
     --set alb.region=" + REGION + " \
     --set alb.clusterName=" + EKS_CLUSTER_NAME + " \
     --set alb.vpcId=" + VPC_ID + " \
     --set jumpstartGatedModelDownloadRoleArn=" + JUMPSTART_GATED_ROLE_ARN


In [None]:
!echo $HELM_CMD

In [None]:
!helm dependencies update sagemaker-hyperpod-cli/helm_chart/HyperPodHelmChart/charts/inference-operator

In [None]:
!eval $HELM_CMD

## 9.0 FSx CSI Service Account Update

In [None]:
 !eksctl create iamserviceaccount \
       --name fsx-csi-controller-sa \
       --override-existing-serviceaccounts \
       --namespace kube-system \
       --cluster $EKS_CLUSTER_NAME \
       --attach-policy-arn arn:aws:iam::aws:policy/AmazonFSxFullAccess \
       --role-name $FSX_CSI_ROLE_NAME \
       --approve \
       --region $REGION 

## 10.0 KV Cache and Intelligent Routing Deployment

This section provides a complete example of deploying an inference endpoint with KV caching and intelligent routing capabilities.

### 10.1 Prerequisites For Endpoint Deployment

Before proceeding with the deployment, ensure your environment meets the following requirements:

- **AWS CLI**: Configured with appropriate IAM permissions for EKS, S3, and SageMaker operations
- **kubectl**: Properly configured to communicate with your target EKS cluster
- **Helm**: Helm installed for managing Kubernetes applications
- **Repository Access**: Clone the SageMaker HyperPod CLI repository from GitHub
- **Model Access**: Valid Hugging Face token for accessing gated models 

Verify your cluster has sufficient instances.

a) Begin by cloning the repository and navigating to the inference operator directory. The following command upgrades the Helm chart with existing values while preserving your current configuration.

In [None]:
!git clone https://github.com/aws/sagemaker-hyperpod-cli.git

b) The following command upgrades the Helm chart with existing values while preserving your current configuration.

In [None]:
!helm upgrade hyperpod-inference-operator . --reuse-values --namespace kube-system

c) Kill the existing inference operator pod to allow scheduler to start a new one with updated helm using

In [None]:
!kubectl delete pod -n hyperpod-inference-system hyperpod-inference-operator-controller-manager-*

### 10.2 FSX Model Upload

We are using Llama-3.1-8B-Instruct. Obtain the model from Hugging Face (authentication may be required for gated models)
and ensure the model directory structure includes all necessary files.


#### 10.2.1 Create FSX volume [Optional in case you have and FSX with same VPC, Security Group and Subnet ID as that of cluster and want to use it.]

##### 10.2.1.1  Intialize subnet Id and Security group for FSX. These should be same as that of the Hyperpod cluster.

In [None]:
SUBNET_ID = "subnet-0c2b951752718e794"      #replace with subnet id

SECURITY_GROUP_ID = "sg-08485795c0105f0d4"     ##replace with security group id

In [None]:
# Configuration
CONFIG = {
    'SUBNET_ID': SUBNET_ID,
    'SECURITY_GROUP_ID': SECURITY_GROUP_ID,
    'STORAGE_CAPACITY': 1200,
    'DEPLOYMENT_TYPE': 'PERSISTENT_2',
    'THROUGHPUT': 250,
    'COMPRESSION_TYPE': 'LZ4',
    'LUSTRE_VERSION': '2.15'
}

In [None]:
# Create FSx client
fsx = boto3.client('fsx')

# Create FSx for Lustre file system
response = fsx.create_file_system(
    FileSystemType='LUSTRE',
    FileSystemTypeVersion=CONFIG['LUSTRE_VERSION'],
    StorageCapacity=CONFIG['STORAGE_CAPACITY'],
    SubnetIds=[CONFIG['SUBNET_ID']],
    SecurityGroupIds=[CONFIG['SECURITY_GROUP_ID']],
    LustreConfiguration={
        'DeploymentType': CONFIG['DEPLOYMENT_TYPE'],
        'PerUnitStorageThroughput': CONFIG['THROUGHPUT'],
        'DataCompressionType': CONFIG['COMPRESSION_TYPE'],
    }
)

# Get the file system ID
file_system_id = response['FileSystem']['FileSystemId']

print(f"Creating FSx filesystem with ID: {file_system_id}")
print(f"In subnet: {CONFIG['SUBNET_ID']}")
print(f"With security group: {CONFIG['SECURITY_GROUP_ID']}")

# Wait for the file system to become available
while True:
    response = fsx.describe_file_systems(FileSystemIds=[file_system_id])
    status = response['FileSystems'][0]['Lifecycle']
    if status == 'AVAILABLE':
        break
    print(f"Waiting for file system to become available... Current status: {status}")
    time.sleep(30)

dns_name = response['FileSystems'][0]['DNSName']
mount_name = response['FileSystems'][0]['LustreConfiguration']['MountName']

# Print the file system details
print("\nFile System Details:")
print(f"File System ID: {file_system_id}")
print(f"DNS Name: {dns_name}")
print(f"Mount Name: {mount_name}")

#### 10.2.2 Mount FSX and copy data from S3 to FSX [Optional in case your model already exists on FSX]

NOTE: Replace values of file_system_id, dns_name and mount_name with your FSX IN CASE not using the fsx from previous step and using your own FSX.

file_system_id = response['FileSystems'][0]['FileSystemId']

dns_name = response['FileSystems'][0]['DNSName']

mount_name = response['FileSystems'][0]['LustreConfiguration']['MountName']

print(f"File System ID: {file_system_id}")

print(f"DNS Name: {dns_name}")

print(f"Mount Name: {mount_name}")

In [None]:
# FSx file system details
mount_point = f'/mnt/fsx_{file_system_id}'  # This will create something like /mnt/fsx_20240317_123456

print(f"Creating mount point at: {mount_point}")

# Create mount directory if it doesn't exist
!sudo mkdir -p {mount_point}

# Mount the FSx Lustre file system
mount_command = f"sudo mount -t lustre {dns_name}@tcp:/{mount_name} {mount_point}"
!{mount_command}

# Verify the mount
!df -h | grep fsx

print(f"File system mounted at {mount_point}")

In [None]:
!sudo chmod 777 {mount_point}

In [None]:
!aws s3 cp $<model-location-on-s3> $<mount_point/fsx> --recursive

In [None]:
!ls $mount_point

In [None]:
!ls /

In [None]:
!sudo umount {mount_point}

In [None]:
!sudo rm -rf {mount_point}


After the model upload is complete, proceed with endpoint creation.

### 10.3 Create an Endpoint
Specify the name and namespace of the endpoint in the metadata section.
Specify your fsx fileSystemID, dnsName, mountName and region based on the model location.

In [None]:
%% endpoint-config.yaml
apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: InferenceEndpointConfig
metadata:
  name: <Your deployment name>
  namespace:<Your namespace>
spec:
  modelName: <include-your-model-name>   #include model name 
  instanceType: ml.g5.24xlarge            #include desired instance type
  invocationEndpoint: v1/chat/completions
  replicas: 1
  modelSourceConfig:
    modelSourceType: fsx
    fsxStorage:
      fileSystemId: fs-12345678               # Replace with your FSx file system ID
      dnsName: fs-12345678.fsx.us-west-2.amazonaws.com  # Replace with your FSx DNS name
      mountName: abcdefgh                   # Replace with your FSx mount name
      region: us-west-2                    # Replace with your region
    modelLocation: <include-your-model-name>    # Path within FSx where model is stored
    prefetchEnabled: false
  kvCacheSpec:
    enableL1Cache: true
    enableL2Cache: true
    l2CacheSpec:
     l2CacheBackend: redis
     l2CacheLocalUrl: "${REDIS_URL}"  #include model location
  intelligentRoutingSpec:
    enabled: true
    routingStrategy: prefixaware
  tlsConfig:
    tlsCertificateOutputS3Uri: <your-S3-Uri>    #include your s3 URI
  metrics:
    enabled: true
    modelMetrics:
      port: 8000
  loadBalancer:
    healthCheckPath: /health
  worker:
    resources:
      limits:
        nvidia.com/gpu: "4"
      requests:
        cpu: "6"
        memory: 30Gi
        nvidia.com/gpu: "4"
    image: lmcache/vllm-openai:latest
    args:
      - "/opt/ml/model"
      - "--max-model-len"
      - "20000"
      - "--tensor-parallel-size"
      - "4"
    modelInvocationPort:
      containerPort: 8000
      name: http
    modelVolumeMount:
      name: model-weights
      mountPath: /opt/ml/model
    environmentVariables:
      - name: PYTHONHASHSEED
        value: "123"
      - name: OPTION_ROLLING_BATCH
        value: "vllm"
      - name: SAGEMAKER_SUBMIT_DIRECTORY
        value: "/opt/ml/model/code"
      - name: MODEL_CACHE_ROOT
        value: "/opt/ml/model"
      - name: SAGEMAKER_MODEL_SERVER_WORKERS
        value: "1"
      - name: SAGEMAKER_MODEL_SERVER_TIMEOUT
        value: "3600"

### 10.3.1 Enable Intelligent Routing
If you want to enable intelligent routing, you can do that by making the intelligentRoutingSpec as true and you can disable it by marking it false

In [None]:
intelligentRoutingSpec:
    enabled: true  #CHANGE HERE
    routingStrategy: prefixaware

#### 10.3.1.1 Change the Routing Strategy

The intelligent routing system supports four different strategies that can be configured based on your requirements: "prefixaware", "kvaware", "session" , and "roundrobin".

* **Prefix-aware routing (default)**: Maintains a tree structure to track which prefixes are cached on which endpoints, making it effective for workloads with common prompt prefixes. While it provides good general-purpose performance, it cannot detect when cache entries are evicted from workers, potentially leading to suboptimal routing decisions.
* **Round-robin routing**: It is the most straightforward approach, which brute-force distributes requests evenly across all available workers. While simple to implement, it doesn't optimize for cache reuse and is best suited for scenarios where requests are independent and cache sharing isn't critical.
* **Session-based routing**: Ensures requests from the same session are consistently routed to the same worker, making it ideal for maintaining conversation context in chatbot applications. However, it limits cache sharing opportunities across different sessions, potentially leading to redundant cache entries across workers.
* **KV-aware routing**: Offers the most sophisticated cache management by using a centralized controller to track cache locations and handle cache eviction events. However, it requires tokenizing prompts in the router and currently only supports Hugging Face models. This strategy is best when precise cache control is needed and these limitations are acceptable.

This can be done using the "routingStrategy" parameter.

In [None]:
intelligentRoutingSpec:
    enabled: true  
    routingStrategy: prefixaware  #CHANGE HERE

For "session" and "kvaware" routing strategies, you can customize the session identifier by adding a "SESSION_KEY" environment variable to your endpoint configuration.


In [None]:
environmentVariables:
  - name: SESSION_KEY           
    value: "user_id"

### 10.3.2 Enable L1 Cache
**L1 Cache**: Local CPU memory cache on each node for fastest access to recently computed key-value pairs.

If you want to enable L1 cache, you can do that by making the kvCacheSpec true and you can disable it by marking it false

In [None]:
kvCacheSpec:
    enableL1Cache: true  #CHANGE HERE

### 10.3.3 Enable L2 Cache


**L2Cache**: Remote storage like Redis that enables offloading intermediate states across multiple nodes for shared access to key-value pairs.

To enable L2 Cache functionality, ensure the Redis service is running and accessible on the configured port.
 
**Important:** The Redis cluster must be deployed within the same VPC as your HyperPod cluster to ensure network connectivity and maintain data encryption in transit.

#### 10.3.3.1 redis.yaml

In [None]:
apiVersion: v1
kind: ConfigMap
metadata:
  name: redis-config
data:
  redis.conf: |
    # --- Memory ---
    maxmemory 75gb
    maxmemory-policy allkeys-lfu
    # Pure cache mode
    appendonly no
    save ""

    # --- Concurrency / networking ---
    io-threads 8
    io-threads-do-reads yes
    tcp-backlog 65535
    tcp-keepalive 300
    timeout 0

    # Protect server at high fan-in (slow consumers, giant replies)
    client-output-buffer-limit normal 128mb 64mb 60
    client-output-buffer-limit replica 256mb 64mb 60
    client-output-buffer-limit pubsub 32mb 8mb 60
    client-query-buffer-limit 256mb
    proto-max-bulk-len 256mb

    # Allow many clients (match to file-descriptor ulimit below)
    maxclients 200000

    # Misc latency hygiene
    dynamic-hz yes
    hz 100


---
# Or keep a standard ClusterIP if you prefer (comment one of the Services out)
apiVersion: v1
kind: Service
metadata:
  name: redis
spec:
  selector:
    app: redis
  ports:
  - name: redis
    port: 6379
    targetPort: 6379
  type: ClusterIP

---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis
spec:
  serviceName: redis-headless
  replicas: 1
  selector:
    matchLabels: { app: redis }
  template:
    metadata:
      labels: { app: redis }
      annotations:
        # Skip service meshes/proxies if present
        sidecar.istio.io/inject: "false"
    spec:
      # Fastest path (optional). Remove these two lines if you don't want host networking.
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet

      # Keep Redis on beefy, quiet nodes
      nodeSelector:
        node.kubernetes.io/instance-type: ml.m5.24xlarge

      # Prefer co-location policy (tweak to your topology/keyspace)
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            app: redis

      containers:
      - name: redis
        image: redis:7.2-alpine
        imagePullPolicy: IfNotPresent
        args: ["redis-server", "/usr/local/etc/redis/redis.conf"]
        env:
          - name: PYTHONHASHSEED
            value: "123"
        ports:
        - containerPort: 6379
          name: redis
        resources:
          # Reserve real cores; **omit CPU limits** to prevent throttling
          requests:
            cpu: "32"
            memory: "90Gi"
          # Remove limits entirely for best latency, or keep a high memory limit if you must
          # limits:
          #   memory: "100Gi"
        readinessProbe:
          tcpSocket: { port: 6379 }
          initialDelaySeconds: 3
          periodSeconds: 3
        livenessProbe:
          tcpSocket: { port: 6379 }
          initialDelaySeconds: 10
          periodSeconds: 10
        volumeMounts:
        - name: cfg
          mountPath: /usr/local/etc/redis
      volumes:
      - name: cfg
        configMap:
          name: redis-config

Apply the above file using kubectl apply, after the pod is running, we can start with enabling L2.

#### 10.3.3.2 Enable L2 Cache Field 

In [None]:
enableL2Cache: true     #Change here
l2CacheSpec:
    l2CacheBackend: redis
    l2CacheLocalUrl: redis://redis.redis-system.svc.cluster.local:6379

The l2CacheLocalUrl mentioned above should follow this forrmat

In [None]:
<redis-service-name>.<namespace>.svc.cluster.local:6379

### 10.4 Apply the File
To apply the above file, use the following command

In [None]:
!kubectl apply -f <filename>.yaml

### 10.5 View the Pods

In [None]:
!kubectl get inferenceendpointconfig -n <yournamespace>  # List all inference endpoints
!kubectl describe inferenceendpointconfig <deployment-name> -n <yournamespace>  # Show detailed deployment info

When the pod status shows "Running", the deployment is complete and ready to serve inference requests.

### 10.6 Invoke the Endpoint

In [None]:
import boto3
import json

runtime = boto3.client("sagemaker-runtime", region_name="us-east-2")

payload = {
         "model": "/opt/ml/model",
            "messages": [
                {
                    "role": "user",
                    "content": "What is machine learning?"
                }
            ],
            "max_tokens": 50,
            "temperature": 0.0
            "user_id": "session_123"  # Add this field for "session" and "kvaware" routing
        }
response = runtime.invoke_endpoint(
    EndpointName="<your-endpoint-name>",           #use your endpoint name
    ContentType="application/json",
    Body=json.dumps(payload)
)

print(response["Body"].read().decode())

## 11 Uninstallation

In [None]:
!helm uninstall hyperpod-inference-operator -n kube-system

## 12 Conclusion

KV Cache and Intelligent Routing in SageMaker HyperPod Model Deployment help you optimize LLM inference performance and costs through efficient memory management and smart request routing. You can get started today by adding these configurations to your HyperPod model deployments in all AWS Regions where SageMaker HyperPod is available.