# Deployment

Let's deploy 8 bit quantized Foundation AI Foundation-Sec-8B-Instruct model onto Amazon SageMaker AI. <br>
You can use the model deployed by this notebook for inference.  Refer to [the inference notebook](./inference.ipynb) for sample code.

As a prerequisite, please launch JupyterLab on SageMaker in your AWS environment. For more details, visit: 
https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-jl.html

### Prerequisites

In [None]:
# Install the required packages:
!pip install sagemaker boto3

### Configuration and Initialization

In [None]:
import sagemaker
import os

AWS_REGION = "us-west-2"
os.environ["AWS_DEFAULT_REGION"] = AWS_REGION 
ACCOUNT_ID = ""  # Add your AWS account ID here

MODEL_NAME = "fdtn-ai/Foundation-Sec-8B-Instruct-Q8_0-GGUF"
INSTANCE_TYPE = "ml.c6i.4xlarge"    # CPU based instance
TIMEOUT = 900

In [None]:
role = sagemaker.get_execution_role()

We need to build a custom docker image for running the GGUF quantized model on SageMaker. The Dockerfile and the steps to build and push the image are available in the [llama_cpp_cpu](../../containers/llama_cpp_cpu) directory.

Once you've published the image, specify it in the cell below.

In [None]:
container_uri = f"{ACCOUNT_ID}.dkr.ecr.{AWS_REGION}.amazonaws.com/fdtn-llama-cpp-cpu:latest"

In [None]:
endpoint_name = "Foundation-Sec-8B-Instruct-Quantized-CPU"
print("Deploying to endpoint: ", endpoint_name)

### Deploy

You can update the `env_vars` dictionary to update env variables as needed. If you are using the image built from [llama_cpp_cpu](../../containers/llama_cpp_cpu) directory, the available environment variables can be found in [Dockerfile](../../containers/llama_cpp_cpu/Dockerfile).

In [None]:
# Update other env variables as needed
env_vars = {
    "HF_MODEL_ID": MODEL_NAME,
    "THREADS": 16,
    "CTX": 8192,
}

Create the SageMaker model and deploy it to an endpoint.

In [None]:
model = sagemaker.Model(
    name=endpoint_name,
    image_uri=container_uri,
    role=role,
    env=env_vars,
)

model.deploy(
    instance_type=INSTANCE_TYPE,
    initial_instance_count=1,
    endpoint_name=endpoint_name,
)

You can now refer to the [inference](./inference.ipynb) notebook to perform inference using the created endpoint.