#  CodeGen 2.5 7b
In this notebook we will create and deploy a CodeGen2.5-7b model using inference components on the endpoint you created in the first notebook. For this model we will be using Faster Transformer using the SageMaker Large Model Inference (LMI) container. This is the 2nd notebook in a series of 5 notebooks used to deploy a model against the endpoint you created in the first notebook. The last notebook will show you other apis available and clean up the artifacts created.

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2a_codegen25_FT_7b.ipynb)

---

Tested using the `Python 3 (Data Science)` kernel on SageMaker Studio and `conda_python3` kernel on SageMaker Notebook Instance.

### Install dependencies

Upgrade the SageMaker Python SDK.

In [None]:
!pip install sagemaker boto3 huggingface_hub --upgrade

In [None]:
import sagemaker
import jinja2
from sagemaker import image_uris
import boto3
import os
import time
import json
from pathlib import Path

### Set configuration

`REPLACE` the `endpoint_name` value with the created endpoint name stored in jupyter

In [None]:
%store -r \
endpoint_name

if "endpoint_name" not in locals():
    print("Please specify the endpoint_name before proceed.")

else:
    print(f"Endpoint name: {endpoint_name}")

We first by creating the objects we will need for our notebook. In particular, the boto3 library to create the various clients we will need to interact with SageMaker and other variables that will be referenced later in our notebook. 

In [None]:
sagemaker_client = boto3.client("sagemaker")
sagemaker_runtime_client = boto3.client("sagemaker-runtime")

sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name

In [None]:
role = sagemaker.get_execution_role()
print(f"Role: {role}")

s3_client = boto3.client("s3")

print(f"Demo endpoint name: {endpoint_name}")

In [None]:
role = sagemaker.get_execution_role()  # execution role for the endpoint

model_bucket = sess.default_bucket()  # bucket to house artifacts
s3_code_prefix = (
    "hf-large-model-djl-/codgen25-7b"  # folder within bucket where code artifact will go
)
s3_model_prefix = (
    "hf-large-model-djl-/codgen25-7b"  # folder within bucket where model artifact will go
)

region = sess._region_name
account_id = sess.account_id()

jinja_env = jinja2.Environment()

# define a variable to contain the s3url of the location that has the model
model_artifact = f"s3://{model_bucket}/{s3_model_prefix}/"
print(f" model tar ball will be uploaded to ---- > {model_artifact}")

## Preparing model artifacts and uploading them to S3
In LMI container, we expect some artifacts to help set up the model
- serving.properties (required): Defines the model server settings
- model.py (optional): A python file to define the core inference logic
- requirements.txt (optional): Any additional pip wheel need to install

### For CodeGen 2.5 which is a LLAMA architecture we will need to prepare the artifacts properlly to be used

#### The directory structure for the model structure in S3 MUST look like this to match the model.py code

#### Model uploaded to 

fp-16::
s3://bucket/hf-large-model-djl-/codgen25-7b/fastertransformer/1/1-gpu/

Model Prefix must point to --- > s3://bucket/hf-large-model-djl-/codgen25-7b/fastertransformer/



```
model-triton-16B
    |
    |---fastertransformer
            |
            | --- config.pbtxt
            | --- 1
                    |
                    | -  1-gpu
                          |
                          | -- model weights

```

We start by making sure the directory we will work in locally is clean for our model artifacts

In [None]:
!rm -rf code_codgen25_ft
!mkdir -p code_codgen25_ft

### Prepare LMI container serving.properties file
The LMI container give you the ability to deploy large models easily. By using the serving.properties file, you can easily set the options you want for deployment including what tensor parallel degree you want to use as well as other options like data type, quantization strategy and others. We start by creating this file in our working directory

In [None]:
%%writefile code_codgen25_ft/serving.properties
engine = MPI
option.tensor_parallel_degree = 1
option.entryPoint = djl_python.huggingface
option.model_id = {{model_id}}
option.trust_remote_code = true
option.dtype = fp16

In [None]:
template = jinja_env.from_string(Path("./code_codgen25_ft/serving.properties").open().read())
Path("./code_codgen25_ft/serving.properties").open("w").write(
    template.render(model_id="Salesforce/codegen2-7B")
)  # pretrained_model_location))
!pygmentize ./code_codgen25_ft/serving.properties # | cat -n

In [None]:
%%writefile ./code_codgen25_ft/requirements.txt
tiktoken
jinja2

We will also need to specify the image of LMI that we would like to use

In [None]:
inference_image_uri = image_uris.retrieve(
    framework="djl-deepspeed", region=region, version="0.23.0"
)
print(f"Image going to be used is ---- > {inference_image_uri}")

Now that everything is prepared, we created our tarball and upload it to S3 so SageMaker can reference it at deploy time. 

In [None]:
%%sh
rm -rf model.tar.gz
tar czvf model.tar.gz code_codgen25_ft/

In [None]:
s3_code_artifact = sess.upload_data("model.tar.gz", model_bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {s3_code_artifact}")

## Creating an inference component to your endpoint
Inference components can reuse a SageMaker model that you may have already created. You also have the option to specify your artifacts and container directly when creating an inference component which we will show below. In this example we will also create a SageMaker model if you want to reference it later. 

In [None]:
from sagemaker.utils import name_from_base

model_name = name_from_base(f"codegen25-7b")
print(model_name)

create_model_response = sagemaker_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={"Image": inference_image_uri, "ModelDataUrl": s3_code_artifact},
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

In [None]:
# ensure that the endpoint_name has been set.
print(endpoint_name)

In [None]:
# ensure your endpoint is in service.
import time

resp = sagemaker_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

### Create the Inference component

In [None]:
prefix = sagemaker.utils.unique_name_from_base("CodeGen25-IC")

inference_component_name = f"{prefix}-inference-component"
print(f"Demo inference component name: {inference_component_name}:: endpoint_name={endpoint_name}")
variant_name = "AllTraffic"

We can now create our inference component. Note below that we specify an inference component name. You can use this name to update your inference compent or view metrics and logs on the inference component you create in CloudWatch. You will also want to set your "ComputeResourceRequirements". This will tell SageMaker how much of each resource you want to reserver for EACH COPY of your inference component. Finally we set the number of copies that we want to deploy. The number of copies can be managed through autoscaling policies. 

In [None]:
sagemaker_client.create_inference_component(
    InferenceComponentName=inference_component_name,
    EndpointName=endpoint_name,
    VariantName=variant_name,
    Specification={
        "Container": {
            "Image": inference_image_uri,
            "ArtifactUrl": s3_code_artifact,
        },
        "StartupParameters": {
            "ModelDataDownloadTimeoutInSeconds": 300,
            "ContainerStartupHealthCheckTimeoutInSeconds": 600,
        },
        "ComputeResourceRequirements": {
            "NumberOfAcceleratorDevicesRequired": 1,
            "MinMemoryRequiredInMb": 1024,
        },
    },
    RuntimeConfig={"CopyCount": 1},
)

In [None]:
import sys

while True:
    desc = sagemaker_client.describe_inference_component(
        InferenceComponentName=inference_component_name
    )
    status = desc["InferenceComponentStatus"]
    print(status)
    sys.stdout.flush()
    if status in ["InService", "Failed"]:
        break
    time.sleep(30)

In [None]:
ic1_name = inference_component_name

In [None]:
# Store the inference component name for notebook 3.
%store \
ic1_name

#### Leverage the Boto3 to invoke the endpoint. 

This is a generative model so we pass in a Text as a prompt and Model will complete the sentence and return the results.

You can pass a batch of prompts as input to the model. This done by setting `inputs` to the list of prompts. The model then returns a result for each prompt. The text generation can be configured using appropriate parameters. These `parameters` need to be passed to the endpoint as a dictionary of `kwargs`. Refer this documentation - https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig for more details.

The below code sample illustrates the invocation of the endpoint using a prompts and also sets some parameters.

In [None]:
%%time

prompts = [
    "def hello_world():"
]

response_model = sagemaker_runtime_client.invoke_endpoint(
    InferenceComponentName=inference_component_name,
    EndpointName=endpoint_name,
    Body=json.dumps(
        {
            "inputs": prompts,
            "parameters": {
                "max_new_tokens": 128,
                "do_sample": True,
            },
        }
    ),
    ContentType="application/json",
)
response_model
response_model["Body"].read().decode("utf8")

Thats it! You have deployed the CodeGen2.5 model on SageMaker as an inference component. You can continue to the other notebooks to continue deploying other models or clean up the artifacts we have created through this example. 

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.


![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2a_codegen25_FT_7b.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2a_codegen25_FT_7b.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2a_codegen25_FT_7b.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2a_codegen25_FT_7b.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2a_codegen25_FT_7b.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2a_codegen25_FT_7b.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2a_codegen25_FT_7b.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2a_codegen25_FT_7b.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2a_codegen25_FT_7b.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2a_codegen25_FT_7b.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2a_codegen25_FT_7b.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2a_codegen25_FT_7b.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2a_codegen25_FT_7b.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2a_codegen25_FT_7b.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/inference/generativeai/llm-workshop/lab-inference-components-with-scaling/2a_codegen25_FT_7b.ipynb)