## Deploy Jumpstart and Non Jumpstart Models Asynchronously 
---------------------
*This notebook works best with the conda_python3 kernel on a ml.t3.medium machine*.

**This step of our solution design covers setting up the environment, downloading the requirements needed to run the environment, as well as deploying the model endpoints from the config.yml file asychronously.**

1. Prerequisite: Navigate to the file: 0_setup.ipynb and Run the cell to import and download the requirements.txt.

2. Now you can run this notebook to deploy the models asychronously in different threads. The key components of this notebook for the purposes of understanding are:

- Loading the globals.py and config.yml file.

- Setting a blocker function deploy_model to deploy the given model endpoint followed by:

- A series of async functions to set tasks to deploy the models from the config yml file asynchronously in different threads. View the notebook from the link above.

- Once the endpoints are deployed, their model configurations are stored within the endpoints.json file.


In [37]:
## auto reload all of the changes made in the config/globals.py file 
%load_ext autoreload
%autoreload 2
!touch globals.py

CONFIG_FILE=configs/config-mistral-7b-tgi-g5.yml
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


#### Import all of the necessary libraries below to run this notebook

In [38]:
import sys
import time
import json
import boto3
import asyncio
import logging
import pathlib
import importlib.util
from botocore.exceptions import NoCredentialsError
from globals import *
from pathlib import Path
from utils import load_config
from sagemaker import get_execution_role
from utils import write_to_s3 
from typing import Dict, List, Optional
from botocore.exceptions import ClientError

CONFIG_FILE=configs/config-mistral-7b-tgi-g5.yml


#### Pygmentize globals.py to view and use any of the globally initialized variables 

In [39]:
# global constants
!pygmentize globals.py

[34mimport[39;49;00m [04m[36mos[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36myaml[39;49;00m[37m[39;49;00m
[34mfrom[39;49;00m [04m[36menum[39;49;00m [34mimport[39;49;00m Enum[37m[39;49;00m
[34mfrom[39;49;00m [04m[36mpathlib[39;49;00m [34mimport[39;49;00m Path[37m[39;49;00m
[34mimport[39;49;00m [04m[36mboto3[39;49;00m[37m[39;49;00m
[34mfrom[39;49;00m [04m[36mdatetime[39;49;00m [34mimport[39;49;00m datetime[37m[39;49;00m
[37m[39;49;00m
CONFIG_FILEPATH_FILE: [36mstr[39;49;00m = [33m"[39;49;00m[33mconfig_filepath.txt[39;49;00m[33m"[39;49;00m[37m[39;49;00m
[37m[39;49;00m
[37m# S3 client initialization[39;49;00m[37m[39;49;00m
s3_client = boto3.client([33m'[39;49;00m[33ms3[39;49;00m[33m'[39;49;00m)[37m[39;49;00m
[37m[39;49;00m
CONFIG_FILE: [36mstr[39;49;00m = Path(CONFIG_FILEPATH_FILE).read_text()[37m[39;49;00m
[36mprint[39;49;00m([33mf[39;49;00m[33m"[39;49;00m[33mCONFIG_FILE=[39;49;00m[33m{[39;49

#### Set up a logger to log all messages while the code runs

In [4]:
logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

### Load the config.yml file
------

The config.yml file contains information that is used across this benchmarking environment, such as information about the aws account, prompts, payloads to be used for invocations, and model configurations like the version of the model, the endpoint name, model_id that needs to be deployed. Configurations also support the gives instance type to be used, for example: "ml.g5.24xlarge", the image uri, whether or not to deploy this given model, followed by an inference script "jumpstart.py" which supports the inference script for jumpstart models to deploy the model in this deploy notebook. 

View the contents of the config yml file below and how it is loaded and used throughout this notebook with deploying the model endpoints asynchronously.

In [5]:
## Load the config.yml file referring to the globals.py file
config = load_config(CONFIG_FILE)

## configure the aws region and execution role
aws_region = config['aws']['region']


try:
    sagemaker_execution_role = get_execution_role()
    config['aws']['sagemaker_execution_role'] = sagemaker_execution_role
    logger.info(f"determined SageMaker exeuction role from get_execution_role")
except Exception as e:
    logger.error(f"could not determine SageMaker execution role, error={e}")
    logger.info(f"going to look for execution role in config file..")
    sagemaker_execution_role = config['aws'].get('sagemaker_execution_role')
    if sagemaker_execution_role is not None:
        logger.info(f"found SageMaker execution role in config file..")

logger.info(f"aws_region={aws_region}, sagemaker_execution_role={sagemaker_execution_role}")
logger.info(f"config={json.dumps(config, indent=2)}")

[2024-01-29 15:48:06,175] p15748 {2048960905.py:13} ERROR - could not determine SageMaker execution role, error=The current AWS identity is not a role: arn:aws:iam::015469603702:user/madhurusertest, therefore it cannot be used as a SageMaker execution role
[2024-01-29 15:48:06,176] p15748 {2048960905.py:14} INFO - going to look for execution role in config file..
[2024-01-29 15:48:06,176] p15748 {2048960905.py:17} INFO - found SageMaker execution role in config file..
[2024-01-29 15:48:06,176] p15748 {2048960905.py:19} INFO - aws_region=us-east-1, sagemaker_execution_role=arn:aws:iam::015469603702:role/service-role/AmazonSageMaker-ExecutionRole-20220504T122644
[2024-01-29 15:48:06,177] p15748 {2048960905.py:20} INFO - config={
  "general": {
    "name": "mistral-7b-tgi-g5-v1",
    "model_name": "mistral7b"
  },
  "aws": {
    "region": "us-east-1",
    "sagemaker_execution_role": "arn:aws:iam::015469603702:role/service-role/AmazonSageMaker-ExecutionRole-20220504T122644",
    "bucket": 

#### Deploy a single model: blocker function used for asynchronous deployment

This function is designed to deploy a single large language model endpoint. It takes three parameters: experiment_config (a dictionary containing configuration details for the model deployment from the config.yml file), aws_region (the AWS region where the model will be deployed), and role_arn (the AWS role's Amazon Resource Name used for the deployment).

In [6]:
# function to deploy a model
def deploy_model(experiment_config: Dict, aws_region: str, role_arn: str) -> Optional[Dict]:
    
    # Log the deployment details
    logger.info(f"going to deploy {experiment_config}, in {aws_region} with {role_arn}")
    model_deployment_result = None
    
    # Check if deployment is enabled in the config; skip if not
    deploy = experiment_config.get('deploy', False)
    if deploy is False:
        logger.error(f"skipping deployment of {experiment_config['model_id']} because deploy={deploy}")
        return model_deployment_result
    
    try: 
        # Set up the module name and S3 path
        module_name = Path(experiment_config['deployment_script']).stem
        logger.info(f"script name being used for model deployment --> {module_name}")

        ## Defining the s3 path to pull out the script for your specific model
        s3_file_path = f"{DEPLOYMENT_SCRIPT_S3}/{module_name}.py"
        logger.info(f"s3 file path where your model script should be installed in --> {s3_file_path}")

        # Check if the module exists in S3 to make sure that you have uploaded it in
        try:
            s3_client.head_object(Bucket=config['aws']['bucket'], Key=s3_file_path)
            file_exists_in_s3 = True
        except ClientError:
            file_exists_in_s3 = False

        # Define the local script path where we will install this script and use it to deploy the model
        local_script_path = f"/tmp/{module_name}.py"

        if file_exists_in_s3:
            # Download the script from S3
            s3_client.download_file(config['aws']['bucket'], s3_file_path, local_script_path)
            logger.info(f"Deploying using code from S3: {local_script_path}")
        else:
            # Fall back to local file
            local_script_path = os.path.join(pathlib.Path().absolute().resolve(), SCRIPTS_DIR, f"{module_name}.py")
            logger.info(f"Deploying using local code: {local_script_path}")

        # Load the specific module name and file path
        spec = importlib.util.spec_from_file_location(module_name, local_script_path)
        module = importlib.util.module_from_spec(spec)
        sys.modules[module_name] = module
        spec.loader.exec_module(module)

        # Execute the deployment function from the imported module
        model_deployment_result = module.deploy(experiment_config, role_arn)

        # Return the model deployment result 
        return model_deployment_result

    
    except ClientError as error:
        print(f"an error occurred: {error}")
        return model_deployment_result

### Asynchronous Model Deployment
----

#### async_deploy_model: 

- This is an asynchronous wrapper around the deploy_model function. It uses asyncio.to_thread to run the synchronous deploy_model function in a separate thread. This allows the function to be awaited in an asynchronous context, enabling concurrent model deployments without any blocking from the main thread

#### async_deploy_all_models Function: 

- This 'async_deploy_all_models' function is designed to deploy multiple models concurrently. It splits the models into batches and deploys each batch concurrently using asyncio.gather.

In [7]:
## Asynchronous wrapper function to allow our deploy_model function to allow concurrent requests for deployment
async def async_deploy_model(experiment_config: Dict, role_arn: str, aws_region: str) -> str:
    # Run the deploy_model function in a separate thread to deploy the models asychronously
    return await asyncio.to_thread(deploy_model, experiment_config, role_arn, aws_region)

## Final asychronous function to deploy all of the models concurrently
async def async_deploy_all_models(config: Dict) -> List[Dict]:
    
    ## Extract experiments from the config.yml file (contains information on model configurations)
    experiments: List[Dict] = config['experiments']
    n: int = 4 # max concurrency so as to not get a throttling exception
    
    ## Split experiments into smaller batches for concurrent deployment
    experiments_splitted = [experiments[i * n:(i + 1) * n] for i in range((len(experiments) + n - 1) // n )]
    results = []
    for exp_list in experiments_splitted:
        
        ## send the deployment in batches
        result = await asyncio.gather(*[async_deploy_model(m,
                                                           config['aws']['region'],
                                                           config['aws']['sagemaker_execution_role']) for m in exp_list])
        ## Collect and furthermore extend the results from each batch
        results.extend(result)
    return results

In [8]:
# async version
s = time.perf_counter()

## Call all of the models for deployment using the config.yml file model configurations
endpoint_names = await async_deploy_all_models(config)

## Set a timer for model deployment counter
elapsed_async = time.perf_counter() - s
print(f"endpoint_names -> {endpoint_names}, deployed in {elapsed_async:0.2f} seconds")

[2024-01-29 15:48:06,386] p15748 {2611178736.py:5} INFO - going to deploy {'name': 'mistral-7b-g5-huggingface-pytorch-tgi-inference-2.0.1-tgi1.1.0', 'model_id': 'huggingface-llm-mistral-7b', 'model_version': '*', 'model_name': 'mistral7b', 'ep_name': 'lmistral7b-g5-2xlarge', 'instance_type': 'ml.g5.2xlarge', 'image_uri': '763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi1.1.0-gpu-py39-cu118-ubuntu20.04', 'deploy': True, 'instance_count': 1, 'deployment_script': 'jumpstart.py', 'payload_files': ['payload_en_1-500.jsonl', 'payload_en_500-1000.jsonl'], 'concurrency_levels': [1], 'accept_eula': True, 'env': {'SAGEMAKER_PROGRAM': 'inference.py', 'ENDPOINT_SERVER_TIMEOUT': '3600', 'MODEL_CACHE_ROOT': '/opt/ml/model', 'SAGEMAKER_ENV': '1', 'HF_MODEL_ID': '/opt/ml/model', 'MAX_INPUT_LENGTH': '8191', 'MAX_TOTAL_TOKENS': '8192', 'MAX_BATCH_PREFILL_TOKENS': '8191', 'SM_NUM_GPUS': '1', 'SAGEMAKER_MODEL_SERVER_WORKERS': '1'}}, in us-east-1 with arn:aws:iam::01

[2024-01-29 15:48:06,562] p15748 {2611178736.py:40} INFO - Deploying using local code: /Users/madhurpt/fmbt/scripts/jumpstart.py
Using model 'huggingface-llm-mistral-7b' with wildcard version identifier '*'. You can pin to version '2.0.1' for more stable results. Note that models may have different input/output signatures after a major version upgrade.
[2024-01-29 15:48:07,127] p15748 {session.py:3701} INFO - Creating model with name: hf-llm-mistral-7b-2024-01-29-20-48-07-126
[2024-01-29 15:48:07,954] p15748 {session.py:5377} INFO - Creating endpoint-config with name lmistral7b-g5-2xlarge-1706561287
[2024-01-29 15:48:08,488] p15748 {session.py:4279} INFO - Creating endpoint with name lmistral7b-g5-2xlarge-1706561287


---------!endpoint_names -> [{'endpoint_name': 'lmistral7b-g5-2xlarge-1706561287', 'experiment_name': 'mistral-7b-g5-huggingface-pytorch-tgi-inference-2.0.1-tgi1.1.0'}], deployed in 303.68 seconds


In [46]:
## Function to get all of the information on the deployed endpoints and store it in a json
def get_all_info_for_endpoint(ep: Dict) -> Dict:
    
    ## extract the endpoint name
    ep_name = ep['endpoint_name']
    
    ## extract the experiment name from the config.yml file
    experiment_name = ep['experiment_name']
    if ep_name is None:
        return None
    sm_client = boto3.client('sagemaker')
    
    ## get the description on the configuration of the deployed model
    endpoint = sm_client.describe_endpoint(EndpointName=ep_name)
    endpoint_config = sm_client.describe_endpoint_config(EndpointConfigName=endpoint['EndpointConfigName'])
    model_config = sm_client.describe_model(ModelName=endpoint_config['ProductionVariants'][0]['ModelName'])
    
    ## Store the experiment name and all of the other model configuration information in the 'info' dict
    info = dict(experiment_name=experiment_name,
                endpoint=endpoint,
                endpoint_config=endpoint_config,
                model_config=model_config)
    return info

all_info = list(map(get_all_info_for_endpoint, [ep for ep in endpoint_names if ep is not None]))

## stores information in a dictionary for collectively all of the deployed model endpoints
all_info

[{'experiment_name': 'mistral-7b-g5-huggingface-pytorch-tgi-inference-2.0.1-tgi1.1.0',
  'endpoint': {'EndpointName': 'lmistral7b-g5-2xlarge-1706561287',
   'EndpointArn': 'arn:aws:sagemaker:us-east-1:015469603702:endpoint/lmistral7b-g5-2xlarge-1706561287',
   'EndpointConfigName': 'lmistral7b-g5-2xlarge-1706561287',
   'ProductionVariants': [{'VariantName': 'AllTraffic',
     'DeployedImages': [{'SpecifiedImage': '763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi1.1.0-gpu-py39-cu118-ubuntu20.04',
       'ResolvedImage': '763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference@sha256:2739b630b95d8a95e6b4665e66d8243dd43b99c4fdb865feff13aab9c1da06eb',
       'ResolutionTime': datetime.datetime(2024, 1, 29, 15, 48, 9, 829000, tzinfo=tzlocal())}],
     'CurrentWeight': 1.0,
     'DesiredWeight': 1.0,
     'CurrentInstanceCount': 1,
     'DesiredInstanceCount': 1}],
   'EndpointStatus': 'InService',
   'CreationTime': datetime.da

In [47]:
MODELS_DIR

'data/models/2024/01/29/16/mistral-7b-tgi-g5-v1'

In [48]:
# Convert data to JSON
json_data = json.dumps(all_info, indent=2, default=str)

# Extract the model name from the config
model_name = config['general']['name']

# Specify the file name
file_name = "endpoints.json"

# Write to S3
write_to_s3(json_data, config['aws']['bucket'], MODELS_DIR, "", file_name)

Data successfully written to s3://fmbttest/data/models/2024/01/29/16/mistral-7b-tgi-g5-v1/endpoints.json


In [49]:
# write all end point info to an endpoints.json file so that other notebooks can read it and make inferences from it
# Path(ENDPOINT_LIST_FPATH).write_text(json.dumps(all_info, indent=2, default=str))