## Deploy Jumpstart and Non Jumpstart Models Asynchronously 
---------------------
*This notebook works best with the conda_python3 kernel on a ml.t3.medium machine*.

**This step of our solution design covers setting up the environment, downloading the requirements needed to run the environment, as well as deploying the model endpoints from the config.yml file asychronously.**

1. Prerequisite: Navigate to the file: 0_setup.ipynb and Run the cell to import and download the requirements.txt.

2. Now you can run this notebook to deploy the models asychronously in different threads. The key components of this notebook for the purposes of understanding are:

- Loading the globals.py and config.yml file.

- Setting a blocker function deploy_model to deploy the given model endpoint followed by:

- A series of async functions to set tasks to deploy the models from the config yml file asynchronously in different threads. View the notebook from the link above.

- Once the endpoints are deployed, their model configurations are stored within the endpoints.json file.


#### Import all of the necessary libraries below to run this notebook

In [None]:
# if interactive mode is set to no -> pickup fmbench from Python installation path
# if interactive mode is set to yes -> pickup fmbench from the current path (one level above this notebook)
# if interactive mode is not defined -> pickup fmbench from the current path (one level above this notebook)
# the premise is that if run non-interactively then it can only be run through main.py which will set interactive mode to no
import os
import sys
if os.environ.get("INTERACTIVE_MODE_SET", "yes") == "yes":
    sys.path.append(os.path.dirname(os.getcwd()))

In [None]:
import sys
import time
import json
import boto3
import asyncio
import logging
import importlib.util
import fmbench.scripts
from pathlib import Path
from fmbench.utils import *
from fmbench.globals import *
from fmbench.scripts import constants
from typing import Dict, List, Optional
from sagemaker import get_execution_role
import importlib.resources as pkg_resources
from botocore.exceptions import ClientError
from botocore.exceptions import NoCredentialsError

#### Pygmentize globals.py to view and use any of the globally initialized variables 

#### Set up a logger to log all messages while the code runs

In [None]:
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Remove existing handlers
logger.handlers.clear()

# Add a simple handler
handler = logging.StreamHandler()
formatter = logging.Formatter('[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)

### Load the config.yml file
------

The config.yml file contains information that is used across this benchmarking environment, such as information about the aws account, prompts, payloads to be used for invocations, and model configurations like the version of the model, the endpoint name, model_id that needs to be deployed. Configurations also support the gives instance type to be used, for example: "ml.g5.24xlarge", the image uri, whether or not to deploy this given model, followed by an inference script "jumpstart.py" which supports the inference script for jumpstart models to deploy the model in this deploy notebook. 

View the contents of the config yml file below and how it is loaded and used throughout this notebook with deploying the model endpoints asynchronously.

In [None]:
## Load the config.yml file referring to the globals.py file
config = load_main_config(CONFIG_FILE)

## configure the aws region and execution role
aws_region = config['aws']['region']


# try:
#     sagemaker_execution_role = get_execution_role()
#     config['aws']['sagemaker_execution_role'] = sagemaker_execution_role
#     logger.info(f"determined SageMaker exeuction role from get_execution_role")
# except Exception as e:
#     logger.error(f"could not determine SageMaker execution role, error={e}")
#     logger.info(f"going to look for execution role in config file..")
#     sagemaker_execution_role = config['aws'].get('sagemaker_execution_role')
#     if sagemaker_execution_role is not None:
#         logger.info(f"found SageMaker execution role in config file..")

logger.info(f"aws_region={aws_region}, execution_role={config['aws']['sagemaker_execution_role']}")
logger.info(f"config={json.dumps(config, indent=2)}")

#### Deploy a single model: blocking function used for asynchronous deployment

This function is designed to deploy a single large language model endpoint. It takes three parameters: experiment_config (a dictionary containing configuration details for the model deployment from the config.yml file), aws_region (the AWS region where the model will be deployed), and role_arn (the AWS role's Amazon Resource Name used for the deployment).

In [None]:
# Initialize an environment variable to check if any of the endpoints are deployed on SageMaker
# this variable is set to False by default and changed to True if the model is deployed on SageMaker
any_ep_on_sagemaker: bool = False

In [None]:
def check_if_deployed_on_sagemaker(scripts_dir: str, script_path: str) -> bool:
    on_sagemaker: bool = False
    module_name = Path(script_path).stem
    logger.info(f"check_if_deployed_on_sagemaker, script provided --> {module_name}")
    full_script_path = scripts_dir / f"{module_name}.py"
    logger.info(f"check_if_deployed_on_sagemaker, script path is --> {full_script_path}")

    # Check and proceed with local script
    if not full_script_path.exists():
        logger.error(f"check_if_deployed_on_sagemaker, script {full_script_path} not found, on_sagemaker={on_sagemaker}")
        return on_sagemaker
    spec = importlib.util.spec_from_file_location(module_name, str(full_script_path))
    module = importlib.util.module_from_spec(spec)
    sys.modules[module_name] = module
    spec.loader.exec_module(module)
    if hasattr(module, 'PLATFORM'):
        logger.info(f"check_if_deployed_on_sagemaker, module.PLATFORM: {module.PLATFORM}")
        if module.PLATFORM == constants.PLATFORM_SAGEMAKER:            
            on_sagemaker = True
            logger.warning(f"check_if_deployed_on_sagemaker, on_sagemaker={on_sagemaker}")
    else:
        logger.warning(f"check_if_deployed_on_sagemaker, Module {module_name} does not have a PLATFORM attribute.=, on_sagemaker{on_sagemaker}")
    return on_sagemaker

# function to deploy a model
def deploy_model(experiment_config: Dict, aws_region: str, role_arn: str) -> Optional[Dict]:
    global any_ep_on_sagemaker
    # Log the deployment details
    logger.info(f"going to deploy {experiment_config}, in {aws_region} with {role_arn}")
    model_deployment_result: Optional[Dict] = None

    # Assuming fmbench is a valid Python package and scripts is a subdirectory within it
    scripts_dir = Path(pkg_resources.files('fmbench'), 'scripts')
    logger.info(f"Using fmbench.scripts directory: {scripts_dir}")

    # Check if deployment is enabled in the config; skip if not
    deploy = experiment_config.get('deploy', False)
    if deploy is False:
        logger.error(f"skipping deployment of {experiment_config['model_id']} because deploy={deploy}")
        # In the case 'deploy' is set to 'False', we introduce a new parameter to the config files that users can configure: the production variant.
        # If you have a custom production variant that you want to use, configure that in the experiment section within your config file. If you do not
        # have a custom production variant configured, the default variant name 'AllTraffic' will be used instead. 
        production_variant_name = experiment_config.get('production_variant_name', DEFAULT_PRODUCTION_VARIANT_NAME)
        logger.info(f"Using production variant name: {production_variant_name}.")
        model_deployment_result = dict(endpoint_name=experiment_config['ep_name'], 
                                       experiment_name=experiment_config['name'], 
                                       instance_type=experiment_config['instance_type'], 
                                       instance_count=experiment_config['instance_count'], 
                                       production_variant_name=production_variant_name,
                                       deployed=False)
        # set the sagemaker flag if this is a sagemaker endpoint, we do this based
        # on the predictor so that even if we are not deploying the model we can still know
        # if this is a SageMaker endpoint
        any_ep_on_sagemaker = check_if_deployed_on_sagemaker(scripts_dir, experiment_config['inference_script'])
        return model_deployment_result

    # Initialize the S3 client
    s3_client = boto3.client('s3', region_name=aws_region)    

    # Proceed with deployment as before
    try:
        module_name = Path(experiment_config['deployment_script']).stem
        logger.info(f"script provided for deploying this model is --> {module_name}")
        deployment_script_path = scripts_dir / f"{module_name}.py"
        logger.info(f"script path is --> {deployment_script_path}")

        # Check and proceed with local script
        if not deployment_script_path.exists():
            logger.error(f"Deployment script {deployment_script_path} not found.")
            return None

        logger.info(f"Deploying using local code: {deployment_script_path}")

        spec = importlib.util.spec_from_file_location(module_name, str(deployment_script_path))
        module = importlib.util.module_from_spec(spec)
        sys.modules[module_name] = module
        spec.loader.exec_module(module)
        if hasattr(module, 'PLATFORM'):
            logger.info(f"module.PLATFORM: {module.PLATFORM}")
            if module.PLATFORM == constants.PLATFORM_SAGEMAKER:
                any_ep_on_sagemaker = True
        else:
            logger.warning(f"Module {module_name} does not have a PLATFORM attribute.")
        st = time.perf_counter()
        model_deployment_result = module.deploy(experiment_config, role_arn)
        elapsed_time = time.perf_counter() - st
        logger.info(f"time taken to deploy model_id={experiment_config['model_id']} via "
                    f"{deployment_script_path} is {elapsed_time:0.2f}")
        return model_deployment_result
    except Exception as error:  # Broader exception handling for non-ClientError issues
        logger.error(f"An error occurred during deployment: {error}")
        return model_deployment_result

### Asynchronous Model Deployment
----

#### async_deploy_model: 

- This is an asynchronous wrapper around the deploy_model function. It uses asyncio.to_thread to run the synchronous deploy_model function in a separate thread. This allows the function to be awaited in an asynchronous context, enabling concurrent model deployments without any blocking from the main thread

#### async_deploy_all_models Function: 

- This 'async_deploy_all_models' function is designed to deploy multiple models concurrently. It splits the models into batches and deploys each batch concurrently using asyncio.gather.

In [None]:
## Asynchronous wrapper function to allow our deploy_model function to allow concurrent requests for deployment
async def async_deploy_model(experiment_config: Dict, role_arn: str, aws_region: str) -> str:
    # Run the deploy_model function in a separate thread to deploy the models asychronously
    return await asyncio.to_thread(deploy_model, experiment_config, role_arn, aws_region)

## Final asychronous function to deploy all of the models concurrently
async def async_deploy_all_models(config: Dict) -> List[Dict]:
    
    ## Extract experiments from the config.yml file (contains information on model configurations)
    experiments: List[Dict] = config['experiments']
    n: int = 4 # max concurrency so as to not get a throttling exception
    
    # special check for deploy_w_djl_serving.py, if there are more than one deployments
    # to be done through this deployment script then we set the deployment concurrency 'n'
    # to 1 since this script is not re-entrant.
    non_reentrant_deployment_scripts = ['deploy_w_djl_serving.py']
    non_reentrant_deployment_scripts_present = [e['deployment_script'] for e in experiments\
                                                if e['deployment_script'] in non_reentrant_deployment_scripts]
    if len(non_reentrant_deployment_scripts_present) > 1:
        logger.info(f"non_reentrant_deployment_scripts_present={len(non_reentrant_deployment_scripts_present)}, going to deploy "
                    f"models serially")
        n = 1
    ## Split experiments into smaller batches for concurrent deployment
    experiments_splitted = [experiments[i * n:(i + 1) * n] for i in range((len(experiments) + n - 1) // n )]
    results = []
    for exp_list in experiments_splitted:
        
        ## send the deployment in batches
        result = await asyncio.gather(*[async_deploy_model(m,
                                                           config['aws']['region'],
                                                           config['aws']['sagemaker_execution_role']) for m in exp_list])
        ## Collect and furthermore extend the results from each batch
        results.extend(result)
    return results

In [None]:
# async version
s = time.perf_counter()

## Call all of the models for deployment using the config.yml file model configurations
endpoint_names = await async_deploy_all_models(config)

## Set a timer for model deployment counter
elapsed_async = time.perf_counter() - s
print(f"endpoint_names -> {endpoint_names}, deployed in {elapsed_async:0.2f} seconds")

In [None]:
## Function to get all of the information on the deployed endpoints and store it in a json
def get_all_info_for_endpoint(ep: Dict) -> Dict:
    try:
        ## extract the endpoint name
        ep_name = ep['endpoint_name']        
        ## extract the experiment name from the config.yml file
        experiment_name = ep['experiment_name']
        if ep_name is None:
            return None
        sm_client = boto3.client('sagemaker')
        
        # check if the given endpoint is a sagemaker endpoint
        next_token = None
        ep_names = []
        while True:
            # if this is not an endpoint on sagemaker, then break out of the loop and carry on. Assuming that this is
            # an EKS/EC2/Bedrock endpoint to benchmark
            if any_ep_on_sagemaker == False:
                logger.info(f"This is not a sagemaker endpoint")
                break
            if next_token:
                resp = sm_client.list_endpoints(MaxResults=100, StatusEquals='InService', NextToken=next_token)
            else:
                resp = sm_client.list_endpoints(MaxResults=100, StatusEquals='InService')
            ep_names.extend([e['EndpointName'] for e in resp['Endpoints']])
        
            next_token = resp.get('NextToken')
            if next_token is None:
                break
        if ep_name in ep_names:
            logger.info(f"ep_name={ep_name} is an InService SageMaker endpoint")
            # get the description on the configuration of the deployed model
            endpoint = sm_client.describe_endpoint(EndpointName=ep_name)
            endpoint_config = sm_client.describe_endpoint_config(EndpointConfigName=endpoint['EndpointConfigName'])
            model_config = sm_client.describe_model(ModelName=endpoint_config['ProductionVariants'][0]['ModelName'])
            # Store the experiment name and all of the other model configuration information in the 'info' dict
            info = dict(experiment_name=experiment_name,
                        endpoint=endpoint,
                        endpoint_config=endpoint_config,
                        deployed=ep['deployed'],
                        model_config=model_config)
        # if it is not an in service sagemaker endpoint, but the endpoint was deployed and exists
        elif ep_name not in ep_names and ep_name is not None:
            logger.info(f"ep_name={ep_name} is an \"InService\" EKS or EC2 or other endpoint")
            # populate the endpoint name in a dictionary
            info = dict(
                experiment_name=experiment_name,
                endpoint={'EndpointName': ep_name},
                endpoint_config={'ProductionVariants': None},
                instance_type=ep['instance_type'],
                instance_count=ep['instance_count'],
                deployed=ep['deployed'],
                model_config=None
            )
        else:
            logger.info(f"ep_name={ep_name} is not an \"InService\" SageMaker endpoint, "
                        f"setting all info about it to None")
            info = None
    except ClientError as e:
        # if there are any access denied exception with regards to permissions for listing or describing the endpoint
        # information, then use the production_variant_name information from the config file. This is useful when the configuration
        # file is a bring your own endpoint version, and the user already has access to the production variant name. If the 
        # production variant name that is provided is incorrect, then an exception will be thrown. If the production variant name
        # is not provided and there are any exceptions, then the production variant resorts to the default, which is AllTraffic.
        if e.response['Error']['Code'] == 'AccessDeniedException':
            logger.warning(f"An access denied exception occurred: {e}")
            # Use the provided production_variant_name or default to 'AllTraffic'
            production_variant = ep.get('production_variant_name', DEFAULT_PRODUCTION_VARIANT_NAME)
            logger.info(f"Using production variant name: {production_variant}.")
            info = dict(
                experiment_name=experiment_name,
                endpoint={'EndpointName': ep_name,
                        'ProductionVariants': [{'VariantName': production_variant}]},
                instance_type=ep['instance_type'],
                instance_count=ep['instance_count'],
                deployed=ep['deployed'],
                model_config=None
            )
        else:
            raise
    except Exception as e:
        logger.error(f"Error processing endpoint {ep_name}: {str(e)}")
        info = None
    return info

all_info = list(filter(None,
                  list(map(get_all_info_for_endpoint,
                             list(filter(None,
                                          endpoint_names))))))

## stores information in a dictionary for collectively all of the deployed model endpoints
all_info

In [None]:
# Convert data to JSON
json_data = json.dumps(all_info, indent=2, default=str)

# Specify the file name
file_name = "endpoints.json"

# Write to S3
endpoint_s3_path = write_to_s3(json_data, config['aws']['bucket'], MODELS_DIR, "", file_name)

logger.info(f"deployed endpoint info is written to this file --> {endpoint_s3_path}")

In [None]:
# check if we needed to deploy at least one endpoint and none got deployed
# and if that is so then raise an Exception because we cannot run any infernece
# so no point in continuing further
expected_deploy_count: int = len([e for e in config['experiments'] if e.get('deploy', True) is True])
actual_deploy_count: int = len([ep for ep in all_info if ep.get('deployed') is True])
assert_text: str = f"expected_deploy_count={expected_deploy_count} but actual_deploy_count={actual_deploy_count}, cannot continue"
assert expected_deploy_count == actual_deploy_count, assert_text