Skip to content

Commit

Permalink
remove usage of SERVING_LOAD_MODELS and OPTION_MODEL_ID in examples/d…
Browse files Browse the repository at this point in the history
…ocs/tests
  • Loading branch information
siddvenk committed Jun 7, 2024
1 parent b178dc9 commit 7203944
Show file tree
Hide file tree
Showing 8 changed files with 48 additions and 100 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/llm_integration.yml
Original file line number Diff line number Diff line change
Expand Up @@ -562,7 +562,7 @@ jobs:
working-directory: tests/integration
run: |
rm -rf models
echo -en "SERVING_LOAD_MODELS=test::MPI=/opt/ml/model\nOPTION_MAX_ROLLING_BATCH_SIZE=2\nOPTION_OUTPUT_FORMATTER=jsonlines\nOPTION_TENSOR_PARALLEL_DEGREE=1\nOPTION_MODEL_ID=gpt2\nOPTION_TASK=text-generation\nOPTION_ROLLING_BATCH=lmi-dist" > docker_env
echo -en "OPTION_MAX_ROLLING_BATCH_SIZE=2\nOPTION_OUTPUT_FORMATTER=jsonlines\nTENSOR_PARALLEL_DEGREE=1\nHF_MODEL_ID=gpt2\nOPTION_TASK=text-generation\nOPTION_ROLLING_BATCH=lmi-dist" > docker_env
./launch_container.sh deepjavalibrary/djl-serving:$DJLSERVING_DOCKER_TAG nocode lmi
python3 llm/client.py lmi_dist gpt2
docker rm -f $(docker ps -aq)
Expand Down
2 changes: 2 additions & 0 deletions engines/python/setup/djl_python/test_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,8 @@ def load_properties(properties_dir):
def update_properties_with_env_vars(kwargs):
env_vars = os.environ
for key, value in env_vars.items():
if key == "HF_MODEL_ID":
kwargs.setdefault("model_id", value)
if key.startswith("OPTION_"):
key = key[7:].lower()
if key == "entrypoint":
Expand Down
20 changes: 4 additions & 16 deletions engines/python/setup/djl_python/tests/test_test_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,17 +61,14 @@ def test_all_code(self):

def test_with_env(self):
envs = {
"OPTION_MODEL_ID": "NousResearch/Nous-Hermes-Llama2-13b",
"SERVING_LOAD_MODELS": "test::MPI=/opt/ml/model",
"HF_MODEL_ID": "NousResearch/Nous-Hermes-Llama2-13b",
"OPTION_ROLLING_BATCH": "auto",
"OPTION_TGI_COMPAT": "true"
}
for key, value in envs.items():
os.environ[key] = value
huggingface.get_rolling_batch_class_from_str = override_rolling_batch
handler = TestHandler(huggingface)
self.assertEqual(handler.serving_properties["model_id"],
envs["OPTION_MODEL_ID"])
self.assertEqual(handler.serving_properties["rolling_batch"],
envs["OPTION_ROLLING_BATCH"])
self.assertEqual(handler.serving_properties["tgi_compat"],
Expand Down Expand Up @@ -100,17 +97,14 @@ def test_with_env(self):

def test_with_tgi_compat_env(self):
envs = {
"OPTION_MODEL_ID": "NousResearch/Nous-Hermes-Llama2-13b",
"SERVING_LOAD_MODELS": "test::MPI=/opt/ml/model",
"HF_MODEL_ID": "NousResearch/Nous-Hermes-Llama2-13b",
"OPTION_ROLLING_BATCH": "auto",
"OPTION_TGI_COMPAT": "true"
}
for key, value in envs.items():
os.environ[key] = value
huggingface.get_rolling_batch_class_from_str = override_rolling_batch
handler = TestHandler(huggingface)
self.assertEqual(handler.serving_properties["model_id"],
envs["OPTION_MODEL_ID"])
self.assertEqual(handler.serving_properties["rolling_batch"],
envs["OPTION_ROLLING_BATCH"])
self.assertEqual(handler.serving_properties["tgi_compat"],
Expand Down Expand Up @@ -162,16 +156,13 @@ def test_all_code_chat(self):

def test_with_env_chat(self):
envs = {
"OPTION_MODEL_ID": "TheBloke/Llama-2-7B-Chat-fp16",
"SERVING_LOAD_MODELS": "test::MPI=/opt/ml/model",
"HF_MODEL_ID": "TheBloke/Llama-2-7B-Chat-fp16",
"OPTION_ROLLING_BATCH": "auto"
}
for key, value in envs.items():
os.environ[key] = value
huggingface.get_rolling_batch_class_from_str = override_rolling_batch
handler = TestHandler(huggingface)
self.assertEqual(handler.serving_properties["model_id"],
envs["OPTION_MODEL_ID"])
self.assertEqual(handler.serving_properties["rolling_batch"],
envs["OPTION_ROLLING_BATCH"])
inputs = [{
Expand Down Expand Up @@ -248,8 +239,7 @@ def test_exception_handling(self):
@unittest.skip
def test_profiling(self, logging_method):
envs = {
"OPTION_MODEL_ID": "TheBloke/Llama-2-7B-Chat-fp16",
"SERVING_LOAD_MODELS": "test::MPI=/opt/ml/model",
"HF_MODEL_ID": "TheBloke/Llama-2-7B-Chat-fp16",
"OPTION_ROLLING_BATCH": "auto",
"DJL_PYTHON_PROFILING": "true",
"DJL_PYTHON_PROFILING_TOP_OBJ": "60"
Expand All @@ -259,8 +249,6 @@ def test_profiling(self, logging_method):
os.environ[key] = value
huggingface.get_rolling_batch_class_from_str = override_rolling_batch
handler = TestHandler(huggingface)
self.assertEqual(handler.serving_properties["model_id"],
envs["OPTION_MODEL_ID"])
self.assertEqual(handler.serving_properties["rolling_batch"],
envs["OPTION_ROLLING_BATCH"])
inputs = [{
Expand Down
111 changes: 34 additions & 77 deletions serving/docs/lmi/deployment_guide/configurations.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,14 @@
# Container and Model Configurations

The configuration supplied to LMI provides required and optional information that LMI will use to load and serve your model.
LMI containers accept configurations provided in two formats. In order of priority, these are:
The configuration supplied to LMI provides information that LMI will use to load and serve your model.
LMI containers accept configurations provided in two formats.

* `serving.properties` Configuration File (per model configurations)
* Environment Variables (global configurations)

We recommend using the `serving.properties` configuration file for the following reasons:

* Supports SageMaker Multi Model Endpoints with per model configurations
* Environment Variables are applied globally to all models hosted by the model server, so they can't be used for model specific configuration
* Separates model configuration from the SageMaker Model Object (deployment unit)
* Configurations can be modified and updated independently of the deployment unit/code

Environment Variables are a good option for the proof-of-concept and experimentation phase for a single model.
You can modify the environment variables as part of your deployment script without having to re-upload configurations to S3.
This typically leads to a faster iteration loop when modifying and experimenting with configuration values.
For most use-cases, using environment variables is sufficient.
If you are deploying LMI to serve multiple models within the same container (SageMaker Multi-Model Endpoint use-case), you should use per model `serving.properties` configuration files.
Environment Variables are global settings and will apply to all models being served within a single instance of LMI.

While you can mix configurations between `serving.properties` and environment variables, we recommend you choose one and specify all configuration in that format.
Configurations specified in the `serving.properties` files will override configurations specified in environment variables.
Expand All @@ -24,59 +17,50 @@ Both configuration mechanisms offer access to the same set of configurations.

If you know which backend you are going to use, you can find a set of starter configurations in the corresponding [user guide](../user_guides/README.md).
We recommend using the quick start configurations as a starting point if you have decided on a particular backend.
The only change required to the starter configurations is specifying `option.model_id` to point to your model artifacts.

We will now cover the components of a minimal starting configuration. This minimal configuration will look like:
We will now cover the two types of configuration formats

```
# use standard python engine, or mpi aware python engine
engine=<Python|MPI>
# where the model artifacts are stored
option.model_id=<hf_hub_model_id|s3_uri>
# which inference library to use
option.rolling_batch=<auto|vllm|lmi-dist|tensorrtllm>
# how many gpus to shard the model across with tensor parallelism
option.tensor_parallel_degree=<max|number between 1 and number of gpus available>
```
## serving.properties

There are additional configurations that can be specified.
We will cover the common configurations (across backends) in [LMI Common Configurations](#lmi-common-configurations)
### Model Artifact Configuration (required)

## Model Artifact Configuration
If you are deploying model artifacts directly with the container, LMI will detect the artifacts in the default model store `/opt/ml/model`.
This is the default location when using SageMaker, and where SageMaker will mount the artifacts when specified via [`ModelDataSource`](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-uncompressed.html).
You do not need to set any model artifact configurations when using this mechanism.

If you are deploying a model hosted on the HuggingFace Hub, you must specify the `option.model_id=<hf_hub_model_id>` configuration.
When using a model directly from the hub, we recommend you also specify the model revision (commit hash) via `option.revision=<commit hash>`.
When using a model directly from the hub, we recommend you also specify the model revision (commit hash or branch) via `option.revision=<commit hash/branch>`.
Since model artifacts are downloaded at runtime from the Hub, using a specific revision ensures you are using a model compatible with package versions in the runtime environment.
Open Source model artifacts on the hub are subject to change at any time, and these changes may cause issues when instantiating the model (the model may require a newer version of transformers than what is available in the container).
If a model provides custom model (*modeling.py) and custom tokenizer (*tokenizer.py) files, you need to specify `option.trust_remote_code=true` to load and use the model.
Open Source model artifacts on the hub are subject to change at any time.
These changes may cause issues when instantiating the model (updated model artifacts may require a newer version of a dependency than what is bundled in the container).
If a model provides custom model (*modeling.py) and/or custom tokenizer (*tokenizer.py) files, you need to specify `option.trust_remote_code=true` to load and use the model.

If you are deploying a model hosted in S3, `option.model_id=<s3 uri>` should be the s3 object prefix of the model artifacts.
Alternatively, you can upload the `serving.properties` file to S3 alongside your model artifacts (under the same prefix) and omit the `option.model_id` config from your `serving.properties` file.
Example code for leveraging uncompressed artifacts in S3 are provided in the [deploying your endpoint](deploying-your-endpoint.md#configuration---servingproperties) section.

## Inference Library Configuration

LMI expects the following two configurations to determine which backend to use:
### Inference Library Configuration (optional)

* `engine`. The options are `Python` and `MPI`, which dictates how we launch the Python processes
* `option.rolling_batch`. This represents the inference library to use. The available options depend on the container.
* `option.entryPoint`. This represents the default inference handler to use. In most cases, this can be auto-detected and does not need to be specified
Inference library configurations are optional, but allow you to override the default backend for your model.
To override, or explicitly set the inference backend, you should set `option.rolling_batch`.
This represents the inference library to use.
The available options depend on the container.

In the LMI Container:

* to use vLLM, use `engine=Python` and `option.rolling_batch=vllm`
* to use lmi-dist, use `engine=MPI` and `option.rolling_batch=lmi-dist`
* to use huggingface accelerate, use `engine=Python` and `option.rolling_batch=auto` for text generation models, or `option.rolling_batch=disable` for non-text generation models.
* to use vLLM, use `option.rolling_batch=vllm`
* to use lmi-dist, use `option.rolling_batch=lmi-dist`
* to use huggingface accelerate, use `option.rolling_batch=auto` for text generation models, or `option.rolling_batch=disable` for non-text generation models.

In the TensorRT-LLM Container:

* use `engine=MPI` and `option.rolling_batch=trtllm` to use TensorRT-LLM
* use `option.rolling_batch=trtllm` to use TensorRT-LLM (this is the default)

In the Transformers NeuronX Container:

* use `engine=Python` and `option.rolling_batch=auto` to use Transformers NeuronX
* use `option.rolling_batch=auto` to use Transformers NeuronX (this is the default)

## Tensor Parallelism Configuration
### Tensor Parallelism Configuration

The `option.tensor_parallel_degree` configuration is used to specify how many GPUs to shard the model across using tensor parallelism.
This value should be between 1, and the maximum number of GPUs available on an instance.
Expand All @@ -87,12 +71,13 @@ Alternatively, if this value is specified as a number, LMI will attempt to maxim

For example, using an instance with 4 gpus and a tensor parallel degree of 2 will result in 2 model copies, each using 2 gpus.

## LMI Common Configurations

### LMI Common Configurations

There are two classes of configurations provided by LMI:

* Model Server level configurations. These configurations do not have a prefix (e.g. `job_queue_size`)
* Engine/Backend level configurations. These configurations have a `option.` prefix (e.g. `option.model_id`)
* Engine/Backend level configurations. These configurations have a `option.` prefix (e.g. `option.dtype`)

Since LMI is built using the DJLServing model server, all DJLServing configurations are available in LMI.
You can find a list of these configurations [here](../../configurations_model.md#python-model-configuration).
Expand Down Expand Up @@ -123,48 +108,20 @@ You can find these configurations in the respective [user guides](../user_guides

## Environment Variable Configurations

All LMI Configuration keys available in the `serving.properties` format can be specified as environment variables.

The translation for `engine` is unique. The configuration `engine=<engine>` is translated to `SERVING_LOAD_MODELS=test::<engine>=/opt/ml/model`.
For example:
The core configurations available via environment variables are documented in our [starting guide](../user_guides/starting-guide.md#available-environment-variable-configurations).

* `engine=Python` is translated to environment variable `SERVING_LOAD_MODELS=test::Python=/opt/ml/model`
* `engine=MPI` is translated to environment variable `SERVING_LOAD_MODELS=test::MPI=/opt/ml/model`
For other configurations, the `serving.property` configuration can be translated into an equivalent environment variable configuration.

Configuration keys that start with `option.` can be specified as environment variables using the `OPTION_` prefix.
Keys that start with `option.` can be specified as environment variables using the `OPTION_` prefix.
The configuration `option.<property>` is translated to environment variable `OPTION_<PROPERTY>`. For example:

* `option.model_id` is translated to environment variable `OPTION_MODEL_ID`
* `option.tensor_parallel_degree` is translated to environment variable `OPTION_TENSOR_PARALLEL_DEGREE`
* `option.rolling_batch` is translated to environment variable `OPTION_ROLLING_BATCH`

Configuration keys that do not start with option can be specified as environment variables using the `SERVING_` prefix.
Configuration keys that do not start with `option` can be specified as environment variables using the `SERVING_` prefix.
The configuration `<property>` is translated to environment variable `SERVING_<PROPERTY>`. For example:

* `job_queue_size` is translated to environment variable `SERVING_JOB_QUEUE_SIZE`

For a full example, given the following `serving.properties` file:

```
engine=MPI
option.model_id=tiiuae/falcon-40b
option.entryPoint=djl_python.transformersneuronx
option.trust_remote_code=true
option.tensor_parallel_degree=4
option.max_rolling_batch_size=32
option.rolling_batch=auto
```

We can translate the configuration to environment variables like this:

```
HF_MODEL_ID=tiiuae/falcon-40b
OPTION_ENTRYPOINT=djl_python.transformersneuronx
HF_MODEL_TRUST_REMOTE_CODE=true
TENSOR_PARALLEL_DEGREE=4
OPTION_MAX_ROLLING_BATCH_SIZE=32
OPTION_ROLLING_BATCH=auto
```

Next: [Deploying your endpoint](deploying-your-endpoint.md)

Previous: [Backend Selection](backend-selection.md)
3 changes: 2 additions & 1 deletion serving/docs/lmi/deployment_guide/deploying-your-endpoint.md
Original file line number Diff line number Diff line change
Expand Up @@ -176,7 +176,8 @@ The following options may be added to the `ModelDataSource` field to support unc
This mechanism is useful when deploying SageMaker endpoints with network isolation.
Model artifacts will be downloaded by SageMaker and mounted to the container rather than being downloaded by the container at runtime.

If you use this mechanism to deploy the container, you should set `option.model_id=/opt/ml/model` in serving.properties, or `OPTION_MODEL_ID=/opt/ml/model` in environment variables depending on which configuration style you are using.
If you use this mechanism to deploy the container, you do not need to specify the `option.model_id` or `HF_MODEL_ID` config.
LMI will load the model artifacts from the model directory by default, which is where SageMaker downloads and mounts the model artifacts from S3.

Follow this link for a detailed overview of this option: https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-uncompressed.html

Expand Down
2 changes: 1 addition & 1 deletion serving/docs/lmi/deployment_guide/testing-custom-script.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ from djl_python import huggingface
from djl_python.test_model import TestHandler

envs = {
"OPTION_MODEL_ID": "NousResearch/Nous-Hermes-Llama2-13b",
"HF_MODEL_ID": "NousResearch/Nous-Hermes-Llama2-13b",
"OPTION_MPI_MODE": "true",
"OPTION_ROLLING_BATCH": "lmi-dist",
"OPTION_TENSOR_PARALLEL_DEGREE": 4
Expand Down
6 changes: 3 additions & 3 deletions serving/docs/lmi/tutorials/trtllm_aot_tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ docker pull 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-te
These below configurations helps you configure the inference optimizations parameters. You can check all the configurations of TensorRT-LLM LMI handler [in our docs](../user_guides/trt_llm_user_guide.md#advanced-tensorrt-llm-configurations).

```
OPTION_MODEL_ID={{s3url}}
HF_MODEL_ID={{s3url}}
OPTION_TENSOR_PARALLEL_DEGREE=8
OPTION_MAX_ROLLING_BATCH_SIZE=128
OPTION_DTYPE=fp16
Expand Down Expand Up @@ -87,7 +87,7 @@ In the below example, the model artifacts will be saved to `$MODEL_REPO_DIR` cre
docker run --runtime=nvidia --gpus all --shm-size 12gb \
-v $MODEL_REPO_DIR:/tmp/trtllm \
-p 8080:8080 \
-e OPTION_MODEL_ID=$OPTION_MODEL_ID \
-e HF_MODEL_ID=$HF_MODEL_ID \
-e OPTION_TENSOR_PARALLEL_DEGREE=$OPTION_TENSOR_PARALLEL_DEGREE \
-e OPTION_MAX_ROLLING_BATCH_SIZE=$OPTION_MAX_ROLLING_BATCH_SIZE \
-e OPTION_DTYPE=$OPTION_DTYPE \
Expand Down Expand Up @@ -115,7 +115,7 @@ aws s3 cp $MODEL_REPO_DIR s3://YOUR_S3_FOLDER_NAME/ --recursive
**Note:** After uploading model artifacts to s3, you can just update the model_id(env var or in `serving.properties`) to the newly created s3 url with compiled model artifacts and use the same rest of the environment variables or `serving.properties` when deploying on SageMaker. Here, you can check the [tutorial](https://github.com/deepjavalibrary/djl-demo/blob/master/aws/sagemaker/large-model-inference/sample-llm/trtllm_rollingbatch_deploy_llama_13b.ipynb) on how to run inference using TensorRT-LLM DLC. Below snippet shows example updated model_id.

```
OPTION_MODEL_ID=s3://YOUR_S3_FOLDER_NAME
HF_MODEL_ID=s3://YOUR_S3_FOLDER_NAME
OPTION_TENSOR_PARALLEL_DEGREE=8
OPTION_MAX_ROLLING_BATCH_SIZE=128
OPTION_DTYPE=fp16
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -254,7 +254,7 @@ Finally, you can use one of the following configuration to load your model on Sa

### 1. Environment variables:
```
OPTION_MODEL_ID=s3://lmi-llm/trtllm/0.5.0/baichuan-13b-tp2/
HF_MODEL_ID=s3://lmi-llm/trtllm/0.5.0/baichuan-13b-tp2/
OPTION_TENSOR_PARALLEL_DEGREE=2
OPTION_MAX_ROLLING_BATCH_SIZE=64
```
Expand Down

0 comments on commit 7203944

Please sign in to comment.