Skip to content

Latest commit

 

History

History
253 lines (190 loc) · 17.4 KB

configurations.md

File metadata and controls

253 lines (190 loc) · 17.4 KB

All DJL configuration options

DJL serving is highly configurable. This document tries to capture those configurations in a single document.

Note: For tunable parameters for Large Language Models please refer to this guide.

DJL settings

DJLServing build on top of Deep Java Library (DJL). Here is a list of settings for DJL:

Key Type Description
DJL_DEFAULT_ENGINE env var/system prop The preferred engine for DJL if there are multiple engines, default: MXNet
ai.djl.default_engine system prop The preferred engine for DJL if there are multiple engines, default: MXNet
DJL_CACHE_DIR env var/system prop The cache directory for DJL: default: $HOME/.djl.ai/
ENGINE_CACHE_DIR env var/system prop The cache directory for engine native libraries: default: $DJL_CACHE_DIR
ai.djl.dataiterator.autoclose system prop Automatically close data set iterator, default: true
ai.djl.repository.zoo.location system prop global model zoo search locations, not recommended
offline system prop Don't access network for downloading engine's native library and model zoo metadata
collect-memory system prop Enable memory metric collection, default: false
disableProgressBar system prop Disable progress bar, default: false

PyTorch

Key Type Description
PYTORCH_LIBRARY_PATH env var/system prop User provided custom PyTorch native library
PYTORCH_VERSION env var/system prop PyTorch version to load
PYTORCH_EXTRA_LIBRARY_PATH env var/system prop Custom pytorch library to load (e.g. torchneuron/torchvision/torchtext)
PYTORCH_PRECXX11 env var/system prop Load precxx11 libtorch
PYTORCH_FLAVOR env var/system prop To force override auto detection (e.g. cpu/cpu-precxx11/cu102/cu116-precxx11)
PYTORCH_JIT_LOG_LEVEL env var Enable JIT logging
ai.djl.pytorch.native_helper system prop A user provided custom loader class to help locate pytorch native resources
ai.djl.pytorch.num_threads system prop Override OMP_NUM_THREAD environment variable
ai.djl.pytorch.num_interop_threads system prop Set PyTorch interop threads
ai.djl.pytorch.graph_optimizer system prop Enable/Disable JIT execution optimize, default: true. See: https://github.com/deepjavalibrary/djl/blob/master/docs/development/inference_performance_optimization.md#graph-optimizer
ai.djl.pytorch.cudnn_benchmark system prop To speed up ConvNN related model loading, default: false
ai.djl.pytorch.use_mkldnn system prop Enable MKLDNN, default: false, not recommended, use with your own risk

TensorFlow

Key Type Description
TENSORFLOW_LIBRARY_PATH env var/system prop User provided custom TensorFlow native library
TENSORRT_EXTRA_LIBRARY_PATH env var/system prop Extra TensorFlow custom operators library to load
TF_CPP_MIN_LOG_LEVEL env var TensorFlow log level
ai.djl.tensorflow.debug env var Enable devicePlacement logging, default: false

MXNet

Key Type Description
MXNET_LIBRARY_PATH env var/system prop User provided custom MXNet native library
MXNET_VERSION env var/system prop The version of custom MXNet build
MXNET_EXTRA_LIBRARY_PATH env var/system prop Load extra MXNet custom libraries, e.g. Elastice Inference
MXNET_EXTRA_LIBRARY_VERBOSE env var/system prop Set verbosity for MXNet custom library
ai.djl.mxnet.static_alloc system prop CachedOp options, default: true
ai.djl.mxnet.static_shape system prop CachedOp options, default: true
ai.djl.use_local_parameter_server system prop Use java parameter server instead of MXNet native implemention, default: false

PaddlePaddle

Key Type Description
PADDLE_LIBRARY_PATH env var/system prop User provided custom PaddlePaddle native library
ai.djl.paddlepaddle.disable_alternative system prop Disable alternative engine

Huggingface tokenizers

Key Type Description
TOKENIZERS_CACHE env var User provided custom Huggingface tokenizer native library

Python

Key Type Description
PYTHON_EXECUTABLE env var The location is python executable, default: python
DJL_ENTRY_POINT env var The entrypoint python file or module, default: model.py
MODEL_LOADING_TIMEOUT env var Python worker load model timeout: default: 240 seconds
PREDICT_TIMEOUT env var Python predict call timeout, default: 120 seconds
DJL_VENV_DIR env var/system prop The venv directory, default: $DJL_CACHE_DIR/venv
ai.djl.python.disable_alternative system prop Disable alternative engine

Python

Key Type Description
TENSOR_PARALLEL_DEGREE env var Set tensor parallel degree.
For mpi mode, the default is number of accelerators.
Use "max" for non-mpi mode to use all GPUs for tensor parallel.

Global Model Server settings

Global settings are configured at model server level. Change to these settings usually requires restart model server to take effect.

Most of the model server specific configuration can be configured in conf/config.properties file. You can find the configuration keys here: ConfigManager.java

Each configuration key can also be override by environment variable with SERVING_ prefix, for example:

export SERVING_JOB_QUEUE_SIZE=1000 # This will override JOB_QUEUE_SIZE in the config
Key Type Description
MODEL_SERVER_HOME env var DJLServing home directory, default: Installation directory (e.g. /usr/local/Cellar/djl-serving/0.19.0/)
DEFAULT_JVM_OPTS env var default: -Dlog4j.configurationFile=${APP_HOME}/conf/log4j2.xml
Override default JVM startup options and system properties.
JAVA_OPTS env var default: -Xms1g -Xmx1g -XX:+ExitOnOutOfMemoryError
Add extra JVM options.
SERVING_OPTS env var default: N/A
Add serving related JVM options.
Some of DJL configuration can only be configured by JVM system properties, user has to set DEFAULT_JVM_OPTS environment variable to configure them.
- -Dai.djl.pytorch.num_interop_threads=2, this will override interop threads for PyTorch
- -Dai.djl.pytorch.num_threads=2, this will override OMP_NUM_THREADS for PyTorch
- -Dai.djl.logging.level=debug change DJL loggging level

Model specific settings

You set per model settings by adding a serving.properties file in the root of your model directory (or .zip).

You can set number of workers for each model: https://github.com/deepjavalibrary/djl-serving/blob/master/serving/src/test/resources/identity/serving.properties#L4-L8

For example, set minimum workers and maximum workers for your model:

minWorkers=32
maxWorkers=64

Or you can configure minimum workers and maximum workers differently for GPU and CPU:

gpu.minWorkers=2
gpu.maxWorkers=3
cpu.minWorkers=2
cpu.maxWorkers=4

job queue size, batch size, max batch delay, max worker idle time can be configured at per model level, this will override global settings:

job_queue_size=10
batch_size=2
max_batch_delay=1
max_idle_time=120

You can configure which device to load the model on, default is *:

load_on_devices=gpu4;gpu5
# or simply:
load_on_devices=4;5

Python (DeepSpeed)

For Python (DeepSpeed) engine, DJL load multiple workers sequentially by default to avoid run out of memory. You can reduced model loading time by parallel loading workers if you know the peak memory won’t cause out of memory:

# Allows to load DeepSpeed workers in parallel
option.parallel_loading=true
# specify tensor parallel degree (number of partitions)
option.tensor_parallel_degree=2
# specify per model timeout
option.model_loading_timeout=600
option.predict_timeout=240
# mark the model as failure after python process crashing 10 times
retry_threshold=0

# enable virtual environment
option.enable_venv=true

# use built-in DeepSpeed handler
option.entryPoint=djl_python.deepspeed
# passing extra options to model.py or built-in handler
option.model_id=gpt2
option.data_type=fp32
option.max_new_tokens=50

# defines custom environment variables
env=LARGE_TENSOR=1
# specify the path to the python executable
option.pythonExecutable=/usr/bin/python3

Engine specific settings

DJL support 12 deep learning frameworks, each framework has their own settings. Please refer to each framework’s document for detail.

A common setting for most of the engines is OMP_NUM_THREADS, for the best throughput, DJLServing set this to 1 by default. For some engines (e.g. MXNet, this value must be one). Since this is a global environment variable, setting this value will impact all other engines.

The follow table show some engine specific environment variables that is override by default by DJLServing:

Key Engine Description
TF_NUM_INTEROP_THREADS TensorFlow default 1, OMP_NUM_THREADS will override this value
TF_NUM_INTRAOP_THREADS TensorFlow default 1
TF_CPP_MIN_LOG_LEVEL TensorFlow default 1
MXNET_ENGINE_TYPE MXNet this value must be NaiveEngine

Appendix

How to configure logging

Option 1: enable debug log:

export SERVING_OPTS="-Dai.djl.logging.level=debug"

Option 2: use your log4j2.xml

export DEFAULT_JVM_OPTS="-Dlog4j.configurationFile=/MY_CONF/log4j2.xml

DJLServing provides a few built-in log4j2-XXX.xml files in DJLServing containers. Use the following environment variable to print HTTP access log to console:

export DEFAULT_JVM_OPTS="-Dlog4j.configurationFile=/usr/local/djl-serving-0.23.0/conf/log4j2-access.xml

Use the following environment variable to print both access log, server metrics and model metrics to console:

export DEFAULT_JVM_OPTS="-Dlog4j.configurationFile=/usr/local/djl-serving-0.23.0/conf/log4j2-console.xml

How to download uncompressed model from S3

To enable fast model downloading, you can store your model artifacts (weights) in a S3 bucket, and only keep the model code and metadata in the model.tar.gz (.zip) file. DJL can leverage s5cmd to download uncompressed files from S3 with extremely fast speed.

To enable s5cmd downloading, you can configure serving.properties as the following:

option.model_id=s3://YOUR_BUCKET/...

How to resolve python package conflict between models

If you want to deploy multiple python models, but their dependencies has conflict, you can enable python virtual environments for your model:

option.enable_venv=true