diff --git a/README.md b/README.md
index 60d56107ce..bfc7152c4e 100644
--- a/README.md
+++ b/README.md
@@ -10,7 +10,7 @@ Cortex is an open source platform for large-scale inference workloads.
## Model serving infrastructure
-* Supports deploying TensorFlow, PyTorch, sklearn and other models as realtime or batch APIs.
+* Supports deploying TensorFlow, PyTorch, and other models as realtime or batch APIs.
* Ensures high availability with availability zones and automated instance restarts.
* Runs inference on on-demand instances or spot instances with on-demand backups.
* Autoscales to handle production workloads with support for overprovisioning.
@@ -98,13 +98,13 @@ import cortex
cx = cortex.client("aws")
cx.create_api(api_spec, predictor=PythonPredictor, requirements=requirements)
-# creating https://example.com/text-generator
+# creating http://example.com/text-generator
```
#### Consume your API
```bash
-$ curl https://example.com/text-generator -X POST -H "Content-Type: application/json" -d '{"text": "hello world"}'
+$ curl http://example.com/text-generator -X POST -H "Content-Type: application/json" -d '{"text": "hello world"}'
```
diff --git a/docs/clusters/aws/gpu.md b/docs/clusters/aws/gpu.md
deleted file mode 100644
index 243f8e3862..0000000000
--- a/docs/clusters/aws/gpu.md
+++ /dev/null
@@ -1,33 +0,0 @@
-# Using GPUs
-
-To use GPUs:
-
-1. Make sure your AWS account is subscribed to the [EKS-optimized AMI with GPU Support](https://aws.amazon.com/marketplace/pp/B07GRHFXGM).
-1. You may need to [request a limit increase](https://console.aws.amazon.com/servicequotas/home?#!/services/ec2/quotas) for your desired instance type.
-1. Set instance type to an AWS GPU instance (e.g. `g4dn.xlarge`) when installing Cortex.
-1. Set the `gpu` field in the `compute` configuration for your API. One unit of GPU corresponds to one virtual GPU. Fractional requests are not allowed.
-
-## Tips
-
-### If using `processes_per_replica` > 1, TensorFlow-based models, and Python Predictor
-
-When using `processes_per_replica` > 1 with TensorFlow-based models (including Keras) in the Python Predictor, loading the model in separate processes at the same time will throw a `CUDA_ERROR_OUT_OF_MEMORY: out of memory` error. This is because the first process that loads the model will allocate all of the GPU's memory and leave none to other processes. To prevent this from happening, the per-process GPU memory usage can be limited. There are two methods:
-
-1\) Configure the model to allocate only as much memory as it requires, via [tf.config.experimental.set_memory_growth()](https://www.tensorflow.org/api_docs/python/tf/config/experimental/set_memory_growth):
-
-```python
-for gpu in tf.config.list_physical_devices("GPU"):
- tf.config.experimental.set_memory_growth(gpu, True)
-```
-
-2\) Impose a hard limit on how much memory the model can use, via [tf.config.set_logical_device_configuration()](https://www.tensorflow.org/api_docs/python/tf/config/set_logical_device_configuration):
-
-```python
-mem_limit_mb = 1024
-for gpu in tf.config.list_physical_devices("GPU"):
- tf.config.set_logical_device_configuration(
- gpu, [tf.config.LogicalDeviceConfiguration(memory_limit=mem_limit_mb)]
- )
-```
-
-See the [TensorFlow GPU guide](https://www.tensorflow.org/guide/gpu) and this [blog post](https://medium.com/@starriet87/tensorflow-2-0-wanna-limit-gpu-memory-10ad474e2528) for additional information.
diff --git a/docs/clusters/aws/inferentia.md b/docs/clusters/aws/inferentia.md
deleted file mode 100644
index e96defee95..0000000000
--- a/docs/clusters/aws/inferentia.md
+++ /dev/null
@@ -1,75 +0,0 @@
-# Using Inferentia
-
-1. You may need to [request a limit increase](https://console.aws.amazon.com/servicequotas/home?#!/services/ec2/quotas) for running Inferentia instances.
-1. Set the instance type to an AWS Inferentia instance (e.g. `inf1.xlarge`) when creating your Cortex cluster.
-1. Set the `inf` field in the `compute` configuration for your API. One unit of `inf` corresponds to one Inferentia ASIC with 4 NeuronCores *(not the same thing as `cpu`)* and 8GB of cache memory *(not the same thing as `mem`)*. Fractional requests are not allowed.
-
-## Neuron
-
-Inferentia ASICs come in different sizes depending on the instance type:
-
-* `inf1.xlarge`/`inf1.2xlarge` - each has 1 Inferentia ASIC
-* `inf1.6xlarge` - has 4 Inferentia ASICs
-* `inf1.24xlarge` - has 16 Inferentia ASICs
-
-Each Inferentia ASIC comes with 4 NeuronCores and 8GB of cache memory. To better understand how Inferentia ASICs work, read these [technical notes](https://github.com/aws/aws-neuron-sdk/blob/master/docs/technotes/README.md) and this [FAQ](https://github.com/aws/aws-neuron-sdk/blob/master/FAQ.md).
-
-### NeuronCore Groups
-
-A [NeuronCore Group](https://github.com/aws/aws-neuron-sdk/blob/master/docs/tensorflow-neuron/tutorial-NeuronCore-Group.md) (NCG) is a set of NeuronCores that is used to load and run a compiled model. NCGs exist to aggregate NeuronCores to improve hardware performance. Models can be shared within an NCG, but this would require the device driver to dynamically context switch between each model, which degrades performance. Therefore we've decided to only allow one model per NCG (unless you are using a multi-model endpoint, in which case there will be multiple models on a single NCG, and there will be context switching).
-
-Each Cortex API process will have its own copy of the model and will run on its own NCG (the number of API processes is configured by the `processes_per_replica` for Realtime APIs field in the API configuration). Each NCG will have an equal share of NeuronCores. Therefore, the size of each NCG will be `4 * inf / processes_per_replica` (`inf` refers to your API's `compute` request, and it's multiplied by 4 because there are 4 NeuronCores per Inferentia chip).
-
-For example, if your API requests 2 `inf` chips, there will be 8 NeuronCores available. If you set `processes_per_replica` to 1, there will be one copy of your model running on a single NCG of size 8 NeuronCores. If `processes_per_replica` is 2, there will be two copies of your model, each running on a separate NCG of size 4 NeuronCores. If `processes_per_replica` is 4, there will be 4 NCGs of size 2 NeuronCores, and if If `processes_per_replica` is 8, there will be 8 NCGs of size 1 NeuronCores. In this scenario, these are the only valid values for `processes_per_replica`. In other words the total number of requested NeuronCores (which equals 4 * the number of requested Inferentia chips) must be divisible by `processes_per_replica`.
-
-The 8GB cache memory is shared between all 4 NeuronCores of an Inferentia chip. Therefore an NCG with 8 NeuronCores (i.e. 2 Inf chips) will have access to 16GB of cache memory. An NGC with 2 NeuronCores will have access to 8GB of cache memory, which will be shared with the other NGC of size 2 running on the same Inferentia chip.
-
-### Compiling models
-
-Before a model can be deployed on Inferentia chips, it must be compiled for Inferentia. The Neuron compiler can be used to convert a regular TensorFlow SavedModel or PyTorch model into the hardware-specific instruction set for Inferentia. Inferentia currently supports compiled models from TensorFlow and PyTorch.
-
-By default, the Neuron compiler will compile a model to use 1 NeuronCore, but can be manually set to a different size (1, 2, 4, etc).
-
-For optimal performance, your model should be compiled to run on the number of NeuronCores available to it. The number of NeuronCores will be `4 * inf / processes_per_replica` (`inf` refers to your API's `compute` request, and it's multiplied by 4 because there are 4 NeuronCores per Inferentia chip). See [NeuronCore Groups](#neuroncore-groups) above for an example, and see [Improving performance](#improving-performance) below for a discussion of choosing the appropriate number of NeuronCores.
-
-Here is an example of compiling a TensorFlow SavedModel for Inferentia:
-
-```python
-import tensorflow.neuron as tfn
-
-tfn.saved_model.compile(
- model_dir,
- compiled_model_dir,
- batch_size,
- compiler_args=["--num-neuroncores", "1"],
-)
-```
-
-Here is an example of compiling a PyTorch model for Inferentia:
-
-```python
-import torch_neuron, torch
-
-model.eval()
-example_input = torch.zeros([batch_size] + input_shape, dtype=torch.float32)
-model_neuron = torch.neuron.trace(
- model,
- example_inputs=[example_input],
- compiler_args=["--num-neuroncores", "1"]
-)
-model_neuron.save(compiled_model)
-```
-
-The versions of `tensorflow-neuron` and `torch-neuron` that are used by Cortex are found in the [Realtime API pre-installed packages list and Batch API pre-installed packages list. When installing these packages with `pip` to compile models of your own, use the extra index URL `--extra-index-url=https://pip.repos.neuron.amazonaws.com`.
-
-A list of model compilation examples for Inferentia can be found on the [`aws/aws-neuron-sdk`](https://github.com/aws/aws-neuron-sdk) repo for [TensorFlow](https://github.com/aws/aws-neuron-sdk/blob/master/docs/tensorflow-neuron/) and for [PyTorch](https://github.com/aws/aws-neuron-sdk/blob/master/docs/pytorch-neuron/README.md).
-
-### Improving performance
-
-A few things can be done to improve performance using compiled models on Cortex:
-
-1. There's a minimum number of NeuronCores for which a model can be compiled. That number depends on the model's architecture. Generally, compiling a model for more cores than its required minimum helps to distribute the model's operators across multiple cores, which in turn [can lead to lower latency](https://github.com/aws/aws-neuron-sdk/blob/master/docs/technotes/neuroncore-pipeline.md). However, compiling a model for more NeuronCores means that you'll have to set `processes_per_replica` to be lower so that the NeuronCore Group has access to the number of NeuronCores for which you compiled your model. This is acceptable if latency is your top priority, but if throughput is more important to you, this tradeoff is usually not worth it. To maximize throughput, compile your model for as few NeuronCores as possible and increase `processes_per_replica` to the maximum possible (see above for a sample calculation).
-
-1. Try to achieve a near [100% placement](https://github.com/aws/aws-neuron-sdk/blob/b28262e3072574c514a0d72ad3fe5ca48686d449/src/examples/tensorflow/keras_resnet50/pb2sm_compile.py#L59) of your model's graph onto the NeuronCores. During the compilation phase, any operators that can't execute on NeuronCores will be compiled to execute on the machine's CPU and memory instead. Even if just a few percent of the operations reside on the host's CPU/memory, the maximum throughput of the instance can be significantly limited.
-
-1. Use the [`--static-weights` compiler option](https://github.com/aws/aws-neuron-sdk/blob/master/docs/technotes/performance-tuning.md#compiling-for-pipeline-optimization) when possible. This option tells the compiler to make it such that the entire model gets cached onto the NeuronCores. This avoids a lot of back-and-forth between the machine's CPU/memory and the Inferentia ASICs.
diff --git a/docs/clusters/aws/install.md b/docs/clusters/aws/install.md
index 904f2c8810..8885807835 100644
--- a/docs/clusters/aws/install.md
+++ b/docs/clusters/aws/install.md
@@ -5,6 +5,7 @@
1. [Docker](https://docs.docker.com/install)
1. Subscribe to the [EKS-optimized AMI with GPU Support](https://aws.amazon.com/marketplace/pp/B07GRHFXGM) (for GPU clusters)
1. An IAM user with `AdministratorAccess` and programmatic access (see [security](security.md) if you'd like to use less privileged credentials after spinning up your cluster)
+1. You may need to [request a limit increase](https://console.aws.amazon.com/servicequotas/home?#!/services/ec2/quotas) for your desired instance type
## Spin up Cortex on your AWS account
diff --git a/docs/clusters/aws/spot.md b/docs/clusters/aws/spot.md
index a5484a7db2..71a49f7992 100644
--- a/docs/clusters/aws/spot.md
+++ b/docs/clusters/aws/spot.md
@@ -1,7 +1,5 @@
# Spot instances
-[Spot instances](https://aws.amazon.com/ec2/spot) are spare capacity that AWS sells at a discount (up to 90%). The caveat is that spot instances may not always be available, and can be recalled by AWS at anytime. Cortex allows you to use spot instances in your cluster to take advantage of the discount while ensuring uptime and reliability of APIs. You can configure your cluster to use spot instances using the configuration below:
-
```yaml
# cluster.yaml
@@ -9,7 +7,7 @@
spot: false
spot_config:
- # additional instances with identical or better specs than the primary instance type (defaults to only the primary instance)
+ # additional instance types with identical or better specs than the primary cluster instance type (defaults to only the primary instance type)
instance_distribution: # [similar_instance_type_1, similar_instance_type_2]
# minimum number of on demand instances (default: 0)
diff --git a/docs/summary.md b/docs/summary.md
index 9d16be3748..d7f3ce747c 100644
--- a/docs/summary.md
+++ b/docs/summary.md
@@ -24,6 +24,7 @@
* Multi-model
* [Example](workloads/multi-model/example.md)
* [Configuration](workloads/multi-model/configuration.md)
+ * [Caching](workloads/multi-model/caching.md)
* Traffic Splitter
* [Example](workloads/traffic-splitter/example.md)
* [Configuration](workloads/traffic-splitter/configuration.md)
@@ -31,7 +32,7 @@
* [Example](workloads/dependencies/example.md)
* [Python packages](workloads/dependencies/python-packages.md)
* [System packages](workloads/dependencies/system-packages.md)
- * [Docker images](workloads/dependencies/docker-images.md)
+ * [Custom images](workloads/dependencies/images.md)
## Clusters
@@ -40,8 +41,6 @@
* [Update](clusters/aws/update.md)
* [Security](clusters/aws/security.md)
* [Spot instances](clusters/aws/spot.md)
- * [GPUs](clusters/aws/gpu.md)
- * [Inferentia](clusters/aws/inferentia.md)
* [Networking](clusters/aws/networking.md)
* [VPC peering](clusters/aws/vpc-peering.md)
* [Custom domain](clusters/aws/custom-domain.md)
diff --git a/docs/workloads/batch/configuration.md b/docs/workloads/batch/configuration.md
index 0a7653b090..1effda8ed7 100644
--- a/docs/workloads/batch/configuration.md
+++ b/docs/workloads/batch/configuration.md
@@ -18,10 +18,10 @@
endpoint: # the endpoint for the API (default: )
api_gateway: public | none # whether to create a public API Gateway endpoint for this API (if not, the API will still be accessible via the load balancer) (default: public, unless disabled cluster-wide)
compute:
- cpu: # CPU request per worker, e.g. 200m or 1 (200m is equivalent to 0.2) (default: 200m)
- gpu: # GPU request per worker (default: 0)
- inf: # Inferentia ASIC request per worker (default: 0)
- mem: # memory request per worker, e.g. 200Mi or 1Gi (default: Null)
+ cpu: # CPU request per worker. One unit of CPU corresponds to one virtual CPU; fractional requests are allowed, and can be specified as a floating point number or via the "m" suffix (default: 200m)
+ gpu: # GPU request per worker. One unit of GPU corresponds to one virtual GPU (default: 0)
+ inf: # Inferentia request per worker. One unit corresponds to one Inferentia ASIC with 4 NeuronCores and 8GB of cache memory. Each process will have one NeuronCore Group with (4 * inf / processes_per_replica) NeuronCores, so your model should be compiled to run on (4 * inf / processes_per_replica) NeuronCores. (default: 0) (aws only)
+ mem: # memory request per worker. One unit of memory is one byte and can be expressed as an integer or by using one of these suffixes: K, M, G, T (or their power-of two counterparts: Ki, Mi, Gi, Ti) (default: Null)
```
## TensorFlow Predictor
@@ -54,10 +54,10 @@
endpoint: # the endpoint for the API (default: )
api_gateway: public | none # whether to create a public API Gateway endpoint for this API (if not, the API will still be accessible via the load balancer) (default: public, unless disabled cluster-wide)
compute:
- cpu: # CPU request per worker, e.g. 200m or 1 (200m is equivalent to 0.2) (default: 200m)
- gpu: # GPU request per worker (default: 0)
- inf: # Inferentia ASIC request per worker (default: 0)
- mem: # memory request per worker, e.g. 200Mi or 1Gi (default: Null)
+ cpu: # CPU request per worker. One unit of CPU corresponds to one virtual CPU; fractional requests are allowed, and can be specified as a floating point number or via the "m" suffix (default: 200m)
+ gpu: # GPU request per worker. One unit of GPU corresponds to one virtual GPU (default: 0)
+ inf: # Inferentia request per worker. One unit corresponds to one Inferentia ASIC with 4 NeuronCores and 8GB of cache memory. Each process will have one NeuronCore Group with (4 * inf / processes_per_replica) NeuronCores, so your model should be compiled to run on (4 * inf / processes_per_replica) NeuronCores. (default: 0) (aws only)
+ mem: # memory request per worker. One unit of memory is one byte and can be expressed as an integer or by using one of these suffixes: K, M, G, T (or their power-of two counterparts: Ki, Mi, Gi, Ti) (default: Null)
```
## ONNX Predictor
@@ -84,7 +84,7 @@
endpoint: # the endpoint for the API (default: )
api_gateway: public | none # whether to create a public API Gateway endpoint for this API (if not, the API will still be accessible via the load balancer) (default: public, unless disabled cluster-wide)
compute:
- cpu: # CPU request per worker, e.g. 200m or 1 (200m is equivalent to 0.2) (default: 200m)
- gpu: # GPU request per worker (default: 0)
- mem: # memory request per worker, e.g. 200Mi or 1Gi (default: Null)
+ cpu: # CPU request per worker. One unit of CPU corresponds to one virtual CPU; fractional requests are allowed, and can be specified as a floating point number or via the "m" suffix (default: 200m)
+ gpu: # GPU request per worker. One unit of GPU corresponds to one virtual GPU (default: 0)
+ mem: # memory request per worker. One unit of memory is one byte and can be expressed as an integer or by using one of these suffixes: K, M, G, T (or their power-of two counterparts: Ki, Mi, Gi, Ti) (default: Null)
```
diff --git a/docs/workloads/batch/predictors.md b/docs/workloads/batch/predictors.md
index f3d1143d59..2dbc76ac21 100644
--- a/docs/workloads/batch/predictors.md
+++ b/docs/workloads/batch/predictors.md
@@ -88,83 +88,10 @@ class PythonPredictor:
pass
```
-For proper separation of concerns, it is recommended to use the constructor's `config` parameter for information such as from where to download the model and initialization files, or any configurable model parameters. You define `config` in your API configuration, and it is passed through to your Predictor's constructor. The `config` parameters in the `API configuration` can be overridden by providing `config` in the job submission requests.
-
-### Pre-installed packages
-
-The following Python packages are pre-installed in Python Predictors and can be used in your implementations:
-
-```text
-boto3==1.14.53
-cloudpickle==1.6.0
-Cython==0.29.21
-dill==0.3.2
-fastapi==0.61.1
-google-cloud-storage==1.32.0
-joblib==0.16.0
-Keras==2.4.3
-msgpack==1.0.0
-nltk==3.5
-np-utils==0.5.12.1
-numpy==1.19.1
-opencv-python==4.4.0.42
-pandas==1.1.1
-Pillow==7.2.0
-pyyaml==5.3.1
-requests==2.24.0
-scikit-image==0.17.2
-scikit-learn==0.23.2
-scipy==1.5.2
-six==1.15.0
-statsmodels==0.12.0
-sympy==1.6.2
-tensorflow-hub==0.9.0
-tensorflow==2.3.0
-torch==1.6.0
-torchvision==0.7.0
-xgboost==1.2.0
-```
-
-#### Inferentia-equipped APIs
-
-The list is slightly different for inferentia-equipped APIs:
-
-```text
-boto3==1.13.7
-cloudpickle==1.6.0
-Cython==0.29.21
-dill==0.3.1.1
-fastapi==0.54.1
-google-cloud-storage==1.32.0
-joblib==0.16.0
-msgpack==1.0.0
-neuron-cc==1.0.20600.0+0.b426b885f
-nltk==3.5
-np-utils==0.5.12.1
-numpy==1.18.2
-opencv-python==4.4.0.42
-pandas==1.1.1
-Pillow==7.2.0
-pyyaml==5.3.1
-requests==2.23.0
-scikit-image==0.17.2
-scikit-learn==0.23.2
-scipy==1.3.2
-six==1.15.0
-statsmodels==0.12.0
-sympy==1.6.2
-tensorflow==1.15.4
-tensorflow-neuron==1.15.3.1.0.2043.0
-torch==1.5.1
-torch-neuron==1.5.1.1.0.1721.0
-torchvision==0.6.1
-```
-
-
-The pre-installed system packages are listed in [images/python-predictor-cpu/Dockerfile](https://github.com/cortexlabs/cortex/tree/master/images/python-predictor-cpu/Dockerfile) (for CPU), [images/python-predictor-gpu/Dockerfile](https://github.com/cortexlabs/cortex/tree/master/images/python-predictor-gpu/Dockerfile) (for GPU), or [images/python-predictor-inf/Dockerfile](https://github.com/cortexlabs/cortex/tree/master/images/python-predictor-inf/Dockerfile) (for Inferentia).
-
## TensorFlow Predictor
+**Uses TensorFlow version 2.3.0 by default**
+
### Interface
```python
@@ -220,32 +147,10 @@ Cortex provides a `tensorflow_client` to your Predictor's constructor. `tensorfl
When multiple models are defined using the Predictor's `models` field, the `tensorflow_client.predict()` method expects a second argument `model_name` which must hold the name of the model that you want to use for inference (for example: `self.client.predict(payload, "text-generator")`).
-For proper separation of concerns, it is recommended to use the constructor's `config` parameter for information such as from where to download the model and initialization files, or any configurable model parameters. You define `config` in your API configuration, and it is passed through to your Predictor's constructor. The `config` parameters in the `API configuration` can be overridden by providing `config` in the job submission requests.
-
-### Pre-installed packages
-
-The following Python packages are pre-installed in TensorFlow Predictors and can be used in your implementations:
-
-```text
-boto3==1.14.53
-dill==0.3.2
-fastapi==0.61.1
-google-cloud-storage==1.32.0
-msgpack==1.0.0
-numpy==1.19.1
-opencv-python==4.4.0.42
-pyyaml==5.3.1
-requests==2.24.0
-tensorflow-hub==0.9.0
-tensorflow-serving-api==2.3.0
-tensorflow==2.3.0
-```
-
-
-The pre-installed system packages are listed in [images/tensorflow-predictor/Dockerfile](https://github.com/cortexlabs/cortex/tree/master/images/tensorflow-predictor/Dockerfile).
-
## ONNX Predictor
+**Uses ONNX Runtime version 1.4.0 by default**
+
### Interface
```python
@@ -300,24 +205,3 @@ class ONNXPredictor:
Cortex provides an `onnx_client` to your Predictor's constructor. `onnx_client` is an instance of [ONNXClient](https://github.com/cortexlabs/cortex/tree/master/pkg/cortex/serve/cortex_internal/lib/client/onnx.py) that manages an ONNX Runtime session to make predictions using your model. It should be saved as an instance variable in your Predictor, and your `predict()` function should call `onnx_client.predict()` to make an inference with your exported ONNX model. Preprocessing of the JSON payload and postprocessing of predictions can be implemented in your `predict()` function as well.
When multiple models are defined using the Predictor's `models` field, the `onnx_client.predict()` method expects a second argument `model_name` which must hold the name of the model that you want to use for inference (for example: `self.client.predict(model_input, "text-generator")`).
-
-For proper separation of concerns, it is recommended to use the constructor's `config` parameter for information such as from where to download the model and initialization files, or any configurable model parameters. You define `config` in your API configuration, and it is passed through to your Predictor's constructor. The `config` parameters in the `API configuration` can be overridden by providing `config` in the job submission requests.
-
-### Pre-installed packages
-
-The following Python packages are pre-installed in ONNX Predictors and can be used in your implementations:
-
-```text
-boto3==1.14.53
-dill==0.3.2
-fastapi==0.61.1
-google-cloud-storage==1.32.0
-msgpack==1.0.0
-numpy==1.19.1
-onnxruntime==1.4.0
-pyyaml==5.3.1
-requests==2.24.0
-```
-
-
-The pre-installed system packages are listed in [images/onnx-predictor-cpu/Dockerfile](https://github.com/cortexlabs/cortex/tree/master/images/onnx-predictor-cpu/Dockerfile) (for CPU) or [images/onnx-predictor-gpu/Dockerfile](https://github.com/cortexlabs/cortex/tree/master/images/onnx-predictor-gpu/Dockerfile) (for GPU).
diff --git a/docs/workloads/dependencies/docker-images.md b/docs/workloads/dependencies/images.md
similarity index 93%
rename from docs/workloads/dependencies/docker-images.md
rename to docs/workloads/dependencies/images.md
index 111b899d5b..4372e2be14 100644
--- a/docs/workloads/dependencies/docker-images.md
+++ b/docs/workloads/dependencies/images.md
@@ -1,6 +1,6 @@
# Docker images
-You can build a custom Docker image for use in your APIs. Common reasons to do this are to avoid installing dependencies during replica initialization, to have smaller images, and/or to mirror images to your cloud's container registry (for speed and reliability).
+Cortex includes a default set of Docker images with pre-installed Python and system packages but you can build custom images for use in your APIs. Common reasons to do this are to avoid installing dependencies during replica initialization, to have smaller images, and/or to mirror images to your cloud's container registry (for speed and reliability).
## Create a Dockerfile
diff --git a/docs/workloads/dependencies/python-packages.md b/docs/workloads/dependencies/python-packages.md
index e00bfdf5c8..4d6f44533e 100644
--- a/docs/workloads/dependencies/python-packages.md
+++ b/docs/workloads/dependencies/python-packages.md
@@ -1,4 +1,4 @@
-# Python / Conda packages
+# Python packages
## PyPI packages
diff --git a/docs/workloads/multi-model/caching.md b/docs/workloads/multi-model/caching.md
new file mode 100644
index 0000000000..23ccdc9e8f
--- /dev/null
+++ b/docs/workloads/multi-model/caching.md
@@ -0,0 +1,14 @@
+# Multi-model caching
+
+Multi-model caching allows each replica to serve more models than would fit into its memory by keeping a specified number of models in memory (and disk) at a time. When the in-memory model limit is reached, the least recently accessed model is evicted from the cache. This can be useful when you have many models, and some models are frequently accessed while a larger portion of them are rarely used, or when running on smaller instances to control costs.
+
+The model cache is a two-layer cache, configured by the following parameters in the `predictor.models` configuration:
+
+* `cache_size` sets the number of models to keep in memory
+* `disk_cache_size` sets the number of models to keep on disk (must be greater than or equal to `cache_size`)
+
+Both of these fields must be specified, in addition to either the `dir` or `paths` field (which specifies the model paths, see [models](../realtime/models.md) for documentation). Multi-model caching is only supported if `predictor.processes_per_replica` is set to 1 (the default value).
+
+## Out of memory errors
+
+Cortex runs a background process every 10 seconds that counts the number of models in memory and on disk, and evicts the least recently used models if the count exceeds `cache_size` / `disk_cache_size`. If many new models are requested between executions of the process, there may be more models in memory and/or on disk than the configured `cache_size` or `disk_cache_size` limits which could lead to out of memory errors.
diff --git a/docs/workloads/realtime/configuration.md b/docs/workloads/realtime/configuration.md
index bf18e9f634..e11aa1d02a 100644
--- a/docs/workloads/realtime/configuration.md
+++ b/docs/workloads/realtime/configuration.md
@@ -32,10 +32,10 @@
endpoint: # the endpoint for the API (default: )
api_gateway: public | none # whether to create a public API Gateway endpoint for this API (if not, the API will still be accessible via the load balancer) (default: public, unless disabled cluster-wide) (aws only)
compute:
- cpu: # CPU request per replica, e.g. 200m or 1 (200m is equivalent to 0.2) (default: 200m)
- gpu: # GPU request per replica (default: 0)
- inf: # Inferentia ASIC request per replica (default: 0) (aws only)
- mem: # memory request per replica, e.g. 200Mi or 1Gi (default: Null)
+ cpu: # CPU request per replica. One unit of CPU corresponds to one virtual CPU; fractional requests are allowed, and can be specified as a floating point number or via the "m" suffix (default: 200m)
+ gpu: # GPU request per replica. One unit of GPU corresponds to one virtual GPU (default: 0)
+ inf: # Inferentia request per replica. One unit corresponds to one Inferentia ASIC with 4 NeuronCores and 8GB of cache memory. Each process will have one NeuronCore Group with (4 * inf / processes_per_replica) NeuronCores, so your model should be compiled to run on (4 * inf / processes_per_replica) NeuronCores. (default: 0) (aws only)
+ mem: # memory request per replica. One unit of memory is one byte and can be expressed as an integer or by using one of these suffixes: K, M, G, T (or their power-of two counterparts: Ki, Mi, Gi, Ti) (default: Null)
autoscaling:
min_replicas: # minimum number of replicas (default: 1)
max_replicas: # maximum number of replicas (default: 100)
@@ -89,10 +89,10 @@
endpoint: # the endpoint for the API (default: )
api_gateway: public | none # whether to create a public API Gateway endpoint for this API (if not, the API will still be accessible via the load balancer) (default: public, unless disabled cluster-wide) (aws only)
compute:
- cpu: # CPU request per replica, e.g. 200m or 1 (200m is equivalent to 0.2) (default: 200m)
- gpu: # GPU request per replica (default: 0)
- inf: # Inferentia ASIC request per replica (default: 0) (aws only)
- mem: # memory request per replica, e.g. 200Mi or 1Gi (default: Null)
+ cpu: # CPU request per replica. One unit of CPU corresponds to one virtual CPU; fractional requests are allowed, and can be specified as a floating point number or via the "m" suffix (default: 200m)
+ gpu: # GPU request per replica. One unit of GPU corresponds to one virtual GPU (default: 0)
+ inf: # Inferentia request per replica. One unit corresponds to one Inferentia ASIC with 4 NeuronCores and 8GB of cache memory. Each process will have one NeuronCore Group with (4 * inf / processes_per_replica) NeuronCores, so your model should be compiled to run on (4 * inf / processes_per_replica) NeuronCores. (default: 0) (aws only)
+ mem: # memory request per replica. One unit of memory is one byte and can be expressed as an integer or by using one of these suffixes: K, M, G, T (or their power-of two counterparts: Ki, Mi, Gi, Ti) (default: Null)
autoscaling:
min_replicas: # minimum number of replicas (default: 1)
max_replicas: # maximum number of replicas (default: 100)
@@ -140,9 +140,9 @@
endpoint: # the endpoint for the API (default: )
api_gateway: public | none # whether to create a public API Gateway endpoint for this API (if not, the API will still be accessible via the load balancer) (default: public, unless disabled cluster-wide) (aws only)
compute:
- cpu: # CPU request per replica, e.g. 200m or 1 (200m is equivalent to 0.2) (default: 200m)
- gpu: # GPU request per replica (default: 0)
- mem: # memory request per replica, e.g. 200Mi or 1Gi (default: Null)
+ cpu: # CPU request per replica. One unit of CPU corresponds to one virtual CPU; fractional requests are allowed, and can be specified as a floating point number or via the "m" suffix (default: 200m)
+ gpu: # GPU request per replica. One unit of GPU corresponds to one virtual GPU (default: 0)
+ mem: # memory request per replica. One unit of memory is one byte and can be expressed as an integer or by using one of these suffixes: K, M, G, T (or their power-of two counterparts: Ki, Mi, Gi, Ti) (default: Null)
autoscaling:
min_replicas: # minimum number of replicas (default: 1)
max_replicas: # maximum number of replicas (default: 100)
diff --git a/docs/workloads/realtime/models.md b/docs/workloads/realtime/models.md
index eb55b647d4..1337d741e0 100644
--- a/docs/workloads/realtime/models.md
+++ b/docs/workloads/realtime/models.md
@@ -148,7 +148,7 @@ or:
dir: s3://my-bucket/models/
```
-For the Python predictor type, the `models` field comes under the name of `multi_model_reloading`. It is also not necessary to specify the `multi_model_reloading` section at all, since you can download and load the model in your predictor's `__init__()` function. That said, it is necessary to use the `models` field to take advantage of [live model reloading](#live-model-reloading) or [multi model caching](#multi-model-caching).
+For the Python predictor type, the `models` field comes under the name of `multi_model_reloading`. It is also not necessary to specify the `multi_model_reloading` section at all, since you can download and load the model in your predictor's `__init__()` function. That said, it is necessary to use the `models` field to take advantage of live model reloading or multi-model caching.
When using the `models.paths` field, each path must be a valid model directory (see above for valid model directory structures).
@@ -372,22 +372,3 @@ You can also retrieve information about the model by calling the `onnx_client`'s
}
}
```
-
-## Multi model caching
-
-Multi model caching allows each API replica to serve more models than would all fit into it's memory. It achieves this by keeping only a specified number of models in memory (and disk) at a time. When the in-memory model limit has been reached, the least recently accessed model is evicted from the cache.
-
-This feature can be useful when you have hundreds or thousands of models, when some models are frequently accessed while a larger portion of them are rarely used, or when running on smaller instances to control costs.
-
-The model cache is a two-layer cache, configured by the following parameters in the `predictor.models` configuration:
-
-* `cache_size` sets the number of models to keep in memory
-* `disk_cache_size` sets the number of models to keep on disk (must be greater than or equal to `cache_size`)
-
-Both of these fields must be specified, in addition to either the `dir` or `paths` field (which specifies the model paths, see above for documentation). Multi model caching is only supported if `predictor.processes_per_replica` is set to 1 (the default value).
-
-### Caveats
-
-Cortex periodically runs a background script (every 10 seconds) that counts the number of models in memory and on disk, and evicts the least recently used models if the count exceeds `cache_size` / `disk_cache_size`.
-
-The benefit of this approach is that there are no added steps on the critical path of the inference. The limitation with this approach in this is that if many new models are requested between exectutions of the script, then until the script runs again, there may be more models in memory and/or on disk than the configured `cache_size` or `disk_cache_size` limits. This has to potential to lead to out-of-memory errors.
diff --git a/docs/workloads/realtime/predictors.md b/docs/workloads/realtime/predictors.md
index dfdc59bf90..8c2e211ea1 100644
--- a/docs/workloads/realtime/predictors.md
+++ b/docs/workloads/realtime/predictors.md
@@ -129,81 +129,10 @@ Your API can accept requests with different types of payloads such as `JSON`-par
Your `predictor` method can return different types of objects such as `JSON`-parseable, `string`, and `bytes` objects. Navigate to the [API responses](#api-responses) section to learn about how to configure your `predictor` method to respond with different response codes and content-types.
-### Pre-installed packages
-
-The following Python packages are pre-installed in Python Predictors and can be used in your implementations:
-
-```text
-boto3==1.14.53
-cloudpickle==1.6.0
-Cython==0.29.21
-dill==0.3.2
-fastapi==0.61.1
-google-cloud-storage==1.32.0
-joblib==0.16.0
-Keras==2.4.3
-msgpack==1.0.0
-nltk==3.5
-np-utils==0.5.12.1
-numpy==1.19.1
-opencv-python==4.4.0.42
-pandas==1.1.1
-Pillow==7.2.0
-pyyaml==5.3.1
-requests==2.24.0
-scikit-image==0.17.2
-scikit-learn==0.23.2
-scipy==1.5.2
-six==1.15.0
-statsmodels==0.12.0
-sympy==1.6.2
-tensorflow-hub==0.9.0
-tensorflow==2.3.0
-torch==1.6.0
-torchvision==0.7.0
-xgboost==1.2.0
-```
-
-#### Inferentia-equipped APIs
-
-The list is slightly different for inferentia-equipped APIs:
-
-```text
-boto3==1.13.7
-cloudpickle==1.6.0
-Cython==0.29.21
-dill==0.3.1.1
-fastapi==0.54.1
-google-cloud-storage==1.32.0
-joblib==0.16.0
-msgpack==1.0.0
-neuron-cc==1.0.20600.0+0.b426b885f
-nltk==3.5
-np-utils==0.5.12.1
-numpy==1.18.2
-opencv-python==4.4.0.42
-pandas==1.1.1
-Pillow==7.2.0
-pyyaml==5.3.1
-requests==2.23.0
-scikit-image==0.17.2
-scikit-learn==0.23.2
-scipy==1.3.2
-six==1.15.0
-statsmodels==0.12.0
-sympy==1.6.2
-tensorflow==1.15.4
-tensorflow-neuron==1.15.3.1.0.2043.0
-torch==1.5.1
-torch-neuron==1.5.1.1.0.1721.0
-torchvision==0.6.1
-```
-
-
-The pre-installed system packages are listed in [images/python-predictor-cpu/Dockerfile](https://github.com/cortexlabs/cortex/tree/master/images/python-predictor-cpu/Dockerfile) (for CPU), [images/python-predictor-gpu/Dockerfile](https://github.com/cortexlabs/cortex/tree/master/images/python-predictor-gpu/Dockerfile) (for GPU), or [images/python-predictor-inf/Dockerfile](https://github.com/cortexlabs/cortex/tree/master/images/python-predictor-inf/Dockerfile) (for Inferentia).
-
## TensorFlow Predictor
+**Uses TensorFlow version 2.3.0 by default**
+
### Interface
```python
@@ -270,30 +199,10 @@ Your API can accept requests with different types of payloads such as `JSON`-par
Your `predictor` method can return different types of objects such as `JSON`-parseable, `string`, and `bytes` objects. Navigate to the [API responses](#api-responses) section to learn about how to configure your `predictor` method to respond with different response codes and content-types.
-### Pre-installed packages
-
-The following Python packages are pre-installed in TensorFlow Predictors and can be used in your implementations:
-
-```text
-boto3==1.14.53
-dill==0.3.2
-fastapi==0.61.1
-google-cloud-storage==1.32.0
-msgpack==1.0.0
-numpy==1.19.1
-opencv-python==4.4.0.42
-pyyaml==5.3.1
-requests==2.24.0
-tensorflow-hub==0.9.0
-tensorflow-serving-api==2.3.0
-tensorflow==2.3.0
-```
-
-
-The pre-installed system packages are listed in [images/tensorflow-predictor/Dockerfile](https://github.com/cortexlabs/cortex/tree/master/images/tensorflow-predictor/Dockerfile).
-
## ONNX Predictor
+**Uses ONNX Runtime version 1.4.0 by default**
+
### Interface
```python
@@ -360,25 +269,6 @@ Your API can accept requests with different types of payloads such as `JSON`-par
Your `predictor` method can return different types of objects such as `JSON`-parseable, `string`, and `bytes` objects. Navigate to the [API responses](#api-responses) section to learn about how to configure your `predictor` method to respond with different response codes and content-types.
-### Pre-installed packages
-
-The following Python packages are pre-installed in ONNX Predictors and can be used in your implementations:
-
-```text
-boto3==1.14.53
-dill==0.3.2
-fastapi==0.61.1
-google-cloud-storage==1.32.0
-msgpack==1.0.0
-numpy==1.19.1
-onnxruntime==1.4.0
-pyyaml==5.3.1
-requests==2.24.0
-```
-
-
-The pre-installed system packages are listed in [images/onnx-predictor-cpu/Dockerfile](https://github.com/cortexlabs/cortex/tree/master/images/onnx-predictor-cpu/Dockerfile) (for CPU) or [images/onnx-predictor-gpu/Dockerfile](https://github.com/cortexlabs/cortex/tree/master/images/onnx-predictor-gpu/Dockerfile) (for GPU).
-
## API requests
The type of the `payload` parameter in `predict(self, payload)` can vary based on the content type of the request. The `payload` parameter is parsed according to the `Content-Type` header in the request. Here are the parsing rules (see below for examples):
@@ -394,42 +284,12 @@ Here are some examples:
#### Making the request
-##### Curl
-
```bash
$ curl https://***.amazonaws.com/my-api \
-X POST -H "Content-Type: application/json" \
-d '{"key": "value"}'
```
-Or if you have a json file:
-
-```bash
-$ curl https://***.amazonaws.com/my-api \
- -X POST -H "Content-Type: application/json" \
- -d @file.json
-```
-
-##### Python
-
-```python
-import requests
-
-url = "https://***.amazonaws.com/my-api"
-requests.post(url, json={"key": "value"})
-```
-
-Or if you have a json string:
-
-```python
-import requests
-import json
-
-url = "https://***.amazonaws.com/my-api"
-jsonStr = json.dumps({"key": "value"})
-requests.post(url, data=jsonStr, headers={"Content-Type": "application/json"})
-```
-
#### Reading the payload
When sending a JSON payload, the `payload` parameter will be a Python object:
@@ -447,25 +307,12 @@ class PythonPredictor:
#### Making the request
-##### Curl
-
```bash
$ curl https://***.amazonaws.com/my-api \
-X POST -H "Content-Type: application/octet-stream" \
--data-binary @object.pkl
```
-##### Python
-
-```python
-import requests
-import pickle
-
-url = "https://***.amazonaws.com/my-api"
-pklBytes = pickle.dumps({"key": "value"})
-requests.post(url, data=pklBytes, headers={"Content-Type": "application/octet-stream"})
-```
-
#### Reading the payload
Since the `Content-Type: application/octet-stream` header is used, the `payload` parameter will be a `bytes` object:
@@ -501,8 +348,6 @@ class PythonPredictor:
#### Making the request
-##### Curl
-
```bash
$ curl https://***.amazonaws.com/my-api \
-X POST \
@@ -511,22 +356,6 @@ $ curl https://***.amazonaws.com/my-api \
-F "image=@image.png"
```
-##### Python
-
-```python
-import requests
-import pickle
-
-url = "https://***.amazonaws.com/my-api"
-files = {
- "text": open("text.txt", "rb"),
- "object": open("object.pkl", "rb"),
- "image": open("image.png", "rb"),
-}
-
-requests.post(url, files=files)
-```
-
#### Reading the payload
When sending files via form data, the `payload` parameter will be `starlette.datastructures.FormData` (key-value pairs where the values are `starlette.datastructures.UploadFile`, see [Starlette's documentation](https://www.starlette.io/requests/#request-files)). Either `Content-Type: multipart/form-data` or `Content-Type: application/x-www-form-urlencoded` can be used (typically `Content-Type: multipart/form-data` is used for files, and is the default in the examples above).
@@ -554,23 +383,12 @@ class PythonPredictor:
#### Making the request
-##### Curl
-
```bash
$ curl https://***.amazonaws.com/my-api \
-X POST \
-d "key=value"
```
-##### Python
-
-```python
-import requests
-
-url = "https://***.amazonaws.com/my-api"
-requests.post(url, data={"key": "value"})
-```
-
#### Reading the payload
When sending text via form data, the `payload` parameter will be `starlette.datastructures.FormData` (key-value pairs where the values are strings, see [Starlette's documentation](https://www.starlette.io/requests/#request-files)). Either `Content-Type: multipart/form-data` or `Content-Type: application/x-www-form-urlencoded` can be used (typically `Content-Type: application/x-www-form-urlencoded` is used for text, and is the default in the examples above).
@@ -588,23 +406,12 @@ class PythonPredictor:
#### Making the request
-##### Curl
-
```bash
$ curl https://***.amazonaws.com/my-api \
-X POST -H "Content-Type: text/plain" \
-d "hello world"
```
-##### Python
-
-```python
-import requests
-
-url = "https://***.amazonaws.com/my-api"
-requests.post(url, "hello world", headers={"Content-Type": "text/plain"})
-```
-
#### Reading the payload
Since the `Content-Type: text/plain` header is used, the `payload` parameter will be a `string` object:
@@ -624,11 +431,11 @@ The response of your `predict()` function may be:
1. A JSON-serializable object (*lists*, *dictionaries*, *numbers*, etc.)
-2. A `string` object (e.g. `"class 1"`)
+1. A `string` object (e.g. `"class 1"`)
-3. A `bytes` object (e.g. `bytes(4)` or `pickle.dumps(obj)`)
+1. A `bytes` object (e.g. `bytes(4)` or `pickle.dumps(obj)`)
-4. An instance of [starlette.responses.Response](https://www.starlette.io/responses/#response)
+1. An instance of [starlette.responses.Response](https://www.starlette.io/responses/#response)
## Chaining APIs