cortexlabs · ospillinger · Jan 4, 2021 · Dec 30, 2020 · Dec 30, 2020 · Jan 4, 2021
diff --git a/README.md b/README.md
@@ -10,7 +10,7 @@ Cortex is an open source platform for large-scale inference workloads.
 
 ## Model serving infrastructure
 
-* Supports deploying TensorFlow, PyTorch, sklearn and other models as realtime or batch APIs.
+* Supports deploying TensorFlow, PyTorch, and other models as realtime or batch APIs.
 * Ensures high availability with availability zones and automated instance restarts.
 * Runs inference on on-demand instances or spot instances with on-demand backups.
 * Autoscales to handle production workloads with support for overprovisioning.
@@ -98,13 +98,13 @@ import cortex
 cx = cortex.client("aws")
 cx.create_api(api_spec, predictor=PythonPredictor, requirements=requirements)
 
-# creating https://example.com/text-generator
+# creating http://example.com/text-generator
 ```
 
 #### Consume your API
 
 ```bash
-$ curl https://example.com/text-generator -X POST -H "Content-Type: application/json" -d '{"text": "hello world"}'
+$ curl http://example.com/text-generator -X POST -H "Content-Type: application/json" -d '{"text": "hello world"}'
 ```
 
 <br>

diff --git a/docs/clusters/aws/gpu.md b/docs/clusters/aws/gpu.md
diff --git a/docs/clusters/aws/inferentia.md b/docs/clusters/aws/inferentia.md
diff --git a/docs/clusters/aws/install.md b/docs/clusters/aws/install.md
@@ -5,6 +5,7 @@
 1. [Docker](https://docs.docker.com/install)
 1. Subscribe to the [EKS-optimized AMI with GPU Support](https://aws.amazon.com/marketplace/pp/B07GRHFXGM) (for GPU clusters)
 1. An IAM user with `AdministratorAccess` and programmatic access (see [security](security.md) if you'd like to use less privileged credentials after spinning up your cluster)
+1. You may need to [request a limit increase](https://console.aws.amazon.com/servicequotas/home?#!/services/ec2/quotas) for your desired instance type
 
 ## Spin up Cortex on your AWS account
 

diff --git a/docs/clusters/aws/spot.md b/docs/clusters/aws/spot.md
@@ -1,15 +1,13 @@
 # Spot instances
 
-[Spot instances](https://aws.amazon.com/ec2/spot) are spare capacity that AWS sells at a discount (up to 90%). The caveat is that spot instances may not always be available, and can be recalled by AWS at anytime. Cortex allows you to use spot instances in your cluster to take advantage of the discount while ensuring uptime and reliability of APIs. You can configure your cluster to use spot instances using the configuration below:
-
 ```yaml
 # cluster.yaml
 
 # whether to use spot instances in the cluster (default: false)
 spot: false
 
 spot_config:
-  # additional instances with identical or better specs than the primary instance type (defaults to only the primary instance)
+  # additional instance types with identical or better specs than the primary cluster instance type (defaults to only the primary instance type)
   instance_distribution: # [similar_instance_type_1, similar_instance_type_2]
 
   # minimum number of on demand instances (default: 0)

diff --git a/docs/summary.md b/docs/summary.md
@@ -24,14 +24,15 @@
 * Multi-model
   * [Example](workloads/multi-model/example.md)
   * [Configuration](workloads/multi-model/configuration.md)
+  * [Caching](workloads/multi-model/caching.md)
 * Traffic Splitter
   * [Example](workloads/traffic-splitter/example.md)
   * [Configuration](workloads/traffic-splitter/configuration.md)
 * Managing dependencies
   * [Example](workloads/dependencies/example.md)
   * [Python packages](workloads/dependencies/python-packages.md)
   * [System packages](workloads/dependencies/system-packages.md)
-  * [Docker images](workloads/dependencies/docker-images.md)
+  * [Custom images](workloads/dependencies/images.md)
 
 ## Clusters
 
@@ -40,8 +41,6 @@
   * [Update](clusters/aws/update.md)
   * [Security](clusters/aws/security.md)
   * [Spot instances](clusters/aws/spot.md)
-  * [GPUs](clusters/aws/gpu.md)
-  * [Inferentia](clusters/aws/inferentia.md)
   * [Networking](clusters/aws/networking.md)
   * [VPC peering](clusters/aws/vpc-peering.md)
   * [Custom domain](clusters/aws/custom-domain.md)

diff --git a/docs/workloads/batch/configuration.md b/docs/workloads/batch/configuration.md
@@ -18,10 +18,10 @@
     endpoint: <string>  # the endpoint for the API (default: <api_name>)
     api_gateway: public | none  # whether to create a public API Gateway endpoint for this API (if not, the API will still be accessible via the load balancer) (default: public, unless disabled cluster-wide)
   compute:
-    cpu: <string | int | float>  # CPU request per worker, e.g. 200m or 1 (200m is equivalent to 0.2) (default: 200m)
-    gpu: <int>  # GPU request per worker (default: 0)
-    inf: <int> # Inferentia ASIC request per worker (default: 0)
-    mem: <string>  # memory request per worker, e.g. 200Mi or 1Gi (default: Null)
+    cpu: <string | int | float>  # CPU request per worker. One unit of CPU corresponds to one virtual CPU; fractional requests are allowed, and can be specified as a floating point number or via the "m" suffix (default: 200m)
+    gpu: <int>  # GPU request per worker. One unit of GPU corresponds to one virtual GPU (default: 0)
+    inf: <int> # Inferentia request per worker. One unit corresponds to one Inferentia ASIC with 4 NeuronCores and 8GB of cache memory. Each process will have one NeuronCore Group with (4 * inf / processes_per_replica) NeuronCores, so your model should be compiled to run on (4 * inf / processes_per_replica) NeuronCores. (default: 0) (aws only)
+    mem: <string>  # memory request per worker. One unit of memory is one byte and can be expressed as an integer or by using one of these suffixes: K, M, G, T (or their power-of two counterparts: Ki, Mi, Gi, Ti) (default: Null)
 ```
 
 ## TensorFlow Predictor
@@ -54,10 +54,10 @@
     endpoint: <string>  # the endpoint for the API (default: <api_name>)
     api_gateway: public | none  # whether to create a public API Gateway endpoint for this API (if not, the API will still be accessible via the load balancer) (default: public, unless disabled cluster-wide)
   compute:
-    cpu: <string | int | float>  # CPU request per worker, e.g. 200m or 1 (200m is equivalent to 0.2) (default: 200m)
-    gpu: <int>  # GPU request per worker (default: 0)
-    inf: <int> # Inferentia ASIC request per worker (default: 0)
-    mem: <string>  # memory request per worker, e.g. 200Mi or 1Gi (default: Null)
+    cpu: <string | int | float>  # CPU request per worker. One unit of CPU corresponds to one virtual CPU; fractional requests are allowed, and can be specified as a floating point number or via the "m" suffix (default: 200m)
+    gpu: <int>  # GPU request per worker. One unit of GPU corresponds to one virtual GPU (default: 0)
+    inf: <int> # Inferentia request per worker. One unit corresponds to one Inferentia ASIC with 4 NeuronCores and 8GB of cache memory. Each process will have one NeuronCore Group with (4 * inf / processes_per_replica) NeuronCores, so your model should be compiled to run on (4 * inf / processes_per_replica) NeuronCores. (default: 0) (aws only)
+    mem: <string>  # memory request per worker. One unit of memory is one byte and can be expressed as an integer or by using one of these suffixes: K, M, G, T (or their power-of two counterparts: Ki, Mi, Gi, Ti) (default: Null)
 ```
 
 ## ONNX Predictor
@@ -84,7 +84,7 @@
     endpoint: <string>  # the endpoint for the API (default: <api_name>)
     api_gateway: public | none  # whether to create a public API Gateway endpoint for this API (if not, the API will still be accessible via the load balancer) (default: public, unless disabled cluster-wide)
   compute:
-    cpu: <string | int | float>  # CPU request per worker, e.g. 200m or 1 (200m is equivalent to 0.2) (default: 200m)
-    gpu: <int>  # GPU request per worker (default: 0)
-    mem: <string>  # memory request per worker, e.g. 200Mi or 1Gi (default: Null)
+    cpu: <string | int | float>  # CPU request per worker. One unit of CPU corresponds to one virtual CPU; fractional requests are allowed, and can be specified as a floating point number or via the "m" suffix (default: 200m)
+    gpu: <int>  # GPU request per worker. One unit of GPU corresponds to one virtual GPU (default: 0)
+    mem: <string>  # memory request per worker. One unit of memory is one byte and can be expressed as an integer or by using one of these suffixes: K, M, G, T (or their power-of two counterparts: Ki, Mi, Gi, Ti) (default: Null)
 ```