Skip to content
Merged

Docs #1754

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Cortex is an open source platform for large-scale inference workloads.

## Model serving infrastructure

* Supports deploying TensorFlow, PyTorch, sklearn and other models as realtime or batch APIs.
* Supports deploying TensorFlow, PyTorch, and other models as realtime or batch APIs.
* Ensures high availability with availability zones and automated instance restarts.
* Runs inference on on-demand instances or spot instances with on-demand backups.
* Autoscales to handle production workloads with support for overprovisioning.
Expand Down Expand Up @@ -98,13 +98,13 @@ import cortex
cx = cortex.client("aws")
cx.create_api(api_spec, predictor=PythonPredictor, requirements=requirements)

# creating https://example.com/text-generator
# creating http://example.com/text-generator
```

#### Consume your API

```bash
$ curl https://example.com/text-generator -X POST -H "Content-Type: application/json" -d '{"text": "hello world"}'
$ curl http://example.com/text-generator -X POST -H "Content-Type: application/json" -d '{"text": "hello world"}'
```

<br>
Expand Down
33 changes: 0 additions & 33 deletions docs/clusters/aws/gpu.md

This file was deleted.

75 changes: 0 additions & 75 deletions docs/clusters/aws/inferentia.md

This file was deleted.

1 change: 1 addition & 0 deletions docs/clusters/aws/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
1. [Docker](https://docs.docker.com/install)
1. Subscribe to the [EKS-optimized AMI with GPU Support](https://aws.amazon.com/marketplace/pp/B07GRHFXGM) (for GPU clusters)
1. An IAM user with `AdministratorAccess` and programmatic access (see [security](security.md) if you'd like to use less privileged credentials after spinning up your cluster)
1. You may need to [request a limit increase](https://console.aws.amazon.com/servicequotas/home?#!/services/ec2/quotas) for your desired instance type

## Spin up Cortex on your AWS account

Expand Down
4 changes: 1 addition & 3 deletions docs/clusters/aws/spot.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,13 @@
# Spot instances

[Spot instances](https://aws.amazon.com/ec2/spot) are spare capacity that AWS sells at a discount (up to 90%). The caveat is that spot instances may not always be available, and can be recalled by AWS at anytime. Cortex allows you to use spot instances in your cluster to take advantage of the discount while ensuring uptime and reliability of APIs. You can configure your cluster to use spot instances using the configuration below:

```yaml
# cluster.yaml

# whether to use spot instances in the cluster (default: false)
spot: false

spot_config:
# additional instances with identical or better specs than the primary instance type (defaults to only the primary instance)
# additional instance types with identical or better specs than the primary cluster instance type (defaults to only the primary instance type)
instance_distribution: # [similar_instance_type_1, similar_instance_type_2]

# minimum number of on demand instances (default: 0)
Expand Down
5 changes: 2 additions & 3 deletions docs/summary.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,14 +24,15 @@
* Multi-model
* [Example](workloads/multi-model/example.md)
* [Configuration](workloads/multi-model/configuration.md)
* [Caching](workloads/multi-model/caching.md)
* Traffic Splitter
* [Example](workloads/traffic-splitter/example.md)
* [Configuration](workloads/traffic-splitter/configuration.md)
* Managing dependencies
* [Example](workloads/dependencies/example.md)
* [Python packages](workloads/dependencies/python-packages.md)
* [System packages](workloads/dependencies/system-packages.md)
* [Docker images](workloads/dependencies/docker-images.md)
* [Custom images](workloads/dependencies/images.md)

## Clusters

Expand All @@ -40,8 +41,6 @@
* [Update](clusters/aws/update.md)
* [Security](clusters/aws/security.md)
* [Spot instances](clusters/aws/spot.md)
* [GPUs](clusters/aws/gpu.md)
* [Inferentia](clusters/aws/inferentia.md)
* [Networking](clusters/aws/networking.md)
* [VPC peering](clusters/aws/vpc-peering.md)
* [Custom domain](clusters/aws/custom-domain.md)
Expand Down
22 changes: 11 additions & 11 deletions docs/workloads/batch/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,10 @@
endpoint: <string> # the endpoint for the API (default: <api_name>)
api_gateway: public | none # whether to create a public API Gateway endpoint for this API (if not, the API will still be accessible via the load balancer) (default: public, unless disabled cluster-wide)
compute:
cpu: <string | int | float> # CPU request per worker, e.g. 200m or 1 (200m is equivalent to 0.2) (default: 200m)
gpu: <int> # GPU request per worker (default: 0)
inf: <int> # Inferentia ASIC request per worker (default: 0)
mem: <string> # memory request per worker, e.g. 200Mi or 1Gi (default: Null)
cpu: <string | int | float> # CPU request per worker. One unit of CPU corresponds to one virtual CPU; fractional requests are allowed, and can be specified as a floating point number or via the "m" suffix (default: 200m)
gpu: <int> # GPU request per worker. One unit of GPU corresponds to one virtual GPU (default: 0)
inf: <int> # Inferentia request per worker. One unit corresponds to one Inferentia ASIC with 4 NeuronCores and 8GB of cache memory. Each process will have one NeuronCore Group with (4 * inf / processes_per_replica) NeuronCores, so your model should be compiled to run on (4 * inf / processes_per_replica) NeuronCores. (default: 0) (aws only)
mem: <string> # memory request per worker. One unit of memory is one byte and can be expressed as an integer or by using one of these suffixes: K, M, G, T (or their power-of two counterparts: Ki, Mi, Gi, Ti) (default: Null)
```

## TensorFlow Predictor
Expand Down Expand Up @@ -54,10 +54,10 @@
endpoint: <string> # the endpoint for the API (default: <api_name>)
api_gateway: public | none # whether to create a public API Gateway endpoint for this API (if not, the API will still be accessible via the load balancer) (default: public, unless disabled cluster-wide)
compute:
cpu: <string | int | float> # CPU request per worker, e.g. 200m or 1 (200m is equivalent to 0.2) (default: 200m)
gpu: <int> # GPU request per worker (default: 0)
inf: <int> # Inferentia ASIC request per worker (default: 0)
mem: <string> # memory request per worker, e.g. 200Mi or 1Gi (default: Null)
cpu: <string | int | float> # CPU request per worker. One unit of CPU corresponds to one virtual CPU; fractional requests are allowed, and can be specified as a floating point number or via the "m" suffix (default: 200m)
gpu: <int> # GPU request per worker. One unit of GPU corresponds to one virtual GPU (default: 0)
inf: <int> # Inferentia request per worker. One unit corresponds to one Inferentia ASIC with 4 NeuronCores and 8GB of cache memory. Each process will have one NeuronCore Group with (4 * inf / processes_per_replica) NeuronCores, so your model should be compiled to run on (4 * inf / processes_per_replica) NeuronCores. (default: 0) (aws only)
mem: <string> # memory request per worker. One unit of memory is one byte and can be expressed as an integer or by using one of these suffixes: K, M, G, T (or their power-of two counterparts: Ki, Mi, Gi, Ti) (default: Null)
```

## ONNX Predictor
Expand All @@ -84,7 +84,7 @@
endpoint: <string> # the endpoint for the API (default: <api_name>)
api_gateway: public | none # whether to create a public API Gateway endpoint for this API (if not, the API will still be accessible via the load balancer) (default: public, unless disabled cluster-wide)
compute:
cpu: <string | int | float> # CPU request per worker, e.g. 200m or 1 (200m is equivalent to 0.2) (default: 200m)
gpu: <int> # GPU request per worker (default: 0)
mem: <string> # memory request per worker, e.g. 200Mi or 1Gi (default: Null)
cpu: <string | int | float> # CPU request per worker. One unit of CPU corresponds to one virtual CPU; fractional requests are allowed, and can be specified as a floating point number or via the "m" suffix (default: 200m)
gpu: <int> # GPU request per worker. One unit of GPU corresponds to one virtual GPU (default: 0)
mem: <string> # memory request per worker. One unit of memory is one byte and can be expressed as an integer or by using one of these suffixes: K, M, G, T (or their power-of two counterparts: Ki, Mi, Gi, Ti) (default: Null)
```
Loading