Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
155 changes: 115 additions & 40 deletions site/content/3.12/deploy/production-checklist.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,34 +10,89 @@ have been performed on your production system before you go live.

## Operating System

- Executed the OS optimization scripts if you run ArangoDB on Linux.
- Executed the operating system (OS) optimization scripts if you run ArangoDB on Linux.
See [Installing ArangoDB on Linux](../operations/installation/linux/_index.md) and its sub pages
[Linux Operating System Configuration](../operations/installation/linux/operating-system-configuration.md) and
[Linux OS Tuning Script Examples](../operations/installation/linux/linux-os-tuning-script-examples.md) for details.

- OS monitoring is in place
(most common metrics, e.g. disk, CPU, RAM utilization).
- Ensure your OS is compatible with your ArangoDB version
and keep it up to date at all times for security and stability.

- OS monitoring is in place with specific alerting thresholds:
- **Disk usage**: Alert when reaching 60% (red line threshold).
- **CPU usage**: Alert when reaching 90% (red line threshold).
- **Memory usage**: Alert when reaching 85% (red line threshold).

- Disk space monitoring is in place. Consider setting up alerting to avoid out-of-disk situations.

## ArangoDB

- The user _root_ is not used to run any ArangoDB processes
- **Use the latest versions**: Deploy the latest version series
of ArangoDB to benefit from performance improvements and security fixes.

- **Testing environments**: Use QA environments and UAT (User Acceptance Testing)
to test all changes, in particular queries, before going live with production deployments.

### Security

- Create a dedicated system user and group (e.g., "arango")
to run ArangoDB processes. Never use the _root_ user to run any ArangoDB processes
(if you run ArangoDB on Linux).

- **Access control**: Restrict access to the deployment to authorized personnel only.
Implement proper authentication and authorization mechanisms.

- **JWT authentication**: Enable JWT authentication
for production deployments. See [JWT authentication](../develop/http-api/authentication.md#jwt-user-tokens) for more details.

- **Encryption**: Enable [Encryption at Rest](../operations/security/encryption-at-rest.md)
for sensitive data. Make sure to safely store any secret keys you create for this.

### Logging and Monitoring

- The _arangod_ (server) process and the _arangodb_ (_Starter_) process
(if in use) have some form of logging enabled and logs can easily be
located and inspected.

- *Memory considerations*
- If you run multiple processes (e.g. DB-Server and Coordinator) on a single
machine, adjust the [`ARANGODB_OVERRIDE_DETECTED_TOTAL_MEMORY`](../components/arangodb-server/environment-variables.md)
environment variable accordingly.
- For versions prior to 3.8, make sure to change the
[`--query.memory-limit`](../components/arangodb-server/options.md#--querymemory-limit)
query option according to the node size and workload.
- Disable swap space to avoid slowdown which can result in servers being incorrectly
detected as failed.

- **Third-party monitoring**: Configure third-party metrics monitoring tools like
Grafana with Prometheus to monitor ArangoDB metrics comprehensively.

- **Configure metrics collection**: Enable the ArangoDB metrics API for production monitoring:
- Set [`--server.export-metrics-api`](../components/arangodb-server/options.md#--serverexport-metrics-api) to `true` to enable the metrics endpoints
- Enable [`--server.export-read-write-metrics`](../components/arangodb-server/options.md#--serverexport-read-write-metrics) for additional document read/write metrics
- Consider enabling [`--server.export-shard-usage-metrics`](../components/arangodb-server/options.md#--serverexport-shard-usage-metrics) for detailed shard usage tracking
- Configure your monitoring system (Prometheus/Grafana) to scrape the `/_admin/metrics/v2` endpoint
- See [HTTP interface for server metrics](../develop/http-api/monitoring/metrics.md) for detailed information

- **Enable RocksDB statistics**: Consider enabling [`--rocksdb.enable-statistics`](../components/arangodb-server/options.md#--rocksdbenable-statistics) to `true` for detailed RocksDB performance metrics.

- Monitor the ArangoDB provided metrics with alerting based on the threshold guidelines:
- Disk usage: 60% (red line)
- CPU usage: 90% (red line)
- Memory usage: 85% (red line)

### Memory

- For DB-Servers and Coordinators, override the
[`ARANGODB_OVERRIDE_DETECTED_TOTAL_MEMORY`](../components/arangodb-server/environment-variables.md)
environment variable using this rule of thumb:
- Multiply available memory by 0.9 to leave headspace for OS/Kubernetes, client connections, etc.
- Use 3/4 of that value for DB-Servers.
- Use 1/4 of that value for Coordinators.
- Agents typically don't need much memory and can use the remaining 10% headspace.

- Note that if ArangoDB "sees" x GB of memory in a pod,
it will try to use those x GB. Memory accounting has been vastly improved in 3.12,
but overshooting in certain cases may still occur.

- Disable swap space to avoid slowdown which can result in servers being incorrectly
detected as failed.

- **Query memory limits**: Configure appropriate memory limits for AQL queries:
- Set [`--query.max-memory-per-query`](../components/arangodb-server/options.md#--querymax-memory-per-query) to limit memory usage per individual query.
- Consider setting [`--query.global-memory-limit`](../components/arangodb-server/options.md#--queryglobal-memory-limit) to limit total memory used by all concurrent queries.

### Service Management

- Ensure ArangoDB will be automatically restarted (e.g. by using a systemd service file). Typically
you would use the Kubernetes operator or use systemd to launch the _Starter_.
Expand All @@ -50,36 +105,56 @@ have been performed on your production system before you go live.
update-rc.d -f arangodb3 remove
```

- If you have deployed a Cluster, the _replication factor_ and
_minimal_replication_factor_ of your collections
are set to a value equal or higher than 2, otherwise you run the risk of
losing data in case of a node failure. See
[cluster startup options](../components/arangodb-server/options.md#cluster).

- *Disk Performance considerations*
- Verify that your **storage performance** is at least 100 IOPS for each
volume in production mode. This is the bare minimum and it's recommended to
provide more for performance. It is probably only a concern if you use a
cloud infrastructure. Note that IOPS might be allotted based on a volume size,
so make sure to check your storage provider for details. Furthermore, you should
be careful with burst mode guarantees as ArangoDB requires a sustainable
high IOPS rate.

- The considerations should be given to an IO bandwidth (especially considering
RocksDB write-amplification which can easily be 10x or more).

- Whenever possible use **block storage**. Database data is based on append
operations, so filesystem which support this should be used for best
performance. We would not recommend to use NFS for performance reasons,
### Cluster Configuration

- **Replication configuration**: For production clusters, configure collections with:
- _replication factor_ of 3 for optimal data availability and fault tolerance.
- _minimal_replication_factor_ of a value equal or higher than 2.
- _writeConcern_ of 2.
See [cluster startup options](../components/arangodb-server/options.md#cluster).

- **Shard limits**: Keep the total number of shards below 10,000 across your cluster
to maintain optimal performance and avoid resource exhaustion.

### Disk Performance

- **Storage performance**: Verify that your storage performance is at least 100 IOPS for each
volume in production mode. This is the bare minimum and it's recommended to
provide more for performance. It is probably only a concern if you use a
cloud infrastructure. Note that IOPS might be allotted based on a volume size,
so make sure to check your storage provider for details. Furthermore, you should
be careful with burst mode guarantees as ArangoDB requires a sustainable
high IOPS rate.

- **DB-Server storage limit**: Keep individual DB-Server storage below 2TB per server to maintain optimal performance.

- **I/O bandwidth**: Give considerations to I/O bandwidth, especially considering
RocksDB write-amplification which can easily be 10x or more.

- **Block storage**: Whenever possible use block storage. Database data is based on append
operations, so filesystems which support this should be used for best
performance. ArangoDB does not recommend using NFS for performance reasons,
furthermore we experienced some issues with hard links required for
Hot Backup.

- Verify your **Backup** and restore procedures are working.
### Backup and Recovery

- **Test restore procedures**: Verify your backup and restore procedures are working.
**TEST YOUR RESTORE PROCEDURE** regularly to ensure you can recover from failures.

- **Hot Backup frequency**: Take Hot Backups with a frequency that matches your
RTO (Recovery Time Objective) and RPO (Recovery Point Objective) requirements.

- **arangodump backups**: Take backups with arangodump from time to time as an
additional backup strategy alongside Hot Backups.

- Consider enabling [Encryption at Rest](../operations/security/encryption-at-rest.md).
Make sure to safely store any secret keys you create for this.
- **Secure backup storage**: Store backups in a secure, separate location from your
production systems. Use encrypted storage and ensure backups are geographically
distributed to protect against regional disasters. Implement proper access controls
for backup storage locations.

- Monitor the ArangoDB provided metrics (e.g. by using Prometheus/Grafana).
- **Retry mechanisms**: Implement exponential retry with jitter in your applications
when connecting to ArangoDB to handle temporary network issues and failovers gracefully.

## Kubernetes Operator (kube-arangodb)

Expand All @@ -89,4 +164,4 @@ have been performed on your production system before you go live.
- The [**ReclaimPolicy**](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#reclaiming)
of your persistent volumes should be set to `Retain` to prevent volumes from premature deletion.

- Use native networking whenever possible to reduce delays.
- Use native networking whenever possible to reduce delays.
Loading