diff --git a/site/content/3.12/deploy/production-checklist.md b/site/content/3.12/deploy/production-checklist.md index 6cb59f8198..51f9242468 100644 --- a/site/content/3.12/deploy/production-checklist.md +++ b/site/content/3.12/deploy/production-checklist.md @@ -10,34 +10,89 @@ have been performed on your production system before you go live. ## Operating System -- Executed the OS optimization scripts if you run ArangoDB on Linux. +- Executed the operating system (OS) optimization scripts if you run ArangoDB on Linux. See [Installing ArangoDB on Linux](../operations/installation/linux/_index.md) and its sub pages [Linux Operating System Configuration](../operations/installation/linux/operating-system-configuration.md) and [Linux OS Tuning Script Examples](../operations/installation/linux/linux-os-tuning-script-examples.md) for details. -- OS monitoring is in place - (most common metrics, e.g. disk, CPU, RAM utilization). +- Ensure your OS is compatible with your ArangoDB version + and keep it up to date at all times for security and stability. + +- OS monitoring is in place with specific alerting thresholds: + - **Disk usage**: Alert when reaching 60% (red line threshold). + - **CPU usage**: Alert when reaching 90% (red line threshold). + - **Memory usage**: Alert when reaching 85% (red line threshold). - Disk space monitoring is in place. Consider setting up alerting to avoid out-of-disk situations. ## ArangoDB -- The user _root_ is not used to run any ArangoDB processes +- **Use the latest versions**: Deploy the latest version series + of ArangoDB to benefit from performance improvements and security fixes. + +- **Testing environments**: Use QA environments and UAT (User Acceptance Testing) + to test all changes, in particular queries, before going live with production deployments. + +### Security + +- Create a dedicated system user and group (e.g., "arango") + to run ArangoDB processes. Never use the _root_ user to run any ArangoDB processes (if you run ArangoDB on Linux). +- **Access control**: Restrict access to the deployment to authorized personnel only. + Implement proper authentication and authorization mechanisms. + +- **JWT authentication**: Enable JWT authentication + for production deployments. See [JWT authentication](../develop/http-api/authentication.md#jwt-user-tokens) for more details. + +- **Encryption**: Enable [Encryption at Rest](../operations/security/encryption-at-rest.md) + for sensitive data. Make sure to safely store any secret keys you create for this. + +### Logging and Monitoring + - The _arangod_ (server) process and the _arangodb_ (_Starter_) process (if in use) have some form of logging enabled and logs can easily be located and inspected. - -- *Memory considerations* - - If you run multiple processes (e.g. DB-Server and Coordinator) on a single - machine, adjust the [`ARANGODB_OVERRIDE_DETECTED_TOTAL_MEMORY`](../components/arangodb-server/environment-variables.md) - environment variable accordingly. - - For versions prior to 3.8, make sure to change the - [`--query.memory-limit`](../components/arangodb-server/options.md#--querymemory-limit) - query option according to the node size and workload. - - Disable swap space to avoid slowdown which can result in servers being incorrectly - detected as failed. + +- **Third-party monitoring**: Configure third-party metrics monitoring tools like + Grafana with Prometheus to monitor ArangoDB metrics comprehensively. + +- **Configure metrics collection**: Enable the ArangoDB metrics API for production monitoring: + - Set [`--server.export-metrics-api`](../components/arangodb-server/options.md#--serverexport-metrics-api) to `true` to enable the metrics endpoints + - Enable [`--server.export-read-write-metrics`](../components/arangodb-server/options.md#--serverexport-read-write-metrics) for additional document read/write metrics + - Consider enabling [`--server.export-shard-usage-metrics`](../components/arangodb-server/options.md#--serverexport-shard-usage-metrics) for detailed shard usage tracking + - Configure your monitoring system (Prometheus/Grafana) to scrape the `/_admin/metrics/v2` endpoint + - See [HTTP interface for server metrics](../develop/http-api/monitoring/metrics.md) for detailed information + +- **Enable RocksDB statistics**: Consider enabling [`--rocksdb.enable-statistics`](../components/arangodb-server/options.md#--rocksdbenable-statistics) to `true` for detailed RocksDB performance metrics. + +- Monitor the ArangoDB provided metrics with alerting based on the threshold guidelines: + - Disk usage: 60% (red line) + - CPU usage: 90% (red line) + - Memory usage: 85% (red line) + +### Memory + +- For DB-Servers and Coordinators, override the + [`ARANGODB_OVERRIDE_DETECTED_TOTAL_MEMORY`](../components/arangodb-server/environment-variables.md) + environment variable using this rule of thumb: + - Multiply available memory by 0.9 to leave headspace for OS/Kubernetes, client connections, etc. + - Use 3/4 of that value for DB-Servers. + - Use 1/4 of that value for Coordinators. + - Agents typically don't need much memory and can use the remaining 10% headspace. + +- Note that if ArangoDB "sees" x GB of memory in a pod, + it will try to use those x GB. Memory accounting has been vastly improved in 3.12, + but overshooting in certain cases may still occur. + +- Disable swap space to avoid slowdown which can result in servers being incorrectly + detected as failed. + +- **Query memory limits**: Configure appropriate memory limits for AQL queries: + - Set [`--query.max-memory-per-query`](../components/arangodb-server/options.md#--querymax-memory-per-query) to limit memory usage per individual query. + - Consider setting [`--query.global-memory-limit`](../components/arangodb-server/options.md#--queryglobal-memory-limit) to limit total memory used by all concurrent queries. + +### Service Management - Ensure ArangoDB will be automatically restarted (e.g. by using a systemd service file). Typically you would use the Kubernetes operator or use systemd to launch the _Starter_. @@ -50,36 +105,56 @@ have been performed on your production system before you go live. update-rc.d -f arangodb3 remove ``` -- If you have deployed a Cluster, the _replication factor_ and - _minimal_replication_factor_ of your collections - are set to a value equal or higher than 2, otherwise you run the risk of - losing data in case of a node failure. See - [cluster startup options](../components/arangodb-server/options.md#cluster). - -- *Disk Performance considerations* - - Verify that your **storage performance** is at least 100 IOPS for each - volume in production mode. This is the bare minimum and it's recommended to - provide more for performance. It is probably only a concern if you use a - cloud infrastructure. Note that IOPS might be allotted based on a volume size, - so make sure to check your storage provider for details. Furthermore, you should - be careful with burst mode guarantees as ArangoDB requires a sustainable - high IOPS rate. - - - The considerations should be given to an IO bandwidth (especially considering - RocksDB write-amplification which can easily be 10x or more). - -- Whenever possible use **block storage**. Database data is based on append - operations, so filesystem which support this should be used for best - performance. We would not recommend to use NFS for performance reasons, +### Cluster Configuration + +- **Replication configuration**: For production clusters, configure collections with: + - _replication factor_ of 3 for optimal data availability and fault tolerance. + - _minimal_replication_factor_ of a value equal or higher than 2. + - _writeConcern_ of 2. + See [cluster startup options](../components/arangodb-server/options.md#cluster). + +- **Shard limits**: Keep the total number of shards below 10,000 across your cluster + to maintain optimal performance and avoid resource exhaustion. + +### Disk Performance + +- **Storage performance**: Verify that your storage performance is at least 100 IOPS for each + volume in production mode. This is the bare minimum and it's recommended to + provide more for performance. It is probably only a concern if you use a + cloud infrastructure. Note that IOPS might be allotted based on a volume size, + so make sure to check your storage provider for details. Furthermore, you should + be careful with burst mode guarantees as ArangoDB requires a sustainable + high IOPS rate. + +- **DB-Server storage limit**: Keep individual DB-Server storage below 2TB per server to maintain optimal performance. + +- **I/O bandwidth**: Give considerations to I/O bandwidth, especially considering + RocksDB write-amplification which can easily be 10x or more. + +- **Block storage**: Whenever possible use block storage. Database data is based on append + operations, so filesystems which support this should be used for best + performance. ArangoDB does not recommend using NFS for performance reasons, furthermore we experienced some issues with hard links required for Hot Backup. -- Verify your **Backup** and restore procedures are working. +### Backup and Recovery + +- **Test restore procedures**: Verify your backup and restore procedures are working. + **TEST YOUR RESTORE PROCEDURE** regularly to ensure you can recover from failures. + +- **Hot Backup frequency**: Take Hot Backups with a frequency that matches your + RTO (Recovery Time Objective) and RPO (Recovery Point Objective) requirements. + +- **arangodump backups**: Take backups with arangodump from time to time as an + additional backup strategy alongside Hot Backups. -- Consider enabling [Encryption at Rest](../operations/security/encryption-at-rest.md). - Make sure to safely store any secret keys you create for this. +- **Secure backup storage**: Store backups in a secure, separate location from your + production systems. Use encrypted storage and ensure backups are geographically + distributed to protect against regional disasters. Implement proper access controls + for backup storage locations. -- Monitor the ArangoDB provided metrics (e.g. by using Prometheus/Grafana). +- **Retry mechanisms**: Implement exponential retry with jitter in your applications + when connecting to ArangoDB to handle temporary network issues and failovers gracefully. ## Kubernetes Operator (kube-arangodb) @@ -89,4 +164,4 @@ have been performed on your production system before you go live. - The [**ReclaimPolicy**](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#reclaiming) of your persistent volumes should be set to `Retain` to prevent volumes from premature deletion. -- Use native networking whenever possible to reduce delays. +- Use native networking whenever possible to reduce delays. \ No newline at end of file diff --git a/site/content/3.13/deploy/production-checklist.md b/site/content/3.13/deploy/production-checklist.md index 6cb59f8198..51f9242468 100644 --- a/site/content/3.13/deploy/production-checklist.md +++ b/site/content/3.13/deploy/production-checklist.md @@ -10,34 +10,89 @@ have been performed on your production system before you go live. ## Operating System -- Executed the OS optimization scripts if you run ArangoDB on Linux. +- Executed the operating system (OS) optimization scripts if you run ArangoDB on Linux. See [Installing ArangoDB on Linux](../operations/installation/linux/_index.md) and its sub pages [Linux Operating System Configuration](../operations/installation/linux/operating-system-configuration.md) and [Linux OS Tuning Script Examples](../operations/installation/linux/linux-os-tuning-script-examples.md) for details. -- OS monitoring is in place - (most common metrics, e.g. disk, CPU, RAM utilization). +- Ensure your OS is compatible with your ArangoDB version + and keep it up to date at all times for security and stability. + +- OS monitoring is in place with specific alerting thresholds: + - **Disk usage**: Alert when reaching 60% (red line threshold). + - **CPU usage**: Alert when reaching 90% (red line threshold). + - **Memory usage**: Alert when reaching 85% (red line threshold). - Disk space monitoring is in place. Consider setting up alerting to avoid out-of-disk situations. ## ArangoDB -- The user _root_ is not used to run any ArangoDB processes +- **Use the latest versions**: Deploy the latest version series + of ArangoDB to benefit from performance improvements and security fixes. + +- **Testing environments**: Use QA environments and UAT (User Acceptance Testing) + to test all changes, in particular queries, before going live with production deployments. + +### Security + +- Create a dedicated system user and group (e.g., "arango") + to run ArangoDB processes. Never use the _root_ user to run any ArangoDB processes (if you run ArangoDB on Linux). +- **Access control**: Restrict access to the deployment to authorized personnel only. + Implement proper authentication and authorization mechanisms. + +- **JWT authentication**: Enable JWT authentication + for production deployments. See [JWT authentication](../develop/http-api/authentication.md#jwt-user-tokens) for more details. + +- **Encryption**: Enable [Encryption at Rest](../operations/security/encryption-at-rest.md) + for sensitive data. Make sure to safely store any secret keys you create for this. + +### Logging and Monitoring + - The _arangod_ (server) process and the _arangodb_ (_Starter_) process (if in use) have some form of logging enabled and logs can easily be located and inspected. - -- *Memory considerations* - - If you run multiple processes (e.g. DB-Server and Coordinator) on a single - machine, adjust the [`ARANGODB_OVERRIDE_DETECTED_TOTAL_MEMORY`](../components/arangodb-server/environment-variables.md) - environment variable accordingly. - - For versions prior to 3.8, make sure to change the - [`--query.memory-limit`](../components/arangodb-server/options.md#--querymemory-limit) - query option according to the node size and workload. - - Disable swap space to avoid slowdown which can result in servers being incorrectly - detected as failed. + +- **Third-party monitoring**: Configure third-party metrics monitoring tools like + Grafana with Prometheus to monitor ArangoDB metrics comprehensively. + +- **Configure metrics collection**: Enable the ArangoDB metrics API for production monitoring: + - Set [`--server.export-metrics-api`](../components/arangodb-server/options.md#--serverexport-metrics-api) to `true` to enable the metrics endpoints + - Enable [`--server.export-read-write-metrics`](../components/arangodb-server/options.md#--serverexport-read-write-metrics) for additional document read/write metrics + - Consider enabling [`--server.export-shard-usage-metrics`](../components/arangodb-server/options.md#--serverexport-shard-usage-metrics) for detailed shard usage tracking + - Configure your monitoring system (Prometheus/Grafana) to scrape the `/_admin/metrics/v2` endpoint + - See [HTTP interface for server metrics](../develop/http-api/monitoring/metrics.md) for detailed information + +- **Enable RocksDB statistics**: Consider enabling [`--rocksdb.enable-statistics`](../components/arangodb-server/options.md#--rocksdbenable-statistics) to `true` for detailed RocksDB performance metrics. + +- Monitor the ArangoDB provided metrics with alerting based on the threshold guidelines: + - Disk usage: 60% (red line) + - CPU usage: 90% (red line) + - Memory usage: 85% (red line) + +### Memory + +- For DB-Servers and Coordinators, override the + [`ARANGODB_OVERRIDE_DETECTED_TOTAL_MEMORY`](../components/arangodb-server/environment-variables.md) + environment variable using this rule of thumb: + - Multiply available memory by 0.9 to leave headspace for OS/Kubernetes, client connections, etc. + - Use 3/4 of that value for DB-Servers. + - Use 1/4 of that value for Coordinators. + - Agents typically don't need much memory and can use the remaining 10% headspace. + +- Note that if ArangoDB "sees" x GB of memory in a pod, + it will try to use those x GB. Memory accounting has been vastly improved in 3.12, + but overshooting in certain cases may still occur. + +- Disable swap space to avoid slowdown which can result in servers being incorrectly + detected as failed. + +- **Query memory limits**: Configure appropriate memory limits for AQL queries: + - Set [`--query.max-memory-per-query`](../components/arangodb-server/options.md#--querymax-memory-per-query) to limit memory usage per individual query. + - Consider setting [`--query.global-memory-limit`](../components/arangodb-server/options.md#--queryglobal-memory-limit) to limit total memory used by all concurrent queries. + +### Service Management - Ensure ArangoDB will be automatically restarted (e.g. by using a systemd service file). Typically you would use the Kubernetes operator or use systemd to launch the _Starter_. @@ -50,36 +105,56 @@ have been performed on your production system before you go live. update-rc.d -f arangodb3 remove ``` -- If you have deployed a Cluster, the _replication factor_ and - _minimal_replication_factor_ of your collections - are set to a value equal or higher than 2, otherwise you run the risk of - losing data in case of a node failure. See - [cluster startup options](../components/arangodb-server/options.md#cluster). - -- *Disk Performance considerations* - - Verify that your **storage performance** is at least 100 IOPS for each - volume in production mode. This is the bare minimum and it's recommended to - provide more for performance. It is probably only a concern if you use a - cloud infrastructure. Note that IOPS might be allotted based on a volume size, - so make sure to check your storage provider for details. Furthermore, you should - be careful with burst mode guarantees as ArangoDB requires a sustainable - high IOPS rate. - - - The considerations should be given to an IO bandwidth (especially considering - RocksDB write-amplification which can easily be 10x or more). - -- Whenever possible use **block storage**. Database data is based on append - operations, so filesystem which support this should be used for best - performance. We would not recommend to use NFS for performance reasons, +### Cluster Configuration + +- **Replication configuration**: For production clusters, configure collections with: + - _replication factor_ of 3 for optimal data availability and fault tolerance. + - _minimal_replication_factor_ of a value equal or higher than 2. + - _writeConcern_ of 2. + See [cluster startup options](../components/arangodb-server/options.md#cluster). + +- **Shard limits**: Keep the total number of shards below 10,000 across your cluster + to maintain optimal performance and avoid resource exhaustion. + +### Disk Performance + +- **Storage performance**: Verify that your storage performance is at least 100 IOPS for each + volume in production mode. This is the bare minimum and it's recommended to + provide more for performance. It is probably only a concern if you use a + cloud infrastructure. Note that IOPS might be allotted based on a volume size, + so make sure to check your storage provider for details. Furthermore, you should + be careful with burst mode guarantees as ArangoDB requires a sustainable + high IOPS rate. + +- **DB-Server storage limit**: Keep individual DB-Server storage below 2TB per server to maintain optimal performance. + +- **I/O bandwidth**: Give considerations to I/O bandwidth, especially considering + RocksDB write-amplification which can easily be 10x or more. + +- **Block storage**: Whenever possible use block storage. Database data is based on append + operations, so filesystems which support this should be used for best + performance. ArangoDB does not recommend using NFS for performance reasons, furthermore we experienced some issues with hard links required for Hot Backup. -- Verify your **Backup** and restore procedures are working. +### Backup and Recovery + +- **Test restore procedures**: Verify your backup and restore procedures are working. + **TEST YOUR RESTORE PROCEDURE** regularly to ensure you can recover from failures. + +- **Hot Backup frequency**: Take Hot Backups with a frequency that matches your + RTO (Recovery Time Objective) and RPO (Recovery Point Objective) requirements. + +- **arangodump backups**: Take backups with arangodump from time to time as an + additional backup strategy alongside Hot Backups. -- Consider enabling [Encryption at Rest](../operations/security/encryption-at-rest.md). - Make sure to safely store any secret keys you create for this. +- **Secure backup storage**: Store backups in a secure, separate location from your + production systems. Use encrypted storage and ensure backups are geographically + distributed to protect against regional disasters. Implement proper access controls + for backup storage locations. -- Monitor the ArangoDB provided metrics (e.g. by using Prometheus/Grafana). +- **Retry mechanisms**: Implement exponential retry with jitter in your applications + when connecting to ArangoDB to handle temporary network issues and failovers gracefully. ## Kubernetes Operator (kube-arangodb) @@ -89,4 +164,4 @@ have been performed on your production system before you go live. - The [**ReclaimPolicy**](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#reclaiming) of your persistent volumes should be set to `Retain` to prevent volumes from premature deletion. -- Use native networking whenever possible to reduce delays. +- Use native networking whenever possible to reduce delays. \ No newline at end of file