From 8c76083db00488ffcb52d049b974cebf42425b82 Mon Sep 17 00:00:00 2001 From: Paula Date: Wed, 10 Sep 2025 14:04:32 +0200 Subject: [PATCH 1/4] review and update production checklist --- .../3.12/deploy/production-checklist.md | 126 +++++++++++++----- .../3.13/deploy/production-checklist.md | 126 +++++++++++++----- 2 files changed, 188 insertions(+), 64 deletions(-) diff --git a/site/content/3.12/deploy/production-checklist.md b/site/content/3.12/deploy/production-checklist.md index 6cb59f8198..efe5d13d2d 100644 --- a/site/content/3.12/deploy/production-checklist.md +++ b/site/content/3.12/deploy/production-checklist.md @@ -15,29 +15,78 @@ have been performed on your production system before you go live. [Linux Operating System Configuration](../operations/installation/linux/operating-system-configuration.md) and [Linux OS Tuning Script Examples](../operations/installation/linux/linux-os-tuning-script-examples.md) for details. -- OS monitoring is in place - (most common metrics, e.g. disk, CPU, RAM utilization). +- Ensure your OS is compatible with your ArangoDB version + and keep it up to date at all times for security and stability. + +- OS monitoring is in place with specific alerting thresholds: + - **Disk usage**: Alert when reaching 60% (red line threshold). + - **CPU usage**: Alert when reaching 90% (red line threshold). + - **Memory usage**: Alert when reaching 85% (red line threshold). - Disk space monitoring is in place. Consider setting up alerting to avoid out-of-disk situations. ## ArangoDB +- **Use the latest versions**: Deploy the latest bug fix version and latest major version + of ArangoDB to benefit from performance improvements and security fixes. + +- **Testing environments**: Use UAT (User Acceptance Testing) and QA environments + to test all changes before going live with production deployments. + +### Security + - The user _root_ is not used to run any ArangoDB processes (if you run ArangoDB on Linux). +- **Access control**: Restrict access to the cluster to authorized personnel only. + Implement proper authentication and authorization mechanisms. + +- **Encryption**: Enable [Encryption at Rest](../operations/security/encryption-at-rest.md) + for sensitive data. Make sure to safely store any secret keys you create for this. + +### Logging and Monitoring + - The _arangod_ (server) process and the _arangodb_ (_Starter_) process (if in use) have some form of logging enabled and logs can easily be located and inspected. - -- *Memory considerations* - - If you run multiple processes (e.g. DB-Server and Coordinator) on a single - machine, adjust the [`ARANGODB_OVERRIDE_DETECTED_TOTAL_MEMORY`](../components/arangodb-server/environment-variables.md) - environment variable accordingly. - - For versions prior to 3.8, make sure to change the - [`--query.memory-limit`](../components/arangodb-server/options.md#--querymemory-limit) - query option according to the node size and workload. - - Disable swap space to avoid slowdown which can result in servers being incorrectly - detected as failed. + +- **Third-party monitoring**: Configure third-party metrics monitoring tools like + Grafana with Prometheus to monitor ArangoDB metrics comprehensively. + +- **Configure metrics collection**: Enable the ArangoDB metrics API for production monitoring: + - Set [`--server.export-metrics-api`](../components/arangodb-server/options.md#--serverexport-metrics-api) to `true` to enable the metrics endpoints + - Enable [`--server.export-read-write-metrics`](../components/arangodb-server/options.md#--serverexport-read-write-metrics) for additional document read/write metrics + - Consider enabling [`--server.export-shard-usage-metrics`](../components/arangodb-server/options.md#--serverexport-shard-usage-metrics) for detailed shard usage tracking + - Configure your monitoring system (Prometheus/Grafana) to scrape the `/_admin/metrics/v2` endpoint + - See [HTTP interface for server metrics](../develop/http-api/monitoring/metrics.md) for detailed information + +- Monitor the ArangoDB provided metrics with alerting based on the threshold guidelines: + - Disk usage: 60% (red line) + - CPU usage: 90% (red line) + - Memory usage: 85% (red line) + +### Memory + +- For DB-Servers and Coordinators, override the + [`ARANGODB_OVERRIDE_DETECTED_TOTAL_MEMORY`](../components/arangodb-server/environment-variables.md) + environment variable using this rule of thumb: + - Multiply available memory by 0.9 to leave headspace for OS/Kubernetes, client connections, etc. + - Use 3/4 of that value for DB-Servers. + - Use 1/4 of that value for Coordinators. + - Agents typically don't need much memory and can use the remaining 10% headspace. + +- Note that if ArangoDB "sees" x GB of memory in a pod, + it will try to use those x GB. Memory accounting has been vastly improved in 3.12, + but occasional overshooting by a few kB may still occur. + +- For versions prior to 3.8, make sure to change the + [`--query.memory-limit`](../components/arangodb-server/options.md#--querymemory-limit) + query option according to the node size and workload. + +- Disable swap space to avoid slowdown which can result in servers being incorrectly + detected as failed. + +### Service Management - Ensure ArangoDB will be automatically restarted (e.g. by using a systemd service file). Typically you would use the Kubernetes operator or use systemd to launch the _Starter_. @@ -50,36 +99,49 @@ have been performed on your production system before you go live. update-rc.d -f arangodb3 remove ``` +### Cluster Configuration + - If you have deployed a Cluster, the _replication factor_ and _minimal_replication_factor_ of your collections are set to a value equal or higher than 2, otherwise you run the risk of losing data in case of a node failure. See [cluster startup options](../components/arangodb-server/options.md#cluster). -- *Disk Performance considerations* - - Verify that your **storage performance** is at least 100 IOPS for each - volume in production mode. This is the bare minimum and it's recommended to - provide more for performance. It is probably only a concern if you use a - cloud infrastructure. Note that IOPS might be allotted based on a volume size, - so make sure to check your storage provider for details. Furthermore, you should - be careful with burst mode guarantees as ArangoDB requires a sustainable - high IOPS rate. - - - The considerations should be given to an IO bandwidth (especially considering - RocksDB write-amplification which can easily be 10x or more). - -- Whenever possible use **block storage**. Database data is based on append - operations, so filesystem which support this should be used for best - performance. We would not recommend to use NFS for performance reasons, +- **Shard limits**: Keep the total number of shards below 10,000 across your cluster + to maintain optimal performance and avoid resource exhaustion. + +### Disk Performance + +- **Storage performance**: Verify that your storage performance is at least 100 IOPS for each + volume in production mode. This is the bare minimum and it's recommended to + provide more for performance. It is probably only a concern if you use a + cloud infrastructure. Note that IOPS might be allotted based on a volume size, + so make sure to check your storage provider for details. Furthermore, you should + be careful with burst mode guarantees as ArangoDB requires a sustainable + high IOPS rate. + +- **IO bandwidth**: Give considerations to IO bandwidth, especially considering + RocksDB write-amplification which can easily be 10x or more. + +- **Block storage**: Whenever possible use block storage. Database data is based on append + operations, so filesystems which support this should be used for best + performance. We would not recommend using NFS for performance reasons, furthermore we experienced some issues with hard links required for Hot Backup. -- Verify your **Backup** and restore procedures are working. +### Backup and Recovery + +- **Test restore procedures**: Verify your backup and restore procedures are working. + **TEST YOUR RESTORE PROCEDURE** regularly to ensure you can recover from failures. + +- **Hot Backup frequency**: Take Hot Backups with a frequency that matches your + RTO (Recovery Time Objective) and RPO (Recovery Point Objective) requirements. -- Consider enabling [Encryption at Rest](../operations/security/encryption-at-rest.md). - Make sure to safely store any secret keys you create for this. +- **arangodump backups**: Take backups with arangodump from time to time as an + additional backup strategy alongside Hot Backups. -- Monitor the ArangoDB provided metrics (e.g. by using Prometheus/Grafana). +- **Retry mechanisms**: Implement exponential retry with jitter in your applications + when connecting to ArangoDB to handle temporary network issues and failovers gracefully. ## Kubernetes Operator (kube-arangodb) @@ -89,4 +151,4 @@ have been performed on your production system before you go live. - The [**ReclaimPolicy**](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#reclaiming) of your persistent volumes should be set to `Retain` to prevent volumes from premature deletion. -- Use native networking whenever possible to reduce delays. +- Use native networking whenever possible to reduce delays. \ No newline at end of file diff --git a/site/content/3.13/deploy/production-checklist.md b/site/content/3.13/deploy/production-checklist.md index 6cb59f8198..efe5d13d2d 100644 --- a/site/content/3.13/deploy/production-checklist.md +++ b/site/content/3.13/deploy/production-checklist.md @@ -15,29 +15,78 @@ have been performed on your production system before you go live. [Linux Operating System Configuration](../operations/installation/linux/operating-system-configuration.md) and [Linux OS Tuning Script Examples](../operations/installation/linux/linux-os-tuning-script-examples.md) for details. -- OS monitoring is in place - (most common metrics, e.g. disk, CPU, RAM utilization). +- Ensure your OS is compatible with your ArangoDB version + and keep it up to date at all times for security and stability. + +- OS monitoring is in place with specific alerting thresholds: + - **Disk usage**: Alert when reaching 60% (red line threshold). + - **CPU usage**: Alert when reaching 90% (red line threshold). + - **Memory usage**: Alert when reaching 85% (red line threshold). - Disk space monitoring is in place. Consider setting up alerting to avoid out-of-disk situations. ## ArangoDB +- **Use the latest versions**: Deploy the latest bug fix version and latest major version + of ArangoDB to benefit from performance improvements and security fixes. + +- **Testing environments**: Use UAT (User Acceptance Testing) and QA environments + to test all changes before going live with production deployments. + +### Security + - The user _root_ is not used to run any ArangoDB processes (if you run ArangoDB on Linux). +- **Access control**: Restrict access to the cluster to authorized personnel only. + Implement proper authentication and authorization mechanisms. + +- **Encryption**: Enable [Encryption at Rest](../operations/security/encryption-at-rest.md) + for sensitive data. Make sure to safely store any secret keys you create for this. + +### Logging and Monitoring + - The _arangod_ (server) process and the _arangodb_ (_Starter_) process (if in use) have some form of logging enabled and logs can easily be located and inspected. - -- *Memory considerations* - - If you run multiple processes (e.g. DB-Server and Coordinator) on a single - machine, adjust the [`ARANGODB_OVERRIDE_DETECTED_TOTAL_MEMORY`](../components/arangodb-server/environment-variables.md) - environment variable accordingly. - - For versions prior to 3.8, make sure to change the - [`--query.memory-limit`](../components/arangodb-server/options.md#--querymemory-limit) - query option according to the node size and workload. - - Disable swap space to avoid slowdown which can result in servers being incorrectly - detected as failed. + +- **Third-party monitoring**: Configure third-party metrics monitoring tools like + Grafana with Prometheus to monitor ArangoDB metrics comprehensively. + +- **Configure metrics collection**: Enable the ArangoDB metrics API for production monitoring: + - Set [`--server.export-metrics-api`](../components/arangodb-server/options.md#--serverexport-metrics-api) to `true` to enable the metrics endpoints + - Enable [`--server.export-read-write-metrics`](../components/arangodb-server/options.md#--serverexport-read-write-metrics) for additional document read/write metrics + - Consider enabling [`--server.export-shard-usage-metrics`](../components/arangodb-server/options.md#--serverexport-shard-usage-metrics) for detailed shard usage tracking + - Configure your monitoring system (Prometheus/Grafana) to scrape the `/_admin/metrics/v2` endpoint + - See [HTTP interface for server metrics](../develop/http-api/monitoring/metrics.md) for detailed information + +- Monitor the ArangoDB provided metrics with alerting based on the threshold guidelines: + - Disk usage: 60% (red line) + - CPU usage: 90% (red line) + - Memory usage: 85% (red line) + +### Memory + +- For DB-Servers and Coordinators, override the + [`ARANGODB_OVERRIDE_DETECTED_TOTAL_MEMORY`](../components/arangodb-server/environment-variables.md) + environment variable using this rule of thumb: + - Multiply available memory by 0.9 to leave headspace for OS/Kubernetes, client connections, etc. + - Use 3/4 of that value for DB-Servers. + - Use 1/4 of that value for Coordinators. + - Agents typically don't need much memory and can use the remaining 10% headspace. + +- Note that if ArangoDB "sees" x GB of memory in a pod, + it will try to use those x GB. Memory accounting has been vastly improved in 3.12, + but occasional overshooting by a few kB may still occur. + +- For versions prior to 3.8, make sure to change the + [`--query.memory-limit`](../components/arangodb-server/options.md#--querymemory-limit) + query option according to the node size and workload. + +- Disable swap space to avoid slowdown which can result in servers being incorrectly + detected as failed. + +### Service Management - Ensure ArangoDB will be automatically restarted (e.g. by using a systemd service file). Typically you would use the Kubernetes operator or use systemd to launch the _Starter_. @@ -50,36 +99,49 @@ have been performed on your production system before you go live. update-rc.d -f arangodb3 remove ``` +### Cluster Configuration + - If you have deployed a Cluster, the _replication factor_ and _minimal_replication_factor_ of your collections are set to a value equal or higher than 2, otherwise you run the risk of losing data in case of a node failure. See [cluster startup options](../components/arangodb-server/options.md#cluster). -- *Disk Performance considerations* - - Verify that your **storage performance** is at least 100 IOPS for each - volume in production mode. This is the bare minimum and it's recommended to - provide more for performance. It is probably only a concern if you use a - cloud infrastructure. Note that IOPS might be allotted based on a volume size, - so make sure to check your storage provider for details. Furthermore, you should - be careful with burst mode guarantees as ArangoDB requires a sustainable - high IOPS rate. - - - The considerations should be given to an IO bandwidth (especially considering - RocksDB write-amplification which can easily be 10x or more). - -- Whenever possible use **block storage**. Database data is based on append - operations, so filesystem which support this should be used for best - performance. We would not recommend to use NFS for performance reasons, +- **Shard limits**: Keep the total number of shards below 10,000 across your cluster + to maintain optimal performance and avoid resource exhaustion. + +### Disk Performance + +- **Storage performance**: Verify that your storage performance is at least 100 IOPS for each + volume in production mode. This is the bare minimum and it's recommended to + provide more for performance. It is probably only a concern if you use a + cloud infrastructure. Note that IOPS might be allotted based on a volume size, + so make sure to check your storage provider for details. Furthermore, you should + be careful with burst mode guarantees as ArangoDB requires a sustainable + high IOPS rate. + +- **IO bandwidth**: Give considerations to IO bandwidth, especially considering + RocksDB write-amplification which can easily be 10x or more. + +- **Block storage**: Whenever possible use block storage. Database data is based on append + operations, so filesystems which support this should be used for best + performance. We would not recommend using NFS for performance reasons, furthermore we experienced some issues with hard links required for Hot Backup. -- Verify your **Backup** and restore procedures are working. +### Backup and Recovery + +- **Test restore procedures**: Verify your backup and restore procedures are working. + **TEST YOUR RESTORE PROCEDURE** regularly to ensure you can recover from failures. + +- **Hot Backup frequency**: Take Hot Backups with a frequency that matches your + RTO (Recovery Time Objective) and RPO (Recovery Point Objective) requirements. -- Consider enabling [Encryption at Rest](../operations/security/encryption-at-rest.md). - Make sure to safely store any secret keys you create for this. +- **arangodump backups**: Take backups with arangodump from time to time as an + additional backup strategy alongside Hot Backups. -- Monitor the ArangoDB provided metrics (e.g. by using Prometheus/Grafana). +- **Retry mechanisms**: Implement exponential retry with jitter in your applications + when connecting to ArangoDB to handle temporary network issues and failovers gracefully. ## Kubernetes Operator (kube-arangodb) @@ -89,4 +151,4 @@ have been performed on your production system before you go live. - The [**ReclaimPolicy**](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#reclaiming) of your persistent volumes should be set to `Retain` to prevent volumes from premature deletion. -- Use native networking whenever possible to reduce delays. +- Use native networking whenever possible to reduce delays. \ No newline at end of file From 54e1034df32a148c21c0505890a62f4cc4546341 Mon Sep 17 00:00:00 2001 From: Paula Date: Thu, 11 Sep 2025 15:42:05 +0200 Subject: [PATCH 2/4] review --- site/content/3.12/deploy/production-checklist.md | 14 +++++--------- site/content/3.13/deploy/production-checklist.md | 14 +++++--------- 2 files changed, 10 insertions(+), 18 deletions(-) diff --git a/site/content/3.12/deploy/production-checklist.md b/site/content/3.12/deploy/production-checklist.md index efe5d13d2d..f1be6ff5e8 100644 --- a/site/content/3.12/deploy/production-checklist.md +++ b/site/content/3.12/deploy/production-checklist.md @@ -10,7 +10,7 @@ have been performed on your production system before you go live. ## Operating System -- Executed the OS optimization scripts if you run ArangoDB on Linux. +- Executed the operating system (OS) optimization scripts if you run ArangoDB on Linux. See [Installing ArangoDB on Linux](../operations/installation/linux/_index.md) and its sub pages [Linux Operating System Configuration](../operations/installation/linux/operating-system-configuration.md) and [Linux OS Tuning Script Examples](../operations/installation/linux/linux-os-tuning-script-examples.md) for details. @@ -27,7 +27,7 @@ have been performed on your production system before you go live. ## ArangoDB -- **Use the latest versions**: Deploy the latest bug fix version and latest major version +- **Use the latest versions**: Deploy the latest version series of ArangoDB to benefit from performance improvements and security fixes. - **Testing environments**: Use UAT (User Acceptance Testing) and QA environments @@ -38,7 +38,7 @@ have been performed on your production system before you go live. - The user _root_ is not used to run any ArangoDB processes (if you run ArangoDB on Linux). -- **Access control**: Restrict access to the cluster to authorized personnel only. +- **Access control**: Restrict access to the deployment to authorized personnel only. Implement proper authentication and authorization mechanisms. - **Encryption**: Enable [Encryption at Rest](../operations/security/encryption-at-rest.md) @@ -79,10 +79,6 @@ have been performed on your production system before you go live. it will try to use those x GB. Memory accounting has been vastly improved in 3.12, but occasional overshooting by a few kB may still occur. -- For versions prior to 3.8, make sure to change the - [`--query.memory-limit`](../components/arangodb-server/options.md#--querymemory-limit) - query option according to the node size and workload. - - Disable swap space to avoid slowdown which can result in servers being incorrectly detected as failed. @@ -120,12 +116,12 @@ have been performed on your production system before you go live. be careful with burst mode guarantees as ArangoDB requires a sustainable high IOPS rate. -- **IO bandwidth**: Give considerations to IO bandwidth, especially considering +- **I/O bandwidth**: Give considerations to I/O bandwidth, especially considering RocksDB write-amplification which can easily be 10x or more. - **Block storage**: Whenever possible use block storage. Database data is based on append operations, so filesystems which support this should be used for best - performance. We would not recommend using NFS for performance reasons, + performance. ArangoDB does not recommend using NFS for performance reasons, furthermore we experienced some issues with hard links required for Hot Backup. diff --git a/site/content/3.13/deploy/production-checklist.md b/site/content/3.13/deploy/production-checklist.md index efe5d13d2d..f1be6ff5e8 100644 --- a/site/content/3.13/deploy/production-checklist.md +++ b/site/content/3.13/deploy/production-checklist.md @@ -10,7 +10,7 @@ have been performed on your production system before you go live. ## Operating System -- Executed the OS optimization scripts if you run ArangoDB on Linux. +- Executed the operating system (OS) optimization scripts if you run ArangoDB on Linux. See [Installing ArangoDB on Linux](../operations/installation/linux/_index.md) and its sub pages [Linux Operating System Configuration](../operations/installation/linux/operating-system-configuration.md) and [Linux OS Tuning Script Examples](../operations/installation/linux/linux-os-tuning-script-examples.md) for details. @@ -27,7 +27,7 @@ have been performed on your production system before you go live. ## ArangoDB -- **Use the latest versions**: Deploy the latest bug fix version and latest major version +- **Use the latest versions**: Deploy the latest version series of ArangoDB to benefit from performance improvements and security fixes. - **Testing environments**: Use UAT (User Acceptance Testing) and QA environments @@ -38,7 +38,7 @@ have been performed on your production system before you go live. - The user _root_ is not used to run any ArangoDB processes (if you run ArangoDB on Linux). -- **Access control**: Restrict access to the cluster to authorized personnel only. +- **Access control**: Restrict access to the deployment to authorized personnel only. Implement proper authentication and authorization mechanisms. - **Encryption**: Enable [Encryption at Rest](../operations/security/encryption-at-rest.md) @@ -79,10 +79,6 @@ have been performed on your production system before you go live. it will try to use those x GB. Memory accounting has been vastly improved in 3.12, but occasional overshooting by a few kB may still occur. -- For versions prior to 3.8, make sure to change the - [`--query.memory-limit`](../components/arangodb-server/options.md#--querymemory-limit) - query option according to the node size and workload. - - Disable swap space to avoid slowdown which can result in servers being incorrectly detected as failed. @@ -120,12 +116,12 @@ have been performed on your production system before you go live. be careful with burst mode guarantees as ArangoDB requires a sustainable high IOPS rate. -- **IO bandwidth**: Give considerations to IO bandwidth, especially considering +- **I/O bandwidth**: Give considerations to I/O bandwidth, especially considering RocksDB write-amplification which can easily be 10x or more. - **Block storage**: Whenever possible use block storage. Database data is based on append operations, so filesystems which support this should be used for best - performance. We would not recommend using NFS for performance reasons, + performance. ArangoDB does not recommend using NFS for performance reasons, furthermore we experienced some issues with hard links required for Hot Backup. From b5ba68a3bdda24ec3002d253a2912f81b3110c37 Mon Sep 17 00:00:00 2001 From: Paula Date: Mon, 15 Sep 2025 12:37:09 +0200 Subject: [PATCH 3/4] add note on enabling RocksDB statistics and other minor changes --- site/content/3.12/deploy/production-checklist.md | 6 ++++-- site/content/3.13/deploy/production-checklist.md | 6 ++++-- 2 files changed, 8 insertions(+), 4 deletions(-) diff --git a/site/content/3.12/deploy/production-checklist.md b/site/content/3.12/deploy/production-checklist.md index f1be6ff5e8..e5ad8a6710 100644 --- a/site/content/3.12/deploy/production-checklist.md +++ b/site/content/3.12/deploy/production-checklist.md @@ -30,7 +30,7 @@ have been performed on your production system before you go live. - **Use the latest versions**: Deploy the latest version series of ArangoDB to benefit from performance improvements and security fixes. -- **Testing environments**: Use UAT (User Acceptance Testing) and QA environments +- **Testing environments**: Use QA environments and UAT (User Acceptance Testing) to test all changes before going live with production deployments. ### Security @@ -60,6 +60,8 @@ have been performed on your production system before you go live. - Configure your monitoring system (Prometheus/Grafana) to scrape the `/_admin/metrics/v2` endpoint - See [HTTP interface for server metrics](../develop/http-api/monitoring/metrics.md) for detailed information +- **Enable RocksDB statistics**: Consider enabling [`--rocksdb.enable-statistics`](../components/arangodb-server/options.md#--rocksdbenable-statistics) to `true` for detailed RocksDB performance metrics. + - Monitor the ArangoDB provided metrics with alerting based on the threshold guidelines: - Disk usage: 60% (red line) - CPU usage: 90% (red line) @@ -77,7 +79,7 @@ have been performed on your production system before you go live. - Note that if ArangoDB "sees" x GB of memory in a pod, it will try to use those x GB. Memory accounting has been vastly improved in 3.12, - but occasional overshooting by a few kB may still occur. + but overshooting in certain cases may still occur. - Disable swap space to avoid slowdown which can result in servers being incorrectly detected as failed. diff --git a/site/content/3.13/deploy/production-checklist.md b/site/content/3.13/deploy/production-checklist.md index f1be6ff5e8..e5ad8a6710 100644 --- a/site/content/3.13/deploy/production-checklist.md +++ b/site/content/3.13/deploy/production-checklist.md @@ -30,7 +30,7 @@ have been performed on your production system before you go live. - **Use the latest versions**: Deploy the latest version series of ArangoDB to benefit from performance improvements and security fixes. -- **Testing environments**: Use UAT (User Acceptance Testing) and QA environments +- **Testing environments**: Use QA environments and UAT (User Acceptance Testing) to test all changes before going live with production deployments. ### Security @@ -60,6 +60,8 @@ have been performed on your production system before you go live. - Configure your monitoring system (Prometheus/Grafana) to scrape the `/_admin/metrics/v2` endpoint - See [HTTP interface for server metrics](../develop/http-api/monitoring/metrics.md) for detailed information +- **Enable RocksDB statistics**: Consider enabling [`--rocksdb.enable-statistics`](../components/arangodb-server/options.md#--rocksdbenable-statistics) to `true` for detailed RocksDB performance metrics. + - Monitor the ArangoDB provided metrics with alerting based on the threshold guidelines: - Disk usage: 60% (red line) - CPU usage: 90% (red line) @@ -77,7 +79,7 @@ have been performed on your production system before you go live. - Note that if ArangoDB "sees" x GB of memory in a pod, it will try to use those x GB. Memory accounting has been vastly improved in 3.12, - but occasional overshooting by a few kB may still occur. + but overshooting in certain cases may still occur. - Disable swap space to avoid slowdown which can result in servers being incorrectly detected as failed. From bef5bc90bedb8ae2d2205771e6d11ac7827525c1 Mon Sep 17 00:00:00 2001 From: Paula Date: Tue, 23 Sep 2025 13:52:01 +0200 Subject: [PATCH 4/4] added more best practices for the production checklist via mamoona --- .../3.12/deploy/production-checklist.md | 29 ++++++++++++++----- .../3.13/deploy/production-checklist.md | 29 ++++++++++++++----- 2 files changed, 44 insertions(+), 14 deletions(-) diff --git a/site/content/3.12/deploy/production-checklist.md b/site/content/3.12/deploy/production-checklist.md index e5ad8a6710..51f9242468 100644 --- a/site/content/3.12/deploy/production-checklist.md +++ b/site/content/3.12/deploy/production-checklist.md @@ -31,16 +31,20 @@ have been performed on your production system before you go live. of ArangoDB to benefit from performance improvements and security fixes. - **Testing environments**: Use QA environments and UAT (User Acceptance Testing) - to test all changes before going live with production deployments. + to test all changes, in particular queries, before going live with production deployments. ### Security -- The user _root_ is not used to run any ArangoDB processes +- Create a dedicated system user and group (e.g., "arango") + to run ArangoDB processes. Never use the _root_ user to run any ArangoDB processes (if you run ArangoDB on Linux). - **Access control**: Restrict access to the deployment to authorized personnel only. Implement proper authentication and authorization mechanisms. +- **JWT authentication**: Enable JWT authentication + for production deployments. See [JWT authentication](../develop/http-api/authentication.md#jwt-user-tokens) for more details. + - **Encryption**: Enable [Encryption at Rest](../operations/security/encryption-at-rest.md) for sensitive data. Make sure to safely store any secret keys you create for this. @@ -84,6 +88,10 @@ have been performed on your production system before you go live. - Disable swap space to avoid slowdown which can result in servers being incorrectly detected as failed. +- **Query memory limits**: Configure appropriate memory limits for AQL queries: + - Set [`--query.max-memory-per-query`](../components/arangodb-server/options.md#--querymax-memory-per-query) to limit memory usage per individual query. + - Consider setting [`--query.global-memory-limit`](../components/arangodb-server/options.md#--queryglobal-memory-limit) to limit total memory used by all concurrent queries. + ### Service Management - Ensure ArangoDB will be automatically restarted (e.g. by using a systemd service file). Typically @@ -99,11 +107,11 @@ have been performed on your production system before you go live. ### Cluster Configuration -- If you have deployed a Cluster, the _replication factor_ and - _minimal_replication_factor_ of your collections - are set to a value equal or higher than 2, otherwise you run the risk of - losing data in case of a node failure. See - [cluster startup options](../components/arangodb-server/options.md#cluster). +- **Replication configuration**: For production clusters, configure collections with: + - _replication factor_ of 3 for optimal data availability and fault tolerance. + - _minimal_replication_factor_ of a value equal or higher than 2. + - _writeConcern_ of 2. + See [cluster startup options](../components/arangodb-server/options.md#cluster). - **Shard limits**: Keep the total number of shards below 10,000 across your cluster to maintain optimal performance and avoid resource exhaustion. @@ -118,6 +126,8 @@ have been performed on your production system before you go live. be careful with burst mode guarantees as ArangoDB requires a sustainable high IOPS rate. +- **DB-Server storage limit**: Keep individual DB-Server storage below 2TB per server to maintain optimal performance. + - **I/O bandwidth**: Give considerations to I/O bandwidth, especially considering RocksDB write-amplification which can easily be 10x or more. @@ -138,6 +148,11 @@ have been performed on your production system before you go live. - **arangodump backups**: Take backups with arangodump from time to time as an additional backup strategy alongside Hot Backups. +- **Secure backup storage**: Store backups in a secure, separate location from your + production systems. Use encrypted storage and ensure backups are geographically + distributed to protect against regional disasters. Implement proper access controls + for backup storage locations. + - **Retry mechanisms**: Implement exponential retry with jitter in your applications when connecting to ArangoDB to handle temporary network issues and failovers gracefully. diff --git a/site/content/3.13/deploy/production-checklist.md b/site/content/3.13/deploy/production-checklist.md index e5ad8a6710..51f9242468 100644 --- a/site/content/3.13/deploy/production-checklist.md +++ b/site/content/3.13/deploy/production-checklist.md @@ -31,16 +31,20 @@ have been performed on your production system before you go live. of ArangoDB to benefit from performance improvements and security fixes. - **Testing environments**: Use QA environments and UAT (User Acceptance Testing) - to test all changes before going live with production deployments. + to test all changes, in particular queries, before going live with production deployments. ### Security -- The user _root_ is not used to run any ArangoDB processes +- Create a dedicated system user and group (e.g., "arango") + to run ArangoDB processes. Never use the _root_ user to run any ArangoDB processes (if you run ArangoDB on Linux). - **Access control**: Restrict access to the deployment to authorized personnel only. Implement proper authentication and authorization mechanisms. +- **JWT authentication**: Enable JWT authentication + for production deployments. See [JWT authentication](../develop/http-api/authentication.md#jwt-user-tokens) for more details. + - **Encryption**: Enable [Encryption at Rest](../operations/security/encryption-at-rest.md) for sensitive data. Make sure to safely store any secret keys you create for this. @@ -84,6 +88,10 @@ have been performed on your production system before you go live. - Disable swap space to avoid slowdown which can result in servers being incorrectly detected as failed. +- **Query memory limits**: Configure appropriate memory limits for AQL queries: + - Set [`--query.max-memory-per-query`](../components/arangodb-server/options.md#--querymax-memory-per-query) to limit memory usage per individual query. + - Consider setting [`--query.global-memory-limit`](../components/arangodb-server/options.md#--queryglobal-memory-limit) to limit total memory used by all concurrent queries. + ### Service Management - Ensure ArangoDB will be automatically restarted (e.g. by using a systemd service file). Typically @@ -99,11 +107,11 @@ have been performed on your production system before you go live. ### Cluster Configuration -- If you have deployed a Cluster, the _replication factor_ and - _minimal_replication_factor_ of your collections - are set to a value equal or higher than 2, otherwise you run the risk of - losing data in case of a node failure. See - [cluster startup options](../components/arangodb-server/options.md#cluster). +- **Replication configuration**: For production clusters, configure collections with: + - _replication factor_ of 3 for optimal data availability and fault tolerance. + - _minimal_replication_factor_ of a value equal or higher than 2. + - _writeConcern_ of 2. + See [cluster startup options](../components/arangodb-server/options.md#cluster). - **Shard limits**: Keep the total number of shards below 10,000 across your cluster to maintain optimal performance and avoid resource exhaustion. @@ -118,6 +126,8 @@ have been performed on your production system before you go live. be careful with burst mode guarantees as ArangoDB requires a sustainable high IOPS rate. +- **DB-Server storage limit**: Keep individual DB-Server storage below 2TB per server to maintain optimal performance. + - **I/O bandwidth**: Give considerations to I/O bandwidth, especially considering RocksDB write-amplification which can easily be 10x or more. @@ -138,6 +148,11 @@ have been performed on your production system before you go live. - **arangodump backups**: Take backups with arangodump from time to time as an additional backup strategy alongside Hot Backups. +- **Secure backup storage**: Store backups in a secure, separate location from your + production systems. Use encrypted storage and ensure backups are geographically + distributed to protect against regional disasters. Implement proper access controls + for backup storage locations. + - **Retry mechanisms**: Implement exponential retry with jitter in your applications when connecting to ArangoDB to handle temporary network issues and failovers gracefully.