From 5a6070cbe8c64fa56bea386a5e9f8ead16d3ccf9 Mon Sep 17 00:00:00 2001 From: Paula Date: Mon, 16 Oct 2023 15:18:52 +0200 Subject: [PATCH 1/6] add arangodump resource usage limits --- .../version-3.12/whats-new-in-3-12.md | 49 +++++++++++++++++++ 1 file changed, 49 insertions(+) diff --git a/site/content/3.12/release-notes/version-3.12/whats-new-in-3-12.md b/site/content/3.12/release-notes/version-3.12/whats-new-in-3-12.md index 5fd411cf3d..8c0de6cd45 100644 --- a/site/content/3.12/release-notes/version-3.12/whats-new-in-3-12.md +++ b/site/content/3.12/release-notes/version-3.12/whats-new-in-3-12.md @@ -222,5 +222,54 @@ of outgrowing the maximum number of file descriptors the ArangoDB process can open. Thus, these options should only be enabled on deployments with a limited number of collections/shards/indexes. +## Client tools + +### arangodump + +#### Improved dump performance + + + +#### Resource usage limits + +The following startup options that can be used to limit +the resource usage of parallel _arangodump_ invocations have been added: + +- `--dump.max-memory-usage`: Maximum memory usage (in bytes) to be + used by the server-side parts of all ongoing _arangodump_ invocations. + This option can be used to limit the amount of memory for prefetching + and keeping results on the server side when _arangodump_ is invoked + with the `--parallel-dump` option. It does not have an effect for + _arangodump_ invocations that did not use the `--parallel-dump` option. + Note that the memory usage limit is not exact and that it can be + slightly exceeded in some situations to guarantee progress. +- -`-dump.max-docs-per-batch`: Maximum number of documents per batch + that can be used in a dump. If an _arangodump_ invocation requests + higher values than configured here, the value is automatically + capped to this value. Will only be followed for _arangodump_ invocations + that use the `--parallel-dump` option. +- `--dump.max-batch-size`: Maximum batch size value (in bytes) that + can be used in a dump. If an _arangodump_ invocation requests larger + batch sizes than configured here, the actual batch sizes is capped + to this value. Will only be followed for arangodump invocations that + use the -`-parallel-dump` option. +- `--dump.max-parallelism`: Maximum parallelism (number of server-side + threads) that can be used in a dump. If an _arangodump_ invocation requests + a higher number of prefetch threads than configured here, the actual + number of server-side prefetch threads is capped to this value. + Will only be followed for _arangodump_ invocations that use the + `--parallel-dump` option. + +The following metrics have been added to observe the behavior of parallel +_arangodump_ operations on the server: + +- `arangodb_dump_memory_usage`: Current memory usage of all ongoing + _arangodump_ operations on the server. +- `arangodb_dump_ongoing`: Number of currently ongoing _arangodump_ + operations on the server. +- `arangodb_dump_threads_blocked_total`: Number of times a server-side + dump thread was blocked because it honored the server-side memory + limit for dumps. + ## Internal changes From d430a5085747d67119864cae9a06578ff0207b00 Mon Sep 17 00:00:00 2001 From: Paula Date: Wed, 18 Oct 2023 10:58:01 +0200 Subject: [PATCH 2/6] improved dump performance --- .../3.12/release-notes/version-3.12/whats-new-in-3-12.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/site/content/3.12/release-notes/version-3.12/whats-new-in-3-12.md b/site/content/3.12/release-notes/version-3.12/whats-new-in-3-12.md index 8c0de6cd45..6403f4d39e 100644 --- a/site/content/3.12/release-notes/version-3.12/whats-new-in-3-12.md +++ b/site/content/3.12/release-notes/version-3.12/whats-new-in-3-12.md @@ -228,7 +228,12 @@ limited number of collections/shards/indexes. #### Improved dump performance - +ArangoDB 3.12 includes extended parallelization capabilities to work not only +at the collection level, but also at the shard level. In combination with the +new optimized format, database dumps are now created and restored quickly and +occupy minimal disk space. This major performance boost makes dumps five times +faster and restores three times faster, which is extremely useful when dealing +with large shards. #### Resource usage limits From c512b044efa082b7f71bf0f72e2a31671790154a Mon Sep 17 00:00:00 2001 From: Paula Date: Wed, 25 Oct 2023 16:53:13 +0200 Subject: [PATCH 3/6] clarifications --- .../version-3.12/whats-new-in-3-12.md | 30 +++++++++++++++---- 1 file changed, 24 insertions(+), 6 deletions(-) diff --git a/site/content/3.12/release-notes/version-3.12/whats-new-in-3-12.md b/site/content/3.12/release-notes/version-3.12/whats-new-in-3-12.md index 6403f4d39e..fad08c0098 100644 --- a/site/content/3.12/release-notes/version-3.12/whats-new-in-3-12.md +++ b/site/content/3.12/release-notes/version-3.12/whats-new-in-3-12.md @@ -230,15 +230,33 @@ limited number of collections/shards/indexes. ArangoDB 3.12 includes extended parallelization capabilities to work not only at the collection level, but also at the shard level. In combination with the -new optimized format, database dumps are now created and restored quickly and -occupy minimal disk space. This major performance boost makes dumps five times -faster and restores three times faster, which is extremely useful when dealing +new optimized format, database dumps are now created and restored more quickly +and occupy minimal disk space. This major performance boost makes dumps and +restores up to several times faster, which is extremely useful when dealing with large shards. +The new dump variant can be enabled via `--use-parallel-dump`. The default +value is `true`. + +To achieve the best dump performance and the smallest data dumps in terms of +size, you can use the `--dump-vpack` option. The resulting dump data is stored +in velocypack format instead of JSON. The velocypack format is more compact than +JSON, therefore the output file size can be reduced compared to JSON, even +when compression is enabled, and can also lead to faster dumps. Note, however, +that this option is experimental and disabled by default. + +Optionally, you can make _arangodump_ write multiple output files per +collection/shard. The file splitting allows better parallelization when +writing the results into the output file, which in case of non-split files +must be serialized. +You can enable it by setting the `--split-files` option to `true`. This option +is disabled by default considering that dumps created with this option enabled +cannot be restored into previous versions of ArangoDB easily. + #### Resource usage limits -The following startup options that can be used to limit -the resource usage of parallel _arangodump_ invocations have been added: +The following `arangod` startup options can be used to limit +the resource usage of parallel _arangodump_ invocations: - `--dump.max-memory-usage`: Maximum memory usage (in bytes) to be used by the server-side parts of all ongoing _arangodump_ invocations. @@ -256,7 +274,7 @@ the resource usage of parallel _arangodump_ invocations have been added: - `--dump.max-batch-size`: Maximum batch size value (in bytes) that can be used in a dump. If an _arangodump_ invocation requests larger batch sizes than configured here, the actual batch sizes is capped - to this value. Will only be followed for arangodump invocations that + to this value. Will only be followed for _arangodump_ invocations that use the -`-parallel-dump` option. - `--dump.max-parallelism`: Maximum parallelism (number of server-side threads) that can be used in a dump. If an _arangodump_ invocation requests From fcc67bc9876f0b4c1eb123272c2c521f1870d409 Mon Sep 17 00:00:00 2001 From: Paula Mihu <97217318+nerpaula@users.noreply.github.com> Date: Mon, 30 Oct 2023 12:23:49 +0100 Subject: [PATCH 4/6] Apply suggestions from code review Co-authored-by: Simran --- .../version-3.12/whats-new-in-3-12.md | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/site/content/3.12/release-notes/version-3.12/whats-new-in-3-12.md b/site/content/3.12/release-notes/version-3.12/whats-new-in-3-12.md index 9c0a180016..e5ef70bb9f 100644 --- a/site/content/3.12/release-notes/version-3.12/whats-new-in-3-12.md +++ b/site/content/3.12/release-notes/version-3.12/whats-new-in-3-12.md @@ -245,12 +245,12 @@ limited number of collections/shards/indexes. ### arangodump -#### Improved dump performance +#### Improved dump performance and size ArangoDB 3.12 includes extended parallelization capabilities to work not only at the collection level, but also at the shard level. In combination with the -new optimized format, database dumps are now created and restored more quickly -and occupy minimal disk space. This major performance boost makes dumps and +new VelocyPack format, database dumps are now created and restored more quickly +and occupy less disk space. This major performance boost makes dumps and restores up to several times faster, which is extremely useful when dealing with large shards. @@ -258,11 +258,12 @@ The new dump variant can be enabled via `--use-parallel-dump`. The default value is `true`. To achieve the best dump performance and the smallest data dumps in terms of -size, you can use the `--dump-vpack` option. The resulting dump data is stored -in velocypack format instead of JSON. The velocypack format is more compact than +size, you can additionally use the `--dump-vpack` option. The resulting dump data is stored +in VelocyPack format instead of JSON. The VelocyPack format is more compact than JSON, therefore the output file size can be reduced compared to JSON, even -when compression is enabled, and can also lead to faster dumps. Note, however, -that this option is experimental and disabled by default. +when compression is enabled. It can also lead to faster dumps because there is less data to +transfer and no conversion from the server-internal format (VelocyPack) to JSON is needed. +Note, however, that this option is experimental and disabled by default. Optionally, you can make _arangodump_ write multiple output files per collection/shard. The file splitting allows better parallelization when @@ -270,7 +271,7 @@ writing the results into the output file, which in case of non-split files must be serialized. You can enable it by setting the `--split-files` option to `true`. This option is disabled by default considering that dumps created with this option enabled -cannot be restored into previous versions of ArangoDB easily. +cannot be restored into previous versions of ArangoDB. #### Resource usage limits From b4046dc9dd7e6fc4854112004c63e4277858990a Mon Sep 17 00:00:00 2001 From: Paula Mihu <97217318+nerpaula@users.noreply.github.com> Date: Mon, 30 Oct 2023 12:27:48 +0100 Subject: [PATCH 5/6] Apply suggestions from code review Co-authored-by: Simran --- .../3.12/release-notes/version-3.12/whats-new-in-3-12.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/site/content/3.12/release-notes/version-3.12/whats-new-in-3-12.md b/site/content/3.12/release-notes/version-3.12/whats-new-in-3-12.md index e5ef70bb9f..12641ed453 100644 --- a/site/content/3.12/release-notes/version-3.12/whats-new-in-3-12.md +++ b/site/content/3.12/release-notes/version-3.12/whats-new-in-3-12.md @@ -254,8 +254,8 @@ and occupy less disk space. This major performance boost makes dumps and restores up to several times faster, which is extremely useful when dealing with large shards. -The new dump variant can be enabled via `--use-parallel-dump`. The default -value is `true`. +Whether the new parallel dump variant is used is controlled by the newly added +`--use-parallel-dump` startup option. The default value is `true`. To achieve the best dump performance and the smallest data dumps in terms of size, you can additionally use the `--dump-vpack` option. The resulting dump data is stored From 09b53804b3258c39ed8c39e4a4726075192b655a Mon Sep 17 00:00:00 2001 From: Simran Spiller Date: Wed, 8 Nov 2023 16:09:04 +0100 Subject: [PATCH 6/6] Review --- .../components/tools/arangodump/examples.md | 47 ++++++++++++++-- .../version-3.12/whats-new-in-3-12.md | 56 ++++++++++--------- 2 files changed, 72 insertions(+), 31 deletions(-) diff --git a/site/content/3.12/components/tools/arangodump/examples.md b/site/content/3.12/components/tools/arangodump/examples.md index 593d7bc457..f9cd1b5182 100644 --- a/site/content/3.12/components/tools/arangodump/examples.md +++ b/site/content/3.12/components/tools/arangodump/examples.md @@ -113,7 +113,7 @@ with these attributes: Document data for a collection is saved in files with name pattern `.data.json`. Each line in a data file is a document insertion/update or -deletion marker, alongside with some meta data. +deletion marker. ## Cluster Backup @@ -213,12 +213,14 @@ RocksDB encryption-at-rest feature. ## Compression -`--compress-output` +The size of dumps can be reduced using compression, for storing but also for the +data transfer. -Data can optionally be dumped in a compressed format to save space on disk. -The `--compress-output` option cannot be used together with [Encryption](#encryption). +You can optionally store data in a compressed format to save space on disk with +the `--compress-output` startup option. It cannot be used together with +[Encryption](#encryption). -If compression is enabled, no `.data.json` files are written. Instead, the +If output compression is enabled, no `.data.json` files are written. Instead, the collection data gets compressed using the Gzip algorithm and for each collection a `.data.json.gz` file is written. Metadata files such as `.structure.json` and `.view.json` do not get compressed. @@ -234,13 +236,39 @@ detects whether the data is compressed or not based on the file extension. arangorestore --input-directory "dump" ``` +You can optionally let the server compress the data for the network transfer +with the `--compress-transfer` startup option. This can reduce the traffic and +thus save time and money. + +The data is automatically decompressed on the client side. You can use the option +independent of the `--compress-output` option, which controls whether to store +the dump compressed or not but without affecting the transfer size. + +``` +arangodump --output-directory "dump" --compress-transfer --compress-output false +``` + +{{< comment >}} Experimental feature in 3.12 +## Storage format + +The default output format for dumps is JSON. + +To achieve the best dump performance and the smallest data dumps in terms of +size, you can enable the `--dump-vpack` startup option. The resulting dump data +is then stored in the more compact but binary [VelocyPack](http://github.com/arangodb/velocypack) +format instead of the text-based JSON format. The output file size can be less +even compared to compressed JSON. It can also lead to faster dumps because there +is less data to transfer and no conversion from the server-internal VelocyPack +format to JSON is needed. +{{< /comment >}} + ## Threads _arangodump_ can use multiple threads for dumping database data in parallel. To speed up the dump of a database with multiple collections, it is often beneficial to increase the number of _arangodump_ threads. The number of threads can be controlled via the `--threads` option. The default -value was changed from `2` to the maximum of `2` and the number of available CPU cores. +value is the maximum of `2` and the number of available CPU cores. The `--threads` option works dynamically, its value depends on the number of available CPU cores. If the amount of available CPU cores is less than `3`, a @@ -267,3 +295,10 @@ file should be expected. Also note that when dumping the data of multiple shards from the same collection, each thread's results are written to the result file in a non-deterministic order. This should not be a problem when restoring such dump, as _arangorestore_ does not assume any order of input. + +From v3.12.0 onward, you can make _arangodump_ write multiple output files per +collection/shard. The file splitting allows for better parallelization when +writing the results to disk, which in case of non-split files must be serialized. +You can enable it with the `--split-files` startup option. It is disabled by +default because dumps created with this option enabled cannot be restored into +previous versions of ArangoDB. diff --git a/site/content/3.12/release-notes/version-3.12/whats-new-in-3-12.md b/site/content/3.12/release-notes/version-3.12/whats-new-in-3-12.md index 12641ed453..ad830618bd 100644 --- a/site/content/3.12/release-notes/version-3.12/whats-new-in-3-12.md +++ b/site/content/3.12/release-notes/version-3.12/whats-new-in-3-12.md @@ -232,7 +232,7 @@ can be mixed and written into the same .sst files. When these options are enabled, the RocksDB compaction is more efficient since a lot of different collections/shards/indexes are written to in parallel. -The disavantage of enabling these options is that there can be more .sst +The disadvantage of enabling these options is that there can be more .sst files than when the option is turned off, and the disk space used by these .sst files can be higher. In particular, on deployments with many collections/shards/indexes @@ -247,33 +247,39 @@ limited number of collections/shards/indexes. #### Improved dump performance and size -ArangoDB 3.12 includes extended parallelization capabilities to work not only -at the collection level, but also at the shard level. In combination with the -new VelocyPack format, database dumps are now created and restored more quickly -and occupy less disk space. This major performance boost makes dumps and +From version 3.12 onward, _arangodump_ has extended parallelization capabilities +to work not only at the collection level, but also at the shard level. +In combination with the newly added support for the VelocyPack format that +ArangoDB uses internally, database dumps can now be created and restored more +quickly and occupy less disk space. This major performance boost makes dumps and restores up to several times faster, which is extremely useful when dealing with large shards. -Whether the new parallel dump variant is used is controlled by the newly added -`--use-parallel-dump` startup option. The default value is `true`. - -To achieve the best dump performance and the smallest data dumps in terms of -size, you can additionally use the `--dump-vpack` option. The resulting dump data is stored -in VelocyPack format instead of JSON. The VelocyPack format is more compact than -JSON, therefore the output file size can be reduced compared to JSON, even -when compression is enabled. It can also lead to faster dumps because there is less data to -transfer and no conversion from the server-internal format (VelocyPack) to JSON is needed. -Note, however, that this option is experimental and disabled by default. - -Optionally, you can make _arangodump_ write multiple output files per -collection/shard. The file splitting allows better parallelization when -writing the results into the output file, which in case of non-split files -must be serialized. -You can enable it by setting the `--split-files` option to `true`. This option -is disabled by default considering that dumps created with this option enabled -cannot be restored into previous versions of ArangoDB. - -#### Resource usage limits +- Whether the new parallel dump variant is used is controlled by the newly added + `--use-parallel-dump` startup option. The default value is `true`. + +- To achieve the best dump performance and the smallest data dumps in terms of + size, you can additionally use the `--dump-vpack` option. The resulting dump data + is then stored in the more compact but binary VelocyPack format instead of the + text-based JSON format. The output file size can be less even compared to + compressed JSON. It can also lead to faster dumps because there is less data to + transfer and no conversion from the server-internal format (VelocyPack) to JSON + is needed. Note, however, that this option is **experimental** and disabled by + default. + +- Optionally, you can make _arangodump_ write multiple output files per + collection/shard. The file splitting allows for better parallelization when + writing the results to disk, which in case of non-split files must be serialized. + You can enable it by setting the `--split-files` option to `true`. This option + is disabled by default because dumps created with this option enabled cannot + be restored into previous versions of ArangoDB. + +- You can enable the new `--compress-transfer` startup option for compressing the + dump data on the server for a faster transfer. This is helpful especially if + the network is slow or its capacity is maxed out. The data is decompressed on + the client side and recompressed if you enable the `--compress-output` option. + +#### Resource usage limits and metrics The following `arangod` startup options can be used to limit the resource usage of parallel _arangodump_ invocations: