Skip to content

Commit

Permalink
doc: Add performance tuning section (#4639)
Browse files Browse the repository at this point in the history
I started this PR mainly as a way to document the new wildcard options, but ultimately updated a few things relating to the CloudQuery docs:
 - added brief notes about wildcards to the source plugin reference sections for `tables` and `skip_tables`
 - placed detailed information about wildcards in a `Performance tuning` page under `Advanced Topics`. The idea is that we will expand this page over time.
 - updated the docs relating to `concurrency` (`resource_concurrency` and `table_concurrency` were deprecated)
 - some other misc fixes
  • Loading branch information
hermanschaaf committed Nov 15, 2022
1 parent 75162dd commit b90ff96
Show file tree
Hide file tree
Showing 9 changed files with 75 additions and 18 deletions.
2 changes: 1 addition & 1 deletion website/components/mdx/_configure.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ spec:
```

- All general options for source spec you can find under [references/source-spec](/docs/reference/source-spec).
- All options for `postgresql` destination plugin spec you can find [here](https://github.com/cloudquery/cloudquery/blob/main/plugins/source/aws/docs/configuration.md)
- All options for `aws` source plugin spec you can find [here](https://github.com/cloudquery/cloudquery/blob/main/plugins/source/aws/docs/configuration.md)

<Callout>

Expand Down
9 changes: 5 additions & 4 deletions website/pages/docs/advanced-topics/_meta.json
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
{
"environment-variable-substitution": "Environment Variable Substitution",
"running-cloudquery-in-parallel": "Running CloudQuery in Parallel",
"proxy-configuration": "Proxy Configuration",
"docker": "Docker",
"security": "Security",
"rate-limiting": "Rate Limiting"
"proxy-configuration": "Proxy Configuration",
"performance-tuning": "Performance Tuning",
"rate-limiting": "Rate Limiting",
"running-cloudquery-in-parallel": "Running CloudQuery in Parallel",
"security": "Security"
}
58 changes: 58 additions & 0 deletions website/pages/docs/advanced-topics/performance-tuning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
---
title: Performance Tuning
---

# Performance Tuning

This page contains a number of tips and tricks for improving the performance of `cloudquery sync` for large cloud estates.

## Wildcard Matching

import { Callout } from 'nextra-theme-docs'

Sometimes the easiest way to improve the performance of the `sync` command is to limit the number of tables that get synced. The `tables` and `skip_tables` source config options both support wildcard matching. This means that you can use `*` anywhere in a name to match multiple tables.

For example, when using the `aws` source plugin, it is possible to use a wildcard pattern to match all tables related to AWS EC2:

```yaml
tables:
- aws_ec2_*
```

This can also be combined with `skip_tables`. For example, let's say we want to include all EC2 tables, but not EBS-related ones:

```yaml
tables:
- "aws_ec2_*"
skip_tables:
- "aws_ec2_ebs_*"
```

<Callout>

The CloudQuery CLI will warn if a wildcard pattern does not match any known tables.

</Callout>

## Improving Performance by Skipping Relations

Some tables require many API calls to sync. This is especially true of tables that depend on other tables, because often multiple API calls need to be made for every row in the parent table. This can lead to thousands of API calls, increasing the time it takes to sync. If you know that some child tables are not strictly necessary, you can improve sync performance by skipping them with the `skip_tables` setting.

Let's say we have three tables: `A`, `B` and `C`. `A` is the top-level table. `B` depends on it, and `C` depends on `B`:

```text
A
↳ B
↳ C
```

We might want table `A`, but not need the information in table `B`. We can then write our source config as:

```yaml
tables:
- A
skip_tables:
- B
```

By skipping table `B`, we are automatically skipping its dependant table `C` as well. Likewise, by including table `A`, we are automatically including its dependant tables `B` and `C` as well, unless they are explicitly skipped in the `skip_tables` section (like in the example above).
10 changes: 3 additions & 7 deletions website/pages/docs/advanced-topics/rate-limiting.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,8 @@ title: Rate Limiting

# Rate Limiting

There are two main levers to control the rate at which CloudQuery fetches resources from cloud providers. These are the `table_concurrency` and `resource_concurrency` options that can be specified as [part of the source spec](/docs/reference/source-spec). Note that these options were introduced in CloudQuery CLI v1.0.8.
There is currently one main lever to control the rate at which CloudQuery fetches resources from cloud providers. This setting is called `concurrency`, and it can be specified as [part of the source spec](/docs/reference/source-spec). Note that this option was introduced in CloudQuery CLI v1.4.1.

## Table Concurrency
## Concurrency

`table_concurrency` controls the number of concurrent tables that will be processed while performing a sync. Setting this to a low number will reduce the number of concurrent requests, making it less likely to hit rate limits. The trade-off is that syncs will take longer to complete.

## Resource Concurrency

`resource_concurrency` is an approximate global limit on how many concurrent requests will be made to fetch details about the initial rows returned by a table's resolver. This limit applies only to top-level tables, and child relations will not be limited. Setting this to a lower number will also reduce the number of concurrent requests made, regardless of how many tables are being synced at any one time. As with `table_concurrency`, the trade-off is that syncs will take longer to complete.
`concurrency` provides rough control over the number of concurrent requests that will be made while performing a sync. Setting this to a low number will reduce the number of concurrent requests, reducing the memory used and making the sync less likely to hit rate limits. The trade-off is that syncs will take longer to complete.
2 changes: 1 addition & 1 deletion website/pages/docs/quickstart/linux.mdx
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: Linux
title: Quickstart - Linux
---

import Intro from '../../../components/mdx/_intro.mdx'
Expand Down
2 changes: 1 addition & 1 deletion website/pages/docs/quickstart/macOS.mdx
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: macOS
title: Quickstart - macOS
---

import Intro from '../../../components/mdx/_intro.mdx'
Expand Down
2 changes: 1 addition & 1 deletion website/pages/docs/quickstart/windows.mdx
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: Windows
title: Quickstart - Windows
---

import Intro from '../../../components/mdx/_intro.mdx'
Expand Down
3 changes: 2 additions & 1 deletion website/pages/docs/reference/cli/_meta.json
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
{
"cloudquery": "cloudquery",
"cloudquery_sync": "cloudquery sync"
"cloudquery_sync": "cloudquery sync",
"cloudquery_migrate": "cloudquery migrate"
}
5 changes: 3 additions & 2 deletions website/pages/docs/reference/source-spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ spec:
name: "aws"
path: "cloudquery/aws"
version: "v6.0.0" # latest version of aws plugin
tables: ["*"]
destinations: ["postgresql"]

spec:
Expand Down Expand Up @@ -57,13 +58,13 @@ Configures how to retrieve the plugin. The contents depend on the value of `regi

(`[]string`, optional, default: `["*"]`)

Tables to sync from the source plugin.
Tables to sync from the source plugin. It accepts wildcards. For example, to match all EC2-related tables, : `aws_ec2_*`. Matched tables will also sync all their descendant tables, unless these are skipped in `skip_tables`.

### skip_tables

(`[]string`, optional, default: `[]`)

Useful when using glob in `tables`, specify which tables to skip when syncing the source plugin.
Useful when using wildcards in `tables`. Specify which tables to skip when syncing the source plugin. Note that if a table with dependencies is skipped, all its dependant tables will also be skipped.

### destinations

Expand Down

0 comments on commit b90ff96

Please sign in to comment.