Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AWS] Implement retry/backoff when being rate limited by a resource API #59

Closed
James-Quigley opened this issue Feb 12, 2021 · 9 comments
Assignees

Comments

@James-Quigley
Copy link
Contributor

If you hit the rate limits for a resource API, CloudQuery stops fetching data and errors out. It would be great if the CloudQuery AWS provider would detect this type of error and retry the request

@roneli
Copy link
Contributor

roneli commented Mar 5, 2021

Fixed in cq-provider-aws v0.2.1 cloudquery/cq-provider-aws#8

@roneli roneli closed this as completed Mar 5, 2021
@Rackme
Copy link
Contributor

Rackme commented Apr 1, 2021

Hello, I am not able to perform a fetch on my environnement, it fails on random operations.

I assume this is due to a rate limit, is there any way to show more debug output ?

2021/03/31 14:06:35 rpc error: code = Unknown desc = operation error IAM: GenerateCredentialReport, request canceled, context deadline exceeded
ERROR: 1

Error: rpc error: code = Unknown desc = operation error S3: GetBucketCors, https response error StatusCode: 0, RequestID: , HostID: , canceled, context deadline exceeded

Error: rpc error: code = Unknown desc = operation error IAM: GetAccountAuthorizationDetails, https response error StatusCode: 0, RequestID: , canceled, context deadline exceeded

Used : cloudquery_Linux_x86_64-v0.11.6 / cq-provider-aws_linux_amd64-v0.2.17

@yevgenypats
Copy link
Member

@Rackme can you try change the following settings:

aws_debug: true
max_retries: 5
max_backoff: 30
timeout: 30

maybe change timeout to 300 ?

@Rackme
Copy link
Contributor

Rackme commented Apr 1, 2021

I tried with 30 and 300s for the timeout, with the following configuration, it's still failing on random operation :

providers:
  - name: aws
    version: latest
    accounts:     
      - id: production
      - id: sharedstorage
      - id: test-us
      - id: test-eu
      - id: test-rds
      - id: prod-kms
      - id: uballot
      - id: mmkpm
#    regions: # Optional. if commented out assumes all regions
#      - us-east-1
#      - us-west-2
    log_level: debug # Optional. if commented out will enable AWS SDK debug logging. possible values: debug, debug_with_signing, debug_with_http_body, debug_with_request_retries, debug_with_request_error, debug_with_event_stream_body
    aws_debug: true
    max_retries: 5
    max_backoff: 30
    timeout: 300
    resources: # You can comment resources your are not interested in for faster fetching.
    *

Unfortunately, I don't have the associated failed request-id :

11:21AM INF Fetched resources @module=aws account_id=12345678912 count=0 region=sa-east-1 resource=fsx.backups timestamp=2021-04-01T11:21:06.750Z
11:21AM INF Fetched resources @module=aws account_id=12345678912 count=1 region=sa-east-1 resource=ec2.security_groups timestamp=2021-04-01T11:21:06.751Z
SDK 2021/04/01 11:21:06 DEBUG Response
HTTP/2.0 200 OK
Content-Length: 14
Content-Type: application/x-amz-json-1.1
Date: Thu, 01 Apr 2021 11:21:06 GMT
X-Amzn-Requestid: aa6764e5-de84-4321-87ed-dec999a19428

SDK 2021/04/01 11:21:06 DEBUG Response
HTTP/1.1 200 OK
Content-Length: 210
Content-Type: application/x-amz-json-1.1
Date: Thu, 01 Apr 2021 11:21:06 GMT
X-Amzn-Requestid: 647c8634-da01-4ca4-bf66-2ddc95df1233

11:21AM INF Fetched resources @module=aws account_id=12345678912 count=1 region=sa-east-1 resource=ec2.network_acls timestamp=2021-04-01T11:21:06.768Z
11:21AM INF Fetched resources @module=aws account_id=12345678912 count=1 region=sa-east-1 resource=cloudtrail.trails timestamp=2021-04-01T11:21:06.776Z
11:21AM INF Fetched resources @module=aws account_id=12345678912 count=1 region=sa-east-1 resource=ec2.vpcs timestamp=2021-04-01T11:21:06.783Z
11:21AM INF Fetched resources @module=aws account_id=12345678912 count=0 region=sa-east-1 resource=sns.subscriptions timestamp=2021-04-01T11:21:06.841Z
SDK 2021/04/01 11:21:06 DEBUG Response
HTTP/1.1 200 OK
Content-Length: 291
Content-Type: text/xml
Date: Thu, 01 Apr 2021 11:21:05 GMT
X-Amzn-Requestid: 53c9f70e-63ad-55e9-8a85-afd1cd4fa10a

Error: rpc error: code = Unknown desc = operation error IAM: ListRoleTags, https response error StatusCode: 0, RequestID: , canceled, context deadline exceeded
11:21AM INF Fetched resources @module=aws account_id=12345678912 count=0 region=sa-east-1 resource=ecs.cluster timestamp=2021-04-01T11:21:06.901Z
SDK 2021/04/01 11:21:06 DEBUG Response
HTTP/1.1 200 
Content-Length: 108
Content-Type: application/x-amz-json-1.1
Date: Thu, 01 Apr 2021 11:21:06 GMT
X-Amzn-Requestid: 042f8c02-e5e8-4546-8952-c225be8da1b1

Usage:
  cloudquery fetch [flags]

Flags:
      --driver string   database driver postgresql/neo4j (env: CQ_DRIVER) (default "postgresql")
      --dsn string      database connection string (env: CQ_DSN) (example: 'host=localhost user=postgres password=pass DB.name=postgres port=5432')
  -h, --help            help for fetch
      --path string     path to configuration file. can be generated with 'gen config' command (env: CQ_CONFIG_PATH) (default "./config.yml")
      --version         version for fetch

Global Flags:
      --enableConsoleLog      Enable console logging (default true)
      --enableFileLogging     enableFileLogging makes the framework logging to a file (default true)
      --encodeLogsAsJson      EncodeLogsAsJson makes the logging framework logging JSON
      --logDirectory string   Directory to logging to to when file logging is enabled (default ".")
      --logFile string        Filename is the name of the logfile which will be placed inside the directory (default "cloudquery.log")
      --maxAge int            MaxAge the max age in days to keep a logfile (default 3)
      --maxBackups int        MaxBackups the max number of rolled files to keep (default 3)
      --maxSize int           MaxSize the max size in MB of the logfile before it's rolled (default 30)
      --plugin-dir string     Directory to save and load Cloudquery plugins from (env: CQ_PLUGIN_DIR) (default "/root")
  -v, --verbose               Enable Verbose logging

2021/04/01 11:21:06 rpc error: code = Unknown desc = operation error IAM: ListRoleTags, https response error StatusCode: 0, RequestID: , canceled, context deadline exceeded
root@59f586c902df:~# 

@yevgenypats yevgenypats reopened this Apr 1, 2021
@yevgenypats
Copy link
Member

After how long it fails? if it fails after 5min then I think you are hitting the timeout (can you try increasing to 1200?) if it's throwing the error before 5min are passed then it might be a different issue.

@Rackme
Copy link
Contributor

Rackme commented Apr 1, 2021

Started at :

11:20AM INF No regions specified in config.yml. Assuming all 22 regions @module=aws timestamp=2021-04-01T11:20:00.526Z
11:20AM INF Configuring SDK retryer @module=aws max_backoff=30 retry_attempts=5 timestamp=2021-04-01T11:20:00.526Z
SDK 2021/04/01 11:20:00 DEBUG Request

Failed at :
SDK 2021/04/01 11:21:06 DEBUG Response

: '(

@yevgenypats
Copy link
Member

Got it. Can you please try increasing the timeout to 3000 and also commenting out some of the accounts? just to understand if it's a specific account or not. for example try running on 1 account only?

@Rackme
Copy link
Contributor

Rackme commented Apr 1, 2021

All good with the 8 accounts and a timeout at 3000 👍
Total execution time : 2021-04-01T12:02:28.017Z -> 2021-04-01T12:04:13.869Z = 104s

@yevgenypats
Copy link
Member

@Rackme I also fixed some bugs with how contexts are being passed as well as a better default for timeouts so this should be solved with https://github.com/cloudquery/cq-provider-aws/releases/tag/v0.2.18

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants