Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pulling all data from many accounts is quite slow #58

Closed
James-Quigley opened this issue Feb 12, 2021 · 10 comments
Closed

Pulling all data from many accounts is quite slow #58

James-Quigley opened this issue Feb 12, 2021 · 10 comments

Comments

@James-Quigley
Copy link
Contributor

If fetching data from multiple AWS accounts via roles, would it be possible to run each account fetch concurrently? Or at least somehow batch the operations? If you have many accounts it takes quite a long time to fetch all the data since it does every account, every region, every resource sequentially

@yevgenypats
Copy link
Member

So actually now CloudQuery concurrently pulls data from the same region. It's should be pretty easy to add the same logic for accounts. One issue I think we might hit is the rate limits if we will have too many concurrent API calls.

Few thoughts: Maybe we can add a variable that specify number of concurrent requests? Other option is to have one long running job that fetches all the data and then subscribe to cloudtrail logs to pull only resources that were changed? What do you think?

@James-Quigley
Copy link
Contributor Author

James-Quigley commented Feb 12, 2021

That assumes you have a cloudtrail log. And it might be challenging to follow the stream. If the long running process dies, would it know where to pick back up, or would it repull everything, and then restart following the stream?

I like the idea of parallelizing as much as possible, and having robust retry/backoff logic for provider API calls. I made a separate issue for that: #59

@Rackme
Copy link
Contributor

Rackme commented Feb 15, 2021

Rate limit could be a quick fix at first, as even with concurrency AWS API has rate limit by IP, not only by access key/role.

@yevgenypats
Copy link
Member

Yeah, I guess a robust retry/backoff should be part of the solution here. Also, There is AWS V2 which should be faster in general and I think we need to migrate to it. Another option is to try and pull data from AWS Config in bulk (Never tried it, so just an idea).

@Rackme
Copy link
Contributor

Rackme commented Feb 15, 2021

Hey @yevgenypats , what do you mean by 'AWS Config in bulk' ?

@yevgenypats
Copy link
Member

@Rackme I didn't do enough research yet but an idea I had in back of my mind is to try and use https://docs.aws.amazon.com/config/latest/APIReference/API_SelectResourceConfig.html or https://docs.aws.amazon.com/config/latest/APIReference/API_BatchGetResourceConfig.html api calls to somehow get the data not via the standard APIs and this should help with the throttling issue. Not sure it's possible though and this API might not have all the data we want. Are you familiar with AWS Config? maybe you can help me shed some light on this one?

@Rackme
Copy link
Contributor

Rackme commented Feb 15, 2021

@yevgenypats I've never used AWS config to pull a bunch of data, only for a few checks sorry ...

As you said some of already covered services by cloudquery (directconnect, emr, organizations) seems to miss in their schema :
https://github.com/awslabs/aws-config-resource-schema/tree/master/config/properties/resource-types

If there is a maximum response size, the API Select documentation is a little disturbing about the possibility to easily handle pagination, don't you think ?
LIMIT
Valid Range: Minimum value of 0. Maximum value of 100.

I've tried with select-resource-config, only 25 resources are returned per page.

@zscholl
Copy link
Contributor

zscholl commented Feb 20, 2021

AWS Config does not have a complete accounting of resources in AWS. IAM access keys is a good example. You can get IAM users/roles/groups out of config, but you cannot query access key IDs.

You could use it to get some subset of the data, but it wouldn't be complete.

@James-Quigley
Copy link
Contributor Author

I think AWS tends to rate limit at the account level. So if you run each account concurrently, they shouldn't step on each others toes unless AWS also implements a global rate limit based on IP or something like that

yevgenypats added a commit to cloudquery/cq-provider-aws that referenced this issue Mar 10, 2021
yevgenypats added a commit to cloudquery/cq-provider-aws that referenced this issue Mar 10, 2021
yevgenypats added a commit to cloudquery/cq-provider-aws that referenced this issue Mar 10, 2021
@yevgenypats
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants