-
Notifications
You must be signed in to change notification settings - Fork 499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pulling all data from many accounts is quite slow #58
Comments
So actually now CloudQuery concurrently pulls data from the same region. It's should be pretty easy to add the same logic for accounts. One issue I think we might hit is the rate limits if we will have too many concurrent API calls. Few thoughts: Maybe we can add a variable that specify number of concurrent requests? Other option is to have one long running job that fetches all the data and then subscribe to cloudtrail logs to pull only resources that were changed? What do you think? |
That assumes you have a cloudtrail log. And it might be challenging to follow the stream. If the long running process dies, would it know where to pick back up, or would it repull everything, and then restart following the stream? I like the idea of parallelizing as much as possible, and having robust retry/backoff logic for provider API calls. I made a separate issue for that: #59 |
Rate limit could be a quick fix at first, as even with concurrency AWS API has rate limit by IP, not only by access key/role. |
Yeah, I guess a robust retry/backoff should be part of the solution here. Also, There is AWS V2 which should be faster in general and I think we need to migrate to it. Another option is to try and pull data from AWS Config in bulk (Never tried it, so just an idea). |
Hey @yevgenypats , what do you mean by 'AWS Config in bulk' ? |
@Rackme I didn't do enough research yet but an idea I had in back of my mind is to try and use https://docs.aws.amazon.com/config/latest/APIReference/API_SelectResourceConfig.html or https://docs.aws.amazon.com/config/latest/APIReference/API_BatchGetResourceConfig.html api calls to somehow get the data not via the standard APIs and this should help with the throttling issue. Not sure it's possible though and this API might not have all the data we want. Are you familiar with AWS Config? maybe you can help me shed some light on this one? |
@yevgenypats I've never used AWS config to pull a bunch of data, only for a few checks sorry ... As you said some of already covered services by cloudquery (directconnect, emr, organizations) seems to miss in their schema : If there is a maximum response size, the I've tried with |
AWS Config does not have a complete accounting of resources in AWS. IAM access keys is a good example. You can get IAM users/roles/groups out of config, but you cannot query access key IDs. You could use it to get some subset of the data, but it wouldn't be complete. |
I think AWS tends to rate limit at the account level. So if you run each account concurrently, they shouldn't step on each others toes unless AWS also implements a global rate limit based on IP or something like that |
If fetching data from multiple AWS accounts via roles, would it be possible to run each account fetch concurrently? Or at least somehow batch the operations? If you have many accounts it takes quite a long time to fetch all the data since it does every account, every region, every resource sequentially
The text was updated successfully, but these errors were encountered: