Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance considerations #486

Open
rossbeehler opened this issue Jan 25, 2023 · 8 comments
Open

Performance considerations #486

rossbeehler opened this issue Jan 25, 2023 · 8 comments

Comments

@rossbeehler
Copy link

rossbeehler commented Jan 25, 2023

We're on GitLab SaaS with over 3000 repos, and climbing, and our GitLabForm nightly process takes many, many hours to run. Are there any tips, tricks, etc. to make this faster. I know I could use the group structure and just run concurrent CI/CD jobs per 2nd-level group, but wondered if there was any other ideas, configurations, etc. that I might be missing. For example, it would be nice if there was a concurrency setting, and all groups in the hierarchy are processed separately/concurrently based on that setting.

gdubicki added a commit that referenced this issue Jan 25, 2023
as reported in #486
@gdubicki
Copy link
Member

gdubicki commented Jan 25, 2023

Hey @rossbeehler!

I was thinking about adding concurrency a few times in the past but:
a) I always ended up thinking that there is no need,
b) it's not trivial.

So I felt that it's not needed because for Egnyte where we use self-hosted GitLab instance, applying the config for a bit over 1000 projects and over 30 groups takes about 16 minutes now. I assumed that this is not much for a pretty large scale.

And it's not trivial because of the output - you'll have to implement some buffering solution to prevent having output for all groups and projects mixed up.

Anyway, because I recently don't have too much time for the project (see #343), I am open to PRs adding it.

As for the other things that you might do:

  • run your CI/CD worker as close to GitLab.com as possible for a minimal latency - according to their public docs they are located in GCP in us-east1,
  • whenever available use group-level configurations to apply settings once per group, instead of for each project individually - f.e. group_variables instead of project_variables,
  • review the verbose output if your run and look for:

Perhaps to be continued...

Let me know what you think!

@rossbeehler
Copy link
Author

Thanks for the detailed response, @gdubicki. May take me a while, but I'll see if I can test in GCP us-east1 and report back on how much it improved performance. We are in Azure East US 2 at the moment, so only a couple states away, but I'm sure co-locating on the same cloud/region would make a significant difference.

I will say we see a rather regular consistent amount of time processing each project, but we'll also turn on verbose at some point to see if anything stands out.

@nejch
Copy link
Contributor

nejch commented Jan 29, 2023

Keep in mind with a lot of API calls you probably start to hit GitLab's rate limiting rather than just network performance issues:

https://docs.gitlab.com/ee/user/gitlab_com/index.html#gitlabcom-specific-rate-limits

At least that would be my assumption as I also work with a large self-hosted instance. So even with concurrency, since urllib's Retry respects retry-headers headers by default, I think it would slow down after 429 responses. I may be wrong though, would have to benchmark that.

So one aspect of optimization would be for gitlabform to make as few requests as possible (e.g. ensure the max 100 per page for pagination, avoid making the same get calls if the data is already fetched etc. Just an idea though!

@adam-moss
Copy link

We have this challenge with 10k+ repos, with this, danger, triage-bot, and renovate. Basically anything we want to run across the estate. What we found was running any of them as one continuous run took hours.

What we did was take the approach the tool is not the issue, at a repo level it is fast enough. Ergo our execution approach was suboptimal at scale. So what we do now is:

  1. hit the gitlab api for a list of all groups & projects
  2. use that to generate child pipelines, running 1 instance of the job for each repo
  3. batch them into blocks of n size, whatever you're comfortable with within the rate limits
  4. use resource_group in the gitlab-ci.yml to ensure only X child pipelines run in parallel.

Doing this took our renovate run-time from > 24hrs to < 3hrs, which was certainly a win for us.

We also use the Audit Stream to run on triggers.

This isn't the code we use now, but you can see an earlier iteration I shared over at the reno repo renovatebot/renovate#13172 (reply in thread)

@nejch
Copy link
Contributor

nejch commented Feb 16, 2023

I think that makes sense, if I remember correctly we had a similar approach but just with parallel: and use of CI_NODE_* variables. Btw might want to be careful with renovating 10k repos from a single renovate instance :P https://docs.renovatebot.com/gitlab-bot-security/ should get better after 15.9 with the job token scope.

@adam-moss
Copy link

Yeah, its a risk but the exploit opportunity is minimised as much as is currently possible. And you need an active SSO session on our IdP. Token scopes will definitely be better but it is only once the support for adding the email address, signing key, and regenerating rather than recreating PRaTs lands that it can be truly secured.

@gdubicki
Copy link
Member

What target are you using to run the app, @rossbeehler? ALL_DEFINED? ALL? Something else?

@rossbeehler
Copy link
Author

We're running it against our top-level group for our organization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants