Allow to delay/limit the number of GitHub repo checks and Pull requests in time #7690

Apollon77 · 2022-05-24T12:32:08Z

Describe the problem

We have an on premise Weblate installation which is connected to more then 80 different GitHub repositories because we have a very decentralized approach of plugins for a smart home system. They are configured as described in https://docs.weblate.org/en/latest/vcs.html?highlight=github#github-pull-requests

Today after a reboot of the Weblate Container we get many emails from Weblate telling us that

GitHub secondary rate limiting reached (https://docs.github.com/en/rest/overview/resources-in-the-rest-api#secondary-rate-limits)

or later too

Rate limits for user reached

It seems to me that there was something hanging and so weblate did not updated repos over time and so many many were outdated on start. Now he started doing the updates and pull requests one after the other (when checking process list felt like partially 2 in parallel?)
Also: Repos that got such an error seems to be retried every 20 seconds! (at least from what Weblate shows me in "klast seen" on the warnings page

Describe the solution you'd like

It would be great to have an option to delay such "mass updates" to be able to match them with the GitHub rate limuts ... e.g. one per 5 mins or something like that.

Additionally there should be a configurable "delay after one repo got an error before it is retried" ... seems we have some repos that are tried near to once every 20seconds!

Describe alternatives you've considered

None so far, because none came into our mind ... We also need to find out what actually happened that made it "stop processing the updates "in general ...

Screenshots

No response

Additional context

No response

nijel · 2022-05-24T13:19:47Z

Seems like your Celery workers were not running, and the queued tasks are now being processed. I don't think this is a scenario we should support. In normal operation, rate limiting is not needed as commits don't happen often enough to hit any limits.

github-actions · 2022-05-24T13:20:10Z

This issue looks more like a support question than an issue. We strive to answer these reasonably fast, but purchasing the support subscription is not only more responsible and faster for your business but also makes Weblate stronger.

In case your question is already answered, making a donation is the right way to say thank you!

Apollon77 · 2022-05-24T13:36:15Z

You could nbe right with the celery workers ... now they run but simply too fast ... also as said they are sometime trying evers 25seconds ... in my eyes this should be slowed down in general ... with many repositories and an active transkation community t could be possible to run into rate limits also in normal cases :-)

nijel · 2022-05-24T13:53:03Z

We're not running into GitHub limits on Hosted Weblate with thousands of projects, but indeed it could possibly happen.

github-actions · 2022-05-24T13:53:26Z

This issue has been put aside. It is currently unclear if it will ever be implemented as it seems to cover too narrow of a use case or doesn't seem to fit into Weblate.

Please try to clarify the use case or consider proposing something more generic to make it useful to more users.

Apollon77 · 2022-05-24T14:57:49Z

We're not running into GitHub limits on Hosted Weblate with thousands of projects, but indeed it could possibly happen.

We also did not the last 2 years :-) now we had that issue and no way to fix it beside:

stop weblate
wait 1-x hours for the rate limits to get free
start again ... see what come through and stop again as soon as we see rate limits ...
restart at step 1 :-)

We see that the same repo is hammered every 5-20 seconds ... this makes the rate limit not better :-) and it "blocks the weblate repo" but still tries to push because it can (it thinks at least). Why it do not detectes the rate limit response and change the retry strategy?

In fact completely up to you what you do ... I can just say: I have a real life issue and try to discuss/bring up ideas on how to prevent such issues for others - even if they are rare ones that "should" never happen.
Also: There was nothing - no info not anything that made me thinks that celery somehow was not running as it should for long time ... so I have no idea (beside a weekly restrt or such) on how to prevent for the future).

nijel · 2022-05-25T07:45:00Z

There is no reason why a single repo should get a pull request that often. See https://docs.weblate.org/en/latest/admin/continuous.html#lazy-commit for info when Weblate commits changes to Git (what in default configuration triggers pushing to upstream repository).

There was nothing - no info not anything that made me thinks that celery somehow was not running as it should for long time

In case the queue is long, there should be an exclamation mark in top navigation for all superusers, or it can be seen in the performance view. See also https://docs.weblate.org/en/latest/admin/install.html#monitoring-weblate, https://docs.weblate.org/en/latest/faq.html#how-can-i-check-whether-my-weblate-is-set-up-properly, https://docs.weblate.org/en/latest/admin/install.html#monitoring-celery-status

See #7690

UncleSamSwiss · 2022-05-25T07:52:15Z

There is no reason why a single repo should get a pull request that often.

The issue here is that the repo operation fails (due to the rate limit), so it will be retried after about 25 seconds.
Now, this happens for all our ~80 projects at once. This will of course not help the rate limit as we are now making every ~0.3 seconds (25/80 seconds) a request. So, in my opinion there should be some kind of (exponential) back-off or a limit of GitHub requests per minute.

Note: I'm working on the same issue together with @Apollon77

Apollon77 · 2022-05-25T09:03:54Z

@nijel > there should be an exclamation mark in top navigation for all superusers, or it can be seen in the performance view.

Hm ... Me as admin I'm not in the tool daily ... I would have expected an email ...
We also have Sentry linked toit and I checked it and only found one entry: https://sentry.iobroker.net/share/issue/fe33d940637d40b4be4a9a616855f68f/ ... but because of it happened also earlier I did not expected it to be a issue ... I expected it being retried :-)

BTW Re the requests ...

Here you see sentry about that ... since yesterday we have 12.000 (!!) times this error for roughly 80 repositories ... and most of the time in the last 24h weblate was offline to restore the rate limits hopefully somehow -so let it be only for 5h in that timeframe

nijel · 2022-05-25T11:18:59Z

Hm ... Me as admin I'm not in the tool daily ... I would have expected an email ...

The e-mails are being sent using Celery as well, so it would be a bit unpractical to try to notify about non-working Celery using e-mail.

The best approach is to add Weblate metrics to whatever monitoring you are using. There is metrics API endpoint exposing all important info - most important to look for is configuration_errors, but if you want more insight looking at individual celery_queues is helpful as well. There is existing integration for Munin.

Anyway, back to the original topic - there had to be a huge queue of tasks which caused this. There is no retrying in failed pushes, it just tried again when commit is triggered. Also, it's not caused by Weblate being offline, but by part of Weblate being alive and part not.

Spotted via #7690

shun2wang · 2022-12-19T01:33:24Z

Seems we also have this problem: https://hosted.weblate.org/projects/jasp/jaspcircular-qml/#alerts
from JASP project and we also have a annually subscription.

nijel · 2022-12-19T09:45:19Z

@shun2wang Does it happen regularly? I've just manually pushed the repo (you could have done that as well).

shun2wang · 2022-12-19T09:53:22Z

@nijel Yes, not sure the regularity of this, but we've had this problem many times. this will clear the translated characters on Weblate.

here we using a github workflow to update translations. this work is done automatically.

EDIT: I just learned that my colleague of JASP has contacted you, thank you

nijel · 2022-12-19T10:09:30Z

That action can always lose translations – you don't force Weblate to push changes before trying to merge. So, there is always the possibility that Weblate has pending changes which were not committed yet (see https://docs.weblate.org/en/latest/admin/continuous.html#lazy-commit). To be on the safe side, invoke wlc push before finding existing pull requests in findAndMergeWeblate.sh. That way, you will also better handle the failed pull request situation – the script will fail instead of discarding translations.

nijel · 2022-12-19T10:29:13Z

Back to the original topic – the problem is that GitHub doesn't tell any information on what is triggering this:

Requests that create content which triggers notifications, such as issues, comments and pull requests, may be further limited and will not include a Retry-After header in the response. Please create this content at a reasonable pace to avoid further limiting.

I will ask their support for more info.

nijel · 2023-01-06T14:34:22Z

Okay, there is nothing better than trying and slowing down if we hit this:

Secondary rate limits are there to prevent problematic traffic patterns which cause performance and reliability concerns on our infrastructure. It helps ensure the stability of our platform for all customers and applications; not just one.

We have different abuse rate limiters and the one you're hitting does not return the retry-after header as you expected.

The algorithm behind anti-abuse rate limits is not that simple(they observe multiple factors, and we also tweak them over time). In addition to that, since these are designed specifically to prevent abuse and performance problems on our end -- we can't really share all the details about how they work. I'm sure you would also agree that having them published defeats their purpose.

As referenced in our documentation, our recommendation for situations like this is that integrators create this content at a reasonable pace to avoid being further limited (i.e slow down the request their App is making to create PRs significantly, then gradually increase the request rate until the error appears again. At that point, you can then slow down a bit more.)

mcm1957 · 2023-01-08T19:05:15Z

The problem accured again.

This time a single string was changed which caused updates in app 100 components. Weblate tried to create pull requests for all of them (nearly) at once hitting the github limit again.

According to github (https://docs.github.com/en/rest/guides/best-practices-for-integrators?apiVersion=2022-11-28#dealing-with-secondary-rate-limits) there should be a delay between 2 github requests of at least 1s.

So please consider to simply add a possibility to configure / add some delay between two github requests.

Share the code among classes to have consistent error handling and to reuse the code. This will allow to add more features to shared code later (rate limiting as described in #7690).

Perform API requests with lock held to ensure we do not perform more of them at one time. Issue #7690

This should reduce number of issues with GitHub secondary rate limiting and make Weblate behave nicer to the services. Fixes #7690

UncleSamSwiss · 2023-01-10T08:46:17Z

Thanks a lot for the improvement, we are looking forward to version 4.15.1!

nijel · 2023-01-10T12:23:33Z

I'm curious if it really addresses the issue in all cases, or some additional modifications will be needed. Feedback is welcome.

mcm1957 · 2023-01-10T14:11:59Z

I'm curious if it really addresses the issue in all cases, or some additional modifications will be needed. Feedback is welcome.

Thanks very much for the change.

If the issue still occures one improvement could be to delay retries significantly. As weblate normally does not update in realtime it could be a good idea to delay retries AFTER errors for several minutes or longer to avoid triggering any ratelimits due to (failing) retries. But this is not prior - at the moment we will simply evaluate whether the problem raises again.

shun2wang · 2023-01-12T01:52:18Z

I think still have problems on 4.15.1-dev see here.

See #7690

Fixes #7690

This way we better handle sleeping in concurrent contexts. Fixes #7690

Otherwise the next consumer might not get the updated timestamp. Fixes #7690

nijel · 2023-01-12T13:24:17Z

Okay, I will try some more tweaking.

nijel added the question This is more a question for the support than an issue. label May 24, 2022

nijel added enhancement Adding or requesting a new feature. undecided These features might not be implemented. Can be prioritized by sponsorship. and removed question This is more a question for the support than an issue. labels May 24, 2022

nijel added a commit that referenced this issue May 25, 2022

docs: Add links to monitoring docs

9f06d89

See #7690

nijel added a commit that referenced this issue May 25, 2022

docs: Fix typo in metrics documentation

767e2e1

Spotted via #7690

nijel added a commit that referenced this issue Jan 9, 2023

vcs: consolidate merge requests http code

38008c5

Share the code among classes to have consistent error handling and to reuse the code. This will allow to add more features to shared code later (rate limiting as described in #7690).

nijel added a commit that referenced this issue Jan 9, 2023

vcs: Avoid doing parallel requests to a single service

c1a9f7d

Perform API requests with lock held to ensure we do not perform more of them at one time. Issue #7690

nijel closed this as completed in 08fb1ea Jan 9, 2023

nijel added this to the 4.15.1 milestone Jan 9, 2023

nijel added a commit that referenced this issue Jan 9, 2023

vcs: Enforce delay between requests

04e5b63

This should reduce number of issues with GitHub secondary rate limiting and make Weblate behave nicer to the services. Fixes #7690

nijel self-assigned this Jan 9, 2023

nijel added a commit that referenced this issue Jan 12, 2023

vcs: Reduce rate of GitHub requests

b7a3274

See #7690

nijel added a commit that referenced this issue Jan 12, 2023

vcs: Detect and handle GitHub secondary rate limit

c450b91

Fixes #7690

nijel added a commit that referenced this issue Jan 12, 2023

vcs: Move all sleeps to the lock doing request

dc8c6ab

This way we better handle sleeping in concurrent contexts. Fixes #7690

nijel added a commit that referenced this issue Jan 12, 2023

vcs: Update next timestamp while lock is held

2aed251

Otherwise the next consumer might not get the updated timestamp. Fixes #7690

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to delay/limit the number of GitHub repo checks and Pull requests in time #7690

Allow to delay/limit the number of GitHub repo checks and Pull requests in time #7690

Apollon77 commented May 24, 2022 •

edited

nijel commented May 24, 2022

github-actions bot commented May 24, 2022

Apollon77 commented May 24, 2022

nijel commented May 24, 2022

github-actions bot commented May 24, 2022

Apollon77 commented May 24, 2022 •

edited

nijel commented May 25, 2022

UncleSamSwiss commented May 25, 2022

Apollon77 commented May 25, 2022

nijel commented May 25, 2022

shun2wang commented Dec 19, 2022

nijel commented Dec 19, 2022

shun2wang commented Dec 19, 2022 •

edited

nijel commented Dec 19, 2022

nijel commented Dec 19, 2022

nijel commented Jan 6, 2023

mcm1957 commented Jan 8, 2023

UncleSamSwiss commented Jan 10, 2023

nijel commented Jan 10, 2023

mcm1957 commented Jan 10, 2023

shun2wang commented Jan 12, 2023

nijel commented Jan 12, 2023

Allow to delay/limit the number of GitHub repo checks and Pull requests in time #7690

Allow to delay/limit the number of GitHub repo checks and Pull requests in time #7690

Comments

Apollon77 commented May 24, 2022 • edited

Describe the problem

Describe the solution you'd like

Describe alternatives you've considered

Screenshots

Additional context

nijel commented May 24, 2022

github-actions bot commented May 24, 2022

Apollon77 commented May 24, 2022

nijel commented May 24, 2022

github-actions bot commented May 24, 2022

Apollon77 commented May 24, 2022 • edited

nijel commented May 25, 2022

UncleSamSwiss commented May 25, 2022

Apollon77 commented May 25, 2022

nijel commented May 25, 2022

shun2wang commented Dec 19, 2022

nijel commented Dec 19, 2022

shun2wang commented Dec 19, 2022 • edited

nijel commented Dec 19, 2022

nijel commented Dec 19, 2022

nijel commented Jan 6, 2023

mcm1957 commented Jan 8, 2023

UncleSamSwiss commented Jan 10, 2023

nijel commented Jan 10, 2023

mcm1957 commented Jan 10, 2023

shun2wang commented Jan 12, 2023

nijel commented Jan 12, 2023

Apollon77 commented May 24, 2022 •

edited

Apollon77 commented May 24, 2022 •

edited

shun2wang commented Dec 19, 2022 •

edited