Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update recollection frequency #2550

Merged
merged 1 commit into from Oct 18, 2023
Merged

Conversation

ABrain7710
Copy link
Contributor

@ABrain7710 ABrain7710 commented Oct 18, 2023

Description
So the current issue is that repos are allowed to be recollected if they are more than a day old. This is causing issues in datasets where one user has the majority of the repos and there are many other users that have small amounts of repos. For example in a dataset where one user has 20000 repos and 19 other users all together have 2000 repos. In this case the single user with a lot of repos only has a 25% chance of being selected by the scheduling algorithm (in fact all users have a 25% chance). The issue is that the other 19 users have enough repos that there are always some repos that are more than 1 day old and therefore can be recollected. This results in the 20000 repos from the single user only being considered 25% of the time even if some of their repos are 3 months old, and the repos for the other 19 users are 1 day old. What is difficult about this issue, is that this is the expected behavior. Due to the fact that we don't want users that add a lot of repos and steal all the bandwidth. So to solve this I changed the requirement for recollection to 7 days for core, 10 days for secondary, 7 days for facade, and 10 days for ml. This means the repos for the 19 users will likely be processed through in a day or so, and then for the rest of the 6 days the older repos from the user with 20000 repos will be selected for collection every time since they are the only one left with valid repos to collect.

This PR fixes #

  • Repos that were collected a day ago being collected before repos that haven't been collected in 3 months

Signed commits

  • Yes, I signed my commits.

Signed-off-by: Andrew Brain <andrewbrain2019@gmail.com>
@sgoggins sgoggins merged commit b3ff92e into dev Oct 18, 2023
1 of 2 checks passed
@ABrain7710 ABrain7710 deleted the update-repo-collection-frequency branch February 20, 2024 00:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants