Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Bug Report: Timeout error while processing large parent location registered by Github Catalog Discovery Plugin #24369

Open
2 tasks done
vikyathharekal opened this issue Apr 19, 2024 · 1 comment
Labels
area:catalog Related to the Catalog Project Area bug Something isn't working help wanted Help/Contributions wanted from community members

Comments

@vikyathharekal
Copy link

📜 Description

We have a repository that contains over 3000 catalog-info.yaml files. Here is the directory structure -

repo

  • folder1
    • catalog-info.yaml
  • folder2
    • catalog-info.yaml
  • ....
  • folder3000
    • catalog-info.yaml

Now the GitHub Auto Discovery plugin has registered 1 Location which looks something like this https://github.com/vikyathharekal/book-my-tickets/tree/main/services/*/catalog-info.yaml

This led to 3000 Components showing up in the Catalog. However, since then intermittently refreshes have been failing for most entities. The error is "GitHub API time out" (see screenshot). The logs show that the GitHub Contents API is failing for only one or two of the 3000 entities on every refresh loop.

Options to mitigate this -

  1. Failure of one or two API calls should not mark the parent location as failed.
  2. Retry mechanism to retry the GitHub API calls on certain errors like timeout.
  3. Emit multiple locations (3000 here) instead of one location with *.

👍 Expected behavior

  • It should ideally handle errors for each blob API call independently and not fail the location.
  • Also, there should be some retry mechanism so that we retry the GitHub API calls on certain errors like timeout.

👎 Actual Behavior with Screenshots

Screenshot
GitHub Contents API is failing only one or two of the 3000 entities on every refresh loop is marking the parent location as failed.

👟 Reproduction steps

Test repository: https://github.com/vikyathharekal/book-my-tickets/tree/main/services

Github Catalog Discovery Plugin Configuration:

providers:
  github:
    providerId:
      organization: vikyathharekal
      catalogPath: /services/**/catalog-info.yaml
      filters:
        branch: main
        repository: book-my-tickets

📃 Provide the context for the Bug.

No response

🖥️ Your Environment

No response

👀 Have you spent some time to check if this bug has been raised before?

  • I checked and didn't find similar issue

🏢 Have you read the Code of Conduct?

Are you willing to submit PR?

Yes I am willing to submit a PR!

@vikyathharekal vikyathharekal added the bug Something isn't working label Apr 19, 2024
@github-actions github-actions bot added the area:catalog Related to the Catalog Project Area label Apr 19, 2024
@Rugvip Rugvip added the help wanted Help/Contributions wanted from community members label Apr 22, 2024
@Rugvip
Copy link
Member

Rugvip commented Apr 23, 2024

There are a couple of things that get wired together for this ingestion e2e. The entity provider will create a location in the catalog with the wildcard intact, that will in turn be picked up by the UrlReaderProcessor and treated as a "search" target here:

const response = await this.options.reader.search(location, { etag });

That in turn is gonna take us to the GitHubUrlReader in this case, which in turn does an initial search request here:

const recursiveTree: GhTreeResponse = await this.fetchJson(
treesUrl.replace('{/sha}', `/${sha}?recursive=true`),
init,
);

If that response is not truncated, it will continue on to do individual fetches for each matching file over here:

const blob: GhBlobResponse = await this.fetchJson(item.url!, init);

This fetch I think we need to make sure doesn't happen for this case. I don't know if it is right now, but based on the error message it seems like it might. What would be better is if it instead falls through to do a full reading of the repo tree so that all files are available locally over here:

const tree = await this.doReadTree(archiveUrl, sha, '', init, {

If we end up there we don't have any issues with individual file fetches, since it's all available upfront.

So to move this ahead we'd appreciate if anyone wants to validate the above hypothesis that the initial search is not truncated. If that ends up being the case, I think we should probably add some additional check that cases us to fall back to the read tree path in case of very large repos.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:catalog Related to the Catalog Project Area bug Something isn't working help wanted Help/Contributions wanted from community members
Projects
None yet
Development

No branches or pull requests

2 participants