🐛 Bug Report: Timeout error while processing large parent location registered by Github Catalog Discovery Plugin #24369

vikyathharekal · 2024-04-19T04:18:59Z

📜 Description

We have a repository that contains over 3000 catalog-info.yaml files. Here is the directory structure -

repo

folder1
- catalog-info.yaml
folder2
- catalog-info.yaml
....
folder3000
- catalog-info.yaml

Now the GitHub Auto Discovery plugin has registered 1 Location which looks something like this https://github.com/vikyathharekal/book-my-tickets/tree/main/services/*/catalog-info.yaml

This led to 3000 Components showing up in the Catalog. However, since then intermittently refreshes have been failing for most entities. The error is "GitHub API time out" (see screenshot). The logs show that the GitHub Contents API is failing for only one or two of the 3000 entities on every refresh loop.

Options to mitigate this -

Failure of one or two API calls should not mark the parent location as failed.
Retry mechanism to retry the GitHub API calls on certain errors like timeout.
Emit multiple locations (3000 here) instead of one location with *.

👍 Expected behavior

It should ideally handle errors for each blob API call independently and not fail the location.
Also, there should be some retry mechanism so that we retry the GitHub API calls on certain errors like timeout.

👎 Actual Behavior with Screenshots

GitHub Contents API is failing only one or two of the 3000 entities on every refresh loop is marking the parent location as failed.

👟 Reproduction steps

Test repository: https://github.com/vikyathharekal/book-my-tickets/tree/main/services

Github Catalog Discovery Plugin Configuration:

providers:
  github:
    providerId:
      organization: vikyathharekal
      catalogPath: /services/**/catalog-info.yaml
      filters:
        branch: main
        repository: book-my-tickets

📃 Provide the context for the Bug.

No response

🖥️ Your Environment

No response

👀 Have you spent some time to check if this bug has been raised before?

I checked and didn't find similar issue

🏢 Have you read the Code of Conduct?

I have read the Code of Conduct

Are you willing to submit PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

Rugvip · 2024-04-23T17:22:05Z

There are a couple of things that get wired together for this ingestion e2e. The entity provider will create a location in the catalog with the wildcard intact, that will in turn be picked up by the UrlReaderProcessor and treated as a "search" target here:

backstage/plugins/catalog-backend/src/modules/core/UrlReaderProcessor.ts

Line 131 in d5a1fe1

const response = await this.options.reader.search(location, { etag });

That in turn is gonna take us to the GitHubUrlReader in this case, which in turn does an initial search request here:

backstage/packages/backend-common/src/reading/GithubUrlReader.ts

Lines 252 to 255 in f640129

    
           const recursiveTree: GhTreeResponse = await this.fetchJson( 
        
             treesUrl.replace('{/sha}', `/${sha}?recursive=true`), 
        
             init, 
        
           );

If that response is not truncated, it will continue on to do individual fetches for each matching file over here:

backstage/packages/backend-common/src/reading/GithubUrlReader.ts

Line 270 in f640129

const blob: GhBlobResponse = await this.fetchJson(item.url!, init);

This fetch I think we need to make sure doesn't happen for this case. I don't know if it is right now, but based on the error message it seems like it might. What would be better is if it instead falls through to do a full reading of the repo tree so that all files are available locally over here:

backstage/packages/backend-common/src/reading/GithubUrlReader.ts

Line 277 in f640129

const tree = await this.doReadTree(archiveUrl, sha, '', init, {

If we end up there we don't have any issues with individual file fetches, since it's all available upfront.

So to move this ahead we'd appreciate if anyone wants to validate the above hypothesis that the initial search is not truncated. If that ends up being the case, I think we should probably add some additional check that cases us to fall back to the read tree path in case of very large repos.

vikyathharekal added the bug Something isn't working label Apr 19, 2024

github-actions bot added the area:catalog Related to the Catalog Project Area label Apr 19, 2024

Rugvip added the help wanted Help/Contributions wanted from community members label Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Bug Report: Timeout error while processing large parent location registered by Github Catalog Discovery Plugin #24369

🐛 Bug Report: Timeout error while processing large parent location registered by Github Catalog Discovery Plugin #24369

vikyathharekal commented Apr 19, 2024

Rugvip commented Apr 23, 2024

🐛 Bug Report: Timeout error while processing large parent location registered by Github Catalog Discovery Plugin #24369

🐛 Bug Report: Timeout error while processing large parent location registered by Github Catalog Discovery Plugin #24369

Comments

vikyathharekal commented Apr 19, 2024

📜 Description

👍 Expected behavior

👎 Actual Behavior with Screenshots

👟 Reproduction steps

📃 Provide the context for the Bug.

🖥️ Your Environment

👀 Have you spent some time to check if this bug has been raised before?

🏢 Have you read the Code of Conduct?

Are you willing to submit PR?

Rugvip commented Apr 23, 2024