Skip to content

bug: Gitingest fetches entire repo for tagged subdirectories #196

@jpotw

Description

@jpotw

Hi! I've noticed some unexpected behavior with gitingest when trying to fetch a subdirectory from a specific tag. It seems to grab the whole repository instead of just the subdirectory, which works fine for branches.

Expected:

When I use a URL like this (pointing to a subdirectory within a tag):

gitingest should only fetch the files in that subdirectory, just like it does for branches.


Observed:

gitingest downloads a ton of files (hitting the max file limit on big repos like PyTorch) and seems to ignore the subdirectory part of the tag URL. It's pulling the entire repository for that tag.


Steps to Reproduce:

  1. Tag (Fails): Run:
    gitingest https://github.com/pytorch/pytorch/tree/v2.4.1/torch/distributed/elastic/agent/server
    (Note: Actually this command will time out without removing --recurse-submodules – see Why Use --recurse-submodules in clone_repo? It slows down cloning large repos #195).

  1. Observe (Tag): You'll see output like this, indicating it's processing the whole repository:
    Maximum file limit (10000) reached
    ... (repeated many times) ...
    Analysis complete! Output written to: digest.txt
    
    Summary:
    Repository: pytorch/pytorch
    Files analyzed: 10000  # Should be much smaller!
    
    Estimated tokens: 16.8M
    ...
    

  1. Branch (Works): Now try the same subdirectory, but on the main branch:
    gitingest https://github.com/pytorch/pytorch/tree/main/torch/distributed/elastic/agent/server

  1. Observe (Branch): You'll see the correct output:
    Analysis complete! Output written to: digest.txt
    
    Summary:
    Repository: pytorch/pytorch
    Files analyzed: 4  # Correct!
    Subpath: /torch/distributed/elastic/agent/server
    
    Estimated tokens: 12.3k
    ...
    

Comparison (Branch Behavior - Working):

Just to confirm, this works perfectly for subdirectories on branches (both main and others):

  • Main: gitingest .../tree/main/... (Correct output: 4 files)
  • Other Branch: gitingest .../tree/gh/qqaatw/26/orig/... (Correct output: 4 files)

I've included the full commands and expected output in the original description, but the key difference is the Files analyzed count.


It seems like gitingest handles tagged subdirectories differently than branch subdirectories, leading to unexpected behavior and hitting the file limit.

I'd be happy to help investigate and potentially submit a PR if you can confirm this is a bug! Let me know what you think.

Thanks!

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingwork in progressThis PR is not ready yet but is being worked on

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions