git2llm is a plug-and-play CLI tool and Python library that authenticates with GitHub, discovers repositories, mines commits, pull requests, issues, and tags in parallel, applies aggressive multi-stage quality filters, and generates clean JSONL datasets in Alpaca or ShareGPT format ready to drop into Unsloth, LLaMA-Factory, or Axolotl.
The project relies on the following key dependencies:
- Core Engine: Python
>=3.10 - Git Mining:
pydriller(shallow clones and commits traversal) - GitHub API Client:
pygithub(pull requests, issues, and release notes) - CLI & TUI:
click(CLI parser),inquirerpy(interactive checkbox prompts),rich(logging and progress visualizer) - Data & Configuration:
pydanticv2 (data validation and configuration models),pyyaml(YAML profiles) - Algorithms:
datasketch(MinHash LSH deduplication),tenacity(exponential backoff retry helper) - Environment:
uvpackage manager
git2llm is structured to minimize GitHub API consumption by performing commit mining locally via shallow clones, reserving GitHub REST API calls for PR and issue metadata.
┌─────────────────────────────────────────┐
│ git2llm CLI │
└────────────────────┬────────────────────┘
│
┌───────────────▼───────────────┐
│ Auth Layer │
│ (PAT or OAuth Device Flow) │
└───────────────┬───────────────┘
│ token
┌───────────────▼───────────────┐
│ Repo Discovery & TUI │
└───────────────┬───────────────┘
│ repositories
┌───────────────▼───────────────┐
│ Orchestrator Thread Pool │
└──┬─────────────────────────┬──┘
│ │
┌────────▼────────┐ ┌────────▼────────┐
│ Commit Collector│ │ PR Collector │
│ (PyDriller) │ │ (PyGithub API) │
└────────┬────────┘ └────────┬────────┘
│ │
└────────────┬────────────┘
│ raw records
┌───────────────▼───────────────┐
│ Quality Filter Pipeline │
│ - Stage 1: Hard Exclusions │
│ - Stage 2: Structural Checks │
│ - Stage 3: Content Scoring │
│ - Stage 4: MinHash Dedup │
└───────────────┬───────────────┘
│ clean records
┌───────────────▼───────────────┐
│ Schema Formatter │
│ (Alpaca / ShareGPT) │
└───────────────┬───────────────┘
│
┌───────────────▼───────────────┐
│ DatasetWriter │
│ (dataset.jsonl & run_report) │
└───────────────────────────────┘
- Authentication Options: Supports GitHub Personal Access Token (PAT) or GitHub OAuth Device Flow (enter a code in your browser, no local browser launch required).
- Interactive Selection: Discovers all org and personal repositories and presents an interactive checkbox list in the terminal.
- Data Collectors:
- Commits: Clones shallow copy (
--depth=500 --filter=blob:none) and mines commit messages and patches. - Pull Requests: Gathers merged PRs, inline review comments, and associated diff hunks.
- Issues: Resolves linked issues from PR bodies to extract problem descriptions.
- Tags: Gathers release notes for version changelog generation tasks.
- Commits: Clones shallow copy (
- 4-Stage Quality Pipeline:
- Stage 1 (Exclusions): Ignores merge commits, bot authors, revert commits, binary/lockfiles, and draft/WIP messages.
- Stage 2 (Structural): Filters based on message/PR word count, diff lines (prevents too small/large diffs), changed file count, and minimum issue description length (
min_issue_to_patch_words). - Stage 3 (Scoring & Alignment): Evaluates message informativeness, V-DO (Verb-Direct Object) imperative start patterns (e.g., Add, Fix, Refactor), and semantic overlap between commit messages and code diffs (
min_alignment_score). - Stage 4 (Deduplication): Eliminates identical or near-duplicate commits/diffs using MinHash LSH (Jaccard similarity).
- Output Schemas: Formats datasets directly into Alpaca (
instruction/input/output) or ShareGPT (conversationslist) format. - Context Optimization (
issue_to_patch): Combines PR titles, descriptions, and linked issues while stripping HTML comment templates, and enforces a configurable minimum word count constraint (min_issue_to_patch_words) to ensure high-quality fine-tuning samples.
git2llm/
├── configs/
│ ├── default.yaml # Standard pipeline filtering settings
│ ├── permissive.yaml # Loose filtering constraints
│ └── strict.yaml # Highly strict constraints (academic standard)
├── git2llm/
│ ├── auth/ # PAT/OAuth login and token caching
│ ├── collectors/ # PyDriller/PyGithub mining algorithms
│ ├── discovery/ # Repository lister and checkbox TUI
│ ├── filters/ # Stages 1 to 4 quality pipeline
│ ├── formatters/ # Alpaca and ShareGPT template engines
│ ├── cli.py # Click CLI entry point
│ ├── config.py # Pydantic configuration loader
│ ├── models.py # Standardized data objects
│ ├── orchestrator.py # Multi-threaded repository orchestrator
│ ├── writer.py # Output files and run stats writer
│ └── utils/ # Git and API rate-limiting utilities
├── tests/
│ ├── integration/ # Mocked end-to-end integration tests
│ └── unit/ # Heuristic and filtering unit tests
├── pyproject.toml # Build settings and dependencies
└── README.md
git2llm supports generating datasets for three primary training tasks. Each task generates standard instruction tuning records (available in Alpaca or ShareGPT format):
- Purpose: Trains models to generate conventional commit messages from code changes.
- Pipeline Flow: Traverses commits locally, filters out merges/bots/reverts, evaluates imperative verb usage, and checks semantic alignment.
- Command:
uv run git2llm run -r owner/repo -t commit_message --format [alpaca|sharegpt] - Configuration Params (YAML / Profiles):
min_commit_message_words(default:5): Minimum words required in the commit message.max_commit_message_chars(default:500): Maximum characters allowed.min_content_score(default:0.5): Minimum score based on verb start, informativeness, and language.min_alignment_score(default:0.15): Hard filter requiring minimum token overlap between the commit message and the diff.require_verb_start(default:true): Requires the commit message to start with an imperative verb (e.g. Add, Fix, Refactor).
- Dataset Structure (Alpaca):
- Instruction:
"You are an expert software engineer. Given a code diff, write a clear and informative commit message." - Input: The unified git diff.
- Output: The conventional commit subject line.
- Instruction:
- Purpose: Trains models to perform code reviews and write inline feedback comments.
- Pipeline Flow: Gathers merged PRs, collects inline review comments with their diff hunks, and filters out short description PRs.
- Command:
uv run git2llm run -r owner/repo -t pr_review --format [alpaca|sharegpt] - Configuration Params (YAML / Profiles):
min_pr_body_words(default:20): Discards PRs where the description is too short.dedup_threshold(default:0.85): Removes near-duplicate PR diffs using MinHash LSH.
- Dataset Structure (ShareGPT):
- Conversations:
system: Code review system prompt.human: PR title, description, and the full PR diff.gpt: Inline review comments, formatted with paths, contextual diff hunks, and review feedback.
- Conversations:
- Purpose: Trains autonomous coding agents to generate patches/diffs from issue descriptions and PR descriptions.
- Pipeline Flow: Gathers PRs and their linked issues, strips out HTML templates, merges description texts, and validates length.
- Command:
uv run git2llm run -r owner/repo -t issue_to_patch --format [alpaca|sharegpt] - Configuration Params (YAML / Profiles):
min_issue_to_patch_words(default:20): Discards examples where the combined description context is too short.require_linked_issue(default:false): Iftrue, only processes PRs that have explicitly linked issues.min_diff_lines(default:3): Minimum lines required in the diff patch.max_diff_lines(default:500): Maximum lines allowed in the patch.
- Dataset Structure (Alpaca):
- Instruction:
"You are an expert software engineer. Given the issue description and the current state of the relevant file(s), produce a minimal, correct git patch that resolves the issue." - Input: The combined PR title, PR description body, and linked issue bodies (with HTML comments stripped).
- Output: The unified patch/diff.
- Instruction:
Ensure you have Git and Python >=3.10 installed. Using uv is highly recommended.
Clone the repository and install it in editable mode:
uv pip install -e .Create a .env file (see .env.example as a template):
cp .env.example .envDefine your token:
GIT2LLM_TOKEN=your_personal_access_token_hereYou can invoke the CLI directly using the registered script name:
# Verify installation
uv run git2llm --help# Save your personal access token locally
uv run git2llm auth --token ghp_yourtokenhere# Run interactively (will prompt to pick repos, and then prompt to pick branches)
uv run git2llm run --format sharegpt
# Run with a built-in profile preset (default, strict, permissive)
uv run git2llm run -r owner/repo1 --profile permissive
# Run with a custom commit limit (cover only the N most recent commits)
uv run git2llm run -r owner/repo1 -n 100
# Run with a custom config file
uv run git2llm run \
-r owner/repo1 -r owner/repo2 \
-b main -b develop \
--format alpaca \
--task commit_message \
--config configs/strict.yaml \
--output ./dataset_outputsIf you want to customize configuration parameters, generate a starter YAML file from one of the built-in profiles:
# Generate a starter configuration file from the permissive profile
uv run git2llm init-config permissive -o configs/my_custom_config.yamlYou can then customize configs/my_custom_config.yaml and run the pipeline using the --config option pointing to it.
After generating a dataset (e.g. git2llm_output/dataset.jsonl), you can split it into training (train.jsonl) and evaluation (eval.jsonl) files:
# Split with default 10% evaluation set and shuffle enabled
uv run git2llm split git2llm_output/dataset.jsonl
# Split with a specific 20% eval ratio and random seed
uv run git2llm split git2llm_output/dataset.jsonl --eval-ratio 0.2 --seed 1234Options:
-r, --eval-ratio FLOAT: Proportion of dataset to assign to evaluation (default:0.1).-s, --seed INTEGER: Random seed for shuffling reproducibility (default:42).-o, --output-dir PATH: Target output directory (defaults to same folder as input file).--shuffle / --no-shuffle: Toggle shuffling of records before splitting (default:True).
- Install development dependencies:
uv add --dev pytest pytest-asyncio
- Make your edits inside
git2llm/. - Create corresponding test fixtures in
tests/. - Submit pull requests following Conventional Commit conventions (e.g.
feat: add tag collector).
Run all unit and integration test suites:
uv run pytest- Code Style: Follow PEP 8 guidelines. Format code using standard tools (such as Ruff or Black).
- Validation: All configuration profiles and API contracts are defined using Pydantic v2 models.
- Commits: Follow Conventional Commits format (
feat: ...,fix: ...,refactor: ...) for all development contributions.
- Fork the repository and create a new branch.
- Ensure new features are covered by unit or integration tests.
- Verify that all tests pass (
uv run pytest) before opening a pull request.
This project is licensed under the Apache License, Version 2.0. See LICENSE for details.