This project implements a simple code chunker using Tree-sitter in Python. It processes source code files and breaks them down into manageable chunks based on the abstract syntax tree (AST) generated by Tree-sitter.
- Supports multiple programming languages (based on available Tree-sitter grammars)
- Configurable chunk sizes which is calculated based on number of tokens (MIN_CHUNK_SIZE and MAX_CHUNK_SIZE)
- MIN_CHUNK_SIZE: Minimum chunk size, combines small nodes (e.g., comments, imports etc)
- MAX_CHUNK_SIZE: Maximum chunk size, splits large nodes into smaller chunks (e.g. big classes, functions etc)
- Generated chunks fall within the range of MIN_CHUNK_SIZE to MAX_CHUNK_SIZE
- Generates metadata for each chunk, including file name, chunk type, and position in the source file
- Python 3.6+
- tree-sitter
- tree-sitter-language-pack
-
Clone this repository:
git clone https://github.com/arslan1510/code-chunker-py.git cd code-chunker-py
-
Install the required dependencies:
pip install tree-sitter tree-sitter-language-pack
-
Create an instance of the
BaseProcessor
class with the desired language:from processors.base_processor import BaseProcessor processor = BaseProcessor(language="python")
-
Process a source code file:
file_name = "example.py" with open(file_name, "r") as file: source_code = file.read() chunks = processor.process_code(file_name=file_name, source_code=source_code)
-
The
chunks
variable will contain a list of dictionaries, each representing a code chunk with its content and metadata.
- Adjust the
MAX_CHUNK_SIZE
andMIN_CHUNK_SIZE
class variables inBaseProcessor
to change the chunk size limits.
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.