Tree-sitter Code Chunker

This project implements a simple code chunker using Tree-sitter in Python. It processes source code files and breaks them down into manageable chunks based on the abstract syntax tree (AST) generated by Tree-sitter.

Features

Supports multiple programming languages (based on available Tree-sitter grammars)
Configurable chunk sizes which is calculated based on number of tokens (MIN_CHUNK_SIZE and MAX_CHUNK_SIZE)
- MIN_CHUNK_SIZE: Minimum chunk size, combines small nodes (e.g., comments, imports etc)
- MAX_CHUNK_SIZE: Maximum chunk size, splits large nodes into smaller chunks (e.g. big classes, functions etc)
- Generated chunks fall within the range of MIN_CHUNK_SIZE to MAX_CHUNK_SIZE
Generates metadata for each chunk, including file name, chunk type, and position in the source file

Requirements

Python 3.6+
tree-sitter
tree-sitter-language-pack

Installation

Clone this repository:

git clone https://github.com/arslan1510/code-chunker-py.git
cd code-chunker-py

Install the required dependencies:

pip install tree-sitter tree-sitter-language-pack

Usage

Create an instance of the BaseProcessor class with the desired language:

from processors.base_processor import BaseProcessor

processor = BaseProcessor(language="python")

Process a source code file:

file_name = "example.py"
with open(file_name, "r") as file:
    source_code = file.read()

chunks = processor.process_code(file_name=file_name, source_code=source_code)

The chunks variable will contain a list of dictionaries, each representing a code chunk with its content and metadata.

Customization

Adjust the MAX_CHUNK_SIZE and MIN_CHUNK_SIZE class variables in BaseProcessor to change the chunk size limits.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
processors		processors
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Tree-sitter Code Chunker

Features

Requirements

Installation

Usage

Customization

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

arslan1510/code-chunker-py

Folders and files

Latest commit

History

Repository files navigation

Tree-sitter Code Chunker

Features

Requirements

Installation

Usage

Customization

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages