Skip to content

arslan1510/code-chunker-py

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tree-sitter Code Chunker

This project implements a simple code chunker using Tree-sitter in Python. It processes source code files and breaks them down into manageable chunks based on the abstract syntax tree (AST) generated by Tree-sitter.

Features

  • Supports multiple programming languages (based on available Tree-sitter grammars)
  • Configurable chunk sizes which is calculated based on number of tokens (MIN_CHUNK_SIZE and MAX_CHUNK_SIZE)
    • MIN_CHUNK_SIZE: Minimum chunk size, combines small nodes (e.g., comments, imports etc)
    • MAX_CHUNK_SIZE: Maximum chunk size, splits large nodes into smaller chunks (e.g. big classes, functions etc)
    • Generated chunks fall within the range of MIN_CHUNK_SIZE to MAX_CHUNK_SIZE
  • Generates metadata for each chunk, including file name, chunk type, and position in the source file

Requirements

  • Python 3.6+
  • tree-sitter
  • tree-sitter-language-pack

Installation

  1. Clone this repository:

    git clone https://github.com/arslan1510/code-chunker-py.git
    cd code-chunker-py
    
  2. Install the required dependencies:

    pip install tree-sitter tree-sitter-language-pack
    

Usage

  1. Create an instance of the BaseProcessor class with the desired language:

    from processors.base_processor import BaseProcessor
    
    processor = BaseProcessor(language="python")
  2. Process a source code file:

    file_name = "example.py"
    with open(file_name, "r") as file:
        source_code = file.read()
    
    chunks = processor.process_code(file_name=file_name, source_code=source_code)
  3. The chunks variable will contain a list of dictionaries, each representing a code chunk with its content and metadata.

Customization

  • Adjust the MAX_CHUNK_SIZE and MIN_CHUNK_SIZE class variables in BaseProcessor to change the chunk size limits.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

A simple code chunker for LLMs.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages