Skip to content

Code aware chunking in RAG strategies#967

Merged
dgageot merged 1 commit intodocker:mainfrom
krissetto:code-aware-chunking
Nov 28, 2025
Merged

Code aware chunking in RAG strategies#967
dgageot merged 1 commit intodocker:mainfrom
krissetto:code-aware-chunking

Conversation

@krissetto
Copy link
Copy Markdown
Contributor

The goal of this change is to allow RAG strategy chunking configurations to define they are code_aware.

When code_aware: true, the rag system will use treesitter bindings to parse the AST of the code for the chunking, in order to not spit up logical blocks of code like functions.

Only Go code is supported in this initial implementation, but we can add support for more languages fairly easily in followup PRs.

Example:

...
rag:
  codebase:
    docs: [./src]
    strategies:
      - type: chunked-embeddings
        embedding_model: openai/text-embedding-3-small
        database: ./code.db
        chunking:
          size: 2000
          **code_aware: true**    # <- Enable AST-based chunking
    results:
      limit: 5

The treesitter AST parsing will also be used in follow up rag strategies to gather the semantic meaning of the analyzed code before generating embeddings 😏

@krissetto krissetto requested a review from a team as a code owner November 27, 2025 13:23
@krissetto
Copy link
Copy Markdown
Contributor Author

needs some extra work to prepare for using CGO in CI

Signed-off-by: Christopher Petito <chrisjpetito@gmail.com>
@dgageot dgageot merged commit fa676c0 into docker:main Nov 28, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants