Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EpicSplitter not following max_chunk_size #8

Open
zkx06111 opened this issue Jun 14, 2024 · 2 comments
Open

EpicSplitter not following max_chunk_size #8

zkx06111 opened this issue Jun 14, 2024 · 2 comments

Comments

@zkx06111
Copy link

zkx06111 commented Jun 14, 2024

I further looked into the implementation of EpicSplitter. It seems that the chunking process is not really ensuring all chunks are smaller than max_chunk_size. Is this supposed to be like this or is this an error?

For one instance, when building the index for sympy__sympy-11870, in the _chunk_block function of EpicSplitter, when the input isfile_path == 'sympy/combinatorics/permutations.py' and codeblock.content == 'class Permutation(Basic):', the first chunk being appended to chunks has 3333 tokens, even though self.max_chunk_size is 1500, self.hard_token_limit is 2000, self.chunk_size is 750.

Specifically, this 3333-token chunk is appended by this line:

current_chunk.append(child)

It seems to me that this part of code is recursively chunking the child. If that's correct, do we still need the parent when the child will be indexed separately?

@aorwall
Copy link
Owner

aorwall commented Jun 14, 2024

The solution is a bit over and underengineered at the same time 😅 The idea was to group small chunks (classes and methods) into the same vector. The different token limits are a bit vague as you already noticed. The idea was that max_chunk_size and chunk_size would be the soft limits and to avoid errors when embedding I added hard_token_limit. But as you also noticed this in another issue this is not always handled properly.

I plan to simplify this by do chunking on method level and truncate methods larger than the hard_token_limit. If the parent is a class the class signature, instance variables and constructors could be in a separate chunk.

@zkx06111
Copy link
Author

happy to help with that and chat further! we can talk on zoom or whatever

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants