Skip to content

fix: crash on inputs containing runs of 4 spaces#8

Merged
admk merged 1 commit into
admk:mainfrom
lukors:lukors/fix_spaces_crash
May 21, 2026
Merged

fix: crash on inputs containing runs of 4 spaces#8
admk merged 1 commit into
admk:mainfrom
lukors:lukors/fix_spaces_crash

Conversation

@lukors
Copy link
Copy Markdown
Contributor

@lukors lukors commented Apr 20, 2026

fixes: #7

The processors register their replace_tokens values with the tokenizer via tokenizer.add_tokens(). The 4-space string used for the "\t" -> spaces substitution seems like it has no row in the sembr2023 model's embedding matrix. This means any input with four or more consecutive spaces therefore produces an out-of-range token ID and crashes with:

IndexError: index out of range in self

Reproducer:

printf 'hello    world' | uvx sembr

The fix in this PR is to not register the 4-space string with the tokenizer. Since its registration was a result of being a value in replace_tokens, I removed it from the dict and handle the tab-to-spaces substitution separately:

  • MarkdownProcessor, LaTeXProcessor, PlainTextProcessor: drop the "\t": " " * self.spaces entry from _get_replace_tokens.
  • MarkdownProcessor.parse_text: add an explicit text.replace("\t", " " * self.spaces). The other two processors already did this substitution.
  • BaseProcessor: drop the now-dead "if k != '\t'" filter from reverse_replace_tokens.

- MarkdownProcessor, LaTeXProcessor, PlainTextProcessor: drop the
  "\t": " " * self.spaces entry from _get_replace_tokens.
- MarkdownProcessor.parse_text: add an explicit
  text.replace("\t", " " * self.spaces). The other two processors
  already did this substitution explicitly.
- BaseProcessor: drop the now-dead "if k != '\t'" filter from
  reverse_replace_tokens.
@admk admk merged commit 98f97fb into admk:main May 21, 2026
@admk
Copy link
Copy Markdown
Owner

admk commented May 21, 2026

Thanks, merged. I have not tested it though, hope it fixes the bug and doesn't have regressions. I hope to plan to add tests later...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Crash on \n -

2 participants