Skip to content

Conversation

@Tasssadar
Copy link
Contributor

On some large and "bad" inputs, like whole HTML page, the regex produces a lot of matches that are just one or two digits, which then just get filtered out.

For my particular use-case, when I set the minimum to 4 digits (I don't care about special short numbers), I get about 50% speed-up on that particular input.

candidate)
candidate_len = len(candidate)

if candidate_len >= self._min_candidate_length:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please could we add a comment here to indicate that the Python code is diverging from the upstream code.

def __init__(self, text, region,
leniency=Leniency.VALID, max_tries=65535):
leniency=Leniency.VALID, max_tries=65535,
min_candidate_length=1):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not normally keen on the Python port having divergences from the upstream Java code, but this looks useful and the default value for the argument means that it's back-compatible with any existing client use.

Please could you add some unit tests for it, and mark them as Python-specific?

@Tasssadar
Copy link
Contributor Author

Thanks, I added the comment and unit tests.

@daviddrysdale daviddrysdale merged commit f243ddd into daviddrysdale:dev Jan 16, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants