Skip to content

Commit

Permalink
fix(glossary): fetch word boundary positions just once
Browse files Browse the repository at this point in the history
Doing repeated single character regular expressions is slow, better
to match all word boundaries at once early.
  • Loading branch information
nijel committed May 23, 2024
1 parent 1b4a796 commit 8eb6659
Show file tree
Hide file tree
Showing 2 changed files with 10 additions and 3 deletions.
2 changes: 2 additions & 0 deletions docs/changes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ Not yet released.

**Bug fixes**

* Loading of strings with many glossary matches.

**Compatibility**

**Upgrading**
Expand Down
11 changes: 8 additions & 3 deletions weblate/glossary/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,17 +93,22 @@ def get_glossary_terms(unit: Unit) -> list[Unit]:
source = PLURAL_SEPARATOR.join(parts)

uses_whitespace = source_language.uses_whitespace()
boundaries: set[int] = set()
if uses_whitespace:
# Get list of word boundaries
boundaries = {match.span()[0] for match in NON_WORD_RE.finditer(source)}
boundaries.add(-1)
boundaries.add(len(source))

automaton = project.glossary_automaton
positions = defaultdict(list[tuple[int, int]])
positions: dict[str, list[tuple[int, int]]] = defaultdict(list)
# Extract terms present in the source
with sentry_sdk.start_span(op="glossary.match", description=project.slug):
for _termno, start, end in automaton.find_matches_as_indexes(
source, overlapping=True
):
if not uses_whitespace or (
(start == 0 or NON_WORD_RE.match(source[start - 1]))
and (end >= len(source) or NON_WORD_RE.match(source[end]))
(start - 1 in boundaries) and (end in boundaries)
):
term = source[start:end].lower()
positions[term].append((start, end))
Expand Down

0 comments on commit 8eb6659

Please sign in to comment.