Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement language-aware word counting #10284

Closed
wants to merge 2 commits into from

Conversation

yheuhtozr
Copy link
Contributor

Proposed changes

See #10278.

Any suggestions are welcome. Especially whether:

  • Universally calculating number of symbols used in those languages one by one in word count
  • Use different counting method according to language specified
  • Add several options for users to choose per project, category, etc.

Checklist

  • Lint and unit tests pass locally with my changes.
  • I have added tests that prove my fix is effective or that my feature works.
  • I have added documentation to describe my feature.
  • I have squashed my commits into logic units.
  • I have described the changes in the commit messages.

Other information

nijel added a commit that referenced this pull request Nov 3, 2023
Needed in case we are going to change the implementation,
see #10284 and #10278.
Copy link
Member

@nijel nijel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you done any benchmarks to see how much is this slower? This code is executed on each source string change. Maybe we want to use this for a few affected languages only?

Can you please add tests for the East Asian languages, so that we can verify it works as expected? I've just added test for the current implementation in f790db2.

@@ -40,6 +40,7 @@ pyparsing>=3.1.1,<3.2
python-dateutil>=2.8.1
python-redis-lock[django]>=4,<4.1
rapidfuzz>=2.6.0,<3.5
regex>=1.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no such version, please choose something reasonably recent as lower bound. Adding upper bound is also a good idea to avoid accidental breakage.

@nijel nijel linked an issue Nov 8, 2023 that may be closed by this pull request
@yheuhtozr
Copy link
Contributor Author

@nijel Hi, as I'm not familiar enough with Python to benchmark a specific method in Django, I made a small script to test each core logic. Also thanks for your suggestion in #10278 (comment).

import regex # re does not support script extensions
import unicodedataplus # unicodedata does not support script extensions
import pyperf

en = """Cortana was demonstrated for the first time at the Microsoft Build developer conference in San Francisco in April 2014. It was launched as a key ingredient of Microsoft's planned "makeover" of future operating systems for Windows Phone and Windows."""
zh = """小娜在2014年4月2日举行的微软Build开发者大会上正式展示并发布。2014年中旬,微软发布了“小娜”这一名字,作为Cortana在中国大陆使用的中文名。与这一中文名一起发布的是小娜在中国大陆的另一个形象。“小娜”一名源自微软旗下知名FPS游戏《光环》中的同名女角色。"""
monogram = r"[\p{scx=Hani}\p{scx=Hang}\p{scx=Hira}\p{scx=Kana}\p{scx=Bopo}]"
splitter = regex.compile(
  rf"(?<!^)(?:\s+|(?<={monogram})(?=\S)|(?={monogram}))(?!$)", flags=regex.U | regex.V1
)
monolist = set(['Hani', 'Hang', 'Hira', 'Kana', 'Bopo'])

def simple_split(loops, txt):
  loops = range(loops)
  t0 = pyperf.perf_counter()
  for _ in loops:
    # logic here
    len(txt.split())
  return pyperf.perf_counter() - t0

def regex_split(loops, txt):
  loops = range(loops)
  t0 = pyperf.perf_counter()
  for _ in loops:
    # logic here
    len(splitter.split(txt))
  return pyperf.perf_counter() - t0

def loop_split(loops, txt):
  loops = range(loops)
  t0 = pyperf.perf_counter()
  for _ in loops:
    # logic here
    count = 0
    was_asian = False
    was_space = True
    for ch in txt:
      scx = unicodedataplus.script_extensions(ch)
      asian = not monolist.isdisjoint(scx)
      space = ch.isspace()
      if asian or ((was_asian or was_space) and not space):
        count += 1
      was_asian = asian
      was_space = space
    # return count
  return pyperf.perf_counter() - t0

if __name__ == "__main__":
  runner = pyperf.Runner()
  runner.bench_time_func('simple split en', simple_split, en)
  runner.bench_time_func('simple split zh (wrong)', simple_split, zh)
  runner.bench_time_func('regex split en', regex_split, en)
  runner.bench_time_func('regex split zh', regex_split, zh)
  runner.bench_time_func('loop split en', loop_split, en)
  runner.bench_time_func('loop split zh', loop_split, zh)

Result on my machine (WSL2):

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.3 LTS
Release:        22.04
Codename:       jammy
$ python3 --version
Python 3.10.12
$ python3 test_split.py 
.....................
simple split en: Mean +- std dev: 602 ns +- 39 ns
.....................
simple split zh (wrong): Mean +- std dev: 373 ns +- 6 ns
.....................
regex split en: Mean +- std dev: 42.0 us +- 2.6 us
.....................
regex split zh: Mean +- std dev: 63.4 us +- 2.2 us
.....................
loop split en: Mean +- std dev: 46.4 us +- 1.2 us
.....................
loop split zh: Mean +- std dev: 28.5 us +- 1.0 us

It seems that script extension lookup is quite heavyweight anyways, and I wonder if hardcoding code point range would improve anything.

@nijel
Copy link
Member

nijel commented Nov 28, 2023

Thanks for the benchmark! I've added the code from SO as well to the test (see below) and here are my results:

.....................
simple_split en: Mean +- std dev: 1.59 us +- 0.03 us
.....................
simple_split zh: Mean +- std dev: 760 ns +- 36 ns
.....................
regex_split en: Mean +- std dev: 112 us +- 9 us
.....................
regex_split zh: Mean +- std dev: 149 us +- 7 us
.....................
loop_split en: Mean +- std dev: 106 us +- 2 us
.....................
loop_split zh: Mean +- std dev: 65.8 us +- 1.6 us
.....................
loop_native en: Mean +- std dev: 65.9 us +- 3.5 us
.....................
loop_native zh: Mean +- std dev: 31.6 us +- 1.1 us

So the unicodedata only solution seems the fastest out of these, but still is 30x slower than simple split. We can still fallback to the current implementation for most of the languages if that is an issue.

benchmark.py
import regex # re does not support script extensions
import unicodedataplus # unicodedata does not support script extensions
import pyperf
import unicodedata

en = """Cortana was demonstrated for the first time at the Microsoft Build developer conference in San Francisco in April 2014. It was launched as a key ingredient of Microsoft's planned "makeover" of future operating systems for Windows Phone and Windows."""
zh = """小娜在2014年4月2日举行的微软Build开发者大会上正式展示并发布。2014年中旬,微软发布了“小娜”这一名字,作为Cortana在中国大陆使用的中文名。与这一中文名一起发布的是小娜在中国大陆的另一个形象。“小娜”一名源自微软旗下知名FPS游戏《光环》中的同名女角色。"""
monogram = r"[\p{scx=Hani}\p{scx=Hang}\p{scx=Hira}\p{scx=Kana}\p{scx=Bopo}]"
splitter = regex.compile(
  rf"(?<!^)(?:\s+|(?<={monogram})(?=\S)|(?={monogram}))(?!$)", flags=regex.U | regex.V1
)
monolist = set(['Hani', 'Hang', 'Hira', 'Kana', 'Bopo'])


def simple_split(txt):
    return len(txt.split())

def regex_split(txt):
    return len(splitter.split(txt))

def loop_split(txt):
    count = 0
    was_asian = False
    was_space = True
    for ch in txt:
      scx = unicodedataplus.script_extensions(ch)
      asian = not monolist.isdisjoint(scx)
      space = ch.isspace()
      if asian or ((was_asian or was_space) and not space):
        count += 1
      was_asian = asian
      was_space = space
    return count

def loop_native(txt):
    wordcount = 0
    start = True
    for c in txt:
      cat = unicodedata.category(c)
      if cat == 'Lo':        # Letter, other
        wordcount += 1       # each letter counted as a word
        start = True
      elif cat[0] == 'Z':    # Some kind of separator
        start = True
      elif cat[0] != 'P':    # Some kind of punctuation
                             # Everything else
        if start:
            wordcount += 1     # Only count at the start
        start = False

    return wordcount

def wrapper(loops, callback ,txt):
  loops = range(loops)
  t0 = pyperf.perf_counter()
  for _ in loops:
    # logic here
    callback(txt)
  return pyperf.perf_counter() - t0

if __name__ == "__main__":
  runner = pyperf.Runner()
  for func in (simple_split, regex_split, loop_split, loop_native):
      name = func.__name__
      runner.bench_time_func(f'{name} en', wrapper, func, en)
      runner.bench_time_func(f'{name} zh', wrapper, func, zh)

@yheuhtozr
Copy link
Contributor Author

yheuhtozr commented Nov 29, 2023

I also tried two versions:

  1. precalculate applicable characters and match against a big set at runtime (not space-efficient at all)
  2. follow Rust crate words-count to match based on blocks (claimed to be LibreOffice compatible)

The two strategies (block and scx) differ in details but might be fine for it does not change count of modifier characters between two CJK characters.
It is still ~25x or more slower than the original logic, so I think language-based opt-in might be good.

test_split.py
import unicodedataplus # unicodedata does not support script extensions neither blocks
import pyperf

en = """Cortana was demonstrated for the first time at the Microsoft Build developer conference in San Francisco in April 2014. It was launched as a key ingredient of Microsoft's planned "makeover" of future operating systems for Windows Phone and Windows."""
zh = """小娜在2014年4月2日举行的微软Build开发者大会上正式展示并发布。2014年中旬,微软发布了“小娜”这一名字,作为Cortana在中国大陆使用的中文名。与这一中文名一起发布的是小娜在中国大陆的另一个形象。“小娜”一名源自微软旗下知名FPS游戏《光环》中的同名女角色。"""

monolist = set(['Hani', 'Hang', 'Hira', 'Kana', 'Bopo'])
monoset = set([chr(m) for m in range(0x110000) if not monolist.isdisjoint(unicodedataplus.script_extensions(chr(m)))])
cjkset = set([
  'CJK Unified Ideographs',
  'CJK Unified Ideographs Extension A',
  'CJK Unified Ideographs Extension B',
  'CJK Unified Ideographs Extension C',
  'CJK Unified Ideographs Extension D',
  'CJK Unified Ideographs Extension E',
  'CJK Unified Ideographs Extension F',
  'CJK Unified Ideographs Extension G',
  'CJK Unified Ideographs Extension H',
  'CJK Unified Ideographs Extension I',
  'CJK Compatibility',
  'CJK Compatibility Forms',
  'CJK Compatibility Ideographs',
  'CJK Compatibility Ideographs Supplement',
  'CJK Radicals Supplement',
  'CJK Strokes',
  'CJK Symbols and Punctuation',
  'Hiragana',
  'Katakana',
  'Katakana Phonetic Extensions',
  'Kana Extended-A',
  'Kana Extended-B',
  'Kana Supplement',
  'Small Kana Extension',
  'Hangul Jamo',
  'Hangul Compatibility Jamo',
  'Hangul Jamo Extended-A',
  'Hangul Jamo Extended-B',
  'Hangul Syllables',
  'Halfwidth and Fullwidth Forms',
  'Enclosed CJK Letters and Months',
  'Enclosed Ideographic Supplement',
  'Kangxi Radicals',
  'Ideographic Description Characters',
])

def simple_split(loops, txt):
  loops = range(loops)
  t0 = pyperf.perf_counter()
  for _ in loops:
    # logic here
    len(txt.split())
  return pyperf.perf_counter() - t0

def loop_scxset(loops, txt):
  loops = range(loops)
  t0 = pyperf.perf_counter()
  for _ in loops:
    # logic here
    count = 0
    was_break = True
    for ch in txt:
      asian = ch in monoset
      space = ch.isspace()
      if asian or (was_break and not space):
        count += 1
      was_break = asian or space
    # return count
  return pyperf.perf_counter() - t0

def loop_block(loops, txt):
  loops = range(loops)
  t0 = pyperf.perf_counter()
  for _ in loops:
    # logic here
    count = 0
    was_break = True
    for ch in txt:
      asian = unicodedataplus.block(ch) in cjkset
      space = ch.isspace()
      if asian or (was_break and not space):
        count += 1
      was_break = asian or space
    # return count
  return pyperf.perf_counter() - t0

if __name__ == "__main__":
  runner = pyperf.Runner()
  runner.bench_time_func('simple split en', simple_split, en)
  runner.bench_time_func('simple split zh (wrong)', simple_split, zh)
  runner.bench_time_func('loop scxset en', loop_scxset, en)
  runner.bench_time_func('loop scxset zh', loop_scxset, zh)
  runner.bench_time_func('loop block en', loop_block, en)
  runner.bench_time_func('loop block zh', loop_block, zh)
$ python3 test_split.py
.....................
simple split en: Mean +- std dev: 593 ns +- 42 ns
.....................
simple split zh (wrong): Mean +- std dev: 379 ns +- 12 ns
.....................
loop scxset en: Mean +- std dev: 13.8 us +- 0.4 us
.....................
loop scxset zh: Mean +- std dev: 9.98 us +- 0.27 us
.....................
loop block en: Mean +- std dev: 27.9 us +- 0.9 us
.....................
loop block zh: Mean +- std dev: 17.3 us +- 0.4 us

Note that the specific logic from SO is not accurate, mostly because we also count each piece of Asian punctuation single word. The category-based approach also unintentionally affects wide range of unrelated Arabic and Indian characters. With the same examples:

en = """Cortana was demonstrated for the first time at the Microsoft Build developer conference in San Francisco in April 2014. It was launched as a key ingredient of Microsoft's planned "makeover" of future operating systems for Windows Phone and Windows."""
zh = """小娜在2014年4月2日举行的微软Build开发者大会上正式展示并发布。2014年中旬,微软发布了“小娜”这一名字,作为Cortana在中国大陆使用的中文名。与这一中文名一起发布的是小娜在中国大陆的另一个形象。“小娜”一名源自微软旗下知名FPS游戏《光环》中的同名女角色。"""
$ python3 test_split_loop.py 
loop by script_extensions (en):  39
loop by script_extensions (zh):  118
loop by unicode block (en):  39
loop by unicode block (zh):  118
loop by StackOverflow (en):  39
loop by StackOverflow (zh):  106

MS Word:
word word count

@nijel
Copy link
Member

nijel commented Nov 29, 2023

I doubt that the current word counting is 100% compatible with Libre Office and I don't know if we should aim at that.

Honestly, my original expectation was to find Python package implementing this, but apparently there is none existing. If we were that much into performance, implementing it in C or interfacing Rust crate words-count would be the way to go.

@yheuhtozr
Copy link
Contributor Author

yheuhtozr commented Nov 30, 2023

I would be fine for tens of microseconds each time saving a string, unless it runs for all strings every time. So, if it is acceptable to carry around an additional large (>5MB) object in the program, I think the precomputation approach is more efficient, but otherwise matching by block seems better.

@nijel
Copy link
Member

nijel commented Nov 30, 2023

Speaking of memory usage, both regex and unicodedataplus take about 2MB and the pre-computed list about 4MB. In the end, I think the best approach would be to pre-compute the list statically and include it in the code. We're doing that already for other purposes (though with a smaller set).

  • Add version pinned unicodedataplus dependency to requirements-dev.txt
  • Add generating logic to ./scripts/generate-non-word-chars
  • Include generated set of characters in the source code

But anyway, we should first focus on how to actually count the words and then look at the implementation. Let's do this in #10278.

@nijel
Copy link
Member

nijel commented Nov 30, 2023

Closing this PR as this won't be the final solution.

@nijel nijel closed this Nov 30, 2023
@yheuhtozr
Copy link
Contributor Author

I found out that lookbehind ((?<...)) greatly slowed down regexp performance, and without it regex.split() is decently optimized. The first version runs in relatively stable speed both for English and CJK strings, but the more contrived second version gives extra boost for all-CJK strings coming at the cost of every non-CJK word.

test_split.py
import regex
import pyperf

en = """Cortana was demonstrated for the first time at the Microsoft Build developer conference in San Francisco in April 2014. It was launched as a key ingredient of Microsoft's planned "makeover" of future operating systems for Windows Phone and Windows."""
zh = """小娜在2014年4月2日举行的微软Build开发者大会上正式展示并发布。2014年中旬,微软发布了“小娜”这一名字,作为Cortana在中国大陆使用的中文名。与这一中文名一起发布的是小娜在中国大陆的另一个形象。“小娜”一名源自微软旗下知名FPS游戏《光环》中的同名女角色。"""

splitter1 = regex.compile(r"([\p{scx=Hani}\p{scx=Hang}\p{scx=Hira}\p{scx=Kana}\p{scx=Bopo}\p{InHalfAndFullForms}]+)")
splitter2 = regex.compile(r"(\s+)|[\p{scx=Hani}\p{scx=Hang}\p{scx=Hira}\p{scx=Kana}\p{scx=Bopo}\p{InHalfAndFullForms}]+")

def simple_split(loops, txt):
  loops = range(loops)
  t0 = pyperf.perf_counter()
  for _ in loops:
    # logic here
    len(txt.split())
  return pyperf.perf_counter() - t0

def loop_split_both(loops, txt):
  loops = range(loops)
  t0 = pyperf.perf_counter()
  for _ in loops:
    # logic here
    count = 0
    even = True
    for sec in splitter1.split(txt):
      if even:
        count += len(sec)
      else:
        count += len(sec.split())
      even = not even
    # return count
  return pyperf.perf_counter() - t0

def loop_split_noncjk(loops, txt):
  loops = range(loops)
  t0 = pyperf.perf_counter()
  for _ in loops:
    # logic here
    count = len(txt)
    even = True
    for sec in splitter2.split(txt):
      if even and sec:
        count -= len(sec) - 1
      else:
        count -= len(sec or '')
      even = not even
    # return count
  return pyperf.perf_counter() - t0

if __name__ == "__main__":
  runner = pyperf.Runner()
  runner.bench_time_func('simple split en', simple_split, en)
  runner.bench_time_func('simple split zh (wrong)', simple_split, zh)
  runner.bench_time_func('regex partition en', loop_split_both, en)
  runner.bench_time_func('regex partition zh', loop_split_both, zh)
  runner.bench_time_func('regex & subtract en', loop_split_noncjk, en)
  runner.bench_time_func('regex & subtract zh', loop_split_noncjk, zh)
$ python3 test_split.py
simple split en: Mean +- std dev: 609 ns +- 39 ns
simple split zh (wrong): Mean +- std dev: 326 ns +- 9 ns
.....................
regex partition en: Mean +- std dev: 7.51 us +- 0.50 us
.....................
regex partition zh: Mean +- std dev: 6.41 us +- 0.22 us
.....................
regex & subtract en: Mean +- std dev: 19.8 us +- 0.9 us
.....................
regex & subtract zh: Mean +- std dev: 5.73 us +- 0.18 us

@nijel
Copy link
Member

nijel commented Dec 15, 2023

loop_split_both seems to give wrong results (the branches are mixed up).

Anyway, most of the word count will happen on English strings (as that is what is typically the source language), so the performance here should be the main focus. Counting fast Chinese strings is nice, but not that relevant for Weblate.

Performance-wise, using native re is 2-3 times faster than regex, what I believe is worth of losing Unicode block namings and hard-coding the ranges:

import re

splitter_re = re.compile(r"([\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\U00020000-\U0002F7FF\uFF00-\uFFEF\u3000-\u303F]+)")

def loop_split_both_re(txt):
    # logic here
    count = 0
    even = True
    for sec in splitter_re.split(txt):
      count += len(sec.split()) if even else len(sec)
      even = not even
    return count

Do you see any issues with this approach?

@yheuhtozr
Copy link
Contributor Author

Thank you for the correction and suggestion. It seems translating to re does make it faster-ish, but I wonder if the reason your specific pattern runs that fast owes to incompleteness of range. It could be rather detrimental to the English example if I convert the full range into re (please check if my conversion has any error).

test_split.py
import re
import regex
import pyperf

en = """Cortana was demonstrated for the first time at the Microsoft Build developer conference in San Francisco in April 2014. It was launched as a key ingredient of Microsoft's planned "makeover" of future operating systems for Windows Phone and Windows."""
zh = """小娜在2014年4月2日举行的微软Build开发者大会上正式展示并发布。2014年中旬,微软发布了“小娜”这一名字,作为Cortana在中国大陆使用的中文名。与这一中文名一起发布的是小娜在中国大陆的另一个形象。“小娜”一名源自微软旗下知名FPS游戏《光环》中的同名女角色。"""

splitter1 = regex.compile(r"([\p{scx=Hani}\p{scx=Hang}\p{scx=Hira}\p{scx=Kana}\p{scx=Bopo}\p{InHalfAndFullForms}]+)")
splitter2 = regex.compile(r"(\s+)|[\p{scx=Hani}\p{scx=Hang}\p{scx=Hira}\p{scx=Kana}\p{scx=Bopo}\p{InHalfAndFullForms}]+")
splitter_re = re.compile(r"([\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\U00020000-\U0002F7FF\uFF00-\uFFEF\u3000-\u303F]+)")

cjkcps = [c for c in range(0x110000) if splitter1.fullmatch(chr(c))]
cjkranges = []
prev = None
start = None
for cp in cjkcps:
  if start is None:
    start = cp
    prev = cp
  elif prev == cp - 1:
    prev = cp
  else:
    cjkranges.append((start, prev))
    start = cp
    prev = cp
if start is not None:
  cjkranges.append((start, prev))

splitter_nonscx = re.compile(rf"([{''.join(['-'.join([chr(c) for c in tup]) for tup in cjkranges])}]+)")

def simple_split(loops, txt):
  loops = range(loops)
  t0 = pyperf.perf_counter()
  for _ in loops:
    # logic here
    len(txt.split())
  return pyperf.perf_counter() - t0

def loop_split_both(loops, txt):
  loops = range(loops)
  t0 = pyperf.perf_counter()
  for _ in loops:
    # logic here
    count = 0
    even = True
    for sec in splitter1.split(txt):
      if even:
        count += len(sec.split())
      else:
        count += len(sec)
      even = not even
    # return count
  return pyperf.perf_counter() - t0

def loop_split_both_re(loops, txt):
  loops = range(loops)
  t0 = pyperf.perf_counter()
  for _ in loops:
    # logic here
    count = 0
    even = True
    for sec in splitter_re.split(txt):
      if even:
        count += len(sec.split())
      else:
        count += len(sec)
      even = not even
    # return count
  return pyperf.perf_counter() - t0

def loop_split_both_nonscx(loops, txt):
  loops = range(loops)
  t0 = pyperf.perf_counter()
  for _ in loops:
    # logic here
    count = 0
    even = True
    for sec in splitter_nonscx.split(txt):
      if even:
        count += len(sec.split())
      else:
        count += len(sec)
      even = not even
    # return count
  return pyperf.perf_counter() - t0

if __name__ == "__main__":
  runner = pyperf.Runner()
  runner.bench_time_func('simple split en', simple_split, en)
  runner.bench_time_func('simple split zh (wrong)', simple_split, zh)
  runner.bench_time_func('regex partition en', loop_split_both, en)
  runner.bench_time_func('regex partition zh', loop_split_both, zh)
  runner.bench_time_func('re given pat en', loop_split_both_re, en)
  runner.bench_time_func('re given pat zh', loop_split_both_re, zh)
  runner.bench_time_func('re equivalent en', loop_split_both_nonscx, en)
  runner.bench_time_func('re equivalent zh', loop_split_both_nonscx, zh)
$ python3 ~/test_split.py
.....................
simple split en: Mean +- std dev: 616 ns +- 37 ns
.....................
simple split zh (wrong): Mean +- std dev: 325 ns +- 11 ns
.....................
regex partition en: Mean +- std dev: 8.11 us +- 0.35 us
.....................
regex partition zh: Mean +- std dev: 6.34 us +- 0.22 us
.....................
re given pat en: Mean +- std dev: 3.22 us +- 0.17 us (<- your pattern)
.....................
re given pat zh: Mean +- std dev: 3.66 us +- 0.17 us (<- your pattern)
.....................
re equivalent en: Mean +- std dev: 12.2 us +- 0.3 us (<- my pattern in re)
.....................
re equivalent zh: Mean +- std dev: 5.19 us +- 0.18 us (<- my pattern in re)

where my pattern should translate roughly to:

r"([\U000002EA-\U000002EB\U00001100-\U000011FF\U00002E80-\U00002E99\U00002E9B-\U00002EF3\U00002F00-\U00002FD5\U00003001-\U00003003\U00003005-\U00003011\U00003013-\U0000301F\U00003021-\U00003035\U00003037-\U0000303F\U00003041-\U00003096\U00003099-\U000030FF\U00003105-\U0000312F\U00003131-\U0000318E\U00003190-\U000031E3\U000031F0-\U0000321E\U00003220-\U00003247\U00003260-\U0000327E\U00003280-\U000032B0\U000032C0-\U000032CB\U000032D0-\U00003370\U0000337B-\U0000337F\U000033E0-\U000033FE\U00003400-\U00004DBF\U00004E00-\U00009FFF\U0000A700-\U0000A707\U0000A960-\U0000A97C\U0000AC00-\U0000D7A3\U0000D7B0-\U0000D7C6\U0000D7CB-\U0000D7FB\U0000F900-\U0000FA6D\U0000FA70-\U0000FAD9\U0000FE45-\U0000FE46\U0000FF00-\U0000FFEF\U00016FE2-\U00016FE3\U00016FF0-\U00016FF1\U0001AFF0-\U0001AFF3\U0001AFF5-\U0001AFFB\U0001AFFD-\U0001AFFE\U0001B000-\U0001B122\U0001B132-\U0001B132\U0001B150-\U0001B152\U0001B155-\U0001B155\U0001B164-\U0001B167\U0001D360-\U0001D371\U0001F200-\U0001F200\U0001F250-\U0001F251\U00020000-\U0002A6DF\U0002A700-\U0002B739\U0002B740-\U0002B81D\U0002B820-\U0002CEA1\U0002CEB0-\U0002EBE0\U0002EBF0-\U0002EE5D\U0002F800-\U0002FA1D\U00030000-\U0003134A\U00031350-\U000323AF]+)"

If it is a sign that fragmented code point range slows down the regexp, we can probably consider words-count style block-based approach, or more tuned-up merging of ranges. Simply branching logic by uses_ngram() is always an option too.

@yheuhtozr
Copy link
Contributor Author

Okay... so reducing range complexity does mean a lot. With manually collapsing the block ranges generated by #10284 (comment) into:

splitter_temp_unicodedataplus = re.compile(r"([\U00001100-\U000011FF\U00002E80-\U00002FDF\U00002FF0-\U00009FFF\U0000A960-\U0000A97F\U0000AC00-\U0000D7FF\U0000F900-\U0000FAFF\U0000FE30-\U0000FE4F\U0000FF00-\U0000FFEF\U0001AFF0-\U0001B16F\U0001F200-\U0001F2FF\U00020000-\U0003FFFF]+)")

runs like:

$ python3 ~/test_split.py
.....................
simple split en: Mean +- std dev: 615 ns +- 28 ns
.....................
simple split zh (wrong): Mean +- std dev: 330 ns +- 11 ns
.....................
re equivalent en: Mean +- std dev: 3.99 us +- 0.18 us
.....................
re equivalent zh: Mean +- std dev: 3.79 us +- 0.18 us

I believe the equivalent result can be calculated somehow automatically.

@yheuhtozr
Copy link
Contributor Author

I have figured out couple of things:

  • unicodedataplus does not return block name of an unassigned character, and makes range highly fragmented. There is a package tinyunicodeblock that directly exposes the start and end of blocks
  • re is not smart enough to infer range continuity, so surprisingly e.g. [\u1000-\u1FFF\u2000-\u2FFF] runs slower than [\u1000-\u2FFF], that requires you to optimize by hand

Current best:

test_split.py
import re
import tinyunicodeblock
import pyperf

en = """Cortana was demonstrated for the first time at the Microsoft Build developer conference in San Francisco in April 2014. It was launched as a key ingredient of Microsoft's planned "makeover" of future operating systems for Windows Phone and Windows."""
zh = """小娜在2014年4月2日举行的微软Build开发者大会上正式展示并发布。2014年中旬,微软发布了“小娜”这一名字,作为Cortana在中国大陆使用的中文名。与这一中文名一起发布的是小娜在中国大陆的另一个形象。“小娜”一名源自微软旗下知名FPS游戏《光环》中的同名女角色。"""

cjkset = set([
  'CJK Unified Ideographs',
  'CJK Unified Ideographs Extension A',
  # 'CJK Unified Ideographs Extension B', # assumes entire Plane 2-3 would be CJK
  # 'CJK Unified Ideographs Extension C',
  # 'CJK Unified Ideographs Extension D',
  # 'CJK Unified Ideographs Extension E',
  # 'CJK Unified Ideographs Extension F',
  # 'CJK Unified Ideographs Extension G',
  # 'CJK Unified Ideographs Extension H',
  # 'CJK Unified Ideographs Extension I',
  'CJK Compatibility',
  'CJK Compatibility Forms',
  'CJK Compatibility Ideographs',
  # 'CJK Compatibility Ideographs Supplement',
  'CJK Radicals Supplement',
  'CJK Strokes',
  'CJK Symbols and Punctuation',
  'Hiragana',
  'Katakana',
  'Katakana Phonetic Extensions',
  'Kana Extended-A',
  'Kana Extended-B',
  'Kana Supplement',
  'Small Kana Extension',
  'Hangul Jamo',
  'Hangul Compatibility Jamo',
  'Hangul Jamo Extended-A',
  'Hangul Jamo Extended-B',
  'Hangul Syllables',
  'Halfwidth and Fullwidth Forms',
  'Enclosed CJK Letters and Months',
  'Enclosed Ideographic Supplement',
  'Kangxi Radicals',
  'Ideographic Description Characters',
  'Kanbun',
  'Yijing Hexagram Symbols', # not strictly necessary but for the sake of range continuity
  'Bopomofo',
  'Bopomofo Extended',
])

cjkranges = [(b[0], b[1]) for b in tinyunicodeblock.BLOCKS if b[2] in cjkset]
cjkranges.sort(key=lambda r: ord(r[0]))
cjkmerged = []
prev = None
for r in cjkranges:
  if prev is None:
    prev = r
  elif ord(prev[1]) == ord(r[0]) - 1:
    prev = (prev[0], r[1])
  else:
    cjkmerged.append(prev)
    prev = r
cjkmerged.append(prev)

splitter_nonscx = re.compile(rf"([{''.join([f'{r[0]}-{r[1]}' for r in cjkmerged])}\U00020000-\U0003FFFF]+)")

def simple_split(loops, txt):
  loops = range(loops)
  t0 = pyperf.perf_counter()
  for _ in loops:
    # logic here
    len(txt.split())
  return pyperf.perf_counter() - t0

def loop_split_both_nonscx(loops, txt):
  loops = range(loops)
  t0 = pyperf.perf_counter()
  for _ in loops:
    # logic here
    count = 0
    even = True
    for sec in splitter_nonscx.split(txt):
      if even:
        count += len(sec.split())
      else:
        count += len(sec)
      even = not even
    # return count
  return pyperf.perf_counter() - t0

if __name__ == "__main__":
  runner = pyperf.Runner()
  runner.bench_time_func('simple split en', simple_split, en)
  runner.bench_time_func('simple split zh (wrong)', simple_split, zh)
  runner.bench_time_func('re tinyunicodeblock en', loop_split_both_nonscx, en)
  runner.bench_time_func('re tinyunicodeblock zh', loop_split_both_nonscx, zh)
$ python3 ~/test_split.py
.....................
simple split en: Mean +- std dev: 621 ns +- 29 ns
.....................
simple split zh (wrong): Mean +- std dev: 330 ns +- 13 ns
.....................
re tinyunicodeblock en: Mean +- std dev: 3.88 us +- 0.15 us
.....................
re tinyunicodeblock zh: Mean +- std dev: 3.75 us +- 0.16 us

@nijel
Copy link
Member

nijel commented Dec 15, 2023

Yes, I tried to collapse ranges, but I really didn't attempt to make it complete, so some ranges were most likely missing. On the other side I intentionally included some reserved blocks because it really doesn't matter in this case (the behavior of reserved blocks is not defined, so let's choose what performs better).

@yheuhtozr
Copy link
Contributor Author

My last attempt automatically consolidates described blocks into consecutive ranges as much as possible, with some educated heuristics, so what you see is the performance of a feature complete version (although generation logic should be written cleaner). Do you find some more room of optimization?

@nijel
Copy link
Member

nijel commented Dec 16, 2023

I think this is fine performance wise.

I'd like to avoid tinyunicodeblock runtime dependency, so the regexp should be generated in scripts/ and embedded in the code.

@yheuhtozr yheuhtozr mentioned this pull request Dec 24, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Language-aware word counting
2 participants