Implement language-aware word counting #10284

yheuhtozr · 2023-10-27T04:52:06Z

Proposed changes

Any suggestions are welcome. Especially whether:

Universally calculating number of symbols used in those languages one by one in word count

Use different counting method according to language specified

Add several options for users to choose per project, category, etc.

Checklist

Lint and unit tests pass locally with my changes.
I have added tests that prove my fix is effective or that my feature works.
I have added documentation to describe my feature.
I have squashed my commits into logic units.
I have described the changes in the commit messages.

Other information

for more information, see https://pre-commit.ci

Needed in case we are going to change the implementation, see #10284 and #10278.

nijel

Have you done any benchmarks to see how much is this slower? This code is executed on each source string change. Maybe we want to use this for a few affected languages only?

Can you please add tests for the East Asian languages, so that we can verify it works as expected? I've just added test for the current implementation in f790db2.

nijel · 2023-11-03T10:30:47Z

requirements.txt

@@ -40,6 +40,7 @@ pyparsing>=3.1.1,<3.2
 python-dateutil>=2.8.1
 python-redis-lock[django]>=4,<4.1
 rapidfuzz>=2.6.0,<3.5
+regex>=1.0


There is no such version, please choose something reasonably recent as lower bound. Adding upper bound is also a good idea to avoid accidental breakage.

yheuhtozr · 2023-11-28T15:45:30Z

@nijel Hi, as I'm not familiar enough with Python to benchmark a specific method in Django, I made a small script to test each core logic. Also thanks for your suggestion in #10278 (comment).

import regex # re does not support script extensions
import unicodedataplus # unicodedata does not support script extensions
import pyperf

en = """Cortana was demonstrated for the first time at the Microsoft Build developer conference in San Francisco in April 2014. It was launched as a key ingredient of Microsoft's planned "makeover" of future operating systems for Windows Phone and Windows."""
zh = """小娜在2014年4月2日举行的微软Build开发者大会上正式展示并发布。2014年中旬，微软发布了“小娜”这一名字，作为Cortana在中国大陆使用的中文名。与这一中文名一起发布的是小娜在中国大陆的另一个形象。“小娜”一名源自微软旗下知名FPS游戏《光环》中的同名女角色。"""
monogram = r"[\p{scx=Hani}\p{scx=Hang}\p{scx=Hira}\p{scx=Kana}\p{scx=Bopo}]"
splitter = regex.compile(
  rf"(?<!^)(?:\s+|(?<={monogram})(?=\S)|(?={monogram}))(?!$)", flags=regex.U | regex.V1
)
monolist = set(['Hani', 'Hang', 'Hira', 'Kana', 'Bopo'])

def simple_split(loops, txt):
  loops = range(loops)
  t0 = pyperf.perf_counter()
  for _ in loops:
    # logic here
    len(txt.split())
  return pyperf.perf_counter() - t0

def regex_split(loops, txt):
  loops = range(loops)
  t0 = pyperf.perf_counter()
  for _ in loops:
    # logic here
    len(splitter.split(txt))
  return pyperf.perf_counter() - t0

def loop_split(loops, txt):
  loops = range(loops)
  t0 = pyperf.perf_counter()
  for _ in loops:
    # logic here
    count = 0
    was_asian = False
    was_space = True
    for ch in txt:
      scx = unicodedataplus.script_extensions(ch)
      asian = not monolist.isdisjoint(scx)
      space = ch.isspace()
      if asian or ((was_asian or was_space) and not space):
        count += 1
      was_asian = asian
      was_space = space
    # return count
  return pyperf.perf_counter() - t0

if __name__ == "__main__":
  runner = pyperf.Runner()
  runner.bench_time_func('simple split en', simple_split, en)
  runner.bench_time_func('simple split zh (wrong)', simple_split, zh)
  runner.bench_time_func('regex split en', regex_split, en)
  runner.bench_time_func('regex split zh', regex_split, zh)
  runner.bench_time_func('loop split en', loop_split, en)
  runner.bench_time_func('loop split zh', loop_split, zh)

Result on my machine (WSL2):

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.3 LTS
Release:        22.04
Codename:       jammy
$ python3 --version
Python 3.10.12
$ python3 test_split.py 
.....................
simple split en: Mean +- std dev: 602 ns +- 39 ns
.....................
simple split zh (wrong): Mean +- std dev: 373 ns +- 6 ns
.....................
regex split en: Mean +- std dev: 42.0 us +- 2.6 us
.....................
regex split zh: Mean +- std dev: 63.4 us +- 2.2 us
.....................
loop split en: Mean +- std dev: 46.4 us +- 1.2 us
.....................
loop split zh: Mean +- std dev: 28.5 us +- 1.0 us

It seems that script extension lookup is quite heavyweight anyways, and I wonder if hardcoding code point range would improve anything.

nijel · 2023-11-28T19:18:13Z

Thanks for the benchmark! I've added the code from SO as well to the test (see below) and here are my results:

.....................
simple_split en: Mean +- std dev: 1.59 us +- 0.03 us
.....................
simple_split zh: Mean +- std dev: 760 ns +- 36 ns
.....................
regex_split en: Mean +- std dev: 112 us +- 9 us
.....................
regex_split zh: Mean +- std dev: 149 us +- 7 us
.....................
loop_split en: Mean +- std dev: 106 us +- 2 us
.....................
loop_split zh: Mean +- std dev: 65.8 us +- 1.6 us
.....................
loop_native en: Mean +- std dev: 65.9 us +- 3.5 us
.....................
loop_native zh: Mean +- std dev: 31.6 us +- 1.1 us

So the unicodedata only solution seems the fastest out of these, but still is 30x slower than simple split. We can still fallback to the current implementation for most of the languages if that is an issue.

benchmark.py

import regex # re does not support script extensions
import unicodedataplus # unicodedata does not support script extensions
import pyperf
import unicodedata

en = """Cortana was demonstrated for the first time at the Microsoft Build developer conference in San Francisco in April 2014. It was launched as a key ingredient of Microsoft's planned "makeover" of future operating systems for Windows Phone and Windows."""
zh = """小娜在2014年4月2日举行的微软Build开发者大会上正式展示并发布。2014年中旬，微软发布了“小娜”这一名字，作为Cortana在中国大陆使用的中文名。与这一中文名一起发布的是小娜在中国大陆的另一个形象。“小娜”一名源自微软旗下知名FPS游戏《光环》中的同名女角色。"""
monogram = r"[\p{scx=Hani}\p{scx=Hang}\p{scx=Hira}\p{scx=Kana}\p{scx=Bopo}]"
splitter = regex.compile(
  rf"(?<!^)(?:\s+|(?<={monogram})(?=\S)|(?={monogram}))(?!$)", flags=regex.U | regex.V1
)
monolist = set(['Hani', 'Hang', 'Hira', 'Kana', 'Bopo'])


def simple_split(txt):
    return len(txt.split())

def regex_split(txt):
    return len(splitter.split(txt))

def loop_split(txt):
    count = 0
    was_asian = False
    was_space = True
    for ch in txt:
      scx = unicodedataplus.script_extensions(ch)
      asian = not monolist.isdisjoint(scx)
      space = ch.isspace()
      if asian or ((was_asian or was_space) and not space):
        count += 1
      was_asian = asian
      was_space = space
    return count

def loop_native(txt):
    wordcount = 0
    start = True
    for c in txt:
      cat = unicodedata.category(c)
      if cat == 'Lo':        # Letter, other
        wordcount += 1       # each letter counted as a word
        start = True
      elif cat[0] == 'Z':    # Some kind of separator
        start = True
      elif cat[0] != 'P':    # Some kind of punctuation
                             # Everything else
        if start:
            wordcount += 1     # Only count at the start
        start = False

    return wordcount

def wrapper(loops, callback ,txt):
  loops = range(loops)
  t0 = pyperf.perf_counter()
  for _ in loops:
    # logic here
    callback(txt)
  return pyperf.perf_counter() - t0

if __name__ == "__main__":
  runner = pyperf.Runner()
  for func in (simple_split, regex_split, loop_split, loop_native):
      name = func.__name__
      runner.bench_time_func(f'{name} en', wrapper, func, en)
      runner.bench_time_func(f'{name} zh', wrapper, func, zh)

yheuhtozr · 2023-11-29T15:14:51Z

I also tried two versions:

precalculate applicable characters and match against a big set at runtime (not space-efficient at all)
follow Rust crate words-count to match based on blocks (claimed to be LibreOffice compatible)

The two strategies (block and scx) differ in details but might be fine for it does not change count of modifier characters between two CJK characters.
It is still ~25x or more slower than the original logic, so I think language-based opt-in might be good.

test_split.py

import unicodedataplus # unicodedata does not support script extensions neither blocks
import pyperf

en = """Cortana was demonstrated for the first time at the Microsoft Build developer conference in San Francisco in April 2014. It was launched as a key ingredient of Microsoft's planned "makeover" of future operating systems for Windows Phone and Windows."""
zh = """小娜在2014年4月2日举行的微软Build开发者大会上正式展示并发布。2014年中旬，微软发布了“小娜”这一名字，作为Cortana在中国大陆使用的中文名。与这一中文名一起发布的是小娜在中国大陆的另一个形象。“小娜”一名源自微软旗下知名FPS游戏《光环》中的同名女角色。"""

monolist = set(['Hani', 'Hang', 'Hira', 'Kana', 'Bopo'])
monoset = set([chr(m) for m in range(0x110000) if not monolist.isdisjoint(unicodedataplus.script_extensions(chr(m)))])
cjkset = set([
  'CJK Unified Ideographs',
  'CJK Unified Ideographs Extension A',
  'CJK Unified Ideographs Extension B',
  'CJK Unified Ideographs Extension C',
  'CJK Unified Ideographs Extension D',
  'CJK Unified Ideographs Extension E',
  'CJK Unified Ideographs Extension F',
  'CJK Unified Ideographs Extension G',
  'CJK Unified Ideographs Extension H',
  'CJK Unified Ideographs Extension I',
  'CJK Compatibility',
  'CJK Compatibility Forms',
  'CJK Compatibility Ideographs',
  'CJK Compatibility Ideographs Supplement',
  'CJK Radicals Supplement',
  'CJK Strokes',
  'CJK Symbols and Punctuation',
  'Hiragana',
  'Katakana',
  'Katakana Phonetic Extensions',
  'Kana Extended-A',
  'Kana Extended-B',
  'Kana Supplement',
  'Small Kana Extension',
  'Hangul Jamo',
  'Hangul Compatibility Jamo',
  'Hangul Jamo Extended-A',
  'Hangul Jamo Extended-B',
  'Hangul Syllables',
  'Halfwidth and Fullwidth Forms',
  'Enclosed CJK Letters and Months',
  'Enclosed Ideographic Supplement',
  'Kangxi Radicals',
  'Ideographic Description Characters',
])

def simple_split(loops, txt):
  loops = range(loops)
  t0 = pyperf.perf_counter()
  for _ in loops:
    # logic here
    len(txt.split())
  return pyperf.perf_counter() - t0

def loop_scxset(loops, txt):
  loops = range(loops)
  t0 = pyperf.perf_counter()
  for _ in loops:
    # logic here
    count = 0
    was_break = True
    for ch in txt:
      asian = ch in monoset
      space = ch.isspace()
      if asian or (was_break and not space):
        count += 1
      was_break = asian or space
    # return count
  return pyperf.perf_counter() - t0

def loop_block(loops, txt):
  loops = range(loops)
  t0 = pyperf.perf_counter()
  for _ in loops:
    # logic here
    count = 0
    was_break = True
    for ch in txt:
      asian = unicodedataplus.block(ch) in cjkset
      space = ch.isspace()
      if asian or (was_break and not space):
        count += 1
      was_break = asian or space
    # return count
  return pyperf.perf_counter() - t0

if __name__ == "__main__":
  runner = pyperf.Runner()
  runner.bench_time_func('simple split en', simple_split, en)
  runner.bench_time_func('simple split zh (wrong)', simple_split, zh)
  runner.bench_time_func('loop scxset en', loop_scxset, en)
  runner.bench_time_func('loop scxset zh', loop_scxset, zh)
  runner.bench_time_func('loop block en', loop_block, en)
  runner.bench_time_func('loop block zh', loop_block, zh)

$ python3 test_split.py
.....................
simple split en: Mean +- std dev: 593 ns +- 42 ns
.....................
simple split zh (wrong): Mean +- std dev: 379 ns +- 12 ns
.....................
loop scxset en: Mean +- std dev: 13.8 us +- 0.4 us
.....................
loop scxset zh: Mean +- std dev: 9.98 us +- 0.27 us
.....................
loop block en: Mean +- std dev: 27.9 us +- 0.9 us
.....................
loop block zh: Mean +- std dev: 17.3 us +- 0.4 us

Note that the specific logic from SO is not accurate, mostly because we also count each piece of Asian punctuation single word. The category-based approach also unintentionally affects wide range of unrelated Arabic and Indian characters. With the same examples:

en = """Cortana was demonstrated for the first time at the Microsoft Build developer conference in San Francisco in April 2014. It was launched as a key ingredient of Microsoft's planned "makeover" of future operating systems for Windows Phone and Windows."""
zh = """小娜在2014年4月2日举行的微软Build开发者大会上正式展示并发布。2014年中旬，微软发布了“小娜”这一名字，作为Cortana在中国大陆使用的中文名。与这一中文名一起发布的是小娜在中国大陆的另一个形象。“小娜”一名源自微软旗下知名FPS游戏《光环》中的同名女角色。"""

$ python3 test_split_loop.py 
loop by script_extensions (en):  39
loop by script_extensions (zh):  118
loop by unicode block (en):  39
loop by unicode block (zh):  118
loop by StackOverflow (en):  39
loop by StackOverflow (zh):  106

MS Word:

nijel · 2023-11-29T19:08:43Z

I doubt that the current word counting is 100% compatible with Libre Office and I don't know if we should aim at that.

Honestly, my original expectation was to find Python package implementing this, but apparently there is none existing. If we were that much into performance, implementing it in C or interfacing Rust crate words-count would be the way to go.

yheuhtozr · 2023-11-30T03:46:39Z

I would be fine for tens of microseconds each time saving a string, unless it runs for all strings every time. So, if it is acceptable to carry around an additional large (>5MB) object in the program, I think the precomputation approach is more efficient, but otherwise matching by block seems better.

nijel · 2023-11-30T09:38:04Z

Speaking of memory usage, both regex and unicodedataplus take about 2MB and the pre-computed list about 4MB. In the end, I think the best approach would be to pre-compute the list statically and include it in the code. We're doing that already for other purposes (though with a smaller set).

Add version pinned unicodedataplus dependency to requirements-dev.txt
Add generating logic to ./scripts/generate-non-word-chars
Include generated set of characters in the source code

But anyway, we should first focus on how to actually count the words and then look at the implementation. Let's do this in #10278.

nijel · 2023-11-30T09:55:55Z

Closing this PR as this won't be the final solution.

yheuhtozr · 2023-12-15T10:00:08Z

I found out that lookbehind ((?<...)) greatly slowed down regexp performance, and without it regex.split() is decently optimized. The first version runs in relatively stable speed both for English and CJK strings, but the more contrived second version gives extra boost for all-CJK strings coming at the cost of every non-CJK word.

test_split.py

import regex
import pyperf

en = """Cortana was demonstrated for the first time at the Microsoft Build developer conference in San Francisco in April 2014. It was launched as a key ingredient of Microsoft's planned "makeover" of future operating systems for Windows Phone and Windows."""
zh = """小娜在2014年4月2日举行的微软Build开发者大会上正式展示并发布。2014年中旬，微软发布了“小娜”这一名字，作为Cortana在中国大陆使用的中文名。与这一中文名一起发布的是小娜在中国大陆的另一个形象。“小娜”一名源自微软旗下知名FPS游戏《光环》中的同名女角色。"""

splitter1 = regex.compile(r"([\p{scx=Hani}\p{scx=Hang}\p{scx=Hira}\p{scx=Kana}\p{scx=Bopo}\p{InHalfAndFullForms}]+)")
splitter2 = regex.compile(r"(\s+)|[\p{scx=Hani}\p{scx=Hang}\p{scx=Hira}\p{scx=Kana}\p{scx=Bopo}\p{InHalfAndFullForms}]+")

def simple_split(loops, txt):
  loops = range(loops)
  t0 = pyperf.perf_counter()
  for _ in loops:
    # logic here
    len(txt.split())
  return pyperf.perf_counter() - t0

def loop_split_both(loops, txt):
  loops = range(loops)
  t0 = pyperf.perf_counter()
  for _ in loops:
    # logic here
    count = 0
    even = True
    for sec in splitter1.split(txt):
      if even:
        count += len(sec)
      else:
        count += len(sec.split())
      even = not even
    # return count
  return pyperf.perf_counter() - t0

def loop_split_noncjk(loops, txt):
  loops = range(loops)
  t0 = pyperf.perf_counter()
  for _ in loops:
    # logic here
    count = len(txt)
    even = True
    for sec in splitter2.split(txt):
      if even and sec:
        count -= len(sec) - 1
      else:
        count -= len(sec or '')
      even = not even
    # return count
  return pyperf.perf_counter() - t0

if __name__ == "__main__":
  runner = pyperf.Runner()
  runner.bench_time_func('simple split en', simple_split, en)
  runner.bench_time_func('simple split zh (wrong)', simple_split, zh)
  runner.bench_time_func('regex partition en', loop_split_both, en)
  runner.bench_time_func('regex partition zh', loop_split_both, zh)
  runner.bench_time_func('regex & subtract en', loop_split_noncjk, en)
  runner.bench_time_func('regex & subtract zh', loop_split_noncjk, zh)

$ python3 test_split.py
simple split en: Mean +- std dev: 609 ns +- 39 ns
simple split zh (wrong): Mean +- std dev: 326 ns +- 9 ns
.....................
regex partition en: Mean +- std dev: 7.51 us +- 0.50 us
.....................
regex partition zh: Mean +- std dev: 6.41 us +- 0.22 us
.....................
regex & subtract en: Mean +- std dev: 19.8 us +- 0.9 us
.....................
regex & subtract zh: Mean +- std dev: 5.73 us +- 0.18 us

nijel · 2023-12-15T12:31:51Z

loop_split_both seems to give wrong results (the branches are mixed up).

Anyway, most of the word count will happen on English strings (as that is what is typically the source language), so the performance here should be the main focus. Counting fast Chinese strings is nice, but not that relevant for Weblate.

Performance-wise, using native re is 2-3 times faster than regex, what I believe is worth of losing Unicode block namings and hard-coding the ranges:

import re

splitter_re = re.compile(r"([\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\U00020000-\U0002F7FF\uFF00-\uFFEF\u3000-\u303F]+)")

def loop_split_both_re(txt):
    # logic here
    count = 0
    even = True
    for sec in splitter_re.split(txt):
      count += len(sec.split()) if even else len(sec)
      even = not even
    return count

Do you see any issues with this approach?

yheuhtozr · 2023-12-15T16:46:57Z

Thank you for the correction and suggestion. It seems translating to re does make it faster-ish, but I wonder if the reason your specific pattern runs that fast owes to incompleteness of range. It could be rather detrimental to the English example if I convert the full range into re (please check if my conversion has any error).

test_split.py

import re
import regex
import pyperf

en = """Cortana was demonstrated for the first time at the Microsoft Build developer conference in San Francisco in April 2014. It was launched as a key ingredient of Microsoft's planned "makeover" of future operating systems for Windows Phone and Windows."""
zh = """小娜在2014年4月2日举行的微软Build开发者大会上正式展示并发布。2014年中旬，微软发布了“小娜”这一名字，作为Cortana在中国大陆使用的中文名。与这一中文名一起发布的是小娜在中国大陆的另一个形象。“小娜”一名源自微软旗下知名FPS游戏《光环》中的同名女角色。"""

splitter1 = regex.compile(r"([\p{scx=Hani}\p{scx=Hang}\p{scx=Hira}\p{scx=Kana}\p{scx=Bopo}\p{InHalfAndFullForms}]+)")
splitter2 = regex.compile(r"(\s+)|[\p{scx=Hani}\p{scx=Hang}\p{scx=Hira}\p{scx=Kana}\p{scx=Bopo}\p{InHalfAndFullForms}]+")
splitter_re = re.compile(r"([\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\U00020000-\U0002F7FF\uFF00-\uFFEF\u3000-\u303F]+)")

cjkcps = [c for c in range(0x110000) if splitter1.fullmatch(chr(c))]
cjkranges = []
prev = None
start = None
for cp in cjkcps:
  if start is None:
    start = cp
    prev = cp
  elif prev == cp - 1:
    prev = cp
  else:
    cjkranges.append((start, prev))
    start = cp
    prev = cp
if start is not None:
  cjkranges.append((start, prev))

splitter_nonscx = re.compile(rf"([{''.join(['-'.join([chr(c) for c in tup]) for tup in cjkranges])}]+)")

def simple_split(loops, txt):
  loops = range(loops)
  t0 = pyperf.perf_counter()
  for _ in loops:
    # logic here
    len(txt.split())
  return pyperf.perf_counter() - t0

def loop_split_both(loops, txt):
  loops = range(loops)
  t0 = pyperf.perf_counter()
  for _ in loops:
    # logic here
    count = 0
    even = True
    for sec in splitter1.split(txt):
      if even:
        count += len(sec.split())
      else:
        count += len(sec)
      even = not even
    # return count
  return pyperf.perf_counter() - t0

def loop_split_both_re(loops, txt):
  loops = range(loops)
  t0 = pyperf.perf_counter()
  for _ in loops:
    # logic here
    count = 0
    even = True
    for sec in splitter_re.split(txt):
      if even:
        count += len(sec.split())
      else:
        count += len(sec)
      even = not even
    # return count
  return pyperf.perf_counter() - t0

def loop_split_both_nonscx(loops, txt):
  loops = range(loops)
  t0 = pyperf.perf_counter()
  for _ in loops:
    # logic here
    count = 0
    even = True
    for sec in splitter_nonscx.split(txt):
      if even:
        count += len(sec.split())
      else:
        count += len(sec)
      even = not even
    # return count
  return pyperf.perf_counter() - t0

if __name__ == "__main__":
  runner = pyperf.Runner()
  runner.bench_time_func('simple split en', simple_split, en)
  runner.bench_time_func('simple split zh (wrong)', simple_split, zh)
  runner.bench_time_func('regex partition en', loop_split_both, en)
  runner.bench_time_func('regex partition zh', loop_split_both, zh)
  runner.bench_time_func('re given pat en', loop_split_both_re, en)
  runner.bench_time_func('re given pat zh', loop_split_both_re, zh)
  runner.bench_time_func('re equivalent en', loop_split_both_nonscx, en)
  runner.bench_time_func('re equivalent zh', loop_split_both_nonscx, zh)

$ python3 ~/test_split.py
.....................
simple split en: Mean +- std dev: 616 ns +- 37 ns
.....................
simple split zh (wrong): Mean +- std dev: 325 ns +- 11 ns
.....................
regex partition en: Mean +- std dev: 8.11 us +- 0.35 us
.....................
regex partition zh: Mean +- std dev: 6.34 us +- 0.22 us
.....................
re given pat en: Mean +- std dev: 3.22 us +- 0.17 us (<- your pattern)
.....................
re given pat zh: Mean +- std dev: 3.66 us +- 0.17 us (<- your pattern)
.....................
re equivalent en: Mean +- std dev: 12.2 us +- 0.3 us (<- my pattern in re)
.....................
re equivalent zh: Mean +- std dev: 5.19 us +- 0.18 us (<- my pattern in re)

where my pattern should translate roughly to:

r"([\U000002EA-\U000002EB\U00001100-\U000011FF\U00002E80-\U00002E99\U00002E9B-\U00002EF3\U00002F00-\U00002FD5\U00003001-\U00003003\U00003005-\U00003011\U00003013-\U0000301F\U00003021-\U00003035\U00003037-\U0000303F\U00003041-\U00003096\U00003099-\U000030FF\U00003105-\U0000312F\U00003131-\U0000318E\U00003190-\U000031E3\U000031F0-\U0000321E\U00003220-\U00003247\U00003260-\U0000327E\U00003280-\U000032B0\U000032C0-\U000032CB\U000032D0-\U00003370\U0000337B-\U0000337F\U000033E0-\U000033FE\U00003400-\U00004DBF\U00004E00-\U00009FFF\U0000A700-\U0000A707\U0000A960-\U0000A97C\U0000AC00-\U0000D7A3\U0000D7B0-\U0000D7C6\U0000D7CB-\U0000D7FB\U0000F900-\U0000FA6D\U0000FA70-\U0000FAD9\U0000FE45-\U0000FE46\U0000FF00-\U0000FFEF\U00016FE2-\U00016FE3\U00016FF0-\U00016FF1\U0001AFF0-\U0001AFF3\U0001AFF5-\U0001AFFB\U0001AFFD-\U0001AFFE\U0001B000-\U0001B122\U0001B132-\U0001B132\U0001B150-\U0001B152\U0001B155-\U0001B155\U0001B164-\U0001B167\U0001D360-\U0001D371\U0001F200-\U0001F200\U0001F250-\U0001F251\U00020000-\U0002A6DF\U0002A700-\U0002B739\U0002B740-\U0002B81D\U0002B820-\U0002CEA1\U0002CEB0-\U0002EBE0\U0002EBF0-\U0002EE5D\U0002F800-\U0002FA1D\U00030000-\U0003134A\U00031350-\U000323AF]+)"

If it is a sign that fragmented code point range slows down the regexp, we can probably consider words-count style block-based approach, or more tuned-up merging of ranges. Simply branching logic by uses_ngram() is always an option too.

yheuhtozr · 2023-12-15T17:18:19Z

Okay... so reducing range complexity does mean a lot. With manually collapsing the block ranges generated by #10284 (comment) into:

splitter_temp_unicodedataplus = re.compile(r"([\U00001100-\U000011FF\U00002E80-\U00002FDF\U00002FF0-\U00009FFF\U0000A960-\U0000A97F\U0000AC00-\U0000D7FF\U0000F900-\U0000FAFF\U0000FE30-\U0000FE4F\U0000FF00-\U0000FFEF\U0001AFF0-\U0001B16F\U0001F200-\U0001F2FF\U00020000-\U0003FFFF]+)")

runs like:

$ python3 ~/test_split.py
.....................
simple split en: Mean +- std dev: 615 ns +- 28 ns
.....................
simple split zh (wrong): Mean +- std dev: 330 ns +- 11 ns
.....................
re equivalent en: Mean +- std dev: 3.99 us +- 0.18 us
.....................
re equivalent zh: Mean +- std dev: 3.79 us +- 0.18 us

I believe the equivalent result can be calculated somehow automatically.

yheuhtozr · 2023-12-15T19:17:34Z

I have figured out couple of things:

unicodedataplus does not return block name of an unassigned character, and makes range highly fragmented. There is a package tinyunicodeblock that directly exposes the start and end of blocks
re is not smart enough to infer range continuity, so surprisingly e.g. [\u1000-\u1FFF\u2000-\u2FFF] runs slower than [\u1000-\u2FFF], that requires you to optimize by hand

Current best:

test_split.py

import re
import tinyunicodeblock
import pyperf

en = """Cortana was demonstrated for the first time at the Microsoft Build developer conference in San Francisco in April 2014. It was launched as a key ingredient of Microsoft's planned "makeover" of future operating systems for Windows Phone and Windows."""
zh = """小娜在2014年4月2日举行的微软Build开发者大会上正式展示并发布。2014年中旬，微软发布了“小娜”这一名字，作为Cortana在中国大陆使用的中文名。与这一中文名一起发布的是小娜在中国大陆的另一个形象。“小娜”一名源自微软旗下知名FPS游戏《光环》中的同名女角色。"""

cjkset = set([
  'CJK Unified Ideographs',
  'CJK Unified Ideographs Extension A',
  # 'CJK Unified Ideographs Extension B', # assumes entire Plane 2-3 would be CJK
  # 'CJK Unified Ideographs Extension C',
  # 'CJK Unified Ideographs Extension D',
  # 'CJK Unified Ideographs Extension E',
  # 'CJK Unified Ideographs Extension F',
  # 'CJK Unified Ideographs Extension G',
  # 'CJK Unified Ideographs Extension H',
  # 'CJK Unified Ideographs Extension I',
  'CJK Compatibility',
  'CJK Compatibility Forms',
  'CJK Compatibility Ideographs',
  # 'CJK Compatibility Ideographs Supplement',
  'CJK Radicals Supplement',
  'CJK Strokes',
  'CJK Symbols and Punctuation',
  'Hiragana',
  'Katakana',
  'Katakana Phonetic Extensions',
  'Kana Extended-A',
  'Kana Extended-B',
  'Kana Supplement',
  'Small Kana Extension',
  'Hangul Jamo',
  'Hangul Compatibility Jamo',
  'Hangul Jamo Extended-A',
  'Hangul Jamo Extended-B',
  'Hangul Syllables',
  'Halfwidth and Fullwidth Forms',
  'Enclosed CJK Letters and Months',
  'Enclosed Ideographic Supplement',
  'Kangxi Radicals',
  'Ideographic Description Characters',
  'Kanbun',
  'Yijing Hexagram Symbols', # not strictly necessary but for the sake of range continuity
  'Bopomofo',
  'Bopomofo Extended',
])

cjkranges = [(b[0], b[1]) for b in tinyunicodeblock.BLOCKS if b[2] in cjkset]
cjkranges.sort(key=lambda r: ord(r[0]))
cjkmerged = []
prev = None
for r in cjkranges:
  if prev is None:
    prev = r
  elif ord(prev[1]) == ord(r[0]) - 1:
    prev = (prev[0], r[1])
  else:
    cjkmerged.append(prev)
    prev = r
cjkmerged.append(prev)

splitter_nonscx = re.compile(rf"([{''.join([f'{r[0]}-{r[1]}' for r in cjkmerged])}\U00020000-\U0003FFFF]+)")

def simple_split(loops, txt):
  loops = range(loops)
  t0 = pyperf.perf_counter()
  for _ in loops:
    # logic here
    len(txt.split())
  return pyperf.perf_counter() - t0

def loop_split_both_nonscx(loops, txt):
  loops = range(loops)
  t0 = pyperf.perf_counter()
  for _ in loops:
    # logic here
    count = 0
    even = True
    for sec in splitter_nonscx.split(txt):
      if even:
        count += len(sec.split())
      else:
        count += len(sec)
      even = not even
    # return count
  return pyperf.perf_counter() - t0

if __name__ == "__main__":
  runner = pyperf.Runner()
  runner.bench_time_func('simple split en', simple_split, en)
  runner.bench_time_func('simple split zh (wrong)', simple_split, zh)
  runner.bench_time_func('re tinyunicodeblock en', loop_split_both_nonscx, en)
  runner.bench_time_func('re tinyunicodeblock zh', loop_split_both_nonscx, zh)

$ python3 ~/test_split.py
.....................
simple split en: Mean +- std dev: 621 ns +- 29 ns
.....................
simple split zh (wrong): Mean +- std dev: 330 ns +- 13 ns
.....................
re tinyunicodeblock en: Mean +- std dev: 3.88 us +- 0.15 us
.....................
re tinyunicodeblock zh: Mean +- std dev: 3.75 us +- 0.16 us

nijel · 2023-12-15T19:41:55Z

Yes, I tried to collapse ranges, but I really didn't attempt to make it complete, so some ranges were most likely missing. On the other side I intentionally included some reserved blocks because it really doesn't matter in this case (the behavior of reserved blocks is not defined, so let's choose what performs better).

yheuhtozr · 2023-12-15T21:00:03Z

My last attempt automatically consolidates described blocks into consecutive ranges as much as possible, with some educated heuristics, so what you see is the performance of a feature complete version (although generation logic should be written cleaner). Do you find some more room of optimization?

nijel · 2023-12-16T11:38:33Z

I think this is fine performance wise.

I'd like to avoid tinyunicodeblock runtime dependency, so the regexp should be generated in scripts/ and embedded in the code.

yheuhtozr and others added 2 commits October 27, 2023 13:48

word count core logic

f5eb9a6

[pre-commit.ci] auto fixes from pre-commit.com hooks

63f310d

for more information, see https://pre-commit.ci

nijel added a commit that referenced this pull request Nov 3, 2023

tests: add tests for word counting

f790db2

Needed in case we are going to change the implementation, see #10284 and #10278.

nijel reviewed Nov 3, 2023

View reviewed changes

nijel linked an issue Nov 8, 2023 that may be closed by this pull request

Language-aware word counting #10278

Closed

nijel mentioned this pull request Nov 16, 2023

Language-aware word counting #10278

Closed

nijel closed this Nov 30, 2023

yheuhtozr mentioned this pull request Dec 24, 2023

word count for CJK #10690

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement language-aware word counting #10284

Implement language-aware word counting #10284

yheuhtozr commented Oct 27, 2023

nijel left a comment

nijel Nov 3, 2023

yheuhtozr commented Nov 28, 2023

nijel commented Nov 28, 2023

yheuhtozr commented Nov 29, 2023 •

edited

Loading

nijel commented Nov 29, 2023

yheuhtozr commented Nov 30, 2023 •

edited

Loading

nijel commented Nov 30, 2023

nijel commented Nov 30, 2023

yheuhtozr commented Dec 15, 2023

nijel commented Dec 15, 2023 •

edited

Loading

yheuhtozr commented Dec 15, 2023

yheuhtozr commented Dec 15, 2023

yheuhtozr commented Dec 15, 2023

nijel commented Dec 15, 2023

yheuhtozr commented Dec 15, 2023

nijel commented Dec 16, 2023

Implement language-aware word counting #10284

Implement language-aware word counting #10284

Conversation

yheuhtozr commented Oct 27, 2023

Proposed changes

Checklist

Other information

nijel left a comment

Choose a reason for hiding this comment

nijel Nov 3, 2023

Choose a reason for hiding this comment

yheuhtozr commented Nov 28, 2023

nijel commented Nov 28, 2023

yheuhtozr commented Nov 29, 2023 • edited Loading

nijel commented Nov 29, 2023

yheuhtozr commented Nov 30, 2023 • edited Loading

nijel commented Nov 30, 2023

nijel commented Nov 30, 2023

yheuhtozr commented Dec 15, 2023

nijel commented Dec 15, 2023 • edited Loading

yheuhtozr commented Dec 15, 2023

yheuhtozr commented Dec 15, 2023

yheuhtozr commented Dec 15, 2023

nijel commented Dec 15, 2023

yheuhtozr commented Dec 15, 2023

nijel commented Dec 16, 2023

yheuhtozr commented Nov 29, 2023 •

edited

Loading

yheuhtozr commented Nov 30, 2023 •

edited

Loading

nijel commented Dec 15, 2023 •

edited

Loading