pyo3_runtime.PanicException occurs in Morpheme.surface() after calling Morpheme.split() #182

hiroshi-matsuda-rit · 2021-11-24T19:44:52Z

I found an input pattern which causes exception in sudachipy==0.6.0.
The reproducing code below is the abstract of the Japanese tokenizer of spaCy v3.2.0.

from sudachipy import dictionary, tokenizer

def get_dtokens(tokenizer, sudachipy_tokens, need_sub_tokens):
    sub_tokens_list = get_sub_tokens(tokenizer, sudachipy_tokens) if need_sub_tokens else None
    dtokens = [
        (
            t.surface(),
            t.part_of_speech()[:4],
            t.part_of_speech()[4:],
            t.dictionary_form(),
            t.normalized_form(),
            t.reading_form(),
            sub_tokens_list[idx] if need_sub_tokens else None,
        ) for idx, t in enumerate(sudachipy_tokens) if len(t.surface()) > 0
    ]
    return dtokens

def get_sub_tokens(tokenizer, sudachipy_tokens):
    sub_tokens_list = []
    for token in sudachipy_tokens:
        sub_a = token.split(tokenizer.SplitMode.A)
        if len(sub_a) == 1:  # no sub tokens
            sub_tokens_list.append(None)
        else:
            sub_b = token.split(tokenizer.SplitMode.B)
            if len(sub_a) == len(sub_b):
                dtokens = get_dtokens(tokenizer, sub_a, False)
                sub_tokens_list.append([dtokens, dtokens])
            else:
                sub_tokens_list.append(
                    [
                        get_dtokens(tokenizer, sub_a, False),
                        get_dtokens(tokenizer, sub_b, False),
                    ]
                )
    return sub_tokens_list

tokenizer = dictionary.Dictionary().create(mode=tokenizer.Tokenizer.SplitMode.C)
sudachipy_tokens = tokenizer.tokenize("Ｔ社はｅコマース（電子商取引）を活用したリサイクル部品の取扱いを系列の部品販売店で平成１３年１０月より始めました。取り扱う部品は、ドア、フェンダー、グリル、バンパー、ランプ類などのＴ社の外装部品（「エ コロパーツ」）全１６品目と大手リサイクル部品流通事業社のＮグループ及びＢ社から供給を受ける国内全メーカーの外装・機能部品で、専用の中古部品ｅコマースサイトを開設し、自動車保有期間の長期化に伴う低価格修理の需要に応えることにしています。")
get_dtokens(tokenizer, sudachipy_tokens, True)

/home/matsuda/ginza/test.py:21: DeprecationWarning: API around this functionality will change. See github issue WorksApplications/SudachiPy#92 for more.
  sub_a = token.split(tokenizer.SplitMode.A)
/home/matsuda/ginza/test.py:25: DeprecationWarning: API around this functionality will change. See github issue WorksApplications/SudachiPy#92 for more.
  sub_b = token.split(tokenizer.SplitMode.B)
thread '<unnamed>' panicked at 'byte index 10 is not a char boundary; it is inside 'ｅ' (bytes 9..12) of `Ｔ社はｅコマース（電子商取引）を活用したリサイクル部品の取扱いを系列の部品販売店で平成１３年１０月より始めました。取り扱う部品は、ドア、フェンダー、グリル、バンパー、ラン`[...]', /github/workspace/sudachi/src/analysis/morpheme.rs:122:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/home/matsuda/ginza/test.py", line 40, in <module>
    get_dtokens(tokenizer, sudachipy_tokens, True)
  File "/home/matsuda/ginza/test.py", line 4, in get_dtokens
    sub_tokens_list = get_sub_tokens(tokenizer, sudachipy_tokens) if need_sub_tokens else None
  File "/home/matsuda/ginza/test.py", line 27, in get_sub_tokens
    dtokens = get_dtokens(tokenizer, sub_a, False)
  File "/home/matsuda/ginza/test.py", line 5, in get_dtokens
    dtokens = [
  File "/home/matsuda/ginza/test.py", line 14, in <listcomp>
    ) for idx, t in enumerate(sudachipy_tokens) if len(t.surface()) > 0
pyo3_runtime.PanicException: byte index 10 is not a char boundary; it is inside 'ｅ' (bytes 9..12) of `Ｔ社はｅコマース（電子商取引）を活用したリサイクル部品の取扱いを系列の部品販売店で平成１３年１０月より始めました。取り扱う部品は、ドア、フェンダー、グリル、バンパー、ラン`[...]

The text was updated successfully, but these errors were encountered:

eiennohito · 2021-11-24T22:58:20Z

Moved the issue to Sudachi.rs repo.

eiennohito · 2021-11-24T23:05:30Z

Seems to be the problem with indices recalculation around Python bindings and surface normalization.

fixes WorksApplications#182

eiennohito · 2021-11-26T08:39:24Z

@hiroshi-matsuda-rit should be fixed after #183 is merged

Also, doing len(t.surface()) is inefficient because it creates a new string every time now. I wonder if something can be done with that. After #183 is merged, it will become possible to write len(t) which does not create any strings.

Morpheme.split() method is also very inefficient with regards to allocations. I want to redesign after-analysis split API, but have no ideas. If you have any particular requirements or wishes on that, please feel free to comment.

hiroshi-matsuda-rit · 2021-11-27T13:50:41Z

Also, doing len(t.surface()) is inefficient because it creates a new string every time now.

I'd like to refactor this logic if there is any method to identify the byte offset of the beginning of each morpheme.

Morpheme.split() method is also very inefficient with regards to allocations. I want to redesign after-analysis split API, but have no ideas.

To reduce the allocation costs, it's better to call the split-analysis API with a buffer instance argument like:

input field
- morpheme id list buffer and its length
output field
- list of split morpheme id list for each input morpheme

eiennohito · 2021-11-28T02:18:13Z

I'd like to refactor this logic if there is any method to identify the byte offset of the beginning of each morpheme.

Python operates on codepoint offsets though. It is possible to get those (Morpheme.begin()/Morpheme.end())
If you need offsets from the utf-8 byte sequence it is possible to expose them, but they will be mostly useless for Python in my opinion.

To reduce the allocation costs, it's better to call the split-analysis API with a buffer instance argument like:

I'm leaning towards using MorphemeList as an output parameter which will be filled by the new split results. This will minimize API surface while achieving memory reuse goals. The only problem which arises that it will be impossible to hold into old morphemes, but that may be also solvable.

* add test infrastructure for using non-built dictionaries * correctly resolve boundaries wrt normalization on nodes splitting fixes #182

eiennohito transferred this issue from WorksApplications/SudachiPy Nov 24, 2021

eiennohito added this to the 0.6.1 milestone Nov 24, 2021

eiennohito added the bug Something isn't working label Nov 24, 2021

eiennohito added a commit to eiennohito/sudachi.rs that referenced this issue Nov 26, 2021

correctly resolve boundaries wrt normalization on nodes splitting

42b9bbe

fixes WorksApplications#182

eiennohito mentioned this issue Nov 26, 2021

Correctly resolve node boundaries wrt node splitting #183

Merged

adrianeboyd mentioned this issue Nov 26, 2021

PanicException in SudachiPy explosion/spaCy#9751

Closed

eiennohito closed this as completed in #183 Nov 29, 2021

eiennohito added a commit that referenced this issue Nov 29, 2021

Correctly resolve node boundaries wrt node splitting (#183)

7bf1a43

* add test infrastructure for using non-built dictionaries * correctly resolve boundaries wrt normalization on nodes splitting fixes #182

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pyo3_runtime.PanicException occurs in Morpheme.surface() after calling Morpheme.split() #182

pyo3_runtime.PanicException occurs in Morpheme.surface() after calling Morpheme.split() #182

hiroshi-matsuda-rit commented Nov 24, 2021

eiennohito commented Nov 24, 2021

eiennohito commented Nov 24, 2021

eiennohito commented Nov 26, 2021 •

edited

hiroshi-matsuda-rit commented Nov 27, 2021

eiennohito commented Nov 28, 2021

pyo3_runtime.PanicException occurs in Morpheme.surface() after calling Morpheme.split() #182

pyo3_runtime.PanicException occurs in Morpheme.surface() after calling Morpheme.split() #182

Comments

hiroshi-matsuda-rit commented Nov 24, 2021

eiennohito commented Nov 24, 2021

eiennohito commented Nov 24, 2021

eiennohito commented Nov 26, 2021 • edited

hiroshi-matsuda-rit commented Nov 27, 2021

eiennohito commented Nov 28, 2021

eiennohito commented Nov 26, 2021 •

edited