Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyo3_runtime.PanicException occurs in Morpheme.surface() after calling Morpheme.split() #182

Closed
hiroshi-matsuda-rit opened this issue Nov 24, 2021 · 5 comments · Fixed by #183
Labels
bug Something isn't working
Milestone

Comments

@hiroshi-matsuda-rit
Copy link

I found an input pattern which causes exception in sudachipy==0.6.0.
The reproducing code below is the abstract of the Japanese tokenizer of spaCy v3.2.0.

from sudachipy import dictionary, tokenizer

def get_dtokens(tokenizer, sudachipy_tokens, need_sub_tokens):
    sub_tokens_list = get_sub_tokens(tokenizer, sudachipy_tokens) if need_sub_tokens else None
    dtokens = [
        (
            t.surface(),
            t.part_of_speech()[:4],
            t.part_of_speech()[4:],
            t.dictionary_form(),
            t.normalized_form(),
            t.reading_form(),
            sub_tokens_list[idx] if need_sub_tokens else None,
        ) for idx, t in enumerate(sudachipy_tokens) if len(t.surface()) > 0
    ]
    return dtokens

def get_sub_tokens(tokenizer, sudachipy_tokens):
    sub_tokens_list = []
    for token in sudachipy_tokens:
        sub_a = token.split(tokenizer.SplitMode.A)
        if len(sub_a) == 1:  # no sub tokens
            sub_tokens_list.append(None)
        else:
            sub_b = token.split(tokenizer.SplitMode.B)
            if len(sub_a) == len(sub_b):
                dtokens = get_dtokens(tokenizer, sub_a, False)
                sub_tokens_list.append([dtokens, dtokens])
            else:
                sub_tokens_list.append(
                    [
                        get_dtokens(tokenizer, sub_a, False),
                        get_dtokens(tokenizer, sub_b, False),
                    ]
                )
    return sub_tokens_list

tokenizer = dictionary.Dictionary().create(mode=tokenizer.Tokenizer.SplitMode.C)
sudachipy_tokens = tokenizer.tokenize("T社はeコマース(電子商取引)を活用したリサイクル部品の取扱いを系列の部品販売店で平成13年10月より始めました。取り扱う部品は、ドア、フェンダー、グリル、バンパー、ランプ類などのT社の外装部品(「エ コロパーツ」)全16品目と大手リサイクル部品流通事業社のNグループ及びB社から供給を受ける国内全メーカーの外装・機能部品で、専用の中古部品eコマースサイトを開設し、自動車保有期間の長期化に伴う低価格修理の需要に応えることにしています。")
get_dtokens(tokenizer, sudachipy_tokens, True)
/home/matsuda/ginza/test.py:21: DeprecationWarning: API around this functionality will change. See github issue WorksApplications/SudachiPy#92 for more.
  sub_a = token.split(tokenizer.SplitMode.A)
/home/matsuda/ginza/test.py:25: DeprecationWarning: API around this functionality will change. See github issue WorksApplications/SudachiPy#92 for more.
  sub_b = token.split(tokenizer.SplitMode.B)
thread '<unnamed>' panicked at 'byte index 10 is not a char boundary; it is inside 'e' (bytes 9..12) of `T社はeコマース(電子商取引)を活用したリサイクル部品の取扱いを系列の部品販売店で平成13年10月より始めました。取り扱う部品は、ドア、フェンダー、グリル、バンパー、ラン`[...]', /github/workspace/sudachi/src/analysis/morpheme.rs:122:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/home/matsuda/ginza/test.py", line 40, in <module>
    get_dtokens(tokenizer, sudachipy_tokens, True)
  File "/home/matsuda/ginza/test.py", line 4, in get_dtokens
    sub_tokens_list = get_sub_tokens(tokenizer, sudachipy_tokens) if need_sub_tokens else None
  File "/home/matsuda/ginza/test.py", line 27, in get_sub_tokens
    dtokens = get_dtokens(tokenizer, sub_a, False)
  File "/home/matsuda/ginza/test.py", line 5, in get_dtokens
    dtokens = [
  File "/home/matsuda/ginza/test.py", line 14, in <listcomp>
    ) for idx, t in enumerate(sudachipy_tokens) if len(t.surface()) > 0
pyo3_runtime.PanicException: byte index 10 is not a char boundary; it is inside 'e' (bytes 9..12) of `T社はeコマース(電子商取引)を活用したリサイクル部品の取扱いを系列の部品販売店で平成13年10月より始めました。取り扱う部品は、ドア、フェンダー、グリル、バンパー、ラン`[...]
@eiennohito eiennohito transferred this issue from WorksApplications/SudachiPy Nov 24, 2021
@eiennohito
Copy link
Collaborator

Moved the issue to Sudachi.rs repo.

@eiennohito eiennohito added this to the 0.6.1 milestone Nov 24, 2021
@eiennohito eiennohito added the bug Something isn't working label Nov 24, 2021
@eiennohito
Copy link
Collaborator

Seems to be the problem with indices recalculation around Python bindings and surface normalization.

@eiennohito
Copy link
Collaborator

eiennohito commented Nov 26, 2021

@hiroshi-matsuda-rit should be fixed after #183 is merged

Also, doing len(t.surface()) is inefficient because it creates a new string every time now. I wonder if something can be done with that. After #183 is merged, it will become possible to write len(t) which does not create any strings.

Morpheme.split() method is also very inefficient with regards to allocations. I want to redesign after-analysis split API, but have no ideas. If you have any particular requirements or wishes on that, please feel free to comment.

@hiroshi-matsuda-rit
Copy link
Author

Also, doing len(t.surface()) is inefficient because it creates a new string every time now.

I'd like to refactor this logic if there is any method to identify the byte offset of the beginning of each morpheme.

Morpheme.split() method is also very inefficient with regards to allocations. I want to redesign after-analysis split API, but have no ideas.

To reduce the allocation costs, it's better to call the split-analysis API with a buffer instance argument like:

  • input field
    • morpheme id list buffer and its length
  • output field
    • list of split morpheme id list for each input morpheme

@eiennohito
Copy link
Collaborator

I'd like to refactor this logic if there is any method to identify the byte offset of the beginning of each morpheme.

Python operates on codepoint offsets though. It is possible to get those (Morpheme.begin()/Morpheme.end())
If you need offsets from the utf-8 byte sequence it is possible to expose them, but they will be mostly useless for Python in my opinion.

To reduce the allocation costs, it's better to call the split-analysis API with a buffer instance argument like:

I'm leaning towards using MorphemeList as an output parameter which will be filled by the new split results. This will minimize API surface while achieving memory reuse goals. The only problem which arises that it will be impossible to hold into old morphemes, but that may be also solvable.

eiennohito added a commit that referenced this issue Nov 29, 2021
* add test infrastructure for using non-built dictionaries

* correctly resolve boundaries wrt normalization on nodes splitting

fixes #182
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants