SudachiPy compatibility #52

mh-northlander · 2021-09-29T08:43:24Z

Part of #25.
Provide same interface to SudachiPy as much as possible.

Maybe after rust API become stable (#28)

mh-northlander · 2021-09-30T03:55:44Z

SudachiPy API to implement

Dictionary
- __new__(config_path=None, resource_dir=None, dict_type=None)
- create(self, mode=None) -> Tokenizer
- close(self)
Tokenizer
- tokenize(self, text: str, mode=None, logger=None) -> MorphemeList
MorphemeList
- behave as List[Morpheme]
  - __getitem__, __len__, __iter__
- __str__
  - "".join([morpheme.surface() for morpheme in self])
- empty(cls) -> MorphemeList
  - generates an empty MorphemeList
- get_internal_cost(self)
  - returns the total cost of this path
- size(self)
  - alias of __len__
Morpheme
- getter functions
  - begin, end
    - returns the index in the input text
  - surface, part_of_speech, part_of_speech_id
  - dictionary_form, normalized_form, reading_form
  - is_oov, word_id, dictionary_id, synonym_group_ids
- split(self, mode) -> MophemeList
  - splits this morpheme based on mode
- get_word_info(self) -> WordInfo
WordInfo
- getter
  - surface, head_word_length, pos_id
  - normalized_form, dictionary_form_id, dictionary_form_id, reading_form
  - a_unit_split, b_unit_split, word_structure, synonym_group_ids
- length(self)
  - shorthand for head_word_length

eiennohito · 2021-09-30T04:22:34Z

Dictonary - OK
Tokeniser - OK
Morpheme / WordInfo - from the binding point of view they probably should be aliases if it is possible, maybe with lazy loading of dictionary information eventually
- Otherwise there would be a lot of Arc internally in the morheme (or it is OK?)

mh-northlander · 2021-10-07T09:23:39Z

Differences left:
Dictionary.close() is not implemented.
Dictionary.__new__(-) should take dict_type (#73)
Tokenizer.tokenize(-) takes enable_debug flag instead of logger.
MorphemeList.empty(dict: Dictionary) takes dictionary as args.
SplitMode locates at the top level (not under Tokenizer).

eiennohito · 2021-10-07T10:04:01Z

The most problematic thing will be passing logger to the tokenizer. I'm not sure if any actual user uses it (the options seems mostly for debugging purposes).

eiennohito · 2021-10-20T02:16:45Z

May have comments from GinZa, but will create new issues for them

hiroshi-matsuda-rit · 2021-10-20T08:50:22Z

I did not find any problematic points for GiNZA in this issue.
Please go ahead!

eiennohito mentioned this issue Sep 30, 2021

AnalyzedSentence Design #55

Closed

mh-northlander mentioned this issue Oct 1, 2021

Serializable PyTokenizer #58

Closed

eiennohito added this to the 0.1 milestone Oct 1, 2021

mh-northlander mentioned this issue Oct 7, 2021

Change Python API #68

Closed

mh-northlander mentioned this issue Oct 8, 2021

Read dictionary installed by sudachidict_* #73

Closed

eiennohito added the python Python binding-related label Oct 8, 2021

This was referenced Oct 8, 2021

Python compatibility #77

Merged

Fix import path for python binding #84

Closed

mh-northlander self-assigned this Oct 11, 2021

eiennohito closed this as completed Oct 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SudachiPy compatibility #52

SudachiPy compatibility #52

mh-northlander commented Sep 29, 2021

mh-northlander commented Sep 30, 2021 •

edited

Loading

eiennohito commented Sep 30, 2021 •

edited

Loading

mh-northlander commented Oct 7, 2021 •

edited

Loading

eiennohito commented Oct 7, 2021

eiennohito commented Oct 20, 2021

hiroshi-matsuda-rit commented Oct 20, 2021

SudachiPy compatibility #52

SudachiPy compatibility #52

Comments

mh-northlander commented Sep 29, 2021

mh-northlander commented Sep 30, 2021 • edited Loading

eiennohito commented Sep 30, 2021 • edited Loading

mh-northlander commented Oct 7, 2021 • edited Loading

eiennohito commented Oct 7, 2021

eiennohito commented Oct 20, 2021

hiroshi-matsuda-rit commented Oct 20, 2021

mh-northlander commented Sep 30, 2021 •

edited

Loading

eiennohito commented Sep 30, 2021 •

edited

Loading

mh-northlander commented Oct 7, 2021 •

edited

Loading