Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SudachiPy compatibility #52

Closed
mh-northlander opened this issue Sep 29, 2021 · 6 comments
Closed

SudachiPy compatibility #52

mh-northlander opened this issue Sep 29, 2021 · 6 comments
Assignees
Labels
python Python binding-related
Milestone

Comments

@mh-northlander
Copy link
Collaborator

Part of #25.
Provide same interface to SudachiPy as much as possible.

Maybe after rust API become stable (#28)

@mh-northlander
Copy link
Collaborator Author

mh-northlander commented Sep 30, 2021

SudachiPy API to implement

  • Dictionary
    • __new__(config_path=None, resource_dir=None, dict_type=None)
    • create(self, mode=None) -> Tokenizer
    • close(self)
  • Tokenizer
    • tokenize(self, text: str, mode=None, logger=None) -> MorphemeList
  • MorphemeList
    • behave as List[Morpheme]
      • __getitem__, __len__, __iter__
    • __str__
      • "".join([morpheme.surface() for morpheme in self])
    • empty(cls) -> MorphemeList
      • generates an empty MorphemeList
    • get_internal_cost(self)
      • returns the total cost of this path
    • size(self)
      • alias of __len__
  • Morpheme
    • getter functions
      • begin, end
        • returns the index in the input text
      • surface, part_of_speech, part_of_speech_id
      • dictionary_form, normalized_form, reading_form
      • is_oov, word_id, dictionary_id, synonym_group_ids
    • split(self, mode) -> MophemeList
      • splits this morpheme based on mode
    • get_word_info(self) -> WordInfo
  • WordInfo
    • getter
      • surface, head_word_length, pos_id
      • normalized_form, dictionary_form_id, dictionary_form_id, reading_form
      • a_unit_split, b_unit_split, word_structure, synonym_group_ids
    • length(self)
      • shorthand for head_word_length

@eiennohito
Copy link
Collaborator

eiennohito commented Sep 30, 2021

  • Dictonary - OK
  • Tokeniser - OK
  • Morpheme / WordInfo - from the binding point of view they probably should be aliases if it is possible, maybe with lazy loading of dictionary information eventually
    • Otherwise there would be a lot of Arc internally in the morheme (or it is OK?)

@mh-northlander
Copy link
Collaborator Author

mh-northlander commented Oct 7, 2021

Differences left:
Dictionary.close() is not implemented.
Dictionary.__new__(-) should take dict_type (#73)
Tokenizer.tokenize(-) takes enable_debug flag instead of logger.
MorphemeList.empty(dict: Dictionary) takes dictionary as args.
SplitMode locates at the top level (not under Tokenizer).

@eiennohito
Copy link
Collaborator

The most problematic thing will be passing logger to the tokenizer. I'm not sure if any actual user uses it (the options seems mostly for debugging purposes).

@eiennohito eiennohito added the python Python binding-related label Oct 8, 2021
This was referenced Oct 8, 2021
@mh-northlander mh-northlander self-assigned this Oct 11, 2021
@eiennohito
Copy link
Collaborator

May have comments from GinZa, but will create new issues for them

@hiroshi-matsuda-rit
Copy link

I did not find any problematic points for GiNZA in this issue.
Please go ahead!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Python binding-related
Projects
None yet
Development

No branches or pull requests

3 participants