implement analysis memory reuse via output parameters #185

eiennohito · 2021-11-30T09:19:33Z

Fixes #184

Implement out parameters for Tokenizer.tokenize() and Morpheme.split() Python API.

For the memory sharing to be actually useful, I had to refactor internal MorphemeList to allow multiple references to input data, while having distinct list of morphemes. Let's welcome Arc<RefCell<X>> in our codebase.
Python MorphemeListWrapper has also changed. As a side effect there is no copy in custom pretokenizers (win). Need to document semantics of everything, but will do that as a documentation pass.

mh-northlander

Arc<RefCell> is not thread safe: https://doc.rust-lang.org/std/sync/struct.Arc.html#thread-safety
we may use Mutex or RwLock instead of RefCell.

mh-northlander · 2021-12-01T03:12:57Z

python/tests/test_morpheme.py

+        self.assertEqual(ms_a[0].surface(), '東京')
+        self.assertEqual(ms_a[1].surface(), '都')
+
+        ms = self.tokenizer_obj.tokenize("京都東京都京都", SplitMode.C)


I want to check the second split with different word from the above one.

Test binary dictionary has only 東京都 as a word with splits. I also wanted to check another word, but alas.
Moving all tests to using non-binary dictionary will fix this issue

mh-northlander · 2021-12-01T03:35:01Z

sudachi/tests/stateful_tokenizer.rs

@@ -130,18 +130,19 @@ fn split_middle() {
    let ms = tok.tokenize("京都東京都京都");
    assert_eq!(ms.len(), 3);
    let m = ms.get(1);
-    assert_eq!(m.surface(), "東京都");
+    assert_eq!(m.surface().deref().deref(), "東京都");


double deref?

mh-northlander · 2021-12-01T03:42:41Z

sudachi/tests/stateless_tokenizer.rs

    let ms = tok.tokenize("京都", Mode::C);
    let ms: Vec<_> = ms.iter().collect();
    assert_eq!(1, ms.len());
-    let pos = ms[0].part_of_speech().expect("failed to get pos");
+    let pos = ms[0].part_of_speech();


We can now access nodes by ms.get(0) instead of ms[0]. (same for other test functions)

Tests need more cleanup, I agree.

eiennohito · 2021-12-01T07:39:47Z

Arc<RefCell<T>> by itself is not thread-safe and that's OK. StatefulTokenizer and MorphemeList are not for sharing between threads. Additonally, for the Python bindings, all accesses to MorphemeList happen under GIL and are OK, for Rust the compiler will check invalid data sharing.
After second thought, it is OK to make internals to be even Rc<RefCell<T>>.

eiennohito · 2021-12-01T09:07:12Z

@mh-northlander comments should be addressed

implement analysis memory reuse via output parameters

e6b0058

eiennohito added the python Python binding-related label Nov 30, 2021

eiennohito added this to the 0.6.1 milestone Nov 30, 2021

eiennohito requested a review from mh-northlander November 30, 2021 09:19

mh-northlander requested changes Dec 1, 2021

View reviewed changes

eiennohito added 4 commits December 1, 2021 17:11

MorphemeList internals: Arc<RefCell> -> Rc<RefCell>

3ade682

PyMorphemeListWrapper now requires GIL for access

9b5582a

remove naked field usage

8f3199e

cleanup tests

9773f91

eiennohito mentioned this pull request Dec 2, 2021

0.6.1 documentation #186

Merged

mh-northlander approved these changes Dec 2, 2021

View reviewed changes

eiennohito merged commit 190bbdb into WorksApplications:develop Dec 2, 2021

eiennohito deleted the 184-memory-reuse branch December 2, 2021 02:34

eiennohito mentioned this pull request Dec 7, 2021

Move Morpheme.split functionality to another thing #92

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement analysis memory reuse via output parameters #185

implement analysis memory reuse via output parameters #185

eiennohito commented Nov 30, 2021 •

edited

mh-northlander left a comment

mh-northlander Dec 1, 2021

eiennohito Dec 1, 2021

mh-northlander Dec 1, 2021

mh-northlander Dec 1, 2021

eiennohito Dec 1, 2021

eiennohito commented Dec 1, 2021 •

edited

eiennohito commented Dec 1, 2021

implement analysis memory reuse via output parameters #185

implement analysis memory reuse via output parameters #185

Conversation

eiennohito commented Nov 30, 2021 • edited

mh-northlander left a comment

Choose a reason for hiding this comment

mh-northlander Dec 1, 2021

Choose a reason for hiding this comment

eiennohito Dec 1, 2021

Choose a reason for hiding this comment

mh-northlander Dec 1, 2021

Choose a reason for hiding this comment

mh-northlander Dec 1, 2021

Choose a reason for hiding this comment

eiennohito Dec 1, 2021

Choose a reason for hiding this comment

eiennohito commented Dec 1, 2021 • edited

eiennohito commented Dec 1, 2021

eiennohito commented Nov 30, 2021 •

edited

eiennohito commented Dec 1, 2021 •

edited