Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement analysis memory reuse via output parameters #185

Merged
merged 5 commits into from
Dec 2, 2021

Conversation

eiennohito
Copy link
Collaborator

@eiennohito eiennohito commented Nov 30, 2021

Fixes #184

Implement out parameters for Tokenizer.tokenize() and Morpheme.split() Python API.

For the memory sharing to be actually useful, I had to refactor internal MorphemeList to allow multiple references to input data, while having distinct list of morphemes. Let's welcome Arc<RefCell<X>> in our codebase.
Python MorphemeListWrapper has also changed. As a side effect there is no copy in custom pretokenizers (win). Need to document semantics of everything, but will do that as a documentation pass.

@eiennohito eiennohito added the python Python binding-related label Nov 30, 2021
@eiennohito eiennohito added this to the 0.6.1 milestone Nov 30, 2021
Copy link
Collaborator

@mh-northlander mh-northlander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arc<RefCell> is not thread safe: https://doc.rust-lang.org/std/sync/struct.Arc.html#thread-safety
we may use Mutex or RwLock instead of RefCell.

self.assertEqual(ms_a[0].surface(), '東京')
self.assertEqual(ms_a[1].surface(), '都')

ms = self.tokenizer_obj.tokenize("京都東京都京都", SplitMode.C)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to check the second split with different word from the above one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test binary dictionary has only 東京都 as a word with splits. I also wanted to check another word, but alas.
Moving all tests to using non-binary dictionary will fix this issue

@@ -130,18 +130,19 @@ fn split_middle() {
let ms = tok.tokenize("京都東京都京都");
assert_eq!(ms.len(), 3);
let m = ms.get(1);
assert_eq!(m.surface(), "東京都");
assert_eq!(m.surface().deref().deref(), "東京都");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double deref?

Comment on lines 36 to 39
let ms = tok.tokenize("京都", Mode::C);
let ms: Vec<_> = ms.iter().collect();
assert_eq!(1, ms.len());
let pos = ms[0].part_of_speech().expect("failed to get pos");
let pos = ms[0].part_of_speech();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can now access nodes by ms.get(0) instead of ms[0]. (same for other test functions)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests need more cleanup, I agree.

@eiennohito
Copy link
Collaborator Author

eiennohito commented Dec 1, 2021

Arc<RefCell<T>> by itself is not thread-safe and that's OK. StatefulTokenizer and MorphemeList are not for sharing between threads. Additonally, for the Python bindings, all accesses to MorphemeList happen under GIL and are OK, for Rust the compiler will check invalid data sharing.
After second thought, it is OK to make internals to be even Rc<RefCell<T>>.

@eiennohito
Copy link
Collaborator Author

@mh-northlander comments should be addressed

@eiennohito eiennohito mentioned this pull request Dec 2, 2021
@eiennohito eiennohito merged commit 190bbdb into WorksApplications:develop Dec 2, 2021
@eiennohito eiennohito deleted the 184-memory-reuse branch December 2, 2021 02:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Python binding-related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pyhon API: ability to reduce allocations by reusing objects
2 participants