-
-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement analysis memory reuse via output parameters #185
implement analysis memory reuse via output parameters #185
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arc<RefCell>
is not thread safe: https://doc.rust-lang.org/std/sync/struct.Arc.html#thread-safety
we may use Mutex
or RwLock
instead of RefCell
.
self.assertEqual(ms_a[0].surface(), '東京') | ||
self.assertEqual(ms_a[1].surface(), '都') | ||
|
||
ms = self.tokenizer_obj.tokenize("京都東京都京都", SplitMode.C) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to check the second split with different word from the above one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test binary dictionary has only 東京都 as a word with splits. I also wanted to check another word, but alas.
Moving all tests to using non-binary dictionary will fix this issue
sudachi/tests/stateful_tokenizer.rs
Outdated
@@ -130,18 +130,19 @@ fn split_middle() { | |||
let ms = tok.tokenize("京都東京都京都"); | |||
assert_eq!(ms.len(), 3); | |||
let m = ms.get(1); | |||
assert_eq!(m.surface(), "東京都"); | |||
assert_eq!(m.surface().deref().deref(), "東京都"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double deref?
sudachi/tests/stateless_tokenizer.rs
Outdated
let ms = tok.tokenize("京都", Mode::C); | ||
let ms: Vec<_> = ms.iter().collect(); | ||
assert_eq!(1, ms.len()); | ||
let pos = ms[0].part_of_speech().expect("failed to get pos"); | ||
let pos = ms[0].part_of_speech(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can now access nodes by ms.get(0)
instead of ms[0]
. (same for other test functions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests need more cleanup, I agree.
|
@mh-northlander comments should be addressed |
Fixes #184
Implement out parameters for
Tokenizer.tokenize()
andMorpheme.split()
Python API.For the memory sharing to be actually useful, I had to refactor internal MorphemeList to allow multiple references to input data, while having distinct list of morphemes. Let's welcome
Arc<RefCell<X>>
in our codebase.Python
MorphemeListWrapper
has also changed. As a side effect there is no copy in custom pretokenizers (win). Need to document semantics of everything, but will do that as a documentation pass.