-
-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
levenshtein automata not matching Japanese Characters correctly #38
Comments
Thanks for the report. This does appear to be a bug. In the future, please provide a minimal runnable program that reproduces your issue. Here's one I came up with: extern crate fst;
use fst::{Set, Levenshtein, IntoStreamer};
fn main() {
let k = "寿司は焦げられない";
let keys = vec![k];
let set = Set::from_iter(keys.iter()).unwrap();
let lev = Levenshtein::new(k, 2).unwrap();
let stream = set.search(lev).into_stream();
let keys = stream.into_strs().unwrap();
println!("{:?}", keys);
} I don't know the cause of it off the top of my head, but I fear this may be due to a bug in the Levenshtein automaton construction itself. If that's the case, I don't know when it will get fixed. |
Hi, I took a closer look on the issue with some insight, although there are some things I don't completely understand with regard to the states data-structures. The issue seems to occur when there are two characters with same first byte. let keys = vec!["来探"];
let lev = Levenshtein::new("来探", 1).unwrap();
--> fails
let keys = vec!["来食"];
let lev = Levenshtein::new("来食", 1).unwrap();
--> correct
来 bytes [230, 157, 165]
探 bytes [230, 142, 162]
食 bytes [233, 163, 159] I printed the states from the dfa, but I'm not sure how to read them, are there multiple entry points or does it always start at state 0? What does is_match mean? Although my understanding is, that following the maching from state 0, I should be able to complete the original byte sequences:
"来探" Failing state machine "来食" Correct state machine Btw, if you like the states representation, I can create a pull request for it. |
@PSeitz I think you're on your own here. I haven't looked at that code in a long time, and I'd need to spend a couple of hours loading up my context to help you. (At which point, I would probably just try to go ahead and fix the bug.) You probably want some background reading: http://julesjacobs.github.io/2015/06/17/disqus-levenshtein-simple-and-fast.html http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata |
Thanks for the links. I think I understood the basic concepts. Theses examples although all use characters in their states and not bytes, which makes them a lot easier, because a new character is just one new step. I don't really understand how a multibyte character is transferred to multiple steps here, and as far as I could tell this seems to be where the issue is. |
@PSeitz Yes, they are encoded as UTF-8 automata. I don't know any well documented place where they are explained, but you can see the |
Um, correct me if I'm wrong, but in cases like this, isn't it always better to compare individual bytes rather than "characters", since then it becomes a matter of comparing two numbers? |
@WarpspeedSCP I don't understand your question. Please show examples and tie it back to the specific implementation in this crate. |
What I meant was that comparing groups of bytes at a time as characters wouldn't be as easy as comparing bytes individually in the string to arrive at an answer. For example, you wouldn't just know which of 😎 and 😕 comes first, but if you looked at the bytes that represent them, it is possible to come to a conclusion. If one has fewer bytes it becomes immediately apparent, for example. But I don't know how utf8 works, so I am probably wrong. |
@WarpspeedSCP Sorry, but what you're saying doesn't make any sense to me. I suspect you aren't using the words "bytes" and "character" correctly, but I don't know for sure. Moreover, this talk about "which one coming first" doesn't make any sense to me; what does ordering have to do with this ticket? Please write out an example in more detail. I'd also like to encourage you to carefully read the comments in this issue. Drive by comments typically aren't helpful, and it honestly kind of feels like that's what your comments here are. For example, a couple comments up I linked to the docs for the Making matters more complex is that regardless of the implementation strategy, the concept of character in the abstract sense still matters. In particular, while the Levenshtein automata in this crate are byte oriented, they measure differences between strings in terms of characters. The specific interpretation of character used in this crate is that of a single Unicode scalar value. This is intended as a less than ideal approximation (where a better approximation would be a Unicode grapheme cluster) because it is simpler to implement. |
What I meant by character, was the set of bytes that represent a Unicode code point. For example, I now understand that your library works on the bytes of the string and not the characters. A state machine has been constructed for 来 and 探 is queried from it.-
Now results wouldn't contain anything since there were no matches, right? The first byte would match, but the second would not, so the machine would output that there is no match. This was after I had skimmed the docs for this library, as well as the article just once. Is this how the library works? |
This ticket isn't about equality. It's about building a Levenshtein automaton and matching keys that are within a certain edit distance of a query. The edit distance is measured in units of Unicode scalar values. The automaton is built such that its transitions are byte based. This program emits no results, which is correct: extern crate fst;
extern crate fst_levenshtein;
use fst::{IntoStreamer, Set};
use fst_levenshtein::Levenshtein;
fn main() {
let set = Set::from_iter(vec!["探"]).unwrap();
let query = Levenshtein::new("来", 0).unwrap();
let results = set.search(query).into_stream().into_strs().unwrap();
println!("{:?}", results);
} This program also emits no results, but is incorrect. This is presumably the bug: extern crate fst;
extern crate fst_levenshtein;
use fst::{IntoStreamer, Set};
use fst_levenshtein::Levenshtein;
fn main() {
let set = Set::from_iter(vec!["探"]).unwrap();
let query = Levenshtein::new("来", 1).unwrap();
let results = set.search(query).into_stream().into_strs().unwrap();
println!("{:?}", results);
} |
Thank you, now I understand why this bug is occurring. In the second case, the search returns only the first byte |
It returns nothing. Returning something that is invalid UTF-8 would be an egregious bug. |
Ok, so it either outputs a valid answer or nothing. |
I had another look on the issue, and the problem is the overwrite, where the old branch gets orphaned. That's the reason, why this appears only with same byte characters This also fits the initial observation, where next state is overwritten from 230 -> 142 to 230 -> 157 fst/fst-levenshtein/src/lib.rs Lines 291 to 303 in 91ab58a
So I tried with a simple merge of the old states nexts into the new state nexts. This fixes the problem in this case, but I think is generally not allowed, because we allow to much? Also some cases would require to do this recursively, and that is not feasible performance wise. fn add_utf8_range(
&mut self,
overwrite: bool,
from: usize,
to: usize,
range: &Utf8Range,
) {
for b in range.start as usize..range.end as usize + 1 {
if overwrite || self.dfa.states[from].next[b].is_none() {
if let Some(state) = self.dfa.states[from].next[b] {
//merge existing into new state
for i in 0..256 {
if let Some(si) = self.dfa.states[state].next[i] {
self.dfa.states[to].next[i] = Some(si);
}
}
}
self.dfa.states[from].next[b] = Some(to);
}
}
} |
Hi,
awesome library and documentation. The explanations are really nice to read.
I have an example that should probably match but, doesn't. I guess it has to do something with the utf8-ranges matching. (Btw: This means sushi can't be burned :)
The text was updated successfully, but these errors were encountered: