levenshtein automata not matching Japanese Characters correctly #38

PSeitz · 2017-03-19T20:06:22Z

Hi,
awesome library and documentation. The explanations are really nice to read.

I have an example that should probably match but, doesn't. I guess it has to do something with the utf8-ranges matching. (Btw: This means sushi can't be burned :)

let keys = vec!["寿司は焦げられない"];
let set = try!(Set::from_iter(keys.iter()));

let lev = try!(Levenshtein::new("寿司は焦げられない", 2));
let stream = set.search(lev).into_stream();
let keys = try!(stream.into_strs());

BurntSushi · 2017-03-19T20:13:55Z

Thanks for the report. This does appear to be a bug.

In the future, please provide a minimal runnable program that reproduces your issue. Here's one I came up with:

extern crate fst;

use fst::{Set, Levenshtein, IntoStreamer};

fn main() {
    let k = "寿司は焦げられない";
    let keys = vec![k];
    let set = Set::from_iter(keys.iter()).unwrap();

    let lev = Levenshtein::new(k, 2).unwrap();
    let stream = set.search(lev).into_stream();
    let keys = stream.into_strs().unwrap();
    println!("{:?}", keys);
}

I don't know the cause of it off the top of my head, but I fear this may be due to a bug in the Levenshtein automaton construction itself. If that's the case, I don't know when it will get fixed.

PSeitz · 2017-04-19T09:25:05Z

Hi,

I took a closer look on the issue with some insight, although there are some things I don't completely understand with regard to the states data-structures.

The issue seems to occur when there are two characters with same first byte.
When we alter the orignal example to these:

let keys = vec!["来探"];
let lev = Levenshtein::new("来探", 1).unwrap();
--> fails

let keys = vec!["来食"];
let lev = Levenshtein::new("来食", 1).unwrap();
--> correct

来 bytes [230, 157, 165]
探 bytes [230, 142, 162]
食 bytes [233, 163, 159]

I printed the states from the dfa, but I'm not sure how to read them, are there multiple entry points or does it always start at state 0? What does is_match mean?

Although my understanding is, that following the maching from state 0, I should be able to complete the original byte sequences:

来探 -> 230, 157, 165, 230, 142, 162
来食 -> 230, 157, 165, 233, 163, 159

"来探" Failing state machine
This is the only path from state 0, and the first 142 should be 157 ?
230 -> 142 -> 162 -> 230 -> 142 -> 162
Also, there are states which are not reachable from state 0.

"来食" Correct state machine
There are multiple paths from state 0 and the original byte sequence is part of it.
230 -> 157 -> 165 -> 233 -> 163 -> 159

Btw, if you like the states representation, I can create a pull request for it.

BurntSushi · 2017-04-19T10:42:40Z

@PSeitz I think you're on your own here. I haven't looked at that code in a long time, and I'd need to spend a couple of hours loading up my context to help you. (At which point, I would probably just try to go ahead and fix the bug.)

You probably want some background reading:

http://julesjacobs.github.io/2015/06/17/disqus-levenshtein-simple-and-fast.html

http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata

http://fulmicoton.com/posts/levenshtein/

PSeitz · 2017-07-24T17:12:33Z

Thanks for the links. I think I understood the basic concepts. Theses examples although all use characters in their states and not bytes, which makes them a lot easier, because a new character is just one new step.

I don't really understand how a multibyte character is transferred to multiple steps here, and as far as I could tell this seems to be where the issue is.

BurntSushi · 2017-07-24T17:14:53Z

@PSeitz Yes, they are encoded as UTF-8 automata. I don't know any well documented place where they are explained, but you can see the utf8-ranges crate docs for a little bit of explanation. (This crate is used in the implementation here.)

WarpspeedSCP · 2017-12-18T18:29:52Z

Um, correct me if I'm wrong, but in cases like this, isn't it always better to compare individual bytes rather than "characters", since then it becomes a matter of comparing two numbers?

BurntSushi · 2017-12-18T18:30:52Z

@WarpspeedSCP I don't understand your question. Please show examples and tie it back to the specific implementation in this crate.

WarpspeedSCP · 2017-12-18T19:14:04Z

What I meant was that comparing groups of bytes at a time as characters wouldn't be as easy as comparing bytes individually in the string to arrive at an answer. For example, you wouldn't just know which of 😎 and 😕 comes first, but if you looked at the bytes that represent them, it is possible to come to a conclusion. If one has fewer bytes it becomes immediately apparent, for example. But I don't know how utf8 works, so I am probably wrong.

BurntSushi · 2017-12-18T19:19:36Z

@WarpspeedSCP Sorry, but what you're saying doesn't make any sense to me. I suspect you aren't using the words "bytes" and "character" correctly, but I don't know for sure. Moreover, this talk about "which one coming first" doesn't make any sense to me; what does ordering have to do with this ticket? Please write out an example in more detail.

I'd also like to encourage you to carefully read the comments in this issue. Drive by comments typically aren't helpful, and it honestly kind of feels like that's what your comments here are. For example, a couple comments up I linked to the docs for the utf8-ranges crate as a way of explaining how the UTF-8 automata are constructed. If you read those docs, then the fact that the implementation is in fact byte based should be very clear. And if isn't, then something is wrong with the docs and I'd appreciate it if you pointed out those problems to me.

Making matters more complex is that regardless of the implementation strategy, the concept of character in the abstract sense still matters. In particular, while the Levenshtein automata in this crate are byte oriented, they measure differences between strings in terms of characters. The specific interpretation of character used in this crate is that of a single Unicode scalar value. This is intended as a less than ideal approximation (where a better approximation would be a Unicode grapheme cluster) because it is simpler to implement.

WarpspeedSCP · 2017-12-19T02:35:03Z

What I meant by character, was the set of bytes that represent a Unicode code point.

For example,
来 is represented by the set of bytes [230, 157, 165], and
探 is represented by the set of bytes [230, 142, 162], right?

I now understand that your library works on the bytes of the string and not the characters.
So at that time, I had thought that just comparing the bytes directly at each state would be enough to say whether or not two strings were equal. If I were to do this with your library, would this be right?-

A state machine has been constructed for 来 and 探 is queried from it.-

let machine = fst::Levenshtein::new("来", 0).unwrap();
let query = vec!["探"];
let set = fst::Set::from_iter(query.iter()).unwrap();
let stream = set.search(machine).into_stream();
let results = stream.into_strs().unwrap();

Now results wouldn't contain anything since there were no matches, right?

The first byte would match, but the second would not, so the machine would output that there is no match. This was after I had skimmed the docs for this library, as well as the article just once. Is this how the library works?

BurntSushi · 2017-12-19T12:19:58Z

comparing the bytes directly at each state would be enough to say whether or not two strings were equal

This ticket isn't about equality. It's about building a Levenshtein automaton and matching keys that are within a certain edit distance of a query. The edit distance is measured in units of Unicode scalar values. The automaton is built such that its transitions are byte based.

This program emits no results, which is correct:

extern crate fst;
extern crate fst_levenshtein;

use fst::{IntoStreamer, Set};
use fst_levenshtein::Levenshtein;

fn main() {
    let set = Set::from_iter(vec!["探"]).unwrap();
    let query = Levenshtein::new("来", 0).unwrap();
    let results = set.search(query).into_stream().into_strs().unwrap();
    println!("{:?}", results);
}

This program also emits no results, but is incorrect. This is presumably the bug:

extern crate fst;
extern crate fst_levenshtein;

use fst::{IntoStreamer, Set};
use fst_levenshtein::Levenshtein;

fn main() {
    let set = Set::from_iter(vec!["探"]).unwrap();
    let query = Levenshtein::new("来", 1).unwrap();
    let results = set.search(query).into_stream().into_strs().unwrap();
    println!("{:?}", results);
}

WarpspeedSCP · 2017-12-20T12:05:02Z

Thank you, now I understand why this bug is occurring. In the second case, the search returns only the first byte [230], which is not a valid character, and won't print, right?

BurntSushi · 2017-12-20T12:18:53Z

In the second case, the search returns only the first byte [230], which is not a valid character, and won't print, right?

It returns nothing. Returning something that is invalid UTF-8 would be an egregious bug.

WarpspeedSCP · 2017-12-20T12:25:50Z

Ok, so it either outputs a valid answer or nothing.

PSeitz · 2017-12-28T17:09:25Z

I had another look on the issue, and the problem is the overwrite, where the old branch gets orphaned. That's the reason, why this appears only with same byte characters

This also fits the initial observation, where next state is overwritten from 230 -> 142 to 230 -> 157

fst/fst-levenshtein/src/lib.rs

Lines 291 to 303 in 91ab58a

    
           fn add_utf8_range( 
        
               &mut self, 
        
               overwrite: bool, 
        
               from: usize, 
        
               to: usize, 
        
               range: &Utf8Range, 
        
           ) { 
        
               for b in range.start as usize..range.end as usize + 1 { 
        
                   if overwrite || self.dfa.states[from].next[b].is_none() { 
        
                       self.dfa.states[from].next[b] = Some(to); 
        
                   } 
        
               } 
        
           }

So I tried with a simple merge of the old states nexts into the new state nexts. This fixes the problem in this case, but I think is generally not allowed, because we allow to much?

Also some cases would require to do this recursively, and that is not feasible performance wise.

    fn add_utf8_range(
        &mut self,
        overwrite: bool,
        from: usize,
        to: usize,
        range: &Utf8Range,
    ) {
        for b in range.start as usize..range.end as usize + 1 {
            if overwrite || self.dfa.states[from].next[b].is_none() {
                if let Some(state) = self.dfa.states[from].next[b] {
                    //merge existing into new state
                    for i in 0..256 {
                        if let Some(si) = self.dfa.states[state].next[i] {
                            self.dfa.states[to].next[i] = Some(si);
                        }
                    }
                }
                self.dfa.states[from].next[b] = Some(to);
            }
        }
    }

BurntSushi added the bug label Mar 19, 2017

fulmicoton mentioned this issue Nov 17, 2017

Update the levenshtein construction algorithm #47

Closed

BurntSushi changed the title ~~No Hit for Japanese Characters~~ levenshtein automata not matching Japanese Characters correctly Feb 21, 2020

ghost mentioned this issue Jun 14, 2023

Arabic support #160

Open

PSeitz mentioned this issue Oct 7, 2024

Use FST as dictionary? elijah-potter/harper#138

Open

grantlemons mentioned this issue Nov 2, 2024

feat: FST-based Curated Dictionary Spellchecking elijah-potter/harper#258

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

levenshtein automata not matching Japanese Characters correctly #38

levenshtein automata not matching Japanese Characters correctly #38

PSeitz commented Mar 19, 2017

BurntSushi commented Mar 19, 2017

PSeitz commented Apr 19, 2017 •

edited

Loading

BurntSushi commented Apr 19, 2017

PSeitz commented Jul 24, 2017

BurntSushi commented Jul 24, 2017

WarpspeedSCP commented Dec 18, 2017

BurntSushi commented Dec 18, 2017

WarpspeedSCP commented Dec 18, 2017 •

edited

Loading

BurntSushi commented Dec 18, 2017 •

edited

Loading

WarpspeedSCP commented Dec 19, 2017 •

edited

Loading

BurntSushi commented Dec 19, 2017 •

edited

Loading

WarpspeedSCP commented Dec 20, 2017 •

edited

Loading

BurntSushi commented Dec 20, 2017

WarpspeedSCP commented Dec 20, 2017

PSeitz commented Dec 28, 2017

levenshtein automata not matching Japanese Characters correctly #38

levenshtein automata not matching Japanese Characters correctly #38

Comments

PSeitz commented Mar 19, 2017

BurntSushi commented Mar 19, 2017

PSeitz commented Apr 19, 2017 • edited Loading

BurntSushi commented Apr 19, 2017

PSeitz commented Jul 24, 2017

BurntSushi commented Jul 24, 2017

WarpspeedSCP commented Dec 18, 2017

BurntSushi commented Dec 18, 2017

WarpspeedSCP commented Dec 18, 2017 • edited Loading

BurntSushi commented Dec 18, 2017 • edited Loading

WarpspeedSCP commented Dec 19, 2017 • edited Loading

BurntSushi commented Dec 19, 2017 • edited Loading

WarpspeedSCP commented Dec 20, 2017 • edited Loading

BurntSushi commented Dec 20, 2017

WarpspeedSCP commented Dec 20, 2017

PSeitz commented Dec 28, 2017

PSeitz commented Apr 19, 2017 •

edited

Loading

WarpspeedSCP commented Dec 18, 2017 •

edited

Loading

BurntSushi commented Dec 18, 2017 •

edited

Loading

WarpspeedSCP commented Dec 19, 2017 •

edited

Loading

BurntSushi commented Dec 19, 2017 •

edited

Loading

WarpspeedSCP commented Dec 20, 2017 •

edited

Loading