New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How does JamSpell correction work? #15
Comments
It use n-gram language model (word based, 3-gram), and select candidate with the highest score. Also it is optimized for speed (modified symspell algorithm) and memory consumption (bloom-filter & perfect hash). I will add some description in README later. Here is an article (russian) with detailed explanation, habrahabr.ru/post/346618. |
Спасибо, надо почитать. Closing this issue. P.S.: Your library is popular and most recommended for Spell checking problem in ODS #nlp channel community |
I want to try LSTM in the future. |
Hi, this is Wolf, the author of the original SymSpell algorithm. Unfortunately, my Russian is limited, so I used Google Translate to read your interesting habrahabr post. I hope I got it right:
While a bloom filter of deletes indeed takes much less space than storing the deletes and the pointers to the original words, I believe that the performance will be significantly reduced by performing the insertions. In your tests, JamSpell seems to be 3..4 times faster than Norvig. In my benchmark the original SymSpell is 1000x faster than Norvig for maximum edit distance=2. Btw, the memory usage of the recent SymSpell versions has been significantly reduced by prefix indexing. |
Hi, thanks a lot for your feedback! In my implementation the bottleneck now is generating sentences with candidates and getting language model predictions. Originally I was using Norvig's approach, but it was very slow, especially on long words. Your algorithm helped me to improve performance.
For distance=2, length=n my approach gives average O(n) solution, not O(n^2). I generate insertions at distance=1 (linear), and i go to level 2 only if my bloom-filter say that there is such delete there. So I don't see why it should be significantly reduced.
Agree, it could be an issue for languages with lot's of characters.
I think it's not correct to compare your library with my, your library doesn't consider words surroundings. Also my benchmarks performed in python, which is rather slow. I would be glad to add symspell to my benchmarks too, but, as far as I know - it don't have any python bindings.
I will look at it. I tried to use suffix tree - it reduced memory usage but it still required a lot of it. Bloom filter is much more compact. |
In the best case (assuming there is at least one suggestion within MaxEditDistance), you would still have n*26 + (n+1)*26 insertions/dictionary lookups per delete. Those minimum 269 dictionary lookups per delete (n=5, distance=2) is one reason for a performance reduction. Even if not visible in Big O notation, constant factors may heavily influence performance. The second reason is that probably multiple level1 deletes will exist, and you will have to to iterate level2 multiple times as well.
That's why I benchmarked a c# port of Norvig's algorithm against the c# implementation of SymSpell. |
Can we use some neural language model instead of n-gram language model for candidate suggestions? |
If Hunspell suggests a list of words with minimal REPlacement position but for uni-gram, then does JamSpell consider some N-gram with Markov chain etc :) ?
The text was updated successfully, but these errors were encountered: