Hunspell backend provided by node-spellchecker does not respect other dictionary encoding but UTF-8 #50

slodki · 2017-08-19T16:55:16Z

Please compare:

with Hunspell output:

$ echo "żólty jerz żółty jest śliczny i biały" | hunspell 
Hunspell 1.4.0
& żólty 1 0: żółty
& jerz 14 6: jeż, jesz, rejz, jer, jerze, jerez, jera, jarz, jery, perz, jedz, jeru, Perz, jer z
*
*
*
*
*

$ echo $LANG
pl_PL.UTF-8

Why ż in jeż is unknown character? Encoding problem maybe?

The text was updated successfully, but these errors were encountered:

bartosz-antosik · 2017-08-19T19:23:28Z

I am far away from any Ubuntu machine. Could you maybe please:

Ensure that the dictionaries you have linked are UTF-8?
Try to download the dictionaries according to the initial recipe ("Dictionaries" folder)?

pjssilva · 2017-08-19T19:25:06Z

I had this problem in Portuguese when not using UTF-8 dictionary. Try to use the dictionaries from

https://github.com/titoBouzout/Dictionaries

slodki · 2017-08-19T19:54:11Z

grep SET *.aff
en_AU.aff:SET UTF-8
en_CA.aff:SET UTF-8
en_GB.aff:SET UTF-8
en_US.aff:SET ISO8859-1
en_ZA.aff:SET UTF-8
pl_PL.aff:SET ISO8859-2

Why this flag in file is ignored? You should use right encoding reading dictionary.

I don't want to have dictionary copy for each app. This dict is from ubuntu distr and works without problems in apps.

bartosz-antosik · 2017-08-19T20:00:23Z

Spell Right does not use preexisting hunspell. I am using module which is UTF-8 dependent and cannot use this flags. It may work other way around - maybe you can switch this flag & replace dictionary file in Ubuntu.

slodki · 2017-08-19T20:22:42Z

Small memory & CPU usage footprint - uses offline, OS native spell checking service whenever possible: Windows Spell Checking API (windows 8/10) or Hunspell (windows 7, macOS, Linux).

This is not true. Not OS native service nor hunspell on Linux.

And loading and using incompatible files without checking and warning is this extension error.

bartosz-antosik · 2017-08-19T20:43:35Z

I understand. I will have a look into this whether it can be resolved better way.

slodki · 2017-08-20T09:58:48Z

node-spellchecker uses new Hunspell(affixpath.c_str(), dpath.c_str()) with new AffixMgr(affpath, pHMgr, &maxdic, key) inside which analyze SET line from aff file.

It's used to convert to lower/uppercase only.

Text encoding to correct dictionary encoding is managed at app level.

slodki · 2017-08-20T10:25:38Z

node-spellchecker always send text (expecting to be wide string) to hunspell library as utf8 ignoring dictionary encoding.

bartosz-antosik · 2017-08-20T10:33:07Z

Yes, I saw the part which apparently reads SET line some ago. And node-spellchecker, as far as I can tell, uses exactly the same Hunspell source code which is used to compile CLI version that is distributed with most Linux distros and which is used by Mozilla & LibreOffice (That's BTW what I meant by saying 'native', because, otherwise, what is native for Linux?).

From some hints (e.g. from your CLI example) I deduce that it CAN be run UTF-8 on the front and native dictionary encoding at the back. The problem is not trivial however as it has not been solved for Atom so far. Still I consider it best back end module for spelling in VSCode on which I have elaborated more in #20266 a while ago.

If you could dig in this a bit would be a great help! Thank you!

bartosz-antosik · 2017-08-20T10:41:14Z

(to previous comment) I think this is exactly how it should be: Hunspell is asked in UTF-8 on the front, does the conversion internally and responds in UTF-8 with acknowledgement & suggestions.

bartosz-antosik · 2017-08-20T10:46:49Z

Oh, and I am sorry my comment above was misleading because I have simplified things - I knew about this UTF-8 requirement (my first comment which somehow solves the issue) that's why I have stated that the module 'does not use the flags' which is partially not true as you have discovered on your own.

slodki · 2017-08-20T11:13:21Z

If you want to compile patched node-spellchecker code it should be easy: surround each hunspell call in spellchecker_hunspell.cc with

if (strcmp(hunspell->get_dict_encoding().c_str(), vscode->current_file_encoding) != 0) {
    toDict=iconv_open(hunspell->get_dict_encoding().c_str(), vscode->current_file_encoding);
    iconv(toDict,word,size_t,tmp_word,size_t);
    hunspell->spell(tmp_word.c_str());
    or
    hunspell->add(tmp_word.c_str());
    or
    hunspell->suggest(&slist_tmp, tmp_word.c_str());
    fromDict=iconv_open(vscode->current_file_encoding, hunspell->get_dict_encoding().c_str());
    iconv(fromDict,slist_tmp[i],size_t,slist[i],size_t);
} else {
    hunspell->spell(word.c_str());
    or
    hunspell->add(word.c_str());
    or
    hunspell->suggest(&slist, word.c_str());
}

You can borrow chenc from hunspell tool.

bartosz-antosik · 2017-08-20T11:51:48Z

Right now it would be a bit of a mystery to me how to get vscode->current_file_encoding there rather than where to get chenc...

Karuso33 · 2017-08-20T13:36:53Z

I had this problem too when I used "system dictionaries" (on Ubuntu 16.04) by sym linking them to /ushr/share/hunspell (as you described in the readme) as those dictionaries were not in UTF-8. Maybe just add a short warning to the readme that the dictionary files have to be encoded in UTF-8...

bartosz-antosik · 2017-08-20T19:46:27Z

@Karuso33: That's exactly what I did in the very last release (1.1.16) few hours ago. Thanks.

Karuso33 · 2017-08-20T19:51:53Z

@bartosz-antosik Oh, my bad.

bartosz-antosik · 2017-08-20T19:53:02Z

@Karuso33: To the contrary! Thank you for supporting this idea!

bartosz-antosik · 2017-08-20T19:56:28Z

@Karuso33: I think I will keep the thread open to try to verify whether it is possible to heal the situation.

P.S. As it seems you are using Spell Right on Linux, could you maybe comment on #51? Sorry for this but Linux support is pretty new and I am for some time far away from an Ubuntu machine, plus I do not use it on regular basis, so I would like to know if it has this issue and on which scale?

bartosz-antosik · 2017-08-27T16:57:59Z

I have examined solution suggested by @slodki few posts above and it has serious drawbacks in the shape proposed because it can only work on Linux (plus it is just a suggestion and it does not compile straight on etc.) whereas Hunspell is also used on Windows 7, and there is no iconv in typical node-gyp toolset. Some more code has to be written to support this conversion also on Windows. I would rather stay with the requirement for UTF-8 dictionaries for now as I cannot pass this much time for developing this solution.

I would of course welcome every solution/help that could resolve this inconvenience.

slodki · 2017-08-27T17:03:02Z

OK.

But what about read dictionary encoding from node-spellchecker and display warning to the user when not UTF8?

bartosz-antosik added bug help wanted labels Aug 19, 2017

bartosz-antosik closed this as completed Aug 19, 2017

bartosz-antosik reopened this Aug 19, 2017

bartosz-antosik changed the title ~~big problems with polish language~~ Hunspell backend provided by node-spellchecker does not respect other dictionary encoding but UTF-8 Aug 20, 2017

bartosz-antosik closed this as completed Aug 27, 2017

bartosz-antosik mentioned this issue Jan 6, 2018

Large pop-up delay for "Show Fixes" on Linux #107

Closed

bartosz-antosik mentioned this issue Mar 12, 2018

Check identifiers in programming languages #124

Closed

kolya-ay mentioned this issue Mar 13, 2018

Respect dictionary encoding atom/node-spellchecker#89

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hunspell backend provided by node-spellchecker does not respect other dictionary encoding but UTF-8 #50

Hunspell backend provided by node-spellchecker does not respect other dictionary encoding but UTF-8 #50

slodki commented Aug 19, 2017

bartosz-antosik commented Aug 19, 2017

pjssilva commented Aug 19, 2017

slodki commented Aug 19, 2017

bartosz-antosik commented Aug 19, 2017

slodki commented Aug 19, 2017

bartosz-antosik commented Aug 19, 2017

slodki commented Aug 20, 2017 •

edited

Loading

slodki commented Aug 20, 2017

bartosz-antosik commented Aug 20, 2017

bartosz-antosik commented Aug 20, 2017

bartosz-antosik commented Aug 20, 2017 •

edited

Loading

slodki commented Aug 20, 2017 •

edited

Loading

bartosz-antosik commented Aug 20, 2017

Karuso33 commented Aug 20, 2017 •

edited

Loading

bartosz-antosik commented Aug 20, 2017

Karuso33 commented Aug 20, 2017

bartosz-antosik commented Aug 20, 2017

bartosz-antosik commented Aug 20, 2017

bartosz-antosik commented Aug 27, 2017

slodki commented Aug 27, 2017 •

edited

Loading

Hunspell backend provided by node-spellchecker does not respect other dictionary encoding but UTF-8 #50

Hunspell backend provided by node-spellchecker does not respect other dictionary encoding but UTF-8 #50

Comments

slodki commented Aug 19, 2017

bartosz-antosik commented Aug 19, 2017

pjssilva commented Aug 19, 2017

slodki commented Aug 19, 2017

bartosz-antosik commented Aug 19, 2017

slodki commented Aug 19, 2017

bartosz-antosik commented Aug 19, 2017

slodki commented Aug 20, 2017 • edited Loading

slodki commented Aug 20, 2017

bartosz-antosik commented Aug 20, 2017

bartosz-antosik commented Aug 20, 2017

bartosz-antosik commented Aug 20, 2017 • edited Loading

slodki commented Aug 20, 2017 • edited Loading

bartosz-antosik commented Aug 20, 2017

Karuso33 commented Aug 20, 2017 • edited Loading

bartosz-antosik commented Aug 20, 2017

Karuso33 commented Aug 20, 2017

bartosz-antosik commented Aug 20, 2017

bartosz-antosik commented Aug 20, 2017

bartosz-antosik commented Aug 27, 2017

slodki commented Aug 27, 2017 • edited Loading

slodki commented Aug 20, 2017 •

edited

Loading

bartosz-antosik commented Aug 20, 2017 •

edited

Loading

slodki commented Aug 20, 2017 •

edited

Loading

Karuso33 commented Aug 20, 2017 •

edited

Loading

slodki commented Aug 27, 2017 •

edited

Loading