Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hunspell backend provided by node-spellchecker does not respect other dictionary encoding but UTF-8 #50

Closed
slodki opened this issue Aug 19, 2017 · 20 comments

Comments

@slodki
Copy link

slodki commented Aug 19, 2017

Please compare:
polish-error
with Hunspell output:

$ echo "żólty jerz żółty jest śliczny i biały" | hunspell 
Hunspell 1.4.0
& żólty 1 0: żółty
& jerz 14 6: jeż, jesz, rejz, jer, jerze, jerez, jera, jarz, jery, perz, jedz, jeru, Perz, jer z
*
*
*
*
*

$ echo $LANG
pl_PL.UTF-8

Why ż in jeż is unknown character? Encoding problem maybe?

@bartosz-antosik
Copy link
Owner

I am far away from any Ubuntu machine. Could you maybe please:

  1. Ensure that the dictionaries you have linked are UTF-8?
  2. Try to download the dictionaries according to the initial recipe ("Dictionaries" folder)?

@pjssilva
Copy link

I had this problem in Portuguese when not using UTF-8 dictionary. Try to use the dictionaries from

https://github.com/titoBouzout/Dictionaries

@slodki
Copy link
Author

slodki commented Aug 19, 2017

grep SET *.aff
en_AU.aff:SET UTF-8
en_CA.aff:SET UTF-8
en_GB.aff:SET UTF-8
en_US.aff:SET ISO8859-1
en_ZA.aff:SET UTF-8
pl_PL.aff:SET ISO8859-2

Why this flag in file is ignored? You should use right encoding reading dictionary.

I don't want to have dictionary copy for each app. This dict is from ubuntu distr and works without problems in apps.

@bartosz-antosik
Copy link
Owner

Spell Right does not use preexisting hunspell. I am using module which is UTF-8 dependent and cannot use this flags. It may work other way around - maybe you can switch this flag & replace dictionary file in Ubuntu.

@slodki
Copy link
Author

slodki commented Aug 19, 2017

Small memory & CPU usage footprint - uses offline, OS native spell checking service whenever possible: Windows Spell Checking API (windows 8/10) or Hunspell (windows 7, macOS, Linux).

This is not true. Not OS native service nor hunspell on Linux.

And loading and using incompatible files without checking and warning is this extension error.

@bartosz-antosik
Copy link
Owner

I understand. I will have a look into this whether it can be resolved better way.

@bartosz-antosik bartosz-antosik changed the title big problems with polish language Hunspell backend provided by node-spellchecker does not respect other dictionary encoding but UTF-8 Aug 20, 2017
@slodki
Copy link
Author

slodki commented Aug 20, 2017

node-spellchecker uses new Hunspell(affixpath.c_str(), dpath.c_str()) with new AffixMgr(affpath, pHMgr, &maxdic, key) inside which analyze SET line from aff file.

It's used to convert to lower/uppercase only.

Text encoding to correct dictionary encoding is managed at app level.

@slodki
Copy link
Author

slodki commented Aug 20, 2017

node-spellchecker always send text (expecting to be wide string) to hunspell library as utf8 ignoring dictionary encoding.

@bartosz-antosik
Copy link
Owner

Yes, I saw the part which apparently reads SET line some ago. And node-spellchecker, as far as I can tell, uses exactly the same Hunspell source code which is used to compile CLI version that is distributed with most Linux distros and which is used by Mozilla & LibreOffice (That's BTW what I meant by saying 'native', because, otherwise, what is native for Linux?).

From some hints (e.g. from your CLI example) I deduce that it CAN be run UTF-8 on the front and native dictionary encoding at the back. The problem is not trivial however as it has not been solved for Atom so far. Still I consider it best back end module for spelling in VSCode on which I have elaborated more in #20266 a while ago.

If you could dig in this a bit would be a great help! Thank you!

@bartosz-antosik
Copy link
Owner

(to previous comment) I think this is exactly how it should be: Hunspell is asked in UTF-8 on the front, does the conversion internally and responds in UTF-8 with acknowledgement & suggestions.

@bartosz-antosik
Copy link
Owner

bartosz-antosik commented Aug 20, 2017

Oh, and I am sorry my comment above was misleading because I have simplified things - I knew about this UTF-8 requirement (my first comment which somehow solves the issue) that's why I have stated that the module 'does not use the flags' which is partially not true as you have discovered on your own.

@slodki
Copy link
Author

slodki commented Aug 20, 2017

If you want to compile patched node-spellchecker code it should be easy: surround each hunspell call in spellchecker_hunspell.cc with

if (strcmp(hunspell->get_dict_encoding().c_str(), vscode->current_file_encoding) != 0) {
    toDict=iconv_open(hunspell->get_dict_encoding().c_str(), vscode->current_file_encoding);
    iconv(toDict,word,size_t,tmp_word,size_t);
    hunspell->spell(tmp_word.c_str());
    or
    hunspell->add(tmp_word.c_str());
    or
    hunspell->suggest(&slist_tmp, tmp_word.c_str());
    fromDict=iconv_open(vscode->current_file_encoding, hunspell->get_dict_encoding().c_str());
    iconv(fromDict,slist_tmp[i],size_t,slist[i],size_t);
} else {
    hunspell->spell(word.c_str());
    or
    hunspell->add(word.c_str());
    or
    hunspell->suggest(&slist, word.c_str());
}

You can borrow chenc from hunspell tool.

@bartosz-antosik
Copy link
Owner

Right now it would be a bit of a mystery to me how to get vscode->current_file_encoding there rather than where to get chenc...

@Karuso33
Copy link

Karuso33 commented Aug 20, 2017

I had this problem too when I used "system dictionaries" (on Ubuntu 16.04) by sym linking them to /ushr/share/hunspell (as you described in the readme) as those dictionaries were not in UTF-8. Maybe just add a short warning to the readme that the dictionary files have to be encoded in UTF-8...

@bartosz-antosik
Copy link
Owner

@Karuso33: That's exactly what I did in the very last release (1.1.16) few hours ago. Thanks.

@Karuso33
Copy link

@bartosz-antosik Oh, my bad.

@bartosz-antosik
Copy link
Owner

@Karuso33: To the contrary! Thank you for supporting this idea!

@bartosz-antosik
Copy link
Owner

@Karuso33: I think I will keep the thread open to try to verify whether it is possible to heal the situation.

P.S. As it seems you are using Spell Right on Linux, could you maybe comment on #51? Sorry for this but Linux support is pretty new and I am for some time far away from an Ubuntu machine, plus I do not use it on regular basis, so I would like to know if it has this issue and on which scale?

@bartosz-antosik
Copy link
Owner

I have examined solution suggested by @slodki few posts above and it has serious drawbacks in the shape proposed because it can only work on Linux (plus it is just a suggestion and it does not compile straight on etc.) whereas Hunspell is also used on Windows 7, and there is no iconv in typical node-gyp toolset. Some more code has to be written to support this conversion also on Windows. I would rather stay with the requirement for UTF-8 dictionaries for now as I cannot pass this much time for developing this solution.

I would of course welcome every solution/help that could resolve this inconvenience.

@slodki
Copy link
Author

slodki commented Aug 27, 2017

OK.

But what about read dictionary encoding from node-spellchecker and display warning to the user when not UTF8?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants