Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding Problem Python3 #32

Closed
NicoDietrich opened this issue Jul 21, 2017 · 16 comments
Closed

Encoding Problem Python3 #32

NicoDietrich opened this issue Jul 21, 2017 · 16 comments

Comments

@NicoDietrich
Copy link

NicoDietrich commented Jul 21, 2017

Hey! I have a problem using hunspell with german words and I hope you can help me out.
An Example:

Python 3.5.2 (default, Nov 17 2016, 17:05:23) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import hunspell
>>> spellchecker = hunspell.HunSpell('/usr/share/hunspell/de_DE.dic',
...                                  '/usr/share/hunspell/de_DE.aff')
>>> spellchecker.spell('Wörterbuch')
True
>>> spellchecker.suggest('Wörterbuhc')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 1: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
SystemError: <built-in method suggest of HunSpell object at 0xb717d6d0> returned a result with an error set

When I encode to utf-8 it works, but the result I get makes no sense:

>>> spellchecker.suggest('Wörterbuhc'.encode('utf-8'))
['Westerburger']

This seems to be very weird, since here was done the exact same thing which apparently worked.

I use it in a Ubuntu VM with all the necessary packages installed:

Distributor ID: Ubuntu
Description: Ubuntu 16.04.2 LTS
Release: 16.04

@blatinier
Copy link
Collaborator

blatinier commented Jul 22, 2017

Ok I can reproduce something similar. But not exactly. Here is what I have on Debian 8:

>>> import hunspell
>>> spellchecker = hunspell.HunSpell('/usr/share/hunspell/de_DE.dic','/usr/share/hunspell/de_DE.aff')
>>> spellchecker.spell('Wörterbuch')
True
>>> spellchecker.suggest('Wörterbuhc') # should return some result, not an empty list
[]
>>> spellchecker.suggest('Wörterbuhc'.encode('utf-8')) # should not fail
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 1: invalid start byte

I will inquire on this later a bit later.

Can you attach your ubuntu version for the record please?

@PanderMusubi
Copy link

In my applications, i use

if filename in ('de_AT_frami.aff', 'de_CH_frami.aff', 'de_DE_frami.aff', 'de_DE.aff', 'en_US.aff', 'pt_BR.aff', 'sl_SI.aff', 'th_TH.aff', 'ru_RU.aff', 'nn_NO.aff', 'an_ES.aff', 'af_ZA.aff', 'el_GR.aff', 'bg_ BG.aff', 'de_CH.aff', 'it_IT.aff', 'hu_HU.aff', 'pl_PL.aff', 'cs_CZ.aff', 'eu.aff', 'lt_LT.aff', 'nb_NO.aff', 'oc_FR.aff', 'bs_BA.aff', 'de_AT.aff', ):
input = open(filepath, 'r', encoding='ISO-8859-1')
else:
input = open(filepath, 'r')

@mike-fabian
Copy link

I see a similar problem on Fedora 26:

$ python3
Python 3.6.2 (default, Aug 11 2017, 11:59:59)
[GCC 7.1.1 20170622 (Red Hat 7.1.1-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.

import hunspell
hobj = hunspell.HunSpell('/usr/share/myspell/de_DE.dic', '/usr/share/myspell/de_DE.aff')
hobj.suggest('grun')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 2: invalid start byte

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "", line 1, in
SystemError: <built-in method suggest of HunSpell object at 0x7f32edc26a50> returned a result with an error set

I think “grün” should be among the suggestions, but it fails because of the encoding error.

@isnok
Copy link

isnok commented Sep 4, 2017

+1
I just ran into similar trouble with the german word "Gültigkeit".
Thanks to this thread I found that it is actually valid (hobj.spell returns True), but hobj.suggest raises the same UnicodeDecodeError... (can't decode at position 1)

@isnok
Copy link

isnok commented Sep 4, 2017

Update: I tried with hunspell==0.4.0 and it works!

@rafis
Copy link

rafis commented Feb 5, 2018

The problem still exists hunspell==0.5.2 even for pure ASCII words:

import hunspell
hobj = hunspell.HunSpell('/usr/share/hunspell/en_US.dic', '/usr/share/hunspell/en_US.aff')
print(hobj.suggest('Eelysa'))

Related #35.

By the way pyenchant doesn't have this problem, through it uses libenchant-dev instead of hunspell directly.

@thierry-FreeBSD
Copy link

Same error with 0.5.3. See mike-fabian/ibus-typing-booster#23

@blatinier
Copy link
Collaborator

I think it's ok in master now. If someone confirm I will publish rapidly a new version on pypi

@blatinier
Copy link
Collaborator

For the record, some dic are Latin1 encoded, I try some UTF-8 first and on fail I fallback on Latin1.

@blatinier blatinier reopened this Feb 20, 2018
blatinier added a commit that referenced this issue Feb 20, 2018
blatinier added a commit that referenced this issue Feb 20, 2018
blatinier added a commit that referenced this issue Feb 20, 2018
blatinier added a commit that referenced this issue Feb 20, 2018
blatinier added a commit that referenced this issue Feb 20, 2018
@mike-fabian
Copy link

The problem seems to be fixed in current git master.

@mike-fabian
Copy link

mike-fabian commented Mar 6, 2018 via email

@nkrot
Copy link

nkrot commented Mar 6, 2018

in hunspell (0.5.3) still the same problem when using HunSpell.suggest() with German umlauts.?

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 5: invalid start byte

Any chances to get it fixed or a working workaround?

EDIT: As I only use hunspell for German and German hunspell dictionaries are known to be in latin1, i changed in hunspell.cpp at line 171

    //pystr = PyUnicode_DecodeUTF8(slist[i], str_size, "strict");
pystr = PyUnicode_DecodeLatin1(slist[i], str_size, "strict");

and recompiled and reinstalled with

  > python3 setup.py install --user

Magically, it worked. Enter chaos!

Now seriously. Is there a way to get from hunspell the encoding of the dictionary and perform a clever decoding? I am too new to python and not proficient to C to make it myself.

@mike-fabian
Copy link

@nkrot Current git master worked for me doing this:

$ python3
Python 3.6.4 (default, Feb 8 2018, 14:42:51)
[GCC 7.3.1 20180130 (Red Hat 7.3.1-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.

import hunspell
hobj = hunspell.HunSpell('/usr/share/myspell/de_DE.dic', '/usr/share/myspell/de_DE.aff')
hobj.suggest('grun')
['grub', 'grün', '-run', 'Grund', 'Grunge']

@blatinier
Copy link
Collaborator

@nkrot since current master solves this encoding issue, I published a new version (0.5.4 → https://pypi.python.org/pypi/hunspell/0.5.4)
You can try it.

@djstrong
Copy link

Encoding is wrong:

hobj = hunspell.HunSpell('/usr/share/hunspell/pl_PL.dic', '/usr/share/hunspell/pl_PL.aff')
hobj.get_dic_encoding()
'ISO8859-2'
hobj.suggest('narazie')
['zaranie',
 'narazi',
 'narzazie',
 'naradzie',
 'zarazie',
 'nakazie',
 'namazie',
 'nardzie',
 'narazi³',
 'naraziæ',
 'na razie',
 'na-razie',
 'nara zie',
 'nara-zie',
 'naraz ie']

Instead of "narazi³" and "naraziæ" should be "narazić" and "naraził".

@petasis
Copy link

petasis commented Aug 14, 2019

I am re-opening this bug, as the solution provided is not a complete fix.
I am using the Greek dictionary, which has an iso-8859-7 encoding.
Using PyUnicode_DecodeLatin1() return invalid results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants