Encoding Problem Python3 #32

NicoDietrich · 2017-07-21T14:30:53Z

Hey! I have a problem using hunspell with german words and I hope you can help me out.
An Example:

Python 3.5.2 (default, Nov 17 2016, 17:05:23) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import hunspell
>>> spellchecker = hunspell.HunSpell('/usr/share/hunspell/de_DE.dic',
...                                  '/usr/share/hunspell/de_DE.aff')
>>> spellchecker.spell('Wörterbuch')
True
>>> spellchecker.suggest('Wörterbuhc')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 1: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
SystemError: <built-in method suggest of HunSpell object at 0xb717d6d0> returned a result with an error set

When I encode to utf-8 it works, but the result I get makes no sense:

>>> spellchecker.suggest('Wörterbuhc'.encode('utf-8'))
['Westerburger']

This seems to be very weird, since here was done the exact same thing which apparently worked.

I use it in a Ubuntu VM with all the necessary packages installed:

Distributor ID: Ubuntu
Description: Ubuntu 16.04.2 LTS
Release: 16.04

blatinier · 2017-07-22T21:29:48Z

Ok I can reproduce something similar. But not exactly. Here is what I have on Debian 8:

>>> import hunspell
>>> spellchecker = hunspell.HunSpell('/usr/share/hunspell/de_DE.dic','/usr/share/hunspell/de_DE.aff')
>>> spellchecker.spell('Wörterbuch')
True
>>> spellchecker.suggest('Wörterbuhc') # should return some result, not an empty list
[]
>>> spellchecker.suggest('Wörterbuhc'.encode('utf-8')) # should not fail
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 1: invalid start byte

I will inquire on this later a bit later.

Can you attach your ubuntu version for the record please?

PanderMusubi · 2017-07-23T09:49:16Z

In my applications, i use

if filename in ('de_AT_frami.aff', 'de_CH_frami.aff', 'de_DE_frami.aff', 'de_DE.aff', 'en_US.aff', 'pt_BR.aff', 'sl_SI.aff', 'th_TH.aff', 'ru_RU.aff', 'nn_NO.aff', 'an_ES.aff', 'af_ZA.aff', 'el_GR.aff', 'bg_ BG.aff', 'de_CH.aff', 'it_IT.aff', 'hu_HU.aff', 'pl_PL.aff', 'cs_CZ.aff', 'eu.aff', 'lt_LT.aff', 'nb_NO.aff', 'oc_FR.aff', 'bs_BA.aff', 'de_AT.aff', ):
input = open(filepath, 'r', encoding='ISO-8859-1')
else:
input = open(filepath, 'r')

mike-fabian · 2017-09-04T08:49:29Z

I see a similar problem on Fedora 26:

$ python3
Python 3.6.2 (default, Aug 11 2017, 11:59:59)
[GCC 7.1.1 20170622 (Red Hat 7.1.1-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.

import hunspell
hobj = hunspell.HunSpell('/usr/share/myspell/de_DE.dic', '/usr/share/myspell/de_DE.aff')
hobj.suggest('grun')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 2: invalid start byte

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "", line 1, in
SystemError: <built-in method suggest of HunSpell object at 0x7f32edc26a50> returned a result with an error set

I think “grün” should be among the suggestions, but it fails because of the encoding error.

isnok · 2017-09-04T13:04:55Z

+1
I just ran into similar trouble with the german word "Gültigkeit".
Thanks to this thread I found that it is actually valid (hobj.spell returns True), but hobj.suggest raises the same UnicodeDecodeError... (can't decode at position 1)

isnok · 2017-09-04T13:07:36Z

Update: I tried with hunspell==0.4.0 and it works!

rafis · 2018-02-05T13:42:51Z

The problem still exists hunspell==0.5.2 even for pure ASCII words:

import hunspell
hobj = hunspell.HunSpell('/usr/share/hunspell/en_US.dic', '/usr/share/hunspell/en_US.aff')
print(hobj.suggest('Eelysa'))

Related #35.

By the way pyenchant doesn't have this problem, through it uses libenchant-dev instead of hunspell directly.

thierry-FreeBSD · 2018-02-18T12:29:49Z

Same error with 0.5.3. See mike-fabian/ibus-typing-booster#23

blatinier · 2018-02-20T21:51:43Z

I think it's ok in master now. If someone confirm I will publish rapidly a new version on pypi

blatinier · 2018-02-20T21:53:40Z

For the record, some dic are Latin1 encoded, I try some UTF-8 first and on fail I fallback on Latin1.

mike-fabian · 2018-03-06T08:56:35Z

The problem seems to be fixed in current git master.

mike-fabian · 2018-03-06T08:57:39Z

Benoît Latinier <notifications@github.com> さんは書きました:

For the record, some dic are Latin1 encoded, I try some UTF-8 first and on fail I fallback on Latin1.

The problem seems to be fixed in current git master indeed.

…

-- 📧 Mike FABIAN <mike.fabian@gmx.de> 睡眠不足はいい仕事の敵だ。

nkrot · 2018-03-06T11:45:23Z

in hunspell (0.5.3) still the same problem when using HunSpell.suggest() with German umlauts.?

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 5: invalid start byte

Any chances to get it fixed or a working workaround?

EDIT: As I only use hunspell for German and German hunspell dictionaries are known to be in latin1, i changed in hunspell.cpp at line 171

    //pystr = PyUnicode_DecodeUTF8(slist[i], str_size, "strict");
pystr = PyUnicode_DecodeLatin1(slist[i], str_size, "strict");

and recompiled and reinstalled with

  > python3 setup.py install --user

Magically, it worked. Enter chaos!

Now seriously. Is there a way to get from hunspell the encoding of the dictionary and perform a clever decoding? I am too new to python and not proficient to C to make it myself.

mike-fabian · 2018-03-06T13:19:49Z

@nkrot Current git master worked for me doing this:

$ python3
Python 3.6.4 (default, Feb 8 2018, 14:42:51)
[GCC 7.3.1 20180130 (Red Hat 7.3.1-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.

import hunspell
hobj = hunspell.HunSpell('/usr/share/myspell/de_DE.dic', '/usr/share/myspell/de_DE.aff')
hobj.suggest('grun')
['grub', 'grün', '-run', 'Grund', 'Grunge']

blatinier · 2018-03-07T14:01:41Z

@nkrot since current master solves this encoding issue, I published a new version (0.5.4 → https://pypi.python.org/pypi/hunspell/0.5.4)
You can try it.

djstrong · 2019-03-22T18:29:32Z

Encoding is wrong:

hobj = hunspell.HunSpell('/usr/share/hunspell/pl_PL.dic', '/usr/share/hunspell/pl_PL.aff')
hobj.get_dic_encoding()
'ISO8859-2'
hobj.suggest('narazie')
['zaranie',
 'narazi',
 'narzazie',
 'naradzie',
 'zarazie',
 'nakazie',
 'namazie',
 'nardzie',
 'narazi³',
 'naraziæ',
 'na razie',
 'na-razie',
 'nara zie',
 'nara-zie',
 'naraz ie']

Instead of "narazi³" and "naraziæ" should be "narazić" and "naraził".

petasis · 2019-08-14T08:49:20Z

I am re-opening this bug, as the solution provided is not a complete fix.
I am using the Greek dictionary, which has an iso-8859-7 encoding.
Using PyUnicode_DecodeLatin1() return invalid results.

mike-fabian mentioned this issue Sep 4, 2017

Errors with py-hunspell 0.5.0 mike-fabian/ibus-typing-booster#23

Closed

mwydmuch mentioned this issue Nov 2, 2017

Fix encoding problem in HunSpell.suggest method #35

Closed

blatinier closed this as completed in ce45ddd Feb 20, 2018

blatinier reopened this Feb 20, 2018

blatinier added a commit that referenced this issue Feb 20, 2018

fix encoding problem (#32)

6c207d8

blatinier added a commit that referenced this issue Feb 20, 2018

fix encoding problem (#32)

ca6acb2

blatinier added a commit that referenced this issue Feb 20, 2018

fix encoding problem (#32)

9f0505d

blatinier added a commit that referenced this issue Feb 20, 2018

fix encoding problem (#32)

220a646

blatinier added a commit that referenced this issue Feb 20, 2018

fix encoding problem (#32)

d838799

blatinier closed this as completed Mar 7, 2018

blatinier mentioned this issue Mar 27, 2019

Python3 encoding error #63

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding Problem Python3 #32

Encoding Problem Python3 #32

NicoDietrich commented Jul 21, 2017 •

edited

blatinier commented Jul 22, 2017 •

edited

PanderMusubi commented Jul 23, 2017

mike-fabian commented Sep 4, 2017

isnok commented Sep 4, 2017

isnok commented Sep 4, 2017

rafis commented Feb 5, 2018 •

edited

thierry-FreeBSD commented Feb 18, 2018

blatinier commented Feb 20, 2018

blatinier commented Feb 20, 2018

mike-fabian commented Mar 6, 2018

mike-fabian commented Mar 6, 2018 via email

nkrot commented Mar 6, 2018 •

edited

mike-fabian commented Mar 6, 2018

blatinier commented Mar 7, 2018

djstrong commented Mar 22, 2019

petasis commented Aug 14, 2019

Encoding Problem Python3 #32

Encoding Problem Python3 #32

Comments

NicoDietrich commented Jul 21, 2017 • edited

blatinier commented Jul 22, 2017 • edited

PanderMusubi commented Jul 23, 2017

mike-fabian commented Sep 4, 2017

isnok commented Sep 4, 2017

isnok commented Sep 4, 2017

rafis commented Feb 5, 2018 • edited

thierry-FreeBSD commented Feb 18, 2018

blatinier commented Feb 20, 2018

blatinier commented Feb 20, 2018

mike-fabian commented Mar 6, 2018

mike-fabian commented Mar 6, 2018 via email

nkrot commented Mar 6, 2018 • edited

mike-fabian commented Mar 6, 2018

blatinier commented Mar 7, 2018

djstrong commented Mar 22, 2019

petasis commented Aug 14, 2019

NicoDietrich commented Jul 21, 2017 •

edited

blatinier commented Jul 22, 2017 •

edited

rafis commented Feb 5, 2018 •

edited

nkrot commented Mar 6, 2018 •

edited