不能处理unicode的中文字符？ #7

GoogleCodeExporter · 2015-03-15T07:06:45Z

如题，这个函数不能处理unicode的中文字符串吗？
比如，cuttest(u"我喜欢python和c++。")
报错：
Traceback (most recent call last):
  File "D:\bluecat2\Desktop\smallseg_0.5.1\test_fenci.py", line 41, in <module>
    cuttest(u"我喜欢python和c++。")
  File "D:\bluecat2\Desktop\smallseg_0.5.1\test_fenci.py", line 18, in cuttest
    wlist = seg.cut(text)
  File "D:\bluecat2\Desktop\smallseg_0.5.1\smallseg.py", line 56, in cut
    text = text.decode('utf-8','ignore')
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: 
ordinal not in range(128)
Windows, Python 2.7

Original issue reported on code.google.com by blurr...@gmail.com on 22 Feb 2012 at 12:50

The text was updated successfully, but these errors were encountered:

GoogleCodeExporter · 2015-03-15T07:06:45Z

cuttest函数的输入参数应该为utf-8字符串，在内部被转为unicode

Original comment by ccnu...@gmail.com on 8 Mar 2012 at 1:29

GoogleCodeExporter · 2015-03-15T07:06:45Z

能不能增加一个unicode的接口？有很多应用在到使用分词的时�
��字符串已经被转换成unicode了
例如django，内部全是unicode。非常感谢！

Original comment by blurr...@gmail.com on 13 Mar 2012 at 2:00

GoogleCodeExporter · 2015-03-15T07:06:46Z

把这行代码注释掉应该就可以了。 
http://code.google.com/p/smallseg/source/browse/trunk/smallseg.py#67

Original comment by ccnu...@gmail.com on 16 Mar 2012 at 12:29

GoogleCodeExporter added Type-Defect Priority-Medium auto-migrated labels Mar 15, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

不能处理unicode的中文字符？ #7

不能处理unicode的中文字符？ #7

GoogleCodeExporter commented Mar 15, 2015

GoogleCodeExporter commented Mar 15, 2015

GoogleCodeExporter commented Mar 15, 2015

GoogleCodeExporter commented Mar 15, 2015

不能处理unicode的中文字符？ #7

不能处理unicode的中文字符？ #7

Comments

GoogleCodeExporter commented Mar 15, 2015

GoogleCodeExporter commented Mar 15, 2015

GoogleCodeExporter commented Mar 15, 2015

GoogleCodeExporter commented Mar 15, 2015