Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

不能处理unicode的中文字符? #7

Open
GoogleCodeExporter opened this issue Mar 15, 2015 · 3 comments
Open

不能处理unicode的中文字符? #7

GoogleCodeExporter opened this issue Mar 15, 2015 · 3 comments

Comments

@GoogleCodeExporter
Copy link

如题,这个函数不能处理unicode的中文字符串吗?
比如,cuttest(u"我喜欢python和c++。")
报错:
Traceback (most recent call last):
  File "D:\bluecat2\Desktop\smallseg_0.5.1\test_fenci.py", line 41, in <module>
    cuttest(u"我喜欢python和c++。")
  File "D:\bluecat2\Desktop\smallseg_0.5.1\test_fenci.py", line 18, in cuttest
    wlist = seg.cut(text)
  File "D:\bluecat2\Desktop\smallseg_0.5.1\smallseg.py", line 56, in cut
    text = text.decode('utf-8','ignore')
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: 
ordinal not in range(128)
Windows, Python 2.7


Original issue reported on code.google.com by blurr...@gmail.com on 22 Feb 2012 at 12:50

@GoogleCodeExporter
Copy link
Author

cuttest函数的输入参数应该为utf-8字符串,在内部被转为unicode

Original comment by ccnu...@gmail.com on 8 Mar 2012 at 1:29

@GoogleCodeExporter
Copy link
Author

能不能增加一个unicode的接口?有很多应用在到使用分词的时�
��字符串已经被转换成unicode了
例如django,内部全是unicode。非常感谢!

Original comment by blurr...@gmail.com on 13 Mar 2012 at 2:00

@GoogleCodeExporter
Copy link
Author

把这行代码注释掉应该就可以了。 
http://code.google.com/p/smallseg/source/browse/trunk/smallseg.py#67

Original comment by ccnu...@gmail.com on 16 Mar 2012 at 12:29

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant