Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

自定义字典 中英文混合间隔为空格时bug #300

Closed
summer1988 opened this issue Oct 19, 2015 · 9 comments
Closed

自定义字典 中英文混合间隔为空格时bug #300

summer1988 opened this issue Oct 19, 2015 · 9 comments

Comments

@summer1988
Copy link

例如:Edu Trust认证 2000
使用jieba.load_userdict('xx.dict')无法读取,tracback:
ValueError: invalid dictionary entry in htopics/summary/user.dict at Line 979: Edu Trust认证 2000
是否是结巴读取自定义文件时,每一行属性分割时使用的spilt,从左开始分割,
我觉得是不是应该从右开始分割并取固定的个数:rsplit('Edu Trust 2000 nv', n)

@ycchuang
Copy link

请问可以容许自订辞典有空格後,是否仍然无法分出含空格的词
例如:
Edu Trust认证
使用自订辞典 Edu Trust认证 2000
分词结果如下
jieba.cut(''Edu Trust认证'', cut_all=False)
Edu
(空格)
Trust
认证

上列四个词(有一个词是空格)

而不是

Edu Trust认证

一个词

请问是设定上的问题吗? 谢谢

@cavonchen
Copy link

cavonchen commented May 24, 2016

我也是这个情况,怎么样能把Edu Trust认证作为一个词啊?@fxsjy@gumblex

我修改了一下2个正则表达式,自测可用,请@fxsjy@gumblex 指点

cavonchen added a commit to cavonchen/jieba that referenced this issue May 25, 2016
fix:fxsjy#300
when jieba.cut(sentence,HMM=False),chinese and english characters mixed with whitespace can be also ouput.
userdict:
Edu Trust认证 2000
jieba.cut("我通过了Edu Trust认证",HMM=False)
output:我, 通过, 了, Edu Trust认证
cavonchen added a commit to cavonchen/jieba that referenced this issue May 25, 2016
tags =jieba.analyse.extract_tags("我通过了Edu Trust认证")
print(", ".join(tags))     

output:
Edu Trust认证, 通过
@geekan
Copy link

geekan commented Jun 15, 2016

很好的一个patch —— 至少解决了这个积年累月的问题
但是用户自定义词典(从文件读取)还是要修改下格式才行,这个patch无法解决

@geekan
Copy link

geekan commented Jun 15, 2016

@fxsjy

@vkjuju
Copy link

vkjuju commented Nov 4, 2016

請問有補丁可以修復嗎? (从文件读取+自定義詞典的方式) @fxsjy ,@cavonchen , 感謝, 急用...

@vkjuju
Copy link

vkjuju commented Nov 4, 2016

再請教英文組合字中間有空格可以也分詞出來嗎? 例如: "這是 This is an 一個 Apple Macbook, 我在自定字典內定義Apple Macbook", 我在自定典中有定義"Apple Macbook", 出來結果要像下面:

Apple Macbook

請各位前輩幫一下忙, 謝謝 @fxsjy , @cavonchen

@vkjuju
Copy link

vkjuju commented Nov 7, 2016

中英交雜有痘點符號的也沒法切出來, 例如: "小王 , 小白, 小張"

@bringtree
Copy link

bringtree commented Mar 19, 2018

<song>娃哈哈</song> 这样也是添加字典失败的

@bringtree
Copy link

@summer1988 前辈 有空可以修下吗? 谢谢了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants