Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'seg'和'lac'分词结果不一致 #188

Open
cheyong007 opened this issue Apr 14, 2021 · 1 comment
Open

'seg'和'lac'分词结果不一致 #188

cheyong007 opened this issue Apr 14, 2021 · 1 comment

Comments

@cheyong007
Copy link

问题:

pip默认安装,发现lac_modelseg_modelrank_modelword.dic字典文件不同,导致分词结果不一致。

lac_model\confword.dic,746KB,58223行,包含多字词语;
rank_model\confword.dic,746KB,58223行,包含多字词语,和lac_model相同;
seg_model\confword.dic,71KB,8223行,只包含单字。

    text = "“没有什么比这场疫情下的生与死更能体现美国的肤色差异了”。"

    lac_seg = LAC(mode='seg')
    seg_result = lac_seg.run(text)

    lac_lac = LAC(mode='lac')
    lac_result = lac_lac.run(text)

    lac_rank = LAC(mode='rank')
    rank_result = lac_rank.run(text)

结果:

seg_result = ['“', '没有', '什么', '比', '这', '场', '疫情', '下', '的', '生与死', '更', '能', '体现', '美国', '的', '肤色', '差异', '了', '”', '。']

lac_result = [['“', '没有', '什么', '比', '这场', '疫情', '下', '的', '生与死', '更', '能', '体现', '美国', '的', '肤', '色差异','了', '”', '。'], ['w', 'v', 'r', 'p', 'r', 'n', 'f', 'u', 'n', 'd', 'v', 'v', 'LOC', 'u', 'n', 'a', 'u', 'w', 'w']]

尝试:

  1. seg_model\confword.dic 替换为 lac_model\confword.dic,会报错,不支持多字词;
  2. lac_model\confword.dic替换为seg_model\confword.dic ,结果一致。
  3. rank_model\confword.dic替换为seg_model\confword.dic,rank不受其影响,依赖于lac模式的分词结果。

能否在三种模式下使用同一个字典文件来确保分词结果一致?

实在不想要维护三个版本的字典文件,可能会造成混乱啊。
lac模式下只用单字的字典文件会不会有影响?
使用三个版本的字典文件,是有什么考虑么?

@yayaQAQ
Copy link

yayaQAQ commented Aug 26, 2021

我也遇到了分词不一致的问题,官方是否可以让rank和seg的分词保持一致?
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants