Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong entity end. #17

Open
jxg972 opened this issue Dec 12, 2017 · 13 comments
Open

Wrong entity end. #17

jxg972 opened this issue Dec 12, 2017 · 13 comments

Comments

@jxg972
Copy link

jxg972 commented Dec 12, 2017

WARNING:rasa_nlu.extractors.mitie_entity_extractor:Example skipped: Invalid entity {u'start': 0, u'end': 6, u'value': u'\u6c5f\u94c3E200', u'entity': u'\u8f66\u7cfb'} in example '江铃E200VS东风风神AX7新能源': entities must span whole tokens. Wrong entity end.
这里报错说实体位置标注错了,但是分词结果却是一样的
for i in jieba.tokenize('江铃E200VS东风风神AX7新能源'):
print(i)
('江铃E200', 0, 6)
('VS', 6, 8)
('东风风神AX7新能源', 8, 18)
不知道是什么原因?

@crownpku
Copy link
Owner

相应这一段的你自己的标注json内容,能发出来看下吗?

@jxg972
Copy link
Author

jxg972 commented Dec 13, 2017

{
"text": "江铃E200VS东风风神AX7新能源",
"intent": "对比",
"entities": [
{
"start": 0,
"end": 6,
"value": "江铃E200",
"entity": "车系"
},
{
"start": 8,
"end": 18,
"value": "东风风神AX7新能源",
"entity": "车系"
}
]
}

@jxg972
Copy link
Author

jxg972 commented Dec 13, 2017

我有加入自定义词典

@crownpku
Copy link
Owner

"江铃E200"包含了汉字与英文数字字符,我感觉是python2的编码问题。
最简单的建议,你试下用python3跑试下?

@jxg972
Copy link
Author

jxg972 commented Dec 13, 2017

好,我去试试

@jxg972
Copy link
Author

jxg972 commented Dec 13, 2017

在给python3安装rasa配套包的时候发现了问题,python包的默认安装目录\usr磁盘满了,增加了硬盘空间后,重装python2的rasa,这个问题解决了,可能是因为这个原因导致安装不完全?
rasa一直没报过错,所以一直没发现问题

@crownpku
Copy link
Owner

x_X

@jxg972
Copy link
Author

jxg972 commented Dec 14, 2017

晕,订正一下,不是目录满的原因
是当前目录的原因,一直没注意到这个问题
我在随意目录下运行python -m rasa_nlu.train -c xxx/config_jieba_mitie_sklearn.json就会报错
必须先cd切换到git下来的rasa_nlu_chi目录下运行,就可以正常读取

@crownpku
Copy link
Owner

Anyway,问题解决了就好:)

@BobCN2017
Copy link

  {
    "text": "今天8点到9点45分有哪些闹钟",
    "intent": "alarm_search",
    "entities": [
      {
        "start": 2,
        "end": 4,
        "value": "8点",
        "entity": "time"
      },
      {
        "start": 5,
        "end": 10,
        "value": "9点45分",
        "entity": "time"
      }
    ]
  },

2018-06-10 22:13:48 WARNING rasa_nlu.extractors.mitie_entity_extractor - Example skipped: Invalid entity {'start': 2, 'end': 4, 'value': '8点', 'entity': 'time'} in example '今天8点到9点45分有哪些闹钟': entities must span whole tokens. Wrong entity end.
2018-06-10 22:13:48 WARNING rasa_nlu.extractors.mitie_entity_extractor - Example skipped: Invalid entity {'start': 5, 'end': 10, 'value': '9点45分', 'entity': 'time'} in example '今天8点到9点45分有哪些闹钟': entities must span whole tokens. Wrong entity end.
我也报这个错误,类型都是这个前面一个数字后面一个中文。
尝试用python3运行,还是报同样的错误。
确认了目录,是在主目录下运行的。
python3是新装的,rasa相关的安装也在python3下装了一遍,也没有磁盘不足的问题。
有没有人碰到类似的问题?谢谢

@jxg972
Copy link
Author

jxg972 commented Jun 11, 2018

这个问题,你需要检查一下你的分词,你这样标注实体的话,你的分词必须是
['今天', '8点', '到', '9点45分', '有', '哪些', '闹钟']
但是如果使用结巴分词的话,默认的分词结果是
['今天', '8', '点到', '9', '点', '45', '分有', '哪些', '闹钟']

@BobCN2017
Copy link

@jxg972 搞定了,确实是分词的问题,把8点什么的加入用户字典就行了

@zhoulijing01
Copy link

您好,我定义的词典加载进去还是错误,没有进行训练rasa_rlu时候 自己亲自尝试使用使用该词典进行分词是正确的,您有好的建议吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants