Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] the extract_email performed very badly even worse than simple regex #167

Closed
healthmatrice opened this issue Oct 7, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@healthmatrice
Copy link

提问题时,请尊重我!把必要的信息,什么环境,输入具体什么文本,运行什么函数讲清楚!
不要甩一句话,说的不清不楚,我无从定位,浪费时间。这样的提单我将直接close。

描述(Description)

The email pattern seems quite limited. It cannot extract emails in many cases. Simple modification on the example will make the extraction fail: 请发简历至dongyu@163.com

I actually dug a bit deeper and find out your regex pattern is problematic.

I think it is better to change it to a more generalized pattern here is what I found out:

my_pattern = r"(([-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*|\"([]!#-[^-~ \t]|(\\[\t -~]))+\")@([-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*|\[[\t -Z^-~]*]))"

import re

regex_jio_email = re.compile(jio.rule_pattern.EMAIL_PATTERN)
regex_mine = re.compile(my_pattern)
print(regex_jio_email.findall(text)) # ok
print(regex_jio_email.findall(text1)) # fail
print(regex_mine.findall(text)) # ok
print(regex_mine.findall(text1)) # ok

Also I think the following line is problematic too:

self.email_domain_pattern.search(item['text']).group(1)

apparently the search can return None. And you did not handle it properly. Through out the code, I found similar problems repetitively. I guess it would reduce lots of similar issues if we put some effort into type checking via type hints and use mypy to check the code before release and merge PR.

描述你遇到了什么问题(Please describe your issue here)
the extract_email performed very badly even worse than simple regex. Is this expected or I did something wrong, may

  1. 版本(Version):
  • python 版本: 3.10
  • jionlp 版本: 1.5.2
  1. jionlp的调用代码与输入文本(Code & Text):
import jionlp as jio
text = '请发简历至dongyu@163.com。'
text1 = '请发简历至dongyu@163.com'
print(jio.extract_email(text, detail=True)) # this works
print(jio.extract_email(text1, detail=True)) # return empty

期望行为(Expectation)

若返回结果不理想,描述你期望发生的事情(Please describe your expectation)

请顺手 star 一下右上角的⭐小星星

@healthmatrice healthmatrice added the bug Something isn't working label Oct 7, 2023
@dongrixinyu
Copy link
Owner

Thank you for your contribution.

my_pattern = r"(([-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*|\"([]!#-[^-~ \t]|(\\[\t -~]))+\")@([-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*|\[[\t -Z^-~]*]))"

I will include your regex into my backup.

For your example, I have tackled the bug and pushed to the latest commit. The reason is when I wrote the regex, I add # in the begining and ending place of the given text. # is not suitable because it is also approved in the email regex.

Besides, my regex of email can not cover all kinds of cases such as Chinese characters, which are also legal for email regex actually. If you still find some other exceptions, feel free to raise an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants