Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve parsing of nautral language string with punctuation #725

Open
systemcatch opened this issue Dec 3, 2019 · 1 comment
Open

Improve parsing of nautral language string with punctuation #725

systemcatch opened this issue Dec 3, 2019 · 1 comment

Comments

@systemcatch
Copy link
Collaborator

systemcatch commented Dec 3, 2019

I don't think any number would work in my implementation but I can increase it to any finite number you want (say 2 or 3 punctuation marks). I think one is fine; more than two is probably overkill.

Edit: Maybe allow 3 or 4 due to the use of "...", although I'm not sure how often people use those after dates. I could see people using a date like this: He said, "The date is 1/2/13." So maybe increasing the constraint is actually a good idea, and I can increase it infinitely following the date, just not preceding it.

Originally posted by @andrewchouman in #720

===========================================================

I tend to agree, but the only thing that concerns me is that this worked pre 0.15.0 (I chose 0.13.0 for example):

venv ❯ python3
Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 16:52:21)
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import arrow
>>> arrow.__version__
'0.13.0'
>>> arrow.get("This date has too many punctuation marks following it 11.11.2011", "DD.MM.YYYY")
<Arrow [2011-11-11T00:00:00+00:00]>
>>> arrow.get("This date has too many punctuation marks following it (11.11.2011)", "DD.MM.YYYY")
<Arrow [2011-11-11T00:00:00+00:00]>
>>> arrow.get("This date has too many punctuation marks following it (11.11.2011).", "DD.MM.YYYY")
<Arrow [2011-11-11T00:00:00+00:00]>

This is definitely an improvement, but for full pre-0.15.0 behavior while still containing improvements, we probably need to add support for any number of punctuation marks. Curious, why would finite numbers work but not infinite (e.g. with the + quantifier in regex)?

Originally posted by @jadchaar in #720

@jadchaar jadchaar changed the title Handling of punctuation in string parsing Improve string parsing of nautral language string with punctuation Dec 21, 2019
@jadchaar jadchaar changed the title Improve string parsing of nautral language string with punctuation Improve parsing of nautral language string with punctuation Dec 21, 2019
@jadchaar
Copy link
Member

jadchaar commented Mar 3, 2020

We definitely need to figure out a way to make the regex simpler and more general. It would be nice to allow for n number of punctuation marks rather than hardcoding an amount.

A starting word boundary of (?<![\S]) and an ending word boundary of (?![\w]) could be a possibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants