Improve parsing of nautral language string with punctuation #725

systemcatch · 2019-12-03T20:53:35Z

I don't think any number would work in my implementation but I can increase it to any finite number you want (say 2 or 3 punctuation marks). I think one is fine; more than two is probably overkill.

Edit: Maybe allow 3 or 4 due to the use of "...", although I'm not sure how often people use those after dates. I could see people using a date like this: He said, "The date is 1/2/13." So maybe increasing the constraint is actually a good idea, and I can increase it infinitely following the date, just not preceding it.

Originally posted by @andrewchouman in #720

===========================================================

I tend to agree, but the only thing that concerns me is that this worked pre 0.15.0 (I chose 0.13.0 for example):

venv ❯ python3
Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 16:52:21)
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import arrow
>>> arrow.__version__
'0.13.0'
>>> arrow.get("This date has too many punctuation marks following it 11.11.2011", "DD.MM.YYYY")
<Arrow [2011-11-11T00:00:00+00:00]>
>>> arrow.get("This date has too many punctuation marks following it (11.11.2011)", "DD.MM.YYYY")
<Arrow [2011-11-11T00:00:00+00:00]>
>>> arrow.get("This date has too many punctuation marks following it (11.11.2011).", "DD.MM.YYYY")
<Arrow [2011-11-11T00:00:00+00:00]>

This is definitely an improvement, but for full pre-0.15.0 behavior while still containing improvements, we probably need to add support for any number of punctuation marks. Curious, why would finite numbers work but not infinite (e.g. with the + quantifier in regex)?

Originally posted by @jadchaar in #720

The text was updated successfully, but these errors were encountered:

jadchaar · 2020-03-03T17:32:39Z

We definitely need to figure out a way to make the regex simpler and more general. It would be nice to allow for n number of punctuation marks rather than hardcoding an amount.

A starting word boundary of (?<![\S]) and an ending word boundary of (?![\w]) could be a possibility.

jadchaar added the enhancement label Dec 9, 2019

jadchaar changed the title ~~Handling of punctuation in string parsing~~ Improve string parsing of nautral language string with punctuation Dec 21, 2019

jadchaar changed the title ~~Improve string parsing of nautral language string with punctuation~~ Improve parsing of nautral language string with punctuation Dec 21, 2019

anishnya mentioned this issue May 18, 2021

Add wildcard as a supported token #976

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve parsing of nautral language string with punctuation #725

Improve parsing of nautral language string with punctuation #725

systemcatch commented Dec 3, 2019 •

edited

Loading

jadchaar commented Mar 3, 2020 •

edited

Loading

Improve parsing of nautral language string with punctuation #725

Improve parsing of nautral language string with punctuation #725

Comments

systemcatch commented Dec 3, 2019 • edited Loading

jadchaar commented Mar 3, 2020 • edited Loading

systemcatch commented Dec 3, 2019 •

edited

Loading

jadchaar commented Mar 3, 2020 •

edited

Loading