testMonths fails #54

zed · 2014-10-11T04:41:37Z

$ python3 run_tests.py parsedatetime
..................F..............................
======================================================================
FAIL: testMonths (parsedatetime.tests.TestUnits.test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "./parsedatetime/tests/TestUnits.py", line 107, in testMonths
    self.assertTrue(_compareResults(self.cal.parse('1 month',  start), (target, 1)))
AssertionError: False is not true

----------------------------------------------------------------------
Ran 49 tests in 0.284s

FAILED (failures=1)

The text was updated successfully, but these errors were encountered:

zed · 2014-10-11T08:50:59Z

Whether the test succeeds depends on PYTHONHASHSEED envvar e.g., if it equals 12 then all tests pass:

$ PYTHONHASHSEED=12 python3 run_tests.py
.....................................................
----------------------------------------------------------------------
Ran 53 tests in 0.292s

OK

but with PYTHONHASHSEED=123 testMonths fails. The hash value is also may depend on the python version. The above values are for Python 3.4.

The result (as expected) doesn't depend on the PyICU version installed. I've tried with 1.5 and 1.8.

zed · 2014-10-11T09:26:10Z

testMonths continues to fail on the earliest parsedatetime version that I can run 7c5e6a0

fake-name · 2014-10-19T12:55:38Z

Can confirm this issue. I'm seeing strings like 1 month ago fail to be parsed at all in the current version from pip.

Edit: Just checked out the HEAD. I'm seeing the exact same behaviour as zed describes.

fake-name · 2014-10-19T13:05:26Z

The output of self.cal.parse('1 month', start) is (time.struct_time(tm_year=2014, tm_mon=10, tm_mday=19, tm_hour=6, tm_min=3, tm_sec=0, tm_wday=6, tm_yday=292, tm_isdst=-1), 2), so it looks like the parser is incorrectly classifying the string as a time, rather then a date.

This is supported by comparing the output of self.cal.parse('1 month', start) and start:

Parsed:  
time.struct_time(tm_year=2014, tm_mon=10, tm_mday=19, tm_hour=6, tm_min=4, tm_sec=25, tm_wday=6, tm_yday=292, tm_isdst=-1)
Now:  
time.struct_time(tm_year=2014, tm_mon=10, tm_mday=19, tm_hour=6, tm_min=3, tm_sec=25, tm_wday=6, tm_yday=292, tm_isdst=-1)

So, it appears that the parser is incorrectly parsing the "1" as a minute quantity with a positive offset.

fake-name · 2014-10-19T13:13:31Z

Ok, I turned on lots of logging. Interesting results:

herp@mainnas:/media/Storage/Scripts/parsedatetime$ python3 run_tests.py parsedatetime.tests.TestUnits.test.testMonths
daysInMonth(11, 2014)

parse()
parse (top of loop): [1 month][]
parse 1 [][][]
parse 2 [][][]
CRE_UNITS matched
parse 3 [1 m][][onth]
parse 4 [1 m][][onth]
parse 5 [1 m][][onth]
parse 6 [1 m][][onth]
parse 7 [1 m][][onth]
parse 8 [1 m][][onth]
parse 9 [1 m][][onth]
parse A [1 m][][onth]
parse (bottom) [ onth][1 m][][onth]
weekday False, dateStd False, dateStr False, time False, timeStr False, meridian False
dayStr False, modifier False, modifier2 False, units True, qunits False
_evalString(1 m, time.struct_time(tm_year=2014, tm_mon=10, tm_mday=19, tm_hour=6, tm_min=12, tm_sec=49, tm_wday=6, tm_yday=292, tm_isdst=-1))
unitsFlag is set
_buildTime: [1 ][][m]
units m --> realunit minutes
parse (top of loop): [ onth][]
parse 1 [][][]
parse 2 [][][]
parse 3 [][][]
parse 4 [][][]
parse 5 [][][]
parse 6 [][][]
parse 7 [][][]
parse 8 [][][]
parse 9 [][][]
parse A [][][]
parse (bottom) [][][][]
weekday False, dateStd False, dateStr False, time False, timeStr False, meridian False
dayStr False, modifier False, modifier2 False, units False, qunits False
parse() return dateFlag 0 timeFlag 2 totalTime time.struct_time(tm_year=2014, tm_mon=10, tm_mday=19, tm_hour=6, tm_min=13, tm_sec=49, tm_wday=6, tm_yday=292, tm_isdst=-1)
F
======================================================================
FAIL: testMonths (parsedatetime.tests.TestUnits.test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/media/Storage/Scripts/parsedatetime/parsedatetime/tests/TestUnits.py", line 110, in testMonths
    self.assertTrue(_compareResults(self.cal.parse('1 month',  start), (target, 1)))
AssertionError: False is not true

----------------------------------------------------------------------
Ran 1 test in 0.144s

So it looks like, /sometimes only/, the "m" is being interpreted as the prefix for "minutes", rather then as part of the value "months".

It always happens in the parse 3 stage.

fake-name · 2014-10-19T13:28:45Z

Ok, I've figured out where the issue is.

Basically, it appears that the ordering of the contents of self.locale.re_values['numbers'] is non-determinstic. self.locale.re_values['numbers'] contains both the literal m and months.

When the literal m sorts before months, the test fails, because the regex binds to the m preferentially. When the months sorts first, everything works fine.

I think the solution here would be to add a word boundary match to the self.RE_UNITS regex, so the regex has to match an entire word, rather then just the starting boundary and the first n characters.

In other words:

        self.RE_UNITS     = r'''(?P<qty>(-?(\b(%(numbers)s)\b|\d+)\s*
                                         (?P<units>((\b%(units)s)s?))
                                        ))''' % self.locale.re_values

Becomes

        self.RE_UNITS     = r'''(?P<qty>(-?(\b(%(numbers)s)\b|\d+)\s*
                                         (?P<units>((\b(%(units)s))s?\b))
                                        ))''' % self.locale.re_values

Note that I had to add an additional set of parenthesis, because as it is, (\b%(units)s) is evaluating to (\bsecond|seconds|sec|s|month|months|mth|year|years|yr|y|hour|hours|hr|h|minute|minutes|min|m|week|weeks|wk|w|day|days|dy|d), and as such only the first item is actually even using the \b boundary.

The boundary grouping issue is also occuring in the RE_QUNITS regex, so there are certain conditions where it improperly matches as well:

parse() return dateFlag 1 timeFlag 0 totalTime time.struct_time(tm_year=2014, tm_mon=11, tm_mday=19, tm_hour=6, tm_min=34, tm_sec=0, tm_wday=2, tm_yday=323, tm_isdst=-1)
parse()
parse (top of loop): [1month][]
parse 1 [][][]
parse 2 [][][]
M None
parse 3 [][][]
CRE_QUNITS matched
parse 4 [1m][][onth]
parse 5 [1m][][onth]
parse 6 [1m][][onth]
parse 7 [1m][][onth]
parse 8 [1m][][onth]
parse 9 [1m][][onth]
parse A [1m][][onth]
parse (bottom) [ onth][1m][][onth]
weekday False, dateStd False, dateStr False, time False, timeStr False, meridian False

(with PYTHONHASHSEED=152)

fake-name · 2014-10-19T13:48:12Z

Actually, considering some of the comments, I think the core of the issue might be the values of self.locale.re_values['units'] which both seems to include single char date characters, which contradicts some of the comments in the code:

units year|years|yr|y|day|days|dy|d|hour|hours|hr|h|minute|minutes|min|m|second|seconds|sec|s|week|weeks|wk|w|month|months|mth


        # Given string is a time string with units like "5 hrs 30 min"
        if self.unitsFlag:
--- SNIP ---
        # Given string is a time string with single char units like "5 h 30 m"
        if self.qunitsFlag:

fake-name · 2014-10-19T13:58:21Z

Ok, so the root of the issue is that single-character duration labels are getting into self.locale.units somehow.

I filtered them out:

(line 2226... in __init__.py)

            l = []
            for s in self.locale.units:
                l = l + self.locale.units[s]
---- to ----
            l = []
            for s in self.locale.units:
                l = l + [x for x in self.locale.units[s] if len(x) > 1]

And it fixes the issue initially described. I still see periodic failures in the 1month match, but I suspect that's a deeper issue with the regex.

Sorry if this thread has gotten a bit long. This is an interesting problem, and it's affecting a project I'm working on, so I've been poking at it.

fake-name · 2014-10-19T14:05:22Z

Ok, the fix as I described above fixes the month parsing issue, but breaks a number of the other tests, primarily those using the single-char time notation.

I suspect what it's actually doing is preventing the self.RE_UNITS regex from occationally working in some contexts, which is exposing an existing issue with the self.RE_QUNITS regex that was already there, and just papered over by the accidental functionality of the self.RE_UNITS regex, but the regexes are melting my brain at this point, so I'm going to stop for the moment.

fixes #54

zed · 2014-10-19T15:40:08Z

Note: locale.units is a dictionary that is why the order depends on PYTHONHASHSEED. I've created PR #57 that fixes this issue.

bear · 2014-10-19T15:58:00Z

Let me know if either of you would like to be added as contributors - your both actively using and fixing bugs.

fake-name · 2014-10-20T07:16:11Z

I'd be willing to spend some time trying to make the regex stuff more concise.

Particularly, is the use of named %(name)s string formatting a legacy compatibility thing? Moving to '{val}'.format(val=something) makes things MUCH more readable, and makes scoping much easier to understand.

The .format() method was added in python 2.6. Is < 2.6 still a support target?

bear · 2014-10-20T15:18:41Z

< 2.6 is not longer supported in the main branch - 2.7+ and 3.0 are it's goals

fake-name · 2014-10-20T17:27:44Z

Ok, cool. The .format() method was introduced in 2.6 anyways, so it's a bit of a moot point.

zed changed the title ~~testMonths fails on Python 3~~ testMonths fails Oct 11, 2014

bear closed this as completed in 7917b75 Oct 19, 2014

bear added a commit that referenced this issue Oct 19, 2014

Merge pull request #57 from zed/issue-54

4c9a55f

fixes #54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

testMonths fails #54

testMonths fails #54

zed commented Oct 11, 2014

zed commented Oct 11, 2014

zed commented Oct 11, 2014

fake-name commented Oct 19, 2014

fake-name commented Oct 19, 2014

fake-name commented Oct 19, 2014

fake-name commented Oct 19, 2014

fake-name commented Oct 19, 2014

fake-name commented Oct 19, 2014

fake-name commented Oct 19, 2014

zed commented Oct 19, 2014

bear commented Oct 19, 2014

fake-name commented Oct 20, 2014

bear commented Oct 20, 2014

fake-name commented Oct 20, 2014

testMonths fails #54

testMonths fails #54

Comments

zed commented Oct 11, 2014

zed commented Oct 11, 2014

zed commented Oct 11, 2014

fake-name commented Oct 19, 2014

fake-name commented Oct 19, 2014

fake-name commented Oct 19, 2014

fake-name commented Oct 19, 2014

fake-name commented Oct 19, 2014

fake-name commented Oct 19, 2014

fake-name commented Oct 19, 2014

zed commented Oct 19, 2014

bear commented Oct 19, 2014

fake-name commented Oct 20, 2014

bear commented Oct 20, 2014

fake-name commented Oct 20, 2014