Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testMonths fails #54

Closed
zed opened this issue Oct 11, 2014 · 14 comments
Closed

testMonths fails #54

zed opened this issue Oct 11, 2014 · 14 comments

Comments

@zed
Copy link
Contributor

zed commented Oct 11, 2014

$ python3 run_tests.py parsedatetime
..................F..............................
======================================================================
FAIL: testMonths (parsedatetime.tests.TestUnits.test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "./parsedatetime/tests/TestUnits.py", line 107, in testMonths
    self.assertTrue(_compareResults(self.cal.parse('1 month',  start), (target, 1)))
AssertionError: False is not true

----------------------------------------------------------------------
Ran 49 tests in 0.284s

FAILED (failures=1)
@zed
Copy link
Contributor Author

zed commented Oct 11, 2014

Whether the test succeeds depends on PYTHONHASHSEED envvar e.g., if it equals 12 then all tests pass:

$ PYTHONHASHSEED=12 python3 run_tests.py
.....................................................
----------------------------------------------------------------------
Ran 53 tests in 0.292s

OK

but with PYTHONHASHSEED=123 testMonths fails. The hash value is also may depend on the python version. The above values are for Python 3.4.

The result (as expected) doesn't depend on the PyICU version installed. I've tried with 1.5 and 1.8.

@zed
Copy link
Contributor Author

zed commented Oct 11, 2014

testMonths continues to fail on the earliest parsedatetime version that I can run 7c5e6a0

@zed zed changed the title testMonths fails on Python 3 testMonths fails Oct 11, 2014
@fake-name
Copy link
Contributor

Can confirm this issue. I'm seeing strings like 1 month ago fail to be parsed at all in the current version from pip.

Edit: Just checked out the HEAD. I'm seeing the exact same behaviour as zed describes.

@fake-name
Copy link
Contributor

The output of self.cal.parse('1 month', start) is (time.struct_time(tm_year=2014, tm_mon=10, tm_mday=19, tm_hour=6, tm_min=3, tm_sec=0, tm_wday=6, tm_yday=292, tm_isdst=-1), 2), so it looks like the parser is incorrectly classifying the string as a time, rather then a date.

This is supported by comparing the output of self.cal.parse('1 month', start) and start:

Parsed:  
time.struct_time(tm_year=2014, tm_mon=10, tm_mday=19, tm_hour=6, tm_min=4, tm_sec=25, tm_wday=6, tm_yday=292, tm_isdst=-1)
Now:  
time.struct_time(tm_year=2014, tm_mon=10, tm_mday=19, tm_hour=6, tm_min=3, tm_sec=25, tm_wday=6, tm_yday=292, tm_isdst=-1)

So, it appears that the parser is incorrectly parsing the "1" as a minute quantity with a positive offset.

@fake-name
Copy link
Contributor

Ok, I turned on lots of logging. Interesting results:

herp@mainnas:/media/Storage/Scripts/parsedatetime$ python3 run_tests.py parsedatetime.tests.TestUnits.test.testMonths
daysInMonth(11, 2014)

parse()
parse (top of loop): [1 month][]
parse 1 [][][]
parse 2 [][][]
CRE_UNITS matched
parse 3 [1 m][][onth]
parse 4 [1 m][][onth]
parse 5 [1 m][][onth]
parse 6 [1 m][][onth]
parse 7 [1 m][][onth]
parse 8 [1 m][][onth]
parse 9 [1 m][][onth]
parse A [1 m][][onth]
parse (bottom) [ onth][1 m][][onth]
weekday False, dateStd False, dateStr False, time False, timeStr False, meridian False
dayStr False, modifier False, modifier2 False, units True, qunits False
_evalString(1 m, time.struct_time(tm_year=2014, tm_mon=10, tm_mday=19, tm_hour=6, tm_min=12, tm_sec=49, tm_wday=6, tm_yday=292, tm_isdst=-1))
unitsFlag is set
_buildTime: [1 ][][m]
units m --> realunit minutes
parse (top of loop): [ onth][]
parse 1 [][][]
parse 2 [][][]
parse 3 [][][]
parse 4 [][][]
parse 5 [][][]
parse 6 [][][]
parse 7 [][][]
parse 8 [][][]
parse 9 [][][]
parse A [][][]
parse (bottom) [][][][]
weekday False, dateStd False, dateStr False, time False, timeStr False, meridian False
dayStr False, modifier False, modifier2 False, units False, qunits False
parse() return dateFlag 0 timeFlag 2 totalTime time.struct_time(tm_year=2014, tm_mon=10, tm_mday=19, tm_hour=6, tm_min=13, tm_sec=49, tm_wday=6, tm_yday=292, tm_isdst=-1)
F
======================================================================
FAIL: testMonths (parsedatetime.tests.TestUnits.test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/media/Storage/Scripts/parsedatetime/parsedatetime/tests/TestUnits.py", line 110, in testMonths
    self.assertTrue(_compareResults(self.cal.parse('1 month',  start), (target, 1)))
AssertionError: False is not true

----------------------------------------------------------------------
Ran 1 test in 0.144s

So it looks like, /sometimes only/, the "m" is being interpreted as the prefix for "minutes", rather then as part of the value "months".

It always happens in the parse 3 stage.

@fake-name
Copy link
Contributor

Ok, I've figured out where the issue is.

Basically, it appears that the ordering of the contents of self.locale.re_values['numbers'] is non-determinstic. self.locale.re_values['numbers'] contains both the literal m and months.

When the literal m sorts before months, the test fails, because the regex binds to the m preferentially. When the months sorts first, everything works fine.

I think the solution here would be to add a word boundary match to the self.RE_UNITS regex, so the regex has to match an entire word, rather then just the starting boundary and the first n characters.

In other words:

        self.RE_UNITS     = r'''(?P<qty>(-?(\b(%(numbers)s)\b|\d+)\s*
                                         (?P<units>((\b%(units)s)s?))
                                        ))''' % self.locale.re_values

Becomes

        self.RE_UNITS     = r'''(?P<qty>(-?(\b(%(numbers)s)\b|\d+)\s*
                                         (?P<units>((\b(%(units)s))s?\b))
                                        ))''' % self.locale.re_values

Note that I had to add an additional set of parenthesis, because as it is, (\b%(units)s) is evaluating to (\bsecond|seconds|sec|s|month|months|mth|year|years|yr|y|hour|hours|hr|h|minute|minutes|min|m|week|weeks|wk|w|day|days|dy|d), and as such only the first item is actually even using the \b boundary.

The boundary grouping issue is also occuring in the RE_QUNITS regex, so there are certain conditions where it improperly matches as well:

parse() return dateFlag 1 timeFlag 0 totalTime time.struct_time(tm_year=2014, tm_mon=11, tm_mday=19, tm_hour=6, tm_min=34, tm_sec=0, tm_wday=2, tm_yday=323, tm_isdst=-1)
parse()
parse (top of loop): [1month][]
parse 1 [][][]
parse 2 [][][]
M None
parse 3 [][][]
CRE_QUNITS matched
parse 4 [1m][][onth]
parse 5 [1m][][onth]
parse 6 [1m][][onth]
parse 7 [1m][][onth]
parse 8 [1m][][onth]
parse 9 [1m][][onth]
parse A [1m][][onth]
parse (bottom) [ onth][1m][][onth]
weekday False, dateStd False, dateStr False, time False, timeStr False, meridian False

(with PYTHONHASHSEED=152)

@fake-name
Copy link
Contributor

Actually, considering some of the comments, I think the core of the issue might be the values of self.locale.re_values['units'] which both seems to include single char date characters, which contradicts some of the comments in the code:

units year|years|yr|y|day|days|dy|d|hour|hours|hr|h|minute|minutes|min|m|second|seconds|sec|s|week|weeks|wk|w|month|months|mth

        # Given string is a time string with units like "5 hrs 30 min"
        if self.unitsFlag:
--- SNIP ---
        # Given string is a time string with single char units like "5 h 30 m"
        if self.qunitsFlag:

@fake-name
Copy link
Contributor

Ok, so the root of the issue is that single-character duration labels are getting into self.locale.units somehow.

I filtered them out:

(line 2226... in __init__.py)

            l = []
            for s in self.locale.units:
                l = l + self.locale.units[s]
---- to ----
            l = []
            for s in self.locale.units:
                l = l + [x for x in self.locale.units[s] if len(x) > 1]

And it fixes the issue initially described. I still see periodic failures in the 1month match, but I suspect that's a deeper issue with the regex.


Sorry if this thread has gotten a bit long. This is an interesting problem, and it's affecting a project I'm working on, so I've been poking at it.

@fake-name
Copy link
Contributor

Ok, the fix as I described above fixes the month parsing issue, but breaks a number of the other tests, primarily those using the single-char time notation.

I suspect what it's actually doing is preventing the self.RE_UNITS regex from occationally working in some contexts, which is exposing an existing issue with the self.RE_QUNITS regex that was already there, and just papered over by the accidental functionality of the self.RE_UNITS regex, but the regexes are melting my brain at this point, so I'm going to stop for the moment.

@bear bear closed this as completed in 7917b75 Oct 19, 2014
bear added a commit that referenced this issue Oct 19, 2014
@zed
Copy link
Contributor Author

zed commented Oct 19, 2014

Note: locale.units is a dictionary that is why the order depends on PYTHONHASHSEED. I've created PR #57 that fixes this issue.

@bear
Copy link
Owner

bear commented Oct 19, 2014

Let me know if either of you would like to be added as contributors - your both actively using and fixing bugs.

@fake-name
Copy link
Contributor

I'd be willing to spend some time trying to make the regex stuff more concise.

Particularly, is the use of named %(name)s string formatting a legacy compatibility thing? Moving to '{val}'.format(val=something) makes things MUCH more readable, and makes scoping much easier to understand.

The .format() method was added in python 2.6. Is < 2.6 still a support target?

@bear
Copy link
Owner

bear commented Oct 20, 2014

< 2.6 is not longer supported in the main branch - 2.7+ and 3.0 are it's goals

@fake-name
Copy link
Contributor

Ok, cool. The .format() method was introduced in 2.6 anyways, so it's a bit of a moot point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants