-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-13168: [C++][R] Enable runtime timezone database for Windows #12536
Conversation
So timezone database seems to work, but methods that rely on
These do work if you set |
d8aeb45
to
2171629
Compare
497aab6
to
ab5d038
Compare
CI failure is unrelated Flight error. |
cc @pitrou |
@rem Download IANA Timezone Database for unit tests | ||
@rem | ||
@rem (Doc section: Download timezone database) | ||
curl https://data.iana.org/time-zones/releases/tzdata2021e.tar.gz --output tzdata.tar.gz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this always be the same DB that R and Python will be using?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No that 2021e
is a version, which won't necessarily align with the R and Python ones in the future. I don't think it matters that we update it, unless the format changes somehow. This is just for testing that it works, and we don't ship it.
But the R unit tests use the one provided by the tzdb package, so we are testing that as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently US and EU have DST, but they plan to abolish it soon. At that point we will have Python and R DBs with fresh DSTless times and arrow c++ using DST for tests. In isolation that's ok, but we do have tests comparing pandas and pyarrow results and similar for lubridate. It's not a big problem for sure, but if there is like a tzdata-latest.tar.gz that would be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The times in the past won't change, so unless we test against a random "now", updates to the timezone database (such as for DST policy changes) shouldn't impact those tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this should be fine as is. We will never be testing between timezone databases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently test arrow vs pandas tzdb in CI. We have a test for a timestamp in 2033 and if it is in DST and DST is abolished it will error if we'll be using a pre-abolishment db with arrow and post-abolishment db with pandas. This is super irrelevant and as it's a simple fix and I'm sorry for wasting your time :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah got it. We'll address in the follow up PR where we'll have PyArrow use tzdb. For now the python timestamp tests are skipped on Windows
arrow/python/pyarrow/tests/test_compute.py
Lines 1915 to 1917 in d327f69
if sys.platform == 'win32': | |
# TODO: We should test on windows once ARROW-13168 is resolved. | |
pytest.skip('Timezone database is not available on Windows yet') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this @wjones127 . Here are some comments.
.. literalinclude:: ../../../ci/appveyor-cpp-setup.bat | ||
:language: cmd | ||
:start-after: @rem (Doc section: Download timezone database) | ||
:end-before: @rem (Doc section: Download timezone database) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather if you copied the relevant snippet here, we shouldn't ideally rely on the contents of CI scripts (which may contain specific quirks that irrelevant to normal user setups) for the public docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are pulling a specific part of the script, we should easily be able to avoid quirks? I like the idea of having our build instructions tested in CI. And we reference the CI scripts regularly when providing build instructions.
At the very least, if we get to a point where the public instructions and CI instructions diverge, we can easily separate them.
@@ -1875,6 +1870,9 @@ TEST_F(ScalarTemporalTest, StrftimeCLocale) { | |||
} | |||
|
|||
TEST_F(ScalarTemporalTest, StrftimeOtherLocale) { | |||
#ifdef _WIN32 | |||
GTEST_SKIP() << "There is a known bug in strftime for locales on Windows (ARROW-15922)"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this on Windows or specifically MinGW? i.e., would the non-MinGW Windows CI pass if you remove this skip? I'm wondering if we can narrow the condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If failed on Windows 2019 C++17, which uses MSVC. But seems like it was actually passing on MinGW. So maybe I can try skipping for just MSVC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a bit weird, MinGW is just a different compiler but targetting the same runtime libraries...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think the test is likely skipping on the LocaleExists("fr_FR.UTF-8")
condition; IIRC MinGW doesn't support locales apart from "C" and "POSIX". Sadly, no indication in CI whether the test was skipped: https://github.com/apache/arrow/runs/5504310182?check_suite_focus=true
:start-after: @rem (Doc section: Download timezone database) | ||
:end-before: @rem (Doc section: Download timezone database) | ||
|
||
By default, the timezone database will be detected at ``%USERPROFILE%\Downloads\tzdata``, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a reliable location? Is there a risk that the downloads folder gets cleared from time to time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the behavior of the vendored datetime library, not something we chose. I think it's a fine default for testing and in production applications I expect users will manually specify a more appropriate path at runtime.
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few things I noticed in CI
Co-authored-by: Jonathan Keane <jkeane@gmail.com>
set -ex | ||
|
||
# Download database | ||
curl https://data.iana.org/time-zones/releases/tzdata2021e.tar.gz --output ~/Downloads/tzdata2021e.tar.gz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, they just released 2022a a few days ago
Co-authored-by: Davis Vaughan <davis@rstudio.com>
Are there any outstanding comments that we need to resolve before merging? The failures in CI both look unrelated. I'm happy to merge if no one objects |
Benchmark runs are scheduled for baseline = 919d113 and contender = f4dfd6c. f4dfd6c is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Follow-up Jira created: https://issues.apache.org/jira/browse/ARROW-16054 |
This allows for runtime configuration of the timezone database on Windows for C++ and R. Python will be handled later because it's available timezone libraries use the binary rather than text format, which is not yet supported the vendored date library.
For R, Windows will only support the "C" locale, since (as far as I can tell) that's the only locale supported by the MingW std::locale implementation. I think R itself gets around this by implementing a completely custom version of
strftime()
and friends.