Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-13168: [C++][R] Enable runtime timezone database for Windows #12536

Closed

Conversation

wjones127
Copy link
Member

@wjones127 wjones127 commented Mar 1, 2022

This allows for runtime configuration of the timezone database on Windows for C++ and R. Python will be handled later because it's available timezone libraries use the binary rather than text format, which is not yet supported the vendored date library.

For R, Windows will only support the "C" locale, since (as far as I can tell) that's the only locale supported by the MingW std::locale implementation. I think R itself gets around this by implementing a completely custom version of strftime() and friends.

@github-actions
Copy link

github-actions bot commented Mar 1, 2022

@wjones127
Copy link
Member Author

So timezone database seems to work, but methods that rely on std::local currently error with:

Error (test-dplyr-funcs-datetime.R:344:3): extract month from timestamp
Error: Invalid: Cannot find locale 'English_United States.1252': locale::facet::_S_create_c_locale name not valid
C:/Users/voltron/arrow/cpp/src/arrow/compute/kernels/scalar_temporal_unary.cc:1085  GetLocale(options.locale)
C:/Users/voltron/arrow/cpp/src/arrow/compute/kernels/scalar_temporal_unary.cc:1105  Make(ctx, *in.type)
C:/Users/voltron/arrow/cpp/src/arrow/compute/exec.cc:700  kernel_->exec(kernel_ctx_, batch, &out)
C:/Users/voltron/arrow/cpp/src/arrow/compute/exec.cc:641  ExecuteBatch(batch, listener)
C:/Users/voltron/arrow/cpp/src/arrow/compute/exec/expression.cc:547  executor->Execute(arguments, &listener)
C:/Users/voltron/arrow/cpp/src/arrow/compute/exec/expression.cc:533  ExecuteScalarExpression(call->arguments[i], input, exec_context)
C:/Users/voltron/arrow/cpp/src/arrow/compute/exec/project_node.cc:91  ExecuteScalarExpression(simplified_expr, target, plan()->exec_context())
C:/Users/voltron/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:484  iterator_.Next()
C:/Users/voltron/arrow/cpp/src/arrow/record_batch.cc:336  ReadNext(&batch)
C:/Users/voltron/arrow/cpp/src/arrow/record_batch.cc:347  ReadAll(&batches)

These do work if you set Sys.setlocale("LC_TIME", "C"). If I don't find a fix for this, I may consider only supporting the "C" locale on Windows.

r/R/arrow-package.R Outdated Show resolved Hide resolved
@wjones127 wjones127 force-pushed the ARROW-13168-timezone-database branch from d8aeb45 to 2171629 Compare March 8, 2022 18:08
@wjones127 wjones127 force-pushed the ARROW-13168-timezone-database branch from 497aab6 to ab5d038 Compare March 10, 2022 22:27
@wjones127
Copy link
Member Author

CI failure is unrelated Flight error.

@wjones127
Copy link
Member Author

cc @pitrou

@rem Download IANA Timezone Database for unit tests
@rem
@rem (Doc section: Download timezone database)
curl https://data.iana.org/time-zones/releases/tzdata2021e.tar.gz --output tzdata.tar.gz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this always be the same DB that R and Python will be using?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No that 2021e is a version, which won't necessarily align with the R and Python ones in the future. I don't think it matters that we update it, unless the format changes somehow. This is just for testing that it works, and we don't ship it.

But the R unit tests use the one provided by the tzdb package, so we are testing that as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently US and EU have DST, but they plan to abolish it soon. At that point we will have Python and R DBs with fresh DSTless times and arrow c++ using DST for tests. In isolation that's ok, but we do have tests comparing pandas and pyarrow results and similar for lubridate. It's not a big problem for sure, but if there is like a tzdata-latest.tar.gz that would be great.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The times in the past won't change, so unless we test against a random "now", updates to the timezone database (such as for DST policy changes) shouldn't impact those tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this should be fine as is. We will never be testing between timezone databases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently test arrow vs pandas tzdb in CI. We have a test for a timestamp in 2033 and if it is in DST and DST is abolished it will error if we'll be using a pre-abolishment db with arrow and post-abolishment db with pandas. This is super irrelevant and as it's a simple fix and I'm sorry for wasting your time :).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got it. We'll address in the follow up PR where we'll have PyArrow use tzdb. For now the python timestamp tests are skipped on Windows

if sys.platform == 'win32':
# TODO: We should test on windows once ARROW-13168 is resolved.
pytest.skip('Timezone database is not available on Windows yet')

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this @wjones127 . Here are some comments.

cpp/src/arrow/config.h Outdated Show resolved Hide resolved
cpp/src/arrow/config.cc Show resolved Hide resolved
cpp/src/arrow/config.cc Outdated Show resolved Hide resolved
cpp/src/arrow/public_api_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/public_api_test.cc Outdated Show resolved Hide resolved
docs/source/cpp/build_system.rst Outdated Show resolved Hide resolved
docs/source/developers/cpp/windows.rst Outdated Show resolved Hide resolved
Comment on lines +176 to +179
.. literalinclude:: ../../../ci/appveyor-cpp-setup.bat
:language: cmd
:start-after: @rem (Doc section: Download timezone database)
:end-before: @rem (Doc section: Download timezone database)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather if you copied the relevant snippet here, we shouldn't ideally rely on the contents of CI scripts (which may contain specific quirks that irrelevant to normal user setups) for the public docs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are pulling a specific part of the script, we should easily be able to avoid quirks? I like the idea of having our build instructions tested in CI. And we reference the CI scripts regularly when providing build instructions.

At the very least, if we get to a point where the public instructions and CI instructions diverge, we can easily separate them.

@@ -1875,6 +1870,9 @@ TEST_F(ScalarTemporalTest, StrftimeCLocale) {
}

TEST_F(ScalarTemporalTest, StrftimeOtherLocale) {
#ifdef _WIN32
GTEST_SKIP() << "There is a known bug in strftime for locales on Windows (ARROW-15922)";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this on Windows or specifically MinGW? i.e., would the non-MinGW Windows CI pass if you remove this skip? I'm wondering if we can narrow the condition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If failed on Windows 2019 C++17, which uses MSVC. But seems like it was actually passing on MinGW. So maybe I can try skipping for just MSVC?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a bit weird, MinGW is just a different compiler but targetting the same runtime libraries...

Copy link
Member Author

@wjones127 wjones127 Mar 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think the test is likely skipping on the LocaleExists("fr_FR.UTF-8") condition; IIRC MinGW doesn't support locales apart from "C" and "POSIX". Sadly, no indication in CI whether the test was skipped: https://github.com/apache/arrow/runs/5504310182?check_suite_focus=true

:start-after: @rem (Doc section: Download timezone database)
:end-before: @rem (Doc section: Download timezone database)

By default, the timezone database will be detected at ``%USERPROFILE%\Downloads\tzdata``,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a reliable location? Is there a risk that the downloads folder gets cleared from time to time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the behavior of the vendored datetime library, not something we chose. I think it's a fine default for testing and in production applications I expect users will manually specify a more appropriate path at runtime.

Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things I noticed in CI

r/src/config.cpp Outdated Show resolved Hide resolved
cpp/src/arrow/config.cc Outdated Show resolved Hide resolved
wjones127 and others added 2 commits March 24, 2022 19:40
Co-authored-by: Jonathan Keane <jkeane@gmail.com>
set -ex

# Download database
curl https://data.iana.org/time-zones/releases/tzdata2021e.tar.gz --output ~/Downloads/tzdata2021e.tar.gz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, they just released 2022a a few days ago

r/R/arrow-package.R Outdated Show resolved Hide resolved
Co-authored-by: Davis Vaughan <davis@rstudio.com>
@jonkeane
Copy link
Member

Are there any outstanding comments that we need to resolve before merging?

The failures in CI both look unrelated. I'm happy to merge if no one objects

@jonkeane jonkeane closed this in f4dfd6c Mar 28, 2022
@ursabot
Copy link

ursabot commented Mar 28, 2022

Benchmark runs are scheduled for baseline = 919d113 and contender = f4dfd6c. f4dfd6c is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.34% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.36% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️1.02% ⬆️0.81%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@wjones127
Copy link
Member Author

Follow-up Jira created: https://issues.apache.org/jira/browse/ARROW-16054

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants