ARROW-7288: [C++][Parquet] Don't use regular expression to parse application version #9367

kou · 2021-01-30T01:57:10Z

std::regex provided by MinGW may take a long with Japanese location on
Windows.

We can use std::regex, boost::regex or RE2 as regular expression
engine for this but RE2 doesn't use compatible syntax with others. If
we support all of them, we need to maintain multiple regular
expressions. It increases maintenance cost. If we don't use regular
expression, we don't need to think about regular expression. But we
need to maintain hand-written parser.

github-actions · 2021-01-30T01:57:37Z

https://issues.apache.org/jira/browse/ARROW-7288

nealrichardson

I like this change. It seemed like overkill to require a regex library only to handle parsing a version string.

I guess there is some risk that there is some funky file out there that we'll fail to parse its version string, but (1) in our code we only seem to care about the major.minor.patch version, which is trivial to parse and covered in tests here, and (2) parquet-mr hasn't changed its code for writing version strings since 2015, and that was just to add prerelease info (which we never check after parsing). So it seems safe to me.

cpp/src/parquet/metadata.cc

cpp/cmake_modules/ThirdpartyToolchain.cmake

kou · 2021-01-30T21:56:23Z

@github-actions crossbow submit -g nightly

github-actions · 2021-01-30T21:57:09Z

Revision: 076863b

Submitted crossbow builds: ursacomputing/crossbow @ actions-59

Task	Status
centos-7-amd64
centos-8-amd64
conda-clean
conda-linux-gcc-py36-aarch64
conda-linux-gcc-py36-cpu-r36
conda-linux-gcc-py36-cuda
conda-linux-gcc-py37-aarch64
conda-linux-gcc-py37-cpu-r40
conda-linux-gcc-py37-cuda
conda-linux-gcc-py38-aarch64
conda-linux-gcc-py38-cpu
conda-linux-gcc-py38-cuda
conda-linux-gcc-py39-aarch64
conda-linux-gcc-py39-cpu
conda-linux-gcc-py39-cuda
conda-osx-clang-py36-r36
conda-osx-clang-py37-r40
conda-osx-clang-py38
conda-osx-clang-py39
conda-win-vs2017-py36-r36
conda-win-vs2017-py37-r40
conda-win-vs2017-py38
debian-buster-amd64
example-cpp-minimal-build-static
example-cpp-minimal-build-static-system-dependency
gandiva-jar-osx
gandiva-jar-ubuntu
homebrew-cpp
homebrew-r-autobrew
nuget
python-sdist
test-conda-cpp
test-conda-cpp-valgrind
test-conda-python-3.6
test-conda-python-3.6-pandas-0.23
test-conda-python-3.7
test-conda-python-3.7-dask-latest
test-conda-python-3.7-hdfs-3.2
test-conda-python-3.7-kartothek-latest
test-conda-python-3.7-kartothek-master
test-conda-python-3.7-pandas-latest
test-conda-python-3.7-pandas-master
test-conda-python-3.7-spark-branch-3.0
test-conda-python-3.7-turbodbc-latest
test-conda-python-3.7-turbodbc-master
test-conda-python-3.8
test-conda-python-3.8-dask-master
test-conda-python-3.8-hypothesis
test-conda-python-3.8-jpype
test-conda-python-3.8-pandas-latest
test-conda-python-3.8-pandas-nightly
test-conda-python-3.8-spark-master
test-debian-10-cpp
test-debian-10-go-1.12
test-debian-10-python-3
test-debian-c-glib
test-debian-ruby
test-fedora-33-cpp
test-fedora-33-python-3
test-r-linux-as-cran
test-r-rhub-ubuntu-gcc-release
test-r-rocker-r-base-latest
test-r-rstudio-r-base-3.6-bionic
test-r-rstudio-r-base-3.6-centos7-devtoolset-8
test-r-rstudio-r-base-3.6-centos8
test-r-rstudio-r-base-3.6-opensuse15
test-r-rstudio-r-base-3.6-opensuse42
test-r-version-compatibility
test-r-versions
test-ubuntu-16.04-cpp
test-ubuntu-18.04-cpp
test-ubuntu-18.04-cpp-release
test-ubuntu-18.04-cpp-static
test-ubuntu-18.04-docs
test-ubuntu-18.04-python-3
test-ubuntu-18.04-r-sanitizer
test-ubuntu-20.04-cpp
test-ubuntu-20.04-cpp-14
test-ubuntu-20.04-cpp-17
test-ubuntu-20.04-cpp-thread-sanitizer
test-ubuntu-c-glib
test-ubuntu-ruby
ubuntu-bionic-amd64
ubuntu-focal-amd64
ubuntu-groovy-amd64
ubuntu-xenial-amd64
wheel-manylinux2010-cp36m
wheel-manylinux2010-cp37m
wheel-manylinux2010-cp38
wheel-manylinux2010-cp39
wheel-manylinux2014-cp36m
wheel-manylinux2014-cp37m
wheel-manylinux2014-cp38
wheel-manylinux2014-cp39
wheel-osx-high-sierra-cp36m
wheel-osx-high-sierra-cp37m
wheel-osx-high-sierra-cp38
wheel-osx-high-sierra-cp39
wheel-osx-mavericks-cp36m
wheel-osx-mavericks-cp37m
wheel-osx-mavericks-cp38
wheel-osx-mavericks-cp39
wheel-windows-cp36m
wheel-windows-cp37m
wheel-windows-cp38
wheel-windows-cp39

pitrou

I'm happy to see this! Please see some comments below.

pitrou · 2021-02-01T17:38:31Z

cpp/src/parquet/metadata_test.cc

+  ASSERT_EQ("cdh5.5.0", version.version.pre_release);
+  ASSERT_EQ("cd", version.version.build_info);
+}
+


Can you add a test with a malformed version string?

I've add some more tests.

Note that this implementation assumes that input encoding is ASCII. It may not work with other encodings.
Should we support non ASCII encodings?

I think assuming ASCII is ok.

cpp/src/parquet/metadata.h

pitrou · 2021-02-01T17:48:22Z

cpp/src/parquet/metadata.cc

+ private:
+  void RemovePrecedingSpaces(const std::string& string, size_t& start,
+                             const size_t& end) {
+    while (start < end && string[start] == ' ') {


Note that \s in a regex may match other whitespace characters. But they're unlikely to appear in the created_by field anyway.

I've added \t, \v, \r, \n and \f to supported whitespace characters but they will not be appeared as you said.

cpp/src/parquet/metadata.cc

…ication version std::regex provided by MinGW may take a long with Japanese location on Windows. We can use std::regex, boost::regex or RE2 as regular expression engine for this but RE2 doesn't use compatible syntax with others. If we support all of them, we need to maintain multiple regular expressions. It increases maintenance cost. If we don't use regular expression, we don't need to think about regular expression. But we need to maintain hand-written parser.

pitrou · 2021-02-02T10:32:29Z

Travis-CI build: https://travis-ci.com/github/pitrou/arrow/builds/215736951

kou · 2021-02-02T20:08:54Z

Thanks for the boolean fix!

…ication version std::regex provided by MinGW may take a long with Japanese location on Windows. We can use std::regex, boost::regex or RE2 as regular expression engine for this but RE2 doesn't use compatible syntax with others. If we support all of them, we need to maintain multiple regular expressions. It increases maintenance cost. If we don't use regular expression, we don't need to think about regular expression. But we need to maintain hand-written parser. Closes apache#9367 from kou/cpp-parquet-no-regex Lead-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

github-actions bot added Component: C++ Component: Parquet labels Jan 30, 2021

kou mentioned this pull request Jan 30, 2021

ARROW-7288: [C++][R] read_parquet() freezes on Windows with Japanese locale #9320

Closed

nealrichardson approved these changes Jan 30, 2021

View reviewed changes

cpp/src/parquet/metadata.cc Show resolved Hide resolved

cpp/cmake_modules/ThirdpartyToolchain.cmake Show resolved Hide resolved

nealrichardson mentioned this pull request Jan 30, 2021

ARROW-7288: [C++] Replace boost::regex with re2 in Parquet #9332

Closed

pitrou requested changes Feb 1, 2021

View reviewed changes

kou added 7 commits February 2, 2021 06:52

Refer parquet-mr's implementation

4d6e0f4

Remove boost-regex dependency

89481af

Don't find boost::regex

e57f174

Add missing "s"

4cf2112

Add support for more spaces

ac0f802

Add support for MAJOR only and MAJOR.MINOR only cases

c85da69

kou force-pushed the cpp-parquet-no-regex branch from 076863b to c85da69 Compare February 2, 2021 05:39

Fix a bug that invalid patch version can't be detected

8a9f8b9

pitrou approved these changes Feb 2, 2021

View reviewed changes

Try to fix Travis-CI failure

a3f89f5

pitrou closed this in 3bddb01 Feb 2, 2021

kou deleted the cpp-parquet-no-regex branch February 2, 2021 20:06

asfimport mentioned this pull request Feb 4, 2021

[C++][R] read_parquet() freezes on Windows with Japanese locale #23577

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-7288: [C++][Parquet] Don't use regular expression to parse application version #9367

ARROW-7288: [C++][Parquet] Don't use regular expression to parse application version #9367

kou commented Jan 30, 2021

github-actions bot commented Jan 30, 2021

nealrichardson left a comment

kou commented Jan 30, 2021

github-actions bot commented Jan 30, 2021

pitrou left a comment

pitrou Feb 1, 2021

kou Feb 2, 2021

pitrou Feb 2, 2021

pitrou Feb 1, 2021

kou Feb 2, 2021

pitrou commented Feb 2, 2021

kou commented Feb 2, 2021

ARROW-7288: [C++][Parquet] Don't use regular expression to parse application version #9367

ARROW-7288: [C++][Parquet] Don't use regular expression to parse application version #9367

Conversation

kou commented Jan 30, 2021

github-actions bot commented Jan 30, 2021

nealrichardson left a comment

Choose a reason for hiding this comment

kou commented Jan 30, 2021

github-actions bot commented Jan 30, 2021

pitrou left a comment

Choose a reason for hiding this comment

pitrou Feb 1, 2021

Choose a reason for hiding this comment

kou Feb 2, 2021

Choose a reason for hiding this comment

pitrou Feb 2, 2021

Choose a reason for hiding this comment

pitrou Feb 1, 2021

Choose a reason for hiding this comment

kou Feb 2, 2021

Choose a reason for hiding this comment

pitrou commented Feb 2, 2021

kou commented Feb 2, 2021