Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-7288: [C++][Parquet] Don't use regular expression to parse application version #9367

Closed
wants to merge 9 commits into from

Conversation

kou
Copy link
Member

@kou kou commented Jan 30, 2021

std::regex provided by MinGW may take a long with Japanese location on
Windows.

We can use std::regex, boost::regex or RE2 as regular expression
engine for this but RE2 doesn't use compatible syntax with others. If
we support all of them, we need to maintain multiple regular
expressions. It increases maintenance cost. If we don't use regular
expression, we don't need to think about regular expression. But we
need to maintain hand-written parser.

@github-actions
Copy link

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this change. It seemed like overkill to require a regex library only to handle parsing a version string.

I guess there is some risk that there is some funky file out there that we'll fail to parse its version string, but (1) in our code we only seem to care about the major.minor.patch version, which is trivial to parse and covered in tests here, and (2) parquet-mr hasn't changed its code for writing version strings since 2015, and that was just to add prerelease info (which we never check after parsing). So it seems safe to me.

cpp/src/parquet/metadata.cc Show resolved Hide resolved
cpp/cmake_modules/ThirdpartyToolchain.cmake Show resolved Hide resolved
@kou
Copy link
Member Author

kou commented Jan 30, 2021

@github-actions crossbow submit -g nightly

@github-actions
Copy link

Revision: 076863b

Submitted crossbow builds: ursacomputing/crossbow @ actions-59

Task Status
centos-7-amd64 Github Actions
centos-8-amd64 Github Actions
conda-clean Azure
conda-linux-gcc-py36-aarch64 Drone
conda-linux-gcc-py36-cpu-r36 Azure
conda-linux-gcc-py36-cuda Azure
conda-linux-gcc-py37-aarch64 Drone
conda-linux-gcc-py37-cpu-r40 Azure
conda-linux-gcc-py37-cuda Azure
conda-linux-gcc-py38-aarch64 Drone
conda-linux-gcc-py38-cpu Azure
conda-linux-gcc-py38-cuda Azure
conda-linux-gcc-py39-aarch64 Drone
conda-linux-gcc-py39-cpu Azure
conda-linux-gcc-py39-cuda Azure
conda-osx-clang-py36-r36 Azure
conda-osx-clang-py37-r40 Azure
conda-osx-clang-py38 Azure
conda-osx-clang-py39 Azure
conda-win-vs2017-py36-r36 Azure
conda-win-vs2017-py37-r40 Azure
conda-win-vs2017-py38 Azure
debian-buster-amd64 Github Actions
example-cpp-minimal-build-static Github Actions
example-cpp-minimal-build-static-system-dependency Github Actions
gandiva-jar-osx Github Actions
gandiva-jar-ubuntu Github Actions
homebrew-cpp TravisCI
homebrew-r-autobrew TravisCI
nuget Github Actions
python-sdist Github Actions
test-conda-cpp Github Actions
test-conda-cpp-valgrind Github Actions
test-conda-python-3.6 Github Actions
test-conda-python-3.6-pandas-0.23 Github Actions
test-conda-python-3.7 Github Actions
test-conda-python-3.7-dask-latest Github Actions
test-conda-python-3.7-hdfs-3.2 Github Actions
test-conda-python-3.7-kartothek-latest Github Actions
test-conda-python-3.7-kartothek-master Github Actions
test-conda-python-3.7-pandas-latest Github Actions
test-conda-python-3.7-pandas-master Github Actions
test-conda-python-3.7-spark-branch-3.0 Github Actions
test-conda-python-3.7-turbodbc-latest Github Actions
test-conda-python-3.7-turbodbc-master Github Actions
test-conda-python-3.8 Github Actions
test-conda-python-3.8-dask-master Github Actions
test-conda-python-3.8-hypothesis Github Actions
test-conda-python-3.8-jpype Github Actions
test-conda-python-3.8-pandas-latest Github Actions
test-conda-python-3.8-pandas-nightly Github Actions
test-conda-python-3.8-spark-master Github Actions
test-debian-10-cpp CircleCI
test-debian-10-go-1.12 Azure
test-debian-10-python-3 Azure
test-debian-c-glib CircleCI
test-debian-ruby CircleCI
test-fedora-33-cpp CircleCI
test-fedora-33-python-3 Azure
test-r-linux-as-cran Github Actions
test-r-rhub-ubuntu-gcc-release Azure
test-r-rocker-r-base-latest Azure
test-r-rstudio-r-base-3.6-bionic Azure
test-r-rstudio-r-base-3.6-centos7-devtoolset-8 Azure
test-r-rstudio-r-base-3.6-centos8 Azure
test-r-rstudio-r-base-3.6-opensuse15 Azure
test-r-rstudio-r-base-3.6-opensuse42 Azure
test-r-version-compatibility Github Actions
test-r-versions Github Actions
test-ubuntu-16.04-cpp CircleCI
test-ubuntu-18.04-cpp CircleCI
test-ubuntu-18.04-cpp-release CircleCI
test-ubuntu-18.04-cpp-static CircleCI
test-ubuntu-18.04-docs Azure
test-ubuntu-18.04-python-3 Azure
test-ubuntu-18.04-r-sanitizer Azure
test-ubuntu-20.04-cpp Github Actions
test-ubuntu-20.04-cpp-14 Github Actions
test-ubuntu-20.04-cpp-17 Github Actions
test-ubuntu-20.04-cpp-thread-sanitizer Github Actions
test-ubuntu-c-glib CircleCI
test-ubuntu-ruby Azure
ubuntu-bionic-amd64 Github Actions
ubuntu-focal-amd64 Github Actions
ubuntu-groovy-amd64 Github Actions
ubuntu-xenial-amd64 Github Actions
wheel-manylinux2010-cp36m Github Actions
wheel-manylinux2010-cp37m Github Actions
wheel-manylinux2010-cp38 Github Actions
wheel-manylinux2010-cp39 Github Actions
wheel-manylinux2014-cp36m Github Actions
wheel-manylinux2014-cp37m Github Actions
wheel-manylinux2014-cp38 Github Actions
wheel-manylinux2014-cp39 Github Actions
wheel-osx-high-sierra-cp36m Github Actions
wheel-osx-high-sierra-cp37m Github Actions
wheel-osx-high-sierra-cp38 Github Actions
wheel-osx-high-sierra-cp39 Github Actions
wheel-osx-mavericks-cp36m Github Actions
wheel-osx-mavericks-cp37m Github Actions
wheel-osx-mavericks-cp38 Github Actions
wheel-osx-mavericks-cp39 Github Actions
wheel-windows-cp36m Github Actions
wheel-windows-cp37m Github Actions
wheel-windows-cp38 Github Actions
wheel-windows-cp39 Github Actions

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to see this! Please see some comments below.

ASSERT_EQ("cdh5.5.0", version.version.pre_release);
ASSERT_EQ("cd", version.version.build_info);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test with a malformed version string?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've add some more tests.

Note that this implementation assumes that input encoding is ASCII. It may not work with other encodings.
Should we support non ASCII encodings?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think assuming ASCII is ok.

cpp/src/parquet/metadata.h Show resolved Hide resolved
private:
void RemovePrecedingSpaces(const std::string& string, size_t& start,
const size_t& end) {
while (start < end && string[start] == ' ') {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that \s in a regex may match other whitespace characters. But they're unlikely to appear in the created_by field anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added \t, \v, \r, \n and \f to supported whitespace characters but they will not be appeared as you said.

cpp/src/parquet/metadata.cc Outdated Show resolved Hide resolved
cpp/src/parquet/metadata.cc Outdated Show resolved Hide resolved
…ication version

std::regex provided by MinGW may take a long with Japanese location on
Windows.

We can use std::regex, boost::regex or RE2 as regular expression
engine for this but RE2 doesn't use compatible syntax with others. If
we support all of them, we need to maintain multiple regular
expressions. It increases maintenance cost. If we don't use regular
expression, we don't need to think about regular expression. But we
need to maintain hand-written parser.
@pitrou
Copy link
Member

pitrou commented Feb 2, 2021

@pitrou pitrou closed this in 3bddb01 Feb 2, 2021
@kou kou deleted the cpp-parquet-no-regex branch February 2, 2021 20:06
@kou
Copy link
Member Author

kou commented Feb 2, 2021

Thanks for the boolean fix!

nevi-me pushed a commit to nevi-me/arrow that referenced this pull request Feb 13, 2021
…ication version

std::regex provided by MinGW may take a long with Japanese location on
Windows.

We can use std::regex, boost::regex or RE2 as regular expression
engine for this but RE2 doesn't use compatible syntax with others. If
we support all of them, we need to maintain multiple regular
expressions. It increases maintenance cost. If we don't use regular
expression, we don't need to think about regular expression. But we
need to maintain hand-written parser.

Closes apache#9367 from kou/cpp-parquet-no-regex

Lead-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…ication version

std::regex provided by MinGW may take a long with Japanese location on
Windows.

We can use std::regex, boost::regex or RE2 as regular expression
engine for this but RE2 doesn't use compatible syntax with others. If
we support all of them, we need to maintain multiple regular
expressions. It increases maintenance cost. If we don't use regular
expression, we don't need to think about regular expression. But we
need to maintain hand-written parser.

Closes apache#9367 from kou/cpp-parquet-no-regex

Lead-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
…ication version

std::regex provided by MinGW may take a long with Japanese location on
Windows.

We can use std::regex, boost::regex or RE2 as regular expression
engine for this but RE2 doesn't use compatible syntax with others. If
we support all of them, we need to maintain multiple regular
expressions. It increases maintenance cost. If we don't use regular
expression, we don't need to think about regular expression. But we
need to maintain hand-written parser.

Closes apache#9367 from kou/cpp-parquet-no-regex

Lead-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants