-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-7288: [C++][Parquet] Don't use regular expression to parse application version #9367
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this change. It seemed like overkill to require a regex library only to handle parsing a version string.
I guess there is some risk that there is some funky file out there that we'll fail to parse its version string, but (1) in our code we only seem to care about the major.minor.patch version, which is trivial to parse and covered in tests here, and (2) parquet-mr hasn't changed its code for writing version strings since 2015, and that was just to add prerelease info (which we never check after parsing). So it seems safe to me.
@github-actions crossbow submit -g nightly |
Revision: 076863b Submitted crossbow builds: ursacomputing/crossbow @ actions-59 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to see this! Please see some comments below.
ASSERT_EQ("cdh5.5.0", version.version.pre_release); | ||
ASSERT_EQ("cd", version.version.build_info); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test with a malformed version string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've add some more tests.
Note that this implementation assumes that input encoding is ASCII. It may not work with other encodings.
Should we support non ASCII encodings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think assuming ASCII is ok.
cpp/src/parquet/metadata.cc
Outdated
private: | ||
void RemovePrecedingSpaces(const std::string& string, size_t& start, | ||
const size_t& end) { | ||
while (start < end && string[start] == ' ') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that \s
in a regex may match other whitespace characters. But they're unlikely to appear in the created_by
field anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added \t
, \v
, \r
, \n
and \f
to supported whitespace characters but they will not be appeared as you said.
…ication version std::regex provided by MinGW may take a long with Japanese location on Windows. We can use std::regex, boost::regex or RE2 as regular expression engine for this but RE2 doesn't use compatible syntax with others. If we support all of them, we need to maintain multiple regular expressions. It increases maintenance cost. If we don't use regular expression, we don't need to think about regular expression. But we need to maintain hand-written parser.
076863b
to
c85da69
Compare
Travis-CI build: https://travis-ci.com/github/pitrou/arrow/builds/215736951 |
Thanks for the boolean fix! |
…ication version std::regex provided by MinGW may take a long with Japanese location on Windows. We can use std::regex, boost::regex or RE2 as regular expression engine for this but RE2 doesn't use compatible syntax with others. If we support all of them, we need to maintain multiple regular expressions. It increases maintenance cost. If we don't use regular expression, we don't need to think about regular expression. But we need to maintain hand-written parser. Closes apache#9367 from kou/cpp-parquet-no-regex Lead-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
…ication version std::regex provided by MinGW may take a long with Japanese location on Windows. We can use std::regex, boost::regex or RE2 as regular expression engine for this but RE2 doesn't use compatible syntax with others. If we support all of them, we need to maintain multiple regular expressions. It increases maintenance cost. If we don't use regular expression, we don't need to think about regular expression. But we need to maintain hand-written parser. Closes apache#9367 from kou/cpp-parquet-no-regex Lead-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
…ication version std::regex provided by MinGW may take a long with Japanese location on Windows. We can use std::regex, boost::regex or RE2 as regular expression engine for this but RE2 doesn't use compatible syntax with others. If we support all of them, we need to maintain multiple regular expressions. It increases maintenance cost. If we don't use regular expression, we don't need to think about regular expression. But we need to maintain hand-written parser. Closes apache#9367 from kou/cpp-parquet-no-regex Lead-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
std::regex provided by MinGW may take a long with Japanese location on
Windows.
We can use std::regex, boost::regex or RE2 as regular expression
engine for this but RE2 doesn't use compatible syntax with others. If
we support all of them, we need to maintain multiple regular
expressions. It increases maintenance cost. If we don't use regular
expression, we don't need to think about regular expression. But we
need to maintain hand-written parser.