Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MINIFICPP-1026 - Added base64 encoder-decoder to StringUtils #690

Closed
wants to merge 2 commits into from

Conversation

bakaid
Copy link
Contributor

@bakaid bakaid commented Dec 5, 2019

Thank you for submitting a contribution to Apache NiFi - MiNiFi C++.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

  • Is there a JIRA ticket associated with this PR? Is it referenced
    in the commit message?

  • Does your PR title start with MINIFICPP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.

  • Has your PR been rebased against the latest commit within the target branch (typically master)?

  • Is your initial contribution a single, squashed commit?

For code changes:

  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE file?
  • If applicable, have you updated the NOTICE file?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.

Copy link
Contributor

@phrocker phrocker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we're leaning toward writing too much library code here. We build curl in most cases and in the cases where we can't we likely couldn't run anyway -- why not just rely on that which is already available versus writing our own.

* @param base64_length the length of base64
* @return true on success
*/
inline static bool from_base64(uint8_t* data, size_t* data_length, const char* base64, size_t base64_length) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this your own impl or a copy and paste?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Own implementation, reasons explained in the main comment.

std::vector<uint8_t> decoded((base64_length / 4 + 1) * 3);
size_t data_length = decoded.size();
if (!from_base64(decoded.data(), &data_length, base64, base64_length)) {
throw std::runtime_error("Base64 encoded string is malformatted");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always felt that runtime errors "are due to events beyond the scope of the program and can not be easily predicted." so when those are used versus internal or named exceptions they have different connotations. We catch them but this error seems recoverable from a programmatic perspective. I don't have a super strong opinion on this but it does have a bit of code smell worth mentioning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, I'll think of something else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrote them to std::invalid_argument.

Copy link
Contributor

@arpadboda arpadboda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, the only concern I have is the placement of the code.

libminifi/include/utils/StringUtils.h Outdated Show resolved Hide resolved
libminifi/include/utils/StringUtils.h Outdated Show resolved Hide resolved
@bakaid
Copy link
Contributor Author

bakaid commented Dec 5, 2019

It seems that we're leaning toward writing too much library code here. We build curl in most cases and in the cases where we can't we likely couldn't run anyway -- why not just rely on that which is already available versus writing our own.

  • cURL does not have a public base64 encode-decode API. We have been using a header file adapted from the base64 implementation of an older cURL version, and used it independently of whether we actually built cURL.
  • that implementation was a particularly bad one and has since been rewritten in cURL:
    • it contained a buffer overread with the comment This final decode may actually read slightly past the end of the buffer if the input string is missing pad bytes. This will almost always be harmless.
    • it couldn't handle invalid characters in base64-encoded strings (didn't signal an error, just resulted in garbage)
    • it couldn't handle newlines, which are potentially valid in base64-encoded strings (didn't signal an error, just resulted in garbage)
    • it couldn't handle url-safe base64
  • the newer cURL implementation depends on other cURL-specific code. It could have been rewritten to be used independently of cURL, but I decided to create an own, well-tested implementation instead.
  • keep in mind that even if we used some base64 implementation from a third party we would still need
    • all the StringUtils functions that wrap it, except the encoding/decoding function
    • all the tests
  • "We build curl in most cases and in the cases where we can't we likely couldn't run anyway" is not a very convincing argument for anything. We should either make cURL build mandatory or fix the non-cURL build, if it indeed couldn't run.

@arpadboda
Copy link
Contributor

LGTM, I'm happy to proceed with this if CIs pass.

@phrocker
Copy link
Contributor

phrocker commented Dec 5, 2019

* keep in mind that even if we used some base64 implementation from a third party we would still need
  
  * all the StringUtils functions that wrap it, except the encoding/decoding function
  * all the tests

This is debatable but a tested third party impl if one exists can be trusted. This is a lot of code with the potential for bugs. In your case you've written a lot of code, so I don't want to demean it but the code for an external third party with StringUtils seems much less risky with much lower tests -- unless we simply can't find a third party impl. What you say about cURL is a valid point. Did you attempt to find a third party example? If not I can take a look -- there may be other string related libraries that provide this and some of what we do with StringUtils..


TEST_CASE("TestStringUtils::testBase64EncodeDecode", "[test base64 encode decode]") {
std::mt19937 gen(std::random_device { }());
for (size_t i = 0U; i < 1024U; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should add a test or scenario for arbitrarily large strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this, but after a few 4-base64-digit groups it doesn't involve different code paths, it would just test how large your memory is. What did you have in mind for size and what advantage would it have?

@arpadboda
Copy link
Contributor

arpadboda commented Dec 5, 2019

* keep in mind that even if we used some base64 implementation from a third party we would still need
  
  * all the StringUtils functions that wrap it, except the encoding/decoding function
  * all the tests

This is debatable but a tested third party impl if one exists can be trusted. This is a lot of code with the potential for bugs. In your case you've written a lot of code, so I don't want to demean it but the code for an external third party with StringUtils seems much less risky with much lower tests -- unless we simply can't find a third party impl. What you say about cURL is a valid point. Did you attempt to find a third party example? If not I can take a look -- there may be other string related libraries that provide this and some of what we do with StringUtils..

Partly agree, partly disagree.
For eg. boost does a lot of thing. But it is a huge dependency, heavily increases build time, etc.
For most of the 3rd parties, in case we don't want to flood our own repo, we need to maintain cmake files, sometimes update dependencies, maybe edit bootstrap. These are also costs and they apply time to time when something needs to be updated. There are cases when these costs are definitely justified by the functionality the library introduces (just as opencv, curl, kafka, mqtt, etc...), but in case of some quite simple string util functionality the longer terms costs of having the dependency might exceed the costs of doing it right, once.

@bakaid
Copy link
Contributor Author

bakaid commented Dec 6, 2019

* keep in mind that even if we used some base64 implementation from a third party we would still need
  
  * all the StringUtils functions that wrap it, except the encoding/decoding function
  * all the tests

This is debatable but a tested third party impl if one exists can be trusted. This is a lot of code with the potential for bugs. In your case you've written a lot of code, so I don't want to demean it but the code for an external third party with StringUtils seems much less risky with much lower tests -- unless we simply can't find a third party impl. What you say about cURL is a valid point. Did you attempt to find a third party example? If not I can take a look -- there may be other string related libraries that provide this and some of what we do with StringUtils..

@phrocker There are a myriad of base64 implementations in C, the two I meet the most often are https://web.mit.edu/freebsd/head/contrib/wpa/src/utils/base64.c and https://opensource.apple.com/source/QuickTimeStreamingServer/QuickTimeStreamingServer-452/CommonUtilitiesLib/base64.c
I have yet to find one that is at the same time

  • standalone
  • can pass the test suite I have written without modification
  • well tested in itself

The core part - that we would be able to substitute with a third-party implementation, albeit only after modifications - is composed of two look-up tables, a 30-line and a 65-line function.
I think if this is reviewed properly, given our test suite, the risk is pretty contained.

As for replacing the majority of StringUtils with some third party utility library: I don't know one that would fit our needs. There might well be, and in that I case I wouldn't be against it, but I don't want to do that integration work in the scope of this issue.

@bakaid
Copy link
Contributor Author

bakaid commented Dec 9, 2019

@phrocker Are you -1 on this? If so, please elaborate your technical justification for it, otherwise I would like to go forward with this PR.

@arpadboda
Copy link
Contributor

Merging this as no better alternative was proposed in a timely manner and it's clearly an improvement compared to the current (flawed) implementation we have.

Feel free to open follow-ups in case any related issue or a promising out-of-the-box alternative is found.

@arpadboda arpadboda closed this in 6c7ae81 Dec 12, 2019
@phrocker
Copy link
Contributor

phrocker commented Dec 12, 2019

-1

Yep. I was -1. My gut suggestion was to use boost's base 64 since it's relatively isolated. If you look at the boost variant they mention portions come from this repo ( https://github.com/ReneNyffenegger/cpp-base64). I'll follow up on the JIRA I've created.

@phrocker phrocker reopened this Dec 12, 2019
@phrocker phrocker closed this Dec 12, 2019
@phrocker
Copy link
Contributor

Merging this as no better alternative was proposed in a timely manner and it's clearly an improvement compared to the current (flawed) implementation we have.

Feel free to open follow-ups in case any related issue or a promising out-of-the-box alternative is found.

The subjective precedent here should be clarified and defined clearly here : https://github.com/apache/nifi-minifi-cpp/blob/master/CONTRIB.md .

@arpadboda
Copy link
Contributor

Merging this as no better alternative was proposed in a timely manner and it's clearly an improvement compared to the current (flawed) implementation we have.
Feel free to open follow-ups in case any related issue or a promising out-of-the-box alternative is found.

The subjective precedent here should be clarified and defined clearly here : https://github.com/apache/nifi-minifi-cpp/blob/master/CONTRIB.md .

Please clarify!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants