Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: Base64.IsValid #76020

Closed
stephentoub opened this issue Sep 22, 2022 · 25 comments · Fixed by #85938
Closed

[API Proposal]: Base64.IsValid #76020

stephentoub opened this issue Sep 22, 2022 · 25 comments · Fixed by #85938
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Memory partner-impact This issue impacts a partner who needs to be kept updated
Milestone

Comments

@stephentoub
Copy link
Member

stephentoub commented Sep 22, 2022

Background and motivation

The Base64 class provides efficient methods for encoding and decoding base64 data, but it doesn't provide any means for validating Base64-encoded data is properly encoded, at least not without having output memory into which to decode the resulting data. For scenarios that, for example, want to validate configuration data promptly and that might need the results of that configuration ever or until later, it's desirable to support an efficient means for validating Base64-encoded data without requiring the output memory, which then also enables the decoding/validation to be performed faster.

API Proposal

namespace System.Buffers.Text;

public static class Base64
{
+    public static bool IsValid(ReadOnlySpan<char> base64Text);
+    public static bool IsValid(ReadOnlySpan<byte> base64Text);
}

API Usage

string base64Text = ...;
if (!Base64.IsValid(base64Text))
    throw new InvalidConfigurationException(...);

Alternative Designs

  • We could expose such methods on the new Ascii class instead, since base64-encoded data is all ASCII. However, we already have two different places for Base64 functionality (Convert and Base64), and ideally we'd use the Base64 class as the central location moving forward for Base64-related functionality (we might even want to duplicate the Convert APIs for working with chars onto the Base64 type for discoverability, or potentially create streaming char-based versions to parallel the streaming byte-based versions already there).
  • The existing Encode/Decode methods call the source text parameter "utf8"; we could do the same here, but only for the byte-based overload, which is then inconsistent between the overloads.

Risks

No response

@stephentoub stephentoub added api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Memory labels Sep 22, 2022
@stephentoub stephentoub added this to the 8.0.0 milestone Sep 22, 2022
@ghost
Copy link

ghost commented Sep 22, 2022

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

The Base64 class provides efficient methods for encoding and decoding base64 data, but it doesn't provide any means for validating Base64-encoded data is properly encoded, at least not without having output memory into which to decode the resulting data. For scenarios that, for example, want to validate configuration data promptly and that might need the results of that configuration ever or until later, it's desirable to support an efficient means for validating Base64-encoded data without requiring the output memory, which then also enables the decoding/validation to be performed faster.

API Proposal

namespace System.Buffers.Text;

public static class Base64
{
+    public static bool IsValid(ReadOnlySpan<char> base64Text);
+    public static bool IsValid(ReadOnlySpan<byte> base64Text);
}

API Usage

string base64Text = ...;
if (!Base64.IsValid(base64Text))
    throw new InvalidConfigurationException(...);

Alternative Designs

  • We could expose such methods on the new Ascii class instead, since base64-encoded data is all ASCII. However, we already have two different places for Base64 functionality (Convert and Base64), and ideally we'd use the Base64 class as the central location moving forward for Base64-related functionality (we might even want to duplicate the Convert APIs for working with chars onto the Base64 type for discoverability, or potentially create streaming char-based versions to parallel the streaming byte-based versions already there).

Risks

No response

Author: stephentoub
Assignees: -
Labels:

api-suggestion, area-System.Memory

Milestone: 8.0.0

@stephentoub stephentoub added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Sep 27, 2022
@terrajobst
Copy link
Member

terrajobst commented Sep 27, 2022

Video

  • Looks good as proposed
    • We should consider answering the question whether Decode will work, rather than just checking whether it's syntactically valid Base64, which would be more useful.
namespace System.Buffers.Text;

public static class Base64
{
    public static bool IsValid(ReadOnlySpan<char> base64Text);
    public static bool IsValid(ReadOnlySpan<char> base64Text, out int decodedLength);
    public static bool IsValid(ReadOnlySpan<byte> base64Text);
    public static bool IsValid(ReadOnlySpan<byte> base64Text, out int decodedLength);
}

@terrajobst terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Sep 27, 2022
@GrabYourPitchforks
Copy link
Member

Here are some observations from quick experimentation, as discussed during review.

Both Convert.FromBase64String and Base64.Decode handle non-canonical forms (RFC 4648, Sec. 3.5) equivalently. Padding bits are discarded without being checked to see if they're all-zero.

Convert.FromBase64String allows whitespace, while Base64.Decode forbids whitespace. (See RFC 4648, Sec. 3.1.)

@stephentoub
Copy link
Member Author

Convert.FromBase64String allows whitespace, while Base64.Decode forbids whitespace.

That's an unfortunate discrepancy. The cleanest option from my perspective would be to update Base64.Decode* to allow whitespace, and then Base64.IsValid would similarly ignore them.

@GrabYourPitchforks, what do you think we should do here? Base64.IsValid aside, it's strange to me that we have these two different base64-decoding routines that treat whitespace (in particular newlines) differently.

@gfoidl
Copy link
Member

gfoidl commented Sep 29, 2022

These APIs could be implemented based on #68328 (comment)

namespace System.Buffers.Text;

public static class Base64
{
    public static bool IsValid(ReadOnlySpan<char> base64Text) => base64Text.IndexOfAnyExcept(/* base64 alphabet + allowed whitespace */) < 0;
    public static bool IsValid(ReadOnlySpan<byte> base64Text) => => base64Text.IndexOfAnyExcept(/* base64 alphabet + allowed whitespace */) < 0;

    public static bool IsValid(ReadOnlySpan<char> base64Text, out int decodedLength)
    {
        int indexOfFirstNonBase64 = base64Text.IndexOfAnyExcept(/* base64 alphabet + allowed whitespace */);

        if (indexOfFirstNonBase64 < 0)
        {
            decodedLength = base64Text.Length;    // probably wrong, see comment below
            return true;
        }

        decodedLength = indexOfFirstNonBase64 & -4;
        return false;
    }

    public static bool IsValid(ReadOnlySpan<byte> base64Text, out int decodedLength)
    {
        // the same
    }
}

Here I assume that decodedLength in the case of being not valid returns length that could actually be decoded (i.e. tha "valid part").

@stephentoub
Copy link
Member Author

These APIs could be implemented based

That's the intent :-) There are a bunch of places the new api will be useful.

@stephentoub
Copy link
Member Author

Here I assume that decodedLength in the case of being not valid returns length that could actually be decoded

We didn't talk about what it would be in the invalid case, but your suggestion is reasonable. I think the intent, however, was for it to be the required output buffer size for the decoded data rather than the input length; that's particularly relevant if whitespace is ignored.

@gfoidl
Copy link
Member

gfoidl commented Sep 29, 2022

was for it to be the required output buffer size for the decoded data

So for documentation that would mean:
IsValid = true -> required buffer size that can hold the decoded data
IsValid = false -> up until this length data can be decoded, then there's non base64 (useful for failure message, etc.)

@stephentoub
Copy link
Member Author

I actually think that's confusing. I'd be inclined to say the out is undefined/0 on false. @bartonjs, what did you intend for this?

@bartonjs
Copy link
Member

My intention is the required space for decoding. If IsValid is false, that makes no sense, so 0.

@stephentoub
Copy link
Member Author

stephentoub commented Oct 11, 2022

Convert.FromBase64String allows whitespace, while Base64.Decode forbids whitespace. (See RFC 4648, Sec. 3.1.)

So it seems like we have a few options here:

  1. Base64.IsValid allows whitespace, matching Convert.FromBase64String but differing from Base64.Decode. That seems like a non-starter, as BAse64.IsValid would end up returning true for something Base64.Decode would fail on.
  2. Base64.IsValid disallows whitespace, matching Base64.Decode but not matching Convert.FromBase64String. This also seems problematic, as a) it would fail to state as valid inputs Convert.FromBase64String would successfully handle, but more importantly b) Convert.ToBase64* accepts a Base64FormattingOptions that allows for the inserting of whitespace into encoded output; it'd be really strange for IsValid to return false for that.
  3. Make Base64.IsValid differ in whitespace-accepting behavior between its char and byte-based overloads. This would be extremely confusing and inconsistent.
  4. Update Base64.Decode to treat whitespace as valid, just as does Convert.FromBase64String, and then IsValid would allow whitespace. I don't know if this could manifest as an impactful enough breaking change so as to discard this option, e.g. if someone is relying on the existing whitespace-is-an-error behavior as part of some kind of input validation.
  5. Add an additional parameter to Base64.IsValid that dictates whether whitespace should be accepted, e.g. enum Base64ParsingOptions { None = 0, IgnoreWhiteSpace = 1 } and public static bool IsValid(ReadOnlySpan<...> base64Text, Base64ParsingOptions options).
  6. (5) + add overloads to Base64.Decode and Convert.FromBase64* that accept a Base64ParsingOptions.

1/2/3 all seem like bad choices.

I'd be ok with 4/5/6.

@GrabYourPitchforks? @bartonjs? @terrajobst?

@stephentoub stephentoub added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-approved API was approved in API review, it can be implemented labels Oct 28, 2022
@stephentoub
Copy link
Member Author

We approved this, but we need to figure out which of these options we want to pursue:
#76020 (comment)
so I'm marking it as ready for review again.

@gfoidl
Copy link
Member

gfoidl commented Oct 28, 2022

When this goes to review again, so consider dropping the overload with decodedLength:

namespace System.Buffers.Text;

public static class Base64
{
    public static bool IsValid(ReadOnlySpan<char> base64Text);
-   public static bool IsValid(ReadOnlySpan<char> base64Text, out int decodedLength);
    public static bool IsValid(ReadOnlySpan<byte> base64Text);
-   public static bool IsValid(ReadOnlySpan<byte> base64Text, out int decodedLength);
}
  • the API is for checking if it's valid base64 according the name, so don't overload if with an orthogonal concern (getting the decoded length)
  • getting the decoded length is a simple operation -- for max length it's just $\frac{length}{4} \cdot 3$ (API already provided), and if exact length is needed check if elements { x[n-2], x[n-1] } are padding chars then subtract that count
  • so I don't assume someone will call this O(n) API to get the exact decoded length

If the exact decoced length is needed, then it would be better to offer a separate method for this:

namespace System.Buffers.Text;

public static class Base64
{
+    public static int GetDecodedFromUtf8Length(int length);
}

But this I'd leave to a potential separate proposal.

@stephentoub
Copy link
Member Author

getting the decoded length is a simple operation

For exact length, that's not the case if there's whitespace to be ignored.

@bartonjs
Copy link
Member

I don't assume someone will call this O(n) API to get the exact decoded length

The overload emitting the decoded length is required to allow the version of this in PemEncoding.TryFind to unify with the public API; or, at least the functionality is required.

@bartonjs
Copy link
Member

So it seems like we have a few options here:

I'm sure the argument for the non-whitespace implementation is perf: if no data ever needs to be skipped over then it's a very fast transform. While I, personally, think (4) (make Base64.Decode just support whitespace) is probably the right way to go, if there's a big kerfuffle about losing an optimized path then I can support (6).

@stephentoub
Copy link
Member Author

FWIW, @bartonjs, 4 is my preference as well.

@GrabYourPitchforks
Copy link
Member

I'm fine with (4) as long as you include utf8 somewhere in the name. I'd prefer the method name because it's more visible, but the parameter name is also acceptable.

The reason for this is that Convert.FromBase64String currently understands just a handful of whitespace characters:

private static bool IsSpace(this char c) => c == ' ' || c == '\t' || c == '\r' || c == '\n';

If this list of allowed whitespace chars is ever expanded to encompass the full set of characters allowed by char.IsWhiteSpace, then we're starting to discuss chars outside of the US-ASCII range, which means the caller has to know how the provided ROS<byte> is going to be interpreted. (Latin-1? UTF-8? Something else?) Putting the UTF-8 moniker somewhere in the API surface makes this obvious and avoids confusion.

(Yes, expanding the list would be a breaking change for the Convert class. But we've been entertaining the idea here and elsewhere of breaking changes to .NET, so it seems prudent to future-proof ourselves.)

@stephentoub
Copy link
Member Author

I'm fine with (4) as long as you include utf8 somewhere in the name.

I don't mind using a different name, but I'm confused by the "utf8" aspect of it. Where in public static bool IsValid(ReadOnlySpan<char> base64Text); would you include "utf8"?

@GrabYourPitchforks
Copy link
Member

Only in the ROS<byte> overloads. :)

@stephentoub
Copy link
Member Author

stephentoub commented Nov 8, 2022

So e.g.

+public static bool IsValid(ReadOnlySpan<byte> utf8); // or base64Utf8Text
+public static bool IsValid(ReadOnlySpan<char> utf16); // or base64Text

// implementation updated to ignore same whitespace as Convert.FromBase64String
public static unsafe OperationStatus DecodeFromUtf8(ReadOnlySpan<byte> utf8, Span<byte> bytes, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true)

?

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Nov 8, 2022

No need to add the utf16 moniker to ROS<char> inputs. The char data type implies UTF-16. (Well, UCS-2 or UTF-16, context-dependent, but academic distinction for this issue.)

public static bool IsValid(ReadOnlySpan<byte> utf8); // or base64Utf8Text

Sure, that seems fine.

@stephentoub
Copy link
Member Author

No need to add the utf16 moniker to ROS inputs. The char data type implies UTF-16. (Well, UCS-2 or UTF-16, context-dependent, but academic distinction for this issue.)

Just aiming for some measure of consistency.

@gfoidl
Copy link
Member

gfoidl commented Nov 8, 2022

(6) is the one that aligns best with RFC4648 (especially section 3.1 and 3.3).
Allowing whitspace could be the default (parameter).

@terrajobst
Copy link
Member

terrajobst commented Nov 8, 2022

Video

  • We'd like to pursue option (4) from @stephentoub's suggestion:
    • Change Base64 to allow the same whitespace characters as Convert.ToBase64
    • We should output the sizes
    • If this proves problematic, we can produce overloads/alternative methods that disallow whitespace
    • Which happens to match the previously approved API shape
namespace System.Buffers.Text;

public static class Base64
{
    public static bool IsValid(ReadOnlySpan<char> base64Text);
    public static bool IsValid(ReadOnlySpan<char> base64Text, out int decodedLength);
    public static bool IsValid(ReadOnlySpan<byte> base64TextUtf8);
    public static bool IsValid(ReadOnlySpan<byte> base64TextUtf8, out int decodedLength);
}

@terrajobst terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Nov 8, 2022
@joperezr joperezr assigned stephentoub and unassigned stephentoub Jan 30, 2023
@stephentoub stephentoub added the in-pr There is an active PR which will close this issue when it is merged label Mar 3, 2023
@danmoseley danmoseley added the partner-impact This issue impacts a partner who needs to be kept updated label Mar 21, 2023
@tarekgh tarekgh removed the in-pr There is an active PR which will close this issue when it is merged label May 4, 2023
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label May 8, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label May 9, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Jun 8, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Memory partner-impact This issue impacts a partner who needs to be kept updated
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants