Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Base 64 URL #1658

Open
Tracked by #64598
commonsensesoftware opened this issue Jan 12, 2020 · 55 comments · May be fixed by #102364
Open
Tracked by #64598

Support Base 64 URL #1658

commonsensesoftware opened this issue Jan 12, 2020 · 55 comments · May be fixed by #102364
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Net in-pr There is an active PR which will close this issue when it is merged
Milestone

Comments

@commonsensesoftware
Copy link

commonsensesoftware commented Jan 12, 2020

Updated by @MihaZupan on 2024-02-27

Proposed API

namespace System.Buffers.Text;

public static class Base64Url
{
    // Encode bytes => utf8
    public static OperationStatus EncodeToUtf8(ReadOnlySpan<byte> bytes, Span<byte> utf8, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true);
    public static OperationStatus EncodeToUtf8InPlace(Span<byte> buffer, int dataLength, out int bytesWritten);

    // Decode from utf8 => bytes
    public static OperationStatus DecodeFromUtf8(ReadOnlySpan<byte> utf8, Span<byte> bytes, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true);
    public static OperationStatus DecodeFromUtf8InPlace(Span<byte> buffer, out int bytesWritten);

    // Max length APIs
    public static int GetMaxDecodedFromUtf8Length(int length);
    public static int GetMaxEncodedToUtf8Length(int length);

    // IsValid
    public static bool IsValid(ReadOnlySpan<char> base64UrlText);
    public static bool IsValid(ReadOnlySpan<char> base64UrlText, out int decodedLength);
    public static bool IsValid(ReadOnlySpan<byte> base64UrlTextUtf8);
    public static bool IsValid(ReadOnlySpan<byte> base64UrlTextUtf8, out int decodedLength);

    // Up to this point, this is a mirror of System.Buffers.Text.Base64
    // Below are more helpers that bring over functionality similar to Convert.*Base64*

    // Encode to / decode from chars
    public static bool TryEncodeToChars(ReadOnlySpan<byte> bytes, Span<char> chars, out int charsWritten) { }
    public static bool TryDecodeFromChars(ReadOnlySpan<char> chars, Span<byte> bytes, out int bytesWritten) { }


    // These are just accelerator methods.
    // Should be efficiently implementable on top of the other ones in just a few lines.

    // Encode to string
    public static string EncodeToString(ReadOnlySpan<char> chars, Encoding encoding) { }
    public static string EncodeToString(ReadOnlySpan<byte> bytes) { }

    // Decode from chars => string
    // Decode from chars => byte[]
    // The names could also just be "Decode" without naming the return type
    public static string DecodeToString(ReadOnlySpan<char> chars, Encoding encoding) { }
    public static byte[] DecodeToByteArray(ReadOnlySpan<char> chars) { }
}

Original issue

The Base64 implementation in System.Memory provides excellent low-level optimizations for RFC 4648, but it currently uses a fixed alphabet. Minor refactoring would add additional use cases using the existing implementation.

Rationale and Usage

The most obvious use case for this change is to support the encoding variant known as Base 64 URL also described in RFC 4648 §4.

This excerpt from the RFC describes the difference between the standard Base 64 alphabet and the Base 64 URL alphabet.

This encoding is technically identical to the previous one, except
for the 62:nd and 63:rd alphabet character...

I have already been able to verify the encoding and decoding in a copy of the current implementation by merely changing the 2 relevant characters in the alphabet mappings at:

Today, this logic is suboptimally implemented in at least:

Furthermore, the encoding is generic and has use cases outside of ASP.NET (e.g. you shouldn't have to reference ASP.NET to use it).

Proposed API Change

The existing Base64.EncodeToUtf8 and Base64.DecodeFromUtf8 should each add a new method overload that allows one of the following:

  1. A new enumeration of allowed alphabets (ex: Base64Alphabet) which internally maps to well-known alphabet spans

  2. Allow any custom alphabet to be supplied as ReadOnlySpan<sbyte> that must have an exact length of 64 for encoding and 256 for decoding

    a. One or more new types could be provided with static properties for the well-known alphabet spans

Details

Option 1 requires less validation, but has less flexibility. Option 2 has more flexibility, including scenarios not described here, but requires more validation before using the alphabet.

Although the standard Base 64 and Base 64 URL are technically the same encoding with different alphabets, Base 64 URL typically does not include padding. RFC 4648 §3.2 indicates this is allowed (as it's explicitly stated), but there is no need to make that concession in this API. Including the = character for padding in Base 64 URL encoding is still correct.

Both approaches would have a similar looking API:

public static unsafe OperationStatus EncodeToUtf8(
    ReadOnlySpan<byte> bytes,
    Span<byte> utf8,
    Base64Alphabet alphabet,
    out int bytesConsumed,
    out int bytesWritten,
    bool isFinalBlock = true);

public static unsafe OperationStatus DecodeFromUtf8(
    ReadOnlySpan<byte> utf8,
    Span<byte> bytes,
    Base64Alphabet alphabet,
    out int bytesConsumed,
    out int bytesWritten,
    bool isFinalBlock = true);

Figure 1: Supply an alphabet enumeration

public static unsafe OperationStatus EncodeToUtf8(
    ReadOnlySpan<byte> bytes,
    Span<byte> utf8,
    ReadOnlySpan<byte> alphabet,
    out int bytesConsumed,
    out int bytesWritten,
    bool isFinalBlock = true);

public static unsafe OperationStatus DecodeFromUtf8(
    ReadOnlySpan<byte> utf8,
    Span<byte> bytes,
    ReadOnlySpan<byte> alphabet,
    out int bytesConsumed,
    out int bytesWritten,
    bool isFinalBlock = true);

Figure 2: Supply a custom alphabet

The support for trimming off and re-adding padding should be implemented separately. This could be in a separate Base64Url class or, perhaps more appropriately, added as a new type in System.Text.Encodings.Web.

Padding is generally very cheap to deal with compared to other implementation methods. Trimming involves walking the tail end of the span while there are padding characters and then slicing it off. Re-padding fills the end of the span buffer before decoding. Neither operation requires additional allocations.

Open Questions

  • Does Option 1 or Option 2 make more sense?
  • To complete the cycle, it feels like there should be a type that fully handles the encoding and decoding with trimmed padding. Does that make sense to be in System.Memory, System.Text.Encodings.Web, or perhaps some other place?
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Jan 12, 2020
@jkotas jkotas added area-System.Buffers api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Jan 13, 2020
@danmoseley
Copy link
Member

@GrabYourPitchforks

@GrabYourPitchforks
Copy link
Member

What we're considering is no longer base64. It should really be a different API named base64url.

@commonsensesoftware
Copy link
Author

@GrabYourPitchforks not exactly - that's merely a canonical name. From an encoding and decoding perspective they are just different alphabets at the lowest level. You literally just exchange 2 characters in the respective encoding and decoding (e.g. 4 numbers). It took me exponentially more time to understand the existing implementation than it did to to make the actual change.

Mostly, I wanted to get the discussion going. There are several ways this proposed change could be implemented. If we split out the API into a new type called Base64Url I think that's more than sufficient. That's actually how I implemented it in my solution. After realizing that this is a common scenario, I thought it would be appropriate to formally discuss supporting it for others to leverage. The method signatures between Base64 and Base64Url would be exactly the same or, at least, they certainly can be.

If we were to take this approach, then it feels like most of the existing Base64 implementation should be refactored into a some other internal type (ex: Base64Core), which can be shared between the two implementations (since they would both be static classes). The internal implementation would then accept the appropriate alphabet leaving the rest of the encoding and decoding algorithm the same.

Thoughts?

@gfoidl
Copy link
Member

gfoidl commented Jan 14, 2020

leaving the rest of the encoding and decoding algorithm the same

This depends how padding is handled (especially for base64Url).

Furthermore the current implementation is vectorized.
base64Url can also be vectorized, but some tricks used for base64 are not applicable for base64url.
That means either two different vectorized implementations, or switch to a less optimized variant so that the code-pathes align (even when vectorized).

The same holds for the option with the user specified alphabet.

@commonsensesoftware
Copy link
Author

@gfoidl the padding need not be handled directly in the encoding/decoding, which makes things simpler; the RFC also says as much. I agree, however, that most people will expect the padding to be handled automatically, which is why my original suggestion was to move that responsibility outside of the encoding/decoding process. If we want to include that in the Base64Url class, that's fine by me, but that is merely encapsulating some very basic behavior after encoding and before decoding.

Code is worth a thousand words. I've thrown up the gist of what I'm talking about. I'm well aware that the implementation uses vectors to map the alphabets. Like you, I merely changed the mapping of the alphabet here and here (on this line and this line). I didn't change anything else. I suspect you changed things to make it work with .NET Standard, which is not one of my goals.

In pure terms of implementing the RFC specification, this is functionally correct. I acknowledge that consumers will expect padding to be automatically handled, which is why you can see everything come together in my rendition of Base64Url.cs.

I'm not saying that my implementation should necessarily be the winning design. Step one is to agree that these two things are very, very similar and that it makes sense for the core implementation to not only live together, but that the two have a shared implementation. Exactly which part of API is publicly exposed and what it looks like is wide open for discussion.

The specification implies that there could be alternate alphabets. These are the only two I know of. I'm completely onboard with closing the door on open-ended alphabets. There's no need to over-engineer this. I'm content with saying these are the only two alphabets supported and they are internally mapped. If that somehow ever changes in the future, there seems to be a straight forward path as to how that would be achieved.

@gfoidl
Copy link
Member

gfoidl commented Jan 14, 2020

In general I agree with you.
But I'm afraid perf will suffer when the two implementations are shared -- especially for short to medium inputs, as the overhead for handling padding may be to high. I'm happy if my fear is superfluos.

closing the door on open-ended alphabets

👍
(can you please update the title of this issue)


I suspect you changed things to make it work with .NET Standard

I don't know to which change you refer. In the linked project (in my last comment) most changes are for perf, and the vectorized approach -- especially for decoding -- is different for base64 and base64Url.

@commonsensesoftware
Copy link
Author

Happy to update the title. What should it say now? "Support Base 64 URL"?

I didn't dissect your implementation, but it clearly looked different and I saw compiler directives. Maybe there's something I don't understand about the encoding and decoding process, which would negate my initial assumptions. My understanding of the specification and the vectorized implementation is as follows:

Encoding

Define the alphabet as a 64 element vector where each element represents the alphabet character. A bunch of optimized techniques and math are then used to calculate which character to use while encoding.

Decoding

Define the decoding map as a 256 (technically only 255 is needed) element vector where each element represents the encoded character which maps back to the corresponding encoding alphabet character. A bunch more optimized techniques and math are used to decode the text.


The differences between the encoding and decoding is merely using - and _ instead of + and / at the exact same positions. With the vectors correctly mapped, encoding/decoding should be no different. I've been using this implementation for a while and haven't encountered any tests or scenarios where things didn't encode/decode properly nor result in a difference from other suboptimal methods. However, that doesn't mean I might not be wrong.

I'll reiterate that padding need not be handled directly in the encoding/decoding process (refer to my example). Unless I've completely missed something else, that means the only difference between the encoding/decoding process is alphabet-to-vector map. Conceptually, this feels like this can be handled by refactoring to a parameter or something other method. I recognize these are very low-level APIs and performance sensitive. I defer to bigger brains than mine as to what the best approach is, but seems to me that choosing vector A or B or passing it as a ref parameter is quite cheap.

I re-reviewed your implementation and it seems you are accounting for the padding directly in the encoding/decoding process. That is not what I'm suggesting, but maybe that's a better way to go. In that case, we end up with two distinct encoding/decoding implementations for each alphabet, where the Base 64 URL implementation directly accounts for padding too. The underlying algorithms are quite complex and advanced. My original thinking was that we don't need different algorithms, just different alphabets where padding can be handled outside of the encoding/decoding process for Base 64 URL (the spec even shows = as the pad character in the alphabet). Furthermore, the worst possible scenario in Base 64 is 2 padding characters, which means any pre/post fix-up has seemingly low perf impact (for Base 64 URL only). I'm sure my technique to handle this is suboptimal, but I suspect improvements could be made without having to reimplement the entire algorithm.

I'm onboard to support two different, optimized algorithms if that's the agreed upon approach, it just seems like it's unnecessary maintenance overhead and complexity.

@gfoidl
Copy link
Member

gfoidl commented Jan 14, 2020

"Support Base 64 URL"?

Sounds good.


We're going down to implementation details...I think it's too early for this. First there should be a consense about the API design / shape.

Maybe we mismatch "vectorized". When I mean "vectorized" it's SIMD. You mean "array of data" (here a lookup table (LUT)).
The SIMD-versions of base64 and base64Url are different, as the cool tricks there won't map directly to the different alphabet. And the SIMD-version brings a huge perf-win.

For padding: I understood what you suggested 😉 But all users of base64Url (in the .NET landscape) use it without padding (e.g. ASP.NET Core), so this should be handled directly. At leasst with a default param -- so a user could opt-in to keep the padding.

Pure scalar code and w/o padding in mind, you're absolutely right that the implementations are equivalent, just with differnt LUTs. Hence it makes sense to share as much code as possible. As said before, it's heavily optimized so I'm afraid it's not that easy without perf-loss or introducing even more complex code by playing JIT-tricks.

I'll wait for some info from the API designers here.
Personally I don't have any strong preference in either direction. Just want to throw an eye about the perf-thing.

@commonsensesoftware commonsensesoftware changed the title Base64 Should Support Custom Alphabet Support Base 64 URL Jan 14, 2020
@commonsensesoftware
Copy link
Author

Title updated.

Alright ... we are agreed. I guess we enter the holding pattern for stakeholders to decide. I'm personally not married to any code or design, I'd just like to see all of the perf goodness flow from the lowest levels up, which currently isn't happening. 😉

@olivier-spinelli
Copy link

Any news on this one? It doesn't seem to be a big deal... A standardized System.Memory.Base64Url will be useful...

@TravisSpomer
Copy link

TravisSpomer commented Mar 21, 2022

(I opened what's effectively a dupe of this as #66841. Oops!)

I'd like to propose an alternative API shape for this: instead of the two options above, just create a new class System.Buffers.Text.Base64Url with the exact shape as Base64, either deriving from Base64 or just sharing implementation. That seems more consistent to me to the way that .NET libraries typically would be organized, and it seems to me that the likelihood of a third encoding popping up in the future that still used the same algorithm as base64 is extremely low. In addition, when this issue was opened, there was probably just one encoding table and one decoding table that would be different between base64 and base64url, but now there are six: scalar, AVX2, and SSSE3 tables each for encoding and decoding. So option 2 "supply a custom alphabet" seems considerably more unwieldy now. The main downside to me of option 1 "alphabet parameter" is that it requires people to learn a new (albeit minor) concept to use the API: "What's an 'alphabet' in this context? Why do I need to pick one? Why am I using a class called Base64 to encode data in base64url?"

Just having two separate classes that share code lets people make a simpler decision: do I want to encode in base64 or base64url? And having a separate class might also make it easier to have different defaults for padding between base64 and base64url if options are added to control padding.

So, I propose option 3: "make it its own class":

namespace System.Buffers.Text
{
    public static class Base64Url
    {
        public static System.Buffers.OperationStatus DecodeFromUtf8(System.ReadOnlySpan<byte> utf8, System.Span<byte> bytes, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true);
        public static System.Buffers.OperationStatus DecodeFromUtf8InPlace(System.Span<byte> buffer, out int bytesWritten);
        public static System.Buffers.OperationStatus EncodeToUtf8(System.ReadOnlySpan<byte> bytes, System.Span<byte> utf8, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true);
        public static System.Buffers.OperationStatus EncodeToUtf8InPlace(System.Span<byte> buffer, int dataLength, out int bytesWritten);
        public static int GetMaxDecodedFromUtf8Length(int length);
        public static int GetMaxEncodedToUtf8Length(int length);
    }
}

@commonsensesoftware
Copy link
Author

@TravisSpomer the term "alphabet" is the vernacular of the spec RFC 4648. I don't really care what it's called in the code, but some level of symmetry and meaning tends to be self-describing.

Where I was really going with in terms of an alphabet is in relation to how the spec calls them out. I'm more than onboard with having a restricted public surface area. My thought process was that there could be the ability to swap out alphabets at a lower level. As the spec calls out, there are more than just 2 possible alphabets. If we're only interested in having these two public approaches, then I don't have any resistance to restricting it that way. There's no sense in over-engineering it.

I posted a Gist (above) which is extracted from a working implementation. I didn't post the unit tests for it, but have those handy if we were to move forward with a PR. I believe the only real hold up on this issue is an agreed upon API design. I've contributed my 2¢, but I'm not married to anything. In comparing the implementations between Base64 and Base64Url (as provided thus far), the only real low-level difference is the encoding and decoding maps/table/vectors. Handling padding, be it trim or add, is easily handled at a higher API level. As it relates to this issue, the allocation sizes are exactly the same. Trimming a simple Slice against the result, if necessary.

I'm totally onboard with having 2 laser-focused public static classes for Base64 and Base64Url. I do think there's a lot of possibility for sharing and reduced duplication in the internal implementation, but I'm more than happy to let that organically flush itself through iterations of code review as opposed to designing it upfront in this issue.

I would ❤️ to see this land as a supported API. There's a wide benefit to client and server web stacks, in particular. According to the linked .NET 7 roadmap, this issue is queued on the backlog to potentially land.

@jeffhandley, is there something we can do to help drive this?

@TravisSpomer
Copy link

Sorry, "What's an 'alphabet' in this context?" and the neighboring questions were supposed to be theoretical questions a person trying to comprehend the API would ask, not a literal question. I know what it means here, because I've read through that spec, but a normal person has not and will not. 🙂

@commonsensesoftware
Copy link
Author

Roger that. Apologies, I didn't mean to spec-splain. 😛

@commonsensesoftware
Copy link
Author

To be clear, padding will always be produced for encoding and required during decoding if the implementation uses the same vectorization approach that the existing Base64 uses. I can't think of a compelling reason why you wouldn't want to do that. This means that there are really only 3 options:

  1. Use padding in the same way as Base64 does; the onus of removing the padding characters are on the caller
  2. Always remove or append padding on behalf of the caller (as illustrated in the gist)
  3. Give the caller a choice by option

The encoding and decoding always have to follow #1 when using vectorization. If #2 isn't supported, then that means the caller has to add or remove padding if they don't want it. That's not a huge deal for something so low-level, but the ergonomics aren't great for the caller. Providing a choice to the caller has an impact, but it is incredibly small (measured in ns). Required just means the data is encoded and decoded as is, whereas None would be the same strategy as #2. The Allowed option is more of an interop suggestion in a scenario where the caller might not know if the padding is expected; for example, when generated by an external source. It's not strictly necessary and could be eliminated. That just means the caller needs to determine if padding needs to be added/removed before calling Base64Url, which is reasonable.

The primary reason that padding is excluded from the encoded result is because = is often percent escaped (e.g. %3D) when used in URLs. This would require double-encoding and additional allocations when encoding and decoding, which impacts performance. If the padding were absent, then escaping/unescaping a URL which contains a Base 64 Safe URL, but otherwise does not require additional escaping/unescaping, will validate (very cheap), but not allocate.

If customers need a Base64 encoding with different alphabet they could continue using Convert.Base64 APIs and update the relevant chars as needed.

I'm not sure I follow this statement. The alphabet is intrinsically part of any of the data encoding schemes outlined in the RFC. The encoded characters cannot simply be updated. The Base 64 and Base 64 Encoding with URL and Filename already use different alphabets. Specifically, the 62nd and 63rd characters for Base64Url are not the same as Base64.

I don't want to rat hole on alphabets - again. I'll be more than happy for Base64Url to land. The only reason I even mention it is because a design that can operate on such a concept, even if it starts internal to flush it out, could open the door for other encoding schemes such as Base 58 or Base 122 to land. It's not a requirement. I just wanted to turn some ⚙️⚙️ , but I'll let it be. When I last did a deep comparison of the two schemes, it seemed that the majority of the Base64 and forthcoming Base64Url implementations were nearly identical, save the alphabet. I guess we're not at that point in the design so it's irrelevant. Perhaps by the time you start implementing things, you'll come to another conclusion. 😉

The surface area for the proposed changes thus far seem reasonable IMO. 👍🏽

@buyaa-n
Copy link
Member

buyaa-n commented Feb 22, 2024

I did not look into implementation deeply and doesn't want to go deep into implementation detail, plus I am not planning to implement it unless nobody would contribute. The padding handling approach especially based on the existing ASP.NET/Azure implementations we want to replace by adding this new API. Based on that I would still prefer ignore (meant do not require) padding on decode and skip/truncate padding on encode, I will reply to your proposed options inline:

  1. Use padding in the same way as Base64 does; the onus of removing the padding characters are on the caller
    • It is not only about removing, this means remove padding after encode and add it back before decode
  2. Always remove or append padding on behalf of the caller (as illustrated in the gist)
    • I guess it is close what I am proposing, remove/skip padding on encode when not multiple of 24, on decode ignore padding if it is not provided when needed, the solution can be adding the missing padding back, but there could be other/better solution and that is implementation detail. Plus the implementation not have to be same or shared with Base64 I assume.
  3. Give the caller a choice by option
    • This would need another parameter almost for each new API, and by my assumption the NoPadding option would be chosen probably 99% of the usage (correct me if I am wrong). If we go with this option I would propose options somethin like:
    public enum Base64UrlPadding
    {
        PaddingNone,
        PaddingEquals,
        PaddingPercentage,
    }

Anyway, the final decisions will be made by the API review board.

@stephentoub
Copy link
Member

Thanks for the API proposal, @buyaa-n. I understand the logic of just replicating the Base64* surface area for Base64Url, but I'm wondering if we should actually instead just put everything into a Base64Url class rather than splitting it across that and Convert. I expect if we were starting over, we wouldn't have the APIs for Base64 on Convert at all (and maybe not Convert at all), and so while we've added more Base64 APIs there, it was only because it was the logical place for span-based versions of string-based versions that already existed. Base64Url is sufficiently it's own thing that I don't think we need to mirror the Base64 APIs directly.

Further, I want to make sure we're considering all the different input / output types we want here. Looking for example at https://github.com/AzureAD/azure-activedirectory-identitymodel-extensions-for-dotnet/blob/dev/src/Microsoft.IdentityModel.Tokens/Base64UrlEncoder.cs and https://github.com/AzureAD/azure-activedirectory-identitymodel-extensions-for-dotnet/blob/dev/src/Microsoft.IdentityModel.Tokens/Base64UrlEncoding.cs, would we be exposing all the APIs necessary to efficiently go between, say, a UTF16 string to UTF8 bytes through Base64Url encoding and out to a string? Would the proposed APIs enable that to be achieved easily with minimal ceremony, or would it be more complicated for a consumer? Similarly, for an API like https://github.com/Azure/azure-sdk-for-net/blob/f65fb9760cbfd8894d40b3f02419d363690eb3bc/sdk/core/Azure.Core/src/Shared/Base64Url.cs#L32, are we making it easy and efficient to go from a Base64Url UTF16 string through UTF8 to a UTF16 string with minimal ceremony and overheads?

@buyaa-n
Copy link
Member

buyaa-n commented Feb 22, 2024

I understand the logic of just replicating the Base64* surface area for Base64Url, but I'm wondering if we should actually instead just put everything into a Base64Url class rather than splitting it across that and Convert. I expect if we were starting over, we wouldn't have the APIs for Base64 on Convert at all (and maybe not Convert at all), and so while we've added more Base64 APIs there, it was only because it was the logical place for span-based versions of string-based versions that already existed. Base64Url is sufficiently it's own thing that I don't think we need to mirror the Base64 APIs directly.

Sounds good to me

Further, I want to make sure we're considering all the different input / output types we want here. Looking for example at https://github.com/AzureAD/azure-activedirectory-identitymodel-extensions-for-dotnet/blob/dev/src/Microsoft.IdentityModel.Tokens/Base64UrlEncoder.cs and https://github.com/AzureAD/azure-activedirectory-identitymodel-extensions-for-dotnet/blob/dev/src/Microsoft.IdentityModel.Tokens/Base64UrlEncoding.cs, would we be exposing all the APIs necessary to efficiently go between, say, a UTF16 string to UTF8 bytes through Base64Url encoding and out to a string? Would the proposed APIs enable that to be achieved easily with minimal ceremony, or would it be more complicated for a consumer? Similarly, for an API like https://github.com/Azure/azure-sdk-for-net/blob/f65fb9760cbfd8894d40b3f02419d363690eb3bc/sdk/core/Azure.Core/src/Shared/Base64Url.cs#L32, are we making it easy and efficient to go from a Base64Url UTF16 string through UTF8 to a UTF16 string with minimal ceremony and overheads?

Good point, thank you! Did not really think about that, will look into and update the proposal as needed

@stephentoub
Copy link
Member

There are also examples of this in ASP.NET, e.g.
https://github.com/dotnet/aspnetcore/blob/f26b79eb34d13648bedc5ec98e2151f8c55db978/src/Identity/Core/src/IdentityApiEndpointRouteBuilderExtensions.cs#L154
That has extra allocations and conversions along the way that should ideally all be handled internally (and avoided) by whatever APIs we expose here. Ideally https://github.com/dotnet/aspnetcore/blob/f26b79eb34d13648bedc5ec98e2151f8c55db978/src/Shared/WebEncoders/WebEncoders.cs could be deleted for .NET 9 as part of shipping this Base64Url type.
cc: @halter73

@halter73
Copy link
Member

halter73 commented Feb 23, 2024

Ideally dotnet/aspnetcore@f26b79e/src/Shared/WebEncoders/WebEncoders.cs could be deleted for .NET 9 as part of shipping this Base64Url type.

We also have two public versions of Base64UrlTextEncoder that pass through to WebEncoders, but I'm okay with deprecating all of them if its usage can be trivially replaced by the new Base64Url and/or Convert APIs. I'm not sure if it'd ever be worth it to actually delete anything especially if they all pass through to the runtime APIs. We only just recently in .NET 8 started shipping it as a package again to unblock customers who wanted to use it without an ASP.NET Core framework reference.

The scenario of doing Encoding.UTF8.GetString(WebEncoders.Base64UrlDecode(code)) or the reverse is very common, so I agree it'd be nice to support that more efficiently than the existing ASP.NET Core APIs do and avoid any extra intermediate allocations or conversions. In addition to all the copies of it in ASP.NET Core Identity code from templates, I found plenty of other projects do the same thing using grep.app. For example, Bitwarden, IdentityManager2 and NATS.Net all have helpers for this just to name of few.

Obviously, we'd still want to support ReadOnly/Span<byte> for when we can skip the UTF16 string representation and keep everything UTF8, but a lot of things like query strings and headers that can only bet set or retrieved as strings.

cc: @keegan-caruso who just pinged me about improving the perf of Base 64 URL decoding in IdentityModel.

@MihaZupan
Copy link
Member

Wanting to insert new lines into Base64Url seems odd - do we need the Base64FormattingOptions overloads at all?

If we're embracing the new type over the split with Convert, should we also consider dropping the char[]/byte[] overloads?

@buyaa-n
Copy link
Member

buyaa-n commented Feb 23, 2024

Wanting to insert new lines into Base64Url seems odd - do we need the Base64FormattingOptions overloads at all?

Good point

If we're embracing the new type over the split with Convert, should we also consider dropping the char[]/byte[] overloads?

Could you explain why?

@buyaa-n
Copy link
Member

buyaa-n commented Feb 23, 2024

After applying the feedbacks the API could look like this:

namespace System.Buffers.Text
{
    public static class Base64Url
    {
	// Encoding APIs, could just rename all with 'Encode' or 'TryEncode' without To*** 
	public static string EncodeToString(string str) { } 
        public static string EncodeToString(byte[] inArray) { }
        public static string EncodeToString(byte[] inArray, int offset, int length) { }
        public static string EncodeToString(ReadOnlySpan<byte> bytes) { }
        public static int EncodeToCharArray(byte[] inArray, int offsetIn, int length, char[] outArray, int offsetOut) { }
	public static bool TryEncodeToChars(ReadOnlySpan<byte> bytes, Span<char> chars, out int charsWritten) { }

        public static OperationStatus EncodeToUtf8(ReadOnlySpan<byte> bytes, Span<byte> utf8, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true);
        public static OperationStatus EncodeToUtf8InPlace(Span<byte> buffer, int dataLength, out int bytesWritten);
		
	// Decoding APIs, also could rename all 'Decode' or 'TryDecode' without From*** 
	public static string DecodeFromString(string str) { } 
	public static byte[] DecodeFromString(string str) { }
	public static byte[] DecodeFromCharArray(char[] inArray, int offset, int length) { }
	public static bool TryDecodeFromChars(ReadOnlySpan<char> chars, Span<byte> bytes, out int bytesWritten) { }
        public static bool TryDecodeFromString(string s, Span<byte> bytes, out int bytesWritten) { }

	public static OperationStatus DecodeFromUtf8(ReadOnlySpan<byte> utf8, Span<byte> bytes, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true);
        public static OperationStatus DecodeFromUtf8InPlace(Span<byte> buffer, out int bytesWritten);
		
	// Max length APIs
        public static int GetMaxDecodedFromUtf8Length(int length);
        public static int GetMaxEncodedToUtf8Length(int length);
    }
}

@stephentoub @halter73 @MihaZupan PTAL

@MihaZupan
Copy link
Member

MihaZupan commented Feb 23, 2024

If we're embracing the new type over the split with Convert, should we also consider dropping the char[]/byte[] overloads?

Could you explain why?

Other than for trying to match other APIs, having both a byte[] and a ReadOnlySpan<byte> overload doesn't get you anything on Core where the former has an implicit conversion to the latter.

The array, offset, count overloads have some value, but it's trivial to array.AsSpan(offset, count) at the call site.


The decoded bytes could be anything. It feels odd to have string/chars overloads that assume an encoding, even if it's likely to be UTF8.

-public static string EncodeToString(string str) { } 
-public static string EncodeToString(byte[] inArray) { }
-public static string EncodeToString(byte[] inArray, int offset, int length) { }
+public static string EncodeToString(ReadOnlySpan<char> chars, Encoding encoding) { }
public static string EncodeToString(ReadOnlySpan<byte> bytes) { }

I would add an EncodeToString(ReadOnlySpan<char>) overload in case you already have a span so you don't have to pay for the string allocation for the argument.
If you have that overload, then you don't need one taking a string, as it will implicitly convert to the ReadOnlySpan<char>.


-public static int EncodeToCharArray(byte[] inArray, int offsetIn, int length, char[] outArray, int offsetOut) { }
public static bool TryEncodeToChars(ReadOnlySpan<byte> bytes, Span<char> chars, out int charsWritten) { 

Would EncodeToCharArray throw for too small destinations?
If we just expose the span-based TryEncodeToChars, I think it would be more obvious what the behavior is.


-public static string DecodeFromString(string str) { } 
-public static byte[] DecodeFromString(string str) { }
-public static byte[] DecodeFromCharArray(char[] inArray, int offset, int length) { }
+public static string DecodeToString(ReadOnlySpan<char> chars, Encoding encoding) { }
+public static byte[] DecodeToByteArray(ReadOnlySpan<char> chars) { }

You can't overload just on the return type.
Similar to above, we can offer overloads taking spans instead of just strings.


-public static bool TryDecodeFromChars(ReadOnlySpan<char> chars, Span<byte> bytes, out int bytesWritten) { }
-public static bool TryDecodeFromString(string s, Span<byte> bytes, out int bytesWritten) { }
public static bool TryDecodeFromChars(ReadOnlySpan<char> chars, Span<byte> bytes, out int bytesWritten) { }

If you have a ReadOnlySpan<char> overload, you can call it with a string. I don't think we need multiple method names.


Together, that might look something like

namespace System.Buffers.Text;

public static class Base64Url
{
    // Encode bytes => utf8
    public static OperationStatus EncodeToUtf8(ReadOnlySpan<byte> bytes, Span<byte> utf8, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true);
    public static OperationStatus EncodeToUtf8InPlace(Span<byte> buffer, int dataLength, out int bytesWritten);

    // Decode from utf8 => bytes
    public static OperationStatus DecodeFromUtf8(ReadOnlySpan<byte> utf8, Span<byte> bytes, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true);
    public static OperationStatus DecodeFromUtf8InPlace(Span<byte> buffer, out int bytesWritten);

    // Max length APIs
    public static int GetMaxDecodedFromUtf8Length(int length);
    public static int GetMaxEncodedToUtf8Length(int length);

    // IsValid
    public static bool IsValid(ReadOnlySpan<char> base64UrlText);
    public static bool IsValid(ReadOnlySpan<char> base64UrlText, out int decodedLength);
    public static bool IsValid(ReadOnlySpan<byte> base64UrlTextUtf8);
    public static bool IsValid(ReadOnlySpan<byte> base64UrlTextUtf8, out int decodedLength);

    // Up to this point, this is a mirror of System.Buffers.Text.Base64
    // Below are more helpers that bring over functionality similar to Convert.*Base64*

    // Encode to / decode from chars
    public static bool TryEncodeToChars(ReadOnlySpan<byte> bytes, Span<char> chars, out int charsWritten) { }
    public static bool TryDecodeFromChars(ReadOnlySpan<char> chars, Span<byte> bytes, out int bytesWritten) { }


    // These are just accelerator methods.
    // Should be efficiently implementable on top of the other ones in just a few lines.

    // Encode to string
    public static string EncodeToString(ReadOnlySpan<char> chars, Encoding encoding) { }
    public static string EncodeToString(ReadOnlySpan<byte> bytes) { }

    // Decode from chars => string
    // Decode from chars => byte[]
    // The names could also just be "Decode" without naming the return type
    public static string DecodeToString(ReadOnlySpan<char> chars, Encoding encoding) { }
    public static byte[] DecodeToByteArray(ReadOnlySpan<char> chars) { }
}

That would get you

ENCODING From chars From bytes
To string
To byte[]
To chars
To utf8
DECODING From chars From utf8
To string
To byte[]
To chars
To bytes

Do we care about other combinations?

@buyaa-n
Copy link
Member

buyaa-n commented Feb 24, 2024

Thank you @MihaZupan, the updates make sense to me, we might want to rename the utf8 bytes as its confusing but might not good idea when the similar APIs already exist in System.Buffers.Text.Base64.

@MihaZupan MihaZupan added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Feb 27, 2024
@keegan-caruso
Copy link

keegan-caruso commented Mar 8, 2024

I work on the Identity Model project; the current proposal should work for us to remove our implementation.

@brentschmaltz
Copy link

brentschmaltz commented Mar 21, 2024

In IdentityModel, we have scenarios where we will have a base64UrlEncoded JWT as a Span<byte>.
We will want to obtain the utf8 bytes to feed to the Utf8JsonReader to parse the JWT.

@stephentoub
Copy link
Member

In IdentityModel, we have scenarios where we will have a base64UrlEncoded JWT as a Span. We will want to obtain the utf8 bytes to feed to the Utf8JsonReader to parse the JWT.

@brentschmaltz, could you share what that would translate to in terms of your ideal API in the proposal above?

@brentschmaltz
Copy link

brentschmaltz commented Mar 21, 2024

@stephentoub consider a protocol, such as SHR which is a JWS inside a JWS (the envelope).
One specific claim in the envelope is the JWS.
We are reading the envelope with the Utf8JsonReader that was instantiated with a buffer we own.

When we hit that claim, we will remember where it was in the buffer, and pass a Span to the decoder when it is time to process.

We don't have to worry about escaping as the value must be base64url encoded or we will fault on the decoding.

Something simple like:

Decode(ReadonlySpan<byte> source, Span<byte> destination, int size)

@MihaZupan
Copy link
Member

That would be Base64Url.DecodeFromUtf8 in the above proposal

@brentschmaltz
Copy link

Perfect, love it.
Just making sure.

@KrzysztofCwalina
Copy link
Member

Could we make it NS 2.0? This is so heavily used in the Azure SDK and we keep duplicating this all over the place.

@am11
Copy link
Member

am11 commented Apr 9, 2024

That was exactly my (Service Fabric) scenario where we needed it in non-aspnetcore service apps (#31099). Also Base64 VLQ (#15237) is popular in Google world.

@bartonjs
Copy link
Member

bartonjs commented Apr 9, 2024

Video

  • GetMaxEncodedToUtf8Length(int length) => GetEncodedLength(int bytesLength)
  • GetMaxDecodedFromUtf8Length(int length) => GetMaxDecodedLength(int base64Length)
  • We improved the parity between UTF8 and char source/destination
  • We added a lot of convenience/aesthetics overloads for the OperationStatus methods
  • In IsValid for utf8, we moved the utf8 marker from a suffix to a prefix, to match IUtf8SpanFormattable

This type is needed for .NET Standard 2.0. System.Memory.nupkg is the most obvious destination, but we may have problems with that which require a net new package.

namespace System.Buffers.Text;

public static class Base64Url
{
    public static int GetMaxDecodedLength(int base64Length);
    public static int GetEncodedLength(int bytesLength);

    public static OperationStatus EncodeToUtf8(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true);
    public static int EncodeToUtf8(ReadOnlySpan<byte> source, Span<byte> destination);
    public static bool TryEncodeToUtf8(ReadOnlySpan<byte> source, Span<byte> destination, out int charsWritten);
    public static byte[] EncodeToUtf8(ReadOnlySpan<byte> source);

    public static OperationStatus EncodeToChars(ReadOnlySpan<byte> source, Span<char> destination, out int bytesConsumed, out int charsWritten, bool isFinalBlock = true);
    public static int EncodeToChars(ReadOnlySpan<byte> source, Span<char> destination);
    public static bool TryEncodeToChars(ReadOnlySpan<byte> source, Span<char> destination, out int charsWritten);
    public static char[] EncodeToChars(ReadOnlySpan<byte> source);
    public static string EncodeToString(ReadOnlySpan<byte> source);

    public static bool TryEncodeToUtf8InPlace(Span<byte> buffer, int dataLength, out int bytesWritten);

    public static OperationStatus DecodeFromUtf8(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true);
    public static int DecodeFromUtf8(ReadOnlySpan<byte> source, Span<byte> destination);
    public static bool TryDecodeFromUtf8(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten);
    public static byte[] DecodeFromUtf8(ReadOnlySpan<byte> source);

    public static OperationStatus DecodeFromChars(ReadOnlySpan<char> source, Span<byte> destination, out int charsConsumed, out int bytesWritten, bool isFinalBlock = true);
    public static int DecodeFromChars(ReadOnlySpan<char> source, Span<byte> destination);
    public static bool TryDecodeFromChars(ReadOnlySpan<char> source, Span<byte> destination, out int bytesWritten);
    public static byte[] DecodeFromChars(ReadOnlySpan<char> source);

    public static int DecodeFromUtf8InPlace(Span<byte> buffer);

    public static bool IsValid(ReadOnlySpan<char> base64UrlText);
    public static bool IsValid(ReadOnlySpan<char> base64UrlText, out int decodedLength);
    public static bool IsValid(ReadOnlySpan<byte> utf8Base64UrlText);
    public static bool IsValid(ReadOnlySpan<byte> utf8Base64UrlText, out int decodedLength);
}

@bartonjs bartonjs added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Apr 9, 2024
@jkotas
Copy link
Member

jkotas commented Apr 10, 2024

System.Memory.nupkg is the most obvious destination, but we may have problems with that which require a net new package.

Are we going to follow the established Microsoft.Bcl.* naming pattern for the NS2.0 down-level package?

@davidfowl
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-approved API was approved in API review, it can be implemented area-System.Net in-pr There is an active PR which will close this issue when it is merged
Projects
None yet
Development

Successfully merging a pull request may close this issue.