Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One-shot PEM reader #29588

Closed
owlstead opened this issue May 17, 2019 · 32 comments · Fixed by #32260
Closed

One-shot PEM reader #29588

owlstead opened this issue May 17, 2019 · 32 comments · Fixed by #32260
Labels
api-approved API was approved in API review, it can be implemented area-System.Security
Milestone

Comments

@owlstead
Copy link

owlstead commented May 17, 2019

I was giving some hints on this StackOverflow question where the idea of a PEM reader / writer for .NET was coming up. Jeremy then posted that if such a reader was available that it would be a nice idea to make it a one-shot, RFC 7468 compliant reader using the new Span<char> framework.

So I've created a sample implementation / API proposal which performs the parsing of the "lax" (i.e. permissive). I've used the following additional goals / non-goals:

Goals:

  • no dynamic memory allocation that depends on the size of the input;
  • possibility to skip a description (in front) of the PEM structure;
  • possibility to detect the end of the PEM structure, if present;
  • related: allow for parsing multiple PEM structures from a single Span;
  • design in the spirit of other methods that use Span;

Non-goals:

  • returning the description in front of the PEM file;
  • detailed error reporting;
  • streaming implementation (requiring state machine or buffering);
  • parsing the base64 inside the header / footer lines (pre/post encapsulation boundaries);
  • support for stricter versions of PEM;
  • return or comparison of the label in the footer.

Niceties:

  • validation of the base64 inside the header / footer lines
  • return the amount of bytes so that a correctly sized buffer can be created in advance for the base64 decryption (yes, implemented).

Parsing the base64 is easily done afterwards with the existing parser, and we may just want to skip the PEM.

Note that I didn't get my Unit testing working for Core 3, so the implementation has not yet received any significant testing. First let me know what you think.

Implementation notes: this implementation searches for the post-encapsulation boundary (footer) before validating the base64. It might be possible to use a state machine instead to create a single pass parser. This adds significant complexity to the parsing though, and searching for a static string should be relatively performant already.

Code in followup comment for now, as I don't know yet where to put it.


(Edited in):

API Proposal

namespace System.Security.Cryptography
{
    public static class PemEncoding
    {
        public static bool TryFind(ReadOnlySpan<char> pemData, out PemFields fields) => throw null;
        public static PemFields Find(ReadOnlySpan<char> pemData) => throw null;

        public static int GetEncodedSize(int labelLength, int dataLength) => throw null;
        public static bool TryWrite(ReadOnlySpan<char> label, ReadOnlySpan<byte> data, Span<char> destination, out int charsWritten) => throw null;
        public static char[] Write(ReadOnlySpan<char> label, ReadOnlySpan<byte> data) => throw null;
    }

    public readonly struct PemFields
    {
        internal PemFields(Range label, Range base64data, Range location, int decodedDataLength) => throw null;

        public Range Location { get; }
        public Range Label { get; }
        public Range Base64Data { get; }
        public int DecodedDataLength { get; }
    }
 }
@owlstead
Copy link
Author

owlstead commented May 17, 2019

Here is the proposed API / implementation code:

using System;
using System.Collections.Generic;
using System.Text;

// bogus namespace
namespace PEMOneShotNamespace
{
    // bogus class name
    public class PEMOneShot
    {
        private const String PREEB_PREFIX = "-----BEGIN ";
        private const String PREEB_POSTFIX = "-----";
        private const String POSTEB_PREFIX = "-----END ";
        private const String POSTEB_POSTFIX = PREEB_POSTFIX;

        private const int NOT_FOUND = -1;

        /// <summary>
        /// Finds a PEM data structure and returns the location of the label, the location of the base64 encoded contents and the end of the PEM structure.
        /// This method follows the "lax" definition of PEM within RFC 7468 for maximum compatibility.
        /// </summary>
        /// <remarks>
        /// <para>
        /// This method does not allow additional header lines to be present; anything between the encapsulation boundaries is considered to be base64.
        /// </para>
        /// <para>
        /// The <code>label</code> and <code>contents</code> may will not have a valid value and may be <code>null</code> if no valid PEM structure is found.
        /// The <code>bytesInContent</code> and <code>endOfPEM</code> values will be set to -1 and 0 respectively if no valid PEM structure is found.
        /// </para>
        /// </remarks>
        /// 
        /// <param name="pemData">Character buffer that - possibly - contains a valid PEM structure.</param>
        /// <param name="label">Output parameter which is a slice in the <code>pemData</code> that denotes the validated label.</param>
        /// <param name="contents">Output parameter which is a slice in the <code>pemData</code> that denotes the validated base64 encoding.</param>
        /// <param name="bytesInContent">Output parameter that shows the precise number of bytes in the base64 content.</param>
        /// <param name="endOfPEM">Output parameter set to the next character after the PEM structure.</param>
        /// <returns><code>true</code> if and only if a valid PEM structure is found</returns>
        public static bool FindNextPem(ReadOnlySpan<char> pemData, out ReadOnlySpan<char> label, out ReadOnlySpan<char> contents, out int bytesInContent, out int endOfPEM)
        {
            // define default output values returned if no PEM structure or an erroneous PEM structure is detected
            label = null;
            contents = null;
            bytesInContent = NOT_FOUND;
            endOfPEM = 0;

            // note that IndexOf always starts to parse at the start of a span, so we need to create multiple spans
            ReadOnlySpan<char> curSpan = pemData;

            int startPreeb = curSpan.IndexOf(PREEB_PREFIX);
            if (startPreeb == NOT_FOUND)
            {
                return false;
            }
            int startPreebLabel = startPreeb + PREEB_PREFIX.Length;

            curSpan = pemData.Slice(startPreebLabel);

            int endPreebLabel = curSpan.IndexOf(PREEB_POSTFIX);
            if (endPreebLabel == NOT_FOUND)
            {
                return false;
            }
            endPreebLabel += startPreebLabel;
            int endPreeb = endPreebLabel + PREEB_POSTFIX.Length;

            ReadOnlySpan<char> preebLabelSlice = pemData.Slice(startPreebLabel, endPreebLabel - startPreebLabel);
            if (!ValidateLabel(preebLabelSlice))
            {
                return false;
            }

            curSpan = pemData.Slice(endPreeb);

            int startPosteb = curSpan.IndexOf(POSTEB_PREFIX);
            if (startPosteb == NOT_FOUND)
            {
                return false;
            }
            startPosteb += endPreeb;

            int startPostebLabel = startPosteb + POSTEB_PREFIX.Length;

            curSpan = pemData.Slice(startPostebLabel);

            int endPostebLabel = curSpan.IndexOf(POSTEB_POSTFIX);
            if (endPostebLabel == NOT_FOUND)
            {
                return false;
            }
            endPostebLabel += startPostebLabel;
            int endPosteb = endPostebLabel + POSTEB_POSTFIX.Length;

            ReadOnlySpan<char> postebLabel = pemData.Slice(startPostebLabel, endPostebLabel - startPostebLabel);
            if (!ValidateLabel(postebLabel))
            {
                return false;
            }

            // perform base64 validation at the end
            ReadOnlySpan<char> contentSlice = pemData.Slice(endPreeb, startPosteb - endPreeb);
            if (!ValidateAndCountBase64Bytes(contentSlice, out int bytes))
            {
                return false;
            }

            label = preebLabelSlice;
            contents = contentSlice;
            bytesInContent = bytes;
            endOfPEM = endPosteb;
            return true;
        }

        /// <summary>
        /// Validates that the content consists of valid base64: that it doesn't contain invalid characters,
        /// too many or misplaced base64 padding characters.
        /// </summary>
        /// <param name="base64">The base64 encoded input array, using standard base64 with padding</param>
        /// <param name="bytes">The number of bytes enocoded by the base64 encoding</param>
        /// <returns><code>true</code> if and only if valid base64 is found</returns>
        private static bool ValidateAndCountBase64Bytes(ReadOnlySpan<char> base64, out int bytes)
        {
            // TODO find out if the base64 codec of .NET is compatible with the whitespace (especially w.r.t. the padding characters)

            bytes = NOT_FOUND;

            int count = 0;
            int padCharFound = 0;
            int offset = 0;

            int end = base64.Length;
            while (offset < end)
            {
                char c = base64[offset++];
                // TODO test if IsLetterOrDigit is not too permissive (Unicode after all)
                if (Char.IsLetterOrDigit(c) || c == '+' || c == '/')
                {
                    if (padCharFound > 0)
                    {
                        return false;
                    }
                    count++;
                    continue;
                }
                // TODO test if IsWhiteSpace is not too permissive (Unicode after all)
                if (Char.IsWhiteSpace(c))
                {
                    continue;
                }
                if (c == '=')
                {
                    count++;
                    if (++padCharFound > 2)
                    {
                        return false;
                    }
                    continue;
                }
                return false;
            }
            if (count % 4 != 0)
            {
                return false;
            }

            bytes = (count / 4) * 3 - padCharFound;
            return true;
        }

        /// <summary>
        /// Validates that the label confirms to RFC 7468.
        /// </summary>
        /// <param name="label">The label to validate.</param>
        /// <returns><code>true</code> if and only if the label consists of valid characters.</returns>
        private static bool ValidateLabel(ReadOnlySpan<char> label)
        {
            // set to true to automatically detect space or hyphen at start of label
            bool previousSpaceOrHyphen = true;
            foreach (char c in label)
            {
                // we'll handle spaces 0x20 and hyphens 0x2D separately
                if (c < 0x20 || c > 0x7E)
                {
                    return false;
                }
                if (c == ' ' || c == '-')
                {
                    if (previousSpaceOrHyphen)
                    {
                        return false;
                    }
                    previousSpaceOrHyphen = true;
                }
                else
                {
                    previousSpaceOrHyphen = false;
                }
            }

            // do not end with space or hyphen
            if (previousSpaceOrHyphen)
            {
                return false;
            }
            return true;
        }
    }
}

PS removed one or two copy / paste bugs during edit, nothing big

@bartonjs
Copy link
Member

Thanks, @owlstead!

From a requirements perspective, things that I can see someone wanting to do:

  • Find where the next PEM starts to do something with the header/preamble/whatever-you-want-to-call-it
    • This seems to be possible... it's the offset of the label span from the input span, minus 11 ("-----BEGIN ".Length)
  • Know where the PEM ends so they can start their next search there
    • I think that's the endOfPEM value
  • Get the span to throw at Convert.FromBase64Chars
    • That's contents. Good there.
    • Yeah, Convert.FromBase64Chars handles whitespace in the payload
  • Know what the label is
    • label.
  • Know how big the base64Decode output will be. (I hadn't thought of that before, nice)
    • bytesInContent

So I think, functionality-wise, it meets the needs.

Other things that might be reasonable for PEM (to help identify a type container for this)

  • Write a ReadOnlySpan<byte> and a label to an output
    • Perhaps to a Span<char>, or a Stream, etc.
  • Write an AsnWriter and a label to an output, when AsnWriter becomes public.
    • If they're in the same assembly it can more efficiently borrow the writer's internal buffer

Restricting to the cryptography uses of PEM (RFC 7468) makes all of this so much easier than my original thoughts, since the header/attributes section is out of scope.

Additionally, I know I'm the one who suggested (ReadOnlySpan<char>, out ReadOnlySpan<char>, out ReadOnlySpan<char>), but I'm wondering if the extra data makes the signature too unwieldy and that it'd be better to return the parse state as a struct.

I'll ignore my last sentence for now, to give the theoretical state of

// assembly: System.Security.Cryptography.Encoding
namespace System.Security.Cryptography
{
    public static partial class PemEncoding
    {
        public static bool TryFindNextPem(
            ReadOnlySpan<char> pemData,
            out ReadOnlySpan<char> label,
            out ReadOnlySpan<char> contents,
            out int contentDecodeLength,
            out int charsRead);

        public static bool TryWriteData(
            Span<char> destination,
            ReadOnlySpan<char> label,
            ReadOnlySpan<byte> data,
            out int charsWritten);

        // The formula, while straightforward, is probably too much to assume a caller would want
        // to redefine, so this is a helper for TryWriteData to see if destination is too small.
        public static int GetOutputSize(ReadOnlySpan<char> label, ReadOnlySpan<byte> data);

        public static void WriteData(
            TextWriter destination,
            ReadOnlySpan<char> label,
            ReadOnlySpan<byte> data);
    }
}

I'm not entirely sold on the method names of any of the methods, just starting the creative juices flowing. In particular, it seems like AsnWriter variants should be proper overloads, so the "Write" name should unify.

public static void WriteData(
    TextWrite destination,
    ReadOnlySpan<char> label,
    AsnWriter data);

Feels... wrongish.

@owlstead
Copy link
Author

owlstead commented May 17, 2019

OK, I can see that giving the start of the PEM might also be a good idea, as that will automatically show how to get the "description" or pre-amble, where needed (presuming that the right Span is given, excluding any previous PEM as you would need the start offset). We could create a PEM class with a static methods and Content class.

The Content class would be a simple class with a single cto and getters (I don't see a use case for changing existing PEM content). We could still return a bool as to be compliant withe the general signature of the Try... methods and return the data class or struct in anout parameter.

Maybe we should return the label as a string rather than a Span, with the disadvantage that we should then think of an upperbound. I don't see any other use of the label other than as a string. Say 64 characters should be plenty (it would already be more than the default PEM line size after all, so nobody with any sense would create such a label).

Otherwise I would largely agree with your interface specification. I could, if you want, think of a ReadData method implementation as well (using an additional pass maybe, who cares, CPU's are plenty fast anyway). I'm visiting my family, will post an enhanced API spec. later.

@bartonjs
Copy link
Member

The advantage of emitting the label as a ReadOnlySpan<char> vs a string is that it is just a projection over the input. If the caller needs it as a System.String they can call ToString() on the span value.

While it's also true that a string can go to a (read-only) span, the string required making a new object and a copy (since .NET strings are length-prepended for performance and null-terminated for interop).

So a string (or System.Range) would make sense for returning it as a field in a parse structure, but the span has advantage when it can be done via the out parameter.

@owlstead
Copy link
Author

owlstead commented May 18, 2019

As for that last comment, I don't see any advantage myself to keep label to a Span. Sure ToString or new String is easily performed, but you've got to ask yourself why the Span is ever needed in the first place. From a user point of view it doesn't make sense and copying such small amounts doesn't have any performance benefits, in my opinion. It feels like over-optimization to me.

So what seems natural to me is to supply:

// assembly: System.Security.Cryptography.Encoding
namespace System.Security.Cryptography
{
    public partial class PEM
    {
        public ref struct FieldsWithBase64
        {
            public string label;
            public ReadOnlySpan<char> contentInBase64;
            public int contentSizeBytes;
        }

        public ref struct Fields
        {
            public string label;
            public ReadOnlySpan<byte> content;
        }

        public static bool TryFindPem(ReadOnlySpan<char> pemData, out FieldsWithBase64 fields, out Range location);

        public static int GetOutputSizePem(Fields fields);

        public static bool TryWritePem(Span<char> destination, Fields fields, out int pemSize);
    }
}

I've used "Smurf naming conventions" to allow for using static to be used for the static methods.

Other changes in order of appearance:

  • renamed PEMEncoding to just PEM for simplicity sake (debatable, but PEM is an encoding by itself, can change back);
  • structs introduced (FieldsWithBase64 should really be FieldsWithContentsInBase64 but yeah...);
  • renamed method to TryFindPem as it may be trying to find the first PEM struture;
  • renamed method TryWriteData to TryWritePem as I found Data to be too generic, and for symmetry with TryFindPem.

Most importantly I removed WriteData from the class in its entirety as it would require a full rewrite (unless you would count creating a Span<char> over a new char[] as acceptable solution). It also mixes high level streaming and low level spanning components, which seems a bad idea in itself (where would it end?). It could of course be added later.

Very much interested in your opinions!

Oh, yeah, got Unit testing working and have a working implementation of above. Added constructors and methods to ref structs, with a method FieldsWithBase64.GetFields which performs the base64 decoding as convenience method.

@owlstead
Copy link
Author

owlstead commented May 19, 2019

Oh, I can test things now. I used command line to create a test project and then imported in VS. However, before I test I would like to know if the API is OK-ish this way.

Note that I'm personally not that happy with evaluating my own application. I am critical enough to think of some pretty nasty tests (boundary tests), but I cannot change my own thought process. So some review will of course be required.

@bartonjs
Copy link
Member

but you've got to ask yourself why the Span is ever needed in the first place

When reading a large multi-PEM (like /etc/ssl/certs.pem) the allocations of each string would cause the GC to kick in more often, which can have impact on servers. Of course, the API could internally avoid that by caching known (or encountered) strings and returning the string that is equivalent to the identified span... so then it comes down to a) always allocate a string (easy, usable, higher GC impact), b) returned shared strings (more work, usable, low GC impact), c) just return a span (easy, less friendly than string, no GC impact).

Since very few different names are expected I'm sure that (a) would end up being (b) if we switched to using this to read /etc/ssl/certs.pem on Linux.

Type name:

RSA, DSA, and ECDsa definitely already violate the rule, but the rule is that acronyms should either be expanded or be treated as a simple word, so "PEM" (Privacy Enhanced Mail) would be "Pem". I threw "Encoding" on it because "Pem" as a standalone typename felt weird, and it's using the data encoding model from PEM without actually being about Privacy Enhanced Mail. (The actual guideline is "DO capitalize only the first character of acronyms with three or more characters, except the first word of a camel-cased identifier.")

Methods/Structs:

  • For the structs the exposed members should use PascalCasing, even if fields. Ideally the struct would be public readonly ref struct.
  • I'm on the fence for multiple outs vs making a ref struct.
  • I don't think that the struct for write adds value.
  • I think we have a nebulous guideline that says that if there's a write-to-a-span method there should also be a "make a properly sized array and return it" method, because they're easier to use.
    • The method to precalculate the size might obviate it, but it still means that you should check for the false return
  • The label and contents for the size calculator really only need to be the lengths, thinking about it.
  • I expect that the API review would suggest the method to write to a Stream (which I'd then counter that we're producing chars, not bytes, so a TextWriter would be more appropriate). But it can be left off until such a review meeting.
    • FWIW: The value is that it would allow the API internals to do something like a 64-char stackalloc, do 48 bytes at a time base64 encoding, and calling WriteLine with the 64 chars; avoiding the need for contiguous memory on a large payload (and avoids inserting / gapping for newlines).
  • I think we discourage "Smurf naming", and that we optimize for type-qualified invocation over using static. But I don't see a rule for it.
  • On the TryWrite, the convention is that if it writes to a Span<byte> there's an out int bytesWritten, for Span<char> it's out int charsWritten, and for any other Span type else it's (IIRC) out int written). It's just the convention we have (and will eventually be codified), so I've gone ahead and restored the out.

So, using "on the fence" means "don't change it", I'd suggest

// assembly: System.Security.Cryptography.Encoding
namespace System.Security.Cryptography
{
    public static partial class PemEncoding
    {
        public readonly ref struct FieldsWithBase64
        {
            public FieldsWithBase64(string label, ReadOnlySpan<char> contentInBase64, int contentSizeBytes);
            public string Label;
            public ReadOnlySpan<char> ContentInBase64;
            public int ContentSizeBytes;
        }

        public static bool TryFindPem(ReadOnlySpan<char> pemData, out FieldsWithBase64 fields, out Range location);

        public static int GetOutputSize(int labelLength, int contentBytesLength);

        public static bool TryWrite(Span<char> destination, ReadOnlySpan<char> label, ReadOnlySpan<byte> contentBytes, out int charsWritten);
    }
}

And a formal API review would likely add public static char[] Write(ReadOnlySpan<char> label, ReadOnlySpan<byte> contentBytes) and public static void Write(TextWriter destination, ReadOnlySpan<char> label, ReadOnlySpan<byte> contentBytes). API review might also do more cosmetic changes on the names.

Testing:

I'd just expect tests to go in src/System.Security.Cryptography.Encoding/tests/, along with the rest of the xunit tests. After you build src/System.Security.Cryptography.Encoding/ref/ with the new API structure the tests should be able to reference it. (msbuild /t:BuildAndTest /p:ForceRunTests=true to run the tests (even if they succeeded previously), and they should automatically see updated binaries after src/System.Security.Cryptography.Encoding/src is rebuilt.

One test that I think you're missing, from the current code: reading a payload with mismatched labels:

-----BEGIN A-----
base64==
-----END B-----

However, before I test I would like to know if the API is OK-ish this way.

The usual flow, per https://github.com/dotnet/corefx/blob/master/Documentation/project-docs/api-review-process.md, is

  • Issue gets opened (done)
  • Discussion on the issue ends up with what the relevant parties consider to be the appropriate shape (ongoing)
  • The issue gets marked api-ready-for-review
  • The issue gets discussed during an API review meeting and ends up as
    • api-approved as-is
    • api-approved with suggestions (e.g. rename some things)
    • api-needs-work with questions/comments (and the issue is now back in "Discussion")
  • Someone takes on the issue and does the PR for source and test.

Of course, sometimes it takes actually doing the implementation and tests to understand how the API really needs to work.. Mainly I'm sharing that just because you and I think we've hit a good state, doesn't mean the (rest of the) API review will agree 😄.

@bartonjs
Copy link
Member

bartonjs commented May 19, 2019

@vcsjones / @onovotny: Do either of you have any API feedback here? Missing scenarios, over-engineering, et cetera?

@vcsjones
Copy link
Member

vcsjones commented May 19, 2019

From a user point of view it doesn't make sense and copying such small amounts doesn't have any performance benefits, in my opinion. It feels like over-optimization to me.

I'm leaning toward Label being a ROS<char> here if not because of performance for this specifically but because the whole API proposal seems geared toward being efficient. The use of a ref struct presents some use constraints of a developer. It seems odd to use a ref struct and ROS<char> for the content but then allocate for the label.

If I were using this API (which is a great idea!) my typical usage pattern would probably be something like:

// If ROS<char>
FieldsWithBase64 fields = //get fields;
if (fields.Label.SequenceEqual("CERTIFICATE")) {
  //
}

// If string
FieldsWithBase64 fields = //get fields;
if (fields.Label == "CERTIFICATE") {
  //
}

So I don't think using ROS<char> feels like an optimization at any significant cost to the developer. That is if my supposed typical-usage is close.

If we feel that a ROS<char> for Label is over-optimized then I might suggest that applies to much of the rest of the API proposal, and change the whole struct to not be ref and instead operate entirely on ReadOnlyMemory<char>. That will let developers use the result as they are used to being able to use things, like return it from a method, add them to a List<T>, or put the result in a field.

I don't actually like my idea, I much prefer {ReadOnly}Span<char> here, but if we are OK with allocations then we can use that to the advantage of making the API more developer-friendly.


In the proposal the FieldsWithBase64 contains actual fields. Most of the API surface in the rest of the framework uses properties, even on readonly ref structs (e.g. Span<T> itself). Any particular reason to use fields here? I don't see any harm in it either, but it feels unconventional to me to add API surface exposing fields.


ContentSizeBytes seems ambiguous to me. DecodedContentSize or DecodedContentSizeBytes might help folks understand what the size refers to (though I have a personal distaste for putting Bytes in member names to indicate the size refers to bytes)

@owlstead
Copy link
Author

owlstead commented May 20, 2019

OK, so I've tried to take all your comments into consideration and came up with the following API, which is currently implemented and running some tests; I've simply refactored.

    public static class PemEncoding
    {
        public readonly ref struct Base64Fields
        {
            public Base64Fields(ReadOnlySpan<char> label, ReadOnlySpan<char> base64, int encodedByteSize);

            public readonly ReadOnlySpan<char> Label;

            public readonly ReadOnlySpan<char> Base64;

            public readonly int EncodedByteSize;

            public byte[] DecodeBase64();
        }

        public static bool TryFind(ReadOnlySpan<char> pemData, out Base64Fields fields, out Range location);

        public static int GetEncodedSize(int labelSize, int dataSize);

        public static bool TryWrite(Span<char> destination, ReadOnlySpan<char> label, ReadOnlySpan<byte> data, out int charsWritten);

        public static char[] Write(ReadOnlySpan<char> label, ReadOnlySpan<byte> data);

        // implementation only

        private static bool ValidateLabel(ReadOnlySpan<char> label);

        private static bool ValidateAndCountBase64Bytes(ReadOnlySpan<char> base64, out int encodedByteSize);

        private static int SkipWhiteSpace(ReadOnlySpan<char> pemData, int startOffset)
    }

I've checked that the writers write all the PEM structures in the RFC, doing a full compare afterwards. I'm also skipping / parsing the following PEM's successfully even though they are obviously tosh:

"-----BEGIN BUT-SKIP----------END BUTT-SKIP-----We can read this -----BEGIN ig g-y-----A Q A B-----END pop-----eye"

Then I can write them again as strictly formatted PEM.

I've added a SkipWhiteSpace implementation method. The previous implementation of TryFind did not skip blanks and / or the first line ending after the post-encapsulation boundary (aka footer), which it should do (also according to the RFC).

I'm not completely happy with the EncodedByteSize and GetEncodedSize identifier names as they both contain the name Encoded and rely too much on the context to understand. What about Base64EncodedDataSize and GetEncodedPemSize? They are slightly larger, but more clear in my opinion.

Note that I'm quite the polyglot w.r.t. PL, but sometimes it takes a bit of time to get up to speed with e.g. C# best practices - I'm pretty much rooted in Java, taking up Kotlin as well.

@owlstead
Copy link
Author

OK, let's discuss the two bears left: streaming support and error handling.


The problem with streaming is that it doesn't allow to back track a stream. However, it is required if e.g. large CMS structures need to be supported because those can be of arbitrary length and not fit into memory. It is useful but maybe not required if PEM structures are present in a stream e.g. received through HTTP.

As for the large data structures: it seems .NET only supports encoding of data in a byte[] currently. So although streaming would definitely be useful for large messages, I don't see how we could do anything with the result that doesn't require at least one full copy in memory (correct me if I'm wrong about this). If the message size is not known in advance then you can simply create a buffer of a particular size and stop if it doesn't fit. So basically I think a streaming API doesn't seem to be useful in the short run.

It is relatively easy to add streaming for the generating part, but it would not be a symmetric API. I think I saw some requests to do this, so we could basically copy the generation code as I don't see any generic way of merging the two.


Now for the error handling. Currently the TryFind message simply returns false if there is any error in the PEM, even if a full preeb (header line) is found and if there is a base64 encoding mistake. This doesn't feel right to me, I'd say that if there is a full preeb that an exception should be thrown if there are any mistakes. I don't think it makes too much sense to keep parsing after an error, so returning the location of the error doesn't make too much sense (for a one-shot decoder).


Finally: should we create a Read method that errors out if no PEM / preeb is found at the start of the Span<char>? I don't think it would be all that useful as having text in front of the PEM is always allowed by the standard. If required, people can always question the returned Range and test if the PEM starts at offset zero after all. We could however think of a Find without Try that returns a byte array and throws an exception if no PEM could be found.

@vcsjones
Copy link
Member

I think a streaming API doesn't seem to be useful in the short run.

That seems reasonable. Using personal experience I would see myself using this for small PEM bundles that supply a certificate chain and reading a few kilobytes into memory shouldn't be problematic.

My suggestion now would be to get this API done now and consider streaming as a separate API proposal. It could either be added on later to these types or perhaps a new PemEncodingStream class that uses this as its internals.


Now for the error handling.

Throwing exceptions would probably be useful, though I can't say with any certainly. Some .NET developers have taken Try-prefixed methods to mean "does not throw" like TryParse* methods.


I don't think it would be all that useful as having text in front of the PEM is always allowed by the standard.

I agree with that, especially if this is intended to be a "lax" parser.

@bartonjs
Copy link
Member

bartonjs commented May 31, 2019

For the API shape: I think that the nested struct needs to move out to be a peer type. We don't have crisp guidelines here, but it boils down to something like "If the average person working with the API needs to use the type by name, it shouldn't be nested". Since it's the output of the primary API (TryFind) it seems pretty "average user" to me. (So what does get to be a public nested type? Mostly struct-based iterators that most people never see, they just put the call in a foreach and let things happen).

"Base64Fields" is a little generic once it moves out. ParsedPemFields, maybe? (Or PemFields?)

I don't think that the output struct needs the DecodeBase64 method on it. Since the base64 data is already exposed the caller can just use the existing base64 decoding APIs.

I think a streaming API doesn't seem to be useful in the short run.

I accept the logic here. Leaving it out until there's a demand seems reasonable.

Now for the error handling.

Some .NET developers have taken Try-prefixed methods to mean "does not throw" like TryParse* methods.

We have guidance on this, which is (effectively) all false responses need the same recovery action (different recovery action? throw.). So the question is really "as a caller, do you want to know the difference between 'no header found', 'no footer found', and 'contains attributes / has invalid base64'" (attributes being invalid base64), or are they all simply "there's no RFC 7468-PEM here"? Alternatively phrased, should

-----BEGIN A-----
maybe some stuff
-----BEGIN B-----
legit stuff
-----END B-----

find "B", or throw? I'm, personally, having trouble deciding which I want it to do, so I'm wide open to suggestions.

(I think I might be leaning toward false means no preeb was found, and throw for "couldn't successfully get to the end" (bad base64, can't find a posteb))

Finally: should we create a Read method that errors out ...

I think that Find, to be effectively TryFind + throw instead of false is useful, it simplifies the "I'm not testing, I'm asserting" case (and matches the guideline "DO provide an exception-throwing member for each member using the Try-Parse Pattern.").

I agree that there's not a lot of value in the "the first (or first non-whitespace) character wasn't -" version, and it's easy enough for someone to write themselves given the output of Find.

internal static PemFields FindPemAtStart(ReadOnlySpan<char> pem)
{
    PemFields fields = PemEncoding.Find(pem, out Range range);

    if (range.Start.Value > 0)
    {
        ReadOnlySpan<char> leader = pem.Slice(0, range.Start.Value);

        for (int i = 0; i < leader.Length; i++)
        {
            if (!char.IsWhitespace(leader[i]))
            {
                // Use the standard exception
                PemEncoding.Find("", out _);
                Debug.Fail("Should have thrown");
            }
        }
    }

    return fields;
}

Yeah, it's more than 3 lines of code, but it's not tricky logic... and given that the RFC says leaders are allowed, it doesn't fit with the class.

@owlstead
Copy link
Author

I'm all open to use PemFields as a separate instance. I'll check if I can get that to work with Span.

Ah, and here I just inserted the exceptions. Personally I'm in favor of throwing errors when if a full header line / pre encap boundary is found (i.e. five dashes, BEGIN, some kind of label (any chars) and then five dashes again. Because if anything is wrong, then it is hard to argue that this is not the PEM that you were looking for. Basically you ask to find it a PEM, it finds a full header, but then still returns false. What would you do next as a user? And if no exception is thrown then symmetry is also lost between TryFind and Find. So yeah, I'm not convinced that returning false instead of a FormatException is a good idea.

I'll try and finish cleaning up the code and post early next week. I haven't got a compiled version of Core 3 here so I'll just post it as a file (or two) here if you don't mind. I've got a holiday coming so time is getting precious (and I need to post soon because the race may eat me, lookup "Pointe du Raz" and then imagine kajaking in it :P ).

@owlstead
Copy link
Author

owlstead commented Jun 3, 2019

Sorry, things taking too much time. I hope tomorrow or Wednesday.

@vcsjones
Copy link
Member

To get things moving along again... here's the API proposal that I think we have, so far:

namespace System.Security.Cryptography
{
    public static class PemEncoding
    {
        public readonly ref struct Base64Fields
        {
            public Base64Fields(ReadOnlySpan<char> label, ReadOnlySpan<char> base64, int encodedByteSize);

            public readonly ReadOnlySpan<char> Label;

            public readonly ReadOnlySpan<char> Base64;

            public readonly int EncodedByteSize;

            public byte[] DecodeBase64();
        }

        public static bool TryFind(ReadOnlySpan<char> pemData, out Base64Fields fields, out Range location);

        public static int GetEncodedSize(int labelSize, int dataSize);

        public static bool TryWrite(Span<char> destination, ReadOnlySpan<char> label, ReadOnlySpan<byte> data, out int charsWritten);

        public static char[] Write(ReadOnlySpan<char> label, ReadOnlySpan<byte> data);
    }
}

I think the only feedback I would have left on this is that Label, Base64 and EncodedByteSize and fields when properties are more typically used. Since the type is a ref struct, I think that is doable.

With that change, the API proposal becomes:

namespace System.Security.Cryptography
{
    public static class PemEncoding
    {
        public readonly ref struct Base64Fields
        {
            public Base64Fields(ReadOnlySpan<char> label, ReadOnlySpan<char> base64, int encodedByteSize);

            public ReadOnlySpan<char> Label { get; }

            public ReadOnlySpan<char> Base64 { get; }

            public int EncodedByteSize { get; }

            public byte[] DecodeBase64();
        }

        public static bool TryFind(ReadOnlySpan<char> pemData, out Base64Fields fields, out Range location);

        public static int GetEncodedSize(int labelSize, int dataSize);

        public static bool TryWrite(Span<char> destination, ReadOnlySpan<char> label, ReadOnlySpan<byte> data, out int charsWritten);

        public static char[] Write(ReadOnlySpan<char> label, ReadOnlySpan<byte> data);
    }
}

@owlstead
Copy link
Author

I'll try and spend some more time on it. I've switched from laptop to desktop as it got to slow and ran out of SSD, and I'm currently running under Ubuntu - might switch back to Windows 10 due to audio problems among others...

@vcsjones
Copy link
Member

I wouldn't worry about implementation too much at this point - this needs to make it past the API review process. They may have significant feedback on the API shape.

Unless there is any more feedback on what the public API is, I think the next best step is to get any additional feedback from @bartonjs on the API proposal, and possibly get it flagged as "ready for review".

@bartonjs
Copy link
Member

I still think the nested type needs to be changed to a peer type (and then it probably needs a name change to PemFields, or something like that).

We are currently codifying what the guidelines are for TryWrite. I think that it'll end up being bool TryWrite(ReadOnlySpan<char> label, ReadOnlySpan<char> data, Span<char> destination, out int charsWritten) based on how the last meeting went, but maybe we'll feel differently once we're looking at examples of "two data sources, one destination", like this one. That doesn't really affect the proposal here, since it'll get normalized to the newly minted rules by the review process... so long as we don't think it needs any other inputs we're good to go.

Since the data portion is written as Base64 and Label is limited to (U+0021-U+002C , U+002E-U+007E) we know we're always writing ASCII output. Should we make TryWriteAscii/WriteAscii that writes to bytes instead of chars?

And, lastly, S.S.Cryptography is getting a bit bloated. Maybe it's time to consider a new namespace. Here we have PemEncoding and PemFields. ASN.1 will propose AsnReader, AsnWriter, AsnTag (or whatever final names they get). I've heard whispers of CBOR, which would probably add a reader and a writer. Maybe all of these things should be namespace System.Security.Cryptography.Encoding?

namespace System.Security.Cryptography
{
   public static class PemEncoding
    {
        public static bool TryFind(ReadOnlySpan<char> pemData, out PemFields fields, out Range location);
        public static PemFields Find(ReadOnlySpan<char> pemData, out Range location);

        public static int GetEncodedSize(int labelSize, int dataSize);
        public static bool TryWrite(Span<char> destination, ReadOnlySpan<char> label, ReadOnlySpan<byte> data, out int charsWritten);
        public static char[] Write(ReadOnlySpan<char> label, ReadOnlySpan<byte> data);
    }
    public readonly ref struct PemFields
    {
        public PemFields(ReadOnlySpan<char> label, ReadOnlySpan<char> base64, int encodedByteSize);

        public ReadOnlySpan<char> Label { get; }
        public ReadOnlySpan<char> Base64 { get; }
        public int EncodedByteSize { get; }

        public byte[] DecodeBase64();
    }
 }

(I left out WriteAscii, because it's an easy addition; and didn't change the namespace here because no consensus yet).

The only thing that feels weird in this now is the EncodedByteSize value. It totally makes sense for the PemEncoding class to calculate it, it does that on the fly while reading. But should the ctor validate it? Or is it garbage-in garbage-out? I can't think of why anyone (other than PemEncoding) would build one, so maybe it doesn't matter. Or maybe it means it should be an internal ctor to enforce that there's no reason to make one.

@vcsjones
Copy link
Member

Maybe it's time to consider a new namespace

Is there existing API surface in S.S.C that would fit in this new Encoding namespace? Would the new namespace lead to confusion, "Why is X not in the Encoding namespace?" "Because it didn't exist for these existing APIs"

@vcsjones
Copy link
Member

@bartonjs

Maybe it's time to consider a new namespace

Looking through the existing types I think my previous concern isn't valid. I'm always weary of namespaces, but it makes sense given the way the rest of corefx is broken up.

The only thing that feels weird in this now is the EncodedByteSize value

I don't mind the idea of this type being a simple way to get information out and not doing the validation at all.

I can't think of why anyone (other than PemEncoding) would build one, so maybe it doesn't matter

Also probably true, then perhaps that is enough reason to mark the constructor as internal. There will still be PemFields a = default leading to it having a value of zero, but it look like a "default" of this struct is still mostly okay.

So this then?

namespace System.Security.Cryptography.Encoding
{
   public static class PemEncoding
    {
        public static bool TryFind(ReadOnlySpan<char> pemData, out PemFields fields, out Range location);
        public static PemFields Find(ReadOnlySpan<char> pemData, out Range location);

        public static int GetEncodedSize(int labelSize, int dataSize);
        public static bool TryWrite(Span<char> destination, ReadOnlySpan<char> label, ReadOnlySpan<byte> data, out int charsWritten);
        public static char[] Write(ReadOnlySpan<char> label, ReadOnlySpan<byte> data);
    }
    public readonly ref struct PemFields
    {
        internal PemFields(ReadOnlySpan<char> label, ReadOnlySpan<char> base64, int encodedByteSize);

        public ReadOnlySpan<char> Label { get; }
        public ReadOnlySpan<char> Base64 { get; }
        public int EncodedByteSize { get; }

        public byte[] DecodeBase64();
    }
 }

@bartonjs
Copy link
Member

So this then?

Looks reasonable to me. I think API review will rewrite PemEncoding.TryWrite to (label, data, destination, out charsWritten), because we recently finished doing a looking-back and looking-forward review to come up with our official guidance (publication TBD).

@owlstead any thoughts? (I'm going to go ahead and mark it as ready for review, but we review things on Tuesdays, so there's at least a couple days between now and when it actually makes it to the front of the queue)

@GrabYourPitchforks
Copy link
Member

Feedback:

  • Change dataSize and labelSize to have *Length suffixes instead.
  • In TryWrite, move the destination parameter toward the end of the parameter list.
  • Rename struct's Base64 property to Base64Data. Rename EncodedByteSize to DecodedByteSize.
  • Remove the base64-decoding helper method from the ref struct. Callers can use the Convert class instead.
  • Consider using a normal struct instead of a ref struct and use Range properties instead of ROS properties. If we do this, then we can get away with not _out_ing a Range from the Find method, and instead just put Range location on the struct itself.

@vcsjones
Copy link
Member

vcsjones commented Jan 7, 2020

@bartonjs @owlstead

Rename EncodedByteSize to DecodedByteSize.

Should it be Length too? Not sure what the rule us there.

Consider using a normal struct instead of a ref struct and use Range properties instead of ROS properties.

The ref struct felt weird to me, but using Range seems even weirder to me, so far. It lacks any associativity to data, and on my first attempt I didn't know what to do with it. It took me a moment to realize I could do span[pemFields.Location] and the C# compiler would do it all for me. Since Span<T> lacks a real indexer accepting Range and it looks like the compiler special cases Span<T> and uses Slice, it didn't show up in intellisense or in docs.

System.Memory<T> doesn't feel right, either.

Here is the proposal with Range (added throw null so I could edit in an IDE easier without a sea of squiggles)

namespace System.Security.Cryptography
{
    public static class PemEncoding
    {
        public static bool TryFind(ReadOnlySpan<char> pemData, out PemFields fields) => throw null;
        public static PemFields Find(ReadOnlySpan<char> pemData) => throw null;

        public static int GetEncodedSize(int labelLength, int dataLength) => throw null;
        public static bool TryWrite(ReadOnlySpan<char> label, ReadOnlySpan<byte> data, Span<char> destination, out int charsWritten) => throw null;
        public static char[] Write(ReadOnlySpan<char> label, ReadOnlySpan<byte> data) => throw null;
    }

    public readonly struct PemFields
    {
        internal PemFields(Range label, Range base64data, Range location, int decodedDataLength) => throw null;

        public Range Location { get; }
        public Range Label { get; }
        public Range Base64Data { get; }
        public int DecodedDataLength { get; }
    }
 }

@bartonjs
Copy link
Member

bartonjs commented Jan 8, 2020

I don't think the internal ctor needs ReadOnlyMemory values, since it's just storing ranges now, but the important part for the proposal is that it has an internal ctor :).

For DecodedByteSize and/or the word Length, perhaps DecodedDataLength?

Part of the reason for the struct having Range values instead of the Span values is that it's easier to map it back to Memory or array values, if delayed processing is desired. (Or you're in an async method and the state machine complains that you have a ref struct). It also sets things up for working with char8/byte later for UTF-8 data, if we wanted.

string pemString = await File.ReadAllTextAsync(path);
PemFields pemFields = PemEncoding.Find(pemString);
// Yeah, this uses Substring, but it works.
byte[] der = Convert.FromBase64String(pemString[pemFields.Base64Data]);
ReadOnlySpan<char> pemString = ...;
PemFields pemFields = PemEncoding.Find(pemString);
ArrayPool<byte> rented = ArrayPool<byte>.Shared.Rent(pemFields.DecodedDataLength);

if (!Convert.TryConvertFromBase64(pemString[pemFields.Base64Data], rented, out int bytesWritten) ||
    bytesWritten != pemFields.DecodedDataLength)
{
    throw new InvalidOperationException();
}

key.ImportSubjectPublicKeyInfo(rented, out _);
ArrayPool<byte>.Shared.Return(rented);

Assuming that seems sane, the only remaining question is the namespace. I know I'm the one who suggested creating the Encoding subnamespace, but I'm now having second thoughts... mainly because I'm thinking that the AsnReader/AsnWriter/etc types should really go somewhere other than System.Security.Cryptography, since it's also used by LDAP (and people have asked for it in the context of LDAP). I brought that up in the meeting, but we didn't really resolve it 😄.

If these would be the only two types in that namespace then we probably want to just shove them in S.S.C with everything else. (Kinda feels like a "there's no right answer" problem.) If we move these types out of S.S.C I'd want to change "Pem" to "SimplePem" or something like that, to indicate it's the PEM variant that doesn't do attributes.

@vcsjones
Copy link
Member

vcsjones commented Jan 9, 2020

Oh crud, I didn't mean to include the ctor that accepted ReadOnlyMemory. I was toying around with using ReadOnlyMemory in the ctor and as properties, too.

Edited to fix and remove namespace for now.

Should we keep the namespace and make it a new project entirely? If we have PEM, and ASN.1, and who-knows-what-else in the future (CBOR?)

It might be "just two" for now but I can see more being added in the future. Too late now but I would have also considered some of the PKCS12 implementation useful there as well, too.

@bartonjs
Copy link
Member

Should we keep the namespace and make it a new project entirely? If we have PEM, and ASN.1, and who-knows-what-else in the future (CBOR?)

A hallway chat for this suggested that we probably want a new assembly, distributed via NuGet (anticipating netstandard2.x requests)--and probably some engineering trickery to be able to use it as internal implementation details without getting into packaging.

  • ASN types (not including PEM, since it's not ASN): namespace System.Security.Cryptography.Asn(1) (strawman, I also have other thoughts below)
  • These two: namespace System.Security.Cryptography

Really it was that crypto doesn't (shouldn't) have enough stuff for a Encodings intermediate namespace (assuming that we want the ASN types to be together in a subnamespace), it'd pretty much be these two types. And System.Security.Cryptography.Encodings.Asn1 is pretty deep.

CBOR/etc would probably just be separate packages... there's not much need for it within the platform, just as a requested offering. I don't think this would be under S.S.C, because there's not even a "propensity" argument. But maybe this suggests we want System.{Encodings.Binary | BinaryEncodings}.Asn / System.{Encodings.Binary | BinaryEncodings}.{SomethingForCBOR}. (COSE would probably be something subnamespaced under S.S.C (depending on complexity), but that's a thing built on top of COSE, doesn't need to be in the same assembly, or namespace).

Also: I see you left DecodedByteSize as is, do you prefer that one over DecodedDataLength? I like having Data in it since it indicates it's only associated with that one Range/Base64-payload.

@msftgits msftgits transferred this issue from dotnet/corefx Feb 1, 2020
@msftgits msftgits added this to the 5.0 milestone Feb 1, 2020
@bartonjs
Copy link
Member

bartonjs commented Feb 3, 2020

@vcsjones Once we can close down on DecodedByteSize vs DecodedDataLength I think this is ready to go back through review. (If you actually prefer DecodedByteSize, I'd like to understand why; perhaps I'll come around 😄)

@vcsjones
Copy link
Member

vcsjones commented Feb 3, 2020

@bartonjs I amended my comment here which I think is the most "current" proposal with changes: #29588 (comment)

If you actually prefer DecodedByteSize, I'd like to understand why

No preference, I don't think? Maybe I had a stronger opinion a month ago that I am unable to recall. Regardless, DecodedDataLength seems sensible.

@bartonjs bartonjs added api-ready-for-review and removed api-needs-work API needs work before it is approved, it is NOT ready for implementation labels Feb 4, 2020
@terrajobst terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review labels Feb 4, 2020
@terrajobst
Copy link
Member

terrajobst commented Feb 4, 2020

Video

  • The constructor on PemFields isn't public to avoid having to validate well-formedness
  • Should PemFields implement IEquatable`?
namespace System.Security.Cryptography
{
    public static class PemEncoding
    {
        public static bool TryFind(ReadOnlySpan<char> pemData, out PemFields fields);
        public static PemFields Find(ReadOnlySpan<char> pemData);

        public static int GetEncodedSize(int labelLength, int dataLength);
        public static bool TryWrite(ReadOnlySpan<char> label, ReadOnlySpan<byte> data, Span<char> destination, out int charsWritten);
        public static char[] Write(ReadOnlySpan<char> label, ReadOnlySpan<byte> data);
    }

    public readonly struct PemFields
    {
        public Range Location { get; }
        public Range Label { get; }
        public Range Base64Data { get; }
        public int DecodedDataLength { get; }
    }
 }

@danmoseley
Copy link
Member

Is anyone interested in implementing this? I would like to get it in 5.0. @owlstead @vcsjones

@vcsjones
Copy link
Member

vcsjones commented Feb 7, 2020

@danmosemsft It looks like @owlstead has already invested a good chunk of time in to it, but if @owlstead is unable to or no longer wishes to, I can probably do this next after I wrap up #2406.

@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020
@maryamariyan maryamariyan removed the untriaged New issue has not been triaged by the area owner label Mar 3, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 13, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Security
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants