Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One-shot PEM reader #37748

Open
owlstead opened this issue May 17, 2019 · 22 comments

Comments

@owlstead
Copy link

commented May 17, 2019

I was giving some hints on this StackOverflow question where the idea of a PEM reader / writer for .NET was coming up. Jeremy then posted that if such a reader was available that it would be a nice idea to make it a one-shot, RFC 7468 compliant reader using the new Span<char> framework.

So I've created a sample implementation / API proposal which performs the parsing of the "lax" (i.e. permissive). I've used the following additional goals / non-goals:

Goals:

  • no dynamic memory allocation that depends on the size of the input;
  • possibility to skip a description (in front) of the PEM structure;
  • possibility to detect the end of the PEM structure, if present;
  • related: allow for parsing multiple PEM structures from a single Span;
  • design in the spirit of other methods that use Span;

Non-goals:

  • returning the description in front of the PEM file;
  • detailed error reporting;
  • streaming implementation (requiring state machine or buffering);
  • parsing the base64 inside the header / footer lines (pre/post encapsulation boundaries);
  • support for stricter versions of PEM;
  • return or comparison of the label in the footer.

Niceties:

  • validation of the base64 inside the header / footer lines
  • return the amount of bytes so that a correctly sized buffer can be created in advance for the base64 decryption (yes, implemented).

Parsing the base64 is easily done afterwards with the existing parser, and we may just want to skip the PEM.

Note that I didn't get my Unit testing working for Core 3, so the implementation has not yet received any significant testing. First let me know what you think.

Implementation notes: this implementation searches for the post-encapsulation boundary (footer) before validating the base64. It might be possible to use a state machine instead to create a single pass parser. This adds significant complexity to the parsing though, and searching for a static string should be relatively performant already.

Code in followup comment for now, as I don't know yet where to put it.

@owlstead

This comment has been minimized.

Copy link
Author

commented May 17, 2019

Here is the proposed API / implementation code:

using System;
using System.Collections.Generic;
using System.Text;

// bogus namespace
namespace PEMOneShotNamespace
{
    // bogus class name
    public class PEMOneShot
    {
        private const String PREEB_PREFIX = "-----BEGIN ";
        private const String PREEB_POSTFIX = "-----";
        private const String POSTEB_PREFIX = "-----END ";
        private const String POSTEB_POSTFIX = PREEB_POSTFIX;

        private const int NOT_FOUND = -1;

        /// <summary>
        /// Finds a PEM data structure and returns the location of the label, the location of the base64 encoded contents and the end of the PEM structure.
        /// This method follows the "lax" definition of PEM within RFC 7468 for maximum compatibility.
        /// </summary>
        /// <remarks>
        /// <para>
        /// This method does not allow additional header lines to be present; anything between the encapsulation boundaries is considered to be base64.
        /// </para>
        /// <para>
        /// The <code>label</code> and <code>contents</code> may will not have a valid value and may be <code>null</code> if no valid PEM structure is found.
        /// The <code>bytesInContent</code> and <code>endOfPEM</code> values will be set to -1 and 0 respectively if no valid PEM structure is found.
        /// </para>
        /// </remarks>
        /// 
        /// <param name="pemData">Character buffer that - possibly - contains a valid PEM structure.</param>
        /// <param name="label">Output parameter which is a slice in the <code>pemData</code> that denotes the validated label.</param>
        /// <param name="contents">Output parameter which is a slice in the <code>pemData</code> that denotes the validated base64 encoding.</param>
        /// <param name="bytesInContent">Output parameter that shows the precise number of bytes in the base64 content.</param>
        /// <param name="endOfPEM">Output parameter set to the next character after the PEM structure.</param>
        /// <returns><code>true</code> if and only if a valid PEM structure is found</returns>
        public static bool FindNextPem(ReadOnlySpan<char> pemData, out ReadOnlySpan<char> label, out ReadOnlySpan<char> contents, out int bytesInContent, out int endOfPEM)
        {
            // define default output values returned if no PEM structure or an erroneous PEM structure is detected
            label = null;
            contents = null;
            bytesInContent = NOT_FOUND;
            endOfPEM = 0;

            // note that IndexOf always starts to parse at the start of a span, so we need to create multiple spans
            ReadOnlySpan<char> curSpan = pemData;

            int startPreeb = curSpan.IndexOf(PREEB_PREFIX);
            if (startPreeb == NOT_FOUND)
            {
                return false;
            }
            int startPreebLabel = startPreeb + PREEB_PREFIX.Length;

            curSpan = pemData.Slice(startPreebLabel);

            int endPreebLabel = curSpan.IndexOf(PREEB_POSTFIX);
            if (endPreebLabel == NOT_FOUND)
            {
                return false;
            }
            endPreebLabel += startPreebLabel;
            int endPreeb = endPreebLabel + PREEB_POSTFIX.Length;

            ReadOnlySpan<char> preebLabelSlice = pemData.Slice(startPreebLabel, endPreebLabel - startPreebLabel);
            if (!ValidateLabel(preebLabelSlice))
            {
                return false;
            }

            curSpan = pemData.Slice(endPreeb);

            int startPosteb = curSpan.IndexOf(POSTEB_PREFIX);
            if (startPosteb == NOT_FOUND)
            {
                return false;
            }
            startPosteb += endPreeb;

            int startPostebLabel = startPosteb + POSTEB_PREFIX.Length;

            curSpan = pemData.Slice(startPostebLabel);

            int endPostebLabel = curSpan.IndexOf(POSTEB_POSTFIX);
            if (endPostebLabel == NOT_FOUND)
            {
                return false;
            }
            endPostebLabel += startPostebLabel;
            int endPosteb = endPostebLabel + POSTEB_POSTFIX.Length;

            ReadOnlySpan<char> postebLabel = pemData.Slice(startPostebLabel, endPostebLabel - startPostebLabel);
            if (!ValidateLabel(postebLabel))
            {
                return false;
            }

            // perform base64 validation at the end
            ReadOnlySpan<char> contentSlice = pemData.Slice(endPreeb, startPosteb - endPreeb);
            if (!ValidateAndCountBase64Bytes(contentSlice, out int bytes))
            {
                return false;
            }

            label = preebLabelSlice;
            contents = contentSlice;
            bytesInContent = bytes;
            endOfPEM = endPosteb;
            return true;
        }

        /// <summary>
        /// Validates that the content consists of valid base64: that it doesn't contain invalid characters,
        /// too many or misplaced base64 padding characters.
        /// </summary>
        /// <param name="base64">The base64 encoded input array, using standard base64 with padding</param>
        /// <param name="bytes">The number of bytes enocoded by the base64 encoding</param>
        /// <returns><code>true</code> if and only if valid base64 is found</returns>
        private static bool ValidateAndCountBase64Bytes(ReadOnlySpan<char> base64, out int bytes)
        {
            // TODO find out if the base64 codec of .NET is compatible with the whitespace (especially w.r.t. the padding characters)

            bytes = NOT_FOUND;

            int count = 0;
            int padCharFound = 0;
            int offset = 0;

            int end = base64.Length;
            while (offset < end)
            {
                char c = base64[offset++];
                // TODO test if IsLetterOrDigit is not too permissive (Unicode after all)
                if (Char.IsLetterOrDigit(c) || c == '+' || c == '/')
                {
                    if (padCharFound > 0)
                    {
                        return false;
                    }
                    count++;
                    continue;
                }
                // TODO test if IsWhiteSpace is not too permissive (Unicode after all)
                if (Char.IsWhiteSpace(c))
                {
                    continue;
                }
                if (c == '=')
                {
                    count++;
                    if (++padCharFound > 2)
                    {
                        return false;
                    }
                    continue;
                }
                return false;
            }
            if (count % 4 != 0)
            {
                return false;
            }

            bytes = (count / 4) * 3 - padCharFound;
            return true;
        }

        /// <summary>
        /// Validates that the label confirms to RFC 7468.
        /// </summary>
        /// <param name="label">The label to validate.</param>
        /// <returns><code>true</code> if and only if the label consists of valid characters.</returns>
        private static bool ValidateLabel(ReadOnlySpan<char> label)
        {
            // set to true to automatically detect space or hyphen at start of label
            bool previousSpaceOrHyphen = true;
            foreach (char c in label)
            {
                // we'll handle spaces 0x20 and hyphens 0x2D separately
                if (c < 0x20 || c > 0x7E)
                {
                    return false;
                }
                if (c == ' ' || c == '-')
                {
                    if (previousSpaceOrHyphen)
                    {
                        return false;
                    }
                    previousSpaceOrHyphen = true;
                }
                else
                {
                    previousSpaceOrHyphen = false;
                }
            }

            // do not end with space or hyphen
            if (previousSpaceOrHyphen)
            {
                return false;
            }
            return true;
        }
    }
}

PS removed one or two copy / paste bugs during edit, nothing big

@bartonjs

This comment has been minimized.

Copy link
Member

commented May 17, 2019

Thanks, @owlstead!

From a requirements perspective, things that I can see someone wanting to do:

  • Find where the next PEM starts to do something with the header/preamble/whatever-you-want-to-call-it
    • This seems to be possible... it's the offset of the label span from the input span, minus 11 ("-----BEGIN ".Length)
  • Know where the PEM ends so they can start their next search there
    • I think that's the endOfPEM value
  • Get the span to throw at Convert.FromBase64Chars
    • That's contents. Good there.
    • Yeah, Convert.FromBase64Chars handles whitespace in the payload
  • Know what the label is
    • label.
  • Know how big the base64Decode output will be. (I hadn't thought of that before, nice)
    • bytesInContent

So I think, functionality-wise, it meets the needs.

Other things that might be reasonable for PEM (to help identify a type container for this)

  • Write a ReadOnlySpan<byte> and a label to an output
    • Perhaps to a Span<char>, or a Stream, etc.
  • Write an AsnWriter and a label to an output, when AsnWriter becomes public.
    • If they're in the same assembly it can more efficiently borrow the writer's internal buffer

Restricting to the cryptography uses of PEM (RFC 7468) makes all of this so much easier than my original thoughts, since the header/attributes section is out of scope.

Additionally, I know I'm the one who suggested (ReadOnlySpan<char>, out ReadOnlySpan<char>, out ReadOnlySpan<char>), but I'm wondering if the extra data makes the signature too unwieldy and that it'd be better to return the parse state as a struct.

I'll ignore my last sentence for now, to give the theoretical state of

// assembly: System.Security.Cryptography.Encoding
namespace System.Security.Cryptography
{
    public static partial class PemEncoding
    {
        public static bool TryFindNextPem(
            ReadOnlySpan<char> pemData,
            out ReadOnlySpan<char> label,
            out ReadOnlySpan<char> contents,
            out int contentDecodeLength,
            out int charsRead);

        public static bool TryWriteData(
            Span<char> destination,
            ReadOnlySpan<char> label,
            ReadOnlySpan<byte> data,
            out int charsWritten);

        // The formula, while straightforward, is probably too much to assume a caller would want
        // to redefine, so this is a helper for TryWriteData to see if destination is too small.
        public static int GetOutputSize(ReadOnlySpan<char> label, ReadOnlySpan<byte> data);

        public static void WriteData(
            TextWriter destination,
            ReadOnlySpan<char> label,
            ReadOnlySpan<byte> data);
    }
}

I'm not entirely sold on the method names of any of the methods, just starting the creative juices flowing. In particular, it seems like AsnWriter variants should be proper overloads, so the "Write" name should unify.

public static void WriteData(
    TextWrite destination,
    ReadOnlySpan<char> label,
    AsnWriter data);

Feels... wrongish.

@owlstead

This comment has been minimized.

Copy link
Author

commented May 17, 2019

OK, I can see that giving the start of the PEM might also be a good idea, as that will automatically show how to get the "description" or pre-amble, where needed (presuming that the right Span is given, excluding any previous PEM as you would need the start offset). We could create a PEM class with a static methods and Content class.

The Content class would be a simple class with a single cto and getters (I don't see a use case for changing existing PEM content). We could still return a bool as to be compliant withe the general signature of the Try... methods and return the data class or struct in anout parameter.

Maybe we should return the label as a string rather than a Span, with the disadvantage that we should then think of an upperbound. I don't see any other use of the label other than as a string. Say 64 characters should be plenty (it would already be more than the default PEM line size after all, so nobody with any sense would create such a label).

Otherwise I would largely agree with your interface specification. I could, if you want, think of a ReadData method implementation as well (using an additional pass maybe, who cares, CPU's are plenty fast anyway). I'm visiting my family, will post an enhanced API spec. later.

@bartonjs

This comment has been minimized.

Copy link
Member

commented May 17, 2019

The advantage of emitting the label as a ReadOnlySpan<char> vs a string is that it is just a projection over the input. If the caller needs it as a System.String they can call ToString() on the span value.

While it's also true that a string can go to a (read-only) span, the string required making a new object and a copy (since .NET strings are length-prepended for performance and null-terminated for interop).

So a string (or System.Range) would make sense for returning it as a field in a parse structure, but the span has advantage when it can be done via the out parameter.

@owlstead

This comment has been minimized.

Copy link
Author

commented May 18, 2019

As for that last comment, I don't see any advantage myself to keep label to a Span. Sure ToString or new String is easily performed, but you've got to ask yourself why the Span is ever needed in the first place. From a user point of view it doesn't make sense and copying such small amounts doesn't have any performance benefits, in my opinion. It feels like over-optimization to me.

So what seems natural to me is to supply:

// assembly: System.Security.Cryptography.Encoding
namespace System.Security.Cryptography
{
    public partial class PEM
    {
        public ref struct FieldsWithBase64
        {
            public string label;
            public ReadOnlySpan<char> contentInBase64;
            public int contentSizeBytes;
        }

        public ref struct Fields
        {
            public string label;
            public ReadOnlySpan<byte> content;
        }

        public static bool TryFindPem(ReadOnlySpan<char> pemData, out FieldsWithBase64 fields, out Range location);

        public static int GetOutputSizePem(Fields fields);

        public static bool TryWritePem(Span<char> destination, Fields fields, out int pemSize);
    }
}

I've used "Smurf naming conventions" to allow for using static to be used for the static methods.

Other changes in order of appearance:

  • renamed PEMEncoding to just PEM for simplicity sake (debatable, but PEM is an encoding by itself, can change back);
  • structs introduced (FieldsWithBase64 should really be FieldsWithContentsInBase64 but yeah...);
  • renamed method to TryFindPem as it may be trying to find the first PEM struture;
  • renamed method TryWriteData to TryWritePem as I found Data to be too generic, and for symmetry with TryFindPem.

Most importantly I removed WriteData from the class in its entirety as it would require a full rewrite (unless you would count creating a Span<char> over a new char[] as acceptable solution). It also mixes high level streaming and low level spanning components, which seems a bad idea in itself (where would it end?). It could of course be added later.

Very much interested in your opinions!

Oh, yeah, got Unit testing working and have a working implementation of above. Added constructors and methods to ref structs, with a method FieldsWithBase64.GetFields which performs the base64 decoding as convenience method.

@owlstead

This comment has been minimized.

Copy link
Author

commented May 19, 2019

Oh, I can test things now. I used command line to create a test project and then imported in VS. However, before I test I would like to know if the API is OK-ish this way.

Note that I'm personally not that happy with evaluating my own application. I am critical enough to think of some pretty nasty tests (boundary tests), but I cannot change my own thought process. So some review will of course be required.

@bartonjs

This comment has been minimized.

Copy link
Member

commented May 19, 2019

but you've got to ask yourself why the Span is ever needed in the first place

When reading a large multi-PEM (like /etc/ssl/certs.pem) the allocations of each string would cause the GC to kick in more often, which can have impact on servers. Of course, the API could internally avoid that by caching known (or encountered) strings and returning the string that is equivalent to the identified span... so then it comes down to a) always allocate a string (easy, usable, higher GC impact), b) returned shared strings (more work, usable, low GC impact), c) just return a span (easy, less friendly than string, no GC impact).

Since very few different names are expected I'm sure that (a) would end up being (b) if we switched to using this to read /etc/ssl/certs.pem on Linux.

Type name:

RSA, DSA, and ECDsa definitely already violate the rule, but the rule is that acronyms should either be expanded or be treated as a simple word, so "PEM" (Privacy Enhanced Mail) would be "Pem". I threw "Encoding" on it because "Pem" as a standalone typename felt weird, and it's using the data encoding model from PEM without actually being about Privacy Enhanced Mail. (The actual guideline is "DO capitalize only the first character of acronyms with three or more characters, except the first word of a camel-cased identifier.")

Methods/Structs:

  • For the structs the exposed members should use PascalCasing, even if fields. Ideally the struct would be public readonly ref struct.
  • I'm on the fence for multiple outs vs making a ref struct.
  • I don't think that the struct for write adds value.
  • I think we have a nebulous guideline that says that if there's a write-to-a-span method there should also be a "make a properly sized array and return it" method, because they're easier to use.
    • The method to precalculate the size might obviate it, but it still means that you should check for the false return
  • The label and contents for the size calculator really only need to be the lengths, thinking about it.
  • I expect that the API review would suggest the method to write to a Stream (which I'd then counter that we're producing chars, not bytes, so a TextWriter would be more appropriate). But it can be left off until such a review meeting.
    • FWIW: The value is that it would allow the API internals to do something like a 64-char stackalloc, do 48 bytes at a time base64 encoding, and calling WriteLine with the 64 chars; avoiding the need for contiguous memory on a large payload (and avoids inserting / gapping for newlines).
  • I think we discourage "Smurf naming", and that we optimize for type-qualified invocation over using static. But I don't see a rule for it.
  • On the TryWrite, the convention is that if it writes to a Span<byte> there's an out int bytesWritten, for Span<char> it's out int charsWritten, and for any other Span type else it's (IIRC) out int written). It's just the convention we have (and will eventually be codified), so I've gone ahead and restored the out.

So, using "on the fence" means "don't change it", I'd suggest

// assembly: System.Security.Cryptography.Encoding
namespace System.Security.Cryptography
{
    public static partial class PemEncoding
    {
        public readonly ref struct FieldsWithBase64
        {
            public FieldsWithBase64(string label, ReadOnlySpan<char> contentInBase64, int contentSizeBytes);
            public string Label;
            public ReadOnlySpan<char> ContentInBase64;
            public int ContentSizeBytes;
        }

        public static bool TryFindPem(ReadOnlySpan<char> pemData, out FieldsWithBase64 fields, out Range location);

        public static int GetOutputSize(int labelLength, int contentBytesLength);

        public static bool TryWrite(Span<char> destination, ReadOnlySpan<char> label, ReadOnlySpan<byte> contentBytes, out int charsWritten);
    }
}

And a formal API review would likely add public static char[] Write(ReadOnlySpan<char> label, ReadOnlySpan<byte> contentBytes) and public static void Write(TextWriter destination, ReadOnlySpan<char> label, ReadOnlySpan<byte> contentBytes). API review might also do more cosmetic changes on the names.

Testing:

I'd just expect tests to go in src/System.Security.Cryptography.Encoding/tests/, along with the rest of the xunit tests. After you build src/System.Security.Cryptography.Encoding/ref/ with the new API structure the tests should be able to reference it. (msbuild /t:BuildAndTest /p:ForceRunTests=true to run the tests (even if they succeeded previously), and they should automatically see updated binaries after src/System.Security.Cryptography.Encoding/src is rebuilt.

One test that I think you're missing, from the current code: reading a payload with mismatched labels:

-----BEGIN A-----
base64==
-----END B-----

However, before I test I would like to know if the API is OK-ish this way.

The usual flow, per https://github.com/dotnet/corefx/blob/master/Documentation/project-docs/api-review-process.md, is

  • Issue gets opened (done)
  • Discussion on the issue ends up with what the relevant parties consider to be the appropriate shape (ongoing)
  • The issue gets marked api-ready-for-review
  • The issue gets discussed during an API review meeting and ends up as
    • api-approved as-is
    • api-approved with suggestions (e.g. rename some things)
    • api-needs-work with questions/comments (and the issue is now back in "Discussion")
  • Someone takes on the issue and does the PR for source and test.

Of course, sometimes it takes actually doing the implementation and tests to understand how the API really needs to work.. Mainly I'm sharing that just because you and I think we've hit a good state, doesn't mean the (rest of the) API review will agree 😄.

@bartonjs

This comment has been minimized.

Copy link
Member

commented May 19, 2019

@vcsjones / @onovotny: Do either of you have any API feedback here? Missing scenarios, over-engineering, et cetera?

@vcsjones

This comment has been minimized.

Copy link
Collaborator

commented May 19, 2019

From a user point of view it doesn't make sense and copying such small amounts doesn't have any performance benefits, in my opinion. It feels like over-optimization to me.

I'm leaning toward Label being a ROS<char> here if not because of performance for this specifically but because the whole API proposal seems geared toward being efficient. The use of a ref struct presents some use constraints of a developer. It seems odd to use a ref struct and ROS<char> for the content but then allocate for the label.

If I were using this API (which is a great idea!) my typical usage pattern would probably be something like:

// If ROS<char>
FieldsWithBase64 fields = //get fields;
if (fields.Label.SequenceEqual("CERTIFICATE")) {
  //
}

// If string
FieldsWithBase64 fields = //get fields;
if (fields.Label == "CERTIFICATE") {
  //
}

So I don't think using ROS<char> feels like an optimization at any significant cost to the developer. That is if my supposed typical-usage is close.

If we feel that a ROS<char> for Label is over-optimized then I might suggest that applies to much of the rest of the API proposal, and change the whole struct to not be ref and instead operate entirely on ReadOnlyMemory<char>. That will let developers use the result as they are used to being able to use things, like return it from a method, add them to a List<T>, or put the result in a field.

I don't actually like my idea, I much prefer {ReadOnly}Span<char> here, but if we are OK with allocations then we can use that to the advantage of making the API more developer-friendly.


In the proposal the FieldsWithBase64 contains actual fields. Most of the API surface in the rest of the framework uses properties, even on readonly ref structs (e.g. Span<T> itself). Any particular reason to use fields here? I don't see any harm in it either, but it feels unconventional to me to add API surface exposing fields.


ContentSizeBytes seems ambiguous to me. DecodedContentSize or DecodedContentSizeBytes might help folks understand what the size refers to (though I have a personal distaste for putting Bytes in member names to indicate the size refers to bytes)

@owlstead

This comment has been minimized.

Copy link
Author

commented May 20, 2019

OK, so I've tried to take all your comments into consideration and came up with the following API, which is currently implemented and running some tests; I've simply refactored.

    public static class PemEncoding
    {
        public readonly ref struct Base64Fields
        {
            public Base64Fields(ReadOnlySpan<char> label, ReadOnlySpan<char> base64, int encodedByteSize);

            public readonly ReadOnlySpan<char> Label;

            public readonly ReadOnlySpan<char> Base64;

            public readonly int EncodedByteSize;

            public byte[] DecodeBase64();
        }

        public static bool TryFind(ReadOnlySpan<char> pemData, out Base64Fields fields, out Range location);

        public static int GetEncodedSize(int labelSize, int dataSize);

        public static bool TryWrite(Span<char> destination, ReadOnlySpan<char> label, ReadOnlySpan<byte> data, out int charsWritten);

        public static char[] Write(ReadOnlySpan<char> label, ReadOnlySpan<byte> data);

        // implementation only

        private static bool ValidateLabel(ReadOnlySpan<char> label);

        private static bool ValidateAndCountBase64Bytes(ReadOnlySpan<char> base64, out int encodedByteSize);

        private static int SkipWhiteSpace(ReadOnlySpan<char> pemData, int startOffset)
    }

I've checked that the writers write all the PEM structures in the RFC, doing a full compare afterwards. I'm also skipping / parsing the following PEM's successfully even though they are obviously tosh:

"-----BEGIN BUT-SKIP----------END BUTT-SKIP-----We can read this -----BEGIN ig g-y-----A Q A B-----END pop-----eye"

Then I can write them again as strictly formatted PEM.

I've added a SkipWhiteSpace implementation method. The previous implementation of TryFind did not skip blanks and / or the first line ending after the post-encapsulation boundary (aka footer), which it should do (also according to the RFC).

I'm not completely happy with the EncodedByteSize and GetEncodedSize identifier names as they both contain the name Encoded and rely too much on the context to understand. What about Base64EncodedDataSize and GetEncodedPemSize? They are slightly larger, but more clear in my opinion.

Note that I'm quite the polyglot w.r.t. PL, but sometimes it takes a bit of time to get up to speed with e.g. C# best practices - I'm pretty much rooted in Java, taking up Kotlin as well.

@owlstead

This comment has been minimized.

Copy link
Author

commented May 21, 2019

OK, let's discuss the two bears left: streaming support and error handling.


The problem with streaming is that it doesn't allow to back track a stream. However, it is required if e.g. large CMS structures need to be supported because those can be of arbitrary length and not fit into memory. It is useful but maybe not required if PEM structures are present in a stream e.g. received through HTTP.

As for the large data structures: it seems .NET only supports encoding of data in a byte[] currently. So although streaming would definitely be useful for large messages, I don't see how we could do anything with the result that doesn't require at least one full copy in memory (correct me if I'm wrong about this). If the message size is not known in advance then you can simply create a buffer of a particular size and stop if it doesn't fit. So basically I think a streaming API doesn't seem to be useful in the short run.

It is relatively easy to add streaming for the generating part, but it would not be a symmetric API. I think I saw some requests to do this, so we could basically copy the generation code as I don't see any generic way of merging the two.


Now for the error handling. Currently the TryFind message simply returns false if there is any error in the PEM, even if a full preeb (header line) is found and if there is a base64 encoding mistake. This doesn't feel right to me, I'd say that if there is a full preeb that an exception should be thrown if there are any mistakes. I don't think it makes too much sense to keep parsing after an error, so returning the location of the error doesn't make too much sense (for a one-shot decoder).


Finally: should we create a Read method that errors out if no PEM / preeb is found at the start of the Span<char>? I don't think it would be all that useful as having text in front of the PEM is always allowed by the standard. If required, people can always question the returned Range and test if the PEM starts at offset zero after all. We could however think of a Find without Try that returns a byte array and throws an exception if no PEM could be found.

@vcsjones

This comment has been minimized.

Copy link
Collaborator

commented May 21, 2019

I think a streaming API doesn't seem to be useful in the short run.

That seems reasonable. Using personal experience I would see myself using this for small PEM bundles that supply a certificate chain and reading a few kilobytes into memory shouldn't be problematic.

My suggestion now would be to get this API done now and consider streaming as a separate API proposal. It could either be added on later to these types or perhaps a new PemEncodingStream class that uses this as its internals.


Now for the error handling.

Throwing exceptions would probably be useful, though I can't say with any certainly. Some .NET developers have taken Try-prefixed methods to mean "does not throw" like TryParse* methods.


I don't think it would be all that useful as having text in front of the PEM is always allowed by the standard.

I agree with that, especially if this is intended to be a "lax" parser.

@bartonjs

This comment has been minimized.

Copy link
Member

commented May 31, 2019

For the API shape: I think that the nested struct needs to move out to be a peer type. We don't have crisp guidelines here, but it boils down to something like "If the average person working with the API needs to use the type by name, it shouldn't be nested". Since it's the output of the primary API (TryFind) it seems pretty "average user" to me. (So what does get to be a public nested type? Mostly struct-based iterators that most people never see, they just put the call in a foreach and let things happen).

"Base64Fields" is a little generic once it moves out. ParsedPemFields, maybe? (Or PemFields?)

I don't think that the output struct needs the DecodeBase64 method on it. Since the base64 data is already exposed the caller can just use the existing base64 decoding APIs.

I think a streaming API doesn't seem to be useful in the short run.

I accept the logic here. Leaving it out until there's a demand seems reasonable.

Now for the error handling.

Some .NET developers have taken Try-prefixed methods to mean "does not throw" like TryParse* methods.

We have guidance on this, which is (effectively) all false responses need the same recovery action (different recovery action? throw.). So the question is really "as a caller, do you want to know the difference between 'no header found', 'no footer found', and 'contains attributes / has invalid base64'" (attributes being invalid base64), or are they all simply "there's no RFC 7468-PEM here"? Alternatively phrased, should

-----BEGIN A-----
maybe some stuff
-----BEGIN B-----
legit stuff
-----END B-----

find "B", or throw? I'm, personally, having trouble deciding which I want it to do, so I'm wide open to suggestions.

(I think I might be leaning toward false means no preeb was found, and throw for "couldn't successfully get to the end" (bad base64, can't find a posteb))

Finally: should we create a Read method that errors out ...

I think that Find, to be effectively TryFind + throw instead of false is useful, it simplifies the "I'm not testing, I'm asserting" case (and matches the guideline "DO provide an exception-throwing member for each member using the Try-Parse Pattern.").

I agree that there's not a lot of value in the "the first (or first non-whitespace) character wasn't -" version, and it's easy enough for someone to write themselves given the output of Find.

internal static PemFields FindPemAtStart(ReadOnlySpan<char> pem)
{
    PemFields fields = PemEncoding.Find(pem, out Range range);

    if (range.Start.Value > 0)
    {
        ReadOnlySpan<char> leader = pem.Slice(0, range.Start.Value);

        for (int i = 0; i < leader.Length; i++)
        {
            if (!char.IsWhitespace(leader[i]))
            {
                // Use the standard exception
                PemEncoding.Find("", out _);
                Debug.Fail("Should have thrown");
            }
        }
    }

    return fields;
}

Yeah, it's more than 3 lines of code, but it's not tricky logic... and given that the RFC says leaders are allowed, it doesn't fit with the class.

@owlstead

This comment has been minimized.

Copy link
Author

commented May 31, 2019

I'm all open to use PemFields as a separate instance. I'll check if I can get that to work with Span.

Ah, and here I just inserted the exceptions. Personally I'm in favor of throwing errors when if a full header line / pre encap boundary is found (i.e. five dashes, BEGIN, some kind of label (any chars) and then five dashes again. Because if anything is wrong, then it is hard to argue that this is not the PEM that you were looking for. Basically you ask to find it a PEM, it finds a full header, but then still returns false. What would you do next as a user? And if no exception is thrown then symmetry is also lost between TryFind and Find. So yeah, I'm not convinced that returning false instead of a FormatException is a good idea.

I'll try and finish cleaning up the code and post early next week. I haven't got a compiled version of Core 3 here so I'll just post it as a file (or two) here if you don't mind. I've got a holiday coming so time is getting precious (and I need to post soon because the race may eat me, lookup "Pointe du Raz" and then imagine kajaking in it :P ).

@owlstead

This comment has been minimized.

Copy link
Author

commented Jun 3, 2019

Sorry, things taking too much time. I hope tomorrow or Wednesday.

@stephentoub stephentoub modified the milestones: Future, 5.0 Jul 18, 2019
@vcsjones

This comment has been minimized.

Copy link
Collaborator

commented Sep 22, 2019

To get things moving along again... here's the API proposal that I think we have, so far:

namespace System.Security.Cryptography
{
    public static class PemEncoding
    {
        public readonly ref struct Base64Fields
        {
            public Base64Fields(ReadOnlySpan<char> label, ReadOnlySpan<char> base64, int encodedByteSize);

            public readonly ReadOnlySpan<char> Label;

            public readonly ReadOnlySpan<char> Base64;

            public readonly int EncodedByteSize;

            public byte[] DecodeBase64();
        }

        public static bool TryFind(ReadOnlySpan<char> pemData, out Base64Fields fields, out Range location);

        public static int GetEncodedSize(int labelSize, int dataSize);

        public static bool TryWrite(Span<char> destination, ReadOnlySpan<char> label, ReadOnlySpan<byte> data, out int charsWritten);

        public static char[] Write(ReadOnlySpan<char> label, ReadOnlySpan<byte> data);
    }
}

I think the only feedback I would have left on this is that Label, Base64 and EncodedByteSize and fields when properties are more typically used. Since the type is a ref struct, I think that is doable.

With that change, the API proposal becomes:

namespace System.Security.Cryptography
{
    public static class PemEncoding
    {
        public readonly ref struct Base64Fields
        {
            public Base64Fields(ReadOnlySpan<char> label, ReadOnlySpan<char> base64, int encodedByteSize);

            public ReadOnlySpan<char> Label { get; }

            public ReadOnlySpan<char> Base64 { get; }

            public int EncodedByteSize { get; }

            public byte[] DecodeBase64();
        }

        public static bool TryFind(ReadOnlySpan<char> pemData, out Base64Fields fields, out Range location);

        public static int GetEncodedSize(int labelSize, int dataSize);

        public static bool TryWrite(Span<char> destination, ReadOnlySpan<char> label, ReadOnlySpan<byte> data, out int charsWritten);

        public static char[] Write(ReadOnlySpan<char> label, ReadOnlySpan<byte> data);
    }
}
@owlstead

This comment has been minimized.

Copy link
Author

commented Sep 22, 2019

I'll try and spend some more time on it. I've switched from laptop to desktop as it got to slow and ran out of SSD, and I'm currently running under Ubuntu - might switch back to Windows 10 due to audio problems among others...

@vcsjones

This comment has been minimized.

Copy link
Collaborator

commented Sep 22, 2019

I wouldn't worry about implementation too much at this point - this needs to make it past the API review process. They may have significant feedback on the API shape.

Unless there is any more feedback on what the public API is, I think the next best step is to get any additional feedback from @bartonjs on the API proposal, and possibly get it flagged as "ready for review".

@bartonjs

This comment has been minimized.

Copy link
Member

commented Sep 22, 2019

I still think the nested type needs to be changed to a peer type (and then it probably needs a name change to PemFields, or something like that).

We are currently codifying what the guidelines are for TryWrite. I think that it'll end up being bool TryWrite(ReadOnlySpan<char> label, ReadOnlySpan<char> data, Span<char> destination, out int charsWritten) based on how the last meeting went, but maybe we'll feel differently once we're looking at examples of "two data sources, one destination", like this one. That doesn't really affect the proposal here, since it'll get normalized to the newly minted rules by the review process... so long as we don't think it needs any other inputs we're good to go.

Since the data portion is written as Base64 and Label is limited to (U+0021-U+002C , U+002E-U+007E) we know we're always writing ASCII output. Should we make TryWriteAscii/WriteAscii that writes to bytes instead of chars?

And, lastly, S.S.Cryptography is getting a bit bloated. Maybe it's time to consider a new namespace. Here we have PemEncoding and PemFields. ASN.1 will propose AsnReader, AsnWriter, AsnTag (or whatever final names they get). I've heard whispers of CBOR, which would probably add a reader and a writer. Maybe all of these things should be namespace System.Security.Cryptography.Encoding?

namespace System.Security.Cryptography
{
   public static class PemEncoding
    {
        public static bool TryFind(ReadOnlySpan<char> pemData, out PemFields fields, out Range location);
        public static PemFields Find(ReadOnlySpan<char> pemData, out Range location);

        public static int GetEncodedSize(int labelSize, int dataSize);
        public static bool TryWrite(Span<char> destination, ReadOnlySpan<char> label, ReadOnlySpan<byte> data, out int charsWritten);
        public static char[] Write(ReadOnlySpan<char> label, ReadOnlySpan<byte> data);
    }
    public readonly ref struct PemFields
    {
        public PemFields(ReadOnlySpan<char> label, ReadOnlySpan<char> base64, int encodedByteSize);

        public ReadOnlySpan<char> Label { get; }
        public ReadOnlySpan<char> Base64 { get; }
        public int EncodedByteSize { get; }

        public byte[] DecodeBase64();
    }
 }

(I left out WriteAscii, because it's an easy addition; and didn't change the namespace here because no consensus yet).

The only thing that feels weird in this now is the EncodedByteSize value. It totally makes sense for the PemEncoding class to calculate it, it does that on the fly while reading. But should the ctor validate it? Or is it garbage-in garbage-out? I can't think of why anyone (other than PemEncoding) would build one, so maybe it doesn't matter. Or maybe it means it should be an internal ctor to enforce that there's no reason to make one.

@vcsjones

This comment has been minimized.

Copy link
Collaborator

commented Sep 22, 2019

Maybe it's time to consider a new namespace

Is there existing API surface in S.S.C that would fit in this new Encoding namespace? Would the new namespace lead to confusion, "Why is X not in the Encoding namespace?" "Because it didn't exist for these existing APIs"

@vcsjones

This comment has been minimized.

Copy link
Collaborator

commented Oct 17, 2019

@bartonjs

Maybe it's time to consider a new namespace

Looking through the existing types I think my previous concern isn't valid. I'm always weary of namespaces, but it makes sense given the way the rest of corefx is broken up.

The only thing that feels weird in this now is the EncodedByteSize value

I don't mind the idea of this type being a simple way to get information out and not doing the validation at all.

I can't think of why anyone (other than PemEncoding) would build one, so maybe it doesn't matter

Also probably true, then perhaps that is enough reason to mark the constructor as internal. There will still be PemFields a = default leading to it having a value of zero, but it look like a "default" of this struct is still mostly okay.

So this then?

namespace System.Security.Cryptography.Encoding
{
   public static class PemEncoding
    {
        public static bool TryFind(ReadOnlySpan<char> pemData, out PemFields fields, out Range location);
        public static PemFields Find(ReadOnlySpan<char> pemData, out Range location);

        public static int GetEncodedSize(int labelSize, int dataSize);
        public static bool TryWrite(Span<char> destination, ReadOnlySpan<char> label, ReadOnlySpan<byte> data, out int charsWritten);
        public static char[] Write(ReadOnlySpan<char> label, ReadOnlySpan<byte> data);
    }
    public readonly ref struct PemFields
    {
        internal PemFields(ReadOnlySpan<char> label, ReadOnlySpan<char> base64, int encodedByteSize);

        public ReadOnlySpan<char> Label { get; }
        public ReadOnlySpan<char> Base64 { get; }
        public int EncodedByteSize { get; }

        public byte[] DecodeBase64();
    }
 }
@bartonjs

This comment has been minimized.

Copy link
Member

commented Oct 17, 2019

So this then?

Looks reasonable to me. I think API review will rewrite PemEncoding.TryWrite to (label, data, destination, out charsWritten), because we recently finished doing a looking-back and looking-forward review to come up with our official guidance (publication TBD).

@owlstead any thoughts? (I'm going to go ahead and mark it as ready for review, but we review things on Tuesdays, so there's at least a couple days between now and when it actually makes it to the front of the queue)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.