Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add methods to convert between hexadecimal strings and bytes #17837

Closed
GSPP opened this issue Jul 12, 2016 · 25 comments
Closed

Add methods to convert between hexadecimal strings and bytes #17837

GSPP opened this issue Jul 12, 2016 · 25 comments

Comments

@GSPP
Copy link

@GSPP GSPP commented Jul 12, 2016

It is quite common to need to convert bytes to hex strings and back. The .NET Framework does not have an API to do that. Look at what lengths people have gone to to solve this problem for them: How do you convert Byte Array to Hexadecimal String, and vice versa? Many of those solutions are somehow bad (e.g. slow or broken or not validating).

BitConverter.ToString(byte[]) generates strings of a practically useless format: BitConverter.ToString(new byte[] { 0x00, 0x01, 0xAA, 0xFF }) => 00-01-AA-FF. I don't understand how this decision was made. This seems to be a special solution for some specific purpose (debug output?). Using this API requires Replace(str, "-", "") which is ugly and slow. Also, this API is not discoverable because it does not have "hex" in it's name.

I propose adding an API that does the following:

  • Supports char[] and string
  • Supports specifying the output buffer to support zero-allocation scenarios
  • Supports upper- and lowercase
  • Fast (probably, this means working in-place using unsafe code)
  • Validates input
  • No support for any kind of hex prefix (such as "0x"), whitespace or digit separator (such as "-" or "\n"). This is out of scope.

The use case for char[] in- and outputs is low-allocation scenarios.

The API could look like this:

static class HexConvert {
 byte[] ToBytes(string input, int start, int count, HexFormat hexFormat);
 void ToBytes(string input, int start, int count, ArraySegment<byte> outputBuffer, HexFormat hexFormat);
 //Repeat the same for ArraySegment<char> as input

 string ToString(ArraySegment<byte> bytes);
 char[] ToChars(ArraySegment<byte> bytes);
 void ToChars(ArraySegment<byte> bytes, ArraySegment<char> outputBuffer);

 //Plus obvious convenience overloads

 int HexDigitToInteger(char val); //0-9A-F => 0-15.
 char IntegerToHexDigit(int value, HexFormat hexFormat); //0-15 => 0-9A-F
}

[Flags]
enum HexFormat { UseDefaults = 0, Lowercase = 1 } //extensible

A possible extension would be to support Stream and IEnumerable<byte> as sources and destinations. I would consider this out of scope.

I feel that converting hexadecimal strings is an important and frequent enough thing that this should be supported well. Of course, a "light" version of this proposal is also feasible (just support for string and byte[] with no user-provided output buffer). Not sure if anyone wants/needs lowercase hex, I don't but others might.

This seems cheap enough to implement. Testing should be easily automatable. I don't see any particular usability problems. This is suitable for community contribution.

@joshfree
Copy link
Member

@joshfree joshfree commented Jul 12, 2016

Thanks for creating this issue @GSPP.

cc: @terrajobst

https://aka.ms/apireview

Loading

@juliusfriedman
Copy link
Contributor

@juliusfriedman juliusfriedman commented Jul 12, 2016

Replacement is not the ONLY way, you can just skip the char if you need to.

///Could easily allow upper or lower case here...   
[System.Runtime.CompilerServices.MethodImpl(System.Runtime.CompilerServices.MethodImplOptions.AggressiveInlining)]
public static byte HexCharToByte(char c, bool upperCase = false) { c = char.ToUpperInvariant(c); return (byte)(upperCase ? char.ToUpperInvariant((char)(c > '9' ? c - 'A' + 10 : c - '0')) : (c > '9' ? c - 'A' + 10 : c - '0')); }

        /// <summary>
        /// Converts a String in the form 0011AABB to a Byte[] using the chars in the string as bytes to caulcate the decimal value.
        /// </summary>
        /// <notes>
        /// Reduced string allocations from managed version substring
        /// About 10 milliseconds faster then Managed when doing it 100,000 times. otherwise no change
        /// </notes>
        /// <param name="str"></param>
        /// <returns></returns>
        public static byte[] HexStringToBytes(string str, int start = 0, int length = -1)
        {
            if (length.Equals(Common.Binary.Zero)) return null;

            if (length <= -1) length = str.Length;

            if (start > length - start) throw new System.ArgumentOutOfRangeException("start");

            if (length > length - start) throw new System.ArgumentOutOfRangeException("length");

            System.Collections.Generic.List<byte> result = new System.Collections.Generic.List<byte>(length >> 1); // / 2

            //Dont check the results for overflow
            unchecked
            {
                //Iterate the pointer using the managed length ....
                //Todo, optomize with reverse or i - 1
                for (int i = start, e = length; i < e; i += 2)
                {
                    //to reduce string manipulations pre call
                    if(str[i] == '-') ++i;

                    //Todo, Native and Unsafe
                    //Convert 2 Chars to a byte
                    result.Add((byte)(HexCharToByte(str[i]) << 4 | HexCharToByte(str[i + 1])));
                }
            }

            //Dont use a List..

            //Return the bytes
            return result.ToArray();
        }

I agree that the format is useless in MOST cases...

However if you think about it a little more and try to be as outside the box as possible you will clearly see that there is an extra index for every octet, finally then you can achieve some interesting things by realigning the data or using the extra indexes....

Loading

@Priya91 Priya91 assigned AlexGhiondea and unassigned Priya91 Aug 22, 2016
@AlexGhiondea
Copy link
Member

@AlexGhiondea AlexGhiondea commented Nov 23, 2016

@GSPP how about augmenting the BitConverter type with these new methods?

I feel like that would be a better place for these since that is already a type people use for converting between various formats.

Loading

@AlexGhiondea AlexGhiondea removed their assignment Nov 24, 2016
@AlexGhiondea
Copy link
Member

@AlexGhiondea AlexGhiondea commented Nov 24, 2016

There are at least a couple of other issues tracking something like this:
https://github.com/dotnet/corefx/issues/1647
https://github.com/dotnet/corefx/issues/4441

Let's track it with a single issue (this one).

It does look like this is something for which there is interest in implementing.

@GSPP, @GrabYourPitchforks @KrzysztofCwalina is anyone of you interested in writing up a formal API proposal for this?

Loading

@CodesInChaos
Copy link

@CodesInChaos CodesInChaos commented Nov 24, 2016

If the slices API lands before this, I'd go with something like:

// all instance methods are thread-safe
public abstract class BinaryEncoding
{
	protected abstract void ToBytesOverride(ReadOnlySpan<char> sourceChars, Span<byte> destinationBytes);
	protected abstract int GetByteLengthOverride(int stringLength);
	protected abstract void ToCharsOverride(ReadOnlySpan<byte> sourceBytes, Span<char> destinationChars);
	protected abstract int GetStringLengthOverride(int byteLength);

	public byte[] ToBytes(ReadOnlySpan<char> str);
	public void ToBytes(ReadOnlySpan<char> sourceChars, Span<byte> destinationBytes);
	public int GetByteLength(int stringLength);
	
	public string ToString(ReadOnlySpan<byte> bytes);
	public char[] ToChars(ReadOnlySpan<byte> bytes);
	public void ToChars(ReadOnlySpan<byte> sourceBytes, Span<char> destinationChars);
	public int GetStringLength(int byteLength);
}

public class HexEncoding
{
	public static HexEncoding UpperCase { get; }
	public static HexEncoding LowerCase { get; }
}

perhaps add some additional overloads with byte[], char[], string inputs.

Loading

@karelz
Copy link
Member

@karelz karelz commented Nov 24, 2016

We don't take any new API additions if they would be obsolete when Span lands.
Marking as blocked on Span dotnet/corefx#13892 - @AlexGhiondea feel free to change it if you think I am wrong.

Loading

@GSPP
Copy link
Author

@GSPP GSPP commented Aug 12, 2018

Span has landed so this can be unblocked.

Loading

@GrabYourPitchforks
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks commented Feb 13, 2019

I propose a slightly different API, instead putting this on the Convert class near the existing base64 APIs. That would help discoverability, and it gives us an existing template for the recommended API shape.

namespace System
{
    // NEW methods on EXISTING type
    public static class Convert
    {
        public static byte[] FromHexString(ReadOnlySpan<char> chars); // doesn't match an existing overload of FromBase64String, but could be useful
        public static byte[] FromHexString(string s);

        public static string ToHexString(byte[] inArray, HexFormattingOptions options = default);
        public static string ToHexString(byte[] inArray, int offset, int length, HexFormattingOptions options = default);
        public static string ToHexString(ReadOnlySpan<byte> bytes, HexFormattingOptions options = default);
    }

    [Flags]
    public enum HexFormattingOptions
    {
        None = 0,
        InsertLineBreaks = 1,
        Lowercase = 2,
    }
}

It also makes sense IMO to have OperationStatus-based APIs. Those APIs could be done in the corefxlab repo to live alongside the existing Base64Encoder and Base64Decoder types in that repo. Then all of the OperationStatus-based base64 and hex APIs can be reviewed at once if we decide to move them into corefx.

Loading

@vcsjones
Copy link
Member

@vcsjones vcsjones commented Feb 14, 2019

To give some parity with the *CharArray from base64, I would consider a few other methods:

public static class Convert
{
    public bool TryFromHexString(ReadOnlySpan<char> chars, Span<byte> buffer, out int bytesWritten);
    public bool TryFromHexString(string s, Span<byte> buffer, out int bytesWritten);
    public bool TryToHexString(ReadOnlySpan<byte> bytes, Span<char> buffer, out int charsWritten, HexFormattingOptions options = default);
}

The rationale for this is in my typical usage I expect a hex string to be short (think a SHA digest) which tops out at 64 bytes right now.

I'm not strongly tied to these, but worth considering during review.


Other considerations, does there need to be an options for the FromHexString that provide similar functionality like specifying if the input is expected to be upper or lower, or would the input be treated case insensitive?

What about continuing from line breaks?

Loading

@GrabYourPitchforks
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks commented Feb 14, 2019

@vcsjones I envision HexFormattingOptions.InsertLineBreaks as inserting every 72 chars, just like base64 does today. And parsing would allow all of uppercase / lowercase / whitespace, hence no options parameter.

Loading

@GSPP
Copy link
Author

@GSPP GSPP commented Feb 14, 2019

I'd not put it on Convert. My argument for that is this: Imagine there were 10 different such formats (Hex, base64, base85, ...). Then, we clearly would like each of them to be their own static class. Anything else would make for a very crowded API. Now we don't have 10 formats but from this thought experiment we can see that creating a separate class is the better option. Especially, since more formats might be added in the future. For example, Bitcoin uses a base85 format which is more efficient. Who knows if additional formats might be desirable in the future.

Should this API accept any other input besides spans? It could accept arrays, strings, ArraySegment and even pointers. In my opinion it should only take spans. That would make for the cleanest and least crowded API. Anything else is easily converted to span. In particular I don't want to see overloads with index and count parameters. Thank god we no longer need those!

We certainly need methods that do not throw in the error case. My original proposal missed those.

InsertLineBreaks as inserting every 72 chars

Is this a widely known standard format? Is it only for base64 or also for hex?


New API proposal:

namespace System
{
    public static class HexConvert
    {
        //convert to bytes, throwing
        public static byte[] GetBytes(ReadOnlySpan<char> chars, HexFormattingOptions formattingOptions = default);
        public static void GetBytes(ReadOnlySpan<char> chars, Span<byte> output, out int bytesWritten, HexFormattingOptions formattingOptions = default);

        //convert to bytes, try
        public static byte[] TryGetBytes(ReadOnlySpan<char> chars, HexFormattingOptions formattingOptions = default); //can return null
        public static bool TryGetBytes(ReadOnlySpan<char> chars, Span<byte> output, out int bytesWritten, HexFormattingOptions formattingOptions = default);

         //convert to chars
         string GetString(ReadOnlySpan<byte> bytes);
         char[] GetChars(ReadOnlySpan<byte> bytes);

         //plus non-throwing overloads
         //plus overloads with output buffer
         //plus overloads without HexFormattingOptions (I was lazy so I made them default arguments)
    }

    [Flags]
    public enum HexFormattingOptions
    {
        None = 0,
        Lowercase = 1,
    }
}

I chose the name prefix to be "Get" so that "TryGet" is fitting. This succinct naming would not be possible when placing these methods in Convert.

Loading

@AlexeiScherbakov
Copy link

@AlexeiScherbakov AlexeiScherbakov commented Feb 17, 2019

There are many variants of API. It can be used in web,cryptography and many other places. Some hex conversion is used in json for writing non-ASCII chars - \u0000 - \uFFFF

I think that .NET Core must provide optimized variants for each purpose

Writing API

  1. It can be uppercase/lowercase
  2. It can use separators or not use
  3. Separators can be single char or multi (char[] or string)
  4. It can create new instance of string and char[] or use Span, StringBuilder, TextWriter or Stream as output
  5. Source can be Span or types like,(s)byte,(u)short,char,(u)int,(u)long
  6. If source is Span function can take entire span or it part (for direct byte[] calls)

All above give us more than 100 function declarations. If it will have byte[] inputs - it take much more

For example just writing API for ReadOnlySpan as source, with all kinds of separators, and only lowercase give us 6 optimized variants

public static string ToHexStringLowerCase(this ReadOnlySpan<byte> array);
public static string ToHexStringLowerCase(this ReadOnlySpan<byte> array, char separator);
public static string ToHexStringLowerCase(this ReadOnlySpan<byte> array, ReadOnlySpan<char> separator);
public static string ToHexStringLowerCase(this ReadOnlySpan<byte> array, int offset, int count);
public static string ToHexStringLowerCase(this ReadOnlySpan<byte> array, int offset, int count, char separator);
public static string ToHexStringLowerCase(this ReadOnlySpan<byte> array, int offset, int count, ReadOnlySpan<char> separator);

This is bad idea to place it in existing places, I think, that better to place this API in external netcoreapp2.1+netstandard2.1 package. Something like System.Text.Hex

Loading

@AlexeiScherbakov
Copy link

@AlexeiScherbakov AlexeiScherbakov commented Feb 17, 2019

Also, as @vcsjones mentioned for short strings, we need some different allocation mechanism, to avoid dirty hacks for reducing allocations count: (https://docs.microsoft.com/en-us/dotnet/csharp/how-to/modify-string-contents#unsafe-modifications-to-string)

public static unsafe string ToHexStringLowerCase(this byte[] array)
{
	string ret = new string('\0', array.Length * 2);
	
	fixed(char* strPtr=ret)
	{
		char* ptr = strPtr;
		for (int i = 0; i < array.Length; i++)
		{
			byte bytic = array[i];
			*ptr = HexChars.HexDigitsLowerCase[(bytic >> 4) & 0xF];
			ptr++;
			*ptr = HexChars.HexDigitsLowerCase[bytic & 0xF];
			ptr++;
		}
	}
	return ret;
}

For example in this case we can allocate string memory (x2 of byte array length) and write to it, without copy contructor.

Loading

@tkp1n
Copy link
Contributor

@tkp1n tkp1n commented Jul 13, 2019

I consolidated the different proposals and comments made in this discussion and came up with the following API. I also provide a sample implementation (internally utilizing AVX2, SSE4.1, SSSE3, SSE2 or Scalar/Lookup with the following performance characteristics) of the API to experiment with via a seperate GitHub repository and nuget package.

Sample implementation to play arround with

https://github.com/tkp1n/HexMate
https://www.nuget.org/packages/HexMate

Adressed proposals and comments:

Open questions:

API proposal

namespace System
{
    // NEW type
    [Flags]
    public enum HexFormattingOptions
    {
        None = 0,
        InsertLineBreaks = 1,
        Lowercase = 2
    }

    // NEW methods on EXISTING type
    public static class Convert
    {
        // Decode from chars
        public static byte[] FromHexCharArray(char[] inArray, int offset, int length) => throw null;
        public static bool TryFromHexChars(ReadOnlySpan<char> chars, Span<byte> bytes, out int bytesWritten) => throw null;

        // Decode from strings
        public static byte[] FromHexString(string s) => throw null;
        public static byte[] FromHexString(ReadOnlySpan<char> chars) => throw null;
        public static bool TryFromHexString(string s, Span<byte> bytes, out int bytesWritten) => throw null;

        // Encode to chars
        public static int ToHexCharArray(byte[] inArray, int offsetIn, int length, char[] outArray, int offsetOut, HexFormattingOptions options = default) => throw null;
        public static bool TryToHexChars(ReadOnlySpan<byte> bytes, Span<char> chars, out int charsWritten, HexFormattingOptions options = default) => throw null;

        // Encode to strings
        public static string ToHexString(byte[] inArray, HexFormattingOptions options = default) => throw null;
        public static string ToHexString(byte[] inArray, int offset, int length, HexFormattingOptions options = default) => throw null;
        public static string ToHexString(ReadOnlySpan<byte> bytes, HexFormattingOptions options = default) => throw null;
    }
}

namespace System.Buffers.Text
{
    // NEW type
    public static class Hex
    {
        // Decode
        public static OperationStatus DecodeFromUtf8(ReadOnlySpan<byte> utf8, Span<byte> bytes, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true) => throw null;
        public static OperationStatus DecodeFromUtf8InPlace(Span<byte> buffer, out int bytesWritten) => throw null;
        
        // Encode
        public static OperationStatus EncodeToUtf8(ReadOnlySpan<byte> bytes, Span<byte> utf8, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true) => throw null;
        public static OperationStatus EncodeToUtf8InPlace(Span<byte> buffer, int dataLength, out int bytesWritten) => throw null;
    }
}

Loading

@GSPP
Copy link
Author

@GSPP GSPP commented Jul 13, 2019

I think we should demonstrate a need for inserting line breaks first. I am not aware of any common scenario where this would be used. If this feature is added then the line length must be variable. I cannot imagine any fixed line length to be appropriate.

What are the intended use cases?

Should the parsing routines accept --separated input

By default, the parsing routines should be maximally strict. This means no support for whitespace or any separator. When we parse data we generally want to be as strict as possible in order to find bugs in the originating systems and in order to not accidentally accept invalid data. The motto be strict in what you produce and liberal in what you accept is false. It has lead to the mess that is web development. The correct approach is to be as strict as possible and relax only if required.

Parsing should accept any case of the letters A-F, though. I see no good alternative to that.

I see no need for supporting a separator. This is a rare need. An easy workaround exists at a performance cost: str.Replace("-", "").

Loading

@terrajobst
Copy link
Member

@terrajobst terrajobst commented Aug 20, 2019

@GrabYourPitchforks, it seems we should survey what internal implementations we have and what all the places are where we expose something similar. It seems new proposals were added after you marked it as read for review.

Loading

@bugproof
Copy link

@bugproof bugproof commented Dec 5, 2019

Really looking forward to seeing this. I benchmarked @tkp1n implementation and it's faster than everything else in .NET Core. Compared to these methods: https://stackoverflow.com/a/624379/2828480

Loading

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 5.0 milestone Jan 31, 2020
@GrabYourPitchforks
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks commented Apr 19, 2020

I have some follow-up thoughts on this.

My instinct is that @tkp1n's comment at #17837 (comment) mostly nails it, but I'd recommend some further changes. First, let's not bother with anything in System.Buffers.Text just yet, reserving this specific issue for new APIs in System.Convert. Second, upon further reflection I don't think we need a HexFormattingOption enum. It could instead be an optional parameter bool lowercase = false at the end of the method signature. There's no need to introduce a new enum type into the System namespace just yet.

I spent some time building up a list of places in the framework where we convert bytes to hex strings. See the end of this comment for a partial list. Every single one of these matches can be replaced with a call to one of the Convert APIs as proposed in this issue.

Now, for a rude Q and A.

What's the primary scenario for this?

The 99% use case for converting to hex is to produce a case-insensitive identifier from a binary buffer (say, a GUID or similar). It's also frequently used in cryptography to represent symmetric key material or digests.

In particular, the majority of hex use cases are for small-ish inputs and outputs (say, 256 bits max in the common case). That's also about the bounds of what we would expect where identifiers might appear.

This means that we don't need to worry about hyper-optimizing any such implementation. Until we get evidence that this particular API is bottlenecking real-world scenarios, a simple for loop should suffice as an implementation.

What about customizing joiners or line breaks?

For the primary scenario of identifiers and other small-ish payloads, callers typically don't want any type of joiner.

Where there are exceptions they tend to be things we can't easily account for in a generalized API. For example, some callers want the format "00-11-22-..." (using joiners). Some callers want the format "0x00, 0x11, ..." (using a prefix and joiners). Some callers want "%00%11..." (using just a prefix) as the format. Hex editors might want simple spaces as a joiner, except where after every 60 characters a newline might occur, and that should be followed by a horizontal tab instead of a space.

I anticipate that such callers will continue to use their own custom logic. Since they represent the minority use case I'm not too worried about them for this scenario.

What about Convert.FromHexString?

The behavior would be that it allows only [0-9A-Fa-f]. The incoming string length must be evenly divisible by 2. (Empty strings are allowed.)

Framework matches

libraries\Common\tests\System\Net\Http\LoopbackServer.AuthenticationHelpers.cs:
  299:                     sb.Append(b.ToString("x2"));

libraries\Common\tests\System\Security\Cryptography\ByteUtils.cs:
  59:                 builder.Append(bytes[i].ToString("X2"));

libraries\System.CodeDom\src\Microsoft\CSharp\CSharpCodeGenerator.cs:
  2512:                     Output.Write(b.ToString("X2", CultureInfo.InvariantCulture));

libraries\System.CodeDom\src\Microsoft\VisualBasic\VBCodeGenerator.cs:
  2273:                     Output.Write(b.ToString("X2", CultureInfo.InvariantCulture));

libraries\System.DirectoryServices.AccountManagement\src\System\DirectoryServices\AccountManagement\Utils.cs:
  62:                 stringizedArray.Append(b.ToString("x2", CultureInfo.InvariantCulture));
  82:                 stringizedBinarySid.Append(b.ToString("x2", CultureInfo.InvariantCulture));

libraries\System.DirectoryServices.AccountManagement\src\System\DirectoryServices\AccountManagement\AD\ADStoreCtx_Query.cs:
  496:                         stringguid.Append(b.ToString("x2", CultureInfo.InvariantCulture));
  630:                 stringizedBinarySid.Append(b.ToString("x2", CultureInfo.InvariantCulture));

(below implementation is incorrect)
libraries\System.Drawing.Common\tests\mono\System.Drawing\BitmapTests.cs:
  397:                 sOutput.Append(arrInput[i].ToString("X2"));

libraries\System.Net.Http\src\System\Net\Http\SocketsHttpHandler\AuthenticationHelper.Digest.cs:
  245:                     bool formatted = result[i].TryFormat(byteX2, out int charsWritten, "x2");

libraries\System.Private.CoreLib\src\System\Diagnostics\Tracing\TraceLogging\XplatEventLogger.cs:
  156:                 byteArray[i].TryFormat(hex, out int charsWritten, hexFormat);

libraries\System.Private.CoreLib\src\System\Reflection\AssemblyNameFormatter.cs:
  75:                         sb.Append(b.ToString("x2", CultureInfo.InvariantCulture));

libraries\System.Reflection.MetadataLoadContext\tests\src\TestUtils\TestUtils.cs:
  127:                 builder.Append(bytes[i].ToString("X2"));

libraries\System.Security.Cryptography.Xml\src\System\Security\Cryptography\Xml\SignedXmlDebugLog.cs:
  203:                 builder.Append(b.ToString("x2", CultureInfo.InvariantCulture));

libraries\System.Security.Cryptography.Xml\tests\XmlDsigXsltTransformTest.cs:
  117:                 sb.Append(b.ToString("X2"));

libraries\System.Text.Encodings.Web\tests\UrlEncoderTests.cs:
  296:             return string.Concat(_utf8EncodingThrowOnInvalidBytes.GetBytes(char.ConvertFromUtf32(codePoint)).Select(b => string.Format(CultureInfo.InvariantCulture, "%{0:X2}", b)));

libraries\System.Security.Cryptography.X509Certificates\src\Internal\Cryptography\Helpers.cs:
  17:         public static char[] ToHexArrayUpper(this byte[] bytes)
  24:         private static void ToHexArrayUpper(ReadOnlySpan<byte> bytes, Span<char> chars)
  36:         public static string ToHexStringUpper(this byte[] bytes) =>

Loading

@tkp1n
Copy link
Contributor

@tkp1n tkp1n commented Apr 20, 2020

@GrabYourPitchforks thank you for picking this up. I generally agree with the points you've brought up.

What's the primary scenario for this?

Let me add to your list two additional use cases I could think of: The tracing of binary data in motion (e.g., low-level network protocol) and hex dumps of binary files. Both typically encode a quantity of data well above the otherwise typical "max 256 bits"-cases. Ingesting those traces/dumps would consequently also use the decode methods with larger data sizes.

Investigating whether a vectorized implementation was worth it for those cases showed that AVX2 could be up to 10x faster in decoding 2KB of hex-encoded data (see table below / full comparison here).

hex

There are already vectorized implementations in the broader ecosystem (e.g., this one in ASP.NET Core) that could be migrated to the API proposed here if we allow for some "hyper-optimization" to ensure the migration causes no perf regressions.

I would suggest a for-loop approach for small input data sizes and using an SSSE3 or AVX2 approach above a given threshold. All of this is an implementation detail, however, and we should first settle on the API, I guess. More than happy to provide a PR by the way, once the API is approved.

After integrating your feedback to my comment above, the API proposal would look like this:

namespace System
{
    // NEW methods on EXISTING type
    public static class Convert
    {
        // Decode from chars
        public static byte[] FromHexCharArray(char[] inArray, int offset, int length) => throw null;
        public static bool TryFromHexChars(ReadOnlySpan<char> chars, Span<byte> bytes, out int bytesWritten) => throw null;

        // Decode from strings
        public static byte[] FromHexString(string s) => throw null;
        public static byte[] FromHexString(ReadOnlySpan<char> chars) => throw null;
        public static bool TryFromHexString(string s, Span<byte> bytes, out int bytesWritten) => throw null;

        // Encode to chars
        public static int ToHexCharArray(byte[] inArray, int offsetIn, int length, char[] outArray, int offsetOut, bool lowercase = false) => throw null;
        public static bool TryToHexChars(ReadOnlySpan<byte> bytes, Span<char> chars, out int charsWritten, bool lowercase = false) => throw null;

        // Encode to strings
        public static string ToHexString(byte[] inArray, bool lowercase = false) => throw null;
        public static string ToHexString(byte[] inArray, int offset, int length, bool lowercase = false) => throw null;
        public static string ToHexString(ReadOnlySpan<byte> bytes, bool lowercase = false) => throw null;
    }
}

Loading

@GSPP
Copy link
Author

@GSPP GSPP commented Apr 25, 2020

Maybe get in a simple implementation soon and rely on a community contribution to provide a super fast vectorized implementation. Sounds like a fun project for someone.

Loading

@GrabYourPitchforks
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks commented Apr 25, 2020

@tkp1n For your scenarios, would those tracers want to do a straight hex dump "00112233..." or would they want things to be formatted "00-11-22-..."? I've never seen any amount of data larger than a few dozen bytes dumped to hex as-is without some kind of formatting to make it more human-readable.

In the case of the aspnet core code you linked to above, it can be written in terms of the 100% fully type-safe HexConverter internal class already in CoreLib, and its performance is within 1 nanosecond of the unsafe SSE3 version.

Method value Mean Error StdDev
UseSse3_Unsafe 3735928559 1.324 ns 0.0298 ns 0.0249 ns
UseBitTwiddle_Safe_B 3735928559 2.237 ns 0.0259 ns 0.0216 ns

Loading

@tkp1n
Copy link
Contributor

@tkp1n tkp1n commented Apr 27, 2020

@GSPP provide a super fast vectorized implementation

I did that already: https://github.com/tkp1n/HexMate/

@GrabYourPitchforks a straight hex dump or would they want things to be formatted

I'd say that depends on whether the primary reason for tracing is the consumption of the trace by a human or by a machine (with maybe a human looking at it from time to time). I've seen both, but over a certain amount of data there is almost always some formatting applied.

Another not yet mentioned source of "large" hex strings is the xs:hexBinary primitive datatype in XML schemas. Such values are also encoded by the internal HexConverter class in CoreLib (via BinHexEncoder) that is undoubtedly fast for small inputs but falls behind vectorized implementations quickly for more substantial buffers. Decoding of hexBinary is done in BinHexDecoder using (yet) another implementation tolerating a mix of upper- and lowercase characters as well as whitespace.

Loading

@terrajobst
Copy link
Member

@terrajobst terrajobst commented Jun 4, 2020

Video

  • We decided to go with a simpler route.
namespace System
{
    public static class Convert
    {
        public static byte[] FromHexString(string s);
        public static byte[] FromHexString(ReadOnlySpan<char> chars);

        public static string ToHexString(byte[] inArray);
        public static string ToHexString(byte[] inArray, int offset, int length);
        public static string ToHexString(ReadOnlySpan<byte> bytes);
    }
}

Loading

@tkp1n
Copy link
Contributor

@tkp1n tkp1n commented Jun 4, 2020

@terrajobst or @GrabYourPitchforks feel free to assign this to me. I‘d be happy to provide a PR for this.

Loading

@GrabYourPitchforks
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks commented Jun 5, 2020

@tkp1n done! Thanks :)

Loading

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.