Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add methods to convert between hexadecimal strings and bytes #10013

Open
GSPP opened this issue Jul 12, 2016 · 15 comments

Comments

@GSPP
Copy link

commented Jul 12, 2016

It is quite common to need to convert bytes to hex strings and back. The .NET Framework does not have an API to do that. Look at what lengths people have gone to to solve this problem for them: How do you convert Byte Array to Hexadecimal String, and vice versa? Many of those solutions are somehow bad (e.g. slow or broken or not validating).

BitConverter.ToString(byte[]) generates strings of a practically useless format: BitConverter.ToString(new byte[] { 0x00, 0x01, 0xAA, 0xFF }) => 00-01-AA-FF. I don't understand how this decision was made. This seems to be a special solution for some specific purpose (debug output?). Using this API requires Replace(str, "-", "") which is ugly and slow. Also, this API is not discoverable because it does not have "hex" in it's name.

I propose adding an API that does the following:

  • Supports char[] and string
  • Supports specifying the output buffer to support zero-allocation scenarios
  • Supports upper- and lowercase
  • Fast (probably, this means working in-place using unsafe code)
  • Validates input
  • No support for any kind of hex prefix (such as "0x"), whitespace or digit separator (such as "-" or "\n"). This is out of scope.

The use case for char[] in- and outputs is low-allocation scenarios.

The API could look like this:

static class HexConvert {
 byte[] ToBytes(string input, int start, int count, HexFormat hexFormat);
 void ToBytes(string input, int start, int count, ArraySegment<byte> outputBuffer, HexFormat hexFormat);
 //Repeat the same for ArraySegment<char> as input

 string ToString(ArraySegment<byte> bytes);
 char[] ToChars(ArraySegment<byte> bytes);
 void ToChars(ArraySegment<byte> bytes, ArraySegment<char> outputBuffer);

 //Plus obvious convenience overloads

 int HexDigitToInteger(char val); //0-9A-F => 0-15.
 char IntegerToHexDigit(int value, HexFormat hexFormat); //0-15 => 0-9A-F
}

[Flags]
enum HexFormat { UseDefaults = 0, Lowercase = 1 } //extensible

A possible extension would be to support Stream and IEnumerable<byte> as sources and destinations. I would consider this out of scope.

I feel that converting hexadecimal strings is an important and frequent enough thing that this should be supported well. Of course, a "light" version of this proposal is also feasible (just support for string and byte[] with no user-provided output buffer). Not sure if anyone wants/needs lowercase hex, I don't but others might.

This seems cheap enough to implement. Testing should be easily automatable. I don't see any particular usability problems. This is suitable for community contribution.

@joshfree joshfree added this to the Future milestone Jul 12, 2016

@joshfree

This comment has been minimized.

Copy link
Member

commented Jul 12, 2016

Thanks for creating this issue @GSPP.

cc: @terrajobst

https://aka.ms/apireview

@juliusfriedman

This comment has been minimized.

Copy link

commented Jul 12, 2016

Replacement is not the ONLY way, you can just skip the char if you need to.

///Could easily allow upper or lower case here...   
[System.Runtime.CompilerServices.MethodImpl(System.Runtime.CompilerServices.MethodImplOptions.AggressiveInlining)]
public static byte HexCharToByte(char c, bool upperCase = false) { c = char.ToUpperInvariant(c); return (byte)(upperCase ? char.ToUpperInvariant((char)(c > '9' ? c - 'A' + 10 : c - '0')) : (c > '9' ? c - 'A' + 10 : c - '0')); }

        /// <summary>
        /// Converts a String in the form 0011AABB to a Byte[] using the chars in the string as bytes to caulcate the decimal value.
        /// </summary>
        /// <notes>
        /// Reduced string allocations from managed version substring
        /// About 10 milliseconds faster then Managed when doing it 100,000 times. otherwise no change
        /// </notes>
        /// <param name="str"></param>
        /// <returns></returns>
        public static byte[] HexStringToBytes(string str, int start = 0, int length = -1)
        {
            if (length.Equals(Common.Binary.Zero)) return null;

            if (length <= -1) length = str.Length;

            if (start > length - start) throw new System.ArgumentOutOfRangeException("start");

            if (length > length - start) throw new System.ArgumentOutOfRangeException("length");

            System.Collections.Generic.List<byte> result = new System.Collections.Generic.List<byte>(length >> 1); // / 2

            //Dont check the results for overflow
            unchecked
            {
                //Iterate the pointer using the managed length ....
                //Todo, optomize with reverse or i - 1
                for (int i = start, e = length; i < e; i += 2)
                {
                    //to reduce string manipulations pre call
                    if(str[i] == '-') ++i;

                    //Todo, Native and Unsafe
                    //Convert 2 Chars to a byte
                    result.Add((byte)(HexCharToByte(str[i]) << 4 | HexCharToByte(str[i + 1])));
                }
            }

            //Dont use a List..

            //Return the bytes
            return result.ToArray();
        }

I agree that the format is useless in MOST cases...

However if you think about it a little more and try to be as outside the box as possible you will clearly see that there is an extra index for every octet, finally then you can achieve some interesting things by realigning the data or using the extra indexes....

@AlexGhiondea

This comment has been minimized.

Copy link
Member

commented Nov 23, 2016

@GSPP how about augmenting the BitConverter type with these new methods?

I feel like that would be a better place for these since that is already a type people use for converting between various formats.

@AlexGhiondea

This comment has been minimized.

Copy link
Member

commented Nov 24, 2016

There are at least a couple of other issues tracking something like this:
#1647
#4441

Let's track it with a single issue (this one).

It does look like this is something for which there is interest in implementing.

@GSPP, @GrabYourPitchforks @KrzysztofCwalina is anyone of you interested in writing up a formal API proposal for this?

@CodesInChaos

This comment has been minimized.

Copy link

commented Nov 24, 2016

If the slices API lands before this, I'd go with something like:

// all instance methods are thread-safe
public abstract class BinaryEncoding
{
	protected abstract void ToBytesOverride(ReadOnlySpan<char> sourceChars, Span<byte> destinationBytes);
	protected abstract int GetByteLengthOverride(int stringLength);
	protected abstract void ToCharsOverride(ReadOnlySpan<byte> sourceBytes, Span<char> destinationChars);
	protected abstract int GetStringLengthOverride(int byteLength);

	public byte[] ToBytes(ReadOnlySpan<char> str);
	public void ToBytes(ReadOnlySpan<char> sourceChars, Span<byte> destinationBytes);
	public int GetByteLength(int stringLength);
	
	public string ToString(ReadOnlySpan<byte> bytes);
	public char[] ToChars(ReadOnlySpan<byte> bytes);
	public void ToChars(ReadOnlySpan<byte> sourceBytes, Span<char> destinationChars);
	public int GetStringLength(int byteLength);
}

public class HexEncoding
{
	public static HexEncoding UpperCase { get; }
	public static HexEncoding LowerCase { get; }
}

perhaps add some additional overloads with byte[], char[], string inputs.

@karelz

This comment has been minimized.

Copy link
Member

commented Nov 24, 2016

We don't take any new API additions if they would be obsolete when Span lands.
Marking as blocked on Span #13892 - @AlexGhiondea feel free to change it if you think I am wrong.

@karelz karelz added blocked and removed up-for-grabs labels Nov 24, 2016

@GSPP

This comment has been minimized.

Copy link
Author

commented Aug 12, 2018

Span has landed so this can be unblocked.

@GrabYourPitchforks

This comment has been minimized.

Copy link
Member

commented Feb 13, 2019

I propose a slightly different API, instead putting this on the Convert class near the existing base64 APIs. That would help discoverability, and it gives us an existing template for the recommended API shape.

namespace System
{
    // NEW methods on EXISTING type
    public static class Convert
    {
        public static byte[] FromHexString(ReadOnlySpan<char> chars); // doesn't match an existing overload of FromBase64String, but could be useful
        public static byte[] FromHexString(string s);

        public static string ToHexString(byte[] inArray, HexFormattingOptions options = default);
        public static string ToHexString(byte[] inArray, int offset, int length, HexFormattingOptions options = default);
        public static string ToHexString(ReadOnlySpan<byte> bytes, HexFormattingOptions options = default);
    }

    [Flags]
    public enum HexFormattingOptions
    {
        None = 0,
        InsertLineBreaks = 1,
        Lowercase = 2,
    }
}

It also makes sense IMO to have OperationStatus-based APIs. Those APIs could be done in the corefxlab repo to live alongside the existing Base64Encoder and Base64Decoder types in that repo. Then all of the OperationStatus-based base64 and hex APIs can be reviewed at once if we decide to move them into corefx.

@vcsjones

This comment has been minimized.

Copy link
Collaborator

commented Feb 14, 2019

To give some parity with the *CharArray from base64, I would consider a few other methods:

public static class Convert
{
    public bool TryFromHexString(ReadOnlySpan<char> chars, Span<byte> buffer, out int bytesWritten);
    public bool TryFromHexString(string s, Span<byte> buffer, out int bytesWritten);
    public bool TryToHexString(ReadOnlySpan<byte> bytes, Span<char> buffer, out int charsWritten, HexFormattingOptions options = default);
}

The rationale for this is in my typical usage I expect a hex string to be short (think a SHA digest) which tops out at 64 bytes right now.

I'm not strongly tied to these, but worth considering during review.


Other considerations, does there need to be an options for the FromHexString that provide similar functionality like specifying if the input is expected to be upper or lower, or would the input be treated case insensitive?

What about continuing from line breaks?

@GrabYourPitchforks

This comment has been minimized.

Copy link
Member

commented Feb 14, 2019

@vcsjones I envision HexFormattingOptions.InsertLineBreaks as inserting every 72 chars, just like base64 does today. And parsing would allow all of uppercase / lowercase / whitespace, hence no options parameter.

@GSPP

This comment has been minimized.

Copy link
Author

commented Feb 14, 2019

I'd not put it on Convert. My argument for that is this: Imagine there were 10 different such formats (Hex, base64, base85, ...). Then, we clearly would like each of them to be their own static class. Anything else would make for a very crowded API. Now we don't have 10 formats but from this thought experiment we can see that creating a separate class is the better option. Especially, since more formats might be added in the future. For example, Bitcoin uses a base85 format which is more efficient. Who knows if additional formats might be desirable in the future.

Should this API accept any other input besides spans? It could accept arrays, strings, ArraySegment and even pointers. In my opinion it should only take spans. That would make for the cleanest and least crowded API. Anything else is easily converted to span. In particular I don't want to see overloads with index and count parameters. Thank god we no longer need those!

We certainly need methods that do not throw in the error case. My original proposal missed those.

InsertLineBreaks as inserting every 72 chars

Is this a widely known standard format? Is it only for base64 or also for hex?


New API proposal:

namespace System
{
    public static class HexConvert
    {
        //convert to bytes, throwing
        public static byte[] GetBytes(ReadOnlySpan<char> chars, HexFormattingOptions formattingOptions = default);
        public static void GetBytes(ReadOnlySpan<char> chars, Span<byte> output, out int bytesWritten, HexFormattingOptions formattingOptions = default);

        //convert to bytes, try
        public static byte[] TryGetBytes(ReadOnlySpan<char> chars, HexFormattingOptions formattingOptions = default); //can return null
        public static bool TryGetBytes(ReadOnlySpan<char> chars, Span<byte> output, out int bytesWritten, HexFormattingOptions formattingOptions = default);

         //convert to chars
         string GetString(ReadOnlySpan<byte> bytes);
         char[] GetChars(ReadOnlySpan<byte> bytes);

         //plus non-throwing overloads
         //plus overloads with output buffer
         //plus overloads without HexFormattingOptions (I was lazy so I made them default arguments)
    }

    [Flags]
    public enum HexFormattingOptions
    {
        None = 0,
        Lowercase = 1,
    }
}

I chose the name prefix to be "Get" so that "TryGet" is fitting. This succinct naming would not be possible when placing these methods in Convert.

@AlexeiScherbakov

This comment has been minimized.

Copy link

commented Feb 17, 2019

There are many variants of API. It can be used in web,cryptography and many other places. Some hex conversion is used in json for writing non-ASCII chars - \u0000 - \uFFFF

I think that .NET Core must provide optimized variants for each purpose

Writing API

  1. It can be uppercase/lowercase
  2. It can use separators or not use
  3. Separators can be single char or multi (char[] or string)
  4. It can create new instance of string and char[] or use Span, StringBuilder, TextWriter or Stream as output
  5. Source can be Span or types like,(s)byte,(u)short,char,(u)int,(u)long
  6. If source is Span function can take entire span or it part (for direct byte[] calls)

All above give us more than 100 function declarations. If it will have byte[] inputs - it take much more

For example just writing API for ReadOnlySpan as source, with all kinds of separators, and only lowercase give us 6 optimized variants

public static string ToHexStringLowerCase(this ReadOnlySpan<byte> array);
public static string ToHexStringLowerCase(this ReadOnlySpan<byte> array, char separator);
public static string ToHexStringLowerCase(this ReadOnlySpan<byte> array, ReadOnlySpan<char> separator);
public static string ToHexStringLowerCase(this ReadOnlySpan<byte> array, int offset, int count);
public static string ToHexStringLowerCase(this ReadOnlySpan<byte> array, int offset, int count, char separator);
public static string ToHexStringLowerCase(this ReadOnlySpan<byte> array, int offset, int count, ReadOnlySpan<char> separator);

This is bad idea to place it in existing places, I think, that better to place this API in external netcoreapp2.1+netstandard2.1 package. Something like System.Text.Hex

@AlexeiScherbakov

This comment has been minimized.

Copy link

commented Feb 17, 2019

Also, as @vcsjones mentioned for short strings, we need some different allocation mechanism, to avoid dirty hacks for reducing allocations count: (https://docs.microsoft.com/en-us/dotnet/csharp/how-to/modify-string-contents#unsafe-modifications-to-string)

public static unsafe string ToHexStringLowerCase(this byte[] array)
{
	string ret = new string('\0', array.Length * 2);
	
	fixed(char* strPtr=ret)
	{
		char* ptr = strPtr;
		for (int i = 0; i < array.Length; i++)
		{
			byte bytic = array[i];
			*ptr = HexChars.HexDigitsLowerCase[(bytic >> 4) & 0xF];
			ptr++;
			*ptr = HexChars.HexDigitsLowerCase[bytic & 0xF];
			ptr++;
		}
	}
	return ret;
}

For example in this case we can allocate string memory (x2 of byte array length) and write to it, without copy contructor.

@tkp1n

This comment has been minimized.

Copy link
Contributor

commented Jul 13, 2019

I consolidated the different proposals and comments made in this discussion and came up with the following API. I also provide a sample implementation (internally utilizing AVX2, SSE4.1, SSSE3, SSE2 or Scalar/Lookup with the following performance characteristics) of the API to experiment with via a seperate GitHub repository and nuget package.

Sample implementation to play arround with

https://github.com/tkp1n/HexMate
https://www.nuget.org/packages/HexMate

Adressed proposals and comments:

Open questions:

  • Would it make sense to add APIs where one could choose a separator at the cost of additional overloads and deviating from base64 API parity? (@AlexeiScherbakov, #10013 (comment))
  • Why should InsertLineBreaks add a new-line every 72 characters for hex but every 76 for base64? (@GrabYourPitchforks, #10013 (comment))
  • Should the parsing routines accept --separated input to allow processing strings returned by BitConverter.ToString(byte[])?
  • Would GetMaxDecodedFromUtf8Length and GetMaxEncodedToUtf8Length make sense for Hex as the calculation is much simpler as with base64 where those methods are provided?

API proposal

namespace System
{
    // NEW type
    [Flags]
    public enum HexFormattingOptions
    {
        None = 0,
        InsertLineBreaks = 1,
        Lowercase = 2
    }

    // NEW methods on EXISTING type
    public static class Convert
    {
        // Decode from chars
        public static byte[] FromHexCharArray(char[] inArray, int offset, int length) => throw null;
        public static bool TryFromHexChars(ReadOnlySpan<char> chars, Span<byte> bytes, out int bytesWritten) => throw null;

        // Decode from strings
        public static byte[] FromHexString(string s) => throw null;
        public static byte[] FromHexString(ReadOnlySpan<char> chars) => throw null;
        public static bool TryFromHexString(string s, Span<byte> bytes, out int bytesWritten) => throw null;

        // Encode to chars
        public static int ToHexCharArray(byte[] inArray, int offsetIn, int length, char[] outArray, int offsetOut, HexFormattingOptions options = default) => throw null;
        public static bool TryToHexChars(ReadOnlySpan<byte> bytes, Span<char> chars, out int charsWritten, HexFormattingOptions options = default) => throw null;

        // Encode to strings
        public static string ToHexString(byte[] inArray, HexFormattingOptions options = default) => throw null;
        public static string ToHexString(byte[] inArray, int offset, int length, HexFormattingOptions options = default) => throw null;
        public static string ToHexString(ReadOnlySpan<byte> bytes, HexFormattingOptions options = default) => throw null;
    }
}

namespace System.Buffers.Text
{
    // NEW type
    public static class Hex
    {
        // Decode
        public static OperationStatus DecodeFromUtf8(ReadOnlySpan<byte> utf8, Span<byte> bytes, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true) => throw null;
        public static OperationStatus DecodeFromUtf8InPlace(Span<byte> buffer, out int bytesWritten) => throw null;
        
        // Encode
        public static OperationStatus EncodeToUtf8(ReadOnlySpan<byte> bytes, Span<byte> utf8, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true) => throw null;
        public static OperationStatus EncodeToUtf8InPlace(Span<byte> buffer, int dataLength, out int bytesWritten) => throw null;
    }
}
@GSPP

This comment has been minimized.

Copy link
Author

commented Jul 13, 2019

I think we should demonstrate a need for inserting line breaks first. I am not aware of any common scenario where this would be used. If this feature is added then the line length must be variable. I cannot imagine any fixed line length to be appropriate.

What are the intended use cases?

Should the parsing routines accept --separated input

By default, the parsing routines should be maximally strict. This means no support for whitespace or any separator. When we parse data we generally want to be as strict as possible in order to find bugs in the originating systems and in order to not accidentally accept invalid data. The motto be strict in what you produce and liberal in what you accept is false. It has lead to the mess that is web development. The correct approach is to be as strict as possible and relax only if required.

Parsing should accept any case of the letters A-F, though. I see no good alternative to that.

I see no need for supporting a separator. This is a rare need. An easy workaround exists at a performance cost: str.Replace("-", "").

@stephentoub stephentoub modified the milestones: Future, 5.0 Jul 18, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.