A library to assist in security-testing Unicode enabled applications during fuzzing, XSS, SQLi, etc.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.



A library to assist in security-testing Unicode enabled applications. The original intent of putting this together was threefold:

  1. To provide a reduced set of useful Unicode input to a software fuzzer
  2. To document historically problematic Unicode characters sequences which might negatively affect protocols and Web applications.
  3. To lookup mappings for ASCII equivalent characters

For example, the best-fit and normalization mappings can be useful for testing Web applications for cross-site scripting (XSS) or SQL injection (SQLi) vulnerabilities, by providing you with alternative characters which map back, or transform, to the intended ASCII encoded input - such as "<", "'", etc.

Additionally, many problem characters have been pre-defined as a small set, reducing the number of iterations a fuzzer might need to perform.

Major features:

  • best fit mappings
  • Unicode normalization mappings
  • hard-coded Unicode characters useful in fuzzing

For fuzzing applications it includes:

  • ill-formed byte sequences
  • non-characters
  • private use area (PUA)
  • unassigned code points
  • code points with special meaning such as the BOM and RLO
  • half-surrogate values


This Windows form application loads the UniHax library mainly to test the best-fit and normalization mappings.
If you simply input a single ASCII character, all of its equivalent characters will be displayed.

e.g. If you're testing a Web-application and want to test equivalents for the "<" character U+003C, enter that as input and select either "best-fit mapping", which is linked to a charset encoding, or "normalization" equivalents. For this character, the following are best-fits:

  • U+003B in the APL-ISO-IR-68 encoding
  • U+0014 in the CP424 encoding
  • etc...

Also, the following are normalization decomposition mappings:



This library contains a small set of problematic Unicode characters in Fuzzer.cs such as the following:

        /// <summary>
        /// An unassigned code point U+0FED
        /// </summary>
        public static readonly string uUnassigned = "\u0FED";
        /// <summary>
        ///  An illegal low half-surrogate U+DEAD
        /// </summary>
        public static readonly string uDEAD = "\uDEAD";

Also the following method to return those characters as a byte array in any encoding.

public byte[] GetCharacterBytes(string encoding, string character)

There's also the following method to return any Unicode character as a malformed byte sequence, simply by trimming the last byte.

public byte[] GetCharacterBytesMalformed(string encoding, string character)

This project also contains the data files, pre-created in the /data folder, and a Mapping.cs Mapping class which can lookup mapping equivalents for the following:

  • ASCII equivalent best-fit mappings across legacy character encodings
  • ASCII equivalent mappings for Unicode normalization types. For example, Web browsers commonly use a form of normalization for keeping URL content and host names compatible.

For more on Unicode Normalization see TR15: http://www.unicode.org/reports/tr15/


Unicode-Hax by Chris Weber is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License . Based on a work at https://github.com/cweb/unicode-hax.