Skip to content
Toolset for handling similarly looking characters in strings.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
example
lib
t
.gitignore
.travis.yml
LICENSE
META6.json
README.md

README.md

Homoglyph toolset for Raku language

Build Status

Homoglyph is set of one or more graphemes that has identical or very similar look to some other set of graphemes.

For example:

  • 6 (DIGIT SIX) and б (CYRILLIC SMALL LETTER BE)
  • w (LATIN SMALL LETTER W) and ω (GREEK SMALL LETTER OMEGA)
  • oo (2 x LATIN SMALL LETTER O) and က (MYANMAR LETTER KA)
  • E (LATIN CAPITAL LETTER E) and Ε (GREEK CAPITAL LETTER EPSILON) and Е (CYRILLIC CAPITAL LETTER IE)
  • V (LATIN CAPITAL LETTER V) and \/ (REVERSE SOLIDUS + SOLIDUS)

Homoglyphs are:

  • Font dependent - two homoglyphs may be 100% identical in one font but have visual differences when rendered in other. Even cursive matters, for example т in cursive in some fonts looks like m.
  • Subjective - similarity level cannot be measured and there is no fixed point where two sets of graphemes stops being homoglyphs. Are a and а homoglyphs? Sure! How about ź and ž? Probably yes. What will you say about R and Я? Er.... You see the point?
  • Funny - replace ; (SEMICOLON) with ; (GREEK QUESTION MARK) in someone's code and watch them trying to debug code that looks perfectly fine :)
  • Dangerous - someone can register IDN domain that looks very similar to your business domain to swindle money out of your clients.

TABLE OF CONTENTS

SYNOPSIS

use HomoGlypher;

my %cyrillic = (
    '6' => [ 'б' ],
    'a' => [ 'а' ],
    'b' => [ 'б', 'ь' ],
    'r' => [ 'г' ]
);

my %greek = (
    'a' => [ 'α' ],
    'o' => [ 'ο' ]
);

my %myanmar = (
    'oo' => [ 'က' ]
);

my $hg = HomoGlypher.new;

$hg.add-mapping( %cyrillic );
$hg.add-mapping( %greek );
$hg.add-mapping( %myanmar );

my @unwinded = $hg.unwind( 'foo' );    # [ 'foο', 'fοo', 'fοο', 'fက' ]

my @collapsed = $hg.collapse( 'бαг' ); # [ 'bar', '6ar' ]

my $randomized = $hg.randomize( 'bar', level => 80 ); # for example 'bαr'

my &tokenized = $hg.tokenize( );
say so 'bαг' ~~ / <&tokenized: 'bar'> /; # True

HINT

When dealing with homoglyphs the easiest method to debug them is to use uniname(s) method:

$ raku -e '.say for "fοο".uninames'

LATIN SMALL LETTER F
GREEK SMALL LETTER OMICRON
GREEK SMALL LETTER OMICRON

METHODS

add-mapping

Merge given mapping (given as Hash of Arrays) with existed mappings.

Typically keys are composed from ASCII characters. Duplicates are filtered out automatically. Multi character glyphs can be used both in keys and values:

my %mapping = (
    'IO' => [ 'Ю' ],
    'P' => [ '|Ͻ']
);

You can inspect megred mappings under $hg.mappings, just do not modify it directly. If you want to fine tune it then fetch merged result, tweak it and add to new HomoGlypher object.

Few ready to use mappings are provided in HomoGlypher::Mappings:

  • @basic - ASCII letters and digits in various scripts (armenian, cherokee, cyrillic, deseret, greek, georgian, latin, lisu, roman-numerals, etc.): ΤꜦꜪ QՍΙᴄк вᚱՕꓪɴ ꓝᏅХ jսოр𐑈 օ𐐷еᎱ tᏥе ιαzႸ Ժօց ОᛐշʒᏎƼỼ7ꝸᏭ.
  • %accented - ASCII letters with accents: ȚȞȆ ꝖṲÏÇꝂ ḂŔǾⱲṆ ḞṌẌ ĵữṁꝕṩ ǭⱱëȑ ʈẖḕ ļǟʐȳ ɗȫǵ. Try to read it loud... Correctly :)
  • %flipped - ASCII letters, digits and symbols in various rotations and mirroring: ꓕH⧢ Ꝺ⋂I𐐣ꓘ ꓭꓤOW𐐥 ꓞOX jᴝᴟpƨ ᴑ⋏ǝɹ ʇɥɘ ꞁɐzʎ dᴑᵷ 0ᛚ2Ƹ4567∞9;
use HomoGlypher;
use HomoGlypher::Mappings;

my $hg = HomoGlypher.new;

$hg.add-mapping( $_ ) for @HomoGlypher::Mappings::basic;    # load basic mappings

$hg.add-mapping( %HomoGlypher::Mappings::cyrillic );    # or load specific mapping,
                                                        # check source for available names

I won't tell you where to get perfect, complete, ultimate mapping because homoglyphs are font-dependent and similarity is subjective. Good start point for creating your own mappings are *_alphabet and *_numeral pages on Wikipedia. Or you can borrow mappings from some other projects like Codebox homoglyphs, IronGeek Homoglyph Attack Generator and many others.

unwind

Generates every possible mapping combination for your ASCII text. Beware, this works only for short inputs and list grows really, really fast.

my %cyrillic = (
    '6' => [ 'б' ],
    'a' => [ 'а' ],
    'b' => [ 'б', 'ь' ],
    'e' => [ 'е', 'ё' ],
    'm' => [ 'м' ],
    'p' => [ 'р' ],
    'r' => [ 'г' ],
    'x' => [ 'х' ]
);

my $hg = HomoGlypher.new;
$hg.add-mapping( %cyrillic );

.say for $hg.unwind( 'example' );
examplё
examрle
examрlе
examрlё
exaмple
exaмplе
exaмplё
exaмрle
...

(total 143 combinations)

Output list:

  • Is lazy - so you can iterate over it without worrying about memory consumption.
  • Has preserverd mappings order - so if you sort your mappings from most to less similar your result will have the same characteristics.

Main purpose of homoglyph unwinding is to check if someone is spoofing your domain. See ready to use IDN Checker script.

collapse

Opposite of unwind. If you have suspicious, homoglyphed text you can check which ASCII texts it might be derived from. Beware, this works only for short inputs.

my %ascii-art = (
    'O' => [ '()' ],
    'V' => [ '\/' ],
    'W' => [ '\/\/' ]
);

my $hg = HomoGlypher.new;
$hg.add-mapping( %ascii-art );

.print for $hg.collapse( '\/()\/\/EL' );
VOVVEL
VOWEL

(as you can see sometimes it may return more than one possible ASCII text)

Main purpose of homoglyph collapsing is to check if someone is using your forums, hostings, or other services for phishing or false advertising. Check also tokenize method.

Unicode::Security module does similar thing.

tokenize

Construct token that can be used to match homoglyphed text in grammars.

my %greek = (
    'a' => [ 'α' ],
    'r' => [ 'Γ' ],
);

my $hg = HomoGlypher.new;
$hg.add-mapping( %greek );

my &homoglyphy = $hg.tokenize( );

'foobαΓbaz' ~~ / $<result>=<&homoglyphy: 'bar'> /;
say $/{ 'result' };

「bαΓ」

Beware, token uses mappings present at match time. You can create token without any mappings added, define grammar that uses this token and then add mappings before text is actually matched against grammar. If you need tokens with different set of mapping in one grammar you can create and tokenize many HomoGlypher instances.

Regex::FuzzyToken module can be used to catch misspelled phrases. Homoglypher and FuzzyToken can coexist in single grammar:

say 'Suspicious!' if $email-text ~~ / [ <fuzzy: 'paypal'> | <&homoglyphy: 'paypal'> ] /;

Will catch both papyal (misspelled) and pαypαl (homoglyphed). And yes, you can throw nuke on phishers and catch misspells and homoglyphs at the same time:

say 'Suspicious!' if $email-text ~~ / <fuzzy: $hg.unwind('paypal')> /;

Will catch such sneaky phrases as pαpyαl.

randomize

Replace characters in text with homoglyphs with given probability.

my $hg = HomoGlypher.new;
$hg.add-mapping( %HomoGlypher::Mappings::flipped );

say $hg.randomize( 'DIRECTIONS & CAKE ARE A LIE', level => 100 );
⫏Iя∃C⟘IOИƧ ⅋ C∀K⧢ ∀Я∃ ∀ LI∃

Level can be given as percentage value from 1 to 100 (default 50). It decides if possible mapping should be used at given position. Do not confuse that with amount of replaced characters. For example you have mapping 'a' => [ 'α' ] and level set to 50%. Transforming barrrr will result with unmodified barrrr with 50% probability (at second position transformation was possible but not used) and modified bαrrrr with 50% probability (at second position transformation was possible and used). Each position is rolled individually against level. Each possible replacement glyph has equal chance to be picked.

Text::Homoglyph module does similar thing.

CONTACT

You can find me on irc.freenode.net #raku channel as bbkr.

You can’t perform that action at this time.