Search by utf8? #25

kbd · 2021-04-13T19:53:33Z

Recently had a problem with some code I copied from a coworker from Slack. For some reason, lines showed up as having been changed in git even though I couldn't see what was different. Put it through a hex editor and saw e2 80 8b. Went to my usual tool for this type of thing, FileFormat.info, typed that in, and it came up with the right answer, that there were zero-width spaces inserted.

I'd like to be able to use uni to search by utf8 text like that.

The text was updated successfully, but these errors were encountered:

arp242 · 2021-04-13T21:58:59Z

This is pretty much what uni identify is:

[~]% uni identify asd
     cpoint  dec    utf8        html       name (cat)
'a'  U+0061  97     61          &#x61;     LATIN SMALL LETTE… (Lowercase_Lett…)
's'  U+0073  115    73          &#x73;     LATIN SMALL LETTE… (Lowercase_Lett…)
'd'  U+0064  100    64          &#x64;     LATIN SMALL LETTE… (Lowercase_Lett…)

Essentially it's a "UTF-8 hexdump".

kbd · 2021-04-14T03:00:59Z

I'm talking about something like:

$ uni identify --utf8 "e2 80 8b"
     cpoint  dec    utf8        html       name (cat)
'�'  U+200B  8203   e2 80 8b    &ZeroWidthSpace; ZERO WIDTH SPACE (Format)

arp242 · 2021-04-14T03:09:54Z

You could just copy the code from Slack to uni, right? That's how I use it anyway.

I suppose some syntax could be added to print; I'm not sure if it's a common use case, and I'm not likely to work on it any time soon, but I'll happily review and merge patches, or I'll probably take it up eventually.

Personally I'd just pipe it to grep (uni p all | grep 'e2 80 8b') in the rare case I'd want it, which is a wee bit slow, but works well enough.

kbd · 2021-04-14T05:14:01Z

$ uni p all | rg 'e2 80 8b'
 '�'  U+200B  8203   e2 80 8b    &ZeroWidthSpace; ZERO WIDTH SPACE (Format)

oh, that'll work most of the time, thanks.

arp242 · 2021-09-23T11:17:00Z

You can now use uni p 'utf8:e2 80 8b', and a few variants thereof:

$ uni p 'utf8:e2 80 8b' 'utf8:e2808b' 'utf8:0xe2 0x80 0x8b' 'utf8:e2-80-8b'
	cpoint  dec    utf8        html       name (cat)
'�'  U+200B  8203   e2 80 8b    &ZeroWidthSpace; ZERO WIDTH SPACE (Format)
'�'  U+200B  8203   e2 80 8b    &ZeroWidthSpace; ZERO WIDTH SPACE (Format)
'�'  U+200B  8203   e2 80 8b    &ZeroWidthSpace; ZERO WIDTH SPACE (Format)
'�'  U+200B  8203   e2 80 8b    &ZeroWidthSpace; ZERO WIDTH SPACE (Format)

I think that should cover all the common syntaxes; the utf8: prefix is needed to disambiguate with codepoints, since uni p 0x200B or just uni p 200B without a leading U+ will print the codepoint already.

arp242 closed this as completed in 23594a3 Sep 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search by utf8? #25

Search by utf8? #25

kbd commented Apr 13, 2021

arp242 commented Apr 13, 2021

kbd commented Apr 14, 2021

arp242 commented Apr 14, 2021

kbd commented Apr 14, 2021

arp242 commented Sep 23, 2021

Search by utf8? #25

Search by utf8? #25

Comments

kbd commented Apr 13, 2021

arp242 commented Apr 13, 2021

kbd commented Apr 14, 2021

arp242 commented Apr 14, 2021

kbd commented Apr 14, 2021

arp242 commented Sep 23, 2021