Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search by utf8? #25

Closed
kbd opened this issue Apr 13, 2021 · 5 comments
Closed

Search by utf8? #25

kbd opened this issue Apr 13, 2021 · 5 comments

Comments

@kbd
Copy link

kbd commented Apr 13, 2021

Recently had a problem with some code I copied from a coworker from Slack. For some reason, lines showed up as having been changed in git even though I couldn't see what was different. Put it through a hex editor and saw e2 80 8b. Went to my usual tool for this type of thing, FileFormat.info, typed that in, and it came up with the right answer, that there were zero-width spaces inserted.

I'd like to be able to use uni to search by utf8 text like that.

@arp242
Copy link
Owner

arp242 commented Apr 13, 2021

This is pretty much what uni identify is:

[~]% uni identify asd
     cpoint  dec    utf8        html       name (cat)
'a'  U+0061  97     61          a     LATIN SMALL LETTE… (Lowercase_Lett…)
's'  U+0073  115    73          s     LATIN SMALL LETTE… (Lowercase_Lett…)
'd'  U+0064  100    64          d     LATIN SMALL LETTE… (Lowercase_Lett…)

Essentially it's a "UTF-8 hexdump".

@kbd
Copy link
Author

kbd commented Apr 14, 2021

I'm talking about something like:

$ uni identify --utf8 "e2 80 8b"
     cpoint  dec    utf8        html       name (cat)
'�'  U+200B  8203   e2 80 8b    ​ ZERO WIDTH SPACE (Format)

@arp242
Copy link
Owner

arp242 commented Apr 14, 2021

You could just copy the code from Slack to uni, right? That's how I use it anyway.

I suppose some syntax could be added to print; I'm not sure if it's a common use case, and I'm not likely to work on it any time soon, but I'll happily review and merge patches, or I'll probably take it up eventually.

Personally I'd just pipe it to grep (uni p all | grep 'e2 80 8b') in the rare case I'd want it, which is a wee bit slow, but works well enough.

@kbd
Copy link
Author

kbd commented Apr 14, 2021

$ uni p all | rg 'e2 80 8b'
 '�'  U+200B  8203   e2 80 8b    ​ ZERO WIDTH SPACE (Format)

oh, that'll work most of the time, thanks.

@arp242 arp242 closed this as completed in 23594a3 Sep 23, 2021
@arp242
Copy link
Owner

arp242 commented Sep 23, 2021

You can now use uni p 'utf8:e2 80 8b', and a few variants thereof:

$ uni p 'utf8:e2 80 8b' 'utf8:e2808b' 'utf8:0xe2 0x80 0x8b' 'utf8:e2-80-8b'
	cpoint  dec    utf8        html       name (cat)
'�'  U+200B  8203   e2 80 8b    ​ ZERO WIDTH SPACE (Format)
'�'  U+200B  8203   e2 80 8b    ​ ZERO WIDTH SPACE (Format)
'�'  U+200B  8203   e2 80 8b    ​ ZERO WIDTH SPACE (Format)
'�'  U+200B  8203   e2 80 8b    ​ ZERO WIDTH SPACE (Format)

I think that should cover all the common syntaxes; the utf8: prefix is needed to disambiguate with codepoints, since uni p 0x200B or just uni p 200B without a leading U+ will print the codepoint already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants