Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blog - add links to topic Unicode, ASCII, UTF-8 #290

Merged
merged 3 commits into from Apr 19, 2022

Conversation

nappex
Copy link
Collaborator

@nappex nappex commented Apr 13, 2022

No description provided.

@nappex
Copy link
Collaborator Author

nappex commented Apr 14, 2022

I have questions to the explanation of Unicode:

  1. latin-1 should start with number 128 and unicode is superset of latin-1. All chars bigger than ASCII (number higher than 127) will be encoded by unicode to two bytes where first (resp. last) byte begin with 10 and next byte start with 1 so the lowest number created in this format should be
(10)000000 10000000

if it should be the lowest number then should equal to number 128 so I assume that first bits 10 are something as distinguishing bits and these bits will never changed and we'll never included them to calculation also, then the highest number will be (2**14 -1 = 16383) in binary

(10)111111 11111111

so the numbers range for two bytes will be 128 - 16383. Is it correct assumption? I am asking because the lowest number of numbers range was 256 in video stream.
2. If utf-8 get numbers higher than 16383, utf-8 will encode to three bytes in format

(110)00000 10000000 10000000

If I assume distinguishing bits like in point one, then the lowest number will be

10000000 10000000

in decimal (2**15 + 2**7 = 32896), but there are missing a lot numbers from 16834 to 32895?

The lowest number should be in binary

(110)00000 01000000 00000000

What do I misunderstand?

Thanks

@encukou
Copy link
Owner

encukou commented Apr 14, 2022

Hm, I think I got a few details (regarding the exact bits, rather than the concepts) wrong on the stream.

  • latin1 includes 0-255, it's a superset of ASCII.
  • UTF-8 encoding:
    • 0-127 (0 .. 2⁷-1, 7 bits, ASCII): 0xxxxxxx
    • 128-2047 (2⁷ .. 2¹¹-1, 11 bits): 110xxxxx 10xxxxxx
    • 2048-65535 (2¹¹ .. 2¹⁶-1, 16 bits): 1110xxxx 10xxxxxx 10xxxxxx
    • 65536-1114111 (2¹⁶ .. the maximum, 21 bits): 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

So, details I got wrong:

  • “start” bytes have one more 1 bit at the beginning than I said in the stream
  • “continuation” bytes start with 10, not just 1
    This means you can always tell if a byte is the start of a codepoint, which is a very nice property of UTF-8.

(The maximum possible Unicode codepoint is 1114111 (10FFFF in hex) because of limits of UTF-16. That won't increase.)

Try using this to get the exact values:

>>> [bin(n) for n in chr(12345).encode('utf-8')]
['0b11100011', '0b10000000', '0b10111001']
>>> ord(bytes([0b11100011, 0b10000000, 0b10111001]).decode('utf-8'))
12345

There is some “unused space” in UTF-8. For example, you could encode an ASCII codepoint like 32 (100000 in binary) using the rule for 2048-65535 (3 bytes), and get 11100000 10000000 10100000, but that is not a valid sequence in UTF-8.

@nappex
Copy link
Collaborator Author

nappex commented Apr 14, 2022

Ok, when I've tried some numbers I got these results:

  • 128 in utf-8 is equal to ['0b11000010', '0b10000000']
  • 129 in utf-8 is equal to ['0b11000010', '0b10000001']
  • 192 in utf-8 is equal to ['0b11000011', '0b10000000']

but normally for example number 192 will be saved as 0b11000000 instead of ['0b11000011', '0b10000000'],
so if the computer wants to read binary from utf-8 bytes to get number for unicode number (code point) it has to do something as?

flowchart TB
192--utf-8-->utf_result(0b11000011, 0b10000000)
utf_result-->fst_byte(110 00011) & snd_byte(10 000000)
fst_byte--start bits-->start(110)
fst_byte--rest bits-->fst_rest(00011)
snd_byte--continuation bits-->continuation(10)
snd_byte--rest bits-->snd_rest(000000)
fst_rest & snd_rest--join rest bits-->result(0b000 0b11000000)
result--get code point of unicode table-->192

by this process we can get also 2047

@encukou
Copy link
Owner

encukou commented Apr 14, 2022

Yes, that looks right.

@encukou encukou merged commit c0d9207 into encukou:master Apr 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants