Blog - add links to topic Unicode, ASCII, UTF-8 #290

nappex · 2022-04-13T07:33:49Z

No description provided.

nappex · 2022-04-14T08:16:20Z

I have questions to the explanation of Unicode:

latin-1 should start with number 128 and unicode is superset of latin-1. All chars bigger than ASCII (number higher than 127) will be encoded by unicode to two bytes where first (resp. last) byte begin with 10 and next byte start with 1 so the lowest number created in this format should be

(10)000000 10000000

if it should be the lowest number then should equal to number 128 so I assume that first bits 10 are something as distinguishing bits and these bits will never changed and we'll never included them to calculation also, then the highest number will be (2**14 -1 = 16383) in binary

(10)111111 11111111

so the numbers range for two bytes will be 128 - 16383. Is it correct assumption? I am asking because the lowest number of numbers range was 256 in video stream.
2. If utf-8 get numbers higher than 16383, utf-8 will encode to three bytes in format

(110)00000 10000000 10000000

If I assume distinguishing bits like in point one, then the lowest number will be

10000000 10000000

in decimal (2**15 + 2**7 = 32896), but there are missing a lot numbers from 16834 to 32895?

The lowest number should be in binary

(110)00000 01000000 00000000

What do I misunderstand?

Thanks

encukou · 2022-04-14T08:56:31Z

Hm, I think I got a few details (regarding the exact bits, rather than the concepts) wrong on the stream.

latin1 includes 0-255, it's a superset of ASCII.
UTF-8 encoding:
- 0-127 (0 .. 2⁷-1, 7 bits, ASCII): 0xxxxxxx
- 128-2047 (2⁷ .. 2¹¹-1, 11 bits): 110xxxxx 10xxxxxx
- 2048-65535 (2¹¹ .. 2¹⁶-1, 16 bits): 1110xxxx 10xxxxxx 10xxxxxx
- 65536-1114111 (2¹⁶ .. the maximum, 21 bits): 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

So, details I got wrong:

“start” bytes have one more 1 bit at the beginning than I said in the stream
“continuation” bytes start with 10, not just 1
This means you can always tell if a byte is the start of a codepoint, which is a very nice property of UTF-8.

(The maximum possible Unicode codepoint is 1114111 (10FFFF in hex) because of limits of UTF-16. That won't increase.)

Try using this to get the exact values:

>>> [bin(n) for n in chr(12345).encode('utf-8')]
['0b11100011', '0b10000000', '0b10111001']
>>> ord(bytes([0b11100011, 0b10000000, 0b10111001]).decode('utf-8'))
12345

There is some “unused space” in UTF-8. For example, you could encode an ASCII codepoint like 32 (100000 in binary) using the rule for 2048-65535 (3 bytes), and get 11100000 10000000 10100000, but that is not a valid sequence in UTF-8.

nappex · 2022-04-14T11:52:52Z

Ok, when I've tried some numbers I got these results:

128 in utf-8 is equal to ['0b11000010', '0b10000000']
129 in utf-8 is equal to ['0b11000010', '0b10000001']
192 in utf-8 is equal to ['0b11000011', '0b10000000']

but normally for example number 192 will be saved as 0b11000000 instead of ['0b11000011', '0b10000000'],
so if the computer wants to read binary from utf-8 bytes to get number for unicode number (code point) it has to do something as?

flowchart TB
192--utf-8-->utf_result(0b11000011, 0b10000000)
utf_result-->fst_byte(110 00011) & snd_byte(10 000000)
fst_byte--start bits-->start(110)
fst_byte--rest bits-->fst_rest(00011)
snd_byte--continuation bits-->continuation(10)
snd_byte--rest bits-->snd_rest(000000)
fst_rest & snd_rest--join rest bits-->result(0b000 0b11000000)
result--get code point of unicode table-->192

by this process we can get also 2047

encukou · 2022-04-14T12:43:43Z

Yes, that looks right.

freezeyt_blog/articles/python_concepts.md

nappex added 2 commits April 13, 2022 09:26

Remove extra 'e' in names

5d70de1

Add link to Unicode, ASCII, UTF-8

5913eb4

jiri-one approved these changes Apr 16, 2022

View reviewed changes

encukou reviewed Apr 19, 2022

View reviewed changes

freezeyt_blog/articles/python_concepts.md Show resolved Hide resolved

Add link to corrections

57320d8

encukou merged commit c0d9207 into encukou:master Apr 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blog - add links to topic Unicode, ASCII, UTF-8 #290

Blog - add links to topic Unicode, ASCII, UTF-8 #290

nappex commented Apr 13, 2022

nappex commented Apr 14, 2022 •

edited

encukou commented Apr 14, 2022

nappex commented Apr 14, 2022 •

edited

encukou commented Apr 14, 2022

Blog - add links to topic Unicode, ASCII, UTF-8 #290

Blog - add links to topic Unicode, ASCII, UTF-8 #290

Conversation

nappex commented Apr 13, 2022

nappex commented Apr 14, 2022 • edited

encukou commented Apr 14, 2022

nappex commented Apr 14, 2022 • edited

encukou commented Apr 14, 2022

nappex commented Apr 14, 2022 •

edited

nappex commented Apr 14, 2022 •

edited