New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blog - add links to topic Unicode, ASCII, UTF-8 #290
Conversation
I have questions to the explanation of Unicode:
if it should be the lowest number then should equal to number 128 so I assume that first bits
so the numbers range for two bytes will be
If I assume distinguishing bits like in point one, then the lowest number will be
in decimal The lowest number should be in binary
What do I misunderstand? Thanks |
Hm, I think I got a few details (regarding the exact bits, rather than the concepts) wrong on the stream.
So, details I got wrong:
(The maximum possible Unicode codepoint is 1114111 ( Try using this to get the exact values: >>> [bin(n) for n in chr(12345).encode('utf-8')]
['0b11100011', '0b10000000', '0b10111001']
>>> ord(bytes([0b11100011, 0b10000000, 0b10111001]).decode('utf-8'))
12345 There is some “unused space” in UTF-8. For example, you could encode an ASCII codepoint like 32 ( |
Ok, when I've tried some numbers I got these results:
but normally for example number 192 will be saved as flowchart TB
192--utf-8-->utf_result(0b11000011, 0b10000000)
utf_result-->fst_byte(110 00011) & snd_byte(10 000000)
fst_byte--start bits-->start(110)
fst_byte--rest bits-->fst_rest(00011)
snd_byte--continuation bits-->continuation(10)
snd_byte--rest bits-->snd_rest(000000)
fst_rest & snd_rest--join rest bits-->result(0b000 0b11000000)
result--get code point of unicode table-->192
by this process we can get also 2047 |
Yes, that looks right. |
No description provided.