In [1]:
type('There are many characters in the Unicode set')

str

In [2]:
'Ƭђ૯૨૯ α૨૯ ʍαทy ૮ђα૨α૮Ƭ૯૨ઽ ¡ท Ƭђ૯ Unicode ઽ૯Ƭ'

'Ƭђ૯૨૯ α૨૯ ʍαทy ૮ђα૨α૮Ƭ૯૨ઽ ¡ท Ƭђ૯ Unicode ઽ૯Ƭ'

In [3]:
'〶ⓗⓔⓡⓔ ⓐⓡⓔ ⓜⓐⓝⓨ ⓒⓗⓐⓡⓐⓒ〶ⓔⓡⓢ ⓘⓝ 〶ⓗⓔ Unicode ⓢⓔ〶'

'〶ⓗⓔⓡⓔ ⓐⓡⓔ ⓜⓐⓝⓨ ⓒⓗⓐⓡⓐⓒ〶ⓔⓡⓢ ⓘⓝ 〶ⓗⓔ Unicode ⓢⓔ〶'

In [4]:
ord('T'), ord('Ƭ'), ord('〶')

(84, 428, 12342)

In [5]:
'€' # a one-character string

'€'

In [6]:
ord('€')

8364

In [7]:
hex(ord('€'))

'0x20ac'

In [8]:
'\u20ac' # the same string using a Unicode escape sequence

'€'

## Encoding From Unicode

[Unicode HOWTO](https://docs.python.org/3.7/howto/unicode.html)

_A Unicode string is a sequence of code points, which are numbers from `0` through `0x10FFFF` (1,114,111 decimal)._ 

This sequence of code points needs to be represented in memory as a set of **code units**, and **code units** are then mapped to 8-bit bytes. The rules for translating a Unicode string into a sequence of bytes are called a **character encoding**, or just an **encoding**.

![](assets/array_of_32_bit_integers.png)

⚠️ **Internally Python program uses 32-bit representation of Unicode**

In [9]:
string = 'This is the first line of the file €9.99'
file_writer = open('assets/unicode.txt', 'w')
print(type(file_writer))

<class '_io.TextIOWrapper'>


In [10]:
file_writer.write(string)
file_writer.close()

file_reader = open('assets/unicode.txt', 'rb')
print(type(file_reader))

<class '_io.BufferedReader'>


In [11]:
content = file_reader.read()
print(content)
file_reader.close()

b'This is the first line of the file \xe2\x82\xac9.99'


^^^ UTF-8 represents ASCII characters as themselves.

In [12]:
import sys
sys.getdefaultencoding()

'utf-8'

In [13]:
file_writer = open('assets/unicode.txt', 'w', encoding='IBM775')
file_writer.write(string)

UnicodeEncodeError: 'charmap' codec can't encode character '\u20ac' in position 35: character maps to <undefined>

In [14]:
file_writer.close()

In [15]:
file_writer = open('assets/unicode.txt', 'w', encoding='utf-16-be')
print(type(file_writer))

<class '_io.TextIOWrapper'>


In [16]:
file_writer.write(string)
file_writer.close()

file_reader = open('assets/unicode.txt', 'rb')
print(type(file_reader))

<class '_io.BufferedReader'>


In [17]:
content = file_reader.read()
type(content)

for byte in content[:]:
    print('{:x} {:c}'.format(byte, byte))

file_reader.close()

0  
54 T
0  
68 h
0  
69 i
0  
73 s
0  
20  
0  
69 i
0  
73 s
0  
20  
0  
74 t
0  
68 h
0  
65 e
0  
20  
0  
66 f
0  
69 i
0  
72 r
0  
73 s
0  
74 t
0  
20  
0  
6c l
0  
69 i
0  
6e n
0  
65 e
0  
20  
0  
6f o
0  
66 f
0  
20  
0  
74 t
0  
68 h
0  
65 e
0  
20  
0  
66 f
0  
69 i
0  
6c l
0  
65 e
0  
20  
20  
ac ¬
0  
39 9
0  
2e .
0  
39 9
0  
39 9


In [18]:
be = b'\x20\xac'
be

b' \xac'

In [19]:
be.decode('utf-16-be')

'€'

## Decoding To Unicode

In [20]:
file_writer = open('assets/unicode2.txt', 'w')
file_writer.write('€')

1

In [21]:
file_writer.close()

In [22]:
file_reader = open('assets/unicode2.txt', 'rb')
b_eur = file_reader.read()
[hex(b) for b in b_eur] # 3 bytes long in utf-8

['0xe2', '0x82', '0xac']

In [23]:
type('€'), type(b_eur)

(str, bytes)

In [24]:
[f for f in dir(b_eur) if not f.startswith('_')]

['capitalize',
 'center',
 'count',
 'decode',
 'endswith',
 'expandtabs',
 'find',
 'fromhex',
 'hex',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdigit',
 'islower',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

^^^ `decode()`

In [25]:
'encode' in dir('€')
'dencode' in dir('€')

False

**When we write an information out it should be encoded with specific encoding, e.g. utf-8.**

[codecs — Codec registry and base classes](https://docs.python.org/3/library/codecs.html)

In [26]:
help(b_eur.decode)

Help on built-in function decode:

decode(encoding='utf-8', errors='strict') method of builtins.bytes instance
    Decode the bytes using the codec registered for encoding.
    
    encoding
      The encoding with which to decode the bytes.
    errors
      The error handling scheme to use for the handling of decoding errors.
      The default is 'strict' meaning that decoding errors raise a
      UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
      as well as any other name registered with codecs.register_error that
      can handle UnicodeDecodeErrors.



In [27]:
b_eur.decode()

'€'

In [28]:
import codecs
dir(codecs)

['BOM',
 'BOM32_BE',
 'BOM32_LE',
 'BOM64_BE',
 'BOM64_LE',
 'BOM_BE',
 'BOM_LE',
 'BOM_UTF16',
 'BOM_UTF16_BE',
 'BOM_UTF16_LE',
 'BOM_UTF32',
 'BOM_UTF32_BE',
 'BOM_UTF32_LE',
 'BOM_UTF8',
 'BufferedIncrementalDecoder',
 'BufferedIncrementalEncoder',
 'Codec',
 'CodecInfo',
 'EncodedFile',
 'IncrementalDecoder',
 'IncrementalEncoder',
 'StreamReader',
 'StreamReaderWriter',
 'StreamRecoder',
 'StreamWriter',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_false',
 'ascii_decode',
 'ascii_encode',
 'backslashreplace_errors',
 'builtins',
 'charmap_build',
 'charmap_decode',
 'charmap_encode',
 'decode',
 'encode',
 'escape_decode',
 'escape_encode',
 'getdecoder',
 'getencoder',
 'getincrementaldecoder',
 'getincrementalencoder',
 'getreader',
 'getwriter',
 'ignore_errors',
 'iterdecode',
 'iterencode',
 'latin_1_decode',
 'latin_1_encode',
 'lookup',
 'lookup_error',
 'make_encoding_map',
 'make_identit

In [29]:
utf_16_codec = codecs.utf_16_decode

In [30]:
type(utf_16_codec)

builtin_function_or_method

In [31]:
help(utf_16_codec)

Help on built-in function utf_16_decode in module _codecs:

utf_16_decode(data, errors=None, final=False, /)



In [32]:
b_eur

b'\xe2\x82\xac'

In [33]:
b_eur.decode('utf-8')

'€'

In [34]:
b_eur.decode('utf-16')

UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0xac in position 2: truncated data

In [35]:
from encodings import cp1252

In [36]:
type(cp1252.decoding_table), type(cp1252.encoding_table)

(str, EncodingMap)

In [37]:
len(cp1252.decoding_table)

256

In [38]:
len(cp1252.encoding_table)

TypeError: object of type 'EncodingMap' has no len()

In [39]:
[ord(c) for c in cp1252.decoding_table]

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100,
 101,
 102,
 103,
 104,
 105,
 106,
 107,
 108,
 109,
 110,
 111,
 112,
 113,
 114,
 115,
 116,
 117,
 118,
 119,
 120,
 121,
 122,
 123,
 124,
 125,
 126,
 127,
 8364,
 65534,
 8218,
 402,
 8222,
 8230,
 8224,
 8225,
 710,
 8240,
 352,
 8249,
 338,
 65534,
 381,
 65534,
 65534,
 8216,
 8217,
 8220,
 8221,
 8226,
 8211,
 8212,
 732,
 8482,
 353,
 8250,
 339,
 65534,
 382,
 376,
 160,
 161,
 162,
 163,
 164,
 165,
 166,
 167,
 168,
 169,
 170,
 171,
 172,
 173,
 174,
 175,
 176,
 177,
 178,
 179,
 18

In [40]:
chr(8364)

'€'

In [41]:
chr(8224)

'†'

In [42]:
ord(cp1252.decoding_table[ord('€')])

IndexError: string index out of range

In [43]:
for o in range(32, 255):
    print(o, cp1252.decoding_table[o])

32  
33 !
34 "
35 #
36 $
37 %
38 &
39 '
40 (
41 )
42 *
43 +
44 ,
45 -
46 .
47 /
48 0
49 1
50 2
51 3
52 4
53 5
54 6
55 7
56 8
57 9
58 :
59 ;
60 <
61 =
62 >
63 ?
64 @
65 A
66 B
67 C
68 D
69 E
70 F
71 G
72 H
73 I
74 J
75 K
76 L
77 M
78 N
79 O
80 P
81 Q
82 R
83 S
84 T
85 U
86 V
87 W
88 X
89 Y
90 Z
91 [
92 \
93 ]
94 ^
95 _
96 `
97 a
98 b
99 c
100 d
101 e
102 f
103 g
104 h
105 i
106 j
107 k
108 l
109 m
110 n
111 o
112 p
113 q
114 r
115 s
116 t
117 u
118 v
119 w
120 x
121 y
122 z
123 {
124 |
125 }
126 ~
127 
128 €
129 ￾
130 ‚
131 ƒ
132 „
133 …
134 †
135 ‡
136 ˆ
137 ‰
138 Š
139 ‹
140 Œ
141 ￾
142 Ž
143 ￾
144 ￾
145 ‘
146 ’
147 “
148 ”
149 •
150 –
151 —
152 ˜
153 ™
154 š
155 ›
156 œ
157 ￾
158 ž
159 Ÿ
160  
161 ¡
162 ¢
163 £
164 ¤
165 ¥
166 ¦
167 §
168 ¨
169 ©
170 ª
171 «
172 ¬
173 ­
174 ®
175 ¯
176 °
177 ±
178 ²
179 ³
180 ´
181 µ
182 ¶
183 ·
184 ¸
185 ¹
186 º
187 »
188 ¼
189 ½
190 ¾
191 ¿
192 À
193 Á
194 Â
195 Ã
196 Ä
197 Å
198 Æ
199 Ç
200 È
201 É
202 Ê
203 Ë
204 Ì
205 Í
206 Î
207 Ï
208 Ð
209 Ñ
