<table>
<tr><td><img style="height: 150px;" src="images/geo_hydro1.jpg"></td>
<td bgcolor="#FFFFFF">
    <p style="font-size: xx-large; font-weight: 900; line-height: 100%">pyCRYPTO</p>
    <p style="font-size: large; color: rgba(0,0,0,0.5);"><b style=color:red;>Crypto</b>graphy</p>
    <p style="font-size: large; color: rgba(0,0,0,0.5);">Georg Kaufmann</p>
    </td>
<td><img style="height: 150px;" src="images/pyCRYPTO.png"></td>
</tr>
</table>

----
# `pyCRYPTO`

pyCRYPTO, a program package for cryptography tools and blockchains.

# Unicode 
----

In this notebook, we discuss different character encodings, used to store letters and numbers, or other symbols.

[See also this explanation on towardsdatascience ...](https://towardsdatascience.com/a-guide-to-unicode-utf-8-and-strings-in-python-757a232db95c)

----
## Checking character positions
Each letter, each integer, and each punctuation sign are stored in **encoding tables**. 

We can test a character with the `ord()` function, which gives us the position in the encoding table:

In [1]:
character = 'a'
help(ord)
print(character,':',ord(character))

Help on built-in function ord in module builtins:

ord(c, /)
    Return the Unicode code point for a one-character string.

a : 97


A reverse function `char(n)` returns the character belonging to the number $n$ , which refers to the underlying encoding table:

In [2]:
n =ord(character)
help(chr)
print(n,':',chr(n))

Help on built-in function chr in module builtins:

chr(i, /)
    Return a Unicode string of one character with ordinal i; 0 <= i <= 0x10ffff.

97 : a


With the `bin()` function, we can figure out the position number in the encoding table as a **binary number**:

In [27]:
help(bin)
print(n,':',bin(n))

Help on built-in function bin in module builtins:

bin(number, /)
    Return the binary representation of an integer.
    
    >>> bin(2796202)
    '0b1010101010101010101010'

97 : 0b1100001


The binary number represents:
$$
1100001
=
1 \times 2^6 +
1 \times 2^5 +
0 \times 2^4 +
0 \times 2^3 +
0 \times 2^2 +
0 \times 2^1 +
1 \times 2^0
=97
$$
Test ...

In [25]:
2**6 + 2**5 + 2**0

97

----
## ASCII

One of the early **character encodings** is the **ASCII** code 
(American Standard Code for Information Interchange). 

The ASCII code uses 7 bits of a byte (which consists of eight bits) to map a character to a binary list used by the computer
to store symbols (such as letters, punctuation signs, ...). The eight bit is not used in the original ASCII table (is was
used a parity bit for testing integrity).

Therefore, the ASCII encoding has space for of $2^7=128$ characters:
- [0,31] - non-printable characters
- [32,127] - printable characters

We print the positions 32 to 127 from the ASCII table:

In [28]:
for i in range(2**5,2**7):
    print(i,':',chr(i))

32 :  
33 : !
34 : "
35 : #
36 : $
37 : %
38 : &
39 : '
40 : (
41 : )
42 : *
43 : +
44 : ,
45 : -
46 : .
47 : /
48 : 0
49 : 1
50 : 2
51 : 3
52 : 4
53 : 5
54 : 6
55 : 7
56 : 8
57 : 9
58 : :
59 : ;
60 : <
61 : =
62 : >
63 : ?
64 : @
65 : A
66 : B
67 : C
68 : D
69 : E
70 : F
71 : G
72 : H
73 : I
74 : J
75 : K
76 : L
77 : M
78 : N
79 : O
80 : P
81 : Q
82 : R
83 : S
84 : T
85 : U
86 : V
87 : W
88 : X
89 : Y
90 : Z
91 : [
92 : \
93 : ]
94 : ^
95 : _
96 : `
97 : a
98 : b
99 : c
100 : d
101 : e
102 : f
103 : g
104 : h
105 : i
106 : j
107 : k
108 : l
109 : m
110 : n
111 : o
112 : p
113 : q
114 : r
115 : s
116 : t
117 : u
118 : v
119 : w
120 : x
121 : y
122 : z
123 : {
124 : |
125 : }
126 : ~
127 : 


----
## Extended ASCII
There is not enough space in the 128 positions in the ASCII table to cope with special characters. Therefore, the **eight bit** was added
as storage for characters, extending the range of storage places to $2^8=256$  characters, the **extended ASCII table**.

We check the range from [128,256]:

In [29]:
for i in range(2**7,2**8):
    #print(chr(i),' ',end='')
    print(i,':',chr(i))

128 : 
129 : 
130 : 
131 : 
132 : 
133 : 
134 : 
135 : 
136 : 
137 : 
138 : 
139 : 
140 : 
141 : 
142 : 
143 : 
144 : 
145 : 
146 : 
147 : 
148 : 
149 : 
150 : 
151 : 
152 : 
153 : 
154 : 
155 : 
156 : 
157 : 
158 : 
159 : 
160 :  
161 : ¡
162 : ¢
163 : £
164 : ¤
165 : ¥
166 : ¦
167 : §
168 : ¨
169 : ©
170 : ª
171 : «
172 : ¬
173 : ­
174 : ®
175 : ¯
176 : °
177 : ±
178 : ²
179 : ³
180 : ´
181 : µ
182 : ¶
183 : ·
184 : ¸
185 : ¹
186 : º
187 : »
188 : ¼
189 : ½
190 : ¾
191 : ¿
192 : À
193 : Á
194 : Â
195 : Ã
196 : Ä
197 : Å
198 : Æ
199 : Ç
200 : È
201 : É
202 : Ê
203 : Ë
204 : Ì
205 : Í
206 : Î
207 : Ï
208 : Ð
209 : Ñ
210 : Ò
211 : Ó
212 : Ô
213 : Õ
214 : Ö
215 : ×
216 : Ø
217 : Ù
218 : Ú
219 : Û
220 : Ü
221 : Ý
222 : Þ
223 : ß
224 : à
225 : á
226 : â
227 : ã
228 : ä
229 : å
230 : æ
231 : ç
232 : è
233 : é
234 : ê
235 : ë
236 : ì
237 : í
238 : î
239 : ï
240 : ð
241 : ñ
242 : ò
243 : ó
244 : ô
245 : õ
246 : ö
247 : ÷
248 : ø
249 : ù
250 : ú
251 : û
252 : ü


There are more characters in!

Let's create a **dictionary** of the characters ...

In [30]:
dict = {}
for i in range(32,2**7):
    dict[i]=chr(i)
print(dict[97])

a


In [6]:
for key, value in dict.items():
    print(key, '->', value)

32 ->  
33 -> !
34 -> "
35 -> #
36 -> $
37 -> %
38 -> &
39 -> '
40 -> (
41 -> )
42 -> *
43 -> +
44 -> ,
45 -> -
46 -> .
47 -> /
48 -> 0
49 -> 1
50 -> 2
51 -> 3
52 -> 4
53 -> 5
54 -> 6
55 -> 7
56 -> 8
57 -> 9
58 -> :
59 -> ;
60 -> <
61 -> =
62 -> >
63 -> ?
64 -> @
65 -> A
66 -> B
67 -> C
68 -> D
69 -> E
70 -> F
71 -> G
72 -> H
73 -> I
74 -> J
75 -> K
76 -> L
77 -> M
78 -> N
79 -> O
80 -> P
81 -> Q
82 -> R
83 -> S
84 -> T
85 -> U
86 -> V
87 -> W
88 -> X
89 -> Y
90 -> Z
91 -> [
92 -> \
93 -> ]
94 -> ^
95 -> _
96 -> `
97 -> a
98 -> b
99 -> c
100 -> d
101 -> e
102 -> f
103 -> g
104 -> h
105 -> i
106 -> j
107 -> k
108 -> l
109 -> m
110 -> n
111 -> o
112 -> p
113 -> q
114 -> r
115 -> s
116 -> t
117 -> u
118 -> v
119 -> w
120 -> x
121 -> y
122 -> z
123 -> {
124 -> |
125 -> }
126 -> ~
127 -> 


----
## Unicode
Now we have to admit, that we cheated a bit. We talked about ASCII tables, but actually the `ord()` and `chr()` functions
use something called unicode code (See e.g. the help on `ord()`).

Here is the explanation: 

The extended ASCII encoding was not sufficient to store symbols from all languages. Here, **Unicode** came into play,
as a new encoding class with a very large symbol set. Actually, extended ASCII is contained in the Unicode set.

The Unicode set needs to be used with a proper **encoding**, e.g. **UTF-8**. In this encoding, a symbol occupies a **minimum**
of 8 bits.

In **UTF-16**, a symbol occupies a **minimum** of 16 bits.

In [8]:
a     = 'U+0061'
emoij = 'U+1F590'
omega = 'U+03A9'

print(a[2:],int(a[2:]),int(a[2:],16))
# 0041

print(a,':',chr(int(a[2:],16)))
print(emoij,':',chr(int(emoij[2:],16)))
print(omega,':',chr(int(omega[2:],16)))

0061 61 97
U+0061 : a
U+1F590 : 🖐
U+03A9 : Ω


----