#### PYTHON FUNDAMENTALS | SUPPLEMENTARY ► BYTES AND CHARACTER ENCODING
---

### I. Introduction

 The **basic and physical unit of information** in computing and digital communications is a **binary digit/bit** (0 or 1). Whatever the information you are dealing with (text, image, sound, ...), digital electronics devices will convert it into digits.

Some terminology:
* A **bit** is either **0** or **1**; 
* A **byte** is a sequence of **8 bits** (for instance 1000 1111) https://en.wikipedia.org/wiki/Binary_number.

In Python to convert a binary number to an integer (base-2 to base-10 numeral system):

In [113]:
# Convert 8 (decimal) to its binary representation.
bin(8)

'0b1000'

In [115]:
# Convert it back to decimal
int(bin(8), 2) # the second argument specifies the base (here 2)

8

### II. Encoding text

As we have just seen, this is pretty straightforward **to encode** numbers into binary digits. For **text (sequence of characters)**, you need a sort of **lookup table linking each character with a sequence of digits**. In computing, **character encoding** standards are used to represent a repertoire of characters by some kind of encoding system.

#### II.1 Character encoding standard

One of the first character encoding standard (1963) is **ASCII (American Standard Code for Information Interchange)**, it encodes **128** specified characters (originally based on the English alphabet) into **seven bits**. For instance, the letter **A** (uppercase) is encoded into **41 in hexadecimal** or **65 in decimal** numeral systems.

![ascii-table.png](img/ascii-table.png)

For instance in Python, to know the ASCII code for **`"a"`**:

In [181]:
# ASCII code in base 10
'a'.encode('ascii')[0]

97

But if you want to encode an accented character like **`"à"`** (used in French among others languages):

In [187]:
'à'.encode('ascii')[0]

UnicodeEncodeError: 'ascii' codec can't encode character '\xe0' in position 0: ordinal not in range(128)

You get an error as **`"à"`** is not part of the ASCII repertoire.

As said already, this encoding standard was initially designed for the latin alphabet used in English written language. **What about Japanese, Arabic, Cyrillic, ... writing systems?**

Alternatively, **Unicode (or Universal Coded Character Set)** is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. **UTF-8** is a character encoding capable of encoding all possible **Unicode** code points. **UTF-8** encodes each of the **1,112,064** valid code points in **Unicode** using **one to four 8-bit bytes**.

In [209]:
'à'.encode('utf-8')

b'\xc3\xa0'

Now, it works! (We will come back on the meaning of the **`b'\xc3\xa0'`** output in a few cells)

It means as well that your Python script need to be encoded to be correctly run. By default, **in a Jupuyter notebook utf-8 encoding is used**:

In [194]:
import sys
sys.getdefaultencoding()

'utf-8'

You can change it but this is not advised. 

**Note:** In a stand-alone Python program (not in a notebook), you can specify the encoding of your choice by adding the two lines below at the top of your **.py** file (the hash sign is part of the line here and has a special meaning, there are not comments):

In [199]:
#!/usr/bin/env python
# -*- coding: UTF-8 -*-

#### II.2 Encoding and decoding

In Python 3, there are **two types** that **represents sequences of characters**: **`bytes`** and **`str`**: 
* instances of **`bytes`** contain raw 8-bit values;
* instances of **`str`** contain Unicode characters.

As seen already, to create a sequence of characters in Python:

In [206]:
text = 'IoT'

In [211]:
type(text) # text is of type 'str'

str

Behind the scene and depending on your system **locale** (a set of parameters that defines the user's language), when your write the letters **'IoT'** on your keyboard, your system **decodes a `utf-8` sequence of bytes** to a **sequence of Unicode characters**.

**`text`** is essentially now a sequence of characters:

In [213]:
# to extract the first unicode character
text[0]

'I'

* **Encoding**

Now if we want to encode it (utf-8, ASCII, ...):

In [217]:
'internet of things'.encode('utf-8')

b'internet of things'

In [218]:
'internet of things'.encode('ascii')

b'internet of things'

In [219]:
type('internet of things'.encode('ascii'))

bytes

Two remarks at this stage:

1. Python type is now **`bytes`** (indicated by the **`b'`** prefix in front of the string);
2. **`utf-8`** and **`ascii`** encoding outputs the same results. This is because, they are using the same codes for first 127 characters (for the sake of **backward compatibility** between them).

* **Decoding**

In [220]:
text_encoded = 'internet of things'.encode('utf-8')

To decode it:

In [223]:
text_encoded.decode('utf-8')

'internet of things'

Now type of decoded `bytes` is `str` again.

In [226]:
type(text_encoded.decode('utf-8'))

str

### III. Representing encoded text

We saw that a string is essentially a sequence of characters:

In [228]:
text = "IoT"
text[0]

'I'

But when we encode it, the string is converted to a sequence of bytes:

In [230]:
text_b = text.encode('utf-8')

In [231]:
type(text_b)

bytes

When you print `text_b`:

In [233]:
text_b

b'IoT'

This is a **bit confusing** because you are supposed to get a sequence of `bytes` (sequence of 8-bits) and you get a text representation instead.

This is just a question related to the way these bytes are represented to a human, to get access to the raw **bytes**:

In [234]:
# as decimal
list(text_b)

[73, 111, 84]

In [235]:
# as hexadecimal
[hex(i) for i in list(text_b)]

['0x49', '0x6f', '0x54']

In [236]:
# as digits
[bin(i) for i in list(text_b)]

['0b1001001', '0b1101111', '0b1010100']

So for instance for a longer string like `'internet of things'`, your computer or digital communication device will handle the following:

In [242]:
# Don't pay to much attention to the details of the code below. Just the output.
''.join([bin(i)[2:].zfill(8) for i in list('internet of things'.encode('utf-8'))])

'011010010110111001110100011001010111001001101110011001010111010000100000011011110110011000100000011101000110100001101001011011100110011101110011'

You will admit this is not not the most compact way to convey such piece of data to a human!

So, instead, to make it human-readable (though they remains digits behind the scene), each **byte** is represented using the **ASCII** table. That's why the string **`IoT`** is printed:

In [245]:
text_b

b'IoT'

instead of: **`['1001001', '1101111', '1010100']`**.

**BUT** we have seen that the ASCII table encodes a maximum of **128** values, so **what happens** for characters **not included in the ASCII table**:

In [246]:
'à'.encode('utf-8')

b'\xc3\xa0'

In [249]:
list('à'.encode('utf-8'))

[195, 160]

We see that the character **`à`** is encoded in **`utf-8`** using two bytes: here **`C3 A0`** (remember that **`utf-8`** uses 1 to 4 bytes to encode Unicode characters - hence the 1,112,064 valid code points).

That's why we get above two bytes (in decimal): **195** for **`C3`** and **160** for **`A0`**. Both **195**  and **160** exceed the max value of the ASCII table, so the notation **`\xc3`** is used instead, **which is actually the hexadecimal representation** (specified by the `\x`) of each byte. 

So to summarize, we can not strictly speaking say that we have an ASCII representation of an encoded string as it is actually a mix of ASCII chars (7 bits) and special string such as **`'\xc3'`** representing a 8 bits.

It means that for a string such as **`'àfdgèqsd'`**, we would get a representation such as:

In [250]:
'àfdgèqsd'.encode('utf-8')

b'\xc3\xa0fdg\xc3\xa8qsd'

In [251]:
list('àfdgèqsd'.encode('utf-8'))

[195, 160, 102, 100, 103, 195, 168, 113, 115, 100]

The length of the string is 8 characters but we get 10 bytes when encoded as two of them (**`à`** and **`è`**) requires two bytes (based on utf-8 table).

### IV Encoding/decoding  hexadecimal values: [Fasten your seat belt]

In [33]:
import binascii

Encoding an hexadecimal string is a bit special as it has a special meaning. For instance, the hexadecimal string **`'AB'`** is actually the hexadecimal representation of the integer **`171`**.

In [34]:
mac = 'AB'

If you encode such string with the `encode` method as in previous example, unsurprisingly you get:

In [35]:
mac.encode('utf-8')

b'AB'

In [36]:
list(mac.encode('utf-8'))

[65, 66]

2 bytes, each of them being the ASCII code of **`A`** and **`B`**.

But usually, when we write a string such as **`'AB'`** we have in mind the hexadecimal representation of the number **`171`**. There is an implicit meaning that we want to encapsulate in the encoding. To do so there is a special method of the module **`binascii`**.

In [70]:
mac = 'AB'

In [71]:
binascii.unhexlify(mac)

b'\xab'

In [79]:
# This is encoded now as type is bytes
type(binascii.unhexlify(mac))

bytes

In [72]:
list(binascii.unhexlify(mac))

[171]

Now we force the encoder to use a single byte (with 8 bits used instead of 7 used for ASCII).

In [None]:
Now if you want to decode it using the decode method:

In [88]:
binascii.unhexlify(mac).decode(encoding='utf-8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xab in position 0: invalid start byte

You get an error as it looks for a **`171`** entry in utf-8 table and does not exist.

Instead to safely decode an encoded hexadecimal string, you need:

In [90]:
# Here is our encoded hexadecimal string
mac_encoded = binascii.unhexlify(mac)
mac_encoded

b'\xab'

In [92]:
# Now to decode first, hexlify it
binascii.hexlify(mac_encoded)

b'ab'

This is still of type bytes and hence encoded, but now we use two ASCII bytes instead:

In [93]:
list(binascii.hexlify(mac_encoded))

[97, 98]

In [94]:
# Now we can decode it safely:
binascii.hexlify(mac_encoded).decode('utf-8')

'ab'

Now a real example, if you get an encoded hexadecimal string such as a 8 bytes MAC adress of a device **`'70 b3 d5 49 9c 3a 7f 7d'`**:

In [95]:
mac_encoded = b'p\xb3\xd5I\x9c:\x7f}'

In [96]:
# First hexlify it:
binascii.hexlify(mac_encoded)

b'70b3d5499c3a7f7d'

In [97]:
# And if you want to decode to get an object of type str instead of bytes:
binascii.hexlify(mac_encoded).decode('utf-8')

'70b3d5499c3a7f7d'

Note: getting it as a type `str` is useful if you want to concatenate it, write it in a csv file, ...

> I warned you! This is convoluted!