# Character encodings: A tutorial

This tutorial convers the fundamentals of how we encode characters and specific examples in Python.

It is based on the following guide: https://realpython.com/python-encodings-guide/

Structure:
- Number representation.
- ASCII standard.
- Unicode standard and its encodings (UTF-8, UTF-16, UTF-32).
- Addicional Python notes.

## 1. Number representation

In a computer, each char is mapped into an number and then this numbers is converted to binary, which is the type of data the computer really understands. 
So, to formalize character encodings, we should talk first about number representation.  

A number can be represented in different forms:
- Decimal (Base 10): 0123456789
- Hexadecimal (Base 16) -> 0123456789ABCDEF
- Octal (Base 8) -> 01234567
- Binary (Base 2) -> 01

In Python, we can use the following prefixes for each type of representation:
- Decimal -> No prefix, this if the default form and the common used.
- Hexadecimal -> 0x{number} or 0X{number}
- Octal -> 0o{number} or 0O{number}, where {number} is the decimal number we want to represent.
- Binary -> 0b{number} or 0B{number}


Let's understand this by an example: we want to know what the value 11 means in each one of the last bases. 
We can do it in two forms:
- Just by adding the appropiated prefix at the start.
- Using the keyword argument base of the integer constructor.


In [35]:
print(11)  # decimal
print(0o11)  # octal
print(0x11)  # hexadecimal 
print(0b11) # binary

# another way
print(int('11'))  # defaults to base 10
print(int('11', base=8))  # base 8
print(int('11', base=16))  # base 16
print(int('11', base=2))  # base 2

11
9
17
3
11
9
17
3


Other example: imagine that we just want to represent a decimal number like 11 if its 4 different forms. 
We can do it by using the following functions:
- int(): Gets the decimal representation of a number. 
- bin(): Gets the binary representation of a number.
- hex(): Gets the hexadecimal representation of a number.
- oct(): Gets the octal representation of a number. 


In [39]:
print(bin(11))  # binary representation of the decimal number 11, starts the prefix 0b
print(hex(11))  # hexadecimal representation of the decimal number 11, starts the prefix 0x
print(oct(11))  # octal representation of the hex number 0xb, with the prefix 0o
print(int(0xb))  # decimal representation of the hex number 0xb, no prefix used


0b1011
0xb
0o13
11


## 2. ASCII standard

As we have said in the previous section, a computer represents each character as a number which is finally converted to bits. 

The ASCII standard in one way to map characters to numbers. It covers 128 different characters, the most common latin ones. 
You can see the ASCII table in the following link: 

https://www.ascii-code.com/

If a character does not appear in this table, it means that you cannot represent it in ASCII encoding. 




Before getting deeper into the ASCII standard, let's take a look to the module *string* in python. This module contains all kind of characters included in the ASCII standard.

In [10]:
import string

print(string.punctuation)
print(string.ascii_letters)
print(string.ascii_lowercase)
print(string.ascii_uppercase)
print(string.digits)
print(string.hexdigits)
print(string.octdigits)
# print(string.whitespace) # '\t\n\r\v\f'
print(string.printable)  # digits + ascii lowercase + ascii uppercase + punctuation + whitespace


!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
0123456789
0123456789abcdefABCDEF
01234567
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ 	



In Python, we can translate between chars and its numerical representation with the following two main functions:
- ord(): to get the integer value (decimal) of a char.
- chr(): to get the char from a integer value in decimal. 

Let's see an example: by consulting the ASCII table, we can see that the letter 'a' corresponds to the integer 97 in the common base 10.

In [40]:
a_unicode = 97  # decimal representation of a in ASCII
a_character = 'a'  # readable character

print(chr(a_unicode))
print(ord(a_character))

a
97


If we want to get the representation of the ASCII code of the character 'a' in others bases like binary, we can use the last presented bin(), hex() and oct() functions.

In [41]:
a_character = 'a'
decimal_ascii = ord(a_character)
binary_ascii = bin(decimal_ascii)
print(binary_ascii)

0b1100001


There is another relevant function: ascii().
ascii() gives you and ASCII-only representation of an object such a str, with non-ASCII characters escaped.

For example, in the word 'España', all characters but 'ñ' belong to the ASCII standard. 

In [42]:
ascii('España')

"'Espa\\xf1a'"

## 3. Unicode standard

As maybe you are guessing, the problem with the ASCII standard is that 128 entries are not enough to represent all type of symbols and characters that currenly exist, specially in languages different from English.
To solve that, we use the called Unicode standard. This standard is like an extension of ASCII and it contains more than 1 millon of entries. In fact, its 128 first entries are the ASCII entries.

You can check and search for unicode character in the following link:

https://unicode-table.com/es/

If you visit that website and search for a character, let's say for example the character 'a', you will its unicode number, which is 'U+0061'. 
In Python, we have multiple forms of representing a character:
- The character by itself, i.e. 'a'.
- With the sequence: \ooo -> 'ooo' is the octal value of the character.
- With the sequence: \xhh -> 'hh' is the dex value of character.
- With the sequence: \N{name} -> 'name' is the name of the character in the Unicode table. 
- With the sequence: \uxxxx -> 'xxxx' is the 16-bit hex vaue of the character.
- With the sequence: \Uxxxxxxxx -> 'xxxxxxxx' is the 32-bit hex value of the character.  

We will focus on the two last forms: \uxxxx and \Uxxxxxxxx. 

When we look at the Unicode table, the code 'U+0061' means that 0061 is the 16-bit representation of 'a'.



In [53]:
character = 'a'
print(hex(ord(character)))  # 0x61 (this is the 8-bit hex representation)

character_hex16 = "\u0061"
character_hex32 = "\U00000061"  # we padded four zeros at the beggining.

print(character == character_hex16 == character_hex32)

0x61
True


A note before going to the next step: you can get information like a character name by using the *unicode* module in Python. 

In [56]:
import unicodedata
print(unicodedata.name('a'))

LATIN SMALL LETTER A


A relevant point here is that Unicode by itself is just a mapping of characters to code points (numbers), but no information is provided by this standard on how these numbers and then encoding to bits.
For that reason, apart from the Unicode standard we need to define an encoding:
- Encoding means converting from human-readable text to bytes. In Python, it means that we go from a str-type variable to a byte-type variable.
- Decoding is the reverse, meaning that we convert from bytes to string.

Unicode standard includes these main encodings:
- UTF-8: the most used one. It encodes a char into 1-4 bytes.
- UTF-16: it encodes a char into 2-4 bytes.
- UTF-32: it encodes a char into 4 bytes.

Let's see an example of how the encode and decode a str sentence.

In [4]:
sentence = "Let's encode this string with the spanish letter ñ."

# encoding
b_utf8 = sentence.encode('utf-8')  # returns bytes in hexadecimal, encoding in utf-8 
b_utf16 = sentence.encode('utf-16')  # same as above but in utf-16
b_utf32 = sentence.encode('utf-32')  # same as above but in utf-32

print(b_utf8)
print(b_utf16)
print(b_utf32)

# decoding
sentence_decoded = b_utf8.decode('utf-8')  # we have to decode in the same form of the str was previously encoded. 
sentence_decoded_16 = b_utf16.decode('utf-16')
# sencence_decoded_bad = b_utf16.decode('utf-8')  # raises UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

print(sentence_decoded)
print(sentence_decoded_16)

b"Let's encode this string with the spanish letter \xc3\xb1."
b"\xff\xfeL\x00e\x00t\x00'\x00s\x00 \x00e\x00n\x00c\x00o\x00d\x00e\x00 \x00t\x00h\x00i\x00s\x00 \x00s\x00t\x00r\x00i\x00n\x00g\x00 \x00w\x00i\x00t\x00h\x00 \x00t\x00h\x00e\x00 \x00s\x00p\x00a\x00n\x00i\x00s\x00h\x00 \x00l\x00e\x00t\x00t\x00e\x00r\x00 \x00\xf1\x00.\x00"
b"\xff\xfe\x00\x00L\x00\x00\x00e\x00\x00\x00t\x00\x00\x00'\x00\x00\x00s\x00\x00\x00 \x00\x00\x00e\x00\x00\x00n\x00\x00\x00c\x00\x00\x00o\x00\x00\x00d\x00\x00\x00e\x00\x00\x00 \x00\x00\x00t\x00\x00\x00h\x00\x00\x00i\x00\x00\x00s\x00\x00\x00 \x00\x00\x00s\x00\x00\x00t\x00\x00\x00r\x00\x00\x00i\x00\x00\x00n\x00\x00\x00g\x00\x00\x00 \x00\x00\x00w\x00\x00\x00i\x00\x00\x00t\x00\x00\x00h\x00\x00\x00 \x00\x00\x00t\x00\x00\x00h\x00\x00\x00e\x00\x00\x00 \x00\x00\x00s\x00\x00\x00p\x00\x00\x00a\x00\x00\x00n\x00\x00\x00i\x00\x00\x00s\x00\x00\x00h\x00\x00\x00 \x00\x00\x00l\x00\x00\x00e\x00\x00\x00t\x00\x00\x00t\x00\x00\x00e\x00\x00\x00r\x00\x00\x00 \x00\x00\x00\xf1\x00\x00\

As you can see in the previous example, when we fancy a UnicodeDecodeError it means that the original string was not encode in the same way we are trying to decode it. 
When we are using an API or making I/O text operations, we should not assume that the original encoding was utf-8. The better option to avoid that kind of errors is consulting the documentation, but if you need to detect the encoding of some input string you can check the Python library called *chardet* (https://chardet.readthedocs.io/en/latest/).   

UTF-8 is variable-length encoding, which means that a single character can be represented from 1 to 4 bytes. 

You can know how many bytes a string occupies just by using the well-known len() function, but keep in mind that:
- len() in Python returns always 1 when using it with a character.
- len() in Python can return between 1 and 4 when using it with bytes. 

So to know how many bytes does a character occupies, firstly you have to convert it to bytes. 

In [45]:
character = '€'
print(len(character))  # 1 character
print(len(character.encode('utf-8')))  # this character is represented by 3 bytes using utf-8.

1
3


One more point: It is worth mentioning that the functions ord() and chr() accepts Unicode characters. 

For example, as we already know the character 'ñ' is not an ASCII character, but we can get its representation using ord() as well. 

In [43]:
ord('ñ')  # ñ is a non-ascii character

241

## 4. Additional Python notes: 

When working with Python, there are a few thing you must keep in mind:
- Python3 source code (i.e. the .py) is assumed to the in utf-8, so no # -*- coding: UTF-8 -*- header is needed at the top.
- All strings in Python3 are assumed to be Unicode. 
- The variable names can also contain Unicode characters in Python3. 
- Python’s re module defaults to the re.UNICODE flag rather than re.ASCII, which means it matches Unicode characters instead of only ASCII.

When opening a file, it is very importat to keep in mind that the open() function is platform-dependent, which means it is not the same to open a file in Windows, MAC or Linux.
To know the default encoding of your platform, you can do:

In [47]:
import locale
locale.getpreferredencoding()  # 'cp1252 in Windows (variant of latin-1 encoding), UTF-8 in Linux.

'cp1252'