# Bits, bytes and encoded messages

The only information computers understad is in binary form but we (as humans) need to work in other bases, for instance the natural numbers (or base 10). Here we will learn how to transform in between bases and how characters are represented in the computer.

# Table of contents:

* [Bits](#bits)
* [Bytes](#bytes)
* [Bases: Decimal, binary, hexadecimal and octal](#bases)
* [Encoding information](#encoding)
* [Bonus: Hexadecimal to bytes](#hex-to-bytes)

    
Author: [Sebastià Agramunt Puig](https://github.com/sebastiaagramunt) for [OpenMined](https://www.openmined.org/) Privacy ML Series course.

## Bits <a class="anchor" id="bits"></a>

A **bit** is a binary number, 0 or 1.  

$$1101$$

So what is this number in base 10? The binary number 1101 can also be written as follows: ($1 * 2^3$) + ($1 * 2^2$) + ($0 * 2^1$) + ($1 * 2^0$) = $13$.

In [None]:
from random import randrange

bits = '0b' + ''.join([str(randrange(2)) for _ in range(10)])
bits

In [None]:
int(bits, 2)

In [None]:
sum([int(k)*2**(len(bits[2:])-i-1) for i, k in enumerate(bits[2:])])

What's the largest number we can represent in $n$ bits?

In [None]:
n = 8

print(f"max integer (base 10) value for {n} bits is {2**n -1}")

## Bytes <a class="anchor" id="bytes"></a>

A **byte** is a collection of 8 bits. is a byte expressed in binary form. The maximum value (in base 10) of a byte is $2^8-1$, therefore 255.  

In [None]:
bytes_int = [randrange(256) for _ in range(10)]
print(f"bytes_int\n\t{bytes_int}")

bytes_bin = [bin(x) for x in bytes_int]
print(f"bytes_bin\n\t{bytes_bin}")

In [None]:
int("0b00111010", 2)

## Bases: Decimal, binary, hexadecimal and octal <a class="anchor" id="bases"></a>

So far we represented numbers in base 2 (binary form) or base 10 (decimal form). But we can also express numbers in octal (base 8) or hexadecimal (base 16), check [this](https://www.rapidtables.com/convert/number/base-converter.html) conversor.

**Hexadecimal characters**: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f]

**Octal characters**: [0, 1, 2, 3, 4, 5, 6, 7]

**Decimal characters**: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


For instance:

$654_{10}$=$1010001110_2$=$1216_8$=$28\rm{E}_{16}$

8 binary characters = 1 byte, max in base 10: 255


In [None]:
n_bytes = 2
x = randrange(2**(8*n_bytes))

print(f"x in binary (base 2):\n\t{bin(x)}")
print(f"x in octal (base 8):\n\t{oct(x)}")
print(f"x in decimal (base 10):\n\t{x}")
print(f"x in hexadecimal (base 16):\n\t{hex(x)}")

## Encoding information <a class="anchor" id="encoding"></a>

How can we represent chracters in the computer?. We need a dictionary that is able to convert an integer to its character form. This is known as encoding. 

UTF-8 is the Unicode Transformation Format for 8 bits (a byte). UTF bytes string is of variable length, having as maximum 4 bytes. ASCII (American Standard Code for Information Interchange) are designated as characters of one byte because they are the most frequently used (also it has a reason historically).

Let's print the ascii characters!

In [None]:
for i in range(0, 128):
    b = i.to_bytes(1, byteorder='big')
    print(f"int = {i}, hex = {hex(i)}, bytes = {b}, decoded = {b.decode(encoding='UTF-8')}")

In [None]:
x = int('e0a887', 16)
x_b = x.to_bytes(3, byteorder='big')
dec_char = x_b.decode(encoding="UTF-8")

print(f"int = {x}")
print(f"bytes = {x_b}")
print(f"decoded = {dec_char}")

We can turn a message from ascii letters into bytes (according to UTF-8 encoding) and then transform it into its binary, hexadecimal or octal form:

In [None]:
from crypto import bytes_to_bin, bytes_to_hex

message = b"simple message"

bin_repr = bytes_to_bin(message, pre="")
hex_repr = bytes_to_hex(message, pre="")

print(f"message:\n{str(message)}\n")
print(f"message in binary:\n{bin_repr}\n")
print(f"message in hexadecimal:\n{hex_repr}\n")

## Bonus: from hexadecimal to bytes <a class="anchor" id="hex-to-bytes"></a>

Sometimes we want to see a string of bytes in hexadecimal form instead of the default given by python (it's encoded in UTF-8 actually, recall the 

In [None]:
from random import seed

seed(8)
bytes_int = [randrange(256) for _ in range(10)]
bytes_stream = bytes(bytes_int)
int_form = int.from_bytes(bytes_stream, "big")
print(f"(int bytes):\n\t{bytes_int}\n")
print(f"(int):\n\t{int_form}\n")
print(f"(bytes):\n\t{bytes_stream}\n")

See in this example we have characters "t", "+" or "~", these obviously are not hexadecimal as bytes is encoded in utf-8 in python. Let's work a little bit on that

In [None]:
bin_repr = bytes_to_bin(bytes_stream, pre="")
hex_repr = bytes_to_hex(bytes_stream, pre="")

print(f"(bin):\n\t{bin_repr}\n")
print(f"(hex):\n\t{hex_repr}\n")

In [None]:
# convert hex string into bytes
bytes_stream2 = bytes.fromhex(hex_repr)

assert bytes_stream==bytes_stream2, "something went wrong"
print(f"(bytes):\n\t{bytes_stream2}\n")