# Everything is an int

Hello there!

As I made my way through the <i>Classic Computer Science in Python</i> book, I came across an interesting fact. The author demonstrated how to implement a one-time pad, a simple form of unbreakable cryptography that works by XORing the input with a sequence of random bytes. What intrigued me is that to implement this bitwise operation, it was converting the input string to an integer. How does this work?

Integers in python have [unlimited precision][1] This means that they can be used to store any byte sequence.

An integer can be constructed from an [integer literal][2], which supports binary, octal, decimal and hexadecimal numbers. Likewise, while the default representation (`repr` function) of an integer is decimal, it can be formatted in other bases using the [`format` function][3].

In [3]:
numbers = [0b111000, 0o70, 56, 0x38]

print(" ".join([format(n) for n in numbers]))

print(" ".join([format(56, f) for f in ['b', 'o', 'd', 'x']]))

56 56 56 56
111000 70 56 38


## Representation

The binary representation of an integer can be retrieved using `bin`. This function returns a string and it's basically the same as the `format` function (which is used in f-strings as well). Negative numbers are represented with a minus sign, not two's complement.

TODO: how to get the two's complement representation of a negative integer?

In [4]:
n = 279
print(bin(n))
print(bin(-n))

print(f"{n:b}")
print(f"{-n:b}")

0b100010111
-0b100010111
100010111
-100010111


## Bytes

The `bytes` class represents immutable sequences of [single bytes][4], that is, integers x so that `0 <= x < 256`.

It can be created from an iterable of integers, from a buffer, or from a bytes literal. 

A bytes literal is like a string literal, but it only supports ASCII characters. All characters can be escaped as `\xFF`, but the characters above 127 won't be represented, since the bytes do not have an encoding, ASCII is just a convenient way to refer to bytes (UTF-8 also happens to be backwards compatible)

The default representation of a bytes object is the bytes literal. The hexadecimal representation can be obtained with the `hex` method.


In [5]:
abcdef = bytes(range(97, 103))
print(abcdef)
print(abcdef.hex(' '))
print(b'\xF0 \x7ecool\x7e')

b'abcdef'
61 62 63 64 65 66
b'\xf0 ~cool~'


## Converting bytes to integers

The class method `int.from_bytes` can be used. It requires the byte order (big endian or little endian). If the `signed` attribute is true, it will use two's complement.

The `big` byteorder will read the bytes from left to right. The `little` byteorder will read the bytes from right to left.

In [6]:
d = b'\x01\x00'
d_big = int.from_bytes(d, 'big')
d_small = int.from_bytes(d, 'little')

print(f"BIG byteorder:\n\tByte sequence [{d.hex(' ')}] is converted to integer {d_big} [{bin(d_big)}]\n")
print(f"SMALL byteorder:\n\tByte sequence [{d.hex(' ')}] is converted to integer {d_small} [{bin(d_small)}]\n")

one_byte_int = b'\x81'
signed = int.from_bytes(one_byte_int, 'big', signed=True)
unsigned = int.from_bytes(one_byte_int, 'big', signed=False)

print(f"Byte sequence [{one_byte_int.hex()}] is interpreted as {signed} [{bin(signed)}]")
print(f"or as {unsigned} as unsigned [{bin(unsigned)}]")


BIG byteorder:
	Byte sequence [01 00] is converted to integer 256 [0b100000000]

SMALL byteorder:
	Byte sequence [01 00] is converted to integer 1 [0b1]

Byte sequence [81] is interpreted as -127 [-0b1111111]
or as 129 as unsigned [0b10000001]


## Converting integers to bytes

The `bit_length` function returns the number of _bits_ needed to represent the given integer. It follows that the number of _bytes_ necessary to represent the same integer is `(x.bit_length() + 7) // 8`. The `to_bytes` method takes the length, byteorder and signed parameters. If the number is not representable in the given number of bytes (default is 1), it will give an error.

In [7]:
n = 0xdead

try:
  n.to_bytes(1, 'big')
except OverflowError:
  print(f"Error trying to convert number {n}")

length = n.bit_length()
n_bytes = n.to_bytes((length + 7) // 8, 'big')

print(f"Number {n} [{bin(n)}] converted to byte string {n_bytes} [{n_bytes.hex(' ')}]")

Error trying to convert number 57005
Number 57005 [0b1101111010101101] converted to byte string b'\xde\xad' [de ad]


## Strings

Strings are sequences of Unicode code points. A string can be encoded into a byte sequence or decoded to one. The default encoding is UTF-8. Escape sequences can be used to represent any unicode character.

In [8]:
s = "Python \u0394"
encoded = s.encode()
print(f"String \"{s}\" is encoded in UTF-8 as {encoded} [{encoded.hex(' ')}]")

b = bytes([65, 66, 67, 68])
print(f"Byte sequence {b} [{b.hex(' ')}] is decoded into the UTF-8 string {b.decode()}")

greeting = b'Tsch\xc3\xbcss'
print(f"Byte sequence {greeting} [{greeting.hex(' ')}] is decoded into the UTF-8 string {greeting.decode()}")

String "Python Δ" is encoded in UTF-8 as b'Python \xce\x94' [50 79 74 68 6f 6e 20 ce 94]
Byte sequence b'ABCD' [41 42 43 44] is decoded into the UTF-8 string ABCD
Byte sequence b'Tsch\xc3\xbcss' [54 73 63 68 c3 bc 73 73] is decoded into the UTF-8 string Tschüss


## Putting it all together

Any string can be decoded into a byte sequence, which can be converted to an integer, QED. It can be perfectly reconstructed, provided we know the byte order and the encoding.

This is not only an academic fact, it can be used to apply bit operations, which are only defined on integers. Going back to the top, this is the basis of cryptography, compression and other cool and fundamental operations.

In [9]:
ENCODING = 'utf-8'
BYTEORDER = 'big'

# Since strings are always encodable as UTF-8, this function cannot fail
def string_to_int(input: str) -> int:
  b: bytes = input.encode(ENCODING)
  output = int.from_bytes(b, BYTEORDER)
  return output

# this function will fail if the input is not valid UTF-8
def int_to_string(input: int) -> str:
  length = input.bit_length()
  b: bytes = input.to_bytes((length + 7) // 8, BYTEORDER)
  output = b.decode(ENCODING)
  return output

message = "Ars longa, vita brevis"
message_as_an_int = string_to_int(message)
reconstructed_message = int_to_string(message_as_an_int)

print(f"The string \"{message}\" is represented by the number {message_as_an_int}")
print(f"The integer {message_as_an_int} can be decoded to \"{reconstructed_message}\"")

The string "Ars longa, vita brevis" is represented by the number 24486655688850483561325096422507014050203373183330675
The integer 24486655688850483561325096422507014050203373183330675 can be decoded to "Ars longa, vita brevis"




[1]: https://docs.python.org/3/library/stdtypes.html#numeric-types-int-float-complex
[2]: https://docs.python.org/3/reference/lexical_analysis.html#integer-literals
[3]: https://docs.python.org/3/library/functions.html#format
[4]: https://docs.python.org/3/library/stdtypes.html#binary-sequence-types-bytes-bytearray-memoryview 