# How are integers stored in Python?

I stumbled upon a very curious python function, 'id()'. 

The `id()` function in Python returns a unique identifier for an object. On CPython (the most common implementation of Python), this identifier is typically the memory address where the object is stored. 

`id(x)` provides a unique identifier for the object `x` during its lifetime — the value of the identifier will remain the same provided that 'x' exists and is not garbage-collected or reassigned. 

However, Python, to my (modest) knowledge, is a terrible fit for direct memory manipulations due to its notorious overheads and abstraction layers.

That being said, I wanted to perform a simple check to verify whether `id()` truly returns a memory address. Specifically, I wanted to see if I could "reverse engineer" the value stored in a variable by inspecting the data type and accessing the memory directly.

The answer is affirmative. I found reaching the answer — without referring to CPython's source code for PyLongObject —  to be a fun problem solving exercise. 

Furthermore, navigating the source code at https://github.com/python/cpython/blob/main/Objects/longobject.c is no joke.

# The id() function

Python abstracts away from direct memory management and it's not possible to code like in C.
An integer in python should be saved in memory as 
<pre>
+----------------------------+
| PyLongObject (int in Python)|
+----------------------------+
| ob_refcnt (Reference Count) |
+----------------------------+
| ob_type (Type Information)  |
+----------------------------+
| ob_digit[0] (Digits)        |
+----------------------------+
</pre>

and id() should point at ob_refcnt, the start of the int structure/object. 
In coding terms
<pre>
typedef struct {
    PyObject_HEAD  // Contains ob_refcnt and ob_type
    ssize_t ob_digit[1];  // Array storing the digits of the integer
} PyLongObject;
</pre>

In [14]:
import sys

a = 12345

# Access the value
print(f"Value of 'a': {a}")
# Access the reference count
print(f"Reference count of 'a': {sys.getrefcount(a)}")

Value of 'a': 12345
Reference count of 'a': 3


The internal structure of a Cint in Python is
<pre>
typedef struct {
    PyObject_HEAD
    ssize_t ob_digit[1];
} PyLongObject;
</pre>
value_at_address, obtained using ctypes, will point at the beginning of the structure of the integer a. It returns 2.

Now,  sys.getrefcount(a) returns 3. I suspect that this method adds 1 to the reference count because it is referencing a. What's weird is that I am counting it before writing value_at_address, which doesn't seem to get influenced. This has a straighforward explanation: sys.getrefcount() creates a TEMPORARY reference to the int a, and then destroys it!!

In [15]:
import ctypes
import sys

a = 12345

# Get the id of 'a' (which is the memory address in CPython)
address_of_a = id(a)

# Get the reference count
ref_count = sys.getrefcount(a)

# Using ctypes to access the memory address
# We'll use ctypes.c_long since we're working with a long integer
value_at_address = ctypes.c_long.from_address(address_of_a)

# Display the values
print(f"Memory address of 'a': {address_of_a}")
print(f"Value at that address: {value_at_address.value}")
print(f"Reference count of 'a': {ref_count}")

Memory address of 'a': 4936991280
Value at that address: 2
Reference count of 'a': 3


If this is true, now I can move beyond ob_refcnt. Let's check and see if adjacent memory addresses contain data that actually corresponds to the data of the integer structure outlined before.
Let's ask for ChatGPT's help 

In [18]:
import ctypes
import sys

a = 12345

# Get the id of 'a' (which is the memory address in CPython)
address_of_a = id(a)

# Access the ob_refcnt (reference count)
ref_count = ctypes.c_long.from_address(address_of_a).value

# Access the ob_type (type object)
# We expect this to be a memory address pointing to the type object
type_ptr = ctypes.c_void_p.from_address(address_of_a + ctypes.sizeof(ctypes.c_long)).value

# Access the first digit of ob_digit (value of the integer)
digit = ctypes.c_long.from_address(address_of_a + 2 * ctypes.sizeof(ctypes.c_long)).value

# Display the values
print(f"Analyzing integer a = {a}")
print(f"Memory address of 'a': {address_of_a}")
print(f"Reference count of 'a': {ref_count}")
print(f"Type pointer of 'a': {type_ptr}")
print(f"First digit of 'a': {digit}")

Analyzing integer a = 12345
Memory address of 'a': 4936992496
Reference count of 'a': 2
Type pointer of 'a': 4368111432
First digit of 'a': 1


However the digits returned are nonsensical

In [19]:
import ctypes

a = 12345

# Get the id of 'a' (which is the memory address in CPython)
address_of_a = id(a)

# Extract the digits
# The first digit (which we've already extracted)
first_digit = ctypes.c_long.from_address(address_of_a + 2 * ctypes.sizeof(ctypes.c_long)).value
# The second digit
second_digit = ctypes.c_long.from_address(address_of_a + 3 * ctypes.sizeof(ctypes.c_long)).value
# The third digit
third_digit = ctypes.c_long.from_address(address_of_a + 4 * ctypes.sizeof(ctypes.c_long)).value
# The fourth digit
fourth_digit = ctypes.c_long.from_address(address_of_a + 5 * ctypes.sizeof(ctypes.c_long)).value
# The fifth digit
fifth_digit = ctypes.c_long.from_address(address_of_a + 6 * ctypes.sizeof(ctypes.c_long)).value

print(f"First digit: {first_digit}")
print(f"Second digit: {second_digit}")
print(f"Third digit: {third_digit}")
print(f"Fourth digit: {fourth_digit}")
print(f"Fifth digit: {fifth_digit}")

First digit: 1
Second digit: 146028900409
Third digit: 4838154448
Fourth digit: 4937146560
Fifth digit: 1


# Small integers are saved at the same address

Those digits are really not what was expected, and the first one is the same probably due to a coincidence. 
There seem to be some manipulations done by Python. Now let's define a = 40, b = 40, and see if I can retrieve the same value for both.

Spoiler: they are now at the same address despite being two different variables! Maybe this is an instance of Python caching small integers and reusing the cached objects. 

In [26]:
import ctypes

# Define a helper function to inspect memory and interpret values
def inspect_memory(address):
    # Read memory content as bytes
    bytes_value = (ctypes.c_ubyte * ctypes.sizeof(ctypes.c_long)).from_address(address)
    # Convert the bytes to an integer value
    interpreted_value = sum(byte << (i * 8) for i, byte in enumerate(bytes_value))
    return interpreted_value

# Create two variables with the same value
a = 40
b = 40

# Inspect the memory of these integers
a_addr = id(a)
b_addr = id(b)

# Get the interpreted values from memory
a_value = inspect_memory(a_addr)
b_value = inspect_memory(b_addr)

print(f"Memory address of 'a' (40): {a_addr}")
print(f"Interpreted value of 'a' from memory: {a_value}\n")

print(f"Memory address of 'b' (40): {b_addr}")
print(f"Interpreted value of 'b' from memory: {b_value}\n")

# Check if the addresses and values are the same
print(f"'a' and 'b' share the same memory address: {a_addr == b_addr}")
print(f"'a' and 'b' have the same interpreted value: {a_value == b_value}")

Memory address of 'a' (40): 4368167104
Interpreted value of 'a' from memory: 317

Memory address of 'b' (40): 4368167104
Interpreted value of 'b' from memory: 317

'a' and 'b' share the same memory address: True
'a' and 'b' have the same interpreted value: True


# Retrieving the correct value of a = 12345

The following code correctly identifies the structure of PyLong and keeps into account that python represents digits in base $2^{30}$ in a 64-bit architecture.

In [28]:
import ctypes

def retrieve_full_int_value_from_address(address):
    # Assume we're on a platform where each digit is stored in 30 bits
    # This is common for 64-bit systems, but this could vary

    class PyLongObject(ctypes.Structure):
        _fields_ = [
            ("ob_refcnt", ctypes.c_long),  # Reference count
            ("ob_type", ctypes.c_void_p),  # Type pointer
            ("ob_size", ctypes.c_ssize_t), # Number of digits (could be negative for negative numbers)
            ("ob_digit", ctypes.c_uint32 * 1),  # Placeholder for the first digit
        ]
    
    # Create an instance of PyLongObject from the given memory address
    py_long = PyLongObject.from_address(address)
    
    # Retrieve the number of digits and their values
    num_digits = abs(py_long.ob_size)
    
    # Adjust the structure to include all digits
    class PyLongObjectWithDigits(ctypes.Structure):
        _fields_ = [
            ("ob_refcnt", ctypes.c_long),
            ("ob_type", ctypes.c_void_p),
            ("ob_size", ctypes.c_ssize_t),
            ("ob_digit", ctypes.c_uint32 * num_digits),
        ]
    
    # Recreate the object to include all digits
    py_long_full = PyLongObjectWithDigits.from_address(address)
    
    # Reconstruct the integer from its digits
    value = 0
    base = 2**30  # Base used by Python for each digit on a 64-bit system
    
    for i in range(num_digits):
        value += py_long_full.ob_digit[i] * (base ** i)
    
    # Adjust for negative numbers
    if py_long_full.ob_size < 0:
        value = -value
    
    return value

# Example with a larger integer
a = 12345
address_of_a = id(a)

# Retrieve the full integer value from memory
retrieved_value = retrieve_full_int_value_from_address(address_of_a)

print(f"Original value: {a}")
print(f"Retrieved value from memory: {retrieved_value}")

Original value: 12345
Retrieved value from memory: 12345


The PyLongObject can be seen as a shelf with multiple compartments: 
<pre>
+-------------------+----------------+----------------+---------------------+
| Reference Count   | Type Information| Size and Sign  | Digits (Chunks)     |
+-------------------+----------------+----------------+---------------------+
|      2            |     (int)       |      1         | [chunk1, chunk2...] |
+-------------------+----------------+----------------+---------------------+
</pre>
Let's discuss why Size and Sign holds +1. In base $2^{30}$, 12345 is a very small number represented by only 1 digit! And I guess (I don't know for sure) that +1 will mean "a positive number with one digit".

Before 
<pre>
first_digit = 
ctypes.c_long.from_address(

            address_of_a + 2 * ctypes.sizeof(ctypes.c_long)
    
    ).value
</pre>
is taking address_of_a, which points to the beginning of reference count, and then  skips ahead by 2 times c_long, to reach "Size and Sign". We did correctly find +1, as showcased in the code snippet below, so our mistake lied somewhere else.

In [31]:
first_digit = ctypes.c_long.from_address(
            address_of_a + 2 * ctypes.sizeof(ctypes.c_long)
    ).value

print(first_digit)

1


Then the problem must have been in interpreting and accessing the digits. Now we identify how many chunks are there in the compartment "Digits"
<pre>
    num_digits = abs(py_long.ob_size)
    
    # Adjust the structure to include all digits
    class PyLongObjectWithDigits(ctypes.Structure):
        _fields_ = [
            ("ob_refcnt", ctypes.c_long),
            ("ob_type", ctypes.c_void_p),
            ("ob_size", ctypes.c_ssize_t),
            ("ob_digit", ctypes.c_uint32 * num_digits),
        ]
</pre>
and we are interpreting the chunks as unsigned 32 bit integers with c_uint32, which is the way Python should store them. Before we were interpreting them as longs with c_long! 

Finally, the retrieved digits are read in the correct base, which iss $2^{30}$. Interestingly, Python does not use the $2^{32}$ bits available to a c_uint32 type, maybe (I don't know) this makes some operations more efficient and avoids the risk of overflows.
<pre>
    base = 2**30  # Base used by Python for each digit on a 64-bit system
    
    for i in range(num_digits):
        value += py_long_full.ob_digit[i] * (base ** i)
</pre>

# Minimal modification to code from before

What I wrote implies that the problem in the code I started from was casting the digit to the wrong type, I was casting it into c_long instead of c_uint32. Then, I should be able to read the number correctly if I save the "Size and Sign" = + 1 from before and read an additional unsigned c_uint32 after it. This proves to be the case

In [32]:
import ctypes
import sys

a = 12345

# Get the id of 'a' (which is the memory address in CPython)
address_of_a = id(a)

# Access the ob_refcnt (reference count)
ref_count = ctypes.c_long.from_address(address_of_a).value

# Access the ob_type (type object)
# We expect this to be a memory address pointing to the type object
type_ptr = ctypes.c_void_p.from_address(address_of_a + ctypes.sizeof(ctypes.c_long)).value

# Access the ob_size (which includes the number of digits and sign)
size = ctypes.c_long.from_address(address_of_a + 2 * ctypes.sizeof(ctypes.c_long)).value

# Access the first digit of ob_digit (value of the integer)
# This should be done using c_uint32 because digits are stored in 30-bit chunks
first_digit = ctypes.c_uint32.from_address(address_of_a + 3 * ctypes.sizeof(ctypes.c_long)).value

# Display the values
print(f"Analyzing integer a = {a}")
print(f"Memory address of 'a': {address_of_a}")
print(f"Reference count of 'a': {ref_count}")
print(f"Type pointer of 'a': {type_ptr}")
print(f"Size (number of digits and sign) of 'a': {size}")
print(f"First digit of 'a': {first_digit}")

Analyzing integer a = 12345
Memory address of 'a': 4937898736
Reference count of 'a': 2
Type pointer of 'a': 4368111432
Size (number of digits and sign) of 'a': 1
First digit of 'a': 12345


# Reading huge numbers

Of course, the best way to conclude this exploration is by correctly retrieving a huge number.

In [37]:
import ctypes

def retrieve_full_int_value_from_address(address):
    # Assume we're on a platform where each digit is stored in 30 bits
    # This is common for 64-bit systems, but this could vary

    class PyLongObject(ctypes.Structure):
        _fields_ = [
            ("ob_refcnt", ctypes.c_long),  # Reference count
            ("ob_type", ctypes.c_void_p),  # Type pointer
            ("ob_size", ctypes.c_ssize_t), # Number of digits (could be negative for negative numbers)
            ("ob_digit", ctypes.c_uint32 * 1),  # Placeholder for the first digit
        ]
    
    # Create an instance of PyLongObject from the given memory address
    py_long = PyLongObject.from_address(address)
    
    # Retrieve the number of digits and their values
    num_digits = abs(py_long.ob_size)
    
    # Adjust the structure to include all digits
    class PyLongObjectWithDigits(ctypes.Structure):
        _fields_ = [
            ("ob_refcnt", ctypes.c_long),
            ("ob_type", ctypes.c_void_p),
            ("ob_size", ctypes.c_ssize_t),
            ("ob_digit", ctypes.c_uint32 * num_digits),
        ]
    
    # Recreate the object to include all digits
    py_long_full = PyLongObjectWithDigits.from_address(address)
    
    # Reconstruct the integer from its digits
    value = 0
    base = 2**30  # Base used by Python for each digit on a 64-bit system
    
    for i in range(num_digits):
        value += py_long_full.ob_digit[i] * (base ** i)
    
    # Adjust for negative numbers
    if py_long_full.ob_size < 0:
        value = -value
    
    return value
# ----------          START    ------------------------------
# choose the integer
a =54789549307749363717967245642654820654127167450746741273478270594796070634706458364583706806438045365380416845306453083476597689754398754937934754389543754383549754398343767263692853638647846584239476595470464783065807645160457365410601546530135648376584073635870564380316583645306584045836453816584334063458
address_of_a = id(a)

# Retrieve the full integer value from memory
retrieved_value = retrieve_full_int_value_from_address(address_of_a)

print(f"Original  value: {a}")
print(f"Retrieved value: {retrieved_value}")

Original  value: 54789549307749363717967245642654820654127167450746741273478270594796070634706458364583706806438045365380416845306453083476597689754398754937934754389543754383549754398343767263692853638647846584239476595470464783065807645160457365410601546530135648376584073635870564380316583645306584045836453816584334063458
Retrieved value: 54789549307749363717967245642654820654127167450746741273478270594796070634706458364583706806438045365380416845306453083476597689754398754937934754389543754383549754398343767263692853638647846584239476595470464783065807645160457365410601546530135648376584073635870564380316583645306584045836453816584334063458
