# Python, on the inside

Let's take two programs, one in C and the other in Python

### C code

C code
```C
int sum(int a, int b){
    return a + b;
}

int main(){
    sum(1, 2);
}
```

The compiler for C is told that `a` and `b` are integers. The actual CPU hardware knows about integers and even has hardware circuitry to do the addition. The C compiler loads `a` and `b` into registers and adds them. C is called a "low level" language because many of its constructs map directly to CPU hardware. In fact, CPUs over evolve based on how C programs are being written.

C is compiled to assembly language, which emulates the CPU.


#### C compiled to Assemly language (executed by the CPU)

```asm
sum:
        push    {r11, lr}
        mov     r11, sp
        sub     sp, sp, #8
        str     r0, [sp, #4]
        str     r1, [sp]
        ldr     r0, [sp, #4]
        ldr     r1, [sp]
        add     r0, r0, r1
        mov     sp, r11
        pop     {r11, pc}

main:
        push    {r11, lr}
        mov     r11, sp
        ldr     r0, .LCPI1_1
        ldr     r1, .LCPI1_2
        bl      sum
        ldr     r0, .LCPI1_0
        pop     {r11, pc}
.LCPI1_0:
        .long   0
.LCPI1_1:
        .long   1
.LCPI1_2:
        .long   2
```

The following website is able to generate this output: https://godbolt.org/

### Python code

```python
def sum(a, b):
    return a + b

sum(1, 2)
sum("hello ", "world")
```

Python is a "high level" language. Its goal is not to map directly to the CPU and make the most efficient  use of its power. Python's goal is to be more user friendly and to be able to represent more complex ideas simply. 

Python has an interpreter which executes Python code. That interpreter is, itself, C code! The interpreter has to carry out more steps:
- Determine the types of a and b (notice, the sum function would work just as well with strings, yet use completely different libraries)
- Use reference counting to keep track of memory (notice that we don't do _any_ memory management)

#### Python's intermediate _bytecode_ language

In [1]:
import dis

In [2]:
dis.dis("""
def sum(a, b):
    return a + b

sum(1,2)
""")

  0           0 RESUME                   0

  2           2 LOAD_CONST               0 (<code object sum at 0x14230c510, file "<dis>", line 2>)
              4 MAKE_FUNCTION            0
              6 STORE_NAME               0 (sum)

  5           8 PUSH_NULL
             10 LOAD_NAME                0 (sum)
             12 LOAD_CONST               1 (1)
             14 LOAD_CONST               2 (2)
             16 PRECALL                  2
             20 CALL                     2
             30 POP_TOP
             32 LOAD_CONST               3 (None)
             34 RETURN_VALUE

Disassembly of <code object sum at 0x14230c510, file "<dis>", line 2>:
  2           0 RESUME                   0

  3           2 LOAD_FAST                0 (a)
              4 LOAD_FAST                1 (b)
              6 BINARY_OP                0 (+)
             10 RETURN_VALUE


Full list of Python bytecode instructions at https://docs.python.org/3/library/dis.html#python-bytecode-instructions

Additionally, note that the bytecode is not dealing with integers, floats or strings directly.
Every object in Python is represented in C as the abstract object called `PyObject`:

```C
typedef struct {
     Py_ssize_t ob_refcnt;   /* object reference count */
     PyTypeObject* ob_type;  /* object type */
};
```

However, specific types, such as integer, floats, booleans, strings are C's equivalent of "sub-classes" and provide more detail. For example, the actual value of an integer or a string has to be placed somewhere in there.

More detail at:
- https://llllllllll.github.io/c-extension-tutorial/c-level-representation.html
- Official "Read Me" doc at https://github.com/python/cpython/blob/main/InternalDocs/interpreter.md

#### The difference?

The assembly generated for C is, essentially, executed directly by the CPU.

The Python bytecode is executed by a C program!

Notice near the end of the bytecode listing, the operator `BINARY_OP    0`, that is executing the `+` operator. 

However, that `+` operator is actually doing lots of complicated work. It is getting the correct python objects, doing type-checking (int +, float +, string concat, what is the right implementation?), then performing the operation, packaging that result back into an object, etc.

Lots of work, just for an addition!

### Python handles low level details, such as memory allocation, bounds checking, etc.

Take a close look at this Python snippet:

```python
names_len_dict = dict()

for name in ['Homer', 'Marge', 'Bart']:
    names_len_dict[name] = len(name)

do_something(names_len_dict)
...

```

Notice that we are creating a dictionary dynamically. In a language like C, we have to manually allocate (`malloc`) the right amount of memory, then remember to `free` it.

In python, allocation of memory is not so explicit and a _garbage collector_ monitors objects and frees them when they are no longer being used.

Even a simple loop operates differently in Python vs C:

```python
for n in ['Homer', 'Marge', 'Bart']:
    ...
```
In Python, in every iteration, the interpreter makes sure that the loop does not go beyond the bounds of the list.

In a C loop:
```C
for(int i = 0; i < 10; i ++){
    some_list[i];
}
```
It is possible that some_list only has a three elements, in which case the loop will happily access memory it should not have access to.

In face, this is a well known way of hacking computers. What if you go past some_list and start accessing the part of memory where the program stores passwords?

All this book keeping, combined with the lack of static type checking and run-time interpretation slows down the language.

### Why is Numpy so much faster?

Numpy is implemented in C, the whole data structure and its values are in C. Unlike Python, where each element in a list has the full weight and complexity of `PyObject` and other Python features, such as garbage collection, Numpy's implementation bypasses all of that. This is why numpy is so much faster:

In [6]:
import numpy as np

In [10]:
%timeit sum(range(10_000_000))

62.4 ms ± 2.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [11]:
%timeit np.sum(np.arange(10_000_000))

6 ms ± 17.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
