# Module 5.a: Memory, GIL, & Internal Performance

### The Scenario

You've built a data processing pipeline that tracks 1 million delivery trucks in real-time. It worked great with 10 trucks on your laptop. But in production:
- The server is running out of **RAM**
- CPU usage is stuck at **5%** despite having 16 cores
- Memory slowly creeps up until the OS **kills the process**

### The Goal

By the end of this module, you will:
- Understand why objects use **more memory than expected**
- Know when **threading helps** and when it doesn't (GIL)
- Debug **memory leaks** from circular references
- Optimize with `__slots__`, generators, and multiprocessing

---

## Lesson 1: The Hidden Cost of Objects

### The Problem

Your server logs show `MemoryError`. You calculated that 1 million GPS points (two floats: x, y) should take **16 MB**. But Python is eating **180 MB**. Why?

### The "Aha!" Moment

In Python, **everything is an object**. Every object has overhead:
- Object header (~16 bytes)
- Reference to `__dict__` (~8 bytes)
- The `__dict__` itself (~104+ bytes)

A simple class with two floats uses **300+ bytes** per instance!

### The Solution: `__slots__`

By defining `__slots__`, you tell Python: "Don't give me a `__dict__`. Just reserve space for these specific attributes."

| Class Type | Memory per Instance | Savings |
|------------|---------------------|--------|
| Regular class | ~488 bytes | - |
| With `__slots__` | ~96 bytes | ~80% |

In [20]:
import sys

# Regular class with __dict__
class BloatedTruck:
    def __init__(self, lat, lng):
        self.lat = lat
        self.lng = lng

# Optimized class with __slots__
class OptimizedTruck:
    __slots__ = ['lat', 'lng']
    def __init__(self, lat, lng):
        self.lat = lat
        self.lng = lng

bloated = BloatedTruck(40.7128, -74.0060)
optimized = OptimizedTruck(40.7128, -74.0060)

In [21]:
# For accurate deep size, use pympler
# ! pip install pympler
try:
    from pympler import asizeof
    
    print("Deep size (includes __dict__ contents):")
    print(f"  BloatedTruck:   {asizeof.asizeof(bloated)} bytes")
    print(f"  OptimizedTruck: {asizeof.asizeof(optimized)} bytes")
    
    savings = (asizeof.asizeof(bloated) - asizeof.asizeof(optimized)) / asizeof.asizeof(bloated) * 100
    print(f"  Savings: {savings:.1f}%")
except ImportError:
    print("Install pympler for deep size: pip install pympler")

Deep size (includes __dict__ contents):
  BloatedTruck:   488 bytes
  OptimizedTruck: 96 bytes
  Savings: 80.3%


### When to Use `__slots__`

| Use `__slots__` | Avoid `__slots__` |
|-----------------|-------------------|
| Many instances (1000+) | Need dynamic attributes |
| Memory-constrained | Multiple inheritance |
| Fixed attributes | Rapid prototyping |
| Performance-critical | Inheriting from non-slots class |

---

## Lesson 2: List Over-Allocation

### The Problem

You noticed that even when you aren't adding new items, your lists use more memory than expected.

### The "Aha!" Moment

Python lists use **over-allocation** to make `append()` fast (amortized O(1)). When a list grows, Python allocates extra space so it doesn't resize on every append.

### List vs Tuple Memory

| Type | Allocation | Best For |
|------|------------|----------|
| `list` | Over-allocated | Dynamic collections |
| `tuple` | Exact size | Fixed records |

In [61]:
# Watch list over-allocation in action
lst = []
prev_size = 0

print(f"{'Items':8} {'Size (bytes)':20} {'Total Size'}")
print("-" * 45)

for i in range(12):
    lst.append(i)
    size = sys.getsizeof(lst)
    print(f"{len(lst):7}{size:10}\t\t\t{size} bytes")

# Compare with tuple
tup = tuple(range(12))
print(f"\nTuple with 12 items: {sys.getsizeof(tup)} bytes (exact size)")

Items    Size (bytes)         Total Size
---------------------------------------------
      1        88			88 bytes
      2        88			88 bytes
      3        88			88 bytes
      4        88			88 bytes
      5       120			120 bytes
      6       120			120 bytes
      7       120			120 bytes
      8       120			120 bytes
      9       184			184 bytes
     10       184			184 bytes
     11       184			184 bytes
     12       184			184 bytes

Tuple with 12 items: 136 bytes (exact size)


---

## Lesson 3: The GIL (Global Interpreter Lock)

### The Problem

You have CPU-heavy calculations for each truck. You tried `threading` to use all 16 cores, but the script is just as slow as the single-threaded version.

### The "Aha!" Moment

The **Global Interpreter Lock (GIL)** ensures only ONE thread executes Python bytecode at a time.

### Threading vs Multiprocessing

| Aspect | Threading | Multiprocessing |
|--------|-----------|----------------|
| GIL Impact | Blocked (one at a time) | Bypassed (separate GIL) |
| Best For | I/O-bound (network, disk) | CPU-bound (math, processing) |
| Memory | Shared | Isolated |
| Overhead | Low | Higher (process spawn) |

In [62]:
import threading
import time

def cpu_heavy(n):
    """CPU-bound task: count down."""
    while n > 0:
        n -= 1

COUNT = 10_000_000

# Sequential
start = time.perf_counter()
cpu_heavy(COUNT)
cpu_heavy(COUNT)
seq_time = time.perf_counter() - start
print(f"Sequential: {seq_time:.2f}s")

# Threaded (limited by GIL)
threads = [
    threading.Thread(target=cpu_heavy, args=(COUNT,)),
    threading.Thread(target=cpu_heavy, args=(COUNT,)),
]

start = time.perf_counter()
for t in threads:
    t.start()
for t in threads:
    t.join()
thread_time = time.perf_counter() - start

print(f"Threaded:   {thread_time:.2f}s (speedup: {seq_time/thread_time:.2f}x)")
print("\n→ Threads don't help CPU-bound tasks due to GIL!")

Sequential: 0.24s
Threaded:   0.23s (speedup: 1.05x)

→ Threads don't help CPU-bound tasks due to GIL!


### The Solution: Multiprocessing

Each process has its own Python interpreter and GIL, enabling true parallelism.

**Python 3.13+ Note:** Experimental "No-GIL" builds are available with `python -X gil=0`

In [68]:
import multiprocessing

multiprocessing.set_start_method('fork', force=True)
# Each process gets its own GIL
processes = [
    multiprocessing.Process(target=cpu_heavy, args=(COUNT,)),
    multiprocessing.Process(target=cpu_heavy, args=(COUNT,)),
]

start = time.perf_counter()
for p in processes:
    p.start()
for p in processes:
    p.join()
thread_time = time.perf_counter() - start
print(f"Multiprocessing:   {thread_time:.2f}s (speedup: {seq_time/thread_time:.2f}x)")

Multiprocessing:   0.18s (speedup: 1.35x)


---

## Lesson 4: Reference Counting & Garbage Collection

### The Problem

Your worker script runs for days, but memory slowly creeps up until the OS kills it. You aren't storing data, so where is it going?

### The "Aha!" Moment

Python uses **reference counting**: objects are freed when their reference count hits zero. But **circular references** (A → B → A) can never hit zero!

### Memory Management

| Mechanism | What It Does | When It Runs |
|-----------|--------------|-------------|
| Reference counting | Frees objects when refcount = 0 | Immediately |
| Garbage collector | Finds circular references | Periodically |

In [69]:
# Reference counting
data = [1, 2, 3]
print(f"Initial refcount: {sys.getrefcount(data)}")

alias = data  # +1 reference
print(f"After alias: {sys.getrefcount(data)}")

container = {'data': data}  # +1 reference
print(f"After container: {sys.getrefcount(data)}")

del alias  # -1 reference
del container  # -1 reference
print(f"After deletions: {sys.getrefcount(data)}")

Initial refcount: 2
After alias: 3
After container: 4
After deletions: 2


In [71]:
import gc

# Circular reference problem
class Node:
    def __init__(self, name):
        self.name = name
        self.partner = None

def create_cycle():
    a = Node("A")
    b = Node("B")
    a.partner = b  # A → B
    b.partner = a  # B → A (cycle!)
    # When function returns, a and b go out of scope
    # But refcounts never hit zero due to cycle!

gc.disable()  # Disable GC to see the problem
create_cycle()
print(f"Objects should be deleted, but...")

# Manually trigger GC
collected = gc.collect()
print(f"GC found and collected {collected} objects")

gc.enable()

Objects should be deleted, but...
GC found and collected 8 objects


---

## Lesson 5: Memory Optimization Strategies

### Quick Wins

| Strategy | Savings | Use Case |
|----------|---------|----------|
| `__slots__` | ~70% | Classes with many instances |
| Generators | Variable | Large sequences you iterate once |
| `tuple` over `list` | ~20% | Fixed-size data |
| `array.array` | ~50% | Homogeneous numeric data |
| NumPy arrays | ~90% | Large numeric datasets |

In [72]:
# Generators vs Lists
# List: creates all items in memory
list_size = sys.getsizeof([x**2 for x in range(10000)])

# Generator: creates items on-demand
gen_size = sys.getsizeof(x**2 for x in range(10000))

print(f"List comprehension: {list_size:,} bytes")
print(f"Generator:          {gen_size:,} bytes")
print(f"\n→ Generator uses constant memory regardless of size!")

List comprehension: 85,176 bytes
Generator:          200 bytes

→ Generator uses constant memory regardless of size!


---

## Summary

### Object Memory

| Issue | Cause | Solution |
|-------|-------|----------|
| Objects too large | `__dict__` overhead | Use `__slots__` |
| Lists use extra memory | Over-allocation | Use tuple/array |
| Memory grows slowly | Circular references | Use weakref, check GC |

### Threading vs Multiprocessing

| Task Type | Solution | Why |
|-----------|----------|-----|
| I/O-bound | Threading | GIL releases during I/O wait |
| CPU-bound | Multiprocessing | Separate process = separate GIL |

### Memory Optimization

| Technique | When to Use |
|-----------|------------|
| `__slots__` | Many instances of a class |
| Generators | Large sequences, iterate once |
| `tuple` | Fixed-size immutable data |
| NumPy | Large numeric computations |

### Debugging Tools

| Tool | Purpose |
|------|--------|
| `sys.getsizeof()` | Shallow object size |
| `pympler.asizeof()` | Deep object size |
| `gc.collect()` | Force garbage collection |
| `tracemalloc` | Memory allocation tracking |

---

**Next in this notebook:** Module 5.b — How Python Compiler Works  

**Then:** Module 6 — Advanced Python Concepts (Multiprocessing, AsyncIO, Context Managers)

---
---

# Module 5.b: How Python Compiler Works

### The Scenario

You've heard that Python is "interpreted," but your `.py` files seem to run fast after the first run. You see a `__pycache__` folder full of `.pyc` files. What is Python actually doing under the hood?

### The Goal

By the end of this section, you will:
- Understand the **compilation pipeline**: source → bytecode
- Know where **bytecode** (.pyc) is stored and when it's used
- Use the **`dis`** module to inspect bytecode
- See how this ties into **startup time** and **imports**

## The Compilation Pipeline

Python does **not** compile to machine code like C or Rust. Instead, it compiles to **bytecode**—a low-level instruction set for the **CPython virtual machine (VM)**.

### Stages

| Stage | Input | Output | What Happens |
|-------|--------|--------|----------------|
| **Tokenize** | Source (`.py`) | Tokens | Lexical analysis: keywords, names, literals |
| **Parse** | Tokens | AST (Abstract Syntax Tree) | Grammar rules → tree structure |
| **Compile** | AST | Bytecode | Tree → stack-based instructions |
| **Execute** | Bytecode | Result | VM interprets bytecode (or runs cached `.pyc`) |

So when you run `python script.py`, Python **compiles** the source to bytecode (or loads cached bytecode), then the **interpreter** executes that bytecode. "Interpreted" means we interpret bytecode, not the raw `.py` text.

## Bytecode and `.pyc` Files

- **First run:** Python compiles `.py` → bytecode and writes it to `__pycache__/<module>.cpython-<version>.pyc`.
- **Later runs:** If the `.pyc` is present and not older than the `.py` file, Python **loads the bytecode** and skips compilation → faster startup.
- **Version-specific:** Bytecode is not portable across Python versions (e.g., 3.11 vs 3.12).

## Summary: How the Python Compiler Fits In

| Concept | Takeaway |
|--------|----------|
| **Compilation** | `.py` is compiled to bytecode (tokens → AST → bytecode). |
| **Caching** | Bytecode is cached in `__pycache__/*.pyc` to speed up imports and reruns. |
| **Execution** | The CPython VM **interprets** bytecode; there is no built-in JIT in standard CPython. |
| **Inspection** | Use `ast.parse()` / `compile()` and `dis.dis()` to see how your code becomes bytecode. |

Understanding this pipeline helps explain **import cost** (compilation + bytecode load), **startup time**, and why tools like **Cython** (compile to C) or **PyPy** (JIT over bytecode) can change performance characteristics.