# Understanding PyO3 Bindings: How Rust Meets Python

This notebook explains the core concepts behind GROMOS-RS Python bindings, inspired by Polars' architecture.

## Table of Contents
1. [What is PyO3?](#what-is-pyo3)
2. [Zero-Copy Data Sharing](#zero-copy)
3. [SIMD Acceleration](#simd)
4. [Memory Layout](#memory)
5. [Performance Benefits](#performance)

## 1. What is PyO3? <a name="what-is-pyo3"></a>

PyO3 is a Rust library that allows you to:
- Write Python extensions in Rust
- Call Rust code from Python
- Call Python code from Rust

### Architecture:

```
┌─────────────────────────────────────┐
│   Python Code (User Interface)     │
│   import gromos                      │
│   v = gromos.Vec3(1.0, 2.0, 3.0)   │
└─────────────────┬───────────────────┘
                  │
                  │ PyO3 Bindings
                  ▼
┌─────────────────────────────────────┐
│   Rust Core (High Performance)      │
│   - SIMD vectorization               │
│   - Parallel execution (Rayon)       │
│   - Memory safety guarantees         │
└─────────────────────────────────────┘
```

In [None]:
# First, let's import the library
# Note: You need to build it first with: maturin develop --release

try:
    import gromos
    import numpy as np
    print(f"✓ GROMOS-RS version: {gromos.__version__}")
    print(f"✓ NumPy version: {np.__version__}")
except ImportError as e:
    print(f"❌ Error: {e}")
    print("\nPlease build the package first:")
    print("  cd py-gromos")
    print("  maturin develop --release")

## 2. Zero-Copy Data Sharing <a name="zero-copy"></a>

One of the key features of the Polars architecture is **zero-copy data sharing**.

### Traditional Approach (With Copying):
```
Python → Copy to Rust → Process → Copy to Python
        (overhead)               (overhead)
```

### Zero-Copy Approach:
```
Python → Share Memory ← Rust
         (no copy!)    (direct access)
```

In [None]:
# Example: Creating a Vec3 and converting to NumPy
v = gromos.Vec3(1.0, 2.0, 3.0)
print(f"Vec3: {v}")
print(f"Type: {type(v)}")

# Convert to NumPy (this is very fast!)
arr = v.to_numpy()
print(f"\nNumPy array: {arr}")
print(f"Type: {type(arr)}")
print(f"Dtype: {arr.dtype}")
print(f"Shape: {arr.shape}")

In [None]:
# Going the other way: NumPy to Vec3
np_vec = np.array([4.0, 5.0, 6.0], dtype=np.float32)
v2 = gromos.Vec3.from_numpy(np_vec)
print(f"Created Vec3 from NumPy: {v2}")

# This involves a copy, but it's very efficient due to Rust's memory layout

### Large Arrays: State Example

For large systems with thousands of atoms, zero-copy becomes critical:

In [None]:
# Create a state with 10,000 atoms
n_atoms = 10000
state = gromos.State(num_atoms=n_atoms, num_temp_groups=1, num_energy_groups=1)

print(f"Created state with {state.num_atoms()} atoms")

# Initialize positions
positions = np.random.rand(n_atoms, 3).astype(np.float32) * 5.0
state.set_positions(positions)
print(f"Set {len(positions)} positions")

# Get positions back (this returns a view when possible)
pos_view = state.positions()
print(f"\nRetrieved positions shape: {pos_view.shape}")
print(f"Memory size: {pos_view.nbytes / 1024:.2f} KB")
print(f"First 3 positions:\n{pos_view[:3]}")

## 3. SIMD Acceleration <a name="simd"></a>

SIMD (Single Instruction, Multiple Data) allows processing multiple values simultaneously.

### Without SIMD:
```rust
// Process one at a time
result[0] = a[0] + b[0];
result[1] = a[1] + b[1];
result[2] = a[2] + b[2];
```

### With SIMD:
```rust
// Process all at once!
result = a + b;  // Hardware does 3 operations simultaneously
```

GROMOS-RS uses the `glam` library which provides automatic SIMD vectorization.

In [None]:
# All Vec3 operations use SIMD under the hood
import time

# Create two vectors
v1 = gromos.Vec3(1.0, 2.0, 3.0)
v2 = gromos.Vec3(4.0, 5.0, 6.0)

# These operations are SIMD-accelerated:
print("SIMD-accelerated operations:")
print(f"Addition:    v1 + v2 = {v1 + v2}")
print(f"Subtraction: v1 - v2 = {v1 - v2}")
print(f"Dot product: v1 · v2 = {v1.dot(v2)}")
print(f"Cross prod:  v1 × v2 = {v1.cross(v2)}")
print(f"Length:      |v1|    = {v1.length():.4f}")
print(f"Distance:    d(v1,v2)= {v1.distance(v2):.4f}")

In [None]:
# Benchmark: Vec3 (SIMD) vs NumPy
n_iterations = 100000

# Vec3 operations (SIMD-accelerated Rust)
start = time.time()
for _ in range(n_iterations):
    result = v1.dot(v2)
vec3_time = time.time() - start

# NumPy operations
arr1 = np.array([1.0, 2.0, 3.0], dtype=np.float32)
arr2 = np.array([4.0, 5.0, 6.0], dtype=np.float32)
start = time.time()
for _ in range(n_iterations):
    result = np.dot(arr1, arr2)
numpy_time = time.time() - start

print(f"\nPerformance comparison ({n_iterations:,} iterations):")
print(f"Vec3 (Rust SIMD): {vec3_time:.4f} seconds")
print(f"NumPy:            {numpy_time:.4f} seconds")
print(f"Speedup:          {numpy_time/vec3_time:.2f}×")

## 4. Memory Layout <a name="memory"></a>

Understanding memory layout is key to understanding performance.

### Rust Memory Layout (glam Vec3A):
```
[x][y][z][padding]
 4  4  4     4     bytes (16 bytes total, aligned)
```

### NumPy Array:
```
[x][y][z]
 4  4  4   bytes (12 bytes, contiguous)
```

The padding in Rust ensures 16-byte alignment for SIMD operations.

In [None]:
# Examine memory layout
import sys

v = gromos.Vec3(1.0, 2.0, 3.0)
arr = v.to_numpy()

print("Memory information:")
print(f"NumPy array size: {arr.nbytes} bytes")
print(f"NumPy dtype: {arr.dtype}")
print(f"NumPy strides: {arr.strides}")
print(f"NumPy C-contiguous: {arr.flags['C_CONTIGUOUS']}")
print(f"NumPy aligned: {arr.flags['ALIGNED']}")

### State Memory Layout

For molecular systems, we store arrays of Vec3:

In [None]:
# Create a small system
n = 5
state = gromos.State(num_atoms=n, num_temp_groups=1, num_energy_groups=1)

# Set positions
pos = np.array([
    [0.0, 0.0, 0.0],
    [1.0, 0.0, 0.0],
    [0.0, 1.0, 0.0],
    [0.0, 0.0, 1.0],
    [1.0, 1.0, 1.0],
], dtype=np.float32)

state.set_positions(pos)

# Retrieve and examine
retrieved = state.positions()
print("Positions array:")
print(retrieved)
print(f"\nShape: {retrieved.shape}")
print(f"Total memory: {retrieved.nbytes} bytes")
print(f"Per atom: {retrieved.nbytes / n} bytes")

## 5. Performance Benefits <a name="performance"></a>

Let's measure the performance benefits of this architecture.

In [None]:
# Benchmark: Large system operations
import time

sizes = [100, 1000, 10000, 50000]

print("Performance scaling with system size:")
print(f"{'N atoms':<12} {'Create (ms)':<15} {'Set pos (ms)':<15} {'Get pos (ms)':<15}")
print("-" * 60)

for n in sizes:
    # Create state
    start = time.time()
    state = gromos.State(num_atoms=n, num_temp_groups=1, num_energy_groups=1)
    create_time = (time.time() - start) * 1000
    
    # Set positions
    pos = np.random.rand(n, 3).astype(np.float32)
    start = time.time()
    state.set_positions(pos)
    set_time = (time.time() - start) * 1000
    
    # Get positions
    start = time.time()
    retrieved = state.positions()
    get_time = (time.time() - start) * 1000
    
    print(f"{n:<12} {create_time:<15.3f} {set_time:<15.3f} {get_time:<15.3f}")

### Why Is It Fast?

1. **SIMD Vectorization**: Operations on 4 floats at once
2. **Cache-Friendly**: Contiguous memory layout
3. **Zero-Copy**: No unnecessary data copying
4. **Rust Optimizations**: Compile-time optimizations
5. **Parallel Execution**: Rayon for multi-threading (when applicable)

## Summary

### Key Concepts:

| Concept | Benefit |
|---------|----------|
| PyO3 Bindings | Seamless Rust ↔ Python integration |
| Zero-Copy | Minimal memory overhead |
| SIMD | 2-4× speedup for vector operations |
| Memory Alignment | Hardware-optimized access patterns |
| Rust Safety | No segfaults, memory leaks |

### The Polars Pattern:

```
High Performance (Rust) + Easy to Use (Python) = Best of Both Worlds
```

This architecture allows you to:
- Write Python code (easy, productive)
- Get Rust performance (fast, efficient)
- No manual memory management (safe)
- Leverage NumPy ecosystem (interoperable)

## Next Steps

Continue to the next notebook to learn about:
- Notebook 02: Working with molecular systems
- Notebook 03: Advanced sampling methods
- Notebook 04: Performance optimization tricks