# PyTorch Basics: Prerequisites for Neural Networks

Before diving into more complex neural network architectures, let's understand PyTorch - the deep learning framework we'll use for the rest of the course.

## Table of Contents
1. [What is PyTorch?](#1)
2. [Tensors: The Building Block](#2)
3. [Tensors vs NumPy Arrays](#3)
4. [Basic Tensor Operations](#4)
5. [Autograd: Automatic Differentiation](#5)
6. [Building Neural Networks with PyTorch](#6)
7. [GPU Acceleration](#7)
8. [Summary](#8)

## 1. What is PyTorch?<a id="1"></a>

PyTorch is a deep learning framework developed by Meta (Facebook). Think of it as:
- **NumPy on steroids** - similar array operations but with GPU support
- **Automatic differentiation engine** - like the micrograd you just built, but production-ready
- **Neural network building blocks** - pre-built layers, optimizers, and utilities

After building micrograd, you understand the core concepts. PyTorch is essentially a highly optimized, industrial-strength version of what you just created!

In [1]:
import torch
import numpy as np

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.4 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/Users/brunomakoto/anaconda3/envs/sklearn_env/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/brunomakoto/anaconda3/envs/sklearn_env/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/brunomakoto/anaconda3/envs/sklearn_env/lib/python3.10/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/Users/brunomakoto/anaconda3/envs/sklearn_env/lib/python3.10/site-packag

PyTorch version: 2.2.2
CUDA available: False


## 2. Tensors: The Building Block<a id="2"></a>

### What is a Tensor?

A **tensor** is a multi-dimensional array - a generalization of scalars, vectors, and matrices:

- **Scalar** (0D tensor): `5`
- **Vector** (1D tensor): `[1, 2, 3]`
- **Matrix** (2D tensor): `[[1, 2], [3, 4]]`
- **3D+ tensor**: Higher dimensional arrays

In deep learning:
- Images are typically **4D tensors**: `(batch_size, channels, height, width)`
- Text sequences are **3D tensors**: `(batch_size, sequence_length, embedding_dim)`
- Video data are **5D tensors**: `(batch_size, frames, channels, height, width)`

In [None]:
# Creating tensors
scalar = torch.tensor(5)
vector = torch.tensor([1, 2, 3])
matrix = torch.tensor([[1, 2], [3, 4]])
tensor_3d = torch.tensor([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])

print(f"Scalar shape: {scalar.shape}, ndim: {scalar.ndim}")
print(f"Vector shape: {vector.shape}, ndim: {vector.ndim}")
print(f"Matrix shape: {matrix.shape}, ndim: {matrix.ndim}")
print(f"3D Tensor shape: {tensor_3d.shape}, ndim: {tensor_3d.ndim}")

## 3. Tensors vs NumPy Arrays<a id="3"></a>

### Why use PyTorch tensors instead of NumPy?

#### 1. **GPU Acceleration**
NumPy only runs on CPU. PyTorch tensors can run on GPU, giving you **10-100x speedup** for large operations.

#### 2. **Automatic Differentiation**
PyTorch tracks operations and can automatically compute gradients - essential for training neural networks. Remember the backpropagation you implemented in micrograd? PyTorch does this automatically!

#### 3. **Deep Learning Ecosystem**
PyTorch integrates seamlessly with neural network layers, optimizers, and other deep learning tools.

#### 4. **Dynamic Computation Graphs**
PyTorch builds computation graphs on-the-fly, making debugging easier and allowing for dynamic architectures.

### When to use what?
- **NumPy**: Data preprocessing, traditional scientific computing, when you don't need gradients or GPU
- **PyTorch**: Training neural networks, operations requiring gradients, when you need GPU acceleration

In [None]:
# Converting between NumPy and PyTorch
np_array = np.array([1, 2, 3, 4, 5])
torch_tensor = torch.from_numpy(np_array)

print(f"NumPy array: {np_array}")
print(f"PyTorch tensor: {torch_tensor}")

# Convert back to NumPy
back_to_numpy = torch_tensor.numpy()
print(f"Back to NumPy: {back_to_numpy}")

# They share memory! Changing one affects the other
np_array[0] = 999
print(f"After modifying np_array, torch_tensor: {torch_tensor}")

In [None]:
# Similar API to NumPy
print("\nPyTorch has similar operations to NumPy:")
print(f"torch.zeros(3, 3):\n{torch.zeros(3, 3)}\n")
print(f"torch.ones(2, 4):\n{torch.ones(2, 4)}\n")
print(f"torch.randn(2, 3): # Normal distribution\n{torch.randn(2, 3)}\n")
print(f"torch.arange(0, 10, 2): {torch.arange(0, 10, 2)}")

## 4. Basic Tensor Operations<a id="4"></a>

PyTorch operations are very similar to NumPy, so if you know NumPy, you're already 80% there!

In [None]:
# Element-wise operations
x = torch.tensor([1.0, 2.0, 3.0])
y = torch.tensor([4.0, 5.0, 6.0])

print(f"x + y = {x + y}")
print(f"x * y = {x * y}")
print(f"x ** 2 = {x ** 2}")
print(f"torch.sqrt(x) = {torch.sqrt(x)}")

In [2]:
# Matrix operations
A = torch.randn(3, 4)
B = torch.randn(4, 2)

# Matrix multiplication
C = torch.matmul(A, B)  # or A @ B
print(f"Matrix multiplication shape: {A.shape} @ {B.shape} = {C.shape}")
print(f"Result:\n{C}")

Matrix multiplication shape: torch.Size([3, 4]) @ torch.Size([4, 2]) = torch.Size([3, 2])
Result:
tensor([[-1.1068,  0.7997],
        [-1.8041,  0.7852],
        [-0.5516,  0.5687]])


In [3]:
# Reshaping and indexing
x = torch.arange(12)
print(f"Original: {x}")
print(f"Reshaped to (3, 4):\n{x.reshape(3, 4)}")
print(f"Reshaped to (2, 6):\n{x.reshape(2, 6)}")

# Indexing works like NumPy
matrix = x.reshape(3, 4)
print(f"\nFirst row: {matrix[0]}")
print(f"First column: {matrix[:, 0]}")
print(f"Submatrix:\n{matrix[:2, :2]}")

Original: tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])
Reshaped to (3, 4):
tensor([[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]])
Reshaped to (2, 6):
tensor([[ 0,  1,  2,  3,  4,  5],
        [ 6,  7,  8,  9, 10, 11]])

First row: tensor([0, 1, 2, 3])
First column: tensor([0, 4, 8])
Submatrix:
tensor([[0, 1],
        [4, 5]])


In [4]:
# Broadcasting - automatic expansion of dimensions
x = torch.tensor([[1, 2, 3]])
y = torch.tensor([[1], [2], [3]])

print(f"x shape: {x.shape}")
print(f"y shape: {y.shape}")
print(f"x + y (broadcasted):\n{x + y}")
print(f"Result shape: {(x + y).shape}")

x shape: torch.Size([1, 3])
y shape: torch.Size([3, 1])
x + y (broadcasted):
tensor([[2, 3, 4],
        [3, 4, 5],
        [4, 5, 6]])
Result shape: torch.Size([3, 3])


## 5. Autograd: Automatic Differentiation<a id="5"></a>

This is where PyTorch really shines! Remember in micrograd how you manually implemented the `backward()` method for each operation? PyTorch does this automatically.

### Key Concepts:
- `requires_grad=True`: Tells PyTorch to track operations on this tensor
- `.backward()`: Computes all gradients automatically
- `.grad`: Stores the gradient after calling `.backward()`

In [5]:
# Simple example: f(x) = x^2
x = torch.tensor(3.0, requires_grad=True)
y = x ** 2

print(f"x = {x.item()}")
print(f"y = x^2 = {y.item()}")

# Compute gradient
y.backward()

# dy/dx = 2x = 2*3 = 6
print(f"dy/dx = {x.grad.item()}")

x = 3.0
y = x^2 = 9.0
dy/dx = 6.0


In [6]:
# More complex example - just like micrograd!
a = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(3.0, requires_grad=True)

# f(a,b) = (a + b) * (a - b)
c = a + b  # c = 5
d = a - b  # d = -1
e = c * d  # e = -5

print(f"a={a.item()}, b={b.item()}")
print(f"c = a + b = {c.item()}")
print(f"d = a - b = {d.item()}")
print(f"e = c * d = {e.item()}")

# Compute gradients
e.backward()

print(f"\nde/da = {a.grad.item()}")
print(f"de/db = {b.grad.item()}")

a=2.0, b=3.0
c = a + b = 5.0
d = a - b = -1.0
e = c * d = -5.0

de/da = 4.0
de/db = -6.0


In [7]:
# Neural network example - forward pass with autograd
# Simulating: y = w * x + b
x = torch.tensor([1.0, 2.0, 3.0])
w = torch.tensor([0.5, -0.5, 1.0], requires_grad=True)
b = torch.tensor(0.1, requires_grad=True)

# Forward pass
y_pred = (w * x).sum() + b
y_true = torch.tensor(2.0)

# Loss (mean squared error)
loss = (y_pred - y_true) ** 2

print(f"Prediction: {y_pred.item():.4f}")
print(f"Loss: {loss.item():.4f}")

# Backward pass
loss.backward()

print(f"\nGradients:")
print(f"dL/dw = {w.grad}")
print(f"dL/db = {b.grad.item():.4f}")

Prediction: 2.6000
Loss: 0.3600

Gradients:
dL/dw = tensor([1.2000, 2.4000, 3.6000])
dL/db = 1.2000


In [8]:
# Important: Gradient accumulation
# Gradients accumulate by default - you need to zero them!
x = torch.tensor(2.0, requires_grad=True)

# First computation
y = x ** 2
y.backward()
print(f"After first backward: x.grad = {x.grad.item()}")

# Second computation without zeroing
y = x ** 3
y.backward()
print(f"After second backward (accumulated): x.grad = {x.grad.item()}")

# Zero the gradients
x.grad.zero_()
print(f"After zeroing: x.grad = {x.grad.item()}")

After first backward: x.grad = 4.0
After second backward (accumulated): x.grad = 16.0
After zeroing: x.grad = 0.0


### Comparison with Micrograd

In micrograd, you did:
```python
class Value:
    def __init__(self, data):
        self.data = data
        self.grad = 0
        self._backward = lambda: None
```

PyTorch tensors work the same way, but:
- More efficient (optimized C++/CUDA code)
- Handles arrays/tensors, not just scalars
- Automatic graph construction and differentiation
- GPU support

## 6. Building Neural Networks with PyTorch<a id="6"></a>

PyTorch provides `torch.nn` module with pre-built layers and utilities. This is like your `Neuron`, `Layer`, and `MLP` classes in micrograd, but production-ready!

In [9]:
import torch.nn as nn

# Simple neural network - similar to your MLP in micrograd
class SimpleMLP(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.layer1 = nn.Linear(input_size, hidden_size)
        self.layer2 = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        x = torch.relu(self.layer1(x))  # Hidden layer with ReLU
        x = self.layer2(x)  # Output layer
        return x

# Create model
model = SimpleMLP(input_size=10, hidden_size=20, output_size=3)
print(model)

# Test forward pass
x = torch.randn(5, 10)  # Batch of 5 samples, 10 features each
output = model(x)
print(f"\nInput shape: {x.shape}")
print(f"Output shape: {output.shape}")

SimpleMLP(
  (layer1): Linear(in_features=10, out_features=20, bias=True)
  (layer2): Linear(in_features=20, out_features=3, bias=True)
)

Input shape: torch.Size([5, 10])
Output shape: torch.Size([5, 3])


In [10]:
# Access model parameters (like your .parameters() method in micrograd)
print("Model parameters:")
for name, param in model.named_parameters():
    print(f"{name}: shape {param.shape}, requires_grad={param.requires_grad}")

# Count total parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"\nTotal parameters: {total_params}")

Model parameters:
layer1.weight: shape torch.Size([20, 10]), requires_grad=True
layer1.bias: shape torch.Size([20]), requires_grad=True
layer2.weight: shape torch.Size([3, 20]), requires_grad=True
layer2.bias: shape torch.Size([3]), requires_grad=True

Total parameters: 283


In [12]:
# Training loop example - putting it all together
import torch.optim as optim

# Create simple dataset
X = torch.randn(100, 10)  # 100 samples
y = torch.randint(0, 3, (100,))  # 100 labels (3 classes)

# Model, loss, and optimizer
model = SimpleMLP(10, 20, 3)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.025)

# Training loop
for epoch in range(5):
    # Forward pass
    outputs = model(X)
    loss = criterion(outputs, y)
    
    # Backward pass
    optimizer.zero_grad()  # Zero gradients (like x.grad.zero_())
    loss.backward()  # Compute gradients
    optimizer.step()  # Update weights
    
    print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")

Epoch 1, Loss: 1.1178
Epoch 2, Loss: 1.1165
Epoch 3, Loss: 1.1152
Epoch 4, Loss: 1.1139
Epoch 5, Loss: 1.1127


### Comparison with Micrograd

**Micrograd:**
```python
class MLP:
    def __init__(self, nin, nouts):
        self.layers = [Layer(nin, nouts[0])] + [Layer(nouts[i], nouts[i+1]) for i in range(len(nouts)-1)]
    
    def parameters(self):
        return [p for layer in self.layers for p in layer.parameters()]
```

**PyTorch:**
```python
class MLP(nn.Module):
    def __init__(self, nin, nouts):
        super().__init__()
        self.layers = nn.ModuleList([nn.Linear(nin, nouts[0])] + [nn.Linear(nouts[i], nouts[i+1]) for i in range(len(nouts)-1)])
    
    # parameters() is inherited from nn.Module!
```

Same concepts, just more powerful and efficient!

## 7. GPU Acceleration<a id="7"></a>

One of the biggest advantages of PyTorch over NumPy is GPU support. Deep learning operations are highly parallelizable, and GPUs can give you **10-100x speedup**.

In [13]:
# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory allocated: {torch.cuda.memory_allocated(0) / 1024**2:.2f} MB")

Using device: cpu


In [None]:
# Moving tensors to GPU
x_cpu = torch.randn(1000, 1000)
print(f"Tensor on: {x_cpu.device}")

# Move to GPU (if available)
x_gpu = x_cpu.to(device)
print(f"Tensor on: {x_gpu.device}")

# Or create directly on GPU
y_gpu = torch.randn(1000, 1000, device=device)
print(f"Created directly on: {y_gpu.device}")

In [None]:
# Speed comparison (only if GPU is available)
if torch.cuda.is_available():
    import time
    
    size = 5000
    
    # CPU
    x_cpu = torch.randn(size, size)
    y_cpu = torch.randn(size, size)
    
    start = time.time()
    z_cpu = torch.matmul(x_cpu, y_cpu)
    cpu_time = time.time() - start
    
    # GPU
    x_gpu = x_cpu.to(device)
    y_gpu = y_cpu.to(device)
    
    # Warm up GPU
    _ = torch.matmul(x_gpu, y_gpu)
    torch.cuda.synchronize()
    
    start = time.time()
    z_gpu = torch.matmul(x_gpu, y_gpu)
    torch.cuda.synchronize()
    gpu_time = time.time() - start
    
    print(f"CPU time: {cpu_time:.4f}s")
    print(f"GPU time: {gpu_time:.4f}s")
    print(f"Speedup: {cpu_time/gpu_time:.2f}x")
else:
    print("No GPU available for speed comparison")

In [None]:
# Moving models to GPU
model = SimpleMLP(10, 20, 3)
model = model.to(device)

# Now all inputs must be on the same device
x = torch.randn(5, 10, device=device)
output = model(x)

print(f"Model device: {next(model.parameters()).device}")
print(f"Input device: {x.device}")
print(f"Output device: {output.device}")

## 8. Summary<a id="8"></a>

### Key Takeaways:

1. **Tensors** are multi-dimensional arrays - the fundamental data structure in PyTorch

2. **Why PyTorch over NumPy?**
   - GPU acceleration (10-100x faster)
   - Automatic differentiation (no manual backprop!)
   - Deep learning ecosystem (pre-built layers, optimizers, utilities)
   - Dynamic computation graphs (easier debugging)

3. **Autograd** - PyTorch automatically tracks operations and computes gradients
   - `requires_grad=True` to track
   - `.backward()` to compute gradients
   - `.grad` to access gradients
   - Always zero gradients before next backward pass!

4. **Building Neural Networks**
   - Inherit from `nn.Module`
   - Define layers in `__init__`
   - Implement `forward()` method
   - Use built-in layers like `nn.Linear`, `nn.Conv2d`, etc.

5. **Training Loop Pattern**
   ```python
   for epoch in range(num_epochs):
       outputs = model(inputs)          # Forward pass
       loss = criterion(outputs, labels) # Compute loss
       optimizer.zero_grad()             # Zero gradients
       loss.backward()                   # Backward pass
       optimizer.step()                  # Update weights
   ```

6. **GPU Acceleration**
   - Move tensors and models to GPU with `.to(device)`
   - Massive speedup for large operations
   - Always keep tensors on the same device

### Connection to Micrograd:

| Micrograd | PyTorch |
|-----------|----------|
| `Value` | `torch.Tensor` |
| `.data` | `.data` or `.item()` |
| `.grad` | `.grad` |
| `.backward()` | `.backward()` |
| `Neuron/Layer/MLP` | `nn.Linear/nn.Module` |
| `.parameters()` | `.parameters()` |

You built micrograd to understand the fundamentals. Now PyTorch lets you scale that understanding to real-world deep learning!

## Next Steps

Now that you understand PyTorch basics, you're ready to move on to more complex architectures:
- **makemore (Bigrams)**: Character-level language modeling
- **makemore (MLP)**: Building deeper networks
- **GPT**: Transformers and attention mechanisms

You'll see these same PyTorch patterns everywhere:
1. Define model architecture (`nn.Module`)
2. Forward pass (compute predictions)
3. Compute loss
4. Backward pass (compute gradients)
5. Update weights (optimizer step)

Let's go! 