<a href="https://colab.research.google.com/github/ansh997/100-days-of-code/blob/master/LearnPytorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Learning PyTorch as framework not as an API

This is a new philosphical way to learn `PyTorch`. `PyTorch` comes with its own paradigm (tensor-first computation), execution model(dynamic graphs), and idioms (modules, autograd etc.).

### What is PyTorch?
- `Paradigm`: Imperative, Tensor-Based, Dynamically Executed
- `Use-Case DNA`: Built for research first --> flexibility over performance
- `Mental Model`: Everything is a tensor; gradients are computed automatically; models are modular graphs of operations

## Make Tensors Feel like home


- [x] Create and manipulate tensors(shapes, dtype, devices)
- [x] Perform Operations - `addition`, `broadcasting`, `.view()`, `.reshape()`
- [x] Compare against Numpy to develop intuition `torch.from_numpy(), tensor.numpy()`

In [None]:
# Create and manipulate tensors(shapes, dtype, devices)

import torch
import numpy as np

data = [[1, 2], [3, 4]]
x_data = torch.tensor(data)
print('Attributes of a Tensor: ', x_data.shape, x_data.dtype, x_data.device)
# (torch.Size([2, 2]), torch.int64, device(type='cpu'))

# can be created from numpy array
np_array = np.array(data)
x_np = torch.from_numpy(np_array)
print('Attributes of a Tensor: ', x_np.shape, x_np.dtype, x_np.device)
# (torch.Size([2, 2]), torch.int64, device(type='cpu'))

# can be created from another tensor
x_ones = torch.ones_like(x_data) # retains the properties of x_data
print(f"Ones Tensor: \n {x_ones} \n")

x_rand = torch.rand_like(x_data, dtype=torch.float) # overrides the datatype of x_data
print(f"Random Tensor: \n {x_rand} \n")

# With random or constant values:
# shape is a tuple of tensor dimensions.
# In the functions below, it determines the dimensionality of the output tensor.

shape = (2,3,)
rand_tensor = torch.rand(shape)
ones_tensor = torch.ones(shape)
zeros_tensor = torch.zeros(shape)

print(f"Random Tensor: \n {rand_tensor} \n")
print(f"Ones Tensor: \n {ones_tensor} \n")
print(f"Zeros Tensor: \n {zeros_tensor}")

In [None]:
# Learning about Operations tensors
tensor = torch.rand(3,4)

print(f"Shape of tensor: {tensor.shape}")
print(f"Datatype of tensor: {tensor.dtype}")
print(f"Device tensor is stored on: {tensor.device}")

# We move our tensor to the current accelerator if available
# Each of these operations can be run on the CPU and
# Accelerator such as CUDA, MPS, MTIA, or XPU.
if torch.accelerator.is_available():
    print('Accelerator is available')
    tensor = tensor.to(torch.accelerator.current_accelerator())

# By default, tensors are created on the CPU.
# We need to explicitly move tensors to the accelerator using `.to` method
# (after checking for accelerator availability).

# Standard numpy like indexing and slicing
tensor = torch.ones(4, 4)
print(f"First row: {tensor[0]}")
print(f"First column: {tensor[:, 0]}")
print(f"Last column: {tensor[..., -1]}")
tensor[:,1] = 0
print(tensor)

# Arithmetic Operation
# This computes the matrix multiplication between two tensors.
# y1, y2, y3 will have the same value
# ``tensor.T`` returns the transpose of a tensor
y1 = tensor @ tensor.T
y2 = tensor.matmul(tensor.T)  # same as transpose in numpy

y3 = torch.rand_like(y1)
torch.matmul(tensor, tensor.T, out=y3)

# This computes the element-wise product. z1, z2, z3 will have the same value
# QUE: Do element-wise product should have same dimensions?
#      If not what happens with dimension is not same?
z1 = tensor * tensor
z2 = tensor.mul(tensor)

z3 = torch.rand_like(tensor)
torch.mul(tensor, tensor, out=z3)

# Single-element tensors If you have a one-element tensor,
# for example by aggregating all values of a tensor into one value,
# you can convert it to a Python numerical value using item()

agg = tensor.sum()
agg_item = agg.item()
print(agg_item, type(agg_item))

# In-place operations Operations that store the result into the operand are called in-place. They are denoted by a _ suffix. For example: x.copy_(y), x.t_(), will change x.
print(f"{tensor} \n")
tensor.add_(5)
print(tensor)
# Note: In-place operations save some memory, but can be problematic when computing derivatives because of an immediate loss of history. Hence, their use is discouraged.

In [None]:
# Perform Operations - addition, broadcasting, .view(), .reshape()
# Tensor Addition
a = torch.tensor([[1, 2], [3, 4]])
b = torch.tensor([[5, 6], [7, 8]])
result = a + b
print(result)

# Broadcasting enables arithmetic operations between tensors of different shapes by automatically expanding the smaller tensor to match the shape of the larger one.
a = torch.tensor([[1, 2, 3], [4, 5, 6]])  # Shape: (2, 3)
b = torch.tensor([10, 20, 30])           # Shape: (3,)
result = a + b
print(result)

# .view() vs .reshape()
# .view(): Returns a new tensor with the same data but a different shape. It requires the tensor to be contiguous in memory. If the tensor is not contiguous, you need to call .contiguous() before using .view().

# .reshape(): Similar to .view(), but it can handle non-contiguous tensors by returning a copy if necessary. It's more flexible and safer to use when you're unsure about the tensor's memory layout.

x = torch.arange(6)  # Tensor with shape (6,)
print("Original tensor:", x)

# Using view
y = x.view(2, 3)
print("Reshaped with view:", y)

# Using reshape
z = x.reshape(3, 2)
print("Reshaped with reshape:", z)

| Feature                     | NumPy `ndarray`                     | PyTorch `Tensor`                  |
| --------------------------- | ----------------------------------- | --------------------------------- |
| Device support              | CPU only                            | CPU + GPU (CUDA, MPS)             |
| Auto-grad (Differentiation) | ❌ Not supported                     | ✅ Via `requires_grad`             |
| Shared memory conversion    | ✅ with `.from_numpy()` / `.numpy()` | ✅ Shared unless explicitly cloned |
| Broadcasting                | ✅                                   | ✅ Compatible                      |
| Data type strictness        | Slightly relaxed                    | Stricter (esp. floats vs ints)    |


## Lens 1: Paradigm - Tensors, Graphs, and Gradients

- [x] Understand dynamic Computation graph (eager mode vs static)
- [x] Explore how autograd works:
    - [x] .requires_grad_()
    - [x] .backward()
    - [x] Inspect .grad on parameters
- [x] Build a simple 2-layer neural net from scratch using only tensor operations

### Understanding dynamic computation graph in PyTorch

- One of the most important mental models
- Different from the static graphs used in other frameworks like tensorflow

> Q. What is a Computation Graph?

A. A computation graph is a directed graph where:
**bold text**
- Nodes are operations (like add, multiply, relu, etc.)

- Edges are tensors flowing between those operations

The graph is used to:

- Track computations

- Compute gradients during backpropagation

| Feature                  | **Dynamic (Eager)** – PyTorch            | **Static** – TensorFlow 1.x            |
| ------------------------ | ---------------------------------------- | -------------------------------------- |
| Graph defined at         | **Run-time** (during forward pass)       | **Compile-time** (before running)      |
| Flexibility              | Very high – native Python control flow   | Limited – needs explicit graph def     |
| Debugging                | Easy – just print like Python            | Harder – must inspect graph            |
| Performance optimization | Less by default, but can use TorchScript | High, since graph is optimized upfront |

🧠 **Eager Execution (PyTorch’s Default)**:

Every operation you write in PyTorch immediately executes and gets added to the autograd tape (if requires_grad=True).

| Topic             | PyTorch Default             |
| ----------------- | --------------------------- |
| Graph Type        | **Dynamic / Eager**         |
| Defined At        | Run-time                    |
| Debugging         | Natural Python              |
| Gradient Tracking | Automatic via `.backward()` |
| Static Option     | Via `torch.jit.script()`    |


### Automatic Differentiation with torch.autograd

PyTorch has a built-in differentiation engine called `torch.autograd` to compute gradients for `backprop` algorithm. It supports automatic computation of gradient for any computational graph.

```python
import torch

x = torch.ones(5)  # input tensor
y = torch.zeros(3)  # expected output
w = torch.randn(5, 3, requires_grad=True)  # tunable
b = torch.randn(3, requires_grad=True)  # tunable
z = torch.matmul(x, w)+b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)
```

![PyTorch Computation Graph](https://docs.pytorch.org/tutorials/_images/comp-graph.png)

> This object knows how to compute the function in the forward direction, and also how to compute its derivative during the backward propagation step. A reference to the backward propagation function is stored in grad_fn property of a tensor.

### Computing Gradients

To optimize weights of parameters in the neural network, we need to compute the derivatives of our loss function with respect to parameters, namely, we need $\frac{\partial \text{loss}}{\partial w}$ and $\frac{\partial \text{loss}}{\partial b}$ under some fixed values of x and y. To compute those derivatives, we call loss.backward(), and then retrieve the values from w.grad and b.grad

```python
loss.backward()
print(w.grad)
print(b.grad)
```

> We can only obtain the grad properties for the leaf nodes of the computational graph, which have `requires_grad` property set to True. For all other nodes in our graph, gradients will not be available. We can only perform gradient calculations using `backward` once on a given graph, for performance reasons. If we need to do several backward calls on the same graph, we need to pass `retain_graph=True` to the backward call.


### Disabling gradient tracking

There are reasons you might want to disable gradient tracking:
* To mark some parameters in your neural network as frozen parameters.

* To speed up computations when you are only doing forward pass, because computations on tensors that do not track gradients would be more efficient.

We can stop tracking computations by surrounding our computation code with `torch.no_grad()` block:

```python
z = torch.matmul(x, w)+b
print(z.requires_grad)  # True

with torch.no_grad():
    z = torch.matmul(x, w)+b
print(z.requires_grad)  # False

# we can achieve same result using `.detach()` method on the tensor
z_det = z.detach()
print(z_det.requires_grad)  # False
```
Autograd in PyTorch records operations in a directed acyclic graph (DAG) as you compute values forward. In this DAG, `leaves` are the `input tensors`, `roots` are the `output tensors`. Then, when you call `.backward()`, it traces this graph in reverse (from outputs back to inputs), applying the chain rule to compute gradients.

In a `forward pass` autograd does two things -
- runs the requested operation to compute a resulting tensor
- mainitain the operation's `gradient function` in the DAG

The `backward pass` kicks off when `.backward()` is called on the root. `autograd` then
- computes gradients of each `.grad_fn`
- accumulates them in the resepective tensor's `.grad` attribute
- using the chain rule, propogates all the way to the leaf tensors

```less
      [ x ]      [ w ]      [ b ]
        |          |          |
        |          |          |
        +----------*----------+
                   | (mul)
               [ x * w ]
                   |
                   | (+)
                   |
                [ xw + b ]
                   |
                   | (activation, e.g. ReLU)
                   |
               [ output ŷ ]
                   |
                   | (loss function)
                   |
               [ loss L ]
```
- **Forward Pass**: moves from top to bottom (inputs → output → loss).
- **Backward Pass**: triggered by `.backward()`, moves from loss back to inputs.
- Each edge is a **Function node** (e.g., add, multiply, relu) that knows how to compute its local gradient.


In [None]:
import torch

# Define Architecture
input_size = 3
hidden_size = 4
output_size = 1

# random input
x = torch.randn(1, input_size)

# Parameters(manually initialised with gradients)
W1 = torch.randn(input_size, hidden_size, requires_grad=True)
b1 = torch.randn(hidden_size, requires_grad=True)

W2 = torch.randn(hidden_size, output_size, requires_grad=True)
b2 = torch.randn(output_size, requires_grad=True)

# Define a forward pass
z1 = x @ W1 + b1       # Linear Layer 1
a1 = torch.relu(z1)    # activation

z2 = a1 @ W2 + b2      # Linear Layer 2
y_pred = z2            # Output

# Define a loss func
y_true = torch.randn(1, output_size)
loss = ((y_pred - y_true) ** 2)       # MSE Loss

# backprop
loss.backward()

# manual weight update
learning_rate = 1e-2

with torch.no_grad():
    W1 -= learning_rate * W1.grad
    b1 -= learning_rate * b1.grad
    W2 -= learning_rate * W2.grad
    b2 -= learning_rate * b2.grad

    # Don’t forget to zero the gradients
    W1.grad.zero_()
    b1.grad.zero_()
    W2.grad.zero_()
    b2.grad.zero_()

## Lens 2: Abstractions - From Layers to Modules
- [x] Use `torch.nn.Module` to create a basic neural network
- [x] Compose layers using nn.Sequential and custom modules
- [x] Explore Parameters Management (.parameters(), .state_dict())

**GOAL**: Understand how to modularize models and reuse components, like LEGO bricks

In [None]:
# Use torch.nn.Module to create a basic neural network
import torch
import torch.nn as nn

class MyNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(MyNN, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out

model = MyNN(3, 4, 1)
x = torch.randn(1, 3)
y = model(x)
print(y.item())

In [None]:
# Compose with nn.Sequential

model = nn.Sequential(
    nn.Linear(3, 4),
    nn.ReLU(),
    nn.Linear(4, 1)
)

# Iterate through named parameters
for name, param in model.named_parameters():
    print(name, param.shape)

print('*'*80)

for param in model.parameters():
    print(param.data)

# Check and Save state_dict()
# Print parameter dictionary
print(model.state_dict())

# Save to file
torch.save(model.state_dict(), "model.pth")

# Load later
model.load_state_dict(torch.load("model.pth"))

In [None]:
# Training a nn.Sequential
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Transform: flatten the image and convert to tensor
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Lambda(lambda x: x.view(-1))  # Flatten 28x28 -> 784
])

train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

model = nn.Sequential(
    nn.Linear(784, 128),
    nn.ReLU(),
    nn.Linear(128, 64),
    nn.ReLU(),
    nn.Linear(64, 10)  # 10 output classes
)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

for epoch in range(5):
    for images, labels in train_loader:
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)

        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f"Epoch [{epoch+1}/5], Loss: {loss.item():.4f}")

In [None]:
# Builidng a CNN using nn.Sequential
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# MNIST with Image Transform (no flattening!)
transform = transforms.Compose([
    transforms.ToTensor(),  # Keeps shape as [1, 28, 28]
])

train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

# MNIST images are 28x28 grayscale, so input channels = 1
# No flattening at the start — we preserve spatial structure for convolutions

# Define CNN Model with nn.Sequential
model = nn.Sequential(
    # Extracts 16 local features with 3x3 receptive fields.
    # Padding=1 keeps output spatial size consistent (28x28).
    nn.Conv2d(1, 16, kernel_size=3, padding=1),
    nn.ReLU(),
    # MaxPool(2) reduces spatial dimensions to 14x14, helping with translation invariance and reducing computation.
    nn.MaxPool2d(2),

    # Doubles feature channels from 16 → 32 to capture more abstract features.
    # Second pooling reduces spatial size to 7x7, compacting information further.
    nn.Conv2d(16, 32, kernel_size=3, padding = 1),
    nn.ReLU(),
    nn.MaxPool2d(2),

    # After flattening, we feed a 1D vector of size 1568 (32×7×7) to a fully connected layer.
    nn.Flatten(),
    # 128 hidden units give enough capacity without overfitting.
    nn.Linear(32*7*7, 128),
    nn.ReLU(),
    # Final output is 10 logits — one for each digit class.
    nn.Linear(128, 10)

)

# Loss and Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)


for epoch in range(5):
    for images, labels in train_loader:
        outputs = model(images)
        loss = criterion(outputs, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f"Epoch [{epoch+1}/5], Loss: {loss.item():.4f}")

#### 🧠 CNN Design Cheat Sheet: Channels vs Spatial Size

## 📐 General Pattern

| Layer Stage     | Channels       | Spatial Size (Example: MNIST) | Notes |
|------------------|----------------|-------------------------------|-------|
| Input Image      | 1              | 28×28                         | Grayscale |
| Conv Block 1     | 16             | 28×28                         | Use padding=1 |
| MaxPool          | 16             | 14×14                         | Halve spatial |
| Conv Block 2     | 32             | 14×14                         | Double channels |
| MaxPool          | 32             | 7×7                           | Halve spatial |
| Flatten → Linear | —              | 32×7×7 = 1568                 | Ready for FC |
| FC Hidden        | 128            | —                             | Good trade-off |
| Output Layer     | 10             | —                             | MNIST classes |

---

## 🔁 Why Double Channels and Halve Size?

> Key Idea: As we reduce spatial detail, we increase semantic richness via channels.

This is a pattern called *progressive abstraction*.
As we go deeper, features become more abstract (e.g., from edges → shapes → digits). We require more channels to capture diverse patterns. Doubling gives exponential representation power. (Can't we quadruple?)

Similtarly, halving spatial size reduces computaiton cost and allows deeper networks. It also helps with local invariance -- the network need to focus less on exact positions.

- **Double Channels** → Capture more abstract patterns (depth).
- **Halve Size** → Reduce computation & increase translational robustness.

> 🔧 You can halve until spatial dimensions become too small to retain meaning (e.g., < 3x3 is usually risky for small images).

---

## 🧮 Why 128 Hidden Units?

- It's a **hyperparameter**.
- Works well for MNIST due to:
  - Low task complexity
  - Limited overfitting risk
  - Efficient training

> For more complex datasets, you may increase this to 256, 512, or beyond.

---

## 🛠️ Tips

- Always match `Conv2D` padding to preserve size if you want clean downsampling.
- Double filters → `16 → 32 → 64`, halve size → `28 → 14 → 7` is a classic design.
- Use `Dropout` or `BatchNorm` between layers to improve generalization.



## Lens 3: Execution & Training Loop - The Real Engine
- [ ] Write a manual training loop using:
    - [ ] optimizer.step(), loss.backward(), zero.backward()
    - [ ] DataLoader, Dataset abstractions
- [ ] Track Model loss and accuracy across epochs    
- [ ] **Use GPU**: `model.to(device)`, `tensor.to(device)`

**GOAL**: Know what happens per epoch, batch, and step.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# 📦 Device setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f'{device = }')

# 📊 Dataset and DataLoader
transform = transforms.ToTensor()
train_dataset = datasets.MNIST(root='./data', train=True, transform=transform, download=True)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

# 🧠 Model
model = nn.Sequential(
    nn.Conv2d(1, 16, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),  # 28 → 14
    nn.Conv2d(16, 32, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),  # 14 → 7
    nn.Flatten(),
    nn.Linear(32 * 7 * 7, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
).to(device)

# 📉 Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

In [None]:
num_epochs = 20

train_loss_hist = []
train_acc_hist = []

for epoch in range(num_epochs):
    running_loss = 0.0
    correct = 0
    total = 0

    for images, labels in train_loader:
        # 🧠 Move to device (CPU/GPU)
        images, labels = images.to(device), labels.to(device)

        # 🔁 Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)

        # 🔄 Backpropagation
        optimizer.zero_grad()         # 🔄 Reset gradients
        loss.backward()               # 🧮 Compute gradients
        optimizer.step()              # ⬆️ Update weights

        # 📊 Stats
        running_loss += loss.item()
        _, predicted = outputs.max(1)
        correct += (predicted == labels).sum().item()
        total += labels.size(0)

    # 📣 End of epoch reporting
    epoch_loss = running_loss / len(train_loader)
    epoch_acc = correct / total

    train_loss_hist.append(epoch_loss)
    train_acc_hist.append(epoch_acc)

    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {epoch_loss:.4f}, Accuracy: {epoch_acc:.4f}")

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(train_loss_hist, label="Loss")
plt.xlabel("Epoch"); plt.ylabel("Loss")
plt.title("Training Loss")

plt.subplot(1, 2, 2)
plt.plot(train_acc_hist, label="Accuracy")
plt.xlabel("Epoch"); plt.ylabel("Accuracy")
plt.title("Training Accuracy")
plt.show()

In [None]:
test_dataset = datasets.MNIST(root='./data', train=False, transform=transform)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

model.eval()
correct = 0
total = 0
with torch.no_grad():
    for images, labels in test_loader:
        images, labels = images.to(device), labels.to(device)
        outputs = model(images)
        _, predicted = outputs.max(1)
        correct += (predicted == labels).sum().item()
        total += labels.size(0)

print(f"Test Accuracy: {correct / total:.4f}")

## Lens 4: Philosphy and Idioms - What's considered **PyTorchic**?

- [x] Use official PyTorch Tutorials.
- [x] Study Idioms:
    - [x] Model Initialization and reuse
    - [x] `TorchVision` tramsforms and dataset
    - [x] `TorchScript` for exporting models
- [x] Read 1 open-source PyTorch repo

**GOAL**: PyTorch is designed to look like Python — if something feels unnatural, you're probably doing it wrong.

| Concept                               | What to Do                                                                                         | PyTorchic Tips                                                        |
| ------------------------------------- | -------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------- |
| **Model Initialization & Reuse**      | Create model classes with `__init__()` and `forward()` and use `model.load_state_dict()` to reuse. | Keep architecture code separate from training logic.                  |
| **TorchVision Datasets & Transforms** | Use `torchvision.transforms.Compose` and `torchvision.datasets` like `MNIST`, `CIFAR10`.           | Always visualize your transformed data.                               |
| **Exporting with TorchScript**        | Use `torch.jit.trace()` or `torch.jit.script()` to export trained models.                          | Choose `trace` for standard forward paths, `script` for control flow. |
| **Saving/Loading Models**             | Use `torch.save(model.state_dict(), 'model.pt')` and `load_state_dict()`                           | Save checkpoints with metadata (epoch, loss, etc.).                   |
| **Read Open Source Code**             | Start with [PyTorch examples](https://github.com/pytorch/examples) repo, e.g. `mnist/main.py`      | Observe how they structure training, validation, config, and logging. |


# 🧠 PyTorch Best Practices — For the Thoughtful Human

A reference guide of often-overlooked but essential PyTorch coding best practices. These are not enterprise-level SDLC rules, but habits that make your AI code more readable, robust, and human-friendly.

---

## 1. 📦 Device Handling Early

```python
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
```

✅ Centralize `.to(device)` calls. Apply it once to the model and input batches.

---

## 2. 🧱 Separate Model, Training, and Evaluation

```python
model = MyModel()
train(model, train_loader)
evaluate(model, test_loader)
```

✅ Keep your functions clean. Avoid putting training logic inside your model class.

---

## 3. 🔁 Use `.train()` and `.eval()` Modes Explicitly

```python
model.train()  # Enables dropout, batchnorm
model.eval()   # Disables them
```

✅ Critical for correct behavior, especially during inference.

---

## 4. 🧼 Always Call `zero_grad()` Before Backward

```python
optimizer.zero_grad()
loss.backward()
optimizer.step()
```

✅ Avoids gradient accumulation unless you're doing it intentionally.

---

## 5. ❌ Avoid `requires_grad=True` on Input Data

```python
x = x.to(device)
```

✅ Only model parameters should require gradients, not your data.

---

## 6. 🚫 Use `with torch.no_grad()` for Inference

```python
with torch.no_grad():
    output = model(x)
```

✅ Saves memory and speeds up forward pass during eval/inference.

---

## 7. 📊 Track Loss and Accuracy Across Epochs

```python
train_loss.append(loss.item())
```

✅ Simple tracking avoids "training blind." Don't optimize prematurely.

---

## 8. 💾 Save Only `state_dict()` for Portability

```python
torch.save(model.state_dict(), 'model.pth')
```

✅ Forward-compatible and safer than saving the entire model object.

---

## 9. 🧱 Prefer `nn.Sequential` for Simple Architectures

```python
model = nn.Sequential(
    nn.Conv2d(1, 32, 3, 1),
    nn.ReLU(),
    nn.Flatten(),
    nn.Linear(5408, 10)
)
```

✅ Clean and modular. Ideal for feedforward-like stacks.

---

## 10. ❗ Avoid Magic Numbers — Use Named Variables

```python
hidden_dim = 128
nn.Linear(hidden_dim, hidden_dim // 2)
```

✅ Makes your model easier to reason about and tune.

---

## ❤️ Bonus — PyTorchic Mindset

- **Write for the reader, not the compiler.**
- **If it feels clunky, you're probably not using the idiom.**
- **Make things explicit. Be boring in a good way.**
- **Revisit tutorials not to learn APIs, but to learn writing styles.**

---

> Use this document as your compass when things feel messy. PyTorch is designed to make the code look like the idea.

# 💡 Mindset Shifts

- Think in batches (not single examples)

- Track the flow: tensor → operation → loss → gradient → update

- Don't memorize APIs — learn tensor transformations