# **Neural Architecture Search (NAS)**
**Neural Architecture Search (NAS)** is a subfield of **AutoML (Automated Machine Learning)** that aims to **automatically design neural network architectures** rather than hand-crafting them.

The idea is to let an algorithm *search* through a space of possible network designs to find one that performs best for a given task (e.g., image classification, detection, or segmentation).

---

## 1. Motivation

Traditionally, researchers manually design architectures like **ResNet**, **DenseNet**, or **Transformer** based on intuition and trial-and-error.
However, there are many hyperparameters and design decisions:

* Number of layers
* Kernel sizes
* Skip connections
* Width/depth of the network
* Type of blocks (conv, attention, etc.)

This manual process is **time-consuming** and often **sub-optimal**.
NAS automates this design process.

---

## 2. General Workflow of NAS

The process can be broken down into **three major components**:

| Component               | Role                                                                                       |
| ----------------------- | ------------------------------------------------------------------------------------------ |
| **Search Space**        | Defines *what* architectures can be explored (e.g., types of layers, connections, etc.)    |
| **Search Strategy**     | Defines *how* architectures are sampled and improved (e.g., RL, evolution, gradient-based) |
| **Evaluation Strategy** | Defines *how* to measure the performance (e.g., train model fully, or use proxy training)  |

---

### (a) **Search Space**

This specifies the **building blocks** and **rules** for generating architectures.

Example:

* Convolution types: 3×3, 5×5, depthwise conv, etc.
* Skip connections allowed or not.
* Number of channels, etc.

Formally, NAS tries to find:
$$
\arg\min_{A \in \mathcal{A}} ; \mathcal{L}_{val}(w^*(A), A)
$$

where

$$
w^*(A) = \arg\min_{w} ; \mathcal{L}_{train}(w, A)
$$

Here:

* $A$ = architecture from search space $\mathcal{A}$
* $w$ = its weights
* $\mathcal{L}_{train}$ = training loss
* $\mathcal{L}_{val}$ = validation loss

So NAS must **optimize both architecture and its weights**.

---

### (b) **Search Strategy**

How to explore the search space efficiently.

#### Common Strategies:

1. **Reinforcement Learning (RL) based NAS**

   * A controller (e.g., RNN) generates architectures.
   * Reward = validation accuracy.
   * Example: **NASNet** (Zoph & Le, 2017).

2. **Evolutionary Algorithms**

   * Start with random architectures.
   * Mutate and recombine top performers.
   * Example: **AmoebaNet** (Real et al., 2019).

3. **Gradient-Based (Differentiable NAS)**

   * Represent architecture choices as continuous parameters.
   * Optimize with gradient descent.
   * Example: **DARTS (Differentiable Architecture Search)**.

---

### (c) **Evaluation Strategy**

Training each candidate model fully is **expensive**.
So evaluation uses approximations:

* **Early stopping:** Train a few epochs only.
* **Weight sharing:** Train a “supernet” that contains all sub-architectures (e.g., **ENAS**).
* **Performance prediction:** Use a small model to predict performance without full training.

---

## 3. Key NAS Variants

| Method                              | Type                            | Example Paper                    |
| ----------------------------------- | ------------------------------- | -------------------------------- |
| **RL-based**                        | Discrete search                 | NASNet (2017)                    |
| **Evolutionary**                    | Discrete search                 | AmoebaNet (2019)                 |
| **Differentiable (Gradient-based)** | Continuous relaxation           | DARTS (2018)                     |
| **One-shot NAS**                    | Weight sharing                  | ENAS (2018), ProxylessNAS (2019) |
| **Hardware-aware NAS**              | Adds latency/energy constraints | FBNet, MnasNet                   |

---

## 4. DARTS Example (Differentiable NAS)

In DARTS, instead of picking a single operation (like 3×3 conv or skip), we use a **weighted sum** of all possible ops:

$$
\tilde{o}(x) = \sum_{i} \alpha_i , o_i(x)
$$

* $\alpha_i$ are learnable architecture parameters (softmax normalized)
* After training, the operation with the highest $\alpha_i$ is chosen for the final architecture.

This makes NAS **differentiable**, so we can train both network weights and architecture weights via backpropagation.

---

## 5. Benefits and Challenges

✅ **Pros**

* Automates architecture design
* Can outperform manually designed models
* Can adapt to specific tasks and hardware constraints

❌ **Cons**

* Very computationally expensive (can require thousands of GPU-hours)
* Search may overfit to validation data
* Interpretability of found architectures is limited

---

## 6. Modern Trends

* **Efficiency**: Using weight sharing (ENAS, Once-for-All Networks)
* **Task-specific NAS**: For segmentation, detection, NLP
* **Multi-objective NAS**: Optimize accuracy + latency + energy
* **Zero-cost NAS**: Predict performance without training at all

---

## 7. Example in Practice

1. Define a cell (micro-architecture):

   * Combination of convolutions, pooling, skip connections.
2. Use NAS algorithm (e.g., DARTS) to find the best combination.
3. Stack the discovered cell into a full network.
4. Train the final model from scratch.

Example architectures discovered by NAS:

* **NASNet**, **AmoebaNet**, **MnasNet**, **FBNet**, **EfficientNet**, **Once-for-All (OFA)**.

---


## **Minimal PyTorch Differentiable NAS** 


Perfect — let’s build a **minimal differentiable NAS (DARTS-style)** example in PyTorch so you can see how both **network weights** and **architecture parameters** are optimized jointly.

We’ll design a small “search cell” that chooses between a few candidate operations (e.g., conv3×3, conv5×5, and skip-connection), all weighted by learnable coefficients **α**.

---

## 1. Core Idea

For each edge (input → output), instead of picking one operation manually, we take a **soft mixture** of all candidate operations:

$$
\tilde{o}(x) = \sum_{i} \frac{e^{\alpha_i}}{\sum_j e^{\alpha_j}} , o_i(x)
$$

So the network learns both:

* **Weights (W)** — as usual (Conv filters, etc.)
* **Architecture parameters (α)** — which operation to prefer

---

## 2. Minimal Implementation

```python
import torch
import torch.nn as nn
import torch.nn.functional as F

# ----- Candidate operations -----
class Conv3x3(nn.Module):
    def __init__(self, C_in, C_out):
        super().__init__()
        self.op = nn.Conv2d(C_in, C_out, 3, padding=1)
    def forward(self, x):
        return F.relu(self.op(x))

class Conv5x5(nn.Module):
    def __init__(self, C_in, C_out):
        super().__init__()
        self.op = nn.Conv2d(C_in, C_out, 5, padding=2)
    def forward(self, x):
        return F.relu(self.op(x))

class SkipConnect(nn.Module):
    def forward(self, x):
        return x

# ----- Mixed operation (weighted sum of candidates) -----
class MixedOp(nn.Module):
    def __init__(self, C_in, C_out):
        super().__init__()
        self.ops = nn.ModuleList([
            Conv3x3(C_in, C_out),
            Conv5x5(C_in, C_out),
            SkipConnect()
        ])
        # α parameters (one per op) — initialized to zero
        self.alpha = nn.Parameter(torch.zeros(len(self.ops)))

    def forward(self, x):
        weights = F.softmax(self.alpha, dim=0)
        return sum(w * op(x) for w, op in zip(weights, self.ops))

# ----- Small NAS network -----
class NASNet(nn.Module):
    def __init__(self, C=16, num_classes=10):
        super().__init__()
        self.stem = nn.Conv2d(3, C, 3, padding=1)
        self.mixed = MixedOp(C, C)
        self.fc = nn.Linear(C*8*8, num_classes)

    def forward(self, x):
        x = F.relu(self.stem(x))
        x = F.avg_pool2d(self.mixed(x), 4)
        x = x.view(x.size(0), -1)
        return self.fc(x)

# ----- Simple toy training -----
device = "cuda" if torch.cuda.is_available() else "cpu"
net = NASNet().to(device)

# Separate parameters: weights vs architecture
arch_params = [p for n, p in net.named_parameters() if 'alpha' in n]
weight_params = [p for n, p in net.named_parameters() if 'alpha' not in n]

opt_weights = torch.optim.SGD(weight_params, lr=0.01)
opt_alpha = torch.optim.Adam(arch_params, lr=0.003)

# Fake data
x = torch.randn(8, 3, 32, 32).to(device)
y = torch.randint(0, 10, (8,)).to(device)

for epoch in range(5):
    # --- Step 1: Update weights (train loss) ---
    net.train()
    opt_weights.zero_grad()
    preds = net(x)
    loss = F.cross_entropy(preds, y)
    loss.backward()
    opt_weights.step()

    # --- Step 2: Update architecture (validation loss) ---
    # (Here we just reuse same data for simplicity)
    opt_alpha.zero_grad()
    preds_val = net(x)
    val_loss = F.cross_entropy(preds_val, y)
    val_loss.backward()
    opt_alpha.step()

    print(f"Epoch {epoch+1} | train_loss={loss.item():.3f} | val_loss={val_loss.item():.3f}")
    print("α:", net.mixed.alpha.data.tolist())
```

---

## 3. What Happens Here

* The **`MixedOp`** contains 3 candidate operations:

  * 3×3 convolution
  * 5×5 convolution
  * Skip connection

* The architecture parameters **α = [α₁, α₂, α₃]** are trained via gradient descent.

* The **softmax(α)** gives the mixture weights for each operation.

* During training, **network weights (W)** and **architecture parameters (α)** are updated alternately:

  1. Update **W** using training loss.
  2. Update **α** using validation loss (to simulate DARTS’ bilevel optimization).

After training, the operation with the **highest α** becomes the chosen one for the final architecture.

---

## 4. Typical Workflow in Real NAS

| Step                       | Description                                                     |
| -------------------------- | --------------------------------------------------------------- |
| **1. Search phase**        | Train the supernet with both W and α parameters.                |
| **2. Derive architecture** | Pick top operations (argmax α).                                 |
| **3. Retrain phase**       | Train the derived architecture from scratch for final accuracy. |

---

## 5. Expected Output Example

```
Epoch 1 | train_loss=2.21 | val_loss=2.09
α: [0.05, 0.03, -0.02]
Epoch 5 | train_loss=1.12 | val_loss=1.07
α: [1.24, 0.87, -0.31]
```

After training, the model learns to prefer one op (for example α₁ → Conv3x3).

---

Would you like me to extend this toy example to **two cells connected in sequence** (so you can see how DARTS builds a graph of multiple “mixed” edges like in real NAS)?
