## Hyper Parameters of a typical Neural Net...


What is a hyperparameter (one-line)

Hyperparameters are externally set configuration values that govern the training process and network architecture.

### 1. Training hyperparameters (most important)

These directly affect learning dynamics.

#### 1. Learning rate (Î·)

Step size for weight updates

Too high â†’ divergence

Too low â†’ slow learning

Typical values: 0.1, 0.01, 0.001

#### 2. Batch size

Number of samples per gradient update

Common values: 16, 32, 64, 128

Affects speed, memory, and generalization

#### 3. Number of epochs

Number of full passes over the dataset

Too few â†’ underfitting

Too many â†’ overfitting

#### 4. Optimizer

Algorithm for updating weights

Examples:

SGD

SGD + Momentum

Adam (most common)

RMSprop

#### 5. Momentum (if used)

Controls how much past gradients influence current update

Typical: 0.9

### 2. Architecture hyperparameters

These define the structure of the network.

#### 6. Number of layers (depth)

How many hidden layers

More layers â†’ more expressive power

Risk of overfitting if excessive

#### 7. Number of neurons per layer (width)

Controls model capacity

Example: 128 â†’ 64 â†’ 32

#### 8. Activation function

Adds non-linearity

Common:

ReLU (hidden layers)

Sigmoid / Softmax (output layer)

#### 9. Weight initialization

How initial weights are set

Examples:

Xavier (for Tanh)

He initialization (for ReLU)

### 3. Regularization hyperparameters

Used to prevent overfitting.

#### 10. L2 / L1 regularization (weight decay)

Penalizes large weights

Common L2 values: 0.0001, 0.001

#### 11. Dropout rate

Fraction of neurons randomly dropped during training

Typical: 0.2 â€“ 0.5

#### 12. Early stopping patience

Stops training when validation loss stops improving

Patience value: 5â€“10 epochs

### 4. Data-related hyperparameters
#### 13. Trainâ€“validation split

Example: 80/20, 70/30

#### 14. Data augmentation settings (if used)

Image flips, rotations, noise, etc.

Summary table (quick revision)
Category	Hyperparameters
Training	Learning rate, Batch size, Epochs, Optimizer
Architecture	Layers, Neurons, Activation, Initialization
Regularization	Dropout, L1/L2, Early stopping
Data	Split ratio, Augmentation
What is NOT a hyperparameter (common confusion)

Weights

Biases

Gradients

Embeddings (they are learned parameters)

Typical default setup (real-world)

For a beginner or standard DL project:

Optimizer: Adam

Learning rate: 0.001

Batch size: 32 or 64

Activation (hidden): ReLU

Epochs: 20â€“50

Dropout: 0.3

## Methods to overcome Overfitting in a Neural net

| Method           | Sub-types                               |
| ---------------- | --------------------------------------- |
| Regularization   | L1, L2, Elastic Net                     |
| Dropout          | Standard, Spatial, DropConnect          |
| Training Control | Early stopping                          |
| Data-based       | Augmentation, Noise injection           |
| Architecture     | Model simplification, Parameter sharing |
| Normalization    | Batch Normalization                     |
| Ensemble         | Bagging, Averaging, Snapshot            |
| Optimization     | LR scheduling, Proper splits            |


## 1. Regularization Methods

Regularization adds constraints or penalties to prevent the model from becoming too complex.

### 1.1 L1 Regularization (Lasso)

Adds absolute value of weights to the loss function


Loss=Original Loss+Î»âˆ‘âˆ£wâˆ£

Encourages sparse weights (many weights become exactly 0)

Performs implicit feature selection

Useful when many input features are irrelevant

Effect: Simpler model, reduced overfitting

### 1.2 L2 Regularization (Ridge / Weight Decay)

Adds square of weights to the loss function

Loss=Original Loss+Î»âˆ‘w^2

Penalizes large weights but does not make them zero

Most commonly used in neural networks

Effect: Smooths the model, prevents extreme weight values

### 1.3 Elastic Net

Combination of L1 + L2

Controls sparsity and stability together

ðŸ“Œ Used when L1 or L2 alone is insufficient

## 2. Dropout

Dropout randomly disables neurons during training.

### 2.1 Standard Dropout

Randomly sets neuron outputs to 0 with probability p

Forces the network to not depend on specific neurons

 Common values:

Fully connected layers: p = 0.5

CNN layers: p = 0.2â€“0.3

### 2.2 Spatial Dropout (for CNNs)

Drops entire feature maps, not individual neurons

Preserves spatial structure

ðŸ“Œ Used mainly in convolutional layers

### 2.3 DropConnect

Randomly drops connections (weights) instead of neurons

More computationally expensive

## 3. Early Stopping

Training is stopped before overfitting begins.

How it works:

Monitor validation loss

Stop training when validation loss starts increasing

Advantages:

Simple

No change to model architecture

Very effective in practice

## 4. Data-Level Techniques

Overfitting often occurs due to insufficient or poor-quality data.

### 4.1 Data Augmentation

Artificially increases dataset size.

For images:

Rotation

Flipping

Zooming

Cropping

Brightness changes

For text:

Synonym replacement

Random word deletion

Back translation

 Improves generalization without new data collection

### 4.2 Noise Injection

Add small random noise to:

Inputs

Weights

Activations

 Makes the model robust to variations

## 5. Model Architecture Control

Reduce the capacity of the model.

### 5.1 Reducing Model Complexity

Fewer layers

Fewer neurons per layer

Smaller kernel sizes (CNN)

 Simple models generalize better with limited data

### 5.2 Parameter Sharing

Used in CNNs and RNNs

Same weights reused across inputs

Reduces total parameters â†’ less overfitting

## 6. Batch Normalization

Normalizes layer inputs during training.

Why it helps:

Reduces internal covariate shift

Adds slight regularization effect

Allows higher learning rates

 Often reduces need for dropout

## 7. Ensemble Methods

Combine multiple models to reduce variance.

### 7.1 Bagging

Train multiple models on different subsets

Average predictions

### 7.2 Model Averaging

Train same model multiple times with different initializations

### 7.3 Snapshot Ensembles

Save multiple model states from one training run

Improves generalization but increases inference cost

## 8. Optimization & Training Strategies

Bad training setup can cause overfitting.

### 8.1 Learning Rate Scheduling

Reduce learning rate over time

Prevents overfitting in later epochs

### 8.2 Proper Trainâ€“Validation Split

Ensure validation data is never seen during training

Avoid data leakage