## **Revision**

![image.png](attachment:image.png)

# Adam:
- adamax 
- nadam

## **Gradient Descent**

Goal: Minimize the loss function by adjusting weights.

Steps:

Calculate Gradient: Compute the gradient (slope) of the loss function with respect to each weight.<br>
Update Weights: Subtract a fraction (learning rate) of the gradient from each weight.<br>
Repeat: Iterate this process for many epochs until the loss function is minimized.<br>

## **Momentum**

Goal: Accelerate gradient vectors in the right direction.

Steps:

Initialize Velocity: Start with a velocity vector of zeros.<br>
Update Velocity: Combine the old velocity (multiplied by a momentum term) with the current gradient.<br>
Update Weights: Adjust weights using the velocity, which helps smooth out the updates and potentially speeds up <br>convergence.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

## **Nestrov Accelearted Gradient(NAG)**

Goal: Look ahead to avoid overshooting.

Steps:

Look Ahead: Calculate the gradient at a point where the current velocity would take us.<br>
Update Velocity: Similar to momentum but using the gradient from the look-ahead point.<br>
Update Weights: Adjust weights based on this new velocity.<br>

![image.png](attachment:image.png)

![image.png](attachment:image.png)

## **Adagrad**

Goal: Adapt the learning rate for each parameter.

Steps:

Accumulate Squared Gradients: Keep a running sum of the squares of past gradients.<br>
Scale Learning Rate: Adjust the learning rate for each weight based on the accumulated gradients.<br>
Update Weights: Use the scaled learning rate to update the weights.<br>

## **RMSProp**

Goal: Handle non-stationary objectives by dividing the learning rate by an exponentially decaying average of squared gradients.

Steps:

Exponential Average: Compute the exponential moving average of the squared gradients.<br>
Scale Learning Rate: Use this moving average to adjust the learning rate.<br>
Update Weights: Update weights with the scaled learning rate.<br>

<!DOCTYPE html>
<html>
<head>
	<title>Optimizer Table</title>
	<style>
		table {
			border-collapse: collapse;
			width: 100%;
			text-align: center;
		}
		th, td {
			border: 1px solid black;
			padding: 8px;
		}
		th {
			background-color: green;
		}
	</style>
</head>
<body>
	<h1>Optimizer Table</h1>
	<table>
		<thead>
			<tr>
				<th>Optimizer</th>
				<th>Key Idea</th>
				<th>Steps Involved</th>
			</tr>
		</thead>
		<tbody>
			<tr>
				<td>GD</td>
				<td>Basic weight update using gradients</td>
				<td>Gradient calculation, weight update, repeat</td>
			</tr>
			<tr>
				<td>Momentum</td>
				<td>Speed up by using past gradients</td>
				<td>Initialize velocity, update velocity, update weights</td>
			</tr>
			<tr>
				<td>NAG</td>
				<td>Anticipate future gradients</td>
				<td>Look ahead, update velocity, update weights</td>
			</tr>
			<tr>
				<td>Adagrad</td>
				<td>Adapt learning rates per parameter</td>
				<td>Accumulate squared gradients, scale learning rate, update weights</td>
			</tr>
			<tr>
				<td>RMSprop</td>
				<td>Smooth out learning rates using moving average</td>
				<td>Exponential average, scale learning rate, update weights</td>
			</tr>
			<tr>
				<td>Adam</td>
				<td>Combine momentum and RMSprop</td>
				<td>Exponential averages, bias correction, update weights</td>
			</tr>
			<tr>
				<td>Adamax</td>
				<td>Variant of Adam using infinity norm</td>
<td>Follow Adam steps, use infinity norm, update weights</td>
</tr>
<tr>
		<td>Nadam</td>
		<td>Combine Adam with Nesterov look-ahead</td>
		<td>Follow Adam steps, look ahead, update weights</td>
	</tr>
	</tbody>
	</table>
</body>
</html>

<!DOCTYPE html>
<html>
<head>
	<title>Optimizer Table</title>
	<style>
		table {
			border-collapse: collapse;
			width: 100%;
			text-align: center;
		}
		th, td {
			border: 1px solid black;
			padding: 8px;
		}
		th {
			background-color: green;
		}
	</style>
</head>
<body>
	<h1>Where to use?</h1>
	<table>
		<thead>
			<tr>
				<th>Optimizer</th>
				<th>Best Use Case</th>
				<th>When to Choose</th>
			</tr>
		</thead>
		<tbody>
			<tr>
				<td>GD</td>
				<td>Simple models, small/medium datasets</td>
				<td>If simplicity is key and the problem is straightforward</td>
			</tr>
			<tr>
				<td>Momentum</td>
				<td>Deep networks, noisy gradients</td>
				<td>When experiencing oscillations in learning</td>
			</tr>
			<tr>
				<td>NAG</td>
				<td>Complex loss surfaces</td>
				<td>When you need a lookahead mechanism to avoid overshooting</td>
			</tr>
			<tr>
				<td>Adagrad</td>
				<td>Sparse and high-dimensional data</td>
				<td>For problems with sparse features or varying feature frequencies</td>
			</tr>
			<tr>
				<td>RMSprop</td>
				<td>Recurrent networks, non-stationary objectives</td>
				<td>When adaptive learning rates are needed dynamically</td>
			</tr>
			<tr>
				<td>Adam</td>
				<td>Large datasets, high-dimensional parameter spaces</td>
				<td>As a default choice for most deep learning tasks</td>
			</tr>
			<tr>
				<td>Adamax</td>
				<td>Scenarios where Adam's 𝐿<br>2 norm is problematic</td>
				<td>When Adam is unstable or converges poorly</td>
			</tr>
			<tr>
				<td>Nadam</td>
				<td>Deep networks needing momentum and lookahead</td>
				<td>When combining Adam and NAG benefits is required</td>
			</tr>
		</tbody>
	</table>
</body>
</html>

# Practical Tips
- <h2>Start with Adam:</h2> It's generally a safe and efficient choice for most problems.<br>
- <h2>Experiment with RMSprop or Adagrad:</h2> If you're dealing with RNNs or sparse data, these can be very effective.<br>
- <h2>Use Momentum or NAG:</h2> If your model is deep and you notice a lot of oscillation in the gradient updates.<br>
- <h2>Try Nadam:</h2> If you need the benefits of both Adam and NAG, especially for very deep networks.