#  Adagrad optimizer

Adagrad (Adaptive Gradient Algorithm) is an optimization algorithm commonly used in training artificial neural networks (ANNs). It dynamically adjusts the learning rates of each parameter based on the historical gradients. 

## Details of Adagrad Algorithm

Adagrad is designed to adaptively adjust the learning rates for each parameter during training. It achieves this by scaling the learning rates based on the historical gradients of each parameter. Here's how Adagrad works:

1. Compute Squared Gradients: Adagrad maintains a sum of the squared gradients for each parameter.

2. Adapt Learning Rates: It divides the learning rate by the square root of the sum of squared gradients for each parameter. This effectively reduces the learning rate for parameters that have large gradients and increases it for parameters that have small gradients.

3. Accumulation of Gradients: Adagrad accumulates the squared gradients over time, so the learning rates decrease monotonically during training.

## Pros of Adagrad optimizer

1. Adaptive Learning Rates: Adagrad adapts the learning rates for each parameter individually based on the historical gradients. This can help converge faster and more efficiently, especially for sparse data or when dealing with features that occur infrequently.

2. No Manual Tuning of Learning Rate: Adagrad automatically adjusts the learning rates based on the gradients, reducing the need for manual tuning of learning rate hyperparameters.

## Cons of Adagrad optimizer

1. Decreasing Learning Rates: Adagrad's accumulation of squared gradients can lead to learning rates that decrease too aggressively over time. This can cause the learning process to slow down prematurely, especially for deep neural networks.

2. Memory Usage: Adagrad needs to store and update the sum of squared gradients for each parameter, which can lead to increased memory usage, especially for models with many parameters.

3. RMSprop and AdaDelta: While Adagrad was one of the early adaptive learning rate algorithms, more recent algorithms like RMSprop and AdaDelta have been developed to address some of its shortcomings, such as the aggressive decrease in learning rates.

In [None]:
from fashionmnist_model import FMM
import tensorflow as tf

In [None]:
# Load and preprocess the data
X_train, y_train, X_test, y_test = FMM.load_data()

In [None]:
# Reshape the data
X_train, X_test = FMM.reshape_data(X_train, X_test)

In [None]:
optimizer = tf.keras.optimizers.Adagrad()
model = FMM.create_model()
print(f"Training with {optimizer.__class__.__name__} optimizer...")
history, train_accuracy, val_accuracy = FMM.compile_and_train(
    model, X_train, y_train, X_test, y_test, optimizer
)

In [None]:
loss, accuracy = FMM.evaluate(model, X_test, y_test)

In [None]:
print(f"Training accuracy : {train_accuracy}")
print(f"Validation accuracy : {val_accuracy}")
print(f"Loss : {loss}")
print(f"Accuracy : {accuracy}")

In [None]:
FMM.plot_history(history, optimizer)