In [None]:
Part 1: Understanding Optimizers

In [None]:
1. What is the role of optimization algorithms in artificial neural networks? Why are they necessary?




Optimization algorithms play a crucial role in training artificial neural networks. Their primary purpose is to adjust the parameters (weights and biases) of the network in order to minimize the loss function, which quantifies the difference between the predicted and actual outputs. Optimization algorithms are necessary for several reasons:

1. **Minimizing Loss:**
   - The ultimate goal of training a neural network is to minimize the loss, which represents the difference between the predicted and actual outputs. Optimization algorithms provide a systematic way to adjust the parameters to achieve this minimization.

2. **Gradient Descent:**
   - Optimization algorithms, particularly gradient descent and its variants, use the gradients of the loss with respect to the parameters to determine the direction and magnitude of parameter updates. The gradients indicate how the loss changes concerning each parameter.

3. **Efficient Parameter Updates:**
   - Neural networks often have a large number of parameters, and manually tuning them to minimize the loss would be impractical. Optimization algorithms automate the process of updating parameters efficiently and iteratively.

4. **Convergence to Optimal Solution:**
   - Optimization algorithms aim to guide the model parameters toward values that correspond to a minimum or near-minimum of the loss function. This process is crucial for achieving a model that generalizes well to new, unseen data.

5. **Handling Non-Convex Loss Landscapes:**
   - The loss landscape of neural networks is typically non-convex, meaning it has multiple minima. Optimization algorithms need to navigate this complex landscape to find a good set of parameters. Gradient descent methods, despite the non-convexity, often converge to good solutions.

6. **Learning Rate Adaptation:**
   - Many optimization algorithms include mechanisms for adapting the learning rate during training. Learning rate adaptation helps control the step size of parameter updates, preventing issues like slow convergence, oscillations, or divergence.

7. **Stochasticity and Minibatch Training:**
   - Optimization algorithms handle stochasticity introduced by training on random minibatches of data. Stochastic Gradient Descent (SGD) and its variants use random subsets of the training data to estimate gradients and update parameters.

8. **Variants for Efficiency:**
   - Various optimization algorithms, such as Adam, RMSprop, and Adagrad, are designed to address specific challenges in training, such as adapting learning rates, handling sparse data, or overcoming the vanishing or exploding gradient problems.

9. **Regularization:**
   - Some optimization algorithms, like L-BFGS and conjugate gradient methods, can be used in combination with regularization techniques to prevent overfitting.

In summary, optimization algorithms are essential for training neural networks by automating the process of adjusting parameters to minimize the loss. They provide systematic and efficient methods for navigating the complex parameter space, enabling neural networks to learn from data and generalize well to unseen examples.

In [None]:
2. Explain the concept of gradient descent and its variants. Discuss their differences and tradeoffs in terms of convergence speed and memory requirements.



**Gradient Descent:**

Gradient descent is an iterative optimization algorithm used to minimize a differentiable function, typically the loss function in the context of training neural networks. The basic idea is to update the parameters of the model in the opposite direction of the gradient of the loss with respect to those parameters.

The update rule for gradient descent is as follows:

\[ \theta_{t+1} = \theta_t - \alpha \nabla J(\theta_t) \]

Where:
- \(\theta_t\) is the parameter vector at iteration \(t\),
- \(\alpha\) is the learning rate,
- \(\nabla J(\theta_t)\) is the gradient of the loss function \(J\) with respect to the parameters at iteration \(t\).

**Variants of Gradient Descent:**

1. **Stochastic Gradient Descent (SGD):**
   - Instead of using the entire training dataset to compute the gradient at each iteration, SGD randomly selects a single training example (or a minibatch) to compute an estimate of the gradient. This introduces stochasticity, which can help escape local minima and speed up training.
   - **Tradeoff:** It can have high variance in parameter updates, leading to noisy convergence.

2. **Batch Gradient Descent:**
   - Batch Gradient Descent computes the gradient of the loss with respect to the parameters using the entire training dataset at each iteration. It provides a more stable estimate of the gradient but can be computationally expensive for large datasets.
   - **Tradeoff:** Memory-intensive for large datasets.

3. **Mini-batch Gradient Descent:**
   - Mini-batch Gradient Descent strikes a balance by using a small, random subset (mini-batch) of the training data to compute the gradient. It combines some advantages of both SGD and Batch Gradient Descent.
   - **Tradeoff:** The choice of mini-batch size affects the tradeoff between computation efficiency and convergence speed.

4. **Momentum:**
   - Momentum introduces a moving average of past gradients to smooth out parameter updates. It helps accelerate convergence by reducing oscillations and overshooting.
   - **Tradeoff:** Requires an additional hyperparameter (momentum coefficient), and may overshoot in some cases.

5. **Adagrad:**
   - Adagrad adapts the learning rate for each parameter based on the historical gradient information. Parameters that receive large gradients have a smaller effective learning rate, and vice versa.
   - **Tradeoff:** It can lead to a diminishing learning rate, which may cause slow convergence in later stages of training.

6. **RMSprop:**
   - RMSprop addresses the diminishing learning rate issue of Adagrad by using a moving average of squared gradients. It divides the current gradient by the square root of the moving average of squared gradients.
   - **Tradeoff:** Still requires careful tuning of hyperparameters.

7. **Adam:**
   - Adam combines ideas from Momentum and RMSprop. It incorporates both the moving average of past gradients and the moving average of past squared gradients to adapt the learning rate.
   - **Tradeoff:** It has more hyperparameters to tune but is often considered a robust choice.

**Tradeoffs:**
- **Convergence Speed:** Adam and other adaptive methods often converge faster in practice compared to standard SGD, especially for complex optimization landscapes. However, the actual performance can depend on the specific problem and hyperparameter settings.
- **Memory Requirements:** Adaptive methods like Adam store additional moving averages, increasing memory requirements. In contrast, SGD and Momentum have lower memory requirements.

In summary, the choice of optimization algorithm depends on the specific characteristics of the problem, the dataset size, and the available computational resources. While adaptive methods like Adam are popular for their robust performance, the choice may involve tradeoffs between convergence speed, memory requirements, and the need for hyperparameter tuning.

In [None]:
3. Describe the challenges associated with traditional gradient descent optimization methods (e.g., slow
convergence, local minima). How do modern optimizers address these challenges?





Traditional gradient descent optimization methods, such as basic gradient descent and its variants like SGD (Stochastic Gradient Descent), face several challenges that can affect the efficiency and effectiveness of training neural networks. Some of these challenges include:

1. **Slow Convergence:**
   - **Issue:** In deep neural networks, the optimization landscape is often non-convex and may contain many flat or steep regions. Traditional gradient descent methods can converge slowly in such landscapes, leading to lengthy training times.
   - **Addressing:** Modern optimizers, such as Adam, RMSprop, and Adagrad, use adaptive learning rates to speed up convergence by adjusting the learning rates for each parameter based on historical gradient information.

2. **Vanishing and Exploding Gradients:**
   - **Issue:** In deep networks, gradients can become very small (vanishing) or very large (exploding) as they are backpropagated through multiple layers. This can hinder learning, especially in deep architectures.
   - **Addressing:** Modern optimizers often use techniques like weight initialization strategies and batch normalization to mitigate vanishing/exploding gradient problems. Additionally, gradient clipping and adaptive learning rates can help stabilize training.

3. **Local Minima and Saddle Points:**
   - **Issue:** Traditional gradient descent methods may get stuck in local minima or saddle points, especially in high-dimensional spaces.
   - **Addressing:** Techniques like momentum, which helps the optimizer overcome small local minima, and the use of adaptive learning rates in modern optimizers can assist in escaping saddle points and exploring the optimization landscape more effectively.

4. **Sensitivity to Learning Rate:**
   - **Issue:** The choice of a suitable learning rate is critical for traditional gradient descent methods. A learning rate that is too small may lead to slow convergence, while a rate that is too large may cause divergence.
   - **Addressing:** Modern optimizers often incorporate mechanisms to adaptively adjust the learning rate during training, reducing the need for manual tuning. Techniques such as learning rate schedules and cyclical learning rates are also employed.

5. **Memory Requirements:**
   - **Issue:** Some traditional optimizers require storage of a significant amount of historical gradient information for each parameter, leading to increased memory requirements.
   - **Addressing:** While modern optimizers may still use historical information, they often use more memory-efficient strategies and additional techniques like weight decay to control the growth of parameters.

6. **Limited Exploration:**
   - **Issue:** Traditional gradient descent methods may follow a straightforward path towards convergence, limiting exploration of the parameter space.
   - **Addressing:** Optimizers with adaptive learning rates, momentum, and adaptive moments (e.g., Adam) can facilitate more effective exploration by adapting to the local geometry of the optimization landscape.

In summary, modern optimizers are designed to address these challenges associated with traditional gradient descent methods. They incorporate adaptive learning rates, momentum, and other techniques to improve convergence speed, escape local minima, handle vanishing/exploding gradients, and reduce sensitivity to hyperparameter choices. The combination of these features allows modern optimizers to offer more efficient and effective solutions for training deep neural networks.

In [None]:
4. Discuss the concepts of momentum and learning rate in the context of optimization algorithms. How do
they impact convergence and model performance?




**Momentum:**

Momentum is a technique used in optimization algorithms, particularly in the context of stochastic gradient descent (SGD) and its variants, to accelerate convergence and overcome oscillations. The basic idea is to introduce a moving average of past gradients, which helps dampen oscillations and speed up convergence.

The update rule for momentum is as follows:

\[ v_{t+1} = \beta \cdot v_t + (1 - \beta) \cdot \nabla J(\theta_t) \]
\[ \theta_{t+1} = \theta_t - \alpha \cdot v_{t+1} \]

Where:
- \( v_t \) is the momentum term at iteration \( t \),
- \( \beta \) is the momentum coefficient (typically close to 1),
- \( \nabla J(\theta_t) \) is the gradient of the loss function with respect to the parameters at iteration \( t \),
- \( \alpha \) is the learning rate,
- \( \theta_t \) is the parameter vector at iteration \( t \).

Momentum helps to smooth out the updates and allows the optimizer to continue moving in the direction of the gradient even when the gradient itself is noisy or changes direction frequently. It introduces a memory effect, preventing the optimizer from getting stuck in flat regions or oscillating in narrow valleys.

**Learning Rate:**

The learning rate (\( \alpha \)) is a hyperparameter that controls the step size of parameter updates in optimization algorithms. It determines how much the model parameters should be adjusted during each iteration based on the computed gradient. The learning rate is crucial because it influences the convergence speed, stability, and the quality of the final model.

Choosing an appropriate learning rate is important, and it involves finding a balance. A learning rate that is too small may lead to slow convergence, while a rate that is too large may cause the optimizer to overshoot the minimum or even diverge. Common learning rate values are in the range of \( 0.1 \) to \( 0.0001 \), but the optimal learning rate can vary depending on the problem and the architecture of the neural network.

**Impact on Convergence and Model Performance:**

1. **Momentum:**
   - **Impact on Convergence:** Momentum helps accelerate convergence, especially in the presence of noisy gradients or in regions with flat or oscillating surfaces. It dampens oscillations and allows the optimizer to navigate through saddle points more efficiently.
   - **Impact on Performance:** The use of momentum generally results in faster training and can help escape local minima. However, the momentum coefficient (\( \beta \)) needs to be carefully tuned to avoid overshooting.

2. **Learning Rate:**
   - **Impact on Convergence:** The learning rate directly influences the step size of parameter updates. A suitable learning rate is essential for stable convergence. Too small a learning rate may lead to slow convergence, while too large a learning rate can cause the optimizer to oscillate or diverge.
   - **Impact on Performance:** A well-chosen learning rate can significantly improve the efficiency of the optimization process. Learning rates that are too small may result in suboptimal solutions, while rates that are too large may cause the optimizer to overshoot the minimum.

In summary, momentum and learning rate are critical factors in optimization algorithms, impacting the convergence speed and performance of neural network training. Appropriate tuning of these hyperparameters is essential to achieve efficient and effective optimization, allowing the model to converge to a satisfactory solution in a reasonable amount of time.

In [None]:
Part 2: Optimizer Techniques

In [None]:
5. Explain the concept of Stochastic Gradient Descent (SGD) and its advantages compared to traditional
gradient descent. Discuss its limitations and scenarios where it is most suitable.




**Stochastic Gradient Descent (SGD):**

Stochastic Gradient Descent (SGD) is an optimization algorithm commonly used for training machine learning models, including neural networks. It is a variant of traditional gradient descent that introduces a stochastic element by computing the gradient of the loss with respect to the parameters using only a subset of the training data at each iteration. In other words, instead of computing the gradient over the entire dataset (as in batch gradient descent), SGD uses a randomly selected subset, often referred to as a mini-batch.

The update rule for SGD is as follows:

\[ \theta_{t+1} = \theta_t - \alpha \cdot \nabla J(\theta_t; x^{(i)}, y^{(i)}) \]

Where:
- \( \theta_t \) is the parameter vector at iteration \( t \),
- \( \alpha \) is the learning rate,
- \( \nabla J(\theta_t; x^{(i)}, y^{(i)}) \) is the gradient of the loss function with respect to the parameters calculated using a randomly selected mini-batch \( (x^{(i)}, y^{(i)}) \).

**Advantages of SGD:**

1. **Faster Convergence:**
   - SGD often converges faster than traditional batch gradient descent since it updates parameters more frequently. This can be especially advantageous for large datasets.

2. **Memory Efficiency:**
   - Since SGD processes only a subset of the data at each iteration, it requires less memory compared to batch gradient descent, making it suitable for datasets that may not fit entirely in memory.

3. **Stochasticity for Escaping Local Minima:**
   - The stochastic nature of SGD introduces randomness in the parameter updates, which can help the optimization process escape local minima and explore different regions of the parameter space.

4. **Online Learning:**
   - SGD is suitable for online learning scenarios where new data points are continuously added to the training set. It allows the model to adapt quickly to changes in the data distribution.

**Limitations and Scenarios:**

1. **High Variance in Parameter Updates:**
   - The stochastic nature of SGD introduces high variance in parameter updates due to the use of mini-batches. This can result in noisy convergence, making it harder to determine the optimal learning rate.

2. **Non-Smooth Loss Functions:**
   - In the presence of non-smooth or noisy loss functions, the high variance in parameter updates can lead to oscillations and slow convergence.

3. **Sensitivity to Learning Rate:**
   - SGD is sensitive to the choice of learning rate. The learning rate needs to be carefully tuned, and the model's performance can be sensitive to small changes in this hyperparameter.

4. **Not Ideal for All Datasets:**
   - SGD may not be suitable for all types of datasets. It can benefit from a well-shuffled dataset to ensure that each mini-batch provides a representative sample.

5. **Suitability for Noisy Data:**
   - SGD may perform well on datasets with noisy or redundant data due to its inherent randomness, allowing it to escape local minima and reach more diverse regions of the parameter space.

In summary, SGD is a powerful optimization algorithm with advantages such as faster convergence, memory efficiency, and the ability to escape local minima. However, it is important to consider its limitations, including high variance in parameter updates and sensitivity to learning rate, and carefully choose scenarios where it is most suitable, such as large datasets, online learning, and situations where escaping local minima is crucial.

In [None]:
6. Describe the concept of Adam optimizer and how it combines momentum and adaptive learning rates.
Discuss its benefits and potential drawbacks.


**Adam Optimizer:**

The Adam optimizer (short for Adaptive Moment Estimation) is a popular optimization algorithm that combines the concepts of momentum and adaptive learning rates to improve the efficiency and effectiveness of neural network training. It was introduced by D. P. Kingma and J. Ba in their paper titled "Adam: A Method for Stochastic Optimization."

The key components of the Adam optimizer are:

1. **Momentum:**
   - Adam includes a momentum term that helps smooth out updates and accelerates convergence, similar to the momentum optimization technique. The momentum term is calculated as a moving average of past gradients.

2. **Adaptive Learning Rates:**
   - Adam adapts the learning rates for each parameter individually based on the magnitude of past gradients. It maintains a separate moving average of the squared gradients for each parameter, adjusting the learning rate for parameters with larger or smaller gradients.

The update rule for Adam is as follows:

\[ m_{t+1} = \beta_1 \cdot m_t + (1 - \beta_1) \cdot \nabla J(\theta_t) \]
\[ v_{t+1} = \beta_2 \cdot v_t + (1 - \beta_2) \cdot (\nabla J(\theta_t))^2 \]
\[ \hat{m}_{t+1} = \frac{m_{t+1}}{1 - \beta_1^{t+1}} \]
\[ \hat{v}_{t+1} = \frac{v_{t+1}}{1 - \beta_2^{t+1}} \]
\[ \theta_{t+1} = \theta_t - \alpha \cdot \frac{\hat{m}_{t+1}}{\sqrt{\hat{v}_{t+1}} + \epsilon} \]

Where:
- \( m_t \) is the first-moment estimate (momentum) at iteration \( t \),
- \( v_t \) is the second-moment estimate (squared gradients) at iteration \( t \),
- \( \beta_1 \) and \( \beta_2 \) are decay rates for the moment estimates (typically close to 1),
- \( \alpha \) is the learning rate,
- \( \epsilon \) is a small constant to prevent division by zero.

**Benefits of Adam:**

1. **Efficient Learning Rate Adaptation:**
   - Adam adaptively adjusts the learning rates for each parameter based on the magnitude of its gradients and squared gradients. This helps overcome challenges associated with manual tuning of learning rates.

2. **Combination of Momentum and Adaptive Learning Rates:**
   - Adam combines the advantages of momentum, which helps accelerate convergence, with the benefits of adaptive learning rates, making it effective in a wide range of optimization landscapes.

3. **Robust Performance Across Diverse Tasks:**
   - Adam has demonstrated robust performance across various types of neural network architectures and tasks, making it a popular choice for many applications.

4. **Little Memory Requirement:**
   - Adam maintains only two moving averages per parameter, which results in relatively lower memory requirements compared to some other adaptive optimization methods.

**Potential Drawbacks of Adam:**

1. **Sensitivity to Hyperparameters:**
   - Adam has additional hyperparameters (\( \beta_1, \beta_2, \epsilon \)) that need to be carefully tuned. Suboptimal choices of these hyperparameters can impact the algorithm's performance.

2. **Bias Correction:**
   - The bias correction terms (\(1 - \beta_1^{t+1}\) and \(1 - \beta_2^{t+1}\)) in the update rule address initialization bias, but they may introduce a bias towards smaller step sizes in the early iterations.

3. **Not Always the Best Performer:**
   - While Adam is widely used, it may not always outperform other optimization methods, and its effectiveness can depend on the specific characteristics of the dataset and task.

In summary, Adam is a versatile optimizer that combines momentum and adaptive learning rates to efficiently navigate complex optimization landscapes. It is suitable for a wide range of applications but requires careful tuning of its hyperparameters for optimal performance. Researchers and practitioners often experiment with different optimizers to determine the most effective choice for their specific use case.


In [None]:
7. Explain the concept of RMSprop optimizer and how it addresses the challenges of adaptive learning
rates. Compare it with Adam and discuss their relative strengths and weaknesses.



**RMSprop Optimizer:**

The RMSprop (Root Mean Square Propagation) optimizer is an optimization algorithm that addresses some of the challenges associated with adaptive learning rates. It was proposed by Geoffrey Hinton in a lecture and is designed to adjust the learning rates for each parameter individually based on the historical information of gradients.

The update rule for RMSprop is as follows:

\[ v_{t+1} = \beta \cdot v_t + (1 - \beta) \cdot (\nabla J(\theta_t))^2 \]
\[ \theta_{t+1} = \theta_t - \alpha \cdot \frac{\nabla J(\theta_t)}{\sqrt{v_{t+1}} + \epsilon} \]

Where:
- \( v_t \) is the moving average of squared gradients,
- \( \beta \) is a decay rate for the moving average (typically close to 1),
- \( \alpha \) is the learning rate,
- \( \epsilon \) is a small constant to prevent division by zero.

**Comparison with Adam:**

1. **Adaptive Learning Rates:**
   - **Adam:** Adam also uses adaptive learning rates, adjusting them based on both first-moment (momentum) and second-moment (squared gradients) estimates.
   - **RMSprop:** RMSprop adapts learning rates based solely on the historical information of squared gradients.

2. **Exponential Moving Averages:**
   - **Adam:** Uses exponential moving averages for both the first and second moments.
   - **RMSprop:** Uses an exponential moving average for the squared gradients.

3. **Memory Requirements:**
   - **Adam:** Maintains two moving averages per parameter (momentum and squared gradients), resulting in relatively higher memory requirements.
   - **RMSprop:** Requires only one moving average per parameter, leading to lower memory requirements compared to Adam.

4. **Bias Correction:**
   - **Adam:** Includes bias correction terms (\(1 - \beta_1^{t+1}\) and \(1 - \beta_2^{t+1}\)) to address initialization bias.
   - **RMSprop:** Does not include bias correction terms in the update rule.

5. **Sensitivity to Hyperparameters:**
   - Both Adam and RMSprop have hyperparameters (\( \beta \) in RMSprop and \( \beta_1, \beta_2 \) in Adam) that need to be tuned, and their performance can be sensitive to the choice of these hyperparameters.

**Relative Strengths and Weaknesses:**

1. **Adam:**
   - **Strengths:**
     - Generally exhibits robust performance across a wide range of tasks.
     - Effective in scenarios where adaptive learning rates and momentum are beneficial.
   - **Weaknesses:**
     - Requires tuning of additional hyperparameters.
     - May not always outperform other optimizers, depending on the task.

2. **RMSprop:**
   - **Strengths:**
     - Simplicity in terms of hyperparameter tuning (primarily \( \beta \)).
     - Lower memory requirements compared to Adam.
     - Effectively handles sparse gradients.
   - **Weaknesses:**
     - May not perform as well as Adam in some cases, especially when momentum is crucial.

**Choosing Between Adam and RMSprop:**
- **Adam:** Consider using Adam as a default choice when starting experimentation. It is versatile and often performs well across different tasks.
- **RMSprop:** Can be a good alternative when memory efficiency is a concern, or in scenarios where Adam might be too aggressive.

In practice, the choice between Adam and RMSprop may involve empirical testing on the specific task and dataset. Researchers and practitioners often experiment with different optimizers to find the one that works best for their particular use case.

In [None]:
Part 3: Applying Optimizers

In [None]:
# 8. Implement SGD, Adam, and RMSprop optimizers in a deep learning model using a framework of your
# choice. Train the model on a suitable dataset and compare their impact on model convergence and
# performance.



import tensorflow as tf
from tensorflow.keras import layers, models, optimizers
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

# Create a synthetic dataset for illustration
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define a simple neural network model
model = models.Sequential([
    layers.Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    layers.Dense(32, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])

# Compile the model with SGD optimizer
sgd_optimizer = optimizers.SGD(learning_rate=0.01, momentum=0.9)
model.compile(optimizer=sgd_optimizer, loss='binary_crossentropy', metrics=['accuracy'])

# Train the model with SGD optimizer
sgd_history = model.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test))

# Reset the model for the next optimizer
model = models.Sequential([
    layers.Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    layers.Dense(32, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])

# Compile the model with Adam optimizer
adam_optimizer = optimizers.Adam(learning_rate=0.001)
model.compile(optimizer=adam_optimizer, loss='binary_crossentropy', metrics=['accuracy'])

# Train the model with Adam optimizer
adam_history = model.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test))

# Reset the model for the next optimizer
model = models.Sequential([
    layers.Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    layers.Dense(32, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])

# Compile the model with RMSprop optimizer
rmsprop_optimizer = optimizers.RMSprop(learning_rate=0.001)
model.compile(optimizer=rmsprop_optimizer, loss='binary_crossentropy', metrics=['accuracy'])

# Train the model with RMSprop optimizer
rmsprop_history = model.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test))

# Plot the training histories to compare convergence
import matplotlib.pyplot as plt

plt.plot(sgd_history.history['loss'], label='SGD Training Loss')
plt.plot(adam_history.history['loss'], label='Adam Training Loss')
plt.plot(rmsprop_history.history['loss'], label='RMSprop Training Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
