2. Implementing Feedforward neural networks with Keras and TensorFlow
a. Import the necessary packages
b. Load the training and testing data (MNIST/CIFAR10)
c. Define the network architecture using Keras
d. Train the model using SGD
e. Evaluate the network
f. Plot the training loss and accuracy


In [19]:
#IMPORTING NECESSARY PACKAGES>

import pandas as pd # to read the csv file

import tensorflow as tf  # to develop the neural network 

from tensorflow import keras # provides an interface and can be used to train the deep learning models

import matplotlib.pyplot as plt  # to visualize the data  

import random   # it is used to make predictions, it generates the random number in given range



In [20]:
#b. LOAD THE TRAINING AND TESTING DATA (MNIST) →

mnist= tf.keras.datasets.mnist  #importing the mnist dataset

(x_train, y_train), (x_test, y_test)= mnist.load_data() #splitting the dataset into training and testing data.

'''
Dividing by 255 normalizes pixel values from the range 0–255 to 0–1. 
This helps the model train more efficiently by keeping input values small and consistent
improving training stability and speed.
'''
x_train=x_train / 255 
x_test= x_test / 255


'''
this is to print the shape of an image
x_train = x_train[0]
img_len, img_wid = x_train.shape
print(img_len,'X', img_wid)
'''

#. DEFINE THE NETWORK ARCHITECTURE USING KERAS ->>

model = keras.Sequential([  # sequential because we wanted to create and feed forward neural network
    #layers is function in keras
    keras.layers.Flatten(input_shape=(28, 28)), #input layer. flattern function converts the input layers into the vectors
    keras.layers.Dense(128, activation="relu"), 
    # one hidden layer with 128 neurons in it and the activation function is Rectified linear unit. formula = max(0,x)
    keras.layers.Dense(10, activation="softmax") # output layer activation fun used is softmax
])


model.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten_4 (Flatten)         (None, 784)               0         
                                                                 
 dense_8 (Dense)             (None, 128)               100480    
                                                                 
 dense_9 (Dense)             (None, 10)                1290      
                                                                 
Total params: 101770 (397.54 KB)
Trainable params: 101770 (397.54 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [21]:
#d. TRAIN THE MODEL USING SGD

model.compile(optimizer="sgd",   # sgd stands for stochastic gradient descent. it is an optimization algorithm used to minimiz the loss function

loss="sparse_categorical_crossentropy",  # it measures the dissimilarity
              metrics=['accuracy'])  # it will judge the model performance 

history= model.fit(x_train,y_train, validation_data=(x_test,y_test), epochs=3) # epochs=3 means it will pass through the entire dataset 3 times each time it will pass through the model and updates in weight and improves performance
 

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [None]:
# Evaluate the trained model on the test dataset to check its performance on unseen data
# This returns two values:
# - test_loss: the model's calculated loss (error) on the test data based on the loss function
# - test_acc: the model's accuracy (percentage of correct predictions) on the test data
test_loss, test_acc = model.evaluate(x_test, y_test)

# Print the test loss value, formatted to three decimal places
print(f"Test Loss: {test_loss:.3f}")

# Print the test accuracy value, formatted to three decimal places
print(f"Test Accuracy: {test_acc:.3f}")




In [None]:
# Create a figure with specific size for plotting (width=12, height=4)
plt.figure(figsize=(12, 4))

# Plot training & validation loss values
plt.subplot(1, 2, 1)  # Create a subplot (1 row, 2 columns, 1st plot)
plt.plot(history.history['loss'], label='Training Loss')  # Plot training loss over epochs
plt.plot(history.history['val_loss'], label='Validation Loss')  # Plot validation loss over epochs
plt.title('Training and Validation Loss')  # Title for the loss plot
plt.xlabel('Epoch')  # Label for the x-axis (epochs)
plt.ylabel('Loss')   # Label for the y-axis (loss values)
plt.legend()         # Show legend to distinguish between training and validation loss

# Plot training & validation accuracy values
plt.subplot(1, 2, 2)  # Create the second subplot (1 row, 2 columns, 2nd plot)
plt.plot(history.history['accuracy'], label='Training Accuracy')  # Plot training accuracy over epochs
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')  # Plot validation accuracy over epochs
plt.title('Training and Validation Accuracy')  # Title for the accuracy plot
plt.xlabel('Epoch')  # Label for the x-axis (epochs)
plt.ylabel('Accuracy')  # Label for the y-axis (accuracy values)
plt.legend()  # Show legend to distinguish between training and validation accuracy

# Display the entire figure with both subplots
plt.show()


In [None]:
Here’s an in-depth breakdown of key theory concepts related to your practical on implementing Feedforward Neural Networks (FNN) with Keras and TensorFlow, which an examiner might ask during your oral exam:

### 1. **Feedforward Neural Networks (FNN)**
   - **Definition**: A feedforward neural network is a type of artificial neural network where connections between the nodes (neurons) do not form a cycle. It consists of an input layer, one or more hidden layers, and an output layer. Each neuron in one layer is connected to every neuron in the next layer.
   - **Usage**: It is used for supervised learning tasks like classification and regression.

### 2. **Keras and TensorFlow**
   - **Keras**: An open-source deep learning framework written in Python that provides a high-level interface to build and train neural networks. It runs on top of TensorFlow, allowing easy construction of complex models.
   - **TensorFlow**: An open-source machine learning library developed by Google. It provides the tools needed to build and train neural networks, and it is widely used for both deep learning and traditional machine learning tasks.

### 3. **MNIST Dataset**
   - **Definition**: The MNIST dataset is a collection of handwritten digits (0–9) used for training image processing systems. It consists of 60,000 training images and 10,000 testing images, each 28x28 pixels in size.
   - **Normalization**: Normalizing pixel values (by dividing by 255) scales the values from the range [0, 255] to [0, 1], improving the convergence speed of the neural network and reducing issues with training instability.

### 4. **Neural Network Architecture**
   - **Input Layer**: In this case, the input layer has 28x28=784 units, corresponding to the flattened 28x28 pixel values of each MNIST image.
   - **Hidden Layer**: A dense layer with 128 units and ReLU (Rectified Linear Unit) activation. ReLU introduces non-linearity by outputting the maximum between 0 and the input. It's computationally efficient and helps with the vanishing gradient problem.
   - **Output Layer**: A dense layer with 10 units, one for each digit (0–9). Softmax activation is applied here to convert the raw network outputs into probabilities that sum to 1, allowing us to interpret them as a classification of digits.

### 5. **Activation Functions**
   - **ReLU (Rectified Linear Unit)**: The ReLU function is defined as \( f(x) = \max(0, x) \). It is widely used in hidden layers because of its simplicity, efficiency, and its ability to mitigate the vanishing gradient problem.
   - **Softmax**: This function is applied to the output layer in multi-class classification tasks. It converts the raw outputs of the neurons into a probability distribution. The class with the highest probability is chosen as the model's prediction.

### 6. **Stochastic Gradient Descent (SGD)**
   - **SGD Overview**: Stochastic Gradient Descent is an optimization algorithm used to minimize the loss function by iteratively updating model parameters. It computes gradients using a random subset (mini-batch) of the data instead of the whole dataset (which is used in batch gradient descent). This makes the algorithm faster and suitable for large datasets.
   - **Learning Rate**: The learning rate is a hyperparameter that controls the size of the steps taken towards the minimum of the loss function. A higher learning rate may lead to overshooting, while a lower learning rate may slow down the convergence.

### 7. **Loss Function**
   - **Sparse Categorical Cross-Entropy**: This loss function is used when the labels are integers (not one-hot encoded) and is suited for multi-class classification problems. It calculates the cross-entropy loss between the predicted class probabilities (from softmax) and the true class labels.

### 8. **Metrics**
   - **Accuracy**: Accuracy measures the percentage of correct predictions made by the model. It is defined as \( \text{accuracy} = \frac{\text{number of correct predictions}}{\text{total number of predictions}} \). It is commonly used for classification tasks.

### 9. **Model Evaluation**
   - **Test Loss and Test Accuracy**: After training the model, the test dataset is used to evaluate the model's performance. The test loss gives an indication of how well the model generalizes to unseen data, and test accuracy measures how many predictions were correct.

### 10. **Epochs and Iterations**
   - **Epoch**: One complete forward and backward pass through the entire training dataset. Increasing the number of epochs typically improves the model's performance, as long as overfitting doesn't occur.
   - **Iteration**: The number of times the model's weights are updated within one epoch. This is equal to the number of batches of data in the dataset.

### 11. **Overfitting and Underfitting**
   - **Overfitting**: A model is said to overfit when it learns the training data too well, including its noise, leading to poor performance on new, unseen data.
   - **Underfitting**: A model is underfitting when it is too simple and fails to learn the underlying patterns in the data, leading to poor performance on both training and test data.
   - **Regularization**: Techniques like dropout, L2 regularization, and data augmentation can be used to reduce overfitting.

### 12. **Plotting Training Loss and Accuracy**
   - **Training Loss**: Measures how well the model is fitting the training data during each epoch.
   - **Validation Loss**: Measures how well the model is fitting the validation data (data it hasn't seen during training). It's essential to monitor this to detect overfitting.
   - **Training Accuracy**: The accuracy of the model on the training data during each epoch.
   - **Validation Accuracy**: The accuracy of the model on the validation data during each epoch.

### 13. **Model Summary**
   - The **model summary** provides an overview of the model architecture, showing the layers, their output shapes, and the number of parameters in each layer. This helps in understanding how data flows through the network and the number of trainable parameters.

### 14. **Model Compilation**
   - **Optimizer**: The optimizer is responsible for adjusting the weights of the network to minimize the loss function. In this case, "sgd" stands for Stochastic Gradient Descent, but other optimizers like Adam, RMSProp, etc., can also be used depending on the problem.
   - **Loss Function**: This defines the measure of the difference between the predicted output and the actual target.
   - **Metrics**: Metrics like accuracy are monitored during training and evaluation to assess model performance.

### 15. **Overfitting Detection**
   - **Validation Data**: It's common to use a validation dataset during training to track the model’s performance on unseen data. If validation loss starts increasing while training loss decreases, it is an indication of overfitting.
   - **Cross-validation**: In more complex datasets, k-fold cross-validation is used to mitigate overfitting by training multiple models on different subsets of the dataset.

### 16. **Future Considerations**
   - **Hyperparameter Tuning**: Adjusting hyperparameters like the number of layers, number of neurons, activation functions, learning rate, etc., can improve model performance.
   - **Advanced Optimizers**: Instead of plain SGD, optimizers like Adam, RMSprop, and Adagrad may converge faster and avoid some problems of SGD.

These are foundational concepts, and you should be prepared to explain them with examples, especially their relevance to the practical implementation.
                                                                                      
                                                                                      
                         Certainly! Here are some additional minor and technical points that could be relevant for your oral examination:

### 17. **TensorFlow and Keras Functions**
   - **`keras.Sequential`**: A linear stack of layers where each layer has exactly one input and one output. It is used to create feedforward neural networks where data flows through the layers sequentially.
   - **`keras.layers.Flatten`**: This layer is used to convert a multi-dimensional input (like the 28x28 MNIST image) into a 1D vector, which is required for the Dense layers.
   - **`keras.layers.Dense`**: A fully connected layer. It is called "dense" because each neuron in the layer is connected to every neuron in the previous layer. This is the primary layer used in feedforward neural networks.
   - **`activation="relu"`**: ReLU is a piecewise linear function that outputs the input if it is positive and 0 otherwise. It is computationally efficient and less prone to the vanishing gradient problem compared to sigmoid or tanh.
   - **`activation="softmax"`**: Softmax transforms the logits (raw scores) into probabilities. It ensures the sum of the output values is equal to 1, making it useful for multi-class classification tasks.

### 18. **Loss Function – Sparse Categorical Crossentropy**
   - **Sparse vs Categorical Crossentropy**: Sparse categorical crossentropy is used when the labels are integers (e.g., for MNIST, the labels are integers 0–9). In contrast, categorical crossentropy is used when the labels are one-hot encoded vectors.
   - **Cross-Entropy**: A measure of the difference between two probability distributions. It is commonly used in classification tasks as it penalizes incorrect classifications heavily.

### 19. **Model Compilation**
   - **Optimizer ("sgd")**: Stochastic Gradient Descent updates the model parameters based on a random subset (mini-batch) of the training data. It is effective for large datasets.
   - **Loss Function ("sparse_categorical_crossentropy")**: A loss function that measures how well the predicted values match the actual values, for multi-class classification with integer labels.
   - **Metrics ("accuracy")**: Accuracy is the percentage of correct predictions. However, it can be misleading in imbalanced datasets, where other metrics like precision, recall, or F1-score are more informative.

### 20. **Learning Rate**
   - **Effect of Learning Rate**: A high learning rate might cause the model to overshoot the optimal point, while a low learning rate might lead to a slow convergence. This balance is critical for training efficiency.
   - **Learning Rate Scheduling**: The learning rate can be adjusted during training, either manually or using techniques like learning rate decay, to help the model converge faster.

### 21. **Epochs, Batches, and Iterations**
   - **Batch Size**: This refers to the number of samples processed before the model’s internal parameters are updated. Smaller batch sizes tend to provide more frequent updates, whereas larger batch sizes provide more stable estimates of the gradients.
   - **Epoch vs Iteration**: An epoch refers to one complete pass through the training dataset. An iteration refers to one update of the model's weights based on a batch of data.
   - **Mini-batch Gradient Descent**: When using SGD, the data is typically divided into small batches. This method is more computationally efficient than using the whole dataset or a single data point.

### 22. **Gradient Descent and Backpropagation**
   - **Gradient Descent**: An optimization algorithm that updates the model weights in the direction that minimizes the loss function. It does this by computing the gradient (partial derivatives) of the loss function with respect to the weights.
   - **Backpropagation**: A method used to compute the gradient of the loss function with respect to each weight in the network. It involves the chain rule from calculus to propagate the error back through the network.
   - **Learning Rate Decay**: Some models use a learning rate schedule where the learning rate decreases over time, which helps the model converge more smoothly after a certain number of epochs.

### 23. **Overfitting and Underfitting**
   - **Early Stopping**: This technique involves stopping the training process when the validation loss starts increasing, which helps prevent overfitting.
   - **Dropout**: A regularization technique where randomly selected neurons are ignored (dropped out) during training, which helps prevent overfitting by reducing dependency on certain neurons.

### 24. **Model Evaluation**
   - **Confusion Matrix**: A confusion matrix is a performance measurement for classification problems. It shows the actual vs predicted classifications and helps in calculating metrics like precision, recall, and F1-score.
   - **Precision and Recall**: Precision measures the accuracy of positive predictions, while recall measures how well the model captures all positive instances.
   - **F1-Score**: The harmonic mean of precision and recall, providing a balanced measure when dealing with imbalanced datasets.

### 25. **Model Serialization and Saving**
   - **Model Saving**: After training, a model can be saved for later use without having to retrain. Keras provides functions like `model.save('model.h5')` to save the entire model, including the architecture, weights, and training configuration.
   - **Loading a Saved Model**: You can use `keras.models.load_model('model.h5')` to reload a previously saved model and continue training or make predictions.

### 26. **Batch Normalization**
   - **Purpose**: Batch normalization normalizes the inputs to a given layer, which helps stabilize the training process and allows for higher learning rates. It is often used in deep networks to reduce the problem of vanishing/exploding gradients.

### 27. **Weight Initialization**
   - **Random Initialization**: Weights are typically initialized randomly, usually using a Gaussian or uniform distribution, to break symmetry and allow the network to learn different features.
   - **He and Xavier Initialization**: These methods help improve convergence by scaling the weight initialization based on the number of input or output units in a layer.

### 28. **Gradient Clipping**
   - **Purpose**: In cases where gradients can become very large and cause instability, gradient clipping is used to cap the gradients at a predefined threshold, ensuring smoother training.

### 29. **Visualization of Training Process**
   - **Learning Curves**: Plotting the loss and accuracy over time (epochs) helps in diagnosing potential problems like overfitting, underfitting, or learning rate issues.
   - **Training vs Validation Curves**: If the training loss decreases but the validation loss increases, the model may be overfitting. If both losses decrease slowly, the model may be underfitting.

### 30. **Computational Considerations**
   - **GPU vs CPU**: Training deep neural networks is computationally expensive. TensorFlow can be configured to run on GPUs, which accelerate training, especially with large datasets like MNIST and CIFAR-10.
   - **Parallelization**: TensorFlow supports parallel processing on multiple GPUs or distributed systems to handle large-scale training tasks.

### 31. **Common Hyperparameters**
   - **Number of Hidden Layers**: A deep neural network can have multiple hidden layers. The depth and number of neurons in each hidden layer influence the model’s ability to capture complex patterns.
   - **Batch Size**: The batch size determines how many samples are used in one iteration of training. A smaller batch size typically offers a more accurate gradient but takes longer to compute.

### 32. **TensorFlow vs Keras**
   - **Low-Level (TensorFlow)** vs **High-Level (Keras)**: TensorFlow provides more flexibility for building custom models but requires more code and setup. Keras, on the other hand, simplifies the model-building process, making it easier for quick prototyping and experimentation.

These additional points provide a more complete understanding of deep learning concepts that could be discussed during the oral exam. Being prepared with both high-level and detailed knowledge of these concepts will give you a stronger foundation for answering questions effectively.  
                    