# Mnist classification with NNs
A first example of a simple Neural Network, applied to a well known dataset.

In [12]:
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Model
from tensorflow.keras import utils
import numpy as np

Let us load the mnist dataset

In [13]:
(x_train, y_train), (x_test, y_test) = mnist.load_data()

In [14]:
print(x_train.shape)
print("pixel range is [{},{}]".format(np.min(x_train),np.max(x_train)))

(60000, 28, 28)
pixel range is [0,255]


We normalize the input in the range [0,1]

In [15]:
x_train = x_train.astype('float32') / 255.
x_test = x_test.astype('float32') / 255.
x_train.shape, x_test.shape

((60000, 28, 28), (10000, 28, 28))

we need to change the shape of the data points to have 28*28 features which are the pixels of the image.

In [16]:
x_train = np.reshape(x_train,(60000,28*28))
x_test = np.reshape(x_test,(10000,28*28))
x_train.shape, x_test.shape

((60000, 784), (10000, 784))

The output of the network will be a proability distribution over the different categories. Similarly, we generate a ground truth distribution, and the training objective will consist in minimizing their distance (categorical crossentropy). The ground truth distribution is the so called "categorical" distribution: if x has label l, the corresponding categorical distribution has probaility 1 for the category l, and 0 for all the others.

In [17]:
print(y_train[0])
y_train_cat = utils.to_categorical(y_train)
print(y_train_cat[0])
y_test_cat = utils.to_categorical(y_test)

5
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]


This output says that for demonstrating the number "5", we need to have the corresponding value in the vector of 10 values to be equal to 1. Remember that, numbers in the dataset are 0 to 9. 

-----------

# FIRST NETWORK - Logistic Regression

Our first Netwok just implements logistic regression.

## Network Construction

Code Explanation:
- __Input(shape=(784))__ : This line defines an input layer for the neural network. The shape parameter specifies the shape of the input data. In this case, it's set to (784), which suggests that the input data is expected to be a vector of length 784. This is commonly used in scenarios like image classification where each image is represented as a 784-dimensional vector (e.g., a flattened 28x28 image).
- __Dense(10,activation='softmax')(xin)__ : This line defines a **fully connected (dense)** layer with **10 units** and a **softmax activation function**. The Dense layer connects every neuron in the previous layer (in this case, the input layer) to every neuron in the current layer. 
    - The softmax activation function is commonly used in the output layer of a classification model to obtain probabilities for each class. The output of this layer (res) represents the probabilities of the input belonging to each of the 10 classes.
- __Model(inputs=xin,outputs=res)__ : This line creates a Keras Model by specifying the inputs and outputs of the model. **xin** is specified as the input, and **res** (the output of the dense layer) is specified as the output. 
    - This essentially constructs a neural network model that takes inputs through the defined input layer, passes them through the dense layer, and outputs the resulting probabilities through the softmax activation.

In [20]:
xin = Input(shape=(784, ))  ## We should define the number of the input data
res = Dense(10,activation='softmax')(xin) 

mynet = Model(inputs=xin,outputs=res)

In [21]:
mynet.summary()

__Summary__ explnation:
- It shows a summery of this created network.
- It says that the input layer consists of 784 input data which is equal to the number of features. 
    - The input layer does not have any parameter.
- The second layer which is a dense layer has 10 neurons. Having said that, __dense__ referes to a fully connected neural network. So, here, having 10 neurons gives us 10*784 parameters + bias. The number of __bias__ equals to the number of neurons, so for the second layer which consists of 10 neurons we also have 10 bias values for each neuron. Since, __bias__ is considered parameters that can be configured, it increases out parameters to 7840+10 number. 

## Compile and Train 

After writing the code you provided, the typical next steps would involve compiling the model, specifying the loss function and optimization algorithm, and then training the model on the data.

**Compiling the Model**: Before training the model, you need to compile it. Compiling the model configures it for training by specifying the loss function, the optimizer, and optional metrics. So, we need to pass two mandatory arguments:
*   the **optimizer**, in charge of governing the details of the backpropagation algorithm
    - **optimizer='adam'**: This line specifies the Adam optimization algorithm, which is a popular choice for training neural networks due to its adaptive learning rate with momentum capabilities.
*   the **loss function**
    - **loss='categorical_crossentropy'**: This line specifies the loss function to optimize. __categorical_crossentropy__ is commonly used for the scenarios where we have categorical data as here. 
* Optionally, we can specify additional metrics, mostly meant for monitoring the training process.
    - This specifies that you want to track the accuracy of the model during training.

In [22]:
mynet.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])

**Training the Model:** After compiling the model, you can train it using the **fit** method by providing input data and corresponding target labels.

Fitting, just requires two arguments: training data and ground truth, that is x and y. Additionally we can specify epochs, batch_size, and many additional arguments.

In particular, passing validation data allow the training procedure to measure loss and metrics on the validation set at the end of each epoch.

In [23]:
mynet.fit(x_train, y_train_cat, shuffle=True, epochs=10, batch_size=32,validation_data=(x_test,y_test_cat))

Epoch 1/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - accuracy: 0.8251 - loss: 0.6964 - val_accuracy: 0.9118 - val_loss: 0.3115
Epoch 2/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - accuracy: 0.9122 - loss: 0.3157 - val_accuracy: 0.9220 - val_loss: 0.2797
Epoch 3/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - accuracy: 0.9196 - loss: 0.2880 - val_accuracy: 0.9253 - val_loss: 0.2722
Epoch 4/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - accuracy: 0.9242 - loss: 0.2696 - val_accuracy: 0.9266 - val_loss: 0.2692
Epoch 5/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 2ms/step - accuracy: 0.9244 - loss: 0.2710 - val_accuracy: 0.9268 - val_loss: 0.2666
Epoch 6/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 2ms/step - accuracy: 0.9271 - loss: 0.2585 - val_accuracy: 0.9247 - val_loss: 0.2675
Epoch 7/10
[1m1

<keras.src.callbacks.history.History at 0x7a60c715bd00>

## Another Network with 1 more layer
In the below code we have defined another network with one more Dense layer. We want to illustrate the impact of more layer on accuracy during the training.

In the below code you can see that we have provided more neurons in the second layer which gives us more parameters as it's described in the network summary. 


- We used __relu__ activation function for the second layer, and have used the __softmax__ activation function for the output layer as our output value is a categorical value.

#### Construction
Constructing the network:

In [26]:
xin = Input(shape=(784, ))
x = Dense(128,activation='relu')(xin) 
res = Dense(10,activation='softmax')(x)

mynet2 = Model(inputs=xin,outputs=res)

In [27]:
mynet2.summary()

#### Compile and Train
We compiled the network with the configuration same as the previous network.

In [29]:
mynet2.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])

Let's train it on the train data.

In [30]:
mynet2.fit(x_train,y_train_cat, shuffle=True, epochs=10, batch_size=32, validation_data=(x_test,y_test_cat))

Epoch 1/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 4ms/step - accuracy: 0.8788 - loss: 0.4254 - val_accuracy: 0.9599 - val_loss: 0.1338
Epoch 2/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 4ms/step - accuracy: 0.9667 - loss: 0.1149 - val_accuracy: 0.9726 - val_loss: 0.0913
Epoch 3/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 4ms/step - accuracy: 0.9760 - loss: 0.0774 - val_accuracy: 0.9727 - val_loss: 0.0870
Epoch 4/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 4ms/step - accuracy: 0.9829 - loss: 0.0558 - val_accuracy: 0.9772 - val_loss: 0.0731
Epoch 5/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 4ms/step - accuracy: 0.9871 - loss: 0.0419 - val_accuracy: 0.9791 - val_loss: 0.0718
Epoch 6/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 4ms/step - accuracy: 0.9908 - loss: 0.0309 - val_accuracy: 0.9777 - val_loss: 0.0748
Epoch 7/10
[1

<keras.src.callbacks.history.History at 0x7a609c2726e0>

The result shows an amazing improvement. WOW!

## 3-Layered Network
In the next cell we have created another network wich has:
- One more Dense layer
- Sigmoid activation function
- Adapted to work with Sparse Categorical Crossentropy

We have also plotted our network and its result.

In [39]:
xin = Input(shape=(784, ))
x = Dense(128,activation='relu')(xin)
x2 = Dense(128,activation='relu')(x)
res = Dense(10,activation='softmax')(x2)

mynet3 = Model(inputs=xin,outputs=res)
mynet3.summary()

**Change Explanation**:
- __Difference between Categorical Crossentropy and Sparse Categorical Crossentropy__
    - **Categorical Crossentropy:** This loss function is typically used when the target labels are one-hot encoded. 
        - One-hot encoding represents each class as a binary vector where only one element is nonzero, indicating the class index.
        - The output of the model is compared to the one-hot encoded target labels.
        - It calculates the cross-entropy loss between the predicted probabilities and the one-hot encoded target labels.
        - Suitable for multi-class classification tasks when the classes are mutually exclusive.
    - **Sparse Categorical Crossentropy:** This loss function is used when the target labels are integers, not one-hot encoded.
        - Instead of one-hot encoding, each target label is represented as an integer indicating the class index directly.
        - The output of the model is compared to the integer target labels.
        - It calculates the cross-entropy loss between the predicted probabilities and the integer target labels.
        - Convenient when you have a large number of classes, as it saves memory and computational resources by avoiding one-hot encoding.
        - It's especially useful when dealing with large datasets where one-hot encoding could lead to memory issues.
        
- __Other Activation functions__
    - There are several activation functions commonly used in neural networks, each with its own characteristics and suitability for different types of problems. Here are some of the most commonly used activation functions:
    1. **Sigmoid**: The sigmoid function squashes the input values between 0 and 1. It's commonly used in the output layer of binary classification problems where the goal is to predict probabilities.
$$\text{sigmoid}(x) = \frac{1}{1 + e^{-x}}$$
    2. **ReLU (Rectified Linear Unit)**: ReLU sets all negative values to zero and leaves positive values unchanged. It's known for its simplicity and effectiveness in training deep neural networks.
$$\text{ReLU}(x) = \max(0, x)$$
    3. **Leaky ReLU**: Leaky ReLU is similar to ReLU but allows a small, positive gradient for negative input values, which can help with the vanishing gradient problem.
$$ \text{Leaky ReLU}(x) = \begin{cases} x, & \text{if } x > 0 \\ \alpha x, & \text{otherwise} \end{cases} $$
    4. **Tanh (Hyperbolic Tangent)**: Tanh squashes the input values between -1 and 1, making it zero-centered. It's often used in hidden layers of neural networks.
$$\text{tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$
    5. **Softmax**: Softmax is used in the output layer of multi-class classification problems. It converts raw scores (logits) into probabilities that sum up to 1.
$$\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}$$
    6. **Linear**: Linear activation simply outputs the input value without applying any transformation. It's often used in the output layer for regression tasks.
$$\text{Linear}(x) = x$$

In [44]:
mynet3.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])

In [46]:
mynet3.fit(x_train, y_train, shuffle=True, epochs=10, batch_size=32, validation_data=(x_test,y_test))

Epoch 1/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 4ms/step - accuracy: 0.9824 - loss: 0.0537 - val_accuracy: 0.9741 - val_loss: 0.0832
Epoch 2/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 4ms/step - accuracy: 0.9875 - loss: 0.0406 - val_accuracy: 0.9775 - val_loss: 0.0771
Epoch 3/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 4ms/step - accuracy: 0.9900 - loss: 0.0311 - val_accuracy: 0.9765 - val_loss: 0.0825
Epoch 4/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 4ms/step - accuracy: 0.9916 - loss: 0.0252 - val_accuracy: 0.9811 - val_loss: 0.0696
Epoch 5/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 4ms/step - accuracy: 0.9929 - loss: 0.0223 - val_accuracy: 0.9786 - val_loss: 0.0825
Epoch 6/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 4ms/step - accuracy: 0.9940 - loss: 0.0167 - val_accuracy: 0.9766 - val_loss: 0.0996
Epoch 7/10
[1m1

<keras.src.callbacks.history.History at 0x7a6075759600>