**<em>Q:in the next part of this DL book it called this code as 'Drop out code' what's the reason?</em>**
```python
import numpy, sys
def relu(x):
    return (x >= 0) * x # returns x if x > 0
                        # returns 0 otherwise

def relu2deriv(output):
    return output >= 0 #returns 1 for input > 0

alpha, iterations, hidden_size = (0.005, 300, 100)
pixels_per_image, num_labels = (784, 10)

weights_0_1 = 0.2*np.random.random((pixels_per_image,hidden_size)) - 0.1
weights_1_2 = 0.2*np.random.random((hidden_size,num_labels)) - 0.1

for j in range(iterations):
    error, correct_cnt = (0.0,0)
    for i in range(len(images)):
        layer_0 = images[i:i+1]
        layer_1 = relu(np.dot(layer_0,weights_0_1))
        dropout_mask = np.random.randint(2, size=layer_1.shape)
        layer_1 *= dropout_mask * 2
        layer_2 = np.dot(layer_1,weights_1_2)

        error += np.sum((labels[i:i+1] - layer_2) ** 2)
        correct_cnt += int(np.argmax(layer_2) == np.argmax(labels[i:i+1]))
        layer_2_delta = (labels[i:i+1] - layer_2)
        layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2deriv(layer_1)
        layer_1_delta *= dropout_mask

        weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
        weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)

    if(j%10 == 0):
        test_error = 0.0
        test_correct_cnt = 0

        for i in range(len(test_images)):
            layer_0 = test_images[i:i+1]
            layer_1 = relu(np.dot(layer_0,weights_0_1))
            layer_2 = np.dot(layer_1, weights_1_2)

            test_error += np.sum((test_labels[i:i+1] - layer_2) ** 2)
            test_correct_cnt += int(np.argmax(layer_2) == np.argmax(test_labels[i:i+1]))

        sys.stdout.write("\n" + \
                         "I:" + str(j) + \
                         " Test-Err:" + str(test_error/ float(len(test_images)))[0:5] +\
                         " Test-Acc:" + str(test_correct_cnt/ float(len(test_images)))+\
                         " Train-Err:" + str(error/ float(len(images)))[0:5] +\
                         " Train-Acc:" + str(correct_cnt/ float(len(images))))
```

A:The code you posted includes dropout, a regularization technique used to prevent overfitting in neural networks. In this particular code, dropout is implemented through the `dropout_mask`.

Here's the relevant part where dropout is applied:

```python
dropout_mask = np.random.randint(2, size=layer_1.shape)
layer_1 *= dropout_mask * 2
```

1. **`dropout_mask = np.random.randint(2, size=layer_1.shape)`**: Creates a mask where each element is either 0 or 1. The mask has the same shape as `
   **<em>Q:
what does '2' do in "dropout_mask = np.random.randint(2, size=layer_1.shape)"</em>**

In the expression `np.random.randint(2, size=layer_1.shape)`, the number `2` specifies the upper bound (exclusive) for the random integers generated by NumPy's `randint` function. 
Specifically, the function will generate random integers from the "discrete uniform" distribution over the interval `[0, 2)`, meaning it will generate either 0 or 1.

Here's a breakdown:

- `np.random.randint(2, size=layer_1.shape)`: This generates an array of random integers with the same shape as `layer_1`, where each integer is either 0 or 1.
  
So, in the context of dropout, this line of code is creating a "mask" of 0s and 1s with the same shape as the `layer_1` output activations. This mask is then element-wise multiplied with `layer_1`, effectively "dropping out" certain neurons by setting their activations to z2ro.?la3er_1`.

2. **`layer_1 *= dropout_mask * 2`**: The `dropout_mask` is applied to `layer_1`. This essentially sets approximately half of the activations in `layer_1` to zero. The remaining activations are scaled by 2 to keep the expected sum of activations the same.

This dropout mask is also later applied to `layer_1_delta`:

```python
layer_1_delta *= dropout_mask
```

This ensures that during backpropagation, the neurons that were "dropped out" (set to zero) during the forward pass will also have no influence on learning; their corresponding gradients will be zero.

The rest of the code is similar to the previous example you provided. It does forward and backward passes, computes errors and updates weights. The key difference is the inclusion of dropout, which adds a form of regularization to the model.

The dropout is not applied to the test set; it's only used during training. This is standard practice. Dropout is a form of regularization used to make the model generalize better from the training set to unseen data.

So, the code is called 'Dropout code' because it incorporates dropout regularization in the neural network training process.

In [3]:
from tensorflow.keras.datasets import mnist
import numpy as np
import sys

(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Preprocessing the data
images, labels = (x_train[0:1000].reshape(1000, 28*28) / 255, y_train[0:1000])
test_images, test_labels = (x_test.reshape(len(x_test), 28*28) / 255, y_test)

np.random.seed(1)
def relu(x):
    return (x >= 0) * x # returns x if x > 0
                        # returns 0 otherwise

def relu2deriv(output):
    return output >= 0 #returns 1 for input > 0

alpha, iterations, hidden_size = (0.0001, 300, 100)
pixels_per_image, num_labels = (784, 10)

weights_0_1 = 0.2*np.random.random((pixels_per_image,hidden_size)) - 0.1
weights_1_2 = 0.2*np.random.random((hidden_size,num_labels)) - 0.1

for j in range(iterations): #backprogation
    error, correct_cnt = (0.0,0)
    for i in range(len(images)):
        layer_0 = images[i:i+1]
        layer_1 = relu(layer_0.dot(weights_0_1))
        dropout_mask = np.random.randint(2, size=layer_1.shape) #随机关闭一定比例的神经元以防止过拟合
        layer_1 *= dropout_mask * 2 #通过乘以2，算法补偿了大约50％的关闭的神经元，使训练时的输出大致与在所有神经元处于活动状态时的预期相当。
        layer_2 = layer_1.dot(weights_1_2)

        error += np.sum((labels[i:i+1] - layer_2)**2)
        correct_cnt += int(np.argmax(layer_2) == np.argmax(labels[i:i+1]))

        layer_2_delta = labels[i:i+1] - layer_2
        layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2deriv(layer_1)
        layer_1_delta *= dropout_mask
        
        weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
        weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)
    #sys.stdout.write("\r I:" + str(j) + " Train-Err:" + str(error/float(len(images)))[0:5] + " Train-Acc:" + str(correct_cnt/float(len(images))))
    if(j % 10 == 0):
        test_error, test_correct_cnt = (0.0,0)
        for i in range(len(test_images)):
            layer_0 = test_images[i:i+1]
            layer_1 = relu(layer_0.dot(weights_0_1))
            layer_2 = layer_1.dot(weights_1_2)

            test_error += np.sum((layer_2 - test_labels[i:i+1])**2)
            test_correct_cnt += int(np.argmax(layer_2) == np.argmax(test_labels[i:i+1]))
        
        sys.stdout.write("\n" + \
                         "I:" + str(j) + \
                         " Test-Err:" + str(test_error/ float(len(test_images)))[0:5] + \
                         " Test-Acc:" + str(test_correct_cnt/ float(len(test_images))) + \
                         " Train-Err:" + str(error/ float(len(images)))[0:5] + \
                         " Train-Acc:" + str(correct_cnt/ float(len(images))))



I:0 Test-Err:61.74 Test-Acc:0.0788 Train-Err:127.7 Train-Acc:0.086
I:10 Test-Err:44.66 Test-Acc:0.0893 Train-Err:41.95 Train-Acc:0.089
I:20 Test-Err:45.05 Test-Acc:0.0709 Train-Err:39.04 Train-Acc:0.089
I:30 Test-Err:45.51 Test-Acc:0.102 Train-Err:37.23 Train-Acc:0.087
I:40 Test-Err:45.04 Test-Acc:0.131 Train-Err:35.71 Train-Acc:0.093
I:50 Test-Err:44.55 Test-Acc:0.0902 Train-Err:34.18 Train-Acc:0.123
I:60 Test-Err:45.74 Test-Acc:0.1013 Train-Err:33.05 Train-Acc:0.109
I:70 Test-Err:45.09 Test-Acc:0.1178 Train-Err:33.39 Train-Acc:0.111
I:80 Test-Err:45.71 Test-Acc:0.1799 Train-Err:31.28 Train-Acc:0.101
I:90 Test-Err:45.15 Test-Acc:0.1088 Train-Err:31.37 Train-Acc:0.154
I:100 Test-Err:45.51 Test-Acc:0.2027 Train-Err:29.13 Train-Acc:0.154
I:110 Test-Err:45.96 Test-Acc:0.1779 Train-Err:29.89 Train-Acc:0.178
I:120 Test-Err:46.80 Test-Acc:0.1543 Train-Err:28.98 Train-Acc:0.161
I:130 Test-Err:46.32 Test-Acc:0.0618 Train-Err:28.99 Train-Acc:0.103
I:140 Test-Err:48.59 Test-Acc:0.1985 Train-Err

In [5]:
#edit alpha = 0.0005
#add one-hot
from tensorflow.keras.datasets import mnist
import numpy as np
import sys

(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Preprocessing the data
images, labels = (x_train[0:1000].reshape(1000, 28*28) / 255, y_train[0:1000])

one_hot_labels = np.zeros((len(labels),10))
for i,l in enumerate(labels):
    one_hot_labels[i][l] = 1
labels = one_hot_labels

test_images = x_test.reshape(len(x_test),28*28) / 255
test_labels = np.zeros((len(y_test),10))
for i,l in enumerate(y_test):
    test_labels[i][l] = 1

np.random.seed(1)
def relu(x):
    return (x >= 0) * x # returns x if x > 0
                        # returns 0 otherwise

def relu2deriv(output):
    return output >= 0 #returns 1 for input > 0

alpha, iterations, hidden_size = (0.0005, 300, 100)
pixels_per_image, num_labels = (784, 10)

weights_0_1 = 0.2*np.random.random((pixels_per_image,hidden_size)) - 0.1
weights_1_2 = 0.2*np.random.random((hidden_size,num_labels)) - 0.1

for j in range(iterations): #backprogation
    error, correct_cnt = (0.0,0)
    for i in range(len(images)):
        layer_0 = images[i:i+1]
        layer_1 = relu(layer_0.dot(weights_0_1))
        dropout_mask = np.random.randint(2, size=layer_1.shape) #随机关闭一定比例的神经元以防止过拟合
        layer_1 *= dropout_mask * 2 #通过乘以2，算法补偿了大约50％的关闭的神经元，使训练时的输出大致与在所有神经元处于活动状态时的预期相当。
        layer_2 = layer_1.dot(weights_1_2)

        error += np.sum((labels[i:i+1] - layer_2)**2)
        correct_cnt += int(np.argmax(layer_2) == np.argmax(labels[i:i+1]))

        layer_2_delta = labels[i:i+1] - layer_2
        layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2deriv(layer_1)
        layer_1_delta *= dropout_mask
        
        weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
        weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)
    #sys.stdout.write("\r I:" + str(j) + " Train-Err:" + str(error/float(len(images)))[0:5] + " Train-Acc:" + str(correct_cnt/float(len(images))))
    if(j % 10 == 0):
        test_error, test_correct_cnt = (0.0,0)
        for i in range(len(test_images)):
            layer_0 = test_images[i:i+1]
            layer_1 = relu(layer_0.dot(weights_0_1))
            layer_2 = layer_1.dot(weights_1_2)

            test_error += np.sum((layer_2 - test_labels[i:i+1])**2)
            test_correct_cnt += int(np.argmax(layer_2) == np.argmax(test_labels[i:i+1]))
        
        sys.stdout.write("\n" + \
                         "I:" + str(j) + \
                         " Test-Err:" + str(test_error/ float(len(test_images)))[0:5] + \
                         " Test-Acc:" + str(test_correct_cnt/ float(len(test_images))) + \
                         " Train-Err:" + str(error/ float(len(images)))[0:5] + \
                         " Train-Acc:" + str(correct_cnt/ float(len(images))))



I:0 Test-Err:0.912 Test-Acc:0.2563 Train-Err:1.391 Train-Acc:0.124
I:10 Test-Err:0.628 Test-Acc:0.6718 Train-Err:0.663 Train-Acc:0.615
I:20 Test-Err:0.570 Test-Acc:0.7188 Train-Err:0.594 Train-Acc:0.691
I:30 Test-Err:0.537 Test-Acc:0.7453 Train-Err:0.556 Train-Acc:0.703
I:40 Test-Err:0.514 Test-Acc:0.7626 Train-Err:0.536 Train-Acc:0.715
I:50 Test-Err:0.498 Test-Acc:0.7741 Train-Err:0.515 Train-Acc:0.741
I:60 Test-Err:0.485 Test-Acc:0.7866 Train-Err:0.500 Train-Acc:0.755
I:70 Test-Err:0.473 Test-Acc:0.7883 Train-Err:0.487 Train-Acc:0.758
I:80 Test-Err:0.471 Test-Acc:0.7868 Train-Err:0.487 Train-Acc:0.764
I:90 Test-Err:0.464 Test-Acc:0.7903 Train-Err:0.481 Train-Acc:0.774
I:100 Test-Err:0.459 Test-Acc:0.7941 Train-Err:0.473 Train-Acc:0.777
I:110 Test-Err:0.451 Test-Acc:0.7955 Train-Err:0.458 Train-Acc:0.792
I:120 Test-Err:0.449 Test-Acc:0.7977 Train-Err:0.460 Train-Acc:0.788
I:130 Test-Err:0.447 Test-Acc:0.8003 Train-Err:0.458 Train-Acc:0.789
I:140 Test-Err:0.449 Test-Acc:0.7998 Train-E

**<h2>Batch Gradient Descent</h2>**

In the code you've provided, the concept of `batch_size` plays a pivotal role in the training of the neural network. The `batch_size` defines how many samples from the training dataset are used in each iteration to update the weights. It's set to 100, meaning in each training step, 100 images are processed before a weight update occurs. Let's break down how it's used in the code:

1. `batch_size = 100`: This sets the batch size to 100.

2. `for i in range(int(len(images) / batch_size)):` This loop iterates through the training set in chunks of `batch_size`. For example, if there are 5000 images, this loop would run 50 times (5000/100).

3. `batch_start, batch_end = ((i * batch_size),((i+1)*batch_size))`: These lines calculate the start and end indices for the current batch of images and labels. 

4. `layer_0 = images[batch_start:batch_end]`: This line selects a batch of images (100 images, as per the defined `batch_size`) from the dataset.

5. `labels[batch_start:batch_end]`: Similarly, the labels corresponding to the current batch of images are selected.

6. `error += np.sum((labels[batch_start:batch_end] - layer_2) ** 2)`: The error for the current batch is calculated and summed up with the error from the previous batches. The error is normalized later by the total number of images (`float(len(images))`).

7. `for k in range(batch_size):`: This loop iterates through each example in the current batch for further calculations, like counting the number of correctly classified examples.

8. `layer_2_delta = (labels[batch_start:batch_end]-layer_2)/batch_size`: The error gradient (`layer_2_delta`) is scaled by dividing it by `batch_size`, which is a common technique to average the gradient across the batch.

9. Later in the code, the weights (`weights_1_2` and `weights_0_1`) are updated based on these batch calculations.

In summary, `batch_size` is crucial for breaking the dataset into manageable chunks that can be used for training the network in each iteration. It's an essential hyperparameter that can affect both the efficiency and performance of a neural network. Given your interest in machine learning, understanding the role of batching and its impact on training dynamics could be particularly valuable for your research.

In [9]:
#edit alpha = 0.0005
#add one-hot
#add batch size
from tensorflow.keras.datasets import mnist
import numpy as np
import sys

(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Preprocessing the data
images, labels = (x_train[0:1000].reshape(1000, 28*28) / 255, y_train[0:1000])

one_hot_labels = np.zeros((len(labels),10))
for i,l in enumerate(labels):
    one_hot_labels[i][l] = 1
labels = one_hot_labels

test_images = x_test.reshape(len(x_test),28*28) / 255
test_labels = np.zeros((len(y_test),10))
for i,l in enumerate(y_test):
    test_labels[i][l] = 1

np.random.seed(1)
def relu(x):
    return (x >= 0) * x # returns x if x > 0
                        # returns 0 otherwise

def relu2deriv(output):
    return output >= 0 #returns 1 for input > 0

batch_size = 100
alpha, iterations, hidden_size = (0.0005, 300, 100)
pixels_per_image, num_labels = (784, 10)

weights_0_1 = 0.2*np.random.random((pixels_per_image,hidden_size)) - 0.1
weights_1_2 = 0.2*np.random.random((hidden_size,num_labels)) - 0.1

for j in range(iterations): #backprogation
    error, correct_cnt = (0.0,0)
    for i in range(int(len(images) / batch_size )):
        batch_start, batch_end = ((i * batch_size),((i+1) * batch_size))
        
        layer_0 = images[batch_start:batch_end]
        layer_1 = relu(layer_0.dot(weights_0_1))
        dropout_mask = np.random.randint(2, size=layer_1.shape) #随机关闭一定比例的神经元以防止过拟合
        layer_1 *= dropout_mask * 2 #通过乘以2，算法补偿了大约50％的关闭的神经元，使训练时的输出大致与在所有神经元处于活动状态时的预期相当。
        layer_2 = layer_1.dot(weights_1_2)

        error += np.sum((labels[batch_start:batch_end] - layer_2)**2)
        for k in range(batch_size):
            correct_cnt += int(np.argmax(layer_2[k:k+1]) == np.argmax(labels[batch_start+k:batch_start+k+1]))

            layer_2_delta = (labels[batch_start:batch_end] - layer_2)/batch_size
            layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2deriv(layer_1)
            layer_1_delta *= dropout_mask
        
            weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
            weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)

    if(j % 10 == 0):
        test_error, test_correct_cnt = (0.0,0)
        for i in range(len(test_images)):
            layer_0 = test_images[i:i+1]
            layer_1 = relu(layer_0.dot(weights_0_1))
            layer_2 = layer_1.dot(weights_1_2)

            test_error += np.sum((layer_2 - test_labels[i:i+1])**2)
            test_correct_cnt += int(np.argmax(layer_2) == np.argmax(test_labels[i:i+1]))
        
        sys.stdout.write("\n" + \
                         "I:" + str(j) + \
                         " Test-Err:" + str(test_error/ float(len(test_images)))[0:5] + \
                         " Test-Acc:" + str(test_correct_cnt/ float(len(test_images))) + \
                         " Train-Err:" + str(error/ float(len(images)))[0:5] + \
                         " Train-Acc:" + str(correct_cnt/ float(len(images))))



I:0 Test-Err:0.910 Test-Acc:0.2586 Train-Err:1.428 Train-Acc:0.123
I:10 Test-Err:0.627 Test-Acc:0.6715 Train-Err:0.663 Train-Acc:0.618
I:20 Test-Err:0.569 Test-Acc:0.7189 Train-Err:0.594 Train-Acc:0.692
I:30 Test-Err:0.537 Test-Acc:0.7466 Train-Err:0.556 Train-Acc:0.704
I:40 Test-Err:0.514 Test-Acc:0.7631 Train-Err:0.536 Train-Acc:0.718
I:50 Test-Err:0.498 Test-Acc:0.7732 Train-Err:0.516 Train-Acc:0.738
I:60 Test-Err:0.486 Test-Acc:0.7857 Train-Err:0.501 Train-Acc:0.755
I:70 Test-Err:0.474 Test-Acc:0.7885 Train-Err:0.488 Train-Acc:0.755
I:80 Test-Err:0.473 Test-Acc:0.7858 Train-Err:0.488 Train-Acc:0.761
I:90 Test-Err:0.466 Test-Acc:0.7892 Train-Err:0.482 Train-Acc:0.774
I:100 Test-Err:0.461 Test-Acc:0.7932 Train-Err:0.475 Train-Acc:0.781
I:110 Test-Err:0.453 Test-Acc:0.7948 Train-Err:0.460 Train-Acc:0.79
I:120 Test-Err:0.452 Test-Acc:0.7961 Train-Err:0.462 Train-Acc:0.789
I:130 Test-Err:0.449 Test-Acc:0.7985 Train-Err:0.460 Train-Acc:0.788
I:140 Test-Err:0.452 Test-Acc:0.7972 Train-Er

The output provides a summary of the model's performance at regular intervals (every 10 iterations, as per your code). The key statistics include:

- **Train-Err and Test-Err**: These are the training and test errors, respectively. Lower values are generally better, and it's good to see these values decreasing over time.
- **Train-Acc and Test-Acc**: These are the training and test accuracies, respectively. Higher values are better, and you want to see these numbers going up over iterations.

### Observations

1. **Training Error vs. Testing Error**: Both training and test errors are generally decreasing, which is a good sign. It suggests the model is learning effectively from the data.
  
2. **Training Accuracy vs. Testing Accuracy**: Both are increasing as well, which is positive. The test accuracy is higher than the training accuracy, which is somewhat unusual but not a problem. 

3. **Convergence**: The model seems to be converging as the changes in errors and accuracies are becoming smaller.

4. **Overfitting**: There is no clear sign of overfitting here, as the test error continues to decrease and test accuracy continues to increase.

### Potential Action Points:

1. **Learning Rate**: The learning rate `alpha = 0.0005` is a hyperparameter you might want to experiment with. Different learning rates might lead to faster convergence or better generalization.

2. **Batch Size**: Since you have a powerful GPU, you could experiment with different batch sizes to see how it affects the training dynamics.

3. **Dropout**: You implemented dropout to combat overfitting, which is a good strategy. You could try varying the dropout rate and observe its effects on overfitting.

4. **More Hidden Layers**: With just one hidden layer, the model is quite simple. Given your computational resources and academic background, it might be worth trying a deeper model to see how it performs.

5. **Iterations**: The number of iterations is set to 300. Depending on the final metrics, you might want to increase/decrease this number.

Overall, the code and its output suggest that the model is performing reasonably well, but there's room for further optimization and experimentation.

In [10]:
#edit alpha = 0.0001
#add one-hot
#add batch size
from tensorflow.keras.datasets import mnist
import numpy as np
import sys

(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Preprocessing the data
images, labels = (x_train[0:1000].reshape(1000, 28*28) / 255, y_train[0:1000])

one_hot_labels = np.zeros((len(labels),10))
for i,l in enumerate(labels):
    one_hot_labels[i][l] = 1
labels = one_hot_labels

test_images = x_test.reshape(len(x_test),28*28) / 255
test_labels = np.zeros((len(y_test),10))
for i,l in enumerate(y_test):
    test_labels[i][l] = 1

np.random.seed(1)
def relu(x):
    return (x >= 0) * x # returns x if x > 0
                        # returns 0 otherwise

def relu2deriv(output):
    return output >= 0 #returns 1 for input > 0

batch_size = 100
alpha, iterations, hidden_size = (0.0001, 300, 100)
pixels_per_image, num_labels = (784, 10)

weights_0_1 = 0.2*np.random.random((pixels_per_image,hidden_size)) - 0.1
weights_1_2 = 0.2*np.random.random((hidden_size,num_labels)) - 0.1

for j in range(iterations): #backprogation
    error, correct_cnt = (0.0,0)
    for i in range(int(len(images) / batch_size )):
        batch_start, batch_end = ((i * batch_size),((i+1) * batch_size))
        
        layer_0 = images[batch_start:batch_end]
        layer_1 = relu(layer_0.dot(weights_0_1))
        dropout_mask = np.random.randint(2, size=layer_1.shape) #随机关闭一定比例的神经元以防止过拟合
        layer_1 *= dropout_mask * 2 #通过乘以2，算法补偿了大约50％的关闭的神经元，使训练时的输出大致与在所有神经元处于活动状态时的预期相当。
        layer_2 = layer_1.dot(weights_1_2)

        error += np.sum((labels[batch_start:batch_end] - layer_2)**2)
        for k in range(batch_size):
            correct_cnt += int(np.argmax(layer_2[k:k+1]) == np.argmax(labels[batch_start+k:batch_start+k+1]))

            layer_2_delta = (labels[batch_start:batch_end] - layer_2)/batch_size
            layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2deriv(layer_1)
            layer_1_delta *= dropout_mask
        
            weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
            weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)

    if(j % 10 == 0):
        test_error, test_correct_cnt = (0.0,0)
        for i in range(len(test_images)):
            layer_0 = test_images[i:i+1]
            layer_1 = relu(layer_0.dot(weights_0_1))
            layer_2 = layer_1.dot(weights_1_2)

            test_error += np.sum((layer_2 - test_labels[i:i+1])**2)
            test_correct_cnt += int(np.argmax(layer_2) == np.argmax(test_labels[i:i+1]))
        
        sys.stdout.write("\n" + \
                         "I:" + str(j) + \
                         " Test-Err:" + str(test_error/ float(len(test_images)))[0:5] + \
                         " Test-Acc:" + str(test_correct_cnt/ float(len(test_images))) + \
                         " Train-Err:" + str(error/ float(len(images)))[0:5] + \
                         " Train-Acc:" + str(correct_cnt/ float(len(images))))



I:0 Test-Err:1.118 Test-Acc:0.1123 Train-Err:1.767 Train-Acc:0.085
I:10 Test-Err:0.802 Test-Acc:0.4154 Train-Err:0.958 Train-Acc:0.286
I:20 Test-Err:0.720 Test-Acc:0.5481 Train-Err:0.802 Train-Acc:0.448
I:30 Test-Err:0.679 Test-Acc:0.6122 Train-Err:0.741 Train-Acc:0.521
I:40 Test-Err:0.653 Test-Acc:0.6472 Train-Err:0.706 Train-Acc:0.569
I:50 Test-Err:0.634 Test-Acc:0.6677 Train-Err:0.674 Train-Acc:0.606
I:60 Test-Err:0.618 Test-Acc:0.6839 Train-Err:0.656 Train-Acc:0.627
I:70 Test-Err:0.604 Test-Acc:0.6944 Train-Err:0.628 Train-Acc:0.664
I:80 Test-Err:0.593 Test-Acc:0.7018 Train-Err:0.621 Train-Acc:0.663
I:90 Test-Err:0.583 Test-Acc:0.7111 Train-Err:0.609 Train-Acc:0.671
I:100 Test-Err:0.575 Test-Acc:0.7172 Train-Err:0.608 Train-Acc:0.671
I:110 Test-Err:0.566 Test-Acc:0.7237 Train-Err:0.587 Train-Acc:0.687
I:120 Test-Err:0.559 Test-Acc:0.7284 Train-Err:0.584 Train-Acc:0.704
I:130 Test-Err:0.552 Test-Acc:0.7336 Train-Err:0.578 Train-Acc:0.704
I:140 Test-Err:0.546 Test-Acc:0.7393 Train-E

**<em>All final output drop down</em>**

In [11]:
#edit alpha = 0.0001
#add one-hot
#edit batch size = 500
from tensorflow.keras.datasets import mnist
import numpy as np
import sys

(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Preprocessing the data
images, labels = (x_train[0:1000].reshape(1000, 28*28) / 255, y_train[0:1000])

one_hot_labels = np.zeros((len(labels),10))
for i,l in enumerate(labels):
    one_hot_labels[i][l] = 1
labels = one_hot_labels

test_images = x_test.reshape(len(x_test),28*28) / 255
test_labels = np.zeros((len(y_test),10))
for i,l in enumerate(y_test):
    test_labels[i][l] = 1

np.random.seed(1)
def relu(x):
    return (x >= 0) * x # returns x if x > 0
                        # returns 0 otherwise

def relu2deriv(output):
    return output >= 0 #returns 1 for input > 0

batch_size = 500
alpha, iterations, hidden_size = (0.0001, 300, 100)
pixels_per_image, num_labels = (784, 10)

weights_0_1 = 0.2*np.random.random((pixels_per_image,hidden_size)) - 0.1
weights_1_2 = 0.2*np.random.random((hidden_size,num_labels)) - 0.1

for j in range(iterations): #backprogation
    error, correct_cnt = (0.0,0)
    for i in range(int(len(images) / batch_size )):
        batch_start, batch_end = ((i * batch_size),((i+1) * batch_size))
        
        layer_0 = images[batch_start:batch_end]
        layer_1 = relu(layer_0.dot(weights_0_1))
        dropout_mask = np.random.randint(2, size=layer_1.shape) #随机关闭一定比例的神经元以防止过拟合
        layer_1 *= dropout_mask * 2 #通过乘以2，算法补偿了大约50％的关闭的神经元，使训练时的输出大致与在所有神经元处于活动状态时的预期相当。
        layer_2 = layer_1.dot(weights_1_2)

        error += np.sum((labels[batch_start:batch_end] - layer_2)**2)
        for k in range(batch_size):
            correct_cnt += int(np.argmax(layer_2[k:k+1]) == np.argmax(labels[batch_start+k:batch_start+k+1]))

            layer_2_delta = (labels[batch_start:batch_end] - layer_2)/batch_size
            layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2deriv(layer_1)
            layer_1_delta *= dropout_mask
        
            weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
            weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)

    if(j % 10 == 0):
        test_error, test_correct_cnt = (0.0,0)
        for i in range(len(test_images)):
            layer_0 = test_images[i:i+1]
            layer_1 = relu(layer_0.dot(weights_0_1))
            layer_2 = layer_1.dot(weights_1_2)

            test_error += np.sum((layer_2 - test_labels[i:i+1])**2)
            test_correct_cnt += int(np.argmax(layer_2) == np.argmax(test_labels[i:i+1]))
        
        sys.stdout.write("\n" + \
                         "I:" + str(j) + \
                         " Test-Err:" + str(test_error/ float(len(test_images)))[0:5] + \
                         " Test-Acc:" + str(test_correct_cnt/ float(len(test_images))) + \
                         " Train-Err:" + str(error/ float(len(images)))[0:5] + \
                         " Train-Acc:" + str(correct_cnt/ float(len(images))))



I:0 Test-Err:1.091 Test-Acc:0.1178 Train-Err:1.882 Train-Acc:0.085
I:10 Test-Err:0.800 Test-Acc:0.4166 Train-Err:0.959 Train-Acc:0.285
I:20 Test-Err:0.719 Test-Acc:0.5501 Train-Err:0.801 Train-Acc:0.452
I:30 Test-Err:0.678 Test-Acc:0.6135 Train-Err:0.741 Train-Acc:0.519
I:40 Test-Err:0.652 Test-Acc:0.6482 Train-Err:0.706 Train-Acc:0.569
I:50 Test-Err:0.633 Test-Acc:0.6682 Train-Err:0.674 Train-Acc:0.606
I:60 Test-Err:0.617 Test-Acc:0.6845 Train-Err:0.655 Train-Acc:0.625
I:70 Test-Err:0.604 Test-Acc:0.6949 Train-Err:0.628 Train-Acc:0.664
I:80 Test-Err:0.593 Test-Acc:0.7022 Train-Err:0.621 Train-Acc:0.661
I:90 Test-Err:0.583 Test-Acc:0.7115 Train-Err:0.609 Train-Acc:0.672
I:100 Test-Err:0.575 Test-Acc:0.7175 Train-Err:0.608 Train-Acc:0.672
I:110 Test-Err:0.566 Test-Acc:0.7238 Train-Err:0.587 Train-Acc:0.688
I:120 Test-Err:0.559 Test-Acc:0.7286 Train-Err:0.583 Train-Acc:0.703
I:130 Test-Err:0.552 Test-Acc:0.7341 Train-Err:0.578 Train-Acc:0.706
I:140 Test-Err:0.545 Test-Acc:0.7397 Train-E

**<em>The same output with the previous one, so batch size does not change result.</em>**

In [13]:
#edit alpha = 0.0001
#add one-hot
#edit batch size = 100
#use whole dataset
from tensorflow.keras.datasets import mnist
import numpy as np
import sys

(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Preprocessing the data
images, labels = (x_train.reshape(60000, 28*28) / 255, y_train)

one_hot_labels = np.zeros((len(labels),10))
for i,l in enumerate(labels):
    one_hot_labels[i][l] = 1
labels = one_hot_labels

test_images = x_test.reshape(len(x_test),28*28) / 255
test_labels = np.zeros((len(y_test),10))
for i,l in enumerate(y_test):
    test_labels[i][l] = 1

np.random.seed(1)
def relu(x):
    return (x >= 0) * x # returns x if x > 0
                        # returns 0 otherwise

def relu2deriv(output):
    return output >= 0 #returns 1 for input > 0

batch_size = 100
alpha, iterations, hidden_size = (0.0001, 300, 100)
pixels_per_image, num_labels = (784, 10)

weights_0_1 = 0.2*np.random.random((pixels_per_image,hidden_size)) - 0.1
weights_1_2 = 0.2*np.random.random((hidden_size,num_labels)) - 0.1

for j in range(iterations): #backprogation
    error, correct_cnt = (0.0,0)
    for i in range(int(len(images) / batch_size )):
        batch_start, batch_end = ((i * batch_size),((i+1) * batch_size))
        
        layer_0 = images[batch_start:batch_end]
        layer_1 = relu(layer_0.dot(weights_0_1))
        dropout_mask = np.random.randint(2, size=layer_1.shape) #随机关闭一定比例的神经元以防止过拟合
        layer_1 *= dropout_mask * 2 #通过乘以2，算法补偿了大约50％的关闭的神经元，使训练时的输出大致与在所有神经元处于活动状态时的预期相当。
        layer_2 = layer_1.dot(weights_1_2)

        error += np.sum((labels[batch_start:batch_end] - layer_2)**2)
        for k in range(batch_size):
            correct_cnt += int(np.argmax(layer_2[k:k+1]) == np.argmax(labels[batch_start+k:batch_start+k+1]))

            layer_2_delta = (labels[batch_start:batch_end] - layer_2)/batch_size
            layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2deriv(layer_1)
            layer_1_delta *= dropout_mask
        
            weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
            weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)

    if(j % 10 == 0):
        test_error, test_correct_cnt = (0.0,0)
        for i in range(len(test_images)):
            layer_0 = test_images[i:i+1]
            layer_1 = relu(layer_0.dot(weights_0_1))
            layer_2 = layer_1.dot(weights_1_2)

            test_error += np.sum((layer_2 - test_labels[i:i+1])**2)
            test_correct_cnt += int(np.argmax(layer_2) == np.argmax(test_labels[i:i+1]))
        
        sys.stdout.write("\n" + \
                         "I:" + str(j) + \
                         " Test-Err:" + str(test_error/ float(len(test_images)))[0:5] + \
                         " Test-Acc:" + str(test_correct_cnt/ float(len(test_images))) + \
                         " Train-Err:" + str(error/ float(len(images)))[0:5] + \
                         " Train-Acc:" + str(correct_cnt/ float(len(images))))



I:0 Test-Err:0.602 Test-Acc:0.7181 Train-Err:0.849 Train-Acc:0.4297
I:10 Test-Err:0.407 Test-Acc:0.8362 Train-Err:0.501 Train-Acc:0.7458
I:20 Test-Err:0.392 Test-Acc:0.8367 Train-Err:0.481 Train-Acc:0.7615666666666666
I:30 Test-Err:0.379 Test-Acc:0.8391 Train-Err:0.470 Train-Acc:0.7710166666666667
I:40 Test-Err:0.363 Test-Acc:0.8427 Train-Err:0.454 Train-Acc:0.7833666666666667
I:50 Test-Err:0.352 Test-Acc:0.8483 Train-Err:0.443 Train-Acc:0.79275
I:60 Test-Err:0.346 Test-Acc:0.8502 Train-Err:0.437 Train-Acc:0.7959333333333334
I:70 Test-Err:0.341 Test-Acc:0.8517 Train-Err:0.431 Train-Acc:0.7989166666666667
I:80 Test-Err:0.337 Test-Acc:0.8531 Train-Err:0.426 Train-Acc:0.8039166666666666
I:90 Test-Err:0.331 Test-Acc:0.8559 Train-Err:0.423 Train-Acc:0.8046833333333333
I:100 Test-Err:0.327 Test-Acc:0.8624 Train-Err:0.420 Train-Acc:0.8083666666666667
I:110 Test-Err:0.323 Test-Acc:0.862 Train-Err:0.413 Train-Acc:0.8127333333333333
I:120 Test-Err:0.323 Test-Acc:0.8617 Train-Err:0.412 Train-Acc