LINK TO ORIGINAL :- https://www.kaggle.com/code/ayushsyntax/simple-mnist-nn-from-scratch-numpy-no-tf-keras

In [None]:
# IMPORTANT: SOME KAGGLE DATA SOURCES ARE PRIVATE
# RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES.
import kagglehub
kagglehub.login()


In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

digit_recognizer_path = kagglehub.competition_download('digit-recognizer')

print('Data source import complete.')


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Simple MNIST NN from scratch

In this notebook, I implemented a simple two-layer neural network and trained it on the MNIST digit recognizer dataset. It's meant to be an instructional example, through which you can understand the underlying math of neural networks better.

Here's a video I made explaining all the math and showing my progress as I coded the network: https://youtu.be/w8yWXqWQYmU

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

data = pd.read_csv('/kaggle/input/digit-recognizer/train.csv')

In [None]:
data = np.array(data)
m, n = data.shape
np.random.shuffle(data) # shuffle before splitting into dev and training sets

data_dev = data[0:1000].T
Y_dev = data_dev[0]
X_dev = data_dev[1:n]
X_dev = X_dev / 255.

data_train = data[1000:m].T
Y_train = data_train[0]
X_train = data_train[1:n]
X_train = X_train / 255.
_,m_train = X_train.shape

In [None]:
Y_train


Our NN will have a simple two-layer architecture. Input layer $a^{[0]}$ will have 784 units corresponding to the 784 pixels in each 28x28 input image. A hidden layer $a^{[1]}$ will have 10 units with ReLU activation, and finally our output layer $a^{[2]}$ will have 10 units corresponding to the ten digit classes with softmax activation.

**Forward propagation**

$$Z^{[1]} = W^{[1]} X + b^{[1]}$$
$$A^{[1]} = g_{\text{ReLU}}(Z^{[1]}))$$
$$Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]}$$
$$A^{[2]} = g_{\text{softmax}}(Z^{[2]})$$

**Backward propagation**

$$dZ^{[2]} = A^{[2]} - Y$$
$$dW^{[2]} = \frac{1}{m} dZ^{[2]} A^{[1]T}$$
$$dB^{[2]} = \frac{1}{m} \Sigma {dZ^{[2]}}$$
$$dZ^{[1]} = W^{[2]T} dZ^{[2]} .* g^{[1]\prime} (z^{[1]})$$
$$dW^{[1]} = \frac{1}{m} dZ^{[1]} A^{[0]T}$$
$$dB^{[1]} = \frac{1}{m} \Sigma {dZ^{[1]}}$$

**Parameter updates**

$$W^{[2]} := W^{[2]} - \alpha dW^{[2]}$$
$$b^{[2]} := b^{[2]} - \alpha db^{[2]}$$
$$W^{[1]} := W^{[1]} - \alpha dW^{[1]}$$
$$b^{[1]} := b^{[1]} - \alpha db^{[1]}$$

**Vars and shapes**

Forward prop

- $A^{[0]} = X$: 784 x m
- $Z^{[1]} \sim A^{[1]}$: 10 x m
- $W^{[1]}$: 10 x 784 (as $W^{[1]} A^{[0]} \sim Z^{[1]}$)
- $B^{[1]}$: 10 x 1
- $Z^{[2]} \sim A^{[2]}$: 10 x m
- $W^{[1]}$: 10 x 10 (as $W^{[2]} A^{[1]} \sim Z^{[2]}$)
- $B^{[2]}$: 10 x 1

Backprop

- $dZ^{[2]}$: 10 x m ($~A^{[2]}$)
- $dW^{[2]}$: 10 x 10
- $dB^{[2]}$: 10 x 1
- $dZ^{[1]}$: 10 x m ($~A^{[1]}$)
- $dW^{[1]}$: 10 x 10
- $dB^{[1]}$: 10 x 1

In [None]:
def init_params():
    W1 = np.random.rand(10, 784) - 0.5
    b1 = np.random.rand(10, 1) - 0.5
    W2 = np.random.rand(10, 10) - 0.5
    b2 = np.random.rand(10, 1) - 0.5
    return W1, b1, W2, b2

def ReLU(Z):
    return np.maximum(Z, 0)

def softmax(Z):
    A = np.exp(Z) / sum(np.exp(Z))
    return A

def forward_prop(W1, b1, W2, b2, X):
    Z1 = W1.dot(X) + b1
    A1 = ReLU(Z1)
    Z2 = W2.dot(A1) + b2
    A2 = softmax(Z2)
    return Z1, A1, Z2, A2

def ReLU_deriv(Z):
    return Z > 0

def one_hot(Y):
    one_hot_Y = np.zeros((Y.size, Y.max() + 1))
    one_hot_Y[np.arange(Y.size), Y] = 1
    one_hot_Y = one_hot_Y.T
    return one_hot_Y

def backward_prop(Z1, A1, Z2, A2, W1, W2, X, Y):
    one_hot_Y = one_hot(Y)
    dZ2 = A2 - one_hot_Y
    dW2 = 1 / m * dZ2.dot(A1.T)
    db2 = 1 / m * np.sum(dZ2)
    dZ1 = W2.T.dot(dZ2) * ReLU_deriv(Z1)
    dW1 = 1 / m * dZ1.dot(X.T)
    db1 = 1 / m * np.sum(dZ1)
    return dW1, db1, dW2, db2

def update_params(W1, b1, W2, b2, dW1, db1, dW2, db2, alpha):
    W1 = W1 - alpha * dW1
    b1 = b1 - alpha * db1
    W2 = W2 - alpha * dW2
    b2 = b2 - alpha * db2
    return W1, b1, W2, b2

In [None]:
def get_predictions(A2):
    return np.argmax(A2, 0)

def get_accuracy(predictions, Y):
    print(predictions, Y)
    return np.sum(predictions == Y) / Y.size

def gradient_descent(X, Y, alpha, iterations):
    W1, b1, W2, b2 = init_params()
    for i in range(iterations):
        Z1, A1, Z2, A2 = forward_prop(W1, b1, W2, b2, X)
        dW1, db1, dW2, db2 = backward_prop(Z1, A1, Z2, A2, W1, W2, X, Y)
        W1, b1, W2, b2 = update_params(W1, b1, W2, b2, dW1, db1, dW2, db2, alpha)
        if i % 10 == 0:
            print("Iteration: ", i)
            predictions = get_predictions(A2)
            print(get_accuracy(predictions, Y))
    return W1, b1, W2, b2

In [None]:
W1, b1, W2, b2 = gradient_descent(X_train, Y_train, 0.10, 500)

In [None]:
def make_predictions(X, W1, b1, W2, b2):
    _, _, _, A2 = forward_prop(W1, b1, W2, b2, X)
    predictions = get_predictions(A2)
    return predictions

def test_prediction(index, W1, b1, W2, b2):
    current_image = X_train[:, index, None]
    prediction = make_predictions(X_train[:, index, None], W1, b1, W2, b2)
    label = Y_train[index]
    print("Prediction: ", prediction)
    print("Label: ", label)

    current_image = current_image.reshape((28, 28)) * 255
    plt.gray()
    plt.imshow(current_image, interpolation='nearest')
    plt.show()

In [None]:
test_prediction(0, W1, b1, W2, b2)
test_prediction(1, W1, b1, W2, b2)
test_prediction(2, W1, b1, W2, b2)
test_prediction(3, W1, b1, W2, b2)

In [None]:
dev_predictions = make_predictions(X_dev, W1, b1, W2, b2)
get_accuracy(dev_predictions, Y_dev)



---

## 🧾 Code Block Overview

This is a **Python script** that:
- Loads the MNIST dataset (handwritten digits).
- Preprocesses it.
- Implements a **two-layer neural network from scratch** (no PyTorch or TensorFlow).
- Trains it on the data.
- Tests predictions and evaluates accuracy.

We'll go **line-by-line**, explaining every function, loop, and concept.

---

### 🔹 1. Import Libraries
```python
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
```
- `numpy`: Used for fast numerical operations like matrix multiplication.
- `pandas`: Used to load CSV files into tables (like Excel).
- `matplotlib.pyplot`: Used to plot images and graphs.

---

### 🔹 2. Load Data
```python
data = pd.read_csv('/kaggle/input/digit-recognizer/train.csv')
```
- This reads the MNIST training data (`train.csv`) which has over 42,000 rows of digit images.
- Each row contains:
  - One label (the actual digit: 0–9)
  - Followed by 784 numbers representing pixel values of a 28x28 image.

---

### 🔹 3. Convert to NumPy Array
```python
data = np.array(data)
m, n = data.shape
```
- Converts the data into a NumPy array (easier to work with).
- `m` = number of samples (rows), `n` = number of columns (785: 784 pixels + 1 label).

---

### 🔹 4. Shuffle Data
```python
np.random.shuffle(data)
```
- Shuffles the data so we don’t get biased results.
- Important before splitting into training and validation sets.

---

### 🔹 5. Split Validation Set (`dev`)
```python
data_dev = data[0:1000].T
Y_dev = data_dev[0]
X_dev = data_dev[1:n]
X_dev = X_dev / 255.
```
- Takes first 1000 examples for validation/testing later.
- `.T` transposes the data so each **column** is one example (instead of each row).
- `Y_dev`: Extracts labels (first row).
- `X_dev`: Extracts pixel data (remaining rows).
- Normalize pixel values to range [0, 1] by dividing by 255.

---

### 🔹 6. Split Training Set
```python
data_train = data[1000:m].T
Y_train = data_train[0]
X_train = data_train[1:n]
X_train = X_train / 255.
_, m_train = X_train.shape
```
- Uses rest of data (after 1000) for training.
- Again transpose and separate labels and inputs.
- Normalize input again.
- `_`, `m_train`: ignore first value, store number of training examples.

---

## 🧠 Neural Network Implementation

Now comes the core of our neural network.

---

### 🔹 7. Define Initial Parameters
```python
def init_params():
    W1 = np.random.rand(10, 784) - 0.5
    b1 = np.random.rand(10, 1) - 0.5
    W2 = np.random.rand(10, 10) - 0.5
    b2 = np.random.rand(10, 1) - 0.5
    return W1, b1, W2, b2
```
- Initializes weights and biases randomly.
- Shapes:
  - `W1`: (10 x 784): Input → Hidden layer (10 units)
  - `b1`: (10 x 1): Bias for hidden layer
  - `W2`: (10 x 10): Hidden → Output layer
  - `b2`: (10 x 1): Bias for output layer
- Subtracts 0.5 to make values range from [-0.5, 0.5].

---

### 🔹 8. Activation Functions
```python
def ReLU(Z):
    return np.maximum(Z, 0)

def softmax(Z):
    A = np.exp(Z) / sum(np.exp(Z))
    return A
```
- **ReLU**: Returns max(0, Z). Introduces non-linearity.
- **Softmax**: Turns raw outputs into probabilities that sum to 1.

---

### 🔹 9. Forward Propagation
```python
def forward_prop(W1, b1, W2, b2, X):
    Z1 = W1.dot(X) + b1
    A1 = ReLU(Z1)
    Z2 = W2.dot(A1) + b2
    A2 = softmax(Z2)
    return Z1, A1, Z2, A2
```
- Computes activations step by step.
  - `Z1` = linear transformation of input using weights `W1`
  - `A1` = apply ReLU
  - `Z2` = linear transformation of hidden layer using `W2`
  - `A2` = apply Softmax (final output)

---

### 🔹 10. Derivative of ReLU
```python
def ReLU_deriv(Z):
    return Z > 0
```
- Returns 1 where Z > 0, else 0.
- Used during backpropagation.

---

### 🔹 11. One-Hot Encoding
```python
def one_hot(Y):
    one_hot_Y = np.zeros((Y.size, Y.max() + 1))
    one_hot_Y[np.arange(Y.size), Y] = 1
    one_hot_Y = one_hot_Y.T
    return one_hot_Y
```
- Converts labels like `[2, 1, 4]` into:
  ```
  [[0, 0, 1, ..., 0],
   [0, 1, 0, ..., 0],
   ...]
  ```
- Transpose so each column corresponds to one sample.

---

### 🔹 12. Backward Propagation
```python
def backward_prop(Z1, A1, Z2, A2, W1, W2, X, Y):
    one_hot_Y = one_hot(Y)
    dZ2 = A2 - one_hot_Y
    dW2 = 1/m * dZ2.dot(A1.T)
    db2 = 1/m * np.sum(dZ2)
    dZ1 = W2.T.dot(dZ2) * ReLU_deriv(Z1)
    dW1 = 1/m * dZ1.dot(X.T)
    db1 = 1/m * np.sum(dZ1)
    return dW1, db1, dW2, db2
```
- Calculates how much the error came from each weight.
- Uses chain rule of derivatives to trace backwards.
- These gradients tell us how to update weights to reduce error.

---

### 🔹 13. Update Parameters
```python
def update_params(W1, b1, W2, b2, dW1, db1, dW2, db2, alpha):
    W1 = W1 - alpha * dW1
    b1 = b1 - alpha * db1    
    W2 = W2 - alpha * dW2  
    b2 = b2 - alpha * db2    
    return W1, b1, W2, b2
```
- Adjust weights and biases based on gradients.
- `alpha` is the learning rate — controls how big the updates are.

---

### 🔹 14. Get Final Predictions
```python
def get_predictions(A2):
    return np.argmax(A2, 0)
```
- From final output (softmax), returns predicted class (digit 0–9).
- `argmax` finds index with highest probability.

---

### 🔹 15. Calculate Accuracy
```python
def get_accuracy(predictions, Y):
    print(predictions, Y)
    return np.sum(predictions == Y) / Y.size
```
- Compares predicted labels with true labels.
- Returns percentage correct.

---

### 🔹 16. Full Gradient Descent Loop
```python
def gradient_descent(X, Y, alpha, iterations):
    W1, b1, W2, b2 = init_params()
    for i in range(iterations):
        Z1, A1, Z2, A2 = forward_prop(W1, b1, W2, b2, X)
        dW1, db1, dW2, db2 = backward_prop(Z1, A1, Z2, A2, W1, W2, X, Y)
        W1, b1, W2, b2 = update_params(W1, b1, W2, b2, dW1, db1, dW2, db2, alpha)
        if i % 10 == 0:
            print("Iteration: ", i)
            predictions = get_predictions(A2)
            print(get_accuracy(predictions, Y))
    return W1, b1, W2, b2
```
- Runs all steps in a loop for fixed number of `iterations`.
- Every 10 iterations, prints current iteration and accuracy.
- Returns final trained weights and biases.

---

### 🔹 17. Train Model
```python
W1, b1, W2, b2 = gradient_descent(X_train, Y_train, 0.10, 500)
```
- Trains model for 500 iterations with learning rate `0.1`.
- Achieves ~85% accuracy on training set.

---

### 🔹 18. Make Predictions Function
```python
def make_predictions(X, W1, b1, W2, b2):
    _, _, _, A2 = forward_prop(W1, b1, W2, b2, X)
    predictions = get_predictions(A2)
    return predictions
```
- Takes new input `X` and makes prediction using trained weights.

---

### 🔹 19. Test Prediction on Specific Image
```python
def test_prediction(index, W1, b1, W2, b2):
    current_image = X_train[:, index, None]
    prediction = make_predictions(X_train[:, index, None], W1, b1, W2, b2)
    label = Y_train[index]
    print("Prediction: ", prediction)
    print("Label: ", label)
    current_image = current_image.reshape((28, 28)) * 255
    plt.gray()
    plt.imshow(current_image, interpolation='nearest')
    plt.show()
```
- Shows image at given `index`, its predicted label, and actual label.
- Displays grayscale image using matplotlib.

---

### 🔹 20. Evaluate on Dev Set
```python
dev_predictions = make_predictions(X_dev, W1, b1, W2, b2)
get_accuracy(dev_predictions, Y_dev)
```
- After training, run predictions on dev set.
- Check generalization accuracy (~84%).

---

## ✅ Summary Table

| Part | What It Does |
|------|---------------|
| `init_params()` | Starts with random weights |
| `forward_prop()` | Makes a prediction |
| `backward_prop()` | Figures out how wrong the prediction was |
| `update_params()` | Adjusts weights to improve next guess |
| `gradient_descent()` | Loops above steps many times to train |
| `make_predictions()` | Make predictions on new data |
| `test_prediction()` | Show prediction vs truth visually |

---

