# Zadanie 5


Celem ćwiczenia jest implementacja perceptronu wielowarstwowego oraz wybranego algorytmu optymalizacji gradientowej z algorytmem propagacji wstecznej.

Następnie należy wytrenować perceptron wielowarstwowy do klasyfikacji zbioru danych [MNIST](http://yann.lecun.com/exdb/mnist/). Zbiór MNIST dostępny jest w pakiecie `scikit-learn`.

Punktacja:
1. Implementacja propagacji do przodu (`forward`) [1 pkt]
2. Implementacja wstecznej propagacji (`backward`) [2 pkt]
3. Przeprowadzenie eksperymentów na zbiorze MNIST, w tym:
    1. Porównanie co najmniej dwóch architektur sieci [1 pkt]
    2. Przetestowanie każdej architektury na conajmniej 3 ziarnach [1 pkt]
    3. Wnioski [2.5 pkt]
4. Jakość kodu [0.5 pkt]

Polecane źródła - teoria + intuicja:
1. [Karpathy, CS231n Winter 2016: Lecture 4: Backpropagation, Neural Networks 1](https://www.youtube.com/watch?v=i94OvYb6noo&ab_channel=AndrejKarpathy)
2. [3 Blude one Brown, Backpropagation calculus | Chapter 4, Deep learning
](https://www.youtube.com/watch?v=tIeHLnjs5U8&t=4s&ab_channel=3Blue1Brown)


In [None]:
from abc import abstractmethod, ABC
from typing import List
import numpy as np

In [None]:
class Layer(ABC):
    def __init__(self) -> None:
        self._learning_rate = 0.01

    @abstractmethod
    def forward(self, x: np.ndarray) -> np.ndarray:
        pass

    @abstractmethod
    def backward(self, output_error_derivative: np.ndarray) -> np.ndarray:
        pass

    @property
    def learning_rate(self):
        return self._learning_rate

    @learning_rate.setter
    def learning_rate(self, learning_rate):
        assert 0 < learning_rate < 1, f"Learning rate must be between 0 and 1, got {learning_rate}"
        self._learning_rate = learning_rate

class FullyConnected(Layer):
    def __init__(self, input_size: int, output_size: int) -> None:
        super().__init__()
        self.input_size = input_size
        self.output_size = output_size
        self.weights = np.random.uniform(-1/np.sqrt(self.input_size), 1/np.sqrt(self.input_size), (self.input_size, self.output_size))
        self.biases = np.random.uniform(-1/np.sqrt(self.input_size), 1/np.sqrt(self.input_size), (1, self.output_size))

    def forward(self, x: np.ndarray) -> np.ndarray:
        self.input = x
        self.output = np.dot(x, self.weights) + self.biases
        return self.output

    def backward(self, output_error_derivative: np.ndarray) -> np.ndarray:
        self.weights_gradient = np.dot(self.input.T, output_error_derivative)
        self.biases_gradient = np.sum(output_error_derivative, axis=0, keepdims=True)

        self.weights -= self.learning_rate * self.weights_gradient
        self.biases -= self.learning_rate * self.biases_gradient

        return np.dot(output_error_derivative, self.weights.T)

class Tanh(Layer):
    def forward(self, x: np.ndarray) -> np.ndarray:
        self.input = x
        self.output = np.tanh(x)
        return self.output

    def backward(self, output_error_derivative: np.ndarray) -> np.ndarray:
        return output_error_derivative * (1 - self.output ** 2)

class Loss:
    def __init__(self, loss_function: callable, loss_function_derivative: callable) -> None:
        self.loss_function = loss_function
        self.loss_function_derivative = loss_function_derivative

    def loss(self, y_true: np.ndarray, y_pred: np.ndarray) -> float:
        return self.loss_function(y_true, y_pred)

    def loss_derivative(self, y_true: np.ndarray, y_pred: np.ndarray) -> np.ndarray:
        return self.loss_function_derivative(y_true, y_pred)

class Network:
    def __init__(self, layers: List[Layer], learning_rate: float) -> None:
        self.layers = layers
        self.learning_rate = learning_rate

    def compile(self, loss: Loss) -> None:
        self.loss = loss

    def __call__(self, x: np.ndarray) -> np.ndarray:
        for layer in self.layers:
            x = layer.forward(x)
        return x

    def fit(self, x_train: np.ndarray, y_train: np.ndarray, epochs: int, verbose: int = 0) -> None:
        for layer in self.layers:
            layer.learning_rate = self.learning_rate

        for epoch in range(epochs):
            predictions = self(x_train)

            loss_value = self.loss.loss(y_train, predictions)

            error = self.loss.loss_derivative(y_train, predictions)
            for layer in reversed(self.layers):
                error = layer.backward(error)

            if verbose:
                print(f"Epoch {epoch + 1}/{epochs}, Loss: {loss_value:.4f}")

    def evaluate(self, x: np.ndarray, y: np.ndarray) -> float:
        predictions = self(x)
        predicted_labels = np.argmax(predictions, axis=1)
        true_labels = np.argmax(y, axis=1)
        accuracy = np.mean(predicted_labels == true_labels)
        return accuracy


def mse(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    return np.mean((y_true - y_pred) ** 2)

def mse_derivative(y_true: np.ndarray, y_pred: np.ndarray) -> np.ndarray:
    return 2 * (y_pred - y_true) / y_true.size

# Eksperymenty

In [None]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler

data = fetch_openml("mnist_784", version=1)
x = data.data.to_numpy()
y = data.target.astype(int).to_numpy()

scaler = StandardScaler()
x = scaler.fit_transform(x)

ohe = OneHotEncoder(sparse_output=False)
y = ohe.fit_transform(y.reshape(-1, 1))

**Architektura I**

Warstwa 1: FullyConnected (784 → 128) + Tanh

Warstwa 2: FullyConnected (128 → 10) + Tanh

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

network = Network([
    FullyConnected(784, 128),
    Tanh(),
    FullyConnected(128, 10),
    Tanh()
], learning_rate=0.2)

network.compile(Loss(mse, mse_derivative))

for epochs in [10, 100, 500]:
    print(f"\nTraining for {epochs} epochs")
    network.fit(x_train, y_train, epochs=epochs, verbose=0)

    train_accuracy = network.evaluate(x_train, y_train)
    test_accuracy = network.evaluate(x_test, y_test)

    print(f"After {epochs} epochs:")
    print(f"Training Accuracy: {train_accuracy * 100:.2f}%")
    print(f"Test Accuracy: {test_accuracy * 100:.2f}%")



Training for 10 epochs
After 10 epochs:
Training Accuracy: 66.55%
Test Accuracy: 67.09%

Training for 100 epochs
After 100 epochs:
Training Accuracy: 83.52%
Test Accuracy: 83.75%

Training for 500 epochs
After 500 epochs:
Training Accuracy: 86.51%
Test Accuracy: 86.24%


In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

network = Network([
    FullyConnected(784, 128),
    Tanh(),
    FullyConnected(128, 10),
    Tanh()
], learning_rate=0.2)

network.compile(Loss(mse, mse_derivative))

network.fit(x_train, y_train, epochs=1000, verbose=1)

train_accuracy = network.evaluate(x_train, y_train)
test_accuracy = network.evaluate(x_test, y_test)

print(f"Training Accuracy: {train_accuracy * 100:.2f}%")
print(f"Test Accuracy: {test_accuracy * 100:.2f}%")

Epoch 1/1000, Loss: 0.1639
Epoch 2/1000, Loss: 0.1363
Epoch 3/1000, Loss: 0.1185
Epoch 4/1000, Loss: 0.1066
Epoch 5/1000, Loss: 0.0981
Epoch 6/1000, Loss: 0.0919
Epoch 7/1000, Loss: 0.0871
Epoch 8/1000, Loss: 0.0832
Epoch 9/1000, Loss: 0.0800
Epoch 10/1000, Loss: 0.0774
Epoch 11/1000, Loss: 0.0751
Epoch 12/1000, Loss: 0.0731
Epoch 13/1000, Loss: 0.0713
Epoch 14/1000, Loss: 0.0697
Epoch 15/1000, Loss: 0.0683
Epoch 16/1000, Loss: 0.0670
Epoch 17/1000, Loss: 0.0659
Epoch 18/1000, Loss: 0.0648
Epoch 19/1000, Loss: 0.0638
Epoch 20/1000, Loss: 0.0629
Epoch 21/1000, Loss: 0.0621
Epoch 22/1000, Loss: 0.0613
Epoch 23/1000, Loss: 0.0606
Epoch 24/1000, Loss: 0.0600
Epoch 25/1000, Loss: 0.0594
Epoch 26/1000, Loss: 0.0588
Epoch 27/1000, Loss: 0.0583
Epoch 28/1000, Loss: 0.0578
Epoch 29/1000, Loss: 0.0573
Epoch 30/1000, Loss: 0.0568
Epoch 31/1000, Loss: 0.0564
Epoch 32/1000, Loss: 0.0560
Epoch 33/1000, Loss: 0.0556
Epoch 34/1000, Loss: 0.0553
Epoch 35/1000, Loss: 0.0549
Epoch 36/1000, Loss: 0.0546
E

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=85)

network = Network([
    FullyConnected(784, 128),
    Tanh(),
    FullyConnected(128, 10),
    Tanh()
], learning_rate=0.2)

network.compile(Loss(mse, mse_derivative))

network.fit(x_train, y_train, epochs=1000, verbose=1)

train_accuracy = network.evaluate(x_train, y_train)
test_accuracy = network.evaluate(x_test, y_test)

print(f"Training Accuracy: {train_accuracy * 100:.2f}%")
print(f"Test Accuracy: {test_accuracy * 100:.2f}%")

Epoch 1/1000, Loss: 0.1398
Epoch 2/1000, Loss: 0.1199
Epoch 3/1000, Loss: 0.1074
Epoch 4/1000, Loss: 0.0989
Epoch 5/1000, Loss: 0.0927
Epoch 6/1000, Loss: 0.0878
Epoch 7/1000, Loss: 0.0839
Epoch 8/1000, Loss: 0.0807
Epoch 9/1000, Loss: 0.0780
Epoch 10/1000, Loss: 0.0757
Epoch 11/1000, Loss: 0.0737
Epoch 12/1000, Loss: 0.0719
Epoch 13/1000, Loss: 0.0704
Epoch 14/1000, Loss: 0.0690
Epoch 15/1000, Loss: 0.0677
Epoch 16/1000, Loss: 0.0666
Epoch 17/1000, Loss: 0.0655
Epoch 18/1000, Loss: 0.0646
Epoch 19/1000, Loss: 0.0637
Epoch 20/1000, Loss: 0.0629
Epoch 21/1000, Loss: 0.0621
Epoch 22/1000, Loss: 0.0614
Epoch 23/1000, Loss: 0.0608
Epoch 24/1000, Loss: 0.0602
Epoch 25/1000, Loss: 0.0596
Epoch 26/1000, Loss: 0.0591
Epoch 27/1000, Loss: 0.0586
Epoch 28/1000, Loss: 0.0581
Epoch 29/1000, Loss: 0.0576
Epoch 30/1000, Loss: 0.0572
Epoch 31/1000, Loss: 0.0568
Epoch 32/1000, Loss: 0.0564
Epoch 33/1000, Loss: 0.0560
Epoch 34/1000, Loss: 0.0557
Epoch 35/1000, Loss: 0.0554
Epoch 36/1000, Loss: 0.0550
E

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=789)

network = Network([
    FullyConnected(784, 128),
    Tanh(),
    FullyConnected(128, 10),
    Tanh()
], learning_rate=0.2)

network.compile(Loss(mse, mse_derivative))

network.fit(x_train, y_train, epochs=1000, verbose=1)

train_accuracy = network.evaluate(x_train, y_train)
test_accuracy = network.evaluate(x_test, y_test)

print(f"Training Accuracy: {train_accuracy * 100:.2f}%")
print(f"Test Accuracy: {test_accuracy * 100:.2f}%")

Epoch 1/1000, Loss: 0.1467
Epoch 2/1000, Loss: 0.1236
Epoch 3/1000, Loss: 0.1095
Epoch 4/1000, Loss: 0.1003
Epoch 5/1000, Loss: 0.0937
Epoch 6/1000, Loss: 0.0887
Epoch 7/1000, Loss: 0.0847
Epoch 8/1000, Loss: 0.0815
Epoch 9/1000, Loss: 0.0787
Epoch 10/1000, Loss: 0.0764
Epoch 11/1000, Loss: 0.0743
Epoch 12/1000, Loss: 0.0724
Epoch 13/1000, Loss: 0.0708
Epoch 14/1000, Loss: 0.0693
Epoch 15/1000, Loss: 0.0680
Epoch 16/1000, Loss: 0.0668
Epoch 17/1000, Loss: 0.0657
Epoch 18/1000, Loss: 0.0647
Epoch 19/1000, Loss: 0.0638
Epoch 20/1000, Loss: 0.0629
Epoch 21/1000, Loss: 0.0621
Epoch 22/1000, Loss: 0.0614
Epoch 23/1000, Loss: 0.0607
Epoch 24/1000, Loss: 0.0600
Epoch 25/1000, Loss: 0.0594
Epoch 26/1000, Loss: 0.0589
Epoch 27/1000, Loss: 0.0584
Epoch 28/1000, Loss: 0.0579
Epoch 29/1000, Loss: 0.0574
Epoch 30/1000, Loss: 0.0570
Epoch 31/1000, Loss: 0.0565
Epoch 32/1000, Loss: 0.0561
Epoch 33/1000, Loss: 0.0558
Epoch 34/1000, Loss: 0.0554
Epoch 35/1000, Loss: 0.0551
Epoch 36/1000, Loss: 0.0547
E

**Architektura II**

Warstwa 1: FullyConnected (784 → 128) + Tanh

Warstwa 2: FullyConnected (128 → 64) + Tanh

Warstwa 3: FullyConnected (64 → 10) + Tanh

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

network = Network([FullyConnected(input_size=784, output_size=128), Tanh(),
                    FullyConnected(input_size=128, output_size=64), Tanh(),
                    FullyConnected(input_size=64, output_size=10), Tanh()], learning_rate=0.2)

network.compile(Loss(mse, mse_derivative))

network.fit(x_train, y_train, epochs=1000, verbose=1)

train_accuracy = network.evaluate(x_train, y_train)
test_accuracy = network.evaluate(x_test, y_test)

print(f"Training Accuracy: {train_accuracy * 100:.2f}%")
print(f"Test Accuracy: {test_accuracy * 100:.2f}%")

Epoch 1/1000, Loss: 0.1244
Epoch 2/1000, Loss: 0.1162
Epoch 3/1000, Loss: 0.1093
Epoch 4/1000, Loss: 0.1036
Epoch 5/1000, Loss: 0.0986
Epoch 6/1000, Loss: 0.0944
Epoch 7/1000, Loss: 0.0906
Epoch 8/1000, Loss: 0.0874
Epoch 9/1000, Loss: 0.0845
Epoch 10/1000, Loss: 0.0819
Epoch 11/1000, Loss: 0.0797
Epoch 12/1000, Loss: 0.0776
Epoch 13/1000, Loss: 0.0757
Epoch 14/1000, Loss: 0.0741
Epoch 15/1000, Loss: 0.0725
Epoch 16/1000, Loss: 0.0711
Epoch 17/1000, Loss: 0.0699
Epoch 18/1000, Loss: 0.0687
Epoch 19/1000, Loss: 0.0676
Epoch 20/1000, Loss: 0.0666
Epoch 21/1000, Loss: 0.0657
Epoch 22/1000, Loss: 0.0648
Epoch 23/1000, Loss: 0.0640
Epoch 24/1000, Loss: 0.0633
Epoch 25/1000, Loss: 0.0626
Epoch 26/1000, Loss: 0.0620
Epoch 27/1000, Loss: 0.0613
Epoch 28/1000, Loss: 0.0608
Epoch 29/1000, Loss: 0.0602
Epoch 30/1000, Loss: 0.0597
Epoch 31/1000, Loss: 0.0592
Epoch 32/1000, Loss: 0.0588
Epoch 33/1000, Loss: 0.0583
Epoch 34/1000, Loss: 0.0579
Epoch 35/1000, Loss: 0.0575
Epoch 36/1000, Loss: 0.0571
E

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=85)

network = Network([FullyConnected(input_size=784, output_size=128), Tanh(),
                    FullyConnected(input_size=128, output_size=64), Tanh(),
                    FullyConnected(input_size=64, output_size=10), Tanh()], learning_rate=0.2)

network.compile(Loss(mse, mse_derivative))

network.fit(x_train, y_train, epochs=1000, verbose=1)

train_accuracy = network.evaluate(x_train, y_train)
test_accuracy = network.evaluate(x_test, y_test)

print(f"Training Accuracy: {train_accuracy * 100:.2f}%")
print(f"Test Accuracy: {test_accuracy * 100:.2f}%")

Epoch 1/1000, Loss: 0.1206
Epoch 2/1000, Loss: 0.1120
Epoch 3/1000, Loss: 0.1049
Epoch 4/1000, Loss: 0.0989
Epoch 5/1000, Loss: 0.0938
Epoch 6/1000, Loss: 0.0895
Epoch 7/1000, Loss: 0.0858
Epoch 8/1000, Loss: 0.0826
Epoch 9/1000, Loss: 0.0798
Epoch 10/1000, Loss: 0.0773
Epoch 11/1000, Loss: 0.0751
Epoch 12/1000, Loss: 0.0732
Epoch 13/1000, Loss: 0.0715
Epoch 14/1000, Loss: 0.0700
Epoch 15/1000, Loss: 0.0686
Epoch 16/1000, Loss: 0.0673
Epoch 17/1000, Loss: 0.0662
Epoch 18/1000, Loss: 0.0652
Epoch 19/1000, Loss: 0.0642
Epoch 20/1000, Loss: 0.0633
Epoch 21/1000, Loss: 0.0625
Epoch 22/1000, Loss: 0.0618
Epoch 23/1000, Loss: 0.0611
Epoch 24/1000, Loss: 0.0604
Epoch 25/1000, Loss: 0.0598
Epoch 26/1000, Loss: 0.0592
Epoch 27/1000, Loss: 0.0587
Epoch 28/1000, Loss: 0.0582
Epoch 29/1000, Loss: 0.0577
Epoch 30/1000, Loss: 0.0573
Epoch 31/1000, Loss: 0.0569
Epoch 32/1000, Loss: 0.0565
Epoch 33/1000, Loss: 0.0561
Epoch 34/1000, Loss: 0.0557
Epoch 35/1000, Loss: 0.0554
Epoch 36/1000, Loss: 0.0551
E

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=789)

network = Network([FullyConnected(input_size=784, output_size=128), Tanh(),
                    FullyConnected(input_size=128, output_size=64), Tanh(),
                    FullyConnected(input_size=64, output_size=10), Tanh()], learning_rate=0.2)

network.compile(Loss(mse, mse_derivative))

network.fit(x_train, y_train, epochs=1000, verbose=1)

train_accuracy = network.evaluate(x_train, y_train)
test_accuracy = network.evaluate(x_test, y_test)

print(f"Training Accuracy: {train_accuracy * 100:.2f}%")
print(f"Test Accuracy: {test_accuracy * 100:.2f}%")

Epoch 1/1000, Loss: 0.1194
Epoch 2/1000, Loss: 0.1113
Epoch 3/1000, Loss: 0.1046
Epoch 4/1000, Loss: 0.0989
Epoch 5/1000, Loss: 0.0941
Epoch 6/1000, Loss: 0.0900
Epoch 7/1000, Loss: 0.0865
Epoch 8/1000, Loss: 0.0835
Epoch 9/1000, Loss: 0.0808
Epoch 10/1000, Loss: 0.0785
Epoch 11/1000, Loss: 0.0764
Epoch 12/1000, Loss: 0.0746
Epoch 13/1000, Loss: 0.0729
Epoch 14/1000, Loss: 0.0714
Epoch 15/1000, Loss: 0.0701
Epoch 16/1000, Loss: 0.0688
Epoch 17/1000, Loss: 0.0677
Epoch 18/1000, Loss: 0.0667
Epoch 19/1000, Loss: 0.0657
Epoch 20/1000, Loss: 0.0648
Epoch 21/1000, Loss: 0.0640
Epoch 22/1000, Loss: 0.0633
Epoch 23/1000, Loss: 0.0626
Epoch 24/1000, Loss: 0.0619
Epoch 25/1000, Loss: 0.0613
Epoch 26/1000, Loss: 0.0607
Epoch 27/1000, Loss: 0.0602
Epoch 28/1000, Loss: 0.0596
Epoch 29/1000, Loss: 0.0592
Epoch 30/1000, Loss: 0.0587
Epoch 31/1000, Loss: 0.0583
Epoch 32/1000, Loss: 0.0579
Epoch 33/1000, Loss: 0.0575
Epoch 34/1000, Loss: 0.0571
Epoch 35/1000, Loss: 0.0567
Epoch 36/1000, Loss: 0.0564
E

# Wnioski

**precyzja po zadanej liczbie epok przy architekturze I**
$$
\begin{array}{|c|c|}
\hline
l\_epok & precyzja   \\
\hline
10 & 67,09\%   \\
\hline
100 & 83,75\% \\
\hline
500 & 86,24\%  \\
\hline
1000 & 88,21\%  \\
\hline
\end{array}
$$

Widoczne jest, że zadany model potrzebuje bardzo dużej liczby epok, aby osiągnąć w miarę zadowalającą dokładność. Taka liczba epok wynika z liczby parametrów, które muszą być dostosowane w modelu. Widać duży wzrost precyzji przy zmianie liczby epok z 10 na 100, potem ten wzrost jest wolniejszy, ponieważ sieć zaczyna osiągać swoje optymalne parametry. W celu poprawy szybkości nauki zmieniono parametr learning_rate na 0,2, co poprawiło wyniki precyzji dla badanych liczb epok.
Należy jednak zachować ostrożność ze zwiększaniem liczby epok, ponieważ model może zacząć wtedy się przeuczać, co prowadzi do gorszej precyzji na zbiorze testowym oraz do znacznego wydłużenia się czasu obliczeń.

**precyzja na zbiorze testowym**
$$
\begin{array}{|c|c|c|}
\hline
random\_state & arch_I & arch_{II}  \\
\hline
42 & 88,21\% & 87,01\%  \\
\hline
85 & 87,15\% & 86,12\% \\
\hline
789 & 88,08\% & 86,31\% \\
\hline
\end{array}
$$

średnia precyzja dla architektury I : 87,81\%

średnia precyzja dla architektury II : 86,48\%

Zbadano precyzję uzyskaną dla zbioru testowego przy trzech różnych podziałach danych dla dwóch różnych architektur sieci. Lepszą średnią precyzję wyników uzyskała mniej skomplikowana architektura pierwsza - może to wynikać z faktu, że sieć druga wymaga większej ilości parametrów do optymalizacji i dokładniejszego dostrojenia parametrów takich jak learning_rate czy liczba epok.
Architektura II z uwagi na większy stopień skomplikowania powodowała też znacznie wydłużony czas obliczeń - dla architektury I czas nauki jednej sieci wynosił około 30min, podczas gdy dla architekury II czas ten wynosił około 45min.
Dla obu architektur przy 1000 epokach wynik dla zbioru testowego był mniejszy niż dla zbioru uczącego o około 1%, co może świadczyć o nieznacznym przeuczeniu, jednak podobna sytuacja dzieje się już przy 500 epokach, kiedy osiągana precyzja dla obu zbiorów nadal nie jest zadowalająca.
Możliwe sposoby poprawienia wyniku to lepsza inicjalizacja wag i biasów, zmiana funkcji aktywacji, bądź zastosowanie bardziej skomplikowanej metody optymalizacji.