# Entendendo Perceptrons Multicamadas (MLPs)

## Exercício 1

**Dados do problema.**
Entrada $x=[0.5,\,-0.2]$, alvo $y=1.0$.
Camada oculta (2 neurônios, tanh):

$$
W^{(1)}=\begin{bmatrix}0.3&-0.1\\ 0.2&0.4\end{bmatrix},\quad
b^{(1)}=\begin{bmatrix}0.1\\-0.2\end{bmatrix}
$$

Saída (1 neurônio, tanh):

$$
W^{(2)}=\begin{bmatrix}0.5&-0.3\end{bmatrix},\quad
b^{(2)}=0.2
$$

---

### 1) *Forward pass*

**Pré-ativações na oculta** $z^{(1)}=W^{(1)}x+b^{(1)}$:

$$
\begin{aligned}
z^{(1)}_1&=0.3\cdot0.5+(-0.1)\cdot(-0.2)+0.1=0.27 \\
z^{(1)}_2&=0.2\cdot0.5+0.4\cdot(-0.2)-0.2=-0.18
\end{aligned}
\Longrightarrow\quad
z^{(1)}=\begin{bmatrix}0.270000\\-0.180000\end{bmatrix}
$$

**Ativações na oculta** $a^{(1)}=\tanh(z^{(1)})$:

$$
a^{(1)}=\begin{bmatrix}\tanh(0.27)\\ \tanh(-0.18)\end{bmatrix}
=\begin{bmatrix}0.263625\\-0.178081\end{bmatrix}
$$

**Pré-ativação na saída** $z^{(2)}=W^{(2)}a^{(1)}+b^{(2)}$:

$$
z^{(2)}=0.5\cdot0.263625+(-0.3)\cdot(-0.178081)+0.2
=0.385237
$$

**Saída** $\hat y=\tanh(z^{(2)})=\tanh(0.385237)=\mathbf{0.367247}$.

---

### 2) *Loss*

$$
L=(y-\hat y)^2=(1-0.367247)^2=\mathbf{0.400377}.
$$

---

### 3) *Backward pass*

Derivada da perda em relação à saída:

$$
\frac{\partial L}{\partial \hat y}=2(\hat y-y)=2(0.367247-1)=-1.265507.
$$

Derivada da tanh: $\frac{d}{dz}\tanh(z)=1-\tanh^2(z)$. Logo:

$$
\frac{\partial L}{\partial z^{(2)}}=\frac{\partial L}{\partial \hat y}\,(1-\hat y^2)
=-1.265507\cdot(1-0.367247^2)=\mathbf{-1.094828}.
$$

**Gradientes da camada de saída**

$$
\frac{\partial L}{\partial W^{(2)}}=\frac{\partial L}{\partial z^{(2)}}\,a^{(1)}
=\begin{bmatrix}-0.288624 & 0.194968\end{bmatrix},\quad
\frac{\partial L}{\partial b^{(2)}}=\mathbf{-1.094828}.
$$

**Propagação para a oculta**

$$
\frac{\partial L}{\partial a^{(1)}}=\frac{\partial L}{\partial z^{(2)}}\,W^{(2)}
=\begin{bmatrix}-0.547414\\ 0.328448\end{bmatrix},\qquad
1-(a^{(1)})^2=\begin{bmatrix}0.930502\\ 0.968287\end{bmatrix}.
$$

$$
\frac{\partial L}{\partial z^{(1)}}=\frac{\partial L}{\partial a^{(1)}}\odot\bigl(1-(a^{(1)})^2\bigr)
=\begin{bmatrix}-0.509370\\ 0.318032\end{bmatrix}.
$$

**Gradientes da camada oculta** (produto externo com $x=[0.5,-0.2]$):

$$
\frac{\partial L}{\partial W^{(1)}}
=\begin{bmatrix}
-0.254685 & 0.101874\\
\phantom{-}0.159016 & -0.063606
\end{bmatrix},\quad
\frac{\partial L}{\partial b^{(1)}}=\begin{bmatrix}-0.509370\\ 0.318032\end{bmatrix}.
$$

---

### 4) *Atualização de parâmetros*  $(\eta=\mathbf{0.1})$

$$
\begin{aligned}
W^{(2)}_{\text{novo}}&=W^{(2)}-\eta\,\frac{\partial L}{\partial W^{(2)}} 
=\begin{bmatrix}0.528862 & -0.319497\end{bmatrix} \\
b^{(2)}_{\text{novo}}&=b^{(2)}-\eta\,\frac{\partial L}{\partial b^{(2)}}
=\mathbf{0.309483} \\
W^{(1)}_{\text{novo}}&=W^{(1)}-\eta\,\frac{\partial L}{\partial W^{(1)}} \\
&=\begin{bmatrix}
0.325468 & -0.110187\\
0.184098 & \phantom{-}0.406361
\end{bmatrix}\\[2pt]
b^{(1)}_{\text{novo}}&=b^{(1)}-\eta\,\frac{\partial L}{\partial b^{(1)}}
=\begin{bmatrix}\phantom{-}0.150937\\ -0.231803\end{bmatrix}.
\end{aligned}
$$

## Exercício 2 — MLP *from scratch* (classificação binária em 2D)

Gerar um dataset 2D com **1 cluster** para a classe 0 e **2 clusters** para a classe 1 (usando `make_classification` em subconjuntos), treinar um **MLP do zero** (NumPy apenas) com 1 camada oculta, loss binário (BCE), e avaliar: perda de treino, acurácia de teste e fronteira de decisão.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

np.random.seed(42)
plt.rcParams["figure.figsize"] = (6, 5)

def plot_scatter(X, y, title=""):
    plt.figure()
    plt.scatter(X[y==0,0], X[y==0,1], s=10, alpha=0.7, label="classe 0")
    plt.scatter(X[y==1,0], X[y==1,1], s=10, alpha=0.7, label="classe 1")
    plt.xlabel("x1"); plt.ylabel("x2"); plt.legend(); plt.title(title)
    plt.show()

In [None]:
n0, n1 = 500, 500  # total = 1000

# Subconjunto só com classe 0 (1 cluster)
X0, y0 = make_classification(
    n_samples=n0, n_features=2, n_informative=2, n_redundant=0,
    n_clusters_per_class=1, n_classes=2, weights=[1.0, 0.0], class_sep=1.5,
    flip_y=0.0, random_state=42
)

# Subconjunto só com classe 1 (2 clusters)
X1, y1 = make_classification(
    n_samples=n1, n_features=2, n_informative=2, n_redundant=0,
    n_clusters_per_class=2, n_classes=2, weights=[0.0, 1.0], class_sep=1.5,
    flip_y=0.0, random_state=43
)

X = np.vstack([X0, X1])
y = np.hstack([y0, y1])  # rótulos {0,1}

# embaralhar
perm = np.random.permutation(len(X))
X, y = X[perm], y[perm]

plot_scatter(X, y, "Dados 2D — classe 0 (1 cluster) vs classe 1 (2 clusters)")
X.shape, y.shape, y.min(), y.max()


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
X_train.shape, X_test.shape

In [None]:
def tanh(x): return np.tanh(x)
def dtanh(a): return 1.0 - a**2               # a = tanh(z)
def sigmoid(z): return 1.0 / (1.0 + np.exp(-z))

def bce_loss(y_true, y_prob, eps=1e-9):
    y_true = y_true.reshape(-1, 1)
    y_prob = np.clip(y_prob, eps, 1.0-eps)
    return -(y_true*np.log(y_prob) + (1-y_true)*np.log(1-y_prob)).mean()

class MLPBinary:
    def __init__(self, in_dim=2, hidden=8, lr=0.05, seed=42):
        rng = np.random.default_rng(seed)
        # Xavier para tanh: U(-sqrt(6/(fan_in+fan_out)), sqrt(6/(fan_in+fan_out)))
        lim1 = np.sqrt(6/(in_dim+hidden))
        self.W1 = rng.uniform(-lim1, lim1, size=(in_dim, hidden))
        self.b1 = np.zeros((1, hidden))
        lim2 = np.sqrt(6/(hidden+1))
        self.W2 = rng.uniform(-lim2, lim2, size=(hidden, 1))
        self.b2 = np.zeros((1, 1))
        self.lr = lr
        self.loss_hist = []

    def forward(self, X):
        Z1 = X @ self.W1 + self.b1          # (N,H)
        A1 = tanh(Z1)                        # (N,H)
        Z2 = A1 @ self.W2 + self.b2          # (N,1)
        A2 = sigmoid(Z2)                     # (N,1)
        cache = (X, Z1, A1, Z2, A2)
        return A2, cache

    def backward(self, cache, y_true):
        X, Z1, A1, Z2, A2 = cache
        N = X.shape[0]
        y = y_true.reshape(-1, 1)

        # BCE + sigmoid: dZ2 = (A2 - y) / N
        dZ2 = (A2 - y) / N                   # (N,1)
        dW2 = A1.T @ dZ2                     # (H,1)
        db2 = dZ2.sum(axis=0, keepdims=True) # (1,1)

        dA1 = dZ2 @ self.W2.T                # (N,H)
        dZ1 = dA1 * dtanh(A1)                # (N,H)
        dW1 = X.T @ dZ1                      # (2,H)
        db1 = dZ1.sum(axis=0, keepdims=True) # (1,H)

        grads = (dW1, db1, dW2, db2)
        return grads

    def step(self, grads):
        dW1, db1, dW2, db2 = grads
        self.W1 -= self.lr * dW1
        self.b1 -= self.lr * db1
        self.W2 -= self.lr * dW2
        self.b2 -= self.lr * db2

    def fit(self, X, y, epochs=300, verbose=False):
        self.loss_hist.clear()
        for ep in range(1, epochs+1):
            y_prob, cache = self.forward(X)
            loss = bce_loss(y, y_prob)
            self.loss_hist.append(float(loss))
            grads = self.backward(cache, y)
            self.step(grads)
            if verbose and ep % 50 == 0:
                print(f"época {ep:03d} | loss={loss:.4f}")
        return self

    def predict_proba(self, X):
        y_prob, _ = self.forward(X)
        return y_prob

    def predict(self, X, thr=0.5):
        return (self.predict_proba(X) >= thr).astype(int).ravel()


In [None]:
mlp = MLPBinary(in_dim=2, hidden=8, lr=0.05, seed=42).fit(X_train, y_train, epochs=300, verbose=True)

y_pred_test = mlp.predict(X_test)
test_acc = (y_pred_test == y_test).mean()

print(f"Acurácia de teste: {test_acc:.3f}")
plt.figure(); plt.plot(mlp.loss_hist); plt.xlabel("época"); plt.ylabel("loss (BCE)"); plt.title("Treinamento — perda"); plt.show()


In [None]:
# malha para fronteira
xx, yy = np.meshgrid(
    np.linspace(X[:,0].min()-1, X[:,0].max()+1, 300),
    np.linspace(X[:,1].min()-1, X[:,1].max()+1, 300)
)
grid = np.c_[xx.ravel(), yy.ravel()]
probs = mlp.predict_proba(grid).reshape(xx.shape)

plt.figure(figsize=(6,5))
plt.contourf(xx, yy, probs, levels=20, alpha=0.4)
plt.scatter(X_test[y_test==0,0], X_test[y_test==0,1], s=12, label="teste: classe 0")
plt.scatter(X_test[y_test==1,0], X_test[y_test==1,1], s=12, label="teste: classe 1")
plt.colorbar(label="p(classe 1)")
plt.title("Fronteira de decisão (MLP)")
plt.xlabel("x1"); plt.ylabel("x2"); plt.legend()
plt.show()


O conjunto foi construído com **1 cluster** para a classe 0 e **2 clusters** para a classe 1, criando regiões não lineares. Um MLP com `tanh` na oculta e `sigmoid` na saída, treinado com **BCE**, modela fronteiras curvadas e supera a limitação do perceptron linear do exercício anterior. A curva de **loss** decresce de forma estável e a fronteira acompanha a geometria (dois aglomerados na classe 1). Resultados podem variar com `hidden`, `lr` e épocas; aumentos moderados em `hidden` ou `epochs` tendem a melhorar a acurácia até saturar.
