R√©seau de neurones √† une couche

Soit le jeu de donn√©es (X, y).

Le r√©seau de neurones a une couche est d√©fini par: $\hat{y} = f(W X + B)$

Avec:

 - W: poids des neurones
 - B: biais
 - f: fonction d'activation non lin√©aire
 

Dans le cas d'une r√©gression, 
 
On cherche √† minimiser l'erreur E:

 - Mean Square Error (r√©gression): $E = \frac{1}{2}(y - \hat{y})^2$
 - Cross-Entropy: $L_W(\hat(y), y) = -\sum y_c log(\hat{y}_c) = -log(\hat{y}_{c^*}) $
 - Divergence de Kullback-Leibler

On va minimiser par **descente de gradient**:

$W^h = W^{h-1} - \epsilon \frac {\partial E} {\partial W^{h-1}}$

On calcule les d√©riv√©es partielles (chain rule):

$\frac {\partial E} {\partial W^{h-1}} = \frac {\partial E} {\partial W^{h-1}} \frac {\partial E} {\partial W^{h-1}}$


In [1]:
!ls iris.data || wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

iris.data


In [2]:
import pandas as pd
import numpy as np

In [43]:
data = pd.read_csv("iris.data", header=None, names=["sepal length", "sepal width", "petal length", "petal width", "class"], dtype={"class": 'category'})

In [38]:
# Transforme une sortie en une distribution de probabilit√©
def softmax(s):
    return np.exp(s)/np.sum(np.exp(s))

In [96]:
# Sigmoid
def ùúé(x):
    return 1/(1+np.exp((-x).tolist()))

In [97]:
num_features = data.shape[1] - 1
categories = data["class"].cat.categories.to_numpy()
num_class = len(categories)
W = np.random.random((num_class, num_features))
# Biais
W_b = np.random.random((num_class,))

## Forward

In [110]:
x = data.iloc[0][:4].to_numpy()
h = W @ x + W_b
# Fonction d'activation
s = ùúé(h)
y = softmax(s)
y

array([0.33718283, 0.32209291, 0.34072426])

In [117]:
≈∑ = categories == data.iloc[0]["class"]
error = np.sum((y - ≈∑)**2)
error

0.6591634651413831

## Backward / R√©tropropagation de l'erreur

On cherche √† minimiser l'erreur. Pouce faire, on va utiliser une descente de gradient.

L'erreur est √©gale √†:
$
\epsilon = \frac {1} {2} ||≈∑ - model(x)||^2
$

Et sa d√©riv√©e:
$
\delta_\epsilon = \frac {d¬†\epsilon} {d x} = \frac {\frac {1} {2} ||≈∑ - model(x)||^2} {d x}
$

Pour calculer ce r√©sultat, nous allons utiliser la **d√©rivation de fonctions compos√©es** *(f ‚ó¶ g ‚ó¶ h)(z)* ou "chain rule" en anglais:

$
\delta_\epsilon = \frac {\frac {1} {2} ||≈∑ - y||^2} {d y} * \frac {d softmax(s)} {d s} * \frac {d \sigma(h)} {d h} * \frac {W * x + W_b} {x}
$

La [d√©riv√©e de softmax][1] est un peu compliqu√©e √† calculer. On mu

Avec:
 * $ \frac {\frac {1} {2} ||≈∑ - y||^2} {d y} = ||≈∑ - y||$
 * $ \frac {d \sigma(h)} {d h} = \sigma(h)(1‚àí\sigma(h))$
 * $ \frac {W * x + W_b} {x} = W$
 
R√©f√©rence:
 * [1](https://towardsdatascience.com/derivative-of-the-softmax-function-and-the-categorical-cross-entropy-loss-ffceefc081d1)
 * [2](https://deepnotes.io/softmax-crossentropy)

Cross Entropy Loss
Cross entropy indicates the distance between what the model believes the output distribution should be, and what the original distribution really is. It is defined as, H(y,p)=‚àí‚àëiyilog(pi) Cross entropy measure is a widely used alternative of squared error. It is used when node activations can be understood as representing the probability that each hypothesis might be true, i.e. when the output is a probability distribution. Thus it is used as a loss function in neural networks which have softmax activations in the output layer.
https://deepnotes.io/softmax-crossentropy
https://towardsdatascience.com/derivative-of-the-softmax-function-and-the-categorical-cross-entropy-loss-ffceefc081d1

In [46]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   sepal length  150 non-null    float64 
 1   sepal width   150 non-null    float64 
 2   petal length  150 non-null    float64 
 3   petal width   150 non-null    float64 
 4   class         150 non-null    category
dtypes: category(1), float64(4)
memory usage: 5.0 KB
