<a href="https://colab.research.google.com/github/fcoelhomrc/MachineLearning/blob/main/APC_Exercises/ML_TP6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ensemble learning

$K$ classifiers: $h_1, h_2, \dots, h_K$

Ensemble by majority voting

Each hypothesis has an error $ɛ$ which are independent of each other

Ensemble decision for binary classification $y ∈ \{ 0, 1 \}$

$$
\begin{cases}
\hat{y} = 1 \quad &\textrm{if at least $m$ hypothesis classify $\hat{y}_i = 1$} \\
\hat{y} = 0 \quad &\textrm{else} 
\end{cases}
$$

where $M = \begin{cases} \frac{K+1}{2} \quad &\textrm{if $K$ odd} \\ \frac{K}{2} + 1 \quad &\textrm{if $K$ even} \end{cases}$

Notice that in practice the assumption of error independence is too strong, since usually a troublesome example for one classifier is troublesome aswell for another.

$$
P(\textrm{ensemble miss}) = P(\textrm{at least $M$ classifiers miss}) 
$$

Start by computing $P(\textrm{n classifiers miss})$. 
This decision can be modeled with a Binomial distribution.
$$
P(\textrm{n classifiers miss}) = \binom{K}{n} ɛ^n (1-ɛ)^{K - n}
$$
Thus, we conclude
$$
P(\textrm{ensemble miss}) = P(\textrm{at least $m$ classifiers miss})
= \sum_{n=M}^{K} \binom{K}{n} ɛ^n (1-ɛ)^{K - n}
$$

Now computing the probability for the specified cases:
$$
\begin{align}
K = 5, ɛ=0.1 &\Rightarrow P = 0.0086 \\
K = 20, ɛ=0.1 &\Rightarrow P = 7.08 \times 10^{-7} \\
K = 5, ɛ=0.4 &\Rightarrow P = 0.3174 \\
K = 20, ɛ=0.4 &\Rightarrow P = 0.1275 \\
\end{align}
$$
We can see that $P$ decreases the more hypothesis we have, and the better they are individually.


In [1]:
from scipy.stats import binom

In [4]:
K = [5, 20]
eps = [0.1, 0.4]
for k in K:
    for ep in eps:
        m = (k+1)//2 if k%2 != 0 else k//2 + 1
        p = 1 - binom.cdf(m-1, k, ep)
        print(f"{k}, {ep}, {p}")

5, 0.1, 0.008560000000000012
5, 0.4, 0.31744000000000006
20, 0.1, 7.088606331917546e-07
20, 0.4, 0.12752124614721672


# Neural network

Without the activation functions, a neural network gets reduced to some linear model. In this case, we have:

$$
\begin{align}
\textrm{output} &= v_0 + v_1(w_0 + w_1 x + w_2 y) + v_2(u_0 + u_1 x + u_2 y) \\
&= c + bx + ay
\end{align}
$$

where $c = v_0 + v_1 w_0 + v_2 u_0$, $b = v_1 w_1 + v_2 u_1$, and $a = v_1 w_2 + v_2 u_2$.

If on top of that, we trained using MSE as our loss:

$$
\mathcal{L}(\hat{y}, y) = \sum_i (\hat{y}_i - y_i)^2
$$

we arrive at the linear regression model.
