# Logistic Regression

### Binary classification examples

A training example consists of input vector $x$ and output boolean $y$:  
$(\boldsymbol{x}, y) \mid \boldsymbol{x} \in \mathbb{R}^{n_x}, y \in \lbrace 0, 1 \rbrace$  

$m$ training examples:  
$
\left\lbrace
\left(\boldsymbol{x}^{(1)}, y^{(1)}\right), 
\left(\boldsymbol{x}^{(2)}, y^{(2)}\right),
\dots,
\left(\boldsymbol{x}^{(m)}, y^{(m)}\right)
\right\rbrace
$  

So the full training set $\boldsymbol{X}$ consists of $m$ vectors that are $n$ long giving a $(n×m)$ shaped matrix:  
$
\boldsymbol{X} =
\begin{vmatrix}
\mid & \mid & \mid & \mid \\
\boldsymbol{x}^{(1)} &  \boldsymbol{x}^{(2)} & \dots & \boldsymbol{x}^{(m)} \\
\mid & \mid & \mid & \mid
\end{vmatrix}
$  

The corresponding $y$ solutions form the following $(1×m)$ matrix:  
$
\boldsymbol{Y} =
\left[
y^{(1)}, y^{(2)}, \dots, y^{(m)}
\right]
$ 


### Logistic Regression - A Single-Neuron System
for solving Binary Classification

Looking for: $\hat{y} = P(y=1 \mid \boldsymbol{x})$  
Note: $0 \le \hat{y} \le 1$

Logistic Regression parameters: weights and bias: $\boldsymbol{w} \in \mathbb{R}^{n_x}, b \in \mathbb{R}$  
Logistic Regression output: $$\hat{y} = \sigma(\boldsymbol{w}^{\intercal}\boldsymbol{x} + b)$$

Where: sigmoid fn: $\sigma(z) = \frac{1}{1+e^{-z}}$

### Cross-Entropy Loss Function ("Log Loss")

MSE (Mean Squared Error) $L2 = \frac{1}{2}(\hat{y}-y)^2$ is not good because for Logistic Regression it tends to have many local minima.  

Instead we use **Cross-Entropy Loss** which will have one global minimum: $$L(\hat{y}, y) = -(y\log\hat{y} + (1-y)\log(1-\hat{y}))$$  
Intuition: since $y \in \lbrace0,1\rbrace$:  
- if $y=1: L(\hat{y}, y) = -\log\hat{y}$ &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
As $\hat{y}$ gets further from true 1, $L$ gets exponentially bigger.
- if $y=0: L(\hat{y}, y) = -\log(1-\hat{y})$ &nbsp;&nbsp;&nbsp;
As $\hat{y}$ gets further from true 0, $L$ gets exponentially bigger.  

Why is it called *Cross-Entropy*?  

**Entropy** is average amount of information contained in an element of a distribution.  
Information can be regarded as deviance from ground zero. It can be measured by comparing a distribution (i.e. the prediction) to an actual fixed outcome. We want to measure how many times uncertainity was halved when replacing the predicted distribution with the actual outcome. This is the number of *bits* of information.  
E.g. take a predicted discrete distribution $\hat{p}$ size 2: $\hat{p}_0 = 0.75$ and $\hat{p}_1 = 0.25$. Let's say the fixed outcome is $p_0=0$ and $p_1=1$. The uncertainity for case $\hat{p}_1$ has decreased four-fold, which is 2 bits of information. This can also be calculated as $InfoBits = -\log_2\hat{p}_1$. Applying the same calculation to $\hat{p}_0$ yields 0.415 bits. Taking the weighted average of the two bit counts is the entropy of the distribution: $0.75*0.415+0.25*2 = 0.811$ bits.  
**Entropy function:**
$$H(\hat{p}) = -\sum_i(\hat{p}_i\log_2\hat{p}_i)$$
**Cross-Entropy**, however, is the amount of information gained by learning a particular truth in relation to a predicted distribution. It depends on both the predicted distribution $\hat{p}$ and the learned truth-distribution $p$.
$$H(p,\hat{p}) = -\sum_i(p_i\log_2\hat{p}_i)$$

In [None]:
def cross_entropy(y, yhat):
    return -(y*math.log2(y))

In [None]:
print("hee")