# Prerequisites:
- Logistic regression basics
- Probability
- Logarithm and exponents
- Linear algebra
- Calculus

# Learning Objectives:
- Revisit Sigmoid function
- Establish Binary cross entropy as cost function
- Optimize cost function with Gradient Descent


In the previous unit, we learned about supervised learning methods to predict continuous data using linear regression as well as discrete data using logistic regression. In this notebook, we will expound on the sigmoid function for binary classification and cost function for logistic regression as well as optimize the parameters using gradient descent.

The model for the Logistic regression is similar to the model of the linear regression as both models compute a weighted sum of the input features. However, in logistic regression, the output of the linear model is passed through a logistic function (sigmoid function) to give the _logistic_ of the outputs.

$$ \mathcal{\hat{P}} = h_w(x) = \sigma(w^Tx) $$

# Revisiting the Sigmoid Function

The sigmoid function is an S-shaped curve that is used to classify a given sample into positive or negative classes and is defined by the expression :

$$ \sigma(w^Tx) = \frac{1}{1 + e^{-w^Tx}} $$

<div align='center'>

![logistic](https://drive.google.com/uc?id=16iX4PEywb5Od9IMOXfMFL1q20HMDInfQ)

<figcaption> Figure 1: Sigmoid Activation Function</figcaption>

</div>

The sigmoid function, also known as the logistic function, $\sigma(x)$ maps every positive value of $x$ towards $1$, positive class, and every negative value of $x$ towards $0$, negative class. As the output of the function is bound within $(0,1)$, it is a perfect model for probability.

We can easily predict of the class after selecting an appropriate threshold. Let's take $0.5$ to be the threshold, resulting in the prediction $\hat{y}$:

$$ \hat{y} =
\begin{cases}
  1 & \text{if} & \mathcal{\hat{P}} \ge 0.5 \\
  0 & \text{if} & \mathcal{\hat{P}} < 0.5
\end{cases}
$$

Since, $\sigma(x) < 0.5$ for all $x< 0$ and $\sigma(x) \ge 0.5 $ for all $x \ge 0$, the logistic regression model will predict $1$ if $w^Tx$ is positive and $0$ if it is negative.

# Cost function
In linear regression, we expressed the cost function as
\begin{equation*}
\mathcal{J}(w) = \frac{1}{m} \sum_{i=1}^m \frac{1}{2} (h_w(x^{(i)}) - y^{(i)})^2
\end{equation*}

The sum of squared errors for logistic regression is a non-convex function thus the gradient descent won't converge to a global minimum. Therefore, SSE is not fit as a cost function for the logistic regression model.

|![convex](https://drive.google.com/uc?id=1lPxYbWqr7wXqJxG2ToBNZxWii_9gea__) | ![non-convex](https://drive.google.com/uc?id=1Ut38s1JYMS_jbe29VQctOYTOq-YSaS62) |
|-|-|

<center><figcaption>Figure 2: Convex and Non-convex function</figcaption></center>

The logistic regression model, $ h_w(x) $, can be used to make a prediction as it estimates the probabilities.

$$ \mathcal{\hat{P}} = h_w(x) = \sigma(w^Tx) $$

 We now need to choose the parameters $w$ such that, the model $h_w(x)$ estimates high probabilities for positive class $(y=1)$ and low probabilities for negative class $(y=0)$. The following cost function captures this behavior of $h_w(x)$.

<a name='eq1'></a>
$$
\text{Cost}(h_w(x),y) =
\begin{cases}
  -\log (h_w(x)) & \text{if} & y = 1 \\
  -\log (1-h_w(x)) & \text{if} & y = 0
\end{cases}   \tag{Equation 1}
$$


For positive class, this cost function will give a large value if the model estimates a probability close to $0$, as $-\log(h_w(x)$ increases when $h_w(x)$ approaches  $0$, similarly the cost will be very large if the model estimates a probability close to $1$ for the negative class. Likewise, when the model estimates a probability close to $1$ for positive class, the cost is close to $0$, and the cost is near $0$ for negative class when the model gives a probability close to $0$.

The following plots illustrate the behavior of the cost function with respect to the prediction of the model for each class.



<div align='center'>

![cost function](https://drive.google.com/uc?id=1P9LYnXXkH-mYqXgz9FKGweR_HB55vv8R)

<figcaption>Figure 3: Behavour of Cost function for (Positive Class) $y=1$ and (Negative class) $y=0$ </figcaption>
</div>

If $y=1$, the cost function is given as
$$\text{Cost}(h_w(x),y) = -\log (h_w(x))$$

From above diagram, we can observe that the $\text{Cost}=0$, at $h_w(x) = 1$, which is correctly classified as the positive class. For the wrong prediction, the cost becomes extremely large as shown by the red dotted line at $h_w(x)=0$.

Thus $h_w(x)=0$ is similar to predicting $\mathcal{P}(y=1|x;w)=0$:  the probability of positive class prediction $(y=1)$ given $x$ with $w$ parameters is zero.

Similarly, if $y=0$, the cost function is given as
$$\text{Cost}(h_w(x),y) = -\log (1- h_w(x))$$

and above plot shows $\text{Cost}=0$, at $h_w(x) = 0$, which is the correct prediction for the negative class. When the model predicts a positive class at $(1-h_w(x)) = 1$, the cost is extremely high.


Alternatively, the cost function can be written in a single line as:
<a name="eq2"></a>
$$ \text{Cost}(h_w(x), y) = -y\log(h_w(x)) - (1-y)\log(1-h_w(x)) \tag{Equation 2}$$

[This equation](#eq2) is similar to the [Equation 1](#eq1) and behaves in similar manner.
- If $y=1$, $(1-y)$ term will become zero, therefore $-\log(h_w(x))$ alone will be present.
- Similarly, if $y=0$, $y$ term will become zero, therefore only $-\log(1-h_w(x))$ will be present in the equation.

## Normalized Binary Cross-entropy
The cost function across all examples ( let's say $m$ in size) thus can be represented with a single _log loss_ expression as:

\begin{align*}
\mathcal{J}(w) &= - \frac{1}{m} \sum_{i=1}^m y_i \log(h_w(x_i)) + (1-y_i) \log(1-h_w(x_i)) \\
\end{align*}

Since, $\mathcal{\hat{P}} = h_w(x_i) = \sigma(w^Tx_i)$, the cost function can also be expressed as
\begin{align*}
 \mathcal{J}(w) &=- \frac{1}{m} \sum_{i=1}^m y_i \log (\sigma(w^T x_i)) + (1-y_i) \log (1-\sigma(w^Tx_i) ) \\
 &=- \frac{1}{m} \sum_{i=1}^m y_i \log (\mathcal{\hat{P}}(1|x_i;w))+ (1-y_i) \log (1-\mathcal{\hat{P}}(1|x_i;w) ) \\
 &=- \frac{1}{m} \sum_{i=1}^m y_i \log (\mathcal{\hat{P}}(1|x_i;w))+ (1-y_i) \log (\mathcal{\hat{P}}(0|x_i;w) ) \tag{Equation 3}
\end{align*}

Equation 3 is known as the **Normalized Binary Cross Entropy (BCE)**, which can be derived using the likelihood function. It is discussed in detail later in the Probabilistic Methods module.

Equation 3 has no known closed-form solution to find the parameters $w$ that minimizes the cost function (as Least square for linear regression). However this function is a convex function ([the proof of convexity of this function](https://towardsdatascience.com/binary-cross-entropy-and-logistic-regression-bf7098e75559)), so any optimization algorithm such as Gradient Descent will find the global minimum provided the right set of hyperparameters (learning rate and number of iterations).

# Optimizing parameters: Gradient Descent on Binary cross entropy



Let $z = w_1 x_1 + w_2 x_2 + b$ be our linear model, then let $ \hat{y} = a = \sigma(z) $ be our logistic model.

Also, let $\mathcal{J}(\hat{y}, y)$ represent our log-loss function where $\hat{y}$ is the predicted class and $y$ is the actual class.

\begin{align*}
\mathcal{J}(\hat{y}, y) = - y \log(a) - (1-y) \log(1-a)  \tag{$\hat{y}$ = a }
\end{align*}

Taking partial derivative with respect to $w_1$

\begin{align*}
\frac{\partial}{\partial w_1} \mathcal{J}(\hat{y}, y) =  \frac{\partial}{\partial w_1} \left[ - y \log(a) - (1-y) \log(1-a) \right]
\end{align*}

Using chain rule,
\begin{align*}
\frac{\partial \mathcal{J}}{\partial w_1}  =  \frac{\partial \mathcal{J}}{\partial a} \frac{\partial a}{\partial z} \frac{\partial z} {\partial w_1}
\end{align*}

Now figuring out the individual partial derivatives
\begin{align*}
 \frac{\partial \mathcal{J}}{\partial a} &= \frac{\partial}{\partial a} \left[ - y \log(a) - (1-y) \log(1-a) \right] \\
 &= -y \left( \frac{1}{a} \right) -(-1) \left( \frac{1-y}{1-a} \right) \\
 & = \frac{a-y}{a(1-a)}
\end{align*}

Similarly,
\begin{align*}
\frac{\partial a}{\partial z} = a(1-a) \tag{$a = \sigma(z) = \frac{1}{1+e^{-z}}$}
\end{align*}

Lastly,
$$
\frac{\partial z} {\partial w_1} = \frac{\partial (w_1 x_1 + w_2 x_2 + b)}{\partial w_1} = x_1
$$

Hence,
\begin{align*}
\frac{\partial \mathcal{J}}{\partial w_1}  =  \frac{\partial \mathcal{J}}{\partial a} \frac{\partial a}{\partial z} \frac{\partial z} {\partial w_1} = \frac{a-y}{a(1-a)} a(1-a) x_1 = (a-y)x_1
\end{align*}

Now, we can update $w_1$ using gradient descent
$$
w_1 = w_1 -\alpha \frac{\partial \mathcal{J}}{\partial w_1}
$$

Similarly, for all parameters
\begin{align*}
w_i &= w_i - \alpha  \frac{\partial \mathcal{J}}{\partial w_i} \tag{i=1,2,...,m; m is the number of parameters} \\
b &= b - \alpha  \frac{\partial \mathcal{J}}{\partial b}
\end{align*}


For m examples, Binary cross entropy loss is given as

$$
 \mathcal{J}(w) =- \frac{1}{m} \sum_{i=1}^m y_i \log (\sigma(w^T x_i)) + (1-y_i) \log (1-\sigma(w^Tx_i) )
$$

Taking partial derivative with respect to $w_j$

\begin{align*}
\frac{\partial J(w)}{\partial w_j} = \frac{\partial}{\partial w_j} \left[ - \frac{1}{m} \sum_{i=1}^m y_i \log (\sigma(w^T x_i)) + (1-y_i) \log (1-\sigma(w^Tx_i) ) \right]
\end{align*}

Using chain rule, we get
\begin{align*}
\frac{\partial J(w)}{\partial w_j} = \frac{1}{m} \sum_{i=1}^m \frac{\partial J(w)}{\partial \sigma (w^Tx_i) } \frac{\partial \sigma (w^Tx_i) }{\partial (w^T x_i)} \frac{\partial w^T x_i}{\partial w_j} \tag{i}
\end{align*}

\begin{align*}
\frac{\partial J(w)}{\partial \sigma(w^Tx_i)} &= \frac{\partial}{\partial \sigma (w^T x)} \left[ - ( y_i \log (\sigma(w^T x_i)) + (1-y_i) \log (1-\sigma(w^Tx_i) )) \right] \\
&= - \left[ \frac{y_i}{\sigma (w^T x)} - \frac{(1-y_i)}{ 1- \sigma (w^Tx)} \right] \\
&=  \left[  \frac{\sigma(w^T x_i) -  y_i}{\sigma(w^Tx_i) (1-\sigma(w^T x_i)} \right] \tag{a}
\end{align*}

and, we know $$ \sigma(w^Tx) = \frac{1}{1 + e^{-w^Tx}}$$
Taking derivative with respect to $(w^T x)$ , we get
\begin{align*}
\frac{\partial \sigma (w^Tx_i) }{\partial (w^T x_i)} = \sigma(w^T x_i) ( 1 - \sigma (w^T x_i) ) \tag{b}
\end{align*}

Lastly,
\begin{align*}
\frac{\partial (w^Tx_i) }{\partial w_j} = x^j \tag{c}
\end{align*}

Substituting (a), (b), and (c) in (i) we get

\begin{align*}
\frac{\partial J(w)}{\partial w_j} &= \frac{1}{m} \sum_{i=1}^m   \frac{\sigma(w^T x_i) -  y_i}{\sigma(w^Tx_i) (1-\sigma(w^T x_i)}  [\sigma(w^T x_i) ( 1 - \sigma (w^T x_i) ) ] x^j \\
\frac{\partial J(w)}{\partial w_j} &=  \frac{1}{m} \sum_{i=1}^m ( \sigma(w^T x_i) - y_i) x^j
\end{align*}

Now we can use  Gradient descent to optimize the $w$ parameters

$$
 w_{j+1} = w_j - \alpha \frac{\partial \mathcal{J}(w)}{\partial w_j} \tag{Equation 4}
$$

Equation 4 is used to update the parameters $w$ based on the gradient. The amount of the movement in the gradient descent is given by the slope of the cost function weighted by a learning rate $\alpha$.



# Key Takeaways
- Sigmoid function estimates probabilities and hence can be used to make a prediction.
- Binary Cross Entropy can be used as the cost function for the logistic regression classification.
- Binary cross entropy has no closed-form solution but it is a convex function thus optimization methods such as gradient descent can be used to find the optimum parameters.

# References

* Books
  *  Aurélien Géron (2017), Hands-On Machine Learning with Scikit-Learn and TensorFlow, 1st edition, O'Reilly
    * Part I, Chapter 4 Logistic Regression, page 202


- [Cross entropy](https://jermwatt.github.io/machine_learning_refined/notes/6_Linear_twoclass_classification/6_2_Cross_entropy.html)
- [Binary cross entropy and logistic regression](https://towardsdatascience.com/binary-cross-entropy-and-logistic-regression-bf7098e75559)