# Logistic Regression

***

## Description: You will learn about the much widely used Logistic Regression, why it is used for classification and its real world applications. 

***

## Overview
- Why not Linear Regression for classification
- Sigmoid function
- Cost function with gradient descent
- Evaluation metrics

***

## Pre-requisite
- Python
- Numpy
- Pandas
- Statistics
- Probability
- Differential Calculus
- Linear Regression

***

## Learning Objective
- Understand when to use Logistic Regression
- Usage of odds, odds ratio and sigmoid function
- Binary and multi-class classification
- Different evaluation metrics for classification tasks

## Chapter 1: Why not Linear Regression for classification?

***

## Description: Understand what is classification and why Linear Regression is not a good algorithm to handle such a problem.

### 1.1 What is classification?

***

Classification is a central topic in machine learning that has to do with teaching machines how to group together data by particular criteria. It is different from regression in the sense that target variables in classification are discrete in nature while in regression they are continuous. Remember that both regression and classification fall into the category of supervised learning approaches.

There is an also an unsupervised version of classification, called clustering where computers find shared characteristics by which to group data when categories are not specified (we will encounter it later, not in this concept).

**Examples**

<img src='../images/example.jpg' >

Other common examples of classification come with classifying loan defaulters, predicting patients as having diabetes or not etc. 

All these types of classification problems can be effectively solved by Logistic Regression. Even though it contains the word "regression", do not take it as a regression algorithm. It is a linear model for classification and is widely used both in academia and industry mainly due to its simplicity and interpretibility. 


**What is the difference between linear regression and logistic regression?**

- *Outcome*: This is the fundamental and possibly the most intuitive difference between both the algorithms. In linear regression, the outcome (dependent variable) is continuous. It can have any one of an infinite number of possible values, for instance, weight, height, number of hours, etc. Whereas in logistic regression, the outcome (dependent variable) has only a limited number of possible values. For instance, yes/no, true/false, red/green/blue, 1st/2nd/3rd/4th, etc.


- *Linear regression output as probabilities*: Its tempting to use the linear regression output as probabilities but it's a mistake because the output can be negative, and greater than 1 whereas probability can not. As regression might actually produce probabilities that could be less than 0, or even bigger than 1, logistic regression was introduced.

- *Equation*: Linear regression has its own equation of the form: $y = \theta_0 + \theta_1x_1 + ...+ \theta_nx_n$. Logistic regression has a somewhat different equation as it interprets probability. You will learn more about it in the coming chapters where its equation is given by $ y = \frac{1}{1+e^{-(\theta_0 + \theta_1x_1 + ...+ \theta_nx_n)}}$



*NOTE*: **Although there is regression at the end of its name Logistic Regression is used for classification purposes. At the same time it is a form of linear model only since the logistic function is a linear combination of weights (you will see later on in this course).**

### 1.2 Introduction to the dataset

***

Throughout the entire concept we will be using the Banknote authentication dataset where you will be classifying bank notes as **Authentic** or **Fake** based on given information.  The data was extracted from images that were taken for the evaluation of an authentication procedure for banknotes. There are 5 attributes in total in this dataset and a brief description about them is given below:

1. `variance`: variance of Wavelet Transformed image (**continuous**) 
2. `skewness`: skewness of Wavelet Transformed image (**continuous**) 
3. `curtosis`: curtosis of Wavelet Transformed image (**continuous**) 
4. `entropy`: entropy of image (**continuous**) 
5. `class`: authentic($1$) or fake($0$) (**integer**) 


The datasets consists of four predictor variables (`variance`, `skewness`, `curtosis` and `entropy`) and a target variable, `class`.




Now lets load the dataset and look at the characteristics

## Load the dataset

Load the dataset and print out its shape and total number of missing values.

### Instructions
- Load the dataset with the argument `bank.csv` using pandas `.read_csv()` method
- Print out the shape using `.shape` method
- Print out the total number of missing values using the `.isnull().sum()` method of pandas.

In [1]:
# import packages
import pandas as pd
import numpy as np
file_path = '../data/data.csv'

# Code starts here

# read data
data = pd.read_csv(file_path)

# display shape
print(data.shape)

# count of missing values
print(data.isnull().sum())

# Code ends here

(1372, 5)
variance    0
skewness    0
curtosis    0
entropy     0
class       0
dtype: int64


## Hints
- Load data as `data = pd.read_csv(file_path)`
- Visualize shape as `data.shape`
- Count of missing values as `data.isnull().sum()`

## Test cases
- variable declaration of `data`
- Shape of `data`: `data.shape == (1372, 5)`

### 1.3 Solving with linear regression

***

In the last concept you learnt about Linear Regression. Lets try to solve this problem with Linear Regression first. 

At first, lets take only the feature `entropy` to predict the `class` from it. We will look at how the decision boundary looks like and how will the line behave with an useen data point. **In worst case scenario the unseen point can be an outlier. We want our model to generalize well and so, we will test for the worst case scenario.**

The code snippet is given below:
```python
# import packages
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
%matplotlib inline

# instantiate model
linear_model = LinearRegression()

# fit the model
linear_model.fit(df[['entropy']], df[['class']])

# generate 1000 X-values
X_sample = np.linspace(-9, 3, 1000)

# calculate y-values for 1000 X-values
Y_sample = X_sample*linear_model.intercept_ + linear_model.coef_[0]

# threshold for entropy
threshold  = (0.5 - linear_model.coef_[0]) / linear_model.intercept_
print(threshold)

# scatter plot 
plt.scatter(df[['entropy']], df[['class']], marker='x')

# axes specifications
plt.xlabel('Entropy')
plt.ylabel('Class {1:Authentic, 0:Fake}')
plt.ylim(-0.1, 1.1)

# threshold lines
plt.axvline(threshold, linestyle='--', color='green')
plt.axhline(0.5, linestyle='--', color='black')

# line plot
plt.plot(X_sample, Y_sample, color='red')

# display plot
plt.show()
```
We get the output as:

```python
Threshold value is: [0.47315246]
```
And the output image looks somewhat like this:

<img src='../images/1.png'>


**Context of the code snippet**
- Trained a linear regression model `linear_model` on `entropy` and `class`
- Used `linear_model` to predict on 1000 samples (`X_sample`)
- Black horizontal line is the `class` threshold (set to 0.5 in our setting)
- Green vertical line is the corrseponding threshold for `entropy`
- Red line is the decision boundary


Lets understand more on decision boundary now.

**Decision boundary**: It is a sort of demarcation or criteria based on which we segregate our data into different groups. In the image below the straight line is a decision boundary separating the two sets points colored red and blue. We can also have non-linear decision boundaries (you will learn in the next chapter about it) as well as higher dimensional ones.

<img src='../images/db.png'>

In our setting `class` attribute has values either $0$ or $1$. Since we have fit a line, we assume that for y-value above $0.5$, we consider it as `1` and `0` for values lower than $0.5$. The threshold value is the value for `entropy` at which the `class` changes; which according to our code is approximately 0.47.

### 1.4 Linear Regression for Classification? Not a good idea

***

You saw that the threshold value for `entropy` was around $0.47$. Upon addition of a new data point we want the threshold values for `entropy` to change as little as possible so that it isn't affected by outliers. Lets add an outlier point and observe for ourselves whether or not the decison boundary and threshold changes.

The code snippet is given for you:

```python
# instantiate linear model
lm = LinearRegression()

# add outlier pointd to 'entropy' and 'class'
x = np.append(df.entropy.values, [35]).reshape(-1,1)
y = np.append(df['class'], [1]).reshape(-1,1)

# fit on new 'entropy' and 'class'
lm.fit(x, y)

# scatter plot 
plt.scatter(x, y, marker='x')

# axes modification
plt.xlabel('Entropy')
plt.ylabel('Class {1:Authentic, 0:Fake}')
plt.ylim(-0.1, 1.1)

# new threshold
new_threshold = (0.5 - lm.coef_[0]) / lm.intercept_

# threshold lines
plt.axvline(new_threshold, linestyle='--', color='green')
plt.axhline(0.5, linestyle='--', color='black')
plt.axvline(threshold, linestyle='--', color='yellow')

# line plots
new_Y_sample = X_sample*lm.intercept_ + lm.coef_[0]
plt.plot(X_sample, new_Y_sample, color='red')
plt.xlim(-8, 3)
plt.show()
```

The image looks somewhat like this: <img src='../images/2.png'>

**From the image it is pretty clear that the threshold value changes as evident from the line changing its color from yellow to green**. As the threshold changes, so does the predictions and so linear regression is not a reliable model for this type of tasks. 

## Quiz

1. Is Logistic Regression used for regression?

    a. TRUE

    b. FALSE
 
**Ans:** b. FALSE, don't get mislead by the use of the term "regression".


2. Using Linear regression for classification tasks results in the line being sensitive to outliers?

    a. TRUE

    b. FALSE
 
**Ans:** a. TRUE, see the example above as well as the video if you got this one incorrect.

## Chapter 2: Nuts and bolts of Logistic Regression

***

## Description: In this chapter you will learn about sigmoid function, odds ratio and cost function for logistic regression.

### 2.1 Sigmoid function

***

As you have seen at the end of the previous chapter, Linear Regression is not suitable for classification tasks mainly due to a couple of reasons:
- Upon addition of outliers, the best-fit line changes which in turn changes the threshold for the decision boundary. 
- Linear combination of features $\theta_0 + \sum_{i=1}^{n}\theta_ix_i$ spans from $-\infty \text{ to } \infty$. But for a classification (binary) one you can have only two possible values (for example: $0 \text{ and } 1$)

*Hence, linear regression is a very unstable process for classification tasks*.  


**Cure for classification problems** 

We can overcome it with the help of the **`sigmoid`** function, also known as the S-curve. It looks somewhat like this: <img src='../images/sigmoid.png'> 

In the figure we consider negative labels as having the value $0$ while the positive ones as being $1$s. 

***`Mathematical Form and Interpretation`***         

$ \sigma(z) = \frac {1} {(1 + e^(-z))}$ where $z = \theta_0 + \theta_1x + ..... + \theta_nx$

Probabilistically speaking, we have:

$ \begin{align}
P(y=1|x) &= h_\theta(x) = \frac{1}{1 + \exp(-\theta^\top x)} \equiv \sigma(\theta^\top x),\\
P(y=0|x) &= 1 - P(y=1|x) = 1 - h_\theta(x).
\end{align} $

**The sigmoid gives the probability of predicted output being positive against some pre-defined threshold**. For example: If we predict less than $0.5$ we call it negative otherwise we term it positive. Notice the term $z$ which is the same as the hypothesis of linear regression. 

*Maximum value of sigmoid:* $1$ for $z = \infty$ 

*Minimum value of sigmoid:* $0$ for $z = -\infty$. 

Thus, sigmoid is perfectly suitable for our binary classification task. 

## Make your own sigmoid function

In this task you will be defining a sigmoid function which takes in a single argument and returns the sigmoid output as given in the slide.

### Instructions
- Define a function `sigmoid` that takes in a single variable x and returns the sigmoid transformation (Use `np.exp()` to represent exponent)
- Calculate the `sigmoid` value for $0$ and save it as `result`

In [2]:
# Code starts here

def sigmoid(x):
    return 1/(1+(np.exp(-x)))

result = sigmoid(0)
print(result)

# Code ends here

0.5


## Hints
- Function `sigmoid` must return $\frac{1}{1 + e^{-x}}$

## Test cases
- Variable declaration for `result`
- Function declaration of `sigmoid`
- Value of result: `result == 0.5`

### 2.2 Odds ratio 

***

So by now you know that the sigmoid function gives us the probability of an instance belong to either of the classes. But ever wondered how we arrived at this step? Well, something called the odds-ratio helped us in arriving at the final form $g(z) = \frac {1} {(1 + e^(-z))}$. So let us understand more about odds ratio now.


**What is odds ratio?**

We will answer this question with the help of an example. Let the probability of success of an event is p ( $0 <= p <= 1$ ). So, the probability of event failure is 1 - p. The ratio of probability of success to probability of failure is called the odds ratio. Mathematically, it is equivalent to $ \frac{p}{1-p} $. If some event has odds of 4, then it means that the chances of success are 4 times more likely than those of failure. 

An interesting property of odds is that it is a monotonically increasing function. Odds increase as the probability increases or vice versa. 

<img src='../images/Odd.png'>


**Log odds**

The transformation from odds to log of odds is the log transformation.  This is also  a monotonic transformation i.e. greater the odds, greater the log of odds and vice versa.  

<img src='../images/log.png'>


**Why take log odds?**

Its difficult to model a variable with probability since it has restricted range between 0 and 1. The log of the odds also called logits transformation is an attempt to get around the restricted range problem. It maps probability ranging between 0 and 1 to log odds ranging from negative infinity to positive infinity. 


A logistic regression model allows us to establish a relationship between a binary outcome variable and a group of predictor variables.  It models the logit-transformed probability as a linear relationship with the predictor variables. Mathematically,

$\text{logit(p)} = log(\frac{p}{1-p}) = \theta_0 + \theta_1x_1 + \theta_2x_2 + .... + \theta_nx_n$


**Observe how the right hand side of the equation looks similar to the linear regression counterpart.**

> $ p = \frac{e^{\theta_0 + \theta_1x_1 + \theta_2x_2 + .... + \theta_nx_n}}{1 + e^{\theta_0 + \theta_1x_1 + \theta_2x_2 + .... + \theta_nx_n}}$

> $ p = \frac{1}{1 + e^-({\theta_0 + \theta_1x_1 + \theta_2x_2 + .... + \theta_nx_n})} $ which is the sigmoid function

### 2.3 Decision boundary for sigmoid function

***

The mathematical condition for decision boundary in case of sigmoid will occur when the probability is $0.5$.

i.e. $y = 0.5$
$$ => 1 + e^{-z} = 0.5 $$
$$ => e^{-z} = 1 $$
$$ =>z = 0 $$
$$ => \theta^TX = 0 $$

Now, $\theta^TX = \theta_0 + \theta_1x_1 + .... + \theta_nx_n$. 

**So, this condition i.e. $\theta_0 + \theta_1x_1 + .... + \theta_nx_n = 0$ is the equation for the decision boundary of sigmoid function.** 

By adding polynomial terms to the left hand side of the above equation we can also get a non-linear decision boundary. Lets say you have the function $$h_θ(x) = g(θ_0 + θ_1x_1+ θ_2x_2 + θ_3x_1^2 + θ_4x_2^2)$$.

Decision boundary for this equation is $$θ_0 + θ_1x_1+ θ_2x_2 + θ_3x_1^2 + θ_4x_2^2 = 0$$

Say θ was $[-1,0,0,1,1]$ then we have; 

Predict that $y = 1$ if
 - $-1 + x_1^2 + x_2^2 >= 0$ or
 - $x_1^2 + x_2^2 >= 1$
 
Else predict $y = 0$

The decision boundary for the above example is a circle of radius $1$ and is shown in the image below.

<img src='../images/non_linear_db.png'>


**Stability of sigmoid functions decision boundary**

The main advantage of the sigmoid function is that always buckets values in the range of [0,1] and it does not get affected by outliers at all. Take a look at the code snippet below and the images that it produces:

```python
# import packages
from sklearn.linear_model import LogisticRegression

logreg1 = LogisticRegression()
logreg2 = LogisticRegression()

# Outlier points added
x = np.append(df.entropy.values, [35]).reshape(-1,1)
y = np.append(df['class'], [1]).reshape(-1,1)

# fit model
logreg1.fit(df[['entropy']], df[['class']])
logreg2.fit(x, y)

# initialize figures
fig, (ax_1, ax_2) = plt.subplots(1, 2, figsize=(10,5))

# scatter plot
ax_1.scatter(df[['entropy']], df[['class']], marker='x')
ax_2.scatter(x,y, marker='x')

# axes modifications
ax_1.set_title('Without outlier')
ax_2.set_title('With outlier')
ax_1.set_xlabel('Entropy')
ax_1.set_ylabel('Class {1:Authentic, 0:Fake}')
ax_2.set_xlabel('Entropy')
ax_2.set_ylabel('Class {1:Authentic, 0:Fake}')

# predictions
old_pred = logreg1.predict(X_sample.reshape(-1,1))
new_pred = logreg2.predict(X_sample.reshape(-1,1))

# line plots showing decision boundary for sigmoid
ax_1.plot(X_sample.reshape(-1,1), old_pred, color='red')
ax_2.plot(X_sample.reshape(-1,1), new_pred, color='red')
ax_2.set_xlim(-9, 3)

# display plot
plt.show()
```

<img src='../images/sig.png'>

The decision boundary is robust enough to deal with outliers.

### 2.4 Multiclass Classification

***

So far we have discussed binary classification only. But what if we have more than two target classes? How to approach such problems? For this kind of situations we will be covering two methods; namely One-vs-All (One-vs-Rest) method and the Softmax method. Lets discuss them in details.

1. *__One-vs-all Method__*

   Consider the situation where you have K classes ( $ K > 2 $ ). You will create classifiers equal to the number of classes and use each of them to predict one class as 1 and other classes as 0s. Then you combine all the classifiers and obtain a single new classifier which assigns probabilities to instances belonging to every class. On encountering a new instance, it will output probability of that instance belonging to every class and in general we select it as the highest of the predicted probabilities. But like all other methods, this approach is also not bullet-proof. Particularly in problems with class imbalance it doesn't have the desired effect.
   
   For example you have a situation at hand to classify shapes as triangles, squares and crosses. So for this classification problem you will make 3 logistic classifiers. The first classifier will classify triangles as 1s and the other classes as 0s. Similarly the second classifier will classify squares as 1s and other classes as 0s and the third one will classify crosses as 1s and other classes as 0s. We combine these three classifiers and obtain a single classifer which can now classify shapes into any of the three classes. 
   
   <img src='../images/onevsall.png'> 
   
   You can go through scikit-learn's official [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html) of One-vs-rest classifier.


2. *__Softmax Function__*

    Softmax is an extension of sigmoid function generalized for K classes. Consider you have m training examples $(x^{1}, y^{1}), (x^{2}, y^{2}), ..., (x^{n}, y^{n})$ and the class labels as $y^{(i)} \in \{1, 2, \ldots, K\}$. 
    
    Now, given a new input $x$ we want to predict the probability that it belongs each of the classes i.e. $P(y=k | x)$. The hypothesis is given by the equation:
    
   $ \begin{align}
h_\theta(x) =
\begin{bmatrix}
P(y = 1 | x; \theta) \\
P(y = 2 | x; \theta) \\
\vdots \\
P(y = K | x; \theta)
\end{bmatrix}
=
\frac{1}{ \sum_{j=1}^{K}{\exp(\theta^{(j)\top} x) }}
\begin{bmatrix}
e^{\theta^{1}\top} x  \\
e^{\theta^{2}\top} x  \\
\vdots \\
e^{\theta^{K}\top} x ) \\
\end{bmatrix}
\end{align} $

    The hypothesis outputs a $ K $ dimensional vector whose elements sum to 1 giving us our $ K $ estimated probabilities. Here, $\theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(K)} \in \Re^{n}$ are the model parameters. 
    
    Every $\theta^{(i)}$ is again made up of $ n $ parameters and so it is convenient $ \theta $ as a $ n\text{x}K $ matrix where $\theta = \left[\begin{array}{cccc}| & | & | & | \\
\theta^{(1)} & \theta^{(2)} & \cdots & \theta^{(K)} \\
| & | & | & |
\end{array}\right].$

## Quiz

1. In logistic regression, what do we estimate for one each unit’s change in X?

    a. Change in Y multiplied with Y
    
    b. Change in Y from mean
    
    c. How much Y changes
    
    d. How much the natural logarithm of the odds for Y = 1 changes

    **Ans:** d. It represents for each unit change in X how much the natural logarithm of the odds for Y = 1 changes


2. A total predicted logit of 0 can be transformed to a probability of?

    a. 1
    
    b. 0
    
    c. 0.5
    
    d. 0.25
    
    **Ans:** c. $log\frac{p}{1-p} = 0$. Solving it we have $p = 0.5$

3. Which of the following option is true?

    a. Linear Regression errors values has to be normally distributed but in case of Logistic Regression it is not the case

    b. Logistic Regression errors values has to be normally distributed but in case of Linear Regression it is not the case

    **Ans:** a. Linear Regression errors values has to be normally distributed but in case of Logistic Regression it is not the case

## Chapter 3: Cost Function

***

## Description: In this chapter you will learn about the cost function for logistic regression and why is it so. 

### 3.1 Cost function for Logistic Regression

***

From the previous tutorial on linear regression, we know that the cost function ($J(\theta$)) for $m$ training examples with hypothesis $h_\theta(x_i)$ and actual target $y_i$ is 

$$ J(\theta) =   \frac{1}{m}\sum_{i=1}^{m}\frac{1}{2}(h_\theta(x_i) - y_i)^2$$

Now for linear regression, this is a convex cost function i.e. you are guaranteed to arrive at the global minimum cost. However, the same cannot be said for sigmoid function and applying the same for logistic regression where our hypothesis is $h_\theta(x) = \frac{1}{1 + e{-\theta^TX}}$.  It is  a non-convex function and chances are we might get stuck in some local minima while optimizing our solution. 

It is depicted in the image below where we can see multiple local minimas for a non-convex function. The image is taken from the massively popular lecture notes from Dr. Andrew NG.

<img src='../images/convex.png'>


**Maximum Likelihood Estimation**

In logistic regression instead of minimizing the least squares error, we try to maximize the likelihood. Likelihood function determines how likely is the observation according to the model. It determines values for the model parameters such that they maximise the chance that the process described by the model produced the data that were actually observed.

Lets assume our hypothesis to be $ h_\theta (x) $ with data points $ (x_i, y_i) $ where $ 0 <= h_\theta (x) <= 1 $. The probability of observing positive class is given by $ h_\theta (x) $ and that of observing negative class is given by $ 1 - h_\theta (x) $. 

Assuming the underlying distribution as Bernoulli, the objective function becomes 

$$ \text{Likelihood (L)} = \prod^{m}_{i=1} h_\theta (x_i)^{y_i} (1 - h_\theta (x_i))^{1-y_i}$$

Notice that we take products and not the sum while calculating the likelihood as total likelihood is the product of the likelihood of every observation. For convenience, it is easy to deal with the log-form of likelihood. Taking the natural logarithm of the above equation gives us:

$$ \ln(\text{L}) = \sum^{m}_{i=1}[y_i \ln h_\theta(x_i) + (1 - y_i)\ln(1 - h_\theta(x_i))]$$ 

Maximizing L is the same as minimizing -L and by taking the average over the entire set of m data points we obtain the new cost function as:

$$ J(\theta)  = -\frac{1}{m}\sum_{i=1}^{m}[y_i\ln h_\theta(x_i) + (1-y_i) \ln (1-h_\theta(x_i) )]$$

This term is our new cost function. Maximizing the log likelihood will give us an optimal solution as it will result in the maximum likelihood (L) and give us the best parameters. 


**Gradient Descent to find best parameters**

We have the sigmoid function $\sigma(x) = \frac{1}{1 + e^{\theta^T X}}$ 

On calculating the derivative w.r.t. $ \theta $ we obtain the best parameters for a single sample:

$\frac{\partial}{\partial \theta_j} L(\theta) = -(y \frac{1}{\sigma(\theta^TX)} - (1-y)  \frac{1}{1 - \sigma(\theta^TX)})  \frac{\partial}{\partial \theta_j} \sigma(\theta^TX) $

$\frac{\partial}{\partial \theta_j} L(\theta) = -(y \frac{1}{\sigma(\theta^TX)} - (1-y)  \frac{1}{1 - \sigma(\theta^TX)})  \sigma(\theta^TX)(1 - \sigma(\theta^TX)) \frac{\partial}{\partial \theta_j} \theta^TX $

$\frac{\partial}{\partial \theta_j} L(\theta) = -(y(1-\sigma(\theta^TX)) - (1-y)\sigma(\theta^TX))x_j$

$\frac{\partial}{\partial \theta_j} L(\theta) = (\sigma(\theta^TX)-y)x_j $

Adding up over all the samples we obtain:

$\frac{\partial}{\partial \theta_j} J(\theta) = \frac{1}{m} \sum_{i = 1}^m (h_\theta(x_i)-y_i)x_j$

After that we can update the weights ( $\theta$s ) as per : $\theta = \theta + \alpha \frac{\partial}{\partial \theta_j} J(\theta)$

### 3.2 Intuition behind the Cost Function

***

Now lets get into the intuition behind the cost function $J(\theta)  = -\frac{1}{m}\sum_{i=1}^{m}[y_i\ln h_\theta(x_i) + (1-y_i) \ln (1-h_\theta(x_i) )]$


We will be considering two scenarios, one when the actual target is **1** and other when it is **0**. The intuition is explained with the help of the image below where the X-axis represents the predicted values and Y-axis represents the cost $J(\theta$).

**Case I: $y = 1$** 

The cost function becomes  $ J(\theta) = -\frac{1}{m}\sum_{i=1}^{m} \ln h_\theta(x_i) $, since $ (1 - y_i) = 0$.

Its graphical representation is given by the red line in the image below. Carefully observe that if we predict probabilities close to $0$, the cost is almost infinity; but as our prediction reaches $1$, it approaches $0$. 
Thus, it highly penalizes incorrect predictions.


**Case II: $y = 0$**

Now, this cost function becomes $ J(\theta) = -\frac{1}{m} \sum_{i=1}^{m}(1 - y_i) \ln (1 - h_\theta(x_i))$.

This is represented by the black line in the image below. For wrong predictions, it penalizes much more and for closer ones, it penalizes very less. 

This is the intuition behind cost function for logistic regression. After that the weights are being updated in the same manner as linear regression; the only difference being the cost function $J (\theta)$.

**With regularization, there is just an added term in cost function and accordingly the weights will get updated so that it doesn't overfit.**


<img src='../images/loss.png'>

### 3.3 Model building using scikit-learn

***

Till now, you have learnt about classification, sigmoid function and the modifications in the cost function for logistic regression. Now, you will perform necessary preprocessing steps and build a model on that data using **scikit-learn**.

**Remember these steps as well as the sequence is not universal, there are various other things to check for.** 


**Step 1:Split into train and test sets** 

Split the data into train and test sets with 20% data in the test set


**Step 2:Standardize data**

Although it is not a mandatory step, since we will be finding co-efficients using gradient descent, it is recommended to use normalization so that it converges faster


**Possible other measures:**

- **Check for multicollinearity** 
Logistic regression requires there to be little or no multicollinearity among the independent variables.  This means that the independent variables should not be too highly correlated with each other. If there exists some collinearity, avoid using them together.

- **Check for missing values and treat them**


In this topic we will be only performing only train-test split and standardize data with zero mean and unit variance.

After that you can fit the model on training data and use that to predict on unseen or test data. 

## Preprocess data

In this task you are going to split the data into train and test and then standardize the data

### Instructions
- All the packages have been imported for you
- Use `.train_test_split()` to split the dataset into train and test datasets with $20$% test data and `random_state=42`. Name them as `X_train`, `X_test`, `y_train` and `y_test`
- Initialize a scaler named `scaler` with `MinMaxScaler`. You can read more about it at [Documentation](http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing)
- Fit this scaler on the train data using `.fit(X_train)` and then transform both the train and test features with `.transform()` method
- Instantiate a Logistic Regression model `model` and fit it on the transformed training features and training target i.e. `X_train` and `y_train`
- Then make predictions on the transformed test features i.e. `X_test` and store it as `pred`

In [6]:
# import packages
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression

# Code starts here

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(data.iloc[:,:4], data['class'], test_size=0.2, random_state=42)

# Standardize data
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# fit the model
model = LogisticRegression()
model.fit(X_train, y_train)

# make predictions
pred = model.predict(X_test)
print(pred)

# Code ends here

[0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 0 1 0 0 1 0 0 1 0 0 1 1 0 0
 1 1 0 0 1 1 0 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0
 1 1 0 1 1 1 1 0 0 0 0 0 0 1 0 0 0 0 1 1 0 1 1 0 0 0 1 0 0 0 1 0 0 0 1 1 1
 1 0 1 1 1 0 1 1 0 1 0 1 0 0 0 1 1 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 0 1
 1 0 0 1 1 0 1 0 1 0 0 0 1 1 1 1 1 0 1 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 1
 0 1 1 0 0 1 0 0 1 1 1 1 0 0 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1
 1 0 1 0 0 1 1 1 1 0 1 0 1 1 1 1 0 0 0 1 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1
 1 0 0 0 1 1 0 1 0 1 1 1 1 0 0 0]




## Test cases
- Variable declaration for `X_train`, `X_test`, `y_train`, `y_test`, `scaler`, `model` and `pred`
- Length of `pred`: `len(pred) == 275`
- Sum of values in `pred`: `sum(pred) == 117`

## Hints
- Do train-test split as `X_train, X_test, y_train, y_test = train_test_split(data.iloc[:,:4], data['class'], test_size=0.2, random_state=42)`
- Fit `scaler` on `X_train` as `scaler.fit(X_train)` and then transform as `X_train = scaler.transform(X_train)` and `X_test = scaler.transform(X_test)`
- Fit `model` on training data as `model.fit(X_train, y_train)` and predict as `model.predict(X_test)`

## Chapter 4: Evaluation Metrics for Classification

***

## Description: In this chapter you will learn more about the different metrics for classification like accuracy score, precision, recall, f-1 score, AUC score etc.

### 4.1 Confusion matrix

***

Also called the error matrix, it is a table describing the performance of a supervised machine learning model on the testing data, where the true values are unknown. Each **row of the matrix** represents the instances in a **predicted class** while **each column** represents the instances in an **actual class** (and vice versa). 

Let us understand it with the help of an example. 


![confusion](https://storage.googleapis.com/ga-commit-live-stag-uat-data/account/b92/11111111-1111-1111-1111-000000000000/b376/03808ceb-2a3a-4828-9f7d-bdbe7a00c6a1/file.jpg)


Before going through the calculations, lets understand some terms:
- **True Positives (TP):** Actually `positive` and predicted `positive` (*CORRECT PREDICTIONS*)
- **True Negatives (TN):** Actually `negative` and predicted `negative` (*CORRECT PREDICTIONS*)
- **False Positives (FP):** Actually `negative` but predicted `positive` (*INCORRECT PREDICTIONS*)
- **False Negatives (FN):** Actually `positive` but predicted `negative` (*INCORRECT PREDICTIONS*)

In the above example with the binary outcome `Cat`($1$) or `Non-cat`($0$):
- Actual number of **cats** = 8
- Actual number of **non-cats** = 19

Now, from the predictions based on model,
- Predicted **cats** = $5 + 2 = 7$
- Predicted **non-cats** = $3 + 17 = 20$

So, 

- **TP** = 5 i.e. actual **cats** and also predicted **cats**.

- **FP** = 2 i.e. actual **non-cats** but predicted **cats**.

- **FN** = 3 i.e. actual **cats** but predicted **non-cats**.

- **TN** = 17 i.e. actual **non-cats** and predicted **non-cats**.

`Confusion matrix represents confusion on unseen data.`




### 4.2 Accuracy score, Precision, Recall and F score

***

#### Accuracy score 

It is the fraction of correct predictions to the total number of predictions on unseen data. Mathematically,
$$ Accuracy = \frac{TP + TN}{TP + FP + TN + FN}$$

Remember in our earlier example, **TP** = $5$, **FP** = $2$, **FN** = $3$, **TN** = $17$

So, accuracy score = $\frac{5 + 17}{5 + 2 + 3 + 17}$ = $\frac{22}{27}$ = $0.8148$

**When to use accuracy?**

It is a relatively good metric where we have symmetric distribution of the targets. but cannot handle cases where predictions for one of the instances is very less. For example: In case of cancer detection, even if we sample data we will encounter very less cases of people having cancer ($1$). In such a case model will learn more about $0$s i.e. non-cancerous and will output a good accuracy score even though it has missed out some actual cancer cases or all of them!

To resolve cases with unbalanced targets, precision, recall and f-1 score come in handy. You can use them according to the business requirement.

#### Precision

For every predicted class, it is the fraction of the correct predictions to the total number of predictions for that class. It answers the question **Of all the values predicted as belonging to the class "X", what percentage is correct?** Mathematically,

$$ Precision(P) = \frac{TP}{TP + FP}$$

The precision in our example would be = $\frac{5}{5 + 2} = \frac{5}{7} = 0.7143 $

**When to use precision?**

Precision is a good measure to determine, when the costs of **False Positive is high**. For instance, email spam detection. In email spam detection, a false positive means that an email that is non-spam (actual negative) has been identified as spam (predicted spam). The email user might lose important emails if the precision is not high for the spam detection model.


#### Recall 

For every actual class, it is the fraction of the number of correct predictions to the total number of actual instances of the class. It answers the question **Of all the instances of the class "X", what percentage did we predict correctly?** Mathematically,        

$$ Recall(R) = \frac{TP}{TP + FN}$$

The recall in our previous example is = $\frac{5}{5 + 3} = \frac{5}{8} = 0.625$

**When to use recall?**

Recall shall be the model metric we use to select our best model when there is a **high cost associated with False Negative**. For instance, in fraud detection or sick patient detection; if a fraudulent transaction (Actual Positive) is predicted as non-fraudulent (Predicted Negative), the consequence can be very bad for the bank.


#### F score 

It is the harmonic mean of the precision and recall for a classifier. Mathematically, 

$$ F score = \frac{2PR}{P + R}$$

Now, the F-score is = $\frac{2*0.7143*0.625}{0.7143 + 0.625} = \frac{0.892875}{1.3393} = 0.67$

**When to use F score?**

If you want to achieve a balance between precision and recall, use F-1 score. But unfortunately, the F-score isn’t the holy grail and has its tradeoffs. It favors classifiers that have similar precision and recall. This is a problem because you sometimes want a high precision and sometimes a high recall. The thing is that an increasing precision results in a decreasing recall and vice versa. This is called the precision/recall tradeoff. 

## Find accuracy, precision, recall, f-score

In this task you will calculate the accuracy score, precision, recall and f-score and also visualzise the confusion matrix to see how your classifier is performing. 

### Instructions
- Use the `predictions` array and actual target labels `y_test` to calculate the confusion matrix accuracy, precision, recall, f-score and save them as `cf`, `acc`, `precision`, `recall` and `f_score` respectively
- Go through official documentations for [accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score), [confusion matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix), [F-1 score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score), [precision](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score), [recall](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html#sklearn.metrics.recall_score) to look at the syntax

In [7]:
# Code starts here

# import packages
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# confusion matrix
cf = confusion_matrix(y_test, pred)
print(cf)
print('='*50)

# accuracy
acc = accuracy_score(y_test, pred)
print(acc)
print('='*50)

# precision
precision = precision_score(y_test, pred)
print(precision)
print('='*50)

# recall
recall = recall_score(y_test, pred)
print(recall)
print('='*50)

# F-score
f_score = f1_score(y_test, pred)
print(f_score)
print('='*50)

# Code ends here

[[144   4]
 [ 14 113]]
0.9345454545454546
0.9658119658119658
0.889763779527559
0.9262295081967213


## Hints
- Calculate `precision` as `precision_score(y_test, pred)`.Similarly repeat for others

## Test cases
- Variable declaration for `cf`, `precision`, `recall`, `acc` and `f_score`
- Value for `cf`: `(cf == [[144, 4], [14, 113]]).sum() == 4`
- Value for `acc`: `round(acc, 2) == 0.93`
- Value for `precision`: `round(precision, 2) == 0.97`
- Value for `recall`: `round(recall, 2) == 0.89`
- Value for `f_score`: `round(f_score, 2) == 0.93`

### 4.3 ROC-AUC score

***

ROC stands for Receiver Operator Characteristic, is a curve that helps us visualize the performance of a binary classifier. The Area under curve (AUC) of the ROC curve indicates the ability of the binary classifier to distinguish between both the classes. It is calculated using all possible threshold probabilities unlike other metrics that use a fixed threshold.

Now, let us understand this using the actual ROC diagram which is a graphical plot of **TPR** on the Y-axis and **FPR** on the X-axis for various threshold settings. The **TPR** is nothing but the recall term that we had already discussed whereas **FPR** is the term which which indicates how much the classifier incorrectly predicts a negative instance as positive.

**Just for your information**

**TPR** $=\frac{TP}{TP + FN}$

**FPR** $=\frac{FP}{FP + FN}$ 
 

![roc_auc](https://storage.googleapis.com/ga-commit-live-stag-uat-data/account/b92/11111111-1111-1111-1111-000000000000/b929/877bd87c-e9f3-426d-9c71-6c635ce844eb/file.png)


The top left image shows the two distributions of negatives (left side gaussian) and positives (right side gaussian) and the corresponding ROC curve below. The threshold is the vertical line in the top image. By moving the line from left to right, we obtain different thresholds, calculate the respective **TPR**s and **FPR**s and plot it on the ROC curves. **The AUC lies between 0 and 1.**

Evaluate your classifier based on the AUC score according to:

- .90-1 = excellent classifier
- .80-.90 = good classifier
- .70-.80 = fair classifier
- .60-.70 = poor classifier
- .50-.60 = fail classifier

***

Now let us understand why a low AUC score represents a bad classifier and vice-versa.


<img src='../images/img.png'>


In the first image, the classifier has an AUC score of $0.7$ whereas the lower one has $0.5$. Now, a higher AUC has classes which have a higher separation as evident from the image in the top left. More separation means for the same **FPR** the **TPR** is higher and vice-versa. An ideal classifier separates both the classes perfectly and as a result the **TPR** is $1$ and **FPR** is $0$. The worst classifer has an AUC score of $0.5$ which happends when it is unable to make distinction between the classes.

**Advantages of using AUC score**

- It is useful even if predicted probabilities are not properly calibrated. For instance our probabilities may lie in the range ($0.8, 0.95$). This would have no effect on the AUC score.
- Useful metric even if the classes are imbalanced.

***

### Example

In our previous (cat vs non-cat) example, we had **TP** = $5$, **FP** = $2$, **FN** = $3$ and **TN** = $17$. Here, **TPR** = $0.625$ and **FPR** = $0.12$ for a threshold of $0.5$. Lets say that we have the following data for the different thresholds and their corresponding **FPR**s and **TPR**s

| Threshold | FPR | TPR |
| --- | --- | --- | 
| 0.06 | 0.05 | 0.15 |
| 0.1 | 0.1 | 0.54 |
| 0.3 | 0.45 | 0.7 |
| 0.45 | 0.5 | 0.8 |
| 0.6 | 0.54 | 0.85 |
| 0.85 | 0.78 | 0.92 |
| 0.95 | 0.9 | 0.93 |

Plotting them in a X-Y plot we get the following graph:

![img](https://storage.googleapis.com/ga-commit-live-stag-uat-data/account/b92/11111111-1111-1111-1111-000000000000/b675/bf5673e5-52a4-43a7-8c30-6721fc537584/file.png)


The blue shaded region represents the **AUC** with **FPR** and **TPR** on the X-axis and Y-axis respectively.

## Calculate AUC

In this task calculate the `AUC` score using scikit-learn's function

### Instructions
- Calculate the ROC AUC score using scikit learn and save it as `roc`. Go through its official [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score)

In [8]:
# import packages
from sklearn.metrics import roc_auc_score

# Code starts here
roc = roc_auc_score(y_test, pred)
print(roc)

# Code ends here

0.9313683762502661


## Hints
- Syntax is `roc_auc_score(y_test, pred)`

## Test cases
- variable declaration for `roc`
- Value for `roc`: `round(roc, 2) == 0.93`

## Concept End Quiz

1. Which of the following evaluation metrics can not be applied in case of logistic regression output to compare with target?

    a. AUC-ROC
    
    b. Accuracy
    
    c. Log-loss
    
    d. RMSE
    
**Ans:** d. RMSE; it is a metric for regression problems not classification

## Concept end quiz

1. Logistic regression is a linear model.

    a. TRUE
    
    b. FALSE
    
**ANS** a. TRUE

2. Which of the following metrics can be used for classification tasks?

    a. MSE
    
    b. None of these
    
    c. F-1 score
    
    d. RMSE
    
**ANS** c. F-1 score

3. Is sigmoid a linear function?

    a. YES
    
    b. NO
    
**ANS** b. NO

4. Given enough time, gradient descent is guaranteed to find the optimal global solution in logistic regression. 

    a. TRUE
    
    b. FALSE
    
**ANS**: a. TRUE

5. Standardisation of features is required before training a Logistic Regression.
    
    a. TRUE
    
    b. FALSE
    
**ANS** b. FALSE

6. What is the range of the logit (logarithm of odd ratio) function?

    a. (0,1)
    
    b. ($-\infty$, $\infty$)

7. Odds ratio is 1. What is the probability?

    a. 0
    
    b. 0.5
    
    c. 1
    
    d. 0.75
    
**ANS** b. 0.5

8. ROC curve represents ability of classifier to separate classes

    a. TRUE
    
    b. FALSE
    
**ANS** a. TRUE

9. One-vs-rest classification can deal with imbalanced classes. 

    a. TRUE
    
    b. FALSE
    
**ANS** b. FALSE

10. Recall represents

    a. Of all instances belonging to a certain class how much we did correctly
    
    b. Of all instances being predicted to a certain class how much we did correctly
    
**ANS** a. Of all instances belonging to a certain class how much we did correctly