# Table of Contents:

## [1. Definition](#1)
## [2. Hypothesis](#2)
## [3. Cost Function](#3)
## [4. Minimizing the cost function](#4)
## [5. Learning the logistic regression model](#5)
## [6. Why can’t we use the cost function for linear regression(MSE) in logistic regression?](#6)
## [7. Is logistic regression a linear classifier?](#7)
## [8. Non Linear Decision Boundary](#8)
## [9. Assumptions of logistic regression](#9)
## [9. Hyperparameter Optimization](#10)
## [10. Solutions to Overfitting](#11)
## [11. Solutions to Underfitting](#12)
## [12. Pros vs Cons](#13)

## Definition <a class="anchor" id="1"></a>

Logistic Regression (also called Logit Regression) is commonly
used to estimate the probability that an instance belongs to a particular class
(e.g., what is the probability that this email is spam?). If the estimated probability is
greater than 50%, then the model predicts that the instance belongs to that class
(called the positive class, labeled “1”), or else it predicts that it does not (i.e., it
belongs to the negative class, labeled “0”). This makes it a binary classifier.

## Hypothesis <a class="anchor" id="2"></a>

So how does it work? Just like a Linear Regression model, a Logistic Regression
model computes a weighted sum of the input features (plus a bias term), but instead
of outputting the result directly like the Linear Regression model does, it outputs the
logistic of this result

<img src="https://miro.medium.com/max/4640/1*dm6ZaX5fuSmuVvM4Ds-vcg.jpeg" width=500>

<img src="https://www.crayondata.com/blog/wp-content/uploads/2018/06/equation1.jpg">

### Sigmoid Function

Linear regression gives us continuous values. In order to map predicted values to probabilities, we use the sigmoid function. The function maps any real value into another value between 0 and 1. In machine learning, we use sigmoid to map predictions to probabilities.

The logistic—also called the logit, noted σ(·)—is a sigmoid function (i.e., S-shaped)
that outputs a number between 0 and 1.

<img src="https://miro.medium.com/max/945/1*RqXFpiNGwdiKBWyLJc_E7g.png" width=500> 

If ‘Z’ goes to infinity, Y(predicted) will become 1 and if ‘Z’ goes to negative infinity, Y(predicted) will become 0.

2 special properties of sigmoid:
1.	It outputs numbers between [0,1] and can hence be used for probability.
2.	Its’s derivative is v simple.

![image.png](attachment:image.png)

### Derivation of sigmoid function: 

https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x


### Prediction

Logistic Regression model prediction:

y = {0 if $h_{\theta}(x)$ < 0 . 5,

   1 if $h_{\theta}(x)$ ≥ 0 . 5}

## Cost Function <a class="anchor" id="3"></a>

### Cost Function for a single instance

Source: https://youtu.be/HIQlmHxI6-0

![image.png](attachment:image.png)

![image.png](attachment:image.png)

So when y=1 (first graph) and h(x) or predicted value is 1 then cost is 0 but if predicted value is 0 then cost is tending towards infinity. The same logic for the second graph as well.

The above 2 pictures are summarized in one below: 
<img src="https://miro.medium.com/max/1050/1*6oBgYMy4wOls9zGC-frSQg.png">

Combining both the equation we get a convex log loss function as shown below-
<img src="https://miro.medium.com/max/1050/1*4pQ9NkAt8m-qdCCZF0Vs1Q.png">

## Minimizing the Cost Function <a class="anchor" id="4"></a>

In order to optimize this convex function, we can either go with gradient-descent or newtons method. For both cases, we need to derive the gradient of this complex loss function. 

Logistic regression cost function partial derivatives:

<img src="https://miro.medium.com/max/434/1*Nr2E9HL-RCDIW78VmBiJyg.png">

Derivation of cost function:
https://medium.com/analytics-vidhya/derivative-of-log-loss-function-for-logistic-regression-9b832f025c2d

## Learning the Logistic Regression Model <a class="anchor" id="5"></a>

Source: https://machinelearningmastery.com/logistic-regression-for-machine-learning/

The coefficients ($\theta$ values) of the logistic regression algorithm must be estimated from your training data. This is done using maximum-likelihood estimation.

Maximum-likelihood estimation is a common learning algorithm used by a variety of machine learning algorithms, although it does make assumptions about the distribution of your data (more on this when we talk about preparing your data).

The best coefficients would result in a model that would predict a value very close to 1 (e.g. male) for the default class and a value very close to 0 (e.g. female) for the other class. The intuition for maximum-likelihood for logistic regression is that a search procedure seeks values for the coefficients ($\theta$ values) that minimize the error in the probabilities predicted by the model to those in the data (e.g. probability of 1 if the data is the primary class).

We are not going to go into the math of maximum likelihood. It is enough to say that a minimization algorithm is used to optimize the best values for the coefficients for your training data. This is often implemented in practice using efficient numerical optimization algorithm (like the Quasi-newton method).

## Why can’t we use the cost function for linear regression(MSE) in logistic regression? <a class="anchor" id="6"></a>

Linear regression uses mean squared error as its cost function. If this is used for logistic regression, then it will be a non-convex function of parameters (theta). Gradient descent will converge into global minimum only if the function is convex.

![image.png](attachment:image.png)

## Is logistic regression a linear classifier? <a class="anchor" id="7"></a>

Source: https://www.quora.com/Why-is-logistic-regression-called-a-linear-classifier?no_redirect=1
Yes. By “linear classifier” it means that the decision boundary is linear in X. You can also have a non linear decision boundary where you would introduce a feature transformation (increasing powers of the features) h(X) and use that in place of X. A linear classifier is one where a hyperplane is formed by taking a linear combination of the features.


## Non linear decision boundary: <a class="anchor" id="8"></a>
Source: https://medium.com/@akshayc123/logistic-regression-87f7fbb4aaf6 (Go to Non Linear Decision Boundary)
 
Let’s assume that we have data (2-D) with features X1 and X2.
<img src="https://miro.medium.com/max/743/1*WoD2ccguEg2jHYQnI3kXzg.png">

Now we want to separate data by using line. The equation is,
<img src="https://miro.medium.com/max/257/1*LMBAeheX2eqpl6BND0GqMg.gif">

Clearly the above data is not linearly separable by line. So in order to separate this data we need to draw the some complex shape /boundaries.
So we add the polynomial features of degree 2,
<img src="https://miro.medium.com/max/584/1*-oKiY99NB4bNcMVODb8FsQ.gif">

Let assume that the weight vector W= [-1, 0, 0, 1, 1, 0]
So the equation becomes,
<img src="https://miro.medium.com/max/219/1*ndM7zrcpzrCYE2HlKEz12g.gif">

This is an equation of circle with unit radius. So the decision boundary in this case will be the circle.

<img src="https://miro.medium.com/max/756/1*pSQ17HOGYkT6ArBzg5nLDQ.png">

Therefore, from above example we can say that if data is not linearly separable in lower dimensions by adding a some features i.e increasing number of features we can make it linearly separable in higher dimensions.

## Assumptions of Logistic regression  <a class="anchor" id="9"></a>

Source: https://medium.com/@dhiraj8899/top-5-assumptions-for-logistic-regression-96b11d24d357

1. The logistic regression assumes that there is minimal or no multicollinearity among the independent variables.
2. The Logistic regression assumes that the independent variables are linearly related to the log of odds.
3. The logistic regression usually requires a large sample size to predict properly.
4. The Logistic regression which has two classes assumes that the dependent variable is binary and ordered logistic regression requires the dependent variable to be ordered, for example Too Little, About Right, Too Much.
5. The Logistic regression assumes the observations to be independent of each other.

<b> Checking assumptions of Logistic regression are met using python:</b> https://pythonfordatascienceorg.wordpress.com/logistic-regression-python/

## Hyperparameter optimization  <a class="anchor" id="10"></a>

1. Apply penalty or regularization with "penalty":

penalty{‘l1’, ‘l2’, ‘elasticnet’, ‘none’}, default=’l2’ -  If ‘none’ (not supported by the liblinear solver), no regularization is applied.

2. "C" determines penality strength:

C, default=1.0 - Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.

3. Type of solver:

solver{‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default=’lbfgs’ - Algorithm to use in the optimization problem.

- For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster for large ones.

- For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial loss; ‘liblinear’ is limited to one-versus-rest schemes.

- ‘newton-cg’, ‘lbfgs’, ‘sag’ and ‘saga’ handle L2 or no penalty

- ‘liblinear’ and ‘saga’ also handle L1 penalty

- ‘saga’ also supports ‘elasticnet’ penalty

- ‘liblinear’ does not support setting penalty='none'

Note that ‘sag’ and ‘saga’ fast convergence is only guaranteed on features with approximately the same scale. <b>You can preprocess the data with a scaler from sklearn.preprocessing.</b>

## Solution to Overfitting  <a class="anchor" id="11"></a>

In order to avoid overfitting, it is necessary to use additional techniques (e.g. cross-validation, regularization, early stopping, pruning, or Bayesian priors).

1. Regularization is a way of finding a good bias-variance tradeoff by tuning the complexity of the model. It is a very useful method to handle collinearity (high correlation among features), filter out noise from data, and eventually prevent overfitting.

To apply regularization to our logistic regression, we just need to add the regularization term to the cost function to shrink the weights:
![image.png](attachment:image.png)
By increasing the value of 
$\lambda$
 , we increase the regularization strength.

![image.png](attachment:image.png)

The parameter C that is implemented for the LogisticRegression class in scikit-learn is the inverse of $\lambda$

2. Increasing size of dataset - One of the ways to combat over-fitting is to increase the training data size.

## Solution to Underfitting  <a class="anchor" id="12"></a>

1. Do you know anything about the data? Perhaps a variable that is the multiple, or the division between two variables may be a good indicator.

2. If you are regularizing (or if the software is automatically applying) your regression, try reducing the regularization parameter.

## Pros vs Cons  <a class="anchor" id="13"></a>


Source: https://iq.opengenus.org/advantages-and-disadvantages-of-logistic-regression/

<b>Pros:</b>

1. Simple to implement
2. Effective
3. Feature scaling not needed: Does not require input features to be scaled (can work with scaled features too, but doesn’t require scaling)
3. Tuning of hyperparameters not needed.

<b>Cons:</b>

1. Poor performance on non-linear data(image data for e.g)
2. Poor performance with irrelevant and highly correlated features (use Boruta plot for removing similar or correlated features and irrelevant features).
3. Not very powerful algorithm and can be easily outperformed by other algorithms.
4. High reliance on proper presentation of data. All the important variables / features should be identified for it to work well.
5. If the number of observations are lesser than the number of features, Logistic Regression should not be used, otherwise it may lead to overfit