# Logistic Regression

- It measures the relatioship between categorical dependent variable and one or more independent variables.
- Input values are combined linearly using weights or coefficient values.
- Linear regression for classification is sensitive to outliers

Models:
- Linear Classifiers
- Logistic Regression
- Decision Trees
- Ensembles

Algorithms:
- Gradient
- Stochastic Gradient
- Recursive Greedy
- Boosting

Core ML:
- Alleviating overfitting
- Handling Missing data
- Precision Recall
- Online learning
- Use gradient ascent to maximize likelihood

1. Linear Classifier     

    - Uses training data to learn a weight (w) or coefficient for each parameter
    - For 2-dimensional data, a decision boundary separates positive and negative predictions
    - For multi-dimensional data, the linear classifier model builds a hyperplane that tries to separate positive samples from negative samples
    - A score function is defined which is a weighted combination of the coefficients multiplied by the features which whose output ranges from $-\infty$ to $+\infty$.
    - To predict the probability, a link function such as logistic function is used to translate the values from the score function to 0 to 1.

$Score(x) = w_0 + w_1 \times h_1(x_1) + w_2 \times h_2(x_2) + w_3 \times h_3(x_3) = w^Th(x_i)$ 

$\hat{y_i} = sign(Score(x_i))$

Example:

Predict most likely class:

$\hat{P}(y|x)$ = estimate of class probabilities

if $\hat{P}(y|x)$ > 0.5:         
             $\hat{y}$  =  +1      
else:        
             $\hat{y}$  =  -1           
             
             
$-\infty < Score(x) = w_0 + w_1 \times h_1(x_1) + w_2 \times h_2(x_2) + w_3 \times h_3(x_3) = w^Th(x_i) < + \infty$


corresponds to 

$0 < P(y=+1|x_i, \hat{w}) = \frac{1}{1+\exp^{-\hat{w}^Th(x_i)}} < 1$

- A Generalized Linear Model squeezes the Score from $-\infty$ and $+\infty$ to 0 and 1.

# Logistic Function

$Score(x) = w_0 + w_1 \times h_1(x_1) + w_2 \times h_2(x_2) + w_3 \times h_3(x_3) = w^Th(x_i)$

$sigmoid(Score) = \frac{1}{1+\exp^{-Score}} = \frac{1}{1+\exp^{-w^Th(x_i)}}$

$P(y=+1|x, \hat{w}) = \frac{1}{1+\exp^{-w^Th(x)}}$

## Multiclass training

Use 1 vs all approach:

$\hat{P}(y=+1|x)$ = estimate of 1 vs all model for each class

Predict most likely class
max_prob = 0; $\hat{y}$ = 0

In [None]:
for c = 1, ...., C:
       if $\hat{P}(y=+1|x_i)$ > max_prob:
           $\hat{y}$ = c
           max_prob = $\hat{P}(y=+1|x_i)$

## Quality Metric

### Likelihood function

- Find $w$ such that logistic function of negative datapoints goes to 0 and positive data points to 1.

$P(y=+1|x_i, w) = 0.0$      
$P(y=+1|x_i, w) = 1.0$

- Likelihood function $l(w)$, measures quality of fit for model with coefficient $w$.


### Find "best" classifier
- Maximize likelihood over all possible $w_0, w_1, w_2$.
- $l(w_0, w_1, w_2)$
- Use Gradient Ascent to maximize $\hat{w}.$

## Feature Scaling for gradient descent
- For Linear Regression as well as Logistic Regression, scaling the features will help to converge on the feature weights faster.


SyntaxError: invalid character in identifier (<ipython-input-1-b71453b3fc7d>, line 1)