### Logistic regression / Logit Regression / maximum-entropy classification (MaxEnt) / the log-linear classifier


- How will you deal with the multiclass classification problem using logistic regression?
- Explain the use of ROC curves and the AUC of an ROC Curve.
- How can you use the concept of ROC in a multiclass classification?
- How to visualize decision boundary.
- How to create polynomial and interactions terms
- How to visualize learning curve
- How to read classification report
- l1 and l2 term regularization
- what is jaccard


## Logistics Regression

Logistic Regression an extension of Linear regression where the dependent variable is categorical and not continuous. Multinomial and Binary Logistic Regression is estimated using Maximum Likelihood Estimation (MLE), unlike linear regression which uses the Ordinary Least Squares (OLS) approach.

Logistic regression is based on Maximum Likelihood (ML) Estimation which says coefficients should be chosen in such a way that it maximizes the Probability of Y given X (likelihood). With ML, the computer uses different "iterations" in which it tries different solutions until it gets the maximum likelihood estimates. Fisher Scoring is the most popular iterative method of estimating the regression parameters.

The output of a Logistic regression model is a probability. We can select a threshold value. If the probability is greater than this threshold value, the event is predicted to happen otherwise it is predicted not to happen. A suitable cut off point is chosen based on ROC curve.

In the multiclass case, the training algorithm uses the one-vs-rest (OvR) scheme if the 'multi_class' option is set to 'ovr' and uses the cross-entropy loss, if the 'multi_class' option is set to 'multinomial'. (Currently the 'multinomial' option is supported only by the 'lbfgs' and 'newton-cg' solvers.)

__Assumptions__

- Dependent variable is binary, 
- Observations are independent of each other, 
- Little or no multicollinearity among the independent variables, 
- Linearity of independent variables and log odds.

__Limitation__

Classic logistic regression doesn't handle nonlinearities well (however you can include interaction and polynomial terms).


__Why Logistics Regression__

- Logistic Regression does well at modeling variables but often struggles to capture interactions between variables. Other models (e.g., XGB) do very well at capturing interactions

- linear models work best with dummiesvariables, but other algorithms like XGBoost and Trees don't rely a lot on dummies variables.

- OHE should be the best encoding for regressions, the danger of one-hot encoding will be overfitting the training set as a result of either incomplete sampling of the feature space or noise

__For Effective LR__
- Remove excess collinearity (by applying PCA) 
- Normalise the variables (using Standard or Robust Scaler), you can normalize all your features to the same scale before putting them in a machine learning model.
- Interaction terms : Investigate factors for linear relationship.
- Apply transformations as required to remove outliers
- removing features (eg. using RFE)
- using a non- liner model
- Class Imbalance - Look for class imbalance in your data. Since you are working with admit/reject data, then the number of rejects would be significantly higher than the admits. Most classifiers in SkLearn including LogisticRegression have a class_weight parameter. Setting that to balanced might also work well in case of a class imbalance.
- SVM can able to learn more complex decision boundaris
- logistic regression cannot deal with missing values. Therefore all incomplete cases will be excluded during the estimation process. In order to avoid that you would have to impute the missing values (or substitute them with a mean or median).


__Model Performance__

a model performance depends on dataset and its peroperties such as its distribution. Simple model like LR for binary classification has higher generalization over boosting.


__LR Hyperparameters__

Regularization is a very useful method to handle collinearity (high correlation among features) and prevents overfitting. Regulariztion introduces addditiuonal information (bias) to penalize parameter weights. Regularization works when features are on comparable scales. Decision regions change when using different regularization values.


C = 1/λ

C which is a inverse regularization parameter, controls the amount of overfitting (a lower value should decrease overfitting). 
From Validation curve, choose  C which offers the smallest difference between the training and testing accuracy. ( our aim is to generalize model on unseen data)

penalty : [L1, L2]

- L2 produces moldels with many samll coeff. whereas L1 with large number of zero  coefficients.
- All coefficients consistently get smaller in size as L2 penalty is increased.

__Alogrith for Optimization Alogrithms__

the optimization algorithm  iteratively updates the weights so as to minimize this loss function. The standard algorithm for this is gradient descent; ( stochastic gradient descent )

Our goal with gradient descent is to find the optimal weights: minimize the loss function. For logistic regression, this loss function is conveniently convex. A convex function has just one minimum; there are no local minima to get stuck in

In [1]:
solver_options = ['newton-cg', 'lbfgs', 'liblinear', 'sag']
multi_class_options = ['ovr', 'multinomial']
class_weight_options = ['None', 'balanced']
C = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
penalty = ['l1','l2']

__liblinear__
- For small datasets, 'liblinear' is a good choice, whereas 'sag' is faster for large ones.
- Regularize the intercept, which is wrong. We should avoid it when fit_intercept=True.
- Does not handle multi_class='multinomial'.
- Only solver which handles dual=True.

__Stochastic Average Gradient (saga)__

- only solver which handles penalty='l1' with multi_class='multinomial'.
- Very fast for large datasets, dense or sparse.
- May suffer slower convergence when the features are not scaled.
- May suffer slower convergence when there are large norm outlier samples. (This might be improved with a better step-size heuristic (e.g.), or with an adaptive step-size.)

__newton-cg__

- Newton’s method uses in a sense a better quadratic function minimisation. A better because it uses the quadratic approximation (i.e. first AND second partial derivatives).

 __Limited-memory Broyden–Fletcher–Goldfarb–Shanno (lbfgs)__
- Good default, but not the fastest for large datasets.
- Does not handle penalty='l1'.
 
 

__Q. How to Optimize logistics Regression Model?__

Use paramter C as our regularization parameter. C = 1/λ.

Lambda (λ) controls the trade-off between allowing the model to increase it's complexity as much as it wants with trying to keep it simple. 

- Higher C is more likely to overfit.

For example, if λ is very low or 0, the model will have enough power to increase it's complexity (overfit) by assigning big values to the weights for each parameter. If, on the other hand, we increase the value of λ, the model will tend to underfit, as the model will become too simple.

__Q Why can't linear regression be used in place of logistics regression?__

- Distribution of error terms: The distribution of data in case of linear and logistic regression is different. Linear regression assumes that error terms are normally distributed. In case of binary classification, this assumption does not hold true.


- Model output: In linear regression, the output is continuous. In case of binary classification, an output of a continuous value does not make sense. For binary classification problems, linear regression may predict values that can go beyond 0 and 1. If we want the output in the form of probabilities, which can be mapped to two different classes, then its range should be restricted to 0 and 1. As the logistic regression model can output probabilities with logistic/sigmoid function, it is preferred over linear regression.


- Variance of Residual errors: Linear regression assumes that the variance of random errors is constant. This assumption is also violated in case of logistic regression.

__Q What is the difference in SGD and LR?__

Logistic regression classifier has different solvers and one of them is 'sgd' (Stochastic Gradient Descent)

- SGD is a optimization method, while Logistic Regression (LR) is a machine learning algorithm/model. A machine learning model defines a loss function, and the optimization method minimizes/maximizes it.
- sklearn SGDClassfier is liner classifier optimized by SGD
- LR can use other optimizers like L-BFGS, conjugate gradient or Newton-like methods

## Performance Metrics

we compare the model predictions from a test set to the actual, ground truth results

In some scenarios, we are ok with the overall accuracy whereas in some scenario the cost of misclassifying a single data point is huge. For example In a scenario of bank finding whether a customer is eligible for the loan or not it can be alright if we might misclassify as some eligible customers as not eligible. But in case of a doctor classifying the patients as having cancer or not it would be a blunder if we declare some potential cancer patients as cancer-free.

__Confusion Matrix__

A confusion matrix is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known.


__Basic terms related to Confusion matrix or contingency table:__
- a) True Positives: Observations where the actual and predicted transactions were fraud

- b) True Negatives: Observations where the actual and predicted transactions weren’t fraud

- c) False Positives: Observations where the actual transactions weren’t fraud but predicted to be fraud

- d) False Negatives: Observations where the actual transactions were fraud but weren’t predicted to be fraud
- ideal scenario - zero values for FP and FN
- Samples in the FP set are actually negatives and samples in FN are actually positives.

__Accuracy (Acc)__

- Classification accuracy is the percentage of correct prediction over total instances.
- = (tp + tn) / (tp + tn + fp + fn)
- Accuracy is a good metric if the classes are balanced
-  it doesn't tell us where the model is making errors. Answering this "where" question is an essential part of model-building. 

__Classification Error / Error Rate / Misclassification Rate (ERR)__

 - measures the ratio of incorrect predictions over the total number of instances evaluated. 
 - = (fp + fn) / (tp + tn + fp + fn)
 - = ERR = (1-Acc)

- applicable for multi-class and multi-label problems;
- Another problem with the accuracy is that two classifiers can yield the same accuracy but perform differently
- Both accuracy and error rate metrics are sensitive to the imbalanced data. The imbalance dataset makes accuracy, not a reliable performance metric to use. To cope with this problem, we can choose to penalize false positives or false negatives. This will generate two alternative metrics i.e precision and recall.


__Recall, Precision and F1__

- The precision is the ability of the classifier not to label as positive when it is negative. With precision, we are evaluating our data by its performance of ‘positive’ predictions.
- Precision is the probability that our system will properly identify as positive.
- Precision = TP / (TP + FP)
 
- The recall is the ability of the classifier to find all the positive samples. With recall, we are evaluating our data by its performance of the ground truths for positive outcomes.
- e.g. If a sample is positive for the disease, what’s the probability that the system will pick it up
- Recall = TP / (TP + FN)


__Sensitivity and Specificity__
- Sensitivity is True Positive rate and is also called recall
- Specificity (TNR) is the opposite of Recall. Hence, the formula is TN/(TN+FP).
- Sensitivity and Specificity may give you a biased result, especially for imbalanced classes.
- both precision and recall are necessary to determine if the classifier is performing well.
- The F1 score can be interpreted as a weighted average of the precision and recall,
- F1 is used where true negatives don’t matter much
- The best value for recall, precision and F1 is 1 and the worst value is 0


__Matthews Correlation Coefficient (MCC)__
- Similar to Correlation Coefficient, the range of values of MCC lie between -1 to +1. A model with a score of +1 is a perfect model and -1 is a poor model. This property is one of the key usefulness of MCC as it leads to easy interpretability.


__ROC Curve__

- A ROC(Receiver Operator Characteristic Curve) is a graphical assesment method which can help in deciding the best threshold value. 
- It shows the performance of a classification model at all classification thresholds.
- The ROC Curve/AUC Score is most useful when we are evaluating a model to itself
- It is generated by plotting the True Positive Rate (y-axis) against the False Positive Rate (x-axis)


__AOC__
- AUC is one of the popular ranking type metrics

- The area under ROC is called Area Under the Curve(AUC). AUC gives the rate of successful classification by the logistic model.

__Choice of Metrics__

It depends on the business objective. 

Using cross validation for more robust error measurement
Using a Validation dataset has a drawback. Firstly, it decreases the training data and secondly since it is tested against a small amount of data, it has high chances of overfitting. To overcome this, there is a technique called cross validation. The most common form of cross validation, and the one we will be using, is called k-fold cross validation. ‘Fold’ refers to each different iteration that we train our model on, and ‘k’ just refers to the number of folds. In the diagram above, we have illustrated k-fold validation where k is 5.


__eg__

When classes are imbalanced, accuracy is not a suitable measure.
The target variable was highly unbalanced with the positive class (heavy drinkers) making up less than 5% of the population. For this reason, recall was chosen as the performance metric rather than accuracy. That is, performance should be weighted towards identifying heavy drinkers even if it means incorrectly classifying some normal drinkers as heavy drinkers.

__Q What is sparse soluttion? How Logistics Regression can be used to perform feature selection?__

As gamma increases, more variables' coefficients go to 0. Once a variable has a 0 coefficient, it has no impact on the model anymore. So, as gamma increases, the model uses fewer and fewer variables. This is what we mean by a sparse solution - it only uses a few variables in the dataset.

A logistic regression with l1 penalty yields sparse models, and can thus be used to perform feature selection

__Q Does regularization in logistic regression always results in better fit and better generalization?__

Regularization does NOT improve the performance on the training set that the algorithm used to learn the model parameters (feature weights). However, it can improve the generalization performance, i.e., the performance on new, unseen data, which is exactly what we want.

In intuitive terms, we can think of regularization as a penalty against complexity. Increasing the regularization strength penalizes "large" weight coefficients -- our goal is to prevent that our model picks up "peculiarities," "noise," or "imagines a pattern where there is none."


Again, we don't want the model to memorize the training dataset, we want a model that generalizes well to new, unseen data.

__Q What is the difference between Logistics Regression and SVM?__

- SVM is non-parametric, is a hard classifier but LR is a parametric/probabilistic one.

- Linear SVM (without using any kernal fuctions) perform usually as same as Logistic Regression as both are structurally similar differing only their loss function. (hinge for SVM and logistics for LR)

- LR produces probabilistic values while SVM produces 1 or 0.SVM may perform worse than LR when the dataset is small,

The logistic regression can only separate linearly separable classes where as SVM (with the kernel trick) can find any arbitrarily shaped decision boundary which means better generalization. This means that SVM will usually do better separating your classes (at least on your training set) but is more prone to over-fitting.

Logistics regression is also a simpler model with fewer hyper-parameters to tune (zero if you're not using regularization) making it easier to implement.


Logistic regression outputs a probability of being in the positive class (you still need to choose a threshold to make it a classifer), SVM just outputs the classes. SVM can give you probabilies via Platt scaling but this can be very slow.



To Sum up , use SVM when you have large dataset and large number of feature vectors and there is a clear decision boundary in the dataset. Otherwise use Logistic Regression.


__Q What is Pearson’s chi-square test of association/independence? How it is useful in feature selection?__

Chi-square test is used for categorical features in a dataset. In practice, we calculate Chi-square between each feature and the target and select the desired number of features with best Chi-square scores.

- Chi square test is akin to correlateion 
- used for testing relationships between categorical variables, categorical response and categorical predictor
- The null hypothesis of the Chi-Square test is that no relationship exists on the categorical variables in the population i.e they are independent. 
- Independent when p > 0.05 and Dependent when p < 0.05, a higher the Chi-Square value the feature is more dependent
- Chi-square Test can also be used for feature selection
- ANOVA - continuous response and categorical predictor, ANOVA can also be used for feature selection
- Chi-Square is sensitive to small frequencies in cells of tables. Generally, when the expected value in a cell of a table is less than 5, chi-square can lead to errors in conclusions.
- the other chi square test is goodness of fit

In [5]:
# Chi Square demo
from sklearn.datasets import load_iris 
from sklearn.feature_selection import SelectKBest ,chi2
iris_dataset = load_iris() 
  
X = iris_dataset.data 
y = iris_dataset.target 
X = X.astype(int) 
  
# Two features with highest chi-squared statistics are selected 
chi2_features = SelectKBest(chi2, k = 2) 
X_kbest_features = chi2_features.fit_transform(X, y) 
  
# Reduced features 
print('Original feature number:', X.shape[1]) 
print('Reduced feature number:', X_kbest_features.shape[1]) 
chi_scores = chi2(X,y) # first array rep. chi sq value, 2nd array rep. p-value
print(chi_scores)


Original feature number: 4
Reduced feature number: 2
(array([ 10.28712871,   5.02267003, 133.06854839,  74.27906977]), array([5.83684799e-03, 8.11598175e-02, 1.27213107e-29, 7.42172639e-17]))


__Q What is Logistics Loss / Log Loss or Cross Entropy Loss?__

Cross entropy measures the divergence between two probability distribution, if the cross entropy is large, which means that the difference between two distribution is large, while if the cross entropy is small, which means that two distribution is similar to each other.

Log Loss uses negative log to provide an easy metric for comparison. It takes this approach because the positive log of numbers < 1 returns negative values, which is confusing to work with when comparing the performance of two models

It is the measure of performance of classifier model where the prediction input is a proabability value. The goal is to minimize this value since it is a loss.

- log loss = 0 , is considered a perfect model
- the log loss is only defined for two or more labels

__log loss vs Cross Entropy__

log loss and cross entropy are slightly different depending upon the context. In machine learning, when calculating error rates between 0 and 1, the resolve to same thing.


__Q. Why is accuracy not a good measure for classification problems?__

Accuracy is not a good measure for classification problems because it gives equal importance to both false positives and false negatives. However, this may not be the case in most business problems. 

Accuracy gives equal importance to both cases and cannot differentiate between them

__Q. Why can’t we use Mean Square Error (MSE) as a cost function for logistic regression?__

logistic regression uses sigmoid function and perform a non-linear transformation to obtain the probabilities. Squaring this non-linear transformation will lead to non-convexity with local minimums. Finding the global minimum in such cases using gradient descent is not possible. Due to this reason, MSE is not suitable for logistic regression. 

Cross-entropy or log loss is used as a cost function for logistic regression. 

__Q What is the difference between “L1” and “L2” regularization?__


Regularization refers to methods used to modify an objective function in order to reduce model overfitting.

L1 and L2 regularization refer to methods of calculating the length of a vector of model parameters (called the vector norm) in order that this length can be minimized as part of fitting the model.

L1 or the L1-norm is calculated as the sum of the absolute vector values. An example use of this form of regularization is used in Lasso Regression.
L2 or the L2-norm is calculated as the sum of the squared vector values. An example use of this form of regularization is used in Ridge Regression.
The ElasticNet Regression algorithm uses a combination of both L1 and L2 regularization.

## Summary


- Logistic regression is a supervised machine learning classifier that extracts real-valued features from the input, multiplies each by a weight, sums them, and passes the sum through a sigmoid function to generate a probability. A threshold is used to make a decision.
- Logistic regression can be used with two classes (e.g., positive and negative sentiment) or with multiple classes (multinomial logistic regression, for example for n-ary text classification, part-of-speech labeling, etc.).
- Multinomial logistic regression uses the softmax function to compute probabilities.
- The weights (vector w and bias b) are learned from a labeled training set via a loss function, such as the cross-entropy loss, that must be minimized.
- Minimizing this loss function is a convex optimization problem, and iterative algorithms like gradient descent are used to find the optimal weights.
- Regularization is used to avoid overfitting.