# Logistic Regression for Binary Classification

Logistic regression, is a statistical method for binary classification problems. Unlike linear regression, which predicts continuous outcomes, logistic regression is designed to predict the probability of a binary outcome (0 or 1, true or false, yes or no).

It models the relationship between a set of independent variables (features) and a binary dependent variable (target) using a logistic function (sigmoid function). The output of the logistic function is a probability value that ranges between 0 and 1.

In this notebook, we use a logit model, in which the sigmoid function of a variable z is f(z) = 1/(1+e^-z).

![alt text](assets/sigmoid.png)

Assuming we have independent/exogenous variable X, and we are interested in finding the probability of observing Y given the variable X, the equation whose parameters we try to estimate is as follows:

![alt text](assets/eqn.png)

Here, we try to estimate the parameters β0 and β1. If we have more variables X2, X3....Xn, we would be estimating their coefficients β2, β3.....βn.

To estimate parameters, we use a method called Maximum Likelihood Estimation. The mathematics behind MLE will not be discussed in this notebook, but we encourage you to follow the links below. In short, however, MLE involves maximizing a likelihood function with respect to some parameters, that is, the coefficients; such that the observed values are most probable.

[https://en.wikipedia.org/wiki/Maximum_likelihood_estimation](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation)

[https://towardsdatascience.com/probability-concepts-explained-maximum-likelihood-estimation-c7b4342fdbb1](https://towardsdatascience.com/probability-concepts-explained-maximum-likelihood-estimation-c7b4342fdbb1)

# Creating and training a Logit Model for Binary Classification

In this notebook, we use the [statsmodels](https://www.statsmodels.org/stable/index.html) library, which offers multiple powerful statistical tools, to fit a logit model to the [Bank Note Authentication dataset](https://archive.ics.uci.edu/dataset/267/banknote+authentication). Each datapoint in the dataset has 4 exogenous variables, explained below; and a single binary target variable - whether or not the note is fake.

In [1]:
import pandas as pd
from statsmodels.api import Logit
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, f1_score, precision_score, recall_score, accuracy_score

We import the dataset using pandas read_csv() function.

In [2]:
dataframe = pd.read_csv('datasets/BankNoteAuthentication.csv')
dataframe.head()

Unnamed: 0,variance,skewness,curtosis,entropy,class
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


In [3]:
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1372 entries, 0 to 1371
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   variance  1372 non-null   float64
 1   skewness  1372 non-null   float64
 2   curtosis  1372 non-null   float64
 3   entropy   1372 non-null   float64
 4   class     1372 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 53.7 KB


In [4]:
dataframe.dtypes

variance    float64
skewness    float64
curtosis    float64
entropy     float64
class         int64
dtype: object

Upon some preliminary analysis, we see that there are 4 variables - variance, skew, curtosis & entropy - each of which are explained below:
1. variance of Wavelet Transformed image (continuous): Variance measures the spread or dispersion of pixel values in the wavelet-transformed image. High variance indicates a wide range of pixel intensities, which often corresponds to more texture or detail in the image.

2. skewness of Wavelet Transformed image (continuous): Skewness measures the asymmetry of the pixel value distribution in the wavelet-transformed image. A positive skewness indicates a distribution with a long right tail, while a negative skewness indicates a distribution with a long left tail.

3. curtosis of Wavelet Transformed image (continuous): Kurtosis measures the "tailedness" of the pixel value distribution in the wavelet-transformed image. High kurtosis indicates a distribution with heavy tails and a sharp peak, while low kurtosis indicates a distribution with lighter tails and a flatter peak.

4. entropy of image (continuous): Entropy measures the randomness or complexity of the pixel value distribution in the image. Higher entropy indicates more complexity and less predictability in the image's pixel values.

We split the dataframe into exogenous (explanatory) and target dataframes.

In [5]:
X = dataframe[['variance', 'skewness', 'curtosis', 'entropy']]
y = dataframe[['class']]

Before fitting the model, we must split our dataset into training & testing sets. This is useful when we want to gauge how well a trained model performs on previously unseen data - that is, the testing set.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

Next, we create and fit a Logit model to the training dataset.

In [7]:
logreg = Logit(y_train, X_train).fit()

Optimization terminated successfully.
         Current function value: 0.085039
         Iterations 11


In [8]:
print(logreg.summary())

                           Logit Regression Results                           
Dep. Variable:                  class   No. Observations:                 1029
Model:                          Logit   Df Residuals:                     1025
Method:                           MLE   Df Model:                            3
Date:                Fri, 28 Jun 2024   Pseudo R-squ.:                  0.8762
Time:                        18:34:39   Log-Likelihood:                -87.506
converged:                       True   LL-Null:                       -707.03
Covariance Type:            nonrobust   LLR p-value:                2.467e-268
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
variance      -2.9600      0.311     -9.513      0.000      -3.570      -2.350
skewness      -1.8234      0.235     -7.744      0.000      -2.285      -1.362
curtosis      -1.9610      0.241     -8.133      0.0

Calling the model.summary() presents us with lots of information. We see that the coefficients of each of the variables have been estimated and can be seen. The P>|z| column tells us that all 4 coefficients are significant at greater than the 99.99% confidence level. That simply means that we can reject the null hypothesis that the coefficients are insignificant, and that the model does have explanatory power. 

To learn more about hypotheses in Linear and Logistic regression models, click [this link](https://www.statology.org/null-hypothesis-of-logistic-regression/).

## Testing the Model

To test the model, we must form predictions on the testing set, which we do as follows:

In [9]:
y_pred_raw = logreg.predict(X_test)
y_pred_raw

430     1.073900e-06
588     3.190383e-03
296     1.984329e-03
184     1.065975e-08
244     5.865388e-08
            ...     
1121    9.503673e-01
940     3.017936e-01
1189    9.977810e-01
438     9.113253e-09
1022    3.395049e-01
Length: 343, dtype: float64

We notice that the predictions are all floats between 0 and 1, and so we round them to get either 0 or 1

In [10]:
y_pred = list(map(round, y_pred_raw))
y_pred

[0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 0,


Let's see how well the model fared by plotting a confusion matrix. A confusion matrix is a table used to evaluate the performance of a classification algorithm. It is particularly useful for understanding the breakdown of correct and incorrect predictions.

![alt text](assets/confmat.png)

In [15]:
conf_mat = confusion_matrix(y_test, y_pred)
conf_mat

array([[191,   0],
       [ 20, 132]])

We notice that our model predicted 191 true positives, 132 true negatives, no false positives and 20 false negatives.

### Evaluation Metrics

#### 1. Accuracy: 
The ratio of correctly predicted instances to the total instances.

![alt text](assets/accuracy.png)

In [11]:
accuracy = accuracy_score(y_test, y_pred)
accuracy

0.9416909620991254

We see that the model has an accuracy of 0.9417, or 94.17%. While accuracy is a useful metric, it has limitations, particularly in the context of imbalanced datasets. In many real-world classification problems, the classes are not evenly distributed, and accuracy alone can be misleading. Therefore, we use some other metrics such as the following.

#### 2. Precision: 
Precision is the ratio of true positive predictions to the total number of positive predictions made by the model. It measures the accuracy of the positive predictions.

![alt text](assets/precision.png)

In [12]:
precision = precision_score(y_test, y_pred)
precision

np.float64(1.0)

The precision score of 1 implies that the model is completely accurate on positive predictions.

#### 3. Recall: 
Recall is the ratio of true positive predictions to the total number of actual positive instances in the dataset. It measures the ability of the model to identify all positive instances.

![alt text](assets/recall.png)

In [13]:
recall = recall_score(y_test, y_pred)
recall

np.float64(0.868421052631579)

#### 4. F1 score:
The F1 score is the harmonic mean of precision and recall. It provides a single metric that balances both precision and recall. The F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. It's a good measure of a model's accuracy when the classes are imbalanced.

![alt text](assets/f1.png)

In [14]:
f1 = f1_score(y_test, y_pred)
f1

np.float64(0.9295774647887324)