<b><font size="6">Logistic Regression</font><a class="anchor"><a id='toc'></a></b><br>

Logistic regression is a classification algorithm used to assign observations to a discrete set of classes. Unlike linear regression which outputs continuous number values, logistic regression transforms its output using the logistic sigmoid function to return a probability value which can then be mapped to two or more discrete classes. The logistic function has the property of being able to map any real value between 0 and 1. The model is represented by:

$$
\hat{y} = \sigma(w_0 + w_1x_1 + w_2x_2 + ... + w_nx_n)
$$

where $\sigma$ is the logistic sigmoid function given by:

$$
\sigma(t) = \frac{1}{1 + e^{-t}}
$$

In binary classification problems, the logistic regression model calculates the probability of an event occuring. If the calculated probability is greater than 0.5, the observation is assigned to a discrete class 1. If the calculated probability is less than 0.5, the observation is assigned to a discrete class 0. In this way, logistic regression can be understood as the probability of a certain event based on the values of the independent variables.

__`Step 1`__ - Import the data and pandas

In [1]:
import pandas as pd
tugas = pd.read_csv('datasets/final_tugas.csv')
tugas

Unnamed: 0,Custid,Year_Birth,Dependents,Income,Rcn,Frq,Mnt,Clothes,Kitchen,SmallAppliances,...,DepVar,Gender_M,Education_Basic,Education_Graduation,Education_Master,Education_PhD,Marital_Status_Divorced,Marital_Status_Married,Marital_Status_Together,Marital_Status_Widow
0,1003,1991,1,29761.20,69,11,45.76,32,19,24,...,0,1,0,1,0,0,0,1,0,0
1,1004,1956,1,98249.55,10,26,923.52,60,10,19,...,0,1,0,0,1,0,0,1,0,0
2,1006,1983,1,23505.30,65,14,58.24,47,2,48,...,0,0,0,0,0,1,0,0,1,0
3,1007,1970,1,72959.25,73,18,358.80,71,7,13,...,0,0,0,1,0,0,0,0,0,0
4,1009,1941,0,114973.95,75,30,1457.04,38,9,35,...,0,0,0,1,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2495,10989,1996,1,29551.20,41,10,47.84,11,40,24,...,0,0,1,0,0,0,0,0,0,0
2496,10991,1940,0,132566.70,36,46,2320.24,32,4,47,...,0,0,0,1,0,0,0,1,0,0
2497,10993,1955,0,91768.95,1,25,870.48,56,8,27,...,0,0,0,1,0,0,0,0,1,0
2498,10994,1961,1,99085.35,1,28,931.84,68,5,21,...,0,0,1,0,0,0,0,1,0,0


Since we are in a classification scenario, it is important that we consider the proportions of the target variable.

In [2]:
tugas.DepVar.value_counts()

0    2325
1     175
Name: DepVar, dtype: int64

__`Step 2`__ - Data partition
- Assign all the variables excluding the DepVar to the object `data`
- Assign the dependent variable to the object `target`
- Import the needed library to make the partition of the dataset
- Split the data and the target to X_train, X_test, y_train, y_test, where `test_size` should be equal to 0.2, `random_state` equal to 5 the `stratify` equal to `target`

In [3]:
data = tugas.drop(['DepVar'], axis=1)
target = tugas['DepVar']

In [4]:
#make the split here
from sklearn.model_selection import train_test_split

In [5]:
X_train, X_test, y_train, y_test = train_test_split(data,target, test_size=0.2, random_state=5, stratify=target)

__`Step 3`__ - Import the model and create an instance

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html'>sklearn.linear_model.LogisticRegression(fit_intercept=True,...)</a>

__Definition:__ <br>
Applies Logistic Regression classifier.

__Parameters:__ <br>
*fit_intercept*: whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations; <br>
...
</div>

In [6]:
from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression()

__`Step 4`__ - Fit the model to the train data

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html'>sklearn.linear_model.LogisticRegression().fit(X,y,...)</a>

__Definition:__ <br>
Fit logistic model in the training data.

__Parameters:__ <br>
X : The regressors in my training dataset; <br>
y : The target in my training dataset; <br>
...
</div>

In [7]:
#CODE HERE
log_model.fit(X_train, y_train)

__`Step 5`__ - Use the model to predict the labels of the test data. Assign them to **y_pred**.

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html'>sklearn.linear_model.LogisticRegression().predict(X)</a>

__Definition:__ <br>
Predict class labels for samples in X.

__Parameters:__ <br>
X : Samples to predict; <br>
...

</div>

In [8]:
y_pred = log_model.predict(X_test)
y_pred

array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

***Note:*** You can get the actual probabilities of each sample instead of the assigned class using the method `predict_proba()`

In [10]:
pred_prob = log_model.predict_proba(X_test)
pred_prob

array([[9.95583919e-01, 4.41608077e-03],
       [9.96770454e-01, 3.22954560e-03],
       [9.83968216e-01, 1.60317840e-02],
       [3.46966909e-01, 6.53033091e-01],
       [9.94684619e-01, 5.31538108e-03],
       [9.88694276e-01, 1.13057239e-02],
       [9.99174415e-01, 8.25585488e-04],
       [9.92827846e-01, 7.17215382e-03],
       [6.78444605e-01, 3.21555395e-01],
       [9.85968695e-01, 1.40313047e-02],
       [9.28410324e-01, 7.15896760e-02],
       [9.98952923e-01, 1.04707722e-03],
       [9.87820356e-01, 1.21796443e-02],
       [9.97947370e-01, 2.05262958e-03],
       [9.98966921e-01, 1.03307936e-03],
       [9.97300447e-01, 2.69955272e-03],
       [9.90187322e-01, 9.81267804e-03],
       [9.94739371e-01, 5.26062907e-03],
       [8.80736125e-01, 1.19263875e-01],
       [9.61548210e-01, 3.84517902e-02],
       [9.98086372e-01, 1.91362777e-03],
       [9.92777318e-01, 7.22268242e-03],
       [2.14466014e-01, 7.85533986e-01],
       [9.83173532e-01, 1.68264683e-02],
       [4.712858

***Note:*** In the same way as for the linear regression, you can get the coefficients and intercept

In [11]:
log_model.coef_

array([[ 2.80001806e-05, -3.85905571e-03,  4.26737090e-05,
         1.55298947e-05,  6.00127787e-04,  1.19019723e-03,
         2.43909462e-03,  1.72590586e-02, -2.65835014e-03,
        -9.22565389e-03, -3.16836628e-03, -2.52209886e-03,
         6.01005789e-03, -6.34963795e-03,  8.41784856e-05,
        -2.33297227e-05,  3.31711416e-06, -1.64116441e-05,
         3.76034926e-05, -5.67068085e-05,  5.31851429e-05,
        -5.23352808e-05, -1.06689509e-05, -3.99910604e-06]])

__`Step 6`__ - Evaluate the model

***Note:*** Since we are predicting a categorical target (classification) we use other metrics to evaluate our model than if we were solving a regression problem. Also, for the logistic regression the R-squared cannot be obtained in the same way as we obtain it in the linear case.

### The confusion matrix

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix'>sklearn.metrics.confusion_matrix(y_true, y_pred, ...)</a>

__Definition:__ <br>
Compute confusion matrix to evaluate the accuracy of a classification

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values.; <br>
_y_pred_: Estimated targets as returned by a classifier.; <br>
...
</div>

In [12]:
from sklearn.metrics import confusion_matrix

In [13]:
cm = confusion_matrix(y_test, y_pred)
cm

array([[455,  10],
       [ 29,   6]])

The confusion matrix in sklearn is presented in the following format: <br>
[ [ TN  FP  ] <br>
    [ FN  TP ] ]

### The accuracy score
<img src="img/accuracy.png" alt="Drawing" style="width: 300px;"/>

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score'>sklearn.metrics.accuracy_score(y_true, y_pred, normalize=True,...)</a>

__Definition:__ <br>
Accuracy classification score.

__Interpretation:__ <br>
If normalize is True, then the best performance is 1. When normalize = False, then the best performance is the number of samples.

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values.; <br>
_y_pred_: Estimated targets as returned by a classifier.; <br>
_normalize_: If False, return the number of correctly classified samples. Otherwise, return the fraction of correctly classified samples. <br>
...
</div>

In [14]:
from sklearn.metrics import accuracy_score

In [15]:
accuracy = accuracy_score(y_test, y_pred)
accuracy

0.922

### The precision
<img src="img/precision.png" alt="Drawing" style="width: 200px;"/>

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score'>sklearn.metrics.precision_score(y_true, y_pred, ...)</a>

__Definition:__ <br>
Compute the precision.

__Interpretation:__ <br>
The best value is 1, and the worst value is 0.

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values.; <br>
_y_pred_: Estimated targets as returned by a classifier.; <br>
...
</div>

In [16]:
from sklearn.metrics import precision_score

In [17]:
precision = precision_score(y_test, y_pred)
precision

0.375

### The recall
<img src="img/recall.png" alt="Drawing" style="width: 180px;"/>

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.recall_score'>sklearn.metrics.recall_score(y_true, y_pred, ...)</a>

__Definition:__ <br>
Compute the recall.

__Interpretation:__ <br>
The best value is 1 and the worst value is 0.

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values.; <br>
_y_pred_: Estimated targets as returned by a classifier.; <br>
...
</div>

In [19]:
from sklearn.metrics import recall_score

In [20]:
recall = recall_score(y_test, y_pred)
recall

0.17142857142857143

### The F1 Score
<img src="img/f1.png" alt="Drawing" style="width: 270px;"/>

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score'>sklearn.metrics.f1_score(y_true, y_pred, ...)</a>

__Definition:__ <br>
Compute the F1 score, also known as balanced F-score or F-measure.

__Interpretation:__ <br>
F1 score reaches its best value at 1 and worst score at 0.

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values.; <br>
_y_pred_: Estimated targets as returned by a classifier.; <br>
...
</div>

In [21]:
from sklearn.metrics import f1_score

In [22]:
f1 = f1_score(y_test, y_pred)
f1

0.23529411764705876