### Gaussian Distribution

Gaussian (or normal) distribution is a statistical distribution in which all values are centrally spread. This means that most of the values in the data are around the average value and the standard deviation is $0$ or close to $0$. Normal distribution is plotted for the count of observations, rather than the raw values.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings('ignore')

In [None]:
loans = pd.read_csv(r"https://raw.githubusercontent.com/puneettrainer/datasets/main/bankloan.csv")
loans.head()

In [None]:
# creating distribution of Experience
plt.bar(loans['Experience'].value_counts().index, loans['Experience'].value_counts())
plt.show()

In [None]:
loans['Experience'].value_counts().std()

In [None]:
# creating distribution of Income
plt.bar(loans['Income'].value_counts().index, loans['Income'].value_counts())
plt.show()

In [None]:
loans['Income'].value_counts().std()

### Logistic Regression

In `linear regression`, we attempt to predict the value of the `target` variable based on the values of the input variable by fitting a line such that all the data points are as close to this line as possible. By simply intersecting this line from the `x-axis`, we can trace back to the `y-axis` to get our predicted value.

<img style='margin: auto; width: 400px;' src='https://raw.githubusercontent.com/puneettrainer/pics/main/linreg.png'>
<h4 style='text-align: center;'>Linear Regression - $y = m * x + c$</h4>

In case of `logistic regression`, we attempt to predict the `probability` of something being true or not; our prediction can be either `True` or `False`. This makes the `y-axis` for logistic regression have the range 0 to 1. This also makes it computationally difficult to fit a line such that it is as close to all the data points as possible. To overcome this, the relation is created (line is fit) by creating a line which has the maximum likelihood.

<img style='margin: auto; width: 450px;' src='https://raw.githubusercontent.com/puneettrainer/pics/main/logreg-prob.png'>
<h4 style='text-align: center;'>Logistic Regression using Maximum Likelihood</h4>

In terms of finding the best fitting line, logistic regression is similar to linear regression, but with one small change. We start off by converting the `y-axis` from `probability` to the $ln$ (`natural log`) of probability.<br>
$probability =$ $\frac{p}{1-p}$

$ln(probability) =$ $ln(\frac{p}{1-p})$

This function is called the `logit function`. When we transform probability using $ln$, we get a new `y-axis` with the range as follows:

Upper range in old `y-axis` = 1<br>
Upper range in new `y-axis`<br>
&emsp; = $ln(\frac{1}{1 - 1})$<br>
&emsp; = $ln(\frac{1}{0})$<br>
&emsp; = $ln(1) - ln(0)$<br>
&emsp; = 0 - ($- \infty$)<br>
&emsp; = $+\infty$

Upper range in old `y-axis` = 0<br>
Upper range in new `y-axis`<br>
&emsp; = $ln(\frac{0}{1 - 0})$<br>
&emsp; = $ln({0})$<br>
&emsp; = $- \infty$<br>

In this new graph, we create a line which gives the maximum likelihood to the line in the original graph.

<img style='margin: auto; width: 450px;' src='https://raw.githubusercontent.com/puneettrainer/pics/main/logreg-log.png'>

In this graph, we are viewing the log of probability. Since this can extend to $\infty$, calculating the distance of points (to figure out which line is as close to all points as possible) is computationally difficult. Instead we trace the point on this line where it intersects with the values in the `x-axis`. These values (on this line) are in the form of log of probabilities. To plot them on the original graph (with probability on the `y-axis`), we convert them back to probabilities.<br>
$ln(probability) =$ $ln(\frac{p}{1-p})$

From the line in the new chart, we have values of $ln(probability)$. We want to find the value of `p`.<br>
$ln(probability) =$ $ln(\frac{p}{1-p})$<br>
Exponentiating the log of probabilities, we get:<br>
$e^{ln(probability)} = \frac{p}{1-p}$<br>
$\implies p = e^{ln(probability)} \times (1-p)$<br>
$\implies p = e^{ln(probability)} - (e^{ln(probability)} \times p)$<br>
$\implies p + (e^{ln(probability)} \times p) = e^{ln(probability)}$<br>
$\implies p + (1 + e^{ln(probability)}) = e^{ln(probability)}$<br>
$\implies p = \frac{e^{ln(probability)}}{(1 + e^{ln(probability)})}$<br>

This function is called the `sigmoid` function. By plugging in the values of $e^{ln(probability)}$, we can now get the predicted probabilities and plot them on the original chart (with probability as `y-axis`). This value is called `likelihood`. The algorithm calculates the product of these values and selects the model which has the highest product. Since the way `logistic regression` predicts values is still built on trying to fit a line (even if here we are trying to get the maximum likelihood), it is categorised as a `regression` model and not a `classification` model. Also, since `logistic regression` predicts the probability of something being `True` (or `False`) it is used applicable when we want to perform `binary classification`.

### Creating a model to predict whether loan would be approved or not

### Choosing columns as `input features` of the model

Before choosing fields as `input features` of the model, we figure out the correlation of fields to the `target`.

We also avoid choosing fields which are highly correlated to each other. This causes `linear models` (linear regression, logistic regression) to not understand the individual effect of a feature on the target and can result in models which are inaccurate.

Input features should also be `diverse`. By having a wider set of values, the model can learn more patterns. For example, if we have more records where applicants have 10 family members compared to records where applicants don't have as many members, the model may learn that applicants generally have 10 family members. For this reason, input features chosen should ideally have a `normal distribution`.

Input features are also chosen as per their relevance, assessed by domain knowledge.

In [None]:
# correlation of all fields with the target field
loans.corr().loc[:, 'Personal.Loan']

In [None]:
# correlation of fields ("highly" correlated with the target), amongst themselves
# highly here simply means correlation is greater than 0.1
selected_features = ['Income', 'CCAvg', 'Education', 'Mortgage', 'CD.Account']
loans.corr().loc[selected_features, selected_features + ['Personal.Loan']].sort_values(by=['Personal.Loan'], ascending=False)

From the above, `CCAvg` and `Income` are strongly correlated, so we choose either one of them. Seeing that `Income` is more correlated to the target than `CCAvg`, we will choose `Income`.

In [None]:
loans.describe().loc['std', ['Income', 'CD.Account', 'Personal.Loan']]

Since there is no significantly large `standard deviation` value, we can continue with these fields as `input features`.

In [None]:
input_fields = ['Income', 'CD.Account', 'Personal.Loan']
target_field = 'Personal.Loan'

In [None]:
# splitting the dataset into training and testing datasets

from sklearn.model_selection import train_test_split

training_data, test_data = train_test_split(loans
                                           ,test_size=0.3
                                           ,random_state=10)

In [None]:
from sklearn.linear_model import LogisticRegression

# instantiating the model
model = LogisticRegression()

# training the model
model.fit(training_data[input_fields], training_data[target_field])

In [None]:
predictions = model.predict(test_data[input_fields])

### Evaluating a Logistic Regression model

Since the output of a logistic regression model is not a continuous value, we cannot use `MAE`, `MSE` or `RMSE` to evaluate the performance of our model.

| Result | Actual Status | Model Status |
| --- | --- | --- |
| True Postive | Approved | Approved |
| False Positive | Not Approved | Approved |
| True Negative | Not Approved | Not Approved |
| False Negative | Approved | Not Approved |

Some commonly used evaluation metrics for logistic regression are:
- `Accuracy`: simply the ratio of correct predictions to the total number of predictions.

$Accuracy = \frac{True \: Positive \: + \: True \: Negative}{True \: Positive \: + \: False \: Positive \: + \: True \: Negative \: + \: False \: Negative} = \frac{correct\:predictions}{total\:predictions}$

This is a straightforward metric which indicates the ratio how many "correct" predictions are made by the model. It is not very indicative of the correctness of the model when the model is `imbalanced`; there are more observations of one type of class than another.

- `Precision`: ratio of `True Positive` to the total number of `True` predictions.

$Precision = \frac{True\:Positive}{True\:Positive\:+\:False\:Positive}$

Since this metric focuses only on positive outcomes, it does not evaluate the model's performance in case of negative outcomes. Due to this reason, it is also not suitable for models where the dataset is imbalanced with more negative outcomes. Models with a high `Precision` score may evaluate, for example, an applicant as being a defaulter (even if they are not).

- `Sensitivity`: ratio of `True Positive` to the sum of `True Positive` and `False Negative`.

$Sensitivity = \frac{True\:Positive}{True\:Positive\:+\:False\:Negative}$

This metric is suitable in case of imbalanced datasets where there are more `False` instances than `True`. However, as it only evaluates `True Positive`s, it may not correctly indicate the portion of predictions which were `False Positives`.

- `F1 Score`: this evaluation metric is used in case of imbalanced datasets and takes into account the `False Negative`s. Due to this, it is a comprehensive way of evaluating a model. However, it gives equal weightage to both `Precision` and `Sensitivity`; this means that even if the model  may have a precision score, the F1 score may be low due to low value of the sensitivity score (or vice-versa).<br>

$F1\:Score = 2 \times \frac{Precision \times Recall}{Precision + Recall}$

### Creating a `Confusion Matrix`

`Confusion Matrix` is a convenient way to view all the possible outcomes of a classification model. It clearly computes the `True Positive`s, `False Positive`s, `True Positive`s and `False Positive`s of the model.

In the `sklearn` library, it can be accessed in the `metrics` sub-module.

### `confusion_matrix(actual, prediction)`

Structure of the output:

|  | Predicted True | Predicted False |
| --- | --- | --- |
| <b>Actual True</b> |  |  |
| <b>Actual False</b> |  |  |

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(test_data[target_field], predictions)

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(test_data[target_field], predictions)

In [None]:
from sklearn.metrics import precision_score

precision_score(test_data[target_field], predictions)

In [None]:
from sklearn.metrics import recall_score

recall_score(test_data[target_field], predictions)

In [None]:
from sklearn.metrics import f1_score

f1_score(test_data[target_field], predictions)

### Saving model for later use

In [None]:
import joblib as jb

# save model
loan_approval = {'inputs':input_fields
                ,'target':target_field
                ,'model': model}
jb.dump(loan_approval, 'loan_approval.joblib')