# Project Phase 5: Model Understanding
From the previous phase, we are able to evaluate different machine learning model to predict employee attrition. In this phase, we will try to understand how the model predicted the values they predict in order to gain some insights on the relationship of the selected features with our target variable.

## Import the Training and Testing Set
Before we start, let's load the training set, testing set, and names of the selected features.

In [1]:
# Import the Features and Targets
import pandas as pd
import numpy as np
import seaborn as sns

# Import the Dataset
X_train = pd.read_csv("dataset/preprocessed/Features_Training_Set.csv", index_col=0).to_numpy()
X_test = pd.read_csv("dataset/preprocessed/Features_Testing_Set.csv", index_col=0).to_numpy()
y_train = pd.read_csv("dataset/preprocessed/Target_Training_Set.csv", index_col=0).to_numpy().ravel()
y_test = pd.read_csv("dataset/preprocessed/Target_Testing_Set.csv", index_col=0).to_numpy().ravel()
ATTRS_SELECTED = pd.read_csv("dataset/constants/ATTRS_SELECTED.csv", index_col=0)

## Training and Testing the Model
Once the dataset was loaded, we will train our logisitic regression model on the training set, and evaluate it on the test set. Then we print the resulting metrics.

In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


# Instanciate LogisticRegression Model
classifier = LogisticRegression(max_iter=10000)

# Train the model from the training set
classifier.fit(X_train, y_train)

# Make predictions from the test set
y_pred = classifier.predict(X_test)

# Preview the resulting metrics
pd.DataFrame({
  "Metric": ["Accuracy", "Precision", "Recall", "F1"],
  "Value": [metric(y_true=y_test, y_pred=y_pred) for metric in [accuracy_score, precision_score, recall_score, f1_score]]
})

Unnamed: 0,Metric,Value
0,Accuracy,0.781377
1,Precision,0.761905
2,Recall,0.8
3,F1,0.780488


## Understanding the Logistic Regression Model
Before we proceed, let's revisit the fundamental concepts of logistic regeression model.

### Probability $p$
In logistic regression model, $p$ is defined as the probability that a given observation with n-features is labelled as $1$. In our case, $0$ means employee stays, $1$ means employee leaves. In other words, $p$ is the probability that the observed employee will left the company.

$$ p = \frac{e^(\beta_0 + \beta_1 * x_1 + \beta_2 * x_2 + ... +\beta_1n * x_n)}{1 + e^(\beta_0 + \beta_1 * x_1 + \beta_2 * x_2 + ... +\beta_1n * x_n)} $$

### Odds-Ratio $O$
Now, let's define another quantity called *odds* or the ratio that an event will happen over the event will not happen.
$$ O = \frac{p}{1-p} $$

### Log-odds
The logarithm of the odds-ratio.

$$ \ln O = \ln \frac{p}{1-p} $$

### Logit Function
The shorthand for the log-odds function
$$ logit(p) = \ln O = \ln \frac{p}{1-p} $$

### Logistic Regression Model
Now, we will define the logistic regression model using the above concepts.

$$ logit(p) = \beta_0 + \beta_1 * x_1 + \beta_2 * x_2 + ... +\beta_1n * x_n $$

> A unit increase in $\beta_n$ will result in a unit increase in the $logit(p)$.

> If the coefficient is positive, there is a positive relationship between $\beta$ and $p$.

> If the coefficient is positive, there is a negative relationship between $\beta$ and $p$.

In [3]:
pd.DataFrame({
  "Attribute": ATTRS_SELECTED.to_numpy()[:,0].tolist(),
  "Coefficient": classifier.coef_.T[:,0].tolist(),
}).sort_values(by="Coefficient", ascending=False)

Unnamed: 0,Attribute,Coefficient
23,OverTime_Yes,1.941574
10,YearsAtCompany,0.784192
22,MaritalStatus_Single,0.724073
12,YearsSinceLastPromotion,0.413564
6,NumCompaniesWorked,0.384305
2,DistanceFromHome,0.245063
5,MonthlyRate,0.068491
3,HourlyRate,0.067564
14,Education,0.03187
17,JobLevel,-0.092323


## Interpreting the Cofficients
Now, guided by the fundamental concepts above, let's interpret the coefficients!

### Top Attributes that that is directly related with employee attrition
- `OverTime_Yes` - An Employee who frequently renders overtime as indicated on the  feature, the log-odds-ratio that he will leave the company increases by 1.94.

- `YearsAtCompany	` - Every unit increase in the years of tenure of employee on the company, the log-odds-ratio that he will leave increases by 0.78.

- `MaritalStatus_Single` - If an employee is single, the log-odds-ratio that he will leave the company increases by 0.72.

- `YearsSinceLastPromotion` - Every unit increase in years that an employee is not promoted, the log-odds-ratio that he will leave increases by 0.41.

- `NumCompaniesWorked` - For each number of previous companies that an employee works with, the log-odds-ratio that he will leave increases by 0.38.

- `DistanceFromHome` - For each mile distance from employee's home, the log-odds-ratio that he will leave increases by 0.24

### Top Attributes that that is inversely related with employee attrition
- Under same format of analysis from the previous one, it can be  seen that an increase in `TotalWorkingYears`, `YearsWithCurrManager`, `YearsInCurrentRole` decreases the log-odds-ratio that he will leave the company by 0.67, 0.66, and 0.34 respectively.

- For each unit rating an employee gives in his `JobInvolvement`, `EnvironmentSatisfaction`, `JobSatisfaction`, `RelationshipSatisfaction`, `WorkLifeBalance` the log-odds-ratio that he will leave the company decreases by 0.56, 0.39, and 0.33, 0.22, 0.19 respectively.

- For each unit increase in `MonthlyIncome` and `PercentSalaryHike`, the log-odds-ratio that he will leave the company decreases by 0.29, and 0.22 respectively.

- For each increase in `TrainingTimesLastYear`, the log-odds-ratio that he will leave the company decreases by 0.17.