### MACHINE LEARNING FOR ANALYSIS AND PREDICTION OF ATTRITION

**The objective of this mini project is to enable practice in data analysis and prediction by classification and
clustering algorithms.**

#### Problem Statement

Attrition is the rate at which employees leave their job. When attrition reaches high levels, it becomes a
concern for the company. Therefore, it is important to find out why employees leave, which factors contribute
to such significant decision.

### Enviroment

#### 1. Data wrangling and exploration

- Load and explore the data, clean it, and analyse it by statistics
- Select the most relevant features of an employee for machine learning operations on prediction of
the attrition

In [None]:
# data structure
import pandas as pd

# visualisation
import seaborn as sns
import matplotlib.pyplot as plt

# sklearn for machine learning methods
from sklearn import tree
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.cluster import KMeans
from sklearn import metrics
from sklearn.metrics import silhouette_score
from scipy.spatial.distance import cdist


# for numeric calculations
import numpy as np

In [None]:
df = pd.read_excel('WA_Fn-UseC_-HR-Employee-Attrition.xlsx')

In [None]:
# view df size
df.shape

In [None]:
# view df
df

In [None]:
# get an overview missing values and data types
df.info()

In [None]:
# view amount of uniqie values of object data type
df.select_dtypes(include='object').nunique()

In [None]:
# map each non-numeric value to numeric values
column_mappings = {
    'Attrition': {'No': 0, 'Yes': 1},
    'BusinessTravel': {'Non-Travel': 0, 'Travel_Rarely': 1, 'Travel_Frequently': 2},
    'Department': {'Human Resources': 0, 'Research & Development': 1, 'Sales': 2},
    'EducationField': {'Life Sciences': 0, 'Other': 1, 'Medical': 2, 'Marketing': 3, 'Technical Degree': 4, 'Human Resources': 5},
    'Gender': {'Male': 0, 'Female': 1},
    'JobRole': {'Sales Executive': 0, 'Research Scientist': 1, 'Laboratory Technician': 2, 'Manufacturing Director': 3, 'Healthcare Representative': 4, 'Manager': 5, 'Sales Representative': 6, 'Research Director': 7, 'Human Resources': 8},
    'MaritalStatus': {'Single': 0, 'Married': 1, 'Divorced': 2},
    'Over18': {'Y': 1},
    'OverTime': {'No': 0, 'Yes': 1}
}

df.replace(column_mappings, inplace=True)
df

In [None]:
df.describe()

In [None]:
# correlation between attrition and other features
correlation_matrix = df.corr()
correlation_with_attrition = correlation_matrix['Attrition'].sort_values(ascending=False)

# heatmap 
plt.figure(figsize=(5, 10))
sns.heatmap(pd.DataFrame(correlation_with_attrition), annot=True, cmap='coolwarm', fmt=".2f", cbar=False)
plt.title('Correlation with attrition')
plt.show()

#### Prepare the data for training

#### Data split 

- y is our target the value (the value we wanna try to predict)
- X is our feature matrix (the data we use to try an predict our target value)

In [None]:
# define X and y
X = df[['OverTime', 'BusinessTravel', 'TotalWorkingYears', 'MaritalStatus', 'YearsInCurrentRole']].values
y = df['Attrition'].values

In [None]:
# split 80/20
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2, random_state=42)

#### Training the model (Decision Tree)

In [None]:
# initialize decision tree
model = DecisionTreeClassifier(random_state=42)

model.fit(X_train, y_train)

In [None]:
# Install the graphviz package
#!pip install graphviz

In [None]:
import graphviz
from sklearn.tree import export_graphviz

# draw tree 
gr_data = export_graphviz(model, out_file=None, 
                          feature_names=['OverTime', 'BusinessTravel', 'TotalWorkingYears', 'MaritalStatus', 'YearsInCurrentRole'],
                          class_names=True, filled=True, rounded=True, proportion=False, special_characters=True)  

dtree = graphviz.Source(gr_data)

dtree.render("decision_tree")  # Optionally save the tree to a file

In [None]:
# view tree
dtree 

### Model Validation

We need a metrics for the evaluation. ‘accuracy‘ is the percentage % of correctly predicted instances from the total number of instances in the dataset.

#### Now we can try to implement the model on our test set.



In [None]:
y_pred = model.predict(X_test)
y_pred

#### Predict test data

In [None]:
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

#### Confusion matrix

In [None]:
# confusion matrix
confusion_mat = confusion_matrix(y_test,y_pred)
confusion_mat

In [None]:
pd.crosstab(y_test,y_pred)

               | Predicted Negative | Predicted Positive |

    Actual Negative|       TN            |        FP           |

              
    Actual Positive|       FN            |        TP           |

Here is a quick summarization: 

- Of the 255 0's (attrition = no) the model predicited 223 correct and 32 wrong. 
- Of the 31 1's (attrintion = yes) the model predicited 27 true and 12 wrong. 

In [None]:
# visualize confusion matrix
plt.imshow(confusion_mat, interpolation='nearest')
plt.title('Confusion matrix')
plt.colorbar()
ticks = np.arange(2)
plt.xticks(ticks, ticks)
plt.yticks(ticks, ticks)
plt.ylabel('True labels')
plt.xlabel('Predicted labels')
plt.show()

#### Compare scores for Train and Test data

In [None]:
class_names = ['Not Attrition', 'Attrition']

# classifier performance on training dataset
print("Classification Report - Training Data:")
print(classification_report(y_train, model.predict(X_train), target_names=class_names))
plt.show()

# classifier performance on test dataset
print("Classification Report - Test Data:")
print(classification_report(y_test, model.predict(X_test), target_names=class_names))
plt.show()

### Naive Bayes

In [None]:
# define X and y directly from DataFrame
X = df[['OverTime', 'BusinessTravel', 'TotalWorkingYears', 'MaritalStatus', 'YearsInCurrentRole']].values
y = df['Attrition'].values

In [None]:
# split 80/20
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2, random_state=42)

We are ready to apply algorithms for training a model from our data. We try Gaussian Naive Bayes (NB), as it is appropriate for analysis of numeric data.

In [None]:
# build the model from the train
model = GaussianNB()
model.fit(X_train, y_train)

In [None]:
# test the model on the test set
model.score(X_test, y_test)

 #### Validating the Model

In [None]:
X_test

In [None]:
# Test on the test data, try prediction
prediction = model.predict(X_test)
prediction

In [None]:
prediction.shape

Estimate the Accuracy


In [None]:
# Calculated accuracy of the model over the validation set
print(accuracy_score(y_test, prediction))

In [None]:
# Classification report provides a breakdown of each class by precision, recall, f1-score and support
cmat = confusion_matrix(y_test, prediction)
print(cmat)
print(classification_report(y_test, prediction))

In [None]:
sns.set()
sns.heatmap(cmat, annot=True, fmt='d', cmap='Blues', xticklabels=class_names, yticklabels=class_names)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()

In [None]:
class_names = ['Not Attrition', 'Attrition']

# classifier performance on training dataset
print("\nClassifier performance on training dataset\n")
print(classification_report(y_train, model.predict(X_train), target_names=class_names))

# classifier performance on test dataset
print("\nClassifier performance on test dataset\n")
print(classification_report(y_test, model.predict(X_test), target_names=class_names))

### Models with the highest accuracy apears to be Naive Bayes

### Unsupervised Machine Learning by K-Means Algorithm

#### 3. Unsupervised machine learning: clustering
- apply at least one clustering algorithm (e.g. K-Means) for segmentation of the employees in groups of
similarity
- evaluate the quality of the clustering by calculating a silhouette score and recommend the cluster
configuration with higher score

#### 2. Supervised machine learning: classification

- train, test, and validate two machine learning models for classification and prediction of attrition (e.g.
Decision Tree and Naïve Bayes)
- apply appropriate methods and measures for assessing the validity of the models and recommend the
one with highest accuracy

#### Determine K by Elbow Method


I want to the find the optimal amount of clusters by using the the elbow method. K stands for the number of clusters

In [None]:
# Determine k by minimizing the distortion - 
# the sum of the squared distances between each observation vector and its centroid
distortions = []
K = range(2,10)
for k in K:
    model = KMeans(n_clusters=k, n_init=10).fit(X)
    model.fit(X)
    distortions.append(sum(np.min(cdist(X, model.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0]) 
print("Distortion: ", distortions)

Distortion measures the average distance between each data point and the centroid in the cluster.
In our data, it looks like the distortion falls when the number of clusters (K) increases. The goal is to find the point where further increase of K does not result in a noticeable improvement of distortion. This is the elbow point.

In [None]:
# Plot the distortion to discover the elbow
plt.title('Elbow Method for Optimal K')
plt.plot(K, distortions, 'bx-')
plt.xlabel('K')
plt.ylabel('Distortion')
plt.show()

#### 4. Machine Learning application
- create and deploy on the localhost an interactive prototype of Streamlit application, visualizing stages
and results of your work
- enable input of user data and making predictions on attrition by use of the classification model,
created in p.2 above.
- test the application with various previously unknown input data and record the results.



Submit a link to the Github repository of your solution, where in the readme file provide answers of the
following questions:
    

- Which machine learning methods did you choose to apply in the application?
- How accurate is your solution of prediction?
- Which are the most decisive factors for quitting a job?
- Which work positions and departments are in higher risk of losing employees?
- Are employees of different gender paid equally in all departments?
- Do the family status and the distance from work influence the work-life balance?
- Does education make people happy (satisfied from the work)?
- Which were the challenges in the project development?