# Attrition in an Organization || Why Workers Quit?

Employees are the backbone of the organization. Organization's performance is heavily based on the quality of the employees. Challenges that an organization has to face due employee attrition are:

1. Expensive in terms of both money and time to train new employees.
1. Loss of experienced employees
1. Impact in productivity
1. Impact profit

Before getting our hands dirty with the data, first step is to frame the business question. Having clarity on below questions is very crucial because the solution that is being developed will make sense only if we have well stated problem.

### Business questions to brainstorm:
1. What factors are contributing more to employee attrition?
1. What type of measures should the company take in order to retain their employees?
1. What business value does the model bring?
1. Will the model save lots of money?
1. Which business unit faces the attrition problem?

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from google.colab import drive

drive.mount('/content/drive/')

%matplotlib inline
sns.set_style("whitegrid")
plt.style.use("fivethirtyeight")

pd.set_option("display.float_format", "{:.2f}".format)
pd.set_option("display.max_columns", 80)
pd.set_option("display.max_rows", 80)

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [9]:
import warnings
warnings.filterwarnings("ignore")

In [10]:
df = pd.read_csv("MECD/Grupos/TAD - Pratical/Final Milestone/Python notebook/WA_Fn-UseC_-HR-Employee-Attrition.csv")
df.head()

FileNotFoundError: ignored

# 1. Exploratory Data Analysis

- Find patterns in data through data visualization. Reveal hidden secrets of the data through graphs, analysis and charts.
    - Univariate analysis
        - Continous variables : Histograms, boxplots. This gives us understanding about the central tendency and spread
        - Categorical variable : Bar chart showing frequency in each category
    - Bivariate analysis
        - Continous & Continous : Scatter plots to know how continous variables interact with each other
        - Categorical & categorical : Stacked column chart to show how the frequencies are spread between two
        - categorical variables
        - Categorical & Continous : Boxplots, Swamplots or even bar charts
- Detect outliers
- Feature engineering

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
for column in df.columns:
    print(f"{column}: Number of unique values {df[column].nunique()}")
    print("==========================================================")

We notice that '`EmployeeCount`', '`Over18`', '`StandardHours`' have only one unique values and '`EmployeeNumber`' has `1470` unique values.
This features aren't useful for us, So we are going to drop those columns.

In [None]:
df.drop(['EmployeeCount', 'EmployeeNumber', 'Over18', 'StandardHours'], axis="columns", inplace=True)

In [None]:
print(df.Attrition.value_counts())
print(df.Attrition.value_counts(normalize=True))
df['Attrition'].value_counts().plot(kind='bar', color=('C0','C1')).set_title('Attrition')

In [None]:
def plot_percentage(df, feature):
    print("Count Valori")
    print(df[feature].value_counts())
    print("Percentage")
    print(df[feature].value_counts(normalize=True))
    #Create plot
    plot = sns.countplot(feature, data=df)
    for p in plot.patches:
        count_percentage = '{:.1f}%'.format(100* p.get_height() / len(df[feature]))
        a_x = p.get_x() + p.get_width() / 2 - 0.05
        a_y = p.get_y() + p.get_height()
        plot.annotate(count_percentage, (a_x, a_y), size = 11)

plt.show()

In [None]:
plot_percentage(df, "Attrition")

In [None]:
df_age = df[["Age", "Attrition"]]
df_age = df_age[df_age["Age"].notnull()]
def in_bin_age(df):
    dtmp = []
    for i in df:
        if(i < 30):
            dtmp.append("20")
        elif(i <40):
            dtmp.append("30")
        elif(i<50):
            dtmp.append("40")
        else:
            dtmp.append("50")
    return dtmp

In [None]:
df_age["Age"] = in_bin_age(df_age["Age"])

In [None]:
sns.factorplot("Age", data=df_age, aspect=1, kind='count', hue='Attrition', palette=['C0', 'C1']).set_ylabels('N° Employees')

In [None]:
print("---- BusinessTravel")
sns.factorplot("BusinessTravel", data=df, aspect=1, kind='count', hue='Attrition', palette=['C0', 'C1']).set_ylabels('N° Employees')
# DEPARTMENT
print("---- Department")
sns.factorplot("Department", data=df, aspect=1, kind='count', hue='Attrition', palette=['C0', 'C1']).set_ylabels('N° Employees')
# ENVIRONMENTSATISFACTION
print("---- EnvironmentSatisfaction")
sns.factorplot("EnvironmentSatisfaction", data=df, aspect=1, kind='count', hue='Attrition', palette=['C0', 'C1']).set_ylabels('N° Employees')
# GENDER
print("---- Gender")
sns.factorplot("Gender", data=df, aspect=1, kind='count', hue='Attrition', palette=['C0', 'C1']).set_ylabels('N° Employees')
# JOBLEVEL
print("---- JobLevel")
sns.factorplot("JobLevel", data=df, aspect=1, kind='count', hue='Attrition', palette=['C0', 'C1']).set_ylabels('N° Employees')
# JOBSATISFACTION
print("---- JobSatisfaction")
sns.factorplot("JobSatisfaction", data=df, aspect=1, kind='count', hue='Attrition', palette=['C0', 'C1']).set_ylabels('N° Employees')
# MARITALSTATUS
print("---- MaritalStatus")
sns.factorplot("MaritalStatus", data=df, aspect=1, kind='count', hue='Attrition', palette=['C0', 'C1']).set_ylabels('N° Employees')
# RELATIONSHIPSATISFACTION
print("---- RelationshipSatisfaction")
sns.factorplot("RelationshipSatisfaction", data=df, aspect=1, kind='count', hue='Attrition', palette=['C0', 'C1']).set_ylabels('N° Employees')
# WORKLIFEBALANCE
print("---- WorkLifeBalance")
sns.factorplot("WorkLifeBalance", data=df, aspect=1, kind='count', hue='Attrition', palette=['C0', 'C1']).set_ylabels('N° Employees')

# NUMCOMPANIESWORKED
print("---- NumCompaniesWorked")
sns.factorplot("NumCompaniesWorked", data=df, aspect=1, kind='count', hue='Attrition', palette=['C0', 'C1']).set_ylabels('N° Employees')



## Categorical Features

In [None]:
object_col = []
for column in df.columns:
    if df[column].dtype == object and len(df[column].unique()) <= 30:
        object_col.append(column)
        print(f"{column} : {df[column].unique()}")
        print(df[column].value_counts())
        print("====================================")
object_col.remove('Attrition')

In [None]:
len(object_col)

In [None]:
object_col

In [None]:
from sklearn.preprocessing import LabelEncoder

label = LabelEncoder()
df["Attrition"] = label.fit_transform(df.Attrition)

## Numerical Features

In [None]:
disc_col = []
for column in df.columns:
    if df[column].dtypes != object and df[column].nunique() < 30:
        print(f"{column} : {df[column].unique()}")
        disc_col.append(column)
        print("====================================")
disc_col.remove('Attrition')

In [None]:
cont_col = []
for column in df.columns:
    if df[column].dtypes != object and df[column].nunique() > 30:
        print(f"{column} : Minimum: {df[column].min()}, Maximum: {df[column].max()}")
        cont_col.append(column)
        print("====================================")

## Data Visualisation

In [None]:
plt.figure(figsize=(20, 20))

for i, column in enumerate(disc_col, 1):
    plt.subplot(4, 4, i)
    df[df["Attrition"] == 0][column].hist(bins=35, color='blue', label='Attrition = NO', alpha=0.6)
    df[df["Attrition"] == 1][column].hist(bins=35, color='red', label='Attrition = YES', alpha=0.6)
    plt.legend()
    plt.xlabel(column)

It seems that `EnvironmentSatisfaction`, `JobSatisfaction`, `PerformanceRating`, and `RelationshipSatisfaction` features don't have big impact on the detrmination of `Attrition` of employees.

In [None]:
plt.figure(figsize=(20, 10))

for i, column in enumerate(cont_col, 1):
    plt.subplot(2, 4, i)
    df[df["Attrition"] == 0][column].hist(bins=35, color='blue', label='Attrition = NO', alpha=0.6)
    df[df["Attrition"] == 1][column].hist(bins=35, color='red', label='Attrition = YES', alpha=0.6)
    plt.legend()
    plt.xlabel(column)

In [None]:
plt.figure(figsize=(15, 20))

for i, column in enumerate(object_col, 1):
    plt.subplot(3, 3, i)
    df[df["Attrition"] == 0][column].hist(bins=35, color='blue', label='Attrition = NO', alpha=0.6)
    df[df["Attrition"] == 1][column].hist(bins=35, color='red', label='Attrition = YES', alpha=0.6)
    plt.legend()
    plt.xlabel(column)
    plt.xticks(rotation=45)

**Conclusions:**

***
- The workers with low `JobLevel`, `MonthlyIncome`, `YearAtCompany`, and `TotalWorkingYears` are more likely to quit there jobs.
- `BusinessTravel` : The workers who travel alot are more likely to quit then other employees.

- `Department` : The worker in `Research & Development` are more likely to stay then the workers on other departement.

- `EducationField` : The workers with `Human Resources` and `Technical Degree` are more likely to quit then employees from other fields of educations.

- `Gender` : The `Male` are more likely to quit.

- `JobRole` : The workers in `Laboratory Technician`, `Sales Representative`, and `Human Resources` are more likely to quit the workers in other positions.

- `MaritalStatus` : The workers who have `Single` marital status are more likely to quit the `Married`, and `Divorced`.

- `OverTime` : The workers who work more hours are likely to quit then others.

***

# 3. Correlation Matrix

In [None]:
df1 = df.copy()

In [None]:
label = LabelEncoder()
df1_enco = df1.apply(label.fit_transform)

In [None]:
plt.figure(figsize=(30, 30))
sns.heatmap(df1_enco[disc_col].corr(), annot=True, cmap="RdYlGn", annot_kws={"size":15})

In [None]:
col = df1_enco.corr().nlargest(20, "Attrition").Attrition.index
plt.figure(figsize=(15, 15))
sns.heatmap(df1_enco[col].corr(), annot=True, cmap="RdYlGn", annot_kws={"size":10})

In [None]:
df1_enco.drop('Attrition', axis=1).corrwith(df1_enco.Attrition).plot(kind='barh', figsize=(10, 7))

**Analysis of correlation results (sample analysis):**
- Monthly income is highly correlated with Job level.
- Job level is highly correlated with total working hours.
- Monthly income is highly correlated with total working hours.
- Age is also positively correlated with the Total working hours.
- Marital status and stock option level are negatively correlated

# 4. Data Processing

In [None]:
# Transform categorical data into dummies
dummy_col = [column for column in df.drop('Attrition', axis=1).columns if df[column].nunique() < 20]
data = pd.get_dummies(df, columns=dummy_col, drop_first=True, dtype='uint8')
data.info()

In [None]:
print(data.shape)

# Remove duplicate Features
data = data.T.drop_duplicates()
data = data.T

# Remove Duplicate Rows
data.drop_duplicates(inplace=True)

print(data.shape)

In [None]:
data.shape

In [None]:
data.drop('Attrition', axis=1).corrwith(data.Attrition).sort_values().plot(kind='barh', figsize=(10, 30))

In [None]:
feature_correlation = data.drop('Attrition', axis=1).corrwith(data.Attrition).sort_values()
model_col = feature_correlation[np.abs(feature_correlation) > 0.02].index
len(model_col)

# 5. Applying machine learning algorithms

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, RobustScaler

X = data.drop('Attrition', axis=1)
y = data.Attrition

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42,
                                                    stratify=y)

scaler = RobustScaler()
X_train_std = scaler.fit_transform(X_train)
X_test_std = scaler.transform(X_test)
X_std = scaler.transform(X)

In [None]:
def feature_imp(df, model):
    fi = pd.DataFrame()
    fi["feature"] = df.columns
    fi["importance"] = model.feature_importances_
    return fi.sort_values(by="importance", ascending=False)

## What defines success?
We have an imbalanced data, so if we predict that all our employees will stay we'll have an accuracy of `83.90%`.


In [None]:
y_test.value_counts()[0] / y_test.shape[0]

In [None]:
stay = (y_train.value_counts()[0] / y_train.shape)[0]
leave = (y_train.value_counts()[1] / y_train.shape)[0]

print("===============TRAIN=================")
print(f"Staying Rate: {stay * 100:.2f}%")
print(f"Leaving Rate: {leave * 100 :.2f}%")

stay = (y_test.value_counts()[0] / y_test.shape)[0]
leave = (y_test.value_counts()[1] / y_test.shape)[0]

print("===============TEST=================")
print(f"Staying Rate: {stay * 100:.2f}%")
print(f"Leaving Rate: {leave * 100 :.2f}%")

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score

def print_score(clf, X_train, y_train, X_test, y_test, train=True):
    if train:
        pred = clf.predict(X_train)
        print("Train Result:\n================================================")
        print(f"Accuracy Score: {accuracy_score(y_train, pred) * 100:.2f}%")
        print("_______________________________________________")
        print("Classification Report:", end='')
        print(f"\tPrecision Score: {precision_score(y_train, pred) * 100:.2f}%")
        print(f"\t\t\tRecall Score: {recall_score(y_train, pred) * 100:.2f}%")
        print(f"\t\t\tF1 score: {f1_score(y_train, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_train, pred)}\n")

    elif train==False:
        pred = clf.predict(X_test)
        print("Test Result:\n================================================")
        print(f"Accuracy Score: {accuracy_score(y_test, pred) * 100:.2f}%")
        print("_______________________________________________")
        print("Classification Report:", end='')
        print(f"\tPrecision Score: {precision_score(y_test, pred) * 100:.2f}%")
        print(f"\t\t\tRecall Score: {recall_score(y_test, pred) * 100:.2f}%")
        print(f"\t\t\tF1 score: {f1_score(y_test, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_test, pred)}\n")

## 5. 1. Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

lr_classifier = LogisticRegression(solver='liblinear', penalty='l1')
lr_classifier.fit(X_train_std, y_train)
y_pred_log = lr_classifier.predict(X_test)


print_score(lr_classifier, X_train_std, y_train, X_test_std, y_test, train=True)
print_score(lr_classifier, X_train_std, y_train, X_test_std, y_test, train=False)

In [None]:
from sklearn.metrics import plot_confusion_matrix, plot_roc_curve

disp = plot_confusion_matrix(lr_classifier, X_test_std, y_test,
                             cmap='Blues', values_format='d',
                             display_labels=['Stay', 'Churn'])

In [None]:
disp = plot_roc_curve(lr_classifier, X_test_std, y_test)

In [None]:
results_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred_log})
print(results_df)

print(results_df.Predicted.value_counts())
print(results_df.Actual.value_counts())

correct_predictions = results_df[results_df['Actual'] == results_df['Predicted']]
incorrect_predictions = results_df[results_df['Actual'] != results_df['Predicted']]

print("Number of incorrect Predictions: %s" % len(incorrect_predictions))

print(sorted(incorrect_predictions))

## 5. 2. Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

rand_forest = RandomForestClassifier(n_estimators=1200,
#                                      bootstrap=False,
#                                      class_weight={0:stay, 1:leave}
                                    )
rand_forest.fit(X_train, y_train)

y_pred_rf = rand_forest.predict(X_test)

print_score(rand_forest, X_train, y_train, X_test, y_test, train=True)
print_score(rand_forest, X_train, y_train, X_test, y_test, train=False)

In [None]:
disp = plot_roc_curve(lr_classifier, X_test_std, y_test)
plot_roc_curve(rand_forest, X_test, y_test, ax=disp.ax_)

In [None]:
df = feature_imp(X, rand_forest)[:40]
df.set_index('feature', inplace=True)
df.plot(kind='barh', figsize=(10, 10))
plt.title('Feature Importance according to Random Forest')

In [None]:
results_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred_rf})
print(results_df)

print(results_df.Predicted.value_counts())
print(results_df.Actual.value_counts())

correct_predictions = results_df[results_df['Actual'] == results_df['Predicted']]
incorrect_predictions = results_df[results_df['Actual'] != results_df['Predicted']]

print("Number of incorrect Predictions: %s" % len(incorrect_predictions))

print(sorted(incorrect_predictions))

## 5. 3. Support Vector Machine

In [None]:
from sklearn.svm import SVC

svc = SVC(kernel='linear')
svc.fit(X_train_std, y_train)

y_pred_svc = svc.predict(X_test)

print_score(svc, X_train_std, y_train, X_test_std, y_test, train=True)
print_score(svc, X_train_std, y_train, X_test_std, y_test, train=False)

In [None]:
disp = plot_roc_curve(lr_classifier, X_test_std, y_test)
plot_roc_curve(rand_forest, X_test, y_test, ax=disp.ax_)
plot_roc_curve(svc, X_test_std, y_test, ax=disp.ax_)

In [None]:
results_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred_svc})
print(results_df)

print(results_df.Predicted.value_counts())
print(results_df.Actual.value_counts())

correct_predictions = results_df[results_df['Actual'] == results_df['Predicted']]
incorrect_predictions = results_df[results_df['Actual'] != results_df['Predicted']]

print("Number of incorrect Predictions: %s" % len(incorrect_predictions))

print(sorted(incorrect_predictions))

## 5. 4. XGBoost Classifier

In [None]:
from xgboost import XGBClassifier

xgb_clf = XGBClassifier()
xgb_clf.fit(X_train, y_train)
y_pred_xgb = xgb_clf.predict(X_test)


print_score(xgb_clf, X_train, y_train, X_test, y_test, train=True)
print_score(xgb_clf, X_train, y_train, X_test, y_test, train=False)

In [None]:
disp = plot_roc_curve(lr_classifier, X_test_std, y_test)
plot_roc_curve(rand_forest, X_test, y_test, ax=disp.ax_)
plot_roc_curve(svc, X_test_std, y_test, ax=disp.ax_)
plot_roc_curve(xgb_clf, X_test, y_test, ax=disp.ax_)

In [None]:
df = feature_imp(X, xgb_clf)[:35]
df.set_index('feature', inplace=True)
df.plot(kind='barh', figsize=(10, 8))
plt.title('Feature Importance according to XGBoost')

In [None]:
results_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred_xgb})
print(results_df)

print(results_df.Predicted.value_counts())
print(results_df.Actual.value_counts())

correct_predictions = results_df[results_df['Actual'] == results_df['Predicted']]
incorrect_predictions = results_df[results_df['Actual'] != results_df['Predicted']]

print("Number of incorrect Predictions: %s" % len(incorrect_predictions))

print(sorted(incorrect_predictions))

## 6. Balance the dataset

In [None]:
from imblearn.under_sampling import RandomUnderSampler
from imblearn.under_sampling import CondensedNearestNeighbour
from imblearn.under_sampling import TomekLinks
from sklearn.tree import DecisionTreeClassifier
from imblearn.under_sampling import EditedNearestNeighbours
from collections import Counter
from sklearn.decomposition import PCA
from collections import defaultdict
from sklearn.model_selection import train_test_split, cross_val_score
from scikitplot.metrics import plot_roc

from sklearn.metrics import accuracy_score, f1_score, classification_report
from sklearn.metrics import roc_curve, auc, roc_auc_score

In [None]:
clf = DecisionTreeClassifier(min_samples_leaf=3, random_state=42)
clf.fit(X_train_std, y_train)

y_pred0 = clf.predict(X_test_std)

print('Accuracy %s' % accuracy_score(y_test, y_pred0))
print('F1-score %s' % f1_score(y_test, y_pred0, average=None))
print(classification_report(y_test, y_pred0))

# ROC curve

In [None]:
y_score = clf.predict_proba(X_test_std)
plot_roc(y_test, y_score)
plt.show()

In [None]:
y_score = clf.predict_proba(X_test_std)
fpr0, tpr0, _ = roc_curve(y_test, y_score[:, 1])
roc_auc0 = auc(fpr0, tpr0)

In [None]:
plt.plot(fpr0, tpr0, color='darkorange', lw=3, label='$AUC_0$ = %.3f' % (roc_auc0))

plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curve', fontsize=16)
plt.legend(loc="lower right", fontsize=14, frameon=False)
plt.tick_params(axis='both', which='major', labelsize=16)
plt.show()

## Undersampling

### Random Under Sampler

In [None]:
rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(X_train, y_train)
print('Resampled dataset shape %s' % Counter(y_res))

In [None]:
pca = PCA(n_components=2)
pca.fit(X_train)
X_pca = pca.transform(X_res)

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_res, cmap=plt.cm.prism, edgecolor='k', alpha=0.7)
plt.show()

In [None]:
clf = DecisionTreeClassifier(min_samples_leaf=3, random_state=42)
clf.fit(X_res, y_res)

y_pred = clf.predict(X_test)

print('Accuracy %s' % accuracy_score(y_test, y_pred))
print('F1-score %s' % f1_score(y_test, y_pred, average=None))
print(classification_report(y_test, y_pred))

y_score = clf.predict_proba(X_test)
fpr, tpr, _ = roc_curve(y_test, y_score[:, 1])
roc_auc = auc(fpr, tpr)

plt.plot(fpr0, tpr0, color='darkorange', lw=3, label='$AUC_0$ = %.3f' % (roc_auc0))
plt.plot(fpr, tpr, color='green', lw=3, label='$AUC_1$ = %.3f' % (roc_auc))

plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curve', fontsize=16)
plt.legend(loc="lower right", fontsize=14, frameon=False)
plt.tick_params(axis='both', which='major', labelsize=16)
plt.show()

### Condensed Nearest Neighbour

In [None]:
import warnings
warnings.simplefilter("ignore")

In [None]:
cnn = CondensedNearestNeighbour(random_state=42, n_jobs=10)
X_res, y_res = cnn.fit_resample(X_train, y_train)
print('Resampled dataset shape %s' % Counter(y_res))

In [None]:
pca = PCA(n_components=2)
pca.fit(X_train)
X_pca = pca.transform(X_res)

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_res, cmap=plt.cm.prism, edgecolor='k', alpha=0.7)
plt.show()

In [None]:
clf = DecisionTreeClassifier(min_samples_leaf=3, random_state=42)
clf.fit(X_res, y_res)

y_pred = clf.predict(X_test)

print('Accuracy %s' % accuracy_score(y_test, y_pred))
print('F1-score %s' % f1_score(y_test, y_pred, average=None))
print(classification_report(y_test, y_pred))

y_score = clf.predict_proba(X_test)
fpr, tpr, _ = roc_curve(y_test, y_score[:, 1])
roc_auc = auc(fpr, tpr)

plt.plot(fpr0, tpr0, color='darkorange', lw=3, label='$AUC_0$ = %.3f' % (roc_auc0))
plt.plot(fpr, tpr, color='green', lw=3, label='$AUC_1$ = %.3f' % (roc_auc))

plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curve', fontsize=16)
plt.legend(loc="lower right", fontsize=14, frameon=False)
plt.tick_params(axis='both', which='major', labelsize=16)
plt.show()

### Tomek Links

In [None]:
tl = TomekLinks()
X_res, y_res = tl.fit_resample(X_train, y_train)
print('Resampled dataset shape %s' % Counter(y_res))

In [None]:
pca = PCA(n_components=2)
pca.fit(X_train)
X_pca = pca.transform(X_res)

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_res, cmap=plt.cm.prism, edgecolor='k', alpha=0.7)
plt.show()

In [None]:
clf = DecisionTreeClassifier(min_samples_leaf=3, random_state=42)
clf.fit(X_res, y_res)

y_pred = clf.predict(X_test)

print('Accuracy %s' % accuracy_score(y_test, y_pred))
print('F1-score %s' % f1_score(y_test, y_pred, average=None))
print(classification_report(y_test, y_pred))

y_score = clf.predict_proba(X_test)
fpr, tpr, _ = roc_curve(y_test, y_score[:, 1])
roc_auc = auc(fpr, tpr)

plt.plot(fpr0, tpr0, color='darkorange', lw=3, label='$AUC_0$ = %.3f' % (roc_auc0))
plt.plot(fpr, tpr, color='green', lw=3, label='$AUC_1$ = %.3f' % (roc_auc))

plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curve', fontsize=16)
plt.legend(loc="lower right", fontsize=14, frameon=False)
plt.tick_params(axis='both', which='major', labelsize=16)
plt.show()

### Edited Nearest Neighbors

In [None]:
enn = EditedNearestNeighbours()
X_res, y_res = enn.fit_resample(X_train, y_train)
print('Resampled dataset shape %s' % Counter(y_res))

In [None]:
pca = PCA(n_components=2)
pca.fit(X_train)
X_pca = pca.transform(X_res)

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_res, cmap=plt.cm.prism, edgecolor='k', alpha=0.7)
plt.show()

In [None]:
clf = DecisionTreeClassifier(min_samples_leaf=3, random_state=42)
clf.fit(X_res, y_res)

y_pred = clf.predict(X_test)

print('Accuracy %s' % accuracy_score(y_test, y_pred))
print('F1-score %s' % f1_score(y_test, y_pred, average=None))
print(classification_report(y_test, y_pred))

y_score = clf.predict_proba(X_test)
fpr, tpr, _ = roc_curve(y_test, y_score[:, 1])
roc_auc = auc(fpr, tpr)

plt.plot(fpr0, tpr0, color='darkorange', lw=3, label='$AUC_0$ = %.3f' % (roc_auc0))
plt.plot(fpr, tpr, color='green', lw=3, label='$AUC_1$ = %.3f' % (roc_auc))

plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curve', fontsize=16)
plt.legend(loc="lower right", fontsize=14, frameon=False)
plt.tick_params(axis='both', which='major', labelsize=16)
plt.show()

# Oversampling

In [None]:
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import ADASYN

### RandomOverSampler-

In [None]:
ros = RandomOverSampler(random_state=42)
X_res, y_res = ros.fit_resample(X_train, y_train)
print('Resampled dataset shape %s' % Counter(y_res))

In [None]:
pca = PCA(n_components=2)
pca.fit(X_train)
X_pca = pca.transform(X_res)

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_res, cmap=plt.cm.prism, edgecolor='k', alpha=0.7)
plt.show()

In [None]:
clf = DecisionTreeClassifier(min_samples_leaf=3, random_state=42)
clf.fit(X_res, y_res)

y_pred = clf.predict(X_test)

print('Accuracy %s' % accuracy_score(y_test, y_pred))
print('F1-score %s' % f1_score(y_test, y_pred, average=None))
print(classification_report(y_test, y_pred))

y_score = clf.predict_proba(X_test)
fpr, tpr, _ = roc_curve(y_test, y_score[:, 1])
roc_auc = auc(fpr, tpr)

plt.plot(fpr0, tpr0, color='darkorange', lw=3, label='$AUC_0$ = %.3f' % (roc_auc0))
plt.plot(fpr, tpr, color='green', lw=3, label='$AUC_1$ = %.3f' % (roc_auc))

plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curve', fontsize=16)
plt.legend(loc="lower right", fontsize=14, frameon=False)
plt.tick_params(axis='both', which='major', labelsize=16)
plt.show()

### SMOTE

In [None]:
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X_train, y_train)
print('Resampled dataset shape %s' % Counter(y_res))

In [None]:
pca = PCA(n_components=2)
pca.fit(X_train)
X_pca = pca.transform(X_res)

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_res, cmap=plt.cm.prism, edgecolor='k', alpha=0.7)
plt.show()

In [None]:
clf = DecisionTreeClassifier(min_samples_leaf=3, random_state=42)
clf.fit(X_res, y_res)

y_pred = clf.predict(X_test)

print('Accuracy %s' % accuracy_score(y_test, y_pred))
print('F1-score %s' % f1_score(y_test, y_pred, average=None))
print(classification_report(y_test, y_pred))

y_score = clf.predict_proba(X_test)
fpr, tpr, _ = roc_curve(y_test, y_score[:, 1])
roc_auc = auc(fpr, tpr)

plt.plot(fpr0, tpr0, color='darkorange', lw=3, label='$AUC_0$ = %.3f' % (roc_auc0))
plt.plot(fpr, tpr, color='green', lw=3, label='$AUC_1$ = %.3f' % (roc_auc))

plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curve', fontsize=16)
plt.legend(loc="lower right", fontsize=14, frameon=False)
plt.tick_params(axis='both', which='major', labelsize=16)
plt.show()

### ADASYN

In [None]:
ada = ADASYN(random_state=42)
X_res_ada, y_res_ada = ada.fit_resample(X, y)
print('Resampled dataset shape %s' % Counter(y_res_ada))

In [None]:
pca = PCA(n_components=2)
pca.fit(X_train)
X_pca = pca.transform(X_res_ada)

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_res_ada, cmap=plt.cm.prism, edgecolor='k', alpha=0.7)
plt.show()

In [None]:
clf = DecisionTreeClassifier(min_samples_leaf=3, random_state=42)
clf.fit(X_res_ada, y_res_ada)

y_pred = clf.predict(X_test)

print('Accuracy %s' % accuracy_score(y_test, y_pred))
print('F1-score %s' % f1_score(y_test, y_pred, average=None))
print(classification_report(y_test, y_pred))

y_score = clf.predict_proba(X_test)
fpr, tpr, _ = roc_curve(y_test, y_score[:, 1])
roc_auc = auc(fpr, tpr)

plt.plot(fpr0, tpr0, color='darkorange', lw=3, label='$AUC_0$ = %.3f' % (roc_auc0))
plt.plot(fpr, tpr, color='green', lw=3, label='$AUC_1$ = %.3f' % (roc_auc))

plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curve', fontsize=16)
plt.legend(loc="lower right", fontsize=14, frameon=False)
plt.tick_params(axis='both', which='major', labelsize=16)
plt.show()

# Analysis with balanced dataset

### Robust Scaler

Robust scaler = Scale features using statistics that are robust to outliers.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_res_ada, y_res_ada, test_size=0.3, random_state=42,
                                                    stratify=y_res_ada)

scaler = RobustScaler()
X_train_std = scaler.fit_transform(X_train)
X_test_std = scaler.transform(X_test)
X_std = scaler.transform(X_res_ada)

## 6. 1. Logistic Regression

In [None]:
lr_classifier = LogisticRegression(solver='liblinear', penalty='l1')
lr_classifier.fit(X_train_std, y_train)
y_pred_log = lr_classifier.predict(X_test_std)

print_score(lr_classifier, X_train_std, y_train, X_test_std, y_test, train=True)
print_score(lr_classifier, X_train_std, y_train, X_test_std, y_test, train=False)


In [None]:
results_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred_log})
#print(results_df)

print(results_df.Predicted.value_counts())
print(results_df.Actual.value_counts())

correct_predictions = results_df[results_df['Actual'] == results_df['Predicted']]
incorrect_predictions = results_df[results_df['Actual'] != results_df['Predicted']]

print("Number of incorrect Predictions: %s" % len(incorrect_predictions))

#print(sorted(incorrect_predictions))

In [None]:
# Stampa la matrice di confusione
conf_matrix = confusion_matrix(y_test, y_pred_log)
print("Confusion Matrix:")
print(conf_matrix)

# Stampa il report di classificazione
class_report = classification_report(y_test, y_pred_log)
print("Classification Report:")
print(class_report)

In [None]:
y_probabilities = lr_classifier.predict_proba(X_test_std)

# Estrai le score associate alla classe positiva
y_scores = y_probabilities[:, 1]

# Creazione di un DataFrame con etichette reali, predizioni e score
results_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred_log, 'Score': y_scores})

# Stampa il DataFrame
print(results_df)

In [None]:
lr_classifier = LogisticRegression()
lr_classifier.fit(X_res_ada, y_res_ada)

# Ottenere i coefficienti e le relative feature
coefficients = lr_classifier.coef_[0]
feature_names = X_res_ada.columns

# Creazione di un DataFrame per la visualizzazione
coefficients_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})

# Ordinamento per valore assoluto
coefficients_df['Abs_Coefficient'] = coefficients_df['Coefficient'].abs()
coefficients_df = coefficients_df.sort_values(by='Abs_Coefficient', ascending=True)


# Visualizzazione dei coefficienti
plt.figure(figsize=(10, 6))
plt.barh(coefficients_df['Feature'], coefficients_df['Coefficient'], color='skyblue')
plt.title('Logistic Regression Coefficients')
plt.xlabel('Coefficient Value')
plt.show()

# Visualizzazione dei coefficienti in valore assoluto
plt.figure(figsize=(10, 6))
plt.barh(coefficients_df['Feature'], coefficients_df['Abs_Coefficient'], color='skyblue')
plt.title('Absolute Logistic Regression Coefficients')
plt.xlabel('Absolute Coefficient Value')
plt.show()

In [None]:
lr_classifier = LogisticRegression()
lr_classifier.fit(X_res_ada, y_res_ada)

# Ottenere i coefficienti e le relative feature
coefficients = lr_classifier.coef_[0]
feature_names = X_res_ada.columns

# Creazione di un DataFrame per la visualizzazione
coefficients_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})

# Ordinamento per valore assoluto
coefficients_df['Abs_Coefficient'] = coefficients_df['Coefficient'].abs()
coefficients_df = coefficients_df.sort_values(by='Abs_Coefficient', ascending=True)

# Seleziona solo i primi 20 coefficienti
top_20_coefficients = coefficients_df.tail(20)

# Visualizzazione dei coefficienti
plt.figure(figsize=(10, 6))
plt.barh(top_20_coefficients['Feature'], top_20_coefficients['Coefficient'], color='skyblue')
plt.title('Logistic Regression Coefficients (Top 20)')
plt.xlabel('Coefficient Value')
plt.show()

# Visualizzazione dei coefficienti in valore assoluto
plt.figure(figsize=(10, 6))
plt.barh(top_20_coefficients['Feature'], top_20_coefficients['Abs_Coefficient'], color='skyblue')
plt.title('Absolute Logistic Regression Coefficients (Top 20)')
plt.xlabel('Absolute Coefficient Value')
plt.show()


In [None]:
import matplotlib.pyplot as plt
from matplotlib.cm import ScalarMappable

# Definizione della scala di colori
cmap = plt.get_cmap("viridis")
#norm = plt.Normalize(results_df['Score'].min(), results_df['Score'].max())

# Creazione dello scatter plot con mappa di colori
fig, ax = plt.subplots(figsize=(8, 6))
scatter = ax.scatter(results_df['Actual'], results_df['Score'], c=results_df['Score'], cmap=cmap, s=20)

# Aggiunta della barra dei colori
sm = ScalarMappable(cmap=cmap)
sm.set_array([])  # Non ha bisogno di un array, ma è necessario impostare uno
cbar = plt.colorbar(sm, ax=ax)
cbar.set_label('Score')

# Personalizzazione
ax.set_title('Scatter Plot of Scores with Color Gradient')
ax.set_xlabel('Actual Class')
ax.set_ylabel('Score')

plt.show()


## 6. 2. Random Forest Classifier

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_res_ada, y_res_ada, test_size=0.3, random_state=42,
                                                    stratify=y_res_ada)

In [None]:
rand_forest = RandomForestClassifier(n_estimators=1500,
                                     bootstrap=True,
                                     oob_score=True
                                    )
rand_forest.fit(X_train, y_train)

y_pred_rf = rand_forest.predict(X_test)


print_score(rand_forest, X_train, y_train, X_test, y_test, train=True)
print_score(rand_forest, X_train, y_train, X_test, y_test, train=False)

In [None]:
results_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred_rf})
#print(results_df)

print(results_df.Predicted.value_counts())
print(results_df.Actual.value_counts())

correct_predictions = results_df[results_df['Actual'] == results_df['Predicted']]
incorrect_predictions = results_df[results_df['Actual'] != results_df['Predicted']]

print("Number of incorrect Predictions: %s" % len(incorrect_predictions))



In [None]:
# Stampa la matrice di confusione
conf_matrix = confusion_matrix(y_test, y_pred_rf)
print("Confusion Matrix:")
print(conf_matrix)

# Stampa il report di classificazione
class_report = classification_report(y_test, y_pred_rf)
print("Classification Report:")
print(class_report)

In [None]:
y_probabilities = rand_forest.predict_proba(X_test)

# Estrai le score associate alla classe positiva
y_scores = y_probabilities[:, 1]

# Creazione di un DataFrame con etichette reali, predizioni e score
results_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred_rf, 'Score': y_scores})

# Stampa il DataFrame
print(results_df)

In [None]:
df = feature_imp(X, rand_forest)[:40]
df.set_index('feature', inplace=True)
df.plot(kind='barh', figsize=(10, 10))
plt.title('Feature Importance according to Random Forest')

## 6. 3. Support Vector Machine

In [None]:
svc = SVC(kernel='linear', probability=True)
svc.fit(X_train_std, y_train)

y_pred_svc = svc.predict(X_test_std)

print_score(svc, X_train_std, y_train, X_test_std, y_test, train=True)
print_score(svc, X_train_std, y_train, X_test_std, y_test, train=False)

In [None]:
results_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred_svc})
#print(results_df)

print(results_df.Predicted.value_counts())
print(results_df.Actual.value_counts())

correct_predictions = results_df[results_df['Actual'] == results_df['Predicted']]
incorrect_predictions = results_df[results_df['Actual'] != results_df['Predicted']]

print("Number of incorrect Predictions: %s" % len(incorrect_predictions))


In [None]:
# Stampa la matrice di confusione
conf_matrix = confusion_matrix(y_test, y_pred_svc)
print("Confusion Matrix:")
print(conf_matrix)

# Stampa il report di classificazione
class_report = classification_report(y_test, y_pred_svc)
print("Classification Report:")
print(class_report)

In [None]:
y_probabilities = svc.predict_proba(X_test_std)

# Estrai le score associate alla classe positiva
y_scores = y_probabilities[:, 1]

# Creazione di un DataFrame con etichette reali, predizioni e score
results_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred_svc, 'Score': y_scores})

# Stampa il DataFrame
print(results_df)

In [None]:
print(results_df.Predicted.value_counts())
print(results_df.Actual.value_counts())

## 6. 4. XGBoost classifier

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_res_ada, y_res_ada, test_size=0.3, random_state=42,
                                                    stratify=y_res_ada)

In [None]:
xgb_clf = XGBClassifier()
xgb_clf.fit(X_train, y_train)

y_pred_xgb = xgb_clf.predict(X_test)

print_score(xgb_clf, X_train, y_train, X_test, y_test, train=True)
print_score(xgb_clf, X_train, y_train, X_test, y_test, train=False)

In [None]:
results_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred_xgb})
#print(results_df)

print(results_df.Predicted.value_counts())
print(results_df.Actual.value_counts())

correct_predictions = results_df[results_df['Actual'] == results_df['Predicted']]
incorrect_predictions = results_df[results_df['Actual'] != results_df['Predicted']]

print("Number of incorrect Predictions: %s" % len(incorrect_predictions))

print(sorted(incorrect_predictions))

In [None]:
# Stampa la matrice di confusione
conf_matrix = confusion_matrix(y_test, y_pred_xgb)
print("Confusion Matrix:")
print(conf_matrix)

# Stampa il report di classificazione
class_report = classification_report(y_test, y_pred_xgb)
print("Classification Report:")
print(class_report)

In [None]:
y_probabilities = lr_classifier.predict_proba(X_test)

# Estrai le score associate alla classe positiva
y_scores = y_probabilities[:, 1]

# Creazione di un DataFrame con etichette reali, predizioni e score
results_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred_xgb, 'Score': y_scores})

# Stampa il DataFrame
print(results_df)

In [None]:
df = feature_imp(X, xgb_clf)[:35]
df.set_index('feature', inplace=True)
df.plot(kind='barh', figsize=(10, 8))
plt.title('Feature Importance according to XGBoost')