# Employee Retention Study Notebook

---

# Change working directory

To ensure our notebook accesses files relative to the project’s root directory, we first retrieve the current working directory and then move up one level to its parent directory.

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

 Change the working directory to the parent directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Load Data

We load the Employee Attrition dataset into a DataFrame named df to start our data exploration and analysis.

In [None]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/WA_Fn-UseC_-HR-Employee-Attrition.csv")
df.head()

# Data Exploration

After confirming the correct working directory, we perform one-hot encoding on the categorical variables in the dataset to prepare it for correlation analysis. This step converts categorical data into a numerical format, which is required for the correlation calculations.

In [None]:
import os
print(os.getcwd())

In [17]:
df_ohe = pd.get_dummies(df, drop_first=True)

In [None]:
print(df_ohe.columns)

# Correlation Study

Explanation:
We use OneHotEncoder to convert categorical variables into a binary format while preserving information from all categories. This prepares the data for further analysis, ensuring that all variables are numeric.

In [None]:
from feature_engine.encoding import OneHotEncoder
encoder = OneHotEncoder(variables=df.columns[df.dtypes=='object'].to_list(), drop_last=False)
df_ohe = encoder.fit_transform(df)
print(df_ohe.shape)
df_ohe.head(3)

# Check Dataset for correct loading

In [None]:
import pandas as pd

df = pd.read_csv("inputs/datasets/raw/WA_Fn-UseC_-HR-Employee-Attrition.csv")
df.head()

 Calculate Correlations and PPS for the Employee Retention Analyzer dataset

In [None]:
df_corr_pearson, df_corr_spearman, pps_matrix = CalculateCorrAndPPS(df_ohe)

# Display Heatmaps for Correlation and PPS Analysis
DisplayCorrAndPPS(df_corr_pearson=df_corr_pearson,
                  df_corr_spearman=df_corr_spearman,
                  pps_matrix=pps_matrix,
                  CorrThreshold=0.3, PPS_Threshold=0.2,
                  figsize=(12,10), font_annot=10)

Define custom functions to calculate correlation matrices (using Spearman and Pearson methods) and the Power Predictive Score (PPS). The correlation methods help assess linear and monotonic relationships between variables, while PPS evaluates the predictive power of one variable on another. Heatmaps are generated to visualize these relationships, aiding in the selection of features for further analysis.

In [1]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import ppscore as pps
import pandas as pd

def heatmap_corr(df, threshold, figsize=(20, 12), font_annot=8):
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=bool)
        mask[np.triu_indices_from(mask)] = True
        mask[abs(df) < threshold] = True

        fig, axes = plt.subplots(figsize=figsize)
        sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                    mask=mask, cmap='viridis', annot_kws={"size": font_annot}, ax=axes,
                    linewidth=0.5)
        axes.set_yticklabels(df.columns, rotation=0)
        plt.ylim(len(df.columns), 0)
        plt.show()

def heatmap_pps(df, threshold, figsize=(20, 12), font_annot=8):
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=bool)
        mask[abs(df) < threshold] = True
        fig, ax = plt.subplots(figsize=figsize)
        ax = sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                         mask=mask, cmap='rocket_r', annot_kws={"size": font_annot},
                         linewidth=0.05, linecolor='grey')
        plt.ylim(len(df.columns), 0)
        plt.show()

def CalculateCorrAndPPS(df):
    df_corr_spearman = df.corr(method="spearman")
    df_corr_pearson = df.corr(method="pearson")

    pps_matrix_raw = pps.matrix(df)
    pps_matrix = pps_matrix_raw.filter(['x', 'y', 'ppscore']).pivot(columns='x', index='y', values='ppscore')

    pps_score_stats = pps_matrix_raw.query("ppscore < 1").filter(['ppscore']).describe().T
    print("PPS threshold - check PPS score IQR to decide threshold for heatmap \n")
    print(pps_score_stats.round(3))

    return df_corr_pearson, df_corr_spearman, pps_matrix

def DisplayCorrAndPPS(df_corr_pearson, df_corr_spearman, pps_matrix, CorrThreshold, PPS_Threshold,
                      figsize=(20, 12), font_annot=8):

    print("\n")
    print("* Analyse how the target variable for your ML models is correlated with other variables (features and target)")
    print("* Analyse multi-collinearity, that is, how the features are correlated among themselves")

    print("\n")
    print("*** Heatmap: Spearman Correlation ***")
    print("It evaluates monotonic relationship \n")
    heatmap_corr(df=df_corr_spearman, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

    print("\n")
    print("*** Heatmap: Pearson Correlation ***")
    print("It evaluates the linear relationship between two continuous variables \n")
    heatmap_corr(df=df_corr_pearson, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

    print("\n")
    print("*** Heatmap: Power Predictive Score (PPS) ***")
    print(f"PPS detects linear or non-linear relationships between two columns.\n"
          f"The score ranges from 0 (no predictive power) to 1 (perfect predictive power) \n")
    heatmap_pps(df=pps_matrix, threshold=PPS_Threshold, figsize=figsize, font_annot=font_annot)


# Run the Correlation and PPS Analysis

In [None]:
# Calculate Correlations and PPS for the Employee Retention Analyzer dataset
df_corr_pearson, df_corr_spearman, pps_matrix = CalculateCorrAndPPS(df_ohe)

# Display Heatmaps for Correlation and PPS Analysis
DisplayCorrAndPPS(df_corr_pearson=df_corr_pearson,
                  df_corr_spearman=df_corr_spearman,
                  pps_matrix=pps_matrix,
                  CorrThreshold=0.3, PPS_Threshold=0.2,
                  figsize=(12,10), font_annot=10)

# Parallel Plot

In [None]:
import plotly.express as px

# Creates multi-dimensional categorical data plot
fig = px.parallel_categories(df_parallel, color="Attrition_Yes")
fig.show(renderer='jupyterlab')

Using a parallel plot to visualize how different categorical features are related to employee attrition. This plot helps in identifying patterns and interactions between features that contribute to employees leaving the company.

---

## Explanation:
The analyses reveal the relationships between various employee attributes and attrition. The insights gained from these studies will guide in developing a predictive model and help the HR team in addressing key factors influencing employee attrition.