# Employee performance analysis with `Python`: predictive modelling

```
@author: Aleksandras Urbonas
@date  : 20241211T2250 ALUR
```


---

# 0. Config



In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# set chart size
sns.set_theme(rc={'figure.figsize':(3,3)})



---

# 1. Import data



In [None]:
raw_data_file_path = 'data/data_model.csv'

# Load the data
data_model = pd.read_csv(raw_data_file_path)



In [None]:
# Display the first few rows to understand the structure
print(data_model.head(2), end='\n\n\n')

# Check the columns and data types
print(data_model.info(), end='\n\n\n')

# Check for any missing values
print(data_model.isnull().sum()) 



---

# Predictive

Now let's build a predictive model. Is it possible to predict if employee got promoted (binary)



In [None]:
# import pandas as pd
# import numpy as np
# import seaborn as sns
# import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix



## Preprocess the Data:

    Handle missing values, encode categorical variables, and scale numerical features.



In [None]:
# Handle missing values
data_model.fillna(method='ffill', inplace=True)



In [None]:
data_model.head(2)

In [None]:
data_model.drop(columns=['job_level', 'job_rank','job_role'], inplace=True)



In [None]:
# Encode categorical variables
data = pd.get_dummies(
    data_model
    , columns=['region', 'job_level', 'job_function', 'perf_rank']
    , drop_first=False # True
)

data.head(2)



---

## Features for machine learning

In [None]:
# Extract features and target variable
X = data.drop(columns=['is_promo'])
y = data['is_promo']



In [None]:
X.head(2)



In [None]:
X.columns



## Class Imbalance

    Handling class imbalance is crucial when working with classification problems to ensure that the model performs well across all classes.
    
    Here are some effective methods to tackle class imbalance in Python:



### Oversampling:
    Increase the number of instances in the minority class by duplicating existing instances or creating synthetic samples.
    The imbalanced-learn library offers the SMOTE (Synthetic Minority Over-sampling Technique) algorithm.
    
    

## Undersampling:

    Reduce the number of instances in the majority class.



In [None]:
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler()

X_resampled, y_resampled = rus.fit_resample(X, y)



In [None]:
# assign the train data to the resampled
X = X_resampled
y = y_resampled



In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=8)
X_train.head(2)



---

# Data anomalies



In [None]:
# Fit the model
from sklearn.ensemble import IsolationForest

clf = IsolationForest(
    n_estimators=800
    , n_jobs=50
    , contamination=0.05

)

clf.fit(X_train)



In [None]:
X_test.head(2)



In [None]:
predictions = clf.predict(X_test)
print(predictions[:10])



In [None]:
predictions.shape



In [None]:
type(predictions)



In [None]:
X_test.shape



In [None]:
# Identify anomalies
import numpy as np

anomalies = np.where(predictions == -1)[0]
anomalous_data = X_test[anomalies]



In [None]:
# Scale the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)



## Build and Train the Model:

    We'll use a Random Forest Classifier for this example.
    
    Class imbalance: we also adjust the weights of classes so that the model pays more attention to the minority class (promoted).


In [None]:
# Initialize the model
model = RandomForestClassifier(
    n_estimators=800
    , class_weight='balanced'
    , random_state=8
)



In [None]:
# Train the model
model.fit(X_train, y_train)



## Evaluate the Model:

In [None]:
# Make predictions
y_pred = model.predict(X_test)



In [None]:
# Evaluate the performance
print("Accuracy:\n", accuracy_score(y_test, y_pred), end="\n-----------------------\n\n")

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred), end="\n-----------------------\n\n")

print("Classification Report:\n", classification_report(y_test, y_pred), end="\n-----------------------\n\n")



In [None]:
# Plot confusion matrix
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()



## Explanation:

Data Preprocessing:

    Handling missing values, encoding categorical variables, and scaling numerical features are crucial steps in preparing data for modeling.

Model Selection:

    A Random Forest Classifier is chosen for its robustness and ability to handle complex data structures.

Evaluation:

    Accuracy, classification report, and confusion matrix provide insights into the model's performance.


# Insights:
    
    A predictive model can determine if an employee is likely to be promoted based on the features provided.
    The classification report and confusion matrix help us understand the model's performance and areas for improvement.
    We can tweak the model parameters or try different algorithms to see which one performs best on our dataset.

    Current model is has an accuracy of 58%, which means it has a low predictive power, although some variance can be explained.
    
    