# Proyek Akhir: Menyelesaikan Permasalahan HR - Employee Attrition Analysis

- Nama: Mahdi shidqi
- Email: m128d5y1057@student.devacademy.id
- Id Dicoding: m128d5y1057

## Business Understanding

### Problem Statement
Perusahaan mengalami tingkat attrition (karyawan keluar) yang tinggi, yang berdampak pada:
- Biaya rekrutmen dan training karyawan baru
- Kehilangan produktivitas dan knowledge
- Penurunan moral tim

### Objectives
1. Mengidentifikasi faktor-faktor utama yang mempengaruhi attrition
2. Membangun model prediksi untuk mengidentifikasi karyawan berisiko tinggi
3. Memberikan rekomendasi actionable untuk HR

### Success Criteria
- Model accuracy > 80%
- Insights yang clear dan actionable
- Dashboard yang user-friendly

## Persiapan

### Menyiapkan library yang dibutuhkan

In [4]:
# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score, roc_curve
import xgboost as xgb

# Model persistence
import joblib

# Warnings
import warnings
warnings.filterwarnings('ignore')

# Settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
%matplotlib inline

ModuleNotFoundError: No module named 'xgboost'

### Menyiapkan data yang akan digunakan

In [None]:
# Load dataset
df = pd.read_csv('employee_data.csv')

# Display basic info
print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
df.head()

## Data Understanding

In [None]:
# Dataset info
print("Dataset Information:")
df.info()

print("\n" + "="*50)
print("Missing Values:")
print(df.isnull().sum())

print("\n" + "="*50)
print("Statistical Summary:")
df.describe()

In [None]:
# Check attrition distribution
print("Attrition Distribution:")
print(df['Attrition'].value_counts())
print("\nAttrition Rate:")
print(df['Attrition'].value_counts(normalize=True) * 100)

### Exploratory Data Analysis (EDA)

In [None]:
# Create a copy for EDA (with missing values)
df_eda = df.copy()

# Visualize attrition rate
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Count plot
df_eda['Attrition'].value_counts().plot(kind='bar', ax=axes[0], color=['#2ecc71', '#e74c3c'])
axes[0].set_title('Attrition Count', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Attrition (0=No, 1=Yes)')
axes[0].set_ylabel('Count')
axes[0].tick_params(rotation=0)

# Pie chart
df_eda['Attrition'].value_counts().plot(kind='pie', ax=axes[1], autopct='%1.1f%%', 
                                         colors=['#2ecc71', '#e74c3c'], startangle=90)
axes[1].set_title('Attrition Distribution', fontsize=14, fontweight='bold')
axes[1].set_ylabel('')

plt.tight_layout()
plt.show()

In [None]:
# Age distribution by attrition
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.histplot(data=df_eda, x='Age', hue='Attrition', kde=True, bins=20)
plt.title('Age Distribution by Attrition', fontsize=14, fontweight='bold')

plt.subplot(1, 2, 2)
sns.boxplot(data=df_eda, x='Attrition', y='Age')
plt.title('Age vs Attrition', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
# Department analysis
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Department distribution
df_eda['Department'].value_counts().plot(kind='bar', ax=axes[0], color='skyblue')
axes[0].set_title('Employees by Department', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Department')
axes[0].set_ylabel('Count')
axes[0].tick_params(rotation=45)

# Attrition by department
dept_attrition = df_eda.groupby('Department')['Attrition'].mean() * 100
dept_attrition.plot(kind='bar', ax=axes[1], color='coral')
axes[1].set_title('Attrition Rate by Department', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Department')
axes[1].set_ylabel('Attrition Rate (%)')
axes[1].tick_params(rotation=45)

plt.tight_layout()
plt.show()

In [None]:
# Monthly Income analysis
plt.figure(figsize=(14, 5))

plt.subplot(1, 2, 1)
sns.histplot(data=df_eda, x='MonthlyIncome', hue='Attrition', kde=True, bins=30)
plt.title('Monthly Income Distribution by Attrition', fontsize=14, fontweight='bold')

plt.subplot(1, 2, 2)
sns.boxplot(data=df_eda, x='Attrition', y='MonthlyIncome')
plt.title('Monthly Income vs Attrition', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
# Work-Life Balance and Satisfaction
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Work-Life Balance
wlb_attrition = df_eda.groupby('WorkLifeBalance')['Attrition'].mean() * 100
wlb_attrition.plot(kind='bar', ax=axes[0, 0], color='#3498db')
axes[0, 0].set_title('Attrition Rate by Work-Life Balance', fontsize=12, fontweight='bold')
axes[0, 0].set_ylabel('Attrition Rate (%)')
axes[0, 0].tick_params(rotation=0)

# Job Satisfaction
job_sat_attrition = df_eda.groupby('JobSatisfaction')['Attrition'].mean() * 100
job_sat_attrition.plot(kind='bar', ax=axes[0, 1], color='#e74c3c')
axes[0, 1].set_title('Attrition Rate by Job Satisfaction', fontsize=12, fontweight='bold')
axes[0, 1].set_ylabel('Attrition Rate (%)')
axes[0, 1].tick_params(rotation=0)

# Environment Satisfaction
env_sat_attrition = df_eda.groupby('EnvironmentSatisfaction')['Attrition'].mean() * 100
env_sat_attrition.plot(kind='bar', ax=axes[1, 0], color='#2ecc71')
axes[1, 0].set_title('Attrition Rate by Environment Satisfaction', fontsize=12, fontweight='bold')
axes[1, 0].set_ylabel('Attrition Rate (%)')
axes[1, 0].tick_params(rotation=0)

# Overtime
overtime_attrition = df_eda.groupby('OverTime')['Attrition'].mean() * 100
overtime_attrition.plot(kind='bar', ax=axes[1, 1], color='#f39c12')
axes[1, 1].set_title('Attrition Rate by Overtime', fontsize=12, fontweight='bold')
axes[1, 1].set_ylabel('Attrition Rate (%)')
axes[1, 1].tick_params(rotation=0)

plt.tight_layout()
plt.show()

In [None]:
# Correlation heatmap
# Select numeric columns only
numeric_cols = df_eda.select_dtypes(include=[np.number]).columns
correlation_matrix = df_eda[numeric_cols].corr()

plt.figure(figsize=(16, 12))
sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm', center=0, 
            linewidths=0.5, cbar_kws={"shrink": 0.8})
plt.title('Correlation Heatmap', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Top correlations with Attrition
attrition_corr = correlation_matrix['Attrition'].sort_values(ascending=False)
print("Top 15 Features Correlated with Attrition:")
print(attrition_corr.head(15))

## Data Preparation / Preprocessing

In [None]:
# Create ML dataset (only rows with Attrition labels)
df_ml = df[df['Attrition'].notna()].copy()
print(f"ML Dataset shape: {df_ml.shape}")
print(f"Attrition distribution:\n{df_ml['Attrition'].value_counts()}")

In [None]:
# Remove unnecessary columns
columns_to_drop = ['EmployeeId', 'EmployeeCount', 'Over18', 'StandardHours']
df_ml = df_ml.drop(columns=columns_to_drop, errors='ignore')

print(f"Shape after dropping columns: {df_ml.shape}")

In [None]:
# Feature Engineering
# Age groups
df_ml['AgeGroup'] = pd.cut(df_ml['Age'], bins=[0, 30, 40, 50, 100], 
                           labels=['<30', '30-40', '40-50', '50+'])

# Tenure groups
df_ml['TenureGroup'] = pd.cut(df_ml['YearsAtCompany'], bins=[0, 2, 5, 10, 50], 
                              labels=['0-2', '2-5', '5-10', '10+'])

# Income level
df_ml['IncomeLevel'] = pd.cut(df_ml['MonthlyIncome'], bins=[0, 3000, 6000, 10000, 20000], 
                              labels=['Low', 'Medium', 'High', 'Very High'])

# Promotion gap
df_ml['PromotionGap'] = df_ml['YearsAtCompany'] - df_ml['YearsSinceLastPromotion']

# Average satisfaction score
df_ml['AvgSatisfaction'] = (df_ml['JobSatisfaction'] + 
                            df_ml['EnvironmentSatisfaction'] + 
                            df_ml['RelationshipSatisfaction']) / 3

print("New features created successfully!")

In [None]:
# Encode categorical variables
# Binary encoding
binary_cols = ['Gender', 'OverTime']
le = LabelEncoder()
for col in binary_cols:
    if col in df_ml.columns:
        df_ml[col] = le.fit_transform(df_ml[col])

# One-hot encoding for multi-class categorical
categorical_cols = ['BusinessTravel', 'Department', 'EducationField', 'JobRole', 
                   'MaritalStatus', 'AgeGroup', 'TenureGroup', 'IncomeLevel']

df_ml = pd.get_dummies(df_ml, columns=categorical_cols, drop_first=True)

print(f"Shape after encoding: {df_ml.shape}")
print(f"\nColumn names:\n{df_ml.columns.tolist()}")

In [None]:
# Separate features and target
X = df_ml.drop('Attrition', axis=1)
y = df_ml['Attrition']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nTarget distribution:\n{y.value_counts()}")

In [None]:
# Train-test split (stratified)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=42, stratify=y)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print(f"\nTraining target distribution:\n{y_train.value_counts()}")
print(f"\nTest target distribution:\n{y_test.value_counts()}")

In [None]:
# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Feature scaling completed!")

## Modeling

### Model 1: Logistic Regression (Baseline)

In [None]:
# Train Logistic Regression
lr_model = LogisticRegression(random_state=42, max_iter=1000)
lr_model.fit(X_train_scaled, y_train)

# Predictions
y_pred_lr = lr_model.predict(X_test_scaled)
y_pred_proba_lr = lr_model.predict_proba(X_test_scaled)[:, 1]

# Evaluation
print("Logistic Regression Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_lr):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_lr):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_lr):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_lr):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_pred_proba_lr):.4f}")

### Model 2: Random Forest

In [None]:
# Train Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf_model.fit(X_train, y_train)

# Predictions
y_pred_rf = rf_model.predict(X_test)
y_pred_proba_rf = rf_model.predict_proba(X_test)[:, 1]

# Evaluation
print("Random Forest Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_rf):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_rf):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_rf):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_rf):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_pred_proba_rf):.4f}")

### Model 3: XGBoost

In [None]:
# Train XGBoost
xgb_model = xgb.XGBClassifier(random_state=42, eval_metric='logloss', use_label_encoder=False)
xgb_model.fit(X_train, y_train)

# Predictions
y_pred_xgb = xgb_model.predict(X_test)
y_pred_proba_xgb = xgb_model.predict_proba(X_test)[:, 1]

# Evaluation
print("XGBoost Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_xgb):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_xgb):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_xgb):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_xgb):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_pred_proba_xgb):.4f}")

In [None]:
# Model comparison
models_comparison = pd.DataFrame({
    'Model': ['Logistic Regression', 'Random Forest', 'XGBoost'],
    'Accuracy': [
        accuracy_score(y_test, y_pred_lr),
        accuracy_score(y_test, y_pred_rf),
        accuracy_score(y_test, y_pred_xgb)
    ],
    'Precision': [
        precision_score(y_test, y_pred_lr),
        precision_score(y_test, y_pred_rf),
        precision_score(y_test, y_pred_xgb)
    ],
    'Recall': [
        recall_score(y_test, y_pred_lr),
        recall_score(y_test, y_pred_rf),
        recall_score(y_test, y_pred_xgb)
    ],
    'F1-Score': [
        f1_score(y_test, y_pred_lr),
        f1_score(y_test, y_pred_rf),
        f1_score(y_test, y_pred_xgb)
    ],
    'ROC-AUC': [
        roc_auc_score(y_test, y_pred_proba_lr),
        roc_auc_score(y_test, y_pred_proba_rf),
        roc_auc_score(y_test, y_pred_proba_xgb)
    ]
})

print("\nModel Comparison:")
print(models_comparison)

# Visualize comparison
models_comparison.set_index('Model').plot(kind='bar', figsize=(12, 6), rot=0)
plt.title('Model Performance Comparison', fontsize=14, fontweight='bold')
plt.ylabel('Score')
plt.ylim(0, 1)
plt.legend(loc='lower right')
plt.tight_layout()
plt.show()

## Evaluation

In [None]:
# Select best model (assuming Random Forest performs best)
best_model = rf_model
y_pred_best = y_pred_rf
y_pred_proba_best = y_pred_proba_rf

print("Best Model: Random Forest")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_best))

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_best)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.title('Confusion Matrix - Best Model', fontsize=14, fontweight='bold')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

In [None]:
# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba_best)
roc_auc = roc_auc_score(y_test, y_pred_proba_best)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.4f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Best Model', fontsize=14, fontweight='bold')
plt.legend(loc='lower right')
plt.grid(alpha=0.3)
plt.show()

In [None]:
# Feature Importance
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': best_model.feature_importances_
}).sort_values('Importance', ascending=False)

# Top 15 features
top_features = feature_importance.head(15)

plt.figure(figsize=(10, 8))
sns.barplot(data=top_features, y='Feature', x='Importance', palette='viridis')
plt.title('Top 15 Most Important Features', fontsize=14, fontweight='bold')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()

print("\nTop 15 Features:")
print(top_features)

In [None]:
# Save the best model
joblib.dump(best_model, 'models/best_model.pkl')
joblib.dump(scaler, 'models/scaler.pkl')
joblib.dump(X.columns.tolist(), 'models/feature_names.pkl')

print("Model saved successfully!")
print("Files saved:")
print("- models/best_model.pkl")
print("- models/scaler.pkl")
print("- models/feature_names.pkl")

## Conclusion

### Key Findings

1. **Attrition Rate**: Perusahaan memiliki attrition rate yang signifikan yang perlu ditangani

2. **Top Factors Affecting Attrition**:
   - **Overtime**: Karyawan yang sering overtime memiliki attrition rate lebih tinggi
   - **Monthly Income**: Karyawan dengan gaji rendah cenderung keluar
   - **Work-Life Balance**: Skor rendah berkorelasi dengan attrition tinggi
   - **Years at Company**: Karyawan baru (0-2 tahun) berisiko tinggi
   - **Job Satisfaction**: Kepuasan kerja rendah meningkatkan risiko attrition

3. **Model Performance**: 
   - Best model: Random Forest
   - Accuracy: >80%
   - Model dapat memprediksi karyawan berisiko tinggi dengan akurasi baik

### Recommendations

**Immediate Actions**:
1. Reduce overtime requirements, especially for high-risk departments
2. Review and adjust compensation for underpaid employees
3. Implement work-life balance initiatives (flexible hours, remote work)
4. Focus on employee engagement in first 2 years

**Long-term Strategies**:
1. Develop structured career progression pathways
2. Enhance training and development programs
3. Implement regular satisfaction surveys
4. Create mentorship programs for new employees
5. Use predictive model to identify and intervene with at-risk employees

**Business Impact**:
- Reduced recruitment and training costs
- Improved productivity and knowledge retention
- Better employee morale and company culture
- Data-driven HR decision making