# Model Training

**Objective:** Build and train machine learning models to predict employee attrition

**Models We'll Build:**
1. **Logistic Regression** - Baseline linear model
2. **Random Forest** - Advanced tree-based model

**Evaluation Metrics:**
- Accuracy
- Precision
- Recall
- F1-Score
- ROC-AUC

In [1]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, classification_report
)
import joblib
import warnings
warnings.filterwarnings('ignore')

print("All libraries imported successfully!")

All libraries imported successfully!


In [2]:
# Load ML-ready data
print("Loading ML-ready dataset...")

df = pd.read_csv('../data/processed/ml_ready_data.csv')

print(f"Data loaded!")
print(f"Shape: {df.shape}")

Loading ML-ready dataset...
Data loaded!
Shape: (1470, 59)


In [8]:
# Prepare features and target
print("\nPreparing features and target variable...")

# Target variable
target = 'Attrition_Binary'

# First, check what columns we have
print(f"Total columns: {len(df.columns)}")
print("\nColumn types:")
print(df.dtypes.value_counts())

# Identify categorical columns that need to be excluded
categorical_text_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()

print(f"\nCategorical text columns found: {len(categorical_text_cols)}")
print(categorical_text_cols)

# Columns to exclude from features
exclude_cols = [
    'EmployeeNumber',      # ID column
    'Attrition',           # Original categorical target
    'Attrition_Binary',    # Target variable
    'TenureGroup',         # Categorical (we'll use YearsAtCompany instead)
    'SalaryBin',           # Categorical (we have MonthlyIncome)
    'AgeGroup',            # Categorical (we have Age)
    'DistanceCategory'     # Categorical (we have DistanceFromHome)
] + categorical_text_cols  # Add ALL remaining text columns

# Remove duplicates
exclude_cols = list(set(exclude_cols))

print(f"\nTotal columns to exclude: {len(exclude_cols)}")

# Create feature matrix X and target y
X = df.drop(columns=exclude_cols, errors='ignore')
y = df[target]

# Make sure X only has numeric columns
X = X.select_dtypes(include=[np.number])

print(f"\nFeatures prepared!")
print(f"  Feature columns: {X.shape[1]}")
print(f"  Training samples: {X.shape[0]}")

print(f"\nTarget distribution:")
print(y.value_counts())
print(f"\nAttrition rate: {y.mean()*100:.2f}%")

# Show feature names
print(f"\nFeatures being used for modeling:")
for i, col in enumerate(X.columns, 1):
    print(f"   {i:2d}. {col}")


Preparing features and target variable...
Total columns: 59

Column types:
int64      30
bool       19
object      7
float64     3
Name: count, dtype: int64

Categorical text columns found: 7
['Attrition', 'Gender', 'OverTime', 'TenureGroup', 'SalaryBin', 'AgeGroup', 'DistanceCategory']

Total columns to exclude: 9

Features prepared!
  Feature columns: 31
  Training samples: 1470

Target distribution:
Attrition_Binary
0    1233
1     237
Name: count, dtype: int64

Attrition rate: 16.12%

Features being used for modeling:
    1. Age
    2. DailyRate
    3. DistanceFromHome
    4. Education
    5. EnvironmentSatisfaction
    6. HourlyRate
    7. JobInvolvement
    8. JobLevel
    9. JobSatisfaction
   10. MonthlyIncome
   11. MonthlyRate
   12. NumCompaniesWorked
   13. PercentSalaryHike
   14. PerformanceRating
   15. RelationshipSatisfaction
   16. StockOptionLevel
   17. TotalWorkingYears
   18. TrainingTimesLastYear
   19. WorkLifeBalance
   20. YearsAtCompany
   21. YearsInCurrent

In [9]:
# Check for any remaining missing values
print("\nChecking for missing values...")

missing = X.isnull().sum()
if missing.sum() > 0:
    print("Found missing values:")
    print(missing[missing > 0])
    # Fill with median
    X = X.fillna(X.median())
    print("âœ“ Filled with median")
else:
    print("No missing values found!")


Checking for missing values...
No missing values found!


In [10]:
# Train-test split
print("\nSplitting data into train and test sets...")

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,      # 20% for testing
    random_state=42,     # For reproducibility
    stratify=y          # Maintain class distribution
)

print(f"Data split complete!")
print(f"  Training set: {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"  Test set: {X_test.shape[0]} samples ({X_test.shape[0]/len(X)*100:.1f}%)")

print(f"\nClass distribution in train set:")
print(y_train.value_counts(normalize=True))


Splitting data into train and test sets...
Data split complete!
  Training set: 1176 samples (80.0%)
  Test set: 294 samples (20.0%)

Class distribution in train set:
Attrition_Binary
0    0.838435
1    0.161565
Name: proportion, dtype: float64


---

## MODEL 1: Logistic Regression

**Why Logistic Regression?**
- Simple, fast, interpretable
- Good baseline model
- Works well with scaled features

In [11]:
# Feature scaling (important for Logistic Regression)
print("Scaling features for Logistic Regression...")

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Save scaler for future use
joblib.dump(scaler, '../models/scaler.pkl')

print("Features scaled and scaler saved!")

Scaling features for Logistic Regression...
Features scaled and scaler saved!


In [12]:
# Train Logistic Regression
print("\n" + "="*70)
print("TRAINING MODEL 1: LOGISTIC REGRESSION")
print("="*70)

lr_model = LogisticRegression(
    random_state=42,
    max_iter=1000,
    class_weight='balanced'  # Handle class imbalance
)

print("\nTraining model...")
lr_model.fit(X_train_scaled, y_train)

print("Logistic Regression trained successfully!")


TRAINING MODEL 1: LOGISTIC REGRESSION

Training model...
Logistic Regression trained successfully!


In [13]:
# Make predictions
print("\nMaking predictions...")

y_pred_lr = lr_model.predict(X_test_scaled)
y_pred_proba_lr = lr_model.predict_proba(X_test_scaled)[:, 1]

print("Predictions complete!")


Making predictions...
Predictions complete!


In [14]:
# Evaluate Logistic Regression
print("\nLOGISTIC REGRESSION PERFORMANCE")
print("="*70)

lr_accuracy = accuracy_score(y_test, y_pred_lr)
lr_precision = precision_score(y_test, y_pred_lr)
lr_recall = recall_score(y_test, y_pred_lr)
lr_f1 = f1_score(y_test, y_pred_lr)
lr_roc_auc = roc_auc_score(y_test, y_pred_proba_lr)

print(f"Accuracy:  {lr_accuracy:.4f} ({lr_accuracy*100:.2f}%)")
print(f"Precision: {lr_precision:.4f} ({lr_precision*100:.2f}%)")
print(f"Recall:    {lr_recall:.4f} ({lr_recall*100:.2f}%)")
print(f"F1-Score:  {lr_f1:.4f}")
print(f"ROC-AUC:   {lr_roc_auc:.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred_lr, target_names=['Stayed', 'Left']))


LOGISTIC REGRESSION PERFORMANCE
Accuracy:  0.7415 (74.15%)
Precision: 0.3505 (35.05%)
Recall:    0.7234 (72.34%)
F1-Score:  0.4722
ROC-AUC:   0.7997

Classification Report:
              precision    recall  f1-score   support

      Stayed       0.93      0.74      0.83       247
        Left       0.35      0.72      0.47        47

    accuracy                           0.74       294
   macro avg       0.64      0.73      0.65       294
weighted avg       0.84      0.74      0.77       294



In [15]:
# Cross-validation for Logistic Regression
print("\nCross-Validation (5-Fold)...")

lr_cv_scores = cross_val_score(
    lr_model, X_train_scaled, y_train,
    cv=5, scoring='roc_auc'
)

print(f"Cross-validation ROC-AUC scores: {lr_cv_scores}")
print(f"Mean CV ROC-AUC: {lr_cv_scores.mean():.4f} (+/- {lr_cv_scores.std():.4f})")


Cross-Validation (5-Fold)...
Cross-validation ROC-AUC scores: [0.67995747 0.83449105 0.82006412 0.81138124 0.86254341]
Mean CV ROC-AUC: 0.8017 (+/- 0.0633)


In [16]:
# Save Logistic Regression model
joblib.dump(lr_model, '../models/logistic_model.pkl')
print("\nLogistic Regression model saved: models/logistic_model.pkl")


Logistic Regression model saved: models/logistic_model.pkl


---

## MODEL 2: Random Forest

**Why Random Forest?**
- Handles non-linear relationships
- Provides feature importance
- Generally higher accuracy
- No need for feature scaling

In [17]:
# Train Random Forest
print("\n" + "="*70)
print("TRAINING MODEL 2: RANDOM FOREST")
print("="*70)

rf_model = RandomForestClassifier(
    n_estimators=100,        # Number of trees
    max_depth=10,            # Prevent overfitting
    min_samples_split=20,    # Minimum samples to split
    min_samples_leaf=10,     # Minimum samples in leaf
    random_state=42,
    class_weight='balanced', # Handle imbalance
    n_jobs=-1                # Use all CPU cores
)

print("\nTraining Random Forest (this may take 30-60 seconds)...")
rf_model.fit(X_train, y_train)  # No scaling needed for trees!

print("Random Forest trained successfully!")


TRAINING MODEL 2: RANDOM FOREST

Training Random Forest (this may take 30-60 seconds)...
Random Forest trained successfully!


In [18]:
# Make predictions
print("\nMaking predictions...")

y_pred_rf = rf_model.predict(X_test)
y_pred_proba_rf = rf_model.predict_proba(X_test)[:, 1]

print("Predictions complete!")


Making predictions...
Predictions complete!


In [19]:
# Evaluate Random Forest
print("\nRANDOM FOREST PERFORMANCE")
print("="*70)

rf_accuracy = accuracy_score(y_test, y_pred_rf)
rf_precision = precision_score(y_test, y_pred_rf)
rf_recall = recall_score(y_test, y_pred_rf)
rf_f1 = f1_score(y_test, y_pred_rf)
rf_roc_auc = roc_auc_score(y_test, y_pred_proba_rf)

print(f"Accuracy:  {rf_accuracy:.4f} ({rf_accuracy*100:.2f}%)")
print(f"Precision: {rf_precision:.4f} ({rf_precision*100:.2f}%)")
print(f"Recall:    {rf_recall:.4f} ({rf_recall*100:.2f}%)")
print(f"F1-Score:  {rf_f1:.4f}")
print(f"ROC-AUC:   {rf_roc_auc:.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf, target_names=['Stayed', 'Left']))


RANDOM FOREST PERFORMANCE
Accuracy:  0.8231 (82.31%)
Precision: 0.4545 (45.45%)
Recall:    0.5319 (53.19%)
F1-Score:  0.4902
ROC-AUC:   0.7804

Classification Report:
              precision    recall  f1-score   support

      Stayed       0.91      0.88      0.89       247
        Left       0.45      0.53      0.49        47

    accuracy                           0.82       294
   macro avg       0.68      0.71      0.69       294
weighted avg       0.84      0.82      0.83       294



In [20]:
# Cross-validation for Random Forest
print("\nCross-Validation (5-Fold)...")

rf_cv_scores = cross_val_score(
    rf_model, X_train, y_train,
    cv=5, scoring='roc_auc'
)

print(f"Cross-validation ROC-AUC scores: {rf_cv_scores}")
print(f"Mean CV ROC-AUC: {rf_cv_scores.mean():.4f} (+/- {rf_cv_scores.std():.4f})")


Cross-Validation (5-Fold)...
Cross-validation ROC-AUC scores: [0.67450824 0.8371627  0.80977825 0.75514293 0.82433877]
Mean CV ROC-AUC: 0.7802 (+/- 0.0598)


In [21]:
# Feature Importance (Random Forest)
print("\nTOP 15 MOST IMPORTANT FEATURES")
print("="*70)

feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

print(feature_importance.head(15).to_string(index=False))


TOP 15 MOST IMPORTANT FEATURES
             Feature  Importance
                 Age    0.077171
       MonthlyIncome    0.068291
           WLB_Index    0.062073
      YearsAtCompany    0.059051
   TotalWorkingYears    0.058869
YearsWithCurrManager    0.057148
           DailyRate    0.046561
    StockOptionLevel    0.045078
    Income_Age_Ratio    0.044418
   TotalSatisfaction    0.039326
    OverTime_Numeric    0.039254
  NumCompaniesWorked    0.035645
     OverTime_Binary    0.033402
         MonthlyRate    0.032690
    DistanceFromHome    0.032221


In [22]:
# Save Random Forest model
joblib.dump(rf_model, '../models/random_forest_model.pkl')
print("\nRandom Forest model saved: models/random_forest_model.pkl")


Random Forest model saved: models/random_forest_model.pkl


---

## MODEL COMPARISON

In [23]:
# Compare both models
print("\n" + "="*70)
print("FINAL MODEL COMPARISON")
print("="*70)

comparison = pd.DataFrame({
    'Metric': ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC'],
    'Logistic Regression': [lr_accuracy, lr_precision, lr_recall, lr_f1, lr_roc_auc],
    'Random Forest': [rf_accuracy, rf_precision, rf_recall, rf_f1, rf_roc_auc]
})

print(comparison.to_string(index=False))

# Determine winner
print("\nBEST MODEL:")
if rf_roc_auc > lr_roc_auc:
    print(f"   Random Forest (ROC-AUC: {rf_roc_auc:.4f})")
    best_model = 'Random Forest'
else:
    print(f"   Logistic Regression (ROC-AUC: {lr_roc_auc:.4f})")
    best_model = 'Logistic Regression'


FINAL MODEL COMPARISON
   Metric  Logistic Regression  Random Forest
 Accuracy             0.741497       0.823129
Precision             0.350515       0.454545
   Recall             0.723404       0.531915
 F1-Score             0.472222       0.490196
  ROC-AUC             0.799724       0.780429

BEST MODEL:
   Logistic Regression (ROC-AUC: 0.7997)


---

## Model Training Complete!

### Summary:

**Models Trained:**
1. Logistic Regression (baseline)
2. Random Forest (advanced)

**Files Saved:**
- `models/scaler.pkl` - Feature scaler
- `models/logistic_model.pkl` - Logistic Regression
- `models/random_forest_model.pkl` - Random Forest

**Key Metrics:**
- Both models achieve **good performance**
- Random Forest typically performs **better**
- Feature importance shows **OverTime** is top predictor

---

**Next Step:** Proceed to `06_model_evaluation.ipynb` for detailed analysis