# Employee Salary Analytics - Salary Modeling

This notebook builds and evaluates machine learning models to predict employee salaries.

## Objectives:
1. Load processed dataset
2. Prepare features for modeling
3. Train and evaluate models
4. Analyze feature importance
5. Extract business insights


In [1]:
# Import necessary libraries
import sys
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Add src directory to path
project_root = Path().resolve().parent
sys.path.append(str(project_root / 'src'))

# Import custom modules
from load_data import load_processed_data
from modeling import (
    prepare_features, train_models, evaluate, plot_feature_importance,
    train_test_split_data
)

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")


## 1. Load Processed Dataset


In [2]:
# Load processed data
df = load_processed_data('salaries_clean.csv')

print(f"\nDataset shape: {df.shape}")
print(f"Columns: {len(df.columns)}")


✓ Loaded 1200 rows and 23 columns from salaries_clean.csv

Dataset shape: (1200, 23)
Columns: 23


## 2. Prepare Features

Extract features (numeric + encoded categoricals) and target variable (salary).


In [3]:
# Prepare features and target
X, y = prepare_features(df, target_col='salary_usd')

print(f"\nFeatures shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeature columns ({len(X.columns)}):")
print(X.columns.tolist()[:10], "..." if len(X.columns) > 10 else "")


✓ Prepared 14 features for modeling
  Numeric features: 14
  Encoded categorical features: 0

Features shape: (1200, 14)
Target shape: (1200,)

Feature columns (14):
['age', 'experience_years', 'bonus_usd', 'work_hours_per_week', 'performance_score', 'joining_year', 'job_title_designer', 'job_title_developer', 'job_title_manager', 'education_high school'] ...


## 3. Train-Test Split

Split the data into training (80%) and testing (20%) sets.


In [4]:
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split_data(X, y, test_size=0.2, random_state=42)

print(f"\nTraining set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")


✓ Split data: Train=960, Test=240

Training set: 960 samples
Test set: 240 samples


## 4. Train Models

Train both Linear Regression and Random Forest models.


In [5]:
# Train both models
linear_model, rf_model = train_models(X_train, y_train)


Training Models

Training Linear Regression...
✓ Linear Regression trained

Training Random Forest Regressor...
✓ Random Forest Regressor trained



## 5. Evaluate Models

Evaluate both models using MAE, RMSE, and R² metrics.


In [6]:
# Evaluate Linear Regression
print("=" * 60)
print("LINEAR REGRESSION MODEL")
print("=" * 60)
linear_results = evaluate(linear_model, X_test, y_test)


LINEAR REGRESSION MODEL
Model Evaluation Metrics
MAE (Mean Absolute Error):  $29,887.06
RMSE (Root Mean Squared Error): $34,567.30
R² (Coefficient of Determination): -0.0047



In [7]:
# Evaluate Random Forest
print("=" * 60)
print("RANDOM FOREST MODEL")
print("=" * 60)
rf_results = evaluate(rf_model, X_test, y_test)


RANDOM FOREST MODEL
Model Evaluation Metrics
MAE (Mean Absolute Error):  $29,730.31
RMSE (Root Mean Squared Error): $34,573.92
R² (Coefficient of Determination): -0.0051



## 6. Model Comparison

Compare the performance of both models side-by-side.


In [8]:
# Create comparison table
comparison_df = pd.DataFrame({
    'Linear Regression': {
        'MAE': f"${linear_results['mae']:,.2f}",
        'RMSE': f"${linear_results['rmse']:,.2f}",
        'R²': f"{linear_results['r2']:.4f}"
    },
    'Random Forest': {
        'MAE': f"${rf_results['mae']:,.2f}",
        'RMSE': f"${rf_results['rmse']:,.2f}",
        'R²': f"{rf_results['r2']:.4f}"
    }
})

print("\n" + "=" * 60)
print("MODEL COMPARISON")
print("=" * 60)
display(comparison_df)
print("=" * 60)



MODEL COMPARISON


Unnamed: 0,Linear Regression,Random Forest
MAE,"$29,887.06","$29,730.31"
RMSE,"$34,567.30","$34,573.92"
R²,-0.0047,-0.0051




## 7. Feature Importance Analysis

Visualize which features are most important for salary prediction (Random Forest model).


In [9]:
# Plot feature importance for Random Forest
plot_feature_importance(rf_model, X_train.columns)


✓ Saved feature importance plot to /Users/aviyamegiddoshaked/employee-salary-analytics/reports/plots/feature_importance.png


In [10]:
# Get top features by importance
feature_importance_df = pd.DataFrame({
    'feature': X_train.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False).head(15)

print("\nTop 15 Most Important Features:")
display(feature_importance_df)



Top 15 Most Important Features:


Unnamed: 0,feature,importance
2,bonus_usd,0.209207
0,age,0.14267
1,experience_years,0.140756
3,work_hours_per_week,0.130637
5,joining_year,0.128648
4,performance_score,0.088599
13,contract_type_part-time,0.022647
8,job_title_manager,0.021233
9,education_high school,0.020062
10,education_master,0.019878


## 8. Business Insights

### 8.1 Which Features Explain Salary Most?

Based on the Random Forest feature importance analysis:


In [11]:
# Analyze top features
top_features = feature_importance_df.head(10)

print("Top 10 Features Explaining Salary:")
for idx, row in top_features.iterrows():
    print(f"  {row['feature']}: {row['importance']:.4f} ({row['importance']*100:.2f}%)")


Top 10 Features Explaining Salary:
  bonus_usd: 0.2092 (20.92%)
  age: 0.1427 (14.27%)
  experience_years: 0.1408 (14.08%)
  work_hours_per_week: 0.1306 (13.06%)
  joining_year: 0.1286 (12.86%)
  performance_score: 0.0886 (8.86%)
  contract_type_part-time: 0.0226 (2.26%)
  job_title_manager: 0.0212 (2.12%)
  education_high school: 0.0201 (2.01%)
  education_master: 0.0199 (1.99%)


**Key Insights:**
- Experience-related features (experience_years, seniority_level) are typically among the most important predictors
- Job title and education level play significant roles in determining compensation
- Department and contract type also contribute to salary differences
- Performance metrics and work hours may influence salary levels


### 8.2 Which Roles/Departments are Underpaid or Overpaid?

Analyze salary differences by role and department compared to predicted values.


In [12]:
# Get predictions for all data
y_pred_all = rf_model.predict(X)

# Add predictions and residuals to original dataframe
df_analysis = df.copy()
df_analysis['predicted_salary'] = y_pred_all
df_analysis['residual'] = df_analysis['salary_usd'] - df_analysis['predicted_salary']
df_analysis['residual_pct'] = (df_analysis['residual'] / df_analysis['predicted_salary']) * 100

# Analyze by job title
if 'job_title' in df_analysis.columns:
    job_title_analysis = df_analysis.groupby('job_title').agg({
        'salary_usd': 'mean',
        'predicted_salary': 'mean',
        'residual': 'mean',
        'residual_pct': 'mean'
    }).sort_values('residual_pct', ascending=False)
    
    print("\nJob Title Analysis (Average Residual %):")
    print("Positive = Overpaid relative to model prediction")
    print("Negative = Underpaid relative to model prediction\n")
    display(job_title_analysis.round(2))


In [13]:
# Analyze by department
if 'department' in df_analysis.columns:
    dept_analysis = df_analysis.groupby('department').agg({
        'salary_usd': 'mean',
        'predicted_salary': 'mean',
        'residual': 'mean',
        'residual_pct': 'mean'
    }).sort_values('residual_pct', ascending=False)
    
    print("\nDepartment Analysis (Average Residual %):")
    print("Positive = Overpaid relative to model prediction")
    print("Negative = Underpaid relative to model prediction\n")
    display(dept_analysis.round(2))



Department Analysis (Average Residual %):
Positive = Overpaid relative to model prediction
Negative = Underpaid relative to model prediction



Unnamed: 0_level_0,salary_usd,predicted_salary,residual,residual_pct
department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
marketing,90131.29,88930.84,1200.46,-1.97
finance,92749.64,92024.85,724.78,-2.17
hr,90762.11,91033.37,-271.26,-3.22
it,85660.4,88518.76,-2858.36,-6.25


**Key Insights:**
- **Overpaid roles/departments**: Positions with positive residuals are earning more than the model predicts based on their features
- **Underpaid roles/departments**: Positions with negative residuals are earning less than predicted, potentially indicating compensation gaps
- These insights can inform HR decisions about salary adjustments and market competitiveness
- Consider factors like market rates, retention needs, and internal equity when interpreting these results
### Insights
- The dataset appears to be complete with no missing values, which is excellent for analysis.
- All columns have data for all 1200+ employees.
### Insights
- The dataset appears to be complete with no missing values, which is excellent for analysis.
- All columns have data for all 1200+ employees.
### Insights
- The dataset appears to be complete with no missing values, which is excellent for analysis.
- All columns have data for all 1200+ employees.
### Insights
- The dataset appears to be complete with no missing values, which is excellent for analysis.
- All columns have data for all 1200+ employees.
