# Salary Prediction Model Report

## Table of Contents

1. [Introduction](#introduction)  
2. [Project Setup & Kedro Context](#setup)  
3. [Data Overview](#data-overview)  
   - 3.1. [Raw Datasets](#raw-datasets)  
   - 3.2. [Target Variable](#target-variable)  
   - 3.3. [Missing Data & Quality Checks](#missing-data)  
4. [Preprocessing & Feature Engineering](#feature-engineering)  
   - 4.1. [Preprocessing Versions](#preprocessing-versions)  
   - 4.2. [Excluded Features (e.g., Text)](#excluded-features)  
5. [Modeling Approach](#modeling-approach)  
   - 5.1. [Baseline & Linear Models](#baseline-linear)  
   - 5.2. [Regularized Models (Lasso, ElasticNet)](#regularized-models)  
   - 5.3. [Tree-Based Models (RF, XGBoost)](#tree-models)  
   - 5.4. [Hyperparameter Optimization](#hyperparameter-optimization)  
   - 5.5. [SHAP-Based Feature Selection](#shap-selection)  
6. [Model Evaluation](#evaluation)  
   - 6.1. [Evaluation Metrics](#metrics-explained)  
   - 6.2. [Confidence Intervals](#confidence-intervals)  
   - 6.3. [Performance Comparison Table](#comparison-table)  
   - 6.4. [Performance Visualization](#comparison-plot)  
7. [Best Model Deep Dive](#best-model)  
   - 7.1. [Feature Importance (SHAP)](#feature-importance)  
   - 7.2. [Residual Analysis](#residual-analysis)  
8. [Conclusion & Next Steps](#conclusion)  

## 1. Introduction

This project was developed as part of a technical challenge. The goal is to design, implement, and evaluate a machine learning model capable of predicting an individual's salary based on structured features provided in a public dataset.

The dataset includes the following columns:
- Age
- Gender
- Education level
- Job title
- Years of experience
- Description (free-text field)

The predictive task is to estimate the salary from these features, using Python-based tools and best practices in modular code development.

### Challenge Goals

The technical requirements for the challenge included:
- Developing a predictive model with a clean and modular Python codebase
- Applying preprocessing and feature transformations
- Evaluating model performance using appropriate metrics and confidence intervals
- Including a clear comparison with a sensible baseline (e.g., DummyRegressor)
- Presenting final results in a well-documented Jupyter Notebook
- Hosting the solution in a public GitHub repository

### Project Design Overview

To address the challenge in a structured and scalable way, this project was implemented using **[Kedro](https://kedro.org/)** — a Python framework for reproducible, modular data science workflows.

The core of this project is divided into:
1. **Preprocessing Pipelines**: Structured in two alternative versions (v1 and v2) to experiment with different feature engineering strategies.
2. **Modeling Pipeline**: Trains and evaluates a wide range of regressors, including baseline, linear, regularized, and tree-based models.
3. **Reporting Pipeline**: Aggregates and visualizes model performance, selects top models, and explains results using tools like SHAP.

Although the dataset includes a free-text `description` field, it was **intentionally excluded** from the modeling process in this version, to focus on structured features and simplify the development pipeline.


## 2. Project Setup & Kedro Context

This project follows a modular structure using **Kedro**, which organizes code into independent pipelines for data preprocessing, modeling, and reporting.

The directory structure includes:
- `src/salary_prediction/pipelines/data_processing`: Data cleaning, splitting, and feature engineering (v1 and v2)
- `src/salary_prediction/pipelines/data_science`: Model training, hyperparameter optimization, SHAP analysis
- `src/salary_prediction/pipelines/reporting`: Visualization, model comparison, and selection

Kedro allows us to:
- Run partial pipelines with version control
- Pass parameters via `parameters.yml`
- Use a `DataCatalog` to manage datasets cleanly

The first step is to load the Kedro context so we can interact with the pipelines and datasets from within the notebook.


In [1]:
from pathlib import Path
from kedro.framework.hooks import _create_hook_manager
from kedro.framework.project import configure_project
from kedro.framework.context import KedroContext
from kedro.config import OmegaConfigLoader
import warnings
warnings.filterwarnings("ignore")

# Set up Kedro project context
project_path = Path.cwd().parents[0]  # If your notebook is in ./notebooks
configure_project("salary_prediction")

conf_path = project_path / "conf"
env = "local"
config_loader = OmegaConfigLoader(conf_source=str(conf_path))
hook_manager = _create_hook_manager()

context = KedroContext(
    package_name="salary_prediction",
    project_path=project_path,
    config_loader=config_loader,
    hook_manager=hook_manager,
    env=env,
)

catalog = context.catalog
print("Kedro context initialized successfully.")

Kedro context initialized successfully.


## 3. Data Overview

The original dataset includes records for individuals with the following fields:

- `Age`
- `Gender`
- `Education Level`
- `Job Title`
- `Years of Experience`
- `Description` (free text)
- `Salary` (target)

In this challenge, we chose to **exclude the `Description` field** to focus solely on structured features. Future iterations could experiment with NLP techniques to incorporate this field.

### 3.1. Raw Datasets

Dataset composition:
- Total number of samples: 375
- Salary dataset: 375 rows, 2 columns (id, Salary)
- People dataset: 375 rows, 6 columns (id, Age, Gender, Education Level, Job Title, Years of Experience)

Education Levels:
- Bachelor's: 222 samples
- Master's: 97 samples
- PhD: 51 samples

Job Titles:
- Diverse range of job titles (50+ unique titles)
- Top job titles include:
  - Director of Marketing (12 samples)
  - Director of Operations (10 samples)

Additional Features:
- Age: Varies across the dataset
- Gender: Male and Female
- Years of Experience: Ranges from entry-level to senior positions

### 3.2. Target Variable

Salary Distribution:
- Total valid salary entries: 373 (2 null values)
- Mean Salary: $100,577
- Median Salary: $95,000
- Salary Range: $350 - $250,000
- Standard Deviation: $48,240

### 3.3. Missing Data & Quality Checks
Missing Data in Salary Dataset:
- Total samples: 375
- Valid salary entries: 373
- Missing salary entries: 2 (0.53% of the dataset)
- Note from EDA: "Two null salary -> Remove in the data processing steps"

Missing Data in People Dataset:
- Education Level: Contains nan values
- Total samples: 375
- Bachelor's: 222
- Master's: 97
- PhD: 51
- Unspecified/Missing: Some nan entries present

Missing Data in Job Title:
- Job Title column also contains nan entries
- Multiple unique job titles (50+ different titles)
- Some job titles have missing values

Data Quality Observations:
- Consistent id column across datasets
- Salary range: $350 - $250,000
- Salary has a standard deviation of $48,240
- Age, Years of Experience, and other numerical columns appear to be complete

## 4. Preprocessing & Feature Engineering

The project uses two alternative preprocessing pipelines:

- **v1**: A minimal pipeline with basic encoding and imputation
- **v2**: A more refined approach, including feature scaling and improved handling of categorical features

The pipelines include:
- Dropping duplicates
- Handling missing values
- One-hot or ordinal encoding
- Scaling numerical features (v2 only)

In [2]:
catalog.load("X_train_preprocessed").head()
catalog.load("X_train_preprocessed_v2").head()

Unnamed: 0,Age,Years of Experience,Education Level,JobTitleGroup_Executive_Finance,JobTitleGroup_Executive_HR,JobTitleGroup_Executive_Marketing,JobTitleGroup_Executive_Operations,JobTitleGroup_Executive_Other,JobTitleGroup_Executive_Product,JobTitleGroup_Executive_Sales,...,JobTitleGroup_Senior_HR,JobTitleGroup_Senior_Marketing,JobTitleGroup_Senior_Operations,JobTitleGroup_Senior_Other,JobTitleGroup_Senior_Product,JobTitleGroup_Senior_Sales,JobTitleGroup_Senior_Technical,JobTitleGroup_Unknown,Gender_Male,Gender_None
193,-0.493751,-0.469123,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
75,-0.066439,-0.007995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
84,-1.205939,-1.237669,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
363,-0.636189,-0.776541,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
16,-0.636189,-0.469123,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 5. Modeling Approach

Several regression models were trained and evaluated using the two preprocessing strategies. These include:

- **Baseline**:
  - DummyRegressor (predicts mean salary)
  
- **Linear Models**:
  - Linear Regression
  - Lasso
  - ElasticNet (with and without hyperparameter tuning)

- **Tree-Based Models**:
  - Random Forest Regressor (vanilla and tuned)
  - XGBoost (with Optuna optimization)

- **SHAP + Random Forest**:
  - SHAP used for feature selection, followed by RF training

All models were evaluated using cross-validation, with metrics reported as confidence intervals.


In [3]:
catalog.load("randomforestregressor_model_v2")


## 6. Model Evaluation

Each model was evaluated using the following metrics:

- **RMSE**: Root Mean Squared Error – sensitive to large errors
- **R² Score**: Proportion of variance explained by the model

All metrics are reported as **confidence intervals**, using bootstrapped cross-validation.
