# Data Science Salaries - Part 2: Error Analysis and Improvements

## 1. Error Analysis Conclusions & Work Plan

Based on the error analysis from Part 1, we identified several key issues:

- **Salary Range Bias**:
   - Model significantly underestimates high-salary positions (>250k USD)
   - Negative skew in error distribution

- **Feature Importance Issues**:
   - Employee residence dominates with ~0.40 importance score
   - Work year and remote ratio show very low importance (<0.05)
   - Potential sparsity issues with job titles

- **Experience Level Patterns**:
   - Largest error variance in Executive (EX) level
   - Significant outliers in Senior (SE) level

Loading the model & data from Part 1

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, QuantileTransformer
from sklearn.model_selection import train_test_split
import xgboost as xgb

# Load the original data and model
path = "data\\ds_salaries.csv"
df = pd.read_csv(path)

# Import the prepare_data function from part1
from part1 import prepare_data, build_model
processed_data = prepare_data(df)
model, X_test, y_test, y_pred, X_train, feature_cols = build_model(processed_data)

### Root Cause Analysis

Let's analyze potential causes for the observed issues:

In [None]:
# Analyze salary distribution
plt.figure(figsize=(10, 6))
sns.histplot(data=processed_data, x='salary_in_usd', bins=50)
plt.title('Salary Distribution')
plt.xlabel('Salary (thousands USD)')
plt.ylabel('Count')
plt.show()

# Print summary statistics
print("Salary Distribution Statistics:")
print(processed_data['salary_in_usd'].describe())

# Calculate skewness
print(f"\nSkewness: {processed_data['salary_in_usd'].skew():.2f}")

1. **Data Distribution Issues**:
   - Right-skewed salary distribution
   - Underrepresentation of high-salary positions
   - Possible outliers affecting model training

In [1]:
# Analyze feature cardinality
categorical_cols = ['job_title', 'employee_residence', 'company_location']

print("Feature Cardinality Analysis:")
for col in categorical_cols:
    unique_count = processed_data[col].nunique()
    top_5_freq = processed_data[col].value_counts().head()
    print(f"\n{col}:")
    print(f"Unique values: {unique_count}")
    print("Top 5 most frequent values:")
    print(top_5_freq)

Feature Cardinality Analysis:


NameError: name 'processed_data' is not defined


2. **Feature Engineering Problems**:
   - High cardinality in categorical variables (e.g. job title and location)
   - Simple label encoding might not capture relationships
   - Missing interaction effects between features
   

3. **Model Limitations**:
   - Linear scaling might not handle salary ranges well
   - No special handling of outliers
   - Basic feature preprocessing
   - Limited hyperparameter optimization
   - Single model approach for all salary ranges

### Potential Solutions to Some of the Issues

**Handling Salary Distribution**
- Apply quantile transformation to normalize salary distribution
- Use stratified sampling to ensure representation across salary ranges
- Implement separate models for different salary brackets
- Consider log transformation for salary values

**Feature Engineering Solutions**
- Implement target encoding for categorical variables
- Use embedding techniques for high-cardinality categorical features
- Create experience-title interaction features
- Add location-based interaction terms
- Develop remote work impact factors
- Include company size-location interactions

**Model Architecture & Training Improvements**
- Use quantile regression for better uncertainty estimation
- Use k-fold cross-validation
- Implement hyperparameter tuning
- Add regularization techniques
- Use weighted sampling for underrepresented cases