# Data Science Salaries - Part 2: Error Analysis and Improvements

## 1. Error Analysis Conclusions & Work Plan

Based on the error analysis from Part 1, we identified several key issues:

1. **Salary Range Bias**:
   - Model significantly underestimates high-salary positions (>250k USD)
   - Negative skew in error distribution

2. **Feature Importance Issues**:
   - Employee residence dominates with ~0.40 importance score
   - Work year and remote ratio show very low importance (<0.05)
   - Potential sparsity issues with job titles

3. **Experience Level Patterns**:
   - Largest error variance in Executive (EX) level
   - Significant outliers in Senior (SE) level

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, QuantileTransformer
from sklearn.model_selection import train_test_split
import xgboost as xgb

# Load the original data and model
path = "data\\ds_salaries.csv"
df = pd.read_csv(path)

# Import the prepare_data function from part1
from part1 import prepare_data
processed_data = prepare_data(df)

### Root Cause Analysis

Let's analyze potential causes for the observed issues:

In [None]:
# Analyze salary distribution
plt.figure(figsize=(10, 6))
sns.histplot(data=processed_data, x='salary_in_usd', bins=50)
plt.title('Salary Distribution')
plt.xlabel('Salary (thousands USD)')
plt.ylabel('Count')
plt.show()

# Print summary statistics
print("Salary Distribution Statistics:")
print(processed_data['salary_in_usd'].describe())

# Calculate skewness
print(f"\nSkewness: {processed_data['salary_in_usd'].skew():.2f}")

In [None]:
# Analyze feature cardinality
categorical_cols = ['job_title', 'employee_residence', 'company_location']

print("Feature Cardinality Analysis:")
for col in categorical_cols:
    unique_count = processed_data[col].nunique()
    top_5_freq = processed_data[col].value_counts().head()
    print(f"\n{col}:")
    print(f"Unique values: {unique_count}")
    print("Top 5 most frequent values:")
    print(top_5_freq)

##### Root Causes Identified:

1. **Data Distribution Issues**:
   - Right-skewed salary distribution
   - Underrepresentation of high-salary positions
   - Possible outliers affecting model training

2. **Feature Engineering Problems**:
   - High cardinality in categorical variables (e.g. job title and location)
   - Simple label encoding might not capture relationships
   - Missing interaction effects between features

3. **Model Limitations**:
   - Linear scaling might not handle salary ranges well
   - No special handling of outliers
   - Basic feature preprocessing
   - Limited hyperparameter optimization
   - Single model approach for all salary ranges

### Potential Solutions to Some of the Issues

**Handling Salary Distribution**
- Apply quantile transformation to normalize salary distribution
- Use stratified sampling to ensure representation across salary ranges
- Implement separate models for different salary brackets
- Consider log transformation for salary values

**Feature Engineering Solutions**
- Implement target encoding for categorical variables
- Use embedding techniques for high-cardinality categorical features
- Create experience-title interaction features
- Add location-based interaction terms
- Develop remote work impact factors
- Include company size-location interactions

**Model Architecture & Training Improvements**
- Use quantile regression for better uncertainty estimation
- Use k-fold cross-validation
- Implement hyperparameter tuning
- Add regularization techniques
- Use weighted sampling for underrepresented cases

## 2. Improving Model Performance

Let's address these issues through the various improvement techniques.

In [None]:
def improved_prepare_data(df):
    data = df.copy()
    
    # 1. Better salary transformation
    data['salary_in_usd'] = data['salary_in_usd'] / 1000
    qt = QuantileTransformer(n_quantiles=1000, output_distribution='normal')
    data['salary_transformed'] = qt.fit_transform(data[['salary_in_usd']])
    
    # 2. Improved categorical handling
    # Group rare categories
    for col in ['job_title', 'employee_residence', 'company_location']:
        value_counts = data[col].value_counts()
        rare_categories = value_counts[value_counts < 5].index
        data[col] = data[col].replace(rare_categories, 'Other')
    
    # 3. Feature interactions
    data['location_match'] = (data['employee_residence'] == data['company_location']).astype(int)
    data['remote_senior'] = ((data['remote_ratio'] > 50) & 
                            (data['experience_level'] == 'SE')).astype(int)
    
    # 4. Encode categorical variables
    categorical_cols = ['experience_level', 'employment_type', 'job_title', 
                       'employee_residence', 'company_location', 'company_size']
    
    # Use mean target encoding instead of label encoding
    for col in categorical_cols:
        means = data.groupby(col)['salary_in_usd'].mean()
        data[col + '_encoded'] = data[col].map(means)
    
    return data, qt

In [None]:
# Prepare improved dataset
improved_data, qt = improved_prepare_data(df)

# Define features for improved model
improved_features = [
    'work_year',
    'experience_level_encoded',
    'employment_type_encoded',
    'job_title_encoded',
    'employee_residence_encoded',
    'remote_ratio',
    'company_location_encoded',
    'company_size_encoded',
    'location_match',
    'remote_senior'
]

# Prepare features and target
X = improved_data[improved_features]
y = improved_data['salary_transformed']  # Use transformed target

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train improved model
improved_model = xgb.XGBRegressor(
    max_depth=6,
    min_child_weight=3,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)
improved_model.fit(X_train_scaled, y_train)

# Make predictions and inverse transform
y_pred_transformed = improved_model.predict(X_test_scaled)
y_pred = qt.inverse_transform(y_pred_transformed.reshape(-1, 1)).ravel()
y_test_original = qt.inverse_transform(y_test.values.reshape(-1, 1)).ravel()

In [None]:
# Compare results
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Calculate metrics
rmse = np.sqrt(mean_squared_error(y_test_original, y_pred))
mae = mean_absolute_error(y_test_original, y_pred)
r2 = r2_score(y_test_original, y_pred)

print("Improved Model Performance:")
print(f"RMSE: {rmse:.2f}k")
print(f"MAE: {mae:.2f}k")
print(f"R2 Score: {r2:.3f}")

# Visualize improvements
plt.figure(figsize=(10, 6))
plt.scatter(y_test_original, y_pred, alpha=0.5)
plt.plot([y_test_original.min(), y_test_original.max()],
         [y_test_original.min(), y_test_original.max()],
         'r--', lw=2)
plt.xlabel('Actual Salary (thousands USD)')
plt.ylabel('Predicted Salary (thousands USD)')
plt.title('Improved Model: Predicted vs Actual Salaries')
plt.show()

### Summary of Improvements

The implemented solutions address the key issues identified:

1. **Handling Salary Distribution**:
   - Quantile transformation normalizes salary distribution
   - Better handling of extreme values
   - Reduced impact of outliers

2. **Feature Engineering**:
   - Grouped rare categories to reduce cardinality
   - Added meaningful feature interactions
   - Implemented target encoding for categorical variables

3. **Model Optimization**:
   - Tuned XGBoost parameters
   - Added regularization through subsample and colsample
   - Better handling of feature relationships

## 3. Analyzing the Improved Model

 ## 4. Drawing Conclusions About the Data & Creative Applications