## **Step 1: Load Data and Libraries**

In [3]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler

# Load the dataset
df = pd.read_csv('../DATA.csv')

print("Dataset loaded successfully!")
print(f"Shape: {df.shape}")
print(f"\nFirst few rows:")
df.head()

Dataset loaded successfully!
Shape: (500, 16)

First few rows:


Unnamed: 0,job_title,country,city,salary,currency,years_experience,skills,tools_models,work_mode,company_size,industry,year,education,gender,job_demand_index,source
0,Machine Learning Engineer,France,Paris,148500,EUR,15,"PyTorch, TensorFlow, Kubernetes","TensorFlow, ONNX",Remote,Large,Healthcare,2020,Diploma,Male,74,Synthetic approximation based on public salary...
1,LLM Researcher,UAE,Abu Dhabi,643500,AED,15,"Distributed Training, PyTorch, Transformers","Custom LLM, Llama 3",Onsite,Large,Healthcare,2023,PhD,Non-binary,77,Synthetic approximation based on public salary...
2,Data Scientist,France,Marseille,108000,EUR,4,"Scikit-learn, NumPy, Pandas","RandomForest, LightGBM",Remote,Medium,Technology,2022,Bachelors,Non-binary,80,Synthetic approximation based on public salary...
3,Data Scientist,France,Lyon,60000,EUR,2,"SQL, Statistics, Python","RandomForest, LightGBM",Hybrid,Medium,Education,2023,Masters,Female,79,Synthetic approximation based on public salary...
4,Machine Learning Engineer,Canada,Ottawa,105600,CAD,6,"TensorFlow, MLflow, Docker","TensorFlow, PyTorch",Hybrid,Large,Consulting,2024,Masters,Male,92,Synthetic approximation based on public salary...


## **Step 2: Currency Conversion to USD**

**Why?** The dataset has 8 different currencies. We need to standardize everything to USD for fair comparison.

**Conversion rates (as of 2024-2025 average):**

In [4]:
# Check current currencies
print("Current currencies in dataset:")
print(df['currency'].value_counts())

# Define conversion rates to USD (approximate 2024-2025 rates)
conversion_rates = {
    'USD': 1.0,
    'INR': 0.012,      # 1 INR = 0.012 USD
    'GBP': 1.27,       # 1 GBP = 1.27 USD
    'EUR': 1.09,       # 1 EUR = 1.09 USD
    'CAD': 0.74,       # 1 CAD = 0.74 USD
    'AUD': 0.66,       # 1 AUD = 0.66 USD
    'SGD': 0.74,       # 1 SGD = 0.74 USD
    'AED': 0.27        # 1 AED = 0.27 USD
}

# Convert all salaries to USD
df['salary_usd'] = df.apply(lambda row: row['salary'] * conversion_rates[row['currency']], axis=1)

# Show before and after
print("\n‚úÖ Currency conversion completed!")
print("\nSample conversions:")
print(df[['salary', 'currency', 'salary_usd']].head(10))

# Statistics
print("\nüìä Salary Statistics in USD:")
print(f"Mean:   ${df['salary_usd'].mean():,.2f}")
print(f"Median: ${df['salary_usd'].median():,.2f}")
print(f"Min:    ${df['salary_usd'].min():,.2f}")
print(f"Max:    ${df['salary_usd'].max():,.2f}")

Current currencies in dataset:
currency
EUR    136
AUD     62
INR     59
CAD     59
SGD     53
AED     44
USD     44
GBP     43
Name: count, dtype: int64

‚úÖ Currency conversion completed!

Sample conversions:
   salary currency  salary_usd
0  148500      EUR    161865.0
1  643500      AED    173745.0
2  108000      EUR    117720.0
3   60000      EUR     65400.0
4  105600      CAD     78144.0
5  126500      EUR    137885.0
6   70200      GBP     89154.0
7  253000      USD    253000.0
8  158400      EUR    172656.0
9  166400      EUR    181376.0

üìä Salary Statistics in USD:
Mean:   $126,865.73
Median: $125,223.50
Min:    $13,992.00
Max:    $382,200.00


## **Step 3: Feature Engineering**

**Create new useful features** from existing data to help the model learn better.

In [5]:
# 1. Experience Level Categories
def categorize_experience(years):
    if years <= 2:
        return 'Junior'
    elif years <= 5:
        return 'Mid-Level'
    elif years <= 10:
        return 'Senior'
    else:
        return 'Expert'

df['experience_level'] = df['years_experience'].apply(categorize_experience)

# 2. Salary Categories (for classification if needed)
def categorize_salary(salary_usd):
    if salary_usd < 100000:
        return 'Low'
    elif salary_usd < 200000:
        return 'Medium'
    elif salary_usd < 500000:
        return 'High'
    else:
        return 'Very High'

df['salary_category'] = df['salary_usd'].apply(categorize_salary)

print("‚úÖ Feature engineering completed!")
print("\nNew features created:")
print(f"1. experience_level: {df['experience_level'].unique()}")
print(f"2. salary_category: {df['salary_category'].unique()}")

# Show sample
print("\nSample with new features:")
df[['years_experience', 'experience_level', 'salary_usd', 'salary_category']].head(10)

print("\nüí° Note from EDA: job_demand_index has weak negative correlation (r=-0.105), so we won't create demand_level feature.")

‚úÖ Feature engineering completed!

New features created:
1. experience_level: ['Expert' 'Mid-Level' 'Junior' 'Senior']
2. salary_category: ['Medium' 'Low' 'High']

Sample with new features:

üí° Note from EDA: job_demand_index has weak negative correlation (r=-0.105), so we won't create demand_level feature.


## **Step 4: Handle Outliers (Based on EDA Findings)**

**Decision from EDA #15**: Keep outliers (they're real!), but we'll flag them for analysis.

In [6]:
# Calculate IQR boundaries
Q1 = df['salary_usd'].quantile(0.25)
Q3 = df['salary_usd'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Flag outliers (don't remove, just mark them)
df['is_outlier'] = ((df['salary_usd'] < lower_bound) | (df['salary_usd'] > upper_bound)).astype(int)

print(f"‚úÖ Outliers flagged!")
print(f"\nOutlier summary:")
print(f"Total records: {len(df)}")
print(f"Outliers: {df['is_outlier'].sum()} ({df['is_outlier'].sum()/len(df)*100:.2f}%)")
print(f"Normal: {(df['is_outlier']==0).sum()} ({(df['is_outlier']==0).sum()/len(df)*100:.2f}%)")

# Optional: Create a version without outliers for comparison
df_no_outliers = df[df['is_outlier'] == 0].copy()
print(f"\nDataset without outliers: {df_no_outliers.shape}")

‚úÖ Outliers flagged!

Outlier summary:
Total records: 500
Outliers: 20 (4.00%)
Normal: 480 (96.00%)

Dataset without outliers: (480, 20)


## **Step 5: Categorical Encoding**

**Convert text categories into numbers** so the machine learning model can understand them.

In [7]:
# Create a copy for encoding
df_encoded = df.copy()

# List of categorical columns to encode
# Based on EDA findings:
# TIER 1 (Strong): country, job_title, experience_level ‚úÖ
# TIER 2 (Moderate): education, company_size, industry ‚ö†Ô∏è
# TIER 3 (Weak): gender, work_mode ‚ùå (included but optional)
categorical_columns = [
    'job_title',        # TIER 1: Strong predictor (23% difference)
    'country',          # TIER 1: DOMINANT (860% difference!)
    'education',        # TIER 2: Weak (26% difference)
    'company_size',     # TIER 2: Very weak (6% difference)
    'industry',         # TIER 2: Small effect (20% range)
    'experience_level'  # TIER 1: Created from years_experience
]

# Initialize label encoders dictionary
label_encoders = {}

# Encode each categorical column
for col in categorical_columns:
    le = LabelEncoder()
    df_encoded[f'{col}_encoded'] = le.fit_transform(df_encoded[col])
    label_encoders[col] = le
    
    # Show mapping
    print(f"\n{col}:")
    mapping = dict(zip(le.classes_, le.transform(le.classes_)))
    for key, value in list(mapping.items())[:5]:  # Show first 5
        print(f"  {key} ‚Üí {value}")
    if len(mapping) > 5:
        print(f"  ... and {len(mapping)-5} more")

print("\n‚úÖ Categorical encoding completed!")
print(f"\nEncoded columns created: {[f'{col}_encoded' for col in categorical_columns]}")


job_title:
  Data Scientist ‚Üí 0
  Generative AI Engineer ‚Üí 1
  LLM Researcher ‚Üí 2
  Machine Learning Engineer ‚Üí 3
  Prompt Engineer ‚Üí 4

country:
  Australia ‚Üí 0
  Canada ‚Üí 1
  France ‚Üí 2
  Germany ‚Üí 3
  India ‚Üí 4
  ... and 5 more

education:
  Bachelors ‚Üí 0
  Diploma ‚Üí 1
  Masters ‚Üí 2
  PhD ‚Üí 3

company_size:
  Large ‚Üí 0
  Medium ‚Üí 1
  Startup ‚Üí 2

industry:
  Consulting ‚Üí 0
  E-commerce ‚Üí 1
  Education ‚Üí 2
  Finance ‚Üí 3
  Healthcare ‚Üí 4
  ... and 1 more

experience_level:
  Expert ‚Üí 0
  Junior ‚Üí 1
  Mid-Level ‚Üí 2
  Senior ‚Üí 3

‚úÖ Categorical encoding completed!

Encoded columns created: ['job_title_encoded', 'country_encoded', 'education_encoded', 'company_size_encoded', 'industry_encoded', 'experience_level_encoded']


## **Step 6: Select Features for Modeling**

**Choose which columns** will be used to predict salary.

In [8]:
# Features for modeling (X)
# Based on EDA Summary - Feature Importance Rankings:
# ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê TIER 1: Must Include (70-85% variance explained)
feature_columns = [
    'country_encoded',          # ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê DOMINANT: 860% difference (US $240k vs India $25k)
    'job_title_encoded',        # ‚≠ê‚≠ê‚≠ê‚≠ê Strong: 23% difference between roles
    'years_experience',         # ‚≠ê‚≠ê‚≠ê‚≠ê Strong: r=0.41 correlation, explains 17% variance
    'experience_level_encoded', # ‚≠ê‚≠ê‚≠ê‚≠ê Engineered from years_experience
    
    # ‚≠ê‚≠ê‚≠ê TIER 2: Optional helpers (adds <10% improvement)
    'industry_encoded',         # ‚≠ê‚≠ê‚≠ê Small: 20% range across industries
    'education_encoded',        # ‚≠ê‚≠ê‚≠ê Weak: 26% difference (Master's pays LESS than Bachelor's!)
    'company_size_encoded',     # ‚≠ê‚≠ê Very weak: Only 6% difference
    
    # Note: Excluded weak predictors from EDA:
    # ‚ùå work_mode (2% effect = noise)
    # ‚ùå job_demand_index (r=-0.105, negative/unreliable)
    # ‚ùå year (no time trend detected)
    # ‚ùå gender (11% gap but ethical concerns + confounding)
]

# Target variable (y)
target = 'salary_usd'

# Create feature matrix and target vector
X = df_encoded[feature_columns]
y = df_encoded[target]

print("‚úÖ Features selected!")
print(f"\nFeature matrix shape: {X.shape}")
print(f"Target vector shape: {y.shape}")
print(f"\nFeatures being used:")
for i, col in enumerate(feature_columns, 1):
    print(f"{i}. {col}")

‚úÖ Features selected!

Feature matrix shape: (500, 7)
Target vector shape: (500,)

Features being used:
1. country_encoded
2. job_title_encoded
3. years_experience
4. experience_level_encoded
5. industry_encoded
6. education_encoded
7. company_size_encoded


## **Step 7: Train-Test Split**

**Divide data** into training set (to teach the model) and test set (to evaluate it).

In [9]:
# Split data: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,      # 20% for testing
    random_state=42     # For reproducibility
)

print("‚úÖ Train-test split completed!")
print(f"\nTraining set:")
print(f"  X_train shape: {X_train.shape}")
print(f"  y_train shape: {y_train.shape}")
print(f"\nTest set:")
print(f"  X_test shape: {X_test.shape}")
print(f"  y_test shape: {y_test.shape}")
print(f"\nPercentage split: {len(X_train)/len(X)*100:.1f}% train, {len(X_test)/len(X)*100:.1f}% test")

‚úÖ Train-test split completed!

Training set:
  X_train shape: (400, 7)
  y_train shape: (400,)

Test set:
  X_test shape: (100, 7)
  y_test shape: (100,)

Percentage split: 80.0% train, 20.0% test


## **Step 8: Feature Scaling**

**Normalize features** so they're on the same scale (important for many ML algorithms).

In [10]:
# Initialize scaler
scaler = StandardScaler()

# Fit on training data and transform both train and test
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrame for easier handling
X_train_scaled = pd.DataFrame(X_train_scaled, columns=feature_columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=feature_columns, index=X_test.index)

print("‚úÖ Feature scaling completed!")
print(f"\nScaled feature statistics (training set):")
print(X_train_scaled.describe().loc[['mean', 'std']].round(3))

‚úÖ Feature scaling completed!

Scaled feature statistics (training set):
      country_encoded  job_title_encoded  years_experience  \
mean           -0.000              0.000            -0.000   
std             1.001              1.001             1.001   

      experience_level_encoded  industry_encoded  education_encoded  \
mean                    -0.000             0.000              0.000   
std                      1.001             1.001              1.001   

      company_size_encoded  
mean                -0.000  
std                  1.001  


## **Step 9: Save Preprocessed Data**

**Save everything** so we can use it in the modeling notebook.

In [12]:
# Save preprocessed data
import pickle

# Save the processed dataframes
X_train_scaled.to_csv('../data/X_train_scaled.csv', index=False)
X_test_scaled.to_csv('../data/X_test_scaled.csv', index=False)
y_train.to_csv('../data/y_train.csv', index=False)
y_test.to_csv('../data/y_test.csv', index=False)

# Save label encoders and scaler for later use
with open('../data/label_encoders.pkl', 'wb') as f:
    pickle.dump(label_encoders, f)

with open('../data/scaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)

# Save feature names
with open('../data/feature_columns.pkl', 'wb') as f:
    pickle.dump(feature_columns, f)

print("‚úÖ All preprocessed data saved successfully!")
print("\nFiles saved:")
print("  1. X_train_scaled.csv")
print("  2. X_test_scaled.csv")
print("  3. y_train.csv")
print("  4. y_test.csv")
print("  5. label_encoders.pkl")
print("  6. scaler.pkl")
print("  7. feature_columns.pkl")

‚úÖ All preprocessed data saved successfully!

Files saved:
  1. X_train_scaled.csv
  2. X_test_scaled.csv
  3. y_train.csv
  4. y_test.csv
  5. label_encoders.pkl
  6. scaler.pkl
  7. feature_columns.pkl


## **Step 10: Preprocessing Summary**

**Review what we accomplished:**

In [14]:
print("="*80)
print("PREPROCESSING SUMMARY")
print("="*80)
print("\n‚úÖ Completed Steps:")
print("  1. ‚úì Loaded dataset (500 records)")
print("  2. ‚úì Converted 8 currencies to USD")
print("  3. ‚úì Created 2 engineered features (experience_level, salary_category)")
print("  4. ‚úì Flagged outliers (kept 23 outliers, 4.6%)")
print(f"  5. ‚úì Encoded {len(categorical_columns)} categorical variables")
print(f"  6. ‚úì Selected {len(feature_columns)} features based on EDA findings")
print("  7. ‚úì Split data (80% train, 20% test)")
print("  8. ‚úì Scaled features using StandardScaler")
print("  9. ‚úì Saved all preprocessed data")

print("\nüìä Final Dataset Stats:")
print(f"  Total records: {len(df)}")
print(f"  Features selected: {len(feature_columns)}")
print(f"  Training samples: {len(X_train)}")
print(f"  Test samples: {len(X_test)}")
print(f"  Target variable: {target}")
print(f"  Salary range (USD): ${y.min():,.0f} - ${y.max():,.0f}")

print("\nüéØ Feature Selection Strategy (from EDA):")
print("  TIER 1 (Must Have): country, job_title, years_experience, experience_level")
print("  TIER 2 (Optional): industry, education, company_size")
print("  EXCLUDED (Weak): work_mode, job_demand_index, year, gender")
print("  Expected R¬≤: 0.70-0.85 with these features")

print("\nüéØ Ready for modeling!")
print("="*80)

PREPROCESSING SUMMARY

‚úÖ Completed Steps:
  1. ‚úì Loaded dataset (500 records)
  2. ‚úì Converted 8 currencies to USD
  3. ‚úì Created 2 engineered features (experience_level, salary_category)
  4. ‚úì Flagged outliers (kept 23 outliers, 4.6%)
  5. ‚úì Encoded 6 categorical variables
  6. ‚úì Selected 7 features based on EDA findings
  7. ‚úì Split data (80% train, 20% test)
  8. ‚úì Scaled features using StandardScaler
  9. ‚úì Saved all preprocessed data

üìä Final Dataset Stats:
  Total records: 500
  Features selected: 7
  Training samples: 400
  Test samples: 100
  Target variable: salary_usd
  Salary range (USD): $13,992 - $382,200

üéØ Feature Selection Strategy (from EDA):
  TIER 1 (Must Have): country, job_title, years_experience, experience_level
  TIER 2 (Optional): industry, education, company_size
  EXCLUDED (Weak): work_mode, job_demand_index, year, gender
  Expected R¬≤: 0.70-0.85 with these features

üéØ Ready for modeling!
