## What salary could get a candidate after transition?

To support data analysts transitioning to data science roles, we build a xgboost model to estimate salary differences based on key skills we defined for transition. on this section we:

- Remove salary outliers and filter for Data Scientist job postings
- Engineer binary skill features
- Train an XGBoost regression model
- Predict the monthly salary in EUR

### Import Libraries

In [None]:
from pathlib import Path
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import mean_squared_error, r2_score
from xgboost import XGBRegressor

import plotly.express as px

### Load Cleaned Dataset

In [257]:
df = pd.read_pickle(Path.cwd().parents[1] / 'Raw_Data' / 'df_Final_2.pkl')

### Predicting Data Scientist Salaries Based on Analyst Transition Skills

To support data analysts transitioning to data science roles, we build a regression model to estimate salary on base of skillset we defined as the most appropriate for transition.

- **Target Role:** Data Scientist  
- **Base Skills:** `Python`, `SQL`, `Tableau`/`Looker`/`Power BI`
- **Key Transition Skills:** `R`/`SAS`, `AWS`/`Azure`, `Spark`/`Tensorflow`, `Pytorch`

In [258]:
df_ml = df[df['job_title_short'] == 'Data Scientist'].copy()

print(f"Valid Data Scientist rows: {len(df_ml)}")

Valid Data Scientist rows: 9507


### Skill Feature Engineering

We create binary features to represent whether a posting mentions:
- Base skills (Excel, Python, SQL, BI tools)
- Transition skills (R/SAS, AWS/Azure, Spark/TensorFlow/PyTorch)

We also add:
- Total skill count
- A composite binary meets_criteria flag for candidates having all required skills.

In [259]:
# Lowercase skill lists
df_ml['job_skills'] = df_ml['job_skills'].apply(lambda x: [s.lower() for s in x] if isinstance(x, list) else [])

# Define skill groups
viz_tools = ['tableau', 'power bi', 'looker']
r_sas = ['r', 'sas']
spark_tf_pt = ['spark', 'tensorflow', 'pytorch']
aws_azure = ['aws', 'azure']

# Individual known skills
df_ml['has_python'] = df_ml['job_skills'].apply(lambda skills: 'python' in skills).astype(int)
df_ml['has_sql'] = df_ml['job_skills'].apply(lambda skills: 'sql' in skills).astype(int)
df_ml['has_excel'] = df_ml['job_skills'].apply(lambda skills: 'excel' in skills).astype(int)

# Group features
df_ml['has_viz_tool'] = df_ml['job_skills'].apply(lambda skills: any(v in skills for v in viz_tools)).astype(int)
df_ml['has_r_sas'] = df_ml['job_skills'].apply(lambda skills: any(v in skills for v in r_sas)).astype(int)
df_ml['has_spark_tf_pt'] = df_ml['job_skills'].apply(lambda skills: any(v in skills for v in spark_tf_pt)).astype(int)
df_ml['has_aws_azure'] = df_ml['job_skills'].apply(lambda skills: any(v in skills for v in aws_azure)).astype(int)

# Total skills count in posting
df_ml['total_skills'] = df_ml['job_skills'].apply(len)

# Composite feature: candidate meets all conditions
df_ml['meets_criteria'] = (
    df_ml['has_excel'] &
    df_ml['has_python'] &
    df_ml['has_sql'] &
    df_ml['has_viz_tool'] &
    df_ml['has_r_sas'] &
    df_ml['has_spark_tf_pt'] &
    df_ml['has_aws_azure']
).astype(int)

### Additional Features

- Location-based flag: is_eu
- Work arrangement: job_work_from_home
- Benefits: job_health_insurance
- Encoded job schedule type (One-Hot Encoding)

In [260]:
cat_features = ['job_schedule_type']
ohe = OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore')
encoded = ohe.fit_transform(df_ml[cat_features])
encoded_df = pd.DataFrame(encoded, columns=ohe.get_feature_names_out(cat_features), index=df_ml.index)

# Merge with main dataframe
df_model = pd.concat([
    df_ml[[
        'has_excel', 'has_python', 'has_sql', 'has_viz_tool',
        'has_r_sas', 'has_spark_tf_pt', 'has_aws_azure',
        'total_skills', 'meets_criteria',
        'is_eu', 'job_work_from_home', 'job_health_insurance'
    ]],
    encoded_df
], axis=1)

# Target variable
y = df_ml['salary_month_avg_eur']

### Train-Test Split

Split data into training (80%) and testing (20%) sets.

In [261]:
X_train, X_test, y_train, y_test = train_test_split(df_model, y, test_size=0.2, random_state=42)

### XGBoost Model

We train an XGBoost Regressor and evaluate it using:
- R² score
- RMSE (EUR/month)

In [262]:
model = XGBRegressor(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=5,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)

model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Metrics
rmse = mean_squared_error(y_test, y_pred) ** 0.5
r2 = r2_score(y_test, y_pred)

print(f"R² Score: {r2:.4f}")
print(f"RMSE (EUR/month): {rmse:.2f}")

R² Score: 0.1461
RMSE (EUR/month): 2991.37


### Predict for the candidate

In [263]:
# Candidate feature vector
candidate_features = pd.DataFrame([{
    'has_excel': 1,
    'has_python': 1,
    'has_sql': 1,
    'has_viz_tool': 1,     
    'has_r_sas': 1,
    'has_spark_tf_pt': 1,
    'has_aws_azure': 1,
    'total_skills': 7,    
    'meets_criteria': 1,
    'is_eu': 1,            
    'job_work_from_home': 0,
    'job_health_insurance': 0,
    **{col: 0 for col in encoded_df.columns}  
}])

predicted_salary = model.predict(candidate_features)[0]
print(f"Predicted salary for candidate: {predicted_salary:.2f} EUR/month")

Predicted salary for candidate: 12949.20 EUR/month


In [264]:
# Ensure total_skills exists for all postings
df.loc[:, 'job_skills'] = df['job_skills'].apply(
    lambda x: [s.lower() for s in x] if isinstance(x, list) else []
)
df.loc[:, 'total_skills'] = df['job_skills'].apply(len)

# Filter <= 15 skills
df_ds_filtered = df[(df['job_title_short'] == 'Data Scientist') & (df['total_skills'] <= 15)]
df_da_filtered = df[(df['job_title_short'] == 'Data Analyst') & (df['total_skills'] <= 15)]

# Compute median salaries and counts per skill count
median_ds = df_ds_filtered.groupby('total_skills').agg(
    median_salary=('salary_month_avg_eur', 'median'),
    count=('salary_month_avg_eur', 'size')
).reset_index()
median_ds['role'] = 'Data Scientist'

median_da = df_da_filtered.groupby('total_skills').agg(
    median_salary=('salary_month_avg_eur', 'median'),
    count=('salary_month_avg_eur', 'size')
).reset_index()
median_da['role'] = 'Data Analyst'

# Combine for plotting
median_df = pd.concat([median_ds, median_da], ignore_index=True)

# Create scatter plot (bullets sized by number of postings)
fig = px.scatter(
    median_df,
    x='total_skills',
    y='median_salary',
    size='count',
    color='role',
    color_discrete_map={
        'Data Scientist': 'darkblue',
        'Data Analyst': 'gray'
    },
    title='Median Salary vs. Number of Skills (<= 15 skills)',
    labels={
        'total_skills': 'Number of Skills',
        'median_salary': 'Median Monthly Salary (EUR)',
        'count': 'Number of Job Postings'
    },
    size_max=20,  # max size of bullet
)

# Add predicted candidate salary
fig.add_scatter(
    x=[candidate_features['total_skills'].iloc[0]],
    y=[predicted_salary],
    mode='markers+text',
    marker=dict(color='red', size=15, line=dict(color='black', width=1)),
    text=[f"{predicted_salary:,.0f} EUR"],
    textposition='top center',
    name='Predicted candidate salary'
)

fig.update_layout(
    plot_bgcolor='white',
    xaxis=dict(showgrid=True, gridcolor='lightgray'),
    yaxis=dict(showgrid=True, gridcolor='lightgray'),
    legend_title_text=''
)

fig.show()

### Summary:
TBD