# How well are the most popular roles paid in Europe?
- Global Salary Distribution
- Salaries of Top-3 in-Demand Data Roles in Europe
- Salary Gap% for Top-3 Data Roles in US and EU
- Interactive Map - Median Salary for Top-3 Data Roles in EU

In [1]:
import pandas as pd
import plotly.express as px

import matplotlib.pyplot as plt
import seaborn as sns

import plotly.graph_objects as go

from pathlib import Path
current_dir = Path.cwd()
project_dir = current_dir.parents[1]
raw_data_dir = project_dir / "Raw_Data"
df = pd.read_csv(raw_data_dir / 'df_Final.csv')

### Remove Salary Outliers
During the exploratory analysis, we identified unusually high salaries in countries with very few job postings. These outliers may result from salaries listed in local currencies, company registration addresses being used instead of actual job locations, or other inconsistencies.

Action Taken:
- Outliers were removed per job title and country to preserve meaningful comparisons.
- For countries with fewer than 5 postings, outliers were removed based on job title only to avoid skewed insights from limited data.

In [3]:
#Calculate global IQR
Q1_global = df['salary_month_avg_eur'].quantile(0.25)
Q3_global = df['salary_month_avg_eur'].quantile(0.75)
IQR_global = Q3_global - Q1_global
lower_bound_global = Q1_global - 1.5 * IQR_global
upper_bound_global = Q3_global + 1.5 * IQR_global

#Build list to collect filtered results
filtered_groups = []

#Iterate over groups
for (country, title), group in df.groupby(['job_country', 'job_title_short']):
    if len(group) >= 5:
        Q1 = group['salary_month_avg_eur'].quantile(0.25)
        Q3 = group['salary_month_avg_eur'].quantile(0.75)
        IQR = Q3 - Q1
        lower = Q1 - 1.5 * IQR
        upper = Q3 + 1.5 * IQR
    else:
        lower = lower_bound_global
        upper = upper_bound_global

    filtered = group[
        (group['salary_month_avg_eur'] >= lower) &
        (group['salary_month_avg_eur'] <= upper)
    ]
    filtered_groups.append(filtered)

# Combine all groups back
df_filtered = pd.concat(filtered_groups, ignore_index=True)

print(f"Original dataset size: {len(df)}")
print(f"Filtered dataset size: {len(df_filtered)}")

Original dataset size: 22002
Filtered dataset size: 21307


### Global Salary Distribution

In [4]:
median_salary = df_filtered.groupby('job_title_short')['salary_month_avg_eur'].median().sort_values()

fig1 = px.bar(
    median_salary,
    x=median_salary.values,
    y=median_salary.index,
    orientation='h',
    title='Global Median Salary by Job Title',
    labels={'x': 'Median Salary, EUR', 'index': 'Job Title'},
    color_discrete_sequence=['#5DADE2'],
    text=median_salary.values
)

fig1.update_traces(
    texttemplate='%{text:.0f}', 
    textposition='inside', 
    insidetextanchor='middle'
    )

fig1.update_layout(
    showlegend=False,
    yaxis_title=''
    )
fig1.show()

Despite being the most popular role in Europe, Data Analyst has the lowest median salary among the top-3 data roles, indicating high accessibility but limited earning potential in the long term. In contrast, Data Scientist and Data Engineer both offer significantly higher median salaries (nearly 40% more). 

For job seekers, starting as a Data Analyst offers a practical entry point, but upskilling toward Data Science — currently one of the highest-paid and fastest-growing roles globally — provides stronger long-term career and salary growth potential. 

### Salaries of Top-3 in-Demand Data Roles in Europe

In [5]:
top_roles = ['Data Analyst', 'Data Scientist', 'Data Engineer']

df_eu_top = df_filtered[
    (df_filtered['job_title_short'].isin(top_roles)) &
    (df_filtered['region_group'] == 'EU')
]

In [6]:
fig2 = px.box(
    df_eu_top,
    x='job_title_short',
    y='salary_month_avg_eur',
    title='Salary Distribution for Top-3 Data Roles in EU',
    labels={
        'salary_month_avg_eur': 'Monthly Salary (EUR)',
        'job_title_short': ''
    }
)

fig2.update_traces(marker_color='#5DADE2', line_color='#5DADE2')

# Add median labels
medians = df_eu_top.groupby('job_title_short')['salary_month_avg_eur'].median().round(0)
for job, med in medians.items():
    fig2.add_annotation(
        x=job,
        y=med,
        text=f"Median: €{med:,.0f}",
        showarrow=False,
        font=dict(size=12),
        yshift=10
    )

fig2.update_layout(showlegend=False)
fig2.show()

Just like on the global level, Data Scientists and Data Engineers in Europe earn noticeably more than Data Analysts. In the same time, Data Analyst salaries have a narrow, predictable distribution, making them attractive for entry-level professionals seeking stability and lower risk. In contrast, the broader and skewed distributions for Data Scientist and Data Engineer roles indicate greater variability but also higher earning potential — especially for those with advanced skills and niche expertise.

### Salary Gap% for Top-3 Data Roles in US and EU

The US data roles market is step ahead globally in adopting emerging technologies and job specializations. Therefore, comparing salaries between the US and EU can help predict future trends for the European market, offering insight into potential career growth, remote work advantages, and long-term salary expectations.

In [7]:
df_top = df_filtered[
    (df_filtered['region_group'].isin(['EU', 'US'])) &
    (df_filtered['job_title_short'].isin(top_roles))
    ]

salary_medians = df_top.groupby(['job_title_short', 'region_group'])['salary_month_avg_eur'].median().unstack()
salary_medians['gap_abs'] = salary_medians['US'] - salary_medians['EU']
salary_medians['gap_pct'] = 100 * (salary_medians['gap_abs'] / salary_medians['EU'])

# Prepare base data
df_base = salary_medians.reset_index()
x_labels = df_base['job_title_short']

# Create the bar chart
fig3 = go.Figure()

# US bars (first in group)
fig3.add_trace(go.Bar(
    x=x_labels,
    y=df_base['US'],
    name='US',
    marker_color='#aec7e8',
    width=0.35,
    text=df_base['US'].round(0),
    textposition='inside',
    insidetextanchor='middle'
))

# EU bars (second in group)
fig3.add_trace(go.Bar(
    x=x_labels,
    y=df_base['EU'],
    name='EU',
    marker_color='#1f77b4',
    width=0.35,
    text=df_base['EU'].round(0),
    textposition='inside',
    insidetextanchor='middle'
))

# Add arrows and % gap labels
for i, row in df_base.iterrows():
    x = row['job_title_short']
    y_us = row['US']
    y_eu = row['EU']
    pct_gap = row['gap_pct']
    y_max = max(y_us, y_eu)

    fig3.add_annotation(
        x=x,
        y=y_max + 800,
        text=f"{pct_gap:.1f}%",
        showarrow=True,
        arrowhead=2,
        arrowwidth=2.5,
        arrowcolor='gray',
        ax=0,
        ay=-40,
        font=dict(size=12, color='black'),
        standoff=6
    )

# Layout settings
fig3.update_layout(
    barmode='group',
    bargap=0.1,  
    title='Salary Gap% for Top-3 Data Roles in US and EU',
    xaxis_title='',
    yaxis_title='Median Salary, EUR',
    height=500,
    legend=dict(title='Region'),
    margin=dict(t=80)
)

fig3.show()

The salary gap between the US and EU is almost negligible for Data Analysts (1%), indicating this role is well-established and globally standardized in terms of expectations and compensation. However, the much larger gap for Data Scientists and especially Data Engineers suggests that the EU is still catching up in adopting advanced analytics and scaling data infrastructure.

In [13]:
# Prepare dataset for ploting map
df_map = df_eu_top.groupby(['job_country', 'job_title_short'])['salary_month_avg_eur'].median().reset_index()

df_iso_lookup = df_eu_top[['job_country', 'ISO']].drop_duplicates()
df_map = df_map.merge(df_iso_lookup, on='job_country', how='left')
df_map.rename(columns={'ISO': 'iso_alpha'}, inplace=True)
df_map = df_map.dropna(subset=['iso_alpha'])

# Prepare traces for each role (Data Analyst, Scientist, Engineer)
fig4 = go.Figure()

job_titles = df_map['job_title_short'].unique()

for role in job_titles:
    df_role = df_map[df_map['job_title_short'] == role]
    fig4.add_trace(go.Choropleth(
        locations=df_role['iso_alpha'],
        z=df_role['salary_month_avg_eur'],
        text=df_role['job_country'],
        colorscale='Blues',
        colorbar_title='EUR',
        zmin=df_map['salary_month_avg_eur'].min(),
        zmax=df_map['salary_month_avg_eur'].max(),
        visible=(role == job_titles[0]),
        name=role,
        locationmode='ISO-3'
    ))

# Side-by-side toggle buttons
buttons = [
    {
        'label': role,
        'method': 'update',
        'args': [
            {'visible': [r == role for r in job_titles]},
            {'title': f'Median Monthly Salary in EU — {role}'}
        ]
    }
    for role in job_titles
]

# Layout with buttons above the plot
fig4.update_layout(
    title=dict(
        text=f'Median Monthly Salary in EU — {job_titles[0]}',
        y=0.92,  # move title up to make space
        x=0.5,
        xanchor='center'
    ),
    geo=dict(
        showframe=False,
        showcoastlines=False,
        projection_type='natural earth',
        scope='europe'
    ),
    height=600,
    updatemenus=[{
        'type': 'buttons',
        'buttons': buttons,
        'direction': 'right',
        'pad': {'r': 10, 't': 10},
        'x': 0.5,
        'xanchor': 'center',
        'y': 1.05,
        'yanchor': 'top'
    }]
)

fig4.show()

This is an additional interactive map showing salaries by European countries for the top 3 in-demand roles. The highest salaries are found in Cyprus, Sweden, and the Netherlands. Cyprus stands out due to favorable conditions for registering companies, making it a profitable location for business. Meanwhile, Sweden and the Netherlands host many global companies such as Spotify, Netflix, and H&M, which contributes to higher salary levels in these countries.

In [10]:
df_ds = df_eu_top[df_eu_top['job_title_short'] == 'Data Scientist'].copy()

top3_countries = (
    df_ds.groupby('job_country')['salary_month_avg_eur']
    .median()
    .sort_values(ascending=False)
    .head(3)
    .index.tolist()
)

ds_top_countries = df_ds[df_ds['job_country'].isin(top3_countries)]

# Median salary per company in those countries
company_medians = (
    ds_top_countries.groupby(['company_name', 'job_country'])['salary_month_avg_eur']
    .median()
    .reset_index()
    .sort_values(by='salary_month_avg_eur', ascending=False)
    .round(0)
)

print("Top 3 countries by median Data Scientist salary:", top3_countries)
print(company_medians.head(15))

Top 3 countries by median Data Scientist salary: ['Cyprus', 'Sweden', 'Netherlands']
            company_name  job_country  salary_month_avg_eur
15                 Palta       Cyprus               13709.0
10                   ING  Netherlands               12243.0
7               Elsevier  Netherlands               12238.0
0                  Adyen  Netherlands               11305.0
16               Spotify       Sweden               11305.0
17        Syngenta Group  Netherlands               11305.0
1         Avery Dennison  Netherlands               11305.0
6       Dun & Bradstreet       Sweden               11305.0
14               Netflix  Netherlands               11305.0
18            Vattenfall  Netherlands               11305.0
9              H&M Group       Sweden               10407.0
3       Creative Fabrica  Netherlands               10374.0
13                  Miro  Netherlands                8491.0
11  JACOBS DOUWE EGBERTS  Netherlands                7859.0
2              