**Abstract**

This analysis investigates salary variations among professionals in data-related roles—Data Scientist, Data Engineer, Data Analyst, and Data Architect—across different work models and geographies. Although these roles share overlapping responsibilities, compensation differs significantly. Using a global dataset of salaries from 2020 to 2024, patterns in pay by role, work model, and country are visualized to provide actionable insights for workforce planning and strategic career decisions.

**Introduction**

The COVID-19 pandemic accelerated the adoption of remote work, raising a critical question: *Which work model pays more?*

Salary trends across roles, work models, and countries from 2020 to 2024 were investigated to understand these dynamics.

**Methodology**
1. *Data Collection*
   - Source: Kaggle dataset “Data Science Salaries”
   - Roles analyzed: Data Scientist, Data Engineer, Data Analyst, Data Architect
   - Experience levels: Entry-level, Mid-level, Senior-level, Executive-level
   - Dimensions: Work model (Remote, On-site), Year (2020–2024), Country

Source Kaggle URL: [https://www.kaggle.com/datasets/sazidthe1/data-science-salaries]

In [1]:
# Imports
import kagglehub
import pandas as pd
import os
import plotly.express as px
import plotly.graph_objects as go

In [2]:
# Download Data Scientist dataset
path_ds = kagglehub.dataset_download("sazidthe1/data-science-salaries")
print("Data Scientist dataset path:", path_ds)
print(os.listdir(path_ds))

df_ds = pd.read_csv(f"{path_ds}/data_science_salaries.csv")
print(df_ds.head())


Downloading from https://www.kaggle.com/api/v1/datasets/download/sazidthe1/data-science-salaries?dataset_version_number=2...


100%|██████████| 57.3k/57.3k [00:00<00:00, 10.6MB/s]

Extracting files...
Data Scientist dataset path: /root/.cache/kagglehub/datasets/sazidthe1/data-science-salaries/versions/2
['data_science_salaries.csv']
        job_title experience_level employment_type work_models  work_year  \
0   Data Engineer        Mid-level       Full-time      Remote       2024   
1   Data Engineer        Mid-level       Full-time      Remote       2024   
2  Data Scientist     Senior-level       Full-time      Remote       2024   
3  Data Scientist     Senior-level       Full-time      Remote       2024   
4    BI Developer        Mid-level       Full-time     On-site       2024   

  employee_residence  salary salary_currency  salary_in_usd company_location  \
0      United States  148100             USD         148100    United States   
1      United States   98700             USD          98700    United States   
2      United States  140032             USD         140032    United States   
3      United States  100022             USD         100022    




2. *Data Preparation*
   - Dataset filtered for selected roles and experience levels
   - Hybrid work model removed for clarity
   - Average salaries aggregated by role, work model, year, and geography

In [3]:
# Define roles and experience levels of interest
roles_of_interest = ["Data Scientist", "Data Engineer", "Data Analyst", "Data Architect"]
experience_levels = ["Entry-level", "Mid-level", "Senior-level", "Executive-level"]

# Filter dataset and create a copy to avoid SettingWithCopyWarning
filtered_df = df_ds[
    (df_ds['job_title'].isin(roles_of_interest)) &
    (df_ds['experience_level'].isin(experience_levels))
].copy()

# Add clean year column safely
filtered_df['work_year_display'] = filtered_df['work_year'].astype(int)
filtered_df = filtered_df[filtered_df['work_models'] != 'Hybrid']

print(filtered_df.head())



        job_title experience_level employment_type work_models  work_year  \
0   Data Engineer        Mid-level       Full-time      Remote       2024   
1   Data Engineer        Mid-level       Full-time      Remote       2024   
2  Data Scientist     Senior-level       Full-time      Remote       2024   
3  Data Scientist     Senior-level       Full-time      Remote       2024   
8   Data Engineer  Executive-level       Full-time      Remote       2024   

  employee_residence  salary salary_currency  salary_in_usd company_location  \
0      United States  148100             USD         148100    United States   
1      United States   98700             USD          98700    United States   
2      United States  140032             USD         140032    United States   
3      United States  100022             USD         100022    United States   
8      United States  219650             USD         219650    United States   

  company_size  work_year_display  
0       Medium      

3. *Tools and Libraries*

The analysis was conducted in Google Colab using Python, leveraging the following libraries:

*  pandas: for data loading, filtering, grouping, and tabular manipulation.
*  kagglehub: to access and download the dataset from Kaggle directly within the Colab environment.
*  plotly.express: for creating interactive visualizations
*  matplotlib and seaborn: for static plots and exploratory data visualization.

4. *Visualization Approach*

To address *“Which work model pays more?”* and *“Has this changed since COVID?”*, the following visualizations were created:
- *Bar Chart*: Average salary by work model across roles
- *Line Charts*: Salary trends by work model (2020–2024) with COVID period highlighted
- *Choropleth Maps*: Average salary by country for each role
- *Country vs Work Model Comparison*: Geographic influence on pay differences

In [4]:
# 1. Bar Chart – Average salary by work model across roles
avg_salary_by_model_role = (
    filtered_df.groupby(['work_models', 'job_title'])['salary_in_usd']
    .mean()
    .reset_index()
)
fig1 = px.bar(
    avg_salary_by_model_role,
    x='job_title', y='salary_in_usd', color='work_models', barmode='group',
    title='Average Salary by Work Model Across Roles',
    labels={'salary_in_usd': 'Average Salary (USD)', 'job_title': 'Role', 'work_models': 'Work Model'}
)
fig1.show()

In [5]:
# 2. Line Chart – Salary trends by work model from 2020 to 2024

for role in roles_of_interest:
    role_df = filtered_df[filtered_df['job_title'] == role]
    avg_salary_by_year_model = role_df.groupby(['work_year_display', 'work_models'])['salary_in_usd'].mean().reset_index()

    fig = px.line(
        avg_salary_by_year_model,
        x='work_year_display', y='salary_in_usd', color='work_models',
        title=f'Salary Trends for {role} by Work Model (2020–2024)',
        labels={'salary_in_usd': 'Average Salary (USD)', 'work_year_display': 'Year', 'work_models': 'Work Model'}
    )

    # Add subtle shaded band for COVID (2020–2021)
    fig.add_shape(
        type="rect",
        x0=2020, x1=2021,
        y0=avg_salary_by_year_model['salary_in_usd'].min(),
        y1=avg_salary_by_year_model['salary_in_usd'].max(),
        fillcolor="rgba(255,0,0,0.15)",  # Red with moderate opacity
        layer="below",
        line=dict(color="rgba(255,0,0,0.4)", width=1, dash="dot")  # Red border
    )


    # Add annotation
    fig.add_annotation(
        x=2020.5,
        y=avg_salary_by_year_model['salary_in_usd'].max(),
        text="COVID Period",
        showarrow=False,
        font=dict(color="gray", size=12),
        yshift=10
    )

    # Use numeric axis for proper spacing
    fig.update_xaxes(type='linear', tickmode='array', tickvals=[2020, 2021, 2022, 2023, 2024])
    fig.show()


In [6]:
# 3. Choropleth Map – Average salary by country for each role

for role in roles_of_interest:
    role_df = filtered_df[filtered_df['job_title'] == role]
    avg_salary_by_country = (
        role_df.groupby('company_location')['salary_in_usd']
        .mean()
        .reset_index()
    )

    fig = px.choropleth(
        avg_salary_by_country,
        locations='company_location', locationmode='country names',
        color='salary_in_usd', hover_name='company_location',
        color_continuous_scale='Viridis',
        title=f'Average Salary by Country for {role}',
        labels={'salary_in_usd': 'Average Salary (USD)'}
    )
    fig.show()

In [7]:
# Focus only on Remote and On-site
df_filtered = filtered_df[filtered_df['work_models'].isin(['Remote', 'On-site'])]

# Group by country and work model
avg_salary_country_model = (
    df_filtered.groupby(['company_location', 'work_models'])['salary_in_usd']
    .mean()
    .reset_index()
)

# Pivot for comparison
pivot_df = avg_salary_country_model.pivot(index='company_location', columns='work_models', values='salary_in_usd').dropna()

# Calculate difference (Remote - On-site)
pivot_df['Difference_Remote_vs_Onsite'] = pivot_df['Remote'] - pivot_df['On-site']

# Sort countries by difference
pivot_df_sorted = pivot_df.sort_values(by='Difference_Remote_vs_Onsite', ascending=False).reset_index()

# Melt for grouped bar chart
melted_df = pivot_df_sorted.melt(id_vars='company_location', value_vars=['Remote', 'On-site'],
                                 var_name='Work_Model', value_name='Average_Salary_USD')

# Create grouped bar chart
fig = px.bar(
    melted_df,
    x='company_location', y='Average_Salary_USD', color='Work_Model', barmode='group',
    title='Remote vs On-site Salary Comparison by Country (2020 - 2024 years combined)',
    labels={'company_location': 'Country', 'Average_Salary_USD': 'Average Salary (USD)', 'Work_Model': 'Work Model'}
)

# Sort by difference for clarity
fig.update_xaxes(categoryorder='array', categoryarray=pivot_df_sorted['company_location'].tolist())

fig.show()


df_filtered = filtered_df[filtered_df['work_models'].isin(['Remote', 'On-site'])].copy()

# Average salary by Year x Country x Work Model
avg_salary_year = (
    df_filtered
    .groupby(['work_year', 'company_location', 'work_models'])['salary_in_usd']
    .mean()
    .reset_index()
    .rename(columns={'salary_in_usd': 'Average_Salary_USD'})
)

# Limit to top 8 countries (by row count)
top_n_countries = 8
if top_n_countries:
    top_countries = (
        df_filtered.groupby('company_location').size()
        .sort_values(ascending=False)
        .head(top_n_countries)
        .index.tolist()
    )
    plot_df = avg_salary_year[avg_salary_year['company_location'].isin(top_countries)]
else:
    plot_df = avg_salary_year.copy()

# Trend lines (faceted by country)
fig = px.line(
    plot_df,
    x='work_year',
    y='Average_Salary_USD',
    color='work_models',
    markers=True,
    facet_col='company_location',
    facet_col_wrap=4,
    category_orders={'work_year': [2020, 2021, 2022, 2023, 2024]},
    labels={
        'work_year': 'Year',
        'Average_Salary_USD': 'Average Salary (USD)',
        'work_models': 'Work Model'
    },
    title='Remote vs On-site Salary Trend by Country (2020–2024)'
)
# Formatting tweaks
fig.update_yaxes(matches=None, tickprefix='$', separatethousands=True)
fig.update_layout(legend_title_text='Work Model', hovermode='x unified')
# Clean facet titles (remove "company_location=" prefix)
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))

fig.show()

*Storyline and Key Insights*

Which work model pays more?

To answer this, we look at salary trends over time. The charts show that during 2020–2021, Remote roles had a clear premium, likely driven by COVID-related shifts. After 2022, On-site salaries grew faster and overtook Remote for most technical roles by 2023. This indicates that the advantage of Remote work has changed over time.

Next, we explore geography. Does location influence which model pays more? The map and country-level charts confirm that it does. In countries like Australia and Canada, On-site salaries exceed Remote, while in the U.S. and U.K., Remote remains competitive. Developed regions such as North America and Western Europe dominate overall pay levels, while emerging markets show lower salaries regardless of work model.

Role differences also matter. Data Architect consistently leads, with Remote pay peaking near 178K. Data Analyst shows minimal difference between models, averaging around 110K.

The trend is clear: the Remote premium during COVID narrowed significantly by 2024, signaling market normalization. Today, the choice between Remote and On-site depends more on role and region than pandemic-driven trends.
