**Abstract**

This analysis explores salary variations among professionals in data-related roles—collectively referred to as Data Practitioners—across different U.S. states. Roles such as Data Scientist, Data Engineer, Data Analyst, Business Analyst, and Data Architect often overlap in responsibilities but differ significantly in compensation. Using publicly available salary data, we visualize how average pay varies by role and geography, providing insights for workforce planning and career decisions.

**Introduction**
Organizations increasingly rely on data-driven decision-making, creating demand for roles like Data Scientist, Data Engineer, Data Analyst, Business Analyst, and Data Architect. Despite overlapping responsibilities, these roles have distinct salary ranges influenced by factors such as skill requirements, industry, and location. This report addresses the question:
***“How much do we get paid?”***

**Methodology**

1. *Data Collection*

The salary information was gathered for four common data-related roles:

*   Data Scientist
*   Data Engineer
*   Data Analyst
*   Data Architect

The primary source for this analysis is Kaggle, which hosts datasets containing detailed salary information for various data roles, including the five categories listed above. These datasets provide granular insights such as job title, experience level, employment type, work model, and salary figures.
For this study, we will filter and clean the Kaggle datasets to include only the following five categories: Data Scientist, Data Engineer, Data Analyst, Data Architect

Industry salaries by state from Kaggle [https://www.kaggle.com/datasets/sazidthe1/data-science-salaries]

In [1]:
# Imports
import kagglehub
import pandas as pd
import os
import plotly.express as px

In [2]:
# Download Data Scientist dataset
path_ds = kagglehub.dataset_download("sazidthe1/data-science-salaries")
print("Data Scientist dataset path:", path_ds)
print(os.listdir(path_ds))

df_ds = pd.read_csv(f"{path_ds}/data_science_salaries.csv")
print(df_ds.head())


Using Colab cache for faster access to the 'data-science-salaries' dataset.
Data Scientist dataset path: /kaggle/input/data-science-salaries
['data_science_salaries.csv']
        job_title experience_level employment_type work_models  work_year  \
0   Data Engineer        Mid-level       Full-time      Remote       2024   
1   Data Engineer        Mid-level       Full-time      Remote       2024   
2  Data Scientist     Senior-level       Full-time      Remote       2024   
3  Data Scientist     Senior-level       Full-time      Remote       2024   
4    BI Developer        Mid-level       Full-time     On-site       2024   

  employee_residence  salary salary_currency  salary_in_usd company_location  \
0      United States  148100             USD         148100    United States   
1      United States   98700             USD          98700    United States   
2      United States  140032             USD         140032    United States   
3      United States  100022             USD  

2. Data Preparation

A global perspective was adopted by analyzing salary data across various countries where data-related roles are prevalent. The dataset was filtered to include five key job titles: Data Scientist, Data Engineer, Data Analyst, Business Analyst, and Data Architect. Additionally, four experience levels were considered: Entry-level, Mid-level, Senior-level, and Executive-level. For each combination of country and experience level, the average salary was calculated. The resulting data was structured into a tabular format using Python’s pandas library, enabling efficient manipulation and visualization for comparative analysis.

In [3]:
# Define the roles and experience levels of interest
roles_of_interest = ["Data Scientist", "Data Engineer", "Data Analyst", "Data Architect"]
experience_levels = ["Entry-level", "Mid-level", "Senior-level", "Executive-level"]

# Filter the dataset
filtered_df = df_ds[
    (df_ds['job_title'].isin(roles_of_interest)) &
    (df_ds['experience_level'].isin(experience_levels))
]
print(filtered_df)

# Group by country and experience level, then calculate average salary
avg_salary_by_country_level = filtered_df.groupby(['company_location', 'experience_level'])['salary_in_usd'].mean().reset_index()

# Rename columns for clarity
avg_salary_by_country_level.columns = ['Country', 'Experience_Level', 'Average_Salary_USD']

# Display the result
print(avg_salary_by_country_level)

           job_title experience_level employment_type work_models  work_year  \
0      Data Engineer        Mid-level       Full-time      Remote       2024   
1      Data Engineer        Mid-level       Full-time      Remote       2024   
2     Data Scientist     Senior-level       Full-time      Remote       2024   
3     Data Scientist     Senior-level       Full-time      Remote       2024   
8      Data Engineer  Executive-level       Full-time      Remote       2024   
...              ...              ...             ...         ...        ...   
6589   Data Engineer        Mid-level       Full-time      Remote       2020   
6591  Data Scientist        Mid-level       Full-time     On-site       2020   
6592  Data Scientist      Entry-level       Full-time      Hybrid       2020   
6597   Data Engineer        Mid-level       Full-time      Hybrid       2020   
6598  Data Scientist     Senior-level       Full-time     On-site       2020   

     employee_residence  salary salary_

3. *Tools and Libraries*

The analysis was conducted in Google Colab using Python, leveraging the following libraries:

*  pandas: for data loading, filtering, grouping, and tabular manipulation.
*  kagglehub: to access and download the dataset from Kaggle directly within the Colab environment.
*  plotly.express: for creating interactive visualizations
*  matplotlib and seaborn: for static plots and exploratory data visualization.

4. *Visualization Approach*

To answer the question “How much do we get paid?”, five interconnected visualizations were created to uncover salary patterns across roles, work models, time, and geography:

* Bar Chart: Compares average salaries for selected data roles, segmented by work model (Remote, On-site), highlighting how employment type influences compensation.
* Line Charts: Show salary trends from 2020 to 2024 for each role, revealing growth or decline in pay over time and differences between work models.
* Choropleth Maps: Provide a global perspective by illustrating average salaries by country. Four maps focus on individual roles, and one combined map shows overall trends across all roles, making geographic disparities clear and visually engaging.

These visuals were chosen to build a cohesive narrative around how employment models, time, and location shape salary dynamics in data-related professions, helping uncover evolving trends and global disparities in compensation.

In [4]:
# 1. Bar Chart – Average salary by work model across roles
avg_salary_by_model_role = filtered_df.groupby(['work_models', 'job_title'])['salary_in_usd'].mean().reset_index()
fig1 = px.bar(avg_salary_by_model_role, x='job_title', y='salary_in_usd', color='work_models', barmode='group',
              title='Average Salary by Work Model Across Roles',
              labels={'salary_in_usd': 'Average Salary (USD)', 'job_title': 'Role', 'work_models': 'Work Model'})
fig1.show()

In [5]:
# 2. Line Chart – Salary trends by work model from 2020 to 2024

# Keep original work_year for calculations but create a clean display column
filtered_df['work_year_display'] = filtered_df['work_year'].astype(int)

# Remove 'Hybrid' work model entirely
filtered_df = filtered_df[filtered_df['work_models'] != 'Hybrid']

# Create separate line charts for each role
for role in roles_of_interest:
    role_df = filtered_df[filtered_df['job_title'] == role]
    avg_salary_by_year_model = role_df.groupby(['work_year_display', 'work_models'])['salary_in_usd'].mean().reset_index()

    fig = px.line(avg_salary_by_year_model, x='work_year_display', y='salary_in_usd', color='work_models',
                  title=f'Salary Trends for {role} by Work Model (2020–2024)',
                  labels={'salary_in_usd': 'Average Salary (USD)', 'work_year_display': 'Year', 'work_models': 'Work Model'})

    # Force X-axis to show clean integers
    fig.update_xaxes(type='category')
    fig.show()




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [6]:
# 3. Heatmap – Salary by work model and year

# Generate maps for each role
for role in roles_of_interest:
    role_df = filtered_df[filtered_df['job_title'] == role]
    avg_salary_by_country = role_df.groupby('company_location')['salary_in_usd'].mean().reset_index()

    fig = px.choropleth(avg_salary_by_country,
                        locations='company_location',
                        locationmode='country names',
                        color='salary_in_usd',
                        hover_name='company_location',
                        hover_data={'salary_in_usd': True},
                        color_continuous_scale='Viridis',
                        title=f'Average Salary by Country for {role}',
                        labels={'salary_in_usd': 'Average Salary (USD)'})

    fig.show()

# Combined map for all roles
avg_salary_combined = filtered_df.groupby('company_location')['salary_in_usd'].mean().reset_index()
fig_combined = px.choropleth(avg_salary_combined,
                             locations='company_location',
                             locationmode='country names',
                             color='salary_in_usd',
                             hover_name='company_location',
                             hover_data={'salary_in_usd': True},
                             color_continuous_scale='Viridis',
                             title='Average Salary by Country for All Data Roles',
                             labels={'salary_in_usd': 'Average Salary (USD)'})
fig_combined.show()

5. *Conclusion*

The analysis reveals clear salary disparities across roles, work models, time, and geography:


 - Role-Based Differences

Data Architect consistently commands the highest salaries, followed by Data Scientist and Data Engineer, while Data Analyst earns the least. This reflects the complexity and strategic importance of architecture and advanced analytics roles.


 - Impact of Work Model

Remote work generally offers competitive pay, often surpassing on-site roles for senior positions like Data Architect. However, trends vary: for Data Engineer and Data Scientist, remote salaries peaked around 2023 but declined slightly in 2024, while on-site pay remained stable or increased.


 - Time Trends (2020–2024)

Salaries for all roles grew significantly from 2020 to 2023, driven by demand for data expertise. The growth plateaued or slightly declined in 2024, suggesting market stabilization after pandemic-driven remote work premiums.


 - Geographic Insights

North America and Western Europe dominate as top-paying regions across all roles, with the U.S. leading globally. Countries like Australia and Canada also offer strong compensation, while emerging markets in Asia and South America show considerably lower pay scales.


Answer to *“How much do we get paid?”*

Compensation depends heavily on role, location, and work model. Senior technical roles (Data Architect, Data Scientist) in North America and Western Europe earn the highest salaries, often exceeding $160K annually, especially for remote positions. Entry-level and analyst roles earn substantially less, and geographic disparities remain pronounced.
