# The Analysis of Data Science Salaries in the United States in 2023
I've embarked on a comprehensive analysis of data science salaries using a dataset I found on Kaggle: "Data Science Salaries 2023". My aim is to extract meaningful insights that can influence my design decisions. Here's a breakdown of my tasks:

### Task 1: Salary Benchmarking by Experience Level
My goal here is to establish clear benchmarks for salaries at various experience levels within the data science field. I'm analyzing salary distributions and calculating percentiles for each experience category to understand what constitutes standard, above-average, and exceptional salary ranges. This task is crucial during the initial data exploration phase and is typically handled by Data Analysts or HR professionals.

### Task 2: Remote Work Ratio Influence on Salary Distribution
My goal is to analyze the impact of remote work arrangements on salary distribution in the data science sector. I'm employing a box plot to visualize the salary ranges and medians for fully remote, hybrid, and in-person work settings. This will help determine if remote work status correlates with higher or lower salary distributions. This task is part of a broader analysis to understand the effects of remote work on compensation and is suited for HR Data Analysts, Workforce Planners, and Policy Makers.

### Task 3: Analyzing Salary and Remote Work Preferences of Each Job Title by Experience Level
I'm exploring the relationship between salary levels and remote work preferences across different experience levels in the data science field. By creating a scatter plot, I'm visualizing salaries and categorizing jobs by their remote work status—Remote, Hybrid, or In-person. This highlights how salary and remote work preferences vary among Entry Level, Mid Level, Senior, and Executive roles. This deeper dive into the data follows the initial benchmarks and is a task for Data Scientists, HR Analysts, and Remote Work Consultants.

### Task 4: Company Size Impact on Salary Distribution
I'm investigating how the size of a company affects the salary distribution within the data science field. To do this, I'm utilizing a box plot to represent the range and distribution of salaries across different company sizes. This will help me identify the median, quartiles, and potential outliers in salary for small, medium, and large companies. This compensation analysis is part of understanding the influence of company size on employee remuneration and is a task for Compensation Analysts, Business Strategists, and Organizational Consultants.



Each of these tasks is designed to directly influence my design by providing a clear picture of the current state of data science salaries and how various factors such as experience level, company size, and remote work preferences can affect them.

### Snapshot of First Few Rows

In [1]:
# Importing necessary libraries
import pandas as pd  
import altair as alt  
import warnings
warnings.filterwarnings('ignore')
# Reading the dataset from a CSV file
data = pd.read_csv("ds_salaries.csv")

# Displaying the first few rows of the dataset to inspect its structure
data.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ES,100,ES,L
1,2023,MI,CT,ML Engineer,30000,USD,30000,US,100,US,S
2,2023,MI,CT,ML Engineer,25500,USD,25500,US,100,US,S
3,2023,SE,FT,Data Scientist,175000,USD,175000,CA,100,CA,M
4,2023,SE,FT,Data Scientist,120000,USD,120000,CA,100,CA,M


### The Interactive Graph of Salary Benchmarking by Experience Level


In [2]:
# Filter the data to include only entries where the employee residence is in the United States
data_us = data[(data['employee_residence'] == 'US') & (data['work_year'] == 2023)]

# Calculate quartiles for each experience level
percentiles = data_us.groupby('experience_level')['salary_in_usd'].quantile([0.25, 0.5, 0.75]).reset_index()

# Pivot the data to have quartiles as columns
percentiles_pivot = percentiles.pivot(index='experience_level', columns='level_1', values='salary_in_usd').reset_index()

# Reverse the order of quartiles for better visualization
percentiles_pivot = percentiles_pivot[['experience_level', 0.75, 0.5, 0.25]]

# Melt the DataFrame to have quartiles as a single column
percentiles_melted = percentiles_pivot.melt(id_vars='experience_level', var_name='percentile', value_name='salary_in_usd')

# Define axis labels based on experience levels
axis_labels = ("datum.label == 'EN' ? 'Entry Level': datum.label == 'MI' ? 'Mid-Level': datum.label == 'SE' ? 'Senior': 'Executive'")

# Create a selection point for interactive filtering
select = alt.selection_point(fields=['experience_level'], empty=False)

# Create the Benchmark chart showing quartiles by experience level
benchmark = alt.Chart(percentiles_melted).mark_bar().encode(
    x=alt.X('experience_level:N', title='Experience Level', axis=alt.Axis(labelExpr=axis_labels)),
    y=alt.Y('salary_in_usd:Q', title='Salary (USD)'),
    color=alt.Color('percentile:N', title='Percentile', scale=alt.Scale(scheme='category10'), legend=alt.Legend(orient='right')),
    opacity=alt.condition(select, alt.value(.5), alt.value(1)),
    order=alt.Order('percentile', sort='ascending'), 
    tooltip=['experience_level', 'salary_in_usd']
).properties(
    title='Salary Benchmarking by Experience Level',
    width=600,
    height=400
).add_params(
    select
)

# Create the average salary per job title chart
job_titles_chart = alt.Chart(data_us).mark_bar().encode(
    x=alt.X('job_title:N', title='Job Title', axis=alt.Axis(labelAngle=-55)),
    y=alt.Y('mean(salary_in_usd):Q', title='Average Salary (USD)'),
    tooltip=['job_title', 'mean(salary_in_usd)']
).transform_filter(
    select
).properties(
    title='Job Titles by Average Salary',
    width=600,
    height=400
)

# Combine both charts vertically
chart = alt.hconcat(benchmark, job_titles_chart, padding={'bottom': 100,'right':50})

# Display the chart
chart


### Remote Work Ratio Influence on Salary Distribution

In [3]:
# Define labels for remote work ratios
remote_ratio_labels = {100: 'Remote', 50: 'Hybrid', 0: 'In-person'}
data_us.loc[:, 'remote_ratio_label'] = data_us['remote_ratio'].map(remote_ratio_labels)

# Define colors for different remote ratios
color_scale = alt.Scale(domain=['Remote', 'Hybrid', 'In-person'], range=['green', 'blue', 'red'])

# Create the boxplot
boxplot = alt.Chart(data_us).mark_boxplot().encode(
    x=alt.X('remote_ratio_label:N', title='Remote Ratio'),
    y=alt.Y('salary_in_usd:Q', title='Salary (USD)', scale=alt.Scale(zero=False)),
    color=alt.Color('remote_ratio_label:N', title='Remote Ratio', scale=color_scale)
).properties(
    width=600,
    height=400,
    title='Remote Work vs. Salary Distribution'
)

boxplot


### Analyzing Salary and Remote Work Preferences of Each Job Title by Experience Level


In [4]:
# Make a copy of the dataset to avoid modifying the original data
data_us = data_us.copy()

# Map remote ratio values to labels
remote_ratio_labels = {100: 'Remote', 50: 'Hybrid', 0: 'In-person'}
data_us['remote_ratio_label'] = data_us['remote_ratio'].map(remote_ratio_labels)

# Map experience level codes to labels
experience_level_labels = {'EN':'Entry Level', 'EX': 'Executive','MI':'Mid Level','SE':'Senior'}
data_us['experience_level_label'] = data_us['experience_level'].map(experience_level_labels)

# Define colors for different remote ratios
color_scale = alt.Scale(domain=['Remote', 'Hybrid', 'In-person'], range=['green', 'blue', 'red'])
shape_scale = alt.Scale(domain=['Remote', 'Hybrid', 'In-person'], range=['circle', 'square', 'triangle'])

# Create the scatter plot facet
scatter_plot = alt.Chart(data_us).mark_point(filled=True, size=100).encode(
    x=alt.X('job_title:N', title='Job Title', axis=alt.Axis(labelAngle=-55)),
    y=alt.Y('salary:Q', title='Salary (USD)'),
    color=alt.Color('remote_ratio_label:N', title='Remote Ratio', scale=color_scale),
    shape=alt.Shape('remote_ratio_label:N', title='Remote Ratio', scale=shape_scale),
    tooltip=['job_title', 'salary', 'remote_ratio_label']
).properties(
    width=700,
    height=400
)

# Facet by experience level
scatter_plot_facet = scatter_plot.facet(
    column=alt.Row('experience_level_label:N', title='Experience Level'),
    title='Salary and Remote Ratio by Experience Level of Each Data Science Job'
).resolve_axis(
     y='independent'  
).configure_legend(
    orient='left'  # Position the legend at the top
)

# Display the scatter plot facet
scatter_plot_facet


### Company Size Impact on Salary Distribution

In [5]:
company_size_labels = {'L': 'Large', 'S': 'Small', 'M': 'Medium'}
data_us.loc[:, 'company_size_label'] = data_us['company_size'].map(company_size_labels)

box_plot = alt.Chart(data_us).mark_boxplot().encode(
    x=alt.X('company_size_label:N', title='Company Size'),
    y=alt.Y('salary_in_usd:Q', title='Salary (USD)'),
    color=alt.Color('company_size_label:N', title='Company Size')
).properties(
    title='Company Size vs. Salary Distribution',
    width=600,
    height=400
)

box_plot


### Summary of Key Design Elements and Justification

- Task 1: This task involves analyzing salary distributions and calculating percentiles for different experience levels in the data science field. By establishing clear benchmarks, it provides a foundation for understanding salary norms and setting expectations for compensation.
- Task 2: This task focuses on investigating the impact of remote work arrangements on salary distribution within the data science sector. Utilizing a box plot allows for a clear visualization of salary ranges and medians for different remote work settings, providing insights into whether remote work status correlates with salary levels. 
- Task 3: This task explores the relationship between salary levels and remote work preferences across different experience levels and job titles in the data science field. By creating a scatter plot, it visually illustrates how salary and remote work preferences vary among various roles, highlighting potential trends and disparities. 
- Task 4: This task involves investigating how the size of a company impacts salary distribution within the data science field. Using a box plot to represent salary ranges across different company sizes enables the identification of median salaries, quartiles, and potential outliers. 


### Final Evaluation

For the final evaluation approach, I employed a combination of informal interviews and observation. I gathered a small group of friends and family members to serve as participants for the evaluation.
The procedure was straightforward: I presented each graph individually and asked participants to interact with them. They were instructed to provide feedback on their understanding of the graphs, the clarity of information presented, and any suggestions for improvement. To facilitate this process, I employed the journaling method, asking participants to explain their observations, thoughts, and questions while engaging with the graphs.
The results of the evaluation were varied. Participants with limited statistical background expressed that they grasped the basic information conveyed in the graphs but found the statistical elements to be unclear. Concepts such as percentiles, quartiles, and statistical significance were challenging for them to interpret.
In contrast, a participant with a technical background was able to comprehend all the graphs and provide detailed explanations. Their feedback included suggestions for using different types of graphs, adding clarification to certain data points, and ensuring proper titles and labels for enhanced clarity.
Overall, the evaluation yielded valuable insights into the effectiveness of the graphs in communicating information to a diverse audience. It underscored the importance of considering the varying levels of statistical literacy among users and highlighted the need for clear and concise data visualization techniques. 

### Synthesis of Findings

#### Task 1: Salary Benchmarking by Experience Level
  - As expected, executive-level positions tend to have higher salaries compared to other experience levels, indicating a positive correlation between experience level and salary. 
  - The variety of job titles at the senior level is higher, suggesting a broader range of roles and responsibilities within this experience category. 
  - The analysis successfully establishes clear benchmarks for salaries at different experience levels, providing valuable insights for both job seekers and employers in understanding salary norms and setting appropriate compensation levels.

### Task 2: Remote Work Ratio Influence on Salary Distribution
  - Based on the box plots, the median salary for in-person positions appears to be higher compared to remote and hybrid arrangements, indicating a potential salary premium for roles that require physical presence in the workplace. 
  - The third quartile (Q3) of hybrid positions is higher than that of remote and in-person positions, suggesting that while the median salary of hybrid positions may not be the highest, there is a higher potential for top earners in hybrid roles. 
  - The minimum salary for remote positions is the lowest among the three categories, highlighting potential disparities in compensation based on remote work status. 
  - This analysis provides valuable insights into the relationship between remote work arrangements and salary distribution, informing organizations and employees about the potential impacts of remote work policies on compensation structures.

#### Task 3: Analyzing Salary and Remote Work Preferences of Each Job Title by Experience Level
  - The analysis reveals that in-person positions appear to be more prevalent across different experience levels compared to remote and hybrid roles. 
  - In-person positions also tend to have higher salaries, indicating a potential preference or incentive for roles that require physical presence in the workplace. 
  - This deeper dive into the data highlights how salary and remote work preferences vary among different job titles and experience levels, providing valuable insights for job seekers and employers in understanding the intersection of salary, job role, experience level, and remote work preferences.

#### Task 4: Company Size Impact on Salary Distribution
  - As expected, large companies tend to offer higher salaries compared to medium and small companies, reflecting the influence of company size on employee remuneration. 
  - Medium-sized companies exhibit more outliers in salary distribution, suggesting potential variability in compensation practices within this category. 
  - This analysis sheds light on how company size affects salary distribution within the data science field, informing organizations and employees about the relationship between organizational scale and compensation levels.

### Reflection on Approach

#### Successes
The approach effectively segmented the analysis into distinct tasks, each targeting specific aspects of data science salaries, such as experience level, remote work arrangements, job titles, and company size. Visualizations such as box plots and scatter plots were appropriately chosen to convey insights clearly and intuitively, facilitating understanding for both technical and non-technical stakeholders. Consideration of factors such as remote work preferences and company size added depth to the analysis, providing a comprehensive view of salary determinants within the data science field.

#### Areas for Improvement
Incorporating statistical tests or predictive modeling techniques could provide deeper insights into the relationships between variables, enhancing the robustness of the analysis and the reliability of conclusions drawn. Ensuring data quality and robustness through thorough cleaning and validation processes is critical for accurate conclusions and actionable insights. Paying close attention to data integrity and addressing any inconsistencies or outliers is essential for the credibility of the analysis.