# Overview
Data science positions remain in high demand, particularly for professionals with data analysis, machine learning, and data visualization skills. According to data from the US Bureau of Labor Statistics (BLS), the median annual wage for occupations in the "computer and information research scientists" category, which includes data scientists, was $122,840 in 2020. However, salary expectations can vary based on the candidate's specific industry, location, and skills and experience. For example, the BLS reports that the median wage for these occupations in the Washington-Arlington-Alexandria metropolitan area was $134,780 in 2021.

Individual-level data on data science salaries can provide additional insights beyond summary statistics. This text uses data from the ai-jobs.net survey, which was initially found on Kaggle and is available here: https://www.kaggle.com/datasets/ruchi798/data-science-job-salaries. The data was analyzed using data science packages in Python, as well as various data science techniques, data analysis, and data visualization methods, to identify and deliver insights.

# Key Insights:
- Across most experience levels and company sizes, very few job titles within the computer and information research scientist industry have a median income of less than $113,000.
- Job titles matter for future earning potential. Data scientists tend to have higher salary expectations than data engineers and analysts. The top quarter of data analysts report salaries over $129,000, quarter of data scientists reports salaries over $185,000.
- Most surveyed mid-level data scientists in the United States in 2022 earned between $102,100 and $141,300. Surveyed senior-level data scientists in the United States in 2022 earned between $140,000 and $205,300.
-Surveyed data scientists working remotely in the US report earning between $9,000 and $22,500 more than data scientists working completely in person.
- Of surveyed data scientists in 2022, only five (around 9%) reported earning less than $100,000.

These key insights can be helpful for data scientists during salary negotiation periods, as they provide a data-centric approach to understanding potential salaries. At the very least, these insights can inform job seekers about what they might be worth based on observational survey data.

### Import Requirements

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px

### Load and Examine Data

In [34]:
#Load data
df = pd.read_csv('/Users/gdhan/Documents/Data Science/Data/ds_salaries.csv')

#Examine a sample data contents, structure, and column names
df.head()


Unnamed: 0.1,Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,0,2020,MI,FT,Data Scientist,70000,EUR,79833,DE,0,DE,L
1,1,2020,SE,FT,Machine Learning Scientist,260000,USD,260000,JP,0,JP,S
2,2,2020,SE,FT,Big Data Engineer,85000,GBP,109024,GB,50,GB,M
3,3,2020,MI,FT,Product Data Analyst,20000,USD,20000,HN,0,HN,S
4,4,2020,SE,FT,Machine Learning Engineer,150000,USD,150000,US,50,US,L


In [35]:
#Identify data types of variables within the dataset
df.dtypes

Unnamed: 0             int64
work_year              int64
experience_level      object
employment_type       object
job_title             object
salary                 int64
salary_currency       object
salary_in_usd          int64
employee_residence    object
remote_ratio           int64
company_location      object
company_size          object
dtype: object

In [36]:
#Examine Salary, unconditional average of data positions contained in the survey
round(df.salary_in_usd.describe())

count       607.0
mean     112298.0
std       70957.0
min        2859.0
25%       62726.0
50%      101570.0
75%      150000.0
max      600000.0
Name: salary_in_usd, dtype: float64

While the global average gives a ball park amount. This includes a variety of data roles, job titles, positions in different countries, a mixture of remote and in person positions, and experience level. The average my over estimate what should be reasonably expected due to a few high earners skewing the average pay, instead we can examine the median. 

In [37]:
#group by experience level, summarize by median salary in usd, sort in decsending
df.groupby('experience_level').salary_in_usd.median().sort_values(ascending=False)

experience_level
EX    171437.5
SE    135500.0
MI     76940.0
EN     56500.0
Name: salary_in_usd, dtype: float64

### Job Title, Experience, Company Size and Salary

To further refine salary expectations we filter the employee residence to United States in 2022 and group the information by job title, experience level, and company size. Then we summarize the grouped data to show the 50th percentile salary in US dollars. Using the the look up table we can better understand what someone with data science technical expertise should expect to earn. 

In [38]:
#filter by US, year 2022, group by job_title, experience level, and company size, summarize by median salary in usd
df.where(df['employee_residence']=='US').where(df['work_year']==2022).groupby(['job_title', 'experience_level', 'company_size'], as_index=False).salary_in_usd.median()

Unnamed: 0,job_title,experience_level,company_size,salary_in_usd
0,AI Scientist,MI,M,120000.0
1,Analytics Engineer,EX,M,155000.0
2,Analytics Engineer,SE,M,195000.0
3,Applied Data Scientist,MI,L,157000.0
4,Applied Data Scientist,SE,L,278500.0
5,Computer Vision Engineer,EN,M,125000.0
6,Data Analyst,EX,M,120000.0
7,Data Analyst,MI,M,121000.0
8,Data Analyst,MI,S,58000.0
9,Data Analyst,SE,M,112900.0


- **Key Insight:** Across most experience levels and company sizes, there are very few job titles within the comupter and information research scientist industry that have a median income of less than 113,000.

### Visualize Reported Median Salaries by Job Title

In [40]:
#filter data to US, 2022 ... group by experinece and job_title
US2022 = df[(df['employee_residence']=='US') & (df['work_year']==2022)].groupby(['job_title', 'experience_level'], as_index=False).salary_in_usd.median().sort_values(by='salary_in_usd')

#Create a barchart of expected salary by experience level
fig = px.bar(US2022 , x='job_title', y='salary_in_usd', color='experience_level', barmode='group')

#Update layout features
fig.update_layout(
    title="Data Industry Median Salaries by Job Title and Experience Level",
    xaxis_title="Job Title",
    yaxis_title="Salary (USD)",
    legend_title="Experience Level"
)

#Display the plot
fig.show()

- **Key Insight:** Job titles matters to personal earnings growth, Data Scientists should have the greater salary expectations than Data Engineers and Data Analysts. The top quarter of Data Analysts report salaries over $129,000,  while the top quarter of Data Scientists repporting salaries over $185,000.

Understanding where the salary fits across the distributions of salaries for a specific job title and across experience levels can provide better insights than a summary statistic like the median. To drill down, we filter further to the specific job title of "Data Scientist".

In [41]:
#narrow the dataframe to US, 2022, Data Scientist job title
USDS2022 = df[(df['employee_residence']=='US') & (df['work_year']==2022) & (df['job_title']=="Data Scientist")]
USDS2022.describe()

Unnamed: 0.1,Unnamed: 0,work_year,salary,salary_in_usd,remote_ratio
count,57.0,57.0,57.0,57.0,57.0
mean,456.54386,2022.0,157210.526316,157210.526316,71.929825
std,103.155235,0.0,42012.348585,42012.348585,45.333628
min,292.0,2022.0,78000.0,78000.0,0.0
25%,357.0,2022.0,130000.0,130000.0,0.0
50%,472.0,2022.0,144000.0,144000.0,100.0
75%,553.0,2022.0,185100.0,185100.0,100.0
max,599.0,2022.0,260000.0,260000.0,100.0


### Visualize Reported Salariy Distribution by Experience

In [42]:
#Boxplot visualation of salary and experience level
fig = px.box(USDS2022 , x='experience_level', y='salary_in_usd', color='experience_level')

#Update layout features
fig.update_layout(
    title="2022 Data Scientist Salary Expectations by Experience Level",
    xaxis_title="Experience Level",
    yaxis_title="Salary (USD)",
    legend_title="Experience Level",
)

#display chart
fig.show()

- **Key Insight:** Most surveyed Mid level Data Scientists within the United States during 2022 earned between $102,100 and $141,300. Surveyed Senior level Data Scientists within the United States during 2022 earned between $140,000 and $205,300.

### Visualize Reported Data Scientist Reported Salary by Remote Ratio

In [43]:
#Boxplot visualation of salary and experience level
fig = px.box(USDS2022 , x='remote_ratio', y='salary_in_usd', color='remote_ratio')

#Update layout features
fig.update_layout(
    title="2022 United States Data Scientist Reported Salary by Remote Ratio",
    xaxis_title="Remote Ratio",
    yaxis_title="Salary (USD)",
    legend_title="Remote Ratio",
)

#display chart
fig.show()

- **Key Insight:** Surveyed Data Scientists working remotely in the US report earning between ~$9,000 and ~$22,500 more than Data Scientists working completely in person. 

### Visualize a histogram of 2022 US Data Scientist Salaries

In [44]:
# Create a histogram using
fig = px.histogram(USDS2022, x="salary_in_usd", nbins=20)

#Update layout features
fig.update_layout(
    title="Distribution of Data Scientist Salaries",
    xaxis_title="Salary (USD)",
    yaxis_title="Count",
    legend_title="Experience Level",
)

# Display the figure
fig.show()

- **Key Insight:** Of surveyed Data Scientists in 2022, only five, ~9%, reported earning less than $100,000.