# Exploratory Data Analysis on Data Analytics Salaries

⌨ Write your name here.

This analysis explores salaries of data analytics professionals around the world to find patterns in the data. Specifically, the goal is to determine which factors influence pay rates around the world and learn more about what a career path might look like for somebody starting out in Data Analytics.

## About the data
This data set comes from Kaggle user [randomarnab](https://www.kaggle.com/datasets/arnabchaki/data-science-salaries-2023) and contains information about various roles in data analytics from around the world. The data was gathered in 2023 and contains details about each role's experience level, job title, salary, remote ratio, company location, and company size.

In [None]:
import pandas as pd
df = pd.read_csv('data_analytics_salaries.csv')

## Analysis
The analysis below explores salaries of data analytics professionals. Specifically, it will explore the following different topics:

- How does experience level affect salary?
- How does experience level affect remote ratio?
- Which job titles are the most common in the United States and how does the job title affect salary?
- How have salaries changed between 2020 and 2022 for Data Analysts?
- Where are most data analytics positions located (according to this data set)? Which countries pay the most?
- What percent of employees are based in another country but are paid in USD?

One notable aspect of this data set is the presence of both `salary` and `salary_in_usd` columns. The former details the salary for the position in the local currency where the company is based, whereas the latter column standardizes all of the salaries into USD. Thus, this analysis will exclusively use the `salary_in_usd` column for comparisons.

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

### How does experience level affect salary?
👉 Use the example text below to model your own responses futher below.


At first glance, experience level seems to be the obvious candidate for the most influential variable in determining salary for data analytics professionals. This analysis assumes that the experiences levels are, in order from least amount of experience to greatest:

1. EN - Entry level
2. MI - Mid-level
3. SE - Senior level
4. EX - Executive level

According to the output of the code below, average salary tends to increase, as hypothesized, as experience level increases. However, these figures may be skewed because part-time salaries are included in the data set. Because part-time workers are more likely to be entry level and mid-level, the lower salaries of these positions (which are caused by working few hours) should be removed for this part of the analysis.

In [None]:
# Code provided as an example
round(df[['experience_level', 'salary_in_usd']].groupby('experience_level').mean().sort_values(by='salary_in_usd'),2)

The code below creates a subset of the data that contains only positions that were full time. Recalculating the average salary for each experience level among this new subset brought the average salaries closer together only slightly. The change in average salary was most unnoticeable.

In [None]:
# Get a subset that only includes full time (FT)
# Your code here
df.head()
df_ft= df[df['employment_type']== 'FT']
df_ft
round(df_ft[['experience_level', 'salary_in_usd']].groupby('experience_level').mean().sort_values(by='salary_in_usd'),2)

From this analysis, I can conclude that experience is necessary to obtaining a higher salary. Salaries tend to vary greatly across different experience levels, meaning that experience is likely very influential for determining a person's salary.

### How does experience level affect remote ratio?
When determining how much employees are allowed to work remotely, I immediately think that senior employees are given more liberty to work from home than employees with less experience.

However, according to the results of the code below...

👉 Talk about your results here. Type code below.

  My results show that the mid-level employees are using less remote work than the entry level employees.

In [None]:
df.head()

round(df[['experience_level', 'remote_ratio']].groupby('experience_level').mean().sort_values(by='remote_ratio'),2)

The year 2022 saw an increase in the remote ratio for employees in several different experience levels. Which one saw the biggest increase in average remote ratio?

Executive Level

In [None]:
df.head()
df2=df[df['work_year']==2022]
df2
round(df2[['experience_level', 'remote_ratio']].groupby('experience_level').mean().sort_values(by='remote_ratio'),2)

Which experience level saw a reduction in its ability to work remotely in 2022? (ie. smaller average remote ratio).

Mid Level

In [None]:
df.head()
df2=df[df['work_year']==2022]
df2
round(df2[['experience_level', 'remote_ratio']].groupby('experience_level').mean().sort_values(by='remote_ratio'),2)

### Which job titles are the most common in the United States and how do they affect salary?

---


👉 Follow the pattern above to write about your findings and talk about your results here. Type code below.

When determining which job titles are the most common in the United States, I immediately think that the more common jobs will pay higher than the less common jobs,
However, according to the results of the code below...
One of the least popular jobs ( Data Analytics Lead) is the highest paid position. The most common positions are mid to lower level of pay.  


In [None]:
df.head()
df_us=df[df['company_location'] == 'US']
df_us.head()
df_us.value_counts('job_title')
df_us[['job_title', 'salary_in_usd']].groupby('job_title').count().sort_values(by='salary_in_usd',ascending= False)

Some of the job titles below are listed only once in the United States. Which ones are they? (Select all that apply)
Group of answer choices

*  Big Data Engineer
*  Data Analytics Engineer
*  Cloud Data Engineer


In [None]:
round(df_us[['job_title', 'salary_in_usd']].groupby('job_title').mean().sort_values(by='salary_in_usd', ascending=False),2)


In [None]:
#Standard Deviation
round(df_us[['job_title', 'salary_in_usd']].groupby('job_title').std().sort_values(by='salary_in_usd', ascending=False),2)

In [None]:
#rounded to nearest cent
round(df_us[['job_title', 'salary_in_usd']].groupby('job_title').mean().sort_values(by='salary_in_usd', ascending=False),1)


### How have salaries changed between 2020 and 2022 for Data Analysts?

👉 Talk about your results here. Type code below.

When determining how salaries have changed between 2020 & 2022, My immediately thought is that salaries have gone up, but only slightly, However, according to the results of the code below...  there was a significant increase in salaries between 2020 & 2022.

In [None]:
df
df_ana=df[df['job_title'] == 'Data Analyst']
#df_ana
round(df_ana[['work_year', 'salary_in_usd']].groupby('work_year').mean().sort_values(by='salary_in_usd'),2)

### Where are most data analytics positions located (according to this data set)? Which countries pay the most?

👉 Talk about your results here. Type code below.

In [None]:
df
df_ana=df[df['job_title'] == 'Data Analyst']
round(df_ana[['company_location', 'salary_in_usd']].groupby('company_location').mean().sort_values(by='salary_in_usd',ascending=False),2)

Which country has 30 job openings available?

In [None]:
df
df.value_counts('company_location')


What is the average salary (USD) for people who work in China (CN)?



*   71665.50




In [None]:
df
df_ana=df[df['job_title'] == 'Data Analyst']
round(df[['company_location', 'salary_in_usd']].groupby('company_location').mean().sort_values(by='salary_in_usd',ascending=False),2)

Which countries have the top three number of data positions?

In [None]:
df
df.value_counts('company_location')

### What percent of employees are based in another country but are paid in USD?
This is a tricky one.

To figure this out, I'll need to make a filter that gets out only employees that are in countries that are not the United States and whose salary currency is USD. Then, I can see how many rows that dataframe has and divide it by the number of rows in the original dataframe to get the answer.

---



👉 Talk about your results here. Type code below.

In [None]:
filtered_df = df[(df['salary_currency'] == 'USD') & (df['company_location'] != 'US')]

# Calculate percentage
percentage = (len(filtered_df) / len(df)) * 100

print(f"Percentage of employees working for a company based outside the US but paid in USD: {percentage:.2f}%")

## Conclusion
👉 Write a brief conclusion about what you learned from the data here.