### DataSet

1. **work_year**: The year the salary was paid.
2. **experience_level**: The experience level in the job during the year with the following possible values:

* EN = Entry-level/Junior
* MI = Mid-level/Intermediate
* SE = Senior-level/Expert
* EX = Executive-level/Director
3. **employment_type**: The type of employement for the role:

* PT = Part-time
* FT = Full-time
* CT = Contract
* FL = Freelance
4. **job_title**: The role worked in during the year.
5. **salary**: The total gross salary amount paid.
6. **salary_currency**: The currency of the salary paid as an ISO 4217 currency code.
7. **salaryinusd**: The salary in USD
8. **employee_residence**: Employee's primary country of residence in during the work year as an ISO 3166 country code.
9. **remote_ratio**: The overall amount of work done remotely, possible values are as follows:

* 0 = No remote work
* 50 = Partially remote
* 100 = Fully remote
10. **company_location**: The country of the employer's main office or contracting branch as an ISO 3166 country code.

11. **company_size**: The average number of people that worked for the company during the year:

* S = less than 50 employees (small)
* M = 50 to 250 employees (medium)
* L = more than 250 employees (large)





### Objectives

The purpose of this EDA is to analyze data scientist salaries in America by experience and employment type.

### Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import datetime as dt
import plotly.graph_objects as go


### Data Visualization

In [None]:
data = pd.read_csv('./sample_data/ds_salaries.csv')

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 607 entries, 0 to 606
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Unnamed: 0          607 non-null    int64 
 1   work_year           607 non-null    int64 
 2   experience_level    607 non-null    object
 3   employment_type     607 non-null    object
 4   job_title           607 non-null    object
 5   salary              607 non-null    int64 
 6   salary_currency     607 non-null    object
 7   salary_in_usd       607 non-null    int64 
 8   employee_residence  607 non-null    object
 9   remote_ratio        607 non-null    int64 
 10  company_location    607 non-null    object
 11  company_size        607 non-null    object
dtypes: int64(5), object(7)
memory usage: 57.0+ KB


### Data Cleaning

In [None]:
# Droping the useless column 'Unnamed: 0'
data = data.drop('Unnamed: 0',axis=1)

In [None]:
# Replacing column names
data['experience_level'].replace({'EN':'Entry-Level','MI':'Mid-Level','EX':'Executive Level','SE':'Senior'},inplace=True)
data['employment_type'].replace({'PT':'Part-Time','FT':'Full-Time','CT':'Contract','FL':'Freelance'},inplace=True)

In [None]:
#Checking for null values
data.isnull().sum()

NameError: ignored

### Exploratory Data Analysis

In [None]:
data.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2020,Mid-Level,Full-Time,Data Scientist,70000,EUR,79833,DE,0,DE,L
1,2020,Senior,Full-Time,Machine Learning Scientist,260000,USD,260000,JP,0,JP,S
2,2020,Senior,Full-Time,Big Data Engineer,85000,GBP,109024,GB,50,GB,M
3,2020,Mid-Level,Full-Time,Product Data Analyst,20000,USD,20000,HN,0,HN,S
4,2020,Senior,Full-Time,Machine Learning Engineer,150000,USD,150000,US,50,US,L


In [None]:
data.describe()

Unnamed: 0,work_year,salary,salary_in_usd,remote_ratio
count,607.0,607.0,607.0,607.0
mean,2021.405272,324000.1,112297.869852,70.92257
std,0.692133,1544357.0,70957.259411,40.70913
min,2020.0,4000.0,2859.0,0.0
25%,2021.0,70000.0,62726.0,50.0
50%,2022.0,115000.0,101570.0,100.0
75%,2022.0,165000.0,150000.0,100.0
max,2022.0,30400000.0,600000.0,100.0


In [None]:
# Mean salary in USD grouped by job title
z = data.groupby('job_title', as_index=False)[['salary_in_usd']].mean().rename({'salary_in_usd' : 'mean_salary_in_usd'}, axis=1).sort_values(by='mean_salary_in_usd',ascending=False)
print(z)

                                   job_title  mean_salary_in_usd
14                       Data Analytics Lead       405000.000000
45                   Principal Data Engineer       328333.333333
28                    Financial Data Analyst       275000.000000
46                  Principal Data Scientist       215242.428571
25                  Director of Data Science       195074.000000
16                            Data Architect       177873.909091
3                     Applied Data Scientist       175655.000000
2                         Analytics Engineer       175000.000000
23                           Data Specialist       165000.000000
29                              Head of Data       160162.600000
41                Machine Learning Scientist       158412.500000
21                      Data Science Manager       158328.500000
24              Director of Data Engineering       156738.000000
30                      Head of Data Science       146718.750000
4         Applied Machine

In [None]:
z['mean_salary_in_usd']=round(z['mean_salary_in_usd'],2)
fig=px.bar(z.head(10),x='job_title',y='mean_salary_in_usd',color='job_title',
           labels={'job_title':'job title','mean_salary_in_usd':'mean salary in usd'},text='mean_salary_in_usd',template='seaborn',title='<b> Top 10 Roles in Data Science based on Average Pay')
fig.show()

In [None]:
# Mean salary in USD grouped by job title and experience level
z = data.groupby(['job_title', 'experience_level'], as_index=False)[['salary_in_usd']].mean().rename({'salary_in_usd' : 'mean_salary_in_usd'}, axis=1).sort_values(by='mean_salary_in_usd',ascending=False)
print(z)

                         job_title experience_level  mean_salary_in_usd
95         Principal Data Engineer  Executive Level           600000.00
61          Financial Data Analyst        Mid-Level           450000.00
97        Principal Data Scientist  Executive Level           416000.00
33             Data Analytics Lead           Senior           405000.00
8           Applied Data Scientist           Senior           278500.00
..                             ...              ...                 ...
1                     AI Scientist      Entry-Level            21987.25
30         Data Analytics Engineer      Entry-Level            20000.00
76                     ML Engineer      Entry-Level            18974.50
100           Product Data Analyst        Mid-Level            13036.00
0    3D Computer Vision Researcher        Mid-Level             5409.00

[105 rows x 3 columns]


In [None]:
z['mean_salary_in_usd']=round(z['mean_salary_in_usd'],2)
z['job-experience'] = z['job_title'].map(str) + ' - ' + z['experience_level'].map(str)
fig=px.bar(z.head(10),x='job-experience',y='mean_salary_in_usd',color='job_title',
           labels={'job-experience':'job title - experience level','mean_salary_in_usd':'mean salary in usd'},text='mean_salary_in_usd',template='seaborn',title='<b> Top 10 average salary in USD grouped by job title and experience level')
fig.show()

In [None]:
# Max salary in USD grouped by job title, experience level and employment type
z = data.groupby(['job_title', 'experience_level', 'employment_type'], as_index=False)[['salary_in_usd']].max().rename({'salary_in_usd' : 'max_salary_in_usd'}, axis=1).sort_values(by='max_salary_in_usd',ascending=False)
print(z)

                              job_title experience_level employment_type  \
107             Principal Data Engineer  Executive Level       Full-Time   
70               Financial Data Analyst        Mid-Level       Full-Time   
114                  Research Scientist        Mid-Level       Full-Time   
11   Applied Machine Learning Scientist        Mid-Level       Full-Time   
109            Principal Data Scientist  Executive Level        Contract   
..                                  ...              ...             ...   
86                          ML Engineer      Entry-Level       Part-Time   
100          Machine Learning Scientist        Mid-Level       Freelance   
2                          AI Scientist      Entry-Level       Part-Time   
31                         Data Analyst      Entry-Level       Part-Time   
0         3D Computer Vision Researcher        Mid-Level       Part-Time   

     max_salary_in_usd  
107             600000  
70              450000  
114         

In [None]:
z['max_salary_in_usd']=round(z['max_salary_in_usd'],2)
z['job-experience-employment'] = z['job_title'].map(str) + ' - ' + z['experience_level'].map(str) + ' - ' + z['employment_type'].map(str)
fig=px.bar(z.head(10),x='job-experience-employment',y='max_salary_in_usd',color='job_title',
           labels={'job-experience-employment':'job title - experience level - employment type','max_salary_in_usd':'max salary in usd'},text='max_salary_in_usd',template='seaborn',title='<b> Top 10 salarys in USD grouped by job title, experience level and employment type')
fig.show()

In [None]:
# Count machine learning scientist jobs grouped by experience level and employment type
z = data.groupby(['job_title', 'experience_level', 'employment_type'], as_index=False)[['salary']].count().rename({'salary' : 'count'}, axis=1).sort_values(by='count',ascending=False)
z = z.loc[z['job_title'] == "Machine Learning Scientist"]
z['experience-employment'] = z['experience_level'].map(str) + ': ' + z['employment_type'].map(str)
print(z)

                      job_title experience_level employment_type  count  \
101  Machine Learning Scientist        Mid-Level       Full-Time      3   
102  Machine Learning Scientist           Senior       Full-Time      3   
100  Machine Learning Scientist        Mid-Level       Freelance      1   
99   Machine Learning Scientist      Entry-Level       Full-Time      1   

      experience-employment  
101    Mid-Level: Full-Time  
102       Senior: Full-Time  
100    Mid-Level: Freelance  
99   Entry-Level: Full-Time  


In [None]:
fig=px.pie(z ,names='experience-employment',values='count',color='experience-employment',hole=0.7,labels={'experience-employment':'Experience level','count':'count'}
,template='seaborn',title='<b> Total Machine Learning Scientist Jobs Based on Experience Level and Employment Type')
fig.update_layout(title_x=0.5)

In [None]:
px.histogram(data,x='salary_in_usd',marginal='rug',template='seaborn',labels={'salary_in_usd':'Salary in USD'},title='<b> Salary Distribution')

In [None]:
px.box(data.loc[data['job_title'] == "Machine Learning Scientist"],x='experience_level',y='salary_in_usd',color='experience_level',template='ggplot2'
,labels={'experience_level':'Experience Level','salary_in_usd':'salary in usd'},title='<b>Machine Learning Scientist Salaries by Experience')

# Nova seção