In [28]:
import numpy as np 
import pandas as pd 
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
import matplotlib.pyplot as plt
import statistics
import seaborn as sns
import numbers

In [29]:
data = pd.read_csv('/kaggle/input/data-science-job-salaries/ds_salaries.csv')

The remaining dateset called 'Data Science Job Salaries' includes the data on the salaries of jobs in the Data Science domain. Next, an analysis of the columns and rows of this dataset can be found and graphs containing all the relevant information will be presented.

# **DS overview**

Let's start with a general overview. Let's see what the dataset is and what information it contains.

In [30]:
data.head(10)

Unnamed: 0.1,Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,0,2020,MI,FT,Data Scientist,70000,EUR,79833,DE,0,DE,L
1,1,2020,SE,FT,Machine Learning Scientist,260000,USD,260000,JP,0,JP,S
2,2,2020,SE,FT,Big Data Engineer,85000,GBP,109024,GB,50,GB,M
3,3,2020,MI,FT,Product Data Analyst,20000,USD,20000,HN,0,HN,S
4,4,2020,SE,FT,Machine Learning Engineer,150000,USD,150000,US,50,US,L
5,5,2020,EN,FT,Data Analyst,72000,USD,72000,US,100,US,L
6,6,2020,SE,FT,Lead Data Scientist,190000,USD,190000,US,100,US,S
7,7,2020,MI,FT,Data Scientist,11000000,HUF,35735,HU,50,HU,L
8,8,2020,MI,FT,Business Data Analyst,135000,USD,135000,US,100,US,L
9,9,2020,SE,FT,Lead Data Engineer,125000,USD,125000,NZ,50,NZ,S


In [31]:
print("Rows & Columns")
print(data.shape)

Rows & Columns
(607, 12)


In total, this dataset has 607 rows and 12 columns. Let's figure out what each line is responsible for and whether its values are unique.

In [32]:
data.describe()

Unnamed: 0.1,Unnamed: 0,work_year,salary,salary_in_usd,remote_ratio
count,607.0,607.0,607.0,607.0,607.0
mean,303.0,2021.405272,324000.1,112297.869852,70.92257
std,175.370085,0.692133,1544357.0,70957.259411,40.70913
min,0.0,2020.0,4000.0,2859.0,0.0
25%,151.5,2021.0,70000.0,62726.0,50.0
50%,303.0,2022.0,115000.0,101570.0,100.0
75%,454.5,2022.0,165000.0,150000.0,100.0
max,606.0,2022.0,30400000.0,600000.0,100.0


In [33]:
fig = px.imshow(data.corr(), text_auto=True)
fig.show()

In [34]:
print('Оutput the data in the format:')
print('type', '&', 'unique el', '&', 'col name')
for col in data:
 print(data[col].dtypes, '-', len(data[col].unique()), '-', col)




Оutput the data in the format:
type & unique el & col name
int64 - 607 - Unnamed: 0
int64 - 3 - work_year
object - 4 - experience_level
object - 4 - employment_type
object - 50 - job_title
int64 - 272 - salary
object - 17 - salary_currency
int64 - 369 - salary_in_usd
object - 57 - employee_residence
int64 - 3 - remote_ratio
object - 50 - company_location
object - 3 - company_size


You can see that each row contains unique values. Let's check the presence of zeros in each row.

In [35]:
data.isnull().sum()

Unnamed: 0            0
work_year             0
experience_level      0
employment_type       0
job_title             0
salary                0
salary_currency       0
salary_in_usd         0
employee_residence    0
remote_ratio          0
company_location      0
company_size          0
dtype: int64

# **DS Cleanup**

Note that there are some columns, such as 'salary', 'salary_currency' and 'Unnamed: 0'. Since the column 'Unnamed: 0' has no value and does not carry any useful information, and columns 'salary' and 'salary_currency' can be easily replaced, having a column 'salary_in_usd', it is possible to remove those columns.

In [36]:
data.drop(data[['salary','salary_currency','Unnamed: 0']],axis=1, inplace=True)

Let's display the number of columns and the straw and see that in the process of clearing the dataset, the number of columns decreased by three.

In [37]:
print("Rows & Columns")
print(data.shape)
data.head(5)

Rows & Columns
(607, 9)


Unnamed: 0,work_year,experience_level,employment_type,job_title,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2020,MI,FT,Data Scientist,79833,DE,0,DE,L
1,2020,SE,FT,Machine Learning Scientist,260000,JP,0,JP,S
2,2020,SE,FT,Big Data Engineer,109024,GB,50,GB,M
3,2020,MI,FT,Product Data Analyst,20000,HN,0,HN,S
4,2020,SE,FT,Machine Learning Engineer,150000,US,50,US,L


Let us consider in more detail what each column and row are responsible for and what parameters they have.

In [38]:
print(f"Based on the data obtained above 'work_year' has 3 unique values. Consider them")
data.work_year.unique()

Based on the data obtained above 'work_year' has 3 unique values. Consider them


array([2020, 2021, 2022])

In [39]:
print("Let's look and find out also specific values 'employment_type' 'remote_ratio.unique', 'experience_year'")
print('', *data.remote_ratio.unique(), '\n',*data.employment_type.unique(), '\n', *data.experience_level.unique(), '\n', *data.company_size.unique())


Let's look and find out also specific values 'employment_type' 'remote_ratio.unique', 'experience_year'
 0 50 100 
 FT CT PT FL 
 MI SE EN EX 
 L S M


# **Statistics**

In the following lines mean, median and standard deviation on statistics on salary are presented.

In [40]:
print('Mean total salaries (in $):',data['salary_in_usd'].mean())
print('Median total salaries (in $):',data['salary_in_usd'].median())
print("Standard Deviation of the salary is % s "%(statistics.stdev(data['salary_in_usd'])))

Mean total salaries (in $): 112297.86985172982
Median total salaries (in $): 101570.0
Standard Deviation of the salary is 70957.25941139569 


# **Data Transformation**

One of the most important sets of information is the salary level. Thus, one more column can be added to this dataset, generalizing and characterizing the index, which is responsible for the amount of money received for the work.


To implement this, we will first calculate the mean of the salary and print it.

In [41]:
med_salary = data['salary_in_usd'].mean()
print(med_salary)

112297.86985172982


To make dataset more convinient to look at, let's add the column called 'salary_ratio'.

In [42]:
data = data.assign(salary_ratio = data.salary_in_usd / med_salary)
data

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary_in_usd,employee_residence,remote_ratio,company_location,company_size,salary_ratio
0,2020,MI,FT,Data Scientist,79833,DE,0,DE,L,0.710904
1,2020,SE,FT,Machine Learning Scientist,260000,JP,0,JP,S,2.315271
2,2020,SE,FT,Big Data Engineer,109024,GB,50,GB,M,0.970847
3,2020,MI,FT,Product Data Analyst,20000,HN,0,HN,S,0.178098
4,2020,SE,FT,Machine Learning Engineer,150000,US,50,US,L,1.335733
...,...,...,...,...,...,...,...,...,...,...
602,2022,SE,FT,Data Engineer,154000,US,100,US,M,1.371353
603,2022,SE,FT,Data Engineer,126000,US,100,US,M,1.122016
604,2022,SE,FT,Data Analyst,129000,US,0,US,M,1.148731
605,2022,SE,FT,Data Analyst,150000,US,100,US,M,1.335733


In [43]:
data = data.assign(salary_level = 'a')
data

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary_in_usd,employee_residence,remote_ratio,company_location,company_size,salary_ratio,salary_level
0,2020,MI,FT,Data Scientist,79833,DE,0,DE,L,0.710904,a
1,2020,SE,FT,Machine Learning Scientist,260000,JP,0,JP,S,2.315271,a
2,2020,SE,FT,Big Data Engineer,109024,GB,50,GB,M,0.970847,a
3,2020,MI,FT,Product Data Analyst,20000,HN,0,HN,S,0.178098,a
4,2020,SE,FT,Machine Learning Engineer,150000,US,50,US,L,1.335733,a
...,...,...,...,...,...,...,...,...,...,...,...
602,2022,SE,FT,Data Engineer,154000,US,100,US,M,1.371353,a
603,2022,SE,FT,Data Engineer,126000,US,100,US,M,1.122016,a
604,2022,SE,FT,Data Analyst,129000,US,0,US,M,1.148731,a
605,2022,SE,FT,Data Analyst,150000,US,100,US,M,1.335733,a


Then, it is logically to add a column with the level of salary. If the index of 'salary_ratio' is lower than 0.75, then the value of 'salary level' will take on the meaning 'low', if higher than 0,75 and less than 1.25 - 'medium', if higher than 1.25 - 'high'.

In [44]:
data.loc[(data['salary_ratio'] < 0.75), 'salary_level'] = 'low'
data.loc[(data['salary_ratio'] >= 0.75) & (data['salary_ratio'] <= 1.25), 'salary_level'] = 'medium'
data.loc[(data['salary_ratio'] > 1.25), 'salary_level'] = 'high'
data

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary_in_usd,employee_residence,remote_ratio,company_location,company_size,salary_ratio,salary_level
0,2020,MI,FT,Data Scientist,79833,DE,0,DE,L,0.710904,low
1,2020,SE,FT,Machine Learning Scientist,260000,JP,0,JP,S,2.315271,high
2,2020,SE,FT,Big Data Engineer,109024,GB,50,GB,M,0.970847,medium
3,2020,MI,FT,Product Data Analyst,20000,HN,0,HN,S,0.178098,low
4,2020,SE,FT,Machine Learning Engineer,150000,US,50,US,L,1.335733,high
...,...,...,...,...,...,...,...,...,...,...,...
602,2022,SE,FT,Data Engineer,154000,US,100,US,M,1.371353,high
603,2022,SE,FT,Data Engineer,126000,US,100,US,M,1.122016,medium
604,2022,SE,FT,Data Analyst,129000,US,0,US,M,1.148731,medium
605,2022,SE,FT,Data Analyst,150000,US,100,US,M,1.335733,high


Now, we have obtained our dataset, so it has two new columns. Let's make a hypothesis: the 'high' salary has increased over the period from 2020 to 2022.

In [45]:
fig=px.histogram(data_frame=data,
                 x="work_year",
                 color_discrete_sequence=px.colors.sequential.Magma_r,
                 template="plotly_white",
                 title="The Level of Salary",
                 color="salary_level",
                 barmode="group",
                 histnorm="percent",
                 text_auto=".2f")

fig.update_layout(yaxis_title="Amount of employees",
                  xaxis_title="Work Year",
                  xaxis={"categoryorder":"total descending"})

Based on this histogram, we can conclude that the hypothesis is correct. Indeed, the level of salaries has become significantly higher two years after 2020.

# **Vizual analysis**

Let's delve into the study of the dataset. Let's create a few simple graphs that summarize the data in the dataset.

Let's start with an analysis of the most popular jobs. There is Top 10 of the most relevant jobs pie chart.

In [46]:
sum_counts = data['job_title'].value_counts()[:10]
fig = go.Figure()
pull = [0]*len(sum_counts)
pull[sum_counts.tolist().index(sum_counts.max())] = 0
fig.add_trace(go.Pie(values=sum_counts, labels=sum_counts.index, pull=pull, hole=0.9))

fig.update_layout(
    margin=dict(l=0, r=0, t=30, b=0),
    legend_orientation="h",
    annotations=[dict(text='Top 10 Job Titles', x=0.5, y=0.5, font_size=20, showarrow=False)])
fig.show()

Having considered the number of employees who have chosen one or another job, we note that 4 of the jobs: 'Data Scientist', 'Data Engineer', 'Data Analyst' and 'Machine Learning Engineer' - stand out from the rest and occupy a leading position.

Let's also consider two more graphs. The first one shows information about the amount of professionals working annually and the level of their experience. The second one demonstrates the distribution of salary among the number of workers in USD$.

In [47]:
exp_lvl = data['experience_level'].value_counts()[:50]
fig = px.bar(y=exp_lvl.values, 
             x=exp_lvl.index, 
             color = exp_lvl.index,
             color_discrete_sequence=px.colors.sequential.Magma_r,
             text=exp_lvl.values,
             title= 'Professionals on each lecel',
             template= 'plotly_white')

fig.update_layout(
    xaxis_title="Experience Level",
    yaxis_title="Amount of Professionals ",
    font = dict(size=17,family="Franklin Gothic"))

fig.show()

In [48]:
fig=px.histogram(data_frame=data,
                 x="salary_in_usd",
                 color_discrete_sequence=px.colors.sequential.Viridis_r,
                 template="plotly_white",
                 title="Distribution of Salary in USD $")

fig.update_layout(yaxis_title="Count",
                  xaxis_title="Salary in USD",
                  xaxis={"categoryorder":"total descending"})

Due to the following graph we can obtain, that the vast majority of workers get 100k as a salary. It can be concluded that increasing your income in half is a challenging task.

The number of work places, the size of companies and level of experience can influence a lot on the level of salary. There is the analysis of the number of different levels of companies, in which data scientists may work and the statistics on the number of companies located in different countries.

In [49]:
fig=px.histogram(data_frame=data,
                 x="company_size",
                 color_discrete_sequence=px.colors.sequential.Viridis_r,
                 template="plotly_white",
                 title="Company Size")

fig.update_layout(yaxis_title="Amount of Companies",
                  xaxis_title="Company Size",
                  xaxis={"categoryorder":"total descending"})

It can be clearly seen from the graph, that the Mid-market companies are leading among other tepes of companies, due to having the average number of employees - 50-200 workers - in comparison with large and small companies.

In [50]:
fig = px.histogram(data,
                   x='company_location',
                   color_discrete_sequence=px.colors.sequential.Viridis_r,
                   template="plotly_white",
                   title="Company Location")

fig.show()

From this, it can be obtained that the highest amount of companies is located in US. So, the better chance to get a job in data science field - move to the USA, the country with the most succesfull IT-companies and a lot of employability prospects.

# Detailed Overview

Hypothesis: do programmers improve their skills every year?

In [51]:
fig=px.histogram(data_frame=data,
                 x="experience_level",
                 color_discrete_sequence=px.colors.sequential.Magma_r,
                 template="plotly_white",
                 title="Distribution of Experience Level/Employment Type %",
                 color="work_year",
                 barmode="group",
                 histnorm="percent",
                 text_auto=".2f")

fig.update_layout(yaxis_title="Amount of employees",
                  xaxis_title="Experience Level",
                  xaxis={"categoryorder":"total descending"})

According to this graph, we see that this hypothesis was confirmed. We can draw this conclusion from the fact that, the amount of Senior-level programmers has boosted, while the number of Mid-level и Entry-level programmers has diminished.

Hypothesis: will the level of salary increase in case of transition of an employee to remote work?

In [52]:
fig=px.histogram(data_frame=data,
                 x="salary_in_usd",
                 color_discrete_sequence=px.colors.sequential.Magma_r,
                 template="plotly_white",
                 title="Level of Salaries According to Remote Ratio",
                 color="remote_ratio",
                 barmode="group",
                 histnorm="percent",
                 text_auto=".2f")

fig.update_layout(yaxis_title="Amount of Employees",
                  xaxis_title="Salary in USD",
                  xaxis={"categoryorder":"total descending"})

As it can be seen from the graph - definitely, yes. It can be obtained, while the level of salaries is increasing, the amount of numbers of hours, spent on remote work is also increasing.

To understand deeper, how the amount of remote working hours corresponds withe the level of incomes, the statistics can be interpreted as a correlation between these two indices.

In [53]:
fig = px.scatter(data[(data['remote_ratio']>10)&(data['salary_in_usd']/(10**9)<300)], 
x='remote_ratio', 
y='salary_in_usd', 
trendline="ols",
title='Correlation between Remote Ratio and Salary in USD')
fig.show()

It is clear that the positive correlation shows, that the later hypothesis is true.