In [14]:
import numpy as np
import numpy.random as npr
import pandas as pd
import altair as alt

In [24]:
df = pd.read_csv("../data/raw/ds_salaries.csv")
df = df[['experience_level', 'employment_type', 'job_title', 'salary_in_usd', 'company_location','remote_ratio', 'company_size']]
df.shape

(607, 7)

In [25]:
df.head()

Unnamed: 0,experience_level,employment_type,job_title,salary_in_usd,company_location,remote_ratio,company_size
0,MI,FT,Data Scientist,79833,DE,0,L
1,SE,FT,Machine Learning Scientist,260000,JP,0,S
2,SE,FT,Big Data Engineer,109024,GB,50,M
3,MI,FT,Product Data Analyst,20000,HN,0,S
4,SE,FT,Machine Learning Engineer,150000,US,50,L


In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 607 entries, 0 to 606
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   experience_level  607 non-null    object
 1   employment_type   607 non-null    object
 2   job_title         607 non-null    object
 3   salary_in_usd     607 non-null    int64 
 4   company_location  607 non-null    object
 5   remote_ratio      607 non-null    int64 
 6   company_size      607 non-null    object
dtypes: int64(2), object(5)
memory usage: 33.3+ KB


In [27]:
df.nunique()

experience_level      4
employment_type       4
job_title            50
salary_in_usd       369
company_location     50
remote_ratio          3
company_size          3
dtype: int64

In [31]:
df['employment_type'].value_counts()

employment_type
FT    588
PT     10
CT      5
FL      4
Name: count, dtype: int64

In [28]:
df['job_title'].value_counts()

job_title
Data Scientist                              143
Data Engineer                               132
Data Analyst                                 97
Machine Learning Engineer                    41
Research Scientist                           16
Data Science Manager                         12
Data Architect                               11
Machine Learning Scientist                    8
Big Data Engineer                             8
Director of Data Science                      7
AI Scientist                                  7
Principal Data Scientist                      7
Data Science Consultant                       7
Data Analytics Manager                        7
BI Data Analyst                               6
Computer Vision Engineer                      6
ML Engineer                                   6
Lead Data Engineer                            6
Applied Data Scientist                        5
Business Data Analyst                         5
Data Engineering Manager      

In [29]:
df['company_location'].value_counts()

company_location
US    355
GB     47
CA     30
DE     28
IN     24
FR     15
ES     14
GR     11
JP      6
NL      4
PT      4
PL      4
AT      4
MX      3
DK      3
AE      3
PK      3
LU      3
TR      3
BR      3
AU      3
RU      2
CN      2
CH      2
BE      2
NG      2
SI      2
IT      2
CZ      2
NZ      1
HU      1
HN      1
SG      1
HR      1
MT      1
IL      1
UA      1
RO      1
IQ      1
MD      1
CL      1
IR      1
VN      1
KE      1
CO      1
AS      1
DZ      1
EE      1
MY      1
IE      1
Name: count, dtype: int64

In [30]:
alt.Chart(df).mark_bar().encode(
    x='experience_level',
    y='mean(salary_in_usd)'
)

The dataset we are going to be visualizing includes 607 jobs in the field of data science. Each job has seven key variables that contain detailed information about the position, employer, and compensation. We assume this information could help MDS graduate students explore the job markets and narrow their job search efforts. These variables include:

- Level of experience of the role (experience_level e.g. Entry level(EN), Mid-Level(ML), Senior(SE), Experience(EX))
- Type of employment (employment_type e.g. Full Time(FT), Part Time(PT), Contract(CT), Freelance(FL))
- Specific position or role within the data science field (job_title, e.g. Data Scientist, Data Analyst, etc)
- Salaries measured in USD (salary_in_usd)
- Geographical locations of the company recorded in country code (company_location)
- Size of the employing company (company_size e.g. Large(L), Median(M), Small(S))

We will also derive a new variable (work_arrangement) from the existing variable (remote_ratio) to explore whether the position is remote (remote_ratio = 100), hybrid (0 < remote_ratio < 100), or onsite (remote_ratio = 0). Given that students might have varied preferences regarding work arrangements, this new variable could be beneficial for them to explore job opportunities based on their preferred work styles.