# 문제 정의 : 직무 / 경력 / 고용 형태 / 회사 규모에 따른 연봉 비교
#### 사용 분석 방법 : 드릴다운(Drill-down)

In [1]:
import pandas as pd

salary_df = pd.read_csv('salaries.csv')
salary_df

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2025,MI,FT,Analyst,142000,USD,142000,US,0,US,M
1,2025,MI,FT,Analyst,87000,USD,87000,US,0,US,M
2,2025,SE,FT,Data Quality Lead,218700,USD,218700,US,0,US,M
3,2025,SE,FT,Data Quality Lead,163200,USD,163200,US,0,US,M
4,2025,MI,FT,Data Quality Specialist,121524,USD,121524,US,0,US,M
...,...,...,...,...,...,...,...,...,...,...,...
146343,2020,SE,FT,Data Scientist,412000,USD,412000,US,100,US,L
146344,2021,MI,FT,Principal Data Scientist,151000,USD,151000,US,100,US,L
146345,2020,EN,FT,Data Scientist,105000,USD,105000,US,100,US,S
146346,2020,EN,CT,Business Data Analyst,100000,USD,100000,US,100,US,L


#### 컬럼 설명

- work_year :          급여 보고 연도
- experience_level :   경력 수준 (EN : 신입 MI : 중간 SE : 고급 EX: 관리자)
- employment_type :    고용 계약 유형 (FT : 정규직 CT : 계약직 PT : 파트타임 FL : 프리랜서)
- job_title :          직업 종류
- salary :             연봉
- salary_currency :    급여 지급 통화
- salary_in_usd :      환율 usd로 통일
- employee_residence : 환율 종류
- remote_ratio :       원격 근무의 백분율
- company_location :   회사 또는 고용주의 본사가 있는 국가
- company_size :       회사 크기

In [2]:
salary_df.shape

(146348, 11)

In [3]:
salary_df.isna().sum()

work_year             0
experience_level      0
employment_type       0
job_title             0
salary                0
salary_currency       0
salary_in_usd         0
employee_residence    0
remote_ratio          0
company_location      0
company_size          0
dtype: int64

In [4]:
salary_df['experience_level'].unique()

array(['MI', 'SE', 'EX', 'EN'], dtype=object)

In [5]:
group_SE = salary_df[salary_df['experience_level'] == 'SE']
group_MI = salary_df[salary_df['experience_level'] == 'MI']
group_EX = salary_df[salary_df['experience_level'] == 'EX']
group_EN = salary_df[salary_df['experience_level'] == 'EN']

In [6]:
print(group_SE.shape)
print(group_MI.shape)
print(group_EX.shape)
print(group_EN.shape)

(84659, 11)
(44410, 11)
(3813, 11)
(13466, 11)


In [7]:
group_SE.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
2,2025,SE,FT,Data Quality Lead,218700,USD,218700,US,0,US,M
3,2025,SE,FT,Data Quality Lead,163200,USD,163200,US,0,US,M
10,2025,SE,FT,Machine Learning Engineer,166000,USD,166000,US,0,US,M
11,2025,SE,FT,Machine Learning Engineer,114000,USD,114000,US,0,US,M
14,2025,SE,FT,Machine Learning Engineer,256500,USD,256500,US,100,US,M


## 단순 비교

In [37]:
# 신입 연봉 비교
pd.DataFrame(group_EN.groupby('job_title')['salary_in_usd'].mean().sort_values(ascending=False)).head(20)

Unnamed: 0_level_0,salary_in_usd
job_title,Unnamed: 1_level_1
Engineering Manager,407560.0
Head of Data,240500.0
Director,204718.0
AI Researcher,203643.636364
Quantitative Researcher,197000.0
Research Scientist,192503.714286
Software Architect,190000.0
Machine Learning Model Engineer,180000.0
Architect,170837.5
Data Analytics Lead,165350.0


In [9]:
#고급 인력 연봉 비교
pd.DataFrame(group_SE.groupby('job_title')['salary_in_usd'].mean().sort_values(ascending=False)).head(7)

Unnamed: 0_level_0,salary_in_usd
job_title,Unnamed: 1_level_1
Research Team Lead,450000.0
Analytics Engineering Manager,399880.0
Data Science Tech Lead,375000.0
Applied AI ML Lead,292500.0
IT Enterprise Data Architect,284090.0
Director of Machine Learning,281616.666667
Deep Learning Engineer,272374.0


In [10]:
# 관리자 연봉 비교
pd.DataFrame(group_EX.groupby('job_title')['salary_in_usd'].mean().sort_values(ascending=False)).head(7)

Unnamed: 0_level_0,salary_in_usd
job_title,Unnamed: 1_level_1
Principal Data Scientist,416000.0
Research Engineer,290000.0
Principal Engineer,280000.0
Technical Lead,280000.0
AI Developer,276000.0
Head of Applied AI,273875.0
Director of Data,270000.0


#### 해석

1. 신입 테이블에 관리자, 책임자가 있는 것으로 보아 경력이 많지만 신입으로 분류된 데이터가 있을 가능성 多
2. 신입 테이블에 책임자, 관리자를 제외한 순위를 보면 연구원, ai 관련 직업이 많음 (석,박사 학위가 많기 때문으로 해석됨)
3. SE(고급) 테이블의 상위 연봉이 EX(관리자)의 상위 연봉보다 높은데 관리자의 데이터가 부족해 평균 왜곡 가능성이 있음      
   혹은 실제로 관리자가 코딩/모델링에 참여하지 않기 때문에 기술력측면에서 고급 엔지니어보다 연봉이 낮게 책정될 가능성이 있음

### 고급 인력 테이블 살펴보기

In [11]:
# FT : 정규직 CT : 계약직 PT : 파트타임 FL : 프리랜서
group_SE_FT = group_SE[group_SE['employment_type'] == 'FT']
group_SE_CT = group_SE[group_SE['employment_type'] == 'CT']
group_SE_PT = group_SE[group_SE['employment_type'] == 'PT']
group_SE_FL = group_SE[group_SE['employment_type'] == 'FL']

In [12]:
print(group_SE_FT.shape)
print(group_SE_CT.shape)
print(group_SE_PT.shape)
print(group_SE_FL.shape)

(84452, 11)
(145, 11)
(56, 11)
(6, 11)


In [13]:
group_SE_FT['employee_residence'].value_counts()

employee_residence
US    77384
CA     3260
GB     1620
DE      197
AU      194
      ...  
CF        2
RU        1
OM        1
SA        1
MD        1
Name: count, Length: 66, dtype: int64

In [14]:
group_SE_CT['employee_residence'].value_counts()

employee_residence
US    114
IN      8
CA      5
FR      5
BR      2
LT      2
AU      2
MX      2
GB      2
EG      2
PL      1
Name: count, dtype: int64

In [15]:
group_SE_PT['employee_residence'].value_counts()

employee_residence
US    46
CA     6
GB     2
DE     1
AE     1
Name: count, dtype: int64

In [16]:
group_SE_FL['employee_residence'].value_counts()

employee_residence
NG    2
CZ    1
IN    1
UA    1
RU    1
Name: count, dtype: int64

#### 해석
1. 미국의 경우 At-will 고용법이 있어 해고가 자유롭기 때문에 정규직의 비율이 많음 (그 외에는 데이터 수가 너무 적어 해석 x)
2. 데이터는 적지만 상위권에 해당하는 나라를 보면 정규직 자체가 보호받는 국가들이 많음
3. 데이터 수집을 미국 기준으로 한 것 같음

In [17]:
# 고급 중에서 계약직의 직업별 연봉 평균
pd.DataFrame(group_SE_CT.groupby('job_title')['salary_in_usd'].mean().sort_values(ascending=False).tail(5))


Unnamed: 0_level_0,salary_in_usd
job_title,Unnamed: 1_level_1
Data Developer,110860.0
Staff Data Scientist,105000.0
Analyst,85583.333333
Machine Learning Engineer,38000.0
Artificial Intelligence Engineer,30000.0


In [18]:
pd.DataFrame(group_SE_CT['job_title'] == 'Finance Data Analyst').value_counts()

job_title
False        144
True           1
Name: count, dtype: int64

In [19]:
pd.DataFrame(group_SE_CT['job_title'] == 'Machine Learning Engineer')

Unnamed: 0,job_title
276,False
277,False
2070,False
2071,False
2465,False
...,...
134220,False
141076,False
142419,False
145306,False


In [66]:
group_SE_CT_D = pd.DataFrame(group_SE_CT.groupby('job_title')['salary_in_usd'].mean().sort_values(ascending=False))
group_SE_CT_D.head(10)

Unnamed: 0_level_0,salary_in_usd
job_title,Unnamed: 1_level_1
Finance Data Analyst,323905.0
Technical Writer,300000.0
Manager,165000.0
Data Architect,160000.0
AI Engineer,153344.714286
Software Engineer,144071.428571
QA Engineer,140000.0
Architect,131333.333333
Engineer,131214.285714
Data Scientist,129302.0


In [67]:
group_SE_CT_D.describe()

Unnamed: 0,salary_in_usd
count,20.0
mean,137333.713853
std,69007.619573
min,30000.0
25%,113703.386364
50%,129026.0
75%,146389.75
max,323905.0


In [68]:
# 고급 중에서 정규직의 직업별 연봉 평균
group_SE_FT_D = pd.DataFrame(group_SE_FT.groupby('job_title')['salary_in_usd'].mean().sort_values(ascending=False))
group_SE_FT_D.head(10)

Unnamed: 0_level_0,salary_in_usd
job_title,Unnamed: 1_level_1
Research Team Lead,450000.0
Analytics Engineering Manager,399880.0
Data Science Tech Lead,375000.0
Applied AI ML Lead,292500.0
IT Enterprise Data Architect,284090.0
Director of Machine Learning,281616.666667
Deep Learning Engineer,272374.0
AIRS Solutions Specialist,263250.0
Machine Learning Performance Engineer,262500.0
Machine Learning Model Engineer,255000.0


In [70]:
group_SE_FT_D.describe()

Unnamed: 0,salary_in_usd
count,330.0
mean,149955.942239
std,52193.019736
min,23649.0
25%,119842.96875
50%,149211.700658
75%,180867.955882
max,450000.0


In [71]:
# 고급 중에서 파트타임 직업의 직업별 연봉 평균
group_SE_PT_D = pd.DataFrame(group_SE_PT.groupby('job_title')['salary_in_usd'].mean().sort_values(ascending=False))
group_SE_PT_D.head()

Unnamed: 0_level_0,salary_in_usd
job_title,Unnamed: 1_level_1
Machine Learning Engineer,225000.0
Architect,189100.0
Software Engineer,185817.333333
Engineer,172500.0
Statistician,142400.0


In [72]:
group_SE_PT_D.describe()

Unnamed: 0,salary_in_usd
count,16.0
mean,113591.192708
std,55789.422426
min,40000.0
25%,72132.25
50%,97351.5
75%,149925.0
max,225000.0


In [73]:
# 고급 중에서 프리랜서의 직업별 연봉 평균
group_SE_FL_D = pd.DataFrame(group_SE_FL.groupby('job_title')['salary_in_usd'].mean().sort_values(ascending=False))
group_SE_FL_D.head()

Unnamed: 0_level_0,salary_in_usd
job_title,Unnamed: 1_level_1
Backend Engineer,61333.0
Computer Vision Engineer,60000.0
Machine Learning Developer,60000.0
Machine Learning Researcher,50000.0
Software Data Engineer,50000.0


In [74]:
group_SE_FL_D.describe()

Unnamed: 0,salary_in_usd
count,6.0
mean,52891.166667
std,9735.420041
min,36014.0
25%,50000.0
50%,55000.0
75%,60000.0
max,61333.0


#### 해석
1. 고급 인력의 경우 정규직으로 근무하는 것이 대부분의 직업에서 가장 평균 연봉이 높음.
2. 정규직 평균 연봉이 가장 높지만 연봉의 폭도 가장 큼.

### 신입 경력 테이블 살펴보기

In [29]:
#FT : 정규직 CT : 계약직 PT : 파트타임 FL : 프리랜서
group_EN_FT = group_EN[group_EN['employment_type'] == 'FT']
group_EN_CT = group_EN[group_EN['employment_type'] == 'CT']
group_EN_PT = group_EN[group_EN['employment_type'] == 'PT']
group_EN_FL = group_EN[group_EN['employment_type'] == 'FL']

In [30]:
print(group_EN_FT.shape)
print(group_EN_CT.shape)
print(group_EN_PT.shape)
print(group_EN_FL.shape)

(13104, 11)
(128, 11)
(230, 11)
(4, 11)


In [64]:
group_EN_FT_D = pd.DataFrame(group_EN_FT.groupby('job_title')['salary_in_usd'].mean().sort_values(ascending=False))
group_EN_FT_D

Unnamed: 0_level_0,salary_in_usd
job_title,Unnamed: 1_level_1
Engineering Manager,407560.0
Head of Data,240500.0
AI Researcher,218008.0
Director,204718.0
Quantitative Researcher,197000.0
...,...
Data Quality Engineer,23753.0
Web Developer,22584.0
Data Analytics Engineer,20000.0
Alternance,19825.0


In [52]:
group_EN_FT_D.describe()

Unnamed: 0,salary_in_usd
count,183.0
mean,91532.769563
std,46567.682629
min,18000.0
25%,58559.833333
50%,85104.411765
75%,112851.428571
max,407560.0


In [62]:
group_EN_CT_D = pd.DataFrame(group_EN_CT.groupby('job_title')['salary_in_usd'].mean().sort_values(ascending=False))
group_EN_CT_D

Unnamed: 0_level_0,salary_in_usd
job_title,Unnamed: 1_level_1
BI Analyst,129000.0
Business Data Analyst,100000.0
Data Scientist,93692.75
AI Research Scientist,88888.0
Product Data Analyst,83200.0
Software Engineer,81440.0
Data Engineer,74019.2
Data Analyst,72452.114754
Analyst,63808.0
Technical Writer,50000.0


In [54]:
group_EN_CT_D.describe()

Unnamed: 0,salary_in_usd
count,17.0
mean,64546.984201
std,29514.587957
min,16000.0
25%,44753.0
50%,63808.0
75%,83200.0
max,129000.0


In [55]:
group_EN_FL_D = pd.DataFrame(group_EN_FL.groupby('job_title')['salary_in_usd'].mean().sort_values(ascending=False))
group_EN_FL_D

Unnamed: 0_level_0,salary_in_usd
job_title,Unnamed: 1_level_1
Machine Learning Engineer,100000.0
AI Researcher,60000.0
Data Analytics Consultant,50000.0
Applied Data Scientist,30000.0


In [75]:
group_EN_FL_D.describe()

Unnamed: 0,salary_in_usd
count,4.0
mean,60000.0
std,29439.202888
min,30000.0
25%,45000.0
50%,55000.0
75%,70000.0
max,100000.0


### 해석
1. 연봉비율은 고급인력이나 신입인력이나 비슷함.
2. 프리랜서들은 정규직의 최저 연봉을 받는 사람들보다는 평균적으로 높지만
   최고 연봉을 받는 사람들보다는 현저히 낮음. (최저 연봉을 받는 사람들에게는 프리랜서를 고민해보는게 나쁘지 않아보임)