# 문제 정의 : 직무 / 경력 / 고용 형태 / 회사 규모에 따른 연봉 비교
#### 사용 분석 방법 : 드릴다운(Drill-down)

In [53]:
import pandas as pd

salary_df = pd.read_csv('salaries.csv')
salary_df

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2025,MI,FT,Analyst,142000,USD,142000,US,0,US,M
1,2025,MI,FT,Analyst,87000,USD,87000,US,0,US,M
2,2025,SE,FT,Data Quality Lead,218700,USD,218700,US,0,US,M
3,2025,SE,FT,Data Quality Lead,163200,USD,163200,US,0,US,M
4,2025,MI,FT,Data Quality Specialist,121524,USD,121524,US,0,US,M
...,...,...,...,...,...,...,...,...,...,...,...
146343,2020,SE,FT,Data Scientist,412000,USD,412000,US,100,US,L
146344,2021,MI,FT,Principal Data Scientist,151000,USD,151000,US,100,US,L
146345,2020,EN,FT,Data Scientist,105000,USD,105000,US,100,US,S
146346,2020,EN,CT,Business Data Analyst,100000,USD,100000,US,100,US,L


#### 컬럼 설명

- work_year :          급여 보고 연도
- experience_level :   경력 수준 (EN : 신입 MI : 중간 SE : 고급 EX: 관리자)
- employment_type :    고용 계약 유형 (FT : 정규직 CT : 계약직 PT : 파트타임 FL : 프리랜서)
- job_title :          직업 종류
- salary :             연봉
- salary_currency :    급여 지급 통화
- salary_in_usd :      환율 usd로 통일
- employee_residence : 환율 종류
- remote_ratio :       원격 근무의 백분율
- company_location :   회사 또는 고용주의 본사가 있는 국가
- company_size :       회사 크기

In [54]:
salary_df.shape

(146348, 11)

In [55]:
salary_df.isna().sum()

work_year             0
experience_level      0
employment_type       0
job_title             0
salary                0
salary_currency       0
salary_in_usd         0
employee_residence    0
remote_ratio          0
company_location      0
company_size          0
dtype: int64

In [56]:
salary_df['experience_level'].unique()

array(['MI', 'SE', 'EX', 'EN'], dtype=object)

In [57]:
group_SE = salary_df[salary_df['experience_level'] == 'SE']
group_MI = salary_df[salary_df['experience_level'] == 'MI']
group_EX = salary_df[salary_df['experience_level'] == 'EX']
group_EN = salary_df[salary_df['experience_level'] == 'EN']

In [58]:
print(group_SE.shape)
print(group_MI.shape)
print(group_EX.shape)
print(group_EN.shape)

(84659, 11)
(44410, 11)
(3813, 11)
(13466, 11)


In [59]:
group_SE.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
2,2025,SE,FT,Data Quality Lead,218700,USD,218700,US,0,US,M
3,2025,SE,FT,Data Quality Lead,163200,USD,163200,US,0,US,M
10,2025,SE,FT,Machine Learning Engineer,166000,USD,166000,US,0,US,M
11,2025,SE,FT,Machine Learning Engineer,114000,USD,114000,US,0,US,M
14,2025,SE,FT,Machine Learning Engineer,256500,USD,256500,US,100,US,M


## 단순 비교

In [119]:
# 신입 연봉 비교
pd.DataFrame(group_EN.groupby('job_title')['salary_in_usd'].mean().sort_values(ascending=False)).tail(20)

Unnamed: 0_level_0,salary_in_usd
job_title,Unnamed: 1_level_1
Research Specialist,45250.0
Compliance Data Analyst,45000.0
Staff Data Analyst,44753.0
Robotics Engineer,42761.8
Data Management Consultant,41000.0
Applied Machine Learning Scientist,40766.666667
Finance Data Analyst,40000.0
Technical Specialist,40000.0
Data Analysis,38131.5
Data Operations,38000.0


In [61]:
#고급 인력 연봉 비교
pd.DataFrame(group_SE.groupby('job_title')['salary_in_usd'].mean().sort_values(ascending=False)).head(7)

Unnamed: 0_level_0,salary_in_usd
job_title,Unnamed: 1_level_1
Research Team Lead,450000.0
Analytics Engineering Manager,399880.0
Data Science Tech Lead,375000.0
Applied AI ML Lead,292500.0
IT Enterprise Data Architect,284090.0
Director of Machine Learning,281616.666667
Deep Learning Engineer,272374.0


In [62]:
# 관리자 연봉 비교
pd.DataFrame(group_EX.groupby('job_title')['salary_in_usd'].mean().sort_values(ascending=False)).head(7)

Unnamed: 0_level_0,salary_in_usd
job_title,Unnamed: 1_level_1
Principal Data Scientist,416000.0
Research Engineer,290000.0
Principal Engineer,280000.0
Technical Lead,280000.0
AI Developer,276000.0
Head of Applied AI,273875.0
Director of Data,270000.0


#### 해석

1. 신입 테이블에 관리자, 책임자가 있는 것으로 보아 경력이 많지만 신입으로 분류된 데이터가 있을 가능성 多
2. 신입 테이블에 책임자, 관리자를 제외한 순위를 보면 연구원, ai 관련 직업이 많음 (석,박사 학위가 많기 때문으로 해석됨)
3. SE(고급) 테이블의 상위 연봉이 EX(관리자)의 상위 연봉보다 높은데 관리자의 데이터가 부족해 평균 왜곡 가능성이 있음      
   혹은 실제로 관리자가 코딩/모델링에 참여하지 않기 때문에 기술력측면에서 고급 엔지니어보다 연봉이 낮게 책정될 가능성이 있음

In [63]:
# FT : 정규직 CT : 계약직 PT : 파트타임 FL : 프리랜서
group_SE_FT = group_SE[group_SE['employment_type'] == 'FT']
group_SE_CT = group_SE[group_SE['employment_type'] == 'CT']
group_SE_PT = group_SE[group_SE['employment_type'] == 'PT']
group_SE_FL = group_SE[group_SE['employment_type'] == 'FL']

In [80]:
print(group_SE_FT.shape)
print(group_SE_CT.shape)
print(group_SE_PT.shape)
print(group_SE_FL.shape)

(84452, 11)
(145, 11)
(56, 11)
(6, 11)


In [86]:
group_SE_FT['employee_residence'].value_counts()

employee_residence
US    77384
CA     3260
GB     1620
DE      197
AU      194
      ...  
CF        2
RU        1
OM        1
SA        1
MD        1
Name: count, Length: 66, dtype: int64

In [87]:
group_SE_CT['employee_residence'].value_counts()

employee_residence
US    114
IN      8
CA      5
FR      5
BR      2
LT      2
AU      2
MX      2
GB      2
EG      2
PL      1
Name: count, dtype: int64

In [96]:
group_SE_PT['employee_residence'].value_counts()

employee_residence
US    46
CA     6
GB     2
DE     1
AE     1
Name: count, dtype: int64

In [97]:
group_SE_FL['employee_residence'].value_counts()

employee_residence
NG    2
CZ    1
IN    1
UA    1
RU    1
Name: count, dtype: int64

#### 해석
1. 미국의 경우 At-will 고용법이 있어 해고가 자유롭기 때문에 정규직의 비율이 많음 (그 외에는 데이터 수가 너무 적어 해석 x)
2. 데이터는 적지만 상위권에 해당하는 나라를 보면 정규직 자체가 보호받는 국가들이 많음
3. 데이터 수집을 미국 기준으로 한 것 같음

In [123]:
# 고급 중에서 계약직의 직업별 연봉 평균
pd.DataFrame(group_SE_CT.groupby('job_title')['salary_in_usd'].mean().sort_values(ascending=False).tail(5))


Unnamed: 0_level_0,salary_in_usd
job_title,Unnamed: 1_level_1
Data Developer,110860.0
Staff Data Scientist,105000.0
Analyst,85583.333333
Machine Learning Engineer,38000.0
Artificial Intelligence Engineer,30000.0


In [122]:
pd.DataFrame(group_SE_CT['job_title'] == 'Finance Data Analyst').value_counts()

job_title
False        144
True           1
Name: count, dtype: int64

In [128]:
pd.DataFrame(group_SE_CT['job_title'] == 'Machine Learning Engineer')

Unnamed: 0,job_title
276,False
277,False
2070,False
2071,False
2465,False
...,...
134220,False
141076,False
142419,False
145306,False


In [110]:
# 고급 중에서 정규직의 직업별 연봉 평균
pd.DataFrame(group_SE_FT.groupby('job_title')['salary_in_usd'].mean().sort_values(ascending=False).tail(5))

Unnamed: 0_level_0,salary_in_usd
job_title,Unnamed: 1_level_1
Computer Vision Software Engineer,53984.0
Consultant Data Engineer,50000.0
Principal Data Architect,38154.0
AI Software Development Engineer,25210.0
AI Engineering Lead,23649.0


In [111]:
# 고급 중에서 파트타임 직업의 직업별 연봉 평균
pd.DataFrame(group_SE_PT.groupby('job_title')['salary_in_usd'].mean().sort_values(ascending=False).tail(5))

Unnamed: 0_level_0,salary_in_usd
job_title,Unnamed: 1_level_1
DevOps Engineer,73333.0
Solutions Architect,68530.0
Developer,57142.5
Data Specialist,49500.0
CRM Data Analyst,40000.0


In [112]:
# 신입 중에서 프리랜서의 직업별 연봉 평균
pd.DataFrame(group_SE_FL.groupby('job_title')['salary_in_usd'].mean().sort_values(ascending=False).tail(5))

Unnamed: 0_level_0,salary_in_usd
job_title,Unnamed: 1_level_1
Computer Vision Engineer,60000.0
Machine Learning Developer,60000.0
Machine Learning Researcher,50000.0
Software Data Engineer,50000.0
Manager Data Management,36014.0


#### 해석
1. 신입의 경우 정규직으로 근무하는 것이 대부분의 직업에서 가장

In [66]:
group_SE_PT.head(5)

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
1044,2025,SE,PT,Data Specialist,52000,USD,52000,US,0,US,M
1045,2025,SE,PT,Data Specialist,47000,USD,47000,US,0,US,M
1074,2025,SE,PT,Analyst,138000,USD,138000,US,0,US,M
1075,2025,SE,PT,Analyst,92000,USD,92000,US,0,US,M
7518,2024,SE,PT,DevOps Engineer,66000,EUR,73333,DE,100,DE,M


In [67]:
group_SE_FL.head(5)

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
104563,2024,SE,FL,Backend Engineer,1380000,CZK,61333,CZ,0,CZ,M
130699,2024,SE,FL,Manager Data Management,3000000,INR,36014,IN,100,IN,S
142324,2021,SE,FL,Machine Learning Developer,60000,USD,60000,NG,100,NG,M
142966,2023,SE,FL,Machine Learning Researcher,50000,USD,50000,UA,50,UA,S
143497,2023,SE,FL,Software Data Engineer,50000,USD,50000,NG,50,AU,M
