# "[ML] Decision Tree"
> "Salary Forecasting을 위한 전처리 및 모델링"

- toc: true
- branch: master
- badges: true
- comments: true
- categories: [Decision Tree]
- author: 도형준

![header](https://capsule-render.vercel.app/api?type=waving&color=auto&height=200&section=header&text=Decision%Tree&fontSize=50&animation=fadeIn&fontAlignY=30&desc=2022/11/10&descAlignY=51&descAlign=62)

In [1]:
# https://www.kaggle.com/datasets/ayessa/salary-prediction-classification
# https://raw.githubusercontent.com/bigdata-young/bigdata_16th/main/data/salary.csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 데이터 불러오기

In [2]:
file_url = 'https://raw.githubusercontent.com/bigdata-young/bigdata_16th/main/data/salary.csv'
df = pd.read_csv(file_url, skipinitialspace=True)

In [3]:
# df.head()
df.tail()

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
48837,27,Private,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
48838,40,Private,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
48839,58,Private,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
48840,22,Private,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K
48841,52,Self-emp-inc,HS-grad,9,Married-civ-spouse,Exec-managerial,Wife,White,Female,15024,0,40,United-States,>50K


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       46043 non-null  object
 2   education       48842 non-null  object
 3   education-num   48842 non-null  int64 
 4   marital-status  48842 non-null  object
 5   occupation      46033 non-null  object
 6   relationship    48842 non-null  object
 7   race            48842 non-null  object
 8   sex             48842 non-null  object
 9   capital-gain    48842 non-null  int64 
 10  capital-loss    48842 non-null  int64 
 11  hours-per-week  48842 non-null  int64 
 12  native-country  47985 non-null  object
 13  class           48842 non-null  object
dtypes: int64(5), object(9)
memory usage: 5.2+ MB


In [5]:
df.isnull().sum()

age                  0
workclass         2799
education            0
education-num        0
marital-status       0
occupation        2809
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     857
class                0
dtype: int64

In [6]:
# [연속형 변수]
# age : 연령
# education-num : 교육년수
# capital-gain : 자산 증가량
# capital-loss : 자산 감소량
# hours-per-week : 주당 노동 시간
df.describe()

Unnamed: 0,age,education-num,capital-gain,capital-loss,hours-per-week
count,48842.0,48842.0,48842.0,48842.0,48842.0
mean,38.643585,10.078089,1079.067626,87.502314,40.422382
std,13.71051,2.570973,7452.019058,403.004552,12.391444
min,17.0,1.0,0.0,0.0,1.0
25%,28.0,9.0,0.0,0.0,40.0
50%,37.0,10.0,0.0,0.0,40.0
75%,48.0,12.0,0.0,0.0,45.0
max,90.0,16.0,99999.0,4356.0,99.0


In [7]:
# [범주형 변수]
# workclass : 고용 형태
# education : 학력
# marital-status : 결혼 상태
# occupation : 직업
# relationship : 가족관계
# race : 인종
# sex : 성별
# native-country : 출신국가
# class : 연봉 범위
df.describe(include=['O'])

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country,class
count,46043,48842,48842,46033,48842,48842,48842,47985,48842
unique,8,16,7,14,6,5,2,41,2
top,Private,HS-grad,Married-civ-spouse,Prof-specialty,Husband,White,Male,United-States,<=50K
freq,33906,15784,22379,6172,19716,41762,32650,43832,37155


In [8]:
df.describe(include='all')

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
count,48842.0,46043,48842,48842.0,48842,46033,48842,48842,48842,48842.0,48842.0,48842.0,47985,48842
unique,,8,16,,7,14,6,5,2,,,,41,2
top,,Private,HS-grad,,Married-civ-spouse,Prof-specialty,Husband,White,Male,,,,United-States,<=50K
freq,,33906,15784,,22379,6172,19716,41762,32650,,,,43832,37155
mean,38.643585,,,10.078089,,,,,,1079.067626,87.502314,40.422382,,
std,13.71051,,,2.570973,,,,,,7452.019058,403.004552,12.391444,,
min,17.0,,,1.0,,,,,,0.0,0.0,1.0,,
25%,28.0,,,9.0,,,,,,0.0,0.0,40.0,,
50%,37.0,,,10.0,,,,,,0.0,0.0,40.0,,
75%,48.0,,,12.0,,,,,,0.0,0.0,45.0,,


In [9]:
import plotly.express as px
fig = px.imshow(df.corr(),text_auto=True, color_continuous_scale='RdBu_r',title='상관관계 히트맵', aspect='auto')
fig.show()

![header](https://capsule-render.vercel.app/api?type=waving&color=auto&height=200&section=header&text=전처리&fontSize=50&animation=fadeIn&fontAlignY=30&desc=&descAlignY=51&descAlign=62)

- 범주형변수(drop or 연속형 변수 or dummies or 대체?)<br>
- 결측치
- -> 스케일링 -> 아웃라이어 신경X. 결정트리이기 때문
<br>
2-1. eductaion: eductaion-num drop<br>
2-2. occupation: 이미 묶여있고, 중요할 것이라 판단 => dummies<br>
2-3. native-country: 출신국가들인데 어차피 결정트리이기에<br>

# 전처리

In [10]:
df['class']

0        <=50K
1        <=50K
2         >50K
3         >50K
4        <=50K
         ...  
48837    <=50K
48838     >50K
48839    <=50K
48840    <=50K
48841     >50K
Name: class, Length: 48842, dtype: object

In [11]:
df['class'].value_counts()

<=50K    37155
>50K     11687
Name: class, dtype: int64

In [12]:
df['class'] = df['class'].map({'<=50K': 0, '>50K': 1})
df['class']

0        0
1        0
2        1
3        1
4        0
        ..
48837    0
48838    1
48839    0
48840    0
48841    1
Name: class, Length: 48842, dtype: int64

In [13]:
df['age'].dtype

dtype('int64')

In [14]:
obj_list = []
for c in df.columns:
  if df[c].dtype == 'object':
    #print(c,df[c].dtype)
    obj_list.append(c)
obj_list

['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native-country']

In [15]:
# List Comprehension
obj_list2 = [c for c in df.columns if df[c].dtype=='object']
obj_list2

['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native-country']

In [16]:
for o in obj_list:
    print(o, df[o].nunique())

workclass 8
education 16
marital-status 7
occupation 14
relationship 6
race 5
sex 2
native-country 41


In [17]:
# education으로 기준으로 묶기에는 너무 많다라는 판단
for o in obj_list:
    if df[o].nunique() > 10:
        print(o, df[o].nunique())

education 16
occupation 14
native-country 41


In [18]:
df['education'].value_counts()

HS-grad         15784
Some-college    10878
Bachelors        8025
Masters          2657
Assoc-voc        2061
11th             1812
Assoc-acdm       1601
10th             1389
7th-8th           955
Prof-school       834
9th               756
12th              657
Doctorate         594
5th-6th           509
1st-4th           247
Preschool          83
Name: education, dtype: int64

In [19]:
# 'education'과 비슷한 컬럼인 'education-num'확인
df['education-num']

0         7
1         9
2        12
3        10
4        10
         ..
48837    12
48838     9
48839     9
48840     9
48841     9
Name: education-num, Length: 48842, dtype: int64

In [20]:
df[df['education-num']==1]['education']

779      Preschool
818      Preschool
1029     Preschool
1059     Preschool
1489     Preschool
           ...    
48079    Preschool
48316    Preschool
48505    Preschool
48640    Preschool
48713    Preschool
Name: education, Length: 83, dtype: object

In [21]:
for n in range(1,17): #1~16
  #  print(df[df['education-num']==n]['education'])
  print(f"**{n}**", df[df['education-num'] == n]['education'].unique())
# 2개의 데이터는 일치한다.

**1** ['Preschool']
**2** ['1st-4th']
**3** ['5th-6th']
**4** ['7th-8th']
**5** ['9th']
**6** ['10th']
**7** ['11th']
**8** ['12th']
**9** ['HS-grad']
**10** ['Some-college']
**11** ['Assoc-voc']
**12** ['Assoc-acdm']
**13** ['Bachelors']
**14** ['Masters']
**15** ['Prof-school']
**16** ['Doctorate']


In [22]:
df.drop('education',axis=1,inplace=True)

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       46043 non-null  object
 2   education-num   48842 non-null  int64 
 3   marital-status  48842 non-null  object
 4   occupation      46033 non-null  object
 5   relationship    48842 non-null  object
 6   race            48842 non-null  object
 7   sex             48842 non-null  object
 8   capital-gain    48842 non-null  int64 
 9   capital-loss    48842 non-null  int64 
 10  hours-per-week  48842 non-null  int64 
 11  native-country  47985 non-null  object
 12  class           48842 non-null  int64 
dtypes: int64(6), object(7)
memory usage: 4.8+ MB


In [24]:
# 데이터가 이미 묶여있음
# 직업이라는 데이터 특성상 연봉이 큰 영향을 줄 것이라 예상
df['occupation'].value_counts()

Prof-specialty       6172
Craft-repair         6112
Exec-managerial      6086
Adm-clerical         5611
Sales                5504
Other-service        4923
Machine-op-inspct    3022
Transport-moving     2355
Handlers-cleaners    2072
Farming-fishing      1490
Tech-support         1446
Protective-serv       983
Priv-house-serv       242
Armed-Forces           15
Name: occupation, dtype: int64

In [25]:
df['native-country'].value_counts() # 미국이 가장 많다. 나머지 더미화?
# 1. 미국 vs 기타: 0/1
# 2. 대륙별, 언어권별 => 추가적인 기준을 만들어서 더미변수화 한다.
# 3. 다 1,2,3,4

United-States                 43832
Mexico                          951
Philippines                     295
Germany                         206
Puerto-Rico                     184
Canada                          182
El-Salvador                     155
India                           151
Cuba                            138
England                         127
China                           122
South                           115
Jamaica                         106
Italy                           105
Dominican-Republic              103
Japan                            92
Guatemala                        88
Poland                           87
Vietnam                          86
Columbia                         85
Haiti                            75
Portugal                         67
Taiwan                           65
Iran                             59
Greece                           49
Nicaragua                        49
Peru                             46
Ecuador                     

In [26]:
# gropuby(묶고 싶은 열 이름) [특정한 열 이름].적용하고자하는 그룹함수() -> mean()
# df.groupby('native-country')['class'].mean()
df.groupby('native-country')['class'].mean().sort_values(ascending=False)

native-country
France                        0.421053
India                         0.410596
Taiwan                        0.400000
Iran                          0.372881
England                       0.370079
Greece                        0.367347
Yugoslavia                    0.347826
Japan                         0.347826
Canada                        0.346154
Italy                         0.323810
Cambodia                      0.321429
Hungary                       0.315789
Ireland                       0.297297
China                         0.295082
Philippines                   0.288136
Germany                       0.281553
Hong                          0.266667
Cuba                          0.246377
United-States                 0.243977
Poland                        0.195402
Portugal                      0.179104
South                         0.173913
Thailand                      0.166667
Scotland                      0.142857
Jamaica                       0.141509
Ecuador   

In [27]:
# country_group 나라별 고연봉자 비율을 원래 DF에 합치고 싶음
# 나라별 -> 나라이름 -> index -> 
country_group = df.groupby('native-country').mean()['class']
country_group

native-country
Cambodia                      0.321429
Canada                        0.346154
China                         0.295082
Columbia                      0.047059
Cuba                          0.246377
Dominican-Republic            0.048544
Ecuador                       0.133333
El-Salvador                   0.070968
England                       0.370079
France                        0.421053
Germany                       0.281553
Greece                        0.367347
Guatemala                     0.034091
Haiti                         0.120000
Holand-Netherlands            0.000000
Honduras                      0.100000
Hong                          0.266667
Hungary                       0.315789
India                         0.410596
Iran                          0.372881
Ireland                       0.297297
Italy                         0.323810
Jamaica                       0.141509
Japan                         0.347826
Laos                          0.086957
Mexico    

In [28]:
country_group = country_group.reset_index() # merge하기 위해 key가 필요해, index를 통해 추출 + DF화
country_group

Unnamed: 0,native-country,class
0,Cambodia,0.321429
1,Canada,0.346154
2,China,0.295082
3,Columbia,0.047059
4,Cuba,0.246377
5,Dominican-Republic,0.048544
6,Ecuador,0.133333
7,El-Salvador,0.070968
8,England,0.370079
9,France,0.421053


In [29]:
# A.merge(B) A 왼쪽, B 오른쪽
# index가 기본인데/ 아니면 on으로 열(컬럼)을 지정해줘야 함
# native-country -> 결측치가 존재함...

# A 부분에 결측치가 존재함
# A를 모두 보존하는 방향으로 진행 => B가 없어도 됨 => how ='left'

df2 = df.merge(country_group,on='native-country',how='left')

In [30]:
df2.tail()

Unnamed: 0,age,workclass,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class_x,class_y
48837,27,Private,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,0,0.243977
48838,40,Private,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,1,0.243977
48839,58,Private,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,0,0.243977
48840,22,Private,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,0,0.243977
48841,52,Self-emp-inc,9,Married-civ-spouse,Exec-managerial,Wife,White,Female,15024,0,40,United-States,1,0.243977


In [31]:
df2.isnull().sum()

age                  0
workclass         2799
education-num        0
marital-status       0
occupation        2809
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     857
class_x              0
class_y            857
dtype: int64

# 현재 진행상황
- 범주형변수(drop or 연속형 변수 or dummies or 대체?)<br>
- 결측치
- -> 스케일링 -> 아웃라이어 신경X. 결정트리이기 때문
<br>
2-1. eductaion: eductaion-num drop<br>
2-2. occupation: 이미 묶여있고, 중요할 것이라 판단 => dummies<br>
2-3. native-country: 출신국가들인데 어차피 결정트리이기에<br>

In [32]:
# df['country_class_mean'] = df.groupby('native-country')['class'].transform('mean')
# class x, class y
#df.drop('native-country', axis=1, inplace=True)
#df

![header](https://capsule-render.vercel.app/api?type=waving&color=auto&height=200&section=header&text=결측치%20처리%20/%20더미%20변수화&fontSize=50&animation=fadeIn&fontAlignY=30&desc=&descAlignY=51&descAlign=62)


# 결측치 처리 / 더미 변수화

In [33]:
df.isna().mean()

age               0.000000
workclass         0.057307
education-num     0.000000
marital-status    0.000000
occupation        0.057512
relationship      0.000000
race              0.000000
sex               0.000000
capital-gain      0.000000
capital-loss      0.000000
hours-per-week    0.000000
native-country    0.017546
class             0.000000
dtype: float64

In [34]:
#@title 임의값 넣어주기
df['native-country'].fillna(-99, inplace=True)
df.isna().mean()

age               0.000000
workclass         0.057307
education-num     0.000000
marital-status    0.000000
occupation        0.057512
relationship      0.000000
race              0.000000
sex               0.000000
capital-gain      0.000000
capital-loss      0.000000
hours-per-week    0.000000
native-country    0.000000
class             0.000000
dtype: float64

In [35]:
df.isna().mean()

age               0.000000
workclass         0.057307
education-num     0.000000
marital-status    0.000000
occupation        0.057512
relationship      0.000000
race              0.000000
sex               0.000000
capital-gain      0.000000
capital-loss      0.000000
hours-per-week    0.000000
native-country    0.000000
class             0.000000
dtype: float64

In [36]:
df['workclass'].fillna('Private', inplace=True)
df.isna().mean()

age               0.000000
workclass         0.000000
education-num     0.000000
marital-status    0.000000
occupation        0.057512
relationship      0.000000
race              0.000000
sex               0.000000
capital-gain      0.000000
capital-loss      0.000000
hours-per-week    0.000000
native-country    0.000000
class             0.000000
dtype: float64

In [37]:
df['occupation'].fillna('Unknown', inplace=True)
df.isna().mean()

age               0.0
workclass         0.0
education-num     0.0
marital-status    0.0
occupation        0.0
relationship      0.0
race              0.0
sex               0.0
capital-gain      0.0
capital-loss      0.0
hours-per-week    0.0
native-country    0.0
class             0.0
dtype: float64

In [38]:
# 아직 범주형이 남아 있음
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       48842 non-null  object
 2   education-num   48842 non-null  int64 
 3   marital-status  48842 non-null  object
 4   occupation      48842 non-null  object
 5   relationship    48842 non-null  object
 6   race            48842 non-null  object
 7   sex             48842 non-null  object
 8   capital-gain    48842 non-null  int64 
 9   capital-loss    48842 non-null  int64 
 10  hours-per-week  48842 non-null  int64 
 11  native-country  48842 non-null  object
 12  class           48842 non-null  int64 
dtypes: int64(6), object(7)
memory usage: 4.8+ MB


In [39]:
df2 = pd.get_dummies(df, drop_first=True)
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 84 columns):
 #   Column                                     Non-Null Count  Dtype
---  ------                                     --------------  -----
 0   age                                        48842 non-null  int64
 1   education-num                              48842 non-null  int64
 2   capital-gain                               48842 non-null  int64
 3   capital-loss                               48842 non-null  int64
 4   hours-per-week                             48842 non-null  int64
 5   class                                      48842 non-null  int64
 6   workclass_Local-gov                        48842 non-null  uint8
 7   workclass_Never-worked                     48842 non-null  uint8
 8   workclass_Private                          48842 non-null  uint8
 9   workclass_Self-emp-inc                     48842 non-null  uint8
 10  workclass_Self-emp-not-inc                 488

![header](https://capsule-render.vercel.app/api?type=waving&color=auto&height=200&section=header&text=모델링/평가&fontSize=50&animation=fadeIn&fontAlignY=30&desc=&descAlignY=51&descAlign=62)


# 모델링/평가

In [42]:
from sklearn.model_selection import train_test_split
X = df2.drop('class', axis=1)
y = df2['class']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.4, random_state=100
)

In [41]:
from sklearn.tree import DecisionTreeClassifier

In [52]:
model = DecisionTreeClassifier(random_state=100)
model.fit(X_train,y_train)
pred = model.predict(X_test)

In [53]:
pred

array([1, 0, 0, ..., 1, 0, 1])

In [54]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,pred)

0.8163996519424681