##### 【 UCI Adult Income Dataset 】<hr>

- 데이터 개요
    * 특성: 수치형 + 범주형 혼합
    * 크기: 약 4만 행 (샘플링 가능)
    * 파일: adult.csv

- 모델 목표
    * 개인의 인구통계/직업 정보를 바탕으로 연소득이 50,000달러 초과인지 여부 예측

- 컬럼 설명
    * class 연소득이 50K 초과(>50K)인지, 이하(<=50K)인지를 나타내는 타깃 레이블
    * age	나이
    * sex	성별 (Male / Female)
    * education-num	교육 수준을 숫자로 표현한 값
    * workclass	고용 형태 (Private, Self-emp, Government 등)
    * occupation	직업 종류
    * hours-per-week	주당 근무 시간
    * capital-gain	자본 이득 (주식, 투자 수익 등)
    * capital-loss	자본 손실
    * marital-status	결혼 상태
    * relationship	가구 내 관계 (Husband, Wife, Not-in-family 등)
    




### 평가 기준
- 전체 모델 학습 프로세스 적용 여부가 중요함!


In [None]:
import pandas as pd 
import numpy as np

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline 

from sklearn.metrics import accuracy_score, precision_score, f1_score, recall_score
from sklearn.metrics import classification_report

In [None]:
DATA_FILE = '../Data/adult.csv'

# 사용할 컬럼들만 불러오기
df = pd.read_csv(DATA_FILE, usecols=[1,2,5,6,7,8,10,11,12,13,15])
df.head(3)

Unnamed: 0,age,workclass,education-num,marital-status,occupation,relationship,sex,capital-gain,capital-loss,hours-per-week,class
0,25,Private,7,Never-married,Machine-op-inspct,Own-child,Male,0,0,40,<=50K
1,38,Private,9,Married-civ-spouse,Farming-fishing,Husband,Male,0,0,50,<=50K
2,28,Local-gov,12,Married-civ-spouse,Protective-serv,Husband,Male,0,0,40,>50K


In [None]:
# 결측치 처리
df.isnull().sum().sum()
df = df.dropna()
df.isnull().sum().sum()

np.int64(0)

In [None]:
lbEncoder  = LabelEncoder()

# 문자형 컬럼들 수치형으로 바꿔주기
df['workclass'] = lbEncoder.fit_transform(df['workclass'])
df['marital-status'] = lbEncoder.fit_transform(df['marital-status'])
df['occupation'] = lbEncoder.fit_transform(df['occupation'])
df['relationship'] = lbEncoder.fit_transform(df['relationship'])
df['sex'] = lbEncoder.fit_transform(df['sex'])

In [51]:
featureDF = df[df.columns[0:-1]]
targetSR = df[df.columns[-1]]

print(f'{featureDF.shape}, {featureDF.ndim}D, {targetSR.shape}, {targetSR.ndim}D')

(46033, 10), 2D, (46033,), 1D


In [52]:
X_train, X_test, y_train, y_test = train_test_split(featureDF, targetSR, test_size=0.2, random_state=42)

In [53]:
lbEncoder  = LabelEncoder()

en_y_train = lbEncoder.fit_transform(y_train) 
en_y_test  = lbEncoder.transform(y_test)


In [54]:
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, en_y_train)

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [58]:
train_score = rf.score(X_train, en_y_train)
test_score = rf.score(X_test, en_y_test)

print(f'train 점수 : {train_score}, test 점수 : {test_score}')

train 점수 : 0.9676044099277684, test 점수 : 0.839795807537743


In [59]:
y_pred = rf.predict(X_test)

In [60]:
print(classification_report(en_y_test, y_pred))

              precision    recall  f1-score   support

           0       0.87      0.92      0.89      6862
           1       0.72      0.62      0.66      2345

    accuracy                           0.84      9207
   macro avg       0.80      0.77      0.78      9207
weighted avg       0.83      0.84      0.84      9207

