# Predicting the income range with financial data

## Introduction to Financial Data and Overview of Predictive Models
The problem of predicting customer income ranges is one of the most important problems in financial data analysis.
Before we get into the analysis, let's point out two things.

### <b> Properties of financial data</b>
Financial data mainly has the following characteristics:
- 1) <b>Combination of heterogeneous data</b>: Data source, form, scale, etc. have different characteristics
- 2) <b>skewedness of distribution</b>: If the predicted value and the correct answer are far apart, the bias of the learning result may be high.
- 3) <b>Unclearness of classification label</b>: Income section, credit rating, product type, etc. include business logic, so classification is arbitrary → Analyst’s interpretation is important
- 4) <b>multicollinearity of variables</b>: Interdependence or correlation between variables may be strong
- 5) <b>Nonlinearity of variables</b>: The influence of variables may not be linear, e.g.) What is the effect of age on income?
- Data may be incomplete (missing, truncated, censored) due to other practical limitations such as regulation, collection, and storage

### <b>Multi-classification and prediction of income brackets</b>
When there are more than 3 classes (also called labels or levels) to predict, it is called a multiclassification problem. In simple terms, it is called multiclass classification or multinomial logistic regression if you use a regression method. It is assumed that the hierarchical relationship (inclusion relationship) between classes is equivalent.

Forecasting income brackets is a classic multiple classification problem. Before analyzing, let's consider the following:
- 1) <b> In case the division between classes is not clear</b>: How should the division of income be established and how many classes should be decided?
- 2) <b>If there is an order in the divisions between classes</b>: To be precise, each income level should be viewed as an ordinal class.
- 3) <b>Insufficient value for a specific class</b>: How do you solve the difference between the number of customers in the high-income bracket and the number of customers in the middle-income bracket?

The multiclass classification problem has the following additional considerations compared to the binary classification problem.
- 1) <b>Cautions when implementing the model</b>: One-hot-encoding of variables, determination of objective function, etc.
- 2) <b>Cautions when interpreting results</b>: Accuracy, F1 score, Confusion Matrix, etc.

-----

## Load data to predict

### Introduction to data
 
- This topic uses data collected by the US Census Bureau and distributed by UCI to the US Adult Income dataset, with simulated variables added and modified by the instructor.
- The first data to be used is the US Adult Income dataset, and the columns are as follows.
 
 
- `age` : 나이
- `workclass`: 직업구분
- `education`: 교육수준
- `education.num`: 교육수준(numerically coded)
- `marital.status`: 혼인상태
- `occupation` : 직업
- `relationship`: 가족관계
- `race`: 인종
- `sex`: 성별
- `capital.gain`: 자본이득
- `capital.loss`: 자본손실
- `hours.per.week`: 주당 근로시간
- `income` : 소득 구분
 
Data from: https://archive.ics.uci.edu/ml/datasets/adult

--------------

### Import data

In [4]:
import numpy as np  
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [5]:
datapath = 'https://github.com/mchoimis/financialML/raw/main/income/'
df = pd.io.parsers.read_csv(datapath + 'income.csv')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,?,77053,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,Some-college,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


-----

### Data preview

In [6]:
print(df.shape)
print(df.columns)

(32561, 15)
Index(['age', 'workclass', 'fnlwgt', 'education', 'education.num',
       'marital.status', 'occupation', 'relationship', 'race', 'sex',
       'capital.gain', 'capital.loss', 'hours.per.week', 'native.country',
       'income'],
      dtype='object')


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education.num   32561 non-null  int64 
 5   marital.status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital.gain    32561 non-null  int64 
 11  capital.loss    32561 non-null  int64 
 12  hours.per.week  32561 non-null  int64 
 13  native.country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


-----------------------

### Check Data

In [8]:
# Replace missing values ​​with NaN
df[df =='?'] = np.nan

In [9]:
# Filling out Missing Values ​​with Mode
for col in ['workclass', 'occupation', 'native.country']:
    df[col].fillna(df[col].mode()[0], inplace = True)

In [16]:
# result
df.head()
 

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,Private,77053,HS-grad,9,Widowed,Prof-specialty,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,Private,186061,Some-college,10,Widowed,Prof-specialty,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


In [17]:
df.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education.num     0
marital.status    0
occupation        0
relationship      0
race              0
sex               0
capital.gain      0
capital.loss      0
hours.per.week    0
native.country    0
income            0
dtype: int64

-----

## Feature Engineering

### Creating input features and target values

In [19]:
X = df.drop(['income','education','fnlwgt'], axis =1)
y = df['income']

In [21]:
X.head()

Unnamed: 0,age,workclass,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country
0,90,Private,9,Widowed,Prof-specialty,Not-in-family,White,Female,0,4356,40,United-States
1,82,Private,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States
2,66,Private,10,Widowed,Prof-specialty,Unmarried,Black,Female,0,4356,40,United-States
3,54,Private,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States
4,41,Private,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States


-----------

In [22]:
y.head()

0    <=50K
1    <=50K
2    <=50K
3    <=50K
4    <=50K
Name: income, dtype: object

------

### Divide the raw data into training set and test set

In [24]:
from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)
X_train.head()

Unnamed: 0,age,workclass,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country
32098,40,State-gov,13,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,20,United-States
25206,39,Local-gov,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,38,United-States
23491,42,Private,10,Never-married,Exec-managerial,Not-in-family,White,Female,0,0,40,United-States
12367,27,Local-gov,9,Never-married,Farming-fishing,Own-child,White,Male,0,0,40,United-States
7054,38,Federal-gov,14,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,United-States


----

### Handling categorical variables

In [25]:
from sklearn.preprocessing import LabelEncoder

categorical = ['workclass', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'native.country']
for feature in categorical:
        le = LabelEncoder()
        X_train[feature] = le.fit_transform(X_train[feature])  
        X_test[feature] = le.transform(X_test[feature])    

### Check the result of categorical variable processing

In [26]:
# Check the transformed categorical variable column (X_train)
X_train[categorical].head(3)


Unnamed: 0,workclass,marital.status,occupation,relationship,race,sex,native.country
32098,6,2,3,5,4,0,38
25206,1,2,6,0,4,1,38
23491,3,4,3,1,4,0,38


In [28]:
# Checking the converted categorical variable column (X_test)

X_test[categorical].head(3)

Unnamed: 0,workclass,marital.status,occupation,relationship,race,sex,native.country
22278,3,6,11,4,4,0,38
8950,3,4,5,3,4,0,38
7838,3,4,7,1,1,0,39


In [29]:
X_train[categorical].nunique()

workclass          8
marital.status     7
occupation        14
relationship       6
race               5
sex                2
native.country    41
dtype: int64

In [30]:
X_test[categorical].nunique()

workclass          8
marital.status     7
occupation        14
relationship       6
race               5
sex                2
native.country    40
dtype: int64

----------

### Note: Handling of categorical variables
Categorical variables can be roughly divided into two methods.

- Convert class to number
- One-hot-encoding (dummy encoding)

In the case of financial data, categorical variables occupy most of the data, so when one-hot-encoding is performed, the majority of the entire dataset may have a value of 0. When there are many meaningless values ​​in a high-dimensional dataset, it is said that the features are sparse, and the learning efficiency may not be high.

### Scaling Features

In [31]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()   
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns = X.columns) 
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns = X.columns)

In [32]:
# Check the scaled X_train data
X_train.head()
 

Unnamed: 0,age,workclass,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country
32098,40,6,13,2,3,5,4,0,0,0,20,38
25206,39,1,9,2,6,0,4,1,0,0,38,38
23491,42,3,10,4,3,1,4,0,0,0,40,38
12367,27,1,9,4,4,3,4,1,0,0,40,38
7054,38,0,14,2,3,0,4,1,0,0,40,38


In [39]:
print(min(X_train['age']))
print(max(X_train['age']))
print(np.mean(X_train['age']))
print(np.var(X_train['age']))
print('\n')
print(min(X_test['age']))
print(max(X_test['age']))
print(np.mean(X_test['age']))
print(np.var(X_test['age']))

17
90
38.61429448929449
186.44402697680712


17
90
38.505476507319074
185.14136114309127


In [None]:
print(min(X_train_scaled['age']))
print(max(X_train_scaled['age']))
print(np.mean(X_train_scaled['age']))
print(np.var(X_train_scaled['age']))
print('\n')
print(min(X_test_scaled['age']))
print(max(X_test_scaled['age']))
print(np.mean(X_test_scaled['age']))
print(np.var(X_test_scaled['age']))

### Note: feature scaler provided by scikit-learn

- `StandardScaler`: default scale, converts the mean of each feature to 0 and standard deviation to 1
- `RobustScaler`: Similar to the above, but uses the median, quartile, and quartile values ​​instead of the mean to minimize the influence of outliers
- `MinMaxScaler`: scale so that the maximum and minimum values ​​of all features are 1 and 0 respectively
- `Normalizer`: Normalized per row, not feature (column), and adjusts the data so that the Euclidean distance is 1.

<p> The reason for scaling is that training may not work properly when the values ​​of the data are too large or too small. Also, for classifiers where the effect of scale is absolute (e.g. distance-based algorithms such as knn), it is essential to consider scaling.
    
<p> On the other hand, some items may be better to keep the distribution of the original data. For example, when data is standardized on features that are concentrated in almost one place to make the distributions the same, small changes may be learned as large differences. You can also omit it if you use a classifier that is not significantly affected by scale (e.g., a tree-based ensemble algorithm), if the performance is acceptable or if you are less concerned about overfitting.
    
<p> One thing to keep in mind when scaling is that the original data may lose its meaning. It may be difficult to improve the model if the explanatory power of the original feature is lost when the purpose of finding an answer is not the ultimate goal, but the interpretation of the model or its application to other datasets in the future is more important. Please consider this together.

## Step 3. 선형 분류모델 구현하기

### 문제 10. 원 데이터를 이용하여 Logistic Regression 모델 돌려보기

In [None]:
# Feature scaling 전 원본 데이터
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

logreg =   
logreg.fit(  )

### 문제 11. 원 데이터를 이용한 Logistic Regression 모델의 정확도 확인하기

In [None]:
y_pred =   
logreg_score =   

print('Logistic Regression accuracy score: {0:0.4f}'. format(  )))

### 문제 12. 스케일 조정된 데이터를 이용하여 Logistic Regression 모델 돌려보기

In [None]:
# Feature scaling 후 변환 데이터
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

logreg =  
logreg.fit(  ) ##

### 문제 13. 스케일 조정된 데이터를 이용한 Logistic Regression 모델의 정확도 확인하기

In [None]:
y_pred = logreg.predict( )  ##
logreg_score =   
print('Logistic Regression (scaled data) accuracy score: {0:0.4f}'. format(  ))

In [None]:
# 예측된 값 확인하기



### 문제 14. 스케일 조정된 데이터를 이용한 Logistic Regression 모델 분류결과 확인하기

In [None]:
from sklearn.metrics import classification_report

cm_logreg =   
print(cm_logreg)

## Step 4. 트리기반 분류모델의 구현

### 문제 15. Random Forest 모델 구현하고 정확도 확인하기

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

rfc =  
rfc.fit(  )

In [None]:
y_pred = rfc.predict( ) 
rfc_score =  accuracy_score( )
print('Random Forest Model accuracy score : {0:0.4f}'. format(   ))

### 문제 16. Random Forest 모델의 Confusion Matrix 확인하기

In [None]:
from sklearn.metrics import confusion_matrix

cm =   
print('Confusion Matrix for Binary Labels \n')
# print('Confusion Matrix for Binary Labels\n')
# print('Actual class')
# print('Predicted', '[[True Positive', 'False Positive]')
# print('         ', '[False Negative', 'True Negative]]')
print(cm)

In [None]:
# Confusion Matrix에서 Recall과 Precision 계산하기

print('\nRecall for Class [<=50K] = ', cm[0,0], '/' , cm[0,0] + cm[0,1])
print('\nPrecision for Class [<=50K] = ', cm[0,0], '/' , cm[0,0] + cm[1,0])
print('\nRecall for Class [>50K] = ', cm[1,1], '/' , cm[1,0] + cm[1,1])
print('\nPrecision for Class [>50K] = ', cm[1,1], '/' , cm[0,1] + cm[1,1])

### 문제 17. Random Forest 모델의 분류결과 확인하기

In [None]:
from sklearn.metrics import classification_report

cm_rfc =   
print(cm_rfc)

## Step 5. 부스팅 기반 분류모델의 구현

### 부스팅(Boosting) 모델 개요


- 부스팅은 여러 트리의 적합 결과를 합하는 앙상블 알고리즘의 하나로, 이 때 sequential의 개념이 추가되어 있습니다. 즉 연속적인 weak learner, 바로 직전 weak learner의 error를 반영한 현재 weak learner를 잡겠다는 것입니다. 이 아이디어는 Gradient Boosting Model(GBM)에서 loss를 계속 줄이는 방향으로 weak learner를 잡는다는 개념으로 확장됩니다.

![boost](https://pluralsight2.imgix.net/guides/81232a78-2e99-4ccc-ba8e-8cd873625fdf_2.jpg)


- 부스팅 계열 모델은 AdaBoost, Gradient Boosting Model(GBM), XGBoost, LightGBM 등이 있습니다.


- 더 자세한 내용은 다음 Step에서 살펴보겠습니다.

### 문제 18. Gradient Boosting 모델 구현하고 정확도 확인하기

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
gbc =   
gbc.fit( )

In [None]:
y_pred =   
gbc_score =   
print('Gradient Boosting accuracy score : {0:0.4f}'.format(gbc_score))

### 문제 19. Gradient Boosting 모델의 분류결과 확인하기

In [None]:
from sklearn.metrics import classification_report

cm_gbc =   
print( )

### 문제 20. Light GBM 구현하고 정확도 확인하기

In [None]:
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score

lgbm =   
lgbm.fit( )
y_pred =   

lgbm_score =   
print('LGBM Model accuracy score : {0:0.4f}'.format(lgbm_score))

### 문제 21. Light GBM 분류결과 확인하기

In [None]:
from sklearn.metrics import classification_report

cm_lgbm =   
print( )

### 문제 22. 요약: 소득 이진분류 예측모델의 정확도 비교하기

In [None]:
print ('Accuracy Comparisons for Binary Models\n')
print ('logreg_score:', '{0:0.5f}'.format(logreg_score))
print ('rfc_score   :', '{0:0.5f}'.format(rfc_score))
print ('gbc_score   :', '{0:0.5f}'.format(gbc_score ))
print ('lgbm_score  :', '{0:0.5f}'.format(lgbm_score))

### 문제 23. 최종 이진분류 모델 비교하기

In [None]:
print ('Classification Comparions for Binary Models\n')
print ('logreg_score:', '{0:0.4f}'.format(logreg_score))
print (cm_logreg)
print ('rfc_score   :', '{0:0.4f}'.format(rfc_score))
print (cm_rfc)
print ('gbc_score   :', '{0:0.4f}'.format(gbc_score ))
print (cm_gbc)
print ('lgbm_score  :', '{0:0.4f}'.format(lgbm_score))
print (cm_lgbm)

## Step 6. 다중분류 모델의 구현

### 데이터 소개

- 두번째로 쓸 데이터는 앞에서 사용한 US Adult 데이터에 모의로 생성한 금융 변수를 수정(+)·추가(++)한 것이며, 컬럼은 다음과 같습니다. 

income_ext.csv
- `age` : 나이
- `workclass`: 직업구분
- `education`: 교육수준
- `education.num`: 교육수준(numerically coded)
- `marital.status`: 혼인상태
- `occupation` : 직업
- `relationship`: 가족관계
- `sex`: 성별
- `capital.gain`: 자본이득
- `capital.loss`: 자본손실
- `hours.per.week`: 주당 근로시간
- `spending.groc`: 식료품 소비 금액(continuous) ++
- `spending.med`: 병의원 소비 금액(continuous) ++
- `spending.trav`: 여행 레저 소비 금액(continuous) ++
- `income` : 소득 이진구분( <=50K: 0, >50K: 1 ) +
- `income.num` : 소득금액(continuous) ++

### 문제 24. 데이터 불러오기

In [None]:
data =  



In [None]:
data.head()

In [None]:
data['income'].value_counts()

### Light GBM 개요


- Decision Tree의 앙상블 모델인 Gradient Boosting Decision Tree (GBDT)는 실무에서 XGboost(eXtreme Gradient Boosting) 등으로도 알려져 있습니다. 각각의 반복에서 GBDT는 음의 기울기(Residual Error)를 적합함으로써 Decision Tree를 학습시키게 됩니다. 


- 그러나 고차원 대용량 데이터에서는 너무 많은 시간의 소모가 발생하였는데, 왜냐하면 모든 가능한 분할점에 대해 정보 획득(information gain)을 평가하기 위해 데이터 전부를 스캔해야 했기 때문입니다. 


- Light GBM은 이러한 Gradient Boosting 모델의 단점을 극복하기 위해 샘플링 등의 기법을 이용하여 <b>스캔하는 데이터 양을 줄임으로써</b> 분석 시간을 획기적으로 단축시킨 방법론입니다.


- LGBM은 <b>범주변수가 많은 정형 데이터</b>와 <b>다중분류</b>에 유용한 알고리즘으로, 간단한 원리를 알아두시면 도움이 될 것입니다.


- 참고링크: Light GBM: A Highly Efficient Gradient Boosting Decision Tree (NIPS 2017)
[https://papers.nips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html ] 

### Light GBM 파라미터 설명

- `objective` : 목적함수, regression, binary, multiclass 가능
- `categorical_feature` : 범주변수임을 선언
- `max_cat_group` : 범주형 변수가 많을 때 과적합을 방지하는 분기 포인트를 찾아서 카테고리 그룹을 max_cat_group 그룹으로 합치고 그룹 경계선에서 분기 포인트를 찾게 됨, 기본값은 64
- `boosting`: 부스팅 방법, 기본값은 gbdt(gradient boosting decision tree)이며 샘플링을 이용하는 goss(Gradient-based One-Side Sampling), 딥러닝의 드롭아웃과 같은 dart (Dropouts meet Multiple Additive Regression Trees), rf (Random Forest) 등의 기법 선택 가능
- `learning_rate` : 학습율, 각 예측기마다 얼마나 가중치를 주어 학습하게 할것인지 결정
- `early_stopping_round` : 더 이상 validation 데이터에서 성능 향상이 없으면 멈추게끔 하는 단위 
- `metric`: loss 측정 기준, binary_logloss, multi_logloss, mae, rmse, auc, cross_entropy 등이 있다

### 문제 25. 데이터 확인하기

In [None]:
# Create raw dataset for input
X =                              # Drop columns
y =                              # Choose column

In [None]:
# X raw 데이터 확인하기

 

In [None]:
# y raw 데이터 확인하기

 

### 문제 26. 다중분류 클래스 생성하기

In [None]:
def value_change(x):
    if x <= 10000: return 0
    if x >  10000 and x <= 20000 : return 1
    if x >  20000 and x <= 30000 : return 2
    if x >  30000 and x <= 40000 : return 3
    if x >  40000 and x <= 50000 : return 4
    if x >  50000 and x <= 60000 : return 5
    if x >  60000 and x <= 70000 : return 6
    if x >  70000 and x <= 80000 : return 7
    if x >  80000 and x <= 90000 : return 8
    if x >  90000 and x <= 100000 : return 9
    return 10                          
                                       ## Define function

y =                                    ## Apply Lambda function

In [None]:
# 생성한 레이블 확인하기



In [None]:
y.value_counts()

In [None]:
sns.set(font_scale=1.4)
y.value_counts().plot(kind='bar', figsize=(7, 6), rot=0)
plt.xlabel("Income Labels", labelpad=14)
plt.ylabel("Counts", labelpad=14)
plt.title("Counts of 11 Income Labels\n", y=1.02);

### 문제 27. Feature Engineering

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(   )

In [None]:
from sklearn import preprocessing

categorical =  
for feature in categorical:
        le =   
        x_train[feature] =  
        x_test[feature] =  

In [None]:
x_train[categorical].head()

### 문제 28. Light GBM 을 이용하여 다중분류 구현하기

In [None]:
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score

lgbm =  
lgbm.fit(x_train, y_train)

### 문제 29. Light GBM  다중분류 결과  확인하기: Accuracy, Confusion Matrix, Heatmap

In [None]:
## 정확도 구하기
y_pred1 =   

lgbm_score =   
print('LGBM Model accuracy score : {0:0.4f}'.format(lgbm_score))

LGBM Model accuracy score : 0.8119


In [None]:
pd.DataFrame(y_test).head(10)

In [None]:
pd.DataFrame(y_pred1).head(10)

In [None]:
## Confusion Matrix 확인하기
from sklearn.metrics import confusion_matrix
cm1 =   
print('LGBM Confusion Matrix for 11-class Labels\n')
print(cm1)


In [None]:
## 히트맵으로 시각화하기
plt.figure(figsize=[8,7])
sns.heatmap( )
plt.title('LGBM Heatmap for 11-class Labels\n')
plt.show()

In [None]:
print('LGBM Model accuracy score : {0:0.4f}'.format(lgbm_score))
print('\n')
print(classification_report(y_test, y_pred1))

### 문제 30. Random Forest 모델로 다중분류 구현하고 정확도 확인하기

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

rfc =   
rfc.fit(x_train, y_train) 

In [None]:
y_pred =   
rfc_score =   
print('Random Forest Model accuracy score : {0:0.4f}'. format(rfc_score))

## Step 7. 다중분류 모델의 평가와 개선

### 문제 31. Random Forest 다중분류 모델의 정확도 확인하기

In [None]:
## 이전 Step 복습
y_pred = rfc.predict(x_test)
rfc_score = accuracy_score(y_test, y_pred)
print('Randoom Forest Model accuracy score : {0:0.4f}'. format(rfc_score)) 

### 문제 32. Random Forest  다중분류 모델의 인접정확도 구하기

In [None]:
## 인접 정확도의 계산
precise_accuracy =   
adjacent_accuracy =  

print('precise accuracy: {0:0.4f}'. format(precise_accuracy))
print('adjacent accuracy: {0:0.4f}'. format(adjacent_accuracy))

### 문제 33. Random Forest  다중분류 모델의 Confusion Matrix 확인하기

In [None]:
print(cm1) # light gbm 

In [None]:
from sklearn.metrics import confusion_matrix
cm2 =   
print('Random Forest Confusion Matrix for 11-class Labels\n')
print(cm2)

In [None]:
plt.figure(figsize=[8, 7])
sns.heatmap(cm1, cmap='Reds', annot=True, fmt='.0f')
plt.show()

In [None]:
## Seaborn Heatmap

plt.figure(figsize=[8, 7])
sns.heatmap(cm2, cmap='Reds', annot=True, fmt='.0f')
plt.title('Random Forest Heatmap for 11-class Labels\n', fontsize=14)
plt.show()

### 문제 34. Random Forest  다중분류 모델의 분류결과 확인하기

In [None]:
from sklearn.metrics import classification_report

print('Random Forest precise  accuracy for 11 labels: {0:0.4f}'. format(  )) 
print('Random Forest adjacent accuracy for 11 labels: {0:0.4f}'. format(  ))
print('\n')
print(classification_report(    ))

### 문제 35. 적절한 클래스 수로 변환하여 모델 개선하기

In [None]:
def value_change(x):
    if x <= 20000: return 0
    if x >  20000 and x <= 50000 : return 1
    if x >  50000 and x <= 70000 : return 2
    if x >  70000 and x <= 90000 : return 3 
    return 4
                                       ## Define new function
y =                                    ## Apply Lambda function

In [None]:
y.value_counts()

In [None]:
y.value_counts().sum()

In [None]:
sns.set(font_scale=1.4)
y.value_counts().plot(kind='bar', figsize=(7, 6), rot=0)
plt.xlabel("Income Labels", labelpad=14)
plt.ylabel("Counts", labelpad=14)
plt.title("Counts of 5 Income Labels\n", y=1.02);

### 문제 36. Feature Engineering

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

In [None]:
from sklearn import preprocessing

categorical = ['workclass', 'marital.status', 'occupation', 'relationship', 'sex', 'native.country']
for feature in categorical:
        le = preprocessing.LabelEncoder()
        x_train[feature] = le.fit_transform(x_train[feature])
        x_test[feature] = le.transform(x_test[feature])

### 문제 37. 변경한 클래스를 이용한 Random Forest 다중분류 모델의 개선결과 확인하기

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

rfc = RandomForestClassifier(random_state=0)
rfc.fit(x_train, y_train)

In [None]:
## Confusion Matrix 확인하기
y_pred =   
cm3 =   
print('Random Forest Confusion Matrix for 5-class Labels\n')
print(cm3)

In [None]:
plt.figure(figsize=[8, 7])
sns.heatmap(cm2, cmap='Reds', annot=True, fmt='.0f')
plt.title('Random Forest Heatmap for 11-class Labels\n', fontsize=14)
plt.show()

In [None]:
# Heatmap 그려보기

plt.figure(figsize=[8,7])
sns.heatmap(cm3, cmap="Reds", annot=True, fmt='.0f')
plt.title('Random Forest Heatmap for 5-class Labels\n', fontsize=14)
plt.show()

In [None]:
## Accruacy Evaluation
precise_accuracy =   
adjacent_accuracy =   

print('precise accuracy: {0:0.4f}'. format(precise_accuracy))
print('adjacent accuracy: {0:0.4f}'. format(adjacent_accuracy))

In [None]:
## 2가지 Accruacy 스코어의 비교
print('Random Forest precise  accuracy for 5 labels: {0:0.4f}'. format(   ))
print('Random Forest adjacent accuracy for 5 labels: {0:0.4f}'. format(   ))
print('\n')
print(  ))

### 문제 38. 변경한 클래스를 이용한 Ligth GBM 다중분류 모델의 개선결과 확인하기

In [None]:
lgbm =   
lgbm.fit(  )
y_pred1 =  

lgbm_score1 = accuracy_score(   )
print('LGBM Model accuracy score : {0:0.4f}'.format(  ))

In [None]:
## Confusion Matrix 구하기

cm4 =   
print('LGBM Confusion Matrix for 5-class Labels\n')
print(cm4)

In [None]:
# seaborn를 이용한 heatmap

plt.figure(figsize=[8, 7])
sns.heatmap(cm3, cmap='Reds', annot=True, fmt='.0f')
plt.title('Random Forest Classifier Heatmap for 5-class Labels\n', fontsize=14)
plt.show()

In [None]:
# seaborn를 이용한 heatmap

plt.figure(figsize=[8, 7])
sns.heatmap(cm4, cmap='Reds', annot=True, fmt='.0f')
plt.title('LGBM Heatmap for 5-class Labels\n', fontsize=14)
plt.show()

In [None]:
## Accuracy Evalaution
precise_accuracy1 =   
adjacent_accuracy1 =   

print('LGBM precise  accuracy for 5 labels: {0:0.4f}'. format(precise_accuracy1))
print('LGBM adjacent accuracy for 5 labels: {0:0.4f}'. format(adjacent_accuracy1))
print('\n')
print(classification_report(y_test, y_pred1))

### 문제 39. 요약: 소득 다중분류 예측모델의 결과 비교하기

In [None]:
print ('Accuracy Comparisons for Multiclass Models\n')
print ('rfc_score  (11 labels)  :', '{0:0.5f}'.format(rfc_score)) # Step 6, 문제 30
print ('rfc_score  ( 5 labels)  :', '{0:0.5f}'.format(precise_accuracy)) # Step 7, 문제 32
print ('lbgm_score (11 labels)  :', '{0:0.5f}'.format(lgbm_score)) # Step 6, 문제 29
print ('lgbm_score ( 5 labels)  :', '{0:0.5f}'.format(precise_accuracy1)) # Step 7, 문제 38

In [None]:
print('LGBM Confusion Matrix')
print(cm1) # Step 6, 문제 29
print('\n')
print('Random Forest Confusion Matrix')
print(cm2) # Step 7, 문제 33

In [None]:
print ('Classification Comparions for Multiclass Models\n')
print ('rfc_score   :', '{0:0.4f}'.format(rfc_score))
print (classification_report(y_test, y_pred)) # Step 7, 문제 37
print ('lgbm_score  :', '{0:0.4f}'.format(lgbm_score))
print (classification_report(y_test, y_pred1))

## Step 8. 요약

- 1) 이진분류·다중분류 모델의 이해<p>: 분류 항목의 수에 따라 달라지는 모델링 방법(파라미터 처리)

- 2) 선형 기반·트리 기반 분류모델의 이해<p>: Logistic Regression, Random Forest, Gradient Boosting, <strong>Light GBM</strong> 등


- 3) 변수 처리 방법에 대한 이해<p>: 범주변수 처리, 스케일 조정, 3개 이상의 클래스 처리


- 4) 분류모델 결과를 해석하는 방법 습득<p>: 단순정확도, <b>인접정확도</b>, Precision, Recall  등


- 5) 평가결과를 바탕으로 모델을 개선하는 방법 습득<p>: F1 스코어, Confusion Matrix, Classification Report 등을 종합적으로 이용
    

- 다음 주제: 구현한 머신러닝 모델을 XAI 기법으로 설명하기