## 저장내용 불러오기

In [2]:
import pandas as pd

train_df = pd.read_csv('elect_training.csv', header=None, )
print(train_df[1].value_counts())
test_df = pd.read_csv('elect_test.csv', header=None)
print(test_df[1].value_counts())

0    209
1    197
Name: 1, dtype: int64
1    100
0    100
Name: 1, dtype: int64


## 단어 토큰화

In [3]:
from konlpy.tag import Twitter

twitter = Twitter()
def tw_tokenizer(text):
    # 입력 인자로 들어온 텍스트를 형태소 단어로 토큰화해 리스트 형태로 반환
    tokens_ko = twitter.morphs(text)
    return tokens_ko

  warn('"Twitter" has changed to "Okt" since KoNLPy v0.4.5.')


In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Twitter 객체의 morphs() 객체를 이용한 tokenizer를 사용. ngram_range는 (1,2)
tfidf_vect = TfidfVectorizer(tokenizer=tw_tokenizer, ngram_range=(1,2), min_df=3, max_df=0.9)
tfidf_vect.fit(train_df[0])
tfidf_matrix_train = tfidf_vect.transform(train_df[0])
tfidf_matrix_test = tfidf_vect.transform(test_df[0])



In [5]:
tfidf_matrix_train

<406x1890 sparse matrix of type '<class 'numpy.float64'>'
	with 16324 stored elements in Compressed Sparse Row format>

In [6]:
train_df[1].shape

(406,)

## 모델생성

In [143]:
from sklearn.metrics import roc_auc_score

# n_estimators는 100으로, random state는 예제 수행 시마다 동일 예측 결과를 위해 설정
xgb_clf = XGBClassifier(n_estimator=100, random_state=0)

# 성능 평가 지표를 auc로, 조기 중단 파리미터는 30으로 설정하고 학습 수행
xgb_clf.fit(tfidf_matrix_train, train_df[1], early_stopping_rounds=30, eval_metric="auc",
            eval_set=[(tfidf_matrix_train, train_df[1]),(tfidf_matrix_test, test_df[1])])

xgb_roc_score = roc_auc_score(test_df[1], xgb_clf.predict_proba(tfidf_matrix_test)[:,1], average='macro')
print('ROC AUC: {0:.4f}'.format(xgb_roc_score))

Parameters: { n_estimator } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[0]	validation_0-auc:0.86863	validation_1-auc:0.63180
Multiple eval metrics have been passed: 'validation_1-auc' will be used for early stopping.

Will train until validation_1-auc hasn't improved in 30 rounds.
[1]	validation_0-auc:0.92042	validation_1-auc:0.67585
[2]	validation_0-auc:0.94314	validation_1-auc:0.68985
[3]	validation_0-auc:0.95796	validation_1-auc:0.72510
[4]	validation_0-auc:0.96844	validation_1-auc:0.72290
[5]	validation_0-auc:0.97532	validation_1-auc:0.73150
[6]	validation_0-auc:0.97564	validation_1-auc:0.72650
[7]	validation_0-auc:0.97715	validation_1-auc:0.73065
[8]	validation_0-auc:0.97956	validation_1-auc:0.72070
[9]	validation_0-auc:0.98240	validation_1-auc:0.73040
[10]	validation_

In [148]:
# 파라미터 C 최적화를 위해 GridSearchCV를 이용
params = { 'max_depth': [0.5, 1, 3, 5],
         'min_child_weight': [0.25, 0.5, 1],
         'colsample_bytree': [0.25, 0.5, 0.75]}

grid_cv = GridSearchCV(xgb_clf, param_grid=params, cv=3, scoring='accuracy', verbose=1)
grid_cv.fit(tfidf_matrix_train, train_df[1])
print(grid_cv.best_params_, round(grid_cv.best_score_, 4))

Fitting 3 folds for each of 36 candidates, totalling 108 fits
Parameters: { n_estimator } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Parameters: { n_estimator } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { n_estimator } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { n_estimator } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { n_estimator } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  

Parameters: { n_estimator } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { n_estimator } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { n_estimator } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { n_estimator } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  

Parameters: { n_estimator } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { n_estimator } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { n_estimator } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { n_estimator } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  

Parameters: { n_estimator } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { n_estimator } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { n_estimator } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { n_estimator } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  

[Parallel(n_jobs=1)]: Done 108 out of 108 | elapsed:   19.3s finished


Parameters: { n_estimator } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


{'colsample_bytree': 0.25, 'max_depth': 3, 'min_child_weight': 0.5} 0.6996


## 모델학습

In [149]:
import xgboost as xgb
from xgboost import plot_importance
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [150]:
# 사이킷런 래퍼 XGBoost 클래스인 XGBClassifier 임포트
from xgboost import XGBClassifier

xgb_wrapper = XGBClassifier(n_estimators=400, learning_rate=0.1, max_depth=3)
xgb_wrapper.fit(tfidf_matrix_train, train_df[1])

best_estimator = grid_cv.best_estimator_
w_preds = xgb_wrapper.predict(tfidf_matrix_test)

In [151]:
print('XGBoost 정확도: ', accuracy_score(test_df[1], w_preds))

XGBoost 정확도:  0.72
