<h2>07 로지스틱 회귀

<h4>[위스콘신 유방암 데이터 세트 생성]

In [9]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression

cancer = load_breast_cancer()

<h4>[정규 분표 형태의 표준 스케일링 적용]

In [10]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# StandardScaler( )로 평균이 0, 분산 1로 데이터 분포도 변환
scaler = StandardScaler()
data_scaled = scaler.fit_transform(cancer.data)

# 데이터 세트 분리
X_train , X_test, y_train , y_test = train_test_split(data_scaled, cancer.target, test_size=0.3, random_state=0)

<h4>[로지스틱 회귀를 이용해 학습 및 예측 수행]

In [11]:
from sklearn.metrics import accuracy_score, roc_auc_score

# 로지스틱 회귀를 이용하여 학습 및 예측 수행. 
# solver인자값을 생성자로 입력하지 않으면 solver='lbfgs'  
lr_clf = LogisticRegression() # solver='lbfgs'
lr_clf.fit(X_train, y_train)
lr_preds = lr_clf.predict(X_test)
lr_preds_proba = lr_clf.predict_proba(X_test)[:, 1]

# accuracy와 roc_auc 측정
print('accuracy: {0:.3f}, roc_auc:{1:.3f}'.format(accuracy_score(y_test, lr_preds),
                                                 roc_auc_score(y_test , lr_preds_proba)))

accuracy: 0.977, roc_auc:0.995


-> 출력 결과 solver가 lbfgs일 경우 정확도가 0.977, ROC-AUC가 0.995로 도출됨.

<서로 다른 solver 값으로 학습 후 성능 평가>

In [12]:
solvers = ['lbfgs', 'liblinear', 'newton-cg', 'sag', 'saga']
# 여러개의 solver값 별로 LogisticRegression 학습 후 성능 평가
for solver in solvers:
    lr_clf = LogisticRegression(solver=solver, max_iter=600) # max_iter는 solver로 지정된 최적화 알고리즘이 최적 수렴할 수 있는 최대 반복 횟수
    lr_clf.fit(X_train, y_train)
    lr_preds = lr_clf.predict(X_test)
    lr_preds_proba = lr_clf.predict_proba(X_test)[:, 1]

    # accuracy와 roc_auc 측정
    print('solver:{0}, accuracy: {1:.3f}, roc_auc:{2:.3f}'.format(solver, 
                                                                  accuracy_score(y_test, lr_preds),
                                                                  roc_auc_score(y_test , lr_preds_proba)))                              

solver:lbfgs, accuracy: 0.977, roc_auc:0.995
solver:liblinear, accuracy: 0.982, roc_auc:0.995
solver:newton-cg, accuracy: 0.977, roc_auc:0.995
solver:sag, accuracy: 0.982, roc_auc:0.995
solver:saga, accuracy: 0.982, roc_auc:0.995


-> 출력 결과 solver별 차이는 크지 않음.

<solver, penalty, C 최적화>

In [13]:
from sklearn.model_selection import GridSearchCV

params={'solver':['liblinear', 'lbfgs'],
        'penalty':['l2', 'l1'],
        'C':[0.01, 0.1, 1, 5, 10]}

lr_clf = LogisticRegression()

grid_clf = GridSearchCV(lr_clf, param_grid=params, scoring='accuracy', cv=3 )
grid_clf.fit(data_scaled, cancer.target)
print('최적 하이퍼 파라미터:{0}, 최적 평균 정확도:{1:.3f}'.format(grid_clf.best_params_, 
                                                  grid_clf.best_score_))

최적 하이퍼 파라미터:{'C': 0.1, 'penalty': 'l2', 'solver': 'liblinear'}, 최적 평균 정확도:0.979


15 fits failed out of a total of 60.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
15 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\s2102\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\s2102\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\s2102\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1172, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-> 출력 결과 FitFailedWarning 메시지가 함께 나오는데 이는 solver가 lbfgs일 때 L1 규제를 지원하지 않음에도 GridSearchCV에서 L1 규제값을 입력했기 때문.

<hr>

<h2>08 회귀 트리

In [14]:
'''from sklearn.datasets import load_boston
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')  #사이킷런 1.2 부터는 보스턴 주택가격 데이터가 없어진다는 warning 메시지 출력 제거

# 보스턴 데이터 세트 로드
boston = load_boston()
bostonDF = pd.DataFrame(boston.data, columns = boston.feature_names)

bostonDF['PRICE'] = boston.target
y_target = bostonDF['PRICE']
X_data = bostonDF.drop(['PRICE'], axis=1,inplace=False)

rf = RandomForestRegressor(random_state=0, n_estimators=1000)
neg_mse_scores = cross_val_score(rf, X_data, y_target, scoring="neg_mean_squared_error", cv = 5)
rmse_scores  = np.sqrt(-1 * neg_mse_scores)
avg_rmse = np.mean(rmse_scores)

print(' 5 교차 검증의 개별 Negative MSE scores: ', np.round(neg_mse_scores, 2))
print(' 5 교차 검증의 개별 RMSE scores : ', np.round(rmse_scores, 2))
print(' 5 교차 검증의 평균 RMSE : {0:.3f} '.format(avg_rmse))'''



In [15]:
'''def get_model_cv_prediction(model, X_data, y_target):
    neg_mse_scores = cross_val_score(model, X_data, y_target, scoring="neg_mean_squared_error", cv = 5)
    rmse_scores  = np.sqrt(-1 * neg_mse_scores)
    avg_rmse = np.mean(rmse_scores)
    print('##### ',model.__class__.__name__ , ' #####')
    print(' 5 교차 검증의 평균 RMSE : {0:.3f} '.format(avg_rmse))'''

'def get_model_cv_prediction(model, X_data, y_target):\n    neg_mse_scores = cross_val_score(model, X_data, y_target, scoring="neg_mean_squared_error", cv = 5)\n    rmse_scores  = np.sqrt(-1 * neg_mse_scores)\n    avg_rmse = np.mean(rmse_scores)\n    print(\'##### \',model.__class__.__name__ , \' #####\')\n    print(\' 5 교차 검증의 평균 RMSE : {0:.3f} \'.format(avg_rmse))'

In [16]:
'''from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

dt_reg = DecisionTreeRegressor(random_state=0, max_depth=4)
rf_reg = RandomForestRegressor(random_state=0, n_estimators=1000)
gb_reg = GradientBoostingRegressor(random_state=0, n_estimators=1000)
xgb_reg = XGBRegressor(n_estimators=1000)
lgb_reg = LGBMRegressor(n_estimators=1000)

# 트리 기반의 회귀 모델을 반복하면서 평가 수행 
models = [dt_reg, rf_reg, gb_reg, xgb_reg, lgb_reg]
for model in models:  
    get_model_cv_prediction(model, X_data, y_target)'''

'from sklearn.tree import DecisionTreeRegressor\nfrom sklearn.ensemble import GradientBoostingRegressor\nfrom xgboost import XGBRegressor\nfrom lightgbm import LGBMRegressor\n\ndt_reg = DecisionTreeRegressor(random_state=0, max_depth=4)\nrf_reg = RandomForestRegressor(random_state=0, n_estimators=1000)\ngb_reg = GradientBoostingRegressor(random_state=0, n_estimators=1000)\nxgb_reg = XGBRegressor(n_estimators=1000)\nlgb_reg = LGBMRegressor(n_estimators=1000)\n\n# 트리 기반의 회귀 모델을 반복하면서 평가 수행 \nmodels = [dt_reg, rf_reg, gb_reg, xgb_reg, lgb_reg]\nfor model in models:  \n    get_model_cv_prediction(model, X_data, y_target)'

In [17]:
'''import seaborn as sns
%matplotlib inline

rf_reg = RandomForestRegressor(n_estimators=1000)

# 앞 예제에서 만들어진 X_data, y_target 데이터 셋을 적용하여 학습합니다.   
rf_reg.fit(X_data, y_target)

feature_series = pd.Series(data=rf_reg.feature_importances_, index=X_data.columns )
feature_series = feature_series.sort_values(ascending=False)
sns.barplot(x= feature_series, y=feature_series.index)'''

'import seaborn as sns\n%matplotlib inline\n\nrf_reg = RandomForestRegressor(n_estimators=1000)\n\n# 앞 예제에서 만들어진 X_data, y_target 데이터 셋을 적용하여 학습합니다.   \nrf_reg.fit(X_data, y_target)\n\nfeature_series = pd.Series(data=rf_reg.feature_importances_, index=X_data.columns )\nfeature_series = feature_series.sort_values(ascending=False)\nsns.barplot(x= feature_series, y=feature_series.index)'

In [18]:
'''import matplotlib.pyplot as plt
%matplotlib inline

bostonDF_sample = bostonDF[['RM','PRICE']]
bostonDF_sample = bostonDF_sample.sample(n=100,random_state=0)
print(bostonDF_sample.shape)
plt.figure()
plt.scatter(bostonDF_sample.RM , bostonDF_sample.PRICE,c="darkorange")'''

'import matplotlib.pyplot as plt\n%matplotlib inline\n\nbostonDF_sample = bostonDF[[\'RM\',\'PRICE\']]\nbostonDF_sample = bostonDF_sample.sample(n=100,random_state=0)\nprint(bostonDF_sample.shape)\nplt.figure()\nplt.scatter(bostonDF_sample.RM , bostonDF_sample.PRICE,c="darkorange")'

In [19]:
'''import numpy as np
from sklearn.linear_model import LinearRegression

# 선형 회귀와 결정 트리 기반의 Regressor 생성. DecisionTreeRegressor의 max_depth는 각각 2, 7
lr_reg = LinearRegression()
rf_reg2 = DecisionTreeRegressor(max_depth=2)
rf_reg7 = DecisionTreeRegressor(max_depth=7)

# 실제 예측을 적용할 테스트용 데이터 셋을 4.5 ~ 8.5 까지 100개 데이터 셋 생성. 
X_test = np.arange(4.5, 8.5, 0.04).reshape(-1, 1)

# 보스턴 주택가격 데이터에서 시각화를 위해 피처는 RM만, 그리고 결정 데이터인 PRICE 추출
X_feature = bostonDF_sample['RM'].values.reshape(-1,1)
y_target = bostonDF_sample['PRICE'].values.reshape(-1,1)

# 학습과 예측 수행. 
lr_reg.fit(X_feature, y_target)
rf_reg2.fit(X_feature, y_target)
rf_reg7.fit(X_feature, y_target)

pred_lr = lr_reg.predict(X_test)
pred_rf2 = rf_reg2.predict(X_test)
pred_rf7 = rf_reg7.predict(X_test)'''

"import numpy as np\nfrom sklearn.linear_model import LinearRegression\n\n# 선형 회귀와 결정 트리 기반의 Regressor 생성. DecisionTreeRegressor의 max_depth는 각각 2, 7\nlr_reg = LinearRegression()\nrf_reg2 = DecisionTreeRegressor(max_depth=2)\nrf_reg7 = DecisionTreeRegressor(max_depth=7)\n\n# 실제 예측을 적용할 테스트용 데이터 셋을 4.5 ~ 8.5 까지 100개 데이터 셋 생성. \nX_test = np.arange(4.5, 8.5, 0.04).reshape(-1, 1)\n\n# 보스턴 주택가격 데이터에서 시각화를 위해 피처는 RM만, 그리고 결정 데이터인 PRICE 추출\nX_feature = bostonDF_sample['RM'].values.reshape(-1,1)\ny_target = bostonDF_sample['PRICE'].values.reshape(-1,1)\n\n# 학습과 예측 수행. \nlr_reg.fit(X_feature, y_target)\nrf_reg2.fit(X_feature, y_target)\nrf_reg7.fit(X_feature, y_target)\n\npred_lr = lr_reg.predict(X_test)\npred_rf2 = rf_reg2.predict(X_test)\npred_rf7 = rf_reg7.predict(X_test)"

In [20]:
'''fig , (ax1, ax2, ax3) = plt.subplots(figsize=(14,4), ncols=3)

# X축값을 4.5 ~ 8.5로 변환하며 입력했을 때, 선형 회귀와 결정 트리 회귀 예측 선 시각화
# 선형 회귀로 학습된 모델 회귀 예측선 
ax1.set_title('Linear Regression')
ax1.scatter(bostonDF_sample.RM, bostonDF_sample.PRICE, c="darkorange")
ax1.plot(X_test, pred_lr,label="linear", linewidth=2 )

# DecisionTreeRegressor의 max_depth를 2로 했을 때 회귀 예측선 
ax2.set_title('Decision Tree Regression: \n max_depth=2')
ax2.scatter(bostonDF_sample.RM, bostonDF_sample.PRICE, c="darkorange")
ax2.plot(X_test, pred_rf2, label="max_depth:3", linewidth=2 )

# DecisionTreeRegressor의 max_depth를 7로 했을 때 회귀 예측선 
ax3.set_title('Decision Tree Regression: \n max_depth=7')
ax3.scatter(bostonDF_sample.RM, bostonDF_sample.PRICE, c="darkorange")
ax3.plot(X_test, pred_rf7, label="max_depth:7", linewidth=2)'''

'fig , (ax1, ax2, ax3) = plt.subplots(figsize=(14,4), ncols=3)\n\n# X축값을 4.5 ~ 8.5로 변환하며 입력했을 때, 선형 회귀와 결정 트리 회귀 예측 선 시각화\n# 선형 회귀로 학습된 모델 회귀 예측선 \nax1.set_title(\'Linear Regression\')\nax1.scatter(bostonDF_sample.RM, bostonDF_sample.PRICE, c="darkorange")\nax1.plot(X_test, pred_lr,label="linear", linewidth=2 )\n\n# DecisionTreeRegressor의 max_depth를 2로 했을 때 회귀 예측선 \nax2.set_title(\'Decision Tree Regression: \n max_depth=2\')\nax2.scatter(bostonDF_sample.RM, bostonDF_sample.PRICE, c="darkorange")\nax2.plot(X_test, pred_rf2, label="max_depth:3", linewidth=2 )\n\n# DecisionTreeRegressor의 max_depth를 7로 했을 때 회귀 예측선 \nax3.set_title(\'Decision Tree Regression: \n max_depth=7\')\nax3.scatter(bostonDF_sample.RM, bostonDF_sample.PRICE, c="darkorange")\nax3.plot(X_test, pred_rf7, label="max_depth:7", linewidth=2)'