[보스턴 집값 예측 모델]

- 데이터셋: boston.csv
- 학습방법: 지도학습 중 회귀
- feature: 13개
- target: 1개

In [52]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler,RobustScaler
from sklearn.model_selection import train_test_split

In [53]:
data_file='../data/boston.csv'

In [54]:
data_df=pd.read_csv(data_file)
data_df.head(2)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6


In [55]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    int64  
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(12), int64(2)
memory usage: 55.5 KB


2) 전처리
- 데이터 정제

In [56]:
#결측치, 중복값, 이상값 ,등 처리(컬럼 별 고유값 추출로 이상 데이터 체크)

- 표준화, 정규화 (진행 여부에 따른 성능 변화는 데이터마다 차이 O)
    * 정규분포 데이터셋을 기반으로 한 모델: StandardScaler,MinMaxScaler , log 변환 등
    * feature 값의 범위 차이를 줄이기: feature 스케일링, MinMaxScaler, RobustScaler 등
    * 범주형 feature: OneHotEncoder, OrdinalEncoder
    * 문자열 target: LabelEncoder

In [57]:
#feature/target 분리
feature_df=data_df.iloc[:,:-1]
target_df=data_df['MEDV']

In [58]:
print(f'feature: {feature_df.shape}, target: {target_df.shape}')

feature: (506, 13), target: (506,)


3) 학습 준비

In [59]:
#학습용/테스트용 분리
x_train,x_test,y_train,y_test=train_test_split(feature_df,target_df,random_state=10)

In [60]:
print(f'x_train: {x_train.shape}, y_train: {y_train.shape}')
print(f'x_test: {x_test.shape}, y_test: {y_test.shape}')

x_train: (379, 13), y_train: (379,)
x_test: (127, 13), y_test: (127,)


In [61]:
#학습용으로 스케일러 생성
#수치 feature 값의 범위 차이가 큼 => StandardScaler 사용해 스케일링 진행
sscaler=StandardScaler()
sscaler.fit(x_train)

In [62]:
scaled_x_train=sscaler.transform(x_train)
scaled_x_test=sscaler.transform(x_test)

4) 학습 진행 w. 교차검증

In [63]:
from sklearn.model_selection import cross_validate
from sklearn.linear_model import Ridge

In [64]:
#모델 성능을 좌우하는 Hyper_parameter 제어(튜닝)
alpha_value=[0.,1.,10.,100.,1000.]

for value in alpha_value:
    #모델 인스턴스 생성
    ridge_model=Ridge(alpha=value)   #alpha 기본 값: 1.0

    #학습 진행
    #cv:3개 / scoring: neg_mean_square_error, r2
    result=cross_validate(ridge_model,scaled_x_train,y_train,cv=3,scoring=['neg_mean_squared_error','r2'],
                            return_train_score=True,return_estimator=True)

    result_df=pd.DataFrame(result)[['test_r2','train_r2']]
    result_df['diff']=abs(result_df['test_r2']-result_df['train_r2'])   #회귀계수 차이: 낮을수록 좋다,,,
    best_idx=result_df['diff'].sort_values()[0]

    print(result['estimator'][0].coef_)   #각 feature 별 기울기 (feature가 13개니까 13개가 나온거쥐,,,) 근데 이걸 왜 구하는거지,,,?
    print()
    print(f'[Ridge(alpha {value})]')
    print(result_df,end='\n\n')
    print(f'best_idx: {best_idx}')  
    print()



[-1.41407793  1.56590993  0.15536906  0.65522098 -2.36200159  2.31948624
  0.1173831  -3.59071105  2.71475429 -2.33252925 -1.88390034  1.04036915
 -3.50250877]

[Ridge(alpha 0.0)]
    test_r2  train_r2      diff
0  0.747022  0.755720  0.008699
1  0.756482  0.740082  0.016400
2  0.680801  0.786156  0.105355

best_idx: 0.008698695430572112

[-1.39035961  1.53043843  0.11109741  0.6621853  -2.29024619  2.34249774
  0.10030677 -3.52062389  2.57481444 -2.20749462 -1.86406784  1.03607796
 -3.48102887]

[Ridge(alpha 1.0)]
    test_r2  train_r2      diff
0  0.748283  0.755663  0.007380
1  0.756292  0.740039  0.016253
2  0.680991  0.786097  0.105106

best_idx: 0.007380040333377247

[-1.23221033  1.29302258 -0.12737786  0.70280521 -1.80949922  2.48028701
 -0.00860666 -2.99831755  1.75466332 -1.51704375 -1.73434856  1.00368486
 -3.30809117]

[Ridge(alpha 10.0)]
    test_r2  train_r2      diff
0  0.753103  0.752474  0.000629
1  0.755100  0.737457  0.017643
2  0.677471  0.783225  0.105755

best_idx

하이퍼 파라미터 튜닝과 교차 검증을 동시 진행

In [65]:
from sklearn.model_selection import GridSearchCV

In [66]:
#ridge의 하이퍼 파라미터 값 설정
params={'alpha':[0.,0.1,0.5,1.0],
        'max_iter':[3,5]}   #총 8개의 ridge 모델 생성

In [67]:
#인스턴스 생성
r_model=Ridge()
scv=GridSearchCV(r_model,params,cv=3,verbose=True,return_train_score=True)  #verbose=True : 진행상황

In [68]:
#학습 진행
scv.fit(scaled_x_train,y_train)

Fitting 3 folds for each of 8 candidates, totalling 24 fits


In [72]:
#모델 파라미터 확인
best_model=scv.best_estimator_
best_model

In [74]:
resultdf=pd.DataFrame(scv.cv_results_)
resultdf

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_alpha,param_max_iter,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,mean_train_score,std_train_score
0,0.001334,0.0004717637,0.000666,0.000471258,0.0,3,"{'alpha': 0.0, 'max_iter': 3}",0.747022,0.756482,0.680801,0.728101,0.033669,7,0.75572,0.740082,0.786156,0.760653,0.019131
1,0.001,2.973602e-07,0.001,1.123916e-07,0.0,5,"{'alpha': 0.0, 'max_iter': 5}",0.747022,0.756482,0.680801,0.728101,0.033669,7,0.75572,0.740082,0.786156,0.760653,0.019131
2,0.001029,4.062968e-05,0.000668,0.0004722743,0.1,3,"{'alpha': 0.1, 'max_iter': 3}",0.747159,0.756462,0.680831,0.728151,0.033675,5,0.75572,0.740081,0.786156,0.760652,0.019131
3,0.000781,0.0005683891,0.000336,0.0004747421,0.1,5,"{'alpha': 0.1, 'max_iter': 5}",0.747159,0.756462,0.680831,0.728151,0.033675,5,0.75572,0.740081,0.786156,0.760652,0.019131
4,0.000336,0.0004748545,0.000675,0.00047765,0.5,3,"{'alpha': 0.5, 'max_iter': 3}",0.747682,0.756385,0.680927,0.728331,0.033708,3,0.755705,0.74007,0.786141,0.760639,0.019129
5,0.00067,0.0004739555,0.0,0.0,0.5,5,"{'alpha': 0.5, 'max_iter': 5}",0.747682,0.756385,0.680927,0.728331,0.033708,3,0.755705,0.74007,0.786141,0.760639,0.019129
6,0.000727,0.0005189453,0.000335,0.0004740678,1.0,3,"{'alpha': 1.0, 'max_iter': 3}",0.748283,0.756292,0.680991,0.728522,0.033768,1,0.755663,0.740039,0.786097,0.7606,0.019124
7,0.000671,0.0004746864,0.000335,0.0004737306,1.0,5,"{'alpha': 1.0, 'max_iter': 5}",0.748283,0.756292,0.680991,0.728522,0.033768,1,0.755663,0.740039,0.786097,0.7606,0.019124
