### 보스턴 집값 예측 모델
- 데이터셋 : boston.csv
- 학습방법 : 지도학습 >> 회귀
- 피쳐/독립 : 13개
- 타겟/종속 : 1개

In [50]:
# 모듈로딩
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler,MinMaxScaler,RobustScaler
from sklearn.model_selection import train_test_split

In [51]:
# 데이터
dataDF=pd.read_csv('../data/boston.csv')
dataDF.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


In [52]:
# 데이터 기본 정보 확인
dataDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    int64  
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(12), int64(2)
memory usage: 55.5 KB


[2] 전처리
- [2-1] 데이터 정제
    - 결측지 , 중복값 , 이상치, 컬럼별 고유값 추출로 이상데이터 체크

- [2-2] 표준화 & 정규화 ==> 진행여부에 따라서 성능의 변화는 경우에 따라 다름!!
    - 정규분포 데이터셋을 기반으로한 모델 ==> StandardScaler,MinMaxScaler, Log변환
    - 피쳐의 값의 범위 차이를 줄이기 ==> 피쳐스케일링, MinMaxScaler, RobustScaler...
    - 범주형 피쳐 ==> 수치화 인코딩 OneHotEncoder, OrdinalEncoder
    - 문자열타켓 ==> 정수라벨인코딩 LabelEncoder

- [2-3] 피쳐와 타겟 분리

In [53]:
featureDF=dataDF.iloc[:,:-1]
targetSR=dataDF['MEDV']

In [54]:
print(f'featureDF: {featureDF.shape} targetSR : {targetSR.shape}')

featureDF: (506, 13) targetSR : (506,)


[3] 학습 준비
- [3-1] 학습용 데이터셋과 테스트용 데이터셋 분리    

In [55]:
X_train,X_test,y_train,y_test = train_test_split(featureDF,targetSR,random_state=10)

In [56]:
print(f'X_train " {X_train.shape} y_train : {y_train.shape}')
print(f'X_test " {X_test.shape} y_test : {y_test.shape}')

X_train " (379, 13) y_train : (379,)
X_test " (127, 13) y_test : (127,)


- [3-2] 학습용 데이터셋으로 스케일러 생성

In [57]:
## - 수치 피쳐 값의 범위 차가 큼!! => scaling 진행
sScaler=StandardScaler()
sScaler.fit(X_train)

In [58]:
X_train_scaled=sScaler.transform(X_train)
X_test_scaled=sScaler.transform(X_test)

[4] 학습진행 => 교차검증으로 진행

In [59]:
from sklearn.model_selection import cross_validate
from sklearn.linear_model import Ridge

- cross_validate 설정
    - cv : 3개
    - scoring : 'mean_squared_error' , 'r2'
    - retrun_train_score

In [62]:
### 모델의 성능을 좌우하는 Hyper-parameter 제어 즉, 튜닝
alpha_values=[0.,1.,10,100]
for value in alpha_values:
    # 모델 인스턴스 생성
    ridge_model=Ridge(alpha=value,max_iter=3)    # 기본값 1.0

    result=cross_validate(ridge_model,
                        X_train,y_train,
                        cv=3,scoring=['neg_mean_squared_error','r2'],
                        return_train_score=True,
                        return_estimator=True)
    
    resultDF=pd.DataFrame(result)[['test_r2','train_r2']]
    resultDF['diff']=resultDF['test_r2']-resultDF['train_r2']
    best_idx=abs(resultDF['diff']).min()
    print(best_idx)
    print(result['estimator'][0].coef_)
    print(f'[Ridge(alpha={value})]')
    print(resultDF,end='\n\n')

0.008698695430572112
[-1.52744153e-01  6.51063403e-02  2.23088731e-02  2.74434110e+00
 -2.00944768e+01  3.46160017e+00  4.16479761e-03 -1.69169487e+00
  3.04135898e-01 -1.35369447e-02 -8.76435913e-01  1.08560072e-02
 -4.92976820e-01]
[Ridge(alpha=0.0)]
    test_r2  train_r2      diff
0  0.747022  0.755720 -0.008699
1  0.756482  0.740082  0.016400
2  0.680801  0.786156 -0.105355

0.005855089203014807
[-1.45670213e-01  6.68220286e-02 -2.18282920e-02  2.56307944e+00
 -8.65836330e+00  3.55215282e+00 -6.35515281e-03 -1.51026733e+00
  2.70526000e-01 -1.40310943e-02 -7.46191873e-01  1.13116933e-02
 -5.06006268e-01]
[Ridge(alpha=1.0)]
    test_r2  train_r2      diff
0  0.756784  0.750929  0.005855
1  0.752085  0.736984  0.015102
2  0.671653  0.784746 -0.113093

0.01327365193147434
[-0.14374244  0.07005825 -0.05203339  1.57086862 -1.42210754  3.2681783
 -0.00899405 -1.39725863  0.26539344 -0.01493116 -0.67912898  0.01156032
 -0.53911292]
[Ridge(alpha=10)]
    test_r2  train_r2      diff
0  0.76