### 피처 스케일링과 정규화
- 피처 스케일링
    - 서로 다른 변수의 값 범위를 일정한 범위 수즌으로 맞추는 작업
- 방식
    - Z-scaling
        - 표준화
        - 평균이 0이고 분산이 1인 가우시안 정규분포로 변환
        - sklearn.preprocessing의 StandardScaler 모듈
        - 사전에 표준화 작업을 반드시 진행하는 알고리즘
            - 회귀(선형회귀(회귀), 로지스틱회귀(분류))알고리즘
            -SVM
    - Min-Max scaling
        - 0~1 사이의 값으로 변환
        - 최소값을 0 변환, 최대값을 1로 변환
        - sklearn.preprocessing의 MinMaxScaler 모듈
        - x_new = (xi - min(x)) / (max(x) - min(x))

### StandardScaler 사용

In [1]:
from sklearn.datasets import load_iris # iris 데이터 저장 모듈
import pandas as pd

In [3]:
iris = load_iris()

In [5]:
iris_df = pd.DataFrame(data = iris.data, columns = iris.feature_names)
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [6]:
from sklearn.preprocessing import StandardScaler

In [8]:
scaler = StandardScaler()
scaler.fit(iris_df) # 각 컬럼의 평균과 표준편차 계산 (통계량 계산)
iris_sc_df = pd.DataFrame(data = scaler.transform(iris_df),
                         columns = iris.feature_names)
iris_sc_df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,-0.900681,1.019004,-1.340227,-1.315444
1,-1.143017,-0.131979,-1.340227,-1.315444
2,-1.385353,0.328414,-1.397064,-1.315444
3,-1.506521,0.098217,-1.283389,-1.315444
4,-1.021849,1.249201,-1.340227,-1.315444
...,...,...,...,...
145,1.038005,-0.131979,0.819596,1.448832
146,0.553333,-1.282963,0.705921,0.922303
147,0.795669,-0.131979,0.819596,1.053935
148,0.432165,0.788808,0.933271,1.448832


In [9]:
print('변환전 특성의 평균값 :', iris_df.mean())

변환전 특성의 평균값 : sepal length (cm)    5.843333
sepal width (cm)     3.057333
petal length (cm)    3.758000
petal width (cm)     1.199333
dtype: float64


In [10]:
print('변환후 특성의 평균값 :', iris_sc_df.mean())

변환후 특성의 평균값 : sepal length (cm)   -1.690315e-15
sepal width (cm)    -1.842970e-15
petal length (cm)   -1.698641e-15
petal width (cm)    -1.409243e-15
dtype: float64


In [11]:
print('변환후 특성의 분산값 :', iris_sc_df.var())

변환후 특성의 분산값 : sepal length (cm)    1.006711
sepal width (cm)     1.006711
petal length (cm)    1.006711
petal width (cm)     1.006711
dtype: float64


### MinMaxScaler
- 데이터의 분포가 가우시안 분포와 너무 관련이 없을 때 Min-Max scale 적용
- 도메인적으로 해당 컬럼(특성)의 최소값과 최대값은 변경되지 않는다는것이 어느정도 신뢰되면 사용

In [12]:
from sklearn.preprocessing import MinMaxScaler

In [14]:
m_scaler = MinMaxScaler()
m_scaler.fit(iris_df) # 최소/최대값 통계량 설정
iris_sc_mx = m_scaler.transform(iris_df)
iris_mx_df = pd.DataFrame(iris_sc_mx, columns = iris.feature_names)
iris_mx_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,0.222222,0.625,0.067797,0.041667
1,0.166667,0.416667,0.067797,0.041667
2,0.111111,0.5,0.050847,0.041667
3,0.083333,0.458333,0.084746,0.041667
4,0.194444,0.666667,0.067797,0.041667


In [15]:
print('특성의 최대값 :', iris_mx_df.max())
print('특성의 최대값 :', iris_mx_df.min())

특성의 최대값 : sepal length (cm)    1.0
sepal width (cm)     1.0
petal length (cm)    1.0
petal width (cm)     1.0
dtype: float64
특성의 최대값 : sepal length (cm)    0.0
sepal width (cm)     0.0
petal length (cm)    0.0
petal width (cm)     0.0
dtype: float64
