# AI&데이터 마이닝 4일차 (2020.07.30)
- ① 
- ② 여러 파라미터 옵션을 주어 목적에 따른 다양한 그래프 활용
- ③ pandas를 이용한 결측치 처리
- ④ 결측치는 전처리의 첫 단계

# 05. 데이터 전처리
## Feature Scaling
### 개요
- 데이터의 각 속성의 값의 범위(feature scaling)의 차이가 클 때 scale이 큰 속성이 학습에 가장 큰 영향을 미치는 문제
- 정규화(nomalization) : (x - x최솟값) / (x최댓값 - x최솟값)
- 표준화(standardization) : (x - 평균) / 표준편차


In [3]:
import pandas as pd
df = pd.read_csv("./data/iris.csv")
df.index.name = 'record'

cols = ['sepal.length', 'sepal.width', 'petal.length', 'petal.width']
print(df[cols])
df.describe().transpose()

        sepal.length  sepal.width  petal.length  petal.width
record                                                      
0                5.1          3.5           1.4          0.2
1                4.9          3.0           1.4          0.2
2                4.7          3.2           1.3          0.2
3                4.6          3.1           1.5          0.2
4                5.0          3.6           1.4          0.2
...              ...          ...           ...          ...
145              6.7          3.0           5.2          2.3
146              6.3          2.5           5.0          1.9
147              6.5          3.0           5.2          2.0
148              6.2          3.4           5.4          2.3
149              5.9          3.0           5.1          1.8

[150 rows x 4 columns]


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
sepal.length,150.0,5.843333,0.828066,4.3,5.1,5.8,6.4,7.9
sepal.width,150.0,3.057333,0.435866,2.0,2.8,3.0,3.3,4.4
petal.length,150.0,3.758,1.765298,1.0,1.6,4.35,5.1,6.9
petal.width,150.0,1.199333,0.762238,0.1,0.3,1.3,1.8,2.5


In [4]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

# normalize the data and store in out scaled numpy array
out_scaled = scaler.fit_transform(df[cols])
print(out_scaled)

[[0.22222222 0.625      0.06779661 0.04166667]
 [0.16666667 0.41666667 0.06779661 0.04166667]
 [0.11111111 0.5        0.05084746 0.04166667]
 [0.08333333 0.45833333 0.08474576 0.04166667]
 [0.19444444 0.66666667 0.06779661 0.04166667]
 [0.30555556 0.79166667 0.11864407 0.125     ]
 [0.08333333 0.58333333 0.06779661 0.08333333]
 [0.19444444 0.58333333 0.08474576 0.04166667]
 [0.02777778 0.375      0.06779661 0.04166667]
 [0.16666667 0.45833333 0.08474576 0.        ]
 [0.30555556 0.70833333 0.08474576 0.04166667]
 [0.13888889 0.58333333 0.10169492 0.04166667]
 [0.13888889 0.41666667 0.06779661 0.        ]
 [0.         0.41666667 0.01694915 0.        ]
 [0.41666667 0.83333333 0.03389831 0.04166667]
 [0.38888889 1.         0.08474576 0.125     ]
 [0.30555556 0.79166667 0.05084746 0.125     ]
 [0.22222222 0.625      0.06779661 0.08333333]
 [0.38888889 0.75       0.11864407 0.08333333]
 [0.22222222 0.75       0.08474576 0.08333333]
 [0.30555556 0.58333333 0.11864407 0.04166667]
 [0.22222222 

In [5]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# standardize the data and store in out_scaled numpy array
out_scaled = scaler.fit_transform(df[cols])
print(out_scaled)

[[-9.00681170e-01  1.01900435e+00 -1.34022653e+00 -1.31544430e+00]
 [-1.14301691e+00 -1.31979479e-01 -1.34022653e+00 -1.31544430e+00]
 [-1.38535265e+00  3.28414053e-01 -1.39706395e+00 -1.31544430e+00]
 [-1.50652052e+00  9.82172869e-02 -1.28338910e+00 -1.31544430e+00]
 [-1.02184904e+00  1.24920112e+00 -1.34022653e+00 -1.31544430e+00]
 [-5.37177559e-01  1.93979142e+00 -1.16971425e+00 -1.05217993e+00]
 [-1.50652052e+00  7.88807586e-01 -1.34022653e+00 -1.18381211e+00]
 [-1.02184904e+00  7.88807586e-01 -1.28338910e+00 -1.31544430e+00]
 [-1.74885626e+00 -3.62176246e-01 -1.34022653e+00 -1.31544430e+00]
 [-1.14301691e+00  9.82172869e-02 -1.28338910e+00 -1.44707648e+00]
 [-5.37177559e-01  1.47939788e+00 -1.28338910e+00 -1.31544430e+00]
 [-1.26418478e+00  7.88807586e-01 -1.22655167e+00 -1.31544430e+00]
 [-1.26418478e+00 -1.31979479e-01 -1.34022653e+00 -1.44707648e+00]
 [-1.87002413e+00 -1.31979479e-01 -1.51073881e+00 -1.44707648e+00]
 [-5.25060772e-02  2.16998818e+00 -1.45390138e+00 -1.31544430e

## Example : improving matches from a dating site with kNN
### Hellen이 만났던 사람들은 다음 3가지 형태 중 하나
- People she didn't like
- People she liked in small doses
- People she liked in large doses

### Hellen은 평일에는 "People she liked in small doses"을 만나고 주말에는 "People she liked in large doses" 만나기를 바람
### Hellen의 요청
- 앞으로 만날 사람들을 3가지 유형으로 분류할 수 있도록 도와달라
- Date 사이트에는 없짐나 분류에 도움이 될 수 있는 데이터 추가 수집

In [14]:
import numpy as np
import pandas as pd
import seaborn as sns

df = pd.read_csv("./data/knnDatingTestSet2.tsv", sep='\t')
print(df.head())
print(df.describe())

   miles       game  icecream  level
0  40920   8.326976  0.953952      3
1  14488   7.153469  1.673904      2
2  26052   1.441871  0.805124      1
3  75136  13.147394  0.428964      1
4  38344   1.669788  0.134296      1
              miles         game     icecream        level
count   1000.000000  1000.000000  1000.000000  1000.000000
mean   33635.421000     6.559961     0.832073     1.985000
std    21957.006833     4.243618     0.497239     0.818196
min        0.000000     0.000000     0.001156     1.000000
25%    13796.000000     2.933963     0.408995     1.000000
50%    31669.000000     6.595204     0.809420     2.000000
75%    47716.250000    10.056500     1.272847     3.000000
max    91273.000000    20.919349     1.695517     3.000000


In [15]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(df.loc[:, df.columns != 'level'], df['level'], test_size=.33, random_state=100)

In [16]:
print(type(x_train))
print(x_train.head())

<class 'pandas.core.frame.DataFrame'>
     miles       game  icecream
3    75136  13.147394  0.428964
538  45944   9.213215  0.797743
709  42545   3.677287  0.244167
224  15987   2.037080  0.715243
370  51542   6.517133  0.402519


In [17]:
from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier(n_neighbors = 3)
classifier.fit(x_train, y_train)
print(classifier)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')


In [18]:
print(classifier.score(x_test, y_test))

0.7696969696969697


## 정규화(MinMax)

In [19]:
# load module and instantiate scaler object
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

# normalize the data and store in out_scaled numpy array
minmax_scaled = pd.DataFrame(scaler.fit_transform(df.loc[:, df.columns != 'level']), columns=['miles', 'games', 'icecreams'])
print(minmax_scaled)

        miles     games  icecreams
0    0.448325  0.398051   0.562334
1    0.158733  0.341955   0.987244
2    0.285429  0.068925   0.474496
3    0.823201  0.628480   0.252489
4    0.420102  0.079820   0.078578
..        ...       ...        ...
995  0.122106  0.163037   0.372224
996  0.754287  0.476818   0.394621
997  0.291159  0.509103   0.510795
998  0.527111  0.436655   0.429005
999  0.479408  0.376809   0.785718

[1000 rows x 3 columns]


In [23]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(minmax_scaled, df['level'], test_size=.33, random_state=100)
print(type(x_train))
print(x_train.head())

<class 'pandas.core.frame.DataFrame'>
        miles     games  icecreams
3    0.823201  0.628480   0.252489
538  0.503369  0.440416   0.470140
709  0.466129  0.175784   0.143423
224  0.175156  0.097378   0.421449
370  0.564701  0.311536   0.236882


In [24]:
from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier(n_neighbors = 3)
classifier.fit(x_train, y_train)
print(classifier)
print(classifier.score(x_test, y_test))

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')
0.9606060606060606


## 표준화(StandardScaler)

In [25]:
# load module and instantiate scaler object
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# normalize the data and store in out_scaled numpy array
std_scaled = pd.DataFrame(scaler.fit_transform(df.loc[:, df.columns != 'level']), columns=['miles', 'games', 'icecreams'])
print(std_scaled)

        miles     games  icecreams
0    0.331932  0.416602   0.245234
1   -0.872478  0.139929   1.693857
2   -0.345549 -1.206671  -0.054224
3    1.891029  1.553092  -0.811100
4    0.214553 -1.152936  -1.404005
..        ...       ...        ...
995 -1.024806 -0.742505  -0.402895
996  1.604417  0.805083  -0.326537
997 -0.321718  0.964316   0.069526
998  0.659599  0.606995  -0.209316
999  0.461203  0.311833   1.006806

[1000 rows x 3 columns]


In [26]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(std_scaled, df['level'], test_size=.33, random_state=100)
print(type(x_train))
print(x_train.head())

<class 'pandas.core.frame.DataFrame'>
        miles     games  icecreams
3    1.891029  1.553092  -0.811100
538  0.560857  0.625547  -0.069076
709  0.405977 -0.679636  -1.182932
224 -0.804174 -1.066341  -0.235075
370  0.815937 -0.010097  -0.864310


In [27]:
from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier(n_neighbors = 3)
classifier.fit(x_train, y_train)
print(classifier)
print(classifier.score(x_test, y_test))

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')
0.9575757575757575


## 범주형 데이터 처리
### 데이터 마이닝 및 기계학습 방법들의 입력
- 거의 대부분 연속적인 변수와 정수형 입력을 처리하도록 설계됨
- 문자열이나 범주형 데이터는 입력으로 줄 수 없음
- 문자열이나 범주형 데이터를 연속적으로 데이터를 대리할 수 있는 형태로 인코딩해야 함

### 변수의 순서 여부
- 범주 / 순서형 : 신발 크기 (230, 270, 280, ...)
- 범주 / 명목형 : 티셔츠 색 (빨강, 파랑, 초록, ...)

### 인코딩 방법
- <font size = 5 color = "red">One Hot Encoding, Label Encoding, Ordinal Encoding,</font>
- Helmert Encoding, Binary Encoding, Frequency Encoding, Mean Encoding
- Weight of Evidence Encoding, Probability Ratio Encoding, Hashing Encoding
- Backward Difference Encoding, Leave One Out Encoding, James-Stein Encoding, Mestimator Encoding

### Ordinal Data Encoding
- 범주/순서형 데이터 : 티셔츠 사이즈, 신발 사이즈
- Scikit-learn의 OrdinalEncoder 이용해서 인코딩

In [31]:
df = pd.read_csv("./data/long_jump.csv")
df.set_index('Person', inplace=True)