# 1. Data problems

## Data quality problems

- 데이터의 최대/최소가 다름 -> Scale에 따른 y값에 영향
- Ordinary 또는 Nominal한 값 들의 표현은 어떻게? 
- 잘 못 기입된 값들에 대한 처리 
- 값이 없을 경우는 어떻게? 
- 극단적으로 큰 값 또는 작은 값들은 그대로 놔둬야 하는가?

어떤 것을 해결할 것인가?

## Data preprocessing issues

- 데이터가 빠진 경우 (결측치의 처리)
- 라벨링된 데이터(category)의 처리 
- 데이터의 scale의 차이가 매우 크게 날 경우

---

# 2. Missing Values

## 데이터가 없을 때 할 수 있는 전략

- 데이터가 없으면 sample을 drop
- 데이터가 없는 **최소 개수** 를 정해서 **sample을 drop**
- 데이터가 거의 없는 feature는 **feature 자체를 drop**
- 최빈값, 평균값으로 비어있는 데이터를 채우기

## Data

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Eaxmple from - https://chrisalbon.com/python/pandas_missing_data.html
raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'],
            'last_name': ['Miller', np.nan, 'Ali', 'Milner', 'Cooze'],
            'age': [42, np.nan, 36, 24, 73],
            'sex': ['m', np.nan, 'f', 'm', 'f'],
            'preTestScore': [4, np.nan, np.nan, 2, 3],
            'postTestScore': [25, np.nan, np.nan, 62, 70]}

In [3]:
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'preTestScore', 'postTestScore'])

In [4]:
df

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore
0,Jason,Miller,42.0,m,4.0,25.0
1,,,,,,
2,Tina,Ali,36.0,f,,
3,Jake,Milner,24.0,m,2.0,62.0
4,Amy,Cooze,73.0,f,3.0,70.0


index 1을 보면 전부 NaN인 value가 들어가 있다.

## Data drop

- NaN이 데이터를 column별로 합계

In [5]:
df.isnull()

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore
0,False,False,False,False,False,False
1,True,True,True,True,True,True
2,False,False,False,False,True,True
3,False,False,False,False,False,False
4,False,False,False,False,False,False


In [6]:
df.isnull().sum()

first_name       1
last_name        1
age              1
sex              1
preTestScore     2
postTestScore    2
dtype: int64

데이터가 총 몇 개가 비어져 있는지 중요한 것이 아니라 전체 데이터가 몇 개인데 그 중에서 몇 %로가 비어져 있는지가 더 중요하다. 따라서 다음과 같이 쓴다.

In [7]:
df.isnull().sum() / len(df)

first_name       0.2
last_name        0.2
age              0.2
sex              0.2
preTestScore     0.4
postTestScore    0.4
dtype: float64

- drop nan -> NaN 데이터들이 사라짐

In [8]:
df_no_missing = df.dropna()

In [9]:
df_no_missing

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore
0,Jason,Miller,42.0,m,4.0,25.0
3,Jake,Milner,24.0,m,2.0,62.0
4,Amy,Cooze,73.0,f,3.0,70.0


- 모든 데이터가 비어 있으면 drop (how='all')

In [10]:
df_cleaned = df.dropna(how='all')

In [11]:
df_cleaned

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore
0,Jason,Miller,42.0,m,4.0,25.0
2,Tina,Ali,36.0,f,,
3,Jake,Milner,24.0,m,2.0,62.0
4,Amy,Cooze,73.0,f,3.0,70.0


- NAN을 생성 column

In [12]:
df['location'] = np.nan

In [13]:
df

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore,location
0,Jason,Miller,42.0,m,4.0,25.0,
1,,,,,,,
2,Tina,Ali,36.0,f,,,
3,Jake,Milner,24.0,m,2.0,62.0,
4,Amy,Cooze,73.0,f,3.0,70.0,


- column 기준으로 삭제

In [14]:
df.dropna(axis=1, how='all')

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore
0,Jason,Miller,42.0,m,4.0,25.0
1,,,,,,
2,Tina,Ali,36.0,f,,
3,Jake,Milner,24.0,m,2.0,62.0
4,Amy,Cooze,73.0,f,3.0,70.0


- 데이터가 최소 3개 이상 없을 때 drop
- 다시 말해 최소한으로 데이터가 3개는 존재해야 살아남음

In [15]:
df.dropna(axis=1, thresh=3)

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore
0,Jason,Miller,42.0,m,4.0,25.0
1,,,,,,
2,Tina,Ali,36.0,f,,
3,Jake,Milner,24.0,m,2.0,62.0
4,Amy,Cooze,73.0,f,3.0,70.0


In [16]:
df.dropna(axis=1, thresh=4)

Unnamed: 0,first_name,last_name,age,sex
0,Jason,Miller,42.0,m
1,,,,
2,Tina,Ali,36.0,f
3,Jake,Milner,24.0,m
4,Amy,Cooze,73.0,f


- 5개 이상 데이터가 있지 않으면 Drop

In [17]:
df.dropna(thresh=5) # default axis = 0

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore,location
0,Jason,Miller,42.0,m,4.0,25.0,
3,Jake,Milner,24.0,m,2.0,62.0,
4,Amy,Cooze,73.0,f,3.0,70.0,


## Data Fill

- 평균값(mean), 중위값(중간값, median), 최빈값(mode)을 활용
    - 평균값 : 주어진 수의 합을 수의 개수로 나눈 값
    - 중위값 : 어떤 주어진 값들을 크기의 순서대로 정렬했을 때 가장 중앙에 위치하는 값
    - 최빈값 : 가장 많이 관측되는 값, 즉 주어진 값 중에서 가장 자주 나오는 값

<img src="../../img/Screen Shot 2019-03-20 at 5.42.38 PM.png" width="700">

In [18]:
df

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore,location
0,Jason,Miller,42.0,m,4.0,25.0,
1,,,,,,,
2,Tina,Ali,36.0,f,,,
3,Jake,Milner,24.0,m,2.0,62.0,
4,Amy,Cooze,73.0,f,3.0,70.0,


In [19]:
df["preTestScore"].mean()

3.0

In [20]:
df["postTestScore"].median()

62.0

In [21]:
df["postTestScore"].mode()

0    25.0
1    62.0
2    70.0
dtype: float64

- 데이터가 없는 곳은 0으로 집어넣어라

In [22]:
df.fillna(0)

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore,location
0,Jason,Miller,42.0,m,4.0,25.0,0.0
1,0,0,0.0,0,0.0,0.0,0.0
2,Tina,Ali,36.0,f,0.0,0.0,0.0
3,Jake,Milner,24.0,m,2.0,62.0,0.0
4,Amy,Cooze,73.0,f,3.0,70.0,0.0


- preTestScore column에서 데이터가 없는 곳에 preTestScore의 평균값을 집어넣어라

In [23]:
df["preTestScore"].fillna(df["preTestScore"].mean(), inplace=True)

In [24]:
df

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore,location
0,Jason,Miller,42.0,m,4.0,25.0,
1,,,,,3.0,,
2,Tina,Ali,36.0,f,3.0,,
3,Jake,Milner,24.0,m,2.0,62.0,
4,Amy,Cooze,73.0,f,3.0,70.0,


- 성별로 나눠서 평균 값을 집어 넣어라

In [25]:
df.groupby("sex")["postTestScore"].mean()

sex
f    70.0
m    43.5
Name: postTestScore, dtype: float64

In [26]:
df.groupby("sex")["postTestScore"].transform("mean")

0    43.5
1     NaN
2    70.0
3    43.5
4    70.0
Name: postTestScore, dtype: float64

In [27]:
df["postTestScore"].fillna(df.groupby("sex")["postTestScore"].transform("mean"), inplace=True)

In [28]:
df

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore,location
0,Jason,Miller,42.0,m,4.0,25.0,
1,,,,,3.0,,
2,Tina,Ali,36.0,f,3.0,70.0,
3,Jake,Milner,24.0,m,2.0,62.0,
4,Amy,Cooze,73.0,f,3.0,70.0,


- Age와 sex가 모두 notnull인 경우에만 표시해라

In [29]:
df[df['age'].notnull() & df['sex'].notnull()]

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore,location
0,Jason,Miller,42.0,m,4.0,25.0,
2,Tina,Ali,36.0,f,3.0,70.0,
3,Jake,Milner,24.0,m,2.0,62.0,
4,Amy,Cooze,73.0,f,3.0,70.0,


---

# 3. Category data

- 이산형 데이터를 어떻게 처리할까? 

<img src="../../img/Screen Shot 2019-03-20 at 7.32.31 PM.png" width="600">

## Data type

In [30]:
edges = pd.DataFrame({'source': [0, 1, 2],
                      'target': [2, 2, 3],
                      'weight': [3, 4, 5],
                      'color': ['red', 'green', 'blue']})

In [31]:
edges

Unnamed: 0,source,target,weight,color
0,0,2,3,red
1,1,2,4,green
2,2,3,5,blue


In [32]:
edges["source"]

0    0
1    1
2    2
Name: source, dtype: int64

In [33]:
edges["color"]

0      red
1    green
2     blue
Name: color, dtype: object

## One Hot Encoding

- get_dummies() 사용
- type이 object인 경우에 그 데이터를 one-hot encoding을 시켜준다.

In [34]:
pd.get_dummies(edges)

Unnamed: 0,source,target,weight,color_blue,color_green,color_red
0,0,2,3,0,0,1
1,1,2,4,0,1,0
2,2,3,5,1,0,0


In [35]:
pd.get_dummies(edges["color"])

Unnamed: 0,blue,green,red
0,0,0,1
1,0,1,0
2,1,0,0


In [36]:
pd.get_dummies(edges[["color"]])

Unnamed: 0,color_blue,color_green,color_red
0,0,0,1
1,0,1,0
2,1,0,0


- Gradational(순서가 있는) data -> One Hot Encoding

어떠한 순서가 있는 데이터는 절댓값이 아니기 때문에 (ex. 3:M, 4:L, 5:XL / 1등급, 2등급, 3등급) One-Hot Encoding을 통해 Category형 data로 바꿔주어야 한다.

In [37]:
weight_dict = {3:"M", 4:"L", 5:"XL"}

In [38]:
edges["weight_sign"] = edges["weight"].map(weight_dict) # 새로운 column 추가

In [39]:
edges

Unnamed: 0,source,target,weight,color,weight_sign
0,0,2,3,red,M
1,1,2,4,green,L
2,2,3,5,blue,XL


In [40]:
edges = pd.get_dummies(edges)

In [41]:
edges

Unnamed: 0,source,target,weight,color_blue,color_green,color_red,weight_sign_L,weight_sign_M,weight_sign_XL
0,0,2,3,0,0,1,0,1,0
1,1,2,4,0,1,0,1,0,0
2,2,3,5,1,0,0,0,0,1


In [42]:
edges.values

array([[0, 2, 3, 0, 0, 1, 0, 1, 0],
       [1, 2, 4, 0, 1, 0, 1, 0, 0],
       [2, 3, 5, 1, 0, 0, 0, 0, 1]])

##  Data binning

- 데이터가 너무 퍼져있는 경우, 구간을 나눠서 좁혀주자!

<img src="../../img/Screen Shot 2019-03-20 at 7.59.59 PM.png" width="400">

위의 방법인 Equal width 방법을 더 많이 쓴다.

In [43]:
# Example from - https://chrisalbon.com/python/pandas_binning_data.html

raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],
            'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'],
            'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'],
            'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
            'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}

In [44]:
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])

In [45]:
df

Unnamed: 0,regiment,company,name,preTestScore,postTestScore
0,Nighthawks,1st,Miller,4,25
1,Nighthawks,1st,Jacobson,24,94
2,Nighthawks,2nd,Ali,31,57
3,Nighthawks,2nd,Milner,2,62
4,Dragoons,1st,Cooze,3,70
5,Dragoons,1st,Jacon,4,25
6,Dragoons,2nd,Ryaner,24,94
7,Dragoons,2nd,Sone,31,57
8,Scouts,1st,Sloan,2,62
9,Scouts,1st,Piger,3,70


- 데이터의 구간을 나눌 수 있음

In [46]:
bins = [0, 25, 50, 75, 100] # Define bins as 0 to 25, 25 to 50, 60 to 75, 75 to 100

In [47]:
group_names = ['Low', 'Okay', 'Good', 'Great'] # 구간명

In [48]:
categories = pd.cut(df['postTestScore'], bins, labels=group_names) # cut 후 categories에 할당

In [49]:
categories

0       Low
1     Great
2      Good
3      Good
4      Good
5       Low
6     Great
7      Good
8      Good
9      Good
10     Good
11     Good
Name: postTestScore, dtype: category
Categories (4, object): [Low < Okay < Good < Great]

- 기존 dataframe에 할당

In [50]:
df['categories'] = pd.cut(df['postTestScore'], bins, labels=group_names)

In [51]:
df

Unnamed: 0,regiment,company,name,preTestScore,postTestScore,categories
0,Nighthawks,1st,Miller,4,25,Low
1,Nighthawks,1st,Jacobson,24,94,Great
2,Nighthawks,2nd,Ali,31,57,Good
3,Nighthawks,2nd,Milner,2,62,Good
4,Dragoons,1st,Cooze,3,70,Good
5,Dragoons,1st,Jacon,4,25,Low
6,Dragoons,2nd,Ryaner,24,94,Great
7,Dragoons,2nd,Sone,31,57,Good
8,Scouts,1st,Sloan,2,62,Good
9,Scouts,1st,Piger,3,70,Good


In [52]:
pd.value_counts(df['categories'])

Good     8
Great    2
Low      2
Okay     0
Name: categories, dtype: int64

In [53]:
pd.get_dummies(df)

Unnamed: 0,preTestScore,postTestScore,regiment_Dragoons,regiment_Nighthawks,regiment_Scouts,company_1st,company_2nd,name_Ali,name_Cooze,name_Jacobson,...,name_Milner,name_Piger,name_Riani,name_Ryaner,name_Sloan,name_Sone,categories_Low,categories_Okay,categories_Good,categories_Great
0,4,25,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,24,94,0,1,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
2,31,57,0,1,0,0,1,1,0,0,...,0,0,0,0,0,0,0,0,1,0
3,2,62,0,1,0,0,1,0,0,0,...,1,0,0,0,0,0,0,0,1,0
4,3,70,1,0,0,1,0,0,1,0,...,0,0,0,0,0,0,0,0,1,0
5,4,25,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
6,24,94,1,0,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,1
7,31,57,1,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,1,0
8,2,62,0,0,1,1,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
9,3,70,0,0,1,1,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0


---

# 4. Feature scaling

- 두 변수 중 하나의 값의 크기가 너무 크다! ex. 몸무게와 키가 변수일때, 키가 영향을 많이 줌

## Feature scaling

- Feature간의 최대-최소값의 차이를 맞춘다!

<img src="../../img/Screen Shot 2019-03-20 at 7.59.35 PM.png" width="600">

## Feature scaling 전략

### Min-Max Normalization

- 기존 변수에 범위를 **새로운 최대-최소로 변경**
- 일반적으로 **0과 1 사이 값으로 변경함**

$$x^{(i)}_{norm} = \frac{x^{(i)}-x_{min}}{x_{max}-x_{min}} (new \, max - new \, low) + new \, low$$

- 최대 98,000 / 최소 12,000 -> 기존 값 73,600

### Standardization (Z-score Normalization)

- 기존 변수에 범위를 **정규 분포로 변환**
- **실제 Mix-Max의 값을 모를 때 활용가능**

$$x^{(i)}_{std \, norm} = \frac{x^{(i)}-\mu}{s_i}$$

- 평균 54,000 / 표준편자 16,000 -> 73,600

## 주의 사항

- 실제 사용할 때는 반드시 정규화 Parameter(최대/최소, 평균/표준편차) 등을 기억하여 새로운 값에 적용해야함

## Min-Max Normalization

$$x^{(i)}_{norm} = \frac{x^{(i)}-x_{min}}{x_{max}-x_{min}} (new \, max - new \, low) + new \, low$$

In [54]:
# code from - https://stackoverflow.com/questions/24645153/pandas-dataframe-columns-scaling-with-sklearn

df = pd.DataFrame({'A':[14.00,90.20,90.95,96.27,91.21],
                   'B':[103.02,107.26,110.35,114.23,114.68], 
                   'C':['big','small','big','small','small']})

In [55]:
df

Unnamed: 0,A,B,C
0,14.0,103.02,big
1,90.2,107.26,small
2,90.95,110.35,big
3,96.27,114.23,small
4,91.21,114.68,small


In [56]:
df["A"]

0    14.00
1    90.20
2    90.95
3    96.27
4    91.21
Name: A, dtype: float64

In [57]:
df["A"]  - df["A"].min()

0     0.00
1    76.20
2    76.95
3    82.27
4    77.21
Name: A, dtype: float64

In [58]:
df["A"].max() - df["A"].min()

82.27

In [59]:
df["A"] = (df["A"] - df["A"].min()) / (df["A"].max() - df["A"].min()) * (5 - 1) + 1

In [60]:
df

Unnamed: 0,A,B,C
0,1.0,103.02,big
1,4.704874,107.26,small
2,4.741339,110.35,big
3,5.0,114.23,small
4,4.753981,114.68,small


## Z-Score Normalization

$$x^{(i)}_{std \, norm} = \frac{x^{(i)}-\mu}{s_i}$$

In [61]:
df["B"]

0    103.02
1    107.26
2    110.35
3    114.23
4    114.68
Name: B, dtype: float64

In [62]:
df["B"].mean(), df["B"].std()

(109.90799999999999, 4.901619120249964)

In [63]:
df["B"] = (df["B"] - df["B"].mean()) / df["B"].std()

In [64]:
df

Unnamed: 0,A,B,C
0,1.0,-1.40525,big
1,4.704874,-0.54023,small
2,4.741339,0.090174,big
3,5.0,0.881749,small
4,4.753981,0.973556,small


##  Feature Scaling Function

In [65]:
def feture_scaling(df, scaling_strategy="min-max", column=None):
    if column == None:
        column = [column_name for column_name in df.columns]
    
    for column_name in column:
        if scaling_strategy == "min-max":
            df[column_name] = ( df[column_name] - df[column_name].min() ) /\
                               (df[column_name].max() - df[column_name].min()) 
        elif scaling_strategy == "z-score":
            df[column_name] = ( df[column_name] - \
                                df[column_name].mean() ) /\
                               (df[column_name].std() )
    return df

In [66]:
df = pd.DataFrame({'A':[14.00,90.20,90.95,96.27,91.21],
                   'B':[103.02,107.26,110.35,114.23,114.68], 
                   'C':['big','small','big','small','small']})

In [67]:
df

Unnamed: 0,A,B,C
0,14.0,103.02,big
1,90.2,107.26,small
2,90.95,110.35,big
3,96.27,114.23,small
4,91.21,114.68,small


In [68]:
feture_scaling(df, column=["A","B"])

Unnamed: 0,A,B,C
0,0.0,0.0,big
1,0.926219,0.363636,small
2,0.935335,0.628645,big
3,1.0,0.961407,small
4,0.938495,1.0,small


In [69]:
feture_scaling(df, scaling_strategy="z-score", column=["A","B"])

Unnamed: 0,A,B,C
0,-1.784641,-1.40525,big
1,0.390289,-0.54023,small
2,0.411695,0.090174,big
3,0.563541,0.881749,small
4,0.419116,0.973556,small


In [70]:
# code from - http://sebastianraschka.com/Articles/2014_about_feature_scaling.html

df = pd.io.parsers.read_csv(
    'https://raw.githubusercontent.com/rasbt/pattern_classification/master/data/wine_data.csv',
     header=None,
     usecols=[0,1,2])

In [71]:
df.columns=['Class label', 'Alcohol', 'Malic acid']

In [72]:
df.head()

Unnamed: 0,Class label,Alcohol,Malic acid
0,1,14.23,1.71
1,1,13.2,1.78
2,1,13.16,2.36
3,1,14.37,1.95
4,1,13.24,2.59


In [73]:
df = feture_scaling(df, "min-max", column=['Alcohol', 'Malic acid'])
df.head()

Unnamed: 0,Class label,Alcohol,Malic acid
0,1,0.842105,0.1917
1,1,0.571053,0.205534
2,1,0.560526,0.320158
3,1,0.878947,0.23913
4,1,0.581579,0.365613


In [74]:
df = feture_scaling(df, "z-score", column=['Alcohol', 'Malic acid'])
df.head()

Unnamed: 0,Class label,Alcohol,Malic acid
0,1,1.514341,-0.560668
1,1,0.245597,-0.498009
2,1,0.196325,0.021172
3,1,1.686791,-0.345835
4,1,0.294868,0.227053
