<font color = "#CC3D3D">
# Feature Engineering

* [Handling Missing Values](#Handling-Missing-Values)
* [Handling Categorical Variables](#Handling-Categorical-Variables)
* [Feature Scaling](#Feature-Scaling)
* [Feature Selection](#Feature-Selection)
* [Feature Generation](#Feature-Generation)

In [None]:
import pandas as pd
import copy
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

**[Allstate Purchase Prediction Challenge](https://www.kaggle.com/c/allstate-purchase-prediction-challenge/data)**
- 고객이 자동차 보험상품을 구매하기까지의 transaction 기록 
- 각 customer ID 별로 quote history 포함
- 각 customer ID 별 마지막 행이 구매 포인트 (record_type=1)

In [None]:
data = pd.read_csv('Allstate_train.csv')
data.head()

In [None]:
data[data.customer_ID == 10000000]

In [None]:
dataP = data.loc[data.record_type == 1].copy()
con = ['group_size','car_age','age_oldest','age_youngest','duration_previous','cost']
cat = ['day','homeowner','car_value','risk_factor','married_couple','C_previous','state', 'location','shopping_pt']

***
## Handling Missing Values

##### Check missing values

In [None]:
dataP.isnull().sum()

### 1. Drop

In [None]:
dataP.shape

In [None]:
dataP_drop = dataP.dropna(subset=['risk_factor','C_previous','duration_previous'])
dataP_drop.shape

In [None]:
dataP_drop.isnull().sum()

### 2. Impute

##### Continuous feature의 결측값 대체

In [None]:
dataP[con].dtypes

In [None]:
from sklearn.preprocessing import Imputer

imputer_con = Imputer(strategy="median")

imputer_con.fit(dataP[con])

<div class="alert alert-block alert-warning">
- strategy="mean": 평균 대체

- strategy="median": 중위수 대체
- strategy="most_frequent":최빈값 대체


In [None]:
dataP[con]

In [None]:
x = imputer_con.transform(dataP[con]); x

In [None]:
dataP_imp = dataP.copy()
dataP_imp[con] = x
dataP_imp.head()

<font color='red'>
- sklearn.preprocessing.Imputer 클래스는 사이킷런 0.22 버전에서 삭제되기 때문에
- 0.20 버전에서 추가된 sklearn.impute.SimpleImputer 클래스를 사용해야 함.

##### Categorical feature 의 결측값 대체

In [None]:
dataP_imp[cat].dtypes

In [None]:
dataP['risk_factor']

In [None]:
obj=['car_value','state'] 
print(dataP['car_value'].astype('category').cat.categories)
print(dataP['state'].astype('category').cat.categories)

<font color='blue'>
* object type의 feature만 추출

In [None]:
dataP_imp[obj]

In [None]:
pd.Series([1,2,3]).apply(lambda x: x+2)

In [None]:
dataP_imp[obj] = dataP_imp[obj].apply(lambda x: x.astype('category').cat.codes)

<font color='blue'>
* object type을 category type으로 바꾼 후 숫자로 encoding

<font color=green>
(strategy="most_frequent"을 사용하여 impute)

In [None]:
dataP_imp[obj] 

In [None]:
imputer_cat = Imputer(strategy="most_frequent")
#dataP_imp[cat] = imputer_cat.fit_transform(dataP_imp[cat])
dataP_imp[cat] = imputer_cat.fit(dataP_imp[cat]).transform(dataP_imp[cat])

In [None]:
dataP_imp[cat].head()

<font color='blue'>
- 모든 categorical feature에 대해 결측치를 최빈값으로 대체 후 data frame으로 변환

***
## Handling Categorical Variables

In [None]:
dataP_imp[cat] = dataP_imp[cat].astype(int)

In [None]:
dataP_imp[cat].head()

### 1. One-Hot Encoding

In [None]:
dataP['day']

In [None]:
dataP_imp = pd.get_dummies(dataP_imp,columns=['day'])

In [None]:
dataP_imp.head()

In [None]:
dataP_imp.filter(like='day').head()

### 2. Label Encoding

In [None]:
dataP['car_value'].value_counts()

In [None]:
dataP['car_value'] = dataP['car_value'].astype('category')

In [None]:
dataP['car_value'] = dataP['car_value'].cat.codes

In [None]:
dataP['car_value'].value_counts()

<font color='blue'>
이외 다양한 인코더를 범주형 속성에 적용할 수 있다. (아래 패키지 참조)  
- [Category Encoders](https://github.com/scikit-learn-contrib/categorical-encoding)

****
## Feature Scaling

### 1. Min-max scaling

In [None]:
dataP_imp = dataP_imp.drop(['customer_ID', 'record_type', 'time','location'], axis=1)

In [None]:
dataP_imp.head()

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(dataP_imp)

In [None]:
x = scaler.transform(dataP_imp)
dataP_imp_s = pd.DataFrame(x, columns=dataP_imp.columns)

In [None]:
dataP_imp_s.describe()

### 2. Standardization

<font color=green>
(MinMaxScaler 대신 StandardScaler를 사용하여 위와 동일하게 진행)

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(dataP_imp)
x = scaler.transform(dataP_imp)
dataP_imp_s = pd.DataFrame(x, columns=dataP_imp.columns)

In [None]:
dataP_imp_s.describe()

<font color='darkgreen'>
### *The effect of preprocessing on supervised learning* #####

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, random_state=0) 

In [None]:
from sklearn.svm import SVC
svm = SVC(C=100)
svm.fit(X_train, y_train).score(X_test, y_test)

In [None]:
# preprocessing using 0-1 scaling
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

X_train_scaled = scaler.fit(X_train).transform(X_train)

# Scaling training and test data the same way
X_test_scaled = scaler.transform(X_test) 

svm.fit(X_train_scaled, y_train).score(X_test_scaled, y_test)

In [None]:
# preprocessing using zero mean and unit variance scaling
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

X_test_scaled = scaler.transform(X_test)
svm.fit(X_train_scaled, y_train).score(X_test_scaled, y_test)

****
## Feature Selection ###

### 1. Model based feature selection #####

In [None]:
from sklearn.feature_selection import SelectFromModel

from sklearn.ensemble import RandomForestClassifier

select = SelectFromModel(RandomForestClassifier(), threshold=None)

In [None]:
X_train_fs = select.fit(X_train, y_train).transform(X_train)

print("X_train.shape: {}, X_train_fs.shape: {}".format(
    X_train.shape, X_train_fs.shape))

In [None]:
mask = select.get_support()
plt.matshow(mask.reshape(1,-1), cmap="gray_r")

<font color = "blue">
In **numpy.reshape()**, one shape dimension can be **-1**. In this case, the value is inferred from the length of the array and remaining dimensions.

<font color = "blue">
All built-in colormaps can be reversed by appending **_r**: For instance, **gray_r** is the reverse of **gray**.<br>
See [color map](https://matplotlib.org/2.0.2/api/pyplot_summary.html#matplotlib.pyplot.colormaps).

In [None]:
X_test_fs = select.transform(X_test)
svm.fit(X_train_fs, y_train).score(X_test_fs, y_test)

### 2. Univariate feature selection

In [None]:
from sklearn.feature_selection import SelectKBest

print(X_train.shape)
X_train_new = SelectKBest(k=5).fit_transform(X_train, y_train)
X_train_new.shape

****
## Feature Generation ###

### Automatic generating polynomial and interaction features
입력값  x 를 다항식으로 변환한다.
$$ x →[1,x,x^2,x^3,⋯] $$

만약 열의 갯수가 두 개이고 2차 다항식으로 변환하는 경우에는 다음처럼 변환한다.
$$ [x_1,x_2]→[1,x_1,x_2,x_1^2,x_1x_2,x_2^2] $$

다음과 같은 파라미터를 가진다.
- degree : 차수
- interaction_only: interaction 항 생성 여부
- include_bias : 상수항 생성 여부

In [None]:
from sklearn.preprocessing import PolynomialFeatures

X = np.arange(1,7).reshape(3, 2)
X

In [None]:
poly = PolynomialFeatures(3)
poly.fit_transform(X)

In [None]:
poly = PolynomialFeatures(interaction_only=True)
poly.fit_transform(X)

In [None]:
print(X_train.shape)

poly = PolynomialFeatures(2)
poly.fit_transform(X_train).shape

<font color='blue'>
#### 기타 [Discretization](http://scikit-learn.org/stable/modules/preprocessing.html#discretization)과 [Dimensionality reduction](http://scikit-learn.org/stable/modules/unsupervised_reduction.html) 등도 Feature Engineering에서 자주 사용되는 방법이다.

<br><font color = "#CC3D3D">
## End