# ML Pipeline

* 이제 여러분은 코드를 작성할 때, 두 가지를 고려해야 합니다.
    * 재사용 하려면 어떻게 작성해야 할까?
    * 물 흐르듯이 pipeline을 구성하려면 어떻게 작성해야 할까?

* 여러분은 OO 통신화사 데이터분석가 입니다.
* 회사는 약정기간이 끝난 고객이 번호이동(이탈)해 가는 문제를 해결하고자 합니다.
* 그래서 여러분에게, 어떤 고객이 번호이동(이탈)해 가는지 데이터분석을 의뢰하였습니다.
* 고객 이탈여부(CHURN)에 영향을 주는 요인을 찾아 봅시다.

![](https://d18lkz4dllo6v2.cloudfront.net/cumulus_uploads/entry/23964/mobile%20phones.png)

|변수 명|내용|구분|
|	----	|	----	|	----	|
|	COLLEGE	|	대학졸업 여부(1,0) - 범주	|		|
|	INCOME	|	연 수입액(달러)	|		|
|	OVERAGE	|	월 초과사용 시간(분)	|		|
|	LEFTOVER	|	월 사용 잔여시간비율(%)	|		|
|	HOUSE	|	집 가격(달러)	|		|
|	HANDSET_PRICE	|	핸드폰 가격(달러)	|		|
|	OVER_15MINS_CALLS_PER_MONTH	|	 평균 장기통화(15분 이상) 횟수	|		|
|	AVERAGE_CALL_DURATION	|	평균 통화시간(분)	|		|
|	REPORTED_SATISFACTION	|	만족도 설문('very_unsat', 'unsat', 'avg', 'sat', 'very_sat' ) - 범주	|		|
|	REPORTED_USAGE_LEVEL	|	사용 수준 설문('very_little', 'little', 'avg', 'high', 'very_high') - 범주	|		|
|	CONSIDERING_CHANGE_OF_PLAN	|	변경 계획 설문('never_thought', 'no', 'perhaps', 'considering',   'actively_looking_into_it') - 범주	|		|
|	**CHURN**	|	이탈여부(1 : 이탈, 0 : 잔류)	|	**Target**	|


## 0.환경준비 

### 1) 라이브러리 

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import MinMaxScaler

from sklearn.svm import SVC
from sklearn.metrics import classification_report

### 2) 데이터 불러오기

In [2]:
use_cols = ['INCOME', 'OVERAGE', 'LEFTOVER', 'HOUSE', 'HANDSET_PRICE', 'OVER_15MINS_CALLS_PER_MONTH', 'AVERAGE_CALL_DURATION', 'CHURN', 'REPORTED_SATISFACTION', 'REPORTED_USAGE_LEVEL']
data = pd.read_csv('data/mobile.csv', usecols = use_cols )
data.head()

Unnamed: 0,INCOME,OVERAGE,LEFTOVER,HOUSE,HANDSET_PRICE,OVER_15MINS_CALLS_PER_MONTH,AVERAGE_CALL_DURATION,REPORTED_SATISFACTION,REPORTED_USAGE_LEVEL,CHURN
0,47711,183,17,730589.0,192,19,5,unsat,little,0
1,74132,191,43,535092.0,349,15,2,unsat,very_little,1
2,150419,0,14,204004.0,682,0,6,unsat,very_high,0
3,159567,0,58,281969.0,634,1,1,very_unsat,very_high,0
4,23392,0,0,216707.0,233,0,15,unsat,very_little,1


## 2.데이터 전처리

### 1) 불필요한 데이터 처리
처음부터 꼭 필요한 칼럼만 지정하여 불러오는 것이 좋습니다.

### 2) 데이터 분할

#### x, y 분할

In [3]:
target = 'CHURN'
x0 = data.drop(target, axis = 1)
y0 = data.loc[:, target]

#### test 분할

여기서는 조금만 떼어 냅시다.

In [4]:
x, x_test, y, y_test = train_test_split(x0, y0, test_size = 10, random_state = 2022)

In [5]:
x_test

Unnamed: 0,INCOME,OVERAGE,LEFTOVER,HOUSE,HANDSET_PRICE,OVER_15MINS_CALLS_PER_MONTH,AVERAGE_CALL_DURATION,REPORTED_SATISFACTION,REPORTED_USAGE_LEVEL
17631,27296,104,18,293757.0,181,5,5,very_sat,little
548,89370,78,13,266922.0,344,4,5,unsat,avg
9178,39363,0,23,704168.0,217,1,4,very_unsat,little
17249,151891,0,14,829930.0,508,0,5,very_unsat,very_little
5057,94895,74,77,727593.0,307,3,4,unsat,little
11890,35551,138,0,896854.0,151,28,12,very_unsat,little
3310,64381,86,6,761415.0,354,23,1,avg,little
3154,138100,200,23,434073.0,717,21,4,very_sat,very_little
3901,24827,179,72,246542.0,144,29,1,very_unsat,very_high
5898,33434,40,16,415481.0,379,3,6,very_sat,little


#### train, val 분할

In [6]:
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size = .3, random_state = 2022)

### 3) Feature Engineering

### 4) NaN 조치①

* 먼저 x의 NaN을 조사해 봅시다.

In [7]:
x_train.isna().sum()

INCOME                          0
OVERAGE                         0
LEFTOVER                        0
HOUSE                           9
HANDSET_PRICE                   0
OVER_15MINS_CALLS_PER_MONTH     0
AVERAGE_CALL_DURATION           0
REPORTED_SATISFACTION          29
REPORTED_USAGE_LEVEL            0
dtype: int64

* 어떻게 조치 방법에 따라 처리 시점이 달라집니다.
    * REPORTED_SATISFACTION는 중앙 값에 해당되는 avg로 **지금** 채우고
    * HOUSE는 KNNImputer로 **가변수화 후에** 채우겠습니다.

* NaN 행 삭제를 결정한다면...
    * 운영에서 NaN이 들어오면 그 역시 버리겠다는 의미 입니다. 
        * 그래도 괜찮다면...
        * 그러나 괜찮은 상황은 별로 없을 겁니다.

#### SimpleImputer 

https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

In [8]:
from sklearn.impute import SimpleImputer

In [9]:
# 대상을 리스트로 선언합시다. 
imputer1_list = ['REPORTED_SATISFACTION']

# 선언하고 fit_transform
imputer1 = SimpleImputer(strategy = 'constant', fill_value= avg)
x_train[imputer1_list] = imputer1.fit_transform(x_train[imputer1_list])

x_train.isna().sum()

INCOME                         0
OVERAGE                        0
LEFTOVER                       0
HOUSE                          9
HANDSET_PRICE                  0
OVER_15MINS_CALLS_PER_MONTH    0
AVERAGE_CALL_DURATION          0
REPORTED_SATISFACTION          0
REPORTED_USAGE_LEVEL           0
dtype: int64

#### validation set에 적용하기

In [10]:
x_val[imputer1_list] = imputer1.transform(x_val[imputer1_list])

x_val.isna().sum()

INCOME                         0
OVERAGE                        0
LEFTOVER                       0
HOUSE                          5
HANDSET_PRICE                  0
OVER_15MINS_CALLS_PER_MONTH    0
AVERAGE_CALL_DURATION          0
REPORTED_SATISFACTION          0
REPORTED_USAGE_LEVEL           0
dtype: int64

In [11]:
x_val['REPORTED_SATISFACTION'].value_counts()

very_unsat    2164
very_sat      1333
unsat         1057
avg            516
sat            267
Name: REPORTED_SATISFACTION, dtype: int64

In [12]:
x_val['REPORTED_USAGE_LEVEL'].value_counts()

little         2130
very_high      1374
very_little    1045
high            512
avg             276
Name: REPORTED_USAGE_LEVEL, dtype: int64

### 5) 가변수화

In [13]:
cat = {'REPORTED_SATISFACTION':["very_unsat", "very_sat", "unsat", "avg", "sat"],
       'REPORTED_USAGE_LEVEL':["little", "very_high", "very_little", "high", "avg"]}

for k, v in cat.items():
    x_train[k] = pd.Categorical(x_train[k], categories=v, ordered=False)

x_train.head()

Unnamed: 0,INCOME,OVERAGE,LEFTOVER,HOUSE,HANDSET_PRICE,OVER_15MINS_CALLS_PER_MONTH,AVERAGE_CALL_DURATION,REPORTED_SATISFACTION,REPORTED_USAGE_LEVEL
11788,107677,155,23,154007.0,859,15,4,very_sat,little
16055,64070,0,0,326646.0,275,1,9,sat,little
2880,58182,210,0,355513.0,312,19,1,very_unsat,little
3023,97009,228,0,165152.0,290,28,9,very_unsat,little
7528,33879,0,0,573076.0,221,1,14,very_unsat,little


In [14]:
x_train = pd.get_dummies(x_train, columns=cat.keys(), drop_first=True)

Unnamed: 0,INCOME,OVERAGE,LEFTOVER,HOUSE,HANDSET_PRICE,OVER_15MINS_CALLS_PER_MONTH,AVERAGE_CALL_DURATION,REPORTED_SATISFACTION_very_sat,REPORTED_SATISFACTION_unsat,REPORTED_SATISFACTION_avg,REPORTED_SATISFACTION_sat,REPORTED_USAGE_LEVEL_very_high,REPORTED_USAGE_LEVEL_very_little,REPORTED_USAGE_LEVEL_high,REPORTED_USAGE_LEVEL_avg
11788,107677,155,23,154007.0,859,15,4,1,0,0,0,0,0,0,0
16055,64070,0,0,326646.0,275,1,9,0,0,0,1,0,0,0,0
2880,58182,210,0,355513.0,312,19,1,0,0,0,0,0,0,0,0
3023,97009,228,0,165152.0,290,28,9,0,0,0,0,0,0,0,0
7528,33879,0,0,573076.0,221,1,14,0,0,0,0,0,0,0,0


In [None]:
x_train.head()

In [15]:
# 함수로 생성
cat = {'REPORTED_SATISFACTION':["very_unsat", "very_sat", "unsat", "avg", "sat"],
       'REPORTED_USAGE_LEVEL':["little", "very_high", "very_little", "high", "avg"]}

def mobile_dumm(df, cat):
    temp = df.copy()
    for k, v in cat.items():
        temp[k] = pd.Categorical(temp[k], categories=v, ordered=False)

    temp = pd.get_dummies(temp, columns=cat.keys(), drop_first=True)
    return temp

#### validation set에 적용하기

In [16]:
x_val = mobile_dumm(x_val, cat)
x_val.head()

Unnamed: 0,INCOME,OVERAGE,LEFTOVER,HOUSE,HANDSET_PRICE,OVER_15MINS_CALLS_PER_MONTH,AVERAGE_CALL_DURATION,REPORTED_SATISFACTION_very_sat,REPORTED_SATISFACTION_unsat,REPORTED_SATISFACTION_avg,REPORTED_SATISFACTION_sat,REPORTED_USAGE_LEVEL_very_high,REPORTED_USAGE_LEVEL_very_little,REPORTED_USAGE_LEVEL_high,REPORTED_USAGE_LEVEL_avg
14427,70306,223,76,233303.0,315,15,2,0,0,0,1,0,0,0,0
5559,36995,0,55,702443.0,174,1,2,0,0,0,0,0,1,0,0
17239,84922,0,0,240698.0,149,1,14,0,1,0,0,0,0,0,0
10538,46597,172,23,155241.0,169,0,6,1,0,0,0,1,0,0,0
16745,149379,224,0,528332.0,693,22,11,1,0,0,0,0,0,0,1


### 6) 스케일링


In [19]:
scaler = MinMaxScaler()
x_train_s = scaler.fit_transform(x_train) # 이미 여기서 Numpy Array가 됨

#### validation set에 적용하기

In [20]:
# validation 적용
x_val_s = scaler.transform(x_val)

### 7) NaN 조치②

#### KNNImputer
https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html

In [21]:
from sklearn.impute import KNNImputer

In [22]:
imputer2_list = list(x_train)

# 선언하고 fit_transform
imputer2 = KNNImputer()

x_train_s = imputer2.fit_transform(x_train_s)
# DataFrame을 넣었는데 결과가 Numpy Arrary로 나옴

# x_train_s[imputer2_list] = imputer2.fit_transform(x_train_s[imputer2_list])
# x_train_s.isna().sum()
# DataFrame으로 하는 방법(Scaler적용시 이미 Numpy Array가 되어서 그냥 위에거로 진행)

#### validation set에 적용하기

In [23]:
# validation 적용
x_val_s = imputer2.transform(x_val_s)

## 3.모델링


In [25]:
# SVM으로 모델링 수행
model = SVC()
model.fit(x_train_s, y_train)

SVC()

In [26]:
# validation
pred = model.predict(x_val_s)
print(classification_report(y_val, pred))

              precision    recall  f1-score   support

           0       0.66      0.74      0.70      2705
           1       0.70      0.61      0.65      2632

    accuracy                           0.68      5337
   macro avg       0.68      0.68      0.68      5337
weighted avg       0.68      0.68      0.68      5337



## 4.Data Pipeline 정리

* 이제 최적의 모델이 생성되어, 운영시스템에 배포되었습니다.
* 운영에서 new data가 주어졌을 때, 어떤 절차로 파이프라인을 구성해야 할까요?

In [27]:
# new data : x_test
x_test.head()

Unnamed: 0,INCOME,OVERAGE,LEFTOVER,HOUSE,HANDSET_PRICE,OVER_15MINS_CALLS_PER_MONTH,AVERAGE_CALL_DURATION,REPORTED_SATISFACTION,REPORTED_USAGE_LEVEL
17631,27296,104,18,293757.0,181,5,5,very_sat,little
548,89370,78,13,266922.0,344,4,5,unsat,avg
9178,39363,0,23,704168.0,217,1,4,very_unsat,little
17249,151891,0,14,829930.0,508,0,5,very_unsat,very_little
5057,94895,74,77,727593.0,307,3,4,unsat,little


### 1) [validation에 적용하기] 코드들 가져오기

* 함수, 변수 선언

In [28]:
def mobile_dumm(df, cat):
    for k, v in cat.items():
        df[k] = pd.Categorical(df[k], categories=v, ordered=False)
    df = pd.get_dummies(df, columns =cat.keys(), drop_first = 1)
    return df

imputer1_list = ['REPORTED_SATISFACTION']

cat = {'REPORTED_SATISFACTION':["very_unsat", "very_sat", "unsat", "avg", "sat"],
       'REPORTED_USAGE_LEVEL':["little", "very_high", "very_little", "high", "avg"]}


* 전처리 실행

In [29]:
temp = x_test.copy()

In [31]:
# Feature Engineering
# temp = mobile_fe(temp)

# NaN 조치① : SimpleImputer
temp[imputer1_list] = imputer1.fit_transform(temp[imputer1_list])

# 가변수화
temp = mobile_dumm(temp, cat)

# 스케일링
temp = scaler.transform(temp)

# NaN 조치② : KNNImputer
temp = imputer2.transform(temp)

temp

array([[0.05205967, 0.31454006, 0.20224719, 0.16911719, 0.0663199 ,
        0.17241379, 0.28571429, 1.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.49552774, 0.23738872, 0.14606742, 0.13754493, 0.27828349,
        0.13793103, 0.28571429, 0.        , 1.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 1.        ],
       [0.13826854, 0.00593472, 0.25842697, 0.65197922, 0.11313394,
        0.03448276, 0.21428571, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.94218926, 0.00593472, 0.15730337, 0.79994235, 0.49154746,
        0.        , 0.28571429, 0.        , 0.        , 0.        ,
        0.        , 0.        , 1.        , 0.        , 0.        ],
       [0.53499936, 0.22551929, 0.86516854, 0.67953951, 0.23016905,
        0.10344828, 0.21428571, 0.        , 1.        , 0.        ,
        0.        , 0.        , 0.        , 

### 2) Data Pipeline 함수 만들고 실행하기

In [38]:
def mobile_datapipeline(df, simpleimputer, simple_impute_list, dumm_list, scaler, knnimputer):

    temp = df.copy()

    # Feature Engineering
#     temp = mobile_fe(temp)

    # NaN 조치① : SimpleImputer
    temp[simple_impute_list] = simpleimputer.fit_transform(temp[simple_impute_list])

    # 가변수화
    temp = mobile_dumm(temp, dumm_list)

    x_cols = list(temp)
    # 스케일링
    temp = scaler.transform(temp)

    # NaN 조치② : KNNImputer
    temp = knnimputer.transform(temp)

    return pd.DataFrame(temp, columns = x_cols)


In [39]:
# 적용
input = mobile_datapipeline(x_test, imputer1, imputer1_list, cat, scaler, imputer2)

In [40]:
# 예측
model.predict(input)



array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0], dtype=int64)

## 5.파이썬 오브젝트 저장하기