# [프로젝트3] 시계열 데이터 전처리하기
---


## 프로젝트 목표
---
- 데이터셋 내에 존재하는 결측치를 보간법을 활용하여 대체합니다.
- 데이터셋 내에 존재하는 변수값의 범위를 비슷하게 만들어주는 표준화를 진행합니다.


## 프로젝트 목차
---

1. **데이터 불러오기:** 전처리를 진행할 데이터셋을 불러옵니다.

2. **결측치 처리하기 (보간법):** 데이터셋 내에 존재하는 결측치 개수를 확인하고, 보간법을 활용하여 결측치를 대체하여 csv 파일로 도출합니다.

3. **데이터 표준화:** 변수 값의 범위를 유사하게 만들어주기 위한 표준화를 진행하고, csv 파일로 도출합니다.



## 프로젝트 개요
---

cell id `1004_0`인 데이터를 활용하여 시계열 데이터 전처리(결측치 대체, 정규화, 표준화) 합니다.


## 1. 데이터 불러오기
---

전처리를 진행할 데이터셋을 불러오고, 기초 구조를 확인합니다. 

In [1]:
import numpy as np
import matplotlib.pyplot as plt 
import pandas as pd
import seaborn as sns
import os
from sklearn.preprocessing import StandardScaler
#plt.rcParams['axes.unicode_minus']=False

In [2]:
df=pd.read_csv('1004_0.csv')

In [3]:
df

Unnamed: 0,New,dl_bler,ul_bler,conn_avg,conn_max,interx2in_succ_rate,interx2out_succ_rate,intraenb_succ_rate,dl_prb,ul_prb,reconfig_succ_rate
0,1004_0-20210901-0,7.551555,14.944129,27.913285,43.0,97.523220,100.000000,100.000000,45.402603,59.021408,100.409763
1,1004_0-20210901-1,8.685262,11.891272,23.013333,36.0,98.170732,,100.000000,36.740540,21.128778,99.953052
2,1004_0-20210901-2,7.186353,12.175938,22.158977,38.0,98.039216,98.765432,100.000000,28.568421,15.338933,100.249377
3,1004_0-20210901-3,6.350503,11.190096,18.396667,34.0,98.717949,100.000000,100.000000,16.503713,8.009815,100.512821
4,1004_0-20210901-4,7.859695,14.823305,14.173889,28.0,97.014925,100.000000,100.000000,14.133542,5.109716,99.923136
...,...,...,...,...,...,...,...,...,...,...,...
2179,1004_0-20211130-19,9.969192,25.551625,29.048916,44.0,97.241993,99.905393,99.605523,53.394651,43.841664,98.565677
2180,1004_0-20211130-20,11.489667,28.470170,,,89.108062,99.900794,99.779736,33.844835,46.722683,98.897528
2181,1004_0-20211130-21,8.944882,17.154398,26.782657,42.0,82.404951,100.000000,100.000000,50.477553,35.947183,98.873623
2182,1004_0-20211130-22,8.804690,16.822283,31.691495,45.0,97.417504,100.000000,100.000000,65.563727,43.669957,98.735844


## 2. 결측치 처리하기 (보간법)
---

데이터 분석 수학 분야에서 "보간법(Interpolation)은 데이터 포인트들의 이산형 집합의 범위에 기반해서 새로운 데이터 포인트들을 만들거나 찾는 추정(estimation)의 한 유형＂입니다. 이번 시간에는 이러한 보간법을 활용하여 데이터 셋 내에 존재하는 결측치를 직접 처리해보겠습니다. 

#### 결측값 갯수

In [4]:
df.isnull().sum()

New                       0
dl_bler                  95
ul_bler                  96
conn_avg                109
conn_max                109
interx2in_succ_rate      86
interx2out_succ_rate    131
intraenb_succ_rate      116
dl_prb                  114
ul_prb                  114
reconfig_succ_rate       91
dtype: int64

#### df_temp 복제

In [5]:
df_temp=df.copy()

#### spline interpolation 이용

#### 보간법 

In [6]:
#spline interpolation
dl_bler_inter_2=df_temp['dl_bler'].interpolate(method='polynomial', order=2)
ul_bler_inter_2=df_temp['ul_bler'].interpolate(method='polynomial', order=2)
conn_avg_inter_2=df_temp['conn_avg'].interpolate(method='polynomial', order=2)
conn_max_inter_2=df_temp['conn_max'].interpolate(method='polynomial', order=2)
interx2in_succ_rate_inter_2=df_temp['interx2in_succ_rate'].interpolate(method='polynomial', order=2)
interx2out_succ_rate_inter_2=df_temp['interx2out_succ_rate'].interpolate(method='polynomial', order=2)
intraenb_succ_rate_inter_2=df_temp['intraenb_succ_rate'].interpolate(method='polynomial', order=2)
dl_prb_inter_2=df_temp['dl_prb'].interpolate(method='polynomial', order=2)
ul_prb_inter_2=df_temp['ul_prb'].interpolate(method='polynomial', order=2)
reconfig_succ_rate_inter_2=df_temp['reconfig_succ_rate'].interpolate(method='polynomial', order=2)

In [7]:
df_preprocessed=pd.DataFrame()
df_preprocessed['dl_bler_inter']=dl_bler_inter_2
df_preprocessed['ul_bler_inter']=ul_bler_inter_2
df_preprocessed['conn_avg_inter']=conn_avg_inter_2
df_preprocessed['conn_max_inter']=conn_max_inter_2
df_preprocessed['interx2in_succ_rate_inter']=interx2in_succ_rate_inter_2
df_preprocessed['interx2out_succ_rate_inter']=interx2out_succ_rate_inter_2
df_preprocessed['intraenb_succ_rate_inter']=intraenb_succ_rate_inter_2
df_preprocessed['dl_prb_inter']=dl_prb_inter_2
df_preprocessed['ul_prb_inter']=ul_prb_inter_2
df_preprocessed['reconfig_succ_rate_inter']=reconfig_succ_rate_inter_2

In [8]:
df_preprocessed

Unnamed: 0,dl_bler_inter,ul_bler_inter,conn_avg_inter,conn_max_inter,interx2in_succ_rate_inter,interx2out_succ_rate_inter,intraenb_succ_rate_inter,dl_prb_inter,ul_prb_inter,reconfig_succ_rate_inter
0,7.551555,14.944129,27.913285,43.000000,97.523220,100.000000,100.000000,45.402603,59.021408,100.409763
1,8.685262,11.891272,23.013333,36.000000,98.170732,98.619780,100.000000,36.740540,21.128778,99.953052
2,7.186353,12.175938,22.158977,38.000000,98.039216,98.765432,100.000000,28.568421,15.338933,100.249377
3,6.350503,11.190096,18.396667,34.000000,98.717949,100.000000,100.000000,16.503713,8.009815,100.512821
4,7.859695,14.823305,14.173889,28.000000,97.014925,100.000000,100.000000,14.133542,5.109716,99.923136
...,...,...,...,...,...,...,...,...,...,...
2179,9.969192,25.551625,29.048916,44.000000,97.241993,99.905393,99.605523,53.394651,43.841664,98.565677
2180,11.489667,28.470170,26.397539,41.225172,89.108062,99.900794,99.779736,33.844835,46.722683,98.897528
2181,8.944882,17.154398,26.782657,42.000000,82.404951,100.000000,100.000000,50.477553,35.947183,98.873623
2182,8.804690,16.822283,31.691495,45.000000,97.417504,100.000000,100.000000,65.563727,43.669957,98.735844


In [9]:
df_preprocessed.isnull().sum()

dl_bler_inter                 0
ul_bler_inter                 0
conn_avg_inter                0
conn_max_inter                0
interx2in_succ_rate_inter     0
interx2out_succ_rate_inter    0
intraenb_succ_rate_inter      0
dl_prb_inter                  0
ul_prb_inter                  0
reconfig_succ_rate_inter      0
dtype: int64

In [10]:
df_preprocessed.to_csv('df_preprocessed(inter)_1004_0.csv', index=None)

## 3. 데이터 표준화
---

표준화는 각각의 변수의 단위가 다르기 때문에 변수들의 값의 범위를 비슷하게 만들어주는 과정입니다. 

sklearn의 StandardScaler를 활용하여 데이터의 표준화를 진행해보겠습니다. 

#### 표준화

In [11]:
from sklearn.preprocessing import StandardScaler

In [12]:
df_dl_bler=df_preprocessed['dl_bler_inter'].values.reshape(-1, 1)
df_ul_bler=df_preprocessed['ul_bler_inter'].values.reshape(-1, 1)
df_conn_avg=df_preprocessed['conn_avg_inter'].values.reshape(-1, 1)
df_conn_max=df_preprocessed['conn_max_inter'].values.reshape(-1, 1)
df_interx2in_succ_rate=df_preprocessed['interx2in_succ_rate_inter'].values.reshape(-1, 1)
df_interx2out_succ_rate=df_preprocessed['interx2out_succ_rate_inter'].values.reshape(-1, 1)
df_intraenb_succ_rate=df_preprocessed['intraenb_succ_rate_inter'].values.reshape(-1, 1)
df_dl_prb=df_preprocessed['dl_prb_inter'].values.reshape(-1, 1)
df_ul_prb=df_preprocessed['ul_prb_inter'].values.reshape(-1, 1)
df_reconfig_succ_rate=df_preprocessed['reconfig_succ_rate_inter'].values.reshape(-1, 1)

In [13]:
# standardize
scaler = StandardScaler()

df_dl_bler_inter_scaled= scaler.fit_transform(df_dl_bler)
df_ul_bler_inter_scaled= scaler.fit_transform(df_ul_bler)
df_conn_avg_inter_scaled= scaler.fit_transform(df_conn_avg)
df_conn_max_inter_scaled= scaler.fit_transform(df_conn_max)
df_interx2in_succ_rate_inter_scaled= scaler.fit_transform(df_interx2in_succ_rate)
df_interx2out_succ_rate_inter_scaled= scaler.fit_transform(df_interx2out_succ_rate)
df_intraenb_succ_rate_inter_scaled= scaler.fit_transform(df_intraenb_succ_rate)
df_dl_prb_inter_scaled= scaler.fit_transform(df_dl_prb)
df_ul_prb_inter_scaled= scaler.fit_transform(df_ul_prb)
df_reconfig_succ_rate_inter_scaled= scaler.fit_transform(df_reconfig_succ_rate)

In [14]:
df_preprocessed_2=pd.DataFrame()
df_preprocessed_2['dl_bler']=pd.Series(df_dl_bler_inter_scaled.reshape(-1))
df_preprocessed_2['ul_bler']=pd.Series(df_ul_bler_inter_scaled.reshape(-1))
df_preprocessed_2['conn_avg']=pd.Series(df_conn_avg_inter_scaled.reshape(-1))
df_preprocessed_2['conn_max']=pd.Series(df_conn_max_inter_scaled.reshape(-1))
df_preprocessed_2['interx2in_succ_rate']=pd.Series(df_interx2in_succ_rate_inter_scaled.reshape(-1))
df_preprocessed_2['interx2out_succ_rate']=pd.Series(df_interx2out_succ_rate_inter_scaled.reshape(-1))
df_preprocessed_2['intraenb_succ_rate']=pd.Series(df_intraenb_succ_rate_inter_scaled.reshape(-1))
df_preprocessed_2['dl_prb']=pd.Series(df_dl_prb_inter_scaled.reshape(-1))
df_preprocessed_2['ul_prb']=pd.Series(df_ul_prb_inter_scaled.reshape(-1))
df_preprocessed_2['reconfig_succ_rate']=pd.Series(df_reconfig_succ_rate_inter_scaled.reshape(-1))

In [15]:
df_preprocessed_2

Unnamed: 0,dl_bler,ul_bler,conn_avg,conn_max,interx2in_succ_rate,interx2out_succ_rate,intraenb_succ_rate,dl_prb,ul_prb,reconfig_succ_rate
0,-1.322526,-1.388984,0.781895,0.534733,0.026699,0.406745,0.524891,0.962706,2.595942,1.715813
1,-0.718152,-1.911244,-0.293207,-0.600515,0.228621,-4.085459,0.524891,0.291056,-0.475253,1.231342
2,-1.517214,-1.862546,-0.480662,-0.276159,0.187608,-3.611406,0.524891,-0.342604,-0.944519,1.545678
3,-1.962802,-2.031196,-1.306153,-0.924872,0.399266,0.406745,0.524891,-1.278092,-1.538543,1.825135
4,-1.158258,-1.409654,-2.232676,-1.897941,-0.131808,0.406745,0.524891,-1.461873,-1.773596,1.199608
...,...,...,...,...,...,...,...,...,...,...
2179,-0.033694,0.425669,1.031065,0.696911,-0.060999,0.098826,-0.636653,1.582403,1.365625,-0.240361
2180,0.776865,0.924952,0.449324,0.246894,-2.597498,0.083857,-0.123681,0.066525,1.599131,0.111661
2181,-0.579749,-1.010867,0.533823,0.372554,-4.687807,0.406745,0.524891,1.356214,0.725778,0.086303
2182,-0.654485,-1.067683,1.610875,0.859089,-0.006267,0.406745,0.524891,2.525984,1.351708,-0.059851


AttributeError: 'DataFrame' object has no attribute 'pairplot'

전처리가 완료된 `1004_0` 장비 데이터를 csv로 도출합니다.

In [16]:
df_preprocessed_2.to_csv('df_preprocessed(inter&scaled)_1004_0.csv', index=None)

`1004_0` 외의 장비 데이터를 전처리하기 위해 전처리 과정을 함수로 구현합니다.

이 때, interpolate의 polynomial의 order는 사용자가 정의할 수 있도록 합니다. 

In [17]:
def fill_missing_value(df_temp, ordern):
    #spline interpolation
    dl_bler_inter_2=df_temp['dl_bler'].interpolate(method='polynomial', order=ordern)
    ul_bler_inter_2=df_temp['ul_bler'].interpolate(method='polynomial', order=ordern)
    conn_avg_inter_2=df_temp['conn_avg'].interpolate(method='polynomial', order=ordern)
    conn_max_inter_2=df_temp['conn_max'].interpolate(method='polynomial', order=ordern)
    interx2in_succ_rate_inter_2=df_temp['interx2in_succ_rate'].interpolate(method='polynomial', order=ordern)
    interx2out_succ_rate_inter_2=df_temp['interx2out_succ_rate'].interpolate(method='polynomial', order=ordern)
    intraenb_succ_rate_inter_2=df_temp['intraenb_succ_rate'].interpolate(method='polynomial', order=ordern)
    dl_prb_inter_2=df_temp['dl_prb'].interpolate(method='polynomial', order=ordern)
    ul_prb_inter_2=df_temp['ul_prb'].interpolate(method='polynomial', order=ordern)
    reconfig_succ_rate_inter_2=df_temp['reconfig_succ_rate'].interpolate(method='polynomial', order=ordern)
    
    df_preprocessed=pd.DataFrame()
    df_preprocessed['dl_bler_inter']=dl_bler_inter_2
    df_preprocessed['ul_bler_inter']=ul_bler_inter_2
    df_preprocessed['conn_avg_inter']=conn_avg_inter_2
    df_preprocessed['conn_max_inter']=conn_max_inter_2
    df_preprocessed['interx2in_succ_rate_inter']=interx2in_succ_rate_inter_2
    df_preprocessed['interx2out_succ_rate_inter']=interx2out_succ_rate_inter_2
    df_preprocessed['intraenb_succ_rate_inter']=intraenb_succ_rate_inter_2
    df_preprocessed['dl_prb_inter']=dl_prb_inter_2
    df_preprocessed['ul_prb_inter']=ul_prb_inter_2
    df_preprocessed['reconfig_succ_rate_inter']=reconfig_succ_rate_inter_2
    
    return df_preprocessed

In [None]:
from elice_utils import EliceUtils

elice_utils = EliceUtils()

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

import matplotlib.pyplot as plt 
import pandas as pd

import statsmodels.api as sm
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from matrixprofile import *
from pmdarima.arima import auto_arima

import os
from datetime import datetime, timedelta


def interpolation_and_standardization(df, order_n):
    df_preprocessed = df.copy()
    for column in df.columns:
        # TODO: [지시사항 1-A번] 데이터의 각 feature에 polynomial interpolation을 적용합니다.
        df_preprocessed[column] = df[column].interpolate(method='polynomial',order=order_n)

    df_preprocessed.drop("New", axis=1, inplace=True)
    
    # TODO: [지시사항 1-B번] 표준화를 위한 scaler를 만들고 이를 데이터셋에 적용합니다.
    scaler = StandardScaler()
    df_preprocessed = pd.DataFrame(scaler.fit_transform(df_preprocessed), columns=df_preprocessed.columns)
    
    return df_preprocessed


def visualize_two_variables_correlation(df, first_var, second_var, length=200):
    fig = plt.subplots(figsize=(20,5))

    plt.plot(df[first_var][:length], label=first_var)
    plt.plot(df[second_var][:length], label=second_var)
    plt.legend(loc='upper left')

    plt.savefig(f"./two_{first_var}_{second_var}.png")
    elice_utils.send_image(f"./two_{first_var}_{second_var}.png")


def plot_autocorrelation_anomaly(df, var, lags=500):
    fig, ax = plt.subplots(1, 2, figsize=(20,5))
    fig.suptitle('Raw Data')
    
    # TODO: [지시사항 3-A번] var로 주어진 변수의 autocorrelation plot을 그립니다.
    plot_acf(df[var], lags=lags, ax=ax[0])
    
    # TODO: [지시사항 3-B번] var로 주어진 변수의 partial autocorrelation plot을 그립니다.
    plot_pacf(df[var], lags=lags, ax=ax[1])

    plt.savefig(f"./autocorrelation_anomaly_{var}.png")
    elice_utils.send_image(f"./autocorrelation_anomaly_{var}.png")


def plot_matrix_profile_anomaly(df, var_list, window_size=50, length=200):
    mp_list = []
    for var in var_list:
        # TODO: [지시사항 4번] scrip++ 알고리즘을 통해 변수 var의 matrix profile을 계산합니다.
        mp = matrixProfile(df[var][:length].values, window_size, algorithm='scrip++')
        mp_list.append((var, mp))

    fig = plt.subplots(figsize=(20,5))
    for var, mp in mp_list:
        plt.plot(range(length), mp[0][:length], label=var)

    plt.legend(loc='upper left')

    plt.savefig("./matrix_profile_anomaly.png")
    elice_utils.send_image("./matrix_profile_anomaly.png")


def run_and_plot_arima_anomaly(df, var, length=200):
    train_data, test_data = train_test_split(df[var][:length], test_size=0.2, shuffle=False)

    # TODO: [지시사항 5-A번] auto_arima 함수를 통해 ARIMA 모델을 만듭니다.
    auto_arima_model = auto_arima(train_data,start_p=1,start_q=1,max_p=1,max_q=1,m=24, seasonal=True, D=1, max_D = 1, trace = True, error_action='ignore',suppress_warnings=True, stepwise=True)

    # TODO: [지시사항 5-B번] 학습된 ARIMA 모델에 테스트 데이터를 적용하여 예측값을 얻습니다.
    prediction = auto_arima_model.predict(n_periods=len(test_data), return_conf_int=True)
    
    predicted_value = prediction[0]
    predicted_lb = prediction[1][:,0]
    predicted_ub = prediction[1][:,1]
    predict_index = list(test_data.index)
    r2 = r2_score(test_data, predicted_value)

    fig, ax = plt.subplots(figsize=(15,8))
    ax.plot(df[var][0:500], label=var);
    ax.vlines(predict_index[0], 0, 10, linestyle='--',color='r', label='Start of Forecast'); 
    ax.plot(predict_index, predicted_value, color='orange', label='Prediction');
    ax.fill_between(
        predict_index,
        predicted_lb,
        predicted_ub,
        color='k',
        alpha=0.1,
        label='0.90 Prediction Interval'
    )
    ax.legend(loc='upper left')
    plt.suptitle(
        f"ARIMA {auto_arima_model.order},{auto_arima_model.seasonal_order} "
        f"Prediction Results (r2_score: {round(r2,2)})"
    )

    plt.savefig(f"./auto_arima_{var}.png")
    elice_utils.send_image(f"./auto_arima_{var}.png")


def main():
    df_merged = pd.read_csv("./preprocessed_1004_0.csv")
    df_preprocessed = interpolation_and_standardization(df_merged, 2)
    
    # TODO: [지시사항 2-A번] dl_prb와 ul_prb 변수 간의 상관관계를 그래프로 그립니다.
    visualize_two_variables_correlation(df_preprocessed, "dl_prb", "ul_prb", length=200)
    
    # TODO: [지시사항 2-B번] dl_bler와 reconfig_succ_rate 변수 간의 상관관계를 그래프로 그립니다.
    visualize_two_variables_correlation(df_preprocessed, "dl_bler", "reconfig_succ_rate", length=200)

    plot_autocorrelation_anomaly(df_preprocessed, "dl_prb", lags=500)
    plot_matrix_profile_anomaly(
        df_preprocessed,
        ["dl_prb", "ul_prb", "dl_bler", "ul_bler"],
        window_size=50,
        length=200
    )
    run_and_plot_arima_anomaly(df_preprocessed, "dl_prb", length=200)

if __name__ == "__main__":
    main()

In [18]:
def standard_value(df_preprocessed):
    df_dl_bler=df_preprocessed['dl_bler_inter'].values.reshape(-1, 1)
    df_ul_bler=df_preprocessed['ul_bler_inter'].values.reshape(-1, 1)
    df_conn_avg=df_preprocessed['conn_avg_inter'].values.reshape(-1, 1)
    df_conn_max=df_preprocessed['conn_max_inter'].values.reshape(-1, 1)
    df_interx2in_succ_rate=df_preprocessed['interx2in_succ_rate_inter'].values.reshape(-1, 1)
    df_interx2out_succ_rate=df_preprocessed['interx2out_succ_rate_inter'].values.reshape(-1, 1)
    df_intraenb_succ_rate=df_preprocessed['intraenb_succ_rate_inter'].values.reshape(-1, 1)
    df_dl_prb=df_preprocessed['dl_prb_inter'].values.reshape(-1, 1)
    df_ul_prb=df_preprocessed['ul_prb_inter'].values.reshape(-1, 1)
    df_reconfig_succ_rate=df_preprocessed['reconfig_succ_rate_inter'].values.reshape(-1, 1)
    
    # standardize
    scaler = StandardScaler()

    df_dl_bler_inter_scaled= scaler.fit_transform(df_dl_bler)
    df_ul_bler_inter_scaled= scaler.fit_transform(df_ul_bler)
    df_conn_avg_inter_scaled= scaler.fit_transform(df_conn_avg)
    df_conn_max_inter_scaled= scaler.fit_transform(df_conn_max)
    df_interx2in_succ_rate_inter_scaled= scaler.fit_transform(df_interx2in_succ_rate)
    df_interx2out_succ_rate_inter_scaled= scaler.fit_transform(df_interx2out_succ_rate)
    df_intraenb_succ_rate_inter_scaled= scaler.fit_transform(df_intraenb_succ_rate)
    df_dl_prb_inter_scaled= scaler.fit_transform(df_dl_prb)
    df_ul_prb_inter_scaled= scaler.fit_transform(df_ul_prb)
    df_reconfig_succ_rate_inter_scaled= scaler.fit_transform(df_reconfig_succ_rate)
    
    df_preprocessed_2=pd.DataFrame()
    df_preprocessed_2['dl_bler']=pd.Series(df_dl_bler_inter_scaled.reshape(-1))
    df_preprocessed_2['ul_bler']=pd.Series(df_ul_bler_inter_scaled.reshape(-1))
    df_preprocessed_2['conn_avg']=pd.Series(df_conn_avg_inter_scaled.reshape(-1))
    df_preprocessed_2['conn_max']=pd.Series(df_conn_max_inter_scaled.reshape(-1))
    df_preprocessed_2['interx2in_succ_rate']=pd.Series(df_interx2in_succ_rate_inter_scaled.reshape(-1))
    df_preprocessed_2['interx2out_succ_rate']=pd.Series(df_interx2out_succ_rate_inter_scaled.reshape(-1))
    df_preprocessed_2['intraenb_succ_rate']=pd.Series(df_intraenb_succ_rate_inter_scaled.reshape(-1))
    df_preprocessed_2['dl_prb']=pd.Series(df_dl_prb_inter_scaled.reshape(-1))
    df_preprocessed_2['ul_prb']=pd.Series(df_ul_prb_inter_scaled.reshape(-1))
    df_preprocessed_2['reconfig_succ_rate']=pd.Series(df_reconfig_succ_rate_inter_scaled.reshape(-1))
    
    return df_preprocessed_2

특정 장비를 전처리하고 csv 파일로 도출하는 함수는 다음과 같습니다. 

### [TODO] 구현한 함수를 활용하여 특정 정비를 전처리하고 csv 파일로 도출하는 함수를 구현합니다.

- 전달받은 df_new와 order를 활용하여 결측치를 보간합니다. 
- 결측치를 보간한 데이터를 표준화합니다.

In [19]:
def preprocess_file(df_new, order, machine_name):
    pre_df = fill_missing_value(df_new, order)
    std_df = standard_value(pre_df)
    std_df.to_csv("./df_preprocessed(inter&scaled)_" + machine_name + ".csv", index = None)

만약, 결측치만 보간한 데이터를 도출하고 싶을 경우, 위 함수에서 standard_value 라인만 주석처리하면 됩니다.

이 때, 파일명도 혼동되지 않도록, `df_preprocessed(inter)_machine_name.csv` 로 변경해주세요.

이제 데이터를 전처리하기 위해, 병합된 데이터를 가져와보겠습니다. 

병합된 데이터를 가져오기 위해선, 아래 모듈을 사용하시면 됩니다.

In [20]:
import data_load

`data_load` 의 `return_data(df_list, machine_name)` 메소드에 다음과 같이 data_list 와 원하는 장비 이름을 전달하여 병합된 데이터를 반환받습니다.

예를 들어, 1005_1 장비의 데이터를 병합하고 싶을 경우, 다음과 같이 작성합니다.

In [21]:
df_0=pd.read_csv('./mySuni_PJT_1_Data_bler.csv')
df_1=pd.read_csv('./mySuni_PJT_1_Data_connection.csv')
df_2=pd.read_csv('.ice/dataset/mySuni_PJT_1_Data_interx2in_succ_rate.csv')
df_3=pd.read_csv('/mnt/elice/dataset/mySuni_PJT_1_Data_interx2out_succ_rate.csv')
df_4=pd.read_csv('/mnt/elice/dataset/mySuni_PJT_1_Data_intraenb_succ_rate.csv')
df_5=pd.read_csv('/mnt/elice/dataset/mySuni_PJT_1_Data_PRB.csv')
df_6=pd.read_csv('/mnt/elice/dataset/mySuni_PJT_1_Data_reconfig.csv')

df_list = [df_0,df_1,df_2,df_3,df_4,df_5,df_6]

FileNotFoundError: [Errno 2] No such file or directory: './mySuni_PJT_1_Data_bler.csv'

In [None]:
df_new = data_load.return_data(df_list, '1005_1')

병합된 `1005_1` 장비 데이터를 전처리하고 결과 파일을 csv로 도출하는 경우의 다음과 같이 함수를 호출합니다.

In [None]:
preprocess_file(df_new, 2, '1005_1')

In [None]:
df_new_preprocess = pd.read_csv('./df_preprocessed(inter&scaled)_1005_1.csv')

In [None]:
df_new_preprocess

장비명은 아래와 같이 확인할 수 있습니다. 

In [None]:
f = open('./machine_list.txt')
for i in f:
    print(i)