# Session 1: All is Function

---

Training PT. Astra Honda Motor with Pacmann AI

Buat semua menjadi fungsi
- `load_dataset`
- `fit_imputer`
- `transform_imputer`
- `fit_standardize`
- `transform_standardize`
- dan lain-lain

## Fungsi `load_dataset`
---

- Gunakan fungsi ini untuk mengimport seluruh dataset hasil proses sebelumnya, yakni `X_train`, `y_train`, `X_valid`, `y_valid`, `X_test`, dan `y_test`

In [1]:
# Import library
import src.utils as utils

CONFIG_DATA = utils.config_load()
CONFIG_DATA

{'raw_dataset_path': 'data/raw/machining_maintenance.csv',
 'dataset_path': 'data/output/data.pkl',
 'input_set_path': 'data/output/input.pkl',
 'output_set_path': 'data/output/output.pkl',
 'input_cols_path': 'data/output/input_cols.pkl',
 'train_set_path': ['data/output/X_train.pkl', 'data/output/y_train.pkl'],
 'valid_set_path': ['data/output/X_valid.pkl', 'data/output/y_valid.pkl'],
 'test_set_path': ['data/output/X_test.pkl', 'data/output/y_test.pkl'],
 'output_cols': 'Failure Type',
 'drop_cols': ['Product ID', 'Failure Type'],
 'seed': 123,
 'test_size': 0.2,
 'num_cols': ['Air temperature [K]',
  'Process temperature [K]',
  'Rotational speed [rpm]',
  'Torque [Nm]',
  'Tool wear [min]'],
 'cat_cols': ['Type'],
 'num_imputer_path': 'data/output/num_imputer.pkl',
 'cat_imputer_path': 'data/output/cat_imputer.pkl',
 'scaler_path': 'data/output/scaler.pkl',
 'train_clean_path': 'data/output/X_train_clean.pkl',
 'valid_clean_path': 'data/output/X_valid_clean.pkl',
 'test_clean_path

In [2]:
def load_dataset():
    # Load train data
    X_train = utils.pickle_load(CONFIG_DATA['train_set_path'][0])
    y_train = utils.pickle_load(CONFIG_DATA['train_set_path'][1])

    # Load valid data
    X_valid = utils.pickle_load(CONFIG_DATA['valid_set_path'][0])
    y_valid = utils.pickle_load(CONFIG_DATA['valid_set_path'][1])

    # Load test data
    X_test = utils.pickle_load(CONFIG_DATA['test_set_path'][0])
    y_test = utils.pickle_load(CONFIG_DATA['test_set_path'][1])

    # Print
    print("X_train shape :", X_train.shape)
    print("y_train shape :", y_train.shape)
    print("X_valid shape :", X_valid.shape)
    print("y_valid shape :", y_valid.shape)
    print("X_test shape  :", X_test.shape)
    print("y_test shape  :", y_test.shape)

    return X_train, X_valid, X_test, y_train, y_valid, y_test

In [3]:
# Panggil fungsi
X_train, X_valid, X_test, y_train, y_valid, y_test = load_dataset()

X_train shape : (6400, 6)
y_train shape : (6400,)
X_valid shape : (1600, 6)
y_valid shape : (1600,)
X_test shape  : (2000, 6)
y_test shape  : (2000,)


## Fungsi `split_num_cat`
---

- Fungsi untuk split data numerik & kategorik pada data input

In [4]:
# Kolom numerik
num_cols = ['Air temperature [K]', 'Process temperature [K]',
            'Rotational speed [rpm]', 'Torque [Nm]', 'Tool wear [min]']

# Pisahkan data
X_train_num = X_train[num_cols]

# Validasi
print('Data shape:', X_train_num.shape)
X_train_num.head()

Data shape: (6400, 5)


Unnamed: 0,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min]
8588,297.3,307.8,1385,44.4,169
7189,300.4,310.4,1486,46.3,48
7205,299.8,309.7,1483,39.6,87
9492,299.1,309.9,1586,36.6,215
9241,298.1,308.7,1683,29.1,161


In [5]:
# Kolom kategorik
cat_cols = ['Type']

# Pisahkan data
X_train_cat = X_train[cat_cols]

# Validasi
print('Data shape:', X_train_cat.shape)
X_train_cat.head()

Data shape: (6400, 1)


Unnamed: 0,Type
8588,L
7189,L
7205,L
9492,L
9241,L


In [6]:
# Buat fungsi
def split_num_cat(X_train, num_cols, cat_cols):
    # Split data
    X_train_num = X_train[num_cols]
    X_train_cat = X_train[cat_cols]

    # Validasi
    print('Data numerik shape   :', X_train_num.shape)
    print('Data kategorik shape :', X_train_cat.shape)

    return X_train_num, X_train_cat


In [7]:
# Spesifikan kolom numerik & kategorik
num_cols = ['Air temperature [K]', 'Process temperature [K]',
            'Rotational speed [rpm]', 'Torque [Nm]', 'Tool wear [min]']
cat_cols = ['Type']

In [8]:
# Panggil
X_train_num, X_train_cat = split_num_cat(X_train, num_cols, cat_cols)

Data numerik shape   : (6400, 5)
Data kategorik shape : (6400, 1)


- Lakukan modifikasi
  - kolom numerik & kategorik dibuat sebagai `config`

In [9]:
CONFIG_DATA = utils.config_load()
CONFIG_DATA

{'raw_dataset_path': 'data/raw/machining_maintenance.csv',
 'dataset_path': 'data/output/data.pkl',
 'input_set_path': 'data/output/input.pkl',
 'output_set_path': 'data/output/output.pkl',
 'input_cols_path': 'data/output/input_cols.pkl',
 'train_set_path': ['data/output/X_train.pkl', 'data/output/y_train.pkl'],
 'valid_set_path': ['data/output/X_valid.pkl', 'data/output/y_valid.pkl'],
 'test_set_path': ['data/output/X_test.pkl', 'data/output/y_test.pkl'],
 'output_cols': 'Failure Type',
 'drop_cols': ['Product ID', 'Failure Type'],
 'seed': 123,
 'test_size': 0.2,
 'num_cols': ['Air temperature [K]',
  'Process temperature [K]',
  'Rotational speed [rpm]',
  'Torque [Nm]',
  'Tool wear [min]'],
 'cat_cols': ['Type'],
 'num_imputer_path': 'data/output/num_imputer.pkl',
 'cat_imputer_path': 'data/output/cat_imputer.pkl',
 'scaler_path': 'data/output/scaler.pkl',
 'train_clean_path': 'data/output/X_train_clean.pkl',
 'valid_clean_path': 'data/output/X_valid_clean.pkl',
 'test_clean_path

In [10]:
# Spesifikan kolom numerik & kategorik
num_cols = CONFIG_DATA['num_cols']
cat_cols = CONFIG_DATA['cat_cols']

In [11]:
# Panggil
X_train_num, X_train_cat = split_num_cat(X_train, num_cols, cat_cols)
X_valid_num, X_valid_cat = split_num_cat(X_valid, num_cols, cat_cols)
X_test_num, X_test_cat = split_num_cat(X_test, num_cols, cat_cols)

Data numerik shape   : (6400, 5)
Data kategorik shape : (6400, 1)
Data numerik shape   : (1600, 5)
Data kategorik shape : (1600, 1)
Data numerik shape   : (2000, 5)
Data kategorik shape : (2000, 1)


## Fungsi imputasi numerik
---

- Imputasi numerik dengan median dari data

In [12]:
# Import library simple imputer
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

In [13]:
# Buat objek imputer
imputer_num = SimpleImputer(missing_values = np.nan,
                            strategy = 'median')

In [14]:
# Cari median pada data Train numerik
imputer_num.fit(X_train_num)

# Transform data
X_train_num_imputed = pd.DataFrame(
    imputer_num.transform(X_train_num),
    columns = X_train_num.columns,
    index = X_train_num.index
)

# Validasi
print('Data shape :', X_train_num_imputed.shape)
print('')
print('Missing val:\n', X_train_num_imputed.isnull().sum())
print('')
X_train_num_imputed.head()

Data shape : (6400, 5)

Missing val:
 Air temperature [K]        0
Process temperature [K]    0
Rotational speed [rpm]     0
Torque [Nm]                0
Tool wear [min]            0
dtype: int64



Unnamed: 0,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min]
8588,297.3,307.8,1385.0,44.4,169.0
7189,300.4,310.4,1486.0,46.3,48.0
7205,299.8,309.7,1483.0,39.6,87.0
9492,299.1,309.9,1586.0,36.6,215.0
9241,298.1,308.7,1683.0,29.1,161.0


In [15]:
# Buat fungsi fit_num_imputer
def fit_num_imputer(X_train_num):
    # Buat imputer
    imputer_num = SimpleImputer(missing_values = np.nan,
                                strategy = 'median')

    # Fit imputer
    imputer_num.fit(X_train_num)

    return imputer_num

In [16]:
# Buat fungsi transform_num_imputer
def transform_num_imputer(X_num, num_imputer):
    # Hard copy data
    X_num = X_num.copy()

    # Transfrom
    X_num_imputed = pd.DataFrame(
        num_imputer.transform(X_num),
        columns = X_num.columns,
        index = X_num.index
    )

    # Validasi
    print('Data shape :', X_num_imputed.shape)
    print('')
    print('Missing val:\n', X_num_imputed.isnull().sum())
    print('')

    return X_num_imputed

In [17]:
# Panggil fungsi
num_imputer = fit_num_imputer(X_train_num)

In [18]:
# Impute semua data
X_train_num_imputed = transform_num_imputer(X_train_num, num_imputer)
X_valid_num_imputed = transform_num_imputer(X_valid_num, num_imputer)
X_test_num_imputed = transform_num_imputer(X_test_num, num_imputer)

Data shape : (6400, 5)

Missing val:
 Air temperature [K]        0
Process temperature [K]    0
Rotational speed [rpm]     0
Torque [Nm]                0
Tool wear [min]            0
dtype: int64

Data shape : (1600, 5)

Missing val:
 Air temperature [K]        0
Process temperature [K]    0
Rotational speed [rpm]     0
Torque [Nm]                0
Tool wear [min]            0
dtype: int64

Data shape : (2000, 5)

Missing val:
 Air temperature [K]        0
Process temperature [K]    0
Rotational speed [rpm]     0
Torque [Nm]                0
Tool wear [min]            0
dtype: int64



- Modifikasi
  - Dump numerical imputer

In [19]:
CONFIG_DATA = utils.config_load()
CONFIG_DATA

{'raw_dataset_path': 'data/raw/machining_maintenance.csv',
 'dataset_path': 'data/output/data.pkl',
 'input_set_path': 'data/output/input.pkl',
 'output_set_path': 'data/output/output.pkl',
 'input_cols_path': 'data/output/input_cols.pkl',
 'train_set_path': ['data/output/X_train.pkl', 'data/output/y_train.pkl'],
 'valid_set_path': ['data/output/X_valid.pkl', 'data/output/y_valid.pkl'],
 'test_set_path': ['data/output/X_test.pkl', 'data/output/y_test.pkl'],
 'output_cols': 'Failure Type',
 'drop_cols': ['Product ID', 'Failure Type'],
 'seed': 123,
 'test_size': 0.2,
 'num_cols': ['Air temperature [K]',
  'Process temperature [K]',
  'Rotational speed [rpm]',
  'Torque [Nm]',
  'Tool wear [min]'],
 'cat_cols': ['Type'],
 'num_imputer_path': 'data/output/num_imputer.pkl',
 'cat_imputer_path': 'data/output/cat_imputer.pkl',
 'scaler_path': 'data/output/scaler.pkl',
 'train_clean_path': 'data/output/X_train_clean.pkl',
 'valid_clean_path': 'data/output/X_valid_clean.pkl',
 'test_clean_path

In [20]:
# Buat fungsi fit_num_imputer
def fit_num_imputer(X_train_num, num_imputer_path):
    # Buat imputer
    imputer_num = SimpleImputer(missing_values = np.nan,
                                strategy = 'median')

    # Fit imputer
    imputer_num.fit(X_train_num)

    # Dump imputer
    utils.pickle_dump(imputer_num, num_imputer_path)

    return imputer_num

In [21]:
# Panggil fungsi
num_imputer_path = CONFIG_DATA['num_imputer_path']
num_imputer = fit_num_imputer(X_train_num, num_imputer_path)

In [22]:
# Impute semua data
X_train_num_imputed = transform_num_imputer(X_train_num, num_imputer)
X_valid_num_imputed = transform_num_imputer(X_valid_num, num_imputer)
X_test_num_imputed = transform_num_imputer(X_test_num, num_imputer)

Data shape : (6400, 5)

Missing val:
 Air temperature [K]        0
Process temperature [K]    0
Rotational speed [rpm]     0
Torque [Nm]                0
Tool wear [min]            0
dtype: int64

Data shape : (1600, 5)

Missing val:
 Air temperature [K]        0
Process temperature [K]    0
Rotational speed [rpm]     0
Torque [Nm]                0
Tool wear [min]            0
dtype: int64

Data shape : (2000, 5)

Missing val:
 Air temperature [K]        0
Process temperature [K]    0
Rotational speed [rpm]     0
Torque [Nm]                0
Tool wear [min]            0
dtype: int64



## Fungsi imputasi kategorik
---

- Imputasi data kategorik dengan label 'KOSONG'

In [23]:
# Buat objek imputer
imputer_cat = SimpleImputer(missing_values = np.nan,
                            strategy = 'constant',
                            fill_value = 'KOSONG')  # isi dengan KOSONG

In [24]:
# Cari median pada data Train kategorik
imputer_cat.fit(X_train_cat)

# Transform data
X_train_cat_imputed = pd.DataFrame(
    imputer_cat.transform(X_train_cat),
    columns = X_train_cat.columns,
    index = X_train_cat.index
)

# Validasi
print('Data shape :', X_train_cat_imputed.shape)
print('')
print('Missing val:\n', X_train_cat_imputed.isnull().sum())
print('')
X_train_cat_imputed.head()

Data shape : (6400, 1)

Missing val:
 Type    0
dtype: int64



Unnamed: 0,Type
8588,L
7189,L
7205,L
9492,L
9241,L


In [25]:
# Buat fungsi fit_cat_imputer
def fit_cat_imputer(X_train_cat):
    # Buat imputer
    imputer_cat = SimpleImputer(missing_values = np.nan,
                                strategy = 'constant',
                                fill_value = 'KOSONG')  # isi dengan KOSONG

    # Fit imputer
    imputer_cat.fit(X_train_cat)

    return imputer_cat

In [26]:
# Buat fungsi transform_cat_imputer
def transform_cat_imputer(X_cat, cat_imputer):
    # Hard copy data
    X_cat = X_cat.copy()

    # Transfrom
    X_cat_imputed = pd.DataFrame(
        cat_imputer.transform(X_cat),
        columns = X_cat.columns,
        index = X_cat.index
    )

    # Validasi
    print('Data shape :', X_cat_imputed.shape)
    print('')
    print('Missing val:\n', X_cat_imputed.isnull().sum())
    print('')

    return X_cat_imputed

In [27]:
# Panggil fungsi
cat_imputer = fit_cat_imputer(X_train_cat)

In [28]:
# Impute semua data
X_train_cat_imputed = transform_cat_imputer(X_train_cat, cat_imputer)
X_valid_cat_imputed = transform_cat_imputer(X_valid_cat, cat_imputer)
X_test_cat_imputed = transform_cat_imputer(X_test_cat, cat_imputer)

Data shape : (6400, 1)

Missing val:
 Type    0
dtype: int64

Data shape : (1600, 1)

Missing val:
 Type    0
dtype: int64

Data shape : (2000, 1)

Missing val:
 Type    0
dtype: int64



- Modifikasi
  - Dump `cat_imputer`

In [29]:
CONFIG_DATA = utils.config_load()
CONFIG_DATA

{'raw_dataset_path': 'data/raw/machining_maintenance.csv',
 'dataset_path': 'data/output/data.pkl',
 'input_set_path': 'data/output/input.pkl',
 'output_set_path': 'data/output/output.pkl',
 'input_cols_path': 'data/output/input_cols.pkl',
 'train_set_path': ['data/output/X_train.pkl', 'data/output/y_train.pkl'],
 'valid_set_path': ['data/output/X_valid.pkl', 'data/output/y_valid.pkl'],
 'test_set_path': ['data/output/X_test.pkl', 'data/output/y_test.pkl'],
 'output_cols': 'Failure Type',
 'drop_cols': ['Product ID', 'Failure Type'],
 'seed': 123,
 'test_size': 0.2,
 'num_cols': ['Air temperature [K]',
  'Process temperature [K]',
  'Rotational speed [rpm]',
  'Torque [Nm]',
  'Tool wear [min]'],
 'cat_cols': ['Type'],
 'num_imputer_path': 'data/output/num_imputer.pkl',
 'cat_imputer_path': 'data/output/cat_imputer.pkl',
 'scaler_path': 'data/output/scaler.pkl',
 'train_clean_path': 'data/output/X_train_clean.pkl',
 'valid_clean_path': 'data/output/X_valid_clean.pkl',
 'test_clean_path

In [30]:
# Buat fungsi fit_cat_imputer
def fit_cat_imputer(X_train_cat, cat_imputer_path):
    # Buat imputer
    imputer_cat = SimpleImputer(missing_values = np.nan,
                                strategy = 'constant',
                                fill_value = 'KOSONG')  # isi dengan KOSONG

    # Fit imputer
    imputer_cat.fit(X_train_cat)

    # Dump imputer
    utils.pickle_dump(imputer_cat, cat_imputer_path)

    return imputer_cat

In [31]:
# Panggil fungsi
cat_imputer_path = CONFIG_DATA['cat_imputer_path']
cat_imputer = fit_cat_imputer(X_train_cat, cat_imputer_path)

In [32]:
# Impute semua data
X_train_cat_imputed = transform_cat_imputer(X_train_cat, cat_imputer)
X_valid_cat_imputed = transform_cat_imputer(X_valid_cat, cat_imputer)
X_test_cat_imputed = transform_cat_imputer(X_test_cat, cat_imputer)

Data shape : (6400, 1)

Missing val:
 Type    0
dtype: int64

Data shape : (1600, 1)

Missing val:
 Type    0
dtype: int64

Data shape : (2000, 1)

Missing val:
 Type    0
dtype: int64



## Fungsi imputasi kategorik
---

- Kita akan melakukan encoding pada data kategorik.
- Dari pemodelan sebelumnya, encoding dilakukan dengan label encoding.

In [33]:
X_train_cat_enc = X_train_cat_imputed.copy()
X_train_cat_enc['Type'] = X_train_cat_imputed['Type'].map({'L': 0, 'M': 1, 'H': 2})

# Validasi hasil
print('Data shape:', X_train_cat_enc.shape)
X_train_cat_enc.head()

Data shape: (6400, 1)


Unnamed: 0,Type
8588,0
7189,0
7205,0
9492,0
9241,0


In [34]:
# Buat dalam fungsi
def cat_encoding(X_cat):
    # Mapping function
    map_dict = {
        'L': 0,
        'M': 1,
        'H': 2
    }

    # Fungsi
    X_cat_enc = X_cat.copy()
    X_cat_enc['Type'] = X_cat['Type'].map(map_dict)

    # Validasi
    print('Data shape:', X_cat_enc.shape)

    return X_cat_enc


In [35]:
# Panggil fungsi
X_train_cat_enc = cat_encoding(X_train_cat)
X_valid_cat_enc = cat_encoding(X_valid_cat)
X_test_cat_enc = cat_encoding(X_test_cat)

Data shape: (6400, 1)
Data shape: (1600, 1)
Data shape: (2000, 1)


## Fungsi join data numerik & kategorik
---

In [36]:
# Satukan data train numerik & kategorik
X_train_concat = pd.concat([X_train_num_imputed,
                            X_train_cat_enc],
                           axis = 1)

# Validasi
print('Data shape:', X_train_concat.shape)
X_train_concat.head()

Data shape: (6400, 6)


Unnamed: 0,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Type
8588,297.3,307.8,1385.0,44.4,169.0,0
7189,300.4,310.4,1486.0,46.3,48.0,0
7205,299.8,309.7,1483.0,39.6,87.0,0
9492,299.1,309.9,1586.0,36.6,215.0,0
9241,298.1,308.7,1683.0,29.1,161.0,0


In [37]:
# Buat fungsi
def concat_data(X_num, X_cat):
    X_concat = pd.concat([X_num, X_cat], axis=1)

    # Validasi
    print('Data shape:', X_concat.shape)

    return X_concat

In [38]:
# Panggil fungsi
X_train_concat = concat_data(X_train_num_imputed, X_train_cat_enc)
X_train_concat.head()

Data shape: (6400, 6)


Unnamed: 0,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Type
8588,297.3,307.8,1385.0,44.4,169.0,0
7189,300.4,310.4,1486.0,46.3,48.0,0
7205,299.8,309.7,1483.0,39.6,87.0,0
9492,299.1,309.9,1586.0,36.6,215.0,0
9241,298.1,308.7,1683.0,29.1,161.0,0


In [39]:
# Panggil fungsi lain
X_valid_concat = concat_data(X_valid_num_imputed, X_valid_cat_enc)
X_test_concat = concat_data(X_test_num_imputed, X_test_cat_enc)

Data shape: (1600, 6)
Data shape: (2000, 6)


## Fungsi standardisasi data
---

In [40]:
# Import library
from sklearn.preprocessing import StandardScaler

In [41]:
# Buat objek standardisasi
scaler = StandardScaler()

In [42]:
# Cari rata-rata & standard deviasi kolom
scaler.fit(X_train_concat)

# Transformasi data
X_train_clean = pd.DataFrame(
    scaler.transform(X_train_concat),
    columns = X_train_concat.columns,
    index = X_train_concat.index
)

# Validasi hasil
print('Data shape:', X_train_clean.shape)
X_train_clean.head()

Data shape: (6400, 6)


Unnamed: 0,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Type
8588,-1.344143,-1.474361,-0.864169,0.441091,0.965616,-0.747463
7189,0.201073,0.267583,-0.294506,0.631797,-0.941701,-0.747463
7205,-0.098001,-0.201402,-0.311426,-0.040691,-0.326946,-0.747463
9492,-0.446921,-0.067406,0.269518,-0.341805,1.690712,-0.747463
9241,-0.945378,-0.87138,0.81662,-1.09459,0.839512,-0.747463


In [43]:
# Fit scaler
def fit_scaler(X_train):
    # Buat scaler
    scaler = StandardScaler()

    # Fit scaler
    scaler.fit(X_train)

    return scaler

In [44]:
# Transform dengan scaler
def transform_scaler(X, scaler):
    X_clean = pd.DataFrame(
        scaler.transform(X),
        columns = X.columns,
        index = X.index
    )

    # Validasi
    print('Data shape:', X_clean.shape)

    return X_clean

In [45]:
# Buat scaler
scaler = fit_scaler(X_train_concat)

In [46]:
# Transform data
X_train_clean = transform_scaler(X_train_concat, scaler)
X_train_clean.head()

Data shape: (6400, 6)


Unnamed: 0,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Type
8588,-1.344143,-1.474361,-0.864169,0.441091,0.965616,-0.747463
7189,0.201073,0.267583,-0.294506,0.631797,-0.941701,-0.747463
7205,-0.098001,-0.201402,-0.311426,-0.040691,-0.326946,-0.747463
9492,-0.446921,-0.067406,0.269518,-0.341805,1.690712,-0.747463
9241,-0.945378,-0.87138,0.81662,-1.09459,0.839512,-0.747463


- Perbaikan
  - Dump `scaler`

In [47]:
CONFIG_DATA = utils.config_load()
CONFIG_DATA

{'raw_dataset_path': 'data/raw/machining_maintenance.csv',
 'dataset_path': 'data/output/data.pkl',
 'input_set_path': 'data/output/input.pkl',
 'output_set_path': 'data/output/output.pkl',
 'input_cols_path': 'data/output/input_cols.pkl',
 'train_set_path': ['data/output/X_train.pkl', 'data/output/y_train.pkl'],
 'valid_set_path': ['data/output/X_valid.pkl', 'data/output/y_valid.pkl'],
 'test_set_path': ['data/output/X_test.pkl', 'data/output/y_test.pkl'],
 'output_cols': 'Failure Type',
 'drop_cols': ['Product ID', 'Failure Type'],
 'seed': 123,
 'test_size': 0.2,
 'num_cols': ['Air temperature [K]',
  'Process temperature [K]',
  'Rotational speed [rpm]',
  'Torque [Nm]',
  'Tool wear [min]'],
 'cat_cols': ['Type'],
 'num_imputer_path': 'data/output/num_imputer.pkl',
 'cat_imputer_path': 'data/output/cat_imputer.pkl',
 'scaler_path': 'data/output/scaler.pkl',
 'train_clean_path': 'data/output/X_train_clean.pkl',
 'valid_clean_path': 'data/output/X_valid_clean.pkl',
 'test_clean_path

In [48]:
# Fit scaler
def fit_scaler(X_train, scaler_path):
    # Buat scaler
    scaler = StandardScaler()

    # Fit scaler
    scaler.fit(X_train)

    # Dump
    utils.pickle_dump(scaler, scaler_path)

    return scaler

In [49]:
# Buat scaler
scaler_path = CONFIG_DATA['scaler_path']
scaler = fit_scaler(X_train_concat, scaler_path)

In [50]:
# Transform data
X_train_clean = transform_scaler(X_train_concat, scaler)
X_valid_clean = transform_scaler(X_valid_concat, scaler)
X_test_clean = transform_scaler(X_test_concat, scaler)

Data shape: (6400, 6)
Data shape: (1600, 6)
Data shape: (2000, 6)


Yes! Satu tahap lagi

## Fungsi preprocess semuanya
---

In [51]:
def preprocess_data(X, types, CONFIG_DATA):
    if X is None:
        # Load data
        path = f'{types}_set_path'
        X = utils.pickle_load(CONFIG_DATA[path][0])

    # Lakukan preprocessing
    # Pertama, split data
    num_cols = CONFIG_DATA['num_cols']
    cat_cols = CONFIG_DATA['cat_cols']
    X_num, X_cat = split_num_cat(X, num_cols, cat_cols)

    # Lakukan imputasi
    if types=='train':
        # Kalo data train, buat preprocessor
        num_imputer_path = CONFIG_DATA['num_imputer_path']
        cat_imputer_path = CONFIG_DATA['cat_imputer_path']
        num_imputer = fit_num_imputer(X_num, num_imputer_path)
        cat_imputer = fit_cat_imputer(X_cat, cat_imputer_path)
    else:
        # Kalo bukan train, load preprocessor
        num_imputer_path = CONFIG_DATA['num_imputer_path']
        cat_imputer_path = CONFIG_DATA['cat_imputer_path']
        num_imputer = utils.pickle_load(num_imputer_path)
        cat_imputer = utils.pickle_load(cat_imputer_path)

    X_num_imputed = transform_num_imputer(X_num, num_imputer)
    X_cat_imputed = transform_cat_imputer(X_cat, cat_imputer)

    # Lakukan encoding
    X_cat_enc = cat_encoding(X_cat_imputed)

    # Lakukan concat data
    X_concat = concat_data(X_num_imputed, X_cat_enc)

    # Lakukan scaling
    if types=='train':
        # Kalo data train, buat scaler
        scaler_path = CONFIG_DATA['scaler_path']
        scaler = fit_scaler(X_concat, scaler_path)
    else:
        # Kalo bukan train, load scaler
        scaler_path = CONFIG_DATA['scaler_path']
        scaler = utils.pickle_load(scaler_path)

    X_clean = transform_scaler(X_concat, scaler)

    # Validasi
    print('Data shape:', X_clean.shape)

    # Dump file
    if types in ['train', 'valid', 'test']:
        clean_path = CONFIG_DATA[f'{types}_clean_path']
        utils.pickle_dump(X_clean, clean_path)

    return X_clean

In [52]:
CONFIG_DATA = utils.config_load()
CONFIG_DATA

{'raw_dataset_path': 'data/raw/machining_maintenance.csv',
 'dataset_path': 'data/output/data.pkl',
 'input_set_path': 'data/output/input.pkl',
 'output_set_path': 'data/output/output.pkl',
 'input_cols_path': 'data/output/input_cols.pkl',
 'train_set_path': ['data/output/X_train.pkl', 'data/output/y_train.pkl'],
 'valid_set_path': ['data/output/X_valid.pkl', 'data/output/y_valid.pkl'],
 'test_set_path': ['data/output/X_test.pkl', 'data/output/y_test.pkl'],
 'output_cols': 'Failure Type',
 'drop_cols': ['Product ID', 'Failure Type'],
 'seed': 123,
 'test_size': 0.2,
 'num_cols': ['Air temperature [K]',
  'Process temperature [K]',
  'Rotational speed [rpm]',
  'Torque [Nm]',
  'Tool wear [min]'],
 'cat_cols': ['Type'],
 'num_imputer_path': 'data/output/num_imputer.pkl',
 'cat_imputer_path': 'data/output/cat_imputer.pkl',
 'scaler_path': 'data/output/scaler.pkl',
 'train_clean_path': 'data/output/X_train_clean.pkl',
 'valid_clean_path': 'data/output/X_valid_clean.pkl',
 'test_clean_path

In [53]:
# Preprocess semua
preprocess_data(X=None, types='train', CONFIG_DATA=CONFIG_DATA)
preprocess_data(X=None, types='valid', CONFIG_DATA=CONFIG_DATA)
preprocess_data(X=None, types='test', CONFIG_DATA=CONFIG_DATA)

Data numerik shape   : (6400, 5)
Data kategorik shape : (6400, 1)


Data shape : (6400, 5)

Missing val:
 Air temperature [K]        0
Process temperature [K]    0
Rotational speed [rpm]     0
Torque [Nm]                0
Tool wear [min]            0
dtype: int64

Data shape : (6400, 1)

Missing val:
 Type    0
dtype: int64

Data shape: (6400, 1)
Data shape: (6400, 6)
Data shape: (6400, 6)
Data shape: (6400, 6)
Data numerik shape   : (1600, 5)
Data kategorik shape : (1600, 1)
Data shape : (1600, 5)

Missing val:
 Air temperature [K]        0
Process temperature [K]    0
Rotational speed [rpm]     0
Torque [Nm]                0
Tool wear [min]            0
dtype: int64

Data shape : (1600, 1)

Missing val:
 Type    0
dtype: int64

Data shape: (1600, 1)
Data shape: (1600, 6)
Data shape: (1600, 6)
Data shape: (1600, 6)
Data numerik shape   : (2000, 5)
Data kategorik shape : (2000, 1)
Data shape : (2000, 5)

Missing val:
 Air temperature [K]        0
Process temperature [K]    0
Rotational speed [rpm]     0
Torque [Nm]                0
Tool wear [min]     

Unnamed: 0,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Type
6632,0.699529,0.267583,-1.337949,1.123616,0.067128,-0.747463
3669,1.197986,1.205553,-0.412950,0.019532,0.555779,-0.747463
4480,1.347523,0.267583,-1.315388,1.525101,-1.383063,2.243322
2319,-0.397075,-0.804382,0.912504,-0.652956,1.217823,0.747930
3809,1.098295,0.535575,-1.061577,1.665621,-0.500338,-0.747463
...,...,...,...,...,...,...
1780,-0.745995,-1.206370,0.320280,-0.773402,-0.973227,-0.747463
8299,-0.646304,-0.000408,0.557169,-0.823587,-1.036278,-0.747463
4658,1.546906,0.803566,0.794059,-0.522473,-1.304248,-0.747463
4997,1.796134,1.875532,6.321487,-2.871162,-1.288486,0.747930


Great! Sekarang tinggal buat `.py` file nya