**Catatan** :
- Pipeline dimulai dari Feature Engineering, sehingga dataset yang digunakan adalah `imputed.csv` yang dimana missing value-nya telah diisi dan beberapa kolomnya telah di drop.
- Saya melakukan beberapa percobaan untuk menggunakan FunctionTransformer agar Pipeline menjadi end-to-end (dari awal sampai dengan akhir/dari data cleaning hingga menjadi model). Namun ada beberapa error yang menunjukkan bahwa DataFrame tidak boleh melakukan callable di dalam FunctionTransformer. Error tersebut bisa dilihat pada direktori `experiment/Pipeline Experiment.ipynb`
- Semisal pun saya berhasil mengkonversi data cleaning function buatan saya (yang saya gunakan di EDA) untuk menjadi FunctionTransformer, fungsi-fungsi tersebut tentunya akan menghasilkan result yang berbeda dengan yang dilakukan saat EDA. Pada EDA, data yang digunakan untuk dieksplorasi adalah raw data (`df`). Sedangkan pipeline akan menjalankan fungsi yang datanya berbasis pada data yang telah di split `X_train`/`X_test`. Referensi: https://towardsdatascience.com/building-an-automated-machine-learning-pipeline-part-one-5c70ae682f35,

# Import Library

In [1]:
import pandas as pd
import numpy as np

from func import zero_std, custom_info

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier

In [2]:
df = pd.read_csv('csv/imputed.csv')

In [3]:
X = df.drop('Attrition',axis=1)
y = df['Attrition'].map({'Yes':1,'No':0})

categorical_features = X.select_dtypes(include='O').columns.tolist()
categorical_features

['BusinessTravel',
 'Department',
 'EducationField',
 'Gender',
 'JobRole',
 'MaritalStatus',
 'OverTime']

In [4]:
numerical_features = X.select_dtypes(exclude='O').columns
numerical_features

Index(['Age', 'DailyRate', 'DistanceFromHome', 'Education',
       'EnvironmentSatisfaction', 'HourlyRate', 'JobInvolvement', 'JobLevel',
       'JobSatisfaction', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction',
       'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
       'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
       'YearsSinceLastPromotion', 'YearsWithCurrManager'],
      dtype='object')

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X,y,stratify=y,random_state=11111992)

In [6]:
categorical_pipeline = Pipeline([
    ('one_hot', OneHotEncoder(drop='first'))
])

In [7]:
X_train_cat = categorical_pipeline.fit_transform(X_train[categorical_features])

In [8]:
X_train_cat.toarray().shape

(771, 21)

In [9]:
feature_engineering = ColumnTransformer([
    ('categoric', categorical_pipeline, categorical_features),
], remainder='passthrough')

In [10]:
X_train_prep = feature_engineering.fit_transform(X_train)
X_train_prep.shape

(771, 44)

In [11]:
X_test_prep = feature_engineering.fit_transform(X_test)
X_test_prep.shape

(258, 44)

In [12]:
pipeline = Pipeline(steps=[
    ('fe', feature_engineering),
    ('pca', PCA(n_components=10)),
                ('dtree',
                 DecisionTreeClassifier(class_weight={0: 1, 1: 5},
                                        max_depth=4.0, max_features=1,
                                        min_samples_leaf=163,
                                        min_samples_split=163,
                                        random_state=11111992,
                                        splitter='random'))])

In [13]:
pipeline.fit(X_train, y_train)

Pipeline(steps=[('fe',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('categoric',
                                                  Pipeline(steps=[('one_hot',
                                                                   OneHotEncoder(drop='first'))]),
                                                  ['BusinessTravel',
                                                   'Department',
                                                   'EducationField', 'Gender',
                                                   'JobRole', 'MaritalStatus',
                                                   'OverTime'])])),
                ('pca', PCA(n_components=10)),
                ('dtree',
                 DecisionTreeClassifier(class_weight={0: 1, 1: 5},
                                        max_depth=4.0, max_features=1,
                                        min_samples_leaf=163,
                                        min_

In [14]:
pipeline.predict(X_test)

array([1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1,
       0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0,
       1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0,
       1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1,
       1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1])

In [15]:
X_test

Unnamed: 0,Age,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,Gender,HourlyRate,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
608,27.0,Non-Travel,727.0,Research & Development,8.0,3,Life Sciences,3,Male,41,...,3,3,2,1,3,3,1,0,0,0
1027,29.0,Travel_Rarely,1378.0,Research & Development,13.0,2,Other,4,Male,46,...,3,1,1,10,2,3,4,3,0,3
242,41.0,Travel_Rarely,549.0,Research & Development,7.0,2,Medical,4,Female,42,...,3,2,0,8,6,3,2,2,2,1
78,29.0,Travel_Rarely,694.0,Research & Development,1.0,3,Life Sciences,4,Female,87,...,3,2,2,9,2,2,7,7,1,7
399,31.0,Travel_Frequently,1125.0,Sales,7.0,4,Marketing,1,Female,68,...,3,4,2,9,3,3,3,2,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
234,30.0,Travel_Rarely,1334.0,Sales,4.0,2,Medical,3,Female,63,...,3,2,3,11,4,2,11,8,2,7
581,34.0,Travel_Rarely,1031.0,Research & Development,6.0,4,Life Sciences,3,Female,45,...,3,3,1,12,3,3,1,0,0,0
907,52.0,Travel_Rarely,319.0,Research & Development,8.0,3,Medical,4,Male,39,...,3,3,0,28,4,3,5,4,0,4
458,37.0,Travel_Rarely,367.0,Research & Development,25.0,2,Medical,3,Female,52,...,3,3,2,9,2,3,6,2,1,3


# Pipeline Testing

In [16]:
X_train.columns

Index(['Age', 'BusinessTravel', 'DailyRate', 'Department', 'DistanceFromHome',
       'Education', 'EducationField', 'EnvironmentSatisfaction', 'Gender',
       'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobRole',
       'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome', 'MonthlyRate',
       'NumCompaniesWorked', 'OverTime', 'PercentSalaryHike',
       'PerformanceRating', 'RelationshipSatisfaction', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')

In [17]:
X_train.head()

Unnamed: 0,Age,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,Gender,HourlyRate,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
604,40.0,Travel_Rarely,1416.0,Research & Development,2.0,2,Medical,1,Male,49,...,3,4,1,22,5,3,21,7,3,9
506,54.0,Travel_Frequently,966.0,Research & Development,1.0,4,Life Sciences,4,Female,53,...,3,1,1,33,2,1,5,4,1,4
29,37.0,Travel_Frequently,289.0,Research & Development,2.0,2,Medical,3,Male,38,...,3,3,0,8,2,2,0,0,0,0
482,52.0,Travel_Rarely,1325.0,Research & Development,11.0,4,Life Sciences,4,Female,82,...,4,2,1,9,3,3,5,2,1,4
206,37.0,Travel_Rarely,408.0,Research & Development,19.0,2,Life Sciences,2,Male,73,...,4,1,0,8,1,3,1,0,0,0


In [18]:
data = {'Age' : 40,
        'BusinessTravel' : 'Travel_Frequently',
        'DailyRate': 800,
        'Department': 'Research & Development',
        'DistanceFromHome': 12,
        'Education': 2,
        'EducationField': 'Medical',
        'EnvironmentSatisfaction': 4,
        'Gender': 'Male',
        'HourlyRate': 50,
        'JobInvolvement': 3,
        'JobLevel': 5,
        'JobRole': 'Laboratory Technician',
        'JobSatisfaction': 2,
        'MaritalStatus': 'Divorced',
        'MonthlyIncome': 2561,
        'MonthlyRate': 6969,
        'NumCompaniesWorked': 3,
        'OverTime': 'No',
        'PercentSalaryHike': 15,
        'PerformanceRating': 1,
        'RelationshipSatisfaction': 4,
        'StockOptionLevel': 1,
        'TotalWorkingYears': 20,
        'TrainingTimesLastYear': 3,
        'WorkLifeBalance': 3,
        'YearsAtCompany': 20,
        'YearsInCurrentRole': 8,
        'YearsSinceLastPromotion': 3,
        'YearsWithCurrManager': 4}

test = pd.DataFrame(data, index=[1])
test

Unnamed: 0,Age,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,Gender,HourlyRate,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
1,40,Travel_Frequently,800,Research & Development,12,2,Medical,4,Male,50,...,1,4,1,20,3,3,20,8,3,4


In [19]:
pipeline.predict(test)

array([1])

# Export Model

In [20]:
import pickle

In [21]:
filename = 'final_model.sav'
write = open(filename, 'wb')
pickle.dump(pipeline, write)

# Testing Exported Model

In [22]:
def predictor(data):
    filename = 'final_model.sav'
    read = open(filename, 'rb')
    model = pickle.load(read)
    df = pd.DataFrame(data, index=[1])
    return model.predict_proba(df)

In [23]:
data = {'Age' : 25,
        'BusinessTravel' : 'Travel_Frequently',
        'DailyRate': 800,
        'Department': 'Research & Development',
        'DistanceFromHome': 12,
        'Education': 2,
        'EducationField': 'Medical',
        'EnvironmentSatisfaction': 4,
        'Gender': 'Male',
        'HourlyRate': 50,
        'JobInvolvement': 3,
        'JobLevel': 2,
        'JobRole': 'Laboratory Technician',
        'JobSatisfaction': 2,
        'MaritalStatus': 'Married',
        'MonthlyIncome': 2561,
        'MonthlyRate': 6969,
        'NumCompaniesWorked': 3,
        'OverTime': 'No',
        'PercentSalaryHike': 15,
        'PerformanceRating': 3,
        'RelationshipSatisfaction': 4,
        'StockOptionLevel': 1,
        'TotalWorkingYears': 20,
        'TrainingTimesLastYear': 3,
        'WorkLifeBalance': 3,
        'YearsAtCompany': 20,
        'YearsInCurrentRole': 8,
        'YearsSinceLastPromotion': 3,
        'YearsWithCurrManager': 4}

In [24]:
predictor(data)[0][1]

0.5345316934720908