# Tugas Python Machine Learning with PACMANN AI

## Machine Learning Process Flowchart :
### 1. Importing Data to Python :
    * Drop Duplicates 
### 2. Data Preprocessing :
    * Input-Output Split, Train-Test Split
    * Imputation, Processing Categorical, Normalization 
### 3. Training Machine Learning : 
    * Choose Score to optimize and Hyperparameter Space
### 4. Test Prediction : 
    * Evaluate model performance on Test Data
    

## 1. Importing Data to Python

In [1]:
# Import libraries

import numpy as np # Import Numpy sebagai np
import pandas as pd # Lalu import pandas sebagai pd

## Get Dataset

Dibawah merupakan dokumentasi mengenai dataet yang akan digunakan. Data terdiri atas 2 kategori yaitu Numerical dan Categorical.

### Dataset Information

"Churn Rate" is a business term describing the rate at which customers leave or cease paying for a product or service. It's a critical figure in many businesses, as it's often the case that acquiring new customers is a lot more costly than retaining existing ones (in some cases, 5 to 20 times more expensive).
 Predicting churn is particularly important for businesses with subscription models such as cell phone, cable, or merchant credit card processing plans.  
### Content
There are 21 variables: 

1. State
2. Account Length
3. Area Code
4. Phone
5. Int'l Plan
6. VMail Plan
7. VMail Message
8. Day Mins
9. Day Calls
10. Day Charge
11. Eve Mins
12. Eve Calls
13. Eve Charge
14. Night Mins
15. Night Calls
16. Night Charge
17. Intl Mins
18. Intl Calls
19. Intl Charge
20. CustServ Calls
21. Churn? : Output

In [2]:
# Baca dataset "churn.csv" dengan menggunakan function read_csv()

data_churn = pd.read_csv("churn.csv")

In [3]:
# Check 5 Observasi pertama dataset

data_churn.head()

Unnamed: 0,State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,...,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False.
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False.
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False.
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False.
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False.


## Droping Duplicates

In [4]:
# Cek shape dari data yang akan di drop duplicate nya

data_churn.shape

(3333, 21)

In [5]:
# Cek jika ada atau tidak observasi yang duplikat

data_churn.duplicated().sum()

0

In [6]:
# Drop data yang duplikat

data_churn = data_churn.drop_duplicates()

In [7]:
# Cek kembali shape

data_churn.shape

(3333, 21)

In [8]:
# Cek kembali data

data_churn.head()

Unnamed: 0,State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,...,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False.
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False.
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False.
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False.
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False.


### Make function to import and drop 

Buat lah sebuah function dengan spesifikasi:

 1. import data
 2. cek JUMLAH OBSERVASI dan JUMLAH COLUMN
 3. drop duplicate
 4. drop unnecassary column
 5. cek JUMLAH OBSERVASI dan JUMLAH COLUMN, setelah di-drop
 6. return data setelah di-drop

Function dinamakan dengan `import_data` dan menerima 2 argument yaitu:

 1. `filename`: Direktori dimana data tersimpan
 2. `drop`    : Nama kolom yang ingin di hapus
 
Lalu assign function tersebut pada suatu variabel yang dengan nama `data_churn`

In [9]:
# Buatlah function 

def import_data(filename, drop):
    data = pd.read_csv(filename)
    print("Data asli : %d Observasi, %d Kolom." %data.shape)
    print("Banyaknya data duplicate :", data.duplicated().sum())
    data_drop = data.drop_duplicates()
    
    if drop == "none" :
        print("Data setelah di drop : %d Observasi, %d Kolom." %data_drop.shape)
    else :
        print("Kolom yang di drop :", drop)
        data_drop = data_drop.drop(drop, axis=1)
        print("Data setelah di drop : %d Observasi, %d Kolom." %data_drop.shape)
    
    return data_drop

In [10]:
# Assign fuction kepada variabel data_churn

data_churn = import_data(filename = "churn.csv", drop = ["State", "Area Code", "Phone"]) # hasil dari function

Data asli : 3333 Observasi, 21 Kolom.
Banyaknya data duplicate : 0
Kolom yang di drop : ['State', 'Area Code', 'Phone']
Data setelah di drop : 3333 Observasi, 18 Kolom.


## 2. Data Preprocessing
### Input-Output Split

Disini kita akan memisahkan kolom berdasarkan input dan output.

Data yang digunakan untuk input akan dinamakan dengan `X`, sedangkan untuk output dengan `y`.

Pada dataset ini, kita hanya perlu menggunakan kolom `Churn?` sebagai output kita. 

In [11]:
# Cek data menggunakan head()

data_churn.head()

Unnamed: 0,Account Length,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
0,128,no,yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False.
1,107,no,yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False.
2,137,no,no,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False.
3,84,yes,no,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False.
4,75,yes,no,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False.


### Make function for input and output

Buatlah sebuah function dengan kriteria dibawah ini:

1. data_input
2. data_output
3. return data_input dan data_output
* Tujuan dari pembuatan function adalah agar function ini dapat digunakan kembali di cases berbeda. 

Function dinamakan dengan `extract_input_output` dan menerima 2 argument yaitu:

1. `data`        : Dataset yang ingin di split
2. `column_name` : Nama kolom yang ingin di jadikan output


In [12]:
# Buatlah function tersebut disini

def extract_input_output(data, column_name):
    data_input = data.drop(column_name, axis=1)
    data_output = data[column_name]
    
    return data_input, data_output

# Assign hasil dari funtion tersebut kepada X, y.
# X: data input
# y: data output
X, y = extract_input_output(data = data_churn, column_name = "Churn?")

In [13]:
X.head()

Unnamed: 0,Account Length,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls
0,128,no,yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1
1,107,no,yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1
2,137,no,no,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0
3,84,yes,no,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2
4,75,yes,no,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3


In [14]:
y.head()

0    False.
1    False.
2    False.
3    False.
4    False.
Name: Churn?, dtype: object

## Train and Test Split

Pada bagian ini, X dan y akan dibagi menjadi 2 set yaitu training dan tes. Kita akan menggunakan function dari library Scikit Learn yaitu `train_test_split`.

In [15]:
# import function train_test_split dari library Scikit Learn

from sklearn.model_selection import train_test_split

#### Train Test Split Function
1. x adalah input
2. y adalah output
3. test size = seberapa besar test, contoh 0.20 untuk 20% test dari data
4. random state adalah kunci untuk random, harus disetting sama, misal random_state = 123
5. Output: 
    * x_train = input dari data training
    * x_test = input dari data test
    * y_train = output dari training data
    * y_test = output dari training data
6. urutan dari x_train, x_test, y_train dan y_test tidak boleh terbalik

In [16]:
# Split dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 123)

In [17]:
# Cek shape untuk tiap set (X_train, X_test, y_train, y_test)

print("Data input training :", X_train.shape)
print("Data input test :", X_test.shape)
print("Data output training :", y_train.shape)
print("Data output test :", y_test.shape)

Data input training : (2666, 17)
Data input test : (667, 17)
Data output training : (2666,)
Data output test : (667,)


## Separating Categorical and Numerical Data Manually

## Getting Categorical


In [18]:
X_train.columns

Index(['Account Length', 'Int'l Plan', 'VMail Plan', 'VMail Message',
       'Day Mins', 'Day Calls', 'Day Charge', 'Eve Mins', 'Eve Calls',
       'Eve Charge', 'Night Mins', 'Night Calls', 'Night Charge', 'Intl Mins',
       'Intl Calls', 'Intl Charge', 'CustServ Calls'],
      dtype='object')

In [19]:
for i in X_train.columns:
    print(X_train[i].value_counts())

105    36
93     36
101    34
112    33
87     33
100    32
86     32
116    32
99     31
90     30
123    30
64     30
80     30
130    29
95     28
122    28
98     27
107    27
73     27
119    27
74     27
75     27
85     27
88     26
127    26
97     26
113    26
92     26
94     26
120    26
       ..
170     2
217     2
210     2
224     2
10      2
225     2
168     2
12      2
7       2
18      2
186     2
194     2
195     2
26      1
8       1
6       1
4       1
243     1
178     1
192     1
196     1
204     1
208     1
232     1
5       1
199     1
205     1
215     1
221     1
2       1
Name: Account Length, dtype: int64
no     2412
yes     254
Name: Int'l Plan, dtype: int64
no     1932
yes     734
Name: VMail Plan, dtype: int64
0     1932
31      57
30      38
33      37
28      37
29      37
27      36
24      36
32      34
23      30
26      29
35      28
36      28
25      28
22      25
39      23
21      22
37      21
38      20
20      20
34      19
19      16
16 

In [20]:
# Get Categorical

categories = ["Int'l Plan", "VMail Plan", ] #list of categorical columns

X_train_cat = X_train[categories] # define the categorical columns from x_train dataset

In [21]:
# check the top observations!

X_train_cat.head()

Unnamed: 0,Int'l Plan,VMail Plan
1881,no,no
48,no,no
2886,no,no
2294,no,no
314,no,no


In [22]:
# Get numerical

numerical = ["Account Length", "VMail Message", "Day Mins", 
             "Day Calls", "Day Charge", "Eve Mins", "Eve Calls", 
             "Eve Charge", "Night Mins", "Night Calls", "Night Charge", 
             "Intl Mins", "Intl Calls", "Intl Charge", "CustServ Calls"]

X_train_num = X_train[numerical]

In [23]:
# check the top observations!

X_train_num.head()

Unnamed: 0,Account Length,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls
1881,76,0,272.7,97,46.36,236.4,95,20.09,235.5,105,10.6,7.7,2,2.08,0
48,119,0,159.1,114,27.05,231.3,117,19.66,143.2,91,6.44,8.8,3,2.38,5
2886,85,0,144.4,88,24.55,264.6,105,22.49,185.4,94,8.34,9.9,3,2.67,1
2294,59,0,189.7,100,32.25,115.9,133,9.85,220.6,115,9.93,7.4,4,2.0,0
314,128,0,125.2,99,21.28,205.4,107,17.46,254.4,111,11.45,18.9,2,5.1,0


### Make a function for Separating Numerical and Categorical

In [24]:
# Def a function that returns x_train categorical and x_train numerical

def categorical_numerical_separation(data, categorical_columns, numerical_columns):
    categorical_data = data[categorical_columns]
    numerical_data = data[numerical_columns]
    
    return categorical_data, numerical_data

X_train_cat, X_train_num = categorical_numerical_separation(data = X_train, 
                                                          categorical_columns = categories,
                                                          numerical_columns = numerical)

In [25]:
# check the top of the x_train categorical observations!

X_train_cat.head()

Unnamed: 0,Int'l Plan,VMail Plan
1881,no,no
48,no,no
2886,no,no
2294,no,no
314,no,no


In [26]:
# check the top of the x_train numerical observations!

X_train_num.head()

Unnamed: 0,Account Length,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls
1881,76,0,272.7,97,46.36,236.4,95,20.09,235.5,105,10.6,7.7,2,2.08,0
48,119,0,159.1,114,27.05,231.3,117,19.66,143.2,91,6.44,8.8,3,2.38,5
2886,85,0,144.4,88,24.55,264.6,105,22.49,185.4,94,8.34,9.9,3,2.67,1
2294,59,0,189.7,100,32.25,115.9,133,9.85,220.6,115,9.93,7.4,4,2.0,0
314,128,0,125.2,99,21.28,205.4,107,17.46,254.4,111,11.45,18.9,2,5.1,0


## Data Imputation

Data imputation adalah proses pengisian data yang memiliki data yang kosong, biasanya diperlihatkan sebagai NaN

Proses tersebut terbagi menjadi 2:
* Numerical Imputation
* Categorical Imputation

In [27]:
# Cek data yang kosong di traininig set input

X_train.isnull().any()

Account Length    False
Int'l Plan        False
VMail Plan        False
VMail Message     False
Day Mins          False
Day Calls         False
Day Charge        False
Eve Mins          False
Eve Calls         False
Eve Charge        False
Night Mins        False
Night Calls       False
Night Charge      False
Intl Mins         False
Intl Calls        False
Intl Charge       False
CustServ Calls    False
dtype: bool

## Finding Difference between Numerical and Categorical (Optional)

In [28]:
X_train.head()

Unnamed: 0,Account Length,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls
1881,76,no,no,0,272.7,97,46.36,236.4,95,20.09,235.5,105,10.6,7.7,2,2.08,0
48,119,no,no,0,159.1,114,27.05,231.3,117,19.66,143.2,91,6.44,8.8,3,2.38,5
2886,85,no,no,0,144.4,88,24.55,264.6,105,22.49,185.4,94,8.34,9.9,3,2.67,1
2294,59,no,no,0,189.7,100,32.25,115.9,133,9.85,220.6,115,9.93,7.4,4,2.0,0
314,128,no,no,0,125.2,99,21.28,205.4,107,17.46,254.4,111,11.45,18.9,2,5.1,0


In [29]:
# Get Numerical

X_train_num = X_train._get_numeric_data()
#_get_numeric_data() hanya akan mengambil column berisikan integer dan float
                                                # hati-hati dengan data kategoric yang berbentuk integer!!

### Checking for Nan (Non-Value)

In [30]:
# check the missing value of the numerical x_train

X_train_num.isnull().any()

Account Length    False
VMail Message     False
Day Mins          False
Day Calls         False
Day Charge        False
Eve Mins          False
Eve Calls         False
Eve Charge        False
Night Mins        False
Night Calls       False
Night Charge      False
Intl Mins         False
Intl Calls        False
Intl Charge       False
CustServ Calls    False
dtype: bool

## Numerical Data Imputation

In [31]:
# Import library for imputation

from sklearn.preprocessing import Imputer

In [32]:
# Get Numerical and Categorical Manualy

X_train_cat, X_train_num = categorical_numerical_separation(data = X_train, 
                                                          categorical_columns = categories,
                                                          numerical_columns = numerical)

In [33]:
imput = Imputer(missing_values='NaN', strategy='median')
# namakan function Imputer menjadi imput, jangan lupa tanda kurung ()
# missing_values adalah tanda missing values dalam data, bisa NaN, bisa 9999, bisa "KOSONG"
# strategy median adalah stragegy imputasi, jika data kosong, maka data diganti dengan median
# strategy bisa diganti dengan mean atau rata-rata
# see median: https://en.wikipedia.org/wiki/Median

* fit: imputer agar mengetahui mean atau median  dari setiap column
* transform: isi data dengan median atau mean
* output dari transform berupda pd dataframe
* namakan column dari x_train_numerical_imputed sesuai dengan x_train_numerical.
     - MENGAPA? karena kita kehilangan nama column setelah data imputation
* beri index dari x_train_numerical_imputed sesuai dengan x_train_numerical.
     - MENGAPA? karena kita kehilangan index setelah data imputation

In [34]:
# isi perintah yang akan dibuat di dalam fungsi baru
# imputer perlu difitting ke data 

imput.fit(X_train_num)
x_train_numerical_imputed = pd.DataFrame(imput.transform(X_train_num))
x_train_numerical_imputed.columns = X_train_num.columns
x_train_numerical_imputed.index = X_train_num.index

In [35]:
# cek kembali hasil imputer, apakah missing valuesnya masih ada atau tidak

x_train_numerical_imputed.isnull().any()

Account Length    False
VMail Message     False
Day Mins          False
Day Calls         False
Day Charge        False
Eve Mins          False
Eve Calls         False
Eve Charge        False
Night Mins        False
Night Calls       False
Night Charge      False
Intl Mins         False
Intl Calls        False
Intl Charge       False
CustServ Calls    False
dtype: bool

## Categorical Imputation

In [36]:
# Ambil daftar nama kolom categorical.
# Anda bisa langsung menuliskan kolomnya atau mengambil listnya jika jumlah variabelnya sangat banyak. 

x_train_columns = list(X_train.columns)        # jadikan nama kolom sebagai list
numerical_columns = list(X_train_num.columns)      # jadikan nama kolom numerical sebagai list
cat_columns = list(X_train_cat.columns)            # cari nama kolom x_train yang bukan numerical

In [37]:
cat_columns

["Int'l Plan", 'VMail Plan']

In [38]:
# ambil kolom kategori dari x_train, simpan sebagai DataFrame categorical_data

categorical_data = X_train[cat_columns]     

# periksa missing values dalam categorical data

categorical_data.isnull().any()

Int'l Plan    False
VMail Plan    False
dtype: bool

In [39]:
categorical_data = X_train[cat_columns] # pilih column categorical dari data
categorical_data = categorical_data.fillna(value='KOSONG')  # isi missing values dengan kategori "KOSONG"

In [40]:
# periksa kembali missing valuesnya

categorical_data.isnull().any()

Int'l Plan    False
VMail Plan    False
dtype: bool

## Make a Function
* Make a function to get the numerical variable and imput the missing values
* Make a function to get the categorical variable and imput the missing values

In [41]:
imput_numerical = Imputer(missing_values='NaN', strategy='median')

def numericalImputation(data, numerical_columns):
    numerical_data = data[numerical_columns]
    imput_numerical.fit(numerical_data)
    
    numerical_data_imputed = pd.DataFrame(imput_numerical.transform(numerical_data))
    numerical_data_imputed.columns = numerical_columns
    numerical_data_imputed.index = numerical_data.index
    
    return numerical_data_imputed, imput_numerical

def categoricalImputation(data, categorical_columns):
    categorical_data = data[categorical_columns].fillna(value="KOSONG")
    
    return categorical_data

X_train_numerical, imput_numerical = numericalImputation(data = X_train, numerical_columns = numerical_columns)
X_train_categorical = categoricalImputation(data = X_train, categorical_columns = cat_columns)

## Preprocessing Categorical Variables

* create dummy variable for each of categorical variable

In [42]:
categorical_dummies = pd.get_dummies(X_train_categorical)

In [43]:
# periksa top observations

categorical_dummies.head()

Unnamed: 0,Int'l Plan_no,Int'l Plan_yes,VMail Plan_no,VMail Plan_yes
1881,1,0,1,0
48,1,0,1,0
2886,1,0,1,0
2294,1,0,1,0
314,1,0,1,0


### Make a function to get the dummies

In [44]:
def extractCategorical(data, categorical_columns):
    X_train_categorical = categoricalImputation(data, categorical_columns)
    categorical_dummies = pd.get_dummies(X_train_categorical)
    
    return categorical_dummies

X_train_cat_dummies = extractCategorical(data = X_train, categorical_columns = cat_columns)
dummies_columns = X_train_cat_dummies.columns

In [45]:
# check the top observations

X_train_cat_dummies.head()

Unnamed: 0,Int'l Plan_no,Int'l Plan_yes,VMail Plan_no,VMail Plan_yes
1881,1,0,1,0
48,1,0,1,0
2886,1,0,1,0
2294,1,0,1,0
314,1,0,1,0


## Join data Numerical dan Categorical

In [46]:
# ambil variabel numerical yang sudah tidak memiliki missing values dan variabel kategori yang sudah menjadi dummy
# satukan kembali kolom tersebut menjadi x_train_concat

X_train_concat = pd.concat([X_train_numerical, X_train_cat_dummies], axis=1)

In [47]:
X_train_concat.head()

Unnamed: 0,Account Length,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Int'l Plan_no,Int'l Plan_yes,VMail Plan_no,VMail Plan_yes
1881,76.0,0.0,272.7,97.0,46.36,236.4,95.0,20.09,235.5,105.0,10.6,7.7,2.0,2.08,0.0,1,0,1,0
48,119.0,0.0,159.1,114.0,27.05,231.3,117.0,19.66,143.2,91.0,6.44,8.8,3.0,2.38,5.0,1,0,1,0
2886,85.0,0.0,144.4,88.0,24.55,264.6,105.0,22.49,185.4,94.0,8.34,9.9,3.0,2.67,1.0,1,0,1,0
2294,59.0,0.0,189.7,100.0,32.25,115.9,133.0,9.85,220.6,115.0,9.93,7.4,4.0,2.0,0.0,1,0,1,0
314,128.0,0.0,125.2,99.0,21.28,205.4,107.0,17.46,254.4,111.0,11.45,18.9,2.0,5.1,0.0,1,0,1,0


In [48]:
# Check NaN values

X_train_concat.isnull().any()

Account Length    False
VMail Message     False
Day Mins          False
Day Calls         False
Day Charge        False
Eve Mins          False
Eve Calls         False
Eve Charge        False
Night Mins        False
Night Calls       False
Night Charge      False
Intl Mins         False
Intl Calls        False
Intl Charge       False
CustServ Calls    False
Int'l Plan_no     False
Int'l Plan_yes    False
VMail Plan_no     False
VMail Plan_yes    False
dtype: bool

## Standardizing Variables

- KEGUNAAN: Menyamakan skala dari variable input
- fit: imputer agar mengetahui mean standard deviasi dari setiap column
- transform: isi data dengan value yang dinormalisasi
- output dari transform berupda pd dataframe
- normalize dikeluarkan karena akan dipakai di test

In [49]:
from sklearn.preprocessing import StandardScaler

def standardizer(data):
    data_columns = data.columns
    data_index = data.index
    normalize = StandardScaler()
    normalize.fit(data)
    
    normalize_x = pd.DataFrame(normalize.transform(data))
    normalize_x.columns = data_columns
    normalize_x.index = data_index
    
    return normalize_x, normalize

X_train_clean, normalize = standardizer(X_train_concat)

In [50]:
X_train_clean.head()

Unnamed: 0,Account Length,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Int'l Plan_no,Int'l Plan_yes,VMail Plan_no,VMail Plan_yes
1881,-0.607469,-0.58996,1.714374,-0.177334,1.714428,0.704097,-0.248097,0.703106,0.681734,0.25096,0.682777,-0.919537,-0.999315,-0.918985,-1.178748,0.32451,-0.32451,0.616374,-0.616374
48,0.476222,-0.58996,-0.380475,0.674057,-0.380209,0.60281,0.856959,0.602636,-1.141555,-0.470287,-1.14331,-0.523387,-0.598446,-0.518813,2.591097,0.32451,-0.32451,0.616374,-0.616374
2886,-0.38065,-0.58996,-0.651551,-0.62807,-0.651394,1.264154,0.254201,1.263865,-0.307939,-0.315734,-0.30928,-0.127237,-0.598446,-0.13198,-0.424779,0.32451,-0.32451,0.616374,-0.616374
2294,-1.035905,-0.58996,0.183807,-0.027089,0.183857,-1.689055,1.660636,-1.689469,0.387401,0.766136,0.388672,-1.027578,-0.197577,-1.025698,-1.178748,0.32451,-0.32451,0.616374,-0.616374
314,0.703041,-0.58996,-1.00561,-0.07717,-1.006105,0.088431,0.354661,0.088606,1.055084,0.560066,1.055896,3.113992,-0.999315,3.109416,-1.178748,0.32451,-0.32451,0.616374,-0.616374


## 3. Training Machine Learning

* Choose Score to optimize and Hyperparameter Space
* Cross-Validation: Random vs Grid Search CV (Optional for Beginner Class)
* Kita harus mengalahkan benchmark

### Benchmark:

In [51]:
# benchmark ini adalah proporsi kelas target yang terbesar

y.value_counts(normalize=True)

False.    0.855086
True.     0.144914
Name: Churn?, dtype: float64

In [52]:
benchmark = y.value_counts(normalize=True)[0]

benchmark

0.85508550855085508

In [53]:
# Import classifier

from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [54]:
# Kernel Nearest Neighbor Classifier: fitting

knn = KNeighborsClassifier()
knn.fit(X_train_clean, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [55]:
# Logistic Regression : fitting

logreg = LogisticRegression(random_state=123)
logreg.fit(X_train_clean, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=123, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [56]:
# Random Forest : fitting

rf = RandomForestClassifier(random_state=123)
rf.fit(X_train_clean, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=123,
            verbose=0, warm_start=False)

In [57]:
# Predict

knn_train_predicted = knn.predict(X_train_clean)
logreg_train_predicted = logreg.predict(X_train_clean)
rf_train_predicted = rf.predict(X_train_clean)

In [58]:
from sklearn.metrics import confusion_matrix

In [59]:
knn_train_matrix = pd.DataFrame(confusion_matrix(y_train, knn_train_predicted))
knn_train_matrix

Unnamed: 0,0,1
0,2264,14
1,186,202


In [60]:
logreg_train_matrix = pd.DataFrame(confusion_matrix(y_train, logreg_train_predicted))
logreg_train_matrix

Unnamed: 0,0,1
0,2217,61
1,308,80


In [61]:
rf_train_matrix = pd.DataFrame(confusion_matrix(y_train, rf_train_predicted))
rf_train_matrix

Unnamed: 0,0,1
0,2278,0
1,15,373


In [62]:
# Cek performa model di data training

print("Score train knn :", knn.score(X_train_clean, y_train))
print("Score train logreg :", logreg.score(X_train_clean, y_train))
print("Score train rf :", rf.score(X_train_clean, y_train))
print("Benchmark :", benchmark)

Score train knn : 0.924981245311
Score train logreg : 0.861590397599
Score train rf : 0.994373593398
Benchmark : 0.855085508551


## Export necessary object

Beberapa objek akan diperlukan kembali untuk mengolah future data. Objek-objek tersebut dapat disimpan menggunakan package joblib.

In [63]:
# import package joblib untuk save and load object python

from sklearn.externals import joblib

In [64]:
# dump imput_numerical

joblib.dump(imput_numerical, 'imput_numerical.pkl')

['imput_numerical.pkl']

In [65]:
# dump standardizer

joblib.dump(normalize, 'standardizer.pkl')

['standardizer.pkl']

In [66]:
# dump dummy_columns

joblib.dump(dummies_columns, 'dummies_columns.pkl')

['dummies_columns.pkl']

In [67]:
# dump machine learning model

joblib.dump(knn, 'knn.pkl')
joblib.dump(logreg, 'logreg.pkl')
joblib.dump(rf, 'rf.pkl')

['rf.pkl']

## 4. Test Prediction

### Preprocessing Test Data

Buat lah sebuah function untuk preprocessing data test. Preprocessing yang dilakukan yakni imputation (numeric and categorical) dan standardizing.
 

Function dinamakan dengan `extract_test` dan menerima 6 argument yaitu:
 1. `data` : Data yang ingin diolah
 2. `numerical_columns`   : Nama kolom numerik
 3. `categorical_columns` : Nama kolom kategorik
 4. `dummies_columns`     : Nama kolom dummy
 5. `imput_numericals`    : Imputer untuk data numerical (hasil preprocessing data training)
 6. `standardizer`        : Standardizer (hasil preprocessing data training)
 
 
Lalu assign function tersebut pada suatu variabel yang dengan nama `x_test_clean`



In [68]:
# function definition

def extract_test(data, numerical_columns, categorical_columns, dummies_columns, imput_numericals, standardizer):
        
    numerical_data = data[numerical_columns]
    categorical_data = data[categorical_columns]
    
    numerical_data = pd.DataFrame(imput_numericals.transform(numerical_data))
    numerical_data.columns = numerical_columns
    numerical_data.index = data.index
    
    categorical_data = categorical_data.fillna(value="KOSONG")
    categorical_data.index = data.index
    categorical_data = pd.get_dummies(categorical_data) 
    categorical_data.reindex(categorical_data.index, dummies_columns)
    x_test = pd.concat([ numerical_data, categorical_data], axis = 1)
    x_test_transform = pd.DataFrame(standardizer.transform(x_test))
    x_test_transform.columns = x_test.columns
    
    return x_test_transform

In [69]:
# load necessary object

dummies_columns = joblib.load('dummies_columns.pkl')
imput_numerical = joblib.load('imput_numerical.pkl')
standardizer = joblib.load('standardizer.pkl')

In [70]:
# preprocess test data

X_test_clean = extract_test(data = X_test, numerical_columns = numerical_columns, 
                            categorical_columns = cat_columns, dummies_columns = dummies_columns, 
                            imput_numericals = imput_numerical, standardizer = standardizer)

In [71]:
X_test_clean.head()

Unnamed: 0,Account Length,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Int'l Plan_no,Int'l Plan_yes,VMail Plan_no,VMail Plan_yes
0,-0.73348,-0.58996,-0.02826,0.623975,-0.028752,-0.658311,-0.298327,-0.659073,-0.373127,0.560066,-0.375124,1.205269,-0.598446,1.201928,1.083159,-3.081568,3.081568,0.616374,-0.616374
1,0.325009,-0.58996,-0.43764,2.026266,-0.4377,1.109245,-1.403383,1.109656,-0.507454,1.178277,-0.506813,0.593036,0.203291,0.58833,0.32919,0.32451,-0.32451,0.616374,-0.616374
2,-1.388735,1.530943,-0.81014,0.173239,-0.809766,0.428041,0.053282,0.427399,0.68371,-0.418769,0.682777,-0.84751,0.60416,-0.85229,0.32919,0.32451,-0.32451,-1.622391,1.622391
3,-2.270809,-0.58996,0.060255,0.123157,0.060197,0.275117,-1.152234,0.275526,0.106894,-1.397604,0.107735,-0.5594,0.60416,-0.55883,0.32919,0.32451,-0.32451,0.616374,-0.616374
4,-0.305044,-0.58996,-0.764038,0.774221,-0.764207,0.696153,1.911785,0.696096,-0.432389,0.508548,-0.432189,-0.235277,-0.197577,-0.238692,1.083159,0.32451,-0.32451,0.616374,-0.616374


### Predict Test Data

In [72]:
# load necessary object

knn = joblib.load('knn.pkl')
logreg = joblib.load('logreg.pkl')
rf = joblib.load('rf.pkl')

In [73]:
knn_test_predicted = knn.predict(X_test_clean)
logreg_test_predicted = logreg.predict(X_test_clean)
rf_test_predicted = rf.predict(X_test_clean)

In [74]:
knn_test_matrix = pd.DataFrame(confusion_matrix(y_test, knn_test_predicted))
knn_test_matrix

Unnamed: 0,0,1
0,563,9
1,63,32


In [75]:
logreg_train_matrix = pd.DataFrame(confusion_matrix(y_test, logreg_test_predicted))
logreg_train_matrix

Unnamed: 0,0,1
0,557,15
1,72,23


In [76]:
rf_train_matrix = pd.DataFrame(confusion_matrix(y_test, rf_test_predicted))
rf_train_matrix

Unnamed: 0,0,1
0,563,9
1,27,68


In [77]:
print("Score test knn :", knn.score(X_test_clean, y_test))
print("Score test logreg :", logreg.score(X_test_clean, y_test))
print("Score test rf :", rf.score(X_test_clean, y_test))
print("Benchmark :", benchmark)

Score test knn : 0.892053973013
Score test logreg : 0.869565217391
Score test rf : 0.946026986507
Benchmark : 0.855085508551


### RandomizedSearchCV

In [78]:
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

In [79]:
knn_param = {'n_neighbors':[1,2,3,4,5], 'p':[1,2,3]}

In [80]:
random_knn = RandomizedSearchCV(knn, param_distributions=knn_param, n_iter=3, cv=5, n_jobs=4, random_state=123)

In [81]:
random_knn.fit(X_train_clean, y_train)

RandomizedSearchCV(cv=5, error_score='raise',
          estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
          fit_params={}, iid=True, n_iter=3, n_jobs=4,
          param_distributions={'n_neighbors': [1, 2, 3, 4, 5], 'p': [1, 2, 3]},
          pre_dispatch='2*n_jobs', random_state=123, refit=True,
          return_train_score=True, scoring=None, verbose=0)

In [82]:
random_knn.best_params_

{'n_neighbors': 3, 'p': 2}

In [83]:
random_knn.best_score_

0.88859714928732181

In [84]:
best_random_knn = KNeighborsClassifier(n_neighbors=random_knn.best_params_.get('n_neighbors'),
                                      p=random_knn.best_params_.get('p'))

In [85]:
best_random_knn.fit(X_train_clean, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

In [86]:
best_random_knn.score(X_test_clean, y_test)

0.8935532233883059

### GridSearchCV

In [87]:
grid_knn = GridSearchCV(knn, param_grid=knn_param)

In [88]:
grid_knn.fit(X_train_clean, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5], 'p': [1, 2, 3]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [89]:
grid_knn.best_params_

{'n_neighbors': 5, 'p': 1}

In [90]:
grid_knn.best_score_

0.89272318079519875

In [91]:
best_grid_knn = KNeighborsClassifier(n_neighbors=grid_knn.best_params_.get('n_neighbors'),
                                    p=grid_knn.best_params_.get('p'))

In [92]:
best_grid_knn.fit(X_train_clean, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=1,
           weights='uniform')

In [93]:
best_grid_knn.score(X_test_clean, y_test)

0.89805097451274363

### RandomizeSearchCV vs GridSearchCV

In [94]:
print('RandomizeSearchCV KNN =', best_random_knn.score(X_test_clean, y_test))
print('GridSearchCV KNN =', best_grid_knn.score(X_test_clean, y_test))

RandomizeSearchCV KNN = 0.893553223388
GridSearchCV KNN = 0.898050974513
