# Tugas Python Machine Learning with PACMANN AI

## Machine Learning Process Flowchart:
### 1. Importing Data to Python : 
    * Drop Duplicates 
### 2. Data Preprocessing :
    * Input-Output Split, Train-Test Split
    * Imputation, Processing Categorical, Normalization 
### 3. Training Machine Learning : 
    * Choose Score to optimize and Hyperparameter Space
### 4. Test Prediction : 
    * Evaluate model performance on Test Data
    

## 1. Importing Data to Python

In [1]:
# Import libraries

import numpy as np # Import Numpy sebagai np
import pandas as pd # Lalu import pandas sebagai pd


## Dataset Information
 
 
### House Prices: Advanced Regression Techniques

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 31 explanatory variables describing aspect of residential homes in Ames, Iowa, this exercise challenges you to predict the final price of each home.  

See data description here : https://drive.google.com/open?id=1tFd-1tD3Z13XJvz-JcMjHg6pg2-H8T3Wi-4fDIaZta0

In [2]:
# Baca dataset 

data = pd.read_csv('../datasets/train.csv')

In [3]:
# Check 5 Observasi pertama dataset

data.head()

Unnamed: 0.1,Unnamed: 0,LotFrontage,LotArea,Utilities,MasVnrType,MasVnrArea,HouseStyle,Heating,BsmtQual,BsmtCond,...,GarageQual,GarageCond,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,SaleType,SaleCondition,SalePrice
0,0,65.0,8450,AllPub,BrkFace,196.0,2Story,GasA,Gd,TA,...,TA,TA,0,61,0,0,0,WD,Normal,208500
1,1,80.0,9600,AllPub,,0.0,1Story,GasA,Gd,TA,...,TA,TA,298,0,0,0,0,WD,Normal,181500
2,2,68.0,11250,AllPub,BrkFace,162.0,2Story,GasA,Gd,TA,...,TA,TA,0,42,0,0,0,WD,Normal,223500
3,3,60.0,9550,AllPub,,0.0,2Story,GasA,TA,Gd,...,TA,TA,0,35,272,0,0,WD,Abnorml,140000
4,4,84.0,14260,AllPub,BrkFace,350.0,2Story,GasA,Gd,TA,...,TA,TA,192,84,0,0,0,WD,Normal,250000


## Droping Duplicates

In [4]:
# Cek shape dari data yang akan di drop duplicate nya

data.shape

(1460, 33)

In [5]:
# Cek jika ada atau tidak observasi yang duplikat

data.duplicated().sum()

0

In [6]:
# Drop data yang duplikat

data = data.drop_duplicates()

In [7]:
# Cek kembali shape

data.shape

(1460, 33)

In [8]:
# Cek kembali data

data.head()

Unnamed: 0.1,Unnamed: 0,LotFrontage,LotArea,Utilities,MasVnrType,MasVnrArea,HouseStyle,Heating,BsmtQual,BsmtCond,...,GarageQual,GarageCond,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,SaleType,SaleCondition,SalePrice
0,0,65.0,8450,AllPub,BrkFace,196.0,2Story,GasA,Gd,TA,...,TA,TA,0,61,0,0,0,WD,Normal,208500
1,1,80.0,9600,AllPub,,0.0,1Story,GasA,Gd,TA,...,TA,TA,298,0,0,0,0,WD,Normal,181500
2,2,68.0,11250,AllPub,BrkFace,162.0,2Story,GasA,Gd,TA,...,TA,TA,0,42,0,0,0,WD,Normal,223500
3,3,60.0,9550,AllPub,,0.0,2Story,GasA,TA,Gd,...,TA,TA,0,35,272,0,0,WD,Abnorml,140000
4,4,84.0,14260,AllPub,BrkFace,350.0,2Story,GasA,Gd,TA,...,TA,TA,192,84,0,0,0,WD,Normal,250000


### Make function to import and drop 

Buat lah sebuah function dengan spesifikasi:

 1. import data
 2. cek JUMLAH OBSERVASI dan JUMLAH COLUMN
 3. drop duplicate
 4. drop unnecassary column
 5. cek JUMLAH OBSERVASI dan JUMLAH COLUMN, setelah di-drop
 6. return data setelah di-drop

Function dinamakan dengan `import_data` dan menerima 2 argument yaitu:

 1. `filename`: Direktori dimana data tersimpan
 2. `drop`    : Nama kolom yang ingin di hapus
 
Lalu assign function tersebut pada suatu variabel yang dengan nama `data`

In [9]:
# Buatlah function 

def import_data(filename, drop):
    data = pd.read_csv(filename)
    print("Data asli : %d Observasi, %d Kolom." %data.shape, '\n')   
    print("Kolom yang di drop :", drop, '\n')
    data_drop = data.drop(drop, axis=1)
    print("Banyaknya data duplicate :", data_drop.duplicated().sum(), '\n')
    data_unique = data_drop.drop_duplicates()
    print("Data setelah di drop : %d Observasi, %d Kolom." %data_unique.shape)
     
    return data_unique

In [10]:
data.columns

Index(['Unnamed: 0', 'LotFrontage', 'LotArea', 'Utilities', 'MasVnrType',
       'MasVnrArea', 'HouseStyle', 'Heating', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
       'BsmtFinSF2', 'Electrical', '1stFlrSF', '2ndFlrSF', 'Fireplaces',
       'GarageType', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch',
       'ScreenPorch', 'SaleType', 'SaleCondition', 'SalePrice'],
      dtype='object')

In [11]:
# Assign fuction kepada variabel data

drop = ['Unnamed: 0', 'Heating', 'Utilities', 
        'BsmtCond', 'BsmtFinType2', 'Electrical', 
        'GarageQual', 'GarageCond', 'BsmtFinSF2', 
        'EnclosedPorch', '3SsnPorch', 'ScreenPorch']

data = import_data(filename = '../dataset/train.csv', drop = drop)

Data asli : 1460 Observasi, 33 Kolom. 

Kolom yang di drop : ['Unnamed: 0', 'Heating', 'Utilities', 'BsmtCond', 'BsmtFinType2', 'Electrical', 'GarageQual', 'GarageCond', 'BsmtFinSF2', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch'] 

Banyaknya data duplicate : 1 

Data setelah di drop : 1459 Observasi, 21 Kolom.


In [12]:
data.columns

Index(['LotFrontage', 'LotArea', 'MasVnrType', 'MasVnrArea', 'HouseStyle',
       'BsmtQual', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', '1stFlrSF',
       '2ndFlrSF', 'Fireplaces', 'GarageType', 'GarageFinish', 'GarageCars',
       'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'SaleType', 'SaleCondition',
       'SalePrice'],
      dtype='object')

## 2. Data Preprocessing
### Input-Output Split

Disini kita akan memisahkan kolom berdasarkan input dan output.

Data yang digunakan untuk input akan dinamakan dengan `X`, sedangkan untuk output dengan `y`.

Pada dataset ini, kita hanya perlu menggunakan kolom `SalePrice` sebagai output kita. 

In [13]:
# Cek data menggunakan head()

data.head()

Unnamed: 0,LotFrontage,LotArea,MasVnrType,MasVnrArea,HouseStyle,BsmtQual,BsmtExposure,BsmtFinType1,BsmtFinSF1,1stFlrSF,...,Fireplaces,GarageType,GarageFinish,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,SaleType,SaleCondition,SalePrice
0,65.0,8450,BrkFace,196.0,2Story,Gd,No,GLQ,706,856,...,0,Attchd,RFn,2,548,0,61,WD,Normal,208500
1,80.0,9600,,0.0,1Story,Gd,Gd,ALQ,978,1262,...,1,Attchd,RFn,2,460,298,0,WD,Normal,181500
2,68.0,11250,BrkFace,162.0,2Story,Gd,Mn,GLQ,486,920,...,1,Attchd,RFn,2,608,0,42,WD,Normal,223500
3,60.0,9550,,0.0,2Story,TA,No,ALQ,216,961,...,1,Detchd,Unf,3,642,0,35,WD,Abnorml,140000
4,84.0,14260,BrkFace,350.0,2Story,Gd,Av,GLQ,655,1145,...,1,Attchd,RFn,3,836,192,84,WD,Normal,250000


### Make function for input and output

Buatlah sebuah function dengan kriteria dibawah ini:

1. data_input
2. data_output
3. return data_input dan data_output
* Tujuan dari pembuatan function adalah agar function ini dapat digunakan kembali di cases berbeda. 

Function dinamakan dengan `extract_input_output` dan menerima 2 argument yaitu:

1. `data`        : Dataset yang ingin di split
2. `column_name` : Nama kolom yang ingin di jadikan output


In [14]:
# Buatlah function tersebut disini

def extract_input_output(data, column_name):
    data_input = data.drop(column_name, axis=1)
    data_output = data[column_name]
    
    return data_input, data_output

# Assign hasil dari funtion tersebut kepada X, y.
# X: data input
# y: data output

x, y = extract_input_output(data = data, column_name = 'SalePrice')

In [15]:
x.columns

Index(['LotFrontage', 'LotArea', 'MasVnrType', 'MasVnrArea', 'HouseStyle',
       'BsmtQual', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', '1stFlrSF',
       '2ndFlrSF', 'Fireplaces', 'GarageType', 'GarageFinish', 'GarageCars',
       'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'SaleType', 'SaleCondition'],
      dtype='object')

## Train and Test Split

Pada bagian ini, X dan y akan dibagi menjadi 2 set yaitu training dan tes. Kita akan menggunakan function dari library Scikit Learn yaitu `train_test_split`.

In [16]:
# import function train_test_split dari library Scikit Learn

from sklearn.model_selection import train_test_split

#### Train Test Split Function
1. x adalah input
2. y adalah output
3. test size = seberapa besar test, contoh 0.20 untuk 20% test dari data
4. random state adalah kunci untuk random, harus disetting sama, misal random_state = 123
5. Output: 
    * x_train = input dari data training
    * x_test = input dari data test
    * y_train = output dari training data
    * y_test = output dari training data
6. urutan dari x_train, x_test, y_train dan y_test tidak boleh terbalik

In [17]:
# Split dataset

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 123)

In [18]:
# Cek shape untuk tiap set (X_train, X_test, y_train, y_test)

print("Data input training :", x_train.shape)
print("Data input test :", x_test.shape)
print("Data output training :", y_train.shape)
print("Data output test :", y_test.shape)

Data input training : (1167, 20)
Data input test : (292, 20)
Data output training : (1167,)
Data output test : (292,)


In [19]:
x_train.columns

Index(['LotFrontage', 'LotArea', 'MasVnrType', 'MasVnrArea', 'HouseStyle',
       'BsmtQual', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', '1stFlrSF',
       '2ndFlrSF', 'Fireplaces', 'GarageType', 'GarageFinish', 'GarageCars',
       'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'SaleType', 'SaleCondition'],
      dtype='object')

## Separating Numerical and Categorical Data Manually

## Getting Numerical

In [20]:
# get numeric using ._get_numeric_data()

x_train_num = x_train._get_numeric_data()

In [21]:
# check the columns

x_train_num.columns

Index(['LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', '1stFlrSF',
       '2ndFlrSF', 'Fireplaces', 'GarageCars', 'GarageArea', 'WoodDeckSF',
       'OpenPorchSF'],
      dtype='object')

In [22]:
x_train_num.head()

Unnamed: 0,LotFrontage,LotArea,MasVnrArea,BsmtFinSF1,1stFlrSF,2ndFlrSF,Fireplaces,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF
319,,14115,225.0,1036,1472,0,2,2,588,233,48
581,98.0,12704,306.0,0,2042,0,1,3,1390,0,90
962,24.0,2308,0.0,556,804,744,1,2,440,48,0
78,72.0,10778,0.0,0,1768,0,0,0,0,0,0
5,85.0,14115,0.0,732,796,566,0,2,480,40,30


In [23]:
# drop unexpected numerical column if any

num_categorical = ['Fireplaces', 'GarageCars']
x_train_num = x_train_num.drop(num_categorical, axis = 1)
numeric = x_train_num.columns

In [24]:
numeric

Index(['LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', '1stFlrSF',
       '2ndFlrSF', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF'],
      dtype='object')

In [25]:
x_train_num.head()

Unnamed: 0,LotFrontage,LotArea,MasVnrArea,BsmtFinSF1,1stFlrSF,2ndFlrSF,GarageArea,WoodDeckSF,OpenPorchSF
319,,14115,225.0,1036,1472,0,588,233,48
581,98.0,12704,306.0,0,2042,0,1390,0,90
962,24.0,2308,0.0,556,804,744,440,48,0
78,72.0,10778,0.0,0,1768,0,0,0,0
5,85.0,14115,0.0,732,796,566,480,40,30


## Getting Categorical


In [26]:
# Get Categorical

output = ['SalePrice']
categories = [x for x in data.columns if x not in x_train_num.columns and x not in output]

x_train_cat = x_train[categories] # define the categorical columns from x_train dataset

In [27]:
x_train_cat.columns

Index(['MasVnrType', 'HouseStyle', 'BsmtQual', 'BsmtExposure', 'BsmtFinType1',
       'Fireplaces', 'GarageType', 'GarageFinish', 'GarageCars', 'SaleType',
       'SaleCondition'],
      dtype='object')

In [28]:
# check the top observations!

x_train_cat.head()

Unnamed: 0,MasVnrType,HouseStyle,BsmtQual,BsmtExposure,BsmtFinType1,Fireplaces,GarageType,GarageFinish,GarageCars,SaleType,SaleCondition
319,BrkFace,SLvl,Gd,Av,GLQ,2,Attchd,Unf,2,WD,Normal
581,BrkFace,1Story,Ex,No,Unf,1,Attchd,RFn,3,New,Partial
962,,2Story,Gd,No,ALQ,1,Detchd,Unf,2,WD,Normal
78,,1Story,TA,No,Unf,0,,,0,WD,Normal
5,,1.5Fin,Gd,No,GLQ,0,Attchd,Unf,2,WD,Normal


### Make a function for Separating Numerical and Categorical

In [29]:
# Def a function that returns x_train numerical and x_train categorical

def categorical_numerical_separation(data_all, data_input, num_categorical, output):
    numerical = data_input._get_numeric_data()
    numerical_data = numerical.drop(num_categorical, axis=1)
    
    categories = [x for x in data_all.columns if x not in numerical_data.columns and x not in output]
    categorical_data = data_input[categories]
    
    return numerical_data, categorical_data

x_train_num, x_train_cat = categorical_numerical_separation(data_all = data, data_input = x_train,
                                                           num_categorical = ['Fireplaces', 'GarageCars'],
                                                           output = ['SalePrice'])

In [30]:
# check the top of the x_train numerical observations!

x_train_num.head()

Unnamed: 0,LotFrontage,LotArea,MasVnrArea,BsmtFinSF1,1stFlrSF,2ndFlrSF,GarageArea,WoodDeckSF,OpenPorchSF
319,,14115,225.0,1036,1472,0,588,233,48
581,98.0,12704,306.0,0,2042,0,1390,0,90
962,24.0,2308,0.0,556,804,744,440,48,0
78,72.0,10778,0.0,0,1768,0,0,0,0
5,85.0,14115,0.0,732,796,566,480,40,30


In [31]:
# check the top of the x_train categorical observations!

x_train_cat.head()

Unnamed: 0,MasVnrType,HouseStyle,BsmtQual,BsmtExposure,BsmtFinType1,Fireplaces,GarageType,GarageFinish,GarageCars,SaleType,SaleCondition
319,BrkFace,SLvl,Gd,Av,GLQ,2,Attchd,Unf,2,WD,Normal
581,BrkFace,1Story,Ex,No,Unf,1,Attchd,RFn,3,New,Partial
962,,2Story,Gd,No,ALQ,1,Detchd,Unf,2,WD,Normal
78,,1Story,TA,No,Unf,0,,,0,WD,Normal
5,,1.5Fin,Gd,No,GLQ,0,Attchd,Unf,2,WD,Normal


## Data Imputation

Data imputation adalah proses pengisian data yang memiliki data yang kosong, biasanya diperlihatkan sebagai NaN

Proses tersebut terbagi menjadi 2:
* Numerical Imputation
* Categorical Imputation

In [32]:
# Cek data yang kosong di traininig set input

x_train.isnull().sum()

LotFrontage      218
LotArea            0
MasVnrType         5
MasVnrArea         5
HouseStyle         0
BsmtQual          30
BsmtExposure      31
BsmtFinType1      30
BsmtFinSF1         0
1stFlrSF           0
2ndFlrSF           0
Fireplaces         0
GarageType        60
GarageFinish      60
GarageCars         0
GarageArea         0
WoodDeckSF         0
OpenPorchSF        0
SaleType           0
SaleCondition      0
dtype: int64

## Numerical Data Imputation

### Checking for NaN (Not A Number)

In [33]:
# check the missing value of the x_train_num

x_train_num.isnull().sum()

LotFrontage    218
LotArea          0
MasVnrArea       5
BsmtFinSF1       0
1stFlrSF         0
2ndFlrSF         0
GarageArea       0
WoodDeckSF       0
OpenPorchSF      0
dtype: int64

In [34]:
# Import library for imputation

from sklearn.preprocessing import Imputer

In [35]:
# namakan function Imputer menjadi imput, jangan lupa tanda kurung ()
# missing_values adalah tanda missing values dalam data, bisa NaN, bisa 9999, bisa "KOSONG"
# strategy median adalah stragegy imputasi, jika data kosong, maka data diganti dengan median
# strategy bisa diganti dengan mean atau rata-rata
# see median: https://en.wikipedia.org/wiki/Median

imput = Imputer(missing_values='NaN', strategy='median')

* fit: imputer agar mengetahui mean atau median  dari setiap column
* transform: isi data dengan median atau mean
* output dari transform berupda pd dataframe
* namakan column dari x_train_numerical_imputed sesuai dengan x_train_numerical.
     - MENGAPA? karena kita kehilangan nama column setelah data imputation
* beri index dari x_train_numerical_imputed sesuai dengan x_train_numerical.
     - MENGAPA? karena kita kehilangan index setelah data imputation

In [36]:
# isi perintah yang akan dibuat di dalam fungsi baru
# imputer perlu difitting ke data 

imput.fit(x_train_num)
x_train_num_imputed = pd.DataFrame(imput.transform(x_train_num))
x_train_num_imputed.columns = x_train_num.columns
x_train_num_imputed.index =  x_train_num.index

In [37]:
# cek x_train_num_imputed

x_train_num_imputed.head()

Unnamed: 0,LotFrontage,LotArea,MasVnrArea,BsmtFinSF1,1stFlrSF,2ndFlrSF,GarageArea,WoodDeckSF,OpenPorchSF
319,70.0,14115.0,225.0,1036.0,1472.0,0.0,588.0,233.0,48.0
581,98.0,12704.0,306.0,0.0,2042.0,0.0,1390.0,0.0,90.0
962,24.0,2308.0,0.0,556.0,804.0,744.0,440.0,48.0,0.0
78,72.0,10778.0,0.0,0.0,1768.0,0.0,0.0,0.0,0.0
5,85.0,14115.0,0.0,732.0,796.0,566.0,480.0,40.0,30.0


In [38]:
# cek kembali hasil imputer, apakah missing valuesnya masih ada atau tidak

x_train_num_imputed.isnull().any()

LotFrontage    False
LotArea        False
MasVnrArea     False
BsmtFinSF1     False
1stFlrSF       False
2ndFlrSF       False
GarageArea     False
WoodDeckSF     False
OpenPorchSF    False
dtype: bool

## Categorical Imputation

In [39]:
# check missing values in x_train_cat

x_train_cat.isnull().sum()

MasVnrType        5
HouseStyle        0
BsmtQual         30
BsmtExposure     31
BsmtFinType1     30
Fireplaces        0
GarageType       60
GarageFinish     60
GarageCars        0
SaleType          0
SaleCondition     0
dtype: int64

In [40]:
# replace missing value with new category ="KOSONG"

x_train_cat_imputed = x_train_cat.fillna(value='KOSONG')

In [41]:
# periksa kembali missing valuesnya

x_train_cat_imputed.isnull().any()

MasVnrType       False
HouseStyle       False
BsmtQual         False
BsmtExposure     False
BsmtFinType1     False
Fireplaces       False
GarageType       False
GarageFinish     False
GarageCars       False
SaleType         False
SaleCondition    False
dtype: bool

## Make a Function
* Make a function for numerical imputation

In [42]:
# function definition

def fit_imput_num(data, missing_values, strategy):
    imput = Imputer(missing_values, strategy)
    imput.fit(data)
    
    return data, imput

def transf_imput_num(data, imputer):
    data_imputed = pd.DataFrame(imputer.transform(data))
    data_imputed.columns = data.columns
    data_imputed.index =  data.index    
    
    return data_imputed

In [43]:
# return imputed data and imputer

x_train_num , imput = fit_imput_num(data=x_train_num, missing_values='NaN', strategy='median')
x_train_num_imputed = transf_imput_num(data=x_train_num, imputer=imput)

In [44]:
# check imputed data

x_train_num_imputed.head()

Unnamed: 0,LotFrontage,LotArea,MasVnrArea,BsmtFinSF1,1stFlrSF,2ndFlrSF,GarageArea,WoodDeckSF,OpenPorchSF
319,70.0,14115.0,225.0,1036.0,1472.0,0.0,588.0,233.0,48.0
581,98.0,12704.0,306.0,0.0,2042.0,0.0,1390.0,0.0,90.0
962,24.0,2308.0,0.0,556.0,804.0,744.0,440.0,48.0,0.0
78,72.0,10778.0,0.0,0.0,1768.0,0.0,0.0,0.0,0.0
5,85.0,14115.0,0.0,732.0,796.0,566.0,480.0,40.0,30.0


In [45]:
# dump imputer

from sklearn.externals import joblib

joblib.dump(imput, 'imput.pkl')

['imput.pkl']

 * Make a function for categorical imputation

In [46]:
# function definition

def cat_imput(data):
    data_imputed = data.fillna(value='KOSONG')
    for i in data_imputed.columns:
        data_imputed[i] = data_imputed[i].astype('category')
    
    return data_imputed

In [47]:
# return imputed data

x_train_cat_imputed = cat_imput(x_train_cat)

In [48]:
x_train_cat_imputed.head()

Unnamed: 0,MasVnrType,HouseStyle,BsmtQual,BsmtExposure,BsmtFinType1,Fireplaces,GarageType,GarageFinish,GarageCars,SaleType,SaleCondition
319,BrkFace,SLvl,Gd,Av,GLQ,2,Attchd,Unf,2,WD,Normal
581,BrkFace,1Story,Ex,No,Unf,1,Attchd,RFn,3,New,Partial
962,,2Story,Gd,No,ALQ,1,Detchd,Unf,2,WD,Normal
78,,1Story,TA,No,Unf,0,KOSONG,KOSONG,0,WD,Normal
5,,1.5Fin,Gd,No,GLQ,0,Attchd,Unf,2,WD,Normal


## Preprocessing Categorical Variables

* create dummy variable for each of categorical variable

In [49]:
# create dummies

categorical_dummies = pd.get_dummies(x_train_cat_imputed)

In [50]:
# periksa top observations

categorical_dummies.head()

Unnamed: 0,MasVnrType_BrkCmn,MasVnrType_BrkFace,MasVnrType_KOSONG,MasVnrType_None,MasVnrType_Stone,HouseStyle_1.5Fin,HouseStyle_1.5Unf,HouseStyle_1Story,HouseStyle_2.5Fin,HouseStyle_2.5Unf,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
319,0,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
581,0,1,0,0,0,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0,1
962,0,0,0,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
78,0,0,0,1,0,0,0,1,0,0,...,0,0,0,1,0,0,0,0,1,0
5,0,0,0,1,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0


### Make a function to get the dummies

In [51]:
# funtion definition

from sklearn.preprocessing import LabelEncoder, LabelBinarizer

def cat_imput_dummy(data):
    data_imputed = data.fillna("KOSONG")
    data_imputed_dummy = pd.DataFrame([])
    label_encoder = pd.Series([])
    label_binarizer = pd.Series([])
    
    for i in list(data_imputed):
        label_en = LabelEncoder()
        label_bin = LabelBinarizer()
        
        encoded = label_en.fit_transform(data_imputed[i])
        binary = label_bin.fit_transform(encoded)
        
        if binary.shape[1] == 1:
            dummy = pd.DataFrame(binary, columns = [i], index = data_imputed.index)
        else :
            dummy = pd.DataFrame(binary, columns = ["{}_{}".format(a,b) for b in sorted(data_imputed[i].unique()) for a in [i]], index = data_imputed.index)
    
        data_imputed_dummy = pd.concat([data_imputed_dummy, dummy], axis=1)
        label_encoder[i] = label_en
        label_binarizer[i] = label_bin
        
    return data_imputed_dummy, label_encoder, label_binarizer, data_imputed_dummy.columns

In [52]:
x_train_cat_imputed_dummy, label_encoder, label_binarizer, dummy_columns = cat_imput_dummy(x_train_cat)

In [53]:
# dump dummy_columns

joblib.dump(dummy_columns, 'dummy_columns.pkl')

['dummy_columns.pkl']

In [54]:
# check the top observations

x_train_cat_imputed_dummy.head()

Unnamed: 0,MasVnrType_BrkCmn,MasVnrType_BrkFace,MasVnrType_KOSONG,MasVnrType_None,MasVnrType_Stone,HouseStyle_1.5Fin,HouseStyle_1.5Unf,HouseStyle_1Story,HouseStyle_2.5Fin,HouseStyle_2.5Unf,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
319,0,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
581,0,1,0,0,0,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0,1
962,0,0,0,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
78,0,0,0,1,0,0,0,1,0,0,...,0,0,0,1,0,0,0,0,1,0
5,0,0,0,1,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0


## Join data Numerical dan Categorical

In [55]:
# ambil variabel numerical yang sudah tidak memiliki missing values dan variabel kategori yang sudah menjadi dummy
# satukan kembali kolom tersebut menjadi x_train_concat

x_train_concat = pd.concat([x_train_num_imputed, x_train_cat_imputed_dummy], axis=1)

In [56]:
x_train_concat.head()

Unnamed: 0,LotFrontage,LotArea,MasVnrArea,BsmtFinSF1,1stFlrSF,2ndFlrSF,GarageArea,WoodDeckSF,OpenPorchSF,MasVnrType_BrkCmn,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
319,70.0,14115.0,225.0,1036.0,1472.0,0.0,588.0,233.0,48.0,0,...,0,0,0,1,0,0,0,0,1,0
581,98.0,12704.0,306.0,0.0,2042.0,0.0,1390.0,0.0,90.0,0,...,0,1,0,0,0,0,0,0,0,1
962,24.0,2308.0,0.0,556.0,804.0,744.0,440.0,48.0,0.0,0,...,0,0,0,1,0,0,0,0,1,0
78,72.0,10778.0,0.0,0.0,1768.0,0.0,0.0,0.0,0.0,0,...,0,0,0,1,0,0,0,0,1,0
5,85.0,14115.0,0.0,732.0,796.0,566.0,480.0,40.0,30.0,0,...,0,0,0,1,0,0,0,0,1,0


In [57]:
# Check NaN values

x_train_concat.isnull().any().any()

False

## Standardizing Variables

- KEGUNAAN: Menyamakan skala dari variable input
- fit: imputer agar mengetahui mean standard deviasi dari setiap column
- transform: isi data dengan value yang dinormalisasi
- output dari transform berupda pd dataframe
- normalize dikeluarkan karena akan dipakai di test

In [58]:
#Import Standard Scaler

from sklearn.preprocessing import StandardScaler

In [59]:
# define function for standardizing data

def fit_standardize(data):
    standardizer = StandardScaler()
    standardizer.fit(data)
    
    return data, standardizer

def transf_standardize(data, standardizer):
    data_clean = pd.DataFrame(standardizer.transform(data))
    data_clean.columns = data.columns
    data_clean.index = data.index
    
    return data_clean

In [60]:
# return standardized data and standardizer

x_train_concat, standard = fit_standardize(x_train_concat)
x_train_clean = transf_standardize(data=x_train_concat, standardizer=standard)

In [61]:
# dump standardizer

joblib.dump(standard, 'standard.pkl')

['standard.pkl']

In [62]:
# check data

x_train_clean.head()

Unnamed: 0,LotFrontage,LotArea,MasVnrArea,BsmtFinSF1,1stFlrSF,2ndFlrSF,GarageArea,WoodDeckSF,OpenPorchSF,MasVnrType_BrkCmn,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
319,0.014139,0.337235,0.658344,1.367393,0.806606,-0.788293,0.538128,1.128709,0.010085,-0.106137,...,-0.050767,-0.30781,-0.041434,0.398607,-0.271288,-0.041434,-0.092968,-0.101929,0.465717,-0.312788
581,1.471619,0.203654,1.101312,-1.016738,2.321884,-0.788293,4.333777,-0.743517,0.639798,-0.106137,...,-0.050767,3.248762,-0.041434,-2.508735,-0.271288,-0.041434,-0.092968,-0.101929,-2.147225,3.197054
962,-2.380292,-0.780546,-0.572123,0.262776,-0.969194,0.903183,-0.162316,-0.357822,-0.709586,-0.106137,...,-0.050767,-0.30781,-0.041434,0.398607,-0.271288,-0.041434,-0.092968,-0.101929,0.465717,-0.312788
78,0.118245,0.021318,-0.572123,-1.016738,1.593488,-0.788293,-2.244717,-0.743517,-0.709586,-0.106137,...,-0.050767,-0.30781,-0.041434,0.398607,-0.271288,-0.041434,-0.092968,-0.101929,0.465717,-0.312788
5,0.794932,0.337235,-0.572123,0.667803,-0.990461,0.498502,0.026993,-0.422105,-0.259791,-0.106137,...,-0.050767,-0.30781,-0.041434,0.398607,-0.271288,-0.041434,-0.092968,-0.101929,0.465717,-0.312788


## 3. Training Machine Learning

* Kita harus mengalahkan benchmark
* Choose Score to optimize and Hyperparameter Space
* Cross-Validation: Random Search CV 


### Benchmark:

In [63]:
# Pada kasus regresi, benchmark diambil dari MSE terhadap nilai mean

# Hitung nilai mean dengan :
y_train.mean()

181531.51071122536

In [64]:
# Hitung MSE terhadap nilai mean

square_error = 0
for i in y_train:
    square_error += (i-y_train.mean())**2
mse = square_error/len(y_train)
print(mse)

6388738000.26


In [65]:
# Import regressor : LinearRegression, Lasso, dan Ridge

from sklearn.linear_model import LinearRegression, Lasso, Ridge

linreg = LinearRegression()
lasso = Lasso()
ridge = Ridge()

In [66]:
# fitting model

linreg.fit(x_train_clean, y_train)
lasso.fit(x_train_clean, y_train)
ridge.fit(x_train_clean, y_train)



Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [67]:
linreg.score(x_train_clean, y_train)

0.83366069472216409

In [68]:
lasso.score(x_train_clean, y_train)

0.83366109822487855

In [69]:
ridge.score(x_train_clean, y_train)

0.83365932452897717

In [70]:
# predict data x_train, simpan ke variabel y_pred

y_pred_linreg = linreg.predict(x_train_clean)
y_pred_lasso = lasso.predict(x_train_clean)
y_pred_ridge = ridge.predict(x_train_clean)

In [71]:
# evaluasi performa model dengan menghitung MSE

from sklearn.metrics import mean_squared_error

print(mean_squared_error(y_train, y_pred_linreg))
print(mean_squared_error(y_train, y_pred_lasso))
print(mean_squared_error(y_train, y_pred_ridge))

1062698240.57
1062695662.69
1062706994.37


In [72]:
# lakukan RandomizedSearchCV untuk memilih parameter alpha pada Lasso dan Ridge

from sklearn.model_selection import RandomizedSearchCV

alpha_param = {'alpha':np.logspace(-5,2,8)}
randomCVLasso = RandomizedSearchCV(Lasso(random_state=123), param_distributions=alpha_param, n_iter=5, cv=5)
randomCVRidge = RandomizedSearchCV(Ridge(random_state=123), param_distributions=alpha_param, n_iter=5, cv=5)

In [73]:
# Fitting

randomCVLasso.fit(x_train_clean, y_train)
randomCVRidge.fit(x_train_clean, y_train)



RandomizedSearchCV(cv=5, error_score='raise',
          estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=123, solver='auto', tol=0.001),
          fit_params={}, iid=True, n_iter=5, n_jobs=1,
          param_distributions={'alpha': array([  1.00000e-05,   1.00000e-04,   1.00000e-03,   1.00000e-02,
         1.00000e-01,   1.00000e+00,   1.00000e+01,   1.00000e+02])},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score=True, scoring=None, verbose=0)

In [74]:
# Evaluasi model

print("Lasso :")
print("Best Accuracy =", randomCVLasso.score(x_train_clean, y_train))
print("Best Param =", randomCVLasso.best_params_)
print("\nRidge :")
print("Best Accuracy =", randomCVRidge.score(x_train_clean, y_train))
print("Best Param =", randomCVRidge.best_params_)

Lasso :
Best Accuracy = 0.833655704099
Best Param = {'alpha': 10.0}

Ridge :
Best Accuracy = 0.833516021783
Best Param = {'alpha': 10.0}


In [75]:
# Buat model dengan parameter terbaik

best_lasso = Lasso(alpha=randomCVLasso.best_params_.get('alpha'))
best_ridge = Ridge(alpha=randomCVRidge.best_params_.get('alpha'))

In [76]:
best_lasso.fit(x_train_clean, y_train)
best_ridge.fit(x_train_clean, y_train)



Ridge(alpha=10.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [77]:
# dump model

joblib.dump(best_lasso, 'best_lasso.pkl')
joblib.dump(best_ridge, 'best_ridge.pkl')

['best_ridge.pkl']

## 4. Test Prediction

### Preprocessing Test Data

Buat lah sebuah function untuk preprocessing data test. Preprocessing yang dilakukan yakni imputation (numeric and categorical) dan standardizing.
 

Function dinamakan dengan `extract_test` dan menerima 6 argument yaitu:
 1. `data` : Data yang ingin diolah
 2. `numerical_columns`   : Nama kolom numerik
 3. `categorical_columns` : Nama kolom kategorik
 4. `dummies_columns`     : Nama kolom dummy
 5. `imput_numericals`    : Imputer untuk data numerical (hasil preprocessing data training)
 6. `standardizer`        : Standardizer (hasil preprocessing data training)
 
 
Lalu assign function tersebut pada suatu variabel yang dengan nama `x_test_clean`



In [78]:
# function definition

def test_cat_dummy(data, categorical_columns, label_encoder, label_binarizer, dummy_columns):
    data_imputed = data[categorical_columns].fillna("KOSONG")
    data_imputed_dummy = pd.DataFrame([])
    
    for i in categorical_columns:
        label_en = label_encoder[i]
        label_bin = label_binarizer[i]
        
        encoded = label_en.transform(data_imputed[i])
        binary = label_bin.transform(encoded)
        
        if binary.shape[1] == 1:
            dummy = pd.DataFrame(binary, index = data.index)
        else:
            dummy = pd.DataFrame(binary, index = data.index)
        
        data_imputed_dummy = pd.concat([data_imputed_dummy, dummy], axis = 1)
    
    data_imputed_dummy.columns = dummy_columns
    
    return data_imputed_dummy

def extract_test(numerical_columns, categorical_columns, data, imputer, standard, label_encoder, label_binarizer, dummy_columns):
    data_dummy = test_cat_dummy(data, categorical_columns, label_encoder, label_binarizer, dummy_columns)
    data_numeric_imput = transf_imput_num(data[numerical_columns], imputer=imputer)
    data_concat = pd.concat([data_numeric_imput, data_dummy], axis=1)
    data_clean = transf_standardize(data_concat, standard)
    
    return data_clean

In [79]:
# load necessary object
# object = joblib.load("filename.pkl")

num_columns = numeric
cat_columns = categories
dummy_columns = joblib.load("dummy_columns.pkl")
imput = joblib.load("imput.pkl")
standardizer = joblib.load("standard.pkl")

In [80]:
# preprocess test data
x_test_clean = extract_test(numeric, categories, x_test, imput, standardizer, label_encoder, label_binarizer, dummy_columns)

Bila terjadi error :

  ValueError: operands could not be broadcast together with shapes (???,??) (??,) (???,??)

Hal ini terjadi karena ada kategori yang tidak tersedia pada data test.
Solusinya dapat menggunakan LabelBinarizer kemudian LabelEncoder untuk menggantikan pd.get_dummies()
Detailnya akan dijelaskan saat kelas

In [81]:
x_test_clean.head()

Unnamed: 0,LotFrontage,LotArea,MasVnrArea,BsmtFinSF1,1stFlrSF,2ndFlrSF,GarageArea,WoodDeckSF,OpenPorchSF,MasVnrType_BrkCmn,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
147,0.014139,-0.099199,0.412251,-1.016738,-0.756523,1.828492,-0.190712,0.413567,0.010085,-0.106137,...,-0.050767,-0.30781,-0.041434,0.398607,-0.271288,-0.041434,-0.092968,-0.101929,0.465717,-0.312788
677,-0.922812,-0.144925,-0.572123,-1.016738,-1.001094,-0.788293,-1.108862,1.79564,-0.709586,-0.106137,...,-0.050767,-0.30781,-0.041434,0.398607,-0.271288,-0.041434,-0.092968,-0.101929,0.465717,-0.312788
1305,1.992148,0.248055,1.0685,2.600882,1.285115,-0.788293,1.730776,2.502747,0.819716,-0.106137,...,-0.050767,-0.30781,-0.041434,0.398607,-0.271288,-0.041434,-0.092968,-0.101929,0.465717,-0.312788
1373,0.014139,0.080203,3.283341,1.933509,3.892989,-0.788293,1.560398,1.779569,1.389456,-0.106137,...,-0.050767,-0.30781,-0.041434,0.398607,-0.271288,-0.041434,-0.092968,-0.101929,0.465717,-0.312788
1427,-0.506389,0.035708,-0.572123,0.318007,-0.320548,0.371186,-0.881691,-0.743517,-0.709586,-0.106137,...,-0.050767,-0.30781,-0.041434,0.398607,-0.271288,-0.041434,-0.092968,-0.101929,0.465717,-0.312788


### Predict Test Data

In [82]:
# load necessary object

lasso = joblib.load("best_lasso.pkl")
ridge = joblib.load("best_ridge.pkl")

In [83]:
lasso.score(x_test_clean, y_test)

0.66350222956875093

In [84]:
ridge.score(x_test_clean, y_test)

0.66403448364077677

In [85]:
linreg.score(x_test_clean, y_test)

0.66329461098038422

### Random Forest

In [86]:
from sklearn.ensemble import RandomForestRegressor

In [87]:
randomForest = RandomForestRegressor(random_state=123)

def randomForest_fit(x_train, y_train):
    rforestparam = {'n_estimators': [100, 300, 500, 1000],
                  'min_samples_leaf': [2, 5, 8, 10, 15],
                  'min_samples_split': [2, 5, 8, 10]}
    randomCV_rforest = RandomizedSearchCV(randomForest, 
                                          param_distributions=rforestparam,
                                          n_iter=3, cv=5, scoring='neg_mean_squared_error',
                                          n_jobs=4, random_state=123)
    randomCV_rforest.fit(x_train, y_train)
    
    print("Best Accuracy :", randomCV_rforest.score(x_train, y_train))
    print("Best Params :", randomCV_rforest.best_params_)
    
    return randomCV_rforest

In [88]:
best_randomForest = randomForest_fit(x_train_clean, y_train)

Best Accuracy : -1016049839.8
Best Params : {'n_estimators': 1000, 'min_samples_split': 8, 'min_samples_leaf': 10}


In [89]:
randomForest = RandomForestRegressor(n_estimators=best_randomForest.best_params_.get('n_estimators'),
                                     min_samples_split=best_randomForest.best_params_.get('min_samples_split'),
                                     min_samples_leaf=best_randomForest.best_params_.get('min_samples_leaf'),
                                     random_state=123)

In [90]:
randomForest.fit(x_train_clean, y_train)
randomForest.score(x_train_clean, y_train)

0.84096235598279956

In [91]:
randomForest.score(x_test_clean, y_test)

0.81275151909919285