## Dataset

#### Importing pandas

In [1]:
import pandas as pd

#### Memuat dataset sebagai Pandas dataframe

Dataset yang digunakan adalah [Iowa Housing Dataset](http://jse.amstat.org/v19n3/decock.pdf); dataset ini merupakah dataset yang sifatnya public dan dapat digunakan untuk keperluan eksplorasi dan penelitian. <br/>
Di sini dataset training dan dataset testing sudah disediakan terpisah.

In [2]:
housing_df = pd.read_csv('./dataset/iowa_data.csv')
housing_df.shape

(1460, 81)

#### Menentukan ```target``` dan ```features```

In [3]:
housing_df.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

#### SalePrice sebagai ```target```

In [4]:
housing_df.dropna(axis=0, subset=['SalePrice'], inplace=True)
housing_df.shape

(1460, 81)

In [5]:
y = housing_df['SalePrice']
y.head()

0    208500
1    181500
2    223500
3    140000
4    250000
Name: SalePrice, dtype: int64

#### Hanya kolom numerik yang disertakan sebagai ```features```

In [6]:
X = housing_df.drop(['SalePrice'], axis=1)
X = X.select_dtypes(exclude=['object'])
X.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
0,1,60,65.0,8450,7,5,2003,2003,196.0,706,...,548,0,61,0,0,0,0,0,2,2008
1,2,20,80.0,9600,6,8,1976,1976,0.0,978,...,460,298,0,0,0,0,0,0,5,2007
2,3,60,68.0,11250,7,5,2001,2002,162.0,486,...,608,0,42,0,0,0,0,0,9,2008
3,4,70,60.0,9550,7,5,1915,1970,0.0,216,...,642,0,35,272,0,0,0,0,2,2006
4,5,60,84.0,14260,8,5,2000,2000,350.0,655,...,836,192,84,0,0,0,0,0,12,2008


#### Menyisishkan sebagian dataset training sebagai dataset validation

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

#### Menampilkan dataset

In [8]:
X_train.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
618,619,20,90.0,11694,9,5,2007,2007,452.0,48,...,774,0,108,0,0,260,0,0,7,2007
870,871,20,60.0,6600,5,5,1962,1962,0.0,0,...,308,0,0,0,0,0,0,0,8,2009
92,93,30,80.0,13360,5,7,1921,2006,0.0,713,...,432,0,0,44,0,0,0,0,8,2009
817,818,20,,13265,8,5,2002,2002,148.0,1218,...,857,150,59,0,0,0,0,0,7,2008
302,303,20,118.0,13704,7,5,2001,2002,150.0,0,...,843,468,81,0,0,0,0,0,1,2006


## Identifikasi Missing Values

#### Dimensi dari training dataset

In [9]:
X_train.shape

(1168, 37)

#### Kolom dengan missing values

In [20]:
missing_val_column = (X_train.isnull().sum())
missing_val_column[missing_val_column > 0]

LotFrontage    212
MasVnrArea       6
GarageYrBlt     58
dtype: int64

In [24]:
cols_with_missing = [col for col in X_train.columns
                     if X_train[col].isnull().any()]
cols_with_missing

['LotFrontage', 'MasVnrArea', 'GarageYrBlt']

#### Total jumlah baris (rows) dengan missing values

In [22]:
sum(missing_val_column[missing_val_column > 0])

276

## Penanganan Missing Values

## Model

#### Importing RandomForestRegressor

In [None]:
from sklearn.ensemble import RandomForestRegressor

#### Mempersiapkan beberapa model dengan konfigurasi berbeda

In [None]:
model_rf = RandomForestRegressor(n_estimators=100, random_state=0)

#### Mengukur performa dari tiap model dengan MAE

In [None]:
from sklearn.metrics import mean_absolute_error

In [None]:
def score_model(model, X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test):
    model.fit(X_train, y_train)
    y_hat = model.predict(X_test)
    return mean_absolute_error(y_test, y_hat)