## Dataset

<li> 實價登錄：x_train_raw
<li>實價登錄 + 機能：x_train_fac
<li>實價登錄 + 機能 + 街景 + 人工標籤：x_train_fac_street
<li>實價登錄 + 機能 + 衛星 + 人工標籤：x_train_fac_satel
<li>實價登錄 + 機能 + 街景 + 衛星 + 人工標籤：x_train_full

In [1]:
import pandas as pd
import numpy as np

In [2]:
# dataset loading
x_train_full = pd.read_csv('C:\\Users\\andyi\\OneDrive\\桌面\\x_train\\x_train_full.csv')
x_train_full = x_train_full.drop(columns=['Unnamed: 0'])
x_valid_full = pd.read_csv('C:\\Users\\andyi\\OneDrive\\桌面\\x_valid\\x_valid_full.csv')
x_valid_full = x_valid_full.drop(columns=['Unnamed: 0'])
x_test_full = pd.read_csv('C:\\Users\\andyi\\OneDrive\\桌面\\x_test\\x_test_full.csv')
x_test_full = x_test_full.drop(columns=['Unnamed: 0'])
ntrain = x_train_full.shape[0]
nvalid = x_valid_full.shape[0]
ntest = x_test_full.shape[0]
X = pd.concat([x_train_full, x_valid_full, x_test_full], axis=0).reset_index(drop=True)

In [3]:
col = X.columns
raw = X[col[:81]]
fac = X[col[81:96]]
street = X[col[96:160]]
satellite = X[col[160:220]]
score = X[col[220]]

In [4]:
# Standardization: fac, street, score
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
fac = scaler.fit_transform(fac)
street = scaler.fit_transform(street)
score = scaler.fit_transform(np.array(score).reshape(-1,1))

fac = pd.DataFrame(fac, columns=col[81:96])
street = pd.DataFrame(street, columns=[("Street" + str(i)) for i in range(1, 65)])
score = pd.DataFrame(score, columns=[col[220]])
satellite.columns = [("Sat" + str(i)) for i in range(1, 61)]

In [5]:
# Assign to each dataset

def assign(data):
    ALL = pd.concat([d for d in data], axis=1)
    x_train = ALL[:ntrain]
    x_valid = ALL[ntrain:(ntrain+nvalid)]
    x_test = ALL[(ntrain+nvalid):]
    return x_train, x_valid, x_test

# 1. x_train_raw, x_valid_raw, x_test_raw
x_train_raw, x_valid_raw, x_test_raw = assign([raw])

# 2. x_train_fac, x_valid_fac, x_test_fac
x_train_fac, x_valid_fac, x_test_fac = assign([raw, fac])

# 3. x_train_fac_street, x_valid_fac_street, x_test_fac_street
x_train_fac_street, x_valid_fac_street, x_test_fac_street = assign([raw, fac, street, score])

# 4. x_train_fac_satel, x_valid_fac_satel, x_test_fac_satel
x_train_fac_satel, x_valid_fac_satel, x_test_fac_satel = assign([raw, fac, satellite, score])

# 5. x_train_fac_full, x_valid_fac_full, x_test_fac_full
x_train_full, x_valid_full, x_test_full = assign([raw, fac, street, satellite, score])

In [6]:
# y
y_train = pd.read_csv('C:\\Users\\andyi\\OneDrive\\桌面\\y\\y_train.csv')
y_train = y_train.drop(columns=['Unnamed: 0']).to_numpy().ravel()
y_valid = pd.read_csv('C:\\Users\\andyi\\OneDrive\\桌面\\y\\y_valid.csv')
y_valid = y_valid.drop(columns=['Unnamed: 0']).to_numpy().ravel()
y_test = pd.read_csv('C:\\Users\\andyi\\OneDrive\\桌面\\y\\y_test.csv')
y_test = y_test.drop(columns=['Unnamed: 0']).to_numpy().ravel()
y_all = np.concatenate([y_train, y_valid, y_test], axis=0)

## Perform Feature Selection via F-score

In [7]:
from sklearn.feature_selection import SelectKBest, f_regression

def FSelect(dataset, k=100):
    # Concatenate all data
    all_X = pd.concat(dataset, axis=0)
    all_Y = y_all

    # F-test
    selector = SelectKBest(f_regression, k=k)
    all_X_new = selector.fit_transform(all_X, all_Y)
    new_columns = selector.get_feature_names_out()
    all_X_new = pd.DataFrame(all_X_new, columns=new_columns)

    x_train = all_X_new[:ntrain]
    x_valid = all_X_new[ntrain:(ntrain + nvalid)]
    x_test = all_X_new[(ntrain + nvalid):]

    return x_train, x_valid, x_test

# 1. x_train_raw, x_valid_raw, x_test_raw: no need to fselect
# x_train_raw, x_valid_raw, x_test_raw = FSelect([x_train_raw, x_valid_raw, x_test_raw], k=)

# 2. x_train_fac, x_valid_fac, x_test_fac: no need to fselect
# x_train_fac, x_valid_fac, x_test_fac = FSelect([x_train_fac, x_valid_fac, x_test_fac])

# 3. x_train_fac_street, x_valid_fac_street, x_test_fac_street
x_train_fac_street, x_valid_fac_street, x_test_fac_street = FSelect([x_train_fac_street, x_valid_fac_street, x_test_fac_street], k=110)

# 4. x_train_fac_satel, x_valid_fac_satel, x_test_fac_satel
x_train_fac_satel, x_valid_fac_satel, x_test_fac_satel = FSelect([x_train_fac_satel, x_valid_fac_satel, x_test_fac_satel], k=110)

# 5. x_train_fac_full, x_valid_fac_full, x_test_fac_full
x_train_full, x_valid_full, x_test_full = FSelect([x_train_full, x_valid_full, x_test_full], k=150)

print("X_train:", x_train_full.shape)
print("X_test:", x_test_full.shape)
print("X_valid:", x_valid_full.shape)

X_train: (10049, 150)
X_test: (2812, 150)
X_valid: (1100, 150)


# Model Define

## Ridge Regression

In [8]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

def Ridgereg(X_train, y_train, X_valid, y_valid, X_test, y_test):
    global ridge
    ridge = Ridge()
    col = X_train.columns
    tune_params = {'alpha':np.arange(1, 5, 0.1)}
    gd = GridSearchCV(ridge, tune_params)
    gd.fit(X_valid, y_valid)

    alpha = gd.best_params_["alpha"]
    print("best alpha:", alpha)
    print("==========")

    X_valid_train = pd.concat([X_valid, X_train], axis=0)
    y_valid_train = np.concatenate([y_valid, y_train], axis=0)

    ridge = Ridge(alpha=alpha)
    ridge.fit(X_valid_train, y_valid_train)
    pred = ridge.predict(X_test)
    RMSE = mean_squared_error(y_test, pred, squared=False)

    # Find the most important 10 features
    abs_coef_idx = np.flip(np.argsort(np.absolute(ridge.coef_)))
    print("The 10 most important features:")
    for i in range(10):
        if i != 9:
            print(col[abs_coef_idx[i]], end=", ")
        else:
            print(col[abs_coef_idx[i]])

    return RMSE

## Lasso Regression

In [9]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

def Lassoreg(X_train, y_train, X_valid, y_valid, X_test, y_test):
    global lasso
    col = X_train.columns
    lasso = Lasso()
    tune_params = {'alpha':np.arange(1, 5, 0.1)}
    gd = GridSearchCV(lasso, tune_params)
    gd.fit(X_valid, y_valid)

    alpha = gd.best_params_["alpha"]
    print("best alpha:", alpha)
    print("==========")

    X_valid_train = pd.concat([X_valid, X_train], axis=0)
    y_valid_train = np.concatenate([y_valid, y_train], axis=0)

    lasso = Lasso(alpha=alpha)
    lasso.fit(X_valid_train, y_valid_train)
    pred = lasso.predict(X_test)
    RMSE = mean_squared_error(y_test, pred, squared=False)

    # Find the most important 10 features
    abs_coef_idx = np.flip(np.argsort(np.absolute(lasso.coef_)))
    print("The 10 most important features:")
    for i in range(10):
        if i != 9:
            print(col[abs_coef_idx[i]], end=", ")
        else:
            print(col[abs_coef_idx[i]])
    return RMSE

## KNN

In [10]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

def KNN(X_train, y_train, X_valid, y_valid, X_test, y_test):
    global knn
    knn = KNeighborsRegressor()
    tune_params = {'n_neighbors':range(1, 31), 'weights':['distance', 'uniform'],
                   'leaf_size':range(1, 20)}
    gd = GridSearchCV(knn, tune_params)
    gd.fit(X_valid, y_valid)

    n_neighbors = gd.best_params_["n_neighbors"]
    print("best n_neighbors:", n_neighbors)
    weights = gd.best_params_["weights"]
    print("best weights:", weights)
    leaf_size = gd.best_params_["leaf_size"]
    print("best leaf_size:", leaf_size)
    print("==========")

    X_valid_train = pd.concat([X_valid, X_train], axis=0)
    y_valid_train = np.concatenate([y_valid, y_train], axis=0)

    knn = KNeighborsRegressor(n_neighbors=n_neighbors, weights=weights, leaf_size=leaf_size)
    knn.fit(X_valid_train, y_valid_train)
    pred = knn.predict(X_test)
    RMSE = mean_squared_error(y_test, pred, squared=False)
    return RMSE

## Random Forest

In [11]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

def RandomForest(X_train, y_train, X_valid, y_valid, X_test, y_test):
    global RF
    RF = RandomForestRegressor(random_state=439)
    tune_params = {'n_estimators':range(200, 300, 20)}
    gd = GridSearchCV(RF, tune_params)
    gd.fit(X_valid, y_valid)

    n_estimators = gd.best_params_["n_estimators"]
    print("best n_estimators:", n_estimators)
    print("==========")

    X_valid_train = pd.concat([X_valid, X_train], axis=0)
    y_valid_train = np.concatenate([y_valid, y_train], axis=0)

    RF = RandomForestRegressor(n_estimators=n_estimators, oob_score=True, random_state=439)
    RF.fit(X_valid_train, y_valid_train)
    pred = RF.predict(X_test)
    RMSE = mean_squared_error(y_test, pred, squared=False)

    # printing the 10 most important features
    importance = RF.feature_importances_
    idx = np.flip(np.argsort(importance))
    important_10feat = X_train.columns[idx[:10]]
    print("The 10 most important features:")
    for i in range(10):
        if i != 9:
            print(important_10feat[i], end=", ")
        else:
            print(important_10feat[i])
    return RF.oob_score_, RMSE

## Bagging

In [12]:
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

def Bagging(X_train, y_train, X_valid, y_valid, X_test, y_test):
    global bagging
    bagging = BaggingRegressor(random_state=439)
    tune_params = {'n_estimators':range(200, 300, 20)}
    gd = GridSearchCV(bagging, tune_params)
    gd.fit(X_valid, y_valid)

    n_estimators = gd.best_params_["n_estimators"]
    print("best n_estimators:", n_estimators)
    print("==========")

    X_valid_train = pd.concat([X_valid, X_train], axis=0)
    y_valid_train = np.concatenate([y_valid, y_train], axis=0)

    bagging = BaggingRegressor(n_estimators=n_estimators,
                               oob_score=True,
                               random_state=439)
    bagging.fit(X_valid_train, y_valid_train)
    pred = bagging.predict(X_test)
    RMSE = mean_squared_error(y_test, pred, squared=False)

    return bagging.oob_score_, RMSE

## Decision Tree

In [13]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

def DecisionTree(X_train, y_train, X_valid, y_valid, X_test, y_test):
    global DT
    DT = DecisionTreeRegressor(random_state=439)
    tune_params = {'criterion':["squared_error", "friedman_mse", "absolute_error", "poisson"],
                   'max_depth':range(5, 20),
                   'min_samples_split':range(2, 10)}
    gd = GridSearchCV(DT, tune_params, scoring='neg_root_mean_squared_error')
    gd.fit(X_valid, y_valid)

    criterion = gd.best_params_["criterion"]
    max_depth = gd.best_params_["max_depth"]
    min_samples_split = gd.best_params_["min_samples_split"]
    print("best criterion:", criterion)
    print("best max_depth:", max_depth)
    print("best min_samples_split:", min_samples_split)
    print("==========")

    X_valid_train = pd.concat([X_valid, X_train], axis=0)
    y_valid_train = np.concatenate([y_valid, y_train], axis=0)

    DT = DecisionTreeRegressor(criterion=criterion, max_depth=max_depth,
                               min_samples_split=min_samples_split, random_state=439)
    DT.fit(X_valid_train, y_valid_train)
    pred = DT.predict(X_test)
    RMSE = mean_squared_error(y_test, pred, squared=False)

    # printing the 10 most important features
    importance = DT.feature_importances_
    idx = np.flip(np.argsort(importance))
    important_10feat = X_train.columns[idx[:10]]
    print("The 10 most important features:")
    for i in range(10):
        if i != 9:
            print(important_10feat[i], end=", ")
        else:
            print(important_10feat[i])

    return RMSE

## Gradient Boosting

In [14]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

def GBoosting(X_train, y_train, X_valid, y_valid, X_test, y_test):
    global GBDT
    GBDT = GradientBoostingRegressor(random_state=439)
    tune_params = {'loss':["squared_error", "huber", "absolute_error", "quantile"],
                   'learning_rate':[0.01, 0.1, 0.2, 0.4],
                   'n_estimators':range(120, 200, 20)}
    gd = GridSearchCV(GBDT, tune_params, scoring='neg_root_mean_squared_error')
    gd.fit(X_valid, y_valid)

    loss = gd.best_params_["loss"]
    learning_rate = gd.best_params_["learning_rate"]
    n_estimators = gd.best_params_["n_estimators"]
    print("best loss:", loss)
    print("best learning_rate:", learning_rate)
    print("best n_estimators:", n_estimators)
    print("==========")

    X_valid_train = pd.concat([X_valid, X_train], axis=0)
    y_valid_train = np.concatenate([y_valid, y_train], axis=0)

    GBDT = GradientBoostingRegressor(loss=loss, learning_rate=learning_rate, n_estimators=n_estimators,
                                     random_state=439)
    GBDT.fit(X_valid_train, y_valid_train)
    pred = GBDT.predict(X_test)
    RMSE = mean_squared_error(y_test, pred, squared=False)

    # printing the 10 most important features
    importance = GBDT.feature_importances_
    idx = np.flip(np.argsort(importance))
    important_10feat = X_train.columns[idx[:10]]
    print("The 10 most important features:")
    for i in range(10):
        if i != 9:
            print(important_10feat[i], end=", ")
        else:
            print(important_10feat[i])

    return RMSE

## Adaboost

In [15]:
from sklearn.ensemble import AdaBoostRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

def AdaBoost(X_train, y_train, X_valid, y_valid, X_test, y_test):
    global ada
    ada = AdaBoostRegressor(random_state=439)
    tune_params = {'loss':["linear", "square", "exponential"],
                   'learning_rate':[0.1, 0.3, 0.5, 0.8, 1, 1.5, 2, 2.5, 3],
                   'n_estimators':range(10, 150, 10)}
    gd = GridSearchCV(ada, tune_params, scoring='neg_root_mean_squared_error')
    gd.fit(X_valid, y_valid)

    loss = gd.best_params_["loss"]
    learning_rate = gd.best_params_["learning_rate"]
    n_estimators = gd.best_params_["n_estimators"]
    print("best loss:", loss)
    print("best learning_rate:", learning_rate)
    print("best n_estimators:", n_estimators)
    print("==========")

    X_valid_train = pd.concat([X_valid, X_train], axis=0)
    y_valid_train = np.concatenate([y_valid, y_train], axis=0)

    ada = AdaBoostRegressor(loss=loss, learning_rate=learning_rate, n_estimators=n_estimators,
                            random_state=439)
    ada.fit(X_valid_train, y_valid_train)
    pred = ada.predict(X_test)
    RMSE = mean_squared_error(y_test, pred, squared=False)

    # printing the 10 most important features
    importance = ada.feature_importances_
    idx = np.flip(np.argsort(importance))
    important_10feat = X_train.columns[idx[:10]]
    print("The 10 most important features:")
    for i in range(10):
        if i != 9:
            print(important_10feat[i], end=", ")
        else:
            print(important_10feat[i])

    return RMSE

## Supported Vector Machine Regression

In [16]:
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

def SVM(X_train, y_train, X_valid, y_valid, X_test, y_test):
    global svr
    svr = SVR()
    tune_params = {'kernel':["linear", "poly", "rbf", "sigmoid"],
                   'C':np.arange(0.1, 3, 0.1)}
    gd = GridSearchCV(svr, tune_params, scoring='neg_root_mean_squared_error')
    gd.fit(X_valid, y_valid)

    kernel = gd.best_params_["kernel"]
    C = gd.best_params_["C"]
    print("best kernel:", kernel)
    print("best C:", C)
    print("==========")

    X_valid_train = pd.concat([X_valid, X_train], axis=0)
    y_valid_train = np.concatenate([y_valid, y_train], axis=0)

    svr = SVR(kernel=kernel, C=C)
    svr.fit(X_valid_train, y_valid_train)
    pred = svr.predict(X_test)
    RMSE = mean_squared_error(y_test, pred, squared=False)

    return RMSE

## Stochastic Gradient Descent

In [17]:
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

def SGD(X_train, y_train, X_valid, y_valid, X_test, y_test):
    global sgd
    sgd = SGDRegressor()
    tune_params = {'loss':["squared_error", "huber", "epsilon_insensitive"],
                   'penalty':["l2", "l1"],
                   'alpha':np.logspace(-6, 1, 8, base=10)}
    gd = GridSearchCV(sgd, tune_params, scoring='neg_root_mean_squared_error')
    gd.fit(X_valid, y_valid)

    loss = gd.best_params_["loss"]
    penalty = gd.best_params_["penalty"]
    alpha = gd.best_params_["alpha"]
    print("best loss:", loss)
    print("best penalty:", penalty)
    print("best alpha:", alpha)
    print("==========")

    X_valid_train = pd.concat([X_valid, X_train], axis=0)
    y_valid_train = np.concatenate([y_valid, y_train], axis=0)

    sgd = SGDRegressor(loss=loss, penalty=penalty, alpha=alpha)
    sgd.fit(X_valid_train, y_valid_train)
    pred = sgd.predict(X_test)
    RMSE = mean_squared_error(y_test, pred, squared=False)

    return RMSE

## Multi-Layer Perceptron

In [18]:
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

def MLP(X_train, y_train, X_valid, y_valid, X_test, y_test):
    global mlp
    mlp = MLPRegressor(early_stopping=True, hidden_layer_sizes=360)
    X_valid_train = pd.concat([X_valid, X_train], axis=0)
    y_valid_train = np.concatenate([y_valid, y_train], axis=0)

    mlp.fit(X_valid_train, y_valid_train)
    pred = mlp.predict(X_test)
    RMSE = mean_squared_error(y_test, pred, squared=False)
    return RMSE

## XGBoost

In [19]:
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

def XGB(X_train, y_train, X_valid, y_valid, X_test, y_test):
    global xgb
    xgb = XGBRegressor()
    tune_params = {"n_estimators":range(1, 50)}
    gd = GridSearchCV(xgb, tune_params, scoring='neg_root_mean_squared_error')
    gd.fit(X_valid, y_valid)

    n_estimators = gd.best_params_["n_estimators"]
    print("best n_estimators:", n_estimators)
    print("==========")

    X_valid_train = pd.concat([X_valid, X_train], axis=0)
    y_valid_train = np.concatenate([y_valid, y_train], axis=0)

    xgb = XGBRegressor(n_estimators=n_estimators)
    xgb.fit(X_valid_train, y_valid_train)
    pred = xgb.predict(X_test)
    RMSE = mean_squared_error(y_test, pred, squared=False)

    return RMSE

## Stacking

In [20]:
from sklearn.ensemble import StackingRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

def Stacking(X_train, y_train, X_valid, y_valid, X_test, y_test):
    ests = [
            ("Random Forest", RF),
            ("Bagging", bagging),
            ("Gradient Boosting", GBDT),
            ("KNN", knn),
            ("SGD", sgd),
            ("Ridge", ridge),
            ("XGB", xgb)
           ]
    stacking = StackingRegressor(estimators=ests)

    X_valid_train = pd.concat([X_valid, X_train], axis=0)
    y_valid_train = np.concatenate([y_valid, y_train], axis=0)

    stacking.fit(X_valid_train, y_valid_train)

    pred = stacking.predict(X_test)
    RMSE = mean_squared_error(y_test, pred, squared=False)
    return RMSE

# Model Training and Evaluation!

## Dataset

<li> 實價登錄：x_train_raw
<li>實價登錄 + 機能：x_train_fac
<li>實價登錄 + 機能 + 街景 + 人工標籤：x_train_fac_street
<li>實價登錄 + 機能 + 衛星 + 人工標籤：x_train_fac_satel
<li>實價登錄 + 機能 + 街景 + 衛星 + 人工標籤：x_train_full

In [21]:
import warnings
warnings.filterwarnings("ignore")

In [22]:
# Evaluation: Solely Government Data
print("==== Solely Government Data ====\n")

print("||Ridge Regression||")
rmse = Ridgereg(x_train_raw, y_train, x_valid_raw, y_valid, x_test_raw, y_test)
print("Ridge Regression | RMSE =", rmse, "\n")

print("||Lasso Regression||")
rmse = Lassoreg(x_train_raw, y_train, x_valid_raw, y_valid, x_test_raw, y_test)
print("Lasso Regression | RMSE =", rmse, "\n")

print("||KNN||")
rmse = KNN(x_train_raw, y_train, x_valid_raw, y_valid, x_test_raw, y_test)
print("KNN | RMSE =", rmse, "\n")

print("||Random Forest||")
oob, rmse = RandomForest(x_train_raw, y_train, x_valid_raw, y_valid, x_test_raw, y_test)
print("Random Forest | RMSE =", rmse)
print("Random Forest | OOB Score =", oob, "\n")

print("||Bagging||")
oob, rmse = Bagging(x_train_raw, y_train, x_valid_raw, y_valid, x_test_raw, y_test)
print("Bagging | RMSE =", rmse)
print("Bagging | OOB Score =", oob, "\n")

print("||Decision Tree||")
rmse = DecisionTree(x_train_raw, y_train, x_valid_raw, y_valid, x_test_raw, y_test)
print("Decision Tree | RMSE =", rmse, "\n")

print("||Gradient Boosting||")
rmse = GBoosting(x_train_raw, y_train, x_valid_raw, y_valid, x_test_raw, y_test)
print("Gradient Boosting | RMSE =", rmse, "\n")

print("||Ada Boosting||")
rmse = AdaBoost(x_train_raw, y_train, x_valid_raw, y_valid, x_test_raw, y_test)
print("Ada Boosting | RMSE =", rmse, "\n")

print("||Support Vector Machine||")
rmse = SVM(x_train_raw, y_train, x_valid_raw, y_valid, x_test_raw, y_test)
print("Support Vector Machine | RMSE =", rmse, "\n")

print("||Stochastic Gradient Descent||")
rmse = SGD(x_train_raw, y_train, x_valid_raw, y_valid, x_test_raw, y_test)
print("Stochastic Gradient Descent | RMSE =", rmse, "\n")

print("||Multi-Layer Perceptron||")
rmse = MLP(x_train_raw, y_train, x_valid_raw, y_valid, x_test_raw, y_test)
print("Multi-Layer Perceptron | RMSE =", rmse, "\n")

print("||XGBoost||")
rmse = XGB(x_train_raw, y_train, x_valid_raw, y_valid, x_test_raw, y_test)
print("XGBoost | RMSE =", rmse, "\n")

print("||Stacking||")
rmse = Stacking(x_train_raw, y_train, x_valid_raw, y_valid, x_test_raw, y_test)
print("Stacking | RMSE =", rmse)

==== Solely Government Data ====

||Ridge Regression||
best alpha: 3.500000000000002
The 10 most important features:
鄉鎮市區_大安區, 鄉鎮市區_萬華區, 鄉鎮市區_文山區, 鄉鎮市區_北投區, 移轉層次項目_平台, 鄉鎮市區_松山區, 主要用途_見使用執照, 鄉鎮市區_中正區, 建物型態_店面(店鋪), 使用分區_特定休閒旅館住宅專用區
Ridge Regression | RMSE = 41474.03474987123 

||Lasso Regression||
best alpha: 4.900000000000004
The 10 most important features:
鄉鎮市區_大安區, 鄉鎮市區_萬華區, 移轉層次項目_平台, 鄉鎮市區_松山區, 鄉鎮市區_中正區, 主要用途_見使用執照, 鄉鎮市區_信義區, 移轉層次項目_騎樓, 建物型態_店面(店鋪), 鄉鎮市區_文山區
Lasso Regression | RMSE = 41463.25751878098 

||KNN||
best n_neighbors: 6
best weights: distance
best leaf_size: 1
KNN | RMSE = 40857.47617396677 

||Random Forest||
best n_estimators: 280
The 10 most important features:
建物型態_公寓(5樓含以下無電梯), 屋齡, 鄉鎮市區_大安區, 建物移轉總面積平方公尺, 土地移轉總面積平方公尺, 移轉層次, 總樓層數, 建材_鋼筋混凝土造, 鄉鎮市區_北投區, 鄉鎮市區_文山區
Random Forest | RMSE = 35399.856584492845
Random Forest | OOB Score = 0.6282794349825827 

||Bagging||
best n_estimators: 280
Bagging | RMSE = 35376.00555136411
Bagging | OOB Score = 0.62881634191952 

||Decision 

In [23]:
# Evaluation: Government Data with Facility Information
print("==== Government Data with Facility Information ====\n")

print("||Ridge Regression||")
rmse = Ridgereg(x_train_fac, y_train, x_valid_fac, y_valid, x_test_fac, y_test)
print("Ridge Regression | RMSE =", rmse, "\n")

print("||Lasso Regression||")
rmse = Lassoreg(x_train_fac, y_train, x_valid_fac, y_valid, x_test_fac, y_test)
print("Lasso Regression | RMSE =", rmse, "\n")

print("||KNN||")
rmse = KNN(x_train_fac, y_train, x_valid_fac, y_valid, x_test_fac, y_test)
print("KNN | RMSE =", rmse, "\n")

print("||Random Forest||")
oob, rmse = RandomForest(x_train_fac, y_train, x_valid_fac, y_valid, x_test_fac, y_test)
print("Random Forest | RMSE =", rmse)
print("Random Forest | OOB Score =", oob, "\n")

print("||Bagging||")
oob, rmse = Bagging(x_train_fac, y_train, x_valid_fac, y_valid, x_test_fac, y_test)
print("Bagging | RMSE =", rmse)
print("Bagging | OOB Score =", oob, "\n")

print("||Decision Tree||")
rmse = DecisionTree(x_train_fac, y_train, x_valid_fac, y_valid, x_test_fac, y_test)
print("Decision Tree | RMSE =", rmse, "\n")

print("||Gradient Boosting||")
rmse = GBoosting(x_train_fac, y_train, x_valid_fac, y_valid, x_test_fac, y_test)
print("Gradient Boosting | RMSE =", rmse, "\n")

print("||Ada Boosting||")
rmse = AdaBoost(x_train_fac, y_train, x_valid_fac, y_valid, x_test_fac, y_test)
print("Ada Boosting | RMSE =", rmse, "\n")

print("||Support Vector Machine||")
rmse = SVM(x_train_fac, y_train, x_valid_fac, y_valid, x_test_fac, y_test)
print("Support Vector Machine | RMSE =", rmse, "\n")

print("||Stochastic Gradient Descent||")
rmse = SGD(x_train_fac, y_train, x_valid_fac, y_valid, x_test_fac, y_test)
print("Stochastic Gradient Descent | RMSE =", rmse, "\n")

print("||Multi-Layer Perceptron||")
rmse = MLP(x_train_fac, y_train, x_valid_fac, y_valid, x_test_fac, y_test)
print("Multi-Layer Perceptron | RMSE =", rmse, "\n")

print("||XGBoost||")
rmse = XGB(x_train_fac, y_train, x_valid_fac, y_valid, x_test_fac, y_test)
print("XGBoost | RMSE =", rmse, "\n")

print("||Stacking||")
rmse = Stacking(x_train_fac, y_train, x_valid_fac, y_valid, x_test_fac, y_test)
print("Stacking | RMSE =", rmse)

==== Government Data with Facility Information ====

||Ridge Regression||
best alpha: 4.900000000000004
The 10 most important features:
鄉鎮市區_大安區, 鄉鎮市區_萬華區, 建材_鋼骨造, 鄉鎮市區_大同區, 移轉層次項目_平台, 建物型態_店面(店鋪), 使用分區_住宅用地, 移轉層次項目_騎樓, 主要用途_見使用執照, 鄉鎮市區_北投區
Ridge Regression | RMSE = 39711.9329989575 

||Lasso Regression||
best alpha: 4.900000000000004
The 10 most important features:
鄉鎮市區_大安區, 鄉鎮市區_萬華區, 建材_鋼骨造, 移轉層次項目_平台, 移轉層次項目_騎樓, 建物型態_店面(店鋪), 主要用途_見使用執照, 建材_鋼骨鋼筋混凝土造, 鄉鎮市區_大同區, 主要用途_見其他登記事項
Lasso Regression | RMSE = 39696.25589837171 

||KNN||
best n_neighbors: 14
best weights: distance
best leaf_size: 1
KNN | RMSE = 36024.7776388928 

||Random Forest||
best n_estimators: 280
The 10 most important features:
金融機構, 屋齡, 建物型態_公寓(5樓含以下無電梯), 建物移轉總面積平方公尺, 高中職, 超商, 土地移轉總面積平方公尺, 移轉層次, 大賣場, 建材_鋼筋混凝土造
Random Forest | RMSE = 33387.083343008955
Random Forest | OOB Score = 0.6733493161991587 

||Bagging||
best n_estimators: 200
Bagging | RMSE = 33352.51711337207
Bagging | OOB Score = 0.6713803755008805 

||Decision

In [22]:
# Evaluation: Above data with Street Scene Data
print("==== Above data with Street Scene Data ====\n")


print("||Ridge Regression||")
rmse = Ridgereg(x_train_fac_street, y_train, x_valid_fac_street, y_valid, x_test_fac_street, y_test)
print("Ridge Regression | RMSE =", rmse, "\n")

print("||Lasso Regression||")
rmse = Lassoreg(x_train_fac_street, y_train, x_valid_fac_street, y_valid, x_test_fac_street, y_test)
print("Lasso Regression | RMSE =", rmse, "\n")

print("||KNN||")
rmse = KNN(x_train_fac_street, y_train, x_valid_fac_street, y_valid, x_test_fac_street, y_test)
print("KNN | RMSE =", rmse, "\n")

print("||Random Forest||")
oob, rmse = RandomForest(x_train_fac_street, y_train, x_valid_fac_street, y_valid, x_test_fac_street, y_test)
print("Random Forest | RMSE =", rmse)
print("Random Forest | OOB Score =", oob, "\n")

print("||Bagging||")
oob, rmse = Bagging(x_train_fac_street, y_train, x_valid_fac_street, y_valid, x_test_fac_street, y_test)
print("Bagging | RMSE =", rmse)
print("Bagging | OOB Score =", oob, "\n")

print("||Decision Tree||")
rmse = DecisionTree(x_train_fac_street, y_train, x_valid_fac_street, y_valid, x_test_fac_street, y_test)
print("Decision Tree | RMSE =", rmse, "\n")

print("||Gradient Boosting||")
rmse = GBoosting(x_train_fac_street, y_train, x_valid_fac_street, y_valid, x_test_fac_street, y_test)
print("Gradient Boosting | RMSE =", rmse, "\n")

print("||Ada Boosting||")
rmse = AdaBoost(x_train_fac_street, y_train, x_valid_fac_street, y_valid, x_test_fac_street, y_test)
print("Ada Boosting | RMSE =", rmse, "\n")

print("||Support Vector Machine||")
rmse = SVM(x_train_fac_street, y_train, x_valid_fac_street, y_valid, x_test_fac_street, y_test)
print("Support Vector Machine | RMSE =", rmse, "\n")

print("||Stochastic Gradient Descent||")
rmse = SGD(x_train_fac_street, y_train, x_valid_fac_street, y_valid, x_test_fac_street, y_test)
print("Stochastic Gradient Descent | RMSE =", rmse, "\n")

print("||Multi-Layer Perceptron||")
rmse = MLP(x_train_fac_street, y_train, x_valid_fac_street, y_valid, x_test_fac_street, y_test)
print("Multi-Layer Perceptron | RMSE =", rmse, "\n")

print("||XGBoost||")
rmse = XGB(x_train_fac_street, y_train, x_valid_fac_street, y_valid, x_test_fac_street, y_test)
print("XGBoost | RMSE =", rmse, "\n")

print("||Stacking||")
rmse = Stacking(x_train_fac_street, y_train, x_valid_fac_street, y_valid, x_test_fac_street, y_test)
print("Stacking | RMSE =", rmse)

==== Above data with Street Scene Data ====

||Ridge Regression||
best alpha: 4.900000000000004
The 10 most important features:
鄉鎮市區_大安區, 建材_鋼骨造, 鄉鎮市區_中正區, 鄉鎮市區_松山區, 移轉層次項目_平台, 鄉鎮市區_信義區, 建物型態_店面(店鋪), 鄉鎮市區_萬華區, 建材_鋼骨鋼筋混凝土造, 使用分區_住宅用地
Ridge Regression | RMSE = 39472.84058505757 

||Lasso Regression||
best alpha: 4.900000000000004
The 10 most important features:
鄉鎮市區_大安區, 建材_鋼骨造, 鄉鎮市區_中正區, 建材_鋼骨鋼筋混凝土造, 移轉層次項目_平台, 鄉鎮市區_松山區, 鄉鎮市區_信義區, 建物型態_店面(店鋪), 主要用途_自由職業事務所, 鄉鎮市區_萬華區
Lasso Regression | RMSE = 39453.74146587292 

||KNN||
best n_neighbors: 11
best weights: distance
best leaf_size: 1
KNN | RMSE = 39909.39473587987 

||Random Forest||
best n_estimators: 200
The 10 most important features:
金融機構, 屋齡, 建物型態_公寓(5樓含以下無電梯), 高中職, 建材_鋼筋混凝土造, 大賣場, 移轉層次, 超商, 鄉鎮市區_大安區, 建物移轉總面積平方公尺
Random Forest | RMSE = 35188.12913695446
Random Forest | OOB Score = 0.6316459521596338 

||Bagging||
best n_estimators: 280
Bagging | RMSE = 35150.479701364224
Bagging | OOB Score = 0.6323601847428955 

||Decision Tree||
best

In [23]:
# Evaluation: Above data with Satellite data
print("==== Above data with Satellite data ====\n")

print("||Ridge Regression||")
rmse = Ridgereg(x_train_fac_satel, y_train, x_valid_fac_satel, y_valid, x_test_fac_satel, y_test)
print("Ridge Regression | RMSE =", rmse, "\n")

print("||Lasso Regression||")
rmse = Lassoreg(x_train_fac_satel, y_train, x_valid_fac_satel, y_valid, x_test_fac_satel, y_test)
print("Lasso Regression | RMSE =", rmse, "\n")

print("||KNN||")
rmse = KNN(x_train_fac_satel, y_train, x_valid_fac_satel, y_valid, x_test_fac_satel, y_test)
print("KNN | RMSE =", rmse, "\n")

print("||Random Forest||")
oob, rmse = RandomForest(x_train_fac_satel, y_train, x_valid_fac_satel, y_valid, x_test_fac_satel, y_test)
print("Random Forest | RMSE =", rmse)
print("Random Forest | OOB Score =", oob, "\n")

print("||Bagging||")
oob, rmse = Bagging(x_train_fac_satel, y_train, x_valid_fac_satel, y_valid, x_test_fac_satel, y_test)
print("Bagging | RMSE =", rmse)
print("Bagging | OOB Score =", oob, "\n")

print("||Decision Tree||")
rmse = DecisionTree(x_train_fac_satel, y_train, x_valid_fac_satel, y_valid, x_test_fac_satel, y_test)
print("Decision Tree | RMSE =", rmse, "\n")

print("||Gradient Boosting||")
rmse = GBoosting(x_train_fac_satel, y_train, x_valid_fac_satel, y_valid, x_test_fac_satel, y_test)
print("Gradient Boosting | RMSE =", rmse, "\n")

print("||Ada Boosting||")
rmse = AdaBoost(x_train_fac_satel, y_train, x_valid_fac_satel, y_valid, x_test_fac_satel, y_test)
print("Ada Boosting | RMSE =", rmse, "\n")

print("||Support Vector Machine||")
rmse = SVM(x_train_fac_satel, y_train, x_valid_fac_satel, y_valid, x_test_fac_satel, y_test)
print("Support Vector Machine | RMSE =", rmse, "\n")

print("||Stochastic Gradient Descent||")
rmse = SGD(x_train_fac_satel, y_train, x_valid_fac_satel, y_valid, x_test_fac_satel, y_test)
print("Stochastic Gradient Descent | RMSE =", rmse, "\n")

print("||Multi-Layer Perceptron||")
rmse = MLP(x_train_fac_satel, y_train, x_valid_fac_satel, y_valid, x_test_fac_satel, y_test)
print("Multi-Layer Perceptron | RMSE =", rmse, "\n")

print("||XGBoost||")
rmse = XGB(x_train_fac_satel, y_train, x_valid_fac_satel, y_valid, x_test_fac_satel, y_test)
print("XGBoost | RMSE =", rmse, "\n")

print("||Stacking||")
rmse = Stacking(x_train_fac_satel, y_train, x_valid_fac_satel, y_valid, x_test_fac_satel, y_test)
print("Stacking | RMSE =", rmse)

==== Above data with Satellite data ====

||Ridge Regression||
best alpha: 4.900000000000004
The 10 most important features:
鄉鎮市區_大安區, 鄉鎮市區_中正區, 鄉鎮市區_信義區, 鄉鎮市區_松山區, 建材_鋼骨造, 建材_鋼骨鋼筋混凝土造, 移轉層次項目_平台, 建物型態_店面(店鋪), 鄉鎮市區_萬華區, 移轉層次項目_騎樓
Ridge Regression | RMSE = 38971.92559278172 

||Lasso Regression||
best alpha: 4.900000000000004
The 10 most important features:
鄉鎮市區_大安區, 建材_鋼骨造, 建材_鋼骨鋼筋混凝土造, 鄉鎮市區_中正區, 鄉鎮市區_信義區, 鄉鎮市區_松山區, 移轉層次項目_平台, 建物型態_店面(店鋪), 移轉層次項目_騎樓, 建材_見其他登記事項
Lasso Regression | RMSE = 38957.8721603581 

||KNN||
best n_neighbors: 10
best weights: distance
best leaf_size: 1
KNN | RMSE = 35787.70928863243 

||Random Forest||
best n_estimators: 260
The 10 most important features:
金融機構, 屋齡, 建物型態_公寓(5樓含以下無電梯), 建材_鋼筋混凝土造, 高中職, 移轉層次, 鄉鎮市區_大安區, 大賣場, 超商, Sat4
Random Forest | RMSE = 34160.834912322745
Random Forest | OOB Score = 0.6559201252046813 

||Bagging||
best n_estimators: 260
Bagging | RMSE = 34146.50780088224
Bagging | OOB Score = 0.6559830028402012 

||Decision Tree||
best criterion: 

In [25]:
# Evaluation: All data included
print("==== All data included ====\n")

print("||Ridge Regression||")
rmse = Ridgereg(x_train_full, y_train, x_valid_full, y_valid, x_test_full, y_test)
print("Ridge Regression | RMSE =", rmse, "\n")

print("||Lasso Regression||")
rmse = Lassoreg(x_train_full, y_train, x_valid_full, y_valid, x_test_full, y_test)
print("Lasso Regression | RMSE =", rmse, "\n")

print("||KNN||")
rmse = KNN(x_train_full, y_train, x_valid_full, y_valid, x_test_full, y_test)
print("KNN | RMSE =", rmse, "\n")

print("||Random Forest||")
oob, rmse = RandomForest(x_train_full, y_train, x_valid_full, y_valid, x_test_full, y_test)
print("Random Forest | RMSE =", rmse)
print("Random Forest | OOB Score =", oob, "\n")

print("||Bagging||")
oob, rmse = Bagging(x_train_full, y_train, x_valid_full, y_valid, x_test_full, y_test)
print("Bagging | RMSE =", rmse)
print("Bagging | OOB Score =", oob, "\n")

print("||Decision Tree||")
rmse = DecisionTree(x_train_full, y_train, x_valid_full, y_valid, x_test_full, y_test)
print("Decision Tree | RMSE =", rmse, "\n")

print("||Gradient Boosting||")
rmse = GBoosting(x_train_full, y_train, x_valid_full, y_valid, x_test_full, y_test)
print("Gradient Boosting | RMSE =", rmse, "\n")

print("||Ada Boosting||")
rmse = AdaBoost(x_train_full, y_train, x_valid_full, y_valid, x_test_full, y_test)
print("Ada Boosting | RMSE =", rmse, "\n")

print("||Support Vector Machine||")
rmse = SVM(x_train_full, y_train, x_valid_full, y_valid, x_test_full, y_test)
print("Support Vector Machine | RMSE =", rmse, "\n")

print("||Stochastic Gradient Descent||")
rmse = SGD(x_train_full, y_train, x_valid_full, y_valid, x_test_full, y_test)
print("Stochastic Gradient Descent | RMSE =", rmse, "\n")

print("||Multi-Layer Perceptron||")
rmse = MLP(x_train_full, y_train, x_valid_full, y_valid, x_test_full, y_test)
print("Multi-Layer Perceptron | RMSE =", rmse, "\n")

print("||XGBoost||")
rmse = XGB(x_train_full, y_train, x_valid_full, y_valid, x_test_full, y_test)
print("XGBoost | RMSE =", rmse, "\n")

print("||Stacking||")
rmse = Stacking(x_train_full, y_train, x_valid_full, y_valid, x_test_full, y_test)
print("Stacking | RMSE =", rmse)

==== All data included ====

||Ridge Regression||
best alpha: 4.900000000000004
The 10 most important features:
鄉鎮市區_大安區, 鄉鎮市區_中正區, 鄉鎮市區_信義區, 建材_鋼骨造, 鄉鎮市區_松山區, 建物型態_店面(店鋪), 鄉鎮市區_萬華區, 移轉層次項目_平台, 建材_鋼骨鋼筋混凝土造, 移轉層次項目_騎樓
Ridge Regression | RMSE = 38591.414097405825 

||Lasso Regression||
best alpha: 4.900000000000004
The 10 most important features:
鄉鎮市區_大安區, 建材_鋼骨造, 鄉鎮市區_中正區, 建材_鋼骨鋼筋混凝土造, 鄉鎮市區_信義區, 鄉鎮市區_松山區, 移轉層次項目_平台, 建物型態_店面(店鋪), 移轉層次項目_騎樓, 建材_見其他登記事項
Lasso Regression | RMSE = 38575.04332916942 

||KNN||
best n_neighbors: 20
best weights: distance
best leaf_size: 1
KNN | RMSE = 38698.72902698795 

||Random Forest||
best n_estimators: 220
The 10 most important features:
金融機構, 建物型態_公寓(5樓含以下無電梯), 屋齡, 建材_鋼筋混凝土造, 高中職, 移轉層次, 鄉鎮市區_大安區, 大賣場, 超商, Sat4
Random Forest | RMSE = 34932.13022475802
Random Forest | OOB Score = 0.6378945086428703 

||Bagging||
best n_estimators: 240
Bagging | RMSE = 34854.57439878327
Bagging | OOB Score = 0.6397787527348187 

||Decision Tree||
best criterion: absolute_err