## HW5 Band Gap Prediction
### Darian Yang

You can start with basic models, and then try your best to optimize your predictions by using more sophisticated models, feature engineering, and fine-tuning the hyperparameters. The grade of this homework will be based on the score you get on Kaggle. 

There is no specific requirement for a written summary for this homework, but please leave some necessary notes/comments in your submission to help the grader understand your workflow.

#### TODO:
* Given the training data: X_train and Y_train
* Find best suitable features
* You are allowed to build any type of regression model
* You are allowed to use any type of data processing
* Find the best performing model
* Use your model to score Y_test

This dataset provides quantitative measurements of the band gap (Egap) for a set of inorganic crystaline materials.

#### File descriptions
* X_train_kaggle.csv - the training set: file with Material column
* y_train_kaggle.csv - the training set: Egap for training with Id column
* X_test_kaggle.csv - the test set: file with Material column with Id column that you should predict
* y_sample_submission.csv - a sample submission file in the correct format

#### Data fields
* Id - an id unique to a given material ( Please note 'Id' is the last column)
* D1-D132 - chemical descriptor to a given material
* Egap - Egap values in y- files

The evaluation metric for this competition is Mean absolute error .

#### Submission Format
For every molecule in the dataset, submission files should contain two columns: Id and Egap.

The file should contain a header and have the following format:
```
Id,Egap
1,0.456
```

In [7]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [8]:
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
from sklearn import model_selection
from sklearn import tree
from sklearn.ensemble import RandomForestRegressor
from sklearn import svm

In [9]:
X_train = pd.read_csv("X_train_kaggle.csv", index_col=133).to_numpy()
y_train = pd.read_csv("y_train_kaggle.csv", index_col=1).to_numpy()
X_test = pd.read_csv("X_test_kaggle.csv", index_col=133).to_numpy()

In [10]:
TEST_ID = pd.read_csv("X_test_kaggle.csv",index_col=133).index

In [11]:
TEST_ID.shape

(4132,)

In [12]:
np.unique(TEST_ID).shape

(4132,)

In [13]:
TRAIN_ID = pd.read_csv("X_train_kaggle.csv",index_col=0).index

In [14]:
X_train.shape

(9640, 133)

Drop the material column:

In [15]:
X_train = X_train[:,1:]
X_train.shape

(9640, 132)

In [16]:
X_test = X_test[:,1:]
X_test.shape

(4132, 132)

In [17]:
y_train.shape

(9640, 1)

In [18]:
# reshape to 1d
y_train = y_train[:,0]
y_train.shape

(9640,)

In [19]:
def worker(model, save_to=None, X_train=X_train, y_train=y_train, X_test=X_test):
    
    model = model.fit(X_train, y_train)
    y_test = model.predict(X_test)

    if save_to:
        ret = {"Id":TEST_ID, "Egap":y_test}
        ret = pd.DataFrame(data=ret)
        ret.set_index("Id")
        ret.to_csv(save_to, index=False)
    
    return y_test

In [20]:
from sklearn.metrics import mean_squared_error

def calc_score(model, X_train, y_train, X_test, y_test):
    """
    Find the mse using an sklearn model.
    """
    model = model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    # RF
    if hasattr(model, "oob_score_"):
        #print(f"OOB: {model.oob_score_}")
        return model.oob_score_, mse
    else:
        return mse

In [21]:
# first scale
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [32]:
# split training/test
X_train_b, X_test_b, y_train_b, y_test_b = \
    model_selection.train_test_split(X_train_scaled, y_train, test_size=0.6)

In [33]:
calc_score(linear_model.LinearRegression(), X_train_b, y_train_b, X_test_b, y_test_b)

1.152794877500961

In [34]:
calc_score(RandomForestRegressor(max_depth=None, oob_score=True), X_train_b, y_train_b, X_test_b, y_test_b)

(0.7098500011839814, 0.7881029119201564)

In [35]:
calc_score(RandomForestRegressor(max_depth=None, oob_score=True, criterion="absolute_error"), X_train_b, y_train_b, X_test_b, y_test_b)

(0.7104884159832807, 0.7871670113942217)

In [36]:
calc_score(RandomForestRegressor(max_depth=None, oob_score=True, criterion="poisson"), X_train_b, y_train_b, X_test_b, y_test_b)

(0.7140032994267561, 0.7760138234797364)

In [207]:
calc_score(svm.LinearSVR(), X_train=X_train_b, y_train=y_train_b, X_test=X_test_b, y_test=y_test_b)



1.2211295173422594

For now, I will submit kaggle using random forest.

In [209]:
worker(RandomForestRegressor(max_depth=None), "rf.csv", X_train=X_train_scaled, y_train=y_train, X_test=X_test_scaled)

array([3.133309  , 2.57074427, 2.68455313, ..., 1.64507733, 1.623616  ,
       3.172862  ])

So between linear regression, a linear SVM, and random forest, RF is the best but note that we can't extrapolate with RF so it might not be the best option here. I will try optimizing SVMs since this dataset is smaller than the last HW.

In [210]:
calc_score(svm.LinearSVR(max_iter=10000, C=1), X_train=X_train_b, y_train=y_train_b, X_test=X_test_b, y_test=y_test_b)



1.216609871226856

In [211]:
calc_score(svm.LinearSVR(max_iter=10000, C=10), X_train=X_train_b, y_train=y_train_b, X_test=X_test_b, y_test=y_test_b)



1.2170656185325868

In [212]:
calc_score(svm.LinearSVR(max_iter=10000, C=100), X_train=X_train_b, y_train=y_train_b, X_test=X_test_b, y_test=y_test_b)



1.7660575937714196

In [213]:
calc_score(svm.SVR(kernel="linear", C=1, cache_size=1000), X_train=X_train_b, y_train=y_train_b, X_test=X_test_b, y_test=y_test_b)

1.2166339185799946

In [214]:
calc_score(svm.SVR(kernel="rbf", C=1, cache_size=1000), X_train=X_train_b, y_train=y_train_b, X_test=X_test_b, y_test=y_test_b)

0.9560429991532254

Note that from literature for a similar problem (https://pubs.acs.org/doi/full/10.1021/acs.jpclett.8b00124):

C and γ were optimized to 10 and 0.01 for SVR, respectively, while ϵ was set at 0.1

In [215]:
calc_score(svm.SVR(kernel="rbf", C=10, cache_size=1000), X_train=X_train_b, y_train=y_train_b, X_test=X_test_b, y_test=y_test_b)

0.8657977947153491

In [216]:
calc_score(svm.SVR(kernel="rbf", C=10, gamma=0.01, cache_size=1000), X_train=X_train_b, y_train=y_train_b, X_test=X_test_b, y_test=y_test_b)

0.8647272137025013

In [217]:
calc_score(svm.SVR(kernel="rbf", C=10, gamma=0.01, epsilon=0.1, cache_size=1000), 
           X_train=X_train_b, y_train=y_train_b, X_test=X_test_b, y_test=y_test_b)

0.8647272137025013

Let's actually optimize this:

In [221]:
C_2d_range = [0.01, 0.1, 1, 10, 100]
gamma_2d_range = [0.01, 0.1, 1, 10, 100]
classifiers = []
for C in C_2d_range:
    for gamma in gamma_2d_range:
        clf = svm.SVR(C=C, gamma=gamma, cache_size=1000)
        clf.fit(X_train_b, y_train_b)
        y_pred = clf.predict(X_test_b)
        mse = mean_squared_error(y_test_b, y_pred)
        classifiers.append((C, gamma, clf, mse))

In [222]:
classifiers

[(0.01, 0.01, SVR(C=0.01, cache_size=1000, gamma=0.01), 1.8236485113967649),
 (0.01, 0.1, SVR(C=0.01, cache_size=1000, gamma=0.1), 2.5585353692385726),
 (0.01, 1, SVR(C=0.01, cache_size=1000, gamma=1), 2.602948806597704),
 (0.01, 10, SVR(C=0.01, cache_size=1000, gamma=10), 2.6043677219436296),
 (0.01, 100, SVR(C=0.01, cache_size=1000, gamma=100), 2.604429664100975),
 (0.1, 0.01, SVR(C=0.1, cache_size=1000, gamma=0.01), 1.2154735920005613),
 (0.1, 0.1, SVR(C=0.1, cache_size=1000, gamma=0.1), 2.196404116055674),
 (0.1, 1, SVR(C=0.1, cache_size=1000, gamma=1), 2.532059427595707),
 (0.1, 10, SVR(C=0.1, cache_size=1000, gamma=10), 2.5471858403151155),
 (0.1, 100, SVR(C=0.1, cache_size=1000, gamma=100), 2.5477969447261923),
 (1, 0.01, SVR(C=1, cache_size=1000, gamma=0.01), 0.9437444438182191),
 (1, 0.1, SVR(C=1, cache_size=1000, gamma=0.1), 1.3781565019410396),
 (1, 1, SVR(C=1, cache_size=1000, gamma=1), 2.2712499225498193),
 (1, 10, SVR(C=1, cache_size=1000, gamma=10), 2.363289627231711),
 

Interesting, the optimized rbf kernel svr is the same as from the paper, and overall probably not as good as RF.

I will submit to make sure:

In [223]:
worker(svm.SVR(kernel="rbf", C=10, gamma=0.01, cache_size=1000), "svm.csv", X_train=X_train_scaled, y_train=y_train, X_test=X_test_scaled)

array([3.69489438, 2.24564259, 3.23115338, ..., 1.64694377, 1.82499312,
       3.13586865])

As suspected, not as good as rf.

Okay, so let's optimize RF hyperparameters:

In [18]:
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
random_grid

{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000],
 'max_features': ['auto', 'sqrt'],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None],
 'min_samples_split': [2, 5, 10],
 'min_samples_leaf': [1, 2, 4],
 'bootstrap': [True, False]}

In [27]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestRegressor()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator=rf, param_distributions=random_grid, n_iter=80, cv=3, verbose=1, n_jobs=8)

In [28]:
# Fit the random search model
rf_random.fit(X_train_scaled, y_train)

Fitting 3 folds for each of 80 candidates, totalling 240 fits


[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed: 31.1min
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed: 110.5min
[Parallel(n_jobs=8)]: Done 240 out of 240 | elapsed: 127.2min finished


RandomizedSearchCV(cv=3, estimator=RandomForestRegressor(), n_iter=80, n_jobs=8,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [10, 20, 30, 40, 50, 60,
                                                      70, 80, 90, 100, 110,
                                                      None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [200, 400, 600, 800,
                                                         1000, 1200, 1400, 1600,
                                                         1800, 2000]},
                   verbose=1)

In [29]:
rf_random.best_params_

{'n_estimators': 800,
 'min_samples_split': 5,
 'min_samples_leaf': 1,
 'max_features': 'sqrt',
 'max_depth': 20,
 'bootstrap': False}

In [30]:
calc_score(RandomForestRegressor(**rf_random.best_params_), 
           X_train=X_train_b, y_train=y_train_b, X_test=X_test_b, y_test=y_test_b)

0.7353321062429926

It does score better in mse compared to the default, let's submit it to kaggle.

In [31]:
worker(RandomForestRegressor(**rf_random.best_params_), "rf.csv", X_train=X_train_scaled, y_train=y_train, X_test=X_test_scaled)

array([3.14690964, 2.97176267, 2.64893846, ..., 1.80630918, 1.70350869,
       3.08582956])

It also scored well on kaggle, just barely above the optimized solution. So still a little more to go.

I'll try to narrow the RF paramater search using a more comprehensive gridsearchCV around the parameters identified from randomsearchCV.

In [32]:
from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [False],
    'max_depth': [30, 40, 50],
    'max_features': [11, 12],
    'min_samples_leaf': [1, 2],
    'min_samples_split': [3, 5, 7],
    'n_estimators': [600, 1000]
}
# Create a based model
rf = RandomForestRegressor()
# Instantiate the grid search model
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=3, n_jobs=4, verbose=1)

In [33]:
grid_search.fit(X_train_scaled, y_train)

Fitting 3 folds for each of 72 candidates, totalling 216 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:  4.1min
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed: 18.9min
[Parallel(n_jobs=4)]: Done 216 out of 216 | elapsed: 21.2min finished


GridSearchCV(cv=3, estimator=RandomForestRegressor(), n_jobs=4,
             param_grid={'bootstrap': [False], 'max_depth': [30, 40, 50],
                         'max_features': [11, 12], 'min_samples_leaf': [1, 2],
                         'min_samples_split': [3, 5, 7],
                         'n_estimators': [600, 1000]},
             verbose=1)

In [34]:
grid_search.best_params_

{'bootstrap': False,
 'max_depth': 50,
 'max_features': 12,
 'min_samples_leaf': 1,
 'min_samples_split': 7,
 'n_estimators': 1000}

In [37]:
rf_best = {'bootstrap': False,
         'max_depth': 50,
         'max_features': 12,
         'min_samples_leaf': 1,
         'min_samples_split': 7,
         'n_estimators': 1000}
calc_score(RandomForestRegressor(**rf_best), 
           X_train=X_train_b, y_train=y_train_b, X_test=X_test_b, y_test=y_test_b)

0.7493932918984666

In [36]:
worker(RandomForestRegressor(**grid_search.best_params_), "rf.csv", 
       X_train=X_train_scaled, y_train=y_train, X_test=X_test_scaled)

array([3.04866572, 2.82187237, 2.64893846, ..., 1.76767495, 1.67762381,
       3.20202313])

This scored slightly better than before, which was expected since it should be further optimized.

However, I'm still not below the error of the optimized solution. To get there probably need to use XGBoost or maybe an ensemble voting method.

In [38]:
import xgboost as xgb

Random params:

In [44]:
# declare parameters
xgb_params = {
            'objective':'reg:squarederror',
            'max_depth': 4,
            'alpha': 10,
            'learning_rate': 1.0,
            'n_estimators':100
            }         

In [45]:
calc_score(xgb.XGBRegressor(**xgb_params), 
           X_train=X_train_b, y_train=y_train_b, X_test=X_test_b, y_test=y_test_b)

0.9015005774850364

Using the random parameters, scored a bit worse than RF, need to do alot of optimization:

In [46]:
# import packages for hyperparameters tuning
from hyperopt import STATUS_OK, Trials, fmin, hp, tpe

In [99]:
space = {'max_depth': hp.quniform('max_depth', 3, 80, 1),
        'gamma': hp.uniform ('gamma', 1, 9),
        'reg_alpha' : hp.quniform('reg_alpha', 10, 180, 1),
        'reg_lambda' : hp.uniform('reg_lambda', 0, 1),
        'colsample_bytree' : hp.uniform('colsample_bytree', 0.5, 1),
        'min_child_weight' : hp.quniform('min_child_weight', 0, 10, 1),
        'n_estimators' : 1000
        #'n_estimators': hp.choice('n_estimators', [i for i in range(50, 450, 50)])
        }

In [100]:
def objective(space):
    clf=xgb.XGBRegressor(
            n_estimators=space['n_estimators'], max_depth=int(space['max_depth']), gamma=space['gamma'],
            reg_alpha=int(space['reg_alpha']), min_child_weight=int(space['min_child_weight']),
            colsample_bytree=int(space['colsample_bytree']), eval_metric="rmse", 
            early_stopping_rounds=10)
    
    evaluation = [( X_train_b, y_train_b), ( X_test_b, y_test_b)]
    
    clf.fit(X_train_b, y_train_b,
            eval_set=evaluation,
            verbose=False)
    
    y_pred = clf.predict(X_test_b)
    mse = mean_squared_error(y_test_b, y_pred)
    #print ("MSE SCORE:", mse)
    # min mse, instead of max accuracy (-accuracy)
    return {'loss': mse, 'status': STATUS_OK }

In [101]:
trials = Trials()

best_hyperparams = fmin(fn = objective,
                        space = space,
                        algo = tpe.suggest,
                        max_evals = 100,
                        trials = trials)

  0%|                                                        | 0/100 [00:00<?, ?trial/s, best loss=?]

job exception: [22:49:38] /home/conda/feedstock_root/build_artifacts/xgboost-split_1667849527992/work/src/metric/metric.cc:49: Unknown metric function neg_mean_squared_error
Stack trace:
  [bt] (0) /home/dty7/Apps/anaconda3/envs/ml/lib/libxgboost.so(+0xc068f) [0x7fced89c368f]
  [bt] (1) /home/dty7/Apps/anaconda3/envs/ml/lib/libxgboost.so(xgboost::Metric::Create(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, xgboost::GenericParameter const*)+0xa8) [0x7fced8c00e68]
  [bt] (2) /home/dty7/Apps/anaconda3/envs/ml/lib/libxgboost.so(+0x28bf0e) [0x7fced8b8ef0e]
  [bt] (3) /home/dty7/Apps/anaconda3/envs/ml/lib/libxgboost.so(+0x2a7253) [0x7fced8baa253]
  [bt] (4) /home/dty7/Apps/anaconda3/envs/ml/lib/libxgboost.so(XGBoosterBoostedRounds+0x2d) [0x7fced89c6dbd]
  [bt] (5) /home/dty7/Apps/anaconda3/envs/ml/lib/python3.9/lib-dynload/../../libffi.so.8(+0x6a4a) [0x7fcf1cb17a4a]
  [bt] (6) /home/dty7/Apps/anaconda3/envs/ml/lib/python3.9/lib-dynload/../../libffi.s

  0%|                                                        | 0/100 [00:00<?, ?trial/s, best loss=?]


XGBoostError: [22:49:38] /home/conda/feedstock_root/build_artifacts/xgboost-split_1667849527992/work/src/metric/metric.cc:49: Unknown metric function neg_mean_squared_error
Stack trace:
  [bt] (0) /home/dty7/Apps/anaconda3/envs/ml/lib/libxgboost.so(+0xc068f) [0x7fced89c368f]
  [bt] (1) /home/dty7/Apps/anaconda3/envs/ml/lib/libxgboost.so(xgboost::Metric::Create(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, xgboost::GenericParameter const*)+0xa8) [0x7fced8c00e68]
  [bt] (2) /home/dty7/Apps/anaconda3/envs/ml/lib/libxgboost.so(+0x28bf0e) [0x7fced8b8ef0e]
  [bt] (3) /home/dty7/Apps/anaconda3/envs/ml/lib/libxgboost.so(+0x2a7253) [0x7fced8baa253]
  [bt] (4) /home/dty7/Apps/anaconda3/envs/ml/lib/libxgboost.so(XGBoosterBoostedRounds+0x2d) [0x7fced89c6dbd]
  [bt] (5) /home/dty7/Apps/anaconda3/envs/ml/lib/python3.9/lib-dynload/../../libffi.so.8(+0x6a4a) [0x7fcf1cb17a4a]
  [bt] (6) /home/dty7/Apps/anaconda3/envs/ml/lib/python3.9/lib-dynload/../../libffi.so.8(+0x5fea) [0x7fcf1cb16fea]
  [bt] (7) /home/dty7/Apps/anaconda3/envs/ml/lib/python3.9/lib-dynload/_ctypes.cpython-39-x86_64-linux-gnu.so(+0x132ad) [0x7fcf1cb302ad]
  [bt] (8) /home/dty7/Apps/anaconda3/envs/ml/lib/python3.9/lib-dynload/_ctypes.cpython-39-x86_64-linux-gnu.so(+0x12954) [0x7fcf1cb2f954]



In [89]:
print("The best hyperparameters are : ","\n")
print(best_hyperparams)

The best hyperparameters are :  

{'colsample_bytree': 0.8100351590645394, 'gamma': 1.013919903193054, 'max_depth': 79.0, 'min_child_weight': 4.0, 'reg_alpha': 12.0, 'reg_lambda': 0.5921827348994315}


In [90]:
# from first hyperopt run but adjusted n_estimators
hyperopt_best = {'colsample_bytree': 0.74335034617887, 
                 'gamma': 1.0010259920525033, 
                 'max_depth': 13,
                 #'max_depth': 50,
                 'min_child_weight': 2, 
                 'n_estimators': 1000, 
                 'reg_alpha': 10, 
                 'reg_lambda': 0.5332326318817571}

In [91]:
calc_score(xgb.XGBRegressor(**hyperopt_best), 
           X_train=X_train_b, y_train=y_train_b, X_test=X_test_b, y_test=y_test_b)

0.7997002717916959

In [93]:
# test with 800 estimators a larger depth range
hyperopt_best = {'colsample_bytree': 0.8100351590645394, 
                 'gamma': 1.013919903193054, 
                 'max_depth': 79, 
                 'min_child_weight': 4, 
                 'n_estimators': 800, 
                 'reg_alpha': 12, 
                 'reg_lambda': 0.5921827348994315}

In [94]:
calc_score(xgb.XGBRegressor(**hyperopt_best), 
           X_train=X_train_b, y_train=y_train_b, X_test=X_test_b, y_test=y_test_b)

0.801414287113136

In [95]:
# from first hyperopt with more depth and n_estimators
hyperopt_best = {'colsample_bytree': 0.74335034617887, 
                 'gamma': 1.0010259920525033, 
                 'max_depth': 79,
                 'min_child_weight': 2, 
                 'n_estimators': 1000, 
                 'reg_alpha': 10, 
                 'reg_lambda': 0.5332326318817571}

In [96]:
calc_score(xgb.XGBRegressor(**hyperopt_best), 
           X_train=X_train_b, y_train=y_train_b, X_test=X_test_b, y_test=y_test_b)

0.7919489474587308

So overall...

In [98]:
worker(xgb.XGBRegressor(**hyperopt_best), "xgb.csv", 
       X_train=X_train_scaled, y_train=y_train, X_test=X_test_scaled)

array([3.3196208, 2.7084322, 2.697912 , ..., 1.7753465, 1.3821244,
       2.6831644], dtype=float32)

Using the full dataset, didn't score as well as optimized random forest, probably could do better with more hyperparameter optimization.

Let's try going back to sklearn grids:

In [119]:
# Create the random grid
random_grid = {'max_depth': [3, 5, 6, 10, 15, 20],
               'learning_rate': [0.001, 0.01, 0.1, 0.2, 0.3],
               'subsample': np.arange(0.3, 1.0, 0.1),
               'colsample_bytree': np.arange(0.3, 1.0, 0.1),
               'colsample_bylevel': np.arange(0.4, 1.4, 0.1),
               'n_estimators': [100, 500, 1000, 1250, 1500],
               }

random_grid

{'max_depth': [3, 5, 6, 10, 15, 20],
 'learning_rate': [0.001, 0.01, 0.1, 0.2, 0.3],
 'subsample': array([0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]),
 'colsample_bytree': array([0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]),
 'colsample_bylevel': array([0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3]),
 'n_estimators': [100, 500, 1000, 1250, 1500]}

In [112]:
from sklearn.model_selection import RandomizedSearchCV
xgbr = xgb.XGBRegressor()
rand_xgbr = RandomizedSearchCV(estimator=xgbr,
                               param_distributions=random_grid,
                               scoring='neg_mean_squared_error',
                               n_iter=80,
                               n_jobs=4,
                               cv=3, 
                               verbose=1)

In [113]:
rand_xgbr.fit(X_train_b, y_train_b)

Fitting 3 folds for each of 80 candidates, totalling 240 fits


In [114]:
print("Best parameters:", rand_xgbr.best_params_)
print("Lowest MSE: ", (-rand_xgbr.best_score_))

Best parameters: {'subsample': 0.5, 'n_estimators': 1000, 'max_depth': 10, 'learning_rate': 0.01, 'colsample_bytree': 0.5, 'colsample_bylevel': 0.8999999999999999}
Lowest MSE:  0.7395940808062352


In [115]:
calc_score(xgb.XGBRegressor(**rand_xgbr.best_params_), 
           X_train=X_train_b, y_train=y_train_b, X_test=X_test_b, y_test=y_test_b)

0.7155404299898452

In [116]:
# randomgrid combined with hyperopt results
best = {'subsample': 0.5, 
        'n_estimators': 1000, 
        'max_depth': 10, 
        'learning_rate': 0.01, 
        'colsample_bytree': 0.5, 
        'colsample_bylevel': 0.8999999999999999,
         # hyperopt
         'gamma': 1.0010259920525033, 
         'max_depth': 79,
         'min_child_weight': 2, 
         'reg_alpha': 10, 
         'reg_lambda': 0.5332326318817571
        }

In [117]:
calc_score(xgb.XGBRegressor(**best), 
           X_train=X_train_b, y_train=y_train_b, X_test=X_test_b, y_test=y_test_b)

0.7639227945318532

Let's go with the randomgrid only for now.

In [118]:
worker(xgb.XGBRegressor(**rand_xgbr.best_params_), "xgb.csv", 
       X_train=X_train_scaled, y_train=y_train, X_test=X_test_scaled)

array([3.0502646, 2.9054544, 2.6375864, ..., 1.8054647, 1.8178905,
       3.1208353], dtype=float32)

This didn't score better than my opt RF. Maybe if I optimize it using the full dataset.

In [None]:
# using full dataset and slightly updated randomgrid params (accounting for edges of previous best)
rand_xgbr.fit(X_train_scaled, y_train)

Fitting 3 folds for each of 80 candidates, totalling 240 fits


In [None]:
print("Best parameters:", rand_xgbr.best_params_)
print("Lowest MSE: ", (-rand_xgbr.best_score_))

In [None]:
# this should score too well, since test set was included in fitting
calc_score(xgb.XGBRegressor(**rand_xgbr.best_params_), 
           X_train=X_train_b, y_train=y_train_b, X_test=X_test_b, y_test=y_test_b)

In [None]:
worker(xgb.XGBRegressor(**rand_xgbr.best_params_), "xgb.csv", 
       X_train=X_train_scaled, y_train=y_train, X_test=X_test_scaled)

So...

Next, try further optimization of xgb using gridsearch.

In [None]:
from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search 
param_grid = {
    'subsample': [0.4, 0.5, 0.6], 
    'n_estimators': [1000], 
    'max_depth': [8, 10, 12], 
    'learning_rate': [0.001, 0.01, 1], 
    'colsample_bytree': [0.3, 0.5, 0.7], 
    'colsample_bylevel': [0.7, 0.9, 0.11],
    }
xgbr = xgb.XGBRegressor()
# Instantiate the grid search model
xgbr_grid_search = GridSearchCV(estimator=xgbr, param_grid=param_grid, cv=3, n_jobs=4, verbose=1)

In [None]:
xgbr_grid_search.fit(X_train_scaled, y_train)

In [None]:
print("Best parameters:", xgbr_grid_search.best_params_)
print("Lowest MSE: ", (-xgbr_grid_search.best_score_))

In [None]:
worker(xgb.XGBRegressor(**xgbr_grid_search.best_params_), "xgb.csv", 
       X_train=X_train_scaled, y_train=y_train, X_test=X_test_scaled)