---

### Sklearn datasets

```q
skds:.p.import`sklearn.datasets
bc:skds`:load_breast_cancer
bh:skds`:load_boston
ir:skds`:load_iris
```

---

### Comparison version 1

This version uses standard grid, random and sobol search and makes use of the cross validation functionality available within `BayesSearchCV`.

In [None]:
// load xv library
\l xval_updated.q

// import data
d:k!.p.import[`sklearn.datasets][`:load_iris;<][0]k:`data`target

// import python libraries for bayesian
bsCV:.p.import[`skopt]`:BayesSearchCV
re :.p.import[`skopt.space]`:Real
cat:.p.import[`skopt.space]`:Categorical
SGD:.p.import[`sklearn.linear_model]`:SGDClassifier

// BayesianSearchCV
.ml.bs.bsCV:{[d;clf;hp;k;n;seed;sz]
 data:.ml.traintestsplit[;;sz]. d;
 opt:.p.import[`skopt][`:BayesSearchCV][clf;hp;`cv pykw k;`n_iter pykw n;`random_state pykw seed];
 r:(enlist[`random_state]!enlist seed),opt[`:fit][data`xtrain;data`ytrain][`:best_params_]`;
 (r;opt[`:score][data`xtest;data`ytest]`)}

// set parameters - random
p:`k`n`test!(5;3;0.2)
s:.ml.xv.fitscore a:{.p.import[`sklearn.linear_model]`:SGDClassifier}
trials:1024

// Bayesian search hyper parameter spaces
bsCV_hp:`average`l1_ratio`alpha!
  (cat[01b]`;re[0;1;`prior pykw"uniform"]`;re[.00001;100;`prior pykw"log-uniform"]`)

// grid hyperparameter space
gs_01_gen:{((0.,(1_til x-1)*10%x-1),10.)%10}
gshp:`random_state`average`l1_ratio`alpha!(42;01b;gs_01_gen 16;xexp[10](gs_01_gen[32]*7)-5)

// random hyperparameter space and random/sobol parameters
rshp:`average`l1_ratio`alpha!(`boolean;(`uniform;0;1;"f");(`loguniform;-5;2;"f"))
prdm:`typ`random_state`n`p!(`random;45;trials;rshp)
psbl:`typ`random_state`n`p!(`sobol ;72;trials;rshp)

\S 24
\ts show res_bs_bsCV    :.ml.bs.bsCV[value d;SGD[];bsCV_hp;5;3;42;.2]
\ts show res_gs :-2#.ml.gs.kfsplit[p`k;p`n;d`data;d`target;s;gshp;p`test]
\ts show res_rdm:-2#.ml.rs.kfsplit[p`k;p`n;d`data;d`target;s;prdm;p`test]
\ts show res_sbl:-2#.ml.rs.kfsplit[p`k;p`n;d`data;d`target;s;psbl;p`test]

---

### Comparison version 2

This version uses the same grid, random and sobol search methods, while explicitly splitting the data into a training set with k folds and a holdout set and running the classifier for each fold in the case of bayesian search.

In [None]:
// load new xv and bayesian functions 
\l xval_updated.q

// python imports for bayesian search
re :.p.import[`skopt.space]`:Real
cat:.p.import[`skopt.space]`:Categorical
SGD:.p.import[`sklearn.linear_model]`:SGDClassifier

// set random seed
\S 24
seed:24

// set xv params
p:`seed`k`n`test`trials!(24;5;1;.2;1024)

// import iris dataset 
d:k!.p.import[`sklearn.datasets][`:load_iris;<][0]k:`data`target

// split data for bayesian in same way as xv/gs/rs - 5 fold cv + 20% holdout
holdout  :.ml.traintestsplit[;;p`test]. value d;
splitdata:raze(.ml.xv.i.idxR . .ml.xv.i`splitidx`groupidx)[p`k;p`n]. holdout`xtrain`ytrain

// gs scoring func
s:.ml.xv.fitscore a:{.p.import[`sklearn.linear_model]`:SGDClassifier}

// bayesian search CV
.ml.bs.bsCV:{[clf;kfolds;hld;hp;seed]
 // optimizer
 opt:.p.import[`skopt][`:BayesSearchCV][clf[];hp;`random_state pykw seed];
 // find best parameter set for each fold
 r:{bst:(x[`:fit]. first y[])[`:best_params_]`;(bst;(x[`:score]. last y[])`)}[opt]each kfolds;
 // find best parameter set across k folds
 best:(enlist[`random_state]!enlist seed),r[;0].ml.imax r[;1];
 // train on entire set of k folds
 clf:(SGD pykwargs@)best;clf[`:fit]. hld`xtrain`ytrain;
 // test on holdout set
 (best;(clf[`:score]. hld`xtest`ytest)`)}

// grid hp
gs_01_gen:{((0.,(1_til x-1)*10%x-1),10.)%10}
gshp:`random_state`average`l1_ratio`alpha!(seed;01b;gs_01_gen 16;xexp[10](gs_01_gen[32]*7)-5)

// random/sobol hp
rshp:`average`l1_ratio`alpha!(`boolean;(`uniform;0;1;"f");(`loguniform;-5;2;"f"))
prdm:`typ`random_state`n`p!(`random;p`seed;p`trials;rshp)
psbl:`typ`random_state`n`p!(`sobol ;p`seed;p`trials;rshp)

// bayesian search CV hp
bsCV_hp:`average`l1_ratio`alpha!
  (cat[01b]`;re[0;1;`prior pykw"uniform"]`;re[.00001;100;`prior pykw"log-uniform"]`)

// run grid, random and sobol search
\ts show res_bscv:.ml.bs.bsCV[SGD;splitdata;holdout;bsCV_hp;42]
\ts show res_gs :-2#.ml.gs.kfsplit[p`k;p`n;d`data;d`target;s;gshp;p`test]
\ts show res_rdm:-2#.ml.rs.kfsplit[p`k;p`n;d`data;d`target;s;prdm;p`test]
\ts show res_sbl:-2#.ml.rs.kfsplit[p`k;p`n;d`data;d`target;s;psbl;p`test]

---

### Additional functionality

I also researched using the `hyperopt` method of hyperparameter optimization. This method did not allow for the same running and scoring as the other methods and has therefore been omitted from the scripts above.

In [None]:
\l ../bayesian/script3.p

.ml.bs.hyperopt:{[d;clf;hp]
  data:.ml.traintestsplit[;;.2]. d;
  hp,:`X`y`mdl!(data`xtrain;data`ytrain;clf`.);
  best:.p.get[`find_best][hp]`;
  mdl:(clf pykwargs@)best;
  mdl[`:fit]. data`xtrain`ytrain;
  (best;mdl[`:score][data`xtest;data`ytest]`)}

hyperopt_hp:`choice`uniform`loguniform!
 (enlist[`average]!enlist 01b;enlist[`l1_ratio]!enlist 0 1;enlist[`alpha]!enlist .00001 100)

\ts show res_bs_hyperopt:.ml.bs.hyperopt[value d;SGD;hyperopt_hp]

---

In [1]:
\l hpopt_results.q

In [2]:
fin  :"../random/test_data.csv"  / works with any csv containing a kdb table
fout :"results"   / works with any name
dtype:"FFFFFIB"
targ :`x6
param:`seed`k`n`test`trials!(42;5;1;.2;1024)  / current working parameters
.ml.save_hpopt_res[fin;fout;dtype;targ;param];

Running comparison
Plotting results
Saving results
Comparison complete, see outputs/


In [3]:
.ml.load_hpopt_res["outputs/files/";"results.txt"]

alpha        average fold_score                            l1_ratio  method  ..
-----------------------------------------------------------------------------..
0.02438354   0       1       0.99375 0.58125 1       1     0         grid    ..
0.0003729849 0       1       0.99375 0.8125  1       1     0.4863586 random  ..
0.0001647392 0       1       0.975   0.95    1       1     0.3925781 sobol   ..
0.01957704   0       0.93125 0.975   0.9375  0.98125 0.925 0.9920007 bayesian..
