# mlr3：嵌套重抽样

主要包括4个部分的内容，今天学习第二部分。

- 模型调优
- 调整超参数
    - 方法一：通过`tuninginstancesinglecrite`和`tuner`训练模型
    - 方法二：通过`autotuner`训练模型
    - 超参数设定的方法
    - 参数依赖
- 嵌套重抽样
    - 进行嵌套重抽样
    - 评价模型
    - 把超参数应用于模型
- Hyperband调参
- 特征选择
    - filters
    - 计算分数
    - 计算变量重要性
    - 组合方法（wrapper methods）
    - 自动选择

## 嵌套重抽样
既有外部重抽样，也有内部重抽样，彼此嵌套，可以很好的解决过拟合问题，得到更加稳定的模型。

对于概念不清楚的可以自行百度学习，就不在这里赘述了。

可使用下图帮助理解：

<image src="./images/嵌套重抽样.webp">

## 进行嵌套重抽样
内部使用4折交叉验证：

In [2]:
rm(list = ls())


In [3]:
library(mlr3verse)
library(mlr3tuning)


Loading required package: mlr3

Loading required package: paradox



In [4]:
learner <- lrn("classif.rpart")
resampling <- rsmp("cv", folds = 4)
measure <- msr("classif.ce")
search_space <- ps(cp = p_dbl(lower = 0.001, upper = 0.1))
terminator <- trm("evals", n_evals = 5)
tuner <- tnr("grid_search", resolution = 10)


In [7]:
args(AutoTuner$new)


In [8]:
at <- AutoTuner$new(
    tuner = tuner,
    learner = learner,
    resampling = resampling,
    measure = measure,
    terminator = terminator,
    search_space = search_space
)


外部使用3折交叉验证：

In [9]:
task <- tsk("pima")
outer_resampling <- rsmp("cv", folds = 3)

rr <- resample(task,
    at,
    outer_resampling,
    store_models = T
)


INFO  [09:23:57.280] [mlr3] Applying learner 'classif.rpart.tuned' on task 'pima' (iter 1/3)
INFO  [09:23:57.348] [bbotk] Starting to optimize 1 parameter(s) with '<OptimizerBatchGridSearch>' and '<TerminatorEvals> [n_evals=5, k=0]'
INFO  [09:23:57.384] [bbotk] Evaluating 1 configuration(s)
INFO  [09:23:57.398] [mlr3] Running benchmark with 4 resampling iterations
INFO  [09:23:57.403] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 1/4)
INFO  [09:23:57.426] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 2/4)
INFO  [09:23:57.439] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 3/4)
INFO  [09:23:57.451] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 4/4)
INFO  [09:23:57.464] [mlr3] Finished benchmark
INFO  [09:23:57.486] [bbotk] Result of batch 1:
INFO  [09:23:57.490] [bbotk]  0.001  0.2773438        0      0            0.033
INFO  [09:23:57.490] [bbotk]                                 uhash
INFO  [09:23:57.490] [bbotk]  438b9456-ef2d

这里演示的数据集比较小，大数据可以使用并行化技术，将在后面介绍。

## 评价模型
提取内部抽样的模型表现：

In [10]:
rr


<ResampleResult> with 3 resampling iterations
    pima classif.rpart.tuned            cv         1        0      0
    pima classif.rpart.tuned            cv         2        0      0
    pima classif.rpart.tuned            cv         3        0      0

In [11]:
as.data.table(rr)


task,learner,resampling,iteration,prediction
<list>,<list>,<list>,<int>,<list>
<environment: 0x558051dbfbc8>,<environment: 0x558053033e60>,<environment: 0x558051fc6890>,1,<environment: 0x5580570b4208>
<environment: 0x558051dbfbc8>,<environment: 0x558055d84250>,<environment: 0x558051fc6890>,2,<environment: 0x558056c603f8>
<environment: 0x558051dbfbc8>,<environment: 0x5580538ff400>,<environment: 0x558051fc6890>,3,<environment: 0x558052cb8e78>


In [12]:
extract_inner_tuning_results(rr)


iteration,cp,classif.ce,learner_param_vals,x_domain,task_id,learner_id,resampling_id
<int>,<dbl>,<dbl>,<list>,<list>,<chr>,<chr>,<chr>
1,0.034,0.2851562,"0.000, 0.034",0.034,pima,classif.rpart.tuned,cv
2,0.023,0.2382812,"0.000, 0.023",0.023,pima,classif.rpart.tuned,cv
3,0.034,0.2636719,"0.000, 0.034",0.034,pima,classif.rpart.tuned,cv


提取内部抽样的存档：

In [13]:
extract_inner_tuning_archives(rr)


iteration,cp,classif.ce,x_domain_cp,runtime_learners,timestamp,batch_nr,warnings,errors,resample_result,task_id,learner_id,resampling_id
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>,<int>,<int>,<int>,<list>,<chr>,<chr>,<chr>
1,0.056,0.2910156,0.056,0.024,2024-09-01 09:23:58,1,0,0,<environment: 0x558051060038>,pima,classif.rpart.tuned,cv
1,0.089,0.2910156,0.089,0.024,2024-09-01 09:23:58,2,0,0,<environment: 0x558051097f30>,pima,classif.rpart.tuned,cv
1,0.067,0.2910156,0.067,0.025,2024-09-01 09:23:58,3,0,0,<environment: 0x5580510bfaf0>,pima,classif.rpart.tuned,cv
1,0.034,0.2851562,0.034,0.031,2024-09-01 09:23:58,4,0,0,<environment: 0x5580510f3900>,pima,classif.rpart.tuned,cv
1,0.1,0.3046875,0.1,0.025,2024-09-01 09:23:58,5,0,0,<environment: 0x558050c53850>,pima,classif.rpart.tuned,cv
2,0.023,0.2382812,0.023,0.026,2024-09-01 09:23:58,1,0,0,<environment: 0x558051201090>,pima,classif.rpart.tuned,cv
2,0.067,0.2597656,0.067,0.026,2024-09-01 09:23:58,2,0,0,<environment: 0x55805121fc60>,pima,classif.rpart.tuned,cv
2,0.089,0.2597656,0.089,0.024,2024-09-01 09:23:58,3,0,0,<environment: 0x5580512385a0>,pima,classif.rpart.tuned,cv
2,0.001,0.25,0.001,0.027,2024-09-01 09:23:58,4,0,0,<environment: 0x55805124d500>,pima,classif.rpart.tuned,cv
2,0.056,0.2421875,0.056,0.026,2024-09-01 09:23:58,5,0,0,<environment: 0x558051269e30>,pima,classif.rpart.tuned,cv


可以看到和上面的结果是不一样的哦，每一折都有5次迭代，这就和我们设置的参数有关系了。

查看外部重抽样的模型表现

In [14]:
rr$score()


task,task_id,learner,learner_id,resampling,resampling_id,iteration,prediction,classif.ce
<list>,<chr>,<list>,<chr>,<list>,<chr>,<int>,<list>,<dbl>
<environment: 0x558051dbfbc8>,pima,<environment: 0x55804ddc8c48>,classif.rpart.tuned,<environment: 0x558051fc6890>,cv,1,<environment: 0x558053400110>,0.2421875
<environment: 0x558051dbfbc8>,pima,<environment: 0x5580564b8a08>,classif.rpart.tuned,<environment: 0x558051fc6890>,cv,2,<environment: 0x558056b59e40>,0.2617188
<environment: 0x558051dbfbc8>,pima,<environment: 0x5580542f0848>,classif.rpart.tuned,<environment: 0x558051fc6890>,cv,3,<environment: 0x558053e49e20>,0.28125


查看平均表现：

In [16]:
rr$aggregate()


把超参数应用于模型

In [17]:
at$train(task)


INFO  [09:30:28.543] [bbotk] Starting to optimize 1 parameter(s) with '<OptimizerBatchGridSearch>' and '<TerminatorEvals> [n_evals=5, k=0]'
INFO  [09:30:28.549] [bbotk] Evaluating 1 configuration(s)
INFO  [09:30:28.554] [mlr3] Running benchmark with 4 resampling iterations
INFO  [09:30:28.558] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 1/4)
INFO  [09:30:28.572] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 2/4)
INFO  [09:30:28.585] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 3/4)
INFO  [09:30:28.598] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 4/4)
INFO  [09:30:28.611] [mlr3] Finished benchmark
INFO  [09:30:28.630] [bbotk] Result of batch 1:
INFO  [09:30:28.632] [bbotk]  0.023   0.265625        0      0            0.025
INFO  [09:30:28.632] [bbotk]                                 uhash
INFO  [09:30:28.632] [bbotk]  7c5784d8-9170-4d22-90c9-b4e6265ab045
INFO  [09:30:28.635] [bbotk] Evaluating 1 configuration(s)
INFO  [09:

现在模型就可以应用于新的数据集了。

以上过程也是有简便写法的，但是需要注意，这里的`mlr3tuning`需要用github版的，`cran`版的还有bug，不知道修复了没：

In [21]:
rr1 <- tune_nested(
    tuner = tnr("grid_search", resolution = 10),
    task = task,
    learner = learner,
    inner_resampling = resampling,
    outer_resampling = outer_resampling,
    measure = measure,
    term_evals = 20,
    search_space = search_space
)


INFO  [09:33:27.913] [mlr3] Applying learner 'classif.rpart.tuned' on task 'pima' (iter 1/3)
INFO  [09:33:27.959] [bbotk] Starting to optimize 1 parameter(s) with '<OptimizerBatchGridSearch>' and '<TerminatorEvals> [n_evals=20, k=0]'
INFO  [09:33:27.965] [bbotk] Evaluating 1 configuration(s)
INFO  [09:33:27.970] [mlr3] Running benchmark with 4 resampling iterations
INFO  [09:33:27.974] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 1/4)
INFO  [09:33:27.987] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 2/4)
INFO  [09:33:28.000] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 3/4)
INFO  [09:33:28.013] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 4/4)
INFO  [09:33:28.025] [mlr3] Finished benchmark
INFO  [09:33:28.046] [bbotk] Result of batch 1:
INFO  [09:33:28.048] [bbotk]  0.034  0.2597656        0      0            0.028
INFO  [09:33:28.048] [bbotk]                                 uhash
INFO  [09:33:28.048] [bbotk]  5279255e-75e

这个rr1本质上和rr是一样的，

In [22]:
print(rr1)


<ResampleResult> with 3 resampling iterations
    pima classif.rpart.tuned            cv         1        0      0
    pima classif.rpart.tuned            cv         2        0      0
    pima classif.rpart.tuned            cv         3        0      0


In [23]:
print(rr)


<ResampleResult> with 3 resampling iterations
    pima classif.rpart.tuned            cv         1        0      0
    pima classif.rpart.tuned            cv         2        0      0
    pima classif.rpart.tuned            cv         3        0      0


查看内部抽样表现：

In [24]:
extract_inner_tuning_results(rr1)


iteration,cp,classif.ce,learner_param_vals,x_domain,task_id,learner_id,resampling_id
<int>,<dbl>,<dbl>,<list>,<list>,<chr>,<chr>,<chr>
1,0.001,0.2519531,"0.000, 0.001",0.001,pima,classif.rpart.tuned,cv
2,0.012,0.2539062,"0.000, 0.012",0.012,pima,classif.rpart.tuned,cv
3,0.023,0.2363281,"0.000, 0.023",0.023,pima,classif.rpart.tuned,cv


In [25]:
extract_inner_tuning_archives(rr1)


iteration,cp,classif.ce,x_domain_cp,runtime_learners,timestamp,batch_nr,warnings,errors,resample_result,task_id,learner_id,resampling_id
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>,<int>,<int>,<int>,<list>,<chr>,<chr>,<chr>
1,0.056,0.2558594,0.056,0.024,2024-09-01 09:33:28,1,0,0,<environment: 0x558050d82708>,pima,classif.rpart.tuned,cv
1,0.034,0.2558594,0.034,0.026,2024-09-01 09:33:29,2,0,0,<environment: 0x558050d6bab8>,pima,classif.rpart.tuned,cv
1,0.045,0.2558594,0.045,0.029,2024-09-01 09:33:29,3,0,0,<environment: 0x558050d51770>,pima,classif.rpart.tuned,cv
1,0.067,0.2695312,0.067,0.025,2024-09-01 09:33:29,4,0,0,<environment: 0x558050c62ba0>,pima,classif.rpart.tuned,cv
1,0.1,0.2617188,0.1,0.025,2024-09-01 09:33:29,5,0,0,<environment: 0x55805111f6f0>,pima,classif.rpart.tuned,cv
1,0.089,0.2675781,0.089,0.141,2024-09-01 09:33:29,6,0,0,<environment: 0x5580510ec348>,pima,classif.rpart.tuned,cv
1,0.023,0.2597656,0.023,0.021,2024-09-01 09:33:29,7,0,0,<environment: 0x5580510b59d8>,pima,classif.rpart.tuned,cv
1,0.078,0.2675781,0.078,0.022,2024-09-01 09:33:29,8,0,0,<environment: 0x55805106ec88>,pima,classif.rpart.tuned,cv
1,0.001,0.2519531,0.001,0.024,2024-09-01 09:33:29,9,0,0,<environment: 0x558051053b68>,pima,classif.rpart.tuned,cv
1,0.012,0.2519531,0.012,0.026,2024-09-01 09:33:29,10,0,0,<environment: 0x558051036318>,pima,classif.rpart.tuned,cv


查看模型表现：

In [26]:
rr1$aggregate()


rr1$score()


task,task_id,learner,learner_id,resampling,resampling_id,iteration,prediction,classif.ce
<list>,<chr>,<list>,<chr>,<list>,<chr>,<int>,<list>,<dbl>
<environment: 0x55804f70f248>,pima,<environment: 0x558053d62b98>,classif.rpart.tuned,<environment: 0x55805055fcf0>,cv,1,<environment: 0x5580568ae7c0>,0.2265625
<environment: 0x55804f70f248>,pima,<environment: 0x55805676e550>,classif.rpart.tuned,<environment: 0x55805055fcf0>,cv,2,<environment: 0x5580546829a0>,0.2421875
<environment: 0x55804f70f248>,pima,<environment: 0x558052cf0d10>,classif.rpart.tuned,<environment: 0x55805055fcf0>,cv,3,<environment: 0x558055cc75b8>,0.2382812
