# mlr3：特征选择

主要包括4个部分的内容，今天学习第二部分。

- 模型调优
- 调整超参数
    - 方法一：通过`tuninginstancesinglecrite`和`tuner`训练模型
    - 方法二：通过`autotuner`训练模型
    - 超参数设定的方法
    - 参数依赖
- 嵌套重抽样
    - 进行嵌套重抽样
    - 评价模型
    - 把超参数应用于模型
- Hyperband调参
- 特征选择
    - filters
    - 计算分数
    - 计算变量重要性
    - 组合方法（wrapper methods）
    - 自动选择

## Hyperband调参
Hyperband调参可看做是一种特殊的随机搜索方式，俗话说：“鱼与熊掌不可兼得”，Hyperband就是取其一种，感兴趣的小伙伴可以自己学习一下。

在这里举一个简单的小例子说明： 假如你有8匹马，每匹马需要4个单位的食物才能发挥最好，但是你现在只有32个单位的食物，所以你需要制定一个策略，充分利用32个单位的食物（也就是你的计算资源）来找到最好的马。 两种策略，第一种：直接放弃4匹马，把所有的食物用在另外4匹马上，这样到最后你就能挑选出4匹马中最好的一匹。但是这样的问题就是你不知道被你舍弃的那4匹马会不会有更好的。 第2种策略：在最开始时每匹马给1个单位食物，然后看它们表现，把表现好的4匹留下，表现不好的就舍弃，给予剩下4匹马更多的食物，然后再把表现好的2匹留下，如此循环，最好把剩下的食物给最后1匹马。

我们主要介绍通过`mlr3hyperband`包实现这一方法。

In [5]:
library(mlr3verse)
library(mlr3hyperband)


Loading required package: mlr3tuning

Loading required package: paradox



In [2]:
set.seed(123)

ll <- po("subsample") %>>% lrn("classif.rpart") # mlr3自带的管道符，先进行预处理


In [3]:
search_space <- ps(
    classif.rpart.cp = p_dbl(
        lower = 0.001,
        upper = 0.1
    ),
    classif.rpart.minsplit = p_int(
        lower = 1,
        upper = 10
    ),
    subsample.frac = p_dbl(
        lower = 0.1,
        upper = 1,
        tags = "budget"
    )
) # tags标记


In [7]:
instance <- TuningInstanceBatchSingleCrit$new(
  task = tsk("iris"),
  learner = ll,
  resampling = rsmp("holdout"),
  measure = msr("classif.ce"),
  terminator = trm("none"), # hyperband terminates itself
  search_space = search_space
)


接下来进行hyperband调参：

In [8]:
tuner <- tnr("hyperband", eta = 3)

lgr::get_logger("bbotk")$set_threshold("warn")

tuner$optimize(instance)


INFO  [09:45:14.404] [mlr3] Running benchmark with 9 resampling iterations
INFO  [09:45:14.453] [mlr3] Applying learner 'subsample.classif.rpart' on task 'iris' (iter 1/1)
INFO  [09:45:14.549] [mlr3] Applying learner 'subsample.classif.rpart' on task 'iris' (iter 1/1)
INFO  [09:45:14.593] [mlr3] Applying learner 'subsample.classif.rpart' on task 'iris' (iter 1/1)
INFO  [09:45:14.639] [mlr3] Applying learner 'subsample.classif.rpart' on task 'iris' (iter 1/1)
INFO  [09:45:14.691] [mlr3] Applying learner 'subsample.classif.rpart' on task 'iris' (iter 1/1)
INFO  [09:45:14.742] [mlr3] Applying learner 'subsample.classif.rpart' on task 'iris' (iter 1/1)
INFO  [09:45:14.810] [mlr3] Applying learner 'subsample.classif.rpart' on task 'iris' (iter 1/1)
INFO  [09:45:14.852] [mlr3] Applying learner 'subsample.classif.rpart' on task 'iris' (iter 1/1)
INFO  [09:45:14.892] [mlr3] Applying learner 'subsample.classif.rpart' on task 'iris' (iter 1/1)
INFO  [09:45:14.932] [mlr3] Finished benchmark
INFO 

classif.rpart.cp,classif.rpart.minsplit,subsample.frac,learner_param_vals,x_domain,classif.ce
<dbl>,<int>,<dbl>,<list>,<list>,<dbl>
0.008118506,1,0.1111111,"0.111111111, 0.000000000, 0.000000000, 0.000000000, 0.008118506, 1.000000000","0.008118506, 1.000000000, 0.111111111",0.02


查看结果：

In [9]:
instance$result


classif.rpart.cp,classif.rpart.minsplit,subsample.frac,learner_param_vals,x_domain,classif.ce
<dbl>,<int>,<dbl>,<list>,<list>,<dbl>
0.008118506,1,0.1111111,"0.111111111, 0.000000000, 0.000000000, 0.000000000, 0.008118506, 1.000000000","0.008118506, 1.000000000, 0.111111111",0.02


In [10]:
instance$result_learner_param_vals


## 特征选择
特征选择也是一门艺术，当我们拿到一份数据时，有很多信息是冗余的，是无效的，对于建模是没有帮助的。这样的变量用于建模只会增加噪声，降低模型表现。把冗余信息去除，挑选最合适的变量的过程被称为特征选择。

### ilters
这种方法首先把所有预测变量计算一个分数，然后按照分数进行排名，这样我们就可以根据分数挑选合适的预测变量了。

查看支持的计算分数的方法：

In [11]:
mlr_filters


<DictionaryFilter> with 23 stored values
Keys: anova, auc, boruta, carscore, carsurvscore, cmim, correlation,
  disr, find_correlation, importance, information_gain, jmi, jmim,
  kruskal_test, mim, mrmr, njmim, performance, permutation, relief,
  selected_features, univariate_cox, variance

In [13]:
mlr_filters$keys()


特征工程是很复杂的，想要详细了解的可阅读相关书籍。


### 计算分数
目前只支持分类和回归。

In [15]:
filter <- flt("jmim")

task <- tsk("iris")
filter$calculate(task)

filter


<FilterJMIM:jmim>: Minimal Joint Mutual Information Maximization
Task Types: classif, regr
Properties: -
Task Properties: -
Packages: praznik
Feature types: integer, numeric, factor, ordered
        feature     score
1:  Petal.Width 1.0000000
2: Sepal.Length 0.6666667
3: Petal.Length 0.3333333
4:  Sepal.Width 0.0000000

可以看到每个变量都计算出来一个分数。

In [16]:
# 根据相关性挑选变量
filter_cor <- flt("correlation")

# 支持更改参数，默认是pearson
filter_cor$param_set


<ParamSet(2)>
       id    class lower upper nlevels    default  value
   <char>   <char> <num> <num>   <num>     <list> <list>
1:    use ParamFct    NA    NA       5 everything       
2: method ParamFct    NA    NA       3    pearson       

In [17]:
#  可以更改为spearman
filter_cor$param_set$values <- list(method = "spearman")
filter_cor$param_set


<ParamSet(2)>
       id    class lower upper nlevels    default    value
   <char>   <char> <num> <num>   <num>     <list>   <list>
1:    use ParamFct    NA    NA       5 everything         
2: method ParamFct    NA    NA       3    pearson spearman

## 计算变量重要性
所有支持`importance`参数的`learner`都支持这种方法。

比如：

In [19]:
lrn <- lrn("classif.ranger",
    importance = "impurity"
)

task <- tsk("iris")
filter <- flt("importance", learner = lrn)


In [21]:
filter$calculate(task)
filter


<FilterImportance:importance>: Importance Score
Task Types: classif
Properties: -
Task Properties: -
Packages: mlr3, mlr3learners, ranger
Feature types: logical, integer, numeric, character, factor, ordered
        feature     score
1: Petal.Length 43.765053
2:  Petal.Width 42.174950
3: Sepal.Length 10.812783
4:  Sepal.Width  2.511471

In [24]:
task$head()


Species,Petal.Length,Petal.Width,Sepal.Length,Sepal.Width
<fct>,<dbl>,<dbl>,<dbl>,<dbl>
setosa,1.4,0.2,5.1,3.5
setosa,1.4,0.2,4.9,3.0
setosa,1.3,0.2,4.7,3.2
setosa,1.5,0.2,4.6,3.1
setosa,1.4,0.2,5.0,3.6
setosa,1.7,0.4,5.4,3.9


In [26]:
as.data.table(task)$Species


## 组合方法（wrapper methods）
和超参数调优很相似，`mlr3fselect`包提供支持。

In [28]:
library("mlr3verse")
library(mlr3fselect)


In [30]:
task <- tsk("pima")
learner <- lrn("classif.rpart")
hout <- rsmp("holdout")
measure <- msr("classif.ce")

evals20 <- trm("evals", n_evals = 20) # 设置何时停止

# 构建实例
instance <- FSelectInstanceBatchSingleCrit$new(
  task = task,
  learner = learner,
  resampling = hout,
  measure = measure,
  terminator = evals20
)
instance


<FSelectInstanceBatchSingleCrit>
* State:  Not optimized
* Objective: <ObjectiveFSelectBatch:classif.rpart_on_pima>
* Terminator: <TerminatorEvals>

## 目前mlr3fselect支持以下方法：

- Random Search(FSelectorBatchRandomSearch)
- Exhaustive Search (FSelectorBatchExhaustiveSearch)
- Sequential Search (FSelectorBatchSequential)
- Recursive Feature Elimination (FSelectorBatchRFE)
- Design Points (FSelectorBatchDesignPoints)

我们挑选一个随机搜索：

In [31]:
fselector <- fs("random_search")


开始运行：

In [32]:
lgr::get_logger("bbotk")$set_threshold("warn")


In [33]:
fselector$optimize(instance)


INFO  [10:13:25.690] [mlr3] Running benchmark with 10 resampling iterations
INFO  [10:13:25.695] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 1/1)
INFO  [10:13:25.708] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 1/1)
INFO  [10:13:25.722] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 1/1)
INFO  [10:13:25.735] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 1/1)
INFO  [10:13:25.748] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 1/1)
INFO  [10:13:25.762] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 1/1)
INFO  [10:13:25.776] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 1/1)
INFO  [10:13:25.788] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 1/1)
INFO  [10:13:25.802] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 1/1)
INFO  [10:13:25.816] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 1/1)
INFO  [10:13:25.836] [mlr3] Finished benchmark
INFO  [

age,glucose,insulin,mass,pedigree,pregnant,pressure,triceps,features,n_features,classif.ce
<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<list>,<int>,<dbl>
False,True,False,False,False,False,True,True,"glucose , pressure, triceps",3,0.2460938


查看选中的变量：


In [34]:
instance$result_feature_set


In [35]:
as.data.table(task)


diabetes,age,glucose,insulin,mass,pedigree,pregnant,pressure,triceps
<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
pos,50,148,,33.6,0.627,6,72,35
neg,31,85,,26.6,0.351,1,66,29
pos,32,183,,23.3,0.672,8,64,
neg,21,89,94,28.1,0.167,1,66,23
pos,33,137,168,43.1,2.288,0,40,35
neg,30,116,,25.6,0.201,5,74,
pos,26,78,88,31.0,0.248,3,50,32
neg,29,115,,35.3,0.134,10,,
pos,53,197,543,30.5,0.158,2,70,45
pos,54,125,,,0.232,8,96,


In [36]:
instance$result_y


In [37]:
as.data.table(instance$archive)


age,glucose,insulin,mass,pedigree,pregnant,pressure,triceps,classif.ce,runtime_learners,timestamp,batch_nr,warnings,errors,features,n_features,resample_result
<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<dbl>,<dbl>,<dttm>,<int>,<int>,<int>,<list>,<list>,<list>
False,False,True,False,True,True,False,False,0.34375,0.007,2024-09-01 10:13:25,1,0,0,"insulin , pedigree, pregnant",3,<environment: 0x560bc7f0f5b0>
True,True,True,True,True,True,True,True,0.25,0.007,2024-09-01 10:13:25,1,0,0,"age , glucose , insulin , mass , pedigree, pregnant, pressure, triceps",8,<environment: 0x560bc7ecd1b8>
False,False,True,True,True,False,True,True,0.3359375,0.008,2024-09-01 10:13:25,1,0,0,"insulin , mass , pedigree, pressure, triceps",5,<environment: 0x560bc7e8c640>
False,True,False,False,False,False,True,True,0.2460938,0.007,2024-09-01 10:13:25,1,0,0,"glucose , pressure, triceps",3,<environment: 0x560bc7e61ac8>
True,True,True,False,True,True,True,True,0.2539062,0.007,2024-09-01 10:13:25,1,0,0,"age , glucose , insulin , pedigree, pregnant, pressure, triceps",7,<environment: 0x560bc7e30ea0>
True,False,True,True,True,True,True,True,0.3046875,0.007,2024-09-01 10:13:25,1,0,0,"age , insulin , mass , pedigree, pregnant, pressure, triceps",7,<environment: 0x560bc7e14178>
False,False,True,True,False,False,False,False,0.3828125,0.006,2024-09-01 10:13:25,1,0,0,"insulin, mass",2,<environment: 0x560bc7c783d0>
True,True,True,True,True,True,True,True,0.25,0.008,2024-09-01 10:13:25,1,0,0,"age , glucose , insulin , mass , pedigree, pregnant, pressure, triceps",8,<environment: 0x560bc7be2358>
True,False,True,True,True,True,True,True,0.3046875,0.007,2024-09-01 10:13:25,1,0,0,"age , insulin , mass , pedigree, pregnant, pressure, triceps",7,<environment: 0x560bc7bc6150>
True,False,True,True,True,True,True,True,0.3046875,0.014,2024-09-01 10:13:25,1,0,0,"age , insulin , mass , pedigree, pregnant, pressure, triceps",7,<environment: 0x560bc7bada88>


In [41]:
instance$archive$benchmark_result


<BenchmarkResult> of 20 rows with 20 resampling runs
  1    pima classif.rpart       holdout     1        0      0
  2    pima classif.rpart       holdout     1        0      0
  3    pima classif.rpart       holdout     1        0      0
  4    pima classif.rpart       holdout     1        0      0
  5    pima classif.rpart       holdout     1        0      0
  6    pima classif.rpart       holdout     1        0      0
  7    pima classif.rpart       holdout     1        0      0
  8    pima classif.rpart       holdout     1        0      0
  9    pima classif.rpart       holdout     1        0      0
 10    pima classif.rpart       holdout     1        0      0
 11    pima classif.rpart       holdout     1        0      0
 12    pima classif.rpart       holdout     1        0      0
 13    pima classif.rpart       holdout     1        0      0
 14    pima classif.rpart       holdout     1        0      0
 15    pima classif.rpart       holdout     1        0      0
 16    pima class

应用于模型，训练任务：

In [43]:
instance$result_feature_set


In [42]:
task$select(instance$result_feature_set) # 只使用选中的变量
learner$train(task)


## 自动选择

In [44]:
learner <- lrn("classif.rpart")
terminator <- trm("evals", n_evals = 10)
fselector <- fs("random_search")

at <- AutoFSelector$new(
    learner = learner,
    resampling = rsmp("holdout"),
    measure = msr("classif.ce"),
    terminator = terminator,
    fselector = fselector
)
at


<AutoFSelector:classif.rpart.fselector>
* Model: list
* Packages: mlr3, mlr3fselect, rpart
* Predict Type: response
* Feature Types: logical, integer, numeric, factor, ordered
* Properties: importance, missings, multiclass, selected_features,
  twoclass, weights

比较不同的子集得到的模型表现：

In [45]:
grid <- benchmark_grid(
  task = tsk("pima"),
  learner = list(at, lrn("classif.rpart")),
  resampling = rsmp("cv", folds = 3)
)

bmr <- benchmark(grid, store_models = TRUE)


INFO  [10:23:19.290] [mlr3] Running benchmark with 6 resampling iterations
INFO  [10:23:19.294] [mlr3] Applying learner 'classif.rpart.fselector' on task 'pima' (iter 1/3)
INFO  [10:23:19.377] [mlr3] Running benchmark with 10 resampling iterations
INFO  [10:23:19.381] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 1/1)
INFO  [10:23:19.393] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 1/1)
INFO  [10:23:19.404] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 1/1)
INFO  [10:23:19.416] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 1/1)
INFO  [10:23:19.430] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 1/1)
INFO  [10:23:19.445] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 1/1)
INFO  [10:23:19.459] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 1/1)
INFO  [10:23:19.474] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 1/1)
INFO  [10:23:19.488] [mlr3] Applying learner 'classif.rp

In [46]:
bmr$aggregate(msrs(c("classif.ce", "time_train")))


nr,resample_result,task_id,learner_id,resampling_id,iters,classif.ce,time_train
<int>,<list>,<chr>,<chr>,<chr>,<int>,<dbl>,<dbl>
1,<environment: 0x560bd22d2868>,pima,classif.rpart.fselector,cv,3,0.265625,0.346
2,<environment: 0x560bd22b3be0>,pima,classif.rpart,cv,3,0.2591146,0.003666667
