search on hyper parameters #38

bifani · 2016-10-21T18:09:36Z

Hi,

I am using this package to reweight MC to look like sPlotted data, and I would like to scan the hyper parameters to look for the best configuration
scikit tools are available for this (e.g. GridSearchCV or RandomizedSearchCV), but I am having troubles interfacing the two packages
Has anyone done that? Are there alternative ways within hep_ml?

In particular, I have my pandas DataFrame for the original and target samples and I am trying something like

        GBreweighterPars = {"n_estimators"     : [10,500],
                            "learning_rate"    : [0.1, 1.0],
                            "max_depth"        : [1,5],
                            "min_samples_leaf" : [100,5000],
                            "subsample"        : [0.1, 1.0]}

        reweighter = reweight.GBReweighter(n_estimators     = GBreweighterPars["n_estimators"],
                                           learning_rate    = GBreweighterPars["learning_rate"],
                                           max_depth        = GBreweighterPars["max_depth"],
                                           min_samples_leaf = GBreweighterPars["min_samples_leaf"],
                                           gb_args          = {"subsample" : GBreweighterPars["subsample"]})

        gridSearch = GridSearchCV(reweighter, param_grid = GBreweighterPars)

        fit = gridSearch.fit(original, target)

but I get the following error

  File "mlWeight.py", line 273, in <module>
    rw = misc.reWeighter(ana, clSamples, inSamples, cutCL + " && " + cutEvt[i], cutMC + " && " + cutEvt[i], weightCL, weightMC, varsToMatch, varsToWatch, year, trigger, name + "_" + str(i), inName, useSW, search, test, add)
  File "/disk/moose/lhcb/simone/RD/Analysis/RKst/ml/misc.py", line 464, in reWeighter
    fit = gridSearch.fit(original, target)
  File "/disk/moose/lhcb/simone/software/anaconda/4.0.0/lib/python2.7/site-packages/sklearn/model_selection/_search.py", line 940, in fit
    return self._fit(X, y, groups, ParameterGrid(self.param_grid))
  File "/disk/moose/lhcb/simone/software/anaconda/4.0.0/lib/python2.7/site-packages/sklearn/model_selection/_search.py", line 539, in _fit
    self.scorer_ = check_scoring(self.estimator, scoring=self.scoring)
  File "/disk/moose/lhcb/simone/software/anaconda/4.0.0/lib/python2.7/site-packages/sklearn/metrics/scorer.py", line 273, in check_scoring
    "have a 'score' method. The estimator %r does not." % estimator)
TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator GBReweighter(gb_args={'subsample': [0.1, 1.0]}, learning_rate=[0.1, 1.0],
       max_depth=[1, 5], min_samples_leaf=[100, 5000],
       n_estimators=[10, 500]) does not.

However, I am not sure how to set the score method for GBReweighter

Any help/suggestions/examples would be much appreciated

The text was updated successfully, but these errors were encountered:

arogozhnikov · 2016-10-22T19:53:49Z

Hi, Simone.

You're doing somewhat strange and expect algorithms to do the things they can't know about.

Cross-validation of machine learning is easy when you have some figure of merit (ROC AUC, MSE, classification accuracy). In this case evaluation is quite straghtforward.

However in case of reweighting, correct validation requires 2 steps:

weak check: looking at 1d distributions (or computing simple 1-d tests)
strong check: checking that machine learning model used in the analysis can't discriminate data after reweighting.

(Also, is there any reason to optimize parameters automatically?)

bifani · 2016-10-22T20:09:03Z

Hi,

OK, let me try to clarify what the situation is

I have played a bit with the hyper parameters and ended up using the following configuration

    GBReweighterPars = {"n_estimators"     : 200,
                        "learning_rate"    : 0.1,
                        "max_depth"        : 4,
                        "min_samples_leaf" : 1000,
                        "subsample"        : 1.0}

However, when I use different samples with much less stats I am afraid the above are far from being optimal, e.g. too many n_estimators, causing the to misbehave
Rather than trying by myself other settings, I was wondering if there is an automated way to study this

In particular, after having created the reweighter I do compute the ROC AUC on a number of variables of interest, which I could use a FoM
Would that be useful?

Thanks

arogozhnikov · 2016-10-23T10:48:41Z

Would that be useful?

Not really. 1-dimensional discrepancies are not all discrepancies.

You can drive 1-dimensional ROC AUCs to 0.5 with max_depth=1, but you'll not cover any non-trivial difference between distributions.

(Well, you can use it as a starting point, and then check results using step 2, but completely no guarantees can be done for this approach)

bifani · 2016-10-23T12:18:57Z

OK, therefore how do you suggest to pick up the hyper parameters?

arogozhnikov · 2016-10-24T09:06:07Z

If you really want to automate this process, you need to write evaluation function which encounters both steps 1) and 2) mentioned above. E.g. sum over KS(featuture_i) + abs(ROC AUC classifier - 0.5)

As for me: I pick relatively small number of trees 30-50, select leaf size and regularization accordingly to the dataset and play with depth (2-4) and learning rate (0.1-0.3). I stop when I see that I significantly reduced discrepancy between datasets. There are many other errors to be encountered in the analysis and trying to minimize only one of those to zero isn't a wise strategy.

arogozhnikov added the question label Oct 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

search on hyper parameters #38

search on hyper parameters #38

bifani commented Oct 21, 2016

arogozhnikov commented Oct 22, 2016 •

edited

bifani commented Oct 22, 2016

arogozhnikov commented Oct 23, 2016

bifani commented Oct 23, 2016

arogozhnikov commented Oct 24, 2016 •

edited

search on hyper parameters #38

search on hyper parameters #38

Comments

bifani commented Oct 21, 2016

arogozhnikov commented Oct 22, 2016 • edited

bifani commented Oct 22, 2016

arogozhnikov commented Oct 23, 2016

bifani commented Oct 23, 2016

arogozhnikov commented Oct 24, 2016 • edited

arogozhnikov commented Oct 22, 2016 •

edited

arogozhnikov commented Oct 24, 2016 •

edited