Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

usage of identity in cross-validation #40

Closed
updiversity opened this issue Apr 22, 2015 · 5 comments
Closed

usage of identity in cross-validation #40

updiversity opened this issue Apr 22, 2015 · 5 comments

Comments

@updiversity
Copy link

I am getting the following error by setting the aggregator option to opt.cross_validation.identity

---------------------------------------------------------------------------
     33 # Define Parameter Tuning
---> 34 optimal_pars_clf_sgd, _, _ = opt.maximize(clf_sgd_cv, num_evals=n_hyperparams_evals, alpha=[0.001, .1], l1_ratio=[0., 1.])
     35 
     36 # Train model on the Inner Training Set with Tuned Hyperparameters

../local/lib/python2.7/site-packages/optunity/api.pyc in maximize(f, num_evals, solver_name, pmap, **kwargs)
    179     solver = make_solver(**suggestion)
    180     solution, details = optimize(solver, f, maximize=True, max_evals=num_evals,
--> 181                                  pmap=pmap)
    182     return solution, details, suggestion
    183 

../local/lib/python2.7/site-packages/optunity/api.pyc in optimize(solver, func, maximize, max_evals, pmap)
    243     time = timeit.default_timer()
    244     try:
--> 245         solution, report = solver.optimize(f, maximize, pmap=pmap)
    246     except fun.MaximumEvaluationsException:
    247         # early stopping because maximum number of evaluations is reached

../local/lib/python2.7/site-packages/optunity/solvers/ParticleSwarm.pyc in optimize(self, f, maximize, pmap)
    257             fitnesses = pmap(evaluate, list(map(self.particle2dict, pop)))
    258             for part, fitness in zip(pop, fitnesses):
--> 259                 part.fitness = fit*fitness
    260                 if not part.best or part.best_fitness < part.fitness:
    261                     part.best = part.position

TypeError: can't multiply sequence by non-int of type 'float'

Here is my code

import optunity as opt
from optunity.metrics import _recall, contingency_table
from sklearn.linear_model import SGDClassifier
import numpy as np

n_in = 1
k_in = 2
n_hyperparams_evals = 10

clf_sgd = SGDClassifier(
            penalty="elasticnet",
            shuffle=True,
            n_iter=500,
            fit_intercept=True,
            learning_rate="optimal")

# Define Inner CV
cv_decorator = opt.cross_validated(x=X, y=Y.values, 
                                   num_folds=k_in, num_iter=n_in,
                                   strata=[Y[Y==1].index.values], 
                                   regenerate_folds=True,
                                   aggregator=opt.cross_validation.identity)

def obj_fun_clf_sgd(x_train, y_train, x_test, y_test, alpha, l1_ratio):
    model = clf_sgd.set_params(l1_ratio=l1_ratio, alpha=alpha).fit(x_train, y_train)
    y_pred = model.predict(x_test)
    score = _recall(contingency_table(y_test,y_pred))
    return score

clf_sgd_cv = cv_decorator(obj_fun_clf_sgd)

# Define Parameter Tuning
optimal_pars_clf_sgd, _, _ = opt.maximize(clf_sgd_cv, num_evals=n_hyperparams_evals, alpha=[0.001, .1], l1_ratio=[0., 1.])

# Train model on the Inner Training Set with Tuned Hyperparameters
optimal_model_clf_sgd = clf_sgd.set_params(**optimal_pars_clf_sgd).fit(X, Y.values)

The objective is to keep track of all the scores from the various folds. Is it a bug? or am I using incorrectly the API ?

Thanks in advance

@claesenm
Copy link
Owner

All solvers we use expect to get a single number for each evaluation of an objective function (which is constructed via cross-validation in your case). The identity aggregator constructs a list of all results, which causes this error.

The identity aggregator is mainly intended for other uses than you have now, for instance in the outer cross-validation procedure of a nested cross-validation setup.

At this point I am not entirely sure what your intentions are: when you are optimizing hyperparameters you really need to return a single number for every set of hyperparameters that is tried. Note that you already get a trace of all objective function evaluations (that is the overall cross-validation score) in the return values of maximize, minimize and optimize.

If you want to retain scores for each fold, you will need to modify optunity internally. I can make a quick code example if you like.

@updiversity
Copy link
Author

I'd like to keep track of the scores obtained during the model selection in order to estimate the variance of the scores as a function of the hyper-parameters values.
I suppose this could be done by building an ad-hoc scorer, and then use the agregator "list_mean"

However, even like that, I would get only num_evals data points, i.e. the number of iterations configured in the "optimizer". I'd like anyway to get at least the points for each CV iteration.

So, yes, if it is not a problem for you, some lines of code would be welcome.

@claesenm
Copy link
Owner

You can now freely return all fold-wise information during cross-validation. Solvers now have an extra layer between objective function evaluations and actually using the return value. If your objective function returns an indexable, it's first element is used in optimization.

You can see an example of how to use this feature in bin/examples/python/advanced_cv.py, which returns the full cross-validation results. If this is not what you were looking for, please do let me know so I can reopen this issue.

PS: you will need to update optunity from git to get the new feature.

@updiversity
Copy link
Author

thanks for the update. Now it even works with a self-provided aggregator. Here is the one I have defined FYI. It computes mean,std,min and max for a list of list of scores, and provides as first element, the score to be used by the optimizer.

´´´
def score_aggregators(list_of_scores,list_of_agg_fun=[np.mean,np.std,np.min,np.max]):
try:
scores = zip(*list_of_scores)
agg_scores = list(map(agg_func,scores) for agg_func in list_of_agg_fun)
return (agg_scores[0][0],list(map(agg_func,scores) for agg_func in list_of_agg_fun))
except:
agg_scores = [agg_func(list_of_scores) for agg_func in list_of_agg_fun]
return [agg_func(list_of_scores) for agg_func in list_of_agg_fun]
pass
´´´

@claesenm
Copy link
Owner

Glad to hear it is working! The current solution allows pretty much every use case I can think of, but feel free to let us know if there are still any remaining issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants