Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GridSearchCV seems to work, but how does it work internally? #29

Closed
komodovaran opened this issue Jun 3, 2020 · 2 comments
Closed

GridSearchCV seems to work, but how does it work internally? #29

komodovaran opened this issue Jun 3, 2020 · 2 comments

Comments

@komodovaran
Copy link

Let's say my classifier is LearningWithNoisyLabels(GridSearchCV(estimator = RandomForestClassifier(), param_grid = ..., cv = ...), cv_n_fold = ...)

What will happen here? Will I get the best parameters from GridSearchCV's cross validation, and then re-train this model on the best set of data using LearningWithNoisyLabels's cross validation? Will this potentially produce bad results?

@cgnorthcutt
Copy link
Member

@komodovaran Your loops are in the wrong order.

You should instead write code like this:

for parameters in GridSearch using cross validation:
    Do cleanlab training # uses cross validation to get out of sample predicted probabilities)

instead your implementation will do this

After already removing data to perform cleanlab cross validation:
    Find the best parameters for that subset of data left over

It might work fine if you're dataset is relatively large compared to model complexity, but if you want to implement this for harder classification problems, you'll want as much data as possible.

Checkout this example: https://github.com/cgnorthcutt/cleanlab/blob/master/examples/classifier_comparison.ipynb
In this example, you see looping through the classifiers is the outermost loop.

So, you're code should look something like

from sklearn.model_selection import ParameterGrid
grid = [{'kernel': ['linear']}, {'kernel': ['rbf'], 'gamma': [1, 10]}]
best_result = None
for params in list(ParameterGrid(grid)):
    Do stuff with LearningWithNoisyLabels(clf = RandomForestClassifier(**params))
    Keep track of best result

@cgnorthcutt
Copy link
Member

closing as there were no further questions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants