Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automl: first-batch catboost is slow to fit on dataset #979

Closed
dsherry opened this issue Jul 27, 2020 · 1 comment · Fixed by #998
Closed

Automl: first-batch catboost is slow to fit on dataset #979

dsherry opened this issue Jul 27, 2020 · 1 comment · Fixed by #998
Assignees
Labels
performance Issues tracking performance improvements.

Comments

@dsherry
Copy link
Contributor

dsherry commented Jul 27, 2020

Problem
I kicked off a looking glass run this morning (logs here). On the first dataset, AP_Endometrium_Prostate_1.csv, I noticed catboost took about 10min to train, whereas other models took about 1min.

I also noticed the following in the log, for logistic regression:

//.pyenv/versions/3.8.3/lib/python3.8/site-packages/sklearn/metrics/_classification.py:2295: RuntimeWarning:
divide by zero encountered in log

For repro: this would have run with random_state=0.

Next steps
Someone needs to look at this dataset, run automl on it locally and determine what the problem(s) are. Its possible our hyperparameter ranges for catboost are still too permissive, particularly around the max tree depth. We may want to shrink some ranges and/or define a non-uniform prior. Its also possible the dataset has an error or its hitting a bug in our code somewhere.

I suspect we should just move this dataset over to low_performing_datasets.yaml

@dsherry dsherry added the performance Issues tracking performance improvements. label Jul 27, 2020
@dsherry dsherry self-assigned this Jul 28, 2020
@dsherry
Copy link
Contributor Author

dsherry commented Jul 28, 2020

I shared some of my preliminary performance results in standup today. I was able to get catboost's runtime down by changing the n_estimators default from 1000 to 10. It does appear that had a measurable negative impact on overall first-batch accuracy.

I also proposed some changes to the catboost hyperparameter ranges, specifically to lower the min and max values for n_estimators and change the distribution we use for that and for tree depth. I think these will safeguard against catboost taking up too much training time.

Next steps

  • Put up PR with the default value change, and the current performance results I have, so we can get that in in time for 0.12.0
  • Make a yaml with a subset of datasets to use for testing
  • Run another before/after test where we limit the model family to catboost, to see what effect the hyperparameter range change will have on catboost's performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Issues tracking performance improvements.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant