Automl: first-batch catboost is slow to fit on dataset #979

dsherry · 2020-07-27T16:14:09Z

Problem
I kicked off a looking glass run this morning (logs here). On the first dataset, AP_Endometrium_Prostate_1.csv, I noticed catboost took about 10min to train, whereas other models took about 1min.

I also noticed the following in the log, for logistic regression:

//.pyenv/versions/3.8.3/lib/python3.8/site-packages/sklearn/metrics/_classification.py:2295: RuntimeWarning:
divide by zero encountered in log

For repro: this would have run with random_state=0.

Next steps
Someone needs to look at this dataset, run automl on it locally and determine what the problem(s) are. Its possible our hyperparameter ranges for catboost are still too permissive, particularly around the max tree depth. We may want to shrink some ranges and/or define a non-uniform prior. Its also possible the dataset has an error or its hitting a bug in our code somewhere.

I suspect we should just move this dataset over to low_performing_datasets.yaml

The text was updated successfully, but these errors were encountered:

dsherry · 2020-07-28T21:47:09Z

I shared some of my preliminary performance results in standup today. I was able to get catboost's runtime down by changing the n_estimators default from 1000 to 10. It does appear that had a measurable negative impact on overall first-batch accuracy.

I also proposed some changes to the catboost hyperparameter ranges, specifically to lower the min and max values for n_estimators and change the distribution we use for that and for tree depth. I think these will safeguard against catboost taking up too much training time.

Next steps

Put up PR with the default value change, and the current performance results I have, so we can get that in in time for 0.12.0
Make a yaml with a subset of datasets to use for testing
Run another before/after test where we limit the model family to catboost, to see what effect the hyperparameter range change will have on catboost's performance.

dsherry added the performance Issues tracking performance improvements. label Jul 27, 2020

dsherry self-assigned this Jul 28, 2020

dsherry mentioned this issue Jul 31, 2020

Speed up catboost fit: change catboost default and automl parameter ranges #998

Merged

dsherry closed this as completed in #998 Jul 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automl: first-batch catboost is slow to fit on dataset #979

Automl: first-batch catboost is slow to fit on dataset #979

dsherry commented Jul 27, 2020 •

edited

dsherry commented Jul 28, 2020

Automl: first-batch catboost is slow to fit on dataset #979

Automl: first-batch catboost is slow to fit on dataset #979

Comments

dsherry commented Jul 27, 2020 • edited

dsherry commented Jul 28, 2020

dsherry commented Jul 27, 2020 •

edited