You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Problem
I kicked off a looking glass run this morning (logs here). On the first dataset, AP_Endometrium_Prostate_1.csv, I noticed catboost took about 10min to train, whereas other models took about 1min.
I also noticed the following in the log, for logistic regression:
//.pyenv/versions/3.8.3/lib/python3.8/site-packages/sklearn/metrics/_classification.py:2295: RuntimeWarning:
divide by zero encountered in log
For repro: this would have run with random_state=0.
Next steps
Someone needs to look at this dataset, run automl on it locally and determine what the problem(s) are. Its possible our hyperparameter ranges for catboost are still too permissive, particularly around the max tree depth. We may want to shrink some ranges and/or define a non-uniform prior. Its also possible the dataset has an error or its hitting a bug in our code somewhere.
I suspect we should just move this dataset over to low_performing_datasets.yaml
The text was updated successfully, but these errors were encountered:
I shared some of my preliminary performance results in standup today. I was able to get catboost's runtime down by changing the n_estimators default from 1000 to 10. It does appear that had a measurable negative impact on overall first-batch accuracy.
I also proposed some changes to the catboost hyperparameter ranges, specifically to lower the min and max values for n_estimators and change the distribution we use for that and for tree depth. I think these will safeguard against catboost taking up too much training time.
Next steps
Put up PR with the default value change, and the current performance results I have, so we can get that in in time for 0.12.0
Make a yaml with a subset of datasets to use for testing
Run another before/after test where we limit the model family to catboost, to see what effect the hyperparameter range change will have on catboost's performance.
Problem
I kicked off a looking glass run this morning (logs here). On the first dataset,
AP_Endometrium_Prostate_1.csv
, I noticed catboost took about 10min to train, whereas other models took about 1min.I also noticed the following in the log, for logistic regression:
For repro: this would have run with
random_state=0
.Next steps
Someone needs to look at this dataset, run automl on it locally and determine what the problem(s) are. Its possible our hyperparameter ranges for catboost are still too permissive, particularly around the max tree depth. We may want to shrink some ranges and/or define a non-uniform prior. Its also possible the dataset has an error or its hitting a bug in our code somewhere.
I suspect we should just move this dataset over to
low_performing_datasets.yaml
The text was updated successfully, but these errors were encountered: