-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train forests serially [Resolves #542] #581
Conversation
Codecov Report
@@ Coverage Diff @@
## master #581 +/- ##
========================================
+ Coverage 82.99% 83% +0.01%
========================================
Files 87 87
Lines 5763 5779 +16
========================================
+ Hits 4783 4797 +14
- Misses 980 982 +2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made some comments in the class_path_implements_parallelization
. I guess you can do your n_job
argument trick if you are able to change that argument from the ScaledLogisticRegression
in catwalk
rather than use LogisticClassifier
, the later should never be used in our grids.
@@ -202,3 +203,37 @@ def save_db_objects(db_engine, db_objects): | |||
) | |||
f.seek(0) | |||
postgres_copy.copy_from(f, type(db_objects[0]), db_engine, format="csv") | |||
|
|||
|
|||
def class_path_implements_parallelization(class_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if we can change this behavior. I think you can avoid the n_job
problem by only targeting the catwalk.estimators.classifiers.ScaledLogisticRegression
instead of the LogisticRegression
(actually we should strongly advice not to use the latter).
If you remove the n_job
argument from the SLR, I guess your trick will work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What @saleiro thinks about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 ☝️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would refrain from hardcding the large ones. We could check if the user passed a 'n_jobs=-1' argument in the configuration parameters for the target class_path. This way this functionlity would be equally compatible with xgboost or lightgbm cause they use the 'n_jobs' parameter too.
import logging | ||
import sys | ||
from contextlib import contextmanager | ||
import importlib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
original ordering made sense 😛 🤷♂️
|
||
Returns: (bool) whether or not the classifier returns parallelization | ||
""" | ||
if 'RandomForest' in class_path or 'ExtraTrees' in class_path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷♂️ or just:
if 'RandomForest' in class_path or 'ExtraTrees' in class_path: | |
return 'RandomForest' in class_path or 'ExtraTrees' in class_path |
@@ -202,3 +203,37 @@ def save_db_objects(db_engine, db_objects): | |||
) | |||
f.seek(0) | |||
postgres_copy.copy_from(f, type(db_objects[0]), db_engine, format="csv") | |||
|
|||
|
|||
def class_path_implements_parallelization(class_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 ☝️
src/triage/experiments/multicore.py
Outdated
if "n_jobs" not in serial_task["train_kwargs"]["parameters"]: | ||
logging.info("n_jobs not configured for serial task, " | ||
"defaulting to -1 for entire machine") | ||
serial_task["train_kwargs"]["parameters"]["n_jobs"] = -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing you don't feel it's important to create a copy of the dictionary rather than mutate it in place? (Just want to be sure it's called out.) Certainly I don't see a problem with this as it is, just noting the concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to suggest an optimization to this approach.
Train models the following order:
- triage.component.catwalk.baselines
- scaled logistic regression, decision trees
- classpaths with 'n_jobs=-1' passed by the user
- all the others (which include the super slow sklearn gradientboostingclassifer and adaboost, for example, or any other not so relevant to dsapp learners like k neighbors, so not so urgent...)
Train/test tasks are now implemented in batches, based on what results people should be interested in first: - Batch 1: Short and important ones, like Decision Trees, Scaled Logistic Regressions, and baselines. These will be parallelized if using an Experiment subclass that supports parallelization - Batch 2: Big ones with n_jobs=-1. This will be run serially no matter what the Experiment subclass is, because the classifier is expected to parallelize and adding on another layer is likely to crash the Experiment. - Batch 3: All others, parallelized similar to batch 1. These are ones that might be expected to take a while to complete (Gradient Boosting, forests without n_jobs=-1) and/or ones less likely to be effective.
0248c26
to
4d71a61
Compare
) | ||
return self.order_and_batch_tasks(train_test_tasks) | ||
|
||
def order_and_batch_tasks(self, tasks): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
In MulticoreExperiment, partition models to be trained/tested into two
buckets: 'large ones' (random forests, extra trees) and 'small ones' (everything else). Train the small ones as they are now first, and then train the large ones serially, maxing out n_jobs if it's not set. This is for RAM stability, as trying to parallelize classifiers like random forests can cause memory to spike and kill the experiment.
Under a dirtyduck test, this actually ended up taking 25% longer given the same parallelization. This makes sense, as running forests with multiprocessing will go pretty fast if you don't run out of memory. However, the unstable memory usage of the old way often leads to killed experiments which add a ton of extra human time and an extra experiment run (or two!), which is far worse. With the new method we can more reliably set n_processes to just be the number of cores on the machine.
before:
after: