Train forests serially [Resolves #542] #581

thcrock · 2019-01-28T23:29:35Z

In MulticoreExperiment, partition models to be trained/tested into two
buckets: 'large ones' (random forests, extra trees) and 'small ones' (everything else). Train the small ones as they are now first, and then train the large ones serially, maxing out n_jobs if it's not set. This is for RAM stability, as trying to parallelize classifiers like random forests can cause memory to spike and kill the experiment.

Under a dirtyduck test, this actually ended up taking 25% longer given the same parallelization. This makes sense, as running forests with multiprocessing will go pretty fast if you don't run out of memory. However, the unstable memory usage of the old way often leads to killed experiments which add a ton of extra human time and an extra experiment run (or two!), which is far worse. With the new method we can more reliably set n_processes to just be the number of cores on the machine.

before:

after:

codecov-io · 2019-01-28T23:39:03Z

Codecov Report

Merging #581 into master will increase coverage by 0.01%.
The diff coverage is 88.52%.

@@           Coverage Diff            @@
##           master   #581      +/-   ##
========================================
+ Coverage   82.99%    83%   +0.01%     
========================================
  Files          87     87              
  Lines        5763   5779      +16     
========================================
+ Hits         4783   4797      +14     
- Misses        980    982       +2

Impacted Files	Coverage Δ
src/triage/experiments/base.py	`94.75% <100%> (-0.5%)`	⬇️
src/triage/experiments/rq.py	`87.23% <100%> (ø)`	⬆️
src/triage/component/catwalk/model_trainers.py	`94.61% <100%> (ø)`	⬆️
src/triage/experiments/singlethreaded.py	`100% <100%> (ø)`	⬆️
src/triage/component/catwalk/__init__.py	`92.59% <84.61%> (+1.96%)`	⬆️
src/triage/experiments/multicore.py	`75.53% <87.5%> (-0.03%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 041f568...4d71a61. Read the comment docs.

ivanhigueram

Made some comments in the class_path_implements_parallelization. I guess you can do your n_job argument trick if you are able to change that argument from the ScaledLogisticRegression in catwalk rather than use LogisticClassifier, the later should never be used in our grids.

ivanhigueram · 2019-01-28T23:43:54Z

src/triage/component/catwalk/utils.py

@@ -202,3 +203,37 @@ def save_db_objects(db_engine, db_objects):
            )
        f.seek(0)
        postgres_copy.copy_from(f, type(db_objects[0]), db_engine, format="csv")
+
+
+def class_path_implements_parallelization(class_path):


I was wondering if we can change this behavior. I think you can avoid the n_job problem by only targeting the catwalk.estimators.classifiers.ScaledLogisticRegression instead of the LogisticRegression (actually we should strongly advice not to use the latter).

If you remove the n_job argument from the SLR, I guess your trick will work.

What @saleiro thinks about this?

🤔 ☝️

I would refrain from hardcding the large ones. We could check if the user passed a 'n_jobs=-1' argument in the configuration parameters for the target class_path. This way this functionlity would be equally compatible with xgboost or lightgbm cause they use the 'n_jobs' parameter too.

jesteria · 2019-01-29T17:38:52Z

src/triage/component/catwalk/model_trainers.py

 import logging
 import sys
 from contextlib import contextmanager
+import importlib


original ordering made sense 😛 🤷‍♂️

jesteria · 2019-01-29T17:40:15Z

src/triage/component/catwalk/utils.py

+
+    Returns: (bool) whether or not the classifier returns parallelization
+    """
+    if 'RandomForest' in class_path or 'ExtraTrees' in class_path:


🤷‍♂️ or just:

Suggested change

if 'RandomForest' in class_path or 'ExtraTrees' in class_path:

return 'RandomForest' in class_path or 'ExtraTrees' in class_path

jesteria · 2019-01-29T17:41:24Z

src/triage/component/catwalk/utils.py

@@ -202,3 +203,37 @@ def save_db_objects(db_engine, db_objects):
            )
        f.seek(0)
        postgres_copy.copy_from(f, type(db_objects[0]), db_engine, format="csv")
+
+
+def class_path_implements_parallelization(class_path):


🤔 ☝️

src/triage/experiments/multicore.py

jesteria · 2019-02-07T16:43:50Z

src/triage/experiments/multicore.py

+            if "n_jobs" not in serial_task["train_kwargs"]["parameters"]:
+                logging.info("n_jobs not configured for serial task, "
+                             "defaulting to -1 for entire machine")
+                serial_task["train_kwargs"]["parameters"]["n_jobs"] = -1


I'm guessing you don't feel it's important to create a copy of the dictionary rather than mutate it in place? (Just want to be sure it's called out.) Certainly I don't see a problem with this as it is, just noting the concern.

saleiro

I want to suggest an optimization to this approach.
Train models the following order:

triage.component.catwalk.baselines
scaled logistic regression, decision trees
classpaths with 'n_jobs=-1' passed by the user
all the others (which include the super slow sklearn gradientboostingclassifer and adaboost, for example, or any other not so relevant to dsapp learners like k neighbors, so not so urgent...)

Train/test tasks are now implemented in batches, based on what results people should be interested in first: - Batch 1: Short and important ones, like Decision Trees, Scaled Logistic Regressions, and baselines. These will be parallelized if using an Experiment subclass that supports parallelization - Batch 2: Big ones with n_jobs=-1. This will be run serially no matter what the Experiment subclass is, because the classifier is expected to parallelize and adding on another layer is likely to crash the Experiment. - Batch 3: All others, parallelized similar to batch 1. These are ones that might be expected to take a while to complete (Gradient Boosting, forests without n_jobs=-1) and/or ones less likely to be effective.

saleiro · 2019-03-13T21:16:23Z

src/triage/component/catwalk/__init__.py

+                    )
+        return self.order_and_batch_tasks(train_test_tasks)
+
+    def order_and_batch_tasks(self, tasks):


thcrock assigned ivanhigueram Jan 28, 2019

thcrock requested a review from saleiro January 28, 2019 23:35

ivanhigueram reviewed Jan 28, 2019

View reviewed changes

jesteria reviewed Jan 29, 2019

View reviewed changes

thcrock assigned saleiro and unassigned ivanhigueram Feb 4, 2019

jesteria reviewed Feb 7, 2019

View reviewed changes

saleiro reviewed Mar 6, 2019

View reviewed changes

thcrock force-pushed the big_classifiers_serially branch from 0248c26 to 4d71a61 Compare March 13, 2019 21:12

saleiro reviewed Mar 13, 2019

View reviewed changes

saleiro approved these changes Mar 13, 2019

View reviewed changes

saleiro merged commit 4c4f3ca into master Mar 13, 2019

saleiro deleted the big_classifiers_serially branch March 13, 2019 21:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train forests serially [Resolves #542] #581

Train forests serially [Resolves #542] #581

thcrock commented Jan 28, 2019 •

edited

Loading

codecov-io commented Jan 28, 2019 •

edited

Loading

ivanhigueram left a comment

ivanhigueram Jan 28, 2019 •

edited

Loading

ivanhigueram Jan 28, 2019

jesteria Jan 29, 2019

saleiro Mar 6, 2019 •

edited

Loading

jesteria Jan 29, 2019

jesteria Jan 29, 2019

jesteria Jan 29, 2019

jesteria Feb 7, 2019

saleiro left a comment

saleiro Mar 13, 2019

	if 'RandomForest' in class_path or 'ExtraTrees' in class_path:
	return 'RandomForest' in class_path or 'ExtraTrees' in class_path

Train forests serially [Resolves #542] #581

Train forests serially [Resolves #542] #581

Conversation

thcrock commented Jan 28, 2019 • edited Loading

codecov-io commented Jan 28, 2019 • edited Loading

Codecov Report

ivanhigueram left a comment

Choose a reason for hiding this comment

ivanhigueram Jan 28, 2019 • edited Loading

Choose a reason for hiding this comment

ivanhigueram Jan 28, 2019

Choose a reason for hiding this comment

jesteria Jan 29, 2019

Choose a reason for hiding this comment

saleiro Mar 6, 2019 • edited Loading

Choose a reason for hiding this comment

jesteria Jan 29, 2019

Choose a reason for hiding this comment

jesteria Jan 29, 2019

Choose a reason for hiding this comment

jesteria Jan 29, 2019

Choose a reason for hiding this comment

jesteria Feb 7, 2019

Choose a reason for hiding this comment

saleiro left a comment

Choose a reason for hiding this comment

saleiro Mar 13, 2019

Choose a reason for hiding this comment

thcrock commented Jan 28, 2019 •

edited

Loading

codecov-io commented Jan 28, 2019 •

edited

Loading

ivanhigueram Jan 28, 2019 •

edited

Loading

saleiro Mar 6, 2019 •

edited

Loading