[tabular] Parallel Model Training #4215

Innixma · 2024-05-21T18:42:04Z

Related: #4213

We should add support for parallel model training via ray that goes beyond the currently implemented parallel fold model fitting and parallel HPO.

Given a portfolio of 100 models, we should be able to train multiple of them at the same time if we have sufficient resources.

This should also be extended to work with a distributed cluster.

Pseudocode Example

Current logic:

for model in portfolio:
    models.append(fit_model(model))

Code that does this in mainline:

autogluon/core/src/autogluon/core/trainer/abstract_trainer.py

Lines 2478 to 2504 in 7b782df

    
           for i, model in enumerate(models): 
        
               if isinstance(model, str): 
        
                   model = self.load_model(model) 
        
               elif self.low_memory: 
        
                   model = copy.deepcopy(model) 
        
               if hyperparameter_tune_kwargs is not None and isinstance(hyperparameter_tune_kwargs, dict): 
        
                   hyperparameter_tune_kwargs_model = hyperparameter_tune_kwargs.get(model.name, None) 
        
               else: 
        
                   hyperparameter_tune_kwargs_model = None 
        
               # TODO: Only update scores when finished, only update model as part of final models if finished! 
        
               if time_split: 
        
                   time_left = time_limit_model_split 
        
               else: 
        
                   if time_limit is None: 
        
                       time_left = None 
        
                   else: 
        
                       time_start_model = time.time() 
        
                       time_left = time_limit - (time_start_model - time_start) 
        
               model_name_trained_lst = self._train_single_full( 
        
                   X, y, model, time_limit=time_left, hyperparameter_tune_kwargs=hyperparameter_tune_kwargs_model, **kwargs 
        
               ) 
        
               if self.low_memory: 
        
                   del model 
        
               models_valid += model_name_trained_lst 
        
           return models_valid

Proposed logic:

# Need smart logic to determine how to schedule and how much resources to give each model
models = ray_parallel(portfolio)

The text was updated successfully, but these errors were encountered:

jmakov · 2024-06-12T08:07:50Z

Ray Tune offers distributed HPO, also Optuna. There's also https://github.com/fugue-project/fugue which offers abstract interface to support different distributed platforms (Dask, Ray, Spark etc.).

Innixma added enhancement New feature or request module: tabular feature: distributed Related to Distributed AutoGluon labels May 21, 2024

Innixma added this to the 1.2 Release milestone May 21, 2024

Innixma assigned rey-allan and Innixma May 21, 2024

Innixma mentioned this issue May 21, 2024

[timeseries] Parallel Training of Time series Predictor #4213

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tabular] Parallel Model Training #4215

[tabular] Parallel Model Training #4215

Innixma commented May 21, 2024 •

edited

Loading

jmakov commented Jun 12, 2024 •

edited

Loading

[tabular] Parallel Model Training #4215

[tabular] Parallel Model Training #4215

Comments

Innixma commented May 21, 2024 • edited Loading

Pseudocode Example

Current logic:

Proposed logic:

jmakov commented Jun 12, 2024 • edited Loading

Innixma commented May 21, 2024 •

edited

Loading

jmakov commented Jun 12, 2024 •

edited

Loading