Replace lambdas with partials to allow multi-gpu training #160

BramVanroy · 2023-08-08T08:18:07Z

Currently, the CometModel has lambda functions defined. These cannot be pickled and therefore multiprocessing (~multi-GPU training) is not possible.

This PR replaces the lambda functions with partials, that should be picklable.

closes #159

BramVanroy · 2023-08-08T11:53:56Z

This PR is currently not ready for integration. I have done some work on making the metric RegressionMetrics(Metric) (not the model) compatible with distributed computing and this seems to work for training. However, I can't get this to work with prediction.
As far as I can tell, the issue is that the metrics are not gathered correctly over the different processes. So in this piece of code, RegressionMetrics should get dist_sync_on_step=True but only in distributed settings.

COMET/comet/models/regression/regression_metric.py

Lines 130 to 135 in a324526

    
           def init_metrics(self): 
        
               """Initializes train/validation metrics.""" 
        
               self.train_metrics = RegressionMetrics(prefix="train") 
        
               self.val_metrics = nn.ModuleList( 
        
                   [RegressionMetrics(prefix=d) for d in self.hparams.validation_data] 
        
               )

PyTorch Lightning has so many hoops (subclasses, properties) to jump through that I lost my patience to figure out how we can do something if multi_gpu or if distributed. For someone who knows PyTorch Lightning well, this is perhaps an easy fix so feel free to chime in.

A test case for distributed scenario has also been added. Note that for dev'ing this, I updated torchmetrics to the newest version to be sure to avoid underlying issues.

BramVanroy added 2 commits August 8, 2023 10:16

replace lambdas with partials

92b8887

ADD: metric compatibility for DDP (but BUG)

13ba4e8

BramVanroy marked this pull request as draft August 8, 2023 11:48

BramVanroy mentioned this pull request Aug 8, 2023

Multi-GPU training #159

Closed

ricardorei closed this Sep 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace lambdas with partials to allow multi-gpu training #160

Replace lambdas with partials to allow multi-gpu training #160

BramVanroy commented Aug 8, 2023

BramVanroy commented Aug 8, 2023 •

edited

Loading

Replace lambdas with partials to allow multi-gpu training #160

Replace lambdas with partials to allow multi-gpu training #160

Conversation

BramVanroy commented Aug 8, 2023

BramVanroy commented Aug 8, 2023 • edited Loading

BramVanroy commented Aug 8, 2023 •

edited

Loading