Parallel training #133

yorickbrunet · 2023-11-24T15:07:10Z

Types of changes

Bug fix (non-breaking change which fixes an issue)
[] New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Description
This work introduces the parallelisation of giotto-deep on multiple GPUs via two methods: pytorch's FSDP and pipeline tools.

The version of pytorch is increased to 1.13.1 to support the necessary features of FSDP.

To allow the parallelisation to be efficiently run, a new sampler GiottoSampler is defined that combines DistributedSampler and SubsetRandomSampler.

A benchmark tool allows running a model with different batch sizes on different GPUs and number of GPUs to compare the parallelised and not-parallelised runs. A generator of Kubernetes pods takes some GKE details as input and outputs the pod configurations, allowing a user to build its own configurations for its own cluster.

Any other comments?

Checklist

I have read the guidelines for contributing.
My code follows the code style of this project. I used flake8 to check my Python changes.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have added tests to cover my changes.
All new and existing tests passed. I used pytest to check this on Python tests.

matteocao

OK for me! As long as the CI passes.

matteocao · 2023-12-13T19:37:48Z

@raphaelreinauer I see that @yorickbrunet answered to all your comments. Are you satisfied or is there anything else you would like to discuss?
If there is nothing more, I'll merge the PR.

raphaelreinauer · 2023-12-13T19:48:42Z

@matteocao thanks for checking in and thanks @yorickbrunet for your answers so far.

However, there's one key aspect that still needs attention - the PR description:

Hi @yorickbrunet, thank you so much for your hard work on this! 😊 I noticed the PR has a lot of file changes - 52, in fact! To help me, could you please add a description to the PR? A bit of context makes the review process much easier for me.

@yorickbrunet Could you please add a detailed description of all the features you added, then I can do a detailed PR review.

gdeep/models/extractor.py

Co-authored-by: raphaelreinauer <reinauerr@googlemail.com>

yorickbrunet · 2024-01-09T15:26:45Z

@raphaelreinauer we answered or fixed all comments. Can you have another look, close the issues that can be closed, and maybe approve the PR? Thanks

raphaelreinauer · 2024-01-09T19:50:38Z

Hi @yorickbrunet, thank you so much for your hard work on this! 😊 I noticed the PR has a lot of file changes - 52, in fact! To help me, could you please add a description to the PR? A bit of context makes the review process much easier for me.

Hi @yorickbrunet, could you please provide a detailed description of the features added in this PR to aid my review process? Thanks

yorickbrunet · 2024-01-10T08:07:19Z

Hi @yorickbrunet, thank you so much for your hard work on this! 😊 I noticed the PR has a lot of file changes - 52, in fact! To help me, could you please add a description to the PR? A bit of context makes the review process much easier for me.

Hi @yorickbrunet, could you please provide a detailed description of the features added in this PR to aid my review process? Thanks

Hi @raphaelreinauer. I improved the general description of the PR, but the important description was there: This work introduces the parallelisation of giotto-deep on multiple GPUs via two methods: pytorch's FSDP and pipeline tools. Even though there are 52 files modified, the modifications are quite small in many of them. The most interesting file is gdeep/trainer/trainer.py, where most of the modifications were done.

raphaelreinauer · 2024-01-14T15:51:53Z

Thanks, @yorickbrunet, for the changes. I'll review the changes by next week Friday.

raphaelreinauer · 2024-01-23T22:34:51Z

Unfortunately, I didn't have time to review the PR last weekend, but I'll do it this weekend. Sorry for the delay.

raphaelreinauer

Hi @yorickbrunet, I've reviewed your additions and added some comments for your consideration. Also, it might be beneficial to include some tests. Thanks!

raphaelreinauer · 2024-01-28T11:01:24Z

.gitignore

@@ -4,6 +4,7 @@
 *.pyd
 **/__pycache__

+*data*


This might be too restrictive, e.g. it would also include data_processor.py.

raphaelreinauer · 2024-01-28T11:07:13Z

benchmark/Dockerfile

+    echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections && \
+    apt-get update && \
+    apt-get install -y \
+        python3 python3-pip \


Please pin specific versions to ensure reproducibility.

The package used comes from ubuntu's packages. There is no version to pin as it won't change during the life of this version of the distribution.

raphaelreinauer · 2024-01-28T11:07:59Z

benchmark/Dockerfile

+RUN cd giotto-deep && \
+    pip3 install --no-cache-dir --disable-pip-version-check -r requirements.txt
+
+COPY ./benchmark/requirements.txt giotto-deep/requirements2.txt


requirements2.txt is not the best name; make it more descriptive.

I will be renamed to requirements.txt, which in the end is not much better.

raphaelreinauer · 2024-01-28T11:14:28Z

benchmark/benchmark.py

+    return RunData(r.start_time, r.end_time, model, parallel, epochs, batch_size, r.loss, r.accuracy, device_count, device_model)
+
+
+def uniq(data: typing.List[RunData]):


What is this function doing? Can you simplify it as list(set(data))?

Actually not. The function keeps the most recent elements of each class.

Some elements of the list may be of the same class but of different generation time, e.g. some benchmark runs that were restarted. This list keeps the most recent element of each class.

I added a comment in the function.

raphaelreinauer · 2024-01-28T11:15:42Z

benchmark/benchmark.py

+    plt.savefig(str(imgfile))
+
+
+def plot_csv(run_data: typing.List[RunData], img_dir: pathlib.Path, now: datetime.datetime):


This function has a lot of repeated code. Could you consider refactoring to reduce duplication and complexity?

Yes the function is complex but so is the data to fetch.
There are some pieces of code that are duplicated but mostly because it is not possible to have the exact same code for both blocks because of the different loops to build the data.
Thus I don't think that a refactoring would really help.

I still think that four nested loops are too complicated and should be refactored.

raphaelreinauer · 2024-01-28T11:36:36Z

benchmark/genpods.py

+
+    with open("pod-template-plot.yml", "r") as fp:
+        ymlt = string.Template(fp.read())
+    ymlv = ymlt.substitute(values)


Add timestamp of unique identifier to out filename.

It will then be a mess of files.
People can move the files of a same batch into folders.

raphaelreinauer · 2024-01-28T11:42:18Z

gdeep/utility/multiprocessing.py

+# general enough, and backends like XLA can reuse them in Colab notebooks as well.
+# Currently we only add this API first, we can consider adding it to documentation as
+# needed in the future.
+def spawn(fn, args=(), nprocs=1):


Add error handling.

I have no idea how to add error handling.
This code comes heavily from https://github.com/pytorch/pytorch/blob/v1.13.1/torch/multiprocessing/spawn.py#L178 where there is also no error handling. Thus I assume we're fine.

raphaelreinauer · 2024-01-28T11:43:46Z

gdeep/trainer/trainer.py

+            self.world_size = len(self.devices)
+
+
+def setup_env():


Why do we need to set env variables here? This looks very error-prone.

This is necessary for parallel training with RPC.
There is no choice but defining env variables.

raphaelreinauer · 2024-01-28T11:47:39Z

benchmark/Dockerfile

+RUN ln -snf /usr/share/zoneinfo/Europe/Zurich /etc/localtime && \
+    echo Europe/Zurich > /etc/timezone && \


I still think it's better to remove it since specifying the time zone does not seem necessary. But keep it if you prefer.

raphaelreinauer · 2024-01-28T11:50:57Z

benchmark/README.md

I think it would be easier to specify the node pools as config files.

Such a change would require too much work and retesting for a project that is now closed.
I think it will be OK for this version. Anyone can adapt later as he/se whishes.

I think this should still be changed.

examples/parallel_bert.py

VascoSch92 · 2024-01-28T18:42:53Z

examples/parallel_bert.py

+    n_sentences_to_consider=4000
+
+    tmp_path=os.path.join('./cola_public','raw','in_domain_train.tsv')
+    df = pd.read_csv(tmp_path, delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])


put a come after names=[...]. This will made the line more readable

@VascoSch92 Could you please explain what you mean exactly?

Instead of writing

df = pd.read_csv(tmp_path, delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])

use

df = pd.read_csv( tmp_path, delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'], )

Note the comma at the end of names. If you are using an automatic formatter (like yapf or black) and you put a comma at the end, it will format that for you.

The second one is much easier to read. You see exactly which method/class/function are you using and what are the parameters.

Thanks for clarification @VascoSch92, I agree with your suggestion . We have a pre-commit hook that runs black formatter.

@yorickbrunet Could you please use that? See here: https://github.com/giotto-ai/giotto-deep?tab=readme-ov-file#contributing

examples/parallel_bert.py

raphaelreinauer · 2024-03-03T15:20:11Z

Hey @yorickbrunet and @matteocao, I noticed that there have yet to be any comments on the PR reviews. Could you give some comments? So that we can merge this PR.

yorickbrunet · 2024-03-07T09:01:39Z

Hey @yorickbrunet and @matteocao, I noticed that there have yet to be any comments on the PR reviews. Could you give some comments? So that we can merge this PR.

Hi @raphaelreinauer I answered all comments with either a modification of the code or some justification why it cannot/won't be modified. Basically, the project is closed and we cannot spend two more weeks adding tests, retesting the setup after modification, etc.

yorickbrunet · 2024-03-08T13:45:10Z

Hey @yorickbrunet and @matteocao, I noticed that there have yet to be any comments on the PR reviews. Could you give some comments? So that we can merge this PR.

@raphaelreinauer @VascoSch92 Can you please close all comments that you consider OK? So that we know where we stand. Thanks.

VascoSch92 · 2024-03-08T15:15:45Z

Hey @yorickbrunet and @matteocao, I noticed that there have yet to be any comments on the PR reviews. Could you give some comments? So that we can merge this PR.

@raphaelreinauer @VascoSch92 Can you please close all comments that you consider OK? So that we know where we stand. Thanks.

@yorickbrunet for me is good. You can resolve the discussions ;-) (i cannot)

raphaelreinauer · 2024-03-24T11:11:56Z

benchmark/benchmark.py

+    """Keep the most recent elements of each class.
+
+    Some elements of the list may be of the same class but of different generation
+    time, e.g. some benchmark runs that were restarted.
+    """
+    data2 = []
+    idx = 0
+    # parse every element in the list (unless those removed during the process)
+    while idx < len(data):
+        jdx = idx + 1
+        keep = data[idx]  # set current data as kept
+        # parse every further element in the list (unless those removed during the process)
+        while jdx < len(data):
+            # if the currently kept element and the current element are of the same "class" ...
+            if data[jdx].same(keep):
+                # ... compare if the current element is greater than the kept one ...
+                if data[jdx].gt(keep):
+                    # ... and keep and remove the current element if it is greater
+                    keep = data.pop(jdx)
+                else:
+                    # ... or only remove the current element if it is not greater
+                    del data[jdx]
+            else:
+                jdx += 1
+        data2.append(keep)
+        idx += 1
+    return data2


The naming is terrible. Why is uniq not unique or, better, a more descriptive name? Also, data2 doesn't say anything. The logic is super complicated: two nested while loops with nested if-else statements, and the comments are just repeating what the code already expresses.

This can be written more concisely as:

def get_unique_latest_runs(data: typing.List[RunData]) -> typing.List[RunData]: unique_config_latest_run = {} for run in data: configuration_key = (run.model, run.parallel, run.batch_size, run.gpu_count, run.gpu_model) if configuration_key not in unique_config_latest_run or run.end_time > unique_config_latest_run[configuration_key].end_time: unique_config_latest_run[configuration_key] = run return list(unique_config_latest_run.values())

raphaelreinauer · 2024-03-24T11:52:06Z

benchmark/benchmark.py

+class Parallelism(enum.Enum):
+    none = enum.auto()
+    fsdp_full_shard = enum.auto()
+    fsdp_shard_grad_op = enum.auto()
+    fsdp_no_shard = enum.auto()
+    pipeline = enum.auto()


Parallelism is a mixture of a ParallelismType and a sharding strategy - instead one should use a composite of both.

act-reds and others added 30 commits May 4, 2023 13:42

Adding sources of pipeline tool into gdeep trainer

97993a1

Updated trainer and extractor to be compatible with pipeline_tool

d995340

Add if statement to shutdown rpc only if it as been init before

d5f29fb

Update requirements to accomodate fsdp

e7a1616

Custom sampler

e3d24de

delete: caltech example

69c0faf

Allow custom device

0338f3a

Apply refactor to gdeep

5795ff3

Add doc

48b7c98

Refactor include

6b152e3

Introduce FSDP

ec313c1

Add some logs

01967cd

Remove copy of model at init

d594fc9

Fix missing data on GPU > 0

6bbaaed

Readd commented deepcopy of model

48ad37a

Recover trained model and return values after FSDP training

77e2bfd

Eval working

eeab757

WIP: Retrieve training results and trained model

d21877a

FSDP WORKS !!!!

a476fba

faster train for easier tests

27cf80b

Add examples script

1b74e9a

Init

85758a3

Push for script example

03ebc41

Able to measure Memory Peaks on GPUs

2e36237

Only split training dataloader if no validation dataloader provided

33645e8

reenable inital model copy and reset

fc478e6

fix profiler setup

63bd3bd

Better config for train and profiling

2cc1c76

Save works

9b7067a

Update trainer for non naive repartition

80e3ead

yorickbrunet added 4 commits December 11, 2023 08:34

Remove type ignore

6b464de

Avoid abbreviations

432eb25

Remove extra space

0d23fdb

Fix computation of loss

d0457e3

matteocao approved these changes Dec 11, 2023

View reviewed changes

raphaelreinauer reviewed Dec 13, 2023

View reviewed changes

gdeep/models/extractor.py Outdated Show resolved Hide resolved

Add type hint for argument

8c99c29

Co-authored-by: raphaelreinauer <reinauerr@googlemail.com>

raphaelreinauer requested changes Jan 28, 2024

View reviewed changes

VascoSch92 reviewed Jan 28, 2024

View reviewed changes