Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel training #133

Open
wants to merge 153 commits into
base: master
Choose a base branch
from
Open

Conversation

yorickbrunet
Copy link

@yorickbrunet yorickbrunet commented Nov 24, 2023

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • [] New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Description
This work introduces the parallelisation of giotto-deep on multiple GPUs via two methods: pytorch's FSDP and pipeline tools.

The version of pytorch is increased to 1.13.1 to support the necessary features of FSDP.

To allow the parallelisation to be efficiently run, a new sampler GiottoSampler is defined that combines DistributedSampler and SubsetRandomSampler.

A benchmark tool allows running a model with different batch sizes on different GPUs and number of GPUs to compare the parallelised and not-parallelised runs. A generator of Kubernetes pods takes some GKE details as input and outputs the pod configurations, allowing a user to build its own configurations for its own cluster.

Any other comments?

Checklist

  • I have read the guidelines for contributing.
  • My code follows the code style of this project. I used flake8 to check my Python changes.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed. I used pytest to check this on Python tests.

Copy link
Contributor

@matteocao matteocao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK for me! As long as the CI passes.

@matteocao
Copy link
Contributor

@raphaelreinauer I see that @yorickbrunet answered to all your comments. Are you satisfied or is there anything else you would like to discuss?
If there is nothing more, I'll merge the PR.

@raphaelreinauer
Copy link
Collaborator

@matteocao thanks for checking in and thanks @yorickbrunet for your answers so far.

However, there's one key aspect that still needs attention - the PR description:

Hi @yorickbrunet, thank you so much for your hard work on this! 😊 I noticed the PR has a lot of file changes - 52, in fact! To help me, could you please add a description to the PR? A bit of context makes the review process much easier for me.

@yorickbrunet Could you please add a detailed description of all the features you added, then I can do a detailed PR review.

Co-authored-by: raphaelreinauer <reinauerr@googlemail.com>
@yorickbrunet
Copy link
Author

@raphaelreinauer we answered or fixed all comments. Can you have another look, close the issues that can be closed, and maybe approve the PR? Thanks

@raphaelreinauer
Copy link
Collaborator

Hi @yorickbrunet, thank you so much for your hard work on this! 😊 I noticed the PR has a lot of file changes - 52, in fact! To help me, could you please add a description to the PR? A bit of context makes the review process much easier for me.

Hi @yorickbrunet, could you please provide a detailed description of the features added in this PR to aid my review process? Thanks

@yorickbrunet
Copy link
Author

Hi @yorickbrunet, thank you so much for your hard work on this! 😊 I noticed the PR has a lot of file changes - 52, in fact! To help me, could you please add a description to the PR? A bit of context makes the review process much easier for me.

Hi @yorickbrunet, could you please provide a detailed description of the features added in this PR to aid my review process? Thanks

Hi @raphaelreinauer. I improved the general description of the PR, but the important description was there: This work introduces the parallelisation of giotto-deep on multiple GPUs via two methods: pytorch's FSDP and pipeline tools. Even though there are 52 files modified, the modifications are quite small in many of them. The most interesting file is gdeep/trainer/trainer.py, where most of the modifications were done.

@raphaelreinauer
Copy link
Collaborator

Thanks, @yorickbrunet, for the changes. I'll review the changes by next week Friday.

@raphaelreinauer
Copy link
Collaborator

Unfortunately, I didn't have time to review the PR last weekend, but I'll do it this weekend. Sorry for the delay.

Copy link
Collaborator

@raphaelreinauer raphaelreinauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @yorickbrunet, I've reviewed your additions and added some comments for your consideration. Also, it might be beneficial to include some tests. Thanks!

.gitignore Outdated
@@ -4,6 +4,7 @@
*.pyd
**/__pycache__

*data*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be too restrictive, e.g. it would also include data_processor.py.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections && \
apt-get update && \
apt-get install -y \
python3 python3-pip \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please pin specific versions to ensure reproducibility.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The package used comes from ubuntu's packages. There is no version to pin as it won't change during the life of this version of the distribution.

RUN cd giotto-deep && \
pip3 install --no-cache-dir --disable-pip-version-check -r requirements.txt

COPY ./benchmark/requirements.txt giotto-deep/requirements2.txt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requirements2.txt is not the best name; make it more descriptive.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will be renamed to requirements.txt, which in the end is not much better.

return RunData(r.start_time, r.end_time, model, parallel, epochs, batch_size, r.loss, r.accuracy, device_count, device_model)


def uniq(data: typing.List[RunData]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this function doing? Can you simplify it as list(set(data))?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually not. The function keeps the most recent elements of each class.

Some elements of the list may be of the same class but of different generation time, e.g. some benchmark runs that were restarted. This list keeps the most recent element of each class.

I added a comment in the function.

plt.savefig(str(imgfile))


def plot_csv(run_data: typing.List[RunData], img_dir: pathlib.Path, now: datetime.datetime):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function has a lot of repeated code. Could you consider refactoring to reduce duplication and complexity?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the function is complex but so is the data to fetch.
There are some pieces of code that are duplicated but mostly because it is not possible to have the exact same code for both blocks because of the different loops to build the data.
Thus I don't think that a refactoring would really help.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think that four nested loops are too complicated and should be refactored.


with open("pod-template-plot.yml", "r") as fp:
ymlt = string.Template(fp.read())
ymlv = ymlt.substitute(values)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add timestamp of unique identifier to out filename.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will then be a mess of files.
People can move the files of a same batch into folders.

# general enough, and backends like XLA can reuse them in Colab notebooks as well.
# Currently we only add this API first, we can consider adding it to documentation as
# needed in the future.
def spawn(fn, args=(), nprocs=1):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add error handling.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea how to add error handling.
This code comes heavily from https://github.com/pytorch/pytorch/blob/v1.13.1/torch/multiprocessing/spawn.py#L178 where there is also no error handling. Thus I assume we're fine.

self.world_size = len(self.devices)


def setup_env():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to set env variables here? This looks very error-prone.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary for parallel training with RPC.
There is no choice but defining env variables.

Comment on lines 3 to 4
RUN ln -snf /usr/share/zoneinfo/Europe/Zurich /etc/localtime && \
echo Europe/Zurich > /etc/timezone && \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it's better to remove it since specifying the time zone does not seem necessary. But keep it if you prefer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be easier to specify the node pools as config files.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such a change would require too much work and retesting for a project that is now closed.
I think it will be OK for this version. Anyone can adapt later as he/se whishes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should still be changed.

n_sentences_to_consider=4000

tmp_path=os.path.join('./cola_public','raw','in_domain_train.tsv')
df = pd.read_csv(tmp_path, delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put a come after names=[...]. This will made the line more readable

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VascoSch92 Could you please explain what you mean exactly?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of writing

df = pd.read_csv(tmp_path, delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])

use

df = pd.read_csv(
    tmp_path, 
    delimiter='\t', 
    header=None, 
    names=['sentence_source', 'label', 'label_notes', 'sentence'],
)

Note the comma at the end of names. If you are using an automatic formatter (like yapf or black) and you put a comma at the end, it will format that for you.

The second one is much easier to read. You see exactly which method/class/function are you using and what are the parameters.

Copy link
Collaborator

@raphaelreinauer raphaelreinauer Jan 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarification @VascoSch92, I agree with your suggestion . We have a pre-commit hook that runs black formatter.

@yorickbrunet Could you please use that? See here: https://github.com/giotto-ai/giotto-deep?tab=readme-ov-file#contributing

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@raphaelreinauer
Copy link
Collaborator

Hey @yorickbrunet and @matteocao, I noticed that there have yet to be any comments on the PR reviews. Could you give some comments? So that we can merge this PR.

@yorickbrunet
Copy link
Author

Hey @yorickbrunet and @matteocao, I noticed that there have yet to be any comments on the PR reviews. Could you give some comments? So that we can merge this PR.

Hi @raphaelreinauer I answered all comments with either a modification of the code or some justification why it cannot/won't be modified. Basically, the project is closed and we cannot spend two more weeks adding tests, retesting the setup after modification, etc.

@yorickbrunet
Copy link
Author

yorickbrunet commented Mar 8, 2024

Hey @yorickbrunet and @matteocao, I noticed that there have yet to be any comments on the PR reviews. Could you give some comments? So that we can merge this PR.

@raphaelreinauer @VascoSch92 Can you please close all comments that you consider OK? So that we know where we stand. Thanks.

@VascoSch92
Copy link

Hey @yorickbrunet and @matteocao, I noticed that there have yet to be any comments on the PR reviews. Could you give some comments? So that we can merge this PR.

@raphaelreinauer @VascoSch92 Can you please close all comments that you consider OK? So that we know where we stand. Thanks.

@yorickbrunet for me is good. You can resolve the discussions ;-) (i cannot)

Comment on lines +362 to +388
"""Keep the most recent elements of each class.

Some elements of the list may be of the same class but of different generation
time, e.g. some benchmark runs that were restarted.
"""
data2 = []
idx = 0
# parse every element in the list (unless those removed during the process)
while idx < len(data):
jdx = idx + 1
keep = data[idx] # set current data as kept
# parse every further element in the list (unless those removed during the process)
while jdx < len(data):
# if the currently kept element and the current element are of the same "class" ...
if data[jdx].same(keep):
# ... compare if the current element is greater than the kept one ...
if data[jdx].gt(keep):
# ... and keep and remove the current element if it is greater
keep = data.pop(jdx)
else:
# ... or only remove the current element if it is not greater
del data[jdx]
else:
jdx += 1
data2.append(keep)
idx += 1
return data2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming is terrible. Why is uniq not unique or, better, a more descriptive name? Also, data2 doesn't say anything. The logic is super complicated: two nested while loops with nested if-else statements, and the comments are just repeating what the code already expresses.

This can be written more concisely as:

def get_unique_latest_runs(data: typing.List[RunData]) -> typing.List[RunData]:
    unique_config_latest_run = {}
   
    for run in data:
        configuration_key = (run.model, run.parallel, run.batch_size, run.gpu_count, run.gpu_model)
        if configuration_key not in unique_config_latest_run or run.end_time > unique_config_latest_run[configuration_key].end_time:
            unique_config_latest_run[configuration_key] = run
   
    return list(unique_config_latest_run.values())

Comment on lines +22 to +27
class Parallelism(enum.Enum):
none = enum.auto()
fsdp_full_shard = enum.auto()
fsdp_shard_grad_op = enum.auto()
fsdp_no_shard = enum.auto()
pipeline = enum.auto()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parallelism is a mixture of a ParallelismType and a sharding strategy - instead one should use a composite of both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants