Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add file logger callback & export train loss json file #22

Merged

Conversation

alex-jw-brooks
Copy link
Collaborator

@alex-jw-brooks alex-jw-brooks commented Jan 24, 2024

This PR adds a new callback, which is invoked on the logging event - if the most recent log contains specific keys, the subdictionary is extracted, and the train_loss.json is reexported.

Example using LLama 7b with the twitter complaints example (tested on single GPU with a v100).

python tuning/sft_trainer.py  --model_name_or_path $MODEL_PATH  --data_path $DATA_PATH  --output_dir $OUTPUT_PATH  --num_train_epochs 5  --per_device_train_batch_size 4  --per_device_eval_batch_size 4  --gradient_accumulation_steps 4  --evaluation_strategy "no"  --save_strategy "epoch"  --learning_rate 1e-5  --weight_decay 0.  --warmup_ratio 0.03  --lr_scheduler_type "cosine"  --logging_steps 1  --include_tokens_per_second  --packing False  --response_template "\n### Label:"  --peft_method pt --tokenizer_name_or_path $MODEL_PATH --dataset_text_field "output" --torch_dtype "float32" --use_flash_attn False

Produces a train_loss.json file in the output directory:

$ ls out
checkpoint-13  checkpoint-15  checkpoint-3  checkpoint-6  checkpoint-9  train_loss.jsonl

Here is an example of the json file that gets created.

{"data": {"epoch": 0.31, "step": 1, "timestamp": "2024-02-13T12:29:27.387128", "value": 10.6876}, "name": "loss"}
{"data": {"epoch": 0.62, "step": 2, "timestamp": "2024-02-13T12:29:28.078879", "value": 10.8365}, "name": "loss"}
{"data": {"epoch": 0.92, "step": 3, "timestamp": "2024-02-13T12:29:28.749249", "value": 10.767}, "name": "loss"}
{"data": {"epoch": 1.23, "step": 4, "timestamp": "2024-02-13T12:29:31.796401", "value": 10.8091}, "name": "loss"}
{"data": {"epoch": 1.54, "step": 5, "timestamp": "2024-02-13T12:29:32.436218", "value": 10.7143}, "name": "loss"}
{"data": {"epoch": 1.85, "step": 6, "timestamp": "2024-02-13T12:29:33.187178", "value": 10.7687}, "name": "loss"}
{"data": {"epoch": 2.15, "step": 7, "timestamp": "2024-02-13T12:29:36.343676", "value": 10.7223}, "name": "loss"}
{"data": {"epoch": 2.46, "step": 8, "timestamp": "2024-02-13T12:29:37.021809", "value": 10.7292}, "name": "loss"}
{"data": {"epoch": 2.77, "step": 9, "timestamp": "2024-02-13T12:29:37.708132", "value": 10.8619}, "name": "loss"}
{"data": {"epoch": 3.08, "step": 10, "timestamp": "2024-02-13T12:29:42.506561", "value": 10.6114}, "name": "loss"}
{"data": {"epoch": 3.38, "step": 11, "timestamp": "2024-02-13T12:29:43.154734", "value": 10.6282}, "name": "loss"}
{"data": {"epoch": 3.69, "step": 12, "timestamp": "2024-02-13T12:29:43.838959", "value": 10.7648}, "name": "loss"}
{"data": {"epoch": 4.0, "step": 13, "timestamp": "2024-02-13T12:29:44.538090", "value": 10.8187}, "name": "loss"}
{"data": {"epoch": 4.31, "step": 14, "timestamp": "2024-02-13T12:29:48.925495", "value": 10.6256}, "name": "loss"}
{"data": {"epoch": 4.62, "step": 15, "timestamp": "2024-02-13T12:29:49.682787", "value": 10.7308}, "name": "loss"}

Note that logs are only exported from the main process to avoid writing from mulitple processes for multigpu tunings.

Copy link
Collaborator

@gkumbhat gkumbhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • General question, should we follow the same pattern as used by caikit-nlp for consistency and compatibility perspective ?
  • We'll need to see how the callbacks work for steps vs epochs. Since we may want to log individual steps, even if it is set to epoch wise decisions. Another scenario we'll need to understand better is multi-gpu case.

Comment on lines 49 to 61
log_file_path = os.path.join(args.output_dir, "train_loss.json")
if logs is not None:
try:
# Take the subdict of the last log line; if any log_keys aren't part of this log
# object, asssume this line is something else, e.g., train completion, and skip.
log_obj = {k: logs[k] for k in FileLoggingCallback.log_keys}
except KeyError:
return

# Redump the json file in the checkpoint directory with the updated log line
self.existing_logs.append(log_obj)
with open(log_file_path, "w") as log_file:
json.dump(self.existing_logs, log_file, sort_keys=True, indent=4)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I would be a bit worried about is the need to open and close file repeatedly. Also looks like we are overriding the existing logs instead of appending, any particular reason for that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just that this is writing a JSON file, so we need to reparse it to add to the list. We could alternatively write a list of JSON objects (i.e., each line is a JSON object representing one log), and in that case just append per log?

I agree that it's a bit of a bummer to open and close the file so much, although it's still probably quite small compared to the actual training of the model, and the parsing is the main problem. If we write a json object per log, we can keep it open in append mode and flush on each written log I guess?

tuning/sft_trainer.py Show resolved Hide resolved
appends the subdict of the log & dumps the file.
"""

log_file_path = os.path.join(args.output_dir, "train_loss.json")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add either a conversion or a check to make sure loss is actually in float instead of other data types. Otherwise the json dump would fail.

@alex-jw-brooks
Copy link
Collaborator Author

alex-jw-brooks commented Jan 30, 2024

Hey @gkumbhat! Thanks for the review. Some thoughts on your broader questions

General question, should we follow the same pattern as used by caikit-nlp for consistency and compatibility perspective ?

I would greatly prefer to avoid subclassing the trainer just for logging. The actual logging events look similar either way, so any logic in the logging should be pretty portable between the two approaches, and this project is already using callbacks, so callbacks feels more natural. Plus the SFT trainer is already a trainer subclass and pretty complex, subclassing it again just to write an extra file seems like unnecessary complexity to me, since it would basically be for the one function

We'll need to see how the callbacks work for steps vs epochs. Since we may want to log individual steps, even if it is set to epoch wise decisions. Another scenario we'll need to understand better is multi-gpu case.
For steps vs epochs etc - the event being handled here just gives a parsed version of the logged object, so we can just handle those keys separately, I think. I'll look into making sure this behavior is sensible.

For multigpu, we probably need to make sure that the file is only written by the master process, but I'm not sure how rank is taken into account for the logging event. I can look into this as well, although do you know if we formally support multiGPU at the moment, or is it still experimental?

@gkumbhat
Copy link
Collaborator

gkumbhat commented Feb 1, 2024

I would greatly prefer to avoid subclassing the trainer just for logging. The actual logging events look similar either way, so any logic in the logging should be pretty portable between the two approaches, and this project is already using callbacks, so callbacks feels more natural. Plus the SFT trainer is already a trainer subclass and pretty complex, subclassing it again just to write an extra file seems like unnecessary complexity to me, since it would basically be for the one function

By "follow pattern", I actually meant the formatting of the output file. Sorry I was not quite clear there

@gkumbhat
Copy link
Collaborator

gkumbhat commented Feb 1, 2024

do you know if we formally support multiGPU at the moment, or is it still experimental?

Ops, I don't know 🤷

@raghukiran1224
Copy link
Contributor

Yes, we can do multi gpu using this code base and it has been well tested

@alex-jw-brooks
Copy link
Collaborator Author

Thanks everyone - updated the PR to write in jsonl format, and only dump logs from process zero to prevent clobbering when running multiprocess training

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
@alex-jw-brooks
Copy link
Collaborator Author

Hey @gkumbhat - updated the PR and description based on our discussions to match the legacy log format, should be ready for another look when you have a moment

Copy link
Collaborator

@gkumbhat gkumbhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alex-jw-brooks alex-jw-brooks merged commit 517652d into foundation-model-stack:main Feb 13, 2024
1 check passed
@alex-jw-brooks alex-jw-brooks deleted the train_loss_file branch February 13, 2024 21:06
anhuong pushed a commit to anhuong/fms-hf-tuning that referenced this pull request Apr 3, 2024
…del-stack#22)

* Add file logger callback & export train loss json file

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* only update logs from process 0

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Export logs in jsonl format

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Formatting

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

---------

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants