Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for collecting metrics programmatically #33

Closed
2 tasks
VassilisVassiliadis opened this issue Feb 7, 2024 · 3 comments · May be fixed by #49
Closed
2 tasks

Add support for collecting metrics programmatically #33

VassilisVassiliadis opened this issue Feb 7, 2024 · 3 comments · May be fixed by #49

Comments

@VassilisVassiliadis
Copy link
Contributor

Issue

I would like to use fms-hf-tuning to collect system level metrics while finetuning models, these include model load time, as well as device related metrics (like these that AIM collects).

One way would be to rely on just the measurements that AIM collects by pointing fms-hf-tuning to an AIM server then contacting AIM to retrieve the data. However, this is a bit restricting in that we cannot collect custom data, collect system metrics at a period other than the one that AIM is using - 30 seconds, or collect data at all if we don't spin up an AIM server.

A more convenient solution would be to allow providing an optional parameter to train() for which can contain a list of callbacks

def train(
model_args: configs.ModelArguments,
data_args: configs.DataArguments,
train_args: configs.TrainingArguments,
peft_config: Optional[Union[peft_config.LoraConfig, peft_config.PromptTuningConfig]] = None,
):

In the same spirit I'd like to get access to the TrainingOutput object that sft_trainer.train() returns here (input_tokens_per_second, train_runtime, etc) :

trainer.train()

(just by returning the output of trainer.train() as the output of train()).

Done when

  • support collecting custom metrics via custom callbacks
  • return the output of trainer.train() to the caller of train()
@VassilisVassiliadis
Copy link
Contributor Author

I can take a stab at the above if you like!

@VassilisVassiliadis
Copy link
Contributor Author

Ideally I'd also like to measure the time it takes to load the model so that I can predict how long it would take to instantiate the weights that I produce. This would probably involve returning both the output of trainer.train() plus some extra information that the sft_trainer.train method measures (e.g. model_load_time)

@dushyantbehl
Copy link
Contributor

@VassilisVassiliadis this can be closed now...if you run into any error when testing it let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants