Add support for collecting metrics programmatically #33

VassilisVassiliadis · 2024-02-07T10:44:26Z

Issue

I would like to use fms-hf-tuning to collect system level metrics while finetuning models, these include model load time, as well as device related metrics (like these that AIM collects).

One way would be to rely on just the measurements that AIM collects by pointing fms-hf-tuning to an AIM server then contacting AIM to retrieve the data. However, this is a bit restricting in that we cannot collect custom data, collect system metrics at a period other than the one that AIM is using - 30 seconds, or collect data at all if we don't spin up an AIM server.

A more convenient solution would be to allow providing an optional parameter to train() for which can contain a list of callbacks

fms-hf-tuning/tuning/sft_trainer.py

Lines 29 to 34 in fc07060

    
           def train( 
        
               model_args: configs.ModelArguments, 
        
               data_args: configs.DataArguments, 
        
               train_args: configs.TrainingArguments, 
        
               peft_config: Optional[Union[peft_config.LoraConfig, peft_config.PromptTuningConfig]] = None, 
        
           ):

In the same spirit I'd like to get access to the TrainingOutput object that sft_trainer.train() returns here (input_tokens_per_second, train_runtime, etc) :

fms-hf-tuning/tuning/sft_trainer.py

Line 168 in fc07060

trainer.train()

(just by returning the output of trainer.train() as the output of train()).

Done when

support collecting custom metrics via custom callbacks
return the output of trainer.train() to the caller of train()

The text was updated successfully, but these errors were encountered:

VassilisVassiliadis · 2024-02-07T10:44:34Z

I can take a stab at the above if you like!

VassilisVassiliadis · 2024-02-07T14:38:38Z

Ideally I'd also like to measure the time it takes to load the model so that I can predict how long it would take to instantiate the weights that I produce. This would probably involve returning both the output of trainer.train() plus some extra information that the sft_trainer.train method measures (e.g. model_load_time)

dushyantbehl · 2024-05-10T18:00:05Z

@VassilisVassiliadis this can be closed now...if you run into any error when testing it let me know.

VassilisVassiliadis mentioned this issue Feb 19, 2024

feat: custom callbacks for train() and return TrainOutput plus model_load_time #49

Open

dushyantbehl mentioned this issue Feb 19, 2024

feat: change tracker API to initialize tracker early and track additional metrics. #50

Closed

dushyantbehl mentioned this issue Mar 13, 2024

Generic tracker API and implementation of Aimstack tracker #89

Merged

2 tasks

VassilisVassiliadis closed this as completed May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for collecting metrics programmatically #33

Add support for collecting metrics programmatically #33

VassilisVassiliadis commented Feb 7, 2024

VassilisVassiliadis commented Feb 7, 2024

VassilisVassiliadis commented Feb 7, 2024

dushyantbehl commented May 10, 2024

Add support for collecting metrics programmatically #33

Add support for collecting metrics programmatically #33

Comments

VassilisVassiliadis commented Feb 7, 2024

VassilisVassiliadis commented Feb 7, 2024

VassilisVassiliadis commented Feb 7, 2024

dushyantbehl commented May 10, 2024