-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data custom collator #260
Data custom collator #260
Conversation
Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
find some quality results here https://github.ibm.com/ai-foundation/watson-fm-stack-tracker/issues/621#issuecomment-86629029 using both collators |
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
@alex-jw-brooks review requested is addressed, also added EOS token Attached some quality results above. Remaining TODOs:
|
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
The PR is ready to be reviewed @alex-jw-brooks @anhuong . I have documented only JSONL format for new input/output and will extend it to JSON in following PR . Also current functionality is retained and we are just adding support for a new data format . I think it is safe to merge as one quality test looked good. we can continue to do quality testing before announcing release |
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
@Ssukriti i think this PR is not accurately named, the changes should be to allow for custom processing by means of passing in the Jinja template? Collation is usually referred to the step where the examples come out of the sampler, and they are formed into a single tensor to be passed out of the dataloader. |
* refactor code to preprocess datasets Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * fix formatting Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * allow input/output in validate args Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * format input/output JSON and mask Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * function to return suitable collator Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * add tests for SFT Trainer input/output format Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * remove unused functions Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * add eos token to input/output format Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * fix tests Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * improve docstrings Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * keeping JSON keys constant Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * support for input/output format Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * formatting fixes Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * update rEADME formats Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * formatting README Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> --------- Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com>
* Set default value of target_modules to be None in LoraConfig Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Removal of transformers logger and addition of python logger Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * FMT and lint check: Removal of transformers logger and addition of python logger Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * fix: remove lm_head for granite with llama arch models (#258) * initial code for deleting lm_head Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fix logic for copying checkpoint Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fix check that embed_tokens and lm_head weights are the same Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fix warning assertion Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fix lm_head check, remove test Signed-off-by: Anh-Uong <anh.uong@ibm.com> * small fixes from code review Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fmt Signed-off-by: Anh-Uong <anh.uong@ibm.com> --------- Signed-off-by: Anh-Uong <anh.uong@ibm.com> Co-authored-by: Anh-Uong <anh.uong@ibm.com> Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Add config_utils tests Signed-off-by: Angel Luu <angel.luu@us.ibm.com> * Fix fmt Signed-off-by: Angel Luu <angel.luu@us.ibm.com> * Separate tests out and use docstrings Signed-off-by: Angel Luu <angel.luu@us.ibm.com> * Update more field/value checks from HF defaults Signed-off-by: Angel Luu <angel.luu@us.ibm.com> * Fix: Addition of env var TRANSFORMERS_VERBOSITY check Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * FMT Fix: Addition of env var TRANSFORMERS_VERBOSITY check Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Add test for tokenizer in lora config (should be ignored) Signed-off-by: Angel Luu <angel.luu@us.ibm.com> * Adding logging support to accelerate launch Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * FMT_FIX: Adding logging support to accelerate launch Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * bug: On save event added to callback (#256) * feat: On save event added to callback Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Removed additional bracket Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Removed additional bracket Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Format issues resolved Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: rebase with upstream and add new line Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> --------- Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Co-authored-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * feat: All metric handling changes (#263) * feat: All metric handling changes Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Format issues Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> --------- Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Configuration to set logging level for trigger log (#241) * feat: Added the triggered login in the operation Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Formatting issues Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Added default config Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Moved the variable to right scope Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Checked added to validate config log level Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Removed some unwanted log file Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> --------- Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * limit peft deps until investigate (#274) Signed-off-by: Anh-Uong <anh.uong@ibm.com> * Data custom collator (#260) * refactor code to preprocess datasets Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * fix formatting Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * allow input/output in validate args Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * format input/output JSON and mask Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * function to return suitable collator Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * add tests for SFT Trainer input/output format Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * remove unused functions Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * add eos token to input/output format Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * fix tests Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * improve docstrings Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * keeping JSON keys constant Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * support for input/output format Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * formatting fixes Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * update rEADME formats Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * formatting README Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> --------- Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> * Revert "limit peft deps until investigate (#274)" (#275) This reverts commit f57ff63. Signed-off-by: Anh-Uong <anh.uong@ibm.com> * feat: per process state metric (#239) Signed-off-by: Harikrishnan Balagopal <harikrishmenon@gmail.com> * Modify test to pass with target_modules: None Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Logging changes and unit tests added Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * feat: Add a dockerfile argument to enable aimstack (#261) * Add a dockerfile argument at the end of final layer to enable aimstack. Currenlty guarded by a dockerfile argument. Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * Set the default value of ENABLE_AIM to false Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> --------- Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * Solved conflict with main Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * FMT:Fix Solved conflict with main Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * enabling tests for prompt tuning Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * feat: Support pretokenized (#272) * feat: support pretokenized datasets Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * fix: rebase with upstream and review commits Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * fix: rebase with upstream and review commits Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * fix: rebase with upstream and review commits Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * consolidate collator code Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * add valuerrors for incorrect args Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * feat: add unit tests for validate_data_args and format_dataset Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * feat: add unit tests for validate_data_args and format_dataset Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * feat: add unit tests for validate_data_args and format_dataset Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * feat: add unit tests for validate_data_args and format_dataset Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> --------- Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> Co-authored-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> Co-authored-by: Alex Brooks <alex.brooks@ibm.com> * Update packaging requirement from <24,>=23.2 to >=23.2,<25 (#212) Updates the requirements on [packaging](https://github.com/pypa/packaging) to permit the latest version. - [Release notes](https://github.com/pypa/packaging/releases) - [Changelog](https://github.com/pypa/packaging/blob/main/CHANGELOG.rst) - [Commits](pypa/packaging@23.2...24.1) --- updated-dependencies: - dependency-name: packaging dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Anh Uong <anh.uong@ibm.com> * enabling tests for prompt tuning (#278) Signed-off-by: Abhishek <maurya.abhishek@ibm.com> Co-authored-by: Anh Uong <anh.uong@ibm.com> * fix: do not add special tokens for custom tokenizer (#279) Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * PR changes for changing logger Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * fix: bug where the logger was not being used properly (#286) Signed-off-by: Hari <harikrishmenon@gmail.com> * Unit Tests changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Add functionality to free disk space from Github Actions (#287) * Add functionality to free disk space from Github Actions Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Add functionality to free disk space from Github Actions, relocate from build-and-publish.yaml to image.yaml Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Move freeing space step to before building image Signed-off-by: Will Johnson <mwjohnson728@gmail.com> --------- Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * commented os.environ[LOG_LEVEL] in accelerate.py for testing Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * FIX:FMT Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR Changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR Changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Add unit test to verify target_modules defaults correctly (#281) * Add unit test to verify target_modules defaults correctly Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Add sft_trainer.main test to ensure target modules properly default for LoRA when set to None from CLI Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fmt Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Use model_args instead of importing, fix nits Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Add test to ensure target_modules defaults to None in job config Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Add additional check, fix nits Signed-off-by: Will Johnson <mwjohnson728@gmail.com> --------- Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * docs: Add documentation on experiment tracking. (#257) Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * Ensure additional metadata to trackers don't throw error in happy case. (#290) Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * PR Changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * fix multiple runid creation bug with accelerate. (#268) Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * feat: logging control operation (#264) Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * Metrics file epoch indexing from 0 Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Revert last commit Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * fix run evaluation to get base model path (#273) Signed-off-by: Anh-Uong <anh.uong@ibm.com> * PR Changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR Changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * feat: Added additional events such as on_step_begin, on_optimizer_step, on_substep_end (#293) Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * Always update setuptools to latest (#288) Signed-off-by: James Busche <jbusche@us.ibm.com> Co-authored-by: Anh Uong <anh.uong@ibm.com> * Rename all fixtures with correct .jsonl extension (#295) Signed-off-by: Will Johnson <mwjohnson728@gmail.com> Co-authored-by: Anh Uong <anh.uong@ibm.com> * feat: add save_model_dir flag where final checkpoint saved (#291) * add save_model_dir flag for final checkpoint Signed-off-by: Anh-Uong <anh.uong@ibm.com> * remove output_dir logic, add save method Signed-off-by: Anh-Uong <anh.uong@ibm.com> * update accelerate_launch, remove save tokenizer Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fix: put back creation of .complete file Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fix failing tests and add new ones Signed-off-by: Anh-Uong <anh.uong@ibm.com> * tests: add sft_trainer test to train and save - small refactor of tests Signed-off-by: Anh-Uong <anh.uong@ibm.com> * add docs on saving checkpoints and fix help msg Signed-off-by: Anh-Uong <anh.uong@ibm.com> * update example and note best checkpoint Signed-off-by: Anh-Uong <anh.uong@ibm.com> * changes based on PR review Signed-off-by: Anh-Uong <anh.uong@ibm.com> * add logging to save, fix error out properly Signed-off-by: Anh-Uong <anh.uong@ibm.com> --------- Signed-off-by: Anh-Uong <anh.uong@ibm.com> --------- Signed-off-by: Will Johnson <mwjohnson728@gmail.com> Signed-off-by: Abhishek <maurya.abhishek@ibm.com> Signed-off-by: Anh-Uong <anh.uong@ibm.com> Signed-off-by: Angel Luu <angel.luu@us.ibm.com> Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> Signed-off-by: Harikrishnan Balagopal <harikrishmenon@gmail.com> Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Hari <harikrishmenon@gmail.com> Signed-off-by: James Busche <jbusche@us.ibm.com> Co-authored-by: Abhishek <maurya.abhishek@ibm.com> Co-authored-by: Sukriti Sharma <Ssukriti@users.noreply.github.com> Co-authored-by: Anh-Uong <anh.uong@ibm.com> Co-authored-by: Abhishek Maurya <124327945+Abhishek-TAMU@users.noreply.github.com> Co-authored-by: Angel Luu <angel.luu@us.ibm.com> Co-authored-by: Angel Luu <an317gel@gmail.com> Co-authored-by: Padmanabha V Seshadri <seshapad@in.ibm.com> Co-authored-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Co-authored-by: Hari <harikrishmenon@gmail.com> Co-authored-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> Co-authored-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: James Busche <jbusche@us.ibm.com>
Description of the change
Addition of support for JSON datasets containing input/output fields . Masking on input to leave output unmasked. Create single sequence by concatenation.
Avoids datacollatorforcompletionLM that requires response template. This is because response template is very error prone and subject to it being present in text.
Refactored https://github.com/foundation-model-stack/fms-hf-tuning/pull/166/files a bit to allow extension to support pretokenized datasets which are masked.
Related issue number
How to verify the PR
Was the PR tested