feat: Support pretokenized #272

kmehant · 2024-07-31T13:52:16Z

No description provided.

fabianlim · 2024-07-31T14:01:41Z

tuning/utils/preprocessing_utils.py

@@ -29,8 +29,48 @@
 logger = logging.get_logger("sft_trainer_preprocessing")


+def _is_pretokenized_dataset(dataset: Dataset):


i suggest move this to inside validate_data_args and put a comment that his is how SFT checks for tokenized internally.

@fabianlim done

fabianlim · 2024-07-31T14:02:25Z

tuning/utils/preprocessing_utils.py

 def validate_data_args(data_args: configs.DataArguments, packing: bool):

+    # validation for pretokenized datasets case


this kid of checks can be hard to parse, i suggest to document in plain english above.

@fabianlim documented them whereever possible thanks

fabianlim · 2024-07-31T16:23:45Z

tuning/utils/preprocessing_utils.py

+    if data_args.is_pretokenized:
+        return DataCollatorForSeq2Seq(
+            tokenizer=tokenizer,
+            padding="max_length",


@kmehant can we handle the datacollator kwargs more carefully, because this collator calls pad() underneath, and there are various options

https://github.com/huggingface/transformers/blob/ef177a5e1cdf0ca53e24e6d76e813198f7300dc4/src/transformers/tokenization_utils_base.py#L3418-L3429

If we decide for the user what options, then we need to chose the best one, for example we probably do not want to chose padding='max_length' aor padding=None

And if there are defaults, then we can see below https://github.com/foundation-model-stack/fms-hf-tuning/pull/272/files#diff-0949af2da3611757d2552841654a34f1fe0a349ee114ce6342b9e0ade3eb13a6R180 how they are handled in configs

And if we decide on the correct behavior, then maybe we want to make it explicit. But padding="max_length" is not recommended.

DataCollatorForSeq2Seq( tokenizer=tokenizer, padding=configs.PADDING_STRATEGY_LONGEST, max_length=max_sequence_length, return_tensors="pt" )

On the fence on "pt" or configs.PADDING_RETURN_TENSORS

Thanks, updated!

On the fence on "pt" or configs.PADDING_RETURN_TENSORS

Avoided this since collator should mostly return tensors rather python lists.

alex-jw-brooks · 2024-07-31T21:23:05Z

tuning/utils/preprocessing_utils.py

+def is_pretokenized_dataset(data_path: str):
+    if data_path:
+        # load one sample from the dataset in order to inspect columns
+        dataset = datasets.load_dataset("json", data_files=data_path, split="train[:1]")


Can you add graceful handling for if the dataset is empty?

nesting it from the same error class DatasetGenerationError since there are tests on malformed data already that assert on DatasetGenerationError Thanks. let me know if the message has to be refined. Thanks

alex-jw-brooks · 2024-07-31T21:25:23Z

tuning/utils/preprocessing_utils.py

+            tokenizer=tokenizer,
+            padding=configs.PADDING_STRATEGY_LONGEST,
+            max_length=max_sequence_length,
+            return_tensors="pt",


nit - I'd suggest not setting return_tensors="pt" directly here, since it is the default, and we don't set it when initializing other collators

ah, right! updated

alex-jw-brooks · 2024-07-31T21:27:05Z

tuning/utils/preprocessing_utils.py

+            or data_args.data_formatter_template
+            or data_args.dataset_text_field
+        ):
+            raise ValueError(


Can you change this to a warning instead of an error? IMO this and packing=True should be handled the same way

modified thanks @alex-jw-brooks

Ssukriti · 2024-07-31T23:35:31Z

tuning/utils/preprocessing_utils.py

@@ -68,13 +115,13 @@ def validate_data_args(data_args: configs.DataArguments, packing: bool):
    # TODO(s) In future seupport two more formats:
    # 1. Allow no response template, and JSON with input/output fields and mask input

-    # 2. Allow pretokenized Dataset besides JSON.
-

 def get_data_collator(


unit tests need to be added for updates. see

fms-hf-tuning/tests/utils/test_preprocessing_utils.py

Line 129 in 6e836d1

@pytest.mark.parametrize(

for reference.
Probably after rebasing, we can re-use same unit test

using the same unit test, thanks.

But if you see some sort of duplicated flow in this function let me know, we will work it out, thanks

Ssukriti · 2024-07-31T23:37:33Z

tuning/utils/preprocessing_utils.py

 def validate_data_args(data_args: configs.DataArguments, packing: bool):

+    is_train_data_pretokenized = is_pretokenized_dataset(
+        data_path=data_args.training_data_path


unit tests need to be added as well

Ssukriti · 2024-07-31T23:38:21Z

tuning/utils/preprocessing_utils.py

@@ -129,6 +189,21 @@ def format_dataset(data_args: configs.DataArguments, tokenizer: AutoTokenizer):
            tuple containing train_dataset, eval_dataset and dataset_text_field
    """
    eval_dataset = None
+    is_train_data_pretokenized = is_pretokenized_dataset(
+        data_path=data_args.training_data_path


Added, thanks

Ssukriti · 2024-08-01T04:07:30Z

tuning/utils/preprocessing_utils.py

-            "attention_mask" in formatted_train_dataset.column_names
-            and "labels" in formatted_train_dataset.column_names
-        ):
+        if is_train_data_pretokenized:


packing is anyway not supported for pretokenized as well, so I just moved it here @kmehant like it was before. The check was doing same thing as your function is_pretokenized.

padding=True is same as padding = longest . see https://huggingface.co/transformers/v4.9.2/main_classes/data_collator.html

Ssukriti · 2024-08-01T04:13:31Z

tuning/utils/preprocessing_utils.py

+            or data_args.data_formatter_template
+            or data_args.dataset_text_field
+        ):
+            logger.warning(


should be a valueerror . If response template is provided, we will return completionLMcollator. There is no need to allow a response template, it is an incorrect argument.
I am committing this change

sure @Ssukriti thanks

Ssukriti · 2024-08-01T04:28:49Z

tuning/utils/preprocessing_utils.py

+        # packing wont be available for pretokenized datasets in trl library
+        # see: https://github.com/huggingface/trl/issues/1848
+        if packing:
+            logger.warning("packing will not be used when datasets are pretokenized")


why is this only a warning? it is ok to be warning if code does not perform as well, like performance wont be good etc. We cant have warning when code just wont work. If flag is set to packing , it will be true even when passing to TRL

Agree with you, we should error out

Ssukriti · 2024-08-01T04:56:18Z

tests are still incomplete for both validate_args and format_dataset . I understand its because the JSON is not readily available for tests. You can take a few examples from any JSON here https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tests/data/twitter_complaints_input_output.json and convert it to pretokenized and contribute new JSON with pretokenized. Then it can be easily used in all tests

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

* feat: support pretokenized datasets Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * fix: rebase with upstream and review commits Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * fix: rebase with upstream and review commits Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * fix: rebase with upstream and review commits Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * consolidate collator code Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * add valuerrors for incorrect args Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * feat: add unit tests for validate_data_args and format_dataset Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * feat: add unit tests for validate_data_args and format_dataset Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * feat: add unit tests for validate_data_args and format_dataset Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * feat: add unit tests for validate_data_args and format_dataset Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> --------- Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> Co-authored-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> Co-authored-by: Alex Brooks <alex.brooks@ibm.com>

* Set default value of target_modules to be None in LoraConfig Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Removal of transformers logger and addition of python logger Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * FMT and lint check: Removal of transformers logger and addition of python logger Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * fix: remove lm_head for granite with llama arch models (#258) * initial code for deleting lm_head Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fix logic for copying checkpoint Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fix check that embed_tokens and lm_head weights are the same Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fix warning assertion Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fix lm_head check, remove test Signed-off-by: Anh-Uong <anh.uong@ibm.com> * small fixes from code review Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fmt Signed-off-by: Anh-Uong <anh.uong@ibm.com> --------- Signed-off-by: Anh-Uong <anh.uong@ibm.com> Co-authored-by: Anh-Uong <anh.uong@ibm.com> Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Add config_utils tests Signed-off-by: Angel Luu <angel.luu@us.ibm.com> * Fix fmt Signed-off-by: Angel Luu <angel.luu@us.ibm.com> * Separate tests out and use docstrings Signed-off-by: Angel Luu <angel.luu@us.ibm.com> * Update more field/value checks from HF defaults Signed-off-by: Angel Luu <angel.luu@us.ibm.com> * Fix: Addition of env var TRANSFORMERS_VERBOSITY check Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * FMT Fix: Addition of env var TRANSFORMERS_VERBOSITY check Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Add test for tokenizer in lora config (should be ignored) Signed-off-by: Angel Luu <angel.luu@us.ibm.com> * Adding logging support to accelerate launch Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * FMT_FIX: Adding logging support to accelerate launch Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * bug: On save event added to callback (#256) * feat: On save event added to callback Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Removed additional bracket Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Removed additional bracket Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Format issues resolved Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: rebase with upstream and add new line Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> --------- Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Co-authored-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * feat: All metric handling changes (#263) * feat: All metric handling changes Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Format issues Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> --------- Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Configuration to set logging level for trigger log (#241) * feat: Added the triggered login in the operation Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Formatting issues Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Added default config Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Moved the variable to right scope Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Checked added to validate config log level Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Removed some unwanted log file Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> --------- Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * limit peft deps until investigate (#274) Signed-off-by: Anh-Uong <anh.uong@ibm.com> * Data custom collator (#260) * refactor code to preprocess datasets Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * fix formatting Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * allow input/output in validate args Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * format input/output JSON and mask Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * function to return suitable collator Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * add tests for SFT Trainer input/output format Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * remove unused functions Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * add eos token to input/output format Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * fix tests Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * improve docstrings Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * keeping JSON keys constant Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * support for input/output format Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * formatting fixes Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * update rEADME formats Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * formatting README Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> --------- Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> * Revert "limit peft deps until investigate (#274)" (#275) This reverts commit f57ff63. Signed-off-by: Anh-Uong <anh.uong@ibm.com> * feat: per process state metric (#239) Signed-off-by: Harikrishnan Balagopal <harikrishmenon@gmail.com> * Modify test to pass with target_modules: None Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Logging changes and unit tests added Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * feat: Add a dockerfile argument to enable aimstack (#261) * Add a dockerfile argument at the end of final layer to enable aimstack. Currenlty guarded by a dockerfile argument. Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * Set the default value of ENABLE_AIM to false Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> --------- Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * Solved conflict with main Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * FMT:Fix Solved conflict with main Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * enabling tests for prompt tuning Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * feat: Support pretokenized (#272) * feat: support pretokenized datasets Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * fix: rebase with upstream and review commits Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * fix: rebase with upstream and review commits Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * fix: rebase with upstream and review commits Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * consolidate collator code Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * add valuerrors for incorrect args Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * feat: add unit tests for validate_data_args and format_dataset Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * feat: add unit tests for validate_data_args and format_dataset Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * feat: add unit tests for validate_data_args and format_dataset Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * feat: add unit tests for validate_data_args and format_dataset Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> --------- Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> Co-authored-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> Co-authored-by: Alex Brooks <alex.brooks@ibm.com> * Update packaging requirement from <24,>=23.2 to >=23.2,<25 (#212) Updates the requirements on [packaging](https://github.com/pypa/packaging) to permit the latest version. - [Release notes](https://github.com/pypa/packaging/releases) - [Changelog](https://github.com/pypa/packaging/blob/main/CHANGELOG.rst) - [Commits](pypa/packaging@23.2...24.1) --- updated-dependencies: - dependency-name: packaging dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Anh Uong <anh.uong@ibm.com> * enabling tests for prompt tuning (#278) Signed-off-by: Abhishek <maurya.abhishek@ibm.com> Co-authored-by: Anh Uong <anh.uong@ibm.com> * fix: do not add special tokens for custom tokenizer (#279) Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * PR changes for changing logger Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * fix: bug where the logger was not being used properly (#286) Signed-off-by: Hari <harikrishmenon@gmail.com> * Unit Tests changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Add functionality to free disk space from Github Actions (#287) * Add functionality to free disk space from Github Actions Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Add functionality to free disk space from Github Actions, relocate from build-and-publish.yaml to image.yaml Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Move freeing space step to before building image Signed-off-by: Will Johnson <mwjohnson728@gmail.com> --------- Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * commented os.environ[LOG_LEVEL] in accelerate.py for testing Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * FIX:FMT Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR Changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR Changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Add unit test to verify target_modules defaults correctly (#281) * Add unit test to verify target_modules defaults correctly Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Add sft_trainer.main test to ensure target modules properly default for LoRA when set to None from CLI Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fmt Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Use model_args instead of importing, fix nits Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Add test to ensure target_modules defaults to None in job config Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Add additional check, fix nits Signed-off-by: Will Johnson <mwjohnson728@gmail.com> --------- Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * docs: Add documentation on experiment tracking. (#257) Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * Ensure additional metadata to trackers don't throw error in happy case. (#290) Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * PR Changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * fix multiple runid creation bug with accelerate. (#268) Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * feat: logging control operation (#264) Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * Metrics file epoch indexing from 0 Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Revert last commit Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * fix run evaluation to get base model path (#273) Signed-off-by: Anh-Uong <anh.uong@ibm.com> * PR Changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR Changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * feat: Added additional events such as on_step_begin, on_optimizer_step, on_substep_end (#293) Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * Always update setuptools to latest (#288) Signed-off-by: James Busche <jbusche@us.ibm.com> Co-authored-by: Anh Uong <anh.uong@ibm.com> * Rename all fixtures with correct .jsonl extension (#295) Signed-off-by: Will Johnson <mwjohnson728@gmail.com> Co-authored-by: Anh Uong <anh.uong@ibm.com> * feat: add save_model_dir flag where final checkpoint saved (#291) * add save_model_dir flag for final checkpoint Signed-off-by: Anh-Uong <anh.uong@ibm.com> * remove output_dir logic, add save method Signed-off-by: Anh-Uong <anh.uong@ibm.com> * update accelerate_launch, remove save tokenizer Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fix: put back creation of .complete file Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fix failing tests and add new ones Signed-off-by: Anh-Uong <anh.uong@ibm.com> * tests: add sft_trainer test to train and save - small refactor of tests Signed-off-by: Anh-Uong <anh.uong@ibm.com> * add docs on saving checkpoints and fix help msg Signed-off-by: Anh-Uong <anh.uong@ibm.com> * update example and note best checkpoint Signed-off-by: Anh-Uong <anh.uong@ibm.com> * changes based on PR review Signed-off-by: Anh-Uong <anh.uong@ibm.com> * add logging to save, fix error out properly Signed-off-by: Anh-Uong <anh.uong@ibm.com> --------- Signed-off-by: Anh-Uong <anh.uong@ibm.com> --------- Signed-off-by: Will Johnson <mwjohnson728@gmail.com> Signed-off-by: Abhishek <maurya.abhishek@ibm.com> Signed-off-by: Anh-Uong <anh.uong@ibm.com> Signed-off-by: Angel Luu <angel.luu@us.ibm.com> Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> Signed-off-by: Harikrishnan Balagopal <harikrishmenon@gmail.com> Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Hari <harikrishmenon@gmail.com> Signed-off-by: James Busche <jbusche@us.ibm.com> Co-authored-by: Abhishek <maurya.abhishek@ibm.com> Co-authored-by: Sukriti Sharma <Ssukriti@users.noreply.github.com> Co-authored-by: Anh-Uong <anh.uong@ibm.com> Co-authored-by: Abhishek Maurya <124327945+Abhishek-TAMU@users.noreply.github.com> Co-authored-by: Angel Luu <angel.luu@us.ibm.com> Co-authored-by: Angel Luu <an317gel@gmail.com> Co-authored-by: Padmanabha V Seshadri <seshapad@in.ibm.com> Co-authored-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Co-authored-by: Hari <harikrishmenon@gmail.com> Co-authored-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> Co-authored-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: James Busche <jbusche@us.ibm.com>

kmehant force-pushed the support-pretokenized branch from 6419ca9 to 50399be Compare July 31, 2024 13:52

kmehant requested a review from fabianlim July 31, 2024 13:56

fabianlim reviewed Jul 31, 2024

View reviewed changes

kmehant marked this pull request as ready for review July 31, 2024 15:44

kmehant requested review from anhuong, Ssukriti and alex-jw-brooks as code owners July 31, 2024 15:45

fabianlim reviewed Jul 31, 2024

View reviewed changes

kmehant force-pushed the support-pretokenized branch from 5823331 to d48ea3c Compare July 31, 2024 17:53

alex-jw-brooks requested changes Jul 31, 2024

View reviewed changes

Ssukriti reviewed Jul 31, 2024

View reviewed changes

kmehant force-pushed the support-pretokenized branch from d48ea3c to 05d797c Compare August 1, 2024 01:52

Ssukriti reviewed Aug 1, 2024

View reviewed changes

kmehant and others added 10 commits August 1, 2024 16:51

feat: support pretokenized datasets

fb44173

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

fix: rebase with upstream and review commits

b369878

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

fix: rebase with upstream and review commits

a7c3869

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

fix: rebase with upstream and review commits

533e53b

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

consolidate collator code

6f1d99e

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

add valuerrors for incorrect args

61d4984

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

feat: add unit tests for validate_data_args and format_dataset

3c30cae

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

feat: add unit tests for validate_data_args and format_dataset

7b4b3c8

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

feat: add unit tests for validate_data_args and format_dataset

6e7fccf

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

feat: add unit tests for validate_data_args and format_dataset

4f39639

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

kmehant force-pushed the support-pretokenized branch from 12ef956 to 4f39639 Compare August 1, 2024 11:21

Ssukriti approved these changes Aug 1, 2024

View reviewed changes

alex-jw-brooks approved these changes Aug 1, 2024

View reviewed changes

Merge branch 'main' into support-pretokenized

c47c53f

alex-jw-brooks merged commit 612789d into foundation-model-stack:main Aug 1, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support pretokenized #272

feat: Support pretokenized #272

kmehant commented Jul 31, 2024

fabianlim Jul 31, 2024

kmehant Jul 31, 2024

fabianlim Jul 31, 2024

kmehant Jul 31, 2024

fabianlim Jul 31, 2024

fabianlim Jul 31, 2024

fabianlim Jul 31, 2024 •

edited

Loading

kmehant Jul 31, 2024

kmehant Jul 31, 2024

alex-jw-brooks Jul 31, 2024

kmehant Aug 1, 2024

alex-jw-brooks Jul 31, 2024

kmehant Aug 1, 2024

alex-jw-brooks Jul 31, 2024

kmehant Aug 1, 2024

Ssukriti Jul 31, 2024

kmehant Aug 1, 2024

Ssukriti Jul 31, 2024

Ssukriti Jul 31, 2024

kmehant Aug 1, 2024

Ssukriti Aug 1, 2024

kmehant Aug 1, 2024

Ssukriti Aug 1, 2024

kmehant Aug 1, 2024

Ssukriti Aug 1, 2024

kmehant Aug 1, 2024

Ssukriti commented Aug 1, 2024

		@@ -29,8 +29,48 @@
		logger = logging.get_logger("sft_trainer_preprocessing")


		def _is_pretokenized_dataset(dataset: Dataset):

		def validate_data_args(data_args: configs.DataArguments, packing: bool):

		# validation for pretokenized datasets case

feat: Support pretokenized #272

feat: Support pretokenized #272

Conversation

kmehant commented Jul 31, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabianlim Jul 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ssukriti commented Aug 1, 2024

fabianlim Jul 31, 2024 •

edited

Loading