Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

attention mask for different documents in dataset chunk #2

Open
waterhorse1 opened this issue May 8, 2023 · 3 comments
Open

attention mask for different documents in dataset chunk #2

waterhorse1 opened this issue May 8, 2023 · 3 comments

Comments

@waterhorse1
Copy link

Hi chaoyi,

Thanks for your great work. I have a question about dataset tokenization in the following code.

all_tokens = [1] + [
tok
for row in all_tokenized
for tok in row + [tokenizer.eos_token_id, tokenizer.bos_token_id]
]
truncated_tokens = all_tokens[:(len(all_tokens) // args.max_seq_length) * args.max_seq_length]
arr = np.array(truncated_tokens).reshape(-1, args.max_seq_length)
ds = datasets.Dataset.from_dict({"input_ids": arr})
ds.save_to_disk(args.save_path)

From my understanding I think this data preprocessing will cause the fact that different documents might be included in the same data chunk. For example, the first document might take 512 tokens while the second document takes 128 tokens in a chunk of 640 tokens. In this case, I think the generation for the second document should not see the first document, so we might need to use an attention mask to mask the first documents for the second document generation. Am I correct?

@chaoyi-wu
Copy link
Owner

Thanks for your recognition.

Yes, your understanding is correct.

Since this project is a tutorial, the code here mainly targets simplifying the main codes, avoiding some dirty padding operations, and making the whole training flow more readable.

In practice, such pre-processing way is only suitable for some large chaos corpus for pre-training. In most cases, you need to replace the dataset Python document with your own and add correct attention and padding masks based on your data characteristics.

@waterhorse1
Copy link
Author

@chaoyi-wu Thanks for your answer! I also meet one problem when running finetune_pp_peft_trainer_lora.sh,

ValueError: FlatParameter requires uniform requires_grad, any idea why this happens?

@chaoyi-wu
Copy link
Owner

Yes, the FSDP with Lora has this bug and we are going to fix this, you may use Deepspeed instead if you are working with Lora

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants