attention mask for different documents in dataset chunk #2

waterhorse1 · 2023-05-08T14:50:20Z

Hi chaoyi,

Thanks for your great work. I have a question about dataset tokenization in the following code.

Finetune_LLAMA/Data_sample/tokenize_dataset.py

Lines 38 to 47 in 1d4280e

    
           all_tokens = [1] + [ 
        
               tok 
        
               for row in all_tokenized 
        
               for tok in row + [tokenizer.eos_token_id, tokenizer.bos_token_id] 
        
           ] 
        
           truncated_tokens = all_tokens[:(len(all_tokens) // args.max_seq_length) * args.max_seq_length] 
        
           arr = np.array(truncated_tokens).reshape(-1, args.max_seq_length) 
        
           ds = datasets.Dataset.from_dict({"input_ids": arr}) 
        
           ds.save_to_disk(args.save_path)

From my understanding I think this data preprocessing will cause the fact that different documents might be included in the same data chunk. For example, the first document might take 512 tokens while the second document takes 128 tokens in a chunk of 640 tokens. In this case, I think the generation for the second document should not see the first document, so we might need to use an attention mask to mask the first documents for the second document generation. Am I correct?

chaoyi-wu · 2023-05-08T16:27:39Z

Thanks for your recognition.

Yes, your understanding is correct.

Since this project is a tutorial, the code here mainly targets simplifying the main codes, avoiding some dirty padding operations, and making the whole training flow more readable.

In practice, such pre-processing way is only suitable for some large chaos corpus for pre-training. In most cases, you need to replace the dataset Python document with your own and add correct attention and padding masks based on your data characteristics.

waterhorse1 · 2023-05-09T02:10:21Z

@chaoyi-wu Thanks for your answer! I also meet one problem when running finetune_pp_peft_trainer_lora.sh,

ValueError: FlatParameter requires uniform requires_grad, any idea why this happens?

chaoyi-wu · 2023-05-09T03:11:14Z

Yes, the FSDP with Lora has this bug and we are going to fix this, you may use Deepspeed instead if you are working with Lora

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

attention mask for different documents in dataset chunk #2

attention mask for different documents in dataset chunk #2

waterhorse1 commented May 8, 2023

chaoyi-wu commented May 8, 2023

waterhorse1 commented May 9, 2023

chaoyi-wu commented May 9, 2023

attention mask for different documents in dataset chunk #2

attention mask for different documents in dataset chunk #2

Comments

waterhorse1 commented May 8, 2023

chaoyi-wu commented May 8, 2023

waterhorse1 commented May 9, 2023

chaoyi-wu commented May 9, 2023