You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From my understanding I think this data preprocessing will cause the fact that different documents might be included in the same data chunk. For example, the first document might take 512 tokens while the second document takes 128 tokens in a chunk of 640 tokens. In this case, I think the generation for the second document should not see the first document, so we might need to use an attention mask to mask the first documents for the second document generation. Am I correct?
The text was updated successfully, but these errors were encountered:
Since this project is a tutorial, the code here mainly targets simplifying the main codes, avoiding some dirty padding operations, and making the whole training flow more readable.
In practice, such pre-processing way is only suitable for some large chaos corpus for pre-training. In most cases, you need to replace the dataset Python document with your own and add correct attention and padding masks based on your data characteristics.
Hi chaoyi,
Thanks for your great work. I have a question about dataset tokenization in the following code.
Finetune_LLAMA/Data_sample/tokenize_dataset.py
Lines 38 to 47 in 1d4280e
From my understanding I think this data preprocessing will cause the fact that different documents might be included in the same data chunk. For example, the first document might take 512 tokens while the second document takes 128 tokens in a chunk of 640 tokens. In this case, I think the generation for the second document should not see the first document, so we might need to use an attention mask to mask the first documents for the second document generation. Am I correct?
The text was updated successfully, but these errors were encountered: