[Feature] Implement sample-ids-to-text extractor#116
[Feature] Implement sample-ids-to-text extractor#116stas00 merged 11 commits intobigscience-workshop:mainfrom
Conversation
|
I started validating that the data matches and it does for the first samples, but then it no longer matches. I'm trying to find out where the indices start to diverge. I think the problem is that this script in your PR indexes into the dataset, rather than the dataloader and thus they aren't the same, as the dataloader shuffles the data (albeit it saves a cache once it did that). To compare I'm dumping the sequence indices on the deepspeed side: and trying to match the 2 sides. |
|
Interesting, it's identical first, but then it starts diverging at idx 256: meg-ds your script, dumping range 0..300 (modified a bit) something occurs at sample 256. Trying to figure out. |
|
I figured it out. My config was just to run for 100 iterations and it consumed 256 samples and then it did validation - since I was dumping inside deepspeed's from the iterator it actually was dumping from the validation dataloader from 256 on, hence the discrepancy. So I think it's all good. But will do some more checks. |
|
I added a few variations on what data we dump and how we dump it: So the new added args are: And an example: we should think where to document this new feature. |
|
I'm thinking we probably don't need some of these args we now require for the script to run as in the example in OP? Need to check which of these don't impact the .npy file generated and set those to some default. e.g. I don't think |
|
You are right that arguments like |
|
OK, I managed to change the code to not require irrelevant args, as we only need 3 of those to get the correct npy files. So now it's just: Please let me know if the latest incarnation looks good and I'm happy to merge it then. |
|
Hmm, I think I was using this script incorrectly. The index here is the sample ID and I used it to match an iteration ID. sample ID is very dynamic while we do batch size ramp up, so it can be tricky to calculate the right sample ID. Moreover in the current trainings we use GBS=1024, so it'd be pretty impossible to actually find which record led to a blow up as one will have to review up to 10240 records if we log every 10 iterations. Hmm. I will post on slack and we will see if we find some better way to "zoom in". But we should probably stress in the documentation that the sample ID is not an iteration ID |
|
Yes, so I usually looked at the
That would be nice to stress it! |
|
I think this opinion is wrong. Never mind! |
|
We have to deal with batch size ramp up, so as you said "consumed_samples" is probably what is needed - except with GBS=2048 and 10 iterations we have to read through 20K records!!! Let's continue the discussion here: https://huggingface.slack.com/archives/C01NHER1JLS/p1632961981146200 we can post a summary after we sort it out. |
|
ok, we probably should add an output file option, otherwise all that logging gets added to the output. OK, done. |
|
LGTM! Thanks @stas00 |
Hello, I tried to tackle on the issue #56, and here is the sample script I used:
Example usage:
If you want tokens instead of text, remove
--print-textand add--print-tokens(but you can have both). If you want full token dumps add--all-tokensFull doc is inside the script's preamble
(edited to bring the OP to up-to-date)
Fixes #56