-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Convert Multilingual/Crosslingual to fast-loading format #635
fix: Convert Multilingual/Crosslingual to fast-loading format #635
Conversation
One of the issue to convert existing datasets is that several of them use custom loading scripts, which makes it non trivial to convert to the fast format For example Flores, NTREX or IN22-Conv don't explicitly defines subsets for each language pairs, they contain data for each languages and then the pairs are created on the fly (so basically the 'en-fr' subset and the 'en-es' one share the same 'en' sentences). Not sure what the correct way to handle this would be. Converting those dataset to the standard format "1 file per subset" would duplicate a lot of data, but having different configurations for each datasets is hard to maintain |
@loicmagne, let us to the easy ones in this PR and then discuss how to best handle the rest. |
I've managed to convert most multilingual datasets from the STS, Pair classification and Classification category, at least those with more than 10 subsets The remaining datasets are in the Bitext mining category, and are not straightforward to convert. They either have a different format, or have too many files to be loaded in one go (see this issue huggingface/datasets#6877 ) I suggest we merge this PR and discuss how to handle the remaining datasets ? @KennethEnevoldsen |
I checked that all the results remains the same within a 1e-4 threshold, although I don't really know why they sometimes vary slightly |
My guess is that it is to do with a single place where the calculations change pr. run influencing the seed (a solution is to use an rng_state which is passed along, but not influence by other operations as it e.g. done in #481). For a related blogpost you might want to check out: https://builtin.com/data-science/numpy-random-seed I think that is a seperate PR though. I think merging this is a great idea. Will you add points (bug fixes where you add one point pr. dataset seems reasonable?). Will you also create an issue for the remaining datasets not addressed here. |
Sounds good I'm writing the issue for the remaining datasets |
Issue opened here for the remaining datasets: #651 |
@KennethEnevoldsen I think this makes those datasets usable now, loading and running evaluation on Flores on the 42k language pairs now takes <10 minutes on small models The last remaining dataset BibleNLPBitextMining relies on a fix of the datasets lib huggingface/datasets#6893 which isn't in the latest release yet so I'll wait for that |
@loicmagne once we have resolved the question related to BUCC I believe we can merge this |
@KennethEnevoldsen For the BUCC dataset, you can see the new results in the BUCC.json file. Overall there's a ~1% point difference, I can revert the changes if that's a problem but it simplifies the code greatly How did you proceed for other tasks (clustering I think) where the changes created non-backward compatible results ? |
@loicmagne what we have done is keep the old implementation (which you might do here by moving to logic over to the BUCC task) and then add a I believe this is the best approach here as well. |
@KennethEnevoldsen alright I added back the previous BUCC version and named the new one BUCC.v2. There's almost a 20x evaluation time difference between the two so I think it's worth having the new version I think we can merge then |
@KennethEnevoldsen Can we merge this? |
Yes indeed - thanks again @loicmagne! I have enabled automerge
Sorry was out yesterday so didn't have time to look at this before now. 20x def. seems like a good reason to introduce v2 |
Following #530, #572
The goal of this PR is to modify multilingual/crosslingual datasets to the fast loading format, i.e. where each row in the dataset has an additional "lang" feature. I don't know of an automatic way to do this, so for now I'm updating each dataset on a case by case basis, which is a bit tedious
List of datasets converted/to convert:
STS:
Pair classification:
Bitext Mining:
🚧 BibleNLPBitextMining
Classification
Those are the datasets with >10 subsets, which would benefit the most from fast loading