New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dataset in French NLI and STS - like #4
Comments
Hi @dataislife ,
XNLI is the cross-lingual NLI corpus which extends the development and test sets of MultiNLI to 15 languages. The training set for each language is machine-translated from the English training set of MultiNLI. The dev and test sets for each language are translated by professional translators (the dev set is translated from the dev set of MultiNLI, whereas the test set is created anew following the same data collection procedure as MultiNLI's). All of the data are obtained from the XNLI paper. Therefore, you can see in the folder Regarding the dev and test sets, we extract from
We also evaluate Flaubert on an STS-like datasets in French, which is the CLS dataset. This dataset consists of Amazon reviews for three product categories: books, DVD, and music in four languages: English, French, German, and Japanese. The labels are based on products reviews (rating greater than 3 is labeled as positive and less than 3 is labeled as negative). You can check out more tasks that we used to evaluate Flaubert in our paper.
Thank you for your suggestion! This is indeed what we did in our paper (the translated data was obtained from XNLI as well). |
Hi @formiel, However, I don't quite get the function extract_nli.py. Indeed, I used the bash file get-data-xnli.sh . And now, I want to extract the data only for French. I could do it with pandas but I would like to benefit from the same preprocessing steps you did. I hence focused on extract_nli.py. splts = ['valid', 'test', 'train']
lang = 'fr'
for s in splts:
sent_pairs = []
labels = []
with open(os.path.join(path, lang + '.raw.' + s), 'rt', encoding='utf-8') as f_in:
next(f_in)
with open(os.path.join(path, '{}_0.xlm.tsv'.format(s)), 'w') as f_out:
tsv_output = csv.writer(f_out, delimiter='\t')
for line in f_in:
sent_pair, label = get_labels(line)
sent_pairs.append(sent_pair)
labels.append(label)
tsv_output.writerow([sent_pair, label]) I don't quite get where you get the data from. Indeed, I don't see any 'valid', 'test', 'train' sets after getting the data from the bash file. I have two folders XNLI-1.0 and XNLI-MT-1.0 which do not contain such kind of files. Also, at training time for French, did you concatenate XNLI-MT-1.0/multinli/multinli.train.fr.tsv with XNLI-MT-1.0/xnli/xnli.dev.en.tsv (with additional filtering on language column to get 'fr' only)? Cheers, |
Hi @dataislife , Thank you for using our code as well! :)
The script By running
Could you please send me your running command so that I can know why it does not output the raw train, valid, and test sets as expected?
The script
We obtain the data for finetuning by first running After data preparation step, you can fine-tune by running the script We have just re-organized the repo so could you please temporarily use the code from the I hope this would help you. Please let me know if you encounter any other problem. |
Hi, Thank you very much for your detailed answer once again. It was very helpful. I had the following issue:
I solved it by running bash command as root and modifying the bash file prepare-data-xnli.sh to:
Running this on MAC, may be useful for someone in the future. Cheers, |
Hi, I also modified the prepare-data-xnli.sh to make it work for me.
However, for the preprocessing to be complete, the preprocess.py file could be useful. Thanks |
Quick update: It is working fine, except that I have an "L" letter that is included in the first token for each line of the files that contains hypothesis and premises such that for train.x1 file, I have:
Obviously this could be handled afterwards but I wonder where it originates from exactly. Any idea is welcome! :) Cheers, |
@dataislife: You don't need to run the script with Regarding the L issue, it's because macOS @QuentinSpalla: |
@formiel: Thanks for your adding. This is my error: coming from embedder.py:
Because I don't have it and i would like to do the training on a classical server. Thanks |
@QuentinSpalla : Could you please tell me how you installed PyTorch and on what operating system? Try replacing
with
|
@formiel: thanks for your help. I modified multiple times embedder.py and flue.py in order to make the fine-tuning works on CPU. Thanks |
@QuentinSpalla: Yes, the fine-tuning would run much slower on CPU compared to GPU. On one Nvidia Titan X GPU, an epoch (i.e. a pass though the entire dataset) would take ~1.5 hours, and I ran it for 30 epochs. Regarding PyTorch, I guess you should always use the latest version (currently 1.4.0) but previous versions (1.1-1.3) work just fine. I don’t have much experience with Google Cloud Platform but if you use a few Nvidia Tesla K80 for the task I think it would take a few hours. For your information, I'm going to add the fine-tuning script using Hugging Face's Transformers for XNLI task in the upcoming days. We may release our pre-trained models on downstream tasks in the near future. Let me confirm this later. |
@formiel: thanks for all the details, I am gonna try
It will be great, i look forward to seeing that |
@formiel: Besides, how can I use the fine-tuned model in order to make sentence embeddings ? Thanks |
Hi @QuentinSpalla, I suggest you to open a new issue for the question related to sentence embeddings. This could be useful for other people using this tool. Cheers, |
@QuentinSpalla You can download all models from here: https://zenodo.org/record/3622251 (Please note that the base-cased and large-cased models are partially trained, we will upload the fully-trained weights when they are available.)
Could you please be more specific on what you want to get? You may consider opening a new issue as @dataislife has suggested. |
Hi,
When getting data from get-data-xnli.sh , I notice that most dataset is not in French. Hence, I wonder how you used that in practice?
I am currently looking for a some NLI-like and STS-like datasets in French. That would be great to fine-tune Flaubert!
As a suggestion, translation the English version of NLI and STS to French could be a good option to fine tune Flaubert on such tasks.
The text was updated successfully, but these errors were encountered: