Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataset in French NLI and STS - like #4

Closed
dataislife opened this issue Jan 15, 2020 · 15 comments
Closed

dataset in French NLI and STS - like #4

dataislife opened this issue Jan 15, 2020 · 15 comments

Comments

@dataislife
Copy link

Hi,

When getting data from get-data-xnli.sh , I notice that most dataset is not in French. Hence, I wonder how you used that in practice?

I am currently looking for a some NLI-like and STS-like datasets in French. That would be great to fine-tune Flaubert!

As a suggestion, translation the English version of NLI and STS to French could be a good option to fine tune Flaubert on such tasks.

@formiel
Copy link
Contributor

formiel commented Jan 16, 2020

Hi @dataislife ,

When getting data from get-data-xnli.sh , I notice that most dataset is not in French. Hence, I wonder how you used that in practice?

XNLI is the cross-lingual NLI corpus which extends the development and test sets of MultiNLI to 15 languages. The training set for each language is machine-translated from the English training set of MultiNLI. The dev and test sets for each language are translated by professional translators (the dev set is translated from the dev set of MultiNLI, whereas the test set is created anew following the same data collection procedure as MultiNLI's). All of the data are obtained from the XNLI paper.

Therefore, you can see in the folder XNLI-MT-1.0/multinli, there are 15 files with the name format multinli.train.${lang}.tsv where $lang is the language identification (en for English, fr for French, etc.). These files are translated from the training set of MultiNLI into each language respectively. We thus obtain the French training set from the file multinli.train.fr.tsv.

Regarding the dev and test sets, we extract from xnli/xnli.${split}.en.tsv, where ${split} is dev or test. The dev and test samples for all 15 languages are included in these 2 files, with a column for language in each file. Hence we can obtain the dev and test sets for the language of our interest among the 15 languages based on this column.

I am currently looking for a some NLI-like and STS-like datasets in French. That would be great to fine-tune Flaubert!

We also evaluate Flaubert on an STS-like datasets in French, which is the CLS dataset. This dataset consists of Amazon reviews for three product categories: books, DVD, and music in four languages: English, French, German, and Japanese. The labels are based on products reviews (rating greater than 3 is labeled as positive and less than 3 is labeled as negative). You can check out more tasks that we used to evaluate Flaubert in our paper.

As a suggestion, translation the English version of NLI and STS to French could be a good option to fine tune Flaubert on such tasks.

Thank you for your suggestion! This is indeed what we did in our paper (the translated data was obtained from XNLI as well).

@dataislife
Copy link
Author

dataislife commented Jan 16, 2020

Hi @formiel,
Thank you for the quick and detailed answer. I could see that indeed you have much more data than I had with only xnli in French. That is great really!

However, I don't quite get the function extract_nli.py. Indeed, I used the bash file get-data-xnli.sh . And now, I want to extract the data only for French. I could do it with pandas but I would like to benefit from the same preprocessing steps you did. I hence focused on extract_nli.py.
But, from the code in the main function:

 splts = ['valid', 'test', 'train']
    lang = 'fr'

    for s in splts:
        sent_pairs = []
        labels = []

        with open(os.path.join(path, lang + '.raw.' + s), 'rt', encoding='utf-8') as f_in:
            next(f_in)
            with open(os.path.join(path, '{}_0.xlm.tsv'.format(s)), 'w') as f_out:
                tsv_output = csv.writer(f_out, delimiter='\t')
                for line in f_in:
                    sent_pair, label = get_labels(line)
                    sent_pairs.append(sent_pair)
                    labels.append(label)

                    tsv_output.writerow([sent_pair, label])

I don't quite get where you get the data from. Indeed, I don't see any 'valid', 'test', 'train' sets after getting the data from the bash file. I have two folders XNLI-1.0 and XNLI-MT-1.0 which do not contain such kind of files.
Is there smth I am missing?

Also, at training time for French, did you concatenate XNLI-MT-1.0/multinli/multinli.train.fr.tsv with XNLI-MT-1.0/xnli/xnli.dev.en.tsv (with additional filtering on language column to get 'fr' only)?

Cheers,

@formiel
Copy link
Contributor

formiel commented Jan 16, 2020

Hi @dataislife ,

Thank you for using our code as well! :)

I don't quite get where you get the data from. Indeed, I don't see any 'valid', 'test', 'train' sets after getting the data from the bash file. I have two folders XNLI-1.0 and XNLI-MT-1.0 which do not contain such kind of files.
Is there smth I am missing?

The script get-data-xnli.sh should get the raw data for French exclusively since it handles the language filtering step here.

By running bash get-data-xnli.sh $DATA_DIR, you should get:

  • In $DATA_DIR/raw: 2 folders XNLI-1.0 and XNLI-MT-1.0 (as you already had). This is the raw data downloaded from XNLI.
  • In $DATA_DIR/processed: 3 files, which are fr.raw.${split} where ${split} is train, valid, and test respectively.

Could you please send me your running command so that I can know why it does not output the raw train, valid, and test sets as expected?

However, I don't quite get the function extract_nli.py. Indeed, I used the bash file get-data-xnli.sh . And now, I want to extract the data only for French. I could do it with pandas but I would like to benefit from the same preprocessing steps you did. I hence focused on extract_nli.py.

The script extract_xnli.py extracts the labels and clean the text from each above raw file (i.e. fr.raw.${split}). The outputs from this script are 3 files ${split}_0.xlm.tsv. These files are saved in $DATA_DIR/processed. So, you can obtain clean data by running the following command: python extract_xnli.py --indir $DATA_DIR/processed.

Also, at training time for French, did you concatenate XNLI-MT-1.0/multinli/multinli.train.fr.tsv with XNLI-MT-1.0/xnli/xnli.dev.en.tsv (with additional filtering on language column to get 'fr' only)?

We obtain the data for finetuning by first running get-data-xnli.shand then running prepare-data-xnli.sh(Please refer to the detailed commands here). The script prepare-data-xnli.sh will clean data (by running extract_xnli.py inside it in line 23), then it will tokenise and binarise data to prepare data for fine-tuning. The output are 9 files, 3 for each split in the following form: ${split}.s1.pth, ${split}.s2.pth, and ${split}.label, where *.s1.pth and *.s2.pth include the premise and hypothesis sentences respectively, and *.label contains the labels (entailment, contradiction, or neutral).

After data preparation step, you can fine-tune by running the script flue_xnli.py (Please refer to here for the detailed running command).

We have just re-organized the repo so could you please temporarily use the code from the dev branch to run these scripts?

I hope this would help you. Please let me know if you encounter any other problem.

@dataislife
Copy link
Author

dataislife commented Jan 17, 2020

Hi,

Thank you very much for your detailed answer once again. It was very helpful.

I had the following issue:

prepare-data-xnli.sh: line 42: tools/tokenize.sh: Permission denied 

I solved it by running bash command as root and modifying the bash file prepare-data-xnli.sh to:

#! bin/bash
# Hang Le (hangtp.le@gmail.com)
# Modified from
# https://github.com/facebookresearch/XLM/blob/master/prepare-xnli.sh
# Original copyright is appended below.
#
# Copyright (c) 2019-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
#

#
# This script is meant to prepare data to reproduce XNLI experiments
# Just modify the "code" and "vocab" path for your own model
#
DATA_DIR=$1
MODEL_DIR=$2

DATA_PATH=$DATA_DIR/processed

python extract_xnli.py --indir $DATA_PATH

set -e
lg=fr

# data paths
FASTBPE=tools/fastBPE/fast
TOKENIZER=tools/tokenize.sh

mkdir -p $DATA_PATH

#CODES_PATH=$MODEL_DIR/codes
#VOCAB_PATH=$MODEL_DIR/vocab

## Clean text
for split in train valid test; do
    awk -F '\t' '{ print $1}' $DATA_PATH/${split}_0.xlm.tsv \
    | awk '{gsub(/\"/,"")};1' \
    | sed -e 's/\(.*\)/\L\1/' \
    **| sudo bash $TOKENIZER $lg \**
    > $DATA_PATH/${split}.x1

    awk -F '\t' '{ print $2}' $DATA_PATH/${split}_0.xlm.tsv \
    | awk '{gsub(/\"/,"")};1' \
    | sed -e 's/\(.*\)/\L\1/' \
    **| sudo bash $TOKENIZER $lg \**
    > $DATA_PATH/${split}.x2

    sudo awk -F '\t' '{ print $3}' $DATA_PATH/${split}_0.xlm.tsv \
    > $DATA_PATH/${split}.label

    echo "Finished processing ${split} and saved to $DATA_PATH."
done
echo 'Finished preparing data.'

# apply BPE codes and binarize the GLUE corpora
#for splt in train valid test; do
#    echo "BPE-rizing $splt..."
#    $FASTBPE applybpe $DATA_PATH/$splt.s1 $DATA_PATH/$splt.x1 $CODES_PATH
#    python preprocess.py $VOCAB_PATH $DATA_PATH/$splt.s1
#    rm $DATA_PATH/$splt.x1

#    $FASTBPE applybpe $DATA_PATH/$splt.s2 $DATA_PATH/$splt.x2 $CODES_PATH
#    python preprocess.py $VOCAB_PATH $DATA_PATH/$splt.s2
#    rm $DATA_PATH/$splt.x2
#done

Running this on MAC, may be useful for someone in the future.

Cheers,

@QuentinSpalla
Copy link

Hi,
and thanks for this useful topic.

I also modified the prepare-data-xnli.sh to make it work for me.

#! bin/bash
# Hang Le (hangtp.le@gmail.com)
# Modified from
# https://github.com/facebookresearch/XLM/blob/master/prepare-xnli.sh
# Original copyright is appended below.
# 
# Copyright (c) 2019-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
#

#
# This script is meant to prepare data to reproduce XNLI experiments
# Just modify the "code" and "vocab" path for your own model
#
DATA_DIR=$1
MODEL_DIR=$2

DATA_PATH=$DATA_DIR/processed

python3 extract_xnli.py --indir $DATA_PATH  

set -e
lg=fr

# data paths
FASTBPE=tools/fastBPE/fast
TOKENIZER=tools/tokenize.sh

mkdir -p $DATA_PATH

CODES_PATH=$MODEL_DIR/codes
VOCAB_PATH=$MODEL_DIR/vocab

## Clean text
for split in train valid test; do
    awk -F '\t' '{ print $1}' $DATA_PATH/${split}_0.xlm.tsv \
    | awk '{gsub(/\"/,"")};1' \
    | sed -e 's/\(.*\)/\L\1/' \
    | bash $TOKENIZER $lg \
    > $DATA_PATH/${split}.x1

    awk -F '\t' '{ print $2}' $DATA_PATH/${split}_0.xlm.tsv \
    | awk '{gsub(/\"/,"")};1' \
    | sed -e 's/\(.*\)/\L\1/' \
    | bash $TOKENIZER $lg \
    > $DATA_PATH/${split}.x2

    awk -F '\t' '{ print $3}' $DATA_PATH/${split}_0.xlm.tsv \
    > $DATA_PATH/${split}.label

    echo "Finished processing ${split} and saved to $DATA_PATH."
done
echo 'Finished preparing data.'

# apply BPE codes and binarize the GLUE corpora
for splt in train valid test; do
    echo "BPE-rizing $splt..."
    $FASTBPE applybpe $DATA_PATH/$splt.s1 $DATA_PATH/$splt.x1 $CODES_PATH
    # python3 preprocess.py $VOCAB_PATH $DATA_PATH/$splt.s1
    rm $DATA_PATH/$splt.x1

    $FASTBPE applybpe $DATA_PATH/$splt.s2 $DATA_PATH/$splt.x2 $CODES_PATH
    # python3 preprocess.py $VOCAB_PATH $DATA_PATH/$splt.s2
    rm $DATA_PATH/$splt.x2
done

However, for the preprocessing to be complete, the preprocess.py file could be useful.
It seems it is not in the folder. Anybody knows if it is necessary ? If yes, what is in it ?

Thanks

@dataislife
Copy link
Author

dataislife commented Jan 17, 2020

Quick update: It is working fine, except that I have an "L" letter that is included in the first token for each line of the files that contains hypothesis and premises such that for train.x1 file, I have:

LL ' écrémage conceptuel de la crème a deux dimensions fondamentales : le produit et la géographie .
LTu sais pendant la saison et je suppose qu ' à ton niveau euh tu les perds au niveau suivant si s ' ils décident de se rappeler l ' équipe des parents les braves décident d ' appeler pour rappeler un mec du triple a puis un double un mec monte à remplacez-le et un seul homme monte pour le remplacer .
LUn de nos numéros vous fera suivre vos instructions minutieusement .
LQu ' est-ce que tu en sais ? Tout ceci est à nouveau leur information .
LOuais je te dis ce que si tu vas prix certaines de ces chaussures de tennis je peux voir pourquoi maintenant tu sais qu ' ils se se dans la gamme des cent dollars

Obviously this could be handled afterwards but I wonder where it originates from exactly. Any idea is welcome! :)

Cheers,

@formiel
Copy link
Contributor

formiel commented Jan 18, 2020

@dataislife: You don't need to run the script with sudo, just run chmod +x tools/tokenize.sh to give execute permission to the tokenize.sh script (I will update the script to add this line). Alternatively, you can add bash as @QuentinSpalla did.

Regarding the L issue, it's because macOS sed doesn't support the \L function. You can replace this line and this line by perl -ne 'print lc' if you want to use the uncased (i.e. lower case) model. The cased Flaubert is also available, I will update the instructions for using it soon.

@QuentinSpalla: preprocess.py is used to binarise the data. It reads an input text file and the BPE vocabulary file learned in pre-training and outputs the corresponding binarised .pth file. Sorry for missing this script! I've just added it to the repo. Please download this file and add to your local folder (I'm reorganising the code, so downloading the whole repo again may not work due to path issues).

@QuentinSpalla
Copy link

@formiel: Thanks for your adding.
Is it mandatory to use cuda for fine-tuning flauBERT ?

This is my error:
AssertionError: Torch not compiled with CUDA enabled

coming from embedder.py:

def cuda(self):
        self.model.cuda()

Because I don't have it and i would like to do the training on a classical server.

Thanks

@formiel
Copy link
Contributor

formiel commented Jan 20, 2020

@QuentinSpalla : Could you please tell me how you installed PyTorch and on what operating system? Try replacing

def cuda(self):
    self.model.cuda()

with

def cuda(self):
    self.model = self.model.to("cuda" if torch.cuda.is_available() else "cpu")

@QuentinSpalla
Copy link

@formiel: thanks for your help.

I modified multiple times embedder.py and flue.py in order to make the fine-tuning works on CPU.
Now it is running but it seems slow.
I have an opportunity to use Google Cloud Platform for fine-tuning flauBERT.
Can you advice me which nvidia GPU use, torch version, etc. please ?
And more do you know how many time lasts the fine-tuning on the XNLI corpus ?

Thanks

@formiel
Copy link
Contributor

formiel commented Jan 21, 2020

@QuentinSpalla: Yes, the fine-tuning would run much slower on CPU compared to GPU. On one Nvidia Titan X GPU, an epoch (i.e. a pass though the entire dataset) would take ~1.5 hours, and I ran it for 30 epochs.

Regarding PyTorch, I guess you should always use the latest version (currently 1.4.0) but previous versions (1.1-1.3) work just fine.

I don’t have much experience with Google Cloud Platform but if you use a few Nvidia Tesla K80 for the task I think it would take a few hours.

For your information, I'm going to add the fine-tuning script using Hugging Face's Transformers for XNLI task in the upcoming days. We may release our pre-trained models on downstream tasks in the near future. Let me confirm this later.

@QuentinSpalla
Copy link

@formiel: thanks for all the details, I am gonna try

For your information, I'm going to add the fine-tuning script using Hugging Face's Transformers for XNLI task in the upcoming days. We may release our pre-trained models on downstream tasks in the near future. Let me confirm this later.

It will be great, i look forward to seeing that

@QuentinSpalla
Copy link

@formiel:
I fine tuned the flaubert_base_lower model on the xnli corpus.
Can you tell me where is the new model ? Is it just that the flaubert_base_lower.pth is updated ?

Besides, how can I use the fine-tuned model in order to make sentence embeddings ?

Thanks

@dataislife
Copy link
Author

Hi @QuentinSpalla, I suggest you to open a new issue for the question related to sentence embeddings. This could be useful for other people using this tool.

Cheers,

@formiel
Copy link
Contributor

formiel commented Jan 22, 2020

Can you tell me where is the new model ? Is it just that the flaubert_base_lower.pth is updated ?

@QuentinSpalla You can download all models from here: https://zenodo.org/record/3622251 (Please note that the base-cased and large-cased models are partially trained, we will upload the fully-trained weights when they are available.)

Besides, how can I use the fine-tuned model in order to make sentence embeddings ?

Could you please be more specific on what you want to get? You may consider opening a new issue as @dataislife has suggested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants