New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can I train flaubert on a different corpus (not gutenberg, wiki) but for another domain ? #9
Comments
Hi @keloemma, Sorry for the unclear documentation. I have updated the README with detailed instructions and added scripts for splitting and preprocessing the data. Could you please try again? |
Good afteroon. Thank you for your reply, I tried the proposed solution and I am getting this error : INFO - 01/29/20 17:57:17 - 0:00:00 - The experiment will be stored in data/model/flaubert_base_cased/nqp72nh6ph INFO - 01/29/20 17:57:17 - 0:00:00 - Running command: python train.py --exp_name flaubert_base_cased --dump_path 'data/model' --data_path 'data/processed/fr_corpus/BPE/10k' --amp 1 --lgs fr --clm_steps '' --mlm_steps fr --emb_dim 768 --n_layers 12 --n_heads 12 --dropout '0.1' --attention_dropout '0.1' --gelu_activation true --batch_size 16 --bptt 512 --optimizer 'adam_inverse_sqrt,lr=0.0006,warmup_updates=24000,beta1=0.9,beta2=0.98,weight_decay=0.01,eps=0.000001' --epoch_size 300000 --max_epoch 100000 --validation_metrics _valid_fr_mlm_ppl --stopping_criterion '_valid_fr_mlm_ppl,20' --fp16 true --accumulate_gradients 16 --word_mask_keep_rand '0.8,0.1,0.1' --word_pred '0.15' INFO - 01/29/20 17:57:17 - 0:00:00 - ***** Starting time 1580317037.413298 ***** INFO - 01/29/20 17:57:17 - 0:00:00 - Loading data from data/processed/fr_corpus/BPE/10k/valid.fr.pth ... INFO - 01/29/20 17:57:17 - 0:00:00 - Loading data from data/processed/fr_corpus/BPE/10k/test.fr.pth ... INFO - 01/29/20 17:57:17 - 0:00:00 - ============ Data summary INFO - 01/29/20 17:57:17 - 0:00:00 - ***** Time limit to run script: -1 (min) ***** Do you have any idea , how can I debug it ? It seems to be linked to namespaces and argument parser. |
Hi @keloemma , Thanks for reporting the error! Sorry that was a bug. I have fixed it (line 257-259). Could you please try again? |
thank , you I am now getting this error : I tried to change the server ( and look for answer in other github code where people have the same error) but I ma still getting the same error when I tried to downgraded the parameters (emb_layers, batch_size etc..) I still get the error or others errors so I was wondering which parameters I should change in order for the command line to work. |
@keloemma Could you share the output of |
On this server : I get this other error : Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'",) plus the one related to the lack of memory in CUDA the other one is like that |
@keloemma There's too little memory available on your servers. If you aim at obtaining a good, well-trained Flaubert on your data, then maybe you should manage to have the necessary resources first (unfortunately pre-training usually needs a lot of them). Currently the second GPU of the second server has around 3GB available, so maybe you can try training a tiny model with a small batch size on it:
I haven't tested so I'm not sure how much memory the above will take. Note that the effective batch size is |
Thank you, I will try your proposed solution. |
hello, I am coming back to you and I would like to know if you can help me understand --epoch_size and max_epoch; For what I understood max_epoch =epoch ( entire dataset is passed both forward and backward through the neural network only once) but what is epoch_size in your specific case ? and do Flaubert_tiny equal to flaubert_small_cased ?? and --stopping_criterion _valid_fr_mlm_ppl,20 When the traning finish , i get this file/model So , is this directory which i am supposed to use for my classification task ? Iget two files ending with *.pth, so I am guessing I should use the last one. |
Hi @keloemma,
No, their architecture are different from each other (you can check out the architecture of our models here). I took an example of a very small network (Flaubert_tiny) so that it would be quicker for you to run and debug the code first, then you can change the parameters to fit the model into your available GPU memory later.
Yes, it means that the training will stop if the validation perplexity does not improve (or decrease) for 20 consecutive epochs.
Yes, you can use the pretrained weights (saved in There are 2 |
I assume that you got your answer @keloemma ? |
Good afternoon,
I tried to follow your instructions to train my own corpus with Flaubert in order to get a model to use for my classification task but I am having trouble understanding the procedure ?
You said we should use this line to train on our preprocessed data :
/Flaubert$ python train.py --exp_name flaubert_base_lower --dump_path ./dumped/ --data_path ./own_data/data/ --lgs 'fr' --clm_steps '' --mlm_steps 'fr' --emb_dim 768 --n_layers 12 --n_heads 12 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --batch_size 16 --bptt 512 --optimizer "adam_inverse_sqrt,lr=0.0006,warmup_updates=24000,beta1=0.9,beta2=0.98,weight_decay=0.01,eps=0.000001" --epoch_size 300000 --max_epoch 100000 --validation_metrics _valid_fr_mlm_ppl --stopping_criterion _valid_fr_mlm_ppl,20 --fp16 true --accumulate_gradients 16 --word_mask_keep_rand '0.8,0.1,0.1' --word_pred '0.15'
I tried it after cloning flaubert and all the necessary librairies but I am getting this error :
FAISS library was not found.
FAISS not available. Switching to standard nearest neighbors search implementation.
./own_data/data/train.fr.pth not found
./own_data/data/valid.fr.pth not found
./own_data/data/test.fr.pth not found
Traceback (most recent call last):
File "train.py", line 387, in
check_data_params(params)
File "/ho/ge/ke/eXP/Flaubert/xlm/data/loader.py", line 302, in check_data_params
assert all([all([os.path.isfile(p) for p in paths.values()]) for paths in params.mono_dataset.values()])
AssertionError
Does this mean I have to split my own data in three corpus after preprocessing it ? (train , valid and test ??) Should I use your preprossed script on my own before executing the command ?
The text was updated successfully, but these errors were encountered: