Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine-tune BERT on SQuAD train then SQuAD dev #7

Closed
fmikaelian opened this issue Feb 11, 2019 · 13 comments
Closed

Fine-tune BERT on SQuAD train then SQuAD dev #7

fmikaelian opened this issue Feb 11, 2019 · 13 comments

Comments

@fmikaelian
Copy link
Collaborator

Use output weights of first BERT fine-tuned on SQuAD train as input starter weights for new BERT fine-tuned on SQuAD dev.

https://github.com/huggingface/pytorch-pretrained-BERT#squad

@fmikaelian
Copy link
Collaborator Author

It would be a good thing to report for each model training:

  • the commands we used
  • the data we used
  • the training time
  • ...

Also, using ML Flow Tracking might be easier for us to track different models.

@fmikaelian
Copy link
Collaborator Author

fmikaelian commented Feb 16, 2019

I have fetched one of the url for --bert_model to see what is inside:

wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz
tar xvzf bert-base-uncased.tar.gz

It's just a weights file and a a model config file:

./pytorch_model.bin
./bert_config.json

I will adapt download.py to save what @andrelmfarias released under the /models folder with the same structure.

Then, it seems run_squad.py just loads a model specified by --bert_model when not in training mode (eg. --do_predict).

    if args.do_train:
        # Save a trained model and the associated configuration
        # Load a trained model and config that you have fine-tuned
    else:
        model = BertForQuestionAnswering.from_pretrained(args.bert_model)

The function from_pretrained can take as argument:

  • a path or url to a pretrained model archive containing:
    . bert_config.json a configuration file for the model
    . pytorch_model.bin a PyTorch dump of a BertForPreTraining instance

To predict with model fine tuned on SQuAD v1.1, we need to do:

python run_squad.py \
  --bert_model models/bert_qa_squad_v1.1 \
  --do_predict \
  --predict_fp16 \
  --do_lower_case \
  --predict_file samples/custom-sample-v2.0.json \
  --predict_batch_size 128 \
  --learning_rate 3e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir logs

With samples/custom-sample-v2.0.json being the file to predict with SQuAD format.

@fmikaelian fmikaelian pinned this issue Feb 16, 2019
@fmikaelian
Copy link
Collaborator Author

@andrelmfarias Should I try this and let you know if it works?

@andrelmfarias
Copy link
Collaborator

@fmikaelian I think if you do not select the option --do_train, it is only going to predict on samples/custom-sample-v2.0.json and is not going to run a second fine-tunning.

Don't we need to change the script in order to do the fine-tune on squad-dev?

@fmikaelian
Copy link
Collaborator Author

@andrelmfarias Yes, you are right the code snippet above is only for prediction.

I can try to generate predictions on a sample samples/custom-sample-v2.0.json with the first model you released (bert_qa_squad_v1.1).

You can try to do the second fine-tuning on squad-dev to validate the workflow? Don't forget to report your actions and commands ✍️.

@fmikaelian
Copy link
Collaborator Author

fmikaelian commented Feb 19, 2019

Ideas for second fine-tuning:

if args.do_train and args.bert_model != 'models/bert_qa_squad_v1.1':
       # Save a trained model and the associated configuration
       model_to_save = model.module if hasattr(model, 'module') else model  # Only save the model it-self
       output_model_file = os.path.join(args.output_dir, WEIGHTS_NAME)
       torch.save(model_to_save.state_dict(), output_model_file)
       output_config_file = os.path.join(args.output_dir, CONFIG_NAME)
       with open(output_config_file, 'w') as f:
           f.write(model_to_save.config.to_json_string())        # Load a trained model and config that you have fine-tuned
       config = BertConfig(output_config_file)
       model = BertForQuestionAnswering(config)
       model.load_state_dict(torch.load(output_model_file))
   else:
model = BertForQuestionAnswering.from_pretrained(args.bert_model)

See: https://github.com/huggingface/pytorch-pretrained-BERT/blob/833774075447b5eaef92b9da92ee4ce2decf89fb/examples/run_squad.py#L1011

@andrelmfarias
Copy link
Collaborator

andrelmfarias commented Feb 21, 2019

Some errors when re-training with saved model 'models/bert_qa_squad_v1.1' using run_squad.py script:

02/21/2019 17:14:43 - ERROR - pytorch_pretrained_bert.tokenization -   Model name './output_bert/squad_1.1_train' was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese). We assumed './output_bert/squad_1.1_train/vocab.txt' was a path or url but couldn't find any file associated to this path or url.
02/21/2019 17:14:44 - INFO - pytorch_pretrained_bert.modeling -   loading archive file ./output_bert/squad_1.1_train
02/21/2019 17:14:44 - INFO - pytorch_pretrained_bert.modeling -   Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

Traceback (most recent call last):
  File "run_squad.py", line 945, in main
    with open(cached_train_features_file, "rb") as reader:
FileNotFoundError: [Errno 2] No such file or directory: 'squad_data/dev_mod.json_squad_1.1_train_384_128_64'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_squad.py", line 1077, in <module>
    main()
  File "run_squad.py", line 954, in main
    is_training=True)
  File "run_squad.py", line 211, in convert_examples_to_features
    query_tokens = tokenizer.tokenize(example.question_text)
AttributeError: 'NoneType' object has no attribute 'tokenize'

It seems to be some error related to the non-existance of tokenizer with the saved model.

I solved the problem by running a own-made script run_squad_fine-tunned.py which will be commited to the repo.

The usage of run_squad-fine-tunned.py for retrain a saved model is as below:

python run_squad_fine-tunned.py \
--bert_model bert-base-uncased \
--do_retrain \
--do_predict \
--do_lower_case \
--train_file <path-to-train-file> \
--predict_file squad_data/dev-v1.1.json \
--train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 2.0 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir output_bert/squad_1.1_dev \
--fine_tunned_weights <path-to-model.bin-file>

@fmikaelian
Copy link
Collaborator Author

@andrelmfarias

The error traceback for FileNotFoundError is weird: the squad_data path has been declared somewhere? What command did you use to get this error?

For the tokenizer error, did you look at the BertTokenizer class in the tokenization.py script? The from_pretrained method the tokenizer loads a vocab_file object located in a PRETRAINED_VOCAB_ARCHIVE_MAP

I'd like to see what you changed in the script. Changing the script is a strategy that we should debate.

Maybe we just need to drop the vocab file of the bert-base-uncased tokenizer under the models/bert_qa_squad_v1.1 folder?

@andrelmfarias
Copy link
Collaborator

@fmikaelian

The command I ran was the following:

python run_squad.py \
  --bert_model models/bert_qa_squad_v1.1 \
  --do_train \
  --fp16 \
  --do_lower_case \
  --train_file samples/dev-v1.1.json \
  --train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir logs \
  --fp16

Yes, I looked at the BertTokenizer class. I understand the argument that is fed to the BertTokenizer is either the path to the saved model or the name of one of the models in PRETRAINED_VOCAB_ARCHIVE_MAP. (args.bert_model at Line 883 in run_squad.py ).
If args.bert_model is not one of the models in PRETRAINED_VOCAB_ARCHIVE_MAP, the variable vocab_file in the function from_pretrained(line 126 in BertTokenizer ) is gonna be the path we fed to run_squad.py, in our case models/bert_qa_squad_v1.1.

If there is no vocab.txtfile (variable VOCAB_NAME in tokenizer.py) in the repo where we saved the model, where are going to get an error. Yes, another solution would be to drop the vocab file there, but we would be constraint to do this every time we need to retrain. Yes, we can automatize this with another script, but in the end we are going to have to whether use more scripts or change the one we have (as I decided to do).

I am committing the new script in a new branch so that we can take a look at it. There are no major changes and it is easy to understand.

@fmikaelian
Copy link
Collaborator Author

@andrelmfarias You released model bert_qa_squad_v1.1_dev fined tuned on squad dev, but do you think we will use this model?

@andrelmfarias
Copy link
Collaborator

andrelmfarias commented Mar 8, 2019

@fmikaelian I understood we would use this model to do a 2nd fine-tune on the BNP dataset, as after this fine-tune on the dev set, the model would be able to generalise more (had seen more samples). I imagine however that the performance might not increase that much in comparison to the model trained on SQUAD train.

We can discuss it.

@fmikaelian
Copy link
Collaborator Author

I've just released the model trained with the sklearn wrapper: https://github.com/fmikaelian/cdQA/releases/tag/bert_qa_squad_v1.1_sklearn

Meaning you can load it and predict directly. You might need to reset some parameters like model.device manually (I think the issue #68 was there when I trained it).

@fmikaelian
Copy link
Collaborator Author

I could make predictions but didnt evaluated it yet (see #70).

@fmikaelian fmikaelian unpinned this issue Mar 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment