Fine-tune BERT on SQuAD train then SQuAD dev #7

fmikaelian · 2019-02-11T17:00:26Z

Use output weights of first BERT fine-tuned on SQuAD train as input starter weights for new BERT fine-tuned on SQuAD dev.

https://github.com/huggingface/pytorch-pretrained-BERT#squad

fmikaelian · 2019-02-15T21:10:27Z

It would be a good thing to report for each model training:

the commands we used
the data we used
the training time
...

Also, using ML Flow Tracking might be easier for us to track different models.

fmikaelian · 2019-02-16T14:36:59Z

I have fetched one of the url for --bert_model to see what is inside:

wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz
tar xvzf bert-base-uncased.tar.gz

It's just a weights file and a a model config file:

./pytorch_model.bin
./bert_config.json

I will adapt download.py to save what @andrelmfarias released under the /models folder with the same structure.

Then, it seems run_squad.py just loads a model specified by --bert_model when not in training mode (eg. --do_predict).

    if args.do_train:
        # Save a trained model and the associated configuration
        # Load a trained model and config that you have fine-tuned
    else:
        model = BertForQuestionAnswering.from_pretrained(args.bert_model)

The function from_pretrained can take as argument:

a path or url to a pretrained model archive containing:
. bert_config.json a configuration file for the model
. pytorch_model.bin a PyTorch dump of a BertForPreTraining instance

To predict with model fine tuned on SQuAD v1.1, we need to do:

python run_squad.py \
  --bert_model models/bert_qa_squad_v1.1 \
  --do_predict \
  --predict_fp16 \
  --do_lower_case \
  --predict_file samples/custom-sample-v2.0.json \
  --predict_batch_size 128 \
  --learning_rate 3e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir logs

With samples/custom-sample-v2.0.json being the file to predict with SQuAD format.

fmikaelian · 2019-02-19T10:24:31Z

@andrelmfarias Should I try this and let you know if it works?

andrelmfarias · 2019-02-19T11:01:37Z

@fmikaelian I think if you do not select the option --do_train, it is only going to predict on samples/custom-sample-v2.0.json and is not going to run a second fine-tunning.

Don't we need to change the script in order to do the fine-tune on squad-dev?

fmikaelian · 2019-02-19T12:17:17Z

@andrelmfarias Yes, you are right the code snippet above is only for prediction.

I can try to generate predictions on a sample samples/custom-sample-v2.0.json with the first model you released (bert_qa_squad_v1.1).

You can try to do the second fine-tuning on squad-dev to validate the workflow? Don't forget to report your actions and commands ✍️.

fmikaelian · 2019-02-19T15:32:58Z

Ideas for second fine-tuning:

if args.do_train and args.bert_model != 'models/bert_qa_squad_v1.1':
       # Save a trained model and the associated configuration
       model_to_save = model.module if hasattr(model, 'module') else model  # Only save the model it-self
       output_model_file = os.path.join(args.output_dir, WEIGHTS_NAME)
       torch.save(model_to_save.state_dict(), output_model_file)
       output_config_file = os.path.join(args.output_dir, CONFIG_NAME)
       with open(output_config_file, 'w') as f:
           f.write(model_to_save.config.to_json_string())        # Load a trained model and config that you have fine-tuned
       config = BertConfig(output_config_file)
       model = BertForQuestionAnswering(config)
       model.load_state_dict(torch.load(output_model_file))
   else:
model = BertForQuestionAnswering.from_pretrained(args.bert_model)

See: https://github.com/huggingface/pytorch-pretrained-BERT/blob/833774075447b5eaef92b9da92ee4ce2decf89fb/examples/run_squad.py#L1011

andrelmfarias · 2019-02-21T17:23:30Z

Some errors when re-training with saved model 'models/bert_qa_squad_v1.1' using run_squad.py script:

02/21/2019 17:14:43 - ERROR - pytorch_pretrained_bert.tokenization -   Model name './output_bert/squad_1.1_train' was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese). We assumed './output_bert/squad_1.1_train/vocab.txt' was a path or url but couldn't find any file associated to this path or url.
02/21/2019 17:14:44 - INFO - pytorch_pretrained_bert.modeling -   loading archive file ./output_bert/squad_1.1_train
02/21/2019 17:14:44 - INFO - pytorch_pretrained_bert.modeling -   Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

Traceback (most recent call last):
  File "run_squad.py", line 945, in main
    with open(cached_train_features_file, "rb") as reader:
FileNotFoundError: [Errno 2] No such file or directory: 'squad_data/dev_mod.json_squad_1.1_train_384_128_64'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_squad.py", line 1077, in <module>
    main()
  File "run_squad.py", line 954, in main
    is_training=True)
  File "run_squad.py", line 211, in convert_examples_to_features
    query_tokens = tokenizer.tokenize(example.question_text)
AttributeError: 'NoneType' object has no attribute 'tokenize'

It seems to be some error related to the non-existance of tokenizer with the saved model.

I solved the problem by running a own-made script run_squad_fine-tunned.py which will be commited to the repo.

The usage of run_squad-fine-tunned.py for retrain a saved model is as below:

python run_squad_fine-tunned.py \
--bert_model bert-base-uncased \
--do_retrain \
--do_predict \
--do_lower_case \
--train_file <path-to-train-file> \
--predict_file squad_data/dev-v1.1.json \
--train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 2.0 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir output_bert/squad_1.1_dev \
--fine_tunned_weights <path-to-model.bin-file>

fmikaelian · 2019-02-23T17:46:09Z

@andrelmfarias

The error traceback for FileNotFoundError is weird: the squad_data path has been declared somewhere? What command did you use to get this error?

For the tokenizer error, did you look at the BertTokenizer class in the tokenization.py script? The from_pretrained method the tokenizer loads a vocab_file object located in a PRETRAINED_VOCAB_ARCHIVE_MAP

I'd like to see what you changed in the script. Changing the script is a strategy that we should debate.

Maybe we just need to drop the vocab file of the bert-base-uncased tokenizer under the models/bert_qa_squad_v1.1 folder?

andrelmfarias · 2019-02-23T18:27:33Z

@fmikaelian

The command I ran was the following:

python run_squad.py \
  --bert_model models/bert_qa_squad_v1.1 \
  --do_train \
  --fp16 \
  --do_lower_case \
  --train_file samples/dev-v1.1.json \
  --train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir logs \
  --fp16

Yes, I looked at the BertTokenizer class. I understand the argument that is fed to the BertTokenizer is either the path to the saved model or the name of one of the models in PRETRAINED_VOCAB_ARCHIVE_MAP. (args.bert_model at Line 883 in run_squad.py ).
If args.bert_model is not one of the models in PRETRAINED_VOCAB_ARCHIVE_MAP, the variable vocab_file in the function from_pretrained(line 126 in BertTokenizer ) is gonna be the path we fed to run_squad.py, in our case models/bert_qa_squad_v1.1.

If there is no vocab.txtfile (variable VOCAB_NAME in tokenizer.py) in the repo where we saved the model, where are going to get an error. Yes, another solution would be to drop the vocab file there, but we would be constraint to do this every time we need to retrain. Yes, we can automatize this with another script, but in the end we are going to have to whether use more scripts or change the one we have (as I decided to do).

I am committing the new script in a new branch so that we can take a look at it. There are no major changes and it is easy to understand.

fmikaelian · 2019-03-04T11:30:37Z

@andrelmfarias You released model bert_qa_squad_v1.1_dev fined tuned on squad dev, but do you think we will use this model?

andrelmfarias · 2019-03-08T10:07:13Z

@fmikaelian I understood we would use this model to do a 2nd fine-tune on the BNP dataset, as after this fine-tune on the dev set, the model would be able to generalise more (had seen more samples). I imagine however that the performance might not increase that much in comparison to the model trained on SQUAD train.

We can discuss it.

fmikaelian · 2019-03-09T19:38:36Z

I've just released the model trained with the sklearn wrapper: https://github.com/fmikaelian/cdQA/releases/tag/bert_qa_squad_v1.1_sklearn

Meaning you can load it and predict directly. You might need to reset some parameters like model.device manually (I think the issue #68 was there when I trained it).

fmikaelian · 2019-03-09T19:40:38Z

I could make predictions but didnt evaluated it yet (see #70).

fmikaelian added priority: high 2️⃣ status: help wanted 👋 tag: machine-learning 🤖 type: enhancement 💡 status: approved 🎉 tag: research ✍🏻 labels Feb 11, 2019

fmikaelian assigned andrelmfarias Feb 11, 2019

fmikaelian self-assigned this Feb 12, 2019

fmikaelian pinned this issue Feb 16, 2019

andrelmfarias closed this as completed Feb 19, 2019

andrelmfarias reopened this Feb 19, 2019

fmikaelian mentioned this issue Feb 19, 2019

Try BERT prediction on sample file #33

Closed

fmikaelian closed this as completed Mar 11, 2019

fmikaelian unpinned this issue Mar 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-tune BERT on SQuAD train then SQuAD dev #7

Fine-tune BERT on SQuAD train then SQuAD dev #7

fmikaelian commented Feb 11, 2019

fmikaelian commented Feb 15, 2019

fmikaelian commented Feb 16, 2019 •

edited

Loading

fmikaelian commented Feb 19, 2019

andrelmfarias commented Feb 19, 2019

fmikaelian commented Feb 19, 2019

fmikaelian commented Feb 19, 2019 •

edited

Loading

andrelmfarias commented Feb 21, 2019 •

edited

Loading

fmikaelian commented Feb 23, 2019

andrelmfarias commented Feb 23, 2019

fmikaelian commented Mar 4, 2019

andrelmfarias commented Mar 8, 2019 •

edited

Loading

fmikaelian commented Mar 9, 2019

fmikaelian commented Mar 9, 2019

Fine-tune BERT on SQuAD train then SQuAD dev #7

Fine-tune BERT on SQuAD train then SQuAD dev #7

Comments

fmikaelian commented Feb 11, 2019

fmikaelian commented Feb 15, 2019

fmikaelian commented Feb 16, 2019 • edited Loading

fmikaelian commented Feb 19, 2019

andrelmfarias commented Feb 19, 2019

fmikaelian commented Feb 19, 2019

fmikaelian commented Feb 19, 2019 • edited Loading

andrelmfarias commented Feb 21, 2019 • edited Loading

fmikaelian commented Feb 23, 2019

andrelmfarias commented Feb 23, 2019

fmikaelian commented Mar 4, 2019

andrelmfarias commented Mar 8, 2019 • edited Loading

fmikaelian commented Mar 9, 2019

fmikaelian commented Mar 9, 2019

fmikaelian commented Feb 16, 2019 •

edited

Loading

fmikaelian commented Feb 19, 2019 •

edited

Loading

andrelmfarias commented Feb 21, 2019 •

edited

Loading

andrelmfarias commented Mar 8, 2019 •

edited

Loading