New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

馃挮 Add experimental ULMFit/BERT/Elmo-like pretraining #2931

merged 4 commits into from Nov 15, 2018


None yet
2 participants
Copy link

honnibal commented Nov 15, 2018

Add support for a new command, spacy pretrain:

usage: spacy pretrain [-h] [-cw 128] [-cd 4] [-er 1000] [-d 0.2] [-i 1] [-s 0]
                      texts_loc vectors_model output_dir

    Pre-train the 'token-to-vector' (tok2vec) layer of pipeline components,
    using an approximate language-modelling objective. Specifically, we load
    pre-trained vectors, and train a component like a CNN, BiLSTM, etc to predict
    vectors which match the pre-trained ones. The weights are saved to a directory
    after each epoch. You can then pass a path to one of these pre-trained weights
    files to the 'spacy train' command.

    This technique may be especially helpful if you have little labelled data.
    However, it's still quite experimental, so your mileage may vary.

    To load the weights back in during 'spacy train', you need to ensure
    all settings are the same between pretraining and training. The API and
    errors around this need some improvement.

positional arguments:
  texts_loc             Path to jsonl file with texts to learn from
  vectors_model         Name or path to vectors model to learn from
  output_dir            Directory to write models each epoch

optional arguments:
  -h, --help            show this help message and exit
  -cw 128, --width 128  Width of CNN layers
  -cd 4, --depth 4      Depth of CNN layers
  -er 1000, --embed-rows 1000
                        Embedding rows
  -d 0.2, --dropout 0.2
  -i 1, --nr-iter 1     Number of iterations to pretrain
  -s 0, --seed 0        Seed for random number generators

The pretrain command uses a novel trick, in order to support pre-training with our small models. Previous work has mostly used language modelling objectives over large vocabularies. Meeting that objective requires sufficiently large output layers, which means the hidden layers have to be quite large too. This doesn't work well for our fast and small CNN.

My solution is to instead load in a pre-trained vectors file, and use the vector-space as the objective. This means we only need to predict a 300d vector for each word, instead of trying to softmax over 10,000 IDs or whatever. It also means the vocabulary we can learn is very large, which is quite satisfying.

To make it easier to use the pre-trained weights, spacy train now supports a new CLI argument, -t2v, which takes a path to a pre-trained weights file. It's the user's responsibility to make sure the settings match up across the two commands, which is a bit fiddly at the moment if you've used non-default depth or width etc.

I thought of this trick a long time ago, and had it implemented in a half-finished state in the Tensorizer component. I always assumed it wouldn't work, as it feels too much like compression, and not enough like prediction. The strong performance of the BERT model made me take another look. The BERT model's objective isn't much different from simply using dropout, which we can easily apply.

In preliminary tests, I've already achieved pretty strong improvements for text classification over small training sizes. Training on 1000 documents from the IMDB corpus, with pre-training I'm able to reach 87% accuracy, which is roughly what Jeremy and Sebastian report in their ULMFit paper (Figure 3). Without pre-training, the best I could get to was 85%.

Interestingly, the technique seems to work better if the vectors are also used as part of the input. I find this completely surprising -- I expected the opposite. I don't know what's going on with this, but I've currently hard-coded that the vectors should be used as features.

The obvious thing to try is running something like ULMFit or BERT as the target for the CNN to learn from, rather than just using static vectors. I expect that should work better.


  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@honnibal honnibal changed the base branch from master to develop Nov 15, 2018

@ines ines changed the title Add experimental ULMFit/BERT/Elmo-like pretraining 馃挮 Add experimental ULMFit/BERT/Elmo-like pretraining Nov 15, 2018

honnibal added some commits Nov 15, 2018

@honnibal honnibal merged commit 8fdb9bc into develop Nov 15, 2018

4 checks passed

continuous-integration/appveyor/branch AppVeyor build succeeded
continuous-integration/appveyor/pr AppVeyor build succeeded
continuous-integration/travis-ci/pr The Travis CI build passed
continuous-integration/travis-ci/push The Travis CI build passed

This comment has been minimized.

Copy link

honnibal commented Nov 27, 2018

Quick update in case people revisit this:

  • Better hyper-parameter search pushed the baseline for my small textcat experiment above 86%, while the pre-training still only took it to around 87%. So, the gains here weren't so impressive after all. Tests on NER were also largely negative. You can find a stream-of-consciousness log of the experiments here:

  • The problem is that the objective of simply predicting the word's vector is indeed too "compressionish". We need to do something that looks more like prediction. The BERT 'masked language model' objective seems to be working very well. I'll push an update to the pretrain script with this objective soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment