@joelgrus joelgrus released this Mar 29, 2019 · 28 commits to master since this release

Highlighted Changes

  • This release includes code for working with the DROP dataset, including the official evaluation script, a DatasetReader, and the NAQANet model. (#2559, #2556 and #2560)
  • We added a no-op trainer that allows you to create AllenNLP model archives for programmatic baselines, alternatively trained models, etc. (#2610)

Breaking Changes

  1. In #2607 we changed the (default) SpacyWordSplitter to return allennlp Tokens (which are compact, efficient NamedTuples) rather than spacy.tokens.Tokens. This was done primarily to decrease memory usage for large datasets; secondarily to play nicer with the multiprocess dataset reader.

This is a breaking change in two ways, neither of which should affect most users:

  • in theory everyone should be programming to the Token abstraction that's shared between both implementations, but it's possible that someone could be relying on having the actual spacy token, in which case they would need need to configure their word splitter with keep_spacy_tokens=True.

  • a NamedTuple can't have different constructor parameters and field names. Our previous Token implementation used e.g. pos as the name of the constructor argument but then pos_ as the name of the field. Converting this to a namedtuple meant that the constructor argument now also has to be pos_. If you were for some reason generating your own tokens manually (which the wikitables dataset reader was doing) you would need to make the corresponding changes to that code; if you were only creating Tokens using our Tokenizers, then there's no difference to you.

It's quite likely that neither of these changes will affect even a single user, but in theory they could.

List of Commits

@schmmd schmmd released this Feb 19, 2019 · 88 commits to master since this release

List of Commits

@schmmd schmmd released this Jan 8, 2019 · 183 commits to master since this release

Highlighted changes

  • We now include a script to easily update pre-existing model archives (#2223).
  • We fixed an issue that caused issues using AllenNLP from within iPython (#2257).

List of Commits

@schmmd schmmd released this Dec 20, 2018 · 213 commits to master since this release

Major New Features

  • PyTorch 1.0 support

List of Commits

Dec 19, 2018


bump version number to v0.8.0

@schmmd schmmd released this Dec 3, 2018 · 245 commits to master since this release

Major New Features

  • You can now run allennlp configure to launch a GUI tool that helps you build a model configuration.
  • You can now use a BERT embedder within you model.

List of Commits

@schmmd schmmd released this Nov 12, 2018 · 300 commits to master since this release

This is a minor release.

List of Commits

@joelgrus joelgrus released this Oct 5, 2018

Major new features

  • A new framework for training state-machine-based models, and several examples of using this for semantic parsing. This still has a few rough edges, but we've successfully used it for enough models now that we're comfortable releasing it.
  • A model for neural open information extraction
  • A re-implementation of a graph-based semantic dependency parser.
  • A MultiProcessDataIterator for loading data on multiple threads on the CPU (we haven't actually used this much, though - if you have trouble with it, let us know).

Breaking Changes

  1. Previously if you were working on a GPU, you would specify a cuda_device at the time you called instance.as_tensor_dict(), and the tensors would be generated on the GPU initially. As we started to develop code for generating instances in parallel across multiple processes, we became concerned that over-generation of instances could potentially exhaust the GPU memory.

Accordingly, now instance.as_tensor_dict() (and all the field.as_tensor operations that underlie it) always return tensors on the CPU, and then the Trainer (or the evaluation loop, or whoever) moves them to the GPU right before sending them to the model.

Most likely this won't affect you (other than making your training loop a tiny bit slower), but if you've been creating your own custom Fields or Iterators, they'll require small changes as in #1731

List of commits

@schmmd schmmd released this Aug 31, 2018 · 435 commits to master since this release

This release includes a new dependency parser model, a QUAC model, and a new NLI model, as well as many bugfixes and small improvements.

@joelgrus joelgrus released this Aug 15, 2018 · 487 commits to master since this release

AllenNLP v.0.6.0 has been upgraded to use PyTorch 0.4.1. Accordingly, it should now run on Python 3.7.

It contains a handful of breaking changes, most of which probably won't affect you.

Breaking changes:

1. HOCON -> Jsonnet for Configuration files

Although our experiment configurations look like JSON, they were technically HOCON (which was a superset of JSON). In this release we changed the format to Jsonnet, which is a different superset of JSON.

If your configuration files are "JSON with comments", this change should not affect you. Your configuration files are valid jsonnet and will work fine as is. We believe this described 99+% of people using allennlp.

If you are using advanced features of HOCON, then these changes will be breaking for you. Probably the two most common issues will be

unquoted strings

JSON requires strings to be quoted. HOCON doesn't. Jsonnet does. So in the off chance that you have not been putting your strings in quotes, you'll need to start putting them in quotes.

environment variables

HOCON allows you to substitute in environment variables, like

    "root_directory": ${HOME}

Jsonnet only allows substitution of explicit variables, using a syntax like

    "root_directory": std.extVar("HOME")

these are in fact variables fed to the Jsonnet parser (not environment variables); however, the allennlp code will read all the environment variables and feed them to the parser

the elimination of ConfigTree

(you probably don't care about this)

previously the AllenNLP Params object was a wrapper around a pyhocon ConfigTree, which is basically a fancy dict. After this change, Params.params is just a plain dict instead of a ConfigTree, so if you have code that relies on it being a ConfigTree, that code will break. This is very unlikely to affect you.

why did we make this change?

There is a bug in the Python HOCON parser that incorrectly handles backslashes in strings. This created issues involving initializer regexes being serialized and deserialized incorrectly. Once we determined that the bug was not simple enough for us to easily fix, we chose this as the next best solution.

(in addition, jsonnet has some nice features involving templates that you might find useful in your experiments)

2. Change to the Predictor API

The API for the _json_to_instance method of the Predictor used to be (json: JsonDict) -> Tuple[Instance, JsonDict], where the returned JsonDict contained information from the input which you wanted to be returned in the predictor. This is now not allowed, and the _json_to_instance method returns only an Instance, meaning any additional information must be routed through your model via the use of MetadataFields. This change was to make Predictors agnostic of where Instances they process come from, allowing us to generate predictions from an original dataset using a DatasetReader to generate instances.

This means you can now do:
allennlp predict /path/to/original/dataset --use-dataset-reader, rather than having to format your data as .jsonl files.

3. Automatic implementation of from_params

It used to be the case that if you implemented your own Model or DatasetReader or whatever, you were required to implement a from_params classmethod that unpacked a Params object and called the constructor with the relevant values. In most cases this method was just boilerplate that didn't do anything interesting -- it popped off strings and strings and ints and ints and so on. And it opened you up to a class of subtle bugs if your from_params popped parameters with a different default value than the constructor used.

In the latest version, any class that inherits from FromParams (which automatically includes all Registrable classes) gets for free a from_params method that does the "right thing". If you need complex logic to instantiate your class from a JSON config, you'll still have to write your own method, but in most cases you won't need to.

There are some from_params methods that take additional parameters; for example, every Model constructor requires a Vocabulary, which will need to be supplied by its from_params method. To support this, the automatic from_params allows extra keyword-only arguments. That is, if you are calling the from_params method yourself (which you probably aren't), you have to do

YourModel.from_params(params, vocab=vocab)

if you try to supply the extra arguments positionally (which you could when all of the from_params were defined explicitly), you will get an error. This is the "breaking" component of the change.

4. changes to TokenIndexers

previously the interface for TokenIndexer was

TokenIndexer.token_to_indices(self, token: Token, vocabulary: Vocabulary) -> TokenType:

this assumption (one token) -> (one or more indices) turned out to be not general enough. there are cases where you want to generate indices that depend on multiple tokens, and where you want to generate multiple sets of (related) indices from one input text. accordingly, we changed the API to

TokenIndexer.tokens_to_indices(self, tokens: List[Token], vocabulary: Vocabulary, index_name: str) -> Dict[str, List[TokenType]]:

this is some real library-innards stuff, and it is unlikely to affect you or your code unless you have been writing your own TokenIndexer subclasses or Field subclasses (which is not most users). If this does describe you, look at the changes to TextField to see how to update your code.

other changes:

