Grammar as a Foreign Langauage

TLDR; Authors apply 3-layer seq2seq LSTM with 256 units and attention mechanism to consituency parsing task and achieve new state of the art. Attention made a huge difference for a small dataset (40k examples), but less so for a noisy large dataset (~11M examples).

Data Sets and model performance

WSJ (40k examples): 90.5
Large distantly supervised corpus (90k gold examples, 11M noisy examples): 92.8

Key Takeaways

The authors use existing parsers to label a large dataset to be used for training. The trained model then outperforms the "teacher" parsers. A possible explanation is that errors of supervising parsers look like noise to the more powerful LSTM model. These results are extremely valuable, as data is typically the limiting factor, but existing models almost always exist.
Attention mechanism can lead to huge improvements on small data sets.
All of the learned LSTM models were able to deal with long (~70 tokens) sentences without a significant impact of performance.
Reversing the input in seq2seq tasks is common. However, reversing resulted in only a 0.2 point bump in accuracy.
Pre-trained word vectors bumped scores by 0.4 (92.9 -> 94.3) only.

Notes/Questions

How much does the ouput data representation matter? The authors linearized the parse tree using depth-first traversal and parentheses. Are there more efficient representations that may lead to better results?
How much does the noise in the auto-labeled training data matter when compared to the data size? Are there systematic errors in the auto-labeled data that put a ceiling on model performance?
Bidirectional LSTM?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grammar-as-a-foreign-language.md

grammar-as-a-foreign-language.md

Grammar as a Foreign Langauage

Data Sets and model performance

Key Takeaways

Notes/Questions

Files

grammar-as-a-foreign-language.md

Latest commit

History

grammar-as-a-foreign-language.md

File metadata and controls

Grammar as a Foreign Langauage

Data Sets and model performance

Key Takeaways

Notes/Questions