Parsing TAG with feature structures (tulipac)
Alto can parse tree-adjoining grammars with feature structures (FTAG). It performs feature unification based on the parse chart and not on the enumerated derivation trees. As far as we know, it is the fastest FTAG parser that currently exists.
The simplest way to write an FTAG grammar for use with Alto is in a format called tulipac
. The name comes from the tulipac tool, which compiles grammars in a human-readable format to the XML files required by the TuLiPA parser (which is slower than Alto). Below we describe the tulipac format. You can also directly use Alto to parse an IRTG grammar with a FeatureStructureAlgebra
.
A tulipac grammar consists of the following types of elements, which may be interleaved in any order:
- Declarations of elementary trees, beginning with the
tree
keyword - Declarations of tree families, beginning with the
family
keyword - Declarations of words and lemmas, beginning with the
word
andlemma
keywords - Includes statements, beginning with the
#include
keyword.
Anything that follows two double slashes //
up to the end of the
line is considered a comment, as in Java or C++ programs.
All symbols in the grammar (e.g. tree or tree family names,
words and lemmas, feature names and feature values, etc.)
must start with an
alphabetical ASCII character, e.g. A-Z and a-z. All further characters
in the symbol can be A-Z, a-z, or 0-9. If you would like to
include whitespace, special characters, or non-ASCII characters
(e.g. Umlauts) in your symbol, of if you would like to start
the symbol with a number, enclose it with single quotes
('like this') or double quotes ("like this").
For example, you can write [person='1']
.
A tree declaration looks like this:
tree trans:
S {
NP![case=nom][]
VP {
V+
NP![case=acc][]
}
}
It starts with the tree
keyword and the tree name ("trans"), followed by a
colon. Then the tree structure is described. Trees consist of nodes;
the example tree has five nodes, with syntactic categories s, np, vp, v, and np.
Nodes can be of different types:
-
!
= substitution node -
*
= foot node -
+
= lexical anchor - all other nodes are standard nodes
You can add a null adjunction constraint to a standard node by adding @NA
after the node name; this means that no auxiliary trees may be adjoined at this node.
tree foo:
S {
S @NA {
a+
S*
}
}
Each node may be followed by one or two feature structures of the form
[ft1=val1, ft2=val2, ...]
. The first feature structure is the top FS
of the node; the second feature structure is the bottom FS. If a node
is followed by only one FS, this is assumed to be the top FS, and the
bottom FS is assumed to be empty. If a node is not annotated with any
FSs, both FSs are assumed to be empty.
The values val1, val2, etc. in a feature structure may either be identifiers (such as plural
)
or variables (such as ?case
).
Important: Alto assumes that the start symbol of your grammar (which must appear at the root of the derived tree) is S
(uppercase). If you get unexpected parsing errors, check whether you have used lowercase s
instead.
A tree family groups a set of trees together into a logical unit. This is so a lexicon entry can assign a word to a whole tree family at once, avoiding the need to specify all the trees for each word again and again. Tulipac assumes that all trees in the same family have lexical anchors of the same syntactic category.
Define a tree family by simply listing the trees in the family:
family vinf_tv: { vinf_tv, vinf_tv_aux }
If you do not assign an elementary tree to a family, a tree family with the same name of the elementary tree is automatically generated for it.
In the current version of tulipac, a tree can belong only to a single family. The Alto-Tulipac parser does not have this restriction.
You can define word forms and assign them to lemmas
and elementary trees using the lemma
and word
keywords, as
follows:
lemma 'schnell': aux_adj [foo=bar] {
word 'schnelle': [case=nom]
word 'schnellen': [case=acc]
word demoword
}
This declares a lemma "schnell" for the elementary tree aux_adj
, and
declares the words "schnelle" and "schnellen" as inflected word
forms of this lemma. For all of these word forms, the top feature
structure of the lexical anchor node is unified with [foo=bar]
. This
feature structure is further unified with [case=nom]
for "schnelle" and
with [case=acc]
for "schnellen". Note that word forms and feature structures
are separated by a colon. The "demoword" does not add a feature
structure and therefore does not get a colon.
Note that if your lemmas or word forms contain umlauts or other non-ASCII characters, you must enclose them with (single or double) quotes because tulipac does not accept non-ASCII characters in identifiers.
If you wish to simply use the word form itself as the lemma, you can use the following abbreviated word declaration (at top level, i.e. not as part of a 'lemma' declaration):
word 'jagt': <vinf_tv>[tense=present]
This declares the word form "jagt" for the lemma
"jagt". Note the <vinf_tv>
in angled brackets: This declares lexicon
entries for "jagt" for all elementary trees in the tree family
vinf_tv
. You can also use tree families in this way in lemma
declarations.
It is sometimes convenient to split your grammar
into several files, e.g. one file for the elementary trees and one
file for the lexicon. You can use the #include
keyword to combine
these files:
#include "lexicon.tag"
Tulipac will treat this as if you had copied the contents of "lexicon.tag" into your grammar file.
To run Alto and parse sentences with a tulipac grammar, start it as follows:
java -cp <alto.jar> de.up.ling.irtg.script.TulipacParser <grammarname>
where <alto.jar>
is the filename of your Alto jarfile, and <grammarname>
is the filename of your tulipac grammar. See below for a concrete example, and observe that the angle brackets are not part of the actual command.
Note that there are two example grammars, chasing.tag and english.tag, in the examples
subdirectory of the Alto source code.
You should see output that looks like this:
$ java -cp alto-2.3.1-all.jar de.up.ling.irtg.script.TulipacParser chasing.tag
Alto tulipac-style TAG parser, v1.0
Type a sentence to parse it, or type 'help' for help.
Reading grammar from examples/chasing.tag ...
Done, read grammar in 174 ms
parse>
You can now type a sentence and hit return to have it parsed by Alto. If Alto found a parse, it will open a window to display the parsing results. For instance, for the German sentence "der hund jagt den schnellen hasen" ("the-NOM dog-NOM chases the-ACC quick-ACC rabbit-ACC") with the chasing.tag
grammar, Alto will show the following result (click on the image to magnify):
The leftmost pane displays the derivation tree. Most nodes are labeled with the name of an elementary tree, together with the lexical anchor and possibly information about feature values. The *NOP*
nodes are needed because Alto parses an IRTG grammar below the hood, and the TAG-to-IRTG conversion introduces the *NOP*
s. To read the IRTG derivation tree as a TAG derivation tree, simply ignore all the *NOP*
nodes.
The middle pane shows the derived tree at the top. Finally, the rightmost pane shows the feature structures at some of the nodes (n1, n3, ...) of the derivation tree.
Note that Alto grammars are case-sensitive. If your grammar defines the word "john", but not the word "John", Alto will not find a parse for the sentence "John sleeps". You need to parse "john sleeps" instead.
If the sentence cannot be parsed, Alto displays an error message that is as helpful as it can make it. For instance, when you parse "der hund jagt der schnelle hase" ("the-NOM dog-NOM chases the-NOM quick-NOM rabbit-NOM"), Alto will show:
parse> der hund jagt der schnelle hase
computed chart: 53 ms
filtered chart for feature structures: 23 ms
Found parses, but they all violate the constraints from the feature structures.
Example of a derivation tree that did not unify:
trans-jagt__
|---np_n-hund_case__acc_
|------det-der_case__nom_
|---------*NOP*_Det_A
|------*NOP*_N_A
|------*NOP*_NP_A
|---*NOP*_V_A
|---np_n-hase_case__nom_
|------det-der_case__nom_
|---------*NOP*_Det_A
|------aux_adj-schnelle_case__nom_
|---------*NOP*_Adj_A
|---------*NOP*_N_A
|------*NOP*_NP_A
|---*NOP*_VP_A
|---*NOP*_S_A
Unification failed at this subtree:
np_n-hund_case__acc_
|---det-der_case__nom_
|------*NOP*_Det_A
|---*NOP*_N_A
|---*NOP*_NP_A
Feature structures for children:
(1) [n1b: #1 [case: nom], n1t: #1, root: #1]
(2) [foot: #1, root: #1]
(3) [foot: #1, root: #1]
Alto first prints a derivation tree that would be grammatical according to the grammar, but had some unifications fail (if such a derivation tree exists). It then points to the node in this derivation tree at which the unification failed. In the example, Alto tried to substitute a determiner "der" in nominative case into the elementary tree for the noun "hund" in accusative case. The feature structure (1) of the "der" subtree has a case: nom
entry which did not unify with the requirement of the elementary tree that the determiner should have case: acc
.