Skip to content

Parsing TAG with feature structures (tulipac)

Alexander Koller edited this page May 2, 2024 · 12 revisions

Alto can parse tree-adjoining grammars with feature structures (FTAG). It performs feature unification based on the parse chart and not on the enumerated derivation trees. As far as we know, it is the fastest FTAG parser that currently exists.

The simplest way to write an FTAG grammar for use with Alto is in a format called tulipac. The name comes from the tulipac tool, which compiles grammars in a human-readable format to the XML files required by the TuLiPA parser (which is slower than Alto). Below we describe the tulipac format. You can also directly use Alto to parse an IRTG grammar with a FeatureStructureAlgebra.

Tulipac grammar format

A tulipac grammar consists of the following types of elements, which may be interleaved in any order:

  • Declarations of elementary trees, beginning with the tree keyword
  • Declarations of tree families, beginning with the family keyword
  • Declarations of words and lemmas, beginning with the word and lemma keywords
  • Includes statements, beginning with the #include keyword.

Anything that follows two double slashes // up to the end of the line is considered a comment, as in Java or C++ programs.

All symbols in the grammar (e.g. tree or tree family names, words and lemmas, feature names and feature values, etc.) must start with an alphabetical ASCII character, e.g. A-Z and a-z. All further characters in the symbol can be A-Z, a-z, or 0-9. If you would like to include whitespace, special characters, or non-ASCII characters (e.g. Umlauts) in your symbol, of if you would like to start the symbol with a number, enclose it with single quotes ('like this') or double quotes ("like this"). For example, you can write [person='1'].

Trees

A tree declaration looks like this:

tree trans:
  S {
    NP![case=nom][]
    VP {
      V+
      NP![case=acc][]
    }
}

It starts with the tree keyword and the tree name ("trans"), followed by a colon. Then the tree structure is described. Trees consist of nodes; the example tree has five nodes, with syntactic categories s, np, vp, v, and np. Nodes can be of different types:

  • ! = substitution node
  • * = foot node
  • + = lexical anchor
  • all other nodes are standard nodes

You can add a null adjunction constraint to a standard node by adding @NA after the node name; this means that no auxiliary trees may be adjoined at this node.

tree foo:
  S {
    S @NA {
      a+
      S*
    }
  }

Each node may be followed by one or two feature structures of the form [ft1=val1, ft2=val2, ...]. The first feature structure is the top FS of the node; the second feature structure is the bottom FS. If a node is followed by only one FS, this is assumed to be the top FS, and the bottom FS is assumed to be empty. If a node is not annotated with any FSs, both FSs are assumed to be empty.

The values val1, val2, etc. in a feature structure may either be identifiers (such as plural) or variables (such as ?case).

Important: Alto assumes that the start symbol of your grammar (which must appear at the root of the derived tree) is S (uppercase). If you get unexpected parsing errors, check whether you have used lowercase s instead.

Tree Families

A tree family groups a set of trees together into a logical unit. This is so a lexicon entry can assign a word to a whole tree family at once, avoiding the need to specify all the trees for each word again and again. Tulipac assumes that all trees in the same family have lexical anchors of the same syntactic category.

Define a tree family by simply listing the trees in the family:

family vinf_tv: { vinf_tv, vinf_tv_aux }

If you do not assign an elementary tree to a family, a tree family with the same name of the elementary tree is automatically generated for it.

In the current version of tulipac, a tree can belong only to a single family. The Alto-Tulipac parser does not have this restriction.

Lexicon Entries

You can define word forms and assign them to lemmas and elementary trees using the lemma and word keywords, as follows:

lemma 'schnell': aux_adj [foo=bar] {
  word 'schnelle': [case=nom]
  word 'schnellen': [case=acc]
  word demoword
}

This declares a lemma "schnell" for the elementary tree aux_adj, and declares the words "schnelle" and "schnellen" as inflected word forms of this lemma. For all of these word forms, the top feature structure of the lexical anchor node is unified with [foo=bar]. This feature structure is further unified with [case=nom] for "schnelle" and with [case=acc] for "schnellen". Note that word forms and feature structures are separated by a colon. The "demoword" does not add a feature structure and therefore does not get a colon.

Note that if your lemmas or word forms contain umlauts or other non-ASCII characters, you must enclose them with (single or double) quotes because tulipac does not accept non-ASCII characters in identifiers.

If you wish to simply use the word form itself as the lemma, you can use the following abbreviated word declaration (at top level, i.e. not as part of a 'lemma' declaration):

word 'jagt': <vinf_tv>[tense=present]

This declares the word form "jagt" for the lemma "jagt". Note the <vinf_tv> in angled brackets: This declares lexicon entries for "jagt" for all elementary trees in the tree family vinf_tv. You can also use tree families in this way in lemma declarations.

Includes Statements

It is sometimes convenient to split your grammar into several files, e.g. one file for the elementary trees and one file for the lexicon. You can use the #include keyword to combine these files:

#include "lexicon.tag"

Tulipac will treat this as if you had copied the contents of "lexicon.tag" into your grammar file.

FTAG parsing with Alto

To run Alto and parse sentences with a tulipac grammar, start it as follows:

java -cp <alto.jar> de.up.ling.irtg.script.TulipacParser <grammarname>

where <alto.jar> is the filename of your Alto jarfile, and <grammarname> is the filename of your tulipac grammar. See below for a concrete example, and observe that the angle brackets are not part of the actual command.

Note that there are two example grammars, chasing.tag and english.tag, in the examples subdirectory of the Alto source code.

You should see output that looks like this:

$ java -cp alto-2.3.1-all.jar de.up.ling.irtg.script.TulipacParser chasing.tag
Alto tulipac-style TAG parser, v1.0
Type a sentence to parse it, or type 'help' for help.

Reading grammar from examples/chasing.tag ...
Done, read grammar in 174 ms

parse> 

Parsing a sentence

You can now type a sentence and hit return to have it parsed by Alto. If Alto found a parse, it will open a window to display the parsing results. For instance, for the German sentence "der hund jagt den schnellen hasen" ("the-NOM dog-NOM chases the-ACC quick-ACC rabbit-ACC") with the chasing.tag grammar, Alto will show the following result (click on the image to magnify):

Alto screenshot

The leftmost pane displays the derivation tree. Most nodes are labeled with the name of an elementary tree, together with the lexical anchor and possibly information about feature values. The *NOP* nodes are needed because Alto parses an IRTG grammar below the hood, and the TAG-to-IRTG conversion introduces the *NOP*s. To read the IRTG derivation tree as a TAG derivation tree, simply ignore all the *NOP* nodes.

The middle pane shows the derived tree at the top. Finally, the rightmost pane shows the feature structures at some of the nodes (n1, n3, ...) of the derivation tree.

Note that Alto grammars are case-sensitive. If your grammar defines the word "john", but not the word "John", Alto will not find a parse for the sentence "John sleeps". You need to parse "john sleeps" instead.

Ungrammatical sentences

If the sentence cannot be parsed, Alto displays an error message that is as helpful as it can make it. For instance, when you parse "der hund jagt der schnelle hase" ("the-NOM dog-NOM chases the-NOM quick-NOM rabbit-NOM"), Alto will show:

parse> der hund jagt der schnelle hase
computed chart: 53 ms
filtered chart for feature structures: 23 ms
Found parses, but they all violate the constraints from the feature structures.

Example of a derivation tree that did not unify:

trans-jagt__
|---np_n-hund_case__acc_
|------det-der_case__nom_
|---------*NOP*_Det_A
|------*NOP*_N_A
|------*NOP*_NP_A
|---*NOP*_V_A
|---np_n-hase_case__nom_
|------det-der_case__nom_
|---------*NOP*_Det_A
|------aux_adj-schnelle_case__nom_
|---------*NOP*_Adj_A
|---------*NOP*_N_A
|------*NOP*_NP_A
|---*NOP*_VP_A
|---*NOP*_S_A

Unification failed at this subtree:

np_n-hund_case__acc_
|---det-der_case__nom_
|------*NOP*_Det_A
|---*NOP*_N_A
|---*NOP*_NP_A


Feature structures for children:
(1) [n1b: #1 [case: nom], n1t: #1, root: #1]
(2) [foot: #1, root: #1]
(3) [foot: #1, root: #1]

Alto first prints a derivation tree that would be grammatical according to the grammar, but had some unifications fail (if such a derivation tree exists). It then points to the node in this derivation tree at which the unification failed. In the example, Alto tried to substitute a determiner "der" in nominative case into the elementary tree for the noun "hund" in accusative case. The feature structure (1) of the "der" subtree has a case: nom entry which did not unify with the requirement of the elementary tree that the determiner should have case: acc.