Montague is a little CCG semantic parsing library for Scala.
Scala Java

README.md

montague

montague is a little CCG semantic parsing library for Scala.

You can build on this code to translate English-into-SQL, English-into-API commands, etc. To do so, you need to build a lexicon for your specific application. We include a few sample lexicons, as described in the Getting Started section of this README.

The code currently implements boolean (non-probabilistic) CCG parsing, using a CKY-based parse search strategy.

An example syntactic parse tree An example semantic parse tree

Authors

Note that the repo history doesn't accurately reflect authorship, because much of the code was ported from the original UPSHOT repo.

Background

At UPSHOT (acquired by Workday), we built a semantic parser that translated English into SQL, and — later — English into SOQL (the Salesforce query language). This was packaged in a mobile application with the following architecture:

UPSHOT architecture

This package extracts and open-sources the core CCG-based semantic parser component of UPSHOT, in a form that is general-purpose and self-contained. We hope that that other people find it useful. We plan to clean up the SQL-generation code and release that too. If you have more requests, please email us.

Joseph Turian and Alex Nisnevich gave a talk at Strata 2016 introducing montague (video here).

Introduction

"Oh, get ahold of yourself. Nobody's proposing that we parse English." — Larry Wall in <199709032332.QAA21669@wall.org>

montague takes its name from Montague semantics, the idea that human language can be expressed through formal logic and lambda-calculus. The process of inferring this formal representation from natural language is called "semantic parsing". Specifically, montague implements Combinatory Categorial Grammar (CCG), a particular grammar formalism that has become popular recently for semantic parsing.

Lambda-calculus? Combinatory grammar? Huh??

Let's break it down. Here's an example of a definition in montague:

  ("plus" -> ((N\N)/N, λ {y: Int => λ {x: Int => x + y}}))

Here's what it means:

  • There's a term called "plus".
  • It has the syntactic category (N\N)/N. This means that it's something that attaches to a noun (N) after it to form a N\N, which is a thing that attaches to a noun before it to form another noun. In other words, "plus" must be between two nouns, and the end result of (Noun) plus (Noun) syntactically is just another noun. So far, so good!
  • It has the semantic definition λ {y: Int => λ {x: Int => x + y}}. In other words, it's a function that takes an integer and returns a function that another integer, adding the first integer to it. Uncurrying it (because in Montague semantics, all functions must be curried) simply yields λ {x: Int, y: Int => x + y}. Well, that's pretty straightforward.

Looking through the code of the examples below, you'll notice that not all definitions look quite like this. Some of them have multiple synonyms for one definition, or multiple definitions for a single term (in that case, we say that the term is ambiguous). Some of them don't operate on single terms at all, but on matchers (functions that try to find certain kinds of strings). Some of them don't have semantic definitions at all and only describe syntactic categories. But the general idea for all of these is the same.

The way you use montague is by defining your own lexicon of terms with syntactic and semantic definitions. And the semantic parser does the rest.

Getting Started

montague comes with a few simple examples demonstrating some applications of its semantic parsing features.

English-to-calculator arithmetic

(See ArithmeticParser.)

In this example, English is parsed into a semantic form, which is then realized as arithmetic operations.

> sbt "runMain example.ArithmeticParser (3 + 5) * 2"
Input: (3 + 5) * 2
Output: 16

Our current implementation treats all grammatical rule applications as either possible (true) or impossible (false). For this reason, the parser cannot currently discriminate between rules of different precedence:

> sbt "runMain example.ArithmeticParser 3 + 5 * 2"
Input: 3 + 5 * 2
Output: Ambiguous(13, 16)

Besides ambiguity arising from the inability to discriminate between different rule applications, we might want intentionally to encode ambiguity into our language. For example, the +/- operation is intentionally ambiguous, and has multiple valid semantic interpretations:

> sbt "runMain example.ArithmeticParser (3 +/- 5) * 2"
Input: (3 +/- 5) * 2
Output: Ambiguous(16, -4)

We ignore all unrecognized tokens by adding an Else clause in the lexicon, which matches all tokens that wouldn't match otherwise, and in this case produces semantically null parses:

> sbt "runMain example.ArithmeticParser Could you please tell me, what is 100 + 100 ?"
Input: Could you please tell me, what is 100 + 100 ?
Output: 200

English-to-syntactic structure

(See CcgBankParser.)

Using the CCGBank lexicon, we parse English sentences into syntax dependency trees.

If you don't have the CCGBank lexicon, you can download an older version of it from Julia Hockenmaier's site:

pushd data/ && \
wget http://juliahmr.cs.illinois.edu/CCGlexicon/lexicon.wsj02-21.gz && \
gunzip lexicon.wsj02-21.gz && \
popd

You can then parse sentences using the old CCGBank lexicon as follows:

> sbt "runMain example.OldCcgBankParser Thom and Alex and Joseph are writing a parser"
Input: Thom and Alex and Joseph are writing a parser
Output:
  are
    writing
      a
        parser
    and
      joseph
      and
        alex
        thom

If you do have the CCGBank lexicon and would like to use it, put CCGbank.00-24.lexicon into subdirectory data/, and invoke the parser as follows:

> sbt "runMain example.CcgBankParser Thom and Alex and Joseph are writing a parser"

English-to-information storage and retrieval

(See InformationStore.)

InformationStore uses SemanticRepl to implement a very basic information storage and retrieval system, by parsing statements into Define constructs and queries into Query constructs, then executing them accordingly.

Important: This example uses the older-style CCGBank lexicon - see the above example for download instructions.

An example interactive session with InformationStore:

> sbt "runMain example.InformationStore"
>> Joseph is a programmer
Ok
>> Joseph is pretty weird
Ok
>> Who is Joseph?
{a(programmer), pretty(weird)}
>> Who are Joseph and Ted?
I don't know
>> Joseph and Ted are Alex's bandmates
Ok
>> Who are Joseph and Ted?
alex's(bandmates)

Library overview

SemanticParser

SemanticParser is the main entry point into montague. To instantiate a SemanticParser, you need a syntactic scheme (CcgCat for our purposes) and a lexicon, stored in a ParserDict.

Once you've instantiated a SemanticParser, .parse(text, tokenizer) yields a SemanticParseResult, which you can unpack to find the parse tree and resulting semantic representation (if the parse succeeded). See SemanticParser.main for an example of how to extract results.

Lexicons

To build up a lexicon, you can create a new ParserDict() (or load syntactic entries from a CCGbank lexicon, if you have one, with ParserDict.fromCcgBankLexicon) and add entries to it with the + operator.

A lexicon entry looks like (matcher -> meaning), where

  • matcher can be
    1. a term (String),
    2. a Seq of terms,
    3. an instance of TokenMatcher[T] (a String => Seq[T] function), or
    4. Else, which matches any otherwise un-matched token; and
  • meaning can be
    1. a syntactic category,
    2. a (syntactic category, semantic representation) pair,
    3. a Seq of either of the above (in which case the meaning of the term is ambiguous), or
    4. (only if the matcher is a TokenMatcher or Else), a function of the matched object that produces of Seq of syntactic categories or a (syntactic category, semantic representation) pairs
SemanticRepl

SemanticRepl is a wrapper on top of SemanticParser that's useful for making REPLs that repeatedly read input, parse it into an "action", and pattern-match that action to perform some operation against an internal state. For an example of SemanticRepl at work, see the InformationStore example above.

Syntactic Categories

montague supports arbitrary syntactic schemes, but the only one built-in is CcgCat, representing CCG categories. Here are the categories available in the ccg package:

  • Terminal syntactic categories are ones that can appear at the top of the parse tree and cannot consume adjacent terms. Built-in terminals are S ("sentence"), N ("noun"), NP ("noun phrase"), and PP ("prepositional phrase"), but others are easy to add, depending on your application.
  • Non-terminal syntactic categories are ones that can (and must) consume adjacent terms. A parse cannot succeed if there is a non-terminal at the top of the parse tree. Types of non-terminal categories are:
    • Forward application: A/B consumes a B in front of it to become an A.
    • Backward application: A\B consumes a B behind it to become an A.
    • Bidirectional application: A|B consumes an adjacent Y to become an A.
    • Identity categories: X|X, X/X, and X\X are special cases of the above -- they can consume a term of any category to become that category.
    • Conj, the conjunction category, is a short-hand for (X\X)/X. (Exercise: Why is this called the "conjunction" category?)
  • Additionally, a category assigned to a term may have a probability attached to it: Cat % prob. For example, if the term apple has categories N % 0.9 and (NP/N) % 0.1, that means that it's 9 times more likely to be a noun than an adjective, and the parser will score potential parses accordingly. Probabilities default to 1.0 if unspecified.
  • A category may also have a label: Cat("label"). A/B("somelabel") can consume a B("somelabel") but not a regular B.

Semantic Representations

A SemanticState is generally one of two things (in each case, LF corresponds to the type of the objects we're dealing with -- in the examples below, LF = Int):

  • A form Form[LF] represents a semantic state that is "complete" (i.e. doesn't consume any arguments) -- for example, Form(4) represents the number four.
  • A lambda λ {x: LF => <semantic state>} represents a semantic state that must consume arguments -- for example, λ {x: Int => Form(x + 5)} represents the function that adds 5 to any integer. Implicit conversions allow us to write this more concisely to λ {x: Int => x + 5}. Multi-argument functions are represented by currying (e.g. λ {y: Int => λ {x: Int => x + y}}).

There are a few other possible SemanticStates that generally shouldn't be specified directly in the lexicon, but can appear in parse results:

  • Nonsense represents a parse with no valid semantic outputs.
  • Ambiguous(Set[SemanticState]) represents a parse with more than one valid semantic output.
  • Ignored represents a parse that ignored semantics entirely (i.e. you didn't specify semantic representations in the lexicon).

Exercises for the reader

(In rough order of difficulty.)

Parsing

  1. While all of our examples involve CCG parsing, montague supports alternative syntactic schemes. Create your own semantic scheme (that is, a type hierarchy that inherits from SyntacticLabel), and parse something with it.
  2. Composition. montague's CCG implementation currently supports only one of the three CCG combinators: the application combinator (X/Y Y -> X, X X\Y -> Y). Extend it to also support the composition combinator (X/Y Y/Z -> X/Z).
  3. Type-raising. As above, but for the type-raising combinator (X -> T/(T\X)).
  4. Probabilistic parsing. The parser currently supports boolean parsing: a parse of the input string is either possible (true) or impossible (false). (This corresponds to implementing the boolean semiring parser of Goodman, 1999 -- see Figure 5.) Extend the code so that it supports probabilistic or weighted parsing. (The existing probabilistic implementation of English parsing, based upon multiplying out lexicon weights, is a hack that including the token weights within the CCG category.)
  5. Speed improvements. The parser implements a CKY search strategy, which is bottom-up. If the parser had weight implemented, we could parse faster using agenda-based parsing: You use an agenda to order nodes by some priority. For example, the priority can be the cumulative probability of applying the rules (best-first parsing). Alternately, instead of agenda-based parsing, beam pruning could be used to reduce the size of the search space. In this case, only the top k weighted nodes are kept in any parse cell.
  6. † Fuzzy matching. What if a user enters a phrase that doesn't parse successfully, but adding or removing one word (or perhaps correcting a misspelling) would fix it? Create a "Did You Mean?" feature that implements this efficiently.

Applications

  1. Add more features to the ArithmeticParser example. For example, improve the tokenizer to correctly handle infix expressions without spaces (e.g. (1+2)*3), or add more operations.
  2. Add more features to the InformationStore example. For example, add other types of relations, or support more kinds of expressions.
  3. IFTTT. Parse English phrases like "When (this happens), (do this)" into semantic forms corresponding to IFTTT API calls. Try building a REPL on top of SemanticRepl for communicating with IFTTT via natural language.
  4. Slack bot. Similar to the above, but do something cool with a Slack bot instead.
  5. † Game semantics. Come up with a semantic scheme for representing rule descriptions for a simple card game (think Magic, Hearthstone, etc., but simplify!) For example, a card may say something like "Whenever your opponent loses life, draw a card". Then write a parser for it.
  6. † English to Freebase. Parse English phrases into Freebase queries. For example (borrowing an example from SEMPRE), "Which college did Obama go to?"(and (Type University) (Education BarackObama))"Occidental College, Columbia University". (Hint: You'll have to generate most of the lexicon programmatically using Freebase as well.)
  7. † English to SQL. Parse English questions (such as "How many customers in Europe made a purchase last month?") into SQL statements. Assume that you have all relevant table structure information available. (Hints: Generate an abstract structure for the full parse first, and then worry about generating SQL out of it. Much of the lexicon will have to be generated from table and column names, as well as entries for categorical columns. There will be a lot of ambiguous definitions.)

Related work

  • SEMPRE is a toolkit for training semantic parsers.
  • Cornell Semantic Parsing Framework is an open source research software package. It includes a semantic parsing algorithm, a flexible meaning representation language and learning algorithms.