Skip to content
Gabriel Weaver edited this page Jun 2, 2013 · 1 revision
<script type="text/javascript"> var _gaq = _gaq || []; _gaq.push(['_setAccount', 'UA-32788022-1']); _gaq.push(['_trackPageview']); (function() { var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); })(); </script>

NAME

xupath - A general-purpose querying syntax for structured texts, including but not limited to XML.

DESCRIPTION

A xupath consists of a sequence of steps (references to language constructs) that are delimited by one or more slashes. These language constructs are represented as a language name (from our XUTools Grammar Library) that may be qualified with a predicate. In this man page, we will describe the xupath syntax as well and the grammar library in which language constructs are defined.

XUPath Syntax

Definition of xupath and step

As mentioned above, a xupath consists of a sequence of references to language constructs that are delimited by one or more slashes. We will call each of these 'references to a language construct' a step.

For example, the xupath /builtin:file/ios:interface consists of two steps that each reference their own language construct. The first step /builtin:file references a file. The second step, /ios:interface references an interface as defined in Cisco IOS. The order of the steps means that matches to this xupath query consist of IOS interfaces contained within files.

Definition of language name (and components)

Every xupath step contains a language name, a string that consists of two components: a grammar name and a production name. The grammar name corresponds to a grammar defined in the XUTools Grammar Library (described in the next section) and the production name corresponds to a production within that grammar.

For example, the second step in the xupath /builtin:file/ios:interface/builtin:line corresponds to the interface production in the Cisco IOS grammar of the Grammar Library and references strings in the language of that production. Notice that the xupath syntax allows us to mix and match language constructs from different grammars.

Definition of xupath predicate

Sometimes a step's 'language construct' may be further qualified by a predicate.

Currently, the only predicate defined in the xupath syntax is the re:testsubtree predicate and this allows us to filter strings by a regular expression (the name of the predicate may change in the future, it is an artifact of xupath's origins in XPath). For example, the second step of xupath /builtin:file/tei:sub subsection[re:testsubtree('Globus','e')] references all subsubsections that contain the word 'Globus'.

XUTools Grammar Library

In this subsection we discuss the intent of our XUTools Grammar Library and how to extend the library to include additional constructs.

Intent of the XUTools Grammar Library

We designed the grammar library to isolate references to language constructs from the encoding of those constructs much as an abstract data type separates a reference to an operation from that operation's implementation. Practitioners already isolate language constructs from that construct's encoding naturally: policy analysts reference sections and subsections of a policy and network administrators reference interfaces. C developers, in order to use a library function in their own code, must know the name, purpose, and calling sequence of that function but not its implementation. Similarly, users of our grammar library need to know the name and purpose of the construct upon which they want to operate, but not its specification as a context-free grammar production.

We designed XUTools to operate in terms of references to language constructs because the way in which people reference information remains relatively stable but the manner in which people encode information changes with technology. Consider the historical transmission of text in which books and lines of Homer's Odyssey migrated from manuscripts, to books, to digital formats. Although the physical media and encoding of text changed, the high-level constructs of book and line survived (thanks to friends at the Holy Cross Classics Department, The Perseus Project at Tufts University, and Harvard's Center for Hellenic Studies for exposing me to this argument). In software engineering, the principle of an Abstract Data Type (ADT) echoes this philosophy---although an ADT's implementation may change over time, the interface to that implementation remains relatively stable.

How to Extend the XUTools Grammar Library

Currently, the XUTools Grammar Library is implemented as a set of grammars written in PyParsing (we may generalize this in the future). In order to extend the XUTools Grammar Library, one must first create a class for the grammar of the language one wants to add and then register that class with the class for the grammar library.

STEP 1: Create a class for the new language's grammar. For this discussion, we will use xutools.grammar.pyparsing.CiscoIOSGrammar class as an example.

When we added the Cisco IOS grammar, we created the CiscoIOSGrammar class in the xutools/grammar/pyparsing directory. Within this class, we wrote a PyParsing grammar for Cisco IOS so that the productions of that grammar correspond to instance variables. We then defined the GRAMMAR_NAME as well as the language names for the grammar productions we want to process with XUTools ( CONFIG , INTERFACE). A grammar class needs to implement the following methods:

  • get_grammar: Given a language name, get the grammar that specifies strings in that language.
  • get_language_name: Get all language names defined for this grammar.
  • get_label_for_match: Get the label to associate with a match for a language name. Note that we use the setResultsName method from PyParsing to set a label upon parsing for most productions. One an also, however, use the match\_idx to assign a number that captures the document order of matches.
  • normalize_parse_tree: Given a match returned from PyParsing, process the result list into a canonical form for a parse tree and set tree vertex properties. These properties include type, id, and value. This method needs to be reworked so that parse trees use the parse tree abstractions in the xutools.parsers package. This abstraction will be closely tied to our xutools.corpus interface as we will have trees of corpus elements rather than this ad-hoc method now.

STEP 2: Register the grammar class for the new language with the grammar library in xutools/grammar/\_\_init\_\_.py. To do so, make the following modifications to this file. First, import the grammar class for the newly-defined language from the previous step. Then, define an instance variable for the grammar name of the newly-added language. We then want to modify the following methods:

  • get_language_names: Given a grammar name, return the language names for that grammar.
  • get_grammar_instance: Given a language name, get the appropriate grammar.

FILES

ENVIRONMENT

DIAGNOSTICS

BUGS

There may be some bugs in the grammar library that need to be worked out. I am working to try to find a nice interface that is less dependent on PyParsing.

AUTHOR

Gabriel A. Weaver

SEE ALSO

xudiff(1), xugrep(1), xuwc(1).


Creative Commons License
XUTools Wiki by Gabriel A. Weaver is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Clone this wiki locally