Switch branches/tags
Nothing to show
Find file
Fetching contributors…
Cannot retrieve contributors at this time
1322 lines (1005 sloc) 55.9 KB
Change Log
Version 1.5.0 - June, 2008
This version of pyparsing includes work on two long-standing
FAQ's: support for forcing parsing of the complete input string
(without having to explicitly append StringEnd() to the grammar),
and a method to improve the mechanism of detecting where syntax
errors occur in an input string with various optional and
alternative paths. This release also includes a helper method
to simplify definition of indentation-based grammars. With
these changes (and the past few minor updates), I thought it was
finally time to bump the minor rev number on pyparsing - so
1.5.0 is now available! Read on...
- AT LAST!!! You can now call parseString and have it raise
an exception if the expression does not parse the entire
input string. This has been an FAQ for a LONG time.
The parseString method now includes an optional parseAll
argument (default=False). If parseAll is set to True, then
the given parse expression must parse the entire input
string. (This is equivalent to adding StringEnd() to the
end of the expression.) The default value is False to
retain backward compatibility.
Inspired by MANY requests over the years, most recently by
ecir-hana on the pyparsing wiki!
- Added new operator '-' for composing grammar sequences. '-'
behaves just like '+' in creating And expressions, but '-'
is used to mark grammar structures that should stop parsing
immediately and report a syntax error, rather than just
backtracking to the last successful parse and trying another
alternative. For instance, running the following code:
port_definition = Keyword("port") + '=' + Word(nums)
entity_definition = Keyword("entity") + "{" +
Optional(port_definition) + "}"
entity_definition.parseString("entity { port 100 }")
pyparsing fails to detect the missing '=' in the port definition.
But, since this expression is optional, pyparsing then proceeds
to try to match the closing '}' of the entity_definition. Not
finding it, pyparsing reports that there was no '}' after the '{'
character. Instead, we would like pyparsing to parse the 'port'
keyword, and if not followed by an equals sign and an integer,
to signal this as a syntax error.
This can now be done simply by changing the port_definition to:
port_definition = Keyword("port") - '=' + Word(nums)
Now after successfully parsing 'port', pyparsing must also find
an equals sign and an integer, or it will raise a fatal syntax
By judicious insertion of '-' operators, a pyparsing developer
can have their grammar report much more informative syntax error
Patches and suggestions proposed by several contributors on
the pyparsing mailing list and wiki - special thanks to
Eike Welk and Thomas/Poldy on the pyparsing wiki!
- Added indentedBlock helper method, to encapsulate the parse
actions and indentation stack management needed to keep track of
indentation levels. Use indentedBlock to define grammars for
indentation-based grouping grammars, like Python's.
indentedBlock takes up to 3 parameters:
- blockStatementExpr - expression defining syntax of statement
that is repeated within the indented block
- indentStack - list created by caller to manage indentation
stack (multiple indentedBlock expressions
within a single grammar should share a common indentStack)
- indent - boolean indicating whether block must be indented
beyond the the current level; set to False for block of
left-most statements (default=True)
A valid block must contain at least one indented statement.
- Fixed bug in nestedExpr in which ignored expressions needed
to be set off with whitespace. Reported by Stefaan Himpe,
nice catch!
- Expanded multiplication of an expression by a tuple, to
accept tuple values of None:
. expr*(n,None) or expr*(n,) is equivalent
to expr*n + ZeroOrMore(expr)
(read as "at least n instances of expr")
. expr*(None,n) is equivalent to expr*(0,n)
(read as "0 to n instances of expr")
. expr*(None,None) is equivalent to ZeroOrMore(expr)
. expr*(1,None) is equivalent to OneOrMore(expr)
Note that expr*(None,n) does not raise an exception if
more than n exprs exist in the input stream; that is,
expr*(None,n) does not enforce a maximum number of expr
occurrences. If this behavior is desired, then write
expr*(None,n) + ~expr
- Added None as a possible operator for operatorPrecedence.
None signifies "no operator", as in multiplying m times x
in "y=mx+b".
- Fixed bug in Each, reported by Michael Ramirez, in which the
order of terms in the Each affected the parsing of the results.
Problem was due to premature grouping of the expressions in
the overall Each during grammar construction, before the
complete Each was defined. Thanks, Michael!
- Also fixed bug in Each in which Optional's with default values
were not getting the defaults added to the results of the
overall Each expression.
- Fixed a bug in Optional in which results names were not
assigned if a default value was supplied.
- Cleaned up Py3K compatibility statements, including exception
construction statements, and better equivalence between _ustr
and basestring, and __nonzero__ and __bool__.
Version 1.4.11 - February, 2008
- With help from Robert A. Clark, this version of pyparsing
is compatible with Python 3.0a3. Thanks for the help,
- Added WordStart and WordEnd positional classes, to support
expressions that must occur at the start or end of a word.
Proposed by piranha on the pyparsing wiki, good idea!
- Added matchOnlyAtCol helper parser action, to simplify
parsing log or data files that have optional fields that are
column dependent. Inspired by a discussion thread with
hubritic on comp.lang.python.
- Added withAttribute.ANY_VALUE as a match-all value when using
withAttribute. Used to ensure that an attribute is present,
without having to match on the actual attribute value.
- Added get() method to ParseResults, similar to dict.get().
Suggested by new pyparsing user, Alejandro Dubrovksy, thanks!
- Added '==' short-cut to see if a given string matches a
pyparsing expression. For instance, you can now write:
integer = Word(nums)
if "123" == integer:
# do something
print [ x for x in "123 234 asld".split() if x==integer ]
# prints ['123', '234']
- Simplified the use of nestedExpr when using an expression for
the opening or closing delimiters. Now the content expression
will not have to explicitly negate closing delimiters. Found
while working with dfinnie on GHOP Task #277, thanks!
- Fixed bug when defining ignorable expressions that are
later enclosed in a wrapper expression (such as ZeroOrMore,
OneOrMore, etc.) - found while working with Prabhu
Gurumurthy, thanks Prahbu!
- Fixed bug in withAttribute in which keys were automatically
converted to lowercase, making it impossible to match XML
attributes with uppercase characters in them. Using with-
Attribute requires that you reference attributes in all
lowercase if parsing HTML, and in correct case when parsing
- Changed '<<' operator on Forward to return None, since this
is really used as a pseudo-assignment operator, not as a
left-shift operator. By returning None, it is easier to
catch faulty statements such as a << b | c, where precedence
of operations causes the '|' operation to be performed
*after* inserting b into a, so no alternation is actually
implemented. The correct form is a << (b | c). With this
change, an error will be reported instead of silently
clipping the alternative term. (Note: this may break some
existing code, but if it does, the code had a silent bug in
it anyway.) Proposed by wcbarksdale on the pyparsing wiki,
- Several unit tests were added to pyparsing's regression
suite, courtesy of the Google Highly-Open Participation
Contest. Thanks to all who administered and took part in
this event!
Version 1.4.10 - December 9, 2007
- Fixed bug introduced in v1.4.8, parse actions were called for
intermediate operator levels, not just the deepest matching
operation level. Again, big thanks to Torsten Marek for
helping isolate this problem!
Version 1.4.9 - December 8, 2007
- Added '*' multiplication operator support when creating
grammars, accepting either an integer, or a two-integer
tuple multiplier, as in:
ipAddress = Word(nums) + ('.'+Word(nums))*3
usPhoneNumber = Word(nums) + ('-'+Word(nums))*(1,2)
If multiplying by a tuple, the two integer values represent
min and max multiples. Suggested by Vincent of,
great idea, Vincent!
- Fixed bug in nestedExpr, original version was overly greedy!
Thanks to Michael Ramirez for raising this issue.
- Fixed internal bug in ParseResults - when an item was deleted,
the key indices were not updated. Thanks to Tim Mitchell for
posting a bugfix patch to the SF bug tracking system!
- Fixed internal bug in operatorPrecedence - when the results of
a right-associative term were sent to a parse action, the wrong
tokens were sent. Reported by Torsten Marek, nice job!
- Added pop() method to ParseResults. If pop is called with an
integer or with no arguments, it will use list semantics and
update the ParseResults' list of tokens. If pop is called with
a non-integer (a string, for instance), then it will use dict
semantics and update the ParseResults' internal dict.
Suggested by Donn Ingle, thanks Donn!
- Fixed quoted string built-ins to accept '\xHH' hex characters
within the string.
Version 1.4.8 - October, 2007
- Added new helper method nestedExpr to easily create expressions
that parse lists of data in nested parentheses, braces, brackets,
- Added withAttribute parse action helper, to simplify creating
filtering parse actions to attach to expressions returned by
makeHTMLTags and makeXMLTags. Use withAttribute to qualify a
starting tag with one or more required attribute values, to avoid
false matches on common tags such as <TD> or <DIV>.
- Added new examples and to demonstrate
the new features.
- Added performance speedup to grammars using operatorPrecedence,
instigated by Stefan Reichör - thanks for the feedback, Stefan!
- Fixed bug/typo when deleting an element from a ParseResults by
using the element's results name.
- Fixed whitespace-skipping bug in wrapper classes (such as Group,
Suppress, Combine, etc.) and when using setDebug(), reported by
new pyparsing user dazzawazza on SourceForge, nice job!
- Added restriction to prevent defining Word or CharsNotIn expressions
with minimum length of 0 (should use Optional if this is desired),
and enhanced docstrings to reflect this limitation. Issue was
raised by Joey Tallieu, who submitted a patch with a slightly
different solution. Thanks for taking the initiative, Joey, and
please keep submitting your ideas!
- Fixed bug in makeHTMLTags that did not detect HTML tag attributes
with no '= value' portion (such as "<td nowrap>"), reported by
hamidh on the pyparsing wiki - thanks!
- Fixed minor bug in makeHTMLTags and makeXMLTags, which did not
accept whitespace in closing tags.
Version 1.4.7 - July, 2007
- NEW NOTATION SHORTCUT: ParserElement now accepts results names using
a notational shortcut, following the expression with the results name
in parentheses. So this:
stats = "AVE:" + realNum.setResultsName("average") + \
"MIN:" + realNum.setResultsName("min") + \
"MAX:" + realNum.setResultsName("max")
can now be written as this:
stats = "AVE:" + realNum("average") + \
"MIN:" + realNum("min") + \
"MAX:" + realNum("max")
The intent behind this change is to make it simpler to define results
names for significant fields within the expression, while keeping
the grammar syntax clean and uncluttered.
- Fixed bug when packrat parsing is enabled, with cached ParseResults
being updated by subsequent parsing. Reported on the pyparsing
wiki by Kambiz, thanks!
- Fixed bug in operatorPrecedence for unary operators with left
associativity, if multiple operators were given for the same term.
- Fixed bug in example, corrected precedence of "and" vs.
"or" operations.
- Fixed bug in Dict class, in which keys were converted to strings
whether they needed to be or not. Have narrowed this logic to
convert keys to strings only if the keys are ints (which would
confuse __getitem__ behavior for list indexing vs. key lookup).
- Added ParserElement method setBreak(), which will invoke the pdb
module's set_trace() function when this expression is about to be
- Fixed bug in StringEnd in which reading off the end of the input
string raises an exception - should match. Resolved while
answering a question for Shawn on the pyparsing wiki.
Version 1.4.6 - April, 2007
- Simplified constructor for ParseFatalException, to support common
exception construction idiom:
raise ParseFatalException, "unexpected text: 'Spanish Inquisition'"
- Added method getTokensEndLoc(), to be called from within a parse action,
for those parse actions that need both the starting *and* ending
location of the parsed tokens within the input text.
- Enhanced behavior of keepOriginalText so that named parse fields are
preserved, even though tokens are replaced with the original input
text matched by the current expression. Also, cleaned up the stack
traversal to be more robust. Suggested by Tim Arnold - thanks, Tim!
- Fixed subtle bug in which countedArray (and similar dynamic
expressions configured in parse actions) failed to match within Or,
Each, FollowedBy, or NotAny. Reported by Ralf Vosseler, thanks for
your patience, Ralf!
- Fixed Unicode bug in upcaseTokens and downcaseTokens parse actions,
scanString, and default debugging actions; reported (and patch submitted)
by Nikolai Zamkovoi, spasibo!
- Fixed bug when saving a tuple as a named result. The returned
token list gave the proper tuple value, but accessing the result by
name only gave the first element of the tuple. Reported by
Poromenos, nice catch!
- Fixed bug in makeHTMLTags/makeXMLTags, which failed to match tag
attributes with namespaces.
- Fixed bug in SkipTo when setting include=True, to have the skipped-to
tokens correctly included in the returned data. Reported by gunars on
the pyparsing wiki, thanks!
- Fixed typobug in OnceOnly.reset method, omitted self argument.
Submitted by eike welk, thanks for the lint-picking!
- Added performance enhancement to Forward class, suggested by
akkartik on the pyparsing Wiki discussion, nice work!
- Added optional asKeyword to Word constructor, to indicate that the
given word pattern should be matched only as a keyword, that is, it
should only match if it is within word boundaries.
- Added S-expression parser to examples directory.
- Added macro substitution example to examples directory.
- Added example, excerpted from Marco Alfonso's blog -
muchas gracias, Marco!
- Modified internal cyclic references in ParseResults to use weakrefs;
this should help reduce the memory footprint of large parsing
programs, at some cost to performance (3-5%). Suggested by bca48150 on
the pyparsing wiki, thanks!
- Enhanced the documentation describing the vagaries and idiosyncracies
of parsing strings with embedded tabs, and the impact on:
. parse actions
. scanString
. col and line helper functions
(Suggested by eike welk in response to some unexplained inconsistencies
between parsed location and offsets in the input string.)
- Cleaned up internal decorators to preserve function names,
docstrings, etc.
Version 1.4.5 - December, 2006
- Removed debugging print statement from QuotedString class. Sorry
for not stripping this out before the 1.4.4 release!
- A significant performance improvement, the first one in a while!
For my Verilog parser, this version of pyparsing is about double the
speed - YMMV.
- Added support for pickling of ParseResults objects. (Reported by
Jeff Poole, thanks Jeff!)
- Fixed minor bug in makeHTMLTags that did not recognize tag attributes
with embedded '-' or '_' characters. Also, added support for
passing expressions to makeHTMLTags and makeXMLTags, and used this
feature to define the globals anyOpenTag and anyCloseTag.
- Fixed error in alphas8bit, I had omitted the y-with-umlaut character.
- Added punc8bit string to complement alphas8bit - it contains all the
non-alphabetic, non-blank 8-bit characters.
- Added commonHTMLEntity expression, to match common HTML "ampersand"
codes, such as "&lt;", "&gt;", "&amp;", "&nbsp;", and "&quot;". This
expression also defines a results name 'entity', which can be used
to extract the entity field (that is, "lt", "gt", etc.). Also added
built-in parse action replaceHTMLEntity, which can be attached to
commonHTMLEntity to translate "&lt;", "&gt;", "&amp;", "&nbsp;", and
"&quot;" to "<", ">", "&", " ", and "'".
- Added example,, that strips HTML tags and scripts
from HTML pages. It also translates common HTML entities to their
respective characters.
Version 1.4.4 - October, 2006
- Fixed traceParseAction decorator to also trap and record exception
returns from parse actions, and to handle parse actions with 0,
1, 2, or 3 arguments.
- Enhanced parse action normalization to support using classes as
parse actions; that is, the class constructor is called at parse
time and the __init__ function is called with 0, 1, 2, or 3
arguments. If passing a class as a parse action, the __init__
method must use one of the valid parse action parameter list
formats. (This technique is useful when using pyparsing to compile
parsed text into a series of application objects - see the new
- Fixed bug in ParseResults when setting an item using an integer
index. (Reported by Christopher Lambacher, thanks!)
- Fixed whitespace-skipping bug, patch submitted by Paolo Losi -
grazie, Paolo!
- Fixed bug when a Combine contained an embedded Forward expression,
reported by cie on the pyparsing wiki - good catch!
- Fixed listAllMatches bug, when a listAllMatches result was
nested within another result. (Reported by don pasquale on
comp.lang.python, well done!)
- Fixed bug in ParseResults items() method, when returning an item
marked as listAllMatches=True
- Fixed bug in definition of cppStyleComment (and javaStyleComment)
in which '//' line comments were not continued to the next line
if the line ends with a '\'. (Reported by eagle-eyed Ralph
- Optimized re's for cppStyleComment and quotedString for better
re performance - also provided by Ralph Corderoy, thanks!
- Added new example,, showing how to
define a grammar using indentation to show grouping (as Python
does for defining statement nesting). Instigated by an e-mail
discussion with Andrew Dalke, thanks Andrew!
- Added new helper operatorPrecedence (based on e-mail list discussion
with Ralph Corderoy and Paolo Losi), to facilitate definition of
grammars for expressions with unary and binary operators. For
instance, this grammar defines a 6-function arithmetic expression
grammar, with unary plus and minus, proper operator precedence,and
right- and left-associativity:
expr = operatorPrecedence( operand,
[("!", 1, opAssoc.LEFT),
("^", 2, opAssoc.RIGHT),
(oneOf("+ -"), 1, opAssoc.RIGHT),
(oneOf("* /"), 2, opAssoc.LEFT),
(oneOf("+ -"), 2, opAssoc.LEFT),]
Also added example and to provide
more detailed code samples using this new helper method.
- Added new helpers matchPreviousLiteral and matchPreviousExpr, for
creating adaptive parsing expressions that match the same content
as was parsed in a previous parse expression. For instance:
first = Word(nums)
matchExpr = first + ":" + matchPreviousLiteral(first)
will match "1:1", but not "1:2". Since this matches at the literal
level, this will also match the leading "1:1" in "1:10".
In contrast:
first = Word(nums)
matchExpr = first + ":" + matchPreviousExpr(first)
will *not* match the leading "1:1" in "1:10"; the expressions are
evaluated first, and then compared, so "1" is compared with "10".
- Added keepOriginalText parse action. Sometimes pyparsing's
whitespace-skipping leaves out too much whitespace. Adding this
parse action will restore any internal whitespace for a parse
expression. This is especially useful when defining expressions
for scanString or transformString applications.
- Added __add__ method for ParseResults class, to better support
using Python sum built-in for summing ParseResults objects returned
from scanString.
- Added reset method for the new OnlyOnce class wrapper for parse
actions (to allow a grammar to be used multiple times).
- Added optional maxMatches argument to scanString and searchString,
to short-circuit scanning after 'n' expression matches are found.
Version 1.4.3 - July, 2006
- Fixed implementation of multiple parse actions for an expression
(added in 1.4.2).
. setParseAction() reverts to its previous behavior, setting
one (or more) actions for an expression, overwriting any
action or actions previously defined
. new method addParseAction() appends one or more parse actions
to the list of parse actions attached to an expression
Now it is harder to accidentally append parse actions to an
expression, when what you wanted to do was overwrite whatever had
been defined before. (Thanks, Jean-Paul Calderone!)
- Simplified interface to parse actions that do not require all 3
parse action arguments. Very rarely do parse actions require more
than just the parsed tokens, yet parse actions still require all
3 arguments including the string being parsed and the location
within the string where the parse expression was matched. With this
release, parse actions may now be defined to be called as:
. fn(string,locn,tokens) (the current form)
. fn(locn,tokens)
. fn(tokens)
. fn()
The setParseAction and addParseAction methods will internally decorate
the provided parse actions with compatible wrappers to conform to
the full (string,locn,tokens) argument sequence.
I announced this in March, 2004, and gave a final warning in the last
release. Now you can return a tuple from a parse action, and it will
be treated like any other return value (i.e., the tuple will be
substituted for the incoming tokens passed to the parse action,
which is useful when trying to parse strings into tuples).
- Added setFailAction method, taking a callable function fn that
takes the arguments fn(s,loc,expr,err) where:
. s - string being parsed
. loc - location where expression match was attempted and failed
. expr - the parse expression that failed
. err - the exception thrown
The function returns no values. It may throw ParseFatalException
if it is desired to stop parsing immediately.
(Suggested by peter21081944 on
- Added class OnlyOnce as helper wrapper for parse actions. OnlyOnce
only permits a parse action to be called one time, after which
all subsequent calls throw a ParseException.
- Added traceParseAction decorator to help debug parse actions.
Simply insert "@traceParseAction" ahead of the definition of your
parse action, and each invocation will be displayed, along with
incoming arguments, and returned value.
- Fixed bug when copying ParserElements using copy() or
setResultsName(). (Reported by Dan Thill, great catch!)
- Fixed bug in asXML() where token text contains <, >, and &
characters - generated XML now escapes these as &lt;, &gt; and
&amp;. (Reported by Jacek Sieka, thanks!)
- Fixed bug in SkipTo() when searching for a StringEnd(). (Reported
by Pete McEvoy, thanks Pete!)
- Fixed "except Exception" statements, the most critical added as part
of the packrat parsing enhancement. (Thanks, Erick Tryzelaar!)
- Fixed end-of-string infinite looping on LineEnd and StringEnd
expressions. (Thanks again to Erick Tryzelaar.)
- Modified setWhitespaceChars to return self, to be consistent with
other ParserElement modifiers. (Suggested by Erick Tryzelaar.)
- Fixed bug/typo in new ParseResults.dump() method.
- Fixed bug in searchString() method, in which only the first token of
an expression was returned. searchString() now returns a
ParseResults collection of all search matches.
- Added example program, a string transformer that
converts text files with hard line-breaks into one with line breaks
only between paragraphs.
- Added example program, to illustrate using the
listAllMatches option when specifying results names (also shows new
support for passing lists to oneOf).
- Added example program, to illustrate using the
helper methods lineno, line, and col, and returning objects from a
parse action.
- Added example program, to which can parse the
string representation of a Python list back into a true list. Taken
mostly from my PyCon presentation examples, but now with support
for tuple elements, too!
Version 1.4.2 - April 1, 2006 (No foolin'!)
- Significant speedup from memoizing nested expressions (a technique
known as "packrat parsing"), thanks to Chris Lesniewski-Laas! Your
mileage may vary, but my Verilog parser almost doubled in speed to
over 600 lines/sec!
This speedup may break existing programs that use parse actions that
have side-effects. For this reason, packrat parsing is disabled when
you first import pyparsing. To activate the packrat feature, your
program must call the class method ParserElement.enablePackrat(). If
your program uses psyco to "compile as you go", you must call
enablePackrat before calling psyco.full(). If you do not do this,
Python will crash. For best results, call enablePackrat() immediately
after importing pyparsing.
- Added new helper method countedArray(expr), for defining patterns that
start with a leading integer to indicate the number of array elements,
followed by that many elements, matching the given expr parse
expression. For instance, this two-liner:
wordArray = countedArray(Word(alphas))
print wordArray.parseString("3 Practicality beats purity")[0]
returns the parsed array of words:
['Practicality', 'beats', 'purity']
The leading token '3' is suppressed, although it is easily obtained
from the length of the returned array.
(Inspired by e-mail discussion with Ralf Vosseler.)
- Added support for attaching multiple parse actions to a single
ParserElement. (Suggested by Dan "Dang" Griffith - nice idea, Dan!)
- Added support for asymmetric quoting characters in the recently-added
QuotedString class. Now you can define your own quoted string syntax
like "<<This is a string in double angle brackets.>>". To define
this custom form of QuotedString, your code would define:
dblAngleQuotedString = QuotedString('<<',endQuoteChar='>>')
QuotedString also supports escaped quotes, escape character other
than '\', and multiline.
- Changed the default value returned internally by Optional, so that
None can be used as a default value. (Suggested by Steven Bethard -
I finally saw the light!)
- Added dump() method to ParseResults, to make it easier to list out
and diagnose values returned from calling parseString.
- A new example, a search query string parser, submitted by Steven
Mooij and Rudolph Froger - a very interesting application, thanks!
- Added an example that parses the BNF in Python's Grammar file, in
support of generating Python grammar documentation. (Suggested by
J H Stovall.)
- A new example, submitted by Tim Cera, of a flexible parser module,
using a simple config variable to adjust parsing for input formats
that have slight variations - thanks, Tim!
- Added an example for parsing Roman numerals, showing the capability
of parse actions to "compile" Roman numerals into their integer
values during parsing.
- Added a new docs directory, for additional documentation or help.
Currently, this includes the text and examples from my recent
presentation at PyCon.
- Fixed another typo in CaselessKeyword, thanks Stefan Behnel.
- Expanded oneOf to also accept tuples, not just lists. This really
should be sufficient...
- Added deprecation warnings when tuple is returned from a parse action.
Looking back, I see that I originally deprecated this feature in March,
2004, so I'm guessing people really shouldn't have been using this
feature - I'll drop it altogether in the next release, which will
allow users to return a tuple from a parse action (which is really
handy when trying to reconstuct tuples from a tuple string
Version 1.4.1 - February, 2006
- Converted generator expression in QuotedString class to list
comprehension, to retain compatibility with Python 2.3. (Thanks, Titus
Brown for the heads-up!)
- Added searchString() method to ParserElement, as an alternative to
using "scanString(instring).next()[0][0]" to search through a string
looking for a substring matching a given parse expression. (Inspired by
e-mail conversation with Dave Feustel.)
- Modified oneOf to accept lists of strings as well as a single string
of space-delimited literals. (Suggested by Jacek Sieka - thanks!)
- Removed deprecated use of Upcase in pyparsing test code. (Also caught by
Titus Brown.)
- Removed lstrip() call from Literal - too aggressive in stripping
whitespace which may be valid for some grammars. (Point raised by Jacek
Sieka). Also, made Literal more robust in the event of passing an empty
- Fixed bug in replaceWith when returning None.
- Added cautionary documentation for Forward class when assigning a
MatchFirst expression, as in:
fwdExpr << a | b | c
Precedence of operators causes this to be evaluated as:
(fwdExpr << a) | b | c
thereby leaving b and c out as parseable alternatives. Users must
explicitly group the values inserted into the Forward:
fwdExpr << (a | b | c)
(Suggested by Scot Wilcoxon - thanks, Scot!)
Version 1.4 - January 18, 2006
- Added Regex class, to permit definition of complex embedded expressions
using regular expressions. (Enhancement provided by John Beisley, great
- Converted implementations of Word, oneOf, quoted string, and comment
helpers to utilize regular expression matching. Performance improvements
in the 20-40% range.
- Added QuotedString class, to support definition of non-standard quoted
strings (Suggested by Guillaume Proulx, thanks!)
- Added CaselessKeyword class, to streamline grammars with, well, caseless
keywords (Proposed by Stefan Behnel, thanks!)
- Fixed bug in SkipTo, when using an ignoreable expression. (Patch provided
by Anonymous, thanks, whoever-you-are!)
- Fixed typo in NoMatch class. (Good catch, Stefan Behnel!)
- Fixed minor bug in _makeTags(), using string.printables instead of
- Cleaned up some of the expressions created by makeXXXTags helpers, to
suppress extraneous <> characters.
- Added some grammar definition-time checking to verify that a grammar is
being built using proper ParserElements.
- Added examples:
. - linear algebra C preprocessor (submitted by Mike Ellis,
thanks Mike!)
. - converts word description of a number back to
the original number (such as 'one hundred and twenty three' -> 123)
. updated to support unary minus, added BNF comments
Version 1.3.3 - September 12, 2005
- Improved support for Unicode strings that would be returned using
srange. Added example, for a Korean version of
"Hello, World!" using Unicode. (Thanks, June Kim!)
- Added 'hexnums' string constant (nums+"ABCDEFabcdef") for defining
hexadecimal value expressions.
Modified tag and results definitions returned by makeHTMLTags(),
to better support the looseness of HTML parsing. Tags to be
parsed are now caseless, and keys generated for tag attributes are
now converted to lower case.
Formerly, makeXMLTags("XYZ") would return a tag with results
name of "startXYZ", this has been changed to "startXyz". If this
tag is matched against '<XYZ Abc="1" DEF="2" ghi="3">', the
matched keys formerly would be "Abc", "DEF", and "ghi"; keys are
now converted to lower case, giving keys of "abc", "def", and
"ghi". These changes were made to try to address the lax
case sensitivity agreement between start and end tags in many
HTML pages.
No changes were made to makeXMLTags(), which assumes more rigorous
parsing rules.
Also, cleaned up case-sensitivity bugs in closing tags, and
switched to using Keyword instead of Literal class for tags.
(Thanks, Steve Young, for getting me to look at these in more
- Added two helper parse actions, upcaseTokens and downcaseTokens,
which will convert matched text to all uppercase or lowercase,
- Deprecated Upcase class, to be replaced by upcaseTokens parse
- Converted messages sent to stderr to use warnings module, such as
when constructing a Literal with an empty string, one should use
the Empty() class or the empty helper instead.
- Added ' ' (space) as an escapable character within a quoted
- Added helper expressions for common comment types, in addition
to the existing cStyleComment (/*...*/) and htmlStyleComment
(<!-- ... -->)
. dblSlashComment = // ... (to end of line)
. cppStyleComment = cStyleComment or dblSlashComment
. javaStyleComment = cppStyleComment
. pythonStyleComment = # ... (to end of line)
Version 1.3.2 - July 24, 2005
- Added Each class as an enhanced version of And. 'Each' requires
that all given expressions be present, but may occur in any order.
Special handling is provided to group ZeroOrMore and OneOrMore
elements that occur out-of-order in the input string. You can also
construct 'Each' objects by joining expressions with the '&'
operator. When using the Each class, results names are strongly
recommended for accessing the matched tokens. (Suggested by Pradam
Amini - thanks, Pradam!)
- Stricter interpretation of 'max' qualifier on Word elements. If the
'max' attribute is specified, matching will fail if an input field
contains more than 'max' consecutive body characters. For example,
previously, Word(nums,max=3) would match the first three characters
of '0123456', returning '012' and continuing parsing at '3'. Now,
when constructed using the max attribute, Word will raise an
exception with this string.
- Cleaner handling of nested dictionaries returned by Dict. No
longer necessary to dereference sub-dictionaries as element [0] of
their parents.
(Prompted by discussion thread on the Python Tutor list, with
contributions from Danny Yoo, Kent Johnson, and original post by
Liam Clarke - thanks all!)
Version 1.3.1 - June, 2005
- Added markInputline() method to ParseException, to display the input
text line location of the parsing exception. (Thanks, Stefan Behnel!)
- Added setDefaultKeywordChars(), so that Keyword definitions using a
custom keyword character set do not all need to add the keywordChars
constructor argument (similar to setDefaultWhitespaceChars()).
(suggested by rzhanka on the SourceForge pyparsing forum.)
- Simplified passing debug actions to setDebugAction(). You can now
pass 'None' for a debug action if you want to take the default
debug behavior. To suppress a particular debug action, you can pass
the pyparsing method nullDebugAction.
- Refactored parse exception classes, moved all behavior to
ParseBaseException, and the former ParseException is now a subclass of
ParseBaseException. Added a second subclass, ParseFatalException, as
a subclass of ParseBaseException. User-defined parse actions can raise
ParseFatalException if a data inconsistency is detected (such as a
begin-tag/end-tag mismatch), and this will stop all parsing immediately.
(Inspired by e-mail thread with Michele Petrazzo - thanks, Michelle!)
- Added helper methods makeXMLTags and makeHTMLTags, that simplify the
definition of XML or HTML tag parse expressions for a given tagname.
Both functions return a pair of parse expressions, one for the opening
tag (that is, '<tagname>') and one for the closing tag ('</tagname>').
The opening tagame also recognizes any attribute definitions that have
been included in the opening tag, as well as an empty tag (one with a
trailing '/', as in '<BODY/>' which is equivalent to '<BODY></BODY>').
makeXMLTags uses stricter XML syntax for attributes, requiring that they
be enclosed in double quote characters - makeHTMLTags is more lenient,
and accepts single-quoted strings or any contiguous string of characters
up to the next whitespace character or '>' character. Attributes can
be retrieved as dictionary or attribute values of the returned results
from the opening tag.
- Added example, a refinement on that adds
an interactive session and support for variables. (Thanks, Steven Siew!)
- Added performance improvement, up to 20% reduction! (Found while working
with Wolfgang Borgert on performance tuning of his TTCN3 parser.)
- And another performance improvement, up to 25%, when using scanString!
(Found while working with Henrik Westlund on his C header file scanner.)
- Updated UML diagrams to reflect latest class/method changes.
Version 1.3 - March, 2005
- Added new Keyword class, as a special form of Literal. Keywords
must be followed by whitespace or other non-keyword characters, to
distinguish them from variables or other identifiers that just
happen to start with the same characters as a keyword. For instance,
the input string containing "ifOnlyIfOnly" will match a Literal("if")
at the beginning and in the middle, but will fail to match a
Keyword("if"). Keyword("if") will match only strings such as "if only"
or "if(only)". (Proposed by Wolfgang Borgert, and Berteun Damman
separately requested this on comp.lang.python - great idea!)
- Added setWhitespaceChars() method to override the characters to be
skipped as whitespace before matching a particular ParseElement. Also
added the class-level method setDefaultWhitespaceChars(), to allow
users to override the default set of whitespace characters (space,
tab, newline, and return) for all subsequently defined ParseElements.
(Inspired by Klaas Hofstra's inquiry on the Sourceforge pyparsing
- Added helper parse actions to support some very common parse
action use cases:
. replaceWith(replStr) - replaces the matching tokens with the
provided replStr replacement string; especially useful with
. removeQuotes - removes first and last character from string enclosed
in quotes (note - NOT the same as the string strip() method, as only
a single character is removed at each end)
- Added copy() method to ParseElement, to make it easier to define
different parse actions for the same basic parse expression. (Note, copy
is implicitly called when using setResultsName().)
(The following changes were posted to CVS as Version 1.2.3 -
October-December, 2004)
- Added support for Unicode strings in creating grammar definitions.
(Big thanks to Gavin Panella!)
- Added constant alphas8bit to include the following 8-bit characters:
- Added srange() function to simplify definition of Word elements, using
regexp-like '[A-Za-z0-9]' syntax. This also simplifies referencing
common 8-bit characters.
- Fixed bug in Dict when a single element Dict was embedded within another
Dict. (Thanks Andy Yates for catching this one!)
- Added 'formatted' argument to ParseResults.asXML(). If set to False,
suppresses insertion of whitespace for pretty-print formatting. Default
equals True for backward compatibility.
- Added setDebugActions() function to ParserElement, to allow user-defined
debugging actions.
- Added support for escaped quotes (either in \', \", or doubled quote
form) to the predefined expressions for quoted strings. (Thanks, Ero
- Minor performance improvement (~5%) converting "char in string" tests
to "char in dict". (Suggested by Gavin Panella, cool idea!)
Version 1.2.2 - September 27, 2004
- Modified delimitedList to accept an expression as the delimiter, instead
of only accepting strings.
- Modified ParseResults, to convert integer field keys to strings (to
avoid confusion with list access).
- Modified Combine, to convert all embedded tokens to strings before
- Fixed bug in MatchFirst in which parse actions would be called for
expressions that only partially match. (Thanks, John Hunter!)
- Fixed bug in example that fixes right-associativity of ^
operator. (Thanks, Andrea Griffini!)
- Added class FollowedBy(expression), to look ahead in the input string
without consuming tokens.
- Added class NoMatch that never matches any input. Can be useful in
debugging, and in very specialized grammars.
- Added example, for parsing chess game files stored in Portable
Game Notation. (Thanks, Alberto Santini!)
Version 1.2.1 - August 19, 2004
- Added SkipTo(expression) token type, simplifying grammars that only
want to specify delimiting expressions, and want to match any characters
between them.
- Added helper method dictOf(key,value), making it easier to work with
the Dict class. (Inspired by Pavel Volkovitskiy, thanks!).
- Added optional argument listAllMatches (default=False) to
setResultsName(). Setting listAllMatches to True overrides the default
modal setting of tokens to results names; instead, the results name
acts as an accumulator for all matching tokens within the local
repetition group. (Suggested by Amaury Le Leyzour - thanks!)
- Fixed bug in ParseResults, throwing exception when trying to extract
slice, or make a copy using [:]. (Thanks, Wilson Fowlie!)
- Fixed bug in transformString() when the input string contains <TAB>'s
(Thanks, Rick Walia!).
- Fixed bug in returning tokens from un-Grouped And's, Or's and
MatchFirst's, where too many tokens would be included in the results,
confounding parse actions and returned results.
- Fixed bug in naming ParseResults returned by And's, Or's, and Match
- Fixed bug in LineEnd() - matching this token now correctly consumes
and returns the end of line "\n".
- Added a beautiful example for parsing Mozilla calendar files (Thanks,
Petri Savolainen!).
- Added support for dynamically modifying Forward expressions during
Version 1.2 - 20 June 2004
- Added definition for htmlComment to help support HTML scanning and
- Fixed bug in generating XML for Dict classes, in which trailing item was
duplicated in the output XML.
- Fixed release bug in which was omitted from release
- Fixed bug in transformString() when parse actions are not defined on the
outermost parser element.
- Added example, as another example of using scanString
and parse actions.
Version 1.2beta3 - 4 June 2004
- Added White() token type, analogous to Word, to match on whitespace
characters. Use White in parsers with significant whitespace (such as
configuration file parsers that use indentation to indicate grouping).
Construct White with a string containing the whitespace characters to be
matched. Similar to Word, White also takes optional min, max, and exact
- As part of supporting whitespace-signficant parsing, added parseWithTabs()
method to ParserElement, to override the default behavior in parseString
of automatically expanding tabs to spaces. To retain tabs during
parsing, call parseWithTabs() before calling parseString(), parseFile() or
scanString(). (Thanks, Jean-Guillaume Paradis for catching this, and for
your suggestions on whitespace-significant parsing.)
- Added transformString() method to ParseElement, as a complement to
scanString(). To use transformString, define a grammar and attach a parse
action to the overall grammar that modifies the returned token list.
Invoking transformString() on a target string will then scan for matches,
and replace the matched text patterns according to the logic in the parse
action. transformString() returns the resulting transformed string.
(Note: transformString() does *not* automatically expand tabs to spaces.)
Also added to the examples directory to show sample uses of
scanString() and transformString().
- Removed group() method that was introduced in beta2. This turns out NOT to
be equivalent to nesting within a Group() object, and I'd prefer not to sow
more seeds of confusion.
- Fixed behavior of asXML() where tags for groups were incorrectly duplicated.
(Thanks, Brad Clements!)
- Changed beta version message to display to stderr instead of stdout, to
make asXML() easier to use. (Thanks again, Brad.)
Version 1.2beta2 - 19 May 2004
- *** SIMPLIFIED API *** - Parse actions that do not modify the list of tokens
no longer need to return a value. This simplifies those parse actions that
use the list of tokens to update a counter or record or display some of the
token content; these parse actions can simply end without having to specify
'return toks'.
- *** POSSIBLE API INCOMPATIBILITY *** - Fixed CaselessLiteral bug, where the
returned token text was not the original string (as stated in the docs),
but the original string converted to upper case. (Thanks, Dang Griffith!)
**NOTE: this may break some code that relied on this erroneous behavior.
Users should scan their code for uses of CaselessLiteral.**
- *** POSSIBLE CODE INCOMPATIBILITY *** - I have renamed the internal
attributes on ParseResults from 'dict' and 'list' to '__tokdict' and
'__toklist', to avoid collisions with user-defined data fields named 'dict'
and 'list'. Any client code that accesses these attributes directly will
need to be modified. Hopefully the implementation of methods such as keys(),
items(), len(), etc. on ParseResults will make such direct attribute
accessess unnecessary.
- Added asXML() method to ParseResults. This greatly simplifies the process
of parsing an input data file and generating XML-structured data.
- Added getName() method to ParseResults. This method is helpful when
a grammar specifies ZeroOrMore or OneOrMore of a MatchFirst or Or
expression, and the parsing code needs to know which expression matched.
(Thanks, Eric van der Vlist, for this idea!)
- Added items() and values() methods to ParseResults, to better support using
ParseResults as a Dictionary.
- Added parseFile() as a convenience function to parse the contents of an
entire text file. Accepts either a file name or a file object. (Thanks
again, Dang!)
- Added group() method to And, Or, and MatchFirst, as a short-cut alternative
to enclosing a construct inside a Group object.
- Extended to support exponentiation, and simple built-in functions.
- Added EBNF parser to examples, including a demo where it parses its own
EBNF! (Thanks to Seo Sanghyeon!)
- Added Delphi Form parser to examples,, plus a couple of
sample Delphi forms as tests. (Well done, Dang!)
- Another performance speedup, 5-10%, inspired by Dang! Plus about a 20%
speedup, by pre-constructing and cacheing exception objects instead of
constructing them on the fly.
- Fixed minor bug when specifying oneOf() with 'caseless=True'.
- Cleaned up and added a few more docstrings, to improve the generated docs.
Version 1.1.2 - 21 Mar 2004
- Fixed minor bug in scanString(), so that start location is at the start of
the matched tokens, not at the start of the whitespace before the matched
- Inclusion of HTML documentation, generated using Epydoc. Reformatted some
doc strings to better generate readable docs. (Beautiful work, Ed Loper,
thanks for Epydoc!)
- Minor performance speedup, 5-15%
- And on a process note, I've used the unittest module to define a series of
unit tests, to help avoid the embarrassment of the version 1.1 snafu.
Version 1.1.1 - 6 Mar 2004
- Fixed critical bug introduced in 1.1, which broke MatchFirst(!) token
- Added "from future import __generators__" to permit running under
pre-Python 2.3.
- Added example, showing how to use pyparsing to extract
a text pattern from the HTML of a web page.
Version 1.1 - 3 Mar 2004
- ***Changed API*** - While testing out parse actions, I found that the value
of loc passed in was not the starting location of the matched tokens, but
the location of the next token in the list. With this version, the location
passed to the parse action is now the starting location of the tokens that
A second part of this change is that the return value of parse actions no
longer needs to return a tuple containing both the location and the parsed
tokens (which may optionally be modified); parse actions only need to return
the list of tokens. Parse actions that return a tuple are deprecated; they
will still work properly for conversion/compatibility, but this behavior will
be removed in a future version.
- Added validate() method, to help diagnose infinite recursion in a grammar tree.
validate() is not 100% fool-proof, but it can help track down nasty infinite
looping due to recursively referencing the same grammar construct without some
intervening characters.
- Cleaned up default listing of some parse element types, to more closely match
ordinary BNF. Instead of the form <classname>:[contents-list], some changes
. And(token1,token2,token3) is "{ token1 token2 token3 }"
. Or(token1,token2,token3) is "{ token1 ^ token2 ^ token3 }"
. MatchFirst(token1,token2,token3) is "{ token1 | token2 | token3 }"
. Optional(token) is "[ token ]"
. OneOrMore(token) is "{ token }..."
. ZeroOrMore(token) is "[ token ]..."
- Fixed an infinite loop in oneOf if the input string contains a duplicated
option. (Thanks Brad Clements)
- Fixed a bug when specifying a results name on an Optional token. (Thanks
again, Brad Clements)
- Fixed a bug introduced in 1.0.6 when I converted quotedString to use
CharsNotIn; I accidentally permitted quoted strings to span newlines. I have
fixed this in this version to go back to the original behavior, in which
quoted strings do *not* span newlines.
- Fixed minor bug in HTTP server log parser. (Thanks Jim Richardson)
Version 1.0.6 - 13 Feb 2004
- Added CharsNotIn class (Thanks, Lee SangYeong). This is the opposite of
Word, in that it is constructed with a set of characters *not* to be matched.
(This enhancement also allowed me to clean up and simplify some of the
definitions for quoted strings, cStyleComment, and restOfLine.)
- **MINOR API CHANGE** - Added joinString argument to the __init__ method of
Combine (Thanks, Thomas Kalka). joinString defaults to "", but some
applications might choose some other string to use instead, such as a blank
or newline. joinString was inserted as the second argument to __init__,
so if you have code that specifies an adjacent value, without using
'adjacent=', this code will break.
- Modified LineStart to recognize the start of an empty line.
- Added optional caseless flag to oneOf(), to create a list of CaselessLiteral
tokens instead of Literal tokens.
- Added some enhancements to the SQL example:
. Oracle-style comments (Thanks to Harald Armin Massa)
. simple WHERE clause
- Minor performance speedup - 5-15%
Version 1.0.5 - 19 Jan 2004
- Added scanString() generator method to ParseElement, to support regex-like
- Added items() list to ParseResults, to return named results as a
list of (key,value) pairs
- Fixed memory overflow in asList() for deeply nested ParseResults (Thanks,
Sverrir Valgeirsson)
- Minor performance speedup - 10-15%
Version 1.0.4 - 8 Jan 2004
- Added positional tokens StringStart, StringEnd, LineStart, and LineEnd
- Added commaSeparatedList to pre-defined global token definitions; also added to the examples directory, to demonstrate the differences between
parsing comma-separated data and simple line-splitting at commas
- Minor API change: delimitedList does not automatically enclose the
list elements in a Group, but makes this the responsibility of the caller;
also, if invoked using 'combine=True', the list delimiters are also included
in the returned text (good for scoped variables, such as a.b.c or a::b::c, or
for directory paths such as a/b/c)
- Performance speed-up again, 30-40%
- Added to examples directory, as this is
a common parsing task
Version 1.0.3 - 23 Dec 2003
- Performance speed-up again, 20-40%
- Added Python distutils installation, etc. (thanks, Dave Kuhlman)
Version 1.0.2 - 18 Dec 2003
- **NOTE: Changed API again!!!** (for the last time, I hope)
+ Renamed module from parsing to pyparsing, to better reflect Python
- Also added to examples directory, to illustrate
usage of the Dict class.
Version 1.0.1 - 17 Dec 2003
- **NOTE: Changed API!**
+ Renamed 'len' argument on Word.__init__() to 'exact'
- Performance speed-up, 10-30%
Version 1.0.0 - 15 Dec 2003
- Initial public release
Version 0.1.1 thru 0.1.17 - October-November, 2003
- initial development iterations:
- added Dict, Group
- added helper methods oneOf, delimitedList
- added helpers quotedString (and double and single), restOfLine, cStyleComment
- added MatchFirst as an alternative to the slower Or
- added UML class diagram
- fixed various logic bugs