# Learning from:

Getting Started with Pyparsing by Paul McGuire Publisher: O'Reilly Media 
http://shop.oreilly.com/product/9780596514235.do


In [212]:
from pyparsing import *

In [213]:
import random

### "Hello World ! on Steroids" 
page 9. 

The task is to write a parser for these strings:

Hello, World! <br>
Hi, Mom! <br>
Good morning, Miss Crabtree!   
Yo, Adrian!   
Whattup, G? <br>
How's it goin', Dude? <br>
Hey, Jude! <br>
Goodbye, Mr. Chips! <br>




Giving the input values with a list of strings:

In [214]:
tests=['   Hello, World!', 'Hi, Mom!', 
      'Good morning, Miss Crabtree!',
      'Yo, Adrian!',
      'Whattup, G?',
      'Hey, Jude!',
      'Goodbye, Mr. Chips!',
      'How\'s it going\', Dude?']

Printing the input values to check:

In [215]:
print(tests)

['   Hello, World!', 'Hi, Mom!', 'Good morning, Miss Crabtree!', 'Yo, Adrian!', 'Whattup, G?', 'Hey, Jude!', 'Goodbye, Mr. Chips!', "How's it going', Dude?"]


"The first step is to identify the pattern that they all follow" <br>

writing this pattern as a BNF:

greeting ::= salutation comma greetee endpunc
salutation ::= word+ <br>
comma ::= , <br>
greetee ::= word+ <br>
word ::= a collection of one or more characters, which are any alpha or 'or <br>
endpunc ::= ! | ? <br>




In [216]:
word = Word(alphas+"'.")
salutation = OneOrMore(word)
comma = Literal(",")
greetee = OneOrMore(word)
endpunc = oneOf("! ?")
greeting = salutation + comma + greetee + endpunc



the greeting variable has the 'formula' for the appropriate parse and is like an object that can do the parse and other operations. Doing the parse for the element 3 in the list (arrays in python begin with the 0 element).

In [217]:
greeting.parseString(tests[2])

(['Good', 'morning', ',', 'Miss', 'Crabtree', '!'], {})

Doing the parse for all the items in the list

In [218]:
for t in tests:
    view = greeting.parseString(t)
    print(view)

['Hello', ',', 'World', '!']
['Hi', ',', 'Mom', '!']
['Good', 'morning', ',', 'Miss', 'Crabtree', '!']
['Yo', ',', 'Adrian', '!']
['Whattup', ',', 'G', '?']
['Hey', ',', 'Jude', '!']
['Goodbye', ',', 'Mr.', 'Chips', '!']
["How's", 'it', "going'", ',', 'Dude', '?']


"to identify the tokens that compose the initial part of the greeting--the salutation--we need to iterate over the results until we reach the comma token:"

In [219]:
for t in tests:
    results = greeting.parseString(t)
    salutation = []
    for token in results:
        if token == ",": break
        salutation.append(token)
    print(salutation) 
        
    

['Hello']
['Hi']
['Good', 'morning']
['Yo']
['Whattup']
['Hey']
['Goodbye']
["How's", 'it', "going'"]


"Since we know that the salutation and greetee parts of the greeting are logical groups, we can use pyparsing's Group class to give more structure to the returned results. By changing the definitions of salutation and greetee to:"   (not so clear)

In [220]:
salutation = Group( OneOrMore(word))

In [221]:
print(salutation)

Group:({W:(ABCD...)}...)


In [222]:
greetee = Group( OneOrMore(word) )

In [223]:
print(greetee)

Group:({W:(ABCD...)}...)


The results are not viewed as in the example

In [224]:
for t in tests:
    view = greeting.parseString(t)
    print(view)

['Hello', ',', 'World', '!']
['Hi', ',', 'Mom', '!']
['Good', 'morning', ',', 'Miss', 'Crabtree', '!']
['Yo', ',', 'Adrian', '!']
['Whattup', ',', 'G', '?']
['Hey', ',', 'Jude', '!']
['Goodbye', ',', 'Mr.', 'Chips', '!']
["How's", 'it', "going'", ',', 'Dude', '?']


Maybe declaring again the structure of the parse, but now with the Group class 

In [225]:
word = Word(alphas+"'.")
salutation = Group(OneOrMore(word))
comma = Literal(",")
greetee = Group(OneOrMore(word))
endpunc = oneOf("! ?")
greeting = salutation + comma + greetee + endpunc

In [226]:
for t in tests:
    view = greeting.parseString(t)
    print(view)

[['Hello'], ',', ['World'], '!']
[['Hi'], ',', ['Mom'], '!']
[['Good', 'morning'], ',', ['Miss', 'Crabtree'], '!']
[['Yo'], ',', ['Adrian'], '!']
[['Whattup'], ',', ['G'], '?']
[['Hey'], ',', ['Jude'], '!']
[['Goodbye'], ',', ['Mr.', 'Chips'], '!']
[["How's", 'it', "going'"], ',', ['Dude'], '?']


Alright, it was necessary to be declared all over again. Just declaring the greeting formula again it's not enough, see [errorJustGreeting](#errorJustGreeting) <!-- How to reference another cell http://stackoverflow.com/a/28080529/7896359 @Amit -->

Using list-to-variable assignment to access the different parts:

In [227]:
for t in tests:
    salutation, dummy, greetee, endpunc = greeting.parseString(t)
    print(salutation, greetee, endpunc)

['Hello'] ['World'] !
['Hi'] ['Mom'] !
['Good', 'morning'] ['Miss', 'Crabtree'] !
['Yo'] ['Adrian'] !
['Whattup'] ['G'] ?
['Hey'] ['Jude'] !
['Goodbye'] ['Mr.', 'Chips'] !
["How's", 'it', "going'"] ['Dude'] ?


"The comma is a very important element during parsing, since it shows where the parser stops reading the salutation and starts the greetee. But in the returned results, the comma is not really very interesting at all, and it would be nice to supress it from the returned results. You can do this by wrapping the definition of comma in a pyparsing Supress instance:"

In [228]:
#comma = Suppress( Literal(",")) # or
comma = Literal(",").suppress() #or
#comma = Suppress(",") # the three are equivalent

<a id='errorJustGreeting'></a>
Seeing again the results, now with the suppress command, and declaring again greeting formula

In [229]:
#greeting = salutation + comma + greetee + endpunc

It seems that it is important to declare all over again

In [230]:
word = Word(alphas+"'.")
salutation = Group(OneOrMore(word))
comma = Literal(",").suppress()
greetee = Group(OneOrMore(word))
endpunc = oneOf("! ?")
greeting = salutation + comma + greetee + endpunc

In [231]:
for t in tests:
    view = greeting.parseString(t)
    print(view)

[['Hello'], ['World'], '!']
[['Hi'], ['Mom'], '!']
[['Good', 'morning'], ['Miss', 'Crabtree'], '!']
[['Yo'], ['Adrian'], '!']
[['Whattup'], ['G'], '?']
[['Hey'], ['Jude'], '!']
[['Goodbye'], ['Mr.', 'Chips'], '!']
[["How's", 'it', "going'"], ['Dude'], '?']


"Now that we have a decent parser and a good way to get out the results, we can start to have fun with the test data. First, let's accumulate the salutations and greetees into lists of their own:"

In [232]:
salutes=[]

In [233]:
greetees = []

In [234]:
for t in tests:
    salutation, greetee, endpunc = greeting.parseString(t)
    salutes.append( (" ".join(salutation), endpunc) )
    greetees.append( " ".join(greetee) )

Seeing what is in salutes

In [235]:
print(salutes)

[('Hello', '!'), ('Hi', '!'), ('Good morning', '!'), ('Yo', '!'), ('Whattup', '?'), ('Hey', '!'), ('Goodbye', '!'), ("How's it going'", '?')]


In [236]:
print(salutes[2])

('Good morning', '!')


what is in greetees

In [237]:
print(greetees)

['World', 'Mom', 'Miss Crabtree', 'Adrian', 'G', 'Jude', 'Mr. Chips', 'Dude']


"Now that we have collected these assorted names and salutations, we can use them to contrive some additional, never-before-seen greetings and introductions."

In [238]:
for i in range(50):
    salute = random.choice( salutes )
    greetee = random.choice( greetees )
    print("%s, %s%s" % ( salute[0], greetee, salute[1] ))

Hi, Miss Crabtree!
Goodbye, Dude!
How's it going', G?
Goodbye, Dude!
Yo, World!
Hi, Miss Crabtree!
Hey, Miss Crabtree!
Yo, Jude!
Good morning, Mr. Chips!
Yo, Mr. Chips!
Hi, World!
How's it going', Dude?
Good morning, World!
Goodbye, World!
Hello, Miss Crabtree!
Hello, Mr. Chips!
Goodbye, Jude!
Goodbye, Jude!
Good morning, Jude!
Yo, Adrian!
Hello, G!
How's it going', Adrian?
Whattup, World?
Goodbye, G!
Hi, Mom!
Yo, Dude!
Whattup, Dude?
Goodbye, Miss Crabtree!
Hi, Mom!
How's it going', G?
Yo, Miss Crabtree!
Good morning, Jude!
Hi, G!
Whattup, Dude?
How's it going', G?
Hey, Mr. Chips!
How's it going', Mr. Chips?
Hey, Mr. Chips!
Goodbye, World!
Hey, Mom!
Hi, Mom!
Goodbye, Adrian!
Hey, Adrian!
Hi, World!
Hey, Mom!
Hello, Jude!
Good morning, Adrian!
Yo, World!
Hey, Dude!
Whattup, Adrian?


"We can also simulate some introductions with the following code:"

In [239]:
for i in range(50):
    print('%s, say "%s" to %s.' % (random.choice( greetees ),
                                   "".join( random.choice( salutes ) ),
                                  random.choice( greetees ) ) )

Jude, say "Hello!" to Miss Crabtree.
World, say "Whattup?" to Mom.
G, say "Hello!" to Dude.
Miss Crabtree, say "Hello!" to Mom.
Mr. Chips, say "Whattup?" to Dude.
Jude, say "Whattup?" to Dude.
Miss Crabtree, say "Whattup?" to Mom.
World, say "Goodbye!" to G.
Jude, say "Goodbye!" to Dude.
Dude, say "Hey!" to Miss Crabtree.
Jude, say "Whattup?" to Miss Crabtree.
Jude, say "Yo!" to Mom.
Miss Crabtree, say "How's it going'?" to Adrian.
World, say "Hello!" to G.
Mr. Chips, say "Hey!" to G.
World, say "Hey!" to Mr. Chips.
Dude, say "Goodbye!" to World.
Mom, say "Hello!" to Miss Crabtree.
Adrian, say "Goodbye!" to World.
Mom, say "Whattup?" to Miss Crabtree.
Mr. Chips, say "Goodbye!" to Dude.
Mom, say "Hey!" to Mom.
Adrian, say "How's it going'?" to World.
Mom, say "How's it going'?" to Dude.
Jude, say "Yo!" to G.
Mr. Chips, say "Good morning!" to Miss Crabtree.
G, say "Goodbye!" to Jude.
World, say "Goodbye!" to Dude.
Mr. Chips, say "Hey!" to Adrian.
Adrian, say "Goodbye!" to Adrian.
Jude, s

### Whitespace markers clutter and distract from the grammar definition 

In [240]:
test = 'a(1,2,def,5)'

In [241]:
print(test)

a(1,2,def,5)


In [242]:
whitespace = Word(alphas)+"("+Group( Optional(delimitedList(Word(nums)|Word(alphas)))) + ")"

In [243]:
whitespace.parseString(test)

(['a', '(', (['1', '2', 'def', '5'], {}), ')'], {})

In [244]:
test2 = 'abc(1, 2,def, 5)'

In [245]:
whitespace.parseString(test2)

(['abc', '(', (['1', '2', 'def', '5'], {}), ')'], {})

In [246]:
test3 = 'abc(a,def,def,130)'


In [247]:
whitespace.parseString(test3)

(['abc', '(', (['a', 'def', 'def', '130'], {}), ')'], {})

In [248]:
view = whitespace.parseString(test3)

In [249]:
print(view)

['abc', '(', ['a', 'def', 'def', '130'], ')']


In [250]:
type(view)

pyparsing.ParseResults

"You can see that the function arguments have been collected into their own sublist, making the extraction of hte function arguments easier during post-parsing analysis. If grammar definition includes results names, specific fields can be accessed by name instead of by error-prone list indexing.

These higher-level access techniques are crucial to making sense of the results from a complex grammar"



### Parsing Data from a Table --Using Parse Actions and ParseResults

"As our first example, let's look at a simple set of scores for college football games that might be given in a datafile"

09/04/2004 Virginia		 44  Temple		          14<br>
09/04/2004 LSU			 22 Oregon State 	      21<br>
09/09/2004 Troy State	 24  Missouri             14<br>
01/02/2003 Florida State 103  University of Miami  2<br>		


In [251]:
tests="""\
      09/04/2004 Virginia   44  Temple    14
09/04/2004 LSU         22 Oregon State  21
09/09/2004 Troy State    24  Missouri  14
01/02/2003 Florida State    103  University of Miami 2
""".splitlines()


In [252]:
print(tests)

['      09/04/2004 Virginia   44  Temple    14', '09/04/2004 LSU         22 Oregon State  21', '09/09/2004 Troy State    24  Missouri  14', '01/02/2003 Florida State    103  University of Miami 2']


"Our BNF for this data is simple and clean"

digit      ::= '0'..'9'<br>
alpha      ::= 'A'..'Z' 'a'..'z'<br>
date       ::= digit+ '/' digit+ '/' digit+<br>
schoolName ::= (alpha+ )+ <br>
score      ::= digit+ <br>
schoolAndScore  ::=schoolName score <br>
gameResult ::= date schoolAndScore schoolAndScore

In [253]:
#nums and alphas are already defined by paparsing
num = Word(nums)
date = num + "/" + num + "/" + num
schoolName = OneOrMore( Word(alphas) )
#"Notice that you can compose pyparsing expression using the + operator
#to combine pyparsing expressions and string literals. Using these 
#basic elements, we can finish the grammar by combining them into larger
#expressions:"
score = Word(nums)
schoolAndScore = schoolName + score
gameResult = date + schoolAndScore + schoolAndScore

In [254]:
for test in tests:
    stats = gameResult.parseString(test)
    print(stats.asList())


['09', '/', '04', '/', '2004', 'Virginia', '44', 'Temple', '14']
['09', '/', '04', '/', '2004', 'LSU', '22', 'Oregon', 'State', '21']
['09', '/', '09', '/', '2004', 'Troy', 'State', '24', 'Missouri', '14']
['01', '/', '02', '/', '2003', 'Florida', 'State', '103', 'University', 'of', 'Miami', '2']


It is important that in the delariation of the tests variable 
there is no space before each statement, 
otherwise * .parseString () will mark an error, since gameResult 
does not contain a first space in its formula.

"The first change we'll make is to combine the tokens returned by date into a single MM/DD/YYY date string. The pyparsing Combine does this for us by simply wrapping the composed expression:"

In [255]:
date = Combine( num + "/" + num + "/" + num )

In [256]:
num = Word(nums)
date = Combine( num + "/" + num + "/" + num )
schoolName = OneOrMore( Word(alphas) )
score = Word(nums)
schoolAndScore = schoolName + score
gameResult = date + schoolAndScore + schoolAndScore

In [257]:
for test in tests:
    stats = gameResult.parseString(test)
    print(stats.asList())

['09/04/2004', 'Virginia', '44', 'Temple', '14']
['09/04/2004', 'LSU', '22', 'Oregon', 'State', '21']
['09/09/2004', 'Troy', 'State', '24', 'Missouri', '14']
['01/02/2003', 'Florida', 'State', '103', 'University', 'of', 'Miami', '2']


"Combine actually perfoms two tasks for us. In addition to concatenating the matched tokens into a single string, it also enforces that the tokensare are adjacent in the incoming text"

"The next change to make will be to combine the school names, too. Because Combine's default behavior requires that the tokens be adjacent, we will not use it, since some of the school names have embedded spaces. Instead we'll define a routine to be run at parse time to join and return the tokens as a single string. As mentioned previously, such routines are referred to in pyparsing as parse actions, and they can perform a variety of functions during the parsing process."

In [258]:
schoolName.setParseAction( lambda tokens: " ".join(tokens) )

{W:(ABCD...)}...

In [259]:
num = Word(nums)
date = Combine( num + "/" + num + "/" + num )
schoolName.setParseAction( lambda tokens: " ".join(tokens) )
score = Word(nums)
schoolAndScore = schoolName + score
gameResult = date + schoolAndScore + schoolAndScore

In [260]:
for test in tests:
    stats = gameResult.parseString(test)
    print(stats)

['09/04/2004', 'Virginia', '44', 'Temple', '14']
['09/04/2004', 'LSU', '22', 'Oregon State', '21']
['09/09/2004', 'Troy State', '24', 'Missouri', '14']
['01/02/2003', 'Florida State', '103', 'University of Miami', '2']


In [261]:
print(stats.asList())

['01/02/2003', 'Florida State', '103', 'University of Miami', '2']


In [262]:
lineDemon = "     1    1   H    1s        0.0000     0.0000     0.0000    -0.0003     0.0000"

In [286]:
lineDemon1 = "     1    1   H    1s        0.0000     0.0000        -0.0003     -122.0000"

In [283]:
num = Word(nums).suppress()
atomInInput = Word(nums)
atomSymbol = Word(alphas)
orbitalSymbol = Word(alphanums)
orbitalValues = OneOrMore(Combine(Optional("-") + OneOrMore( Word( nums + "." + nums))) )
lineOrbitalInfo = num + atomInInput + atomSymbol + orbitalSymbol + orbitalValues

In [288]:
viewInfoOrbital88

[['1', 'H', '1s', '0.0000', '0.0000', '0.0000', '-0.0003', '0.0000'],
 ['1', 'H', '1s', '0.0000', '0.0000', '0.0000', '-0.0003', '0.0000'],
 ['1', 'H', '1s', '0.0000', '0.0000', '0.0000', '-0.0003', '0.0000'],
 ['1', 'H', '1s', '0.0000', '0.0000', '0.0000', '-0.0003', '0.0000'],
 ['1', 'H', '1s', '0.0000', '0.0000', '0.0000', '-0.0003', '0.0000'],
 ['1', 'H', '1s', '0.0000', '0.0000', '0.0000', '-0.0003', '0.0000'],
 ['1', 'H', '1s', '0.0000'],
 ['1', 'H', '1s', '0.0000', '0.0000', '-0.0003', '-122.0000']]

In [287]:
viewInfoOrbital88.append(lineOrbitalInfo.parseString(lineDemon1).asList())
print(viewInfoOrbital88)

[['1', 'H', '1s', '0.0000', '0.0000', '0.0000', '-0.0003', '0.0000'], ['1', 'H', '1s', '0.0000', '0.0000', '0.0000', '-0.0003', '0.0000'], ['1', 'H', '1s', '0.0000', '0.0000', '0.0000', '-0.0003', '0.0000'], ['1', 'H', '1s', '0.0000', '0.0000', '0.0000', '-0.0003', '0.0000'], ['1', 'H', '1s', '0.0000', '0.0000', '0.0000', '-0.0003', '0.0000'], ['1', 'H', '1s', '0.0000', '0.0000', '0.0000', '-0.0003', '0.0000'], ['1', 'H', '1s', '0.0000'], ['1', 'H', '1s', '0.0000', '0.0000', '-0.0003', '-122.0000']]


In [289]:
viewInfoOrbital88

[['1', 'H', '1s', '0.0000', '0.0000', '0.0000', '-0.0003', '0.0000'],
 ['1', 'H', '1s', '0.0000', '0.0000', '0.0000', '-0.0003', '0.0000'],
 ['1', 'H', '1s', '0.0000', '0.0000', '0.0000', '-0.0003', '0.0000'],
 ['1', 'H', '1s', '0.0000', '0.0000', '0.0000', '-0.0003', '0.0000'],
 ['1', 'H', '1s', '0.0000', '0.0000', '0.0000', '-0.0003', '0.0000'],
 ['1', 'H', '1s', '0.0000', '0.0000', '0.0000', '-0.0003', '0.0000'],
 ['1', 'H', '1s', '0.0000'],
 ['1', 'H', '1s', '0.0000', '0.0000', '-0.0003', '-122.0000']]

In [291]:
#sum(1 for x in a if isinstance(x, list))
#from: https://stackoverflow.com/a/2059028/7896359
#@Ignacio Vazquez-Abrams


NameError: name 'size' is not defined

In [293]:
numListas = sum(1 for x in viewInfoOrbital88 if isinstance(x, list))

In [294]:
for j in range(numListas):
    print(viewInfoOrbital88[j])

['1', 'H', '1s', '0.0000', '0.0000', '0.0000', '-0.0003', '0.0000']
['1', 'H', '1s', '0.0000', '0.0000', '0.0000', '-0.0003', '0.0000']
['1', 'H', '1s', '0.0000', '0.0000', '0.0000', '-0.0003', '0.0000']
['1', 'H', '1s', '0.0000', '0.0000', '0.0000', '-0.0003', '0.0000']
['1', 'H', '1s', '0.0000', '0.0000', '0.0000', '-0.0003', '0.0000']
['1', 'H', '1s', '0.0000', '0.0000', '0.0000', '-0.0003', '0.0000']
['1', 'H', '1s', '0.0000']
['1', 'H', '1s', '0.0000', '0.0000', '-0.0003', '-122.0000']


In [267]:
lineDemon2 = "     1    1   H    1s        12.0000     0.0000     0.0000    -0.0003     0.0000"

In [268]:
viewInfoOrbital99[1] = lineOrbitalInfo.parseString(lineDemon1)
print(viewInfoOrbital)

IndexError: list assignment index out of range

In [None]:
#viewInfoOrbital.append('0.333')

In [None]:
#print(viewInfoOrbital)

In [None]:
num = Word(nums)
atomInInput = Word(nums)
atomSymbol = Word(alphas)
orbitalSymbol = Word(alphanums)
orbitalValues = OneOrMore( Word(nums) )
lineOrbitalInfo = num + atomInInput + atomSymbol + orbitalSymbol + orbitalValues


In [None]:
#viewInfoOrbital = lineOrbitalInfo.parseString(lineDemon)
#print(viewInfoOrbital)