# Context-free Parsing and Derivative Scanning which generates proofs

## This is a version of CH10_Derivatives.ipynb that also prints out a proof as to why an RE matched a string. It follows the rules given in Figure 10.2.  **NOTE** I forgot to include a rule for ```&``` -- i.e. AND. The rule is similar to that for ```+``` and is present in the code. Since I did not have a rule number for ```&```, I call it Rule-10 (Figure 10.2 has Rules 1-9).


This Jove file covers two topics. 

* The first, context-free parsing, helps us design a parser for regular expressions. This
is the subject of Chapter 11. 

* The second is derivative-based scanning, the topic for Chapter 10

These are now described. 

You may wish to watch the video before embarking on this work.


In [None]:
# This Youtube video walks through this notebook

from IPython.display import YouTubeVideo
YouTubeVideo('xGvCjoWemWg')

## Context-free Parsing

We will present the parser for regular expressions

* The CFG for regular expressions that we'd like to deal with (during derivative-based scanning) 
is the one shown below. 

* Note that the rule for AND and for NOT are not implemented (these 
are exercises for the reader)


expression -> expression PLUS catexp

catexp -> catexp andexp | andexp 

andexp -> andexp AND ordyexp | ordyexp

ordyexp -> str | eps | LPAREN expression RPAREN | ordyexp STAR | NOT ordyexp


In [None]:
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
import sys

# -- Detect if in Own Install or in Colab
try:
    import google.colab
    OWN_INSTALL = False
except:
    OWN_INSTALL = True
    
if OWN_INSTALL:
    
  #---- Leave these definitions ON if running on laptop
  #---- Else turn OFF by putting them between ''' ... '''

  sys.path[0:0] = ['../../../../..',  '../../../../../3rdparty',  
                   '../../../..',  '../../../../3rdparty',  
                   '../../..',     '../../../3rdparty', 
                   '../..',        '../../3rdparty',
                   '..',           '../3rdparty' ]

else: # In colab
  ! if [ ! -d Jove ]; then git clone https://github.com/ganeshutah/Jove Jove; fi
  sys.path.append('./Jove')
  sys.path.append('./Jove/jove')

# -- common imports --
from jove.lex import lex
from jove.yacc import yacc
from jove.StateNameSanitizers import ResetStNum, NxtStateStr
from jove.SystemImports       import *
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

### Token definitions

This is the lexer for REs. We begin with token definitions

**NOTE** 

* We leave it as an exercise for you to add the token for negation and conjunction, below

i.e. support things like !a for negation and !a & b for conjunction


In [None]:
tokens = ('EPS','STR','LPAREN','RPAREN','PLUS','STAR', 'NOT', 'AND')

# Tokens
t_PLUS    = r'\+'
t_STAR    = r'\*'
t_LPAREN  = r'\('
t_RPAREN  = r'\)'
t_EPS     = r'\'\'|\"\"'   
t_STR     = r'[a-zA-Z0-9]'
t_NOT     = r'\!' 
t_AND     = r'\&'

# Ignored characters
t_ignore = " \t"

def t_newline(t):
    r'\n+'
    t.lexer.lineno += t.value.count("\n")
    
def t_error(t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1)
    
 

### These parsing rules specify many things. 

We begin with operator precedence rules that are essentially to help the "LALR parser" (also known as the bottom-up parser) resolve 'shift-reduce conflicts'.

In [None]:
# Parsing rules
# 
precedence = (
   ('left','PLUS'),
   ('left', 'AND'),   #<== ADDED this 
   ('left','STAR'),
   ('right','NOT')    #<== ADDED this
   )

### CFG productions and semantic actions

These Python functions whose names begin with "p_" house (1) the CFG production rules within their documentation strings. (2) the semantic actions within their body. The semantic actions can refer to grammar symbol attributes within CFG productions. We will explain one of these rules now.

Take the rules 

 expression -> expression PLUS catexp
 expression -> catexp
 
1) This function defines the first production rule

def p_expression_plus(t):

   a) This comment string expresses the production rule
   
    '''expression : expression PLUS catexp'''
    
   b) This line below tells us that the occurrence of 'expression' on
      the left-hand side is marked t[0], and its value is determined by
      applying function attrDyadicInfix onto its three arguments below.
      Here, t[1] is the attribute of 'expression' coming after the colon (:)
      and the attribute of catexp is t[3]
      
    t[0] = attrDyadicInfix("+", t[1], t[3])    
    
2) This function expresses the second related production rule where the
   basis case 
    
def p_expression_plus1(t):
    '''expression : catexp'''

    t[0] = t[1]  

In [None]:
def p_expression_plus(t):
    'expression : expression PLUS catexp'
    #
    t[0] = attrDyadicInfix("+", t[1], t[3])    
    
def p_expression_plus1(t):
    'expression : catexp'
    #
    t[0] = t[1]  

In [None]:

def p_expression_cat(t):
    'catexp :  catexp andexp'
    #
    t[0] = attrDyadicInfix(".", t[1], t[2])

def p_expression_cat1(t):
    'catexp :  andexp'
    #
    t[0] = t[1]  


def p_expression_ordy(t):          #<== Added this
    'andexp : andexp AND ordyexp'  #<== to support infix and
    #
    t[0] = attrDyadicInfix("&", t[1], t[3])


def p_expression_ordy1(t):
    'andexp : ordyexp'
    #
    t[0] = t[1]



    '''
    Documentation for p_expression_ordy_star:
    
    We employ field 'ast' of the dict to record the abstract syntax tree. 
    Field 'dig' holds a digraph. It too is a dict. 
    Its fields are nl for the node list and el for the edge list
    '''
    
def p_expression_ordy_star(t):
    'ordyexp : ordyexp STAR'
    #
    ast = ('*', t[1]['ast'])

    nlin = t[1]['dig']['nl']
    elin = t[1]['dig']['el']
    
    rootin = nlin[0]

    root = NxtStateStr("R*_") 
    right = NxtStateStr("*_")

    t[0] = {'ast' : ast,
            'dig' : {'nl' : [root] + nlin + [right], # this order important for proper layout!
                     'el' : elin + [ (root, rootin),
                                     (root, right) ]
                    }}


def p_expression_ordy_not(t):  #<== The tree-drawing for NOT happens here
    'ordyexp : NOT ordyexp'
    #
    ast  = ('!', t[2]['ast'])
    
    nlin = t[2]['dig']['nl']
    elin = t[2]['dig']['el']
    
    rootin = nlin[0]

    root = NxtStateStr("!R_") 
    left = NxtStateStr("!_")

    t[0] = {'ast' : ast,
            'dig' : {'nl' : [ root, left ] + nlin, # this order important for proper layout!
                     'el' : elin + [ (root, left),
                                     (root, rootin) ]
                    }}

    
def p_expression_ordy_paren(t):
    'ordyexp : LPAREN expression RPAREN'
    #
    ast  = t[2]['ast']
    
    nlin = t[2]['dig']['nl']
    elin = t[2]['dig']['el']
    
    rootin = nlin[0]
    
    root = NxtStateStr("(R)_")
    left = NxtStateStr("(_")
    right= NxtStateStr(")_")
    
    t[0] = {'ast' : ast,
            'dig' : {'nl' : [root, left] + nlin + [right], #order important f. proper layout!
                     'el' : elin + [ (root, left),
                                     (root, rootin),
                                     (root, right) ]
                    }}

def p_expression_ordy_eps(t):
    'ordyexp : EPS'
    #
    strn = '@'
    ast  = ('@', strn)           
    t[0] = { 'ast' : ast,
             'dig' : {'nl' : [ strn + NxtStateStr("_") ],
                      'el' : []
                     }}          
    
def p_expression_ordy_str(t):
    'ordyexp : STR'
    #
    strn = t[1]
    ast  = ('str', strn)
    t[0] = {'ast' : ast,
            'dig' : {'nl' : [ strn + NxtStateStr("_") ],
                     'el' : [] 
                    }}

def p_error(t):
    print("Syntax error at '%s'" % t.value)

#--
    
def attrDyadicInfix(op, attr1, attr3):         # <== this is what prints the parse-tree
    ast  = (op, (attr1['ast'], attr3['ast']))  # <== for an infix operator
    
    nlin1 = attr1['dig']['nl']
    nlin3 = attr3['dig']['nl']
    nlin  = nlin1 + nlin3
    
    elin1 = attr1['dig']['el']
    elin3 = attr3['dig']['el']
    elin  = elin1 + elin3
    
    rootin1 = nlin1[0]
    rootin3 = nlin3[0]    
    
    root   = NxtStateStr("R1"+op+"R2"+"_") # NxtStateStr("$_")
    left   = rootin1
    middle = NxtStateStr(op+"_")
    right  = rootin3
    
    return {'ast' : ast,
            'dig' : {'nl' : [ root, left, middle, right ] + nlin,
                     'el' : elin + [ (root, left),
                                     (root, middle),
                                     (root, right) ]
                     }}

#===
# This is the entry-point into the parser.
#===

def parseRE(s):
    """In: a string s containing a regular expression.
       Out: An attribute triple consisting of
            1) An abstract syntax tree suitable for processing in the derivative-based scanner
            2) A node-list for the parse-tree digraph generated. Good for drawing a parse tree 
               using the drawPT function below
            3) An edge list for the parse-tree generated (again good for drawing using the
               drawPT function below)
    """
    mylexer  = lex()
    myparser = yacc()
    pt = myparser.parse(s, lexer = mylexer)             # <== pass the right lexer into the parser
    return (pt['ast'], pt['dig']['nl'], pt['dig']['el']) # <== the parser returns the parse-tree
                                                        # <== as a Python data structure, plus a tree data structure for drawing

In [None]:
def drawPT(ast_nl_el, comment="PT"):
    """Given an (ast, nl, el) triple where nl is the node and el the edge-list,
       draw the Parse Tree by returning a dot object.
    """
    (ast, nl, el) = ast_nl_el
    print("Drawing AST for ", ast)
    dotObj_pt = Digraph(comment)
    dotObj_pt.graph_attr['rankdir'] = 'TB'
    for n in nl:
        prNam = n.split('_')[0]
        dotObj_pt.node(n, prNam, shape="oval", peripheries="1")
    for e in el:
        dotObj_pt.edge(e[0], e[1])
    return dotObj_pt

# Exercise: Study of Parsing by Drawing Parse Trees

** Question Q1(a): ** Some simple parse-tree examples are now given. Please produce three more interesting-looking parser-trees of your own. They can be anything, but ensure that you understand the trees generated. Write two short sentences describing each such parse-tree produced. Try to limit yourself to about eight leaf nodes and about the same number of operators (rough guideline only).

In [None]:
drawPT(parseRE("''"))

In [None]:


parseRE('""')

In [None]:
parseRE('a')

In [None]:
parseRE("a*")

In [None]:
parseRE('a&b')

In [None]:
parseRE('ab')

In [None]:
parseRE("!a")

In [None]:
parseRE('!a* b*')

In [None]:
drawPT(parseRE("1"))

In [None]:
drawPT(parseRE("(0*1*)*"))

In [None]:
drawPT(parseRE("0+11*"))

# Derivative-based Pattern Matching

In [None]:
#=== Now comes derivMatch as illustration of RE Derivatives and Pattern-matching
# These four functions are simple extractors of the operator and arguments

def opr(E):
    """Retrieves the operator of an expression.
    """
    return E[0]

def arg1(E):
    """Retrieves the first argument of a binary operator-based expression.
    """
    return E[1][0]

def arg2(E):
    """Retrieves the second argument of a binary operator-based expression.
    """
    return E[1][1]

def arg(E):
    """Retrieves the only argument of a unary operator-based expression.
    """
    return E[1]

def nullable(E):
    """This is the nullability test defined in Chapter 10.
    """
    if (opr(E) == "str") :
        return False
    elif (opr(E) == '@') :
        return True
    elif (opr(E) == "mty") :
        return False
    elif (opr(E) == "*"):
        return True
    elif (opr(E) == "!"):
        return not nullable(arg(E))
    elif (opr(E) == '+') :
        return nullable(arg1(E)) or nullable(arg2(E))
    elif (opr(E) == '.') :
        return nullable(arg1(E)) and nullable(arg2(E))
    elif (opr(E) == '&') :
        return nullable(arg1(E)) and nullable(arg2(E))
    else:
        return "??? Undefined expression given to the nullability test. ??? "    

#--- Computes the derivative of E w.r.t. c. 
#--- Also returns a list of recursive steps executed to produce the derivative, 
#--- suitably decorated with the Rule numbers and other helpful debugging info.

In [None]:
def dyadicstr(E, L, ch, R, ind):
    return (L + prt(arg1(E),ind) + ch + prt(arg2(E),ind) + R)

In [None]:
def prt(E, ind):
    if opr(E)=='str':
        return(arg(E))
    elif opr(E) in ['+', '&']:
        return dyadicstr(E, '(', opr(E), ')',ind)
    elif opr(E)=='.':
        return dyadicstr(E, '(', ' ', ')',ind)
    elif opr(E)=='mty':
        return '{}'
    elif opr(E)=='!':
        return '!' + '(' + prt(arg(E),ind) + ')'
    elif opr(E)=='*':
        return '(' + prt(arg(E),ind) + ')' + '*'
    elif opr(E)=='@':
        return '@'
    else:
        return "??? illegal opr(E) for prt. ???"

In [None]:
def dv(E, c, ind):
    """This function computes the derivative
       of a regular expression E with respect
       to character "c".  
    """
    if (opr(E) == "str") :
        if (arg(E) == c):
            Dout = ('@', '@')
            return ( Dout,
                     [ ' '*ind + 'Rule 1: Expn ' + prt(E,ind) + ' ~~' + c + '~~> ' + prt(Dout,ind) + '\n' ] )
        else:
            Dout = ("mty", "mty") 
            return ( Dout,
                     [ ' '*ind + 'Rule 2: Expn ' + prt(E,ind) + ' ~~' + c + '~~> ' + prt(Dout,ind) + '\n' ] )
        
    elif (opr(E) == '@') :
        Dout = ("mty", "mty")
        return ( Dout,
                 [ ' '*ind +'Rule 3: Expn ' + prt(E,ind) + ' ~~' + c + '~~> ' + prt(Dout,prt) + '\n' ] )
    
    elif (opr(E) == "mty") :
        Dout = ("mty", "mty")
        return ( Dout,
                 [ ' '*ind + 'Rule 4: Expn ' + prt(E,ind) + ' ~~' + c + '~~> ' + prt(Dout,prt) + '\n' ] )
    
    elif (opr(E) == "*"):
        (D, P) = dv(arg(E), c, ind+4)
        Dout   = (".", (D, E))
        return ( Dout,
                 [ ' '*ind + 'Rule 5: Expn ' + prt(E,ind) + ' ~~' + c + '~~> ' + prt(Dout,ind) + '\n' ] + P )
    
    elif (opr(E) == "!"):
        (D, P) = dv(arg(E), c, ind+4)
        Dout   = ("!", D)
        return ( Dout,
                 [ ' '*ind + 'Rule 6: Expn ' + prt(E,ind) + ' ~~' + c + '~~> ' + prt(Dout,ind) + '\n' ] + P )
    
    elif (opr(E) == '+') :
        (D1, P1) = dv(arg1(E), c, ind+4)
        (D2, P2) = dv(arg2(E), c, ind+4)
        Dout     = ("+", (D1, D2))
        return ( Dout,
                  [ ' '*ind + 'Rule 7: Expn ' + prt(E,ind) + ' ~~' + c + '~~> ' + prt(Dout,ind) + '\n' ] 
                +  P1 
                +  P2 )
    
    elif (opr(E) == '&') :
        (D1, P1) = dv(arg1(E), c, ind+4)
        (D2, P2) = dv(arg2(E), c, ind+4)
        Dout     = ("&", (D1, D2))
        return ( Dout,
                  [ ' '*ind + 'Rule 10: Expn ' + prt(E,ind) + ' ~~' + c + '~~> ' + prt(Dout,ind) + '\n' ] 
                +  P1
                +  P2 )    
    
    elif (opr(E) == '.') :
        if nullable(arg1(E)):
            (D1, P1) = dv(arg1(E), c, ind+4)
            (D2, P2) = dv(arg2(E), c, ind+4) 
            Dout     = ("+", 
                             ( ('.', (D1, 
                                      arg2(E)
                                     )),
                                D2
                             ))
            return ( Dout,
                      [ ' '*ind + 'Rule 8: Nullable ' + prt( arg1(E),ind ) + " :: " 
                         + prt(E,ind) + ' ~~' + c + '~~> ' + prt(Dout,ind) + '\n' ] 
                    +  P1 
                    +  P2 )    
        
        else:
            (D1, P1) = dv(arg1(E), c, ind+4)
            Dout     = ('.', (D1, arg2(E)))
            return ( Dout,
                      [ ' '*ind + 'Rule 9: !Nullable ' + prt( arg1(E),ind ) + " :: " 
                         + prt(E,ind) + ' ~~' + c + '~~> ' + prt(Dout,ind) + '\n' ] 
                    +  P1 )
                    
    else:
        return "??? Undefined operator in Expn given to dv. ???"        

In [None]:
def matches(w, E):
    if w=="":
        print("----- Derivatives of all characters over; subjecting final derivative to nullability test -----")
        n = nullable(E)
        if n:
            print("-- Final derivative ", prt(E,0), " is nullable, hence the given RE matches the given string :-)")
        else:
            print("-- Final derivative ", prt(E,0), " is not nullable, hence the given RE does not match the given string :-(")
    
    else:
        print("----- Taking derivative of first/next character", w[0], "-----")
        (D, P) = dv(E, w[0], 0) # indent of 0
        for x in P:
            print(x)
        return matches(w[1:], D)

In [None]:
RE = "a"

In [None]:
RE = "a+b&c"
(ast, n, e) = parseRE(RE)
matches("zb", ast)

In [None]:
RE = "a+b* & c*+d"
(ast, n, e) = parseRE(RE)
matches("aab", ast)

In [None]:
RE = "!b*"  
(ast, n, e) = parseRE(RE)
matches("aba", ast)

In [None]:
RE = "(ab+a)*"  
(ast, n, e) = parseRE(RE)
matches("aba", ast)

In [None]:
RE = "(a&!b)"  
(ast, n, e) = parseRE(RE)
matches("a", ast)

In [None]:
RE = "(!a&!b)"  
(ast, n, e) = parseRE(RE)
matches("ah", ast)

In [None]:
RE = "!(a*)"  
(ast, n, e) = parseRE(RE)
matches("ab", ast)

In [None]:
RE = "(p+q)*"
(ast, n, e) = parseRE(RE)
matches("pq",ast)