In this repository I keep my parsers for simple parsers for basic arithemtics on natural numbers.
I have four distinct directories for three distinct-ish here which all differ in some way.
Those are in the c_*
directories with appropriate READMEs
attached. There are two of them but they are quite similar apart from
couple major differences.
This is the first one I made. It processes the maths expression in a
non-sequential manner ie. it loads the expression, finds the +
, *
etc and then finds the both sides of it and only then evaluates the
captured expression between those two pointers
This one works in a sequential way – it keeps only one pointer that goes through the expression from right to left and then adds next elements recursively (more details below)
In the ocaml_grammary
I keep the first ocaml parser I wrote where
when desinging the AST (term explained below) I have tried to mimic
the grammar as closely as possible. In ocaml_norm
I have a much more
normal of a parser in the sense the tree is kept very concise and
simple.
Both depend on the menhir library (which I think dates back to yacc).
For all parsers the usage (should be) is specified in their directories’ readmes.
With parsing one ought always think of the grammar before actually indulging in any coding. Context free grammar is quite cool and lets you pretend to be super smart but perhaps a better choice would be to use something more readable such as the Backus–Naur form (BNF, please consult wikipedia for details). But for that one should have some intuition before hand or else how would you come up with the grammar?
The idea of making an arithemtics parser occured to me first when
I was analysing the nature of the Łukasiewicz notation (aka the
Polish notation) and the Azciweisakuł notation (/the Reversed
Polish notation). It first was used as a notation for logic but then
it was deemed useful by computer scientists because of its natural
stacky/tree-y nature. Eg. the expression 2 + 3 * (4 + 5)
in PN would
look like + 2 * 3 + 4 5
so as we see when compared to the standard
notation it is like something <binary operator> <arg1> <arg2>
instead of <a1> <bop> <a2>
. So we first get the element that
combines the two values and then we recursively get the values
themselves (perhaps composed of something like that as well). Hence
one could see that there is a tree-like structure hidden here. The
leafs are filled with numbers (the atoms of arithmetic) and the nodes
that combine them are the binary operations.
This way we discover the idea of an abstract syntax tree or AST for short. How does such tree look like? Using the example from above:
+ / \ / \ 2 * / \ / \ 3 + / \ / \ 4 5
this is the tree representing the operation
standard notation | Polish notation | reverse pn |
---|---|---|
2 + 3 * (4 + 5) | + 2 * 3 + 4 5 | 2 3 4 5 + * + |
Here we can see how the structure plays out – when we have a +
in a
node that has incoming multiplication, then it needs parentheses to
perserve the order of operations. Without any further details we cas
present this grammar:
<expr> ::= <term> + <expr> | <term> - <expr> | <term> <term> ::= - <term> | <factor> * <term> | <factor> / <term> <factor> ::= (<expr>) | <number> | <variable>
So we have the syntax denoted here. But actually it denotes something
that is not exactly as our classicl arithmetic expressions
look. Something like 2 * -3
is completely valid when it comes to
this grammar (and frankly – I think it makes sense, it is unambiguous
hence there is no reason but aesthetics that would make us resign for
letting the unary minus work without parentheses) but in reality we’d
use 2 * (-3)
.
So the grammar would look better as:
<expr> ::= [ - ] <term> + <expr> | <term> - <expr> | <term> <term> ::= <term> | <factor> * <term> | <factor> / <term> <factor> ::= (<expr>) | <number> | <variable>
with the minus still being an aspect of a term (or even a factor) but
it needs to be caught in the <expr>
production.
We might as well add exponents if it is what we want. Right now I’ve implemented that only in the simpler ocaml parser. We’d have to adjust our grammar accordingly:
<expr> ::= [ - ] <term> + <expr> | <term> - <expr> | <term> <term> ::= <term> | <factor> * <term> | <factor> / <term> <factor> ::= <factor ^ <factor> | (<expr>) | <number> | <variable>
And actually we have to consider the assosciativity problem. So we
usually consider minus and division to work like: 3 - 2 - 1 = (3 - 2) - 1
so perhaps a better grammar would be:
<expr> ::= - <expr> + <term> | <expr> - <term> | <term> <term> ::= - <term> | <term> * <factor> | <term> / <factor> | <factor> <factor> = <factor>^<factor> | (expr) | num | var
We use structs as the data structure letting us maintain these trees.
typedef struct node {
int is_op, value;
char op;
struct node * l, * r;
} node;
/* simplified grammar
* <expr> ::= <term> { + <term> }
* <term> ::= <factor> { * <factor> }
* <factor> ::= <num> | ( <expr> ) */
we then use them in different ways. is_op
serves as a way to
distinguish between a node and a leaf.
We have functions that create a node of each type, they get the left and right index of beggining and ending of the appriopriate section
node *expr(char* p, int l, int r);
node *term(char* p, int l, int r);
node *factor(char* p, int l, int r);
here we parse the string as it goes
node *expr(char** p);
node *term(char** p);
node *factor(char** p);
We can keep everything very simple and design our AST based not on the grammar but on our tree diagram from above. Then we need those posibilities:
- leaves – the atoms of arithmetic which end the tree. we have
two options for those
- leaves carrying a number
- leaves carrying a symbol
- the pesky frontal minus as a singular node which just marks that the term should be negated
- the basic binary node with children being the operands and an operator in the node
Having considered these points we can make such an AST:
(* binary operation *)
type bop = Sum | Mult | Diff | Divis | Pow
(* the most basic arithemtics tree *)
type arithtree =
| Leaf of int
| VLeaf of string
| SNode of arithtree
| Node of arithtree * bop * arithtree
it is much simpler than the ocaml_grammary
version but of course it
doesn’t mimic the grammar but rather the actual arithmetic
tree. Another minor disadvantage is that we dont have original parentheses
and structure marked that well so adding those parentheses when trying
to printout an expr infix becomes quite tricky.
The lexer.mll file translates the raw text into lexems like DIV
TIMES
et which are much easy to handle. The parser.mly has these productions listed:
expr:
| e = expr PLUS t = term { Node (e, Sum, t) }
| e = expr MINUS t = term { Node (e, Diff, t) }
| t = term { t }
;
term:
| MINUS t = term %prec FMINUS { SNode t }
| t = term TIMES f = factor { Node (t, Mult, f) }
| t = term DIV f = factor { Node (t, Divis, f) }
| f = factor { f }
;
factor:
| f1 = factor POW f2 = factor { Node(f1, Pow, f2) }
| LPAREN e = expr RPAREN { e }
| n = NUM { Leaf n }
| x = VAR { VLeaf x }
;
# let s = "-5 * (2 - 4) * ((3 * 2) + 5)";;
val s : string = "-5 * (2 - 4) * ((3 * 2) + 5)"
# let e = parse s;;
val e : Ast.arithtree =
SNode
(Node (Leaf 5, Mult,
Node (Node (Leaf 2, Diff, Leaf 4), Mult,
Node (Node (Leaf 3, Mult, Leaf 2), Sum, Leaf 5))))
# eval e;;
- : int = 110
# pn e; print_endline ""; rpn e; print_endline ""; infix e; print_endline "";;
* 5 * - 2 4 + * 3 2 5
5 2 4 - 3 2 * 5 + * *
5 * (2 - 4) * (3 * 2 + 5)
Ocaml as a functional language is the real charm. We can in fact mimic the AST so well it exactly matches the grammar of ours. Take a look:
type expr =
| Plus of term * expr
| Minus of term * expr
| Term of term
and term =
| FMinus of term
| Times of factor * term
| Div of factor * term
| Factor of factor
and factor = Expr of expr | Num of int | Var of string
the structure is preserved perfectly – we have all pieces as in above main grammar.
When it comes to parsing per se we have this neat piece of .mly
code:
expr:
| t = term PLUS e = expr { Plus (t, e) }
| t = term MINUS e = expr { Minus (t, e) }
| t = term { Term t }
;
term:
| MINUS t = term %prec FMINUS { FMinus t }
| f = factor TIMES t = term { Times (f, t) }
| f = factor DIV t = term { Div (f, t) }
| f = factor { Factor f }
;
factor:
| LPAREN e = expr RPAREN { Expr e }
| n = NUM { Num n }
| x = VAR { Var x }
;
and it is very nice indeed but it allows something that is not that
possible in mathematics as we discussed above (2 * -3
makes sense,
prove me wrong!). So the proper way to denote it is:
expr:
| MINUS t = term PLUS e = expr %prec FMINUS { Plus(FMinus t, e) }
| MINUS t = term MINUS e = expr %prec FMINUS { Plus(FMinus t, e) }
| MINUS t = term %prec FMINUS { Term (FMinus t) }
| t = term PLUS e = expr { Plus (t, e) }
| t = term MINUS e = expr { Minus (t, e) }
| t = term { Term t }
;
term:
| f = factor TIMES t = term { Times (f, t) }
| f = factor DIV t = term { Div (f, t) }
| f = factor { Factor f }
;
factor:
| LPAREN e = expr RPAREN { Expr e }
| n = NUM { Num n }
| x = VAR { Var x }
;
with the frontal minus being caught by the beggining of the expression production.
So we have ast.ml with the above-shown AST, we have parser.mly with
the parser and we have a lexer.mll file that changes written text as
2 + 3 / 1
into simple lexemes like 2 PLUS 3 DIV 1
etc. In main.ml
we have evaluation and different notations to choose from.
# let s = "-5 * (2 - 4) * ((3 * 2) + 5)";;
val s : string = "-5 * (2 - 4) * ((3 * 2) + 5)"
# let e = parse s;;
val e : Ast.expr =
Ast.Term
(Ast.FMinus
(Ast.Times (Ast.Num 5,
Ast.Times
(Ast.Expr
(Ast.Minus (Ast.Factor (Ast.Num 2),
Ast.Term (Ast.Factor (Ast.Num 4)))),
Ast.Factor
(Ast.Expr
(Ast.Plus
(Ast.Factor
(Ast.Expr
(Ast.Term (Ast.Times (Ast.Num 3, Ast.Factor (Ast.Num 2))))),
Ast.Term (Ast.Factor (Ast.Num 5)))))))))
# eval e;;
- : int = 110
# pn e; infix e; rpn e;;
* 5 * - 2 4 + * 3 2 5
5 * (2 - 4) * ((3 * 2) + 5)
5 2 4 - 3 2 * 5 + * *
we have parsing, evaluation and PN, infix and RPN notations.