Skip to content

Simple parsers for basic arithmetics. Some in C, some in Ocaml

Notifications You must be signed in to change notification settings

ggegoge/parsing_expressions

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Parsing arithmetical expressions

In this repository I keep my parsers for simple parsers for basic arithemtics on natural numbers.

Brief overview of the repository

I have four distinct directories for three distinct-ish here which all differ in some way.

Parsers written in C

Those are in the c_* directories with appropriate READMEs attached. There are two of them but they are quite similar apart from couple major differences.

c_non_seq

This is the first one I made. It processes the maths expression in a non-sequential manner ie. it loads the expression, finds the +, * etc and then finds the both sides of it and only then evaluates the captured expression between those two pointers

c_seq

This one works in a sequential way – it keeps only one pointer that goes through the expression from right to left and then adds next elements recursively (more details below)

The ocaml parsers

In the ocaml_grammary I keep the first ocaml parser I wrote where when desinging the AST (term explained below) I have tried to mimic the grammar as closely as possible. In ocaml_norm I have a much more normal of a parser in the sense the tree is kept very concise and simple.

Both depend on the menhir library (which I think dates back to yacc).

Usage

For all parsers the usage (should be) is specified in their directories’ readmes.

Not as brief of an overview

Parsing arithmetic in general

With parsing one ought always think of the grammar before actually indulging in any coding. Context free grammar is quite cool and lets you pretend to be super smart but perhaps a better choice would be to use something more readable such as the Backus–Naur form (BNF, please consult wikipedia for details). But for that one should have some intuition before hand or else how would you come up with the grammar?

arithemtics as trees and forests

The idea of making an arithemtics parser occured to me first when I was analysing the nature of the Łukasiewicz notation (aka the Polish notation) and the Azciweisakuł notation (/the Reversed Polish notation). It first was used as a notation for logic but then it was deemed useful by computer scientists because of its natural stacky/tree-y nature. Eg. the expression 2 + 3 * (4 + 5) in PN would look like + 2 * 3 + 4 5 so as we see when compared to the standard notation it is like something <binary operator> <arg1> <arg2> instead of <a1> <bop> <a2>. So we first get the element that combines the two values and then we recursively get the values themselves (perhaps composed of something like that as well). Hence one could see that there is a tree-like structure hidden here. The leafs are filled with numbers (the atoms of arithmetic) and the nodes that combine them are the binary operations.

This way we discover the idea of an abstract syntax tree or AST for short. How does such tree look like? Using the example from above:

   +
  / \
 /   \
2     *
     / \
    /   \
   3     +
        / \
       /   \
      4     5

this is the tree representing the operation

standard notationPolish notationreverse pn
2 + 3 * (4 + 5)+ 2 * 3 + 4 52 3 4 5 + * +

Here we can see how the structure plays out – when we have a + in a node that has incoming multiplication, then it needs parentheses to perserve the order of operations. Without any further details we cas present this grammar:

<expr> ::= <term> + <expr> | <term> - <expr> | <term>
<term> ::= - <term> | <factor> * <term> | <factor> / <term>
<factor> ::= (<expr>) | <number> | <variable>

So we have the syntax denoted here. But actually it denotes something that is not exactly as our classicl arithmetic expressions look. Something like 2 * -3 is completely valid when it comes to this grammar (and frankly – I think it makes sense, it is unambiguous hence there is no reason but aesthetics that would make us resign for letting the unary minus work without parentheses) but in reality we’d use 2 * (-3).

So the grammar would look better as:

<expr> ::= [ - ] <term> + <expr> | <term> - <expr> | <term>
<term> ::= <term> | <factor> * <term> | <factor> / <term>
<factor> ::= (<expr>) | <number> | <variable>

with the minus still being an aspect of a term (or even a factor) but it needs to be caught in the <expr> production.

powers

We might as well add exponents if it is what we want. Right now I’ve implemented that only in the simpler ocaml parser. We’d have to adjust our grammar accordingly:

<expr> ::= [ - ] <term> + <expr> | <term> - <expr> | <term>
<term> ::= <term> | <factor> * <term> | <factor> / <term>
<factor> ::= <factor ^ <factor> | (<expr>) | <number> | <variable>

And actually we have to consider the assosciativity problem. So we usually consider minus and division to work like: 3 - 2 - 1 = (3 - 2) - 1 so perhaps a better grammar would be:

<expr> ::= - <expr> + <term> | <expr> - <term> | <term>
<term> ::= - <term> | <term> * <factor> | <term> / <factor> | <factor>
<factor> = <factor>^<factor> | (expr) | num | var

C parsers

We use structs as the data structure letting us maintain these trees.

typedef struct node {
  int is_op, value;
  char op;
  struct node * l, * r;
} node;


/* simplified grammar
 * <expr> ::= <term> { + <term> }
 * <term> ::= <factor> { * <factor> }
 * <factor> ::= <num> | ( <expr> ) */

we then use them in different ways. is_op serves as a way to distinguish between a node and a leaf.

non sequential

We have functions that create a node of each type, they get the left and right index of beggining and ending of the appriopriate section

node *expr(char* p, int l, int r);
node *term(char* p, int l, int r);
node *factor(char* p, int l, int r);

sequential

here we parse the string as it goes

node *expr(char** p);
node *term(char** p);
node *factor(char** p);

Ocaml parsers

the simple version

We can keep everything very simple and design our AST based not on the grammar but on our tree diagram from above. Then we need those posibilities:

  1. leaves – the atoms of arithmetic which end the tree. we have two options for those
    1. leaves carrying a number
    2. leaves carrying a symbol
  2. the pesky frontal minus as a singular node which just marks that the term should be negated
  3. the basic binary node with children being the operands and an operator in the node

Having considered these points we can make such an AST:

(* binary operation *)
type bop = Sum | Mult | Diff | Divis | Pow

(* the most basic arithemtics tree *)
type arithtree =
  | Leaf of int
  | VLeaf of string
  | SNode of arithtree
  | Node of arithtree * bop * arithtree

it is much simpler than the ocaml_grammary version but of course it doesn’t mimic the grammar but rather the actual arithmetic tree. Another minor disadvantage is that we dont have original parentheses and structure marked that well so adding those parentheses when trying to printout an expr infix becomes quite tricky.

The lexer.mll file translates the raw text into lexems like DIV TIMES et which are much easy to handle. The parser.mly has these productions listed:

expr:
  | e = expr PLUS t = term { Node (e, Sum, t) }
  | e = expr MINUS t = term { Node (e, Diff, t) }
  | t = term { t }
;

term:
  | MINUS t = term %prec FMINUS { SNode t }
  | t = term TIMES f = factor { Node (t, Mult, f) }
  | t = term DIV f = factor { Node (t, Divis, f) }
  | f = factor { f }
;

factor:
  | f1 = factor POW f2 = factor { Node(f1, Pow, f2) }
  | LPAREN e = expr RPAREN { e }
  | n = NUM { Leaf n }
  | x = VAR { VLeaf x }
;
Example of it working
# let s = "-5 * (2 - 4) * ((3 * 2) + 5)";;
val s : string = "-5 * (2 - 4) * ((3 * 2) + 5)"
# let e = parse s;;
val e : Ast.arithtree =
  SNode
   (Node (Leaf 5, Mult,
     Node (Node (Leaf 2, Diff, Leaf 4), Mult,
      Node (Node (Leaf 3, Mult, Leaf 2), Sum, Leaf 5))))
# eval e;;
- : int = 110
# pn e; print_endline ""; rpn e; print_endline ""; infix e; print_endline "";;
 * 5 * - 2 4 + * 3 2 5
 5 2 4 - 3 2 * 5 + * *
 5 * (2 - 4) * (3 * 2 + 5)

the grammarised version

Ocaml as a functional language is the real charm. We can in fact mimic the AST so well it exactly matches the grammar of ours. Take a look:

type expr =  
  | Plus of term * expr
  | Minus of term * expr
  | Term of term
and term =
  | FMinus of term
  | Times of factor * term
  | Div of factor * term
  | Factor of factor
and factor = Expr of expr | Num of int | Var of string

the structure is preserved perfectly – we have all pieces as in above main grammar.

When it comes to parsing per se we have this neat piece of .mly code:

expr:
  | t = term PLUS e = expr { Plus (t, e) }
  | t = term MINUS e = expr { Minus (t, e) }
  | t = term { Term t }
;

term:
  | MINUS t = term %prec FMINUS { FMinus t }
  | f = factor TIMES t = term { Times (f, t) }
  | f = factor DIV t = term { Div (f, t) }
  | f = factor { Factor f }
;

factor:
  | LPAREN e = expr RPAREN { Expr e }
  | n = NUM { Num n }
  | x = VAR { Var x }
;

and it is very nice indeed but it allows something that is not that possible in mathematics as we discussed above (2 * -3 makes sense, prove me wrong!). So the proper way to denote it is:

expr:
  | MINUS t = term PLUS e = expr %prec FMINUS { Plus(FMinus t, e) }
  | MINUS t = term MINUS e = expr %prec FMINUS { Plus(FMinus t, e) }
  | MINUS t = term %prec FMINUS { Term (FMinus t) }
  | t = term PLUS e = expr { Plus (t, e) }
  | t = term MINUS e = expr { Minus (t, e) }
  | t = term { Term t }
;

term:
  | f = factor TIMES t = term { Times (f, t) }
  | f = factor DIV t = term { Div (f, t) }
  | f = factor { Factor f }
;

factor:
  | LPAREN e = expr RPAREN { Expr e }
  | n = NUM { Num n }
  | x = VAR { Var x }
;

with the frontal minus being caught by the beggining of the expression production.

So we have ast.ml with the above-shown AST, we have parser.mly with the parser and we have a lexer.mll file that changes written text as 2 + 3 / 1 into simple lexemes like 2 PLUS 3 DIV 1 etc. In main.ml we have evaluation and different notations to choose from.

Example of it working:
# let s = "-5 * (2 - 4) * ((3 * 2) + 5)";;
val s : string = "-5 * (2 - 4) * ((3 * 2) + 5)"
# let e = parse s;;
val e : Ast.expr =
  Ast.Term
   (Ast.FMinus
     (Ast.Times (Ast.Num 5,
       Ast.Times
        (Ast.Expr
          (Ast.Minus (Ast.Factor (Ast.Num 2),
            Ast.Term (Ast.Factor (Ast.Num 4)))),
        Ast.Factor
         (Ast.Expr
           (Ast.Plus
             (Ast.Factor
               (Ast.Expr
                 (Ast.Term (Ast.Times (Ast.Num 3, Ast.Factor (Ast.Num 2))))),
             Ast.Term (Ast.Factor (Ast.Num 5)))))))))
# eval e;;
- : int = 110
# pn e; infix e; rpn e;;
  * 5 * - 2 4 + * 3 2 5
  5 * (2 - 4) * ((3 * 2) + 5)
  5 2 4 - 3 2 * 5 + * *

we have parsing, evaluation and PN, infix and RPN notations.