/
manual.scrbl
1107 lines (838 loc) · 36.5 KB
/
manual.scrbl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
#lang scribble/manual
@(require scribble/eval
racket/date
file/md5
(for-label racket
ragg/support
ragg/examples/nested-word-list
(only-in parser-tools/lex lexer-src-pos)
(only-in syntax/parse syntax-parse ~literal)))
@(define (lookup-date filename [default ""])
(cond
[(file-exists? filename)
(define modify-seconds (file-or-directory-modify-seconds filename))
(define a-date (seconds->date modify-seconds))
(date->string a-date)]
[else
default]))
@(define (compute-md5sum filename [default ""])
(cond [(file-exists? filename)
(bytes->string/utf-8 (call-with-input-file filename md5 #:mode 'binary))]
[else
default]))
@title{ragg: a Racket AST Generator Generator}
@author+email["Danny Yoo" "dyoo@hashcollision.org"]
@section{Informal quickstart}
@(define my-eval (make-base-eval))
@(my-eval '(require ragg/examples/nested-word-list
racket/list
racket/match))
Salutations! Let's consider the following scenario: say that we're given the
following string:
@racketblock["(radiant (humble))"]
@margin-note{(... and pretend that we don't already know about the built-in
@racket[read] function.)} How do we go about turning this kind of string into a
structured value? That is, how would we @emph{parse} it?
We need to first consider the shape of the things we'd like to parse. The
string above looks like a deeply nested list of words. How might we describe
this formally? A convenient notation to describe the shape of these things is
@link["http://en.wikipedia.org/wiki/Backus%E2%80%93Naur_Form"]{Backus-Naur
Form} (BNF). So let's try to notate the structure of nested word lists in BNF.
@nested[#:style 'code-inset]{
@verbatim{
nested-word-list: WORD
| LEFT-PAREN nested-word-list* RIGHT-PAREN
}}
What we intend by this notation is this: @racket[nested-word-list] is either an
atomic @racket[WORD], or a parenthesized list of any number of
@racket[nested-word-list]s. We use the character @litchar{*} to represent zero
or more repetitions of the previous thing, and we treat the uppercased
@racket[LEFT-PAREN], @racket[RIGHT-PAREN], and @racket[WORD] as placeholders
for atomic @emph{tokens}.
@margin-note{See @secref{install-ragg} for instructions on installing
@tt{ragg.}}
Here are a few examples of tokens:
@interaction[#:eval my-eval
(require ragg/support)
(token 'LEFT-PAREN)
(token 'WORD "crunchy" #:span 7)
(token 'RIGHT-PAREN)]
Have we made progress? At this point, we only have a BNF description in hand,
but we're still missing a @emph{parser}, something to take that description and
use it to make structures out of a sequence of tokens.
It's clear that we don't yet have a program because there's no @litchar{#lang}
line. We should add one. Put @litchar{#lang ragg} at the top of the BNF
description, and save it as a file called @filepath{nested-word-list.rkt}.
@filebox["nested-word-list.rkt"]{
@verbatim{
#lang ragg
nested-word-list: WORD
| LEFT-PAREN nested-word-list* RIGHT-PAREN
}}
Now it is a proper program. But what does it do?
@interaction[#:eval my-eval
@eval:alts[(require "nested-word-list.rkt") (void)]
parse
]
It gives us a @racket[parse] function. Let's investigate what @racket[parse]
does for us. What happens if we pass it a sequence of tokens?
@interaction[#:eval my-eval
(define a-parsed-value
(parse (list (token 'LEFT-PAREN "(")
(token 'WORD "some")
(token 'LEFT-PAREN "[")
(token 'WORD "pig")
(token 'RIGHT-PAREN "]")
(token 'RIGHT-PAREN ")"))))
a-parsed-value]
Wait... that looks suspiciously like a syntax object!
@interaction[#:eval my-eval
(syntax->datum a-parsed-value)
]
That's @racket[(some [pig])], essentially.
What happens if we pass it a more substantial source of tokens?
@interaction[#:eval my-eval
@code:comment{tokenize: string -> (sequenceof token-struct?)}
@code:comment{Generate tokens from a string:}
(define (tokenize s)
(for/list ([str (regexp-match* #px"\\(|\\)|\\w+" s)])
(match str
["("
(token 'LEFT-PAREN str)]
[")"
(token 'RIGHT-PAREN str)]
[else
(token 'WORD str)])))
@code:comment{For example:}
(define token-source (tokenize "(welcome (to (((ragg)) ())))"))
(define v (parse token-source))
(syntax->datum v)
]
Welcome to @tt{ragg}.
@;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
@;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
@section{Introduction}
@tt{ragg} is a parsing framework for Racket with the design goal to be easy
to use. It includes the following features:
@itemize[
@item{It provides a @litchar{#lang} for writing extended BNF grammars.
A module written in @litchar{#lang ragg} automatically generates a
parser. The output of this parser tries to follow
@link["http://en.wikipedia.org/wiki/How_to_Design_Programs"]{HTDP}
doctrine; the structure of the grammar informs the structure of the
Racket syntax objects it generates.}
@item{The language uses a few conventions to simplify the expression of
grammars. The first rule in the grammar is automatically assumed to be the
starting production. Identifiers in uppercase are assumed to represent
terminal tokens, and are otherwise the names of nonterminals.}
@item{Tokenizers can be developed completely independently of parsers.
@tt{ragg} takes a liberal view on tokens: they can be strings,
symbols, or instances constructed with @racket[token]. Furthermore,
tokens can optionally provide location: if tokens provide location, the
generated syntax objects will as well.}
@item{The underlying parser should be able to handle ambiguous grammars.}
@item{It should integrate with the rest of the Racket
@link["http://docs.racket-lang.org/guide/languages.html"]{language toolchain}.}
]
@subsection[#:tag "install-ragg"]{Installation}
@itemize[
@item{@margin-note{At the time of this writing, Racket 5.3.2 is in
@link["http://pre.racket-lang.org/"]{pre-release}.} If you are using a version
of Racket > 5.3.1, then follow the instructions on the
@link["https://plt-etc.byu.edu:9004/info/ragg"]{PLaneT2 page}.}
@item{For those who are using Racket <= 5.3.1, you can download the following PLT package:
@nested[#:style 'inset]{@link["ragg.plt"]{ragg.plt} [md5sum: @compute-md5sum["ragg.plt" "ab79038b40e510a5cf13363825c4aef4"]]
Last updated: @lookup-date["ragg.plt" "Wednesday, January 16th, 2013"]
}
Once downloaded, either use DrRacket's package installation features
(@link["http://docs.racket-lang.org/drracket/Menus.html#(idx._(gentag._57._(lib._scribblings/drracket/drracket..scrbl)))"]{Install
PLT File...} under DrRacket's File menu), or use the command line:
@nested[#:style 'inset]{@tt{raco setup -A ragg.plt}}}
]
@subsection{Example: a small DSL for ASCII diagrams}
@margin-note{This is a
@link["http://stackoverflow.com/questions/12345647/rewrite-this-script-by-designing-an-interpreter-in-racket"]{restatement
of a question on Stack Overflow}.} To motivate @tt{ragg}'s design, let's look
at the following toy problem: we'd like to define a language for
drawing simple ASCII diagrams. We'd like to be able write something like this:
@nested[#:style 'inset]{
@verbatim|{
3 9 X;
6 3 b 3 X 3 b;
3 9 X;
}|}
whose interpretation should generate the following picture:
@nested[#:style 'inset]{
@verbatim|{
XXXXXXXXX
XXXXXXXXX
XXXXXXXXX
XXX
XXX
XXX
XXX
XXX
XXX
XXXXXXXXX
XXXXXXXXX
XXXXXXXXX
}|}
@subsection{Syntax and semantics}
We're being very fast-and-loose with what we mean by the program above, so
let's try to nail down some meanings. Each line of the program has a semicolon
at the end, and describes the output of several @emph{rows} of the line
drawing. Let's look at two of the lines in the example:
@itemize[
@item{@litchar{3 9 X;}: ``Repeat the following 3 times: print @racket["X"] nine times, followed by
a newline.''}
@item{@litchar{6 3 b 3 X 3 b;}: ``Repeat the following 6 times: print @racket[" "] three times,
followed by @racket["X"] three times, followed by @racket[" "] three times, followed by a newline.''}
]
Then each line consists of a @emph{repeat} number, followed by pairs of
(number, character) @emph{chunks}. We will
assume here that the intent of the lowercased character @litchar{b} is to
represent the printing of a 1-character whitespace @racket[" "], and for other
uppercase letters to represent the printing of themselves.
Once we have a better idea of the pieces of each line, we have a better chance
to capture that meaning in a formal notation. Once we have each instruction in
a structured format, we should be able to interpret it with a straighforward
case analysis.
Here is a first pass at expressing the structure of these line-drawing
programs.
@subsection{Parsing the concrete syntax}
@filebox["simple-line-drawing.rkt"]{
@verbatim|{
#lang ragg
drawing: rows*
rows: repeat chunk+ ";"
repeat: INTEGER
chunk: INTEGER STRING
}|
}
@margin-note{@secref{ragg-syntax} describes @tt{ragg}'s syntax in more detail.}
We write a @tt{ragg} program as an extended BNF grammar, where patterns can be:
@itemize[
@item{the names of other rules (e.g. @racket[chunk])}
@item{literal and symbolic token names (e.g. @racket[";"], @racket[INTEGER])}
@item{quantified patterns (e.g. @litchar{+} to represent one-or-more repetitions)}
]
The result of a @tt{ragg} program is a module with a @racket[parse] function
that can parse tokens and produce a syntax object as a result.
Let's exercise this function:
@interaction[#:eval my-eval
(require ragg/support)
@eval:alts[(require "simple-line-drawing.rkt")
(require ragg/examples/simple-line-drawing)]
(define stx
(parse (list (token 'INTEGER 6)
(token 'INTEGER 2)
(token 'STRING " ")
(token 'INTEGER 3)
(token 'STRING "X")
";")))
(syntax->datum stx)
]
Tokens can either be: plain strings, symbols, or instances produced by the
@racket[token] function. (Plus a few more special cases, one in which we'll describe in a
moment.)
Preferably, we want to attach each token with auxiliary source location
information. The more source location we can provide, the better, as the
syntax objects produced by @racket[parse] will incorporate them.
Let's write a helper function, a @emph{lexer}, to help us construct tokens more
easily. The Racket standard library comes with a module called
@racketmodname[parser-tools/lex] which can help us write a position-sensitive
tokenizer:
@interaction[#:eval my-eval
(require parser-tools/lex)
(define (tokenize ip)
(port-count-lines! ip)
(define my-lexer
(lexer-src-pos
[(repetition 1 +inf.0 numeric)
(token 'INTEGER (string->number lexeme))]
[upper-case
(token 'STRING lexeme)]
["b"
(token 'STRING " ")]
[";"
(token ";" lexeme)]
[whitespace
(token 'WHITESPACE lexeme #:skip? #t)]
[(eof)
(void)]))
(define (next-token) (my-lexer ip))
next-token)
(define a-sample-input-port (open-input-string "6 2 b 3 X;"))
(define token-thunk (tokenize a-sample-input-port))
@code:comment{Now we can pass token-thunk to the parser:}
(define another-stx (parse token-thunk))
(syntax->datum another-stx)
@code:comment{The syntax object has location information:}
(syntax-line another-stx)
(syntax-column another-stx)
(syntax-span another-stx)
]
There are a few things to note from this lexer example:
@itemize[
@item{The @racket[parse] function can consume either sequences of tokens, or a
function that produces tokens. Both of these are considered sources of
tokens.}
@item{As a special case for acceptable tokens, a token can also be an instance
of the @racket[position-token] structure of @racketmodname[parser-tools/lex],
in which case the token will try to derive its position from that of the
position-token.}
@item{The @racket[parse] function will stop reading from a token source if any
token is @racket[void].}
@item{The @racket[parse] function will skip over any token with the
@racket[#:skip?] attribute. Elements such as whitespace and comments will
often have @racket[#:skip?] set to @racket[#t].}
]
@subsection{From parsing to interpretation}
We now have a parser for programs written in this simple-line-drawing language.
Our parser will give us back syntax objects:
@interaction[#:eval my-eval
(define parsed-program
(parse (tokenize (open-input-string "3 9 X; 6 3 b 3 X 3 b; 3 9 X;"))))
(syntax->datum parsed-program)
]
Moreover, we know that these syntax objects have a regular, predictable
structure. Their structure follows the grammar, so we know we'll be looking at
values of the form:
@racketblock[
(drawing (rows (repeat <number>)
(chunk <number> <string>) ... ";")
...)
]
where @racket[drawing], @racket[rows], @racket[repeat], and @racket[chunk]
should be treated literally, and everything else will be numbers or strings.
Still, these syntax object values are just inert structures. How do we
interpret them, and make them @emph{print}? We did claim at the beginning of
this section that these syntax objects should be fairly easy to case-analyze
and interpret, so let's do it.
@margin-note{This is a very quick-and-dirty treatment of @racket[syntax-parse].
See the @racketmodname[syntax/parse] documentation for a gentler guide to its
features.} Racket provides a special form called @racket[syntax-parse] in the
@racketmodname[syntax/parse] library. @racket[syntax-parse] lets us do a
structural case-analysis on syntax objects: we provide it a set of patterns to
parse and actions to perform when those patterns match.
As a simple example, we can write a function that looks at a syntax object and
says @racket[#t] if it's the literal @racket[yes], and @racket[#f] otherwise:
@interaction[#:eval my-eval
(require syntax/parse)
@code:comment{yes-syntax-object?: syntax-object -> boolean}
@code:comment{Returns true if the syntax-object is yes.}
(define (yes-syntax-object? stx)
(syntax-parse stx
[(~literal yes)
#t]
[else
#f]))
(yes-syntax-object? #'yes)
(yes-syntax-object? #'nooooooooooo)
]
Here, we use @racket[~literal] to let @racket[syntax-parse] know that
@racket[yes] should show up literally in the syntax object. The patterns can
also have some structure to them, such as:
@racketblock[({~literal drawing} rows-stxs ...)]
which matches on syntax objects that begin, literally, with @racket[drawing],
followed by any number of rows (which are syntax objects too).
Now that we know a little bit more about @racket[syntax-parse],
we can use it to do a case analysis on the syntax
objects that our @racket[parse] function gives us.
We start by defining a function on syntax objects of the form @racket[(drawing
rows-stx ...)].
@interaction[#:eval my-eval
(define (interpret-drawing drawing-stx)
(syntax-parse drawing-stx
[({~literal drawing} rows-stxs ...)
(for ([rows-stx (syntax->list #'(rows-stxs ...))])
(interpret-rows rows-stx))]))]
When we encounter a syntax object with @racket[(drawing rows-stx
...)], then @racket[interpret-rows] each @racket[rows-stx].
@;The pattern we
@;express in @racket[syntax-parse] above marks what things should be treated
@;literally, and the @racket[...] is a a part of the pattern matching language
@;known by @racket[syntax-parse] that lets us match multiple instances of the
@;last pattern.
Let's define @racket[interpret-rows] now:
@interaction[#:eval my-eval
(define (interpret-rows rows-stx)
(syntax-parse rows-stx
[({~literal rows}
({~literal repeat} repeat-number)
chunks ... ";")
(for ([i (syntax-e #'repeat-number)])
(for ([chunk-stx (syntax->list #'(chunks ...))])
(interpret-chunk chunk-stx))
(newline))]))]
For a @racket[rows], we extract out the @racket[repeat-number] out of the
syntax object and use it as the range of the @racket[for] loop. The inner loop
walks across each @racket[chunk-stx] and calls @racket[interpret-chunk] on it.
Finally, we need to write a definition for @racket[interpret-chunk]. We want
it to extract out the @racket[chunk-size] and @racket[chunk-string] portions,
and print to standard output:
@interaction[#:eval my-eval
(define (interpret-chunk chunk-stx)
(syntax-parse chunk-stx
[({~literal chunk} chunk-size chunk-string)
(for ([k (syntax-e #'chunk-size)])
(display (syntax-e #'chunk-string)))]))
]
@margin-note{Here are the definitions in a single file:
@link["examples/simple-line-drawing/interpret.rkt"]{interpret.rkt}.}
With these definitions in hand, now we can pass it syntax objects
that we construct directly by hand:
@interaction[#:eval my-eval
(interpret-chunk #'(chunk 3 "X"))
(interpret-drawing #'(drawing (rows (repeat 5) (chunk 3 "X") ";")))
]
or we can pass it the result generated by our parser:
@interaction[#:eval my-eval
(define parsed-program
(parse (tokenize (open-input-string "3 9 X; 6 3 b 3 X 3 b; 3 9 X;"))))
(interpret-drawing parsed-program)]
And now we've got an interpreter!
@subsection{From interpretation to compilation}
@margin-note{For a gentler tutorial on writing @litchar{#lang} extensions, see:
@link["http://hashcollision.org/brainfudge"]{F*dging up a Racket}.} (Just as a
warning: the following material is slightly more advanced, but shows how
writing a compiler for the line-drawing language reuses the ideas for the
interpreter.)
Wouldn't it be nice to be able to write something like:
@nested[#:style 'inset]{
@verbatim|{
3 9 X;
6 3 b 3 X 3 b;
3 9 X;
}|}
and have Racket automatically compile this down to something like this?
@racketblock[
(for ([i 3])
(for ([k 9]) (displayln "X"))
(newline))
(for ([i 6])
(for ([k 3]) (displayln " "))
(for ([k 3]) (displayln "X"))
(for ([k 3]) (displayln " "))
(newline))
(for ([i 3])
(for ([k 9]) (displayln "X"))
(newline))
]
Well, of course it won't work: we don't have a @litchar{#lang} line.
Let's add one.
@filebox["letter-i.rkt"]{
@verbatim|{
#lang ragg/examples/simple-line-drawing
3 9 X;
6 3 b 3 X 3 b;
3 9 X;
}|
}
Now @filepath{letter-i.rkt} is a program.
How does this work? From the previous sections, we've seen how to take the
contents of a file and interpret it. What we want to do now is teach Racket
how to compile programs labeled with this @litchar{#lang} line. We'll do two
things:
@itemize[
@item{Tell Racket to use the @tt{ragg}-generated parser and lexer we defined
earlier whenever it sees a program written with
@litchar{#lang ragg/examples/simple-line-drawing}.}
@item{Define transformation rules for @racket[drawing], @racket[rows], and
@racket[chunk] to rewrite these into standard Racket forms.}
]
The second part, the writing of the transformation rules, will look very
similar to the definitions we wrote for the interpreter, but the transformation
will happen at compile-time. (We @emph{could} just resort to simply calling
into the interpreter we just wrote up, but this section is meant to show that
compilation is also viable.)
We do the first part by defining a @emph{module reader}: a
@link["http://docs.racket-lang.org/guide/syntax_module-reader.html"]{module
reader} tells Racket how to parse and compile a file. Whenever Racket sees a
@litchar{#lang <name>}, it looks for a corresponding module reader in
@filepath{<name>/lang/reader}.
Here's the definition for
@filepath{ragg/examples/simple-line-drawing/lang/reader.rkt}:
@filebox["ragg/examples/simple-line-drawing/lang/reader.rkt"]{
@codeblock|{
#lang s-exp syntax/module-reader
ragg/examples/simple-line-drawing/semantics
#:read my-read
#:read-syntax my-read-syntax
#:whole-body-readers? #t
(require ragg/examples/simple-line-drawing/lexer
ragg/examples/simple-line-drawing/grammar)
(define (my-read in)
(syntax->datum (my-read-syntax #f in)))
(define (my-read-syntax src ip)
(list (parse src (tokenize ip))))
}|
}
We use a helper module @racketmodname[syntax/module-reader], which provides
utilities for creating a module reader. It uses the lexer and
@tt{ragg}-generated parser we defined earlier (saved into
@link["http://hashcollision.org/ragg/examples/simple-line-drawing/lexer.rkt"]{lexer.rkt}
and
@link["http://hashcollision.org/ragg/examples/simple-line-drawing/grammar.rkt"]{grammar.rkt}
modules), and also tells Racket that it should compile the forms in the syntax
object using a module called @filepath{semantics.rkt}.
@margin-note{For a systematic treatment on capturing the semantics of
a language, see @link["http://cs.brown.edu/~sk/Publications/Books/ProgLangs/"]{Programming Languages: Application and
Interpretation}.}
Let's look into @filepath{semantics.rkt} and see what's involved in
compilation:
@filebox["ragg/examples/simple-line-drawing/semantics.rkt"]{
@codeblock|{
#lang racket/base
(require (for-syntax racket/base syntax/parse))
(provide #%module-begin
;; We reuse Racket's treatment of raw datums, specifically
;; for strings and numbers:
#%datum
;; And otherwise, we provide definitions of these three forms.
;; During compiliation, Racket uses these definitions to
;; rewrite into for loops, displays, and newlines.
drawing rows chunk)
;; Define a few compile-time functions to do the syntax rewriting:
(begin-for-syntax
(define (compile-drawing drawing-stx)
(syntax-parse drawing-stx
[({~literal drawing} rows-stxs ...)
(syntax/loc drawing-stx
(begin rows-stxs ...))]))
(define (compile-rows rows-stx)
(syntax-parse rows-stx
[({~literal rows}
({~literal repeat} repeat-number)
chunks ...
";")
(syntax/loc rows-stx
(for ([i repeat-number])
chunks ...
(newline)))]))
(define (compile-chunk chunk-stx)
(syntax-parse chunk-stx
[({~literal chunk} chunk-size chunk-string)
(syntax/loc chunk-stx
(for ([k chunk-size])
(display chunk-string)))])))
;; Wire up the use of "drawing", "rows", and "chunk" to these
;; transformers:
(define-syntax drawing compile-drawing)
(define-syntax rows compile-rows)
(define-syntax chunk compile-chunk)
}|
}
The semantics hold definitions for @racket[compile-drawing],
@racket[compile-rows], and @racket[compile-chunk], similar to what we had for
interpretation with @racket[interpret-drawing], @racket[interpret-rows], and
@racket[interpret-chunk]. However, compilation is not the same as
interpretation: each definition does not immediately execute the act of
drawing, but rather returns a syntax object whose evaluation will do the actual
work.
There are a few things to note:
@itemize[
@item{@tt{ragg}'s native data structure is the syntax object because the
majority of Racket's language-processing infrastructure knows how to read and
write this structured value.}
@item{
@margin-note{By the way, we can just as easily rewrite the semantics so that
@racket[compile-rows] does explicitly call @racket[compile-chunk]. Often,
though, it's easier to write the transformation functions in this piecemeal way
and depend on the Racket macro expansion system to do the rewriting as it
encounters each of the forms.}
Unlike in interpretation, @racket[compile-rows] doesn't
compile each chunk by directly calling @racket[compile-chunk]. Rather, it
depends on the Racket macro expander to call each @racket[compile-XXX] function
as it encounters a @racket[drawing], @racket[rows], or @racket[chunk] in the
parsed value. The three statements at the bottom of @filepath{semantics.rkt} inform
the macro expansion system to do this:
@racketblock[
(define-syntax drawing compile-drawing)
(define-syntax rows compile-rows)
(define-syntax chunk compile-chunk)
]}
]
Altogether, @tt{ragg}'s intent is to be a parser generator generator for Racket
that's easy and fun to use. It's meant to fit naturally with the other tools
in the Racket language toolchain. Hopefully, it will reduce the friction in
making new languages with alternative concrete syntaxes.
The rest of this document describes the @tt{ragg} language and the parsers it
generates.
@;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
@;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
@section{The language}
@subsection[#:tag "ragg-syntax"]{Syntax and terminology}
A program in the @tt{ragg} language consists of the language line
@litchar{#lang ragg}, followed by a collection of @tech{rule}s and
@tech{line comment}s.
A @deftech{rule} is a sequence consisting of: a @tech{rule identifier}, a colon
@litchar{":"}, and a @tech{pattern}.
A @deftech{rule identifier} is an @tech{identifier} that is not in upper case.
A @deftech{token identifier} is an @tech{identifier} that is in upper case.
An @deftech{identifier} is a character sequence of letters, numbers, and
characters in @racket["-.!$%&/<=>?^_~@"]. It must not contain
@litchar{*} or @litchar{+}, as those characters are used to denote
quantification.
A @deftech{pattern} is one of the following:
@itemize[
@item{an implicit sequence of @tech{pattern}s separated by whitespace}
@item{a terminal: either a literal string or a @tech{token identifier}}
@item{a @tech{rule identifier}}
@item{a @deftech{choice pattern}: a sequence of @tech{pattern}s delimited with @litchar{|} characters.}
@item{a @deftech{quantifed pattern}: a @tech{pattern} followed by either @litchar{*} (``zero or more'') or @litchar{+} (``one or more'')}
@item{an @deftech{optional pattern}: a @tech{pattern} surrounded by @litchar{[} and @litchar{]}}
@item{an explicit sequence: a @tech{pattern} surrounded by @litchar{(} and @litchar{)}}]
A @deftech{line comment} begins with either @litchar{#} or @litchar{;} and
continues till the end of the line.
For example, in the following program:
@nested[#:style 'inset
@verbatim|{
#lang ragg
;; A parser for a silly language
sentence: verb optional-adjective object
verb: greeting
optional-adjective: ["happy" | "frumpy"]
greeting: "hello" | "hola" | "aloha"
object: "world" | WORLD
}|]
the elements @tt{sentence}, @tt{verb}, @tt{greeting}, and @tt{object} are rule
identifiers. The first rule, @litchar{sentence: verb optional-adjective
object}, is a rule whose right side is an implicit pattern sequence of three
sub-patterns. The uppercased @tt{WORLD} is a token identifier. The fourth rule in the program associates @tt{greeting} with a @tech{choice pattern}.
More examples:
@itemize[
@item{A
@link["http://hashcollision.org/ragg/examples/01-equal.rkt"]{BNF} for binary
strings that contain an equal number of zeros and ones.
@verbatim|{
#lang ragg
equal: [zero one | one zero] ;; equal number of "0"s and "1"s.
zero: "0" equal | equal "0" ;; has an extra "0" in it.
one: "1" equal | equal "1" ;; has an extra "1" in it.
}|
}
@item{A @link["http://hashcollision.org/ragg/examples/baby-json.rkt"]{BNF} for
@link["http://www.json.org/"]{JSON}-like structures.
@verbatim|{
#lang ragg
json: number | string
| array | object
number: NUMBER
string: STRING
array: "[" [json ("," json)*] "]"
object: "{" [kvpair ("," kvpair)*] "}"
kvpair: ID ":" json
}|
}
]
The @link["https://github.com/dyoo/ragg"]{ragg github source repository}
includes
@link["https://github.com/dyoo/ragg/tree/master/ragg/examples"]{several more
examples}.
@subsection{Syntax errors}
Besides the basic syntax errors that can occur with a malformed grammar, there
are a few other classes of situations that @litchar{#lang ragg} will consider
as syntax errors.
@tt{ragg} will raise a syntax error if the grammar:
@itemize[
@item{doesn't have any rules.}
@item{has a rule with the same left hand side as any other rule.}
@item{refers to rules that have not been defined. e.g. the
following program:
@nested[#:style 'code-inset
@verbatim|{
#lang ragg
foo: [bar]
}|
]
should raise an error because @tt{bar} has not been defined, even though
@tt{foo} refers to it in an @tech{optional pattern}.}
@item{uses the token name @racket[EOF]; the end-of-file token type is reserved
for internal use by @tt{ragg}.}
@item{contains a rule that has no finite derivation. e.g. the following
program:
@nested[#:style 'code-inset
@verbatim|{
#lang ragg
infinite-a: "a" infinite-a
}|
]
should raise an error because no finite sequence of tokens will satisfy
@tt{infinite-a}.}
]
Otherwise, @tt{ragg} should be fairly tolerant and permit even ambiguous
grammars.
@subsection{Semantics}
@declare-exporting[ragg/examples/nested-word-list]
A program written in @litchar{#lang ragg} produces a module that provides a few
bindings. The most important of these is @racket[parse]:
@defproc[(parse [source any/c #f]
[token-source (or/c (sequenceof token)
(-> token))])
syntax?]{
Parses the sequence of @tech{tokens} according to the rules in the grammar, using the
first rule as the start production. The parse must completely consume
@racket[token-source].
The @deftech{token source} can either be a sequence, or a 0-arity function that
produces @tech{tokens}.
A @deftech{token} in @tt{ragg} can be any of the following values:
@itemize[
@item{a string}
@item{a symbol}
@item{an instance produced by @racket[token]}
@item{an instance produced by the token constructors of @racketmodname[parser-tools/lex]}
@item{an instance of @racketmodname[parser-tools/lex]'s @racket[position-token] whose
@racket[position-token-token] is a @tech{token}.}
]
A token whose type is either @racket[void] or @racket['EOF] terminates the
source.
If @racket[parse] succeeds, it will return a structured syntax object. The
structure of the syntax object follows the overall structure of the rules in
the BNF. For each rule @racket[r] and its associated pattern @racket[p],
@racket[parse] generates a syntax object @racket[#'(r p-value)] where
@racket[p-value]'s structure follows a case analysis on @racket[p]:
@itemize[
@item{For implicit and explicit sequences of @tech{pattern}s @racket[p1],
@racket[p2], ..., the corresponding values, spliced into the
structure.}
@item{For terminals, the value associated to the token.}
@item{For @tech{rule identifier}s: the associated parse value for the rule.}
@item{For @tech{choice pattern}s: the associated parse value for one of the matching subpatterns.}
@item{For @tech{quantifed pattern}s and @tech{optional pattern}s: the corresponding values, spliced into the structure.}
]
Consequently, it's only the presence of @tech{rule identifier}s in a rule's
pattern that informs the parser to introduces nested structure into the syntax
object.
If the grammar has ambiguity, @tt{ragg} will choose and return a parse, though
it does not guarantee which one it chooses.
If the parse cannot be performed successfully, or if a token in the
@racket[token-source] uses a type that isn't mentioned in the grammar, then
@racket[parse] raises an instance of @racket[exn:fail:parsing].}
It's often convenient to extract a parser for other non-terminal rules in the
grammar, and not just for the first rule. A @tt{ragg}-generated module also
provides a form called @racket[make-rule-parser] to extract a parser for the
other non-terminals:
@defform[#:id make-rule-parser
(make-rule-parser name)]{
Constructs a parser for the @racket[name] of one of the non-terminals
in the grammar.
For example, given the @tt{ragg} program
@filepath{simple-arithmetic-grammar.rkt}:
@filebox["simple-arithmetic-grammar.rkt"]{
@verbatim|{
#lang ragg
expr : term ('+' term)*
term : factor ('*' factor)*
factor : INT
}|
}
the following interaction shows how to extract a parser for @racket[term]s.
@interaction[#:eval my-eval
@eval:alts[(require "simple-arithmetic-grammar.rkt")
(require ragg/examples/simple-arithmetic-grammar)]
(define term-parse (make-rule-parser term))
(define tokens (list (token 'INT 3)
"*"
(token 'INT 4)))
(syntax->datum (parse tokens))
(syntax->datum (term-parse tokens))
(define another-token-sequence
(list (token 'INT 1) "+" (token 'INT 2)
"*" (token 'INT 3)))
(syntax->datum (parse another-token-sequence))
@code:comment{Note that term-parse will break on another-token-sequence}
@code:comment{as it does not know what to do with the "+"}
(term-parse another-token-sequence)
]
}
Finally, the module provides a set of all the used token types in the grammar
in @racket[all-token-types]:
@defthing[all-token-types (setof symbol?)]{
A set of all the token types used in a grammar.
For example:
@interaction[#:eval my-eval
@eval:alts[(require "simple-arithmetic-grammar.rkt")
(require ragg/examples/simple-arithmetic-grammar)]
all-token-types
]
}
@section{Support API}
@defmodule[ragg/support]
The @racketmodname[ragg/support] module provides functions to interact with
@tt{ragg} programs. The most useful is the @racket[token] function, which
produces tokens to be parsed.
@defproc[(token [type (or/c string? symbol?)]
[val any/c #f]
[#:line line (or/c positive-integer? #f) #f]