Switch branches/tags
Nothing to show
Find file
Fetching contributors…
Cannot retrieve contributors at this time
164 lines (115 sloc) 5.75 KB
#lang scribble/manual
@(require planet/scribble
(for-label (this-package-in main)
@(define myeval (make-base-eval))
@(myeval '(require racket/sequence))
@(myeval '(require racket/list))
@title{python-tokenizer: a translation of Python's @tt{} library for Racket}
@author+email["Danny Yoo" ""]
This is a fairly close translation of the
library from @link[""]{Python}.
The main function, @racket[generate-tokens], consumes an input port
and produces a sequence of tokens.
For example:
@interaction[#:eval myeval
(require (planet dyoo/python-tokenizer))
(define sample-input (open-input-string "def d22(a, b, c=2, d=2, *k): pass"))
(define tokens
(generate-tokens sample-input))
(for ([t tokens])
(printf "~s ~s ~s ~s\n" (first t) (second t) (third t) (fourth t)))
@defproc[(generate-tokens [inp input-port]) (sequenceof (list/c symbol? string? (list/c number? number?) (list/c number? number?) string?))]{
Consumes an input port and produces a sequence of tokens.
Each token is a 5-tuple consisting of:
@itemize[#:style 'ordered
@item{token-type: one of the following symbols:
@racket['NAME], @racket['NUMBER], @racket['STRING],
@racket['OP], @racket['COMMENT], @racket['NL],
@racket['NEWLINE], @racket['DEDENT], @racket['INDENT],
@racket['ERRORTOKEN], or @racket['ENDMARKER]. The only difference between
@racket['NEWLINE] and @racket['NL] is that @racket['NEWLINE] will only occurs
if the indentation level is at @racket[0].}
@item{text: the string content of the token.}
@item{start-pos: the line and column as a list of two numbers}
@item{end-pos: the line and column as a list of two numbers}
@item{current-line: the current line that the tokenizer is on}
The last token produced, under normal circumstances, will be
If a recoverable error occurs, @racket[generate-tokens] will produce
single-character tokens with the @racket['ERRORTOKEN] type until it
can recover.
Unrecoverable errors occur when the tokenizer encounters @racket[eof]
in the middle of a multi-line string or statement, or if an
indentation level is inconsistent. On an unrecoverable error,
@racket[generate-tokesn] will raise an @racket[exn:fail:token] or
@racket[exn:fail:indentation] error.
@defstruct[(exn:fail:token exn:fail) ([loc (list/c number number)])]{
Raised when @racket[eof] is unexpectedly encounted.
@racket[exn:fail:token-loc] holds the start position.
@defstruct[(exn:fail:indentation exn:fail) ([loc (list/c number number)])]{
Raised when the indentation is inconsistent.
@racket[exn:fail:indentation-loc] holds the start position.
@section{Translator comments}
The translation is a fairly direct one; I wrote an
@link[""]{auxiliary package} to deal
with the @racket[while] loops, which proved invaluable during the
translation of the code. It may be instructive to compare the
here to that of
Here are some points I observed while doing the translation:
@item{Mutation pervades the entirety of the tokenizer's main loop.
The main reason is because @racket[while] has no return type and
doesn't carry variables around; the @racket[while] loop communicates
values from one part of the code to others through mutation, often in
wildly distant locations.}
@item{Racket makes a syntactic distinction between variable definition
(@racket[define]) and mutation (@racket[set!]). I've had to deduce
which variables were intended to be temporaries, and hopefully I
haven't induced any errors along the way.}
@item{In some cases, Racket has finer-grained type distinctions than
Python. Python does not use a separate type to represent individual
characters, and instead uses a length-1 string. In this translation,
I've used characters where I think they're appropriate.}
@item{Most uses of raw strings in Python can be translated to
uses of the
@item{Generators in Racket and in Python are pretty similar, though
the Racket documentation can do a better job in documenting them.
When dealing with generators in Racket, what one really wants to
usually produce is a generic sequence. For that reason, the
Racket documentation really needs to place more emphasis in
@racket[in-generator], not the raw @racket[generator] form.}
@item{Python heavily overloads the @tt{in} operator. Its expressivity
makes it easy to write code with it. On the flip side, its
flexibility makes it a little harder to know what it actually means.}
@item{Regular expressions, on the whole, match
@; Yeah, that's a pun. I had to get that in somewhere... :)
well between the two
languages. Minor differences in the syntax are potholes: Racket's
regular expression matcher does not have an implicit @emph{begin}
anchor, and Racket's regexps are more sensitive to escape characters.
Python's regexp engine returns a single match object that can support
different operators. Racket, on the other hand, requires the user to
select between getting the position of the match, with
@racket[regexp-match-positions], or getting the textual content with
@section{Release history}
@item{1.0 (2012-02-29): Initial release.}
@item{1.1 (2012-09-10): Bug fix. Corrected an infinite-loop bug due to mis-typing a paren. Thanks to Joe Politz for the bug report!}