Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Branch: master
Fetching contributors…

Cannot retrieve contributors at this time

164 lines (115 sloc) 5.885 kB
#lang scribble/manual
@(require planet/scribble
scribble/eval
(for-label (this-package-in main)
racket/base
racket/sequence))
@(define myeval (make-base-eval))
@(myeval '(require racket/sequence))
@(myeval '(require racket/list))
@title{python-tokenizer: a translation of Python's @tt{tokenize.py} library for Racket}
@author+email["Danny Yoo" "dyoo@hashcollision.org"]
This is a fairly close translation of the
@link["http://hg.python.org/cpython/file/2.7/Lib/tokenize.py"]{@tt{tokenize.py}}
library from @link["http://python.org"]{Python}.
The main function, @racket[generate-tokens], consumes an input port
and produces a sequence of tokens.
For example:
@interaction[#:eval myeval
(require (planet dyoo/python-tokenizer))
(define sample-input (open-input-string "def d22(a, b, c=2, d=2, *k): pass"))
(define tokens
(generate-tokens sample-input))
(for ([t tokens])
(printf "~s ~s ~s ~s\n" (first t) (second t) (third t) (fourth t)))
]
@section{API}
@defmodule/this-package[main]
@defproc[(generate-tokens [inp input-port]) (sequenceof (list/c symbol? string? (list/c number? number?) (list/c number? number?) string?))]{
Consumes an input port and produces a sequence of tokens.
Each token is a 5-tuple consisting of:
@itemize[#:style 'ordered
@item{token-type: one of the following symbols:
@racket['NAME], @racket['NUMBER], @racket['STRING],
@racket['OP], @racket['COMMENT], @racket['NL],
@racket['NEWLINE], @racket['DEDENT], @racket['INDENT],
@racket['ERRORTOKEN], or @racket['ENDMARKER]. The only difference between
@racket['NEWLINE] and @racket['NL] is that @racket['NEWLINE] will only occurs
if the indentation level is at @racket[0].}
@item{text: the string content of the token.}
@item{start-pos: the line and column as a list of two numbers}
@item{end-pos: the line and column as a list of two numbers}
@item{current-line: the current line that the tokenizer is on}
]
The last token produced, under normal circumstances, will be
@racket['ENDMARKER].
If a recoverable error occurs, @racket[generate-tokens] will produce
single-character tokens with the @racket['ERRORTOKEN] type until it
can recover.
Unrecoverable errors occur when the tokenizer encounters @racket[eof]
in the middle of a multi-line string or statement, or if an
indentation level is inconsistent. On an unrecoverable error,
@racket[generate-tokesn] will raise an @racket[exn:fail:token] or
@racket[exn:fail:indentation] error.
@defstruct[(exn:fail:token exn:fail) ([loc (list/c number number)])]{
Raised when @racket[eof] is unexpectedly encounted.
@racket[exn:fail:token-loc] holds the start position.
}
@defstruct[(exn:fail:indentation exn:fail) ([loc (list/c number number)])]{
Raised when the indentation is inconsistent.
@racket[exn:fail:indentation-loc] holds the start position.
}
}
@section{Translator comments}
The translation is a fairly direct one; I wrote an
@link["https://github.com/dyoo/while-loop"]{auxiliary package} to deal
with the @racket[while] loops, which proved invaluable during the
translation of the code. It may be instructive to compare the
@link["https://github.com/dyoo/python-tokenizer/blob/master/python-tokenizer.rkt"]{source}
here to that of
@link["http://hg.python.org/cpython/file/2.7/Lib/tokenize.py"]{tokenize.py}.
Here are some points I observed while doing the translation:
@itemize[
@item{Mutation pervades the entirety of the tokenizer's main loop.
The main reason is because @racket[while] has no return type and
doesn't carry variables around; the @racket[while] loop communicates
values from one part of the code to others through mutation, often in
wildly distant locations.}
@item{Racket makes a syntactic distinction between variable definition
(@racket[define]) and mutation (@racket[set!]). I've had to deduce
which variables were intended to be temporaries, and hopefully I
haven't induced any errors along the way.}
@item{In some cases, Racket has finer-grained type distinctions than
Python. Python does not use a separate type to represent individual
characters, and instead uses a length-1 string. In this translation,
I've used characters where I think they're appropriate.}
@item{Most uses of raw strings in Python can be translated to
uses of the
@link["http://docs.racket-lang.org/scribble/reader-internals.html#(mod-path._at-exp)"]{at-exp}
reader.}
@item{Generators in Racket and in Python are pretty similar, though
the Racket documentation can do a better job in documenting them.
When dealing with generators in Racket, what one really wants to
usually produce is a generic sequence. For that reason, the
Racket documentation really needs to place more emphasis in
@racket[in-generator], not the raw @racket[generator] form.}
@item{Python heavily overloads the @tt{in} operator. Its expressivity
makes it easy to write code with it. On the flip side, its
flexibility makes it a little harder to know what it actually means.}
@item{Regular expressions, on the whole, match
@; Yeah, that's a pun. I had to get that in somewhere... :)
well between the two
languages. Minor differences in the syntax are potholes: Racket's
regular expression matcher does not have an implicit @emph{begin}
anchor, and Racket's regexps are more sensitive to escape characters.
Python's regexp engine returns a single match object that can support
different operators. Racket, on the other hand, requires the user to
select between getting the position of the match, with
@racket[regexp-match-positions], or getting the textual content with
@racket[regexp-match].}
]
@section{Release history}
@itemlist[
@item{1.0 (2012-02-29): Initial release.}
@item{1.1 (2012-09-10): Bug fix. Corrected an infinite-loop bug due to mis-typing a paren. Thanks to Joe Politz for the bug report!}
]
Jump to Line
Something went wrong with that request. Please try again.