-
Notifications
You must be signed in to change notification settings - Fork 2
Parsing Gmane with Factor Part 4
Part 4 of the Parsing Gmane with Factor tutorial.
In part 1 we defined the database and in part 2, the structure for populating it from http urls. One of the remaining tasks is deciding how we will display data to the user in a tasteful way. The problem can be demonstrated with this line of code:
IN: scratchpad 10 [ random-mail ] replicate
--- Data stack:
{ ~mail~ ~mail~ ~mail~ ~mail~ ~mail~ ~mail~ ~mail~ ~mail~...
IN: scratchpad
That's the default way Factor displays a list of tuples and it isn't very user friendly. A better way to show a data structure like the above would be as a table ordered in revere chronological order similar to how e-mail clients render users' inboxes.
A common dilemma in software engineering is posing and answering the question how general should the solution be? In the example above, we have a sequence of exactly 10 mail tuples so we could theoretically write a function called pretty-print-10-mail-tuples
that only works when the number of tuples is exactly 10. However, that is probably too specific and it would make sense to use the same code whether the sequence length is 10, 0, 5 or 50.
But we can go one step further and write code to handle both sequences of any length and tuples of any type, not just mail tuples! Likely, that involves writing more code, but code that can be reused in a variety of situations. Thinking further in that direction here is how we could declaratively describe how the mail tuple table should look:
: db-assigned-id>string ( number/f -- str )
[ number>string ] [ "f" ] if* ;
: mail-format ( -- seq )
{
{ "id" t 6 db-assigned-id>string }
{ "mid" t 5 number>string }
{ "group" f 15 >string }
{ "date" f 10 timestamp>ymd }
{ "sender" f 15 >string }
{ "subject" f 50 >string }
} ;
This data structure is a Factor sequence containing six sequences of length four representing a table with six columns.
- The first item in the subsequences is the name of the tuple slot from which we will retrieve the value with the help of the
mirrors
(the factor devs love puns) introspection vocabulary. - The second item is a boolean value indicating whether the column should be right-aligned or not. Columns containing numerical data should almost always be right-aligned.
- The third item in the sequence specifies how wide the column should be in character units. If the data doesn't fit in the allotted space, it will be truncated and if it is too short, padded. The last item is a word that converts the slot value to a string.
- Lastly, we have a reference to a word that converts the slot value to a string.
Note that the id
slot is most often a numeric value, but can also be f
so we create our own word db-assigned-id>string
to handle its conversion.
Initially, the structure didn't contain the fourth element of the subsequences because I thought the present
word would handle stringification but I wasn't happy with the way it stringified timestamps so it became necessary to specify the word to use explicitly.
Next, we need to actually write the print-table
function that will take a proper sequence of tuples and the format declaration above to produce a good looking table:
IN: scratchpad 10 [ random-mail ] replicate mail-format print-table
... show expected output here
But first we need to create some utility functions.
Factor supports printf-style string formatting which is the feature we will make good use of to format the table. Consider the second column declaration in mail-format
above:
{ "mid" t 5 number>string }
t
means that the column should be right-aligned and 5 is the number of spaces it should occupy. It can be directly translated to this format string:
IN: scratchpad 25 number>string "[%5s]\n" printf
[ 25]
Or, if we wanted left-aligned output we could have used "%-5s\n"
.
IN: scratchpad 25 number>string "[%-5s]\n" printf
[25 ]
The builtin string formatting is very convenient for us as it means that we won't have to write any code for left- and right-aligning and padding values ourselves. But there is one big catch -- printf is implemented as a macro so format strings can not be dynamically built. I think it is a flaw in the library and will probably be corrected in a future release. But for now, it's just the price one has to pay for using a young language and because Factor is so powerful it is easy to workaround. Create a new vocabulary called gmane.formatting
with the following contents:
USING:
arrays
calendar.format
combinators
formatting formatting.private
fry
io io.streams.string
kernel
math.parser
sequences
strings ;
IN: gmane.formatting
: printf>write-quot ( quot -- quot' )
{
{ [ dup string? ] [ '[ _ ] ] }
{ [ dup [ "%" ] = ] [ '[ _ first ] ] }
[ '[ unclip @ ] ]
} cond [ write ] append ;
: vprintf ( seq format -- )
parse-printf reverse [ first printf>write-quot ] map
compose-all call( x -- x ) drop ;
: vsprintf ( seq format -- str )
[ vprintf ] with-string-writer ;
: vsprintf1 ( elt format -- str )
[ 1array ] dip vsprintf ;
This tutorial is too short to explain in detail how this pretty dense code works. It also reaches into undocumented internal parts of Factor (the formatting.private
import) which is to be avoided but sometimes is necessary. The best way to understand it is to experiment in the listener to get a feel for how it works:
IN: scratchpad "subject" { t 20 } first2 [ "" "-" ? ] dip 2array "%%%s%ds\n" vsprintf printf
subject
IN: scratchpad "subject" "%20s\n" printf
subject
Note how vsprintf
allows you to construct format strings on the fly which normally would be impossible with printf
.
The prettyprint vocabulary contains a handy word called simple-table.
It accepts a 2-dimensional array of objects and prints a table.
IN: scratchpad { { "foo" "bar" } { "a" "b" } { 33 f } } simple-table.
foo bar
a b
33 f
However, it doesn't handle truncation and right alignment of cell values so we have to fixup those small details before simply sending our data to simple-table.
.
The task can be split up in three logical parts:
- Formatting the table header.
- Formatting each table row.
- Aggregating the output from step 1 and 2 and emitting it to
simple-table.
.
Each of these steps is only a few lines of code when based upon the vsprintf
word we created earlier.
: table-header ( format -- header-cells )
[ first3 nip "%%-%ds" vsprintf1 vsprintf1 ] map ;
The output from this function is a sequence of strings which are the table headers. first3 nip
has the effect of picking the first and third item of the input sequence and putting them on the stack. Use the join word to see what the result will look like:
IN: scratchpad mail-format table-header "|" join print
id |mid |group |date|sender |subject
Great! Let's do the data rows of the table too.
: table-cell-format ( right-align width -- str )
[ "" "-" ? ] dip 2array "%%%s%ds" vsprintf ;
: table-cell ( right-align width word value -- str )
swap execute( x -- x ) over short head -rot table-cell-format vsprintf1 ;
: table-row ( format row -- row-cells )
<mirror> '[ unclip _ at suffix first4 table-cell ] map ;
This code is significantly harder to understand. But, as always, experimenting with it in the listener should clear things up.
IN: scratchpad mail-format random-mail table-row "|" join print
f| 2466|uybecybpgq |2013|uwifstroyqiexky |aeheotnpsumasyttehbklzetzoxzzx
IN: scratchpad t 10 table-cell-format .
"%10s"
IN: scratchpad { t 10 number>string 75 } first4 table-cell .
" 75"
Note the call to map
in table-row
and the fried quotation that preceedes it. Fried quotations is how you implement anonymous functions with some bound parameters in Factor. In other languages they they are called closures. The fried quotation here consumes the row
parameter and uses it to create a regular quotation. Since that parameter is consumed, what map will iterate over is the format
sequence.
If the fry syntax feels tricky, you can manually "expand" it on the listener:
IN: scratchpad mail-format [ unclip random-mail <mirror> at suffix first4 table-cell ] map .
{
" f"
" 2363"
"dhikaejcaa "
"2012"
"gjsgosbexvnwliv "
"jcouzszjztsnvhkmtzkoordyihqsvb"
}
IN: scratchpad mail-format random-mail <mirror> '[ unclip _ at suffix first4 table-cell ] map .
{
" f"
" 2506"
"lpcefuzwnp "
"2013"
"mbcujjlrauqswzn "
"xuybiuqkvfbssawgcrnfmjkplatdyj"
}
As you can see the lines are roughly similar. In the first example random-mail <mirror>
is run once for each iteration of map
while in the second, it is run once and then "bound" into the _
placeholder.
We just call table-header
once to make a table header and then call table-row
on each mail tuple in our sequence.
: print-table ( seq format -- )
[ '[ _ swap table-row ] map ] [ table-header ] bi prefix simple-table. ;
Perhaps the quotations sent to bi
is in the wrong order. After all, the table header should come before the table rows. But it all works out thanks to prefix
its second argument (the last one on the stack) to its first (the next last one on the stack). Then we let simple-table.
print out the result for us.
IN: scratchpad clear 10 [ random-mail ] replicate mail-format print-table
id mid group date sender subject
f 4057 kvtxnpyqik 2013 ouybpihdztjcvug gxsqmiifjrdhodyvdnxnsekhzwcpjs
f 1720 hwsdifocfb 2013 gmtmffskyiofzwa bddwczpqzbwoujxvqqkqzvoolekxes
f 4644 pplsqkflll 2012 jlndeolfhjktkks kloinuqblpuseprtfjyhvftwnkqxsl
f 2252 cbjbepfqes 2013 xjaljwzcnpjfchs njvwaaatnkyztaddsbonwbdwhkhgek
f 4189 uxvoefnrhb 2013 ocbjtolivaeitub fcxccrqajcbipiqffxyiamjcfrlqix
f 639 lgascyuxij 2012 jofshtqqpxgamba wnwtjnwpgzzedvuetjjnhkkqzebbcx
f 1214 hzcesktucw 2012 hcqomnhbvelcuyb gyqkzmeyorbbuhlgwnlpyeoynzviyg
f 2194 yvmibchwpk 2013 gptrzethazwohig hbjfujnkdnlrrqkgakmxitnbliqtkg
f 545 yqjgobgorn 2013 dsqgabphculhplk zzsymwywpxocbnwrrrlargybfdogyz
f 4845 pvffdfuptm 2012 xdfkkhtedrzwtgm mhviqaklvgofwpwhrheudeslvjiebv
Yay! The gmane.formatting
vocabulary is fairly complete at this point. Continue on to part 5 to see how to build the final vocabulary of the application.
To me, those words stands out like dirt in what would otherwise have been clean and elegant vocabulary. The other words are useful for anyone wanting to format tabular data but these two are specific to this application. Where could they be placed instead? Clearly, they are needed but none of the vocabularies we have written so far seem to offer a natural home for them. I don't think there is a perfect solution to the problem so I'lll let the question go unanswered for now.