Skip to content

Parsing Gmane with Factor Part 4

luftluft edited this page Oct 27, 2013 · 47 revisions

Part 4 of the Parsing Gmane with Factor tutorial.

In part 1 we defined the database and in part 2, the structure for populating it from http urls. One of the remaining tasks is deciding how we will display data to the user in a tasteful way. The problem can be demonstrated with this line of code:

IN: scratchpad 10 [ random-mail ] replicate

--- Data stack:
{ ~mail~ ~mail~ ~mail~ ~mail~ ~mail~ ~mail~ ~mail~ ~mail~...
IN: scratchpad 

That's the default way Factor displays a list of tuples and it isn't very user friendly. A better way to show a data structure like the above would be as a table ordered in revere chronological order similar to how e-mail clients render users' inboxes.

Pretty-printing Sequences of Tuples

A common dilemma in software engineering is posing and answering the question how general should the solution be? In the example above, we have a sequence of exactly 10 mail tuples so we could theoretically write a function called pretty-print-10-mail-tuples that only works when the number of tuples is exactly 10. However, that is probably too specific and it would make sense to use the same code whether the sequence length is 10, 0, 5 or 50.

But we can go one step further and write code to handle both sequences of any length and tuples of any type, not just mail tuples! Likely, that involves writing more code, but code that can be reused in a variety of situations. Thinking further in that direction here is how we could declaratively describe how the mail tuple table should look:

: db-assigned-id>string ( number/f -- str )
    [ number>string ] [ "f" ] if* ;

: mail-format ( -- seq )
    {
        { "id" t 6 db-assigned-id>string }
        { "mid" t 5 number>string }
        { "group" f 15 >string }
        { "date" f 10 timestamp>ymd }
        { "sender" f 15 >string }
        { "subject" f 50 >string }
    } ;

This data structure is a Factor sequence containing six sequences of length four representing a table with six columns.

  • The first item in the subsequences is the name of the tuple slot from which we will retrieve the value with the help of the mirrors (the factor devs love puns) introspection vocabulary.
  • The second item is a boolean value indicating whether the column should be right-aligned or not. Columns containing numerical data should almost always be right-aligned.
  • The third item in the sequence specifies how wide the column should be in character units. If the data doesn't fit in the allotted space, it will be truncated and if it is too short, padded. The last item is a word that converts the slot value to a string.
  • Lastly, we have a reference to a word that converts the slot value to a string.

Note that the id slot is most often a numeric value, but can also be f so we create our own word db-assigned-id>string to handle its conversion.

Initially, the structure didn't contain the fourth element of the subsequences because I thought the present word would handle stringification but I wasn't happy with the way it stringified timestamps so it became necessary to specify the word to use explicitly.

Next, we need to actually write the print-table function that will take a proper sequence of tuples and the format declaration above to produce a good looking table:

IN: scratchpad 10 [ random-mail ] replicate mail-format print-table
... show expected output here

But first we need to create some utility functions.

Dynamic printf

Factor supports printf-style string formatting which is the feature we will make good use of to format the table. Consider the second column declaration in mail-format above:

{ "mid" t 5 number>string }

t means that the column should be right-aligned and 5 is the number of spaces it should occupy. It can be directly translated to this format string:

IN: scratchpad 25 number>string "[%5s]\n" printf
[   25]

Or, if we wanted left-aligned output we could have used "%-5s\n".

IN: scratchpad 25 number>string "[%-5s]\n" printf
[25   ]

The builtin string formatting is very convenient for us as it means that we won't have to write any code for left- and right-aligning and padding values ourselves. But there is one big catch -- printf is implemented as a macro so format strings can not be dynamically built. I think it is a flaw in the library and will probably be corrected in a future release. But for now, it's just the price one has to pay for using a young language and because Factor is so powerful it is easy to workaround. Create a new vocabulary called gmane.formatting with the following contents:

USING:
    arrays
    calendar.format
    combinators
    formatting formatting.private       
    fry
    io io.streams.string
    kernel
    math.parser
    sequences
    strings ;
IN: gmane.formatting

: printf>write-quot ( quot -- quot' )
    {
        { [ dup string? ] [ '[ _ ] ] }
        { [ dup [ "%" ] = ] [ '[ _ first ] ] }
        [ '[ unclip @ ] ]
    } cond [ write ] append ;

: vprintf ( seq format -- )
    parse-printf reverse [ first printf>write-quot ] map
    compose-all call( x -- x ) drop ;

: vsprintf ( seq format -- str )
    [ vprintf ] with-string-writer ;

: vsprintf1 ( elt format -- str )
    [ 1array ] dip vsprintf ;

This tutorial is too short to explain in detail how this pretty dense code works. It also reaches into undocumented internal parts of Factor (the formatting.private import) which is to be avoided but sometimes is necessary. The best way to understand it is to experiment in the listener to get a feel for how it works:

IN: scratchpad "subject" { t 20 } first2 [ "" "-" ? ] dip 2array "%%%s%ds\n" vsprintf printf
             subject
IN: scratchpad "subject" "%20s\n" printf
             subject

Note how vsprintf allows you to construct format strings on the fly which normally would be impossible with printf.

Table Printing

The prettyprint vocabulary contains a handy word called simple-table. It accepts a 2-dimensional array of objects and prints a table.

IN: scratchpad { { "foo" "bar" } { "a" "b" } { 33 f } } simple-table.
foo bar
a   b
33  f

However, it doesn't handle truncation and right alignment of cell values so we have to fixup those small details before simply sending our data to simple-table..

The task can be split up in three logical parts:

  1. Formatting the table header.
  2. Formatting each table row.
  3. Aggregating the output from step 1 and 2 and emitting it to simple-table..

Each of these steps is only a few lines of code when based upon the vsprintf word we created earlier.

The Table Header

: table-header ( format -- header-cells )
    [ first3 nip "%%-%ds" vsprintf1 vsprintf1 ] map ;

The output from this function is a sequence of strings which are the table headers. first3 nip has the effect of picking the first and third item of the input sequence and putting them on the stack. Use the join word to see what the result will look like:

IN: scratchpad mail-format table-header "|" join print
id     |mid  |group               |date|sender              |subject

Great! Let's do the data rows of the table too.

The Table Rows

: table-cell-format ( right-align width -- str )
    [ "" "-" ? ] dip 2array "%%%s%ds" vsprintf ;

: table-cell ( right-align width word value -- str )
    swap execute( x -- x ) over short head -rot table-cell-format vsprintf1 ;

: table-row ( format row -- row-cells )
    <mirror> '[ unclip _ at suffix first4 table-cell ] map ;

This code is significantly harder to understand. But, as always, experimenting with it in the listener should clear things up.

IN: scratchpad mail-format random-mail table-row "|" join print
      f| 2466|uybecybpgq          |2013|uwifstroyqiexky     |aeheotnpsumasyttehbklzetzoxzzx
IN: scratchpad t 10 table-cell-format .
"%10s"
IN: scratchpad { t 10 number>string 75 } first4 table-cell .
"        75"

Note the call to map in table-row and the fried quotation that preceedes it. Fried quotations is how you implement anonymous functions with some bound parameters in Factor. In other languages they they are called closures. The fried quotation here consumes the row parameter and uses it to create a regular quotation. Since that parameter is consumed, what map will iterate over is the format sequence.

If the fry syntax feels tricky, you can manually "expand" it on the listener:

IN: scratchpad mail-format [ unclip random-mail <mirror> at suffix first4 table-cell ] map .
{
    "      f"
    " 2363"
    "dhikaejcaa          "
    "2012"
    "gjsgosbexvnwliv     "
    "jcouzszjztsnvhkmtzkoordyihqsvb"
}
IN: scratchpad mail-format random-mail <mirror> '[ unclip _ at suffix first4 table-cell ] map .
{
    "      f"
    " 2506"
    "lpcefuzwnp          "
    "2013"
    "mbcujjlrauqswzn     "
    "xuybiuqkvfbssawgcrnfmjkplatdyj"
}

As you can see the lines are roughly similar. In the first example random-mail <mirror> is run once for each iteration of map while in the second, it is run once and then "bound" into the _ placeholder.

Putting It Together

We just call table-header once to make a table header and then call table-row on each mail tuple in our sequence.

: print-table ( seq format -- )
    [ '[ _ swap table-row ] map ] [ table-header ] bi prefix simple-table. ;

Perhaps the quotations sent to bi is in the wrong order. After all, the table header should come before the table rows. But it all works out thanks to prefix its second argument (the last one on the stack) to its first (the next last one on the stack). Then we let simple-table. print out the result for us.

IN: scratchpad clear 10 [ random-mail ] replicate mail-format print-table
id      mid   group                date sender               subject                       
      f  4057 kvtxnpyqik           2013 ouybpihdztjcvug      gxsqmiifjrdhodyvdnxnsekhzwcpjs
      f  1720 hwsdifocfb           2013 gmtmffskyiofzwa      bddwczpqzbwoujxvqqkqzvoolekxes
      f  4644 pplsqkflll           2012 jlndeolfhjktkks      kloinuqblpuseprtfjyhvftwnkqxsl
      f  2252 cbjbepfqes           2013 xjaljwzcnpjfchs      njvwaaatnkyztaddsbonwbdwhkhgek
      f  4189 uxvoefnrhb           2013 ocbjtolivaeitub      fcxccrqajcbipiqffxyiamjcfrlqix
      f   639 lgascyuxij           2012 jofshtqqpxgamba      wnwtjnwpgzzedvuetjjnhkkqzebbcx
      f  1214 hzcesktucw           2012 hcqomnhbvelcuyb      gyqkzmeyorbbuhlgwnlpyeoynzviyg
      f  2194 yvmibchwpk           2013 gptrzethazwohig      hbjfujnkdnlrrqkgakmxitnbliqtkg
      f   545 yqjgobgorn           2013 dsqgabphculhplk      zzsymwywpxocbnwrrrlargybfdogyz
      f  4845 pvffdfuptm           2012 xdfkkhtedrzwtgm      mhviqaklvgofwpwhrheudeslvjiebv

Yay! The gmane.formatting vocabulary is fairly complete at this point. Continue on to part 5 to see how to build the final vocabulary of the application.

Location of the mail-format and db-assigned-id>string words

To me, those words stands out like dirt in what would otherwise have been clean and elegant vocabulary. The other words are useful for anyone wanting to format tabular data but these two are specific to this application. Where could they be placed instead? Clearly, they are needed but none of the vocabularies we have written so far seem to offer a natural home for them. I don't think there is a perfect solution to the problem so I'lll let the question go unanswered for now.