Skip to content

Ergonomic, efficient data processing

License

Notifications You must be signed in to change notification settings

emacsmirror/transducers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Transducers: Ergonomic, efficient data processing

I think Transducers are a fundamental primitive that decouples critical logic from list/sequence processing, and if I had to do Clojure all over I would put them at the bottom.

– Rich Hickey

Transducers are an ergonomic and extremely memory-efficient way to process a data source. Here “data source” means simple collections like Lists or Vectors, but also potentially large files or generators of infinite data.

Transducers…

  • allow the chaining of operations like map and filter without allocating memory between each step.
  • aren’t tied to any specific data type; they need only be implemented once.
  • vastly simplify “data transformation code”.
  • have nothing to do with “lazy evaluation”.
  • are a joy to use!

Example: While skipping every second line of a file, sum the lengths of only evenly-lengthed lines.

(t-transduce
  ;; How do we want to process each element?
  (t-comp (t-step 2) (t-map #'length) (t-filter #'cl-evenp))
  ;; How do we want to combine all the elements together?
  #'+
  ;; What's our original data source?
  (t-file-read "README.org"))
6690

Looking for Transducers in other Lisps? Check out the Common Lisp and Fennel implementations!

History and Motivation

Originally invented in Clojure and later adapted to other Lisps, Transducers are an excellent way to think about - and efficiently operate on - collections or streams of data. Transduction operations are strict and don’t involve “laziness” or “thunking” in any way, yet only process the exact amount of data you ask them to.

This library is mostly a port of the Common Lisp implementation, with a few alterations to account for the minor differences between Common Lisp and Emacs Lisp.

Installation

This package is available on MELPA.

Usage

Importing

Since this is just a library, you can import it as usual:

(require 'transducers)

Every function in the library is prefixed by transducers- but you’re encouraged to use read-symbol-shorthands to shorten this to t-. This can be done interactively in your own files via add-file-local-variable, which you can use to set this at the bottom of your file:

;; Local Variables:
;; read-symbol-shorthands: (("t-" . "transducers-"))
;; End:

After this, you can make relatively clean calls like:

(t-transduce (t-map #'1+) #'t-vector '(1 2 3))
[2 3 4]

This can also be done in .org files, so that Transducers can be used in their short forms even in Babel source blocks. That’s exactly what this README does!

The remaining examples below use t- for brevity.

Transducers, Reducers, and Sources

;; The fundamental pattern.
(t-transduce <transducer-chain> <reducer> <source>)

Data processing largely has three concerns:

  1. Where is my data coming from? (sources)
  2. What do I want to do to each element? (transducers)
  3. How do I want to collect the results? (reducers)

Each full “transduction” requires all three. We pass one of each to the t-transduce function, which drives the process. It knows how to pull values from the source, feed them through the transducer chain, and wrap everything together via the reducer.

  • Typical transducers are t-map, t-filter, and t-take.
  • Typical reducers are +, t-count, t-cons, and t-fold.
  • Typical sources are lists, vectors, strings, hash tables, and files.

Generators are a special kind of source that yield infinite data. Typical generators are t-repeat, t-cycle, and t-ints.

Let’s sum the squares of the first 1000 odd integers:

(t-transduce
 (t-comp (t-filter #'cl-oddp)          ;; (2) Keep only odd numbers.
         (t-take 1000)                 ;; (3) Keep the first 1000 filtered odds.
         (t-map (lambda (n) (* n n)))) ;; (4) Square those 1000.
 #'+         ;; (5) Reducer: Add up all the squares.
 (t-ints 1)) ;; (1) Source: Generate all positive integers.
1333333000

Two things of note here:

  1. t-comp is used here to chain together different transducer steps. Notice that the order appears “backwards” from usual function composition. It may help to imagine that t-comp is acting like the thread-last macro here.
  2. The reduction via + is listed as Step 5, but really it’s occuring throughout the transduction process. Each value that makes it through the composed transducer chain is immediately added to an internal accumulator.

Explore the other transducers and reducers to see what’s possible! You’ll never write a loop again.

API

The examples here use show each symbol prefixed by t-, but recall that you’ll need to set an explicit shorthand for this to work. When searching in-editor documentation, each symbol is prefixed fully by transducers-.

Transducers

Transducers describe how to alter the items of some stream of values. Some transducers, like take, can short-circuit.

Multiple transducer functions can be chained together with comp.

pass, map

Just pass along each value of the transduction.

(t-transduce #'t-pass #'t-cons '(1 2 3))
(1 2 3)

Apply a function F to all elements of the transduction.

(t-transduce (t-map #'1+) #'t-cons '(1 2 3))
(2 3 4)

filter, filter-map, unique, dedup

Only keep elements from the transduction that satisfy PRED.

(t-transduce (t-filter #'cl-evenp) #'t-cons '(1 2 3 4 5))
(2 4)

Apply a function F to the elements of the transduction, but only keep results that are non-nil.

(t-transduce (t-filter-map #'cl-first) #'t-cons '(() (2 3) () (5 6) () (8 9)))
(2 5 8)

Only allow values to pass through the transduction once each. Stateful; this uses a hash table internally so could get quite heavy if you’re not careful.

(t-transduce #'t-unique #'t-cons '(1 2 1 3 2 1 2 "abc"))
(1 2 3 "abc")

Remove adjacent duplicates from the transduction.

(t-transduce #'t-dedup #'t-cons '(1 1 1 2 2 2 3 3 3 4 3 3))
(1 2 3 4 3)

drop, drop-while, take, take-while

Drop the first N elements of the transduction.

(t-transduce (t-drop 3) #'t-cons '(1 2 3 4 5))
(4 5)

Drop elements from the front of the transduction that satisfy PRED.

(t-transduce (t-drop-while #'cl-evenp) #'t-cons '(2 4 6 7 8 9))
(7 8 9)

Keep only the first N elements of the transduction.

(t-transduce (t-take 3) #'t-cons '(1 2 3 4 5))
(1 2 3)

Keep only elements which satisfy a given PRED, and stop the transduction as soon as any element fails the test.

(t-transduce (t-take-while #'cl-evenp) #'t-cons '(2 4 6 8 9 2))
(2 4 6 8)

uncons, concatenate, flatten

Split up a transduction of cons cells.

(t-transduce #'t-uncons #'t-cons '((:a . 1) (:b . 2) (:c . 3)))
(:a 1 :b 2 :c 3)

Concatenate all the sublists in the transduction.

(t-transduce #'t-concatenate #'t-cons '((1 2 3) (4 5 6) (7 8 9)))
(1 2 3 4 5 6 7 8 9)

Entirely flatten all lists in the transduction, regardless of nesting.

(t-transduce #'t-flatten #'t-cons '((1 2 3) 0 (4 (5) 6) 0 (7 8 9) 0))
(1 2 3 0 4 5 6 0 7 8 9 0)

segment, window, group-by

Partition the input into lists of N items. If the input stops, flush any accumulated state, which may be shorter than N.

(t-transduce (t-segment 3) #'t-cons '(1 2 3 4 5))
((1 2 3) (4 5))

Yield N-length windows of overlapping values. This is different from segment which yields non-overlapping windows. If there were fewer items in the input than N, then this yields nothing.

(t-transduce (t-window 3) #'t-cons '(1 2 3 4 5))
((1 2 3) (2 3 4) (3 4 5))

Group the input stream into sublists via some function F. The cutoff criterion is whether the return value of F changes between two consecutive elements of the transduction.

(t-transduce (t-group-by #'cl-evenp) #'t-cons '(2 4 6 7 9 1 2 4 6 3))
((2 4 6) (7 9 1) (2 4 6) (3))

intersperse, enumerate, step, scan

Insert an ELEM between each value of the transduction.

(t-transduce (t-intersperse 0) #'t-cons '(1 2 3))
(1 0 2 0 3)

Index every value passed through the transduction into a cons pair. Starts at 0.

(t-transduce #'t-enumerate #'t-cons '("a" "b" "c"))
((0 . "a") (1 . "b") (2 . "c"))

Only yield every Nth element of the transduction. The first element of the transduction is always included.

(t-transduce (t-step 2) #'t-cons '(1 2 3 4 5 6 7 8 9))
(1 3 5 7 9)

Build up successsive values from the results of previous applications of a given function F.

(t-transduce (t-scan #'+ 0) #'t-cons '(1 2 3 4))
(0 1 3 6 10)

once

Inject some ITEM onto the front of the transduction.

(t-transduce (t-comp (t-filter (lambda (n) (> n 10)))
                     (t-once 'hello)
                     (t-take 3))
             #'t-cons (t-ints 1))
(hello 11 12)

log

Call some LOGGER function for each step of the transduction. The LOGGER must accept the running results and the current element as input. The original items of the transduction are passed through as-is.

(t-transduce (t-log (lambda (_ n) (print! "Got: %d" n))) #'t-cons '(1 2 3 4 5))
Got: 1
Got: 2
Got: 3
Got: 4
Got: 5

These are STDOUT results. The actual return value is the result of the reducer, in this case cons, thus a list.

from-csv, into-csv

Interpret the data stream as CSV data.

The first item found is assumed to be the header list, and it will be used to construct useable hashtables for all subsequent items.

Note: This function makes no attempt to convert types from the original parsed strings. If you want numbers, you will need to further parse them yourself.

(t-transduce (t-comp #'t-from-csv
                     (t-map (lambda (hm) (gethash "Name" hm))))
             #'t-cons '("Name,Age" "Alice,35" "Bob,26"))
("Alice" "Bob")

Given a sequence of HEADERS, rerender each item in the data stream into a CSV string. It’s assumed that each item in the transduction is a hash table whose keys are strings that match the values found in HEADERS.

(t-transduce (t-comp #'t-from-csv
                     (t-into-csv '("Name" "Age")))
             #'t-cons '("Name,Age,Hair" "Alice,35,Blond" "Bob,26,Black"))
("Name,Age" "Alice,35" "Bob,26")

Reducers

Reducers describe how to fold the stream of items down into a single result, be it either a new collection or a scalar.

Some reducers, like first, can also force the entire transduction to short-circuit.

cons, snoc, vector, string, hash-table

Collect all results as a list.

(t-transduce #'t-pass #'t-cons '(1 2 3))
(1 2 3)

Collect all results as a list, but results are reversed. In theory, slightly more performant than cons since it performs no final reversal.

(t-transduce #'t-pass #'t-snoc '(1 2 3))
(3 2 1)

Collect a stream of values into a vector.

(t-transduce #'t-pass #'t-vector '(1 2 3))
[1 2 3]

Collect a stream of characters into to a single string.

(t-transduce (t-map #'upcase) #'t-string "hello")
"HELLO"

Collect a stream of key-value cons pairs into a hash table.

(t-transduce #'t-enumerate #'t-hash-table '("a" "b" "c"))
#s(hash-table size 65 test equal rehash-size 1.5 rehash-threshold 0.8125 data (0 "a" 1 "b" 2 "c"))

count, average

Count the number of elements that made it through the transduction.

(t-transduce #'t-pass #'t-count '(1 2 3 4 5))
5

Calculate the average value of all numeric elements in a transduction.

(t-transduce #'t-pass #'t-average '(1 2 3 4 5 6))
3.5

anyp, allp

Yield t if any element in the transduction satisfies PRED. Short-circuits the transduction as soon as the condition is met.

(t-transduce #'t-pass (t-anyp #'cl-evenp) '(1 3 5 7 9 2))
t

Yield t if all elements of the transduction satisfy PRED. Short-circuits with NIL if any element fails the test.

(t-transduce #'t-pass (t-allp #'cl-oddp) '(1 3 5 7 9))
t

first, last, find

Yield the first value of the transduction. As soon as this first value is yielded, the entire transduction stops.

(t-transduce (t-filter #'cl-oddp) #'t-first '(2 4 6 7 10))
7

Yield the last value of the transduction.

(t-transduce #'t-pass #'t-last '(2 4 6 7 10))
10

Find the first element in the transduction that satisfies a given PRED. Yields NIL if no such element were found.

(t-transduce #'t-pass (t-find #'cl-evenp) '(1 3 5 6 9))
6

fold

fold is the fundamental reducer. fold creates an ad-hoc reducer based on a given 2-argument function. An optional SEED value can also be given as the initial accumulator value, which also becomes the return value in case there were no input left in the transduction.

Functions like + and * are automatically valid reducers, because they yield sane values even when given 0 or 1 arguments. Other functions like cl:max cannot be used as-is as reducers since they can’t be called without arguments. For functions like this, fold is appropriate.

(t-transduce #'t-pass (t-fold #'max) '(1 2 3 4 1000 5 6))
1000

With a seed:

(t-transduce #'t-pass (t-fold #'max 0) '())
0

In Clojure this function is called completing.

for-each

Run through every item in a transduction for their side effects. Throws away all results and yields t.

(t-transduce (t-map (lambda (n) (print! "%d" n))) #'t-for-each [1 2 3 4])
t

Sources

Data is pulled in an on-demand fashion from Sources. They can be either finite or infinite in length. A list is an example of a simple Source, but you can also pull from files and endless number generators.

ints, random

Yield all integers, beginning with START and advancing by an optional STEP value which can be positive or negative. If you only want a specific range within the transduction, then use take-while within your transducer chain.

(t-transduce (t-take 10) #'t-cons (t-ints 0 :step 2))
(0 2 4 6 8 10 12 14 16 18)

Yield an endless stream of random numbers, based on a given LIMIT.

(t-transduce (t-take 20) #'t-cons (t-random 10))
(2 2 8 7 5 9 8 0 7 0 6 6 0 0 5 4 3 6 3 2)
(t-transduce (t-take 4) #'t-cons (t-random 1.0))
(0.5335642099380493 0.7913782596588135 0.6917074918746948 0.09546732902526855)

cycle, repeat, shuffle

Yield the values of a given SEQ endlessly.

(t-transduce (t-take 10) #'t-cons (t-cycle '(1 2 3)))
(1 2 3 1 2 3 1 2 3 1)

Endlessly yield a given ITEM.

(t-transduce (t-take 4) #'t-cons (t-repeat 9))
(9 9 9 9)

Endlessly yield random elements from a given vector.

(t-transduce (t-take 5) #'t-cons (t-shuffle ["Alice" "Bob" "Dennis"]))
("Dennis" "Bob" "Bob" "Dennis" "Bob")

Recall also that strings are vectors too:

(t-transduce (t-take 15) #'t-string (t-shuffle "Númenor"))
"mnromnrrrnNNoor"

plist

Yield key-value pairs from a Property List, usually known as a ‘plist’. The pairs are passed as a cons cell.

(t-transduce (t-map #'cdr) #'+ (t-plist '(:a 1 :b 2 :c 3)))
6

See also the uncons transducer for another way to handle incoming cons cells.

Utilities

comp, const

Function composition. You can pass as many functions as you like and they are applied from right to left.

(funcall (t-comp #'length #'reverse) [1 2 3])
3

For transducer functions specifically, they are composed from right to left, but their effects are applied from left to right. This is due to how the reducer function is chained through them all internally via transduce.

Notice here how drop is clearly applied first:

(t-transduce (t-comp (t-drop 3) (t-take 2)) #'t-cons '(1 2 3 4 5 6))
(4 5)

Return a function that ignores its argument and returns ITEM instead.

(funcall (t-comp (t-const 108) (lambda (n) (* 2 n)) #'1+) 1)
108

make-reduced, reduced-p, reduced-val

When writing your own transducers and reducers, these functions allow you to short-circuit the entire operation.

Here is a simplified definition of first:

(defun first (&optional (acc nil a-p) (input nil i-p))
  (cond ((and a-p i-p) (make-transducers-reduced :val input))
        ((and a-p (not i-p)) acc)
        (t acc)))

You can see make-reduced being used to wrap the return value. transduce sees this wrapping and immediately halts further processing.

reduced-p and reduced-val can similarly be used (mostly within transducer functions) to check if some lower transducer (or the reducer) has signaled a short-circuit, and if so potentially perform some clean-up. This is important for transducers that carry internal state.

Example Gallery

Words in a File

(t-transduce (t-comp (t-map #'split-string)
                     #'t-concatenate)
             #'t-count
             (t-file-read "README.org"))
1101

Splitting a string by its lines

Transducing over a string yields its characters:

(t-transduce #'t-pass #'t-count "hello\nworld!")
12

If you want to transduce over its lines instead, create a temporary buffer first:

(with-temp-buffer
  (insert "hello\nworld!")
  (t-transduce #'t-pass #'t-cons (current-buffer)))
("hello" "world!")

Reading and Writing CSV data

This library also provides two transducers for processing CSV data: t-from-csv and t-into-csv. The original data can come from any source, like a file, open buffer, or raw string.

t-from-csv reads the data into a stream of Hash Tables with each value keyed to the fields provided in the first line. t-into-csv reverses the process, given a sequence of headers to select.

(t-transduce (t-comp #'t-from-csv
                     (t-into-csv ["Age" "Name"]))
             #'t-cons
             ["Name,Age,Hair" "Alice,35,Blond" "Bob,26,Black"])
("Age,Name" "35,Alice" "26,Bob")

Here we’re immediately converting back into CSV strings, but with t-comp we’re free to add as many intermediate steps as we like.

Reading and Writing JSON data

It is also possible to read from and write to JSON buffers. Reading assumes that the buffer contains a top-level array. By default, yielded objects are plists, but this can be customized via the :object-type keyword.

(with-temp-buffer
  (insert "[{\"name\": \"Colin\"},{\"name\": \"Jack\"}]")
  (t-transduce #'t-pass #'t-cons (t-from-json-buffer (current-buffer))))
((:name "Colin") (:name "Jack"))

Note that t-from-json-buffer is a “source”.

Likewise, t-into-json-buffer is a reducer that writes a stream of lisp values back into the current buffer.

(with-temp-buffer
  (t-transduce #'t-pass #'t-into-json-buffer '((:name "Colin") (:name "Jack")))
  (buffer-string))
[{"name":"Colin"},{"name":"Jack"}]

Note that t-into-json-buffer makes no assumptions about where point initially is the current buffer, nor is the buffer automatically saved. Such concerns are the responsibility of the user.

Reducing into Property Lists and Assocation Lists

There is no special reducer function for plists, because none is needed. If you have a stream of cons cells, you can break it up with t-uncons and then collect with t-cons as usual:

(t-transduce (t-comp (t-map (lambda (pair) (cons (car pair) (1+ (cdr pair)))))
                     #'t-uncons)
             #'t-cons
             (t-plist '(:a 1 :b 2 :c 3)))
(:a 2 :b 3 :c 4)

Likewise, Association Lists are already lists-of-cons-cells, so no special treatment is needed:

(t-transduce #'t-pass #'t-cons '((:a . 1) (:b . 2) (:c . 3)))
((:a . 1) (:b . 2) (:c . 3))

Resources