Switch branches/tags
Nothing to show
Find file
Fetching contributors…
Cannot retrieve contributors at this time
258 lines (200 sloc) 10.2 KB

Data structures

Data structures

I’m trying to figure out what Erlang data structures to use for my data.

Choosing the right data structure — in any language — does matter for both code clarity and performance.

Especially for the core data, yes. And my experience is that it’s a good idea to know one’s options, in order to make good choices.

What kind of data structures will you need?

Well, certainly both sequences and dictionaries/maps from keys to values.

Of course.

Table 1. Sequences

For sequences, I take it I’ll be using the list data type?

Much of the time, yes. And possibly also for things — intermediate results — where you wouldn’t build a data structure at all in some other languages.

But the list type only gives quick access to the beginning of the sequence. It’s a singly-linked list. Great for stacks, but how about e.g a queue — where you need efficient access to both ends?

Then the queue module is what you’ll need. It provides O(1) insertion and amortized O(1) extraction time. The same module also provides deque operations.

(Erlang also has another kind of queue, by the way: the message queue of a process. Sometimes you need a queue within a process, sometimes you need a queue between processes.)

And what if I need a sequence with efficient random access? What should I use for the kind of situations where I’d usually use an array or array-based list?

That depends on the usage pattern.

If it’s read-heavy, then an Erlang tuple is your array. You manipulate it with element/2 (O(1)) and setelement/3 (O(n)). And, for construction it in the first place, make_tuple() and list_to_tuple. tuple_to_list() and append_element() may also come in handy.

If the content is less static, then look into the array module for O(log n) read and write operations.

I guess that’s as good as you can do with purely functional data structures.

Yes. If you need better than that — or better time constants —  then you can use one of the key-value mapping data structures.

On the other hand, if you expect to always operate on small data sets, then plain lists may be your best choice.

Table 2. Maps

Maps from keys to values, then — what are my options in that department?

The straight-forward choice is a property list — a list of pairs.

Ah, I know that one. I can use the proplists module to manipulate such a list.

You can, yes. But…​

But what?

…​But proplists actually handles a data structure a bit more general than lists-of-pairs: besides pairs, it recognizes atoms, which it takes to mean {the_atom, true}.

OK. Nothing wrong with that; being more general doesn’t hurt.

Not as such, no; many function in the standard libraries which take a property list of options understand that kind of shortcut.

And it is a nice convention for configuration parameters and such, but…​ you have to consider the cost of that convenience feature, before using it in inner-loop like code.

The cost? Are we talking about some premature optimization?

Call it premature if you must.

Just be aware that an alternative to proplists:lookup(Key, List), which does nearly the same thing except it doesn’t handle the atom special case, is lists:keyfind(Key, 1, List) — which is 10 times faster.

Ah. Perhaps we’re talking more “low hanging fruit” here.


How come it’s that much faster? Is the proplists version implemented in a stupid manner?

No; the reason is that lists:keyfind/3 is a BIF - a built-in function, written in C. That gives it a performance edge.

(If there is any lesson in this, I guess it is: don’t accidentally reinvent any of the functions in the lists module. You may gain NIH points, but you may lose performance.)

What other options are there for a key-value map data structure?

Next up are sorted property lists — as handled by the orddict module. They are on average twice as fast for lookups (given that the key is actually there), and more efficient for key union and intersection operations.

We’re still talking linear time, though, but also still with a small constant factor.

So still not something I’d use for large data sets.

No. For those cases, your primary question would be, “will I require the keys to be ordered or not?”

If you need ordered keys, then use gb_trees — balanced trees — for O(log n) time operations.

If you don’t, then use dict — a hash table — for O(1) lookup and O(n) insertion (with a quite small constant factor).

And either way, if you don’t need a persistent (purely functional) data structure, there are other options.

Do tell — I can buy some performance by allowing destructive updates, right?

Yes. It’s not always worth it, but for the right kinds of data it is.

So what do I use for these kinds of data?

For ordered keys, you can use an ETS table of the “ordered_set” flavour.

For unordered keys, you can use one of the “set” flavour. (Or “bag”, or “duplicate_bag”, if that’s what you need.)

“Set” provides constant-time operations (like a hash table), while “ordered_set” requires logarithmic-time (like a balanced tree).

Finally, there’s the process dictionary. It’s there for you to use, but resist the temptation unless you must.

Key-value store functional overview
              ---Pure------   ---Ordered-----
             /             \ /               \   ets (set)
O(1)        /    dict    ___X___              \
           /            /       \              \  pdict
          |           |           |	      	|
O(log n)  |           | gb_trees  |    ets     	|
          |           |           |(ordered_set)|
           \           \ orddict /             /
            \ pair-list \___ ___/             /
O(n)         \              X                /
              -------------- ----------------
Table 3. Strings

How about strings? I’ll need to manipulate textual data, too.

Once again, the right representation depends on what you’re doing. There’s of course the basic string type, which is just a list of characters, each represented by an integer.

And like any other list, that gives me easy access to the first elements, but slow access to the rest.

True. For some applications, that’s fine, of course — parsing, for instance.

And there’s also quite a bit of space overhead?

Yes. Two words per character, to be specific — which translates into 8 bytes per character on a 32-bit platform.

That’s a lot, if all you want to do is represent an ASCII string.

It is. And if that data comes from e.g. a text file or a network packet, it will have to be converted from the one-byte-per-character form into the internal list representation.

And for a lot of applications, you’re just going to put that same data into a new file or network packet, which means the data will have to be converted again, just in the other direction. That’s silly.

It is…​ Fortunately, the list representation is not the only option.

What else can I do?

You can use binaries — the Erlang data type called ‘binary’. Then you can keep the data in the same form the entire time.

Now that makes more sense.

And because that’s a typical use of binaries — starting and ending their lifespans in a device driver of some sort, and perhaps going through a few processes in between, binaries above a certain threshold size are handled specially:

They live outside the heap of any particular process, which means that they are not copied when they are sent from one process to another, and that they are not moved around by the normal process-heap garbage collections.

So, we have character lists, and we have binaries.

But neither of those types have efficient concatenation operations —  appending to a list is quite expensive (especially if you do it repeatedly), and appending one large binary to another means constructing a new, larger binary.

Actually, you can append to a binary repeatedly, rather cheaply: the first time, some extra space will be allocated at the end ; if that space is still free when the subsequent appendings take place, the binary will not have to be moved.

But the data which is appended will still have to be copied. If you concatenate two large binaries, at least one of them will have to be copied.

True. For such construction maneuvers, there is a third option. It’s called “IO lists” — usually written as “iolist”.

An iolist is a binary, or a list containing characters, binaries, and/or sub-iolists. (Impure lists are also allowed, as long as the final tail is a binary.)

That does obviate the need for costly concatenation operations. But once I have such an iolist, what can I do with it?

If nothing else, you can apply iolist_to_binary to it. But many Erlang function which expect binaries or strings are also able to handle iolists. Look for it in the documentation.

Most significantly, I/O operations usually accept iolists.

And will I also have to accept iolists occasionally?

I/O input is typically pure binaries. But some functions for constructing strings — in particular, io_lib:format() — return iolists; you may want to flatten those (using iolist_to_binary() or lists:flatten()) before you display them or pass them to some entity which is not iolist-aware.

So, all in all we have three string representations for three different scenarios:

  • character lists for deconstructing/parsing;

  • iolists for constructing; and

  • binaries for just passing through

where the iolist is a generalization over both character lists and binaries.

That just about sums it up, yes.