Skip to content

Pipe and Filter Message Processing

Vladimir Gamalian edited this page Jun 30, 2016 · 1 revision

Many common uses of cryptography involve processing one or more streams of data. Botan provides services that make setting up data flows through various operations, such as compression, encryption, and base64 encoding. Each of these operations is implemented in what are called filters in Botan. A set of filters are created and placed into a pipe, and information "flows" through the pipe until it reaches the end, where the output is collected for retrieval. If you're familiar with the Unix shell environment, this design will sound quite familiar.

Here is an example that uses a pipe to base64 encode some strings:

  Pipe pipe = Pipe(new Base64Encoder); // pipe owns the pointer
  pipe.startMsg();
  pipe.write("message 1");
  pipe.endMsg(); // flushes buffers, increments message number

  // processMsg(x) is startMsg() && write(x) && endMsg()
  pipe.processMsg("message2");

  string m1 = pipe.toString(0); // "message1"
  string m2 = pipe.toString(1); // "message2"

ubytestreams in the pipe are grouped into messages; blocks of data that are processed in an identical fashion (ie, with the same sequence of filter operations). Messages are delimited by calls to startMsg and endMsg. Each message in a pipe has its own identifier, which currently is an integer that increments up from zero.

The Base64Encoder was allocated using new; but where was it deallocated? When a filter object is passed to a Pipe, the pipe takes ownership of the object, and will deallocate it when it is no longer needed.

There are two different ways to make use of messages. One is to send several messages through a Pipe without changing the Pipe configuration, so you end up with a sequence of messages; one use of this would be to send a sequence of identically encrypted UDP packets, for example (note that the data need not be identical; it is just that each is encrypted, encoded, signed, etc in an identical fashion). Another is to change the filters that are used in the Pipe between each message, by adding or removing filters; functions that let you do this are documented in the Pipe API section.

Botan has about 40 filters that perform different operations on data. Here's code that uses one of them to encrypt a string with AES:

  AutoSeededRNG rng,
  auto key = SymmetricKey(rng, 16); // a random 128-bit key
  auto iv = InitializationVector(rng, 16); // a random 128-bit IV

  // The algorithm we want is specified by a string
  Pipe pipe(getCipher("AES-128/CBC", key, iv, ENCRYPTION));

  pipe.processMsg("secrets");
  pipe.processMsg("more secrets");

  SecureVector!ubyte c1 = pipe.readAll(0);

  ubyte[4096] c2;
  size_t got_out = pipe.read(c2.ptr, c2.length, 1);
  // use c2[0...got_out]

Note the use of AutoSeededRNG, which is a random number generator. If you want to, you can explicitly set up the random number generators and entropy sources you want to, however for 99% of cases AutoSeededRNG is preferable.

Pipe also has convenience methods for dealing with File. Here is an example of those, using the BzipCompression filter (included as a module; if you have bzlib available, check the build instructions for how to enable it) to compress a file:

  File infile = File("data.bin", "rb")
  File outfile = File("data.bin.bz2", "wb+")

  Pipe pipe = Pipe(new BzipCompression);

  pipe.startMsg();
  foreach (line; infile.byLine())
     pipe.write(line);
  pipe.endMsg();
  outfile.write(pipe.toString());

However there is a hitch to the code above; the complete contents of the compressed data will be held in memory until the entire message has been compressed, at which time the statement out << pipe is executed, and the data is freed as it is read from the pipe and written to the file. But if the file is very large, we might not have enough physical memory (or even enough virtual memory!) for that to be practical. So instead of storing the compressed data in the pipe for reading it out later, we divert it directly to the file::

  File infile = File("data.bin", "rb")
  File outfile = File("data.bin.bz2", "wb+")

  Pipe pipe = Pipe(new BzipCompression, new DataSinkStream(outfile));

  pipe.startMsg();
  foreach (line; infile.byLine())
      pipe.write(line);
  pipe.endMsg();

This is the first code we've seen so far that uses more than one filter in a pipe. The output of the compressor is sent to the DataSinkStream. Anything written to a DataSinkStream is written to a file; the filter produces no output. As soon as the compression algorithm finishes up a block of data, it will send it along to the sink filter, which will immediately write it to the stream; if you were to call pipe.readAll() after pipe.endMsg(), you'd get an empty vector out. This is particularly useful for cases where you are processing a large amount of data, as it means you don't have to store everything in memory at once.

Here's an example using two computational filters:

   AutoSeededRNG rng,
   auto key = SymmetricKey(rng, 32);
   auto iv = InitializationVector(rng, 16);

   Pipe encryptor = Pipe(getCipher("AES/CBC/PKCS7", key, iv, ENCRYPTION),
                         new Base64Encoder);

   encryptor.startMsg();
   foreach (line; file.byLine())
       encryptor.write(line);
   encryptor.endMsg(); // flush buffers, complete computations
   writeln(encryptor);

You can read from a pipe while you are still writing to it, which allows you to bound the amount of memory that is in use at any one time. A common idiom for this is:

   pipe.startMsg();
   SecureBuffer!(ubyte, 4096) buffer;
   while (!infile.error)
   {
      infile.read((char*)buffer.ptr, buffer.length);
      const size_t got_from_infile = infile.gcount();
      pipe.write(buffer, got_from_infile);

      if (infile.eof())
         pipe.endMsg();

      while (pipe.remaining() > 0)
      {
         const size_t buffered = pipe.read(buffer, buffer.length);
         outfile.write((const char*)buffer.ptr, buffered);
      }
   }
   if (infile.error)
      throw Some_Exception();

Fork

It is common that you might receive some data and want to perform more than one operation on it (ie, encrypt it with Serpent and calculate the SHA-256 hash of the plaintext at the same time). That's where Fork comes in. Fork is a filter that takes input and passes it on to one or more filters that are attached to it. Fork changes the nature of the pipe system completely: instead of being a linked list, it becomes a tree or acyclic graph.

Each filter in the fork is given its own output buffer, and thus its own message. For example, if you had previously written two messages into a pipe, then you start a new one with a fork that has three paths of filter's inside it, you add three new messages to the pipe. The data you put into the pipe is duplicated and sent into each set of filter and the eventual output is placed into a dedicated message slot in the pipe.

Messages in the pipe are allocated in a depth-first manner. This is only interesting if you are using more than one fork in a single pipe. As an example, consider the following::

   auto pipe = Pipe(new Fork(
                new Fork(
                   new Base64Encoder,
                   new Fork(
                      NULL,
                      new Base64Encoder
                      )
                   ),
                new HexEncoder
                )
      );

In this case, message 0 will be the output of the first Base64Encoder, message 1 will be a copy of the input (see below for how fork interprets NULL pointers), message 2 will be the output of the second Base64Encoder, and message 3 will be the output of the HexEncoder. This results in message numbers being allocated in a top to bottom fashion, when looked at on the screen. However, note that there could be potential for bugs if this is not anticipated. For example, if your code is passed a filter, and you assume it is a "normal" one that only uses one message, your message offsets would be wrong, leading to some confusion during output.

If Fork's first argument is a null pointer, but a later argument is not, then Fork will feed a copy of its input directly through. Here's a case where that is useful:

   // have string ciphertext, auth_code, key, iv, mac_key;

   Pipe pipe = Pipe(new Base64Decoder,
                    getCipher("AES-128", key, iv, DECRYPTION),
                    new Fork(
                        0, // this message gets plaintext
                        new MACFilter("HMAC(SHA-1)", mac_key)
                ));

   pipe.processMsg(ciphertext);
   string plaintext = pipe.toString(0);
   SecureVector!ubyte mac = pipe.readAll(1);

   if(mac != auth_code)
      error();

Here we wanted to not only decrypt the message, but send the decrypted text through an additional computation, in order to compute the authentication code.

Any filters that are attached to the pipe after the fork are implicitly attached onto the first branch created by the fork. For example, let's say you created this pipe::

Pipe pipe = Pipe(new Fork(new HashFilter("SHA-256"),
                          new HashFilter("SHA-512")),
                 new HexEncoder);

And then called startMsg, inserted some data, then endMsg. Then pipe would contain two messages. The first one (message number 0) would contain the SHA-256 sum of the input in hex encoded form, and the other would contain the SHA-512 sum of the input in raw binary. In many situations you'll want to perform a sequence of operations on multiple branches of the fork; in which case, use the filter described in chain.

Chain

A Chain filter creates a chain of filters and encapsulates them inside a single filter (itself). This allows a sequence of filters to become a single filter, to be passed into or out of a function, or to a Fork constructor.

You can call Chain's constructor with up to four Filter pointers (they will be added in order), or with an array of filter pointers and a size_t that tells Chain how many filters are in the array (again, they will be attached in order). Here's the example from the last section, using chain instead of relying on the implicit passthrough the other version used:

  Pipe pipe = Pipe(new Fork(
                     new Chain(new HashFilter("SHA-256"), new HexEncoder),
                     new HashFilter("SHA-512")
              ));

Sources and Sinks

Data Sources

A DataSource is a simple abstraction for a thing that stores ubytes. This type is used heavily in the areas of the API related to ASN.1 encoding/decoding. The following types are DataSource: Pipe, SecureQueue, and a couple of special purpose ones: DataSourceMemory and DataSourceStream.

You can create a DataSourceMemory with an array of ubytes and a length field. The object will make a copy of the data, so you don't have to worry about keeping that memory allocated. This is mostly for internal use, but if it comes in handy, feel free to use it.

A DataSourceStream is probably more useful than the memory based one. Its constructors take either a File or a string. If it's a stream, the data source will use the File to satisfy read requests. If the string version is used, it will attempt to open up a file with that name and read from it.

Data Sinks ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A DataSink (in botan.filters.data_snk) is a Filter that takes arbitrary amounts of input, and produces no output. This means it's doing something with the data outside the realm of what Filter/Pipe can handle, for example, writing it to a file (which is what the DataSinkStream does). There is no need for DataSinks that write to a string or memory buffer, because Pipe can handle that by itself.

Here's a quick example of using a DataSink, which encrypts in.txt and sends the output to out.txt. There is no explicit output operation; the writing of out.txt is implicit:

   DataSourceStream in = DataSourceStream("in.txt");
   Pipe pipe = Pipe(getCipher("AES-128/CTR-BE", key, iv),
                    new DataSinkStream("out.txt"));
   pipe.processMsg(in);

A real advantage of this is that even if "in.txt" is large, only as much memory is needed for internal I/O buffers will be used.

The Pipe API

Initializing Pipe

By default, Pipe will do nothing at all; any input placed into the Pipe will be read back unchanged. Obviously, this has limited utility, and presumably you want to use one or more filters to somehow process the data. First, you can choose a set of filters to initialize the Pipe via the constructor. You can pass it either a set of up to four filter pointers, or a pre-defined array and a length:

   Pipe pipe1 = Pipe(new Filter1(/*args*/), new Filter2(/*args*/),
                     new Filter3(/*args*/), new Filter4(/*args*/));
   Pipe pipe2 = pipe(new Filter1(/*args*/), new Filter2(/*args*/));

   Filter[5] filters = [
     new Filter1(/*args*/), new Filter2(/*args*/), new Filter3(/*args*/),
     new Filter4(/*args*/), new Filter5(/*args*/) /* more if desired... */
   ];
   Pipe pipe3 = Pipe(filters, 5);

This is by far the most common way to initialize a Pipe. However, occasionally a more flexible initialization strategy is necessary; this is supported by 4 member functions. These functions may only be used while the pipe in question is not in use; that is, either before calling startMsg, or after endMsg has been called (and no new calls to startMsg have been made yet).

void prepend(Filter filter)

Calling prepend will put the passed filter first in the list of transformations. For example, if you prepend a filter implementing encryption, and the pipe already had a filter that hex encoded the input, then the next message processed would be first encrypted, and then hex encoded.

void append(Filter filter);

Like prepend, but places the filter at the end of the message flow. This doesn't always do what you expect if there is a fork.

void pop()

Removes the first filter in the flow.

void reset()

Removes all the filters that the pipe currently holds - it is reset to an empty/no-op state. Any data that is being retained by the pipe is retained after a reset, and reset does not affect message numbers (discussed later).

Giving Data to a Pipe

Input to a Pipe is delimited into messages, which can be read from independently (ie, you can read 5 ubytes from one message, and then all of another message, without either read affecting any other messages).

void startMsg();

Starts a new message; if a message was already running, an exception is thrown. After this function returns, you can call write.

void write(const ubyte* input, size_t length);

void write(in Vector!ubyte input);

void write(in string input);

void write(DataSource input);

void write(ubyte input);

All versions of write write the input into the filter sequence. If a message is not currently active, an exception is thrown.

void endMsg()

End the currently active message

Sometimes, you may want to do only a single write per message. In this case, you can use the processMsg series of functions, which start a message, write their argument into the pipe, and then end the message. In this case you would not make any explicit calls to startMsg/endMsg.

Getting Output from a Pipe

Retrieving the processed data from a pipe is a bit more complicated, for various reasons. The pipe will separate each message into a separate buffer, and you have to retrieve data from each message independently. Each of the reader functions has a final parameter that specifies what message to read from. If this parameter is set to Pipe::DEFAULT_MESSAGE, it will read the current default message (DEFAULT_MESSAGE is also the default value of this parameter).

Functions in Pipe related to reading include:

size_t read(ubyte* out, size_t len);

Reads up to len ubytes into out, and returns the number of ubytes actually read.

size_t peek(ubyte* out, size_t len);

Acts exactly like read, except the data is not actually read; the next read will return the same data.

SecureVector!ubyte readAll();

Reads the entire message into a buffer and returns it

string toString()

Like readAll, but it returns the data as a std::string. No encoding is done; if the message contains raw binary, so will the string.

size_t remaining()

Returns how many ubytes are left in the message

Pipe.message_id defaultMsg()

Returns the current default message number

Pipe.message_id messageCount()

Returns the total number of messages currently in the pipe

void setDefaultMsg(Pipe.message_id msgno);

Sets the default message number (which must be a valid message number for that pipe).

Filter Catalog

This section documents most of the useful filters included in the library.

Keyed Filters

A few sections ago, it was mentioned that Pipe can process multiple messages, treating each of them the same. Well, that was a bit of a lie. There are some algorithms (in particular, block ciphers not in ECB mode, and all stream ciphers) that change their state as data is put through them.

Naturally, you might well want to reset the keys or (in the case of block cipher modes) IVs used by such filters, so multiple messages can be processed using completely different keys, or new IVs, or new keys and IVs, or whatever. And in fact, even for a MAC or an ECB block cipher, you might well want to change the key used from message to message.

Enter KeyedFilter, which acts as an abstract interface for any filter that is uses keys: block cipher modes, stream ciphers, MACs, and so on. It has two functions, setKey and setIv. Calling setKey will set (or reset) the key used by the algorithm. Setting the IV only makes sense in certain algorithms -- a call to setIv on an object that doesn't support IVs will cause an exception. You must call setKey before calling setIv.

Here's a example::

   KeyedFilter aes, hmac;
   Pipe pipe = Pipe(new Base64Decoder,
                    // Note the assignments to the cast and hmac variables
                    aes = getCipher("AES-128/CBC", aes_key, iv),
                    new Fork(
                       0, // Read the section 'Fork' to understand this
                       new Chain(
                          hmac = new MACFilter("HMAC(SHA-1)", mac_key, 12),
                          new Base64Encoder
                       )
                    )
   );
   pipe.startMsg();
   // use pipe for a while, decrypt some stuff, derive new keys and IVs
   pipe.endMsg();

   aes.setKey(aes_key2);
   aes.setIv(iv2);
   hmac.setKey(mac_key2);

   pipe.startMsg();
   // use pipe for some other things
   pipe.endMsg();

There are some requirements to using KeyedFilter that you must follow. If you call setKey or setIv on a filter that is owned by a Pipe, you must do so while the Pipe is "unlocked". This refers to the times when no messages are being processed by Pipe -- either before Pipe's startMsg is called, or after endMsg is called (and no new call to startMsg has happened yet). Doing otherwise will result in undefined behavior, probably silently getting invalid output.

And remember: if you're resetting both values, reset the key first.

Cipher Filters

Getting a hold of a Filter implementing a cipher is very easy. Make sure you're importing module botan.libstate.lookup, and then call getCipher. You will pass the return value directly into a Pipe. There are a couple different functions which do varying levels of initialization:

KeyedFilter getCipher(string cipher_spec, 
                        SymmetricKey key, 
                        InitializationVector iv, 
                        CipherDir dir);

KeyedFilter getCipher(string cipher_spec, 
                        SymmetricKey key,
                        CipherDir dir);

The version that doesn't take an IV is useful for things that don't use them, like block ciphers in ECB mode, or most stream ciphers. If you specify a cipher spec that does want a IV, and you use the version that doesn't take one, an exception will be thrown. The dir argument can be either ENCRYPTION or DECRYPTION.

The cipher_spec is a string that specifies what cipher is to be used. The general syntax for "cipher_spec" is "STREAM_CIPHER", "BLOCK_CIPHER/MODE", or "BLOCK_CIPHER/MODE/PADDING". In the case of stream ciphers, no mode is necessary, so just the name is sufficient. A block cipher requires a mode of some sort, which can be "ECB", "CBC", "CFB(n)", "OFB", "CTR-BE", or "EAX(n)". The argument to CFB mode is how many bits of feedback should be used. If you just use "CFB" with no argument, it will default to using a feedback equal to the block size of the cipher. EAX mode also takes an optional bit argument, which tells EAX how large a tag size to use~--~generally this is the size of the block size of the cipher, which is the default if you don't specify any argument.

In the case of the ECB and CBC modes, a padding method can also be specified. If it is not supplied, ECB defaults to not padding, and CBC defaults to using PKCS #5/#7 compatible padding. The padding methods currently available are "NoPadding", "PKCS7", "OneAndZeros", and "CTS". CTS padding is currently only available for CBC mode, but the others can also be used in ECB mode.

Some example "cipher_spec arguments are: "AES-128/CBC", "Blowfish/CTR-BE", "Serpent/XTS", and "AES-256/EAX".

"CTR-BE" refers to counter mode where the counter is incremented as if it were a big-endian encoded integer. This is compatible with most other implementations, but it is possible some will use the incompatible little endian convention. This version would be denoted as "CTR-LE" if it were supported.

"EAX" is a new cipher mode designed by Wagner, Rogaway, and Bellare. It is an authenticated cipher mode (that is, no separate authentication is needed), has provable security, and is free from patent entanglements. It runs about half as fast as most of the other cipher modes (like CBC, OFB, or CTR), which is not bad considering you don't need to use an authentication code.

Hashes and MACs

Hash functions and MACs don't need anything special when it comes to filters. Both just take their input and produce no output until endMsg is called, at which time they complete the hash or MAC and send that as output.

These filters take a string naming the type to be used. If for some reason you name something that doesn't exist, an exception will be thrown.

class HashFilter {
    this(string hash, size_t outlen = 0);
}

This constructor creates a filter that hashes its input with hash. When endMsg is called on the owning pipe, the hash is completed and the digest is sent on to the next filter in the pipeline. The parameter outlen specifies how many ubytes of the hash output will be passed along to the next filter when endMsg is called. By default, it will pass the entire hash.

Examples of names for HashFilter are "SHA-1" and "Whirlpool".

class MACFilter {
   this(string mac, SymmetricKey key, size_t outlen = 0);
}

This constructor takes a name for a mac, such as "HMAC(SHA-1)" or "CMAC(AES-128)", along with a key to use. The optional outlen works the same as in HashFilter.

PK Filters

There are four classes in this category, PKEncryptorFilter, PKDecryptorFilter, PKSignerFilter, and PKVerifierFilter. Each takes a pointer to an object of the appropriate type (PKEncryptor, PKDecryptor, etc) that is deleted by the destructor. These classes are found in botan.filters.pk_filts.

Three of these, for encryption, decryption, and signing are much the same in terms of dataflow - ach of them buffers its input until the end of the message is marked with a call to the endMsg function. Then they encrypt, decrypt, or sign the entire input as a single blob and send the output (the ciphertext, the plaintext, or the signature) into the next filter.

Signature verification works a little differently, because it needs to know what the signature is in order to check it. You can either pass this in along with the constructor, or call the function set_signature -- with this second method, you need to keep a pointer to the filter around so you can send it this command. In either case, after endMsg is called, it will try to verify the signature (if the signature has not been set by either method, an exception will be thrown here). It will then send a single ubyte onto the next filter -- a 1 or a 0, which specifies whether the signature verified or not (respectively).

For more information about PK algorithms (including creating the appropriate objects to pass to the constructors), see Public Key Cryptography.

Encoders

Often you want your data to be in some form of text (for sending over channels that aren't 8-bit clean, printing it, etc). The filters HexEncoder and Base64Encoder will convert arbitrary binary data into hex or base64 formats. Not surprisingly, you can use HexDecoder and Base64Decoder to convert it back into its original form.

Both of the encoders can take a few options about how the data should be formatted (all of which have defaults). The first is a bool which says if the encoder should insert line breaks. This defaults to false. Line breaks don't matter either way to the decoder, but it makes the output a bit more appealing to the human eye, and a few transport mechanisms (notably some email systems) limit the maximum line length.

The second encoder option is an integer specifying how long such lines will be (obviously this will be ignored if line-breaking isn't being used). The default tends to be in the range of 60-80 characters, but is not specified. If you want a specific value, set it. Otherwise the default should be fine.

Lastly, HexEncoder takes an argument of type Case, which can be Uppercase or Lowercase (default is Uppercase). This specifies what case the characters A-F should be output as. The base64 encoder has no such option, because it uses both upper and lower case letters for its output.

You can find the declarations for these types in botan.filters.hex_filt and botan.filters.b64_filt.

Compressors

There are two compression algorithms supported by Botan, zlib and bzip2. Only lossless compression algorithms are currently supported by Botan, because they tend to be the most useful for cryptography. However, it is very reasonable to consider supporting something like GSM speech encoding (which is lossy), for use in encrypted voice applications.

You should always compress before you encrypt, because encryption seeks to hide the redundancy that compression is supposed to try to find and remove.

To test for Bzip2, check to see if BOTAN_HAS_COMPRESSOR_BZIP2 is defined. If so, you can include botan.filters.bzip2, which will declare a pair of Filter objects: Bzip2_Compression and Bzip2_Decompression.

You should be prepared to take an exception when using the decompressing filter, for if the input is not valid bzip2 data, that is what you will receive. You can specify the desired level of compression to Bzip2_Compression's constructor as an integer between 1 and 9, 1 meaning worst compression, and 9 meaning the best. The default is to use 9, since small values take the same amount of time, just use a little less memory.

Zlib compression works much like Bzip2 compression. The only differences in this case are that the macro is BOTAN_HAS_COMPRESSOR_ZLIB, the module you need to import is called botan.filters.zlib. The Botan classes for zlib compression/decompression are called ZlibCompression and ZlibDecompression.

Like Bzip2, a ZlibDecompression object will throw an exception if invalid (in the sense of not being in the Zlib format) data is passed into it.

While the zlib compression library uses the same compression algorithm as the gzip and zip programs, the format is different. The zlib format is defined in RFC 1950.

Writing New Filters

The system of filters and pipes was designed in an attempt to make it as simple as possible to write new filter types. There are four functions that need to be implemented by a class deriving from Filter:

void Filter::write(in ubyte* input, size_t length)

This function is what is called when a filter receives input for it to process. The filter is not required to process the data right away; many filters buffer their input before producing any output. A filter will usually have write called many times during its lifetime.

void send(ubyte* output, size_t length);

Eventually, a filter will want to produce some output to send along to the next filter in the pipeline. It does so by calling send with whatever it wants to send along to the next filter. There is also a version of send taking a single ubyte argument, as a convenience.

void startMsg();

Implementing this function is optional. Implement it if your filter would like to do some processing or setup at the start of each message, such as allocating a data structure.

void endMsg();

Implementing this function is optional. It is called when it has been requested that filters finish up their computations. The filter should finish up with whatever computation it is working on (for example, a compressing filter would flush the compressor and send the final block), and empty any buffers in preparation for processing a fresh new set of input.

Additionally, if necessary, filters can define a constructor that takes any needed arguments, and a destructor to deal with deallocating memory, closing files, etc.