src/docbkx/UserGuide.xml

<?xml version="1.0" encoding="utf-8"?>  
<article xmlns="http://docbook.org/ns/docbook" version="4.5" xml:lang="en"
    xmlns:xlink="http://www.w3.org/1999/xlink">
  <!--
     Licensed to Odiago, Inc. under one or more contributor license
     agreements.  See the NOTICE.txt file distributed with this work for
     additional information regarding copyright ownership.  Odiago, Inc.
     licenses this file to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance with the
     License.  You may obtain a copy of the License at

         http://www.apache.org/licenses/LICENSE-2.0

     Unless required by applicable law or agreed to in writing, software
     distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
     WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the
     License for the specific language governing permissions and limitations
     under the License.
  -->
  <title>FlumeBase User Guide</title>  
  <subtitle>version <?eval ${project.version} ?></subtitle>
  <section>
    <title>Introduction</title>
    <para>
      FlumeBase is a database-inspired stream processing system built on top
      of <productname>Flume</productname>. This system allows users to
      dynamically insert queries into a data collection environment and
      inspect the stream of events being collected by Flume. These
      queries may spot-check incoming data, or specify persistent
      monitoring, data transformation, or quality filtering tasks.
      Queries are written in a SQL-like language called "rtsql."
    </para>
    <para>
      FlumeBase can present data back to users of an interactive shell environment.
      It can also be configured to deliver streams of output events back into a
      Flume network, for consumption by other tools or persistance in HBase, HDFS,
      or other storage media.
    </para>
    <para>
      The emphasis of this system is on low-latency analysis of
      incoming data being captured by Flume. The name "rtsql"
      (FlumeBase's query language)
      underscores the real-time nature of the query system, as well as
      the SQL-based origin of the query language syntax. It is hoped
      that FlumeBase will allow you to perform useful in-line data
      transformation or filtering, or time-sensitive alerting or
      tuning of a broader system, before subjecting the data being
      captured by <productname>Flume</productname> to a deeper (but
      perhaps higher-latency) analysis with other tools such as
      <productname>Hadoop MapReduce</productname>.
    </para>
    <warning>
      <para>
        FlumeBase is an EXPERIMENTAL system! This is in no way ready
        for production use. Use this AT YOUR OWN RISK. Connecting
        this system to production Flume nodes may result in data
        loss, misconfiguration, or other serious problems.
      </para>
    </warning>
    <para>
      This document explains how to install and configure the FlumeBase
      system. It then explains the rtsql language, used to submit
      queries to the runtime environment, and the commands used to
      control the terminal client itself. This document is intended
      for:
      <itemizedlist>
        <listitem>System administrators</listitem>
        <listitem>Data analysts</listitem>
        <listitem>Data engineers</listitem>
      </itemizedlist>
    </para>
  </section>
  <section>
    <title>Quick Start</title>
    <para>
      For those who understand Flume, SQL, and want to just see a demo of what
      can be done with FlumeBase, follow the steps in this section. This is a
      five minute tour of the FlumeBase world.
    </para>
    <para>
      First, copy the following text into a file named <filename>data.txt</filename>.
    </para>
    <programlisting>
1,aaron,purple,42
2,bob,blue,11
3,cindy,green,312
    </programlisting>
    <para>
      Install Flume 0.9.3, Hadoop 0.20, and Java 6. If you are running Cloudera's
      Distribution of Hadoop 3 beta 4 (CDH3B4), you have already installed all
      of these. Users of older versions of these products will need to upgrade.
      See <xref linkend="installation" /> for more thorough installation
      instructions.
    </para>
    <para>
      Unzip the FlumeBase installation:
    </para>
    <programlisting>
$ <userinput>tar vzxf flumebase-(version).tar.gz</userinput>
    </programlisting>
    <para>
      Start the FlumeBase shell:
    </para>
    <programlisting>
$ <userinput>cd flumebase-(version)/</userinput>
$ <userinput>bin/flumebase shell</userinput>
    </programlisting>
    <para>
      By default, FlumeBase is configured with a self-contained environment that
      embeds the FlumeBase server and Flume itself within the same process as
      the shell. Now let's define a stream over the file, and query it.
    </para>
    <programlisting>
rtsql&gt; <userinput>CREATE STREAM data(id int, name string, favcolor string,</userinput>
    -&gt; <userinput>luckynumber int) FROM LOCAL SOURCE 'tail("/path/to/data.txt")';</userinput>
CREATE STREAM

rtsql&gt; <userinput>SELECT * FROM data;</userinput>
    </programlisting>
    <para>
      You created stream which operates over a local (self-hosted) Flume logical node
      which reads all the lines from <filename>data.txt</filename>. You then ran
      a query that extracts all fields from each event in the stream. Each line
      of the file corresponds to a different event.
    </para>
    <para>
      In another terminal, now execute the following:
    </para>
    <programlisting>
$ <userinput>echo 4,dave,orange,611 &gt;&gt; /path/to/data.txt</userinput>    
    </programlisting>
    <para>
      You should observe that as soon as Flume detects the new record (about a second's
      delay), it will be passed along to FlumeBase and emitted on your console.
    </para>
    <para>
      The submitted query has created a "flow," which runs as long as we allow it.
      If more data were to enter Flume via that file, we would continue to process
      it. Now, let's cancel that flow:
    </para>
    <programlisting>
rtsql&gt; <userinput>\d 1</userinput>
    </programlisting>
    <para>
      (As FlumeBase decommissions the internal logical node, there may be an error
      emitted by Flume itself; this is normal. In general, running in a single
      process will be "noisy" because of both client and server activity condensed
      to a single console. For a cleaner session experience, run the server and
      client in separate processes; see <xref linkend="installation"/> for
      instructions.)
    </para>
    <para>
      And now let's run another query:
    </para>
    <programlisting>
rtsql&gt; <userinput>SELECT favcolor FROM data WHERE luckynumber = 42;</userinput>    
    </programlisting>
    <para>
      After a few seconds, this flow is initialized with the data in the Flume
      logical node. Note that we only get one row out from our original data set.
      If you add more lines to the file which add events where the <literal>luckynumber</literal>
      column is <constant>42</constant>, you'll see them appear in the FlumeBase
      console.
    </para>
    <para>
      This concludes our tour. To quit the FlumeBase shell, run:
    </para>
    <programlisting>
rtsql&gt; <userinput>\q</userinput>    
    </programlisting>
    <para>
      The remaining sections of this user guide will describe multi-process configuration,
      the rtsql language, and shell operation in greater detail. Good luck!
    </para>
  </section>
  <section id="installation">
    <title>Installation</title>
    <section>
      <title>Prerequisites</title>

      <para>
        FlumeBase requires a few prerequisites before it can be run on your machine:
      </para>
      <itemizedlist>
        <listitem><productname>Java</productname> 6.0</listitem>
        <listitem><productname>Hadoop</productname> 0.20</listitem>
        <listitem><productname>Flume</productname> 0.9.3</listitem>
      </itemizedlist>
      <para>
        Java can be obtained from <link
        xlink:href="http://www.oracle.com/technetwork/java/index.html">http://www.oracle.com/technetwork/java/index.html</link>.
        The Java 6.0 SE JRE (or JDK) is required. Java downloads and installation
        instructions can be found on Oracle's web site.
      </para>
      <para>
        Both other prerequisites can be installed from <productname>Cloudera's
        Distribution for Hadoop</productname>, version 3-beta-4 (CDH3b4) or
        newer. See
        <link xlink:href="http://archive.cloudera.com">http://archive.cloudera.com</link>
        for instructions on downloading and installing <productname>Cloudera's
        Distribution for Hadoop</productname>.
      </para>
      <para>
        While FlumeBase is written in <productname>Java</productname> and thus
        should be portable across a wide variety of operating systems, testing
        has only been performed under a Linux environment. It is likely to work
        under cygwin and OS X as well, but no guarantees are made.
      </para>
      <para>
        The following prerequisite knowledge is required to understand
        this documentation:
        <itemizedlist>
          <listitem>Basic computer technology and terminology</listitem>
          <listitem>Familiarity with command-line interfaces such as
              <literal>bash</literal></listitem>
          <listitem>Prior understanding of Flume's operation and purpose</listitem>
          <listitem>Prior exposure to SQL is recommended</listitem>
        </itemizedlist>
      </para>
    </section>
    <section>
      <title>Program installation</title>
      <para>
        FlumeBase itself is distributed as a tar file. Install FlumeBase by unzipping
        the tar file:
        <screen>
$ <userinput>tar vzxf flumebase-(version).tar.gz</userinput>      
        </screen>
      </para>
      <para>
        This will expand to a directory called
        <filename>flumebase-(version)/</filename>.
      </para>
    </section>
  </section>
  <section>
    <title>Configuration</title>
    <para>
      By default, FlumeBase is configured to run in a single process
      combining both the interactive shell, and the execution engine.
      Terminating the shell will also terminate the execution
      environment, including all running queries. This is most useful
      for evaluating FlumeBase.  For more serious use, the execution
      environment should be run in a persistent process on a server.
      Clients should be configured to connect to this server, or users
      should be instructed to explicitly do so.
     </para>
     <para>
       To enable zero-configuration evaluation of FlumeBase, the
       FlumeBase process will also host an embedded Flume master node. To
       interact with existing streaming data sources, this should also
       be reconfigured to point to an existing Flume deployment.
    </para>

    <section>
      <title>Server configuration</title>
      <para>
        Install FlumeBase on a server where the query execution engine should
        be run. Then edit the <filename>etc/flumebase-site.xml</filename> file
        to contain the following values:
      </para>
      <table>
        <caption>Configuration settings for FlumeBase servers</caption>
        <thead>
          <tr><td>Property</td><td>Value</td></tr>
        </thead>
        <tbody>
          <tr><td><constant>flume.home</constant></td>
            <td>The path to $FLUME_HOME on your server.</td></tr>
          <tr><td><constant>flumebase.remote.port</constant></td>
            <td>The port where the FlumeBase server listens for clients.</td></tr>
          <tr><td><constant>embedded.flume.master</constant></td>
            <td>This should be set to <constant>false</constant> if a Flume
            master is available. A value of <constant>true</constant> means
            that the FlumeBase environment acts as its own Flume master, separate
            from an existing Flume network.</td></tr>
          <tr><td><constant>flumebase.flume.master.host</constant></td>
            <td>The hostname of the foreign Flume master to connect to.</td></tr>
          <tr><td><constant>flumebase.flume.master.port</constant></td>
            <td>The port the foreign Flume master listens on.</td></tr>
          <tr><td><constant>flumebase.flume.collector.port.min/max</constant></td>
            <td>FlumeBase uses Flume collectors to receive data from the broader
            Flume network. set <constant>...port.min</constant> and
            <constant>...port.max</constant> to the range of ports on the
            FlumeBase server which the FlumeBase daemon may use for this purpose.</td></tr>
        </tbody>
      </table>
      <para>
        Finally, to run in distributed mode, the Flume master node needs to
        register the FlumeBase plugin. You should copy the
        <filename>flumebase-(version).jar</filename> file from the FlumeBase
        installation into <filename>/usr/lib/flume/lib</filename> on the Flume
        master machine. Then edit <filename>flume-site.xml</filename> on
        the master to include the setting:
        <programlisting>
&lt;property&gt;
  &lt;name&gt;flume.plugin.classes&lt;/name&gt;
  &lt;value&gt;com.odiago.flumebase.flume.FlumePlugin&lt;/value&gt;
&lt;/property&gt;
        </programlisting>
      </para>
      <para>
        You may need to restart the Flume master process for this to take effect.
      </para>
      <para>
        After a server is configured, you may start a server instance by running:
        <literal>bin/flumebase server</literal> from the directory where FlumeBase
        was installed. To shut down a running server, see <xref
        linkend="flumebase.client.connecting" />. Killing a server process with
        <literal>^C</literal> is not recommended.
      </para>
    </section>
    <section>
      <title>Client configuration</title>
      <para>
        Install a copy of FlumeBase on every client machine where users intend to
        submit queries to the FlumeBase system. The client must be able to open a
        TCP connection to the FlumeBase server. In order to view output events on
        the FlumeBase console, the server must be able to open a TCP connection
        back to the client.
      </para>
      <para>
        Set the following settings in <filename>etc/flumebase-site.xml</filename>
        on the client machine:
      </para>
      <table>
        <caption>Configuration settings for FlumeBase clients</caption>
        <thead>
          <tr><td>Property</td><td>Value</td></tr>
        </thead>
        <tbody>
          <tr><td><constant>flume.home</constant></td>
            <td>The path to $FLUME_HOME on the client.</td></tr>
          <tr><td><constant>flumebase.autoconnect</constant></td>
            <td>The host:port of the FlumeBase server to connect to. If set
            to <constant>local</constant>, this will use an in-process server.
            If set to <constant>none</constant>, the user must explicitly open
            a server connection with <userinput>\open</userinput> in the
            console.</td></tr>
          <tr><td><constant>flumebase.flow.autowatch</constant></td>
            <td>Defaults to <constant>true</constant>; this boolean property
            specifies whether you want every query to automatically send its
            output to the console when submitted. If false, you must explicitly
            watch flow output with the <userinput>\watch</userinput> command.
            </td></tr>
          <tr><td><constant>flumebase.console.port</constant></td>
            <td>FlumeBase uses a Thrift RPC connection to relay query output back to
            the client. The client listens on the port specified by this
            property.</td></tr>
        </tbody>
      </table>
    </section>
  </section>
  <section>
    <title>Architecture</title>
    <para>
      The FlumeBase system is composed of a command-line client, a server called
      the "execution environment," and the Flume system that collects and
      transports data. These may be configured as separate, distributed
      processes, or collocated on a single machine, or in a single process.
    </para>
    <para>
      The command-line client is the simplest component in the product. This
      process is run directly by a user (perhaps on a server, but more
      often her own desktop or laptop). This connects to the execution
      environment. The client provides the user with a prompt, where new
      queries or control statements may be entered.</para>
    <para>
      Each query (i.e., <literal>SELECT</literal> statement) produces a
      <emphasis>flow</emphasis> in the execution environment. The user may
      subscribe to running flows (this is done automatically for new flows
      created by the user). When a subscribed flow emits an output event, its
      text is printed to the client terminal.
    </para>
    <para>
      Closing the client does not terminate any submitted flows. These are
      running in the <emphasis>execution environment</emphasis>, a separate
      long-lived process which may be shared by multiple users. An execution
      environment holds the definitions of all streams (created by
      <literal>CREATE STREAM</literal> statements), and processes the running flows. The
      execution environment is typically run on a dedicated server. For
      evaluation purposes, it may also be hosted inside the same process as a
      command-line client. (When the execution environment is embedded
      in the client, terminating the client will terminate all running
      flows, and discard knowledge of any streams.)
    </para>
    <para>
      Submitted queries (flows) allow computation over
      <emphasis>streams</emphasis> of data. Streams are defined as a set of
      <emphasis>events</emphasis>, which are roughly analogous to "records" in a
      table-based SQL environment. These events are directly linked to "events"
      in Flume. Users define streams before querying them; these definitions
      specify the fields within each event, how to parse the event body into the
      fields, and where the stream originates. Each flow is itself a stream;
      its output is also a series of events, based on the computations specified
      by the user and the set of input events the flow operates over.
    </para>
    <para>
      By default, queries submitted by users result in anonymous flows, which
      deliver their outputs only to the subscribed client instances. These flows
      continue to operate while no users are subscribed, but output events
      generated when no users are subscribed are simply dropped (there is no way
      to retrieve them later).
    </para>
    <para>
      Users can bind a name to a running flow (or do so when submitting a flow
      with the <literal>CREATE STREAM AS SELECT</literal> syntax). This name is
      used as the name for a Flume logical node, which broadcasts the output of the
      flow as a set of Avro-encoded events. Users may then use the Flume shell
      to configure this logical node to direct a copy of its output to a
      monitoring application, persistent storage (such as HDFS), or elsewhere.
      <xref linkend="create.as.select" /> describes the <literal>CREATE STREAM
      AS SELECT</literal> syntax and its effects in greater detail.  <xref
      linkend="controlling.flows" /> describes how to manipulate flow names.
    </para>
    <para>
      FlumeBase reads from a Flume network by modifying the sink
      definitions of nodes specified with <literal>CREATE
      STREAM</literal> statements. When a logical node is identified
      as a stream source, its sink definition is rewritten as a
      fan-out sink containing its original sink, and a new agent sink
      which forwards the node's output to a collector source hosted
      within the FlumeBase execution environment. (The FlumeBase execution
      environment will host an embedded Flume physical node, which
      then hosts logical nodes as necessary to receive and transmit
      streams of events.) When a stream is gracefully dropped (by
      using <userinput>DROP STREAM</userinput> to drop the stream, or
      <userinput>\shutdown!</userinput> to shutdown the execution
      environment), the original logical node definition is restored
      to the logical node which provided the data stream. 
    </para>
    <para>
      Interaction between a FlumeBase execution environment and Flume is performed
      via the Flume master node's thrift interface. The physical node hosted
      within an execution environment is controlled by the Flume master node,
      and is for all intents and purposes, an ordinary Flume node. For this
      reason, flows may take a few seconds to initialize (or cancel), as they
      are dependent on Flume for aspects of their configuration. Once
      initialized, flows should operate on events with low latency. If no
      external Flume network is available, you can configure the Flume execution
      environment to host an embedded Flume master node, for evaluation or
      single-machine computation purposes.
    </para>
  </section>
  <section id="rtsql.language">
    <title>The rtsql language</title>
    <para>
      Users interact with FlumeBase by submitting commands and queries
      written in a language called rtsql. 
      The rtsql language is designed to allow on-going analysis of incoming
      data. The language is similar to <productname>SQL:2003</productname>; its
      syntax will be largely familiar to SQL experts. It also provides
      <productname>SQL:2003</productname>-style <emphasis>windowed
      operators</emphasis> which allow joining and aggregation over bounded
      amounts of time.
    </para>

    <para>
      In rtsql, all data is consumed through <emphasis>streams</emphasis>. The
      FlumeBase architecture assumes that these streams cannot be replayed, and may
      be of infinite length.  Therefore, all operators such as <literal>GROUP
      BY</literal> which can use “all the rows” as input are restricted so that
      they can only use windowed views into the stream.  rtsql does allow a
      stream to be defined over a file. A <literal>SELECT</literal> statement
      querying such a stream will read the data in-order in the file and then
      terminate when it reaches the end of the file, but rtsql does not
      currently have special provisions for working with these data sources
      in a different fashion than Flume-based sources.
    </para>

    <para>
      Keywords in rtsql are case-insensitive. Identifiers (stream, column,
      function names, etc) are translated to lower-case for their canonical
      representation, unless they are <literal>"double-quoted"</literal> in
      which case they are interpreted literally.
    </para>

    <section>
      <title>DDL Commands</title>
      <section>
        <title><literal>CREATE STREAM</literal></title>

        <para>
          The <literal>CREATE STREAM</literal> statement will create a stream
          definition which may be used in subsequent statements such as
          <literal>SELECT</literal>.
        </para>

        <programlisting>
CREATE STREAM <userinput>stream_name</userinput> (<userinput>col_name</userinput> data_type [, ...])
    FROM [LOCAL] {FILE | NODE | SOURCE} <userinput>input_spec</userinput>
    [EVENT FORMAT format_spec [PROPERTIES (key = val, …)]]

CREATE STREAM <userinput>stream_name</userinput> AS select_statement

data_type ::= BOOLEAN | BINARY | BIGINT | INT | FLOAT | DOUBLE | PRECISE(int) | STRING | TIMESTAMP 
format_spec ::= 'delimited' | 'regex' | 'avro'
        </programlisting>

        <para>
          <xref linkend="types" /> describes the rtsql data types in greater
          detail.
        </para>

        <para>
          <literal>input_spec</literal> is a
          <literal>'single-quoted-string'</literal> identifying the filename /
          Flume logical node / Flume source specification to use as the input
          for this stream.
        </para>

        <para>
          File names are Hadoop <classname>Path</classname> objects; they may
          specify the complete URI to a file, using any protocol permitted by
          the Hadoop common library. e.g.:

          <screen>
rtsql&gt; <userinput>CREATE STREAM foo (x STRING) FROM FILE</userinput>
    -&gt; <userinput>'hdfs://nn.example.com/user/aaron/foo.txt';</userinput>
          </screen>
        </para>

        <para>
          Unqualified file names are interpreted relative to the value of the
          <constant>fs.default.name</constant> configuration parameter. For
          example, if this were set to
          <userinput>'hdfs://nn.example.com'</userinput>, the following
          definition would be equivalent to the previous one:

          <screen>
rtsql&gt; <userinput>CREATE STREAM foo (x STRING) FROM FILE '/user/aaron/foo.txt';</userinput>
          </screen>
        </para>

        <para>
          Using the <literal>LOCAL</literal> keyword will cause the source
          definition to be interpreted relative to the local filesystem of the
          FlumeBase server. The following two statements are equivalent:

          <screen>
rtsql&gt; <userinput>CREATE STREAM foo (x STRING) FROM LOCAL FILE '/home/aaron/foo.txt';</userinput>
rtsql&gt; <userinput>CREATE STREAM foo (x STRING) FROM FILE 'file:///home/aaron/foo.txt';</userinput>
          </screen>
        </para>

        <para>
          Note that if the FlumeBase server is on a different machine than the
          client, this will read from <filename>/home/aaron/foo.txt</filename>
          on the FlumeBase server -- not the client.
        </para>

        <para>
          The <literal>EVENT FORMAT</literal> clause specifies how the bytes
          inside an event should be interpreted. By default, rtsql uses the
          <literal>delimited</literal> event format.  Events are assumed to
          contain UTF-8 text representations of each field, separated by commas.
        </para>
        <para>
          By specifying an <literal>EVENT FORMAT</literal>, you can choose which
          parser to apply to each event. The event format is specified as a
          <literal>'quoted string'</literal>. The next few subsections define
          the available event formats.
        </para>
        <para>
          You can further control the behavior of the event parser
          by specifying (key, value) pairs in the <literal>PROPERTIES</literal>
          section. The keys recognized are specific to each event format. Keys
          and values are both single-quoted strings.
        </para>

        <section id="stream.timestamp.col">
          <title>Designated timestamp columns</title>
          <para>
            When reading a stream from a file, there is no Flume timestamp to
            associate with each event. By default, FlumeBase will associate the
            current system timestamp as it reads each line of the file with the
            event generated for that line. This can be overridden by specifying
            the <constant>timestamp.col</constant> property in the
            <literal>PROPERTIES</literal> section of the <literal>CREATE
            STREAM</literal> statement. The <constant>timestamp.col</constant>
            must refer to a column of type <type>TIMESTAMP</type>. If the
            timestamp value for an event is null, the current system timestamp
            will be used instead.
          </para>
        </section>

        <section>
          <title>The <literal>delimited</literal> event format</title>
          <para>
            The <literal>delimited</literal> event format allows FlumeBase to
            interpret events consisting of UTF-8 encoded text. Individual fields
            are expected to be separated by commas. All values are expected to
            be converted to text. <type>BINARY</type> columns are created as
            the bytes holding a UTF-8 encoded string (which was terminated by
            the field delimiter).
          </para>
          <para>
            The delimiter character is controlled by the
            <constant>delimiter</constant> property. You may set this to any
            other character; for example, a pipe character:
            <screen>
  rtsql&gt; <userinput>CREATE STREAM x(a int, b int) FROM LOCAL FILE 'foo.txt'</userinput>
      -&gt; <userinput>EVENT FORMAT 'delimited' PROPERTIES ('delimiter' = '|');</userinput>
            </screen>
          </para>
          <para>
            Nullable integer, timestamp, etc. fields are regarded as null if the
            field is an empty string (i.e., two delimiters occur in a row). A
            column of type <type>STRING</type> of zero length will be an empty
            string.  NULL string values are, by default, indicated by the
            sequence <literal>\N</literal>.  This sequence can be overridden by
            any other string with the <constant>null.sequence</constant>
            property.
          </para>
        </section>
        <section>
          <title>The <literal>avro</literal> event format</title>
          <para>
            FlumeBase can interpret events which contain a single serialized avro
            record as a collection of fields. The event is assumed to be in the
            Avro binary encoding format. You must specify the
            <constant>schema</constant> property to describe the expected
            encoding schema. (This is in addition to the normal column
            definition section of the <literal>CREATE STREAM</literal>
            statement.) The schema is expected to be a single Avro record (with
            any name) which contains a set of fields; these fields must have the
            correct avro types (<literal>"string"</literal>,
            <literal>"long"</literal>, etc.) to match the expected rtsql types
            (<type>STRING NOT NULL</type>, <type>BIGINT NOT NULL</type>, etc.).
            A nullable type (e.g., <type>STRING</type>) is expressed as
            an avro union of <literal>["string", "null"]</literal>.
          </para>
        </section>
        <section>
          <title>The <literal>regex</literal> event format</title>
          <para>
            Another text-based event format, this format allows you to specify
            a regular expression, the groups of which are extracted as the columns.
            Each event is a single line of UTF-8 encoded text. The
            <constant>regex</constant> property is required. This should define
            as many binding groups (with <literal>(parentheses)</literal>) as
            columns are specified in the stream definition. The
            <constant>null.sequence</constant> property applies to this format
            as well.
          </para>
        </section>
        <section id="create.as.select">
          <title><literal>CREATE STREAM AS SELECT</literal></title>
          <para>
            One of the most powerful uses of rtsql is as an inline processor of
            Flume events. The output of a FlumeBase flow can be used as a Flume
            source for further downstream processing or data collection. Named
            streams, defined by <literal>CREATE STREAM AS SELECT</literal> will
            cause the FlumeBase execution environment to host a Flume logical node
            with the same name as the stream name. This logical node will
            deliver to its sink all output events of the flow. The events will
            be in binary-encoded Avro format; a record with the same name as the
            stream, with field names equal to the display names of each select
            expression.
          </para>
          <para>
            By default, the <literal>null</literal> sink is used for the logical
            node created by this syntax. You should use the Flume shell to
            reconfigure the logical node to deliver this output to other
            required sinks.
          </para>
        </section>
      </section>
      <section>
        <title><literal>DROP STREAM</literal></title>
        <para>
          The <literal>DROP STREAM</literal> statement removes a stream
          definition created by <literal>CREATE STREAM</literal>.
        </para>
        <programlisting>
DROP STREAM <userinput>stream_name</userinput>
        </programlisting>
        <para>
          When dropping a stream created in terms of a flow (<literal>CREATE
          STREAM AS SELECT</literal>), this will decommission the Flume logical
          node and drop the stream identifier, but will not cancel the flow
          itself. See <xref linkend="controlling.flows" /> for more information
          on how to cancel the flow itself.
        </para>
      </section>
      <section>
        <title><literal>SHOW STREAMS</literal></title>
        <para>
          The <literal>SHOW STREAMS</literal> statement shows the definitions
          of all streams.
        </para>
      </section>
      <section>
        <title><literal>SHOW FUNCTIONS</literal></title>
        <para>
          The <literal>SHOW FUNCTIONS</literal> statement shows the definitions
          of all functions which may be applied to expressions in a statement.
          The output of this command is a list of functions and their types.
          Types are written in the form <literal>((input_types) -&gt;
          output_type)</literal>.
          <screen>
rtsql&gt; <userinput>SHOW FUNCTIONS;</userinput>
length ((STRING) -> INT)
...
          </screen>
        </para>
        <para>
          The <function>length</function> function may take a
          <type>STRING</type> or <type>NULL</type> value, and returns an
          <type>INT</type> (or <type>NULL</type>, if the input was
          <type>NULL</type>).
        </para>

        <para>
          Some functions are polymorphic -- their input types are flexible,
          subject to certain constraints, and their output types may or may not
          match their input types. For example, the <function>sum</function>
          function can operate over any numeric type: 

          <screen>
rtsql&gt; <userinput>SHOW FUNCTIONS;</userinput>
sum ((var('a, constraints={TYPECLASS_NUMERIC})) -> var('a, constraints={TYPECLASS_NUMERIC}))
...
          </screen>
        </para>

        <para>
          The input argument’s type is <literal>var('a, constraints={
          TYPECLASS_NUMERIC})</literal>. This is a type variable with the name
          <literal>'a</literal> (pronounced "alpha"), and can take any type subject to the
          constraint that it is in the typeclass "<type>numeric</type>" -- that is, it is one
          of <type>INT</type>, <type>BIGINT</type>, <type>FLOAT</type>, or
          <type>DOUBLE</type>. It is an error to take the sum of a
          <type>STRING</type> or <type>BOOLEAN</type> column.
        </para>

        <para>
          The output argument is the same type variable "alpha;" whatever type
          is used for the input, will also be used as the output type. For
          more information on polymorphic types, see <xref
          linkend="polymorphic" />.
        </para>
      </section>
      <section>
        <title><literal>DESCRIBE</literal></title>
        <para>
          The <literal>DESCRIBE</literal> statement shows the definition of a
          single object in rtsql:
        </para>
        <programlisting>
DESCRIBE <userinput>identifier</userinput>        
        </programlisting>

        <para>
          This may be used to inspect a single stream, function, or other entity
          present in the symbol table.
        </para>

        <para>
          The following statement displays the argument and return types for the
          <function>length</function> function:

          <screen>
rtsql&gt; <userinput>DESCRIBE length;</userinput>
length ((STRING) -> INT)
          </screen>
        </para>
      </section>
      <section>
        <title><literal>EXPLAIN</literal></title>
        <para>
          The <literal>EXPLAIN</literal> statement shows the execution plan
          for an rtsql statement:
        </para>
        <programlisting>
EXPLAIN statement        
        </programlisting>

        <para>
          This may be used to inspect the operation of any rtsql statement.
          The output of the command is a text description of how the statement
          was parsed (in a tree-based representation), followed by a control-flow
          graph of the steps applied in the runtime environment to satisfy
          the query.
        </para>

        <screen>
rtsql&gt; <userinput>EXPLAIN SELECT x FROM foo;</userinput>        
        </screen>
      </section>
    </section>
    <section>
      <title><literal>SELECT</literal> statements</title>
      <para>
        The <literal>SELECT</literal> statement returns an event stream
        computed in terms of one or more existing event streams.
      </para>

      <programlisting>
select_statement ::= SELECT select_expr, select_expr ... FROM stream_reference
    [ JOIN stream_reference ON join_expr OVER range_expr, JOIN ... ]
    [ WHERE where_condition ]
    [ GROUP BY column_list ]
    [ OVER range_expr ]
    [ HAVING having_condition ]
    [ WINDOW <userinput>window_name</userinput> AS ( range_expr ), WINDOW ... ]
      </programlisting>

      <para>
        A simple <literal>SELECT</literal> statement can return all events in
        a stream:

        <screen>
rtsql&gt; <userinput>SELECT * FROM foo;</userinput>
        </screen>
      </para>

      <para>
        It can also return only a specific subset of fields from the
        underlying stream:
        <screen>
rtsql&gt; <userinput>SELECT a, b, d FROM foo;</userinput>
        </screen>
      </para>

      <para>
        In addition to referencing specific fields, mathematical expressions
        may be calculated as well:

        <screen>
rtsql&gt; <userinput>SELECT 2 * a + 3 FROM foo;</userinput>
        </screen>
      </para>

      <para>
        The following table lists all available operators. Operators at one
        level of the table have higher priority than operators in a lower row
        of the table. Operators of the same priority are applied
        left-to-right. Parentheses can be used to override precedence. (This
        is the same precedence order as uesd by Java, for the subset of Java
        operators supported by rtsql.)
      </para>

      <table><caption>Operator precedence rules in rtsql</caption>
        <thead>
          <tr><td>Operator class</td><td>operators</td>
          </tr>
        </thead>
        <tbody>
          <tr><td>unary null operators:</td>
            <td><literal>IS NULL</literal>, <literal>IS NOT NULL</literal></td></tr>
          <tr><td>unary operators:</td>
            <td><literal>+ - NOT</literal></td></tr>
          <tr><td>multiplicative:</td>
            <td><literal>* / %</literal></td></tr>
          <tr><td>additive:</td>
            <td><literal>+ -</literal></td></tr>
          <tr><td>comparison:</td>
            <td><literal>&gt; &lt; &gt;= &lt;=</literal></td></tr>
          <tr><td>equality:</td>
            <td><literal>= !=</literal></td></tr>
          <tr><td>logical conjunction:</td>
            <td><literal>AND</literal></td></tr>
          <tr><td>logical disjunction:</td>
            <td><literal>OR</literal></td></tr>
          <tr><td>function call:</td>
            <td><literal>f(e1, e2, e3...)</literal></td></tr>
          <tr><td>identifiers and constants:</td><td><literal>x 42 'hello!'</literal></td></tr>
        </tbody>
      </table>

      <para>
        Each selected expression may have an alias associated with it:

        <screen>
rtsql&gt; <userinput>SELECT 2 * a AS doubled FROM foo;</userinput>
        </screen>
      </para>

      <para>
        The <literal>AS</literal> keyword itself is optional.
      </para>

      <para>
        This is specifically useful in the context of nested
        <literal>SELECT</literal> statements:

        <screen>
rtsql&gt; <userinput>SELECT doubled FROM (SELECT 2 * a AS doubled FROM foo)</userinput>
    -&gt; <userinput>AS q WHERE doubled > 4;</userinput>
        </screen>
      </para>

      <para>
        rtsql does not support the <literal>DISTINCT</literal> or
        <literal>ALL</literal> keywords; every query is implicitly
        "<literal>SELECT ALL</literal>."
      </para>

      <section>
        <title>Stream references</title>
        <programlisting>
stream_reference ::= (<userinput>stream_name</userinput> | select_statement) [[AS] <userinput>ref_name</userinput>]
        </programlisting>

        <para>
          The <literal>stream_reference</literal> in a <literal>SELECT</literal>
          statement may literally identify a stream:

          <screen>
rtsql&gt; <userinput>CREATE STREAM foo (x string) FROM ...;</userinput>
CREATE STREAM
rtsql&gt; <userinput>SELECT * FROM foo;</userinput>
...
          </screen>
        </para>
        <para>
          You may also qualify column names with their stream name:
          <screen>
rtsql&gt; <userinput>SELECT foo.x FROM foo;</userinput>
          </screen>
        </para>

        <para>
          And you may provide a reference name (<literal>ref_name</literal>)
          that is different than the stream name:
          <screen>
rtsql&gt; <userinput>SELECT v.x FROM verylongname AS v;</userinput>
          </screen>
        </para>

        <para>
          The <literal>AS</literal> keyword is optional. This is equivalent to:
          <screen>
rtsql&gt; <userinput>SELECT v.x FROM verylongname v;</userinput>
          </screen>
        </para>

        <para>
          A <literal>stream_reference</literal> may also be a nested
          <literal>SELECT</literal> statement.
          <screen>
rtsql&gt; <userinput>SELECT length(x) FROM (SELECT x FROM foo) AS f;</userinput>
          </screen>
        </para>

        <para>
          Each nested <literal>SELECT</literal> statement must be given a
          <literal>ref_name</literal> alias (<userinput>f</userinput> in the
          previous example). You do not need to qualify individual column names
          with the <literal>ref_name</literal> unless the column name would
          otherwise be ambiguous (e.g., if two sources are joined, and they each
          contain a column named <userinput>x</userinput>, then all references
          to <userinput>x</userinput> must be qualified with the source
          <literal>ref_name</literal>).
        </para>
      </section>
      <section>
        <title><literal>WHERE</literal> clauses</title>
        <programlisting>
where_clause ::= WHERE bool_expr
        </programlisting>

        <para>
          A <literal>SELECT</literal> statement may filter some input events,
          and emit output events corresponding only to input events that match a
          boolean predicate.

          <screen>
rtsql&gt; <userinput>SELECT x FROM foo WHERE length(x) > 5;</userinput>
          </screen>
        </para>

        <para>
          This may be a compound boolean expression (using the
          <literal>AND</literal> and <literal>OR</literal> operators). rtsql
          does not support the <literal>IN</literal> or
          <literal>EXISTS</literal> operators.  Subqueries are also not
          permitted in a <literal>WHERE</literal> clause.
        </para>

      </section>
      <section id="select.join.clause">
        <title><literal>JOIN</literal> clauses</title>
        <programlisting>
join_clause ::= JOIN stream_reference ON join_expr OVER range_expr
        </programlisting>

        <para>
          A <literal>SELECT</literal> statement may correlate events from
          multiple sources and operate on their joined representation. In
          table-based SQL systems, any row of one table may be joined with any
          row of another table in a <literal>JOIN</literal> clause. Since FlumeBase
          operates over potentially infinite streams of data, this model would
          not scale. Instead, <literal>JOIN</literal> clauses require a window
          clause which defines the time-based boundaries within which a join may
          occur.
        </para>
        <para>
          The only join expression supported is an equi-join; the
          <literal>join_expr</literal> must use the equality operator
          (<literal>=</literal>) to relate one field of each of the two joined
          streams.
        </para>
        <para>
          The <literal>range_expr</literal> specifies the time range for the
          dependent (right) stream in which events may be joined to a given
          event of the primary (left) stream.
        </para>
        <programlisting>
range_expr ::= RANGE INTERVAL expr time_scale PRECEDING
    | BETWEEN INTERVAL expr time_scale PRECEDING AND INTERVAL expr time_scale FOLLOWING

time_scale ::= SECONDS | MINUTES | HOURS | DAYS | WEEKS | MONTHS | YEARS
        </programlisting>
        <para>
          Consider the following example:
          <screen>
rtsql&gt; <userinput>SELECT * FROM f JOIN g ON f.x = g.y</userinput>
    -&gt; <userinput>OVER RANGE INTERVAL 5 SECONDS PRECEDING;</userinput>
          </screen>
        </para>
        <para>
          This specifies that for each event seen in <userinput>f</userinput>
          (the primary stream), it may be joined with any events in
          <userinput>g</userinput> (the dependent stream) which occurred up to
          five seconds before the event in <userinput>f</userinput>.
        </para>
        <para>
          The opposite time relation holds from the perspective of events in
          <userinput>g</userinput>: for each event in stream
          <userinput>g</userinput>, it may be joined with any events in
          <userinput>f</userinput> which occur up to five seconds later.
        </para>
        <para>
          <screen>
rtsql&gt; <userinput>SELECT * FROM f JOIN g ON f.x = g.y OVER</userinput>
    -&gt; <userinput>RANGE BETWEEN INTERVAL 1 SECONDS PRECEDING</userinput>
    -&gt; <userinput>AND INTERVAL 5 SECONDS FOLLOWING;</userinput>
          </screen>

          This example specifies that each event in <userinput>f</userinput> may
          be joined with any matching events in <userinput>g</userinput> which
          occured up to one second before, or five seconds after the event in
          <userinput>f</userinput>. 
        </para>
        <para>
          Only inner joins are supported at present. The <literal>INNER</literal>,
          <literal>OUTER</literal>, <literal>NATURAL</literal>, <literal>LEFT</literal>,
          <literal>RIGHT</literal>, and <literal>FULL</literal> keywords are not (yet)
          supported by rtsql.
        </para>
      </section>
      <section>
        <title>Aggregation</title>
        <para>
          Aggregate operators may be used in rtsql in a similar manner to
          ordinary SQL systems.
        </para>
        <programlisting>
group_by_clause ::= GROUP BY <userinput>col</userinput> [, <userinput>col</userinput>...]
over_clause ::= OVER range_expr
having_clause ::= [ HAVING bool_expr ]
        </programlisting>

        <para>
          If an aggregate function (e.g., <function>sum</function>,
          <function>count</function>) is used in a <literal>SELECT</literal>
          statement, the statement must have an <literal>over_clause</literal>
          which defines the time window in which aggregate operators work.
          Aggregation over "the entire stream" is not supported.
        </para>
        <para>
          The <literal>range_expr</literal> syntax is given in
          <xref linkend="select.join.clause" />.
        </para>

        <para>
          The following example provides a count of the number of events
          observed over a rolling five second window:
          <screen>
rtsql&gt; <userinput>SELECT COUNT(*) FROM foo OVER RANGE INTERVAL 5 SECONDS PRECEDING;</userinput>
          </screen>
        </para>

        <para>
          This may be further refined with a <literal>group_by_clause</literal>. For example:
          <screen>
rtsql&gt; <userinput>SELECT COUNT(*) FROM foo GROUP BY event_src</userinput>
    -&gt; <userinput>OVER RANGE INTERVAL 5 SECONDS PRECEDING;</userinput>
          </screen>
        </para>

        <para>
          FlumeBase uses <emphasis>bucketing</emphasis> to support rolling time
          windows. By default, 100 buckets are used. So the previous two
          examples will support rolling counts with a "step" size of 50
          milliseconds (5 seconds / 100 buckets). 
        </para>

        <para>
          You can specify the number of buckets by setting the
          <constant>flumebase.aggregation.buckets</constant> key in the session
          configuration (See <xref linkend="session.configuration" />).
          A larger number of buckets allows finer granularity in rolling
          windows, but may increase memory usage.
        </para>

        <para>
          By default, windowed aggregation operators will emit output groups
          only when the corresponding input buckets contained data. For
          example, if data arrives at t=100 and t=150, and buckets are 10 ms
          wide, an output group will be emitted with timestamp t=100 and
          another one for t=150, but no intermediate counts or other
          aggregates will be generated for t=110, t=120, etc.
          This behavior can be configured by setting
          <constant>flumebase.aggregation.continuous.output</constant> in the
          session configuration. Setting this flag to <literal>true</literal>
          will cause the aggregation operator to emit output groups for all
          time intervals in which any data is available. This may prove more
          useful for building time-series graphs, etc, but less so for working
          with sporadic incoming data.
        </para>

        <para>
          Note that by setting <constant>flumebase.aggregation.buckets</constant>
          to <literal>1</literal> and
          <constant>flumebase.aggregation.continuous.output</constant> to
          <literal>true</literal>, you can disable rolling windows, and instead
          divide the stream into discrete time-based groups.
        </para>
        <para>
          This example gives a minute-by-minute summary of hits per minute from
          a web log:
          <screen>
rtsql&gt; <userinput>\set flumebase.aggregation.buckets=1;</userinput>
rtsql&gt; <userinput>\set flumebase.aggregation.continuous.output=true;</userinput>
rtsql&gt; <userinput>SELECT COUNT(*) as hits FROM httpd_log</userinput>
    -&gt; <userinput>OVER RANGE INTERVAL 1 MINUTES PRECEDING;</userinput>
          </screen>
        </para>

        <para>
          Flume may deliver events out of order; FlumeBase tolerates improperly
          ordered events that arrive within the "slack interval" of 200
          milliseconds of when they are expected. Events that arrive after the
          slack interval may be excluded from aggregate functions.
          The slack interval may be configured by setting
          <constant>flumebase.slack.time</constant>
          to a different integer number of milliseconds in the session configuration.
          (See <xref linkend="session.configuration" />.)
        </para>

        <para>
          The following aggregate functions are available:
        </para>

        <table><caption>Aggregate functions in rtsql</caption>
          <thead>
            <tr><td>Function name</td><td>Description</td></tr>
          </thead>
          <tbody>
            <tr><td><literal><function>COUNT(*)</function></literal></td>
              <td>Counts the number of events which match the group and time interval</td></tr>
            <tr><td><literal><function>COUNT(expr)</function></literal></td>
              <td>Counts the number of events where expr is non-null</td></tr>
            <tr><td><literal><function>SUM(expr)</function></literal></td>
              <td>Returns the sum of the values in expr</td></tr>
            <tr><td><literal><function>MAX(expr)</function></literal></td>
              <td>Returns the maximum value for expr</td></tr>
            <tr><td><literal><function>MIN(expr)</function></literal></td>
              <td>Returns the minimum value for expr</td></tr>
            <tr><td><literal><function>AVG(expr)</function></literal></td>
              <td>Returns the arithmetic mean value for expr</td></tr>
          </tbody>
        </table>

        <para>
          rtsql does not support the
          <literal><function>COUNT</function>(DISTINCT
          <userinput>col</userinput>)</literal> syntax.
        </para>
        <para>
          rtsql also does not support full <productname>SQL:2003</productname>
          windowed operators; a single range for the entire select statement is
          applied by the <literal>over_clause</literal> to all aggregate operators.
          This may change in a future version of FlumeBase.
        </para>

        <section>
          <title>Filtering aggregate data</title>
          <programlisting>
having_clause ::= [ HAVING bool_expr ]         
          </programlisting>

          <para>
            A <literal>WHERE</literal> clause is evaluated before any grouping
            operators are applied. By contrast, the <literal>HAVING</literal>
            clause is applied after all grouping and projection operators (e.g.,
            <literal>x AS y</literal>) have been applied. A
            <literal>HAVING</literal> clause filters events at the end of
            processing, emitting only those output events that satisfy the
            boolean predicate.
          </para>
          <para>
            For example, this statement filters the previous example such that
            only event sources that emit 10 or more records in a five second
            window are listed as output:
            <screen>
rtsql&gt; <userinput>SELECT event_src, COUNT(event_src) AS cnt FROM foo</userinput>
    -&gt; <userinput>GROUP BY event_src</userinput>
    -&gt; <userinput>OVER RANGE INTERVAL 5 SECONDS PRECEDING</userinput>
    -&gt; <userinput>HAVING cnt > 10;</userinput>
            </screen>
          </para>
        </section>
      </section>
      <section>
        <title><literal>WINDOW</literal> clauses</title>
        <programlisting>
window_clause ::= WINDOW <userinput>window_name</userinput> AS ( range_expr ), WINDOW ...
        </programlisting>

        <para>
          Windows may be defined with a window clause at the end of a
          <literal>SELECT</literal> statement. A
          <userinput>window_name</userinput> may be substituted for a
          <literal>range_expr</literal> anywhere in the
          <literal>SELECT</literal> statement:

          <screen>
rtsql&gt; <userinput>SELECT COUNT(*) FROM foo OVER mywin</userinput>
    -&gt; <userinput>WINDOW mywin AS (RANGE INTERVAL 10 SECONDS PRECEDING);</userinput>
          </screen>
        </para>
      </section>
      <section>
        <title>Querying event properties and attributes</title>
        <para>
          The columns of a stream event are extracted from the event body. In Flume,
          events have additional properties (the originating host, priority, and
          timestamp). Each event may also be decorated with additional arbitrary
          named attributes.
        </para>
        <para>
          Named attributes of an event can be accessed with the syntax
          "<userinput>#attrname</userinput>". This acts like a column of
          type <literal>BINARY</literal>.
        </para>
        <para>
          For example, to select events with the <literal>interesting</literal> attribute:
          <screen>
rtsql&gt; <userinput>SELECT * FROM foo WHERE #interesting IS NOT NULL;</userinput>
          </screen>
        </para>
        <para>
          Event attributes are defined as a STRING key and a BINARY value. To
          use these values as strings, use the <literal>bin2str()</literal>
          function:
          <screen>
rtsql&gt; <userinput>SELECT * FROM foo WHERE bin2str(#x) = 'abc';</userinput>
          </screen>
        </para>
        <para>
          A set of functions allow you to access the host, priority, and timestamp
          properties of each event:
        </para>
        <table><caption>Event property accessor functions</caption>
          <thead>
            <tr><td>function</td><td>accesses</td><td>type</td></tr>
          </thead>
          <tbody>
            <tr><td><literal>event_timestamp()</literal></td><td>Event
              <literal>timestamp</literal> and <literal>nanos</literal>
              properties</td><td>TIMESTAMP NOT NULL</td></tr>
            <tr><td><literal>host()</literal></td><td>Event origin
              host</td><td>STRING NOT NULL</td></tr>
            <tr><td><literal>priority()</literal></td><td>Event priority label</td>
              <td>STRING NOT NULL</td></tr>
            <tr><td><literal>priority_level()</literal></td><td>Event priority as an integer</td>
              <td>INT NOT NULL</td></tr>
          </tbody>
        </table>
        <para>
          For example, to select events only at the ERROR priority level:
          <screen>
rtsql&gt; <userinput>SELECT * FROM foo WHERE priority() = 'ERROR';</userinput>
          </screen>
        </para>
        <para>
          The priority field is also available as an integer. More urgent priorities have
          lower ordinal values (<constant>'FATAL'</constant> is <constant>0</constant>).
          To select events at the <constant>WARN</constant> level and more urgent:
          <screen>
rtsql&gt; <userinput>SELECT * FROM foo WHERE priority_level() &lt;= 2;</userinput>
          </screen>
        </para>
      </section>
    </section>
    <section id="types">
      <title>Data types and value ranges</title>
      <para>
        Several data types are defined which can hold values of differing
        ranges. These column types do not follow ANSI SQL names; they are
        stored in underlying Java types and have the following ranges:
      </para>
      <table><caption>rtsql Types and Ranges</caption>
        <thead>
          <tr><td>rtsql type</td><td>Underlying Java type</td><td>Range</td></tr>
        </thead>
        <tbody>
          <tr><td>BOOLEAN</td><td>Boolean</td><td><constant>true</constant>,
            <constant>false</constant></td></tr>
          <tr><td>BINARY</td><td>ByteBuffer</td><td>
            Any array of bytes</td></tr>
          <tr><td>BIGINT</td><td>Long</td><td>
            [<constant>-2<superscript>63</superscript></constant>,
            <constant>2<superscript>63</superscript>-1</constant>]</td></tr>
          <tr><td>INT</td><td>Integer</td><td>
            [<constant>-2<superscript>31</superscript></constant>,
            <constant>2<superscript>31</superscript>-1</constant>]</td></tr>
          <tr><td>FLOAT</td><td>Float</td><td>
            [2<superscript>-149</superscript>,
            (2-2<superscript>-23</superscript>)*2<superscript>127</superscript>]
            (positive or negative)
            </td></tr>
          <tr><td>DOUBLE</td><td>Double</td><td>
            [2<superscript>-1074</superscript>,
            (2-2<superscript>-52</superscript>)*2<superscript>1023</superscript>]
            (positive or negative)
            </td></tr>
          <tr><td>PRECISE(n)</td><td>BigDecimal</td><td>
            A BigDecimal with the <constant>scale</constant> property set to
            <literal>n</literal>.</td></tr>
          <tr><td>STRING</td><td>String</td><td>
            A UTF-8-encoded string</td></tr>
          <tr><td>TIMESTAMP</td><td>(internal)</td><td>
            (See <xref linkend="types.timestamp" />)</td></tr>
        </tbody>
      </table>
      <para>
        Like all keywords in rtsql, type names are case-insensitive.
      </para>

      <section>
        <title>Integral types</title>
        <para>
          The <type>BIGINT</type> and <type>INT</type> types store integer
          values in 64- and 32-bit values, accordingly. The text representation
          of these values (e.g., in the delimited event format) is a base-10
          integer.
        </para>
      </section>

      <section>
        <title>The BOOLEAN type</title>
        <para>
          The <type>BOOLEAN</type> type holds true/false values only. It
          cannot be coerced to or from an integer type. The text
          representation of these values are the UTF-8 strings
          <constant>true</constant> and <constant>false</constant>.
        </para>
      </section>

      <section>
        <title>Floating-point types</title>
        <para>
          The <type>FLOAT</type> and <type>DOUBLE</type> types hold
          floating-point values. Operations on floating-point values
          may be imprecise, subject to the constraints of the IEEE
          floating-point format standard. All string values parsed by
          <function>java.lang.Float.valueOf()</function> and
          <function>java.lang.Double.valueOf()</function> may be
          used as the text representation of these values.
        </para>
      </section>

      <section>
        <title>Precise numeric types</title>
        <para>
          The <literal>PRECISE()</literal> type constructor defines
          a family of numeric types providing specific degrees of
          precision. The argument to the type constructor specifies
          the "scale" of the value; if positive, this is the number
          of significant digits to the right of the decimal place.
          If 0, all digits are significant. Operations on PRECISE
          values preserve the scale of the value.
        </para>
      </section>
      
      <section>
        <title>The STRING type</title>
        <para>
          The <type>STRING</type> type will hold a UTF-8 encoded character
          string of arbitrary length. (Events in Flume have a maximum length,
          which defaults to 32 KB. The contents of all columns together in an
          event is subject to this limit.)
        </para>
        <para>
          The empty string is a legal value in rtsql. In delimited events, a
          <constant>NULL</constant> string is denoted by a configurable escape
          sequence which defaults to the two characters <constant>\N</constant>.
          Delimited events may not contain the delimiter character itself; the
          ability to escape these characters is future work.
        </para>
      </section>
      <section>
        <title>The BINARY type</title>
        <para>
          The <type>BINARY</type> type holds a byte buffer of arbitrary bytes.
          When reading from an Avro input source, any byte array can be
          specified. When reading from text-based inputs, the byte array will
          be the bytes representing the UTF-8 encoding of the string input.
          If coerced to the <type>STRING</type> type (either implicitly, or
          explicitly through the <literal>bin2str()</literal> function,
          the UTF-8 character set will be applied to the bytes.
        </para>
      </section>
      
      <section id="types.timestamp">
        <title>The TIMESTAMP type</title>
        <para>
          rtsql employs a notion of a <type>TIMESTAMP</type> distinct from other
          numerical values. A <type>TIMESTAMP</type> contains two fields; a
          milliseconds part, and a nanoseconds part. The milliseconds part holds
          the number of milliseconds since the UNIX epoch (00:00:00 UTC, Jan 1,
          1970). The nanoseconds part holds the number of nanoseconds after the
          specified millisecond and should not exceed 1,000,000.
        </para>
        <para>
          Despite this internal precision, the delimited event format parses
          timestamps as a single 64-bit base-10 integer corresponding to the
          milliseconds part, and the nanoseconds part is <constant>0</constant>.
          Furthermore, FlumeBase's internal notion of time and event-ordering works
          only at the granularity of milliseconds. The Avro event format allows
          the nanoseconds part to be specified. Even in this case, a column
          specified by <constant>timestamp.col</constant> (see
          <xref linkend="stream.timestamp.col" />) will associate only the
          milliseconds part of a timestamp with the event itself.
        </para>
      </section>

      <section>
        <title>Type coercion</title>
        <para>
          Values of some types may be <emphasis>coerced</emphasis> to act as
          another type. This is performed automatically as necessary; no
          explicit type-casting operators are used (nor are they provided by
          the language). Coercion may only occur
          when moving to a "broader" type; this is called
          <emphasis>promotion</emphasis> in rtsql.  The value itself may be
          modified to conform to the specific domain of the target type, but
          this is only performed if the modification would not lose data.
        </para>
        <para>
          For example, an expression adding an <type>INT</type> value with a
          <type>BIGINT</type> value will provide a <type>BIGINT</type> result.
          The <type>INT</type> argument will be promoted to
          <type>BIGINT</type> and then added to the other <type>BIGINT</type>.
          Similarly, concatenating a <type>STRING</type> and an
          <type>INT</type> will coerce the integer into a string
          representation, and then concatenate it with the string.
        </para>
        <para>
          Numeric values may be promoted to any broader numeric type. From
          narrowest to broadest, the numeric types are <type>INT</type>,
          <type>BIGINT</type>, <type>FLOAT</type>, and <type>DOUBLE</type>.
          <type>PRECISE</type> types may be promoted to any broader
          <type>PRECISE</type> type. <type>INT</type> and <type>BIGINT</type>
          promote to PRECISE(0), FLOAT promotes to PRECISE(24), and
          DOUBLE promotes to PRECISE(53).
        </para>
        <para>
          All types may be promoted to <type>STRING</type>. The result
          of coercing a value to <type>STRING</type> is the string
          representation of the value, as defined in the previous subsections.
        </para>
        <para>
          Any type <type><emphasis>X</emphasis> NOT NULL</type> may be promoted
          to its nullable counterpart.
        </para>
      </section>
      <section id="polymorphic">
        <title>Polymorphic types and type classes</title>
        <para>
          Some functions may operate over a variety of types. For example,
          the <function>sum</function> function will add values of any
          numeric type. Since <type>DOUBLE</type> is the widest numeric
          type available, the type of the argument column for
          <function>sum</function> could be marked as <type>DOUBLE</type>.
          But that would promote every input to <function>sum</function>
          to a <type>DOUBLE</type> value, and the output would also be
          <type>DOUBLE</type>. This is not the case; using the
          <literal>DESCRIBE</literal> statement, we can see the type of
          the <function>sum</function> function is:
        </para>
          <screen>
rtsql&gt; <userinput>DESCRIBE sum;</userinput>
sum ((var('a, constraints={TYPECLASS_NUMERIC})) -> var('a, constraints={TYPECLASS_NUMERIC}))
          </screen>
        <para>
          In order to allow the <function>sum</function> function to accept
          a variety of input types, its input type is specified as
          a type variable with a <emphasis>universal type</emphasis>; this
          is denoted by the <literal>var('a)</literal> type. The name of
          this type variable is <literal>'a</literal> and pronounced "alpha." Type
          variables are bound to concrete types (e.g., <type>INT</type>) when
          the rtsql statement is compiled. The same type variable specifies
          the argument and return types of the <function>sum</function>; this way,
          it will return the same type as it receives as input. (This can be
          verified by observing that <literal>'a</literal> is also the name of
          the output type.)
        </para>
        <para>
          The <function>sum</function> function cannot sum a set of strings,
          however. Nor can it sum boolean values. Type variables may be
          specified with a set of constraints; the type it is bound to must
          conform to these constraints. The sum function can operate only
          over numeric values. To allow this constraint specification,
          rtsql provides a set of <emphasis>type
          classes</emphasis> which are sets of types. Type classes are not
          themselves concrete types; a column cannot be specified with
          a type class. The set of type classes are listed in
          <xref linkend="typeclass-table" />.
        </para>
        <table id="typeclass-table">
          <caption>Type classes and constraints in rtsql</caption>
          <thead>
            <tr><td>Type class</td><td>Concrete types included</td>
            <td>Example use</td></tr>
          </thead>
          <tbody>
            <tr>
              <td>TYPECLASS_NUMERIC</td>
              <td>INT, BIGINT, FLOAT, DOUBLE</td>
              <td>Input to and output of the <function>sum</function> function</td>
            </tr>
            <tr>
              <td>TYPECLASS_COMPARABLE</td>
              <td>All numeric types, STRING, and BOOLEAN</td>
              <td>The <function>max</function> and <function>min</function>
              functions</td>
            </tr>
            <tr>
              <td>TYPECLASS_ANY</td>
              <td>All concrete types</td>
              <td>The <function>count</function> function</td>
            </tr>
            <tr>
              <td>(any typeclass) NOT NULL</td>
              <td>
                Each typeclass may be further specified as NOT NULL.
              </td>
              <td/>
            </tr>
          </tbody>
          <para>
            In a function with multiple arguments, the same type variable may
            be used to specify one or more argument types, in addition to its
            return type. The type variable will take only one concrete type.
            Consider a (hypothetical) function <function>mul(x, y)</function> which
            returns the product of its arguments. The same type variable will
            be used for both arguments and the return type. It would be
            constrained to TYPECLASS_NUMERIC. If <function>mul()</function>
            were called on an <type>INT</type> and a <type>BIGINT</type>, the
            type variable would be bound to the narrowest concrete type which
            satisfies all arguments; in this case, <type>BIGINT</type>. Only
            the concrete types of the arguments are considered when binding a
            type variable; the context of the return type of the function is
            not considered.
          </para>
        </table>
      </section>
    </section>
  </section>
  <section>
    <title>The FlumeBase shell</title>
    <para>
      The FlumeBase shell allows users to interact with the FlumeBase environment. The
      default FlumeBase configuration connects to a single-threaded execution
      environment within the same process as the shell. You may also connect
      to a remote execution environment running in another process (on the
      same or a different machine).
    </para>
    <para>
      The FlumeBase shell can be used to transmit rtsql statements defined in
      <xref linkend="rtsql.language"/> to the execution environment.  Several
      control commands are also defined which allow users to interact with the
      shell or the environment itself.
    </para>

    <section>
      <title>Starting the shell</title>
      <para>
        To start the FlumeBase shell, run <literal>bin/flumebase shell</literal>
        from the directory where FlumeBase is installed.
      </para>
    </section>

    <section id="flumebase.client.connecting">
      <title>Connecting to the execution environment</title>
      <para>
        As mentioned, FlumeBase's default configuration file causes an automatic
        connection to the local environment. You can connect to a remote
        environment with the command:

        <screen>
rtsql&gt; <userinput>\open server [port]</userinput>
        </screen>
      </para>
      <para>
        You can connect to the local (self-hosted) environment explicitly with
        the command:

        <screen>
rtsql&gt; <userinput>\open local</userinput>
        </screen>
      </para>
      <para>
        The shell can connect to at most one environment at a time. A
        <literal>\open</literal> command automatically disconnects from any
        previously-connected environment. You can explicitly disconnect from
        the environment with the command:

        <screen>
rtsql&gt; <userinput>\disconnect</userinput>
        </screen>
      </para>
      <para>
        You can close the shell with the command:
        
        <screen>
rtsql&gt; <userinput>\q</userinput>
        </screen>
      </para>
      <para>
        This is a synomym for:

        <screen>
rtsql&gt; <userinput>exit;</userinput>
        </screen>
      </para>
      <para>
        You can shut down the execution environment (which stops all running
        flows) with the command:

        <screen>
rtsql&gt; <userinput>\shutdown!</userinput>
        </screen>
      </para>
    </section>
    <section>
      <title>Monitoring flows</title>
      <para>
        Each <literal>SELECT</literal> statement is instantiated as a flow in
        the execution environment. Flows are persistent: they continue to read
        data from the associated Flume sources indefinitely, even if the
        client disconnects.
      </para>
      <para>
        You can get a list of all running flows with the command:
        
        <screen>
rtsql&gt; <userinput>\f</userinput>
        </screen>
      </para>
      <para>
        This returns the following fields of information:
      </para>
      <table><caption>Columns in the running flows list</caption>
        <thead>
          <tr><td>Column</td><td>Description</td></tr>
        </thead>
        <tbody>
          <tr><td>Watch?</td><td>This column has a <literal>*</literal> in it if you are
            watching the output of this flow on your console.</td></tr>
          <tr><td>FlowId</td><td>The numeric id associated with the flow.
            Commands that control a flow will use this id.</td></tr>
          <tr><td>Stream</td><td> If the output of this flow is used as a
            stream (e.g., <literal>CREATE STREAM foo AS SELECT...</literal>),
            the name of the stream (<userinput>foo</userinput>) is shown
            here.</td></tr>
          <tr><td>Query</td><td>The actual rtsql query which was used to
            create this flow.</td></tr>
        </tbody>
      </table>

      <para>
        By default, when you submit a flow (that is, run a <literal>SELECT</literal> statement),
        you are watching the output. Any events which are generated by this
        flow will be printed to your console. You can unwatch a flow with the
        <literal>\u</literal> or <literal>\unwatch</literal> commands.
      </para>

      <para>
        The following command unwatches a flow with FlowId 3:
        <screen>
rtsql&gt; <userinput>\u 3</userinput>          
        </screen>
      </para>
      <para>
        (You can find the flowId for a flow with the <literal>\f</literal> command first.)
      </para>

      <para>
        You can then resubscribe to a flow with the <literal>\w</literal> or
        <literal>\watch</literal> commands. This will resubscribe to the same
        flow’s output:

        <screen>
rtsql&gt; <userinput>\w 3</userinput>
        </screen>
      </para>

      <para>
        You can configure your session to not automatically watch flows as you
        create them; you will then need to explicitly watch any flows you create
        if you want to inspect their output. This property is controlled by
        the <constant>flumebase.flow.autowatch</constant> key in the session
        configuration (see <xref linkend="session.configuration"/>).
      </para>

    </section>
    <section id="controlling.flows">
      <title>Controlling flows</title>
      <para>
        You can cancel a flow entirely with with the <literal>\d</literal> and
        <literal>\D</literal> commands. Each of these takes a FlowId, and
        destroys the associated flow. The <literal>\d</literal> command does not
        block; <literal>\D</literal> will wait until the flow is complete before
        returning:

        <screen>
rtsql&gt; <userinput>\d 3</userinput>
        </screen>
      </para>
      <para>
        If a flow was created via <literal>CREATE STREAM AS SELECT</literal>, the stream name
        associated with the flow can be removed, without stopping the flow
        itself, via the <literal>\dname</literal> command:

        <screen>
rtsql&gt; <userinput>CREATE STREAM foo AS SELECT x FROM bar;</userinput>
Started flow: flow[mId=5]
rtsql&gt; <userinput>\dname 5</userinput>
Removed stream name from flow 5
        </screen>
      </para>
      <para>
        A new stream name can then be attached to the same flow:
        <screen>
rtsql&gt; <userinput>\name 5 baz</userinput>
Created stream 'baz' on flow 5
        </screen>
      </para>
      <para>
        A stream name can be attached to any running flow with the
        <literal>\name</literal> command. This creates a new Flume logical node
        with the same name as the stream; its source is populated with events
        containing avro representations of the fields emitted by the flow.
        A flow can have at most one stream name attached at a time. 
      </para>
    </section>
    <section id="session.configuration">
      <title>Controlling the session configuration</title>
      <para>
        Each client session has a configuration associated with it; this is
        initially populated from the configuration files. The configuration is a
        set of key=val pairs, where both the keys and values are strings
        (although some keys are expected to have values which behave as
        integers, etc.). The configuration can be viewed with the command:
        <literal>\set</literal>

        <screen>
rtsql&gt; <userinput>\set</userinput>
io.seqfile.compress.blocksize = '1000000'
io.skip.checksum.errors = 'false'
fs.checkpoint.size = '67108864'
...
        </screen>
      </para>
      <para>
        Configuration keys are roughly hierarchical. You can view the
        configuration keys under any prefix that ends with a "<userinput>.</userinput>"
        by typing <literal>\set <userinput>prefix.</userinput></literal>:

        <screen>
rtsql&gt; <userinput>\set flumebase.</userinput>
flumebase.flume.master.port = '35873'
flumebase.flume.master.host = 'localhost'
flumebase.autoconnect = 'local'
        </screen>
      </para>

      <para>
        You can also view any specific key:
        <screen>
rtsql&gt; <userinput>\set flumebase.autoconnect</userinput>
flumebase.autoconnect = 'local'
        </screen>
      </para>
      <para>
        This command may also be used to set the value of any key:
        <screen>
rtsql&gt; <userinput>\set flumebase.flume.master.port=12345</userinput>
flumebase.flume.master.port = '12345'
        </screen>
      </para>

      <para>
        Setting configuration values does not modify the behavior of any
        previously-submitted flows. The behavior of a new flow may be controlled
        by setting configuration keys, and then submitting the query that
        generates the flow.
      </para>
    </section>
    <section>
      <title>Miscellaneous commands</title>
      <para>
        At any time when inputting a command, the current input may be canceled
        by appending <userinput>\c</userinput> to the current line, and pressing enter:
        <screen>
rtsql&gt; <userinput>SELECT something where I made a typo</userinput>
    -&gt; <userinput>and then continued \c</userinput>
rtsql&gt;
        </screen>
      </para>
      <para>
        A help message which lists all available control commands can be
        accessed by typing <userinput>help;</userinput> or
        <userinput>\h</userinput>:

        <screen>
rtsql&gt; <userinput>\h</userinput>
All text commands must end with a ';' character.
Session control commands must be on a line by themselves.

Session control commands:
  \c                    Cancel the current input statement.
  \d flowId             Drop the specified flow.
  \D flowId             Drop a flow and wait for it to stop.
...
        </screen>
      </para>
    </section>
  </section>
</article>