/
UserGuide.xml
1847 lines (1741 loc) · 79.7 KB
/
UserGuide.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8"?>
<article xmlns="http://docbook.org/ns/docbook" version="4.5" xml:lang="en"
xmlns:xlink="http://www.w3.org/1999/xlink">
<!--
Licensed to Odiago, Inc. under one or more contributor license
agreements. See the NOTICE.txt file distributed with this work for
additional information regarding copyright ownership. Odiago, Inc.
licenses this file to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance with the
License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations
under the License.
-->
<title>FlumeBase User Guide</title>
<subtitle>version <?eval ${project.version} ?></subtitle>
<section>
<title>Introduction</title>
<para>
FlumeBase is a database-inspired stream processing system built on top
of <productname>Flume</productname>. This system allows users to
dynamically insert queries into a data collection environment and
inspect the stream of events being collected by Flume. These
queries may spot-check incoming data, or specify persistent
monitoring, data transformation, or quality filtering tasks.
Queries are written in a SQL-like language called "rtsql."
</para>
<para>
FlumeBase can present data back to users of an interactive shell environment.
It can also be configured to deliver streams of output events back into a
Flume network, for consumption by other tools or persistance in HBase, HDFS,
or other storage media.
</para>
<para>
The emphasis of this system is on low-latency analysis of
incoming data being captured by Flume. The name "rtsql"
(FlumeBase's query language)
underscores the real-time nature of the query system, as well as
the SQL-based origin of the query language syntax. It is hoped
that FlumeBase will allow you to perform useful in-line data
transformation or filtering, or time-sensitive alerting or
tuning of a broader system, before subjecting the data being
captured by <productname>Flume</productname> to a deeper (but
perhaps higher-latency) analysis with other tools such as
<productname>Hadoop MapReduce</productname>.
</para>
<warning>
<para>
FlumeBase is an EXPERIMENTAL system! This is in no way ready
for production use. Use this AT YOUR OWN RISK. Connecting
this system to production Flume nodes may result in data
loss, misconfiguration, or other serious problems.
</para>
</warning>
<para>
This document explains how to install and configure the FlumeBase
system. It then explains the rtsql language, used to submit
queries to the runtime environment, and the commands used to
control the terminal client itself. This document is intended
for:
<itemizedlist>
<listitem>System administrators</listitem>
<listitem>Data analysts</listitem>
<listitem>Data engineers</listitem>
</itemizedlist>
</para>
</section>
<section>
<title>Quick Start</title>
<para>
For those who understand Flume, SQL, and want to just see a demo of what
can be done with FlumeBase, follow the steps in this section. This is a
five minute tour of the FlumeBase world.
</para>
<para>
First, copy the following text into a file named <filename>data.txt</filename>.
</para>
<programlisting>
1,aaron,purple,42
2,bob,blue,11
3,cindy,green,312
</programlisting>
<para>
Install Flume 0.9.3, Hadoop 0.20, and Java 6. If you are running Cloudera's
Distribution of Hadoop 3 beta 4 (CDH3B4), you have already installed all
of these. Users of older versions of these products will need to upgrade.
See <xref linkend="installation" /> for more thorough installation
instructions.
</para>
<para>
Unzip the FlumeBase installation:
</para>
<programlisting>
$ <userinput>tar vzxf flumebase-(version).tar.gz</userinput>
</programlisting>
<para>
Start the FlumeBase shell:
</para>
<programlisting>
$ <userinput>cd flumebase-(version)/</userinput>
$ <userinput>bin/flumebase shell</userinput>
</programlisting>
<para>
By default, FlumeBase is configured with a self-contained environment that
embeds the FlumeBase server and Flume itself within the same process as
the shell. Now let's define a stream over the file, and query it.
</para>
<programlisting>
rtsql> <userinput>CREATE STREAM data(id int, name string, favcolor string,</userinput>
-> <userinput>luckynumber int) FROM LOCAL SOURCE 'tail("/path/to/data.txt")';</userinput>
CREATE STREAM
rtsql> <userinput>SELECT * FROM data;</userinput>
</programlisting>
<para>
You created stream which operates over a local (self-hosted) Flume logical node
which reads all the lines from <filename>data.txt</filename>. You then ran
a query that extracts all fields from each event in the stream. Each line
of the file corresponds to a different event.
</para>
<para>
In another terminal, now execute the following:
</para>
<programlisting>
$ <userinput>echo 4,dave,orange,611 >> /path/to/data.txt</userinput>
</programlisting>
<para>
You should observe that as soon as Flume detects the new record (about a second's
delay), it will be passed along to FlumeBase and emitted on your console.
</para>
<para>
The submitted query has created a "flow," which runs as long as we allow it.
If more data were to enter Flume via that file, we would continue to process
it. Now, let's cancel that flow:
</para>
<programlisting>
rtsql> <userinput>\d 1</userinput>
</programlisting>
<para>
(As FlumeBase decommissions the internal logical node, there may be an error
emitted by Flume itself; this is normal. In general, running in a single
process will be "noisy" because of both client and server activity condensed
to a single console. For a cleaner session experience, run the server and
client in separate processes; see <xref linkend="installation"/> for
instructions.)
</para>
<para>
And now let's run another query:
</para>
<programlisting>
rtsql> <userinput>SELECT favcolor FROM data WHERE luckynumber = 42;</userinput>
</programlisting>
<para>
After a few seconds, this flow is initialized with the data in the Flume
logical node. Note that we only get one row out from our original data set.
If you add more lines to the file which add events where the <literal>luckynumber</literal>
column is <constant>42</constant>, you'll see them appear in the FlumeBase
console.
</para>
<para>
This concludes our tour. To quit the FlumeBase shell, run:
</para>
<programlisting>
rtsql> <userinput>\q</userinput>
</programlisting>
<para>
The remaining sections of this user guide will describe multi-process configuration,
the rtsql language, and shell operation in greater detail. Good luck!
</para>
</section>
<section id="installation">
<title>Installation</title>
<section>
<title>Prerequisites</title>
<para>
FlumeBase requires a few prerequisites before it can be run on your machine:
</para>
<itemizedlist>
<listitem><productname>Java</productname> 6.0</listitem>
<listitem><productname>Hadoop</productname> 0.20</listitem>
<listitem><productname>Flume</productname> 0.9.3</listitem>
</itemizedlist>
<para>
Java can be obtained from <link
xlink:href="http://www.oracle.com/technetwork/java/index.html">http://www.oracle.com/technetwork/java/index.html</link>.
The Java 6.0 SE JRE (or JDK) is required. Java downloads and installation
instructions can be found on Oracle's web site.
</para>
<para>
Both other prerequisites can be installed from <productname>Cloudera's
Distribution for Hadoop</productname>, version 3-beta-4 (CDH3b4) or
newer. See
<link xlink:href="http://archive.cloudera.com">http://archive.cloudera.com</link>
for instructions on downloading and installing <productname>Cloudera's
Distribution for Hadoop</productname>.
</para>
<para>
While FlumeBase is written in <productname>Java</productname> and thus
should be portable across a wide variety of operating systems, testing
has only been performed under a Linux environment. It is likely to work
under cygwin and OS X as well, but no guarantees are made.
</para>
<para>
The following prerequisite knowledge is required to understand
this documentation:
<itemizedlist>
<listitem>Basic computer technology and terminology</listitem>
<listitem>Familiarity with command-line interfaces such as
<literal>bash</literal></listitem>
<listitem>Prior understanding of Flume's operation and purpose</listitem>
<listitem>Prior exposure to SQL is recommended</listitem>
</itemizedlist>
</para>
</section>
<section>
<title>Program installation</title>
<para>
FlumeBase itself is distributed as a tar file. Install FlumeBase by unzipping
the tar file:
<screen>
$ <userinput>tar vzxf flumebase-(version).tar.gz</userinput>
</screen>
</para>
<para>
This will expand to a directory called
<filename>flumebase-(version)/</filename>.
</para>
</section>
</section>
<section>
<title>Configuration</title>
<para>
By default, FlumeBase is configured to run in a single process
combining both the interactive shell, and the execution engine.
Terminating the shell will also terminate the execution
environment, including all running queries. This is most useful
for evaluating FlumeBase. For more serious use, the execution
environment should be run in a persistent process on a server.
Clients should be configured to connect to this server, or users
should be instructed to explicitly do so.
</para>
<para>
To enable zero-configuration evaluation of FlumeBase, the
FlumeBase process will also host an embedded Flume master node. To
interact with existing streaming data sources, this should also
be reconfigured to point to an existing Flume deployment.
</para>
<section>
<title>Server configuration</title>
<para>
Install FlumeBase on a server where the query execution engine should
be run. Then edit the <filename>etc/flumebase-site.xml</filename> file
to contain the following values:
</para>
<table>
<caption>Configuration settings for FlumeBase servers</caption>
<thead>
<tr><td>Property</td><td>Value</td></tr>
</thead>
<tbody>
<tr><td><constant>flume.home</constant></td>
<td>The path to $FLUME_HOME on your server.</td></tr>
<tr><td><constant>flumebase.remote.port</constant></td>
<td>The port where the FlumeBase server listens for clients.</td></tr>
<tr><td><constant>embedded.flume.master</constant></td>
<td>This should be set to <constant>false</constant> if a Flume
master is available. A value of <constant>true</constant> means
that the FlumeBase environment acts as its own Flume master, separate
from an existing Flume network.</td></tr>
<tr><td><constant>flumebase.flume.master.host</constant></td>
<td>The hostname of the foreign Flume master to connect to.</td></tr>
<tr><td><constant>flumebase.flume.master.port</constant></td>
<td>The port the foreign Flume master listens on.</td></tr>
<tr><td><constant>flumebase.flume.collector.port.min/max</constant></td>
<td>FlumeBase uses Flume collectors to receive data from the broader
Flume network. set <constant>...port.min</constant> and
<constant>...port.max</constant> to the range of ports on the
FlumeBase server which the FlumeBase daemon may use for this purpose.</td></tr>
</tbody>
</table>
<para>
Finally, to run in distributed mode, the Flume master node needs to
register the FlumeBase plugin. You should copy the
<filename>flumebase-(version).jar</filename> file from the FlumeBase
installation into <filename>/usr/lib/flume/lib</filename> on the Flume
master machine. Then edit <filename>flume-site.xml</filename> on
the master to include the setting:
<programlisting>
<property>
<name>flume.plugin.classes</name>
<value>com.odiago.flumebase.flume.FlumePlugin</value>
</property>
</programlisting>
</para>
<para>
You may need to restart the Flume master process for this to take effect.
</para>
<para>
After a server is configured, you may start a server instance by running:
<literal>bin/flumebase server</literal> from the directory where FlumeBase
was installed. To shut down a running server, see <xref
linkend="flumebase.client.connecting" />. Killing a server process with
<literal>^C</literal> is not recommended.
</para>
</section>
<section>
<title>Client configuration</title>
<para>
Install a copy of FlumeBase on every client machine where users intend to
submit queries to the FlumeBase system. The client must be able to open a
TCP connection to the FlumeBase server. In order to view output events on
the FlumeBase console, the server must be able to open a TCP connection
back to the client.
</para>
<para>
Set the following settings in <filename>etc/flumebase-site.xml</filename>
on the client machine:
</para>
<table>
<caption>Configuration settings for FlumeBase clients</caption>
<thead>
<tr><td>Property</td><td>Value</td></tr>
</thead>
<tbody>
<tr><td><constant>flume.home</constant></td>
<td>The path to $FLUME_HOME on the client.</td></tr>
<tr><td><constant>flumebase.autoconnect</constant></td>
<td>The host:port of the FlumeBase server to connect to. If set
to <constant>local</constant>, this will use an in-process server.
If set to <constant>none</constant>, the user must explicitly open
a server connection with <userinput>\open</userinput> in the
console.</td></tr>
<tr><td><constant>flumebase.flow.autowatch</constant></td>
<td>Defaults to <constant>true</constant>; this boolean property
specifies whether you want every query to automatically send its
output to the console when submitted. If false, you must explicitly
watch flow output with the <userinput>\watch</userinput> command.
</td></tr>
<tr><td><constant>flumebase.console.port</constant></td>
<td>FlumeBase uses a Thrift RPC connection to relay query output back to
the client. The client listens on the port specified by this
property.</td></tr>
</tbody>
</table>
</section>
</section>
<section>
<title>Architecture</title>
<para>
The FlumeBase system is composed of a command-line client, a server called
the "execution environment," and the Flume system that collects and
transports data. These may be configured as separate, distributed
processes, or collocated on a single machine, or in a single process.
</para>
<para>
The command-line client is the simplest component in the product. This
process is run directly by a user (perhaps on a server, but more
often her own desktop or laptop). This connects to the execution
environment. The client provides the user with a prompt, where new
queries or control statements may be entered.</para>
<para>
Each query (i.e., <literal>SELECT</literal> statement) produces a
<emphasis>flow</emphasis> in the execution environment. The user may
subscribe to running flows (this is done automatically for new flows
created by the user). When a subscribed flow emits an output event, its
text is printed to the client terminal.
</para>
<para>
Closing the client does not terminate any submitted flows. These are
running in the <emphasis>execution environment</emphasis>, a separate
long-lived process which may be shared by multiple users. An execution
environment holds the definitions of all streams (created by
<literal>CREATE STREAM</literal> statements), and processes the running flows. The
execution environment is typically run on a dedicated server. For
evaluation purposes, it may also be hosted inside the same process as a
command-line client. (When the execution environment is embedded
in the client, terminating the client will terminate all running
flows, and discard knowledge of any streams.)
</para>
<para>
Submitted queries (flows) allow computation over
<emphasis>streams</emphasis> of data. Streams are defined as a set of
<emphasis>events</emphasis>, which are roughly analogous to "records" in a
table-based SQL environment. These events are directly linked to "events"
in Flume. Users define streams before querying them; these definitions
specify the fields within each event, how to parse the event body into the
fields, and where the stream originates. Each flow is itself a stream;
its output is also a series of events, based on the computations specified
by the user and the set of input events the flow operates over.
</para>
<para>
By default, queries submitted by users result in anonymous flows, which
deliver their outputs only to the subscribed client instances. These flows
continue to operate while no users are subscribed, but output events
generated when no users are subscribed are simply dropped (there is no way
to retrieve them later).
</para>
<para>
Users can bind a name to a running flow (or do so when submitting a flow
with the <literal>CREATE STREAM AS SELECT</literal> syntax). This name is
used as the name for a Flume logical node, which broadcasts the output of the
flow as a set of Avro-encoded events. Users may then use the Flume shell
to configure this logical node to direct a copy of its output to a
monitoring application, persistent storage (such as HDFS), or elsewhere.
<xref linkend="create.as.select" /> describes the <literal>CREATE STREAM
AS SELECT</literal> syntax and its effects in greater detail. <xref
linkend="controlling.flows" /> describes how to manipulate flow names.
</para>
<para>
FlumeBase reads from a Flume network by modifying the sink
definitions of nodes specified with <literal>CREATE
STREAM</literal> statements. When a logical node is identified
as a stream source, its sink definition is rewritten as a
fan-out sink containing its original sink, and a new agent sink
which forwards the node's output to a collector source hosted
within the FlumeBase execution environment. (The FlumeBase execution
environment will host an embedded Flume physical node, which
then hosts logical nodes as necessary to receive and transmit
streams of events.) When a stream is gracefully dropped (by
using <userinput>DROP STREAM</userinput> to drop the stream, or
<userinput>\shutdown!</userinput> to shutdown the execution
environment), the original logical node definition is restored
to the logical node which provided the data stream.
</para>
<para>
Interaction between a FlumeBase execution environment and Flume is performed
via the Flume master node's thrift interface. The physical node hosted
within an execution environment is controlled by the Flume master node,
and is for all intents and purposes, an ordinary Flume node. For this
reason, flows may take a few seconds to initialize (or cancel), as they
are dependent on Flume for aspects of their configuration. Once
initialized, flows should operate on events with low latency. If no
external Flume network is available, you can configure the Flume execution
environment to host an embedded Flume master node, for evaluation or
single-machine computation purposes.
</para>
</section>
<section id="rtsql.language">
<title>The rtsql language</title>
<para>
Users interact with FlumeBase by submitting commands and queries
written in a language called rtsql.
The rtsql language is designed to allow on-going analysis of incoming
data. The language is similar to <productname>SQL:2003</productname>; its
syntax will be largely familiar to SQL experts. It also provides
<productname>SQL:2003</productname>-style <emphasis>windowed
operators</emphasis> which allow joining and aggregation over bounded
amounts of time.
</para>
<para>
In rtsql, all data is consumed through <emphasis>streams</emphasis>. The
FlumeBase architecture assumes that these streams cannot be replayed, and may
be of infinite length. Therefore, all operators such as <literal>GROUP
BY</literal> which can use “all the rows” as input are restricted so that
they can only use windowed views into the stream. rtsql does allow a
stream to be defined over a file. A <literal>SELECT</literal> statement
querying such a stream will read the data in-order in the file and then
terminate when it reaches the end of the file, but rtsql does not
currently have special provisions for working with these data sources
in a different fashion than Flume-based sources.
</para>
<para>
Keywords in rtsql are case-insensitive. Identifiers (stream, column,
function names, etc) are translated to lower-case for their canonical
representation, unless they are <literal>"double-quoted"</literal> in
which case they are interpreted literally.
</para>
<section>
<title>DDL Commands</title>
<section>
<title><literal>CREATE STREAM</literal></title>
<para>
The <literal>CREATE STREAM</literal> statement will create a stream
definition which may be used in subsequent statements such as
<literal>SELECT</literal>.
</para>
<programlisting>
CREATE STREAM <userinput>stream_name</userinput> (<userinput>col_name</userinput> data_type [, ...])
FROM [LOCAL] {FILE | NODE | SOURCE} <userinput>input_spec</userinput>
[EVENT FORMAT format_spec [PROPERTIES (key = val, …)]]
CREATE STREAM <userinput>stream_name</userinput> AS select_statement
data_type ::= BOOLEAN | BINARY | BIGINT | INT | FLOAT | DOUBLE | PRECISE(int) | STRING | TIMESTAMP
format_spec ::= 'delimited' | 'regex' | 'avro'
</programlisting>
<para>
<xref linkend="types" /> describes the rtsql data types in greater
detail.
</para>
<para>
<literal>input_spec</literal> is a
<literal>'single-quoted-string'</literal> identifying the filename /
Flume logical node / Flume source specification to use as the input
for this stream.
</para>
<para>
File names are Hadoop <classname>Path</classname> objects; they may
specify the complete URI to a file, using any protocol permitted by
the Hadoop common library. e.g.:
<screen>
rtsql> <userinput>CREATE STREAM foo (x STRING) FROM FILE</userinput>
-> <userinput>'hdfs://nn.example.com/user/aaron/foo.txt';</userinput>
</screen>
</para>
<para>
Unqualified file names are interpreted relative to the value of the
<constant>fs.default.name</constant> configuration parameter. For
example, if this were set to
<userinput>'hdfs://nn.example.com'</userinput>, the following
definition would be equivalent to the previous one:
<screen>
rtsql> <userinput>CREATE STREAM foo (x STRING) FROM FILE '/user/aaron/foo.txt';</userinput>
</screen>
</para>
<para>
Using the <literal>LOCAL</literal> keyword will cause the source
definition to be interpreted relative to the local filesystem of the
FlumeBase server. The following two statements are equivalent:
<screen>
rtsql> <userinput>CREATE STREAM foo (x STRING) FROM LOCAL FILE '/home/aaron/foo.txt';</userinput>
rtsql> <userinput>CREATE STREAM foo (x STRING) FROM FILE 'file:///home/aaron/foo.txt';</userinput>
</screen>
</para>
<para>
Note that if the FlumeBase server is on a different machine than the
client, this will read from <filename>/home/aaron/foo.txt</filename>
on the FlumeBase server -- not the client.
</para>
<para>
The <literal>EVENT FORMAT</literal> clause specifies how the bytes
inside an event should be interpreted. By default, rtsql uses the
<literal>delimited</literal> event format. Events are assumed to
contain UTF-8 text representations of each field, separated by commas.
</para>
<para>
By specifying an <literal>EVENT FORMAT</literal>, you can choose which
parser to apply to each event. The event format is specified as a
<literal>'quoted string'</literal>. The next few subsections define
the available event formats.
</para>
<para>
You can further control the behavior of the event parser
by specifying (key, value) pairs in the <literal>PROPERTIES</literal>
section. The keys recognized are specific to each event format. Keys
and values are both single-quoted strings.
</para>
<section id="stream.timestamp.col">
<title>Designated timestamp columns</title>
<para>
When reading a stream from a file, there is no Flume timestamp to
associate with each event. By default, FlumeBase will associate the
current system timestamp as it reads each line of the file with the
event generated for that line. This can be overridden by specifying
the <constant>timestamp.col</constant> property in the
<literal>PROPERTIES</literal> section of the <literal>CREATE
STREAM</literal> statement. The <constant>timestamp.col</constant>
must refer to a column of type <type>TIMESTAMP</type>. If the
timestamp value for an event is null, the current system timestamp
will be used instead.
</para>
</section>
<section>
<title>The <literal>delimited</literal> event format</title>
<para>
The <literal>delimited</literal> event format allows FlumeBase to
interpret events consisting of UTF-8 encoded text. Individual fields
are expected to be separated by commas. All values are expected to
be converted to text. <type>BINARY</type> columns are created as
the bytes holding a UTF-8 encoded string (which was terminated by
the field delimiter).
</para>
<para>
The delimiter character is controlled by the
<constant>delimiter</constant> property. You may set this to any
other character; for example, a pipe character:
<screen>
rtsql> <userinput>CREATE STREAM x(a int, b int) FROM LOCAL FILE 'foo.txt'</userinput>
-> <userinput>EVENT FORMAT 'delimited' PROPERTIES ('delimiter' = '|');</userinput>
</screen>
</para>
<para>
Nullable integer, timestamp, etc. fields are regarded as null if the
field is an empty string (i.e., two delimiters occur in a row). A
column of type <type>STRING</type> of zero length will be an empty
string. NULL string values are, by default, indicated by the
sequence <literal>\N</literal>. This sequence can be overridden by
any other string with the <constant>null.sequence</constant>
property.
</para>
</section>
<section>
<title>The <literal>avro</literal> event format</title>
<para>
FlumeBase can interpret events which contain a single serialized avro
record as a collection of fields. The event is assumed to be in the
Avro binary encoding format. You must specify the
<constant>schema</constant> property to describe the expected
encoding schema. (This is in addition to the normal column
definition section of the <literal>CREATE STREAM</literal>
statement.) The schema is expected to be a single Avro record (with
any name) which contains a set of fields; these fields must have the
correct avro types (<literal>"string"</literal>,
<literal>"long"</literal>, etc.) to match the expected rtsql types
(<type>STRING NOT NULL</type>, <type>BIGINT NOT NULL</type>, etc.).
A nullable type (e.g., <type>STRING</type>) is expressed as
an avro union of <literal>["string", "null"]</literal>.
</para>
</section>
<section>
<title>The <literal>regex</literal> event format</title>
<para>
Another text-based event format, this format allows you to specify
a regular expression, the groups of which are extracted as the columns.
Each event is a single line of UTF-8 encoded text. The
<constant>regex</constant> property is required. This should define
as many binding groups (with <literal>(parentheses)</literal>) as
columns are specified in the stream definition. The
<constant>null.sequence</constant> property applies to this format
as well.
</para>
</section>
<section id="create.as.select">
<title><literal>CREATE STREAM AS SELECT</literal></title>
<para>
One of the most powerful uses of rtsql is as an inline processor of
Flume events. The output of a FlumeBase flow can be used as a Flume
source for further downstream processing or data collection. Named
streams, defined by <literal>CREATE STREAM AS SELECT</literal> will
cause the FlumeBase execution environment to host a Flume logical node
with the same name as the stream name. This logical node will
deliver to its sink all output events of the flow. The events will
be in binary-encoded Avro format; a record with the same name as the
stream, with field names equal to the display names of each select
expression.
</para>
<para>
By default, the <literal>null</literal> sink is used for the logical
node created by this syntax. You should use the Flume shell to
reconfigure the logical node to deliver this output to other
required sinks.
</para>
</section>
</section>
<section>
<title><literal>DROP STREAM</literal></title>
<para>
The <literal>DROP STREAM</literal> statement removes a stream
definition created by <literal>CREATE STREAM</literal>.
</para>
<programlisting>
DROP STREAM <userinput>stream_name</userinput>
</programlisting>
<para>
When dropping a stream created in terms of a flow (<literal>CREATE
STREAM AS SELECT</literal>), this will decommission the Flume logical
node and drop the stream identifier, but will not cancel the flow
itself. See <xref linkend="controlling.flows" /> for more information
on how to cancel the flow itself.
</para>
</section>
<section>
<title><literal>SHOW STREAMS</literal></title>
<para>
The <literal>SHOW STREAMS</literal> statement shows the definitions
of all streams.
</para>
</section>
<section>
<title><literal>SHOW FUNCTIONS</literal></title>
<para>
The <literal>SHOW FUNCTIONS</literal> statement shows the definitions
of all functions which may be applied to expressions in a statement.
The output of this command is a list of functions and their types.
Types are written in the form <literal>((input_types) ->
output_type)</literal>.
<screen>
rtsql> <userinput>SHOW FUNCTIONS;</userinput>
length ((STRING) -> INT)
...
</screen>
</para>
<para>
The <function>length</function> function may take a
<type>STRING</type> or <type>NULL</type> value, and returns an
<type>INT</type> (or <type>NULL</type>, if the input was
<type>NULL</type>).
</para>
<para>
Some functions are polymorphic -- their input types are flexible,
subject to certain constraints, and their output types may or may not
match their input types. For example, the <function>sum</function>
function can operate over any numeric type:
<screen>
rtsql> <userinput>SHOW FUNCTIONS;</userinput>
sum ((var('a, constraints={TYPECLASS_NUMERIC})) -> var('a, constraints={TYPECLASS_NUMERIC}))
...
</screen>
</para>
<para>
The input argument’s type is <literal>var('a, constraints={
TYPECLASS_NUMERIC})</literal>. This is a type variable with the name
<literal>'a</literal> (pronounced "alpha"), and can take any type subject to the
constraint that it is in the typeclass "<type>numeric</type>" -- that is, it is one
of <type>INT</type>, <type>BIGINT</type>, <type>FLOAT</type>, or
<type>DOUBLE</type>. It is an error to take the sum of a
<type>STRING</type> or <type>BOOLEAN</type> column.
</para>
<para>
The output argument is the same type variable "alpha;" whatever type
is used for the input, will also be used as the output type. For
more information on polymorphic types, see <xref
linkend="polymorphic" />.
</para>
</section>
<section>
<title><literal>DESCRIBE</literal></title>
<para>
The <literal>DESCRIBE</literal> statement shows the definition of a
single object in rtsql:
</para>
<programlisting>
DESCRIBE <userinput>identifier</userinput>
</programlisting>
<para>
This may be used to inspect a single stream, function, or other entity
present in the symbol table.
</para>
<para>
The following statement displays the argument and return types for the
<function>length</function> function:
<screen>
rtsql> <userinput>DESCRIBE length;</userinput>
length ((STRING) -> INT)
</screen>
</para>
</section>
<section>
<title><literal>EXPLAIN</literal></title>
<para>
The <literal>EXPLAIN</literal> statement shows the execution plan
for an rtsql statement:
</para>
<programlisting>
EXPLAIN statement
</programlisting>
<para>
This may be used to inspect the operation of any rtsql statement.
The output of the command is a text description of how the statement
was parsed (in a tree-based representation), followed by a control-flow
graph of the steps applied in the runtime environment to satisfy
the query.
</para>
<screen>
rtsql> <userinput>EXPLAIN SELECT x FROM foo;</userinput>
</screen>
</section>
</section>
<section>
<title><literal>SELECT</literal> statements</title>
<para>
The <literal>SELECT</literal> statement returns an event stream
computed in terms of one or more existing event streams.
</para>
<programlisting>
select_statement ::= SELECT select_expr, select_expr ... FROM stream_reference
[ JOIN stream_reference ON join_expr OVER range_expr, JOIN ... ]
[ WHERE where_condition ]
[ GROUP BY column_list ]
[ OVER range_expr ]
[ HAVING having_condition ]
[ WINDOW <userinput>window_name</userinput> AS ( range_expr ), WINDOW ... ]
</programlisting>
<para>
A simple <literal>SELECT</literal> statement can return all events in
a stream:
<screen>
rtsql> <userinput>SELECT * FROM foo;</userinput>
</screen>
</para>
<para>
It can also return only a specific subset of fields from the
underlying stream:
<screen>
rtsql> <userinput>SELECT a, b, d FROM foo;</userinput>
</screen>
</para>
<para>
In addition to referencing specific fields, mathematical expressions
may be calculated as well:
<screen>
rtsql> <userinput>SELECT 2 * a + 3 FROM foo;</userinput>
</screen>
</para>
<para>
The following table lists all available operators. Operators at one
level of the table have higher priority than operators in a lower row
of the table. Operators of the same priority are applied
left-to-right. Parentheses can be used to override precedence. (This
is the same precedence order as uesd by Java, for the subset of Java
operators supported by rtsql.)
</para>
<table><caption>Operator precedence rules in rtsql</caption>
<thead>
<tr><td>Operator class</td><td>operators</td>
</tr>
</thead>
<tbody>
<tr><td>unary null operators:</td>
<td><literal>IS NULL</literal>, <literal>IS NOT NULL</literal></td></tr>
<tr><td>unary operators:</td>
<td><literal>+ - NOT</literal></td></tr>
<tr><td>multiplicative:</td>
<td><literal>* / %</literal></td></tr>
<tr><td>additive:</td>
<td><literal>+ -</literal></td></tr>
<tr><td>comparison:</td>
<td><literal>> < >= <=</literal></td></tr>
<tr><td>equality:</td>
<td><literal>= !=</literal></td></tr>
<tr><td>logical conjunction:</td>
<td><literal>AND</literal></td></tr>
<tr><td>logical disjunction:</td>
<td><literal>OR</literal></td></tr>
<tr><td>function call:</td>
<td><literal>f(e1, e2, e3...)</literal></td></tr>
<tr><td>identifiers and constants:</td><td><literal>x 42 'hello!'</literal></td></tr>
</tbody>
</table>
<para>
Each selected expression may have an alias associated with it:
<screen>
rtsql> <userinput>SELECT 2 * a AS doubled FROM foo;</userinput>
</screen>
</para>
<para>
The <literal>AS</literal> keyword itself is optional.
</para>
<para>
This is specifically useful in the context of nested
<literal>SELECT</literal> statements:
<screen>
rtsql> <userinput>SELECT doubled FROM (SELECT 2 * a AS doubled FROM foo)</userinput>
-> <userinput>AS q WHERE doubled > 4;</userinput>
</screen>
</para>
<para>
rtsql does not support the <literal>DISTINCT</literal> or
<literal>ALL</literal> keywords; every query is implicitly
"<literal>SELECT ALL</literal>."
</para>
<section>
<title>Stream references</title>
<programlisting>
stream_reference ::= (<userinput>stream_name</userinput> | select_statement) [[AS] <userinput>ref_name</userinput>]
</programlisting>
<para>
The <literal>stream_reference</literal> in a <literal>SELECT</literal>
statement may literally identify a stream:
<screen>
rtsql> <userinput>CREATE STREAM foo (x string) FROM ...;</userinput>
CREATE STREAM
rtsql> <userinput>SELECT * FROM foo;</userinput>
...
</screen>
</para>
<para>
You may also qualify column names with their stream name:
<screen>
rtsql> <userinput>SELECT foo.x FROM foo;</userinput>
</screen>
</para>
<para>
And you may provide a reference name (<literal>ref_name</literal>)
that is different than the stream name:
<screen>
rtsql> <userinput>SELECT v.x FROM verylongname AS v;</userinput>
</screen>
</para>
<para>
The <literal>AS</literal> keyword is optional. This is equivalent to:
<screen>
rtsql> <userinput>SELECT v.x FROM verylongname v;</userinput>
</screen>
</para>
<para>
A <literal>stream_reference</literal> may also be a nested
<literal>SELECT</literal> statement.
<screen>
rtsql> <userinput>SELECT length(x) FROM (SELECT x FROM foo) AS f;</userinput>
</screen>
</para>
<para>
Each nested <literal>SELECT</literal> statement must be given a
<literal>ref_name</literal> alias (<userinput>f</userinput> in the
previous example). You do not need to qualify individual column names
with the <literal>ref_name</literal> unless the column name would
otherwise be ambiguous (e.g., if two sources are joined, and they each
contain a column named <userinput>x</userinput>, then all references
to <userinput>x</userinput> must be qualified with the source
<literal>ref_name</literal>).
</para>
</section>
<section>
<title><literal>WHERE</literal> clauses</title>
<programlisting>
where_clause ::= WHERE bool_expr
</programlisting>
<para>
A <literal>SELECT</literal> statement may filter some input events,
and emit output events corresponding only to input events that match a
boolean predicate.
<screen>
rtsql> <userinput>SELECT x FROM foo WHERE length(x) > 5;</userinput>
</screen>
</para>
<para>
This may be a compound boolean expression (using the
<literal>AND</literal> and <literal>OR</literal> operators). rtsql
does not support the <literal>IN</literal> or
<literal>EXISTS</literal> operators. Subqueries are also not
permitted in a <literal>WHERE</literal> clause.
</para>
</section>
<section id="select.join.clause">
<title><literal>JOIN</literal> clauses</title>
<programlisting>
join_clause ::= JOIN stream_reference ON join_expr OVER range_expr
</programlisting>
<para>
A <literal>SELECT</literal> statement may correlate events from
multiple sources and operate on their joined representation. In
table-based SQL systems, any row of one table may be joined with any
row of another table in a <literal>JOIN</literal> clause. Since FlumeBase
operates over potentially infinite streams of data, this model would
not scale. Instead, <literal>JOIN</literal> clauses require a window
clause which defines the time-based boundaries within which a join may
occur.
</para>
<para>
The only join expression supported is an equi-join; the