Skip to content
Jürgen Jakobitsch edited this page Aug 24, 2016 · 9 revisions
Website https://flume.apache.org/
Supported versions Apache Flume 1.6.0
Apache Zookeeper 3.5.2-alpha
Current responsible(s) Jürgen Jakobitsch @ SWC -- j.jakobitsch@semantic-web.at
Docker image(s) bde2020/flume:latest
More info https://flume.apache.org/FlumeUserGuide.html
https://flume.apache.org/FlumeDeveloperGuide.html
https://flume.apache.org/releases/content/1.6.0/apidocs/

Short description

Apache Flume is a distributed data acquisition framework used to collect and move or redistribute large amount of data. It is based on pipelines that consist of a source, a channel and a sink. Setup of such pipelines are done with simple key value configuration files, either from the filesystem or stored in an Apache Zookeeper node. Multiple sources, channels and sinks are already available in the default distribution. Those components can be found along with their corresponding documentation and configuration in Apache Flume's User Guide mentioned above.

Example usage

To include this docker image into an arbitrary BDE pipeline, it is necessary to extend the bde2020/flume image, adding a flume-startup.json and flume configuration to the /config directory. The flume-startup.json file contains the startup command and options for flume in json format, an example is given below

[
  {
    "bash":"/app/bin/flume-ng",
    "agent":"",
    "--name":"$FLUME_AGENT",
    "--conf":"/config/",
    "-z":"192.168.88.219:2181,192.168.88.220:2181",
    "-p":"/flume"
  }
]

Notes on flume-startup.json

  • The included flume-bin.py will read the command in order and issue the resulting command.
  • First key value pair must be "bash" and "/app/bin/flume-ng" (or another binary from the flume bin directory)
  • It is possible to include environmental variables in flume-startup.json, flume-bin.py will retrieve any value starting with "$" from the environment.
  • All other options are dependent on how flume needs to be started. Requirements for certain options derive from the flume-ng binary. Check out FlumeUserGuide (above) to learn more.
  • In case a "chroot" is used in Apache Zookeeper using Apache Flume's -p option it must start with a slash.
  • In case the -z option is used flume-bin.py will upload the contents the flume configuration file to a zookeeper node with the name specified by the --name option.
  • Important note on the naming convention: the value for the --name option for the startup command, the file name of the flume configuration (added to /config) and the name of the agent inside said flume configuration must be the same, when zookeeper is used.
  • In case additional java libraries for the defined Apache Flume Pipeline are required, it is necessary to include those libraries in the extension of this docker image and add them to an arbitrary path. This path must then be given as an option in flume-startup.json using the --plugins-path option. See the UserGuide for more details about using plugins.

Scaling

In principal scaling is simply done by running multiple instances of one and the same agent. It is however important to note that to overcome duplicate work (e.g. running a query on the source for the same results) this also must be supported by the defined Apache Flume Source.

Clone this wiki locally