# Apache Flume – Ingesting Streaming Data to HDFS

As part of this session we will understand how we can use Apache Flume to ingest streaming real time data in detail.
* Overview of Flume
* Develop first Flume Agent
* Understand Source, Sink and Channel
* Flume Multi Agent Flows
* Get data into HDFS using Flume
* Limitations and Conclusion

For this demo we will be using our [Big Data developer labs](https://labs.itversity.com/). You need to have access to existing big data cluster or sign up to our labs.

### Overview of Flume
Let us understand what Flume is all about.
* Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.
* It has a simple and flexible architecture based on streaming data flows.
* It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms.
* It can be integrated with other technologies like Spark Streaming to build streaming analytic applications.
* We need to configure agent to get data from source like web server logs and target like data stores such as HDFS.
* Flume agent contains – Source, Sink and Channel
* Here is the [link](https://flume.apache.org/FlumeUserGuide.html) for latest flume documentation.
![](https://kaizen.itversity.com/wp-content/uploads/2018/09/01FlumeAgentComponents.png)

### Develop first Flume Agent
Now let us go ahead and develop our first Flume agent.
* [Here](https://flume.apache.org/FlumeUserGuide.html#a-simple-example) is the complete configuration file for simple flume example.
* Agent Name: a1
* Source
    * Name: r1
    * Type: Netcat
    * Netcat is simple web server which runs on simple IP address and port number.
    * IP address is either gw02.itversity.com or gw03.itversity.com on which you want to run netcat service
    * Port Number: 44444
    * A web server will be started by flume agent automatically
    * This is primarily used to understand flume. We do not use this in actual implementations.
* Sink
    * Name: k1
    * Type: logger
    * Data will be sent back to the flume agent
* Channel
    * Name: c1
    * Type: memory
* Now we can start flume agent using flume-ng command.

##### #Create a file by name example.conf
a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 44444

a1.sinks.k1.type = logger

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

flume-ng agent \
  --conf-file example.conf \
  --name a1

### Understand Source, Sink and Channel
Each Flume agent have Source, Sink and Channel.
* Source is primarily to read data from web server logs. There are several types of sources.
    * netcat
    * exec
    * syslog
    * avro
    * and more.
* Sink is primarily to write data into data stores or other Flume agent sources (via avro). There are several types of sinks.
    * logger
    * HDFS
    * avro
    * and many more
* Channel is to channelize data between source and sink. There are different types of channels but most popular ones are memory, file and Kafka.
* Memory gives you good performance, but not reliable. File gives you reliability at the cost of performance.
* There can be one to many relationship between source and sink. But for each sink there need to be one channel.
* Properties are determined by the type chosen.
* It is easily extensible. For example we can integrate custom sources, sinks or channels with Flume.

### Flume Multi Agent Flows
Flume can be configured with multiple agents for scalability, consolidation etc.
* Consolidation
    * Typically to read data related to same application which is deployed on multiple web/app servers
    * We need to define one agent for each web/app server and each of those need to write to single avro sink
    * Then we can have one agent to read data by configuring avro source and then to what ever target you want to write to.
![](https://kaizen.itversity.com/wp-content/uploads/2018/09/02FlumeConsolidation.png)

![](https://kaizen.itversity.com/wp-content/uploads/2018/09/03FlumeConsolidation.png)

* Multiplexing
    * We can have data from one source multiplexed into multiple sinks.
    * This approach is primarily used to
        * Scaling up write operations of data
        * Write into variety of targets such as HDFS, JMS, avro etc
![](https://kaizen.itversity.com/wp-content/uploads/2018/09/04FlumeMultiplexing.png)

### Get data into HDFS using Flume
Now let us understand how to get data into HDFS using Flume.
* Source – exec
* Sink – HDFS
* Channel Type – Memory
* We will explore all the options from official documentation and then use it as part of agent definition.

##### #flume-logger-hdfs.conf: Read data from logs and write it to both logger and hdfs
##### #flume command to start the agent - flume-ng agent --name a1 --conf /home/dgadiraju/flume_example/example.conf --conf-file example.conf

##### #Name the components on this agent
a1.sources = logsource
a1.sinks = loggersink hdfssink
a1.channels = loggerchannel hdfschannel

##### #Describe/configure the source
a1.sources.logsource.type = exec
a1.sources.logsource.command = tail -F /opt/gen_logs/logs/access.log

##### #Describe the sink
a1.sinks.loggersink.type = logger

##### #Use a channel which buffers events in memory
a1.channels.loggerchannel.type = memory
a1.channels.loggerchannel.capacity = 1000
a1.channels.loggerchannel.transactionCapacity = 100

##### #Bind the source and sink to the channel
a1.sources.logsource.channels = loggerchannel hdfschannel
a1.sinks.loggersink.channel = loggerchannel

##### #Describe the sink
a1.sinks.hdfssink.type = hdfs
a1.sinks.hdfssink.hdfs.path = hdfs://nn01.itversity.com:8020/user/dgadiraju/flume_example_%Y-%m-%d
a1.sinks.hdfssink.hdfs.fileType = DataStream
a1.sinks.hdfssink.hdfs.rollInterval = 120
a1.sinks.hdfssink.hdfs.rollSize = 10485760
a1.sinks.hdfssink.hdfs.rollCount = 30
a1.sinks.hdfssink.hdfs.filePrefix = retail
a1.sinks.hdfssink.hdfs.fileSuffix = .txt
a1.sinks.hdfssink.hdfs.inUseSuffix = .tmp
a1.sinks.hdfssink.hdfs.useLocalTimeStamp = true

##### #Use a channel which buffers events in file for HDFS sink
a1.channels.hdfschannel.type = file
a1.channels.hdfschannel.capacity = 1000
a1.channels.hdfschannel.transactionCapacity = 100
a1.channels.hdfschannel.checkpointInterval = 300

### Limitations and Conclusion
Flume is primarily data ingestion tool.
* It is flexible to apply rules while ingesting data.
* We can integrate with tools like Spark Streaming, Flink, Storm etc for streaming analytics.
* Flume is also used for getting data from existing applications web server logs into much more flexible streaming ingestion tools like Kafka. However it might be replaced with Kafka connect in future.
* Over a period of time there can be many flume agents and there is no tools to manage them (unlike Kafka)
* There are too many moving parts and there are simpler tools than Flume like Kafka.