# Basic Apache Flume

## Create folders

- First, we will create a folder named "flume" in `/media/notebooks/` to store all our work with Flume.
- Inside the "flume" folder, we will create another folder named "configuraciones" to hold all our Flume configuration files.
- We will also create a "data" folder, where we will store the files to be read and later sent via Flume to HDFS.

In [1]:
# Create folders
! mkdir -p flume
! mkdir -p flume/data
! mkdir -p flume/configuraciones

In [2]:
import os
os.chdir("/media/notebooks/flume/configuraciones")
! pwd

/media/notebooks/flume/configuraciones


## Agent Configuration File

The configuration file outlines the data flow, which involves copying the content of files in a specified local path to a temporary HDFS file that Flume will create for us. This setup allows an agent to "listen" to everything that arrives in a folder and establish a data flow from that folder to HDFS. Here’s the configuration breakdown:

- **General Agent Configuration**: Here, we name the agent as "agent1." In the initial lines, we define the names for the source, channel, and sink, which will be `source1`, `sink1`, and `channel1`.

- **Source Configuration**: We specify the type of source, which in this case is `spooldir`, since we’ll be monitoring a local directory of files. We then set the path for this source.

- **Channel Configuration**: We define the channel type that will connect our source with the sink. For this setup, we use `memory`, a memory channel that stores events in the agent’s memory.

- **Sink Configuration**: Since we’ll be writing to HDFS, the sink type is set to `hdfs`, and we provide the HDFS path where the files will be stored. We then specify the output file type as `DataStream` and the data format as `Text`.

- **Linking Source and Sink to the Channel**: We specify the channel used by the source and sink. Notably, for the source, we use the `channels` property, while for the sink, we use `channel`. This is because a source can write to multiple channels, while a sink can only read from one channel.

In [3]:
%%writefile ejemplo.config
# Configuracion nombres componentes
agente1.sources = source1
agente1.sinks = sink1
agente1.channels = channel1

# Configuración source
agente1.sources.source1.type = spooldir
agente1.sources.source1.spoolDir = /media/notebooks/flume/data/ejemplo

# Configuración channel
agente1.channels.channel1.type = memory

# Configuración sink
agente1.sinks.sink1.type = hdfs
agente1.sinks.sink1.hdfs.path = hdfs://namenode:9000/ejemplo-flume/
agente1.sinks.sink1.hdfs.fileType = DataStream
agente1.sinks.sink1.hdfs.writeFormat = Text

# Vinculación de source y sink con channel
agente1.sources.source1.channels = channel1
agente1.sinks.sink1.channel = channel1

Writing ejemplo.config


In [4]:
#creamos la carpeta ejemplo1 para los ficheros de entrada
! mkdir -p /media/notebooks/flume/data/ejemplo
os.chdir("/media/notebooks/flume/data/ejemplo")
! pwd

/media/notebooks/flume/data/ejemplo


## Create Dataset

We create the three files from which we will read their contents:

In [5]:
%%writefile ejemplo1-fichero1.txt
Este es el texto del fichero 1 para el ejemplo de flume agente1

Writing ejemplo1-fichero1.txt


In [6]:
%%writefile ejemplo1-fichero2.txt
Sin embargo, este es el texto del fichero 2 para el ejemplo de flume agente1

Writing ejemplo1-fichero2.txt


In [7]:
%%writefile ejemplo1-fichero3.txt
Por ultimo, este es el texto del fichero 3 para el ejemplo de flume agente1

Writing ejemplo1-fichero3.txt


In [None]:
# List the content of the folder
!ls /media/notebooks/flume/data/ejemplo1

ejemplo1-fichero1.txt  ejemplo1-fichero2.txt  ejemplo1-fichero3.txt


In [9]:
# List the content of the folder
!ls /media/notebooks/flume/data/ejemplo

ejemplo1-fichero1.txt  ejemplo1-fichero2.txt  ejemplo1-fichero3.txt


## Create the Agent and Execute it

Start an Apache Flume agent named `agente1`, using a specific configuration located in `ejemplo.config`, with the log level in `INFO` displayed in the console.

In [10]:
! flume-ng agent --conf ./conf/ \
    --conf-file /media/notebooks/flume/configuraciones/ejemplo.config \
    --name agente1 \
    -Dflume.root.logger=INFO,console

Info: Including Hadoop libraries found via (/usr/local/hadoop/bin/hadoop) for HDFS access
Info: Including Hive libraries found via (/usr/local/hive) for Hive access
+ exec /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Xmx20m -Dflume.root.logger=INFO,console -cp './conf/:/usr/local/flume/lib/*:/usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/mapreduce/*:/usr/local/hadoop/share/hadoop/yarn:/usr/local/hadoop/share/hadoop/yarn/lib/*:/usr/local/hadoop/share/hadoop/yarn/*:/usr/local/hive/lib/*' -Djava.library.path=:/usr/local/hadoop/lib/native org.apache.flume.node.Application --conf-file /media/notebooks/flume/configuraciones/ejemplo.config --name agente1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/flume/lib/log4j-slf4j-impl-2.18.0.j

We must wait until all the files are read and then stop the execution of the agent (stopping the Jupyter kernel with the stop button or with ctrl+c in the terminal) once we see that the reading of the files has been completed correctly. Since Flume receives the data in streaming, it does not know when to stop listening, therefore, if we do not stop it manually, it will be active indefinitely and we will not be able to execute the following cells.

How to see that the reading of the files has been completed correctly: the word COMPLETED has been added to the end of the file name in the folder `/flume/data/ejemplo1`.


See the content of the source folder:

In [11]:
!ls /media/notebooks/flume/data/ejemplo

ejemplo1-fichero1.txt.COMPLETED  ejemplo1-fichero3.txt.COMPLETED
ejemplo1-fichero2.txt.COMPLETED


See the content of the HDFS folder:

In [12]:
!hdfs dfs -ls /ejemplo-flume/

Found 1 items
-rw-r--r--   3 root supergroup        217 2024-08-26 13:55 /ejemplo-flume/FlumeData.1724680541208


This Flume file has been created by the agent reading the content of the file, sending it through the channel, in this case the agent, and finally it has been written by the sink in HDFS.

We can see the content of the file and we can see, indeed, how everything has been transmitted successfully:

In [13]:
! hdfs dfs -cat /ejemplo-flume/FlumeData*

Este es el texto del fichero 1 para el ejemplo de flume agente1
Sin embargo, este es el texto del fichero 2 para el ejemplo de flume agente1
Por ultimo, este es el texto del fichero 3 para el ejemplo de flume agente1
