Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datasources: Add kafka descriptions #5

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
61 changes: 61 additions & 0 deletions content/datasources/kafka.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Collecting Data from Kafka

## Scenario

[Kafka](http://kafka.apache.org) is a highly distributed messaging system.

You run Kafka as a messaging system and now want to send the messages into various other systems.

Fluentd can setup to collect messages from Kafka. Applications include:

1. Sending Kafka messages into HDFS for analysis
2. Sending Kafka messages into Elasticsearch for analysis

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing options like "Fluentd has two way to consume kafka message, kafka-fluentd-consumer or fluent-plugin-kafka" is better.

You can two choices for this purpose whether using `in_kafka` or using `kafka-fluentd-consumer`.

## Setup: fluent-plugin-kafka

1. Install the [Kafka input plugin](https://github.com/htgc/fluent-plugin-kafka) by running the following command:

```
$ fluent-gem install fluent-plugin-kafka
```

2. Open your Fluentd configuration file and add the following lines:

```
<source>
@type kafka
host <broker host>
port <broker port: default=9092>
topics <listening topics(separate with comma',')>
format <input text type (text|json|ltsv|msgpack)>
message_key <key (Optional, for text format only, default is message)>
</source>
```

With the above setup, Fluentd consumes Kafka messages via `in_kafka` plugin.

## Setup: kafka-fluentd-consumer

1. Download the latest [kafka-fluentd-consumer jar](https://github.com/treasure-data/kafka-fluentd-consumer/releases).

2. Set kafka-fluentd-consumer settings correctly. (See [fluentd-consumer.properties](https://github.com/treasure-data/kafka-fluentd-consumer/blob/master/config/fluentd-consumer.properties) for example.)

3. Open your Fluentd configuration file and add the following lines:

```
<source>
type exec
command java -Dlog4j.configuration=file:///path/to/log4j.properties -jar /path/to/kafka-fluentd-consumer-LATEST_VERSION-all.jar /path/to/config/fluentd-consumer.properties
tag dummy
format json
</source>
```

With the above setup, Fluentd consumes Kafka messages which are specified topics in `fluentd-consumer.properties` via `in_exec` plugin.

### Note

For simplification, you can use `in_kafka` plugin to retrive Lafka messages.
If you assume highly Kafka traffic in production, we recommend to use `kafka-fluentd-consumer` instead of `in_kafka`. Because `in_kafka` has been reported high CPU usage when 1000req/sec environment. In more detail, please refer to [the issue](https://github.com/htgc/fluent-plugin-kafka/issues/16).