Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Kafka batch source that emits a records with user specified schema.
Usage Notes
-----------

Kafka Batch Source can be used to read events from a kafka topic. It uses kafka consumer [0.9.1 apis](https://kafka.apache.org/090/documentation.html) to read events from a kafka topic. The Kafka Batch Source supports providing additional kafka properties for the kafka consumer, reading from kerberos-enabled kafka and limiting the number of records read. Kafka Batch Source converts incoming kafka events into cdap structured records which then can be used for further transformations.
Kafka Batch Source can be used to read events from a kafka topic. It uses kafka consumer [0.10.2 apis](https://kafka.apache.org/0100/documentation.html) to read events from a kafka topic. The Kafka Batch Source supports providing additional kafka properties for the kafka consumer, reading from kerberos-enabled kafka and limiting the number of records read. Kafka Batch Source converts incoming kafka events into cdap structured records which then can be used for further transformations.

The source will read from the earliest available offset or the initial offset that specified in the config for the first run, remember the last offset it read last run and continue from that offset for the next run.

Expand Down
85 changes: 85 additions & 0 deletions kafka-plugins-0.10/docs/KAFKASOURCE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
[![Build Status](https://travis-ci.org/hydrator/kafka-plugins.svg?branch=master)](https://travis-ci.org/hydrator/kafka-plugins) [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

Kafka Source
===========

Kafka streaming source that emits a records with user specified schema.

<img align="center" src="kafka-source-plugin-config.png" width="400" alt="plugin configuration" />

Usage Notes
-----------

Kafka Streaming Source can be used to read events from a kafka topic. It uses kafka consumer [0.10.2 apis](https://kafka.apache.org/0100/documentation.html) to read events from a kafka topic. Kafka Source converts incoming kafka events into cdap structured records which then can be used for further transformations.

The source provides capabilities to read from latest offset or from beginning or from the provided kafka offset. The plugin relies on Spark Streaming offset [storage capabilities](https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html) to manager offsets and checkpoints.

Plugin Configuration
---------------------

| Configuration | Required | Default | Description |
| :------------ | :------: | :----- | :---------- |
| **Kafka Brokers** | **Y** | N/A | List of Kafka brokers specified in host1:port1,host2:port2 form. |
| **Kafka Topic** | **Y** | N/A | The Kafka topic to read from. |
| **Topic Partition** | **N** | N/A | List of topic partitions to read from. If not specified, all partitions will be read. |
| **Default Initial Offset** | **N** | N/A | The default initial offset for all topic partitions. An offset of -2 means the smallest offset. An offset of -1 means the latest offset. Defaults to -1. Offsets are inclusive. If an offset of 5 is used, the message at offset 5 will be read. If you wish to set different initial offsets for different partitions, use the initialPartitionOffsets property. |
| **Initial Partition Offsets** | **N** | N/A | The initial offset for each topic partition. If this is not specified, all partitions will use the same initial offset, which is determined by the defaultInitialOffset property. Any partitions specified in the partitions property, but not in this property will use the defaultInitialOffset. An offset of -2 means the smallest offset. An offset of -1 means the latest offset. Offsets are inclusive. If an offset of 5 is used, the message at offset 5 will be read. |
| **Time Field** | **N** | N/A | Optional name of the field containing the read time of the batch. If this is not set, no time field will be added to output records. If set, this field must be present in the schema property and must be a long. |
| **Key Field** | **N** | N/A | Optional name of the field containing the message key. If this is not set, no key field will be added to output records. If set, this field must be present in the schema property and must be bytes. |
| **Partition Field** | **N** | N/A | Optional name of the field containing the partition the message was read from. If this is not set, no partition field will be added to output records. If set, this field must be present in the schema property and must be an int. |
| **Offset Field** | **N** | N/A | Optional name of the field containing the partition offset the message was read from. If this is not set, no offset field will be added to output records. If set, this field must be present in the schema property and must be a long. |
| **Format** | **N** | N/A | Optional format of the Kafka event message. Any format supported by CDAP is supported. For example, a value of 'csv' will attempt to parse Kafka payloads as comma-separated values. If no format is given, Kafka message payloads will be treated as bytes. |
| **Kerberos Principal** | **N** | N/A | The kerberos principal used for the source when kerberos security is enabled for kafka. |
| **Keytab Location** | **N** | N/A | The keytab location for the kerberos principal when kerberos security is enabled for kafka. |


Build
-----
To build this plugin:

```
mvn clean package
```

The build will create a .jar and .json file under the ``target`` directory.
These files can be used to deploy your plugins.

Deployment
----------
You can deploy your plugins using the CDAP CLI:

> load artifact <target/kafka-plugins-<version>.jar config-file <target/kafka-plugins<version>.json>

For example, if your artifact is named 'kafka-plugins-<version>':

> load artifact target/kafka-plugins-<version>.jar config-file target/kafka-plugins-<version>.json

## Mailing Lists

CDAP User Group and Development Discussions:

* `cdap-user@googlegroups.com <https://groups.google.com/d/forum/cdap-user>`

The *cdap-user* mailing list is primarily for users using the product to develop
applications or building plugins for appplications. You can expect questions from
users, release announcements, and any other discussions that we think will be helpful
to the users.

## License and Trademarks

Copyright © 2018 Cask Data, Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the
License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
either express or implied. See the License for the specific language governing permissions
and limitations under the License.

Cask is a trademark of Cask Data, Inc. All rights reserved.

Apache, Apache HBase, and HBase are trademarks of The Apache Software Foundation. Used with
permission. No endorsement by The Apache Software Foundation is implied by the use of these marks.
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,14 @@ The sink also allows you to write events into kerberos-enabled kafka.
Usage Notes
-----------

Kafka sink emits events in realtime to configured kafka topic and partition. It uses kafka producer [0.8.2 apis](https://kafka.apache.org/082/javadoc/index.html?org/apache/kafka/clients/producer/KafkaProducer.html) to write events into kafka.
Kafka sink emits events in realtime to configured kafka topic and partition. It uses kafka producer [0.10.2 apis](https://kafka.apache.org/0100/documentation.html) to write events into kafka.

This sink can be configured to operate in synchronous or asynchronous mode. In synchronous mode, each event will be sent to the broker synchronously on the thread that calls it. This is not sufficient on most of the high volume environments.
In async mode, the kafka producer will batch together all the kafka events for greater throughput. But that makes it open for the possibility of dropping unsent events in case of client machine failure. Since kafka producer by default uses synchronous mode, this sink also uses Synchronous producer by default.

It uses String partitioner and String serializer for key and value to write events to kafka. Optionally if kafka key is provided, producer will use that key to partition events accross multiple partitions in a given topic. This sink also allows compression configuration. By default compression is none.

Kafka producer can be tuned using many properties as shown [here](https://kafka.apache.org/082/javadoc/org/apache/kafka/clients/producer/ProducerConfig.html). This sink allows user to configure any property supported by kafka 0.8.2 Producer.
Kafka producer can be tuned using many properties as shown [here](https://kafka.apache.org/0100/javadoc/org/apache/kafka/clients/producer/ProducerConfig.html). This sink allows user to configure any property supported by kafka 0.10.2.0 Producer.


Plugin Configuration
Expand Down
63 changes: 63 additions & 0 deletions kafka-plugins-0.10/docs/Kafka-alert-publisher.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# kafka-alert-plugin

<a href="https://cdap-users.herokuapp.com/"><img alt="Join CDAP community" src="https://cdap-users.herokuapp.com/badge.svg?t=kafka-alert-plugin"/></a> [![Build Status](https://travis-ci.org/hydrator/kafka-alert-plugin.svg?branch=master)](https://travis-ci.org/hydrator/kafka-alert-plugin) [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) []() <img src="https://cdap-users.herokuapp.com/assets/cm-available.svg"/>

Kafka Alert Publisher that allows you to publish alerts to kafka as json objects. The plugin internally uses kafka producer apis to publish alerts.
The plugin allows to specify kafka topic to use for publishing and other additional kafka producer properties.
This plugin uses kafka 0.10.2 java apis.

Build
-----
To build this plugin:

```
mvn clean package
```

The build will create a .jar and .json file under the ``target`` directory.
These files can be used to deploy your plugins.

Deployment
----------
You can deploy your plugins using the CDAP CLI:

> load artifact <target/kafka-alert-plugin-<version>.jar config-file <target/kafka-alert-plugin<version>.json>

For example, if your artifact is named 'kafka-alert-plugin-<version>':

> load artifact target/kafka-alert-plugin-<version>.jar config-file target/kafka-alert-plugin-<version>.json

## Mailing Lists

CDAP User Group and Development Discussions:

* `cdap-user@googlegroups.com <https://groups.google.com/d/forum/cdap-user>`

The *cdap-user* mailing list is primarily for users using the product to develop
applications or building plugins for appplications. You can expect questions from
users, release announcements, and any other discussions that we think will be helpful
to the users.

## IRC Channel

CDAP IRC Channel: #cdap on irc.freenode.net


## License and Trademarks

Copyright © 2018 Cask Data, Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the
License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
either express or implied. See the License for the specific language governing permissions
and limitations under the License.

Cask is a trademark of Cask Data, Inc. All rights reserved.

Apache, Apache HBase, and HBase are trademarks of The Apache Software Foundation. Used with
permission. No endorsement by The Apache Software Foundation is implied by the use of these marks.
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Kafka sink that allows you to write events into CSV or JSON to kafka.
Plugin has the capability to push the data to a Kafka topic. It can also be
configured to partition events being written to kafka based on a configurable key.
The sink can also be configured to operate in sync or async mode and apply different
compression types to events. Kafka sink is compatible with Kafka 0.9 and 0.10
compression types to events. This plugin uses kafka 0.10.2 java apis.


Configuration
Expand Down Expand Up @@ -55,4 +55,4 @@ Additional properties like number of acknowledgements and client id can also be
"kafkaProperties": "acks:2,client.id:myclient",
"key": "message"
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Kafka batch source. Emits the record from kafka. It will emit a record based on
you use, or if no schema or format is specified, the message payload will be emitted. The source will
remember the offset it read last run and continue from that offset for the next run.
The Kafka batch source supports providing additional kafka properties for the kafka consumer,
reading from kerberos-enabled kafka and limiting the number of records read
reading from kerberos-enabled kafka and limiting the number of records read. This plugin uses kafka 0.10.2 java apis.

Use Case
--------
Expand Down Expand Up @@ -106,4 +106,3 @@ For each Kafka message read, it will output a record with the schema:
| count | int |
| price | double |
+================================+

118 changes: 118 additions & 0 deletions kafka-plugins-0.10/docs/Kafka-streamingsource.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# Kafka Streaming Source


Description
-----------
Kafka streaming source. Emits a record with the schema specified by the user. If no schema
is specified, it will emit a record with two fields: 'key' (nullable string) and 'message'
(bytes). This plugin uses kafka 0.10.2 java apis.


Use Case
--------
This source is used whenever you want to read from Kafka. For example, you may want to read messages
from Kafka and write them to a Table.


Properties
----------
**referenceName:** This will be used to uniquely identify this source for lineage, annotating metadata, etc.

**brokers:** List of Kafka brokers specified in host1:port1,host2:port2 form. (Macro-enabled)

**topic:** The Kafka topic to read from. (Macro-enabled)

**partitions:** List of topic partitions to read from. If not specified, all partitions will be read. (Macro-enabled)

**defaultInitialOffset:** The default initial offset for all topic partitions.
An offset of -2 means the smallest offset. An offset of -1 means the latest offset. Defaults to -1.
Offsets are inclusive. If an offset of 5 is used, the message at offset 5 will be read.
If you wish to set different initial offsets for different partitions, use the initialPartitionOffsets property. (Macro-enabled)

**initialPartitionOffsets:** The initial offset for each topic partition. If this is not specified,
all partitions will use the same initial offset, which is determined by the defaultInitialOffset property.
Any partitions specified in the partitions property, but not in this property will use the defaultInitialOffset.
An offset of -2 means the smallest offset. An offset of -1 means the latest offset.
Offsets are inclusive. If an offset of 5 is used, the message at offset 5 will be read. (Macro-enabled)

**schema:** Output schema of the source. If you would like the output records to contain a field with the
Kafka message key, the schema must include a field of type bytes or nullable bytes, and you must set the
keyField property to that field's name. Similarly, if you would like the output records to contain a field with
the timestamp of when the record was read, the schema must include a field of type long or nullable long, and you
must set the timeField property to that field's name. Any field that is not the timeField or keyField will be used
in conjuction with the format to parse Kafka message payloads.

**format:** Optional format of the Kafka event message. Any format supported by CDAP is supported.
For example, a value of 'csv' will attempt to parse Kafka payloads as comma-separated values.
If no format is given, Kafka message payloads will be treated as bytes.

**timeField:** Optional name of the field containing the read time of the batch.
If this is not set, no time field will be added to output records.
If set, this field must be present in the schema property and must be a long.

**keyField:** Optional name of the field containing the message key.
If this is not set, no key field will be added to output records.
If set, this field must be present in the schema property and must be bytes.

**partitionField:** Optional name of the field containing the partition the message was read from.
If this is not set, no partition field will be added to output records.
If set, this field must be present in the schema property and must be an int.

**offsetField:** Optional name of the field containing the partition offset the message was read from.
If this is not set, no offset field will be added to output records.
If set, this field must be present in the schema property and must be a long.

**maxRatePerPartition:** Maximum number of records to read per second per partition. Defaults to 1000.

**principal** The kerberos principal used for the source when kerberos security is enabled for kafka.

**keytabLocation** The keytab location for the kerberos principal when kerberos security is enabled for kafka.

Example
-------
This example reads from the 'purchases' topic of a Kafka instance running
on brokers host1.example.com:9092 and host2.example.com:9092. The source will add
a time field named 'readTime' that contains a timestamp corresponding to the micro
batch when the record was read. It will also contain a field named 'key' which will have
the message key in it. It parses the Kafka messages using the 'csv' format
with 'user', 'item', 'count', and 'price' as the message schema.

{
"name": "Kafka",
"type": "streamingsource",
"properties": {
"topics": "purchases",
"brokers": "host1.example.com:9092,host2.example.com:9092",
"format": "csv",
"timeField": "readTime",
"keyField": "key",
"schema": "{
\"type\":\"record\",
\"name\":\"purchase\",
\"fields\":[
{\"name\":\"readTime\",\"type\":\"long\"},
{\"name\":\"key\",\"type\":\"bytes\"},
{\"name\":\"user\",\"type\":\"string\"},
{\"name\":\"item\",\"type\":\"string\"},
{\"name\":\"count\",\"type\":\"int\"},
{\"name\":\"price\",\"type\":\"double\"}
]
}"
}
}

For each Kafka message read, it will output a record with the schema:

+================================+
| field name | type |
+================================+
| readTime | long |
| key | bytes |
| user | string |
| item | string |
| count | int |
| price | double |
+================================+

Note that the readTime field is not derived from the Kafka message, but from the time that the
message was read.
Loading