Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-3454: add Kafka Streams web docs #1127

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions docs/documentation.html
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,12 @@ <h1>Kafka 0.10.0 Documentation</h1>
<li><a href="#connect_development">8.3 Connector Development Guide</a></li>
</ul>
</li>
<li><a href="#streams">9. Kafka Streams</a>
<ul>
<li><a href="#streams_overview">9.1 Overview</a></li>
<li><a href="#streams_user">9.2 User Guide</a></li>
</ul>
</li>
</ul>

<h2><a id="gettingStarted" href="#gettingStarted">1. Getting Started</a></h2>
Expand Down Expand Up @@ -171,4 +177,7 @@ <h2><a id="security" href="#security">7. Security</a></h2>
<h2><a id="connect" href="#connect">8. Kafka Connect</a></h2>
<!--#include virtual="connect.html" -->

<h2><a id="streams" href="#streams">9. Kafka Streams</a></h2>
<!--#include virtual="streams.html" -->

<!--#include virtual="../includes/footer.html" -->
99 changes: 99 additions & 0 deletions docs/quickstart.html
Original file line number Diff line number Diff line change
Expand Up @@ -249,3 +249,102 @@ <h4><a id="quickstart_kafkaconnect" href="#quickstart_kafkaconnect">Step 7: Use
</pre>

You should see the line appear in the console consumer output and in the sink file.

<h4><a id="quickstart_kafkastreams" href="#quickstart_kafkastreams">Step 8: Use Kafka Streams to process data</a></h4>

Kafka Streams is a client library of Kafka for real-time stream processing and analyzing data stored in Kafka brokers.
This quickstart example will demonstrate how to run a streaming application coded in this library. Here is the gist
of the <a href="https://github.com/apache/kafka/blob/trunk/streams/examples/src/main/java/org/apache/kafka/streams/examples/wordcount/WordCountDemo.java">WordCount</a> example code (using Java 8 lambda expressions).

<pre>

KStream<String, Long> wordCounts = textLines
// Split each text line, by whitespace, into words.
.flatMapValues(value -> Arrays.asList(value.toLowerCase().split("\\W+")))
// Ensure the words are available as message keys for the next aggregate operation.
.map((key, value) -> new KeyValue<>(value, value))
// Count the occurrences of each word (message key).
.countByKey(stringSerializer, longSerializer, stringDeserializer, longDeserializer, "Counts")
// Convert the resulted aggregate table into another stream.
.toStream();

</pre>

It implements the WordCount
algorithm, which computes a word occurrence histogram from the input text. However, unlike other WordCount examples
you might have seen before that operate on <b>bounded data</b>, the WordCount demo application behaves slightly differently because it is
designed to operate on an <b>infinite, unbounded stream</b> of data. Similar to the bounded variant, it is a stateful algorithm that
tracks and updates the counts of words. However, since it must assume potentially
unbounded input data, it will periodically output its current state and results while continuing to process more data
because it cannot know when it has processed "all" the input data.

We will now prepare input data to a Kafka topic, which will subsequently processed by a Kafka Streams application.

<!--
<pre>
&gt; <b>./bin/kafka-topics --create \</b>
<b>--zookeeper localhost:2181 \</b>
<b>--replication-factor 1 \</b>
<b>--partitions 1 \</b>
<b>--topic streams-file-input</b>

</pre>

-->

<pre>
&gt; <b>echo -e "all streams lead to kafka\nhello kafka streams\njoin kafka summit" > file-input.txt</b>
</pre>

Next, we send this input data to the input topic named <b>streams-file-input</b> using the console producer (in practice,
stream data will likely be flowing continuously into Kafka where the application will be up and running):

<pre>
&gt; <b>cat /tmp/file-input.txt | ./bin/kafka-console-producer --broker-list localhost:9092 --topic streams-file-input</b>
</pre>


We can now run the WordCount demo application to process the input data:


<pre>
&gt; <b>./bin/kafka-run-class org.apache.kafka.streams.examples.wordcount.WordCountJob</b>
</pre>

There won't be any STDOUT output except log entries as the results are continuously written back into another topic named <b>streams-wordcount-output</b> in Kafka.
The demo will run for a few seconds and then, unlike typical stream processing applications, terminate automatically.

We can now inspect the output of the WordCount demo application by reading from its output topic:


<pre>
&gt; <b>./bin/kafka-console-consumer --zookeeper localhost:2181 \</b>
<b>--topic streams-wordcount-output \</b>
<b>--from-beginning \</b>
<b>--formatter kafka.tools.DefaultMessageFormatter \</b>
<b>--property print.key=true \</b>
<b>--property print.key=true \</b>
<b>--property key.deserializer=org.apache.kafka.common.serialization.StringDeserializer \</b>
<b>--property value.deserializer=org.apache.kafka.common.serialization.LongDeserializer</b>
</pre>

with the following output data being printed to the console (You can stop the console consumer via <b>Ctrl-C</b>):

<pre>
all 1
streams 1
lead 1
to 1
kafka 1
hello 1
kafka 2
streams 2
join 1
kafka 3
summit 1
<b>^C</b>
</pre>

Here, the first column is the Kafka message key, and the second column is the message value, both in in <b>java.lang.String</b> format.
Note that the output is actually a continuous stream of updates, where each data record (i.e. each line in the original output above) is
an updated count of a single word, aka record key such as "kafka". For multiple records with the same key, each later record is an update of the previous one.
35 changes: 35 additions & 0 deletions docs/streams.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
<!--~
~ Licensed to the Apache Software Foundation (ASF) under one or more
~ contributor license agreements. See the NOTICE file distributed with
~ this work for additional information regarding copyright ownership.
~ The ASF licenses this file to You under the Apache License, Version 2.0
~ (the "License"); you may not use this file except in compliance with
~ the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing, software
~ distributed under the License is distributed on an "AS IS" BASIS,
~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~ See the License for the specific language governing permissions and
~ limitations under the License.
~-->

<h3><a id="streams_overview" href="#streams_overview">9.1 Overview</a></h3>

Kafka Streams is a client library for processing and analyzing data stored in Kafka and either write the resulting data back to Kafka or send the final output to an external system. It builds upon important stream processing concepts such as properly distinguishing between event time and processing time, windowing support, and simple yet efficient management of application state.
Kafka Streams has a <b>low barrier to entry</b>: You can quickly write and run a small-scale proof-of-concept on a single machine; and you only need to run additional instances of your application on multiple machines to scale up to high-volume production workloads. Kafka Streams transparently handles the load balancing of multiple instances of the same application by leveraging Kafka’s parallelism model.

Some highlights of Kafka Streams:

<ul>
<li>Designed as a <b>simple and lightweight client library</b>, which can be easily embedded in any Java application and integrated with any existing packaging, deployment and operational tools that users have for their streaming applications.</li>
<li>Has <b>no external dependencies on systems other than Apache Kafka itself</b> as the internal messaging layer; notably, it uses Kafka’s partitioning model to horizontally scale processing while maintaining strong ordering guarantees.</li>
<li>Supports <b>fault-tolerant local state</b>, which enables very fast and efficient stateful operations like joins and windowed aggregations.</li>
<li>Employs <b>one-record-at-a-time processing</b> to achieve low processing latency, and supports <b>event-time based windowing operations</b>.</li>
<li>Offers necessary stream processing primitives, along with a <b>high-level Streams DSL</b> and a <b>low-level Processor API</b>.</li>

</ul>

<h3><a id="streams_user" href="#streams_user">8.2 User Guide</a></h3>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9.2