Skip to content
Permalink
Browse files
[FLINK-24164] Use site.DOCS_BASE_URL where possible
  • Loading branch information
zentol committed Sep 6, 2021
1 parent 28b251d commit 3ca96931d8a09227764154f35866872e0f517bd0
Showing 77 changed files with 329 additions and 329 deletions.
@@ -13,11 +13,11 @@ See the release changelog [here](https://issues.apache.org/jira/secure/ReleaseNo

## Overview of major new features

**Flink Streaming:** The gem of the 0.7.0 release is undoubtedly Flink Streaming. Available currently in alpha, Flink Streaming provides a Java API on top of Apache Flink that can consume streaming data sources (e.g., from Apache Kafka, Apache Flume, and others) and process them in real time. A dedicated blog post on Flink Streaming and its performance is coming up here soon. You can check out the Streaming programming guide [here](http://ci.apache.org/projects/flink/flink-docs-release-0.7/streaming_guide.html).
**Flink Streaming:** The gem of the 0.7.0 release is undoubtedly Flink Streaming. Available currently in alpha, Flink Streaming provides a Java API on top of Apache Flink that can consume streaming data sources (e.g., from Apache Kafka, Apache Flume, and others) and process them in real time. A dedicated blog post on Flink Streaming and its performance is coming up here soon. You can check out the Streaming programming guide [here]({{site.DOCS_BASE_URL}}flink-docs-release-0.7/streaming_guide.html).

**New Scala API:** The Scala API has been completely rewritten. The Java and Scala APIs have now the same syntax and transformations and will be kept from now on in sync in every future release. See the new Scala API [here](http://ci.apache.org/projects/flink/flink-docs-release-0.7/programming_guide.html).
**New Scala API:** The Scala API has been completely rewritten. The Java and Scala APIs have now the same syntax and transformations and will be kept from now on in sync in every future release. See the new Scala API [here]({{site.DOCS_BASE_URL}}flink-docs-release-0.7/programming_guide.html).

**Logical key expressions:** You can now specify grouping and joining keys with logical names for member variables of POJO data types. For example, you can join two data sets as ``persons.join(cities).where(“zip”).equalTo(“zipcode”)``. Read more [here](http://ci.apache.org/projects/flink/flink-docs-release-0.7/programming_guide.html#specifying-keys).
**Logical key expressions:** You can now specify grouping and joining keys with logical names for member variables of POJO data types. For example, you can join two data sets as ``persons.join(cities).where(“zip”).equalTo(“zipcode”)``. Read more [here]({{site.DOCS_BASE_URL}}flink-docs-release-0.7/programming_guide.html#specifying-keys).

**Hadoop MapReduce compatibility:** You can run unmodified Hadoop Mappers and Reducers (mapred API) in Flink, use all Hadoop data types, and read data with all Hadoop InputFormats.

@@ -81,10 +81,10 @@ Hadoop functions can be used at any position within a Flink program and of cours

## What comes next?

While the Hadoop compatibility package is already very useful, we are currently working on a dedicated Hadoop Job operation to embed and execute Hadoop jobs as a whole in Flink programs, including their custom partitioning, sorting, and grouping code. With this feature, you will be able to chain multiple Hadoop jobs, mix them with Flink functions, and other operations such as [Spargel](http://ci.apache.org/projects/flink/flink-docs-release-0.7/spargel_guide.html) operations (Pregel/Giraph-style jobs).
While the Hadoop compatibility package is already very useful, we are currently working on a dedicated Hadoop Job operation to embed and execute Hadoop jobs as a whole in Flink programs, including their custom partitioning, sorting, and grouping code. With this feature, you will be able to chain multiple Hadoop jobs, mix them with Flink functions, and other operations such as [Spargel]({{site.DOCS_BASE_URL}}flink-docs-release-0.7/spargel_guide.html) operations (Pregel/Giraph-style jobs).

## Summary

Flink lets you reuse a lot of the code you wrote for Hadoop MapReduce, including all data types, all Input- and OutputFormats, and Mapper and Reducers of the mapred-API. Hadoop functions can be used within Flink programs and mixed with all other Flink functions. Due to Flink’s pipelined execution, Hadoop functions can arbitrarily be assembled without data exchange via HDFS. Moreover, the Flink community is currently working on a dedicated Hadoop Job operation to supporting the execution of Hadoop jobs as a whole.

If you want to use Flink’s Hadoop compatibility package checkout our [documentation](https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/hadoop_compatibility.html).
If you want to use Flink’s Hadoop compatibility package checkout our [documentation]({{site.DOCS_BASE_URL}}flink-docs-master/apis/batch/hadoop_compatibility.html).
@@ -16,11 +16,11 @@ We are pleased to announce the availability of Flink 0.8.0. This release include


- **Extended filesystem support**: The former `DistributedFileSystem` interface has been generalized to `HadoopFileSystem` now supporting all sub classes of `org.apache.hadoop.fs.FileSystem`. This allows users to use all file systems supported by Hadoop with Apache Flink.
[See connecting to other systems](http://ci.apache.org/projects/flink/flink-docs-release-0.8/example_connectors.html)
[See connecting to other systems]({{site.DOCS_BASE_URL}}flink-docs-release-0.8/example_connectors.html)

- **Streaming Scala API**: As an alternative to the existing Java API Streaming is now also programmable in Scala. The Java and Scala APIs have now the same syntax and transformations and will be kept from now on in sync in every future release.

- **Streaming windowing semantics**: The new windowing api offers an expressive way to define custom logic for triggering the execution of a stream window and removing elements. The new features include out-of-the-box support for windows based in logical or physical time and data-driven properties on the events themselves among others. [Read more here](http://ci.apache.org/projects/flink/flink-docs-release-0.8/streaming_guide.html#window-operators)
- **Streaming windowing semantics**: The new windowing api offers an expressive way to define custom logic for triggering the execution of a stream window and removing elements. The new features include out-of-the-box support for windows based in logical or physical time and data-driven properties on the events themselves among others. [Read more here]({{site.DOCS_BASE_URL}}flink-docs-release-0.8/streaming_guide.html#window-operators)

- **Mutable and immutable objects in runtime** All Flink versions before 0.8.0 were always passing the same objects to functions written by users. This is a common performance optimization, also used in other systems such as Hadoop.
However, this is error-prone for new users because one has to carefully check that references to the object aren’t kept in the user function. Starting from 0.8.0, Flink allows to configure a mode which is disabling that mechanism.
@@ -15,7 +15,7 @@ In this post, we go through an example that uses the Flink Streaming
API to compute statistics on stock market data that arrive
continuously and combine the stock market data with Twitter streams.
See the [Streaming Programming
Guide](http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/index.html) for a
Guide]({{site.DOCS_BASE_URL}}flink-docs-master/apis/streaming/index.html) for a
detailed presentation of the Streaming API.

First, we read a bunch of stock price streams and combine them into
@@ -115,11 +115,11 @@ public static void main(String[] args) throws Exception {
</div>

See
[here](http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/index.html#data-sources)
[here]({{site.DOCS_BASE_URL}}flink-docs-master/apis/streaming/index.html#data-sources)
on how you can create streaming sources for Flink Streaming
programs. Flink, of course, has support for reading in streams from
[external
sources](http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/connectors/index.html)
sources]({{site.DOCS_BASE_URL}}flink-docs-master/apis/streaming/connectors/index.html)
such as Apache Kafka, Apache Flume, RabbitMQ, and others. For the sake
of this example, the data streams are simply generated using the
`generateStock` method:
@@ -230,7 +230,7 @@ Window aggregations
---------------

We first compute aggregations on time-based windows of the
data. Flink provides [flexible windowing semantics](http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/windows.html) where windows can
data. Flink provides [flexible windowing semantics]({{site.DOCS_BASE_URL}}flink-docs-master/apis/streaming/windows.html) where windows can
also be defined based on count of records or any custom user defined
logic.

@@ -432,7 +432,7 @@ Combining with a Twitter stream

Next, we will read a Twitter stream and correlate it with our stock
price stream. Flink has support for connecting to [Twitter's
API](https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/connectors/twitter.html)
API]({{site.DOCS_BASE_URL}}flink-docs-master/apis/streaming/connectors/twitter.html)
but for the sake of this example we generate dummy tweet data.

<img alt="Social media analytics" src="{{ site.baseurl }}/img/blog/blog_social_media.png" width="100%" class="img-responsive center-block">
@@ -666,7 +666,7 @@ public static final class WindowCorrelation
Other things to try
---------------

For a full feature overview please check the [Streaming Guide](http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/index.html), which describes all the available API features.
For a full feature overview please check the [Streaming Guide]({{site.DOCS_BASE_URL}}flink-docs-master/apis/streaming/index.html), which describes all the available API features.
You are very welcome to try out our features for different use-cases we are looking forward to your experiences. Feel free to [contact us](http://flink.apache.org/community.html#mailing-lists).

Upcoming for streaming
@@ -24,7 +24,7 @@ In this blog post, we cut through Apache Flink’s layered architecture and take

### How do I join with Flink?

Flink provides fluent APIs in Java and Scala to write data flow programs. Flink’s APIs are centered around parallel data collections which are called data sets. data sets are processed by applying Transformations that compute new data sets. Flink’s transformations include Map and Reduce as known from MapReduce [[1]](http://research.google.com/archive/mapreduce.html) but also operators for joining, co-grouping, and iterative processing. The documentation gives an overview of all available transformations [[2]](http://ci.apache.org/projects/flink/flink-docs-release-0.8/dataset_transformations.html).
Flink provides fluent APIs in Java and Scala to write data flow programs. Flink’s APIs are centered around parallel data collections which are called data sets. data sets are processed by applying Transformations that compute new data sets. Flink’s transformations include Map and Reduce as known from MapReduce [[1]](http://research.google.com/archive/mapreduce.html) but also operators for joining, co-grouping, and iterative processing. The documentation gives an overview of all available transformations [[2]]({{site.DOCS_BASE_URL}}flink-docs-release-0.8/dataset_transformations.html).

Joining two Scala case class data sets is very easy as the following example shows:

@@ -52,7 +52,7 @@ Flink’s APIs also allow to:
* select fields of pairs of joined Tuple elements (projection), and
* define composite join keys such as `.where(“orderDate”, “zipCode”).equalTo(“date”, “zip”)`.

See the documentation for more details on Flink’s join features [[3]](http://ci.apache.org/projects/flink/flink-docs-release-0.8/dataset_transformations.html#join).
See the documentation for more details on Flink’s join features [[3]]({{site.DOCS_BASE_URL}}flink-docs-release-0.8/dataset_transformations.html#join).


### How does Flink join my data?
@@ -122,7 +122,7 @@ The Hybrid-Hash-Join distinguishes its inputs as build-side and probe-side input

Ship and local strategies do not depend on each other and can be independently chosen. Therefore, Flink can execute a join of two data sets R and S in nine different ways by combining any of the three ship strategies (RR, BF with R being broadcasted, BF with S being broadcasted) with any of the three local strategies (SM, HH with R being build-side, HH with S being build-side). Each of these strategy combinations results in different execution performance depending on the data sizes and the available amount of working memory. In case of a small data set R and a much larger data set S, broadcasting R and using it as build-side input of a Hybrid-Hash-Join is usually a good choice because the much larger data set S is not shipped and not materialized (given that the hash table completely fits into memory). If both data sets are rather large or the join is performed on many parallel instances, repartitioning both inputs is a robust choice.

Flink features a cost-based optimizer which automatically chooses the execution strategies for all operators including joins. Without going into the details of cost-based optimization, this is done by computing cost estimates for execution plans with different strategies and picking the plan with the least estimated costs. Thereby, the optimizer estimates the amount of data which is shipped over the the network and written to disk. If no reliable size estimates for the input data can be obtained, the optimizer falls back to robust default choices. A key feature of the optimizer is to reason about existing data properties. For example, if the data of one input is already partitioned in a suitable way, the generated candidate plans will not repartition this input. Hence, the choice of a RR ship strategy becomes more likely. The same applies for previously sorted data and the Sort-Merge-Join strategy. Flink programs can help the optimizer to reason about existing data properties by providing semantic information about user-defined functions [[4]](https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/batch/index.html#semantic-annotations). While the optimizer is a killer feature of Flink, it can happen that a user knows better than the optimizer how to execute a specific join. Similar to relational database systems, Flink offers optimizer hints to tell the optimizer which join strategies to pick [[5]](https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/batch/dataset_transformations.html#join-algorithm-hints).
Flink features a cost-based optimizer which automatically chooses the execution strategies for all operators including joins. Without going into the details of cost-based optimization, this is done by computing cost estimates for execution plans with different strategies and picking the plan with the least estimated costs. Thereby, the optimizer estimates the amount of data which is shipped over the the network and written to disk. If no reliable size estimates for the input data can be obtained, the optimizer falls back to robust default choices. A key feature of the optimizer is to reason about existing data properties. For example, if the data of one input is already partitioned in a suitable way, the generated candidate plans will not repartition this input. Hence, the choice of a RR ship strategy becomes more likely. The same applies for previously sorted data and the Sort-Merge-Join strategy. Flink programs can help the optimizer to reason about existing data properties by providing semantic information about user-defined functions [[4]]({{site.DOCS_BASE_URL}}flink-docs-release-1.0/apis/batch/index.html#semantic-annotations). While the optimizer is a killer feature of Flink, it can happen that a user knows better than the optimizer how to execute a specific join. Similar to relational database systems, Flink offers optimizer hints to tell the optimizer which join strategies to pick [[5]]({{site.DOCS_BASE_URL}}flink-docs-release-1.0/apis/batch/dataset_transformations.html#join-algorithm-hints).

### How is Flink’s join performance?

@@ -171,7 +171,7 @@ We have seen that off-the-shelf distributed joins work really well in Flink. But
#### References

[1] [“MapReduce: Simplified data processing on large clusters”](), Dean, Ghemawat, 2004 <br>
[2] [Flink 0.8.1 documentation: Data Transformations](http://ci.apache.org/projects/flink/flink-docs-release-0.8/dataset_transformations.html) <br>
[3] [Flink 0.8.1 documentation: Joins](http://ci.apache.org/projects/flink/flink-docs-release-0.8/dataset_transformations.html#join) <br>
[4] [Flink 1.0 documentation: Semantic annotations](https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/batch/index.html#semantic-annotations) <br>
[5] [Flink 1.0 documentation: Optimizer join hints](https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/batch/dataset_transformations.html#join-algorithm-hints) <br>
[2] [Flink 0.8.1 documentation: Data Transformations]({{site.DOCS_BASE_URL}}flink-docs-release-0.8/dataset_transformations.html) <br>
[3] [Flink 0.8.1 documentation: Joins]({{site.DOCS_BASE_URL}}flink-docs-release-0.8/dataset_transformations.html#join) <br>
[4] [Flink 1.0 documentation: Semantic annotations]({{site.DOCS_BASE_URL}}flink-docs-release-1.0/apis/batch/index.html#semantic-annotations) <br>
[5] [Flink 1.0 documentation: Optimizer join hints]({{site.DOCS_BASE_URL}}flink-docs-release-1.0/apis/batch/dataset_transformations.html#join-algorithm-hints) <br>
@@ -15,7 +15,7 @@ release is a preview release that contains known issues.
You can download the release
[here](http://flink.apache.org/downloads.html#preview) and check out the
latest documentation
[here](http://ci.apache.org/projects/flink/flink-docs-master/). Feedback
[here]({{site.DOCS_BASE_URL}}flink-docs-master/). Feedback
through the Flink [mailing
lists](http://flink.apache.org/community.html#mailing-lists) is, as
always, very welcome!
@@ -45,7 +45,7 @@ for Flink programs. Tables are available for both static and streaming
data sources (DataSet and DataStream APIs).

Check out the Table guide for Java and Scala
[here](https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/table.html).
[here]({{site.DOCS_BASE_URL}}flink-docs-master/apis/batch/libs/table.html).

### Gelly Graph Processing API

@@ -60,13 +60,13 @@ algorithms, including PageRank, SSSP, label propagation, and community
detection.

Gelly internally builds on top of Flink’s [delta
iterations](https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/iterations.html). Iterative
iterations]({{site.DOCS_BASE_URL}}flink-docs-master/apis/batch/iterations.html). Iterative
graph algorithms are executed leveraging mutable state, achieving
similar performance with specialized graph processing systems.

Gelly will eventually subsume Spargel, Flink’s Pregel-like API. Check
out the Gelly guide
[here](https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/gelly.html).
[here]({{site.DOCS_BASE_URL}}flink-docs-master/apis/batch/libs/gelly.html).

### Flink Machine Learning Library

@@ -112,7 +112,7 @@ algorithms, Tez focuses on scalability and elastic resource usage in
shared YARN clusters.

Get started with Flink on Tez
[here](http://ci.apache.org/projects/flink/flink-docs-master/setup/flink_on_tez.html).
[here]({{site.DOCS_BASE_URL}}flink-docs-master/setup/flink_on_tez.html).

### Reworked Distributed Runtime on Akka

@@ -135,7 +135,7 @@ system is internally tracking the Kafka offsets to ensure that Flink
can pick up data from Kafka where it left off in case of an failure.

Read
[here](http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#apache-kafka)
[here]({{site.DOCS_BASE_URL}}flink-docs-master/apis/streaming_guide.html#apache-kafka)
on how to use the persistent Kafka source.

### Improved YARN support
@@ -152,7 +152,7 @@ integrators to easily control Flink on YARN within their Hadoop 2
cluster.

See the YARN docs
[here](http://ci.apache.org/projects/flink/flink-docs-master/setup/yarn_setup.html).
[here]({{site.DOCS_BASE_URL}}flink-docs-master/setup/yarn_setup.html).

## More Improvements and Fixes

0 comments on commit 3ca9693

Please sign in to comment.