-
Notifications
You must be signed in to change notification settings - Fork 3k
Add Flink quickstart #15062
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add Flink quickstart #15062
Conversation
…pl will commonly look for it
MartijnVisser
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thnx for the PR, I've checked the Flink input and left some minor nits, but overall +1
| @@ -0,0 +1,51 @@ | |||
| # - Licensed to the Apache Software Foundation (ASF) under one or more | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spark quickstart contains the yaml file in the documentation page.
Is there a specific reason we decided to do otherwise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because there's a Dockerfile too, and that's a lot of code to puts in a docs page when it could just be linked to :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My fear is, that it will be missed, and we will forget to update after a new release.
The less hops we have the less likely to make these kind of mistakes.
Unless we release a docker image too
|
is the flink/quickstart/overview.excalidraw.svg used somewhere? |
|
@mxm, @Guosmilesmile: Could you please review? |
Co-authored-by: pvary <peter.vary.apache@gmail.com>
I've added it into the doc as a reference image 68b53bc |
- Change to use flink 2.1 (apache#15062 (comment)) - Updated version variables for easier upgrades (apache#15062 (comment))
…sion variable handling
|
SeaweedFS is S3-compatible local storage that I was using in place of MinIO, which has been been moved to maintenance mode. |
| # - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # - See the License for the specific language governing permissions and | ||
| # - limitations under the License. | ||
| services: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to ask the community on the dev list about creating an official flink quickstart docker image.
We already have a /docker/iceberg-rest-fixture which is released by .github/workflows/publish-iceberg-rest-fixture-docker.yml.
Maybe we could have the same for Flink and Spark there.
If the community is not interested in having that there, I'm ok adding this to Flink
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will start this discussion, and would be happy to contribute a PR for it too.
#15114
mxm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @rmoff for the PR! This is a great addition.
| @@ -1,5 +1,5 @@ | |||
| --- | |||
| title: "Flink Getting Started" | |||
| title: "Getting Started" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we keep the Flink context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Let's create a table using `iceberg_catalog.nyc.taxis` where `iceberg_catalog` is the catalog name, `nyc` is the database name, and `taxis` is the table name. | ||
|
|
||
| ```sql | ||
| CREATE TABLE iceberg_catalog.nyc.taxis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious, why are we fully-qualifying the table name here when we set the default catalog and database name above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. copy pasta from the Spark quickstart.
Fixed bdd5763
site/docs/flink-quickstart.md
Outdated
| Then make this the active catalog in your Flink SQL session: | ||
|
|
||
| ```sql | ||
| USE CATALOG iceberg_catalog; | ||
| ``` | ||
|
|
||
| Create a database in the catalog: | ||
|
|
||
| ```sql | ||
| CREATE DATABASE IF NOT EXISTS nyc; | ||
| ``` | ||
|
|
||
| and set it as active: | ||
|
|
||
| ```sql | ||
| USE nyc; | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For brevity and to avoid confusion, I would remove changing the default catalog / database and continue to use fully-qualified table names (like below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed bdd5763
site/docs/flink-quickstart.md
Outdated
| First, switch to the default catalog (otherwise the table would be created using the Iceberg details that we configured in the catalog definition above): | ||
|
|
||
| ```sql | ||
| USE CATALOG default_catalog; | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to avoid changing the default catalog because that would make these examples easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed bdd5763
| RUN echo "-> Install JARs: Hadoop" && \ | ||
| mkdir -p ./lib/hadoop && pushd $_ && \ | ||
| curl https://repo1.maven.org/maven2/org/apache/commons/commons-configuration2/2.1.1/commons-configuration2-2.1.1.jar -O && \ | ||
| curl https://repo1.maven.org/maven2/commons-logging/commons-logging/1.1.3/commons-logging-1.1.3.jar -O && \ | ||
| curl https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-auth/${HADOOP_VERSION}/hadoop-auth-${HADOOP_VERSION}.jar -O && \ | ||
| curl https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-common/${HADOOP_VERSION}/hadoop-common-${HADOOP_VERSION}.jar -O && \ | ||
| curl https://repo1.maven.org/maven2/org/apache/hadoop/thirdparty/hadoop-shaded-guava/1.1.1/hadoop-shaded-guava-1.1.1.jar -O && \ | ||
| curl https://repo1.maven.org/maven2/org/codehaus/woodstox/stax2-api/4.2.1/stax2-api-4.2.1.jar -O && \ | ||
| curl https://repo1.maven.org/maven2/com/fasterxml/woodstox/woodstox-core/5.3.0/woodstox-core-5.3.0.jar -O && \ | ||
| curl https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-hdfs-client/${HADOOP_VERSION}/hadoop-hdfs-client-${HADOOP_VERSION}.jar -O && \ | ||
| curl https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-mapreduce-client-core/${HADOOP_VERSION}/hadoop-mapreduce-client-core-${HADOOP_VERSION}.jar -O && \ | ||
| popd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to use Hadoop in 2026? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use S3 without Hadoop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's the dream, right? ;)
Flink SQL> CREATE CATALOG iceberg_catalog WITH (
> 'type' = 'iceberg',
> 'catalog-impl' = 'org.apache.iceberg.rest.RESTCatalog',
> 'uri' = 'http://iceberg-rest:8181',
> 'warehouse' = 's3://warehouse/',
> 'io-impl' = 'org.apache.iceberg.aws.s3.S3FileIO',
> 's3.endpoint' = 'http://minio:9000',
> 's3.access-key-id' = 'admin',
> 's3.secret-access-key' = 'password',
> 's3.path-style-access' = 'true'
> );
[ERROR] Could not execute SQL statement. Reason:
java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
Flink SQL>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be the dream 😄 I just checked the code, and yes there is the dependency on at least Hadoop's Configuration, even with a custom catalog / IO. I think it should suffice to only include hadoop-common. We can remove all the HDFS, guava, etc.
Would you mind giving that a try?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have iterated over them before, but let me try again and log the details. Stand by…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, managed to strip three out (e0f619e):
- commons-logging
- hadoop-aws (S3 handled by iceberg-aws-bundle)
- flink-s3-fs-hadoop (S3FileIO used instead)
The others are needed though:
| JAR Name | Error point | Error |
|---|---|---|
| commons-configuration2-2.1.1.jar | jobmanager startup | java.lang.NoClassDefFoundError: org/apache/commons/configuration2/Configuration |
| hadoop-auth-${HADOOP_VERSION}.jar | jobmanager startup | java.lang.NoClassDefFoundError: org/apache/hadoop/util/PlatformName |
| hadoop-common-${HADOOP_VERSION}.jar | CREATE CATALOG |
java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration |
| hadoop-shaded-guava-1.1.1.jar | CREATE CATALOG |
java.lang.ClassNotFoundException: org.apache.hadoop.thirdparty.com.google.common |
| stax2-api-4.2.1.jar | CREATE CATALOG |
java.lang.ClassNotFoundException: org.codehaus.stax2.XMLInputFactory2 |
| woodstox-core-5.3.0.jar | CREATE CATALOG |
java.lang.ClassNotFoundException: com.ctc.wstx.io.InputBootstrapper |
| hadoop-hdfs-client-${HADOOP_VERSION}.jar | CREATE CATALOG |
java.lang.ClassNotFoundException: org.apache.hadoop.hdfs.HdfsConfiguration |
| hadoop-mapreduce-client-core-${HADOOP_VERSION}.jar | SELECT |
java.lang.ClassNotFoundException: org.apache.hadoop.mapreduce.lib.input.FileInputFormat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking! The outcome is a bit sad; we have some cleanup to do.
| services: | ||
| jobmanager: | ||
| build: | ||
| context: . | ||
| dockerfile: Dockerfile.flink | ||
| hostname: jobmanager | ||
| container_name: jobmanager | ||
| depends_on: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kubernetes seems to be a more typical setup from my experience, even for local testing, e.g. via Minikube.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think everyone who does k8s does Docker, but not everyone who does Docker does k8s… so for the sake of making it as accessible to as many people, I'd suggest we stick with Docker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we link this file from the docs page or remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've brought it into the page itself: 68b53bc
|
|
||
| Once you have those, save these two files into a new folder: | ||
|
|
||
| * [`docker-compose.yml`](https://raw.githubusercontent.com/apache/iceberg/refs/heads/main/flink/v2.0/quickstart/docker-compose.yml) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven’t tried this myself, so I’d like to double‑check whether a v2.0 path will actually be created in this case.
| * MinIO (local S3 storage) | ||
| * AWS CLI (to create the S3 bucket) | ||
|
|
||
| * [`Dockerfile.flink`](https://raw.githubusercontent.com/apache/iceberg/refs/heads/main/flink/v2.0/quickstart/Dockerfile.flink) - base Flink image, plus some required JARs for S3 and Iceberg. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same above.
site/docs/flink-quickstart.md
Outdated
| ```sql | ||
| USE CATALOG default_catalog; | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think it would be better to avoid changing the default catalog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed bdd5763
mxm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @rmoff! I wonder if we can reduce the Hadoop dependencies (see https://github.com/apache/iceberg/pull/15062/files#r2720565630).
mxm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you @rmoff!
Removed an unnecessary blank line in the documentation.
|
Let’s see where we land on the image location. |

This is modelled on the existing Spark quickstart.
It uses a Docker Compose and Dockerfile; I've put these under
/flink/v2.0/quickstartin the repo, but not sure if that's the right location :)