Skip to content

Conversation

@rmoff
Copy link
Contributor

@rmoff rmoff commented Jan 16, 2026

This is modelled on the existing Spark quickstart.

It uses a Docker Compose and Dockerfile; I've put these under /flink/v2.0/quickstart in the repo, but not sure if that's the right location :)

Copy link

@MartijnVisser MartijnVisser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thnx for the PR, I've checked the Flink input and left some minor nits, but overall +1

@@ -0,0 +1,51 @@
# - Licensed to the Apache Software Foundation (ASF) under one or more
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spark quickstart contains the yaml file in the documentation page.

Is there a specific reason we decided to do otherwise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because there's a Dockerfile too, and that's a lot of code to puts in a docs page when it could just be linked to :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My fear is, that it will be missed, and we will forget to update after a new release.

The less hops we have the less likely to make these kind of mistakes.

Unless we release a docker image too

@pvary
Copy link
Contributor

pvary commented Jan 19, 2026

is the flink/quickstart/overview.excalidraw.svg used somewhere?

@pvary
Copy link
Contributor

pvary commented Jan 19, 2026

@mxm, @Guosmilesmile: Could you please review?

Co-authored-by: pvary <peter.vary.apache@gmail.com>
@rmoff
Copy link
Contributor Author

rmoff commented Jan 19, 2026

is the flink/quickstart/overview.excalidraw.svg used somewhere?

I've added it into the doc as a reference image 68b53bc

- Change to use flink 2.1 (apache#15062 (comment))
- Updated version variables for easier upgrades (apache#15062 (comment))
@rmoff rmoff marked this pull request as draft January 19, 2026 11:35
@rmoff rmoff marked this pull request as ready for review January 19, 2026 12:14
@rmoff
Copy link
Contributor Author

rmoff commented Jan 19, 2026

SeaweedFS is S3-compatible local storage that I was using in place of MinIO, which has been been moved to maintenance mode.
I investigated different options and seaweedfs seems like a good one which is why I was using it here.
On Slack, @pvary advised to use MinIO for now, and open the question on the dev list please, that we need to find an alternative. We use minio in many other place, and we will fix it with one batch.

@rmoff rmoff requested a review from pvary January 19, 2026 12:44
# - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# - See the License for the specific language governing permissions and
# - limitations under the License.
services:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to ask the community on the dev list about creating an official flink quickstart docker image.

We already have a /docker/iceberg-rest-fixture which is released by .github/workflows/publish-iceberg-rest-fixture-docker.yml.

Maybe we could have the same for Flink and Spark there.

If the community is not interested in having that there, I'm ok adding this to Flink

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will start this discussion, and would be happy to contribute a PR for it too.
#15114

Copy link
Contributor

@mxm mxm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @rmoff for the PR! This is a great addition.

@@ -1,5 +1,5 @@
---
title: "Flink Getting Started"
title: "Getting Started"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep the Flink context?

Copy link
Contributor Author

@rmoff rmoff Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was modelling on what we've got for Spark

32d7e0433ea2b7a54467728635777edbc04b1987f7660add234afa8b5aad3c74

Maybe I should address the LHN links in a separate PR, since on this logic the other Flink pages also oughtn't have the prefix (or, the Spark ones should)

Let's create a table using `iceberg_catalog.nyc.taxis` where `iceberg_catalog` is the catalog name, `nyc` is the database name, and `taxis` is the table name.

```sql
CREATE TABLE iceberg_catalog.nyc.taxis
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, why are we fully-qualifying the table name here when we set the default catalog and database name above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. copy pasta from the Spark quickstart.

Fixed bdd5763

Comment on lines 77 to 93
Then make this the active catalog in your Flink SQL session:

```sql
USE CATALOG iceberg_catalog;
```

Create a database in the catalog:

```sql
CREATE DATABASE IF NOT EXISTS nyc;
```

and set it as active:

```sql
USE nyc;
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For brevity and to avoid confusion, I would remove changing the default catalog / database and continue to use fully-qualified table names (like below).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed bdd5763

Comment on lines 148 to 152
First, switch to the default catalog (otherwise the table would be created using the Iceberg details that we configured in the catalog definition above):

```sql
USE CATALOG default_catalog;
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to avoid changing the default catalog because that would make these examples easier to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed bdd5763

Comment on lines 43 to 54
RUN echo "-> Install JARs: Hadoop" && \
mkdir -p ./lib/hadoop && pushd $_ && \
curl https://repo1.maven.org/maven2/org/apache/commons/commons-configuration2/2.1.1/commons-configuration2-2.1.1.jar -O && \
curl https://repo1.maven.org/maven2/commons-logging/commons-logging/1.1.3/commons-logging-1.1.3.jar -O && \
curl https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-auth/${HADOOP_VERSION}/hadoop-auth-${HADOOP_VERSION}.jar -O && \
curl https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-common/${HADOOP_VERSION}/hadoop-common-${HADOOP_VERSION}.jar -O && \
curl https://repo1.maven.org/maven2/org/apache/hadoop/thirdparty/hadoop-shaded-guava/1.1.1/hadoop-shaded-guava-1.1.1.jar -O && \
curl https://repo1.maven.org/maven2/org/codehaus/woodstox/stax2-api/4.2.1/stax2-api-4.2.1.jar -O && \
curl https://repo1.maven.org/maven2/com/fasterxml/woodstox/woodstox-core/5.3.0/woodstox-core-5.3.0.jar -O && \
curl https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-hdfs-client/${HADOOP_VERSION}/hadoop-hdfs-client-${HADOOP_VERSION}.jar -O && \
curl https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-mapreduce-client-core/${HADOOP_VERSION}/hadoop-mapreduce-client-core-${HADOOP_VERSION}.jar -O && \
popd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to use Hadoop in 2026? :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use S3 without Hadoop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's the dream, right? ;)

Flink SQL> CREATE CATALOG iceberg_catalog WITH (
>   'type'                 = 'iceberg',
>   'catalog-impl'         = 'org.apache.iceberg.rest.RESTCatalog',
>   'uri'                  = 'http://iceberg-rest:8181',
>   'warehouse'            = 's3://warehouse/',
>   'io-impl'              = 'org.apache.iceberg.aws.s3.S3FileIO',
>   's3.endpoint'          = 'http://minio:9000',
>   's3.access-key-id'     = 'admin',
>   's3.secret-access-key' = 'password',
>   's3.path-style-access' = 'true'
> );
[ERROR] Could not execute SQL statement. Reason:
java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration

Flink SQL>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would be the dream 😄 I just checked the code, and yes there is the dependency on at least Hadoop's Configuration, even with a custom catalog / IO. I think it should suffice to only include hadoop-common. We can remove all the HDFS, guava, etc.

Would you mind giving that a try?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have iterated over them before, but let me try again and log the details. Stand by…

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, managed to strip three out (e0f619e):

  • commons-logging
  • hadoop-aws (S3 handled by iceberg-aws-bundle)
  • flink-s3-fs-hadoop (S3FileIO used instead)

The others are needed though:

JAR Name Error point Error
commons-configuration2-2.1.1.jar jobmanager startup java.lang.NoClassDefFoundError: org/apache/commons/configuration2/Configuration
hadoop-auth-${HADOOP_VERSION}.jar jobmanager startup java.lang.NoClassDefFoundError: org/apache/hadoop/util/PlatformName
hadoop-common-${HADOOP_VERSION}.jar CREATE CATALOG java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
hadoop-shaded-guava-1.1.1.jar CREATE CATALOG java.lang.ClassNotFoundException: org.apache.hadoop.thirdparty.com.google.common
stax2-api-4.2.1.jar CREATE CATALOG java.lang.ClassNotFoundException: org.codehaus.stax2.XMLInputFactory2
woodstox-core-5.3.0.jar CREATE CATALOG java.lang.ClassNotFoundException: com.ctc.wstx.io.InputBootstrapper
hadoop-hdfs-client-${HADOOP_VERSION}.jar CREATE CATALOG java.lang.ClassNotFoundException: org.apache.hadoop.hdfs.HdfsConfiguration
hadoop-mapreduce-client-core-${HADOOP_VERSION}.jar SELECT java.lang.ClassNotFoundException: org.apache.hadoop.mapreduce.lib.input.FileInputFormat

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking! The outcome is a bit sad; we have some cleanup to do.

Comment on lines +15 to +22
services:
jobmanager:
build:
context: .
dockerfile: Dockerfile.flink
hostname: jobmanager
container_name: jobmanager
depends_on:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubernetes seems to be a more typical setup from my experience, even for local testing, e.g. via Minikube.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think everyone who does k8s does Docker, but not everyone who does Docker does k8s… so for the sake of making it as accessible to as many people, I'd suggest we stick with Docker.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we link this file from the docs page or remove it?

Copy link
Contributor Author

@rmoff rmoff Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've brought it into the page itself: 68b53bc


Once you have those, save these two files into a new folder:

* [`docker-compose.yml`](https://raw.githubusercontent.com/apache/iceberg/refs/heads/main/flink/v2.0/quickstart/docker-compose.yml)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven’t tried this myself, so I’d like to double‑check whether a v2.0 path will actually be created in this case.

* MinIO (local S3 storage)
* AWS CLI (to create the S3 bucket)

* [`Dockerfile.flink`](https://raw.githubusercontent.com/apache/iceberg/refs/heads/main/flink/v2.0/quickstart/Dockerfile.flink) - base Flink image, plus some required JARs for S3 and Iceberg.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same above.

Comment on lines 150 to 152
```sql
USE CATALOG default_catalog;
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think it would be better to avoid changing the default catalog.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed bdd5763

@rmoff rmoff requested a review from mxm January 22, 2026 18:22
Copy link
Contributor

@mxm mxm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @rmoff! I wonder if we can reduce the Hadoop dependencies (see https://github.com/apache/iceberg/pull/15062/files#r2720565630).

@rmoff rmoff requested a review from mxm January 23, 2026 11:51
Copy link
Contributor

@mxm mxm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you @rmoff!

Removed an unnecessary blank line in the documentation.
@rmoff rmoff requested a review from pvary January 23, 2026 12:54
@pvary
Copy link
Contributor

pvary commented Jan 23, 2026

Let’s see where we land on the image location.
Directly providing the Docker image would help lower the entry barrier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants