Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-11893] [Project Website] Update ecosystem to encourage sharing of the connectors #187

Open
wants to merge 14 commits into
base: asf-site
from

Conversation

@becketqin
Copy link

becketqin commented Mar 13, 2019

Based on the discussion in the mailing list, we would like to encourage the connector authors to share their connectors with the community, even the connectors are not in the Flink repo.

This ticket updates the ecosystem page on the website to reflect the change. It also adds ecosystem page back to the navigation bar.

@fhueske

This comment has been minimized.

Copy link
Contributor

fhueske commented Mar 13, 2019

Hi @becketqin, thanks for the PR!

Can you remove all changes generated by the build script?
That will make the PR easier to review.

Thank you, Fabian

<tr>
<td><a href="https://github.com/apache/bahir-flink" target="_blank">Redis, Flume, and ActiveMQ (via Apache Bahir)</a></td>
<td>sink</td>
<td>Flink Repo</td>

This comment has been minimized.

Copy link
@rmetzger

rmetzger Mar 13, 2019

Contributor

This needs to be "Bahir Repository"

<td>Apache Flink</td>
</tr>
<tr>
<td><a href="https://github.com/apache/bahir-flink" target="_blank">Redis, Flume, and ActiveMQ (via Apache Bahir)</a></td>

This comment has been minimized.

Copy link
@rmetzger

rmetzger Mar 13, 2019

Contributor

I think it would be nicer of Redis, Flume and ActiveMQ would all get their own row in the table.
There are now more connectors available in Apache Bahir: https://github.com/apache/bahir-flink

ecosystem.md Outdated Show resolved Hide resolved
ecosystem.md Outdated Show resolved Hide resolved
ecosystem.md Outdated Show resolved Hide resolved
<td>1.7.x</td>
<td>Apache Flink</td>
</tr>
</table>
To run an application using one of these connectors, additional third party
components are usually required to be installed and launched, e.g., the servers

This comment has been minimized.

Copy link
@rmetzger

rmetzger Mar 13, 2019

Contributor

Can you add "https://github.com/TouK/nussknacker/" to the list of 3rd party projects?

@rmetzger

This comment has been minimized.

Copy link
Contributor

rmetzger commented Mar 13, 2019

Thank you @becketqin ! I left some comments. Once these are resolved, I'm +1 to merge.

Copy link
Contributor

fhueske left a comment

I've added a few comments as well.
Thanks, Fabian

@@ -63,6 +63,9 @@
</ul>
</li>

<!-- Ecosystem -->
<li{% if page.url contains '/ecosystem.html' %} class="active"{% endif %}><a href="{{ baseurl_i18n }}/ecosystem.html">{{ site.data.i18n[page.language].ecosystem }}</a></li>

This comment has been minimized.

Copy link
@fhueske

fhueske Mar 13, 2019

Contributor

I would move the Ecosystem link below "Getting Help".

<li><a href="https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/connectors/cassandra.html" target="_blank">Apache Cassandra</a> (sink)</li>
<li><a href="https://github.com/apache/bahir-flink" target="_blank">Redis, Flume, and ActiveMQ (via Apache Bahir)</a> (sink)</li>
</ul>
<p>

This comment has been minimized.

Copy link
@fhueske

fhueske Mar 13, 2019

Contributor

I think it would be good to make a pass over the third party projects and remove all that are not in a reasonable shape. For example the Cascading integration should be removed and there are probably others that are not maintained. Linking them does not provide any benefit and might even give the feeling that the page is not maintained.

This comment has been minimized.

Copy link
@becketqin

becketqin Mar 14, 2019

Author

That's a good idea. I made a quick pass on the 3rd-party project list and removed the following projects.

Cascading: per your request
BigPetStore: It does not have a Flink support. We could add it back to the list after the support for Flink is available.
FastR Flink: There is only one commit back in 2015.
Apache SAMOA: Last commit was in Oct 2017.
Flink HTM: last commit was in Sep 2017.
TINK: No real commit since Sep 2017.

I also moved the two example projects for Python and Clojure to a separate section of "examples of non-java languages". Technically speaking, these should be part of Flink tutorials instead of ecosystem projects.

Do these changes look reasonable to you?

@becketqin

This comment has been minimized.

Copy link
Author

becketqin commented Mar 14, 2019

@fhueske @rmetzger Thanks a lot for the review. I updated the PR based on your comments. I removed the changes to the html files from build script. I will do a rebuild and add them back before the PR is merged.

@rmetzger

This comment has been minimized.

Copy link
Contributor

rmetzger commented Mar 14, 2019

+1 to merge from my point of view

Copy link
Contributor

fhueske left a comment

Thanks for the update @becketqin!

I made another pass and left a few comments.
Another issue that we should consider is how to keep this list up to date when a new Flink version is released. For the Flink connectors, we could replace the version string by the version variable such that it is automatically incremented when a new version is released. For the third party connectors, we might need to do this manually.

Best, Fabian

<td>Apache Flink</td>
</tr>
<tr>
<td><a href="{{site.docs-stable}}/dev/connectors/filesystem_sink.html" target="_blank">HDFS</a></td>

This comment has been minimized.

Copy link
@fhueske

fhueske Mar 14, 2019

Contributor

The filesystem sink does not only work for HDFS. it is also for all filesystems supported by Flink like S3 (see https://ci.apache.org/projects/flink/flink-docs-release-1.7/ops/filesystems.html for a complete list).

This comment has been minimized.

Copy link
@fhueske

fhueske Mar 14, 2019

Contributor

There are also many different encodings for files, like CSV, Parquet, ORC, SequenceFiles, etc. Some are only supported by DataSet (or DataStream), some only for sources or sinks.
Not sure how to include this information into the table but leaving it out makes it much less useful.

Maybe it makes sense to add a dedicated section about file system connectors / formats to the page.

This comment has been minimized.

Copy link
@becketqin

becketqin Mar 14, 2019

Author

Thanks for pointing out that there are different FS support in the filesystem sink. I am thinking that we could put some popular system such as S3 and HDFS in the list, and have a "Other filesystems" with a bullet linking to the FileSystem page.

I am not sure if it is necessary to mention the encoding / format in this list. This seems something specific to each system. It is a good idea to have a section describing them. But is the FileSystem connector page a more suitable place to put that information?

This comment has been minimized.

Copy link
@fhueske

fhueske Mar 14, 2019

Contributor

We could replace HDFS by File Systems (HDFS, S3, and others) and make others a link.

Regarding the encodings, I think we should list them in a separate section. Putting them here would clutter the connector list, but I think it is important to show that Flink supports different encodings like Parquet and ORC. Having a file system connector does not help a lot if the encoding that a user needs is not supported.

This comment has been minimized.

Copy link
@becketqin

becketqin Mar 19, 2019

Author

It is absolutely necessary to mention the supported encoding information. But this seems something specific to particular file systems, instead of applicable to all connectors. So would it be better to have the encoding information in the individual pages of those specific systems?

This comment has been minimized.

Copy link
@fhueske

fhueske May 2, 2019

Contributor

The supported encodings are independent of the file system. However, they depend on the connector implementation (Flink API). For example, we support ORC as source format for DataSet and Table API (and probably also DataStream via continuous file source) but not as sink format.

This comment has been minimized.

Copy link
@becketqin

becketqin May 6, 2019

Author

Should we only put information applicable to all the connectors in this table? For information specific to some particular systems, it seems better to leave them in the individual connector page.


<table class="table table-bordered">
<tr>
<th>System Name</th>

This comment has been minimized.

Copy link
@fhueske

fhueske Mar 14, 2019

Contributor

I think we also need to distinguish the different APIs: DataSet, DataStream, Table/SQL.

This comment has been minimized.

Copy link
@becketqin

becketqin Mar 14, 2019

Author

Yeah, it is a little unfortunate that currently Flink have different connectors depending on components. For now it might make sense to add a column to specify availability per API. Hopefully we can unify the connectors of different APIs in the near future.

This comment has been minimized.

Copy link
@fhueske

fhueske Mar 14, 2019

Contributor

Yes, that would be great!

<td>Apache Flink</td>
</tr>
<tr>
<td><a href="{{site.docs-stable}}/dev/connectors/rabbitmq.html" target="_blank">RabbitMQ</a></td>

This comment has been minimized.

Copy link
@fhueske

fhueske Mar 14, 2019

Contributor

Should we group them by system type (message queue/event log (Kafka, Kinesis, RabbitMQ, ...), file system, database (JDBC, Cassandra, Redis, ...), API (Twitter) etc.), provider (Flink, Bahir, other Apache community, external), or (estimated) popularity? I think grouping them by system type makes most sense but maybe somebody has a better idea.

This comment has been minimized.

Copy link
@becketqin

becketqin Mar 14, 2019

Author

Good point. Grouping by system type sounds reasonable to me.

<td>05/24/2017</td>
<td>1.7.x</td>
<td>Apache Bahir</td>
</tr>

This comment has been minimized.

Copy link
@fhueske

fhueske Mar 14, 2019

Contributor

Apache Pulsar features Flink connectors (https://github.com/apache/pulsar/tree/master/pulsar-flink/src/main/java/org/apache/flink). Would be great to have it listed here as well.

This comment has been minimized.

Copy link
@fhueske

fhueske Mar 14, 2019

Contributor

Can of course be added later as well.

This comment has been minimized.

Copy link
@becketqin

becketqin Mar 14, 2019

Author

How about inviting Pulsar folks to add their connector to the list after we update the page?

This comment has been minimized.

Copy link
@fhueske

fhueske Mar 14, 2019

Contributor

Yes, that would be great

Copy link
Author

becketqin left a comment

@fhueske Thanks for the comments. I replied with my thoughts. Please let me know what do you think.


<table class="table table-bordered">
<tr>
<th>System Name</th>

This comment has been minimized.

Copy link
@becketqin

becketqin Mar 14, 2019

Author

Yeah, it is a little unfortunate that currently Flink have different connectors depending on components. For now it might make sense to add a column to specify availability per API. Hopefully we can unify the connectors of different APIs in the near future.

<td>Apache Flink</td>
</tr>
<tr>
<td><a href="{{site.docs-stable}}/dev/connectors/filesystem_sink.html" target="_blank">HDFS</a></td>

This comment has been minimized.

Copy link
@becketqin

becketqin Mar 14, 2019

Author

Thanks for pointing out that there are different FS support in the filesystem sink. I am thinking that we could put some popular system such as S3 and HDFS in the list, and have a "Other filesystems" with a bullet linking to the FileSystem page.

I am not sure if it is necessary to mention the encoding / format in this list. This seems something specific to each system. It is a good idea to have a section describing them. But is the FileSystem connector page a more suitable place to put that information?

<td>Apache Flink</td>
</tr>
<tr>
<td><a href="{{site.docs-stable}}/dev/connectors/rabbitmq.html" target="_blank">RabbitMQ</a></td>

This comment has been minimized.

Copy link
@becketqin

becketqin Mar 14, 2019

Author

Good point. Grouping by system type sounds reasonable to me.

<td>05/24/2017</td>
<td>1.7.x</td>
<td>Apache Bahir</td>
</tr>

This comment has been minimized.

Copy link
@becketqin

becketqin Mar 14, 2019

Author

How about inviting Pulsar folks to add their connector to the list after we update the page?

Copy link
Author

becketqin left a comment

@fhueske I just updated the patch. One thing to mention is that it looks we do not have a JDBC connector page. So I left the JDBC connector without a link. I'll create a followup ticket to add a JDBC page then link it from here.

@becketqin

This comment has been minimized.

Copy link
Author

becketqin commented Mar 27, 2019

@fhueske Ping. Do you have time to take another look?

@rmetzger

This comment has been minimized.

Copy link
Contributor

rmetzger commented Apr 29, 2019

I'd like to get this in asap, otherwise, the real community packages website will be done before this is merged :)

@becketqin

This comment has been minimized.

Copy link
Author

becketqin commented Apr 29, 2019

@rmetzger Yeah... I agree. @fhueske could you take look? Maybe we can check in the current update and make further modifications in separate patches if needed?

@fhueske

This comment has been minimized.

Copy link
Contributor

fhueske commented Apr 30, 2019

Hi, sorry @becketqin, I forgot about this PR.
I'll have a look later today.

Thanks, Fabian

Copy link
Contributor

fhueske left a comment

Hi @becketqin,

I've left a few comments.

Best, Fabian

[Cascading](http://www.cascading.org/cascading-flink/) enables a user to build complex workflows easily on Flink and other execution engines.
[Cascading on Flink](https://github.com/dataArtisans/cascading-flink) is built by [dataArtisans](http://data-artisans.com/) and [Driven, Inc](http://www.driven.io/).
See Fabian Hueske's [Flink Forward talk](http://www.slideshare.net/FlinkForward/fabian-hueske-training-cascading-on-flink) for more details.
**Apache Beam**

This comment has been minimized.

Copy link
@fhueske

fhueske May 2, 2019

Contributor

I think we should we order projects alphabetically.

<th>Maintained By</th>
</tr>
<tr>
<td><a href="{{site.docs-stable}}/dev/connectors/kafka.html" target="_blank">Apache Kafka</a></td>

This comment has been minimized.

Copy link
@fhueske

fhueske May 2, 2019

Contributor

should we link here to the Apache Kafka website?
We can add the link to Flink's documentation to the "Available For" cell. This would allow us to link to the docs of the different APIs (the docs of the Table API Kafka connector are different from the DataStream API Kafka connector docs).

This comment has been minimized.

Copy link
@becketqin

becketqin May 6, 2019

Author

Currently this column links to individual connector web page. The location column links to the connector project page. But I agree it is a bit confusing. Would changing the column name from System Name to Connector help?

I am not sure about having fine-grained doc link in the "Available For" column. Doing this assumes that all the connector projects have separate links for each API. And there could be many combinations (e.g. DataSet Sink / DataStream Source, etc.) that makes the page very verbose.

Personally I am happy with just one link to the documentation for a connector, and users would probably navigate in that connector page to check on usages for different APIs. This leaves our index concise and also gives freedom to each project to organize their own documents.

What do you think?

This comment has been minimized.

Copy link
@fhueske

fhueske May 13, 2019

Contributor

I don't think that adding links in "Available For" would mean that we need to have different pages/links for the different APIs. If the links are the same, we can simply link to the same page, IMO.

However, in most cases we have different documentation pages for the different APIs. For example the file system connectors:

<tr>
<th>System Name</th>
<th>Connector Type</th>
<th>Location</th>

This comment has been minimized.

Copy link
@fhueske

fhueske May 2, 2019

Contributor

Do we need a "Location" and a "Maintained By" column? It seems these are duplicate.

This comment has been minimized.

Copy link
@becketqin

becketqin May 6, 2019

Author

The idea was that "Location" goes directly to the connector project page, while "maintained by" might link to the website of some organization or individuals. The might be same in some cases while different in other cases.

This comment has been minimized.

Copy link
@fhueske

fhueske May 13, 2019

Contributor

I'm fine keeping both.
At this point it just looked duplicate to me.

<td>Apache Bahir</td>
</tr>
<tr>
<td><a href="http://bahir.apache.org/docs/flink/current/flink-streaming-influxdb/" target="_blank">InfluxDB</a></td>

This comment has been minimized.

Copy link
@fhueske

fhueske May 2, 2019

Contributor

duplicate entry

<td></td>
<td>Table</td>
<td>{{site.stable}}.x</td>
<td>Apache Bahir</td>

This comment has been minimized.

Copy link
@fhueske

fhueske May 2, 2019

Contributor

JDBC is maintained by the Flink community

<td>sink/source</td>
<td>Apache Flink</td>
<td>Apache 2.0</td>
<td></td>

This comment has been minimized.

Copy link
@fhueske

fhueske May 2, 2019

Contributor

should we leave this cell empty?
For us it is clear that this is the last Flink release, but visitors might be confused about the missing date.
Should we add a variable with the latest Flink release here?

<td>{{site.stable}}.x</td>
<td>Apache Flink</td>
</tr>
<tr>

This comment has been minimized.

Copy link
@fhueske

fhueske May 2, 2019

Contributor

We have file system source connectors for DataStream and DataSet

This comment has been minimized.

Copy link
@fhueske

fhueske May 2, 2019

Contributor

We have file system connectors for DataSet API

This comment has been minimized.

Copy link
@becketqin

becketqin May 6, 2019

Author

I only see Writer classes in org.apache.flink.streaming.connectors.fs. Did I miss something?

This comment has been minimized.

Copy link
@fhueske

fhueske May 13, 2019

Contributor

The interfaces are org/apache/flink/api/common/io/FileInputFormat and org/apache/flink/api/common/io/FileOutputFormat.

Concrete implementations depend on the format (like ORC, CSV, etc.). There are also wrappers for Hadoop InputFormat implementation like Parquet, SequenceFile, etc.

<td>Apache Flink</td>
</tr>
<tr>
<td><a href="{{site.docs-stable}}/dev/connectors/streamfile_sink.html" target="_blank">Others File Systems (S3, others)</a></td>

This comment has been minimized.

Copy link
@fhueske

fhueske May 2, 2019

Contributor

I would list all file systems under the same entry. We should not treat HDFS different that other FSs, IMO.

This comment has been minimized.

Copy link
@fhueske

fhueske May 13, 2019

Contributor

Flink's FileSystem interface hides manu of the differences of the different supported file systems (HDFS, S3, ...) from the connector implementations.

<td>{{site.stable}}.x</td>
<td>Apache Bahir</td>
</tr>
</table>

This comment has been minimized.

Copy link
@fhueske

fhueske May 2, 2019

Contributor

We have an HBase source connector for DataSet and Table APIs.

<td>Apache Flink</td>
</tr>
<tr>
<td><a href="{{site.docs-stable}}/dev/connectors/filesystem_sink.html" target="_blank">HDFS</a></td>

This comment has been minimized.

Copy link
@fhueske

fhueske May 2, 2019

Contributor

The supported encodings are independent of the file system. However, they depend on the connector implementation (Flink API). For example, we support ORC as source format for DataSet and Table API (and probably also DataStream via continuous file source) but not as sink format.

Copy link
Author

becketqin left a comment

@fhueske Thanks for the review. I made some modifications and replied to some of the comments. In general, I would consider this page as a fast reference to the connectors and ecosystem projects, rather than providing too much details.

<tr>
<th>System Name</th>
<th>Connector Type</th>
<th>Location</th>

This comment has been minimized.

Copy link
@becketqin

becketqin May 6, 2019

Author

The idea was that "Location" goes directly to the connector project page, while "maintained by" might link to the website of some organization or individuals. The might be same in some cases while different in other cases.

<th>Maintained By</th>
</tr>
<tr>
<td><a href="{{site.docs-stable}}/dev/connectors/kafka.html" target="_blank">Apache Kafka</a></td>

This comment has been minimized.

Copy link
@becketqin

becketqin May 6, 2019

Author

Currently this column links to individual connector web page. The location column links to the connector project page. But I agree it is a bit confusing. Would changing the column name from System Name to Connector help?

I am not sure about having fine-grained doc link in the "Available For" column. Doing this assumes that all the connector projects have separate links for each API. And there could be many combinations (e.g. DataSet Sink / DataStream Source, etc.) that makes the page very verbose.

Personally I am happy with just one link to the documentation for a connector, and users would probably navigate in that connector page to check on usages for different APIs. This leaves our index concise and also gives freedom to each project to organize their own documents.

What do you think?

<td>Apache Flink</td>
</tr>
<tr>
<td><a href="{{site.docs-stable}}/dev/connectors/filesystem_sink.html" target="_blank">HDFS</a></td>

This comment has been minimized.

Copy link
@becketqin

becketqin May 6, 2019

Author

Should we only put information applicable to all the connectors in this table? For information specific to some particular systems, it seems better to leave them in the individual connector page.

<td>{{site.stable}}.x</td>
<td>Apache Flink</td>
</tr>
<tr>

This comment has been minimized.

Copy link
@becketqin

becketqin May 6, 2019

Author

I only see Writer classes in org.apache.flink.streaming.connectors.fs. Did I miss something?

@becketqin becketqin force-pushed the becketqin:update_ecosystem branch from 24676da to cb54f78 May 6, 2019
@becketqin

This comment has been minimized.

Copy link
Author

becketqin commented May 6, 2019

@rmetzger @fhueske One thing not quite clear to me is the compatible Flink version section. As far as I understand, anything compatible with Flink X.Y should also be compatible with Flink X.Z as long as Z >= Y. Is that correct? If that is the case, I'll check at which Flink version was each connector added, and update the Compatible Flink Versions accordingly.

@fhueske

This comment has been minimized.

Copy link
Contributor

fhueske commented May 13, 2019

Thanks for the update @becketqin. I agree that this page should be a quick reference to more detailed information. However, I also think that it should be complete (at least with respect to build-in connectors). One of the main problems is the large number of possible combinations (APIs, source/sink, encodings, etc.).

Regarding your question: This is usually true. However, we sometimes also drop support for outdated connectors or older versions of other systems.

@rmetzger

This comment has been minimized.

Copy link
Contributor

rmetzger commented Jun 21, 2019

@rmetzger @fhueske One thing not quite clear to me is the compatible Flink version section. As far as I understand, anything compatible with Flink X.Y should also be compatible with Flink X.Z as long as Z >= Y. Is that correct? If that is the case, I'll check at which Flink version was each connector added, and update the Compatible Flink Versions accordingly.

How about removing that information from the page?
I would prefer to have something about our ecosystem on the Flink website, instead of discussing forever here.

@rmetzger

This comment has been minimized.

Copy link
Contributor

rmetzger commented Jun 21, 2019

Once this is merged, let's add https://issues.apache.org/jira/browse/FLINK-12783

@fhueske

This comment has been minimized.

Copy link
Contributor

fhueske commented Jun 24, 2019

I'm fine leaving the compatibility version out.

As I said before, I think the information about available build-in connectors should be complete as this page is likely to be considered as source-of-truth wrt. connector availability and it would be a shame if users would be discouraged because they cannot find info about an available connector.

@becketqin

This comment has been minimized.

Copy link
Author

becketqin commented Jun 26, 2019

@rmetzger Thanks for pushing this. I'll remove the compatible versions column.

I'm fine leaving the compatibility version out.

As I said before, I think the information about available build-in connectors should be complete as this page is likely to be considered as source-of-truth wrt. connector availability and it would be a shame if users would be discouraged because they cannot find info about an available connector.

Hi @fhueske , in general I agree that this page should serve as a good connector availability reference for the users. However, it is unclear to me what is considered complete. For example, some security mechanism may not be supported, or we only have append sink while user needs an upsert sink, or some source connector may support timestamp while some others do not. Is the availability information of these connectors incomplete? To the users who needs that, such connector is not available. But including them here will simply explode the page and make it difficult to maintain.

So I think we need to draw a line between "available" and "fully meet user requirements".
Personally speaking, I would take connector type, supported API and compatible versions as necessary for users to consider a connector to be available. And other information, such as supported encoding, supported security mechanisms, whether it is an upsert or append sink, belong to whether the connector "fully meets user requirements". We can keep them in the individual pages of each connectors.

Thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.