Add blog post "A review of input streaming connectors" by jphalip · Pull Request #521 · apache/beam-site

jphalip · 2018-08-03T16:54:55Z

This is a re-post of an article that I recently published on the GCP blog: https://cloud.google.com/blog/products/data-analytics/review-of-input-streaming-connectors-for-apache-beam-and-apache-spark

This is a slightly edited version from the other article to make it relevant to a broader audience beyond just GCP.

jphalip · 2018-08-06T07:44:33Z

Hi @jbonofre. Just pinging you based on recent commit history for blog posts. Are you the right person to review blog submissions? Thanks! :)

melap · 2018-08-07T16:31:16Z

R: @chamikaramj

melap · 2018-08-07T17:28:37Z

R: @iemejia or @jbonofre or @holdenk for Spark, I'm not familiar enough to review

iemejia

Nice ! Really minor suggestions, the text is clear and the intention on highlighting streaming connectors is interesting.

iemejia · 2018-08-12T21:56:36Z

src/_posts/2018-08-XX-review-input-streaming-connectors.md

+   </td>
+   <td><a href="https://beam.apache.org/documentation/sdks/javadoc/2.5.0/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.html">PubsubIO</a>
+   </td>
+   <td><a href="https://github.com/apache/bahir/tree/master/streaming-pubsub">Spark-streaming-pubsub</a> from <a href="http://bahir.apache.org">Apache Bahir</a>


spark in lowercase maybe for consistency with the previous text.

Other streaming connectors seem to be missing for both Beam and spark (not sure if they are not because those are not for distributed data stores but could make the comparison richer): JMS, MQTT, AMQP

Which connectors exactly would you recommend in each case for Spark? Bahir has one for MQTT, but I'm not sure what other connectors to recommend for JMS & AMQP.

There seems not to be community maintanied version of both so probably just mentioned that.

iemejia · 2018-08-12T22:04:06Z

src/_posts/2018-08-XX-review-input-streaming-connectors.md

+  <tr>
+   <td>HDFS<br>(Using the <code>hdfs://</code> URI)
+   </td>
+   <td><a href="https://beam.apache.org/documentation/sdks/javadoc/2.5.0/org/apache/beam/sdk/io/hdfs/HadoopFileSystemOptions.html">HadoopFileSystemOptions</a>


Not sure I understand this column, are the options highlighted for configuration? Or this should probably be better FileIO + HadoopFileSystem ?

iemejia · 2018-08-12T22:05:20Z

src/_posts/2018-08-XX-review-input-streaming-connectors.md

+   </td>
+   <td>Cloud Storage<br>(Using the <code>gs://</code> URI)
+   </td>
+   <td rowspan="2" ><a href="https://beam.apache.org/documentation/sdks/javadoc/2.5.0/org/apache/beam/sdk/io/hdfs/HadoopFileSystemOptions.html">HadoopFileSystemOptions</a>


SImilar to above, maybe (GcsOptions or GcsFileSystem) and (S3Options or S3FileSystem) ?

In this case, should it be: FileIO + GcsOptions, and FileIO + S3Options?

Yes probably dividing both. I still don't understand why we refer to HadoopFileSystemOptions before instead of HadoopFileSystem?

I've tried to only use links to the documentation, which is why I referenced HadoopFileSystemOptions.

Somehow HadoopFileSystem doesn't seem to be documented. Is that intentional?

Would you like to replace HadoopFileSystemOptions→HadoopFileSystem, GcsOptions→GcsFileSystem, and S3Options→S3FileSystem, and set the links to the source code in Github, since the *FileSystem classes aren't documented?

iemejia · 2018-08-12T22:10:17Z

src/_posts/2018-08-XX-review-input-streaming-connectors.md

+
+### **Scala**
+
+Since Scala code is interoperable with Java and therefore has native compatibility with Java libraries (and vice versa), you can use the same Java connectors described above in your Scala programs. Apache Beam also has a [Scala SDK](https://github.com/spotify/scio) open-sourced [by Spotify](https://labs.spotify.com/2017/10/16/big-data-processing-at-spotify-the-road-to-scio-part-1/).


s/SDK/API as referred in the scio github page. Probably more clear to say Spotify has a Scala API on top of Apache Beam.

iemejia · 2018-08-12T22:13:12Z

src/_posts/2018-08-XX-review-input-streaming-connectors.md

+
+### **Go**
+
+A [Go SDK](https://beam.apache.org/documentation/sdks/go/) for Apache Beam is under active development. It is currently experimental and is not recommended for production.


Probably worth adding that Spark does not have a go sdk.

iemejia · 2018-08-12T22:24:45Z

src/_posts/2018-08-XX-review-input-streaming-connectors.md

+
+Spark offers two approaches to streaming: [Discretized Streaming](https://spark.apache.org/docs/latest/streaming-programming-guide.html) (or DStreams) and [Structured Streaming](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html). DStreams are a basic abstraction that represents a continuous series of [Resilient Distributed Datasets](https://spark.apache.org/docs/latest/rdd-programming-guide.html) (or RDDs). Structured Streaming was introduced more recently (the alpha release came with Spark 2.1.0) and is based on a [model](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#programming-model) where live data is continuously appended to a table structure.
+
+Spark Structured Streaming supports [file sources](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/streaming/DataStreamReader.html) (local filesystems and HDFS-compatible systems like Cloud Storage or S3) and [Kafka](https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html) as streaming inputs. Spark maintains built-in connectors for DStreams aimed at third-party services, such as Kafka or Flume, while other connectors are available through linking external dependencies, as shown in the table below.


Probably this link is worth here for ref.
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources

iemejia · 2018-08-12T22:27:05Z

Extra comment I cannot see the changes in the staging repo (maybe related to not having the correct date in the file I suppose).

jphalip · 2018-08-13T17:20:36Z

@iemejia Thanks a lot for the feedback! I've just pushed some updates and left a couple of questions above.

I'm not sure how the staging repo works. I can change the publication date. Which date should I pick?

iemejia · 2018-08-13T21:52:45Z

For the date just asume a publication date maybe end of the week asuming that he new release blog post is coming soon. Btw, can you please rebase the PR once #536 is merged, it seems that it broke the layout so probably affecting this one too.

jphalip · 2018-08-14T15:25:57Z

@iemejia I've set the date to 08/16/2018. I'll rebase once the other PR you've linked is merged.

iemejia · 2018-08-15T13:39:39Z

src/_posts/2018-08-XX-review-input-streaming-connectors.md

   <td>Local<br>(Using the <code>file://</code> URI)
   </td>
-   <td><a href="https://beam.apache.org/documentation/sdks/javadoc/2.5.0/org/apache/beam/sdk/io/TextIO.html">TextIO</a>
+   <td><a href="https://beam.apache.org/documentation/sdks/javadoc/2.6.0/org/apache/beam/sdk/io/TextIO.html">TextIO</a>


I am not sure if this will work but it would be good to refer to the urls by using the latest URLs (this work in normal markdown links so hopefully it will work here too:
{{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/org/apache/beam/sdk/io/BoundedSource.html

iemejia · 2018-08-15T13:40:17Z

The other PR was merged please rebase and I will LGTM/merge once it looks ok on staging.

jphalip · 2018-08-15T17:20:01Z

@iemejia Thanks, I've rebased the branch and updated the doc links.

iemejia

Content-wise LGTM, have just one minor issue, I cannot see the blog section in the generated website.
@melap I am looking at the wrong place? Or is there anything else missing ? (or something wrong with the generation). In any case feel free to merge when you are ok with it. Thanks @jphalip and sorry for the delay (holidays time in the middle of the PR).

melap · 2018-08-20T20:46:13Z

retest this please

melap · 2018-08-20T20:55:08Z

I forced a regeneration and it's there now (might have to shift-reload, sometimes it seems to cache unexpectedly). There are a bunch of html and table issues that will need to be resolved though:
http://apache-beam-website-pull-requests.storage.googleapis.com/521/blog/2018/08/16/review-input-streaming-connectors.html

melap · 2018-08-20T21:14:47Z

I just pushed a commit that fixes the issues + a couple other minor changes (such as adding table borders). Go ahead and update the date to today and I'll merge after I verify final version on staging

melap · 2018-08-20T21:31:32Z

retest this please

melap · 2018-08-20T21:39:21Z

@asfgit merge

jphalip added a commit to jphalip/beam-site that referenced this pull request Aug 6, 2018

Add authors for blog post in apache#521

884c9c3

jphalip added a commit to jphalip/beam-site that referenced this pull request Aug 6, 2018

Fix typo for author's name in blog post apache#521

3f83fa2

jphalip added a commit to jphalip/beam-site that referenced this pull request Aug 6, 2018

Fix other typo in author's name for blog post apache#521

ac155ca

iemejia requested changes Aug 12, 2018

View reviewed changes

iemejia reviewed Aug 15, 2018

View reviewed changes

jphalip added 8 commits August 15, 2018 10:11

Add blog post "A review of input streaming connectors"

7b62a5b

Add authors for blog post in apache#521

be689ea

Fix typo for author's name in blog post apache#521

a74dff8

Fix other typo in author's name for blog post apache#521

c5930f4

Blog post updates based on @iemejia's feedback

2984a09

Updates to streaming connectors blog post

f8508ab

Set publication date for streaming connectors blog post

ea008eb

Update doc links in blog post to point to latest release

8bca0a1

jphalip force-pushed the blog-post-streaming-connectors branch from 948ff30 to 8bca0a1 Compare August 15, 2018 17:18

iemejia approved these changes Aug 20, 2018

View reviewed changes

Fix extraneous p tag and add table borders

078b700

Update streaming connectors blog post's publication date

8b6f139

melap approved these changes Aug 20, 2018

View reviewed changes

asfgit pushed a commit that referenced this pull request Aug 20, 2018

Add authors for blog post in #521

bf7240b

asfgit pushed a commit that referenced this pull request Aug 20, 2018

Fix typo for author's name in blog post #521

11c9c29

asfgit pushed a commit that referenced this pull request Aug 20, 2018

Fix other typo in author's name for blog post #521

d2cf4a7

asfgit closed this in 67d7fba Aug 20, 2018

swegner pushed a commit to swegner/beam that referenced this pull request Sep 19, 2018

Add authors for blog post in apache/beam-site#521

7c851b9

swegner pushed a commit to swegner/beam that referenced this pull request Sep 19, 2018

Fix typo for author's name in blog post apache/beam-site#521

4060c52

swegner pushed a commit to swegner/beam that referenced this pull request Sep 19, 2018

Fix other typo in author's name for blog post apache/beam-site#521

9f57de8

swegner pushed a commit to swegner/beam that referenced this pull request Sep 19, 2018

This closes apache/beam-site#521

6dc995c


		### Scala

		Since Scala code is interoperable with Java and therefore has native compatibility with Java libraries (and vice versa), you can use the same Java connectors described above in your Scala programs. Apache Beam also has a [Scala SDK](https://github.com/spotify/scio) open-sourced [by Spotify](https://labs.spotify.com/2017/10/16/big-data-processing-at-spotify-the-road-to-scio-part-1/).


		### Go

		A [Go SDK](https://beam.apache.org/documentation/sdks/go/) for Apache Beam is under active development. It is currently experimental and is not recommended for production.


		Spark offers two approaches to streaming: [Discretized Streaming](https://spark.apache.org/docs/latest/streaming-programming-guide.html) (or DStreams) and [Structured Streaming](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html). DStreams are a basic abstraction that represents a continuous series of [Resilient Distributed Datasets](https://spark.apache.org/docs/latest/rdd-programming-guide.html) (or RDDs). Structured Streaming was introduced more recently (the alpha release came with Spark 2.1.0) and is based on a [model](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#programming-model) where live data is continuously appended to a table structure.

		Spark Structured Streaming supports [file sources](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/streaming/DataStreamReader.html) (local filesystems and HDFS-compatible systems like Cloud Storage or S3) and [Kafka](https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html) as streaming inputs. Spark maintains built-in connectors for DStreams aimed at third-party services, such as Kafka or Flume, while other connectors are available through linking external dependencies, as shown in the table below.

Comments

Conversation

jphalip commented Aug 3, 2018

Uh oh!

jphalip commented Aug 6, 2018

Uh oh!

melap commented Aug 7, 2018

Uh oh!

melap commented Aug 7, 2018

Uh oh!

iemejia left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jphalip Aug 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iemejia commented Aug 12, 2018

Uh oh!

jphalip commented Aug 13, 2018

Uh oh!

iemejia commented Aug 13, 2018

Uh oh!

jphalip commented Aug 14, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iemejia commented Aug 15, 2018

Uh oh!

jphalip commented Aug 15, 2018

Uh oh!

iemejia left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

melap commented Aug 20, 2018

Uh oh!

melap commented Aug 20, 2018

Uh oh!

melap commented Aug 20, 2018

Uh oh!

melap commented Aug 20, 2018

Uh oh!

melap commented Aug 20, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jphalip Aug 14, 2018 •

edited

Loading

iemejia left a comment •

edited

Loading