Skip to content

Comments

Add blog post "A review of input streaming connectors"#521

Closed
jphalip wants to merge 10 commits intoapache:asf-sitefrom
jphalip:blog-post-streaming-connectors
Closed

Add blog post "A review of input streaming connectors"#521
jphalip wants to merge 10 commits intoapache:asf-sitefrom
jphalip:blog-post-streaming-connectors

Conversation

@jphalip
Copy link

@jphalip jphalip commented Aug 3, 2018

This is a re-post of an article that I recently published on the GCP blog: https://cloud.google.com/blog/products/data-analytics/review-of-input-streaming-connectors-for-apache-beam-and-apache-spark

This is a slightly edited version from the other article to make it relevant to a broader audience beyond just GCP.

jphalip added a commit to jphalip/beam-site that referenced this pull request Aug 6, 2018
jphalip added a commit to jphalip/beam-site that referenced this pull request Aug 6, 2018
jphalip added a commit to jphalip/beam-site that referenced this pull request Aug 6, 2018
@jphalip
Copy link
Author

jphalip commented Aug 6, 2018

Hi @jbonofre. Just pinging you based on recent commit history for blog posts. Are you the right person to review blog submissions? Thanks! :)

@melap
Copy link

melap commented Aug 7, 2018

R: @chamikaramj

@melap
Copy link

melap commented Aug 7, 2018

R: @iemejia or @jbonofre or @holdenk for Spark, I'm not familiar enough to review

Copy link
Member

@iemejia iemejia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice ! Really minor suggestions, the text is clear and the intention on highlighting streaming connectors is interesting.

</td>
<td><a href="https://beam.apache.org/documentation/sdks/javadoc/2.5.0/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.html">PubsubIO</a>
</td>
<td><a href="https://github.com/apache/bahir/tree/master/streaming-pubsub">Spark-streaming-pubsub</a> from <a href="http://bahir.apache.org">Apache Bahir</a>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark in lowercase maybe for consistency with the previous text.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other streaming connectors seem to be missing for both Beam and spark (not sure if they are not because those are not for distributed data stores but could make the comparison richer): JMS, MQTT, AMQP

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which connectors exactly would you recommend in each case for Spark? Bahir has one for MQTT, but I'm not sure what other connectors to recommend for JMS & AMQP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems not to be community maintanied version of both so probably just mentioned that.

<tr>
<td>HDFS<br>(Using the <code>hdfs://</code> URI)
</td>
<td><a href="https://beam.apache.org/documentation/sdks/javadoc/2.5.0/org/apache/beam/sdk/io/hdfs/HadoopFileSystemOptions.html">HadoopFileSystemOptions</a>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand this column, are the options highlighted for configuration? Or this should probably be better FileIO + HadoopFileSystem ?

</td>
<td>Cloud Storage<br>(Using the <code>gs://</code> URI)
</td>
<td rowspan="2" ><a href="https://beam.apache.org/documentation/sdks/javadoc/2.5.0/org/apache/beam/sdk/io/hdfs/HadoopFileSystemOptions.html">HadoopFileSystemOptions</a>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SImilar to above, maybe (GcsOptions or GcsFileSystem) and (S3Options or S3FileSystem) ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, should it be: FileIO + GcsOptions, and FileIO + S3Options?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes probably dividing both. I still don't understand why we refer to HadoopFileSystemOptions before instead of HadoopFileSystem?

Copy link
Author

@jphalip jphalip Aug 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to only use links to the documentation, which is why I referenced HadoopFileSystemOptions.

Somehow HadoopFileSystem doesn't seem to be documented. Is that intentional?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like to replace HadoopFileSystemOptions→HadoopFileSystem, GcsOptions→GcsFileSystem, and S3Options→S3FileSystem, and set the links to the source code in Github, since the *FileSystem classes aren't documented?


### **Scala**

Since Scala code is interoperable with Java and therefore has native compatibility with Java libraries (and vice versa), you can use the same Java connectors described above in your Scala programs. Apache Beam also has a [Scala SDK](https://github.com/spotify/scio) open-sourced [by Spotify](https://labs.spotify.com/2017/10/16/big-data-processing-at-spotify-the-road-to-scio-part-1/).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/SDK/API as referred in the scio github page. Probably more clear to say Spotify has a Scala API on top of Apache Beam.


### **Go**

A [Go SDK](https://beam.apache.org/documentation/sdks/go/) for Apache Beam is under active development. It is currently experimental and is not recommended for production.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth adding that Spark does not have a go sdk.


Spark offers two approaches to streaming: [Discretized Streaming](https://spark.apache.org/docs/latest/streaming-programming-guide.html) (or DStreams) and [Structured Streaming](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html). DStreams are a basic abstraction that represents a continuous series of [Resilient Distributed Datasets](https://spark.apache.org/docs/latest/rdd-programming-guide.html) (or RDDs). Structured Streaming was introduced more recently (the alpha release came with Spark 2.1.0) and is based on a [model](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#programming-model) where live data is continuously appended to a table structure.

Spark Structured Streaming supports [file sources](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/streaming/DataStreamReader.html) (local filesystems and HDFS-compatible systems like Cloud Storage or S3) and [Kafka](https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html) as streaming inputs. Spark maintains built-in connectors for DStreams aimed at third-party services, such as Kafka or Flume, while other connectors are available through linking external dependencies, as shown in the table below.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iemejia
Copy link
Member

iemejia commented Aug 12, 2018

Extra comment I cannot see the changes in the staging repo (maybe related to not having the correct date in the file I suppose).

@jphalip
Copy link
Author

jphalip commented Aug 13, 2018

@iemejia Thanks a lot for the feedback! I've just pushed some updates and left a couple of questions above.

I'm not sure how the staging repo works. I can change the publication date. Which date should I pick?

@iemejia
Copy link
Member

iemejia commented Aug 13, 2018

For the date just asume a publication date maybe end of the week asuming that he new release blog post is coming soon. Btw, can you please rebase the PR once #536 is merged, it seems that it broke the layout so probably affecting this one too.

@jphalip
Copy link
Author

jphalip commented Aug 14, 2018

@iemejia I've set the date to 08/16/2018. I'll rebase once the other PR you've linked is merged.

<td>Local<br>(Using the <code>file://</code> URI)
</td>
<td><a href="https://beam.apache.org/documentation/sdks/javadoc/2.5.0/org/apache/beam/sdk/io/TextIO.html">TextIO</a>
<td><a href="https://beam.apache.org/documentation/sdks/javadoc/2.6.0/org/apache/beam/sdk/io/TextIO.html">TextIO</a>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this will work but it would be good to refer to the urls by using the latest URLs (this work in normal markdown links so hopefully it will work here too:
{{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/org/apache/beam/sdk/io/BoundedSource.html

@iemejia
Copy link
Member

iemejia commented Aug 15, 2018

The other PR was merged please rebase and I will LGTM/merge once it looks ok on staging.

@jphalip jphalip force-pushed the blog-post-streaming-connectors branch from 948ff30 to 8bca0a1 Compare August 15, 2018 17:18
@jphalip
Copy link
Author

jphalip commented Aug 15, 2018

@iemejia Thanks, I've rebased the branch and updated the doc links.

Copy link
Member

@iemejia iemejia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Content-wise LGTM, have just one minor issue, I cannot see the blog section in the generated website.
@melap I am looking at the wrong place? Or is there anything else missing ? (or something wrong with the generation). In any case feel free to merge when you are ok with it. Thanks @jphalip and sorry for the delay (holidays time in the middle of the PR).

@melap
Copy link

melap commented Aug 20, 2018

retest this please

@melap
Copy link

melap commented Aug 20, 2018

I forced a regeneration and it's there now (might have to shift-reload, sometimes it seems to cache unexpectedly). There are a bunch of html and table issues that will need to be resolved though:
http://apache-beam-website-pull-requests.storage.googleapis.com/521/blog/2018/08/16/review-input-streaming-connectors.html

@melap
Copy link

melap commented Aug 20, 2018

I just pushed a commit that fixes the issues + a couple other minor changes (such as adding table borders). Go ahead and update the date to today and I'll merge after I verify final version on staging

@melap
Copy link

melap commented Aug 20, 2018

retest this please

@melap
Copy link

melap commented Aug 20, 2018

@asfgit merge

asfgit pushed a commit that referenced this pull request Aug 20, 2018
asfgit pushed a commit that referenced this pull request Aug 20, 2018
asfgit pushed a commit that referenced this pull request Aug 20, 2018
@asfgit asfgit closed this in 67d7fba Aug 20, 2018
swegner pushed a commit to swegner/beam that referenced this pull request Sep 19, 2018
swegner pushed a commit to swegner/beam that referenced this pull request Sep 19, 2018
swegner pushed a commit to swegner/beam that referenced this pull request Sep 19, 2018
swegner pushed a commit to swegner/beam that referenced this pull request Sep 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants