Add blog post "A review of input streaming connectors"#521
Add blog post "A review of input streaming connectors"#521jphalip wants to merge 10 commits intoapache:asf-sitefrom
Conversation
|
Hi @jbonofre. Just pinging you based on recent commit history for blog posts. Are you the right person to review blog submissions? Thanks! :) |
|
R: @chamikaramj |
iemejia
left a comment
There was a problem hiding this comment.
Nice ! Really minor suggestions, the text is clear and the intention on highlighting streaming connectors is interesting.
| </td> | ||
| <td><a href="https://beam.apache.org/documentation/sdks/javadoc/2.5.0/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.html">PubsubIO</a> | ||
| </td> | ||
| <td><a href="https://github.com/apache/bahir/tree/master/streaming-pubsub">Spark-streaming-pubsub</a> from <a href="http://bahir.apache.org">Apache Bahir</a> |
There was a problem hiding this comment.
spark in lowercase maybe for consistency with the previous text.
There was a problem hiding this comment.
Other streaming connectors seem to be missing for both Beam and spark (not sure if they are not because those are not for distributed data stores but could make the comparison richer): JMS, MQTT, AMQP
There was a problem hiding this comment.
Which connectors exactly would you recommend in each case for Spark? Bahir has one for MQTT, but I'm not sure what other connectors to recommend for JMS & AMQP.
There was a problem hiding this comment.
There seems not to be community maintanied version of both so probably just mentioned that.
| <tr> | ||
| <td>HDFS<br>(Using the <code>hdfs://</code> URI) | ||
| </td> | ||
| <td><a href="https://beam.apache.org/documentation/sdks/javadoc/2.5.0/org/apache/beam/sdk/io/hdfs/HadoopFileSystemOptions.html">HadoopFileSystemOptions</a> |
There was a problem hiding this comment.
Not sure I understand this column, are the options highlighted for configuration? Or this should probably be better FileIO + HadoopFileSystem ?
| </td> | ||
| <td>Cloud Storage<br>(Using the <code>gs://</code> URI) | ||
| </td> | ||
| <td rowspan="2" ><a href="https://beam.apache.org/documentation/sdks/javadoc/2.5.0/org/apache/beam/sdk/io/hdfs/HadoopFileSystemOptions.html">HadoopFileSystemOptions</a> |
There was a problem hiding this comment.
SImilar to above, maybe (GcsOptions or GcsFileSystem) and (S3Options or S3FileSystem) ?
There was a problem hiding this comment.
In this case, should it be: FileIO + GcsOptions, and FileIO + S3Options?
There was a problem hiding this comment.
Yes probably dividing both. I still don't understand why we refer to HadoopFileSystemOptions before instead of HadoopFileSystem?
There was a problem hiding this comment.
I've tried to only use links to the documentation, which is why I referenced HadoopFileSystemOptions.
Somehow HadoopFileSystem doesn't seem to be documented. Is that intentional?
There was a problem hiding this comment.
Would you like to replace HadoopFileSystemOptions→HadoopFileSystem, GcsOptions→GcsFileSystem, and S3Options→S3FileSystem, and set the links to the source code in Github, since the *FileSystem classes aren't documented?
|
|
||
| ### **Scala** | ||
|
|
||
| Since Scala code is interoperable with Java and therefore has native compatibility with Java libraries (and vice versa), you can use the same Java connectors described above in your Scala programs. Apache Beam also has a [Scala SDK](https://github.com/spotify/scio) open-sourced [by Spotify](https://labs.spotify.com/2017/10/16/big-data-processing-at-spotify-the-road-to-scio-part-1/). |
There was a problem hiding this comment.
s/SDK/API as referred in the scio github page. Probably more clear to say Spotify has a Scala API on top of Apache Beam.
|
|
||
| ### **Go** | ||
|
|
||
| A [Go SDK](https://beam.apache.org/documentation/sdks/go/) for Apache Beam is under active development. It is currently experimental and is not recommended for production. |
There was a problem hiding this comment.
Probably worth adding that Spark does not have a go sdk.
|
|
||
| Spark offers two approaches to streaming: [Discretized Streaming](https://spark.apache.org/docs/latest/streaming-programming-guide.html) (or DStreams) and [Structured Streaming](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html). DStreams are a basic abstraction that represents a continuous series of [Resilient Distributed Datasets](https://spark.apache.org/docs/latest/rdd-programming-guide.html) (or RDDs). Structured Streaming was introduced more recently (the alpha release came with Spark 2.1.0) and is based on a [model](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#programming-model) where live data is continuously appended to a table structure. | ||
|
|
||
| Spark Structured Streaming supports [file sources](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/streaming/DataStreamReader.html) (local filesystems and HDFS-compatible systems like Cloud Storage or S3) and [Kafka](https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html) as streaming inputs. Spark maintains built-in connectors for DStreams aimed at third-party services, such as Kafka or Flume, while other connectors are available through linking external dependencies, as shown in the table below. |
There was a problem hiding this comment.
Probably this link is worth here for ref.
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources
|
Extra comment I cannot see the changes in the staging repo (maybe related to not having the correct date in the file I suppose). |
|
@iemejia Thanks a lot for the feedback! I've just pushed some updates and left a couple of questions above. I'm not sure how the staging repo works. I can change the publication date. Which date should I pick? |
|
For the date just asume a publication date maybe end of the week asuming that he new release blog post is coming soon. Btw, can you please rebase the PR once #536 is merged, it seems that it broke the layout so probably affecting this one too. |
|
@iemejia I've set the date to 08/16/2018. I'll rebase once the other PR you've linked is merged. |
| <td>Local<br>(Using the <code>file://</code> URI) | ||
| </td> | ||
| <td><a href="https://beam.apache.org/documentation/sdks/javadoc/2.5.0/org/apache/beam/sdk/io/TextIO.html">TextIO</a> | ||
| <td><a href="https://beam.apache.org/documentation/sdks/javadoc/2.6.0/org/apache/beam/sdk/io/TextIO.html">TextIO</a> |
There was a problem hiding this comment.
I am not sure if this will work but it would be good to refer to the urls by using the latest URLs (this work in normal markdown links so hopefully it will work here too:
{{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/org/apache/beam/sdk/io/BoundedSource.html
|
The other PR was merged please rebase and I will LGTM/merge once it looks ok on staging. |
948ff30 to
8bca0a1
Compare
|
@iemejia Thanks, I've rebased the branch and updated the doc links. |
There was a problem hiding this comment.
Content-wise LGTM, have just one minor issue, I cannot see the blog section in the generated website.
@melap I am looking at the wrong place? Or is there anything else missing ? (or something wrong with the generation). In any case feel free to merge when you are ok with it. Thanks @jphalip and sorry for the delay (holidays time in the middle of the PR).
|
retest this please |
|
I forced a regeneration and it's there now (might have to shift-reload, sometimes it seems to cache unexpectedly). There are a bunch of html and table issues that will need to be resolved though: |
|
I just pushed a commit that fixes the issues + a couple other minor changes (such as adding table borders). Go ahead and update the date to today and I'll merge after I verify final version on staging |
|
retest this please |
|
@asfgit merge |
This is a re-post of an article that I recently published on the GCP blog: https://cloud.google.com/blog/products/data-analytics/review-of-input-streaming-connectors-for-apache-beam-and-apache-spark
This is a slightly edited version from the other article to make it relevant to a broader audience beyond just GCP.