[SPARK-13149][SQL]Add FileStreamSource #11034

zsxwing · 2016-02-02T22:45:36Z

FileStreamSource is an implementation of org.apache.spark.sql.execution.streaming.Source. It takes advantage of the existing HadoopFsRelationProvider to support various file formats. It remembers files in each batch and stores it into the metadata files so as to recover them when restarting. The metadata files are stored in the file system. There will be a further PR to clean up the metadata files periodically.

This is based on the initial work from @marmbrus.

zsxwing · 2016-02-02T22:47:12Z

@marmbrus @tdas Please take a look.

SparkQA · 2016-02-03T00:25:58Z

Test build #50604 has finished for PR 11034 at commit a2784ff.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FileStreamSink(
- class FileStreamSource(
- trait HadoopFsRelationProvider extends StreamSourceProvider with StreamSinkProvider

marmbrus · 2016-02-03T01:47:23Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

+import org.apache.spark.sql.test.SharedSQLContext
+import org.apache.spark.util.Utils
+
+class FileStreamSourceSuite extends StreamTest with SharedSQLContext {


Could we write these tests with the MemorySink instead, using testStream? I'd like to be able to test things like dropped batches and restarting as well. I would also be good to have them less coupled.

…out startId; address other comments

SparkQA · 2016-02-03T21:33:53Z

Test build #50678 has finished for PR 11034 at commit 71e6312.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-02-03T21:35:05Z

retest this please

SparkQA · 2016-02-03T22:40:43Z

Test build #50684 has finished for PR 11034 at commit 6a90c55.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-03T23:20:03Z

Test build #50687 has finished for PR 11034 at commit 6a90c55.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-02-04T19:52:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

+  import sqlContext.implicits._
+
+  /** Returns the schema of the data from this source */
+  override def schema: StructType = dataSchema.getOrElse(new StructType().add("value", StringType))


This getOrElse is only going to work for the text file data source, I think that for things like JSON we should probably try and initialize the source using dataFrameBuilder and extract the schema from there.

We should also add tests that would catch a problem here.

Updated the logic here. Now if there are any existing files, it will use them to infer the schema. And also added a test for it.

marmbrus · 2016-02-04T23:36:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

+    dataSchema.getOrElse {
+      val filesPresent = fetchAllFiles()
+      if (filesPresent.isEmpty) {
+        new StructType().add("value", StringType)


Even if there are no files present, we should probably still defer to the source. Those that can support that will work and those that don't will throw the correct error message.

Those that can support that will work and those that don't will throw the correct error message.

But we need to return some StructType here. Any magic to defer that?

Oh I see, even sqlContext.read.format("text").load() doesn't work? I would rather fix that than hardcode this here.

For sources like parquet/json it doesn't really make sense to let them point it at an empty directory so I would rather throw an error.

SparkQA · 2016-02-05T00:07:12Z

Test build #50779 has finished for PR 11034 at commit 93af82e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-05T00:40:26Z

Test build #50780 has finished for PR 11034 at commit 2af6fc8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-05T20:45:51Z

Test build #50830 has finished for PR 11034 at commit ce0556d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-05T21:20:13Z

Test build #50838 has finished for PR 11034 at commit 9a1042c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-02-05T22:53:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

+   * an empty `Seq`.
+   */
+  def readBatch(input: InputStream): Seq[String] = {
+    val lines = scala.io.Source.fromInputStream(input)(Codec.UTF8).getLines().toArray


We should validate the version too probably?

marmbrus · 2016-02-05T22:58:47Z

Small comments, otherwise LGTM

SparkQA · 2016-02-06T00:52:06Z

Test build #50848 has finished for PR 11034 at commit 1ffee5f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Result(avgMs: Double, bestRate: Double, bestMs: Double)
- s\"Unable to generate an encoder for inner class$`
- case class NaturalJoin(tpe: JoinType) extends JoinType

marmbrus · 2016-02-07T01:53:58Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

+    val src = Utils.createTempDir("streaming.src")
+
+    // Only "text" doesn't need a schema
+    createFileStreamSource("text", src.getCanonicalPath)


Can we also make sure we throw a better error for this case?

scala> sqlContext.read.format("text").stream() java.util.NoSuchElementException: key not found: path at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:59) at org.apache.spark.sql.sources.HadoopFsRelationProvider$class.createSource(interfaces.scala:206) at org.apache.spark.sql.execution.datasources.text.DefaultSource.createSource(DefaultSource.scala:42) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.createSource(ResolvedDataSource.scala:107) at org.apache.spark.sql.DataFrameReader.stream(DataFrameReader.scala:167) ... 40 elided

I would like to fix it in a separate PR since load throws the same error:

scala> sqlContext.read.format("text").load() java.util.NoSuchElementException: key not found: path at scala.collection.MapLike$class.default(MapLike.scala:228) at org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.default(ddl.scala:159) at scala.collection.MapLike$class.apply(MapLike.scala:141) at org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.apply(ddl.scala:159) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$10.apply(ResolvedDataSource.scala:200) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$10.apply(ResolvedDataSource.scala:200) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:200) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:129) ... 49 elided

Fixed the path error for stream

zsxwing · 2016-02-08T20:52:35Z

retest this please

SparkQA · 2016-02-08T22:23:23Z

Test build #50937 has finished for PR 11034 at commit fb0e3f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2016-02-09T20:56:38Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

+  private val batchToMetadata = new HashMap[Long, Seq[String]]
+
+  {
+    // Restore statues from the metadata files


nit: statuses, not statues. :) Also, what is status? Isnt it just file names?

tdas · 2016-02-09T22:15:03Z

Offline discussion:
It will be good to have a unit test that documents/tests the behavior of what file source gets generated when format and/or schema is provided/not provided and existing files present / not present, what is the schema generated in the source. So that we have clear understanding of what the behavior is.

SparkQA · 2016-02-10T01:26:13Z

Test build #51009 has finished for PR 11034 at commit 07e2ddd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2016-02-10T02:48:54Z

LGTM. New tests are great, and they make sense. Merging this!
Thanks @zsxwing and @marmbrus

Improved the error message as per discussion in #11034 (comment). Also made `path` and `metadataPath` in FileStreamSource case insensitive. Author: Shixiong Zhu <shixiong@databricks.com> Closes #11154 from zsxwing/path.

Add FileStreamSource and FileStreamSink

a2784ff

zsxwing changed the title ~~Add FileStreamSource and a simple version of FileStreamSink~~ [SPARK-13149][SQL]Add FileStreamSource and a simple version of FileStreamSink Feb 2, 2016

marmbrus reviewed Feb 3, 2016
View reviewed changes

Remove FileStreamSink; Rerewrite tests using testStream; fix a bug ab…

6a90c55

…out startId; address other comments

zsxwing changed the title ~~[SPARK-13149][SQL]Add FileStreamSource and a simple version of FileStreamSink~~ [SPARK-13149][SQL]Add FileStreamSource Feb 3, 2016

Merge branch 'master' into stream-df-file-source

9f5967f

marmbrus reviewed Feb 4, 2016
View reviewed changes

zsxwing added 2 commits February 4, 2016 14:42

Use the existing files to infer schema and update tests

937b86f

Add stress test

2af6fc8

marmbrus reviewed Feb 4, 2016
View reviewed changes

maxBatchFile -> maxBatchId

f38153b

marmbrus reviewed Feb 5, 2016
View reviewed changes

zsxwing added 2 commits February 5, 2016 15:13

Validate the version line

91a2b74

Merge remote-tracking branch 'origin/master' into stream-df-file-source

1ffee5f

marmbrus reviewed Feb 7, 2016
View reviewed changes

Throw a better error if path is not specified

fb0e3f9

tdas reviewed Feb 9, 2016
View reviewed changes

Address TD's comments

07e2ddd

asfgit closed this in b385ce3 Feb 10, 2016

zsxwing deleted the stream-df-file-source branch February 10, 2016 18:59

zsxwing mentioned this pull request Feb 10, 2016

[SPARK-13271][SQL]Better error message if 'path' is not specified #11154

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13149][SQL]Add FileStreamSource #11034

[SPARK-13149][SQL]Add FileStreamSource #11034

zsxwing commented Feb 2, 2016

zsxwing commented Feb 2, 2016

SparkQA commented Feb 3, 2016

marmbrus Feb 3, 2016

SparkQA commented Feb 3, 2016

zsxwing commented Feb 3, 2016

SparkQA commented Feb 3, 2016

SparkQA commented Feb 3, 2016

marmbrus Feb 4, 2016

zsxwing Feb 4, 2016

marmbrus Feb 4, 2016

zsxwing Feb 4, 2016

marmbrus Feb 5, 2016

SparkQA commented Feb 5, 2016

SparkQA commented Feb 5, 2016

SparkQA commented Feb 5, 2016

SparkQA commented Feb 5, 2016

marmbrus Feb 5, 2016

zsxwing Feb 5, 2016

marmbrus commented Feb 5, 2016

SparkQA commented Feb 6, 2016

marmbrus Feb 7, 2016

zsxwing Feb 8, 2016

zsxwing Feb 8, 2016

zsxwing commented Feb 8, 2016

SparkQA commented Feb 8, 2016

tdas Feb 9, 2016

tdas commented Feb 9, 2016

SparkQA commented Feb 10, 2016

tdas commented Feb 10, 2016

[SPARK-13149][SQL]Add FileStreamSource #11034

[SPARK-13149][SQL]Add FileStreamSource #11034

Conversation

zsxwing commented Feb 2, 2016

zsxwing commented Feb 2, 2016

SparkQA commented Feb 3, 2016

Choose a reason for hiding this comment

SparkQA commented Feb 3, 2016

zsxwing commented Feb 3, 2016

SparkQA commented Feb 3, 2016

SparkQA commented Feb 3, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 5, 2016

SparkQA commented Feb 5, 2016

SparkQA commented Feb 5, 2016

SparkQA commented Feb 5, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marmbrus commented Feb 5, 2016

SparkQA commented Feb 6, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zsxwing commented Feb 8, 2016

SparkQA commented Feb 8, 2016

Choose a reason for hiding this comment

tdas commented Feb 9, 2016

SparkQA commented Feb 10, 2016

tdas commented Feb 10, 2016