SPARK-1795 - Add recursive directory file search to fileInputStream #537

patrickotoole · 2014-04-24T18:29:34Z

Added recursive directory search to fileInputStream. Want spark to be able to find files in the subdirectories rather than just the parent directory.

tdas · 2014-04-24T21:24:16Z

Can you please add a JIRA for this and add the JIRA number in the title, like other PRs.

tdas · 2014-04-24T21:24:57Z

Also, please add a unit test for this usecase in the InputStreamsSuite

mridulm · 2014-04-27T13:40:34Z

streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala

+      F <: NewInputFormat[K, V]: ClassTag
+    ] (directory: String, filter: Path => Boolean, newFilesOnly: Boolean, recursive: Boolean): DStream[(K, V)] = {
+      new FileInputDStream[K, V, F](this, directory, filter, newFilesOnly, recursive)
+  } 


This looks like an api change - please add default value to recursive

I have included a default value on the FileInputDStream but not on the API itself.

Wondering if we want to introduce default values to the more granular version of the API. Currently, it looks like the exposed API essentially has two versions for these methods -- one that assumes default values and one that exposes all the parameters of the DStream constructor.

Thoughts?

In which version of spark can we get the API with support for nested directory streaming?

rajat-agarwal · 2014-06-28T14:25:00Z

streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala

-    val newFiles = fs.listStatus(directoryPath, filter).map(_.getPath.toString)
+
+    val filePaths: Array[Path] = if (recursive) 
+      recursiveFileList(fs.listStatus(directoryPath).toList).toArray


If the input directory is already the lowest level of directory then it will not consider any files in it.
example:
Consider the following directory.
/a/file1.txt
/a/file2.txt and so on .
If the input directory is given as "/a", there will be no output.

We can call this like

val filePaths: Array[Path] = if (recursive)
recursiveListDirs(List(fs.getFileStatus(new Path(directoryPath)))).toArray

SparkQA · 2014-09-05T23:47:02Z

Can one of the admins verify this patch?

tdas · 2014-12-24T23:51:52Z

@patrickotoole Sorry for this patch sitting around here for so long without any attention. Mind updating this patch to the latest code.

srowen · 2015-01-23T14:12:17Z

I suggest we close this in favor of #2765 since it implements recursion with max depth, merges, and was active more recently.

AmplabJenkins · 2015-04-27T18:24:48Z

Can one of the admins verify this patch?

srowen · 2015-04-27T18:27:17Z

Mind closing this PR?

One-line code change which is the initial patch for [HADOOP-16248](https://issues.apache.org/jira/browse/HADOOP-16248). See internal ticket number 87611 for more context.

* [SPARK-27267][CORE] Update snappy to avoid error when decompressing empty serialized data (apache#531) * [SPARK-27514][SQL] Skip collapsing windows with empty window expressions (apache#538) * Bump hadoop to 2.9.2-palantir.5 (apache#537)

Update ssh_known_hosts

Add recursive directory file search to fileinputstream

d2c3430

mridulm reviewed Apr 27, 2014
View reviewed changes

patrickotoole added 2 commits May 8, 2014 06:02

Merge branch 'master' of https://github.com/apache/spark

2579c1e

added unit tests for recursive search and modified recursive search

8c0d50c

patrickotoole changed the title ~~Add recursive directory file search to fileInputStream~~ SPARK-1795 - Add recursive directory file search to fileInputStream May 11, 2014

rajat-agarwal reviewed Jun 28, 2014
View reviewed changes

asfgit closed this in 8dee274 Apr 29, 2015

bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019

Update ssh_known_hosts (apache#537)

b69a0a8

Update ssh_known_hosts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-1795 - Add recursive directory file search to fileInputStream #537

SPARK-1795 - Add recursive directory file search to fileInputStream #537

patrickotoole commented Apr 24, 2014

tdas commented Apr 24, 2014

tdas commented Apr 24, 2014

mridulm Apr 27, 2014

patrickotoole May 11, 2014

stshruthi Nov 28, 2017

rajat-agarwal Jun 28, 2014

rajat-agarwal Jun 30, 2014

SparkQA commented Sep 5, 2014

tdas commented Dec 24, 2014

srowen commented Jan 23, 2015

AmplabJenkins commented Apr 27, 2015

srowen commented Apr 27, 2015

SPARK-1795 - Add recursive directory file search to fileInputStream #537

SPARK-1795 - Add recursive directory file search to fileInputStream #537

Conversation

patrickotoole commented Apr 24, 2014

tdas commented Apr 24, 2014

tdas commented Apr 24, 2014

mridulm Apr 27, 2014

Choose a reason for hiding this comment

patrickotoole May 11, 2014

Choose a reason for hiding this comment

stshruthi Nov 28, 2017

Choose a reason for hiding this comment

rajat-agarwal Jun 28, 2014

Choose a reason for hiding this comment

rajat-agarwal Jun 30, 2014

Choose a reason for hiding this comment

SparkQA commented Sep 5, 2014

tdas commented Dec 24, 2014

srowen commented Jan 23, 2015

AmplabJenkins commented Apr 27, 2015

srowen commented Apr 27, 2015