[SPARK-12334][SQL][PYSPARK] Support read from multiple input paths for orc file in DataFrameReader.orc #10307

zjffdu · 2015-12-15T10:03:12Z

Beside the issue in spark api, also fix 2 minor issues in pyspark

support read from multiple input paths for orc
support read from multiple input paths for text

SparkQA · 2015-12-15T11:48:18Z

Test build #47725 has finished for PR 10307 at commit c0f90bf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

bomeng · 2015-12-15T23:03:41Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

@@ -322,11 +322,11 @@ class DataFrameReader private[sql](sqlContext: SQLContext) extends Logging {
  /**
   * Loads an ORC file and returns the result as a [[DataFrame]].


The comments should be updated as well.

Thanks @bomeng also update docs for other formats

zjffdu · 2015-12-17T02:06:40Z

please build it again (if it works)

yanboliang · 2015-12-18T07:39:57Z

+1, I vote to request this feature for a while.

yanboliang · 2015-12-18T07:46:12Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcQuerySuite.scala

+    val path2 = Utils.createTempDir()
+    makeOrcFile((1 to 10).map(Tuple1.apply), path1)
+    makeOrcFile((1 to 10).map(Tuple1.apply), path2)
+    assertResult(20)(read.orc(path1.getCanonicalPath, path2.getCanonicalPath).count())


We need to remove generated temporary file automatically, use withOrcFile or withTempDir.

It will be deleted automatically after the program exit.

/** * Create a temporary directory inside the given parent directory. The directory will be * automatically deleted when the VM shuts down. */ def createTempDir( root: String = System.getProperty("java.io.tmpdir"), namePrefix: String = "spark"): File = { val dir = createDirectory(root, namePrefix) ShutdownHookManager.registerShutdownDeleteDir(dir) dir }

withOrcFile will get cleaned up faster (e.g. as soon as the test ends rather than program exit).

holdenk · 2016-04-19T01:24:42Z

@zjffdu would you want to update this against master so jenkins can give it a run?

SparkQA · 2016-04-19T09:53:52Z

Test build #56216 has finished for PR 10307 at commit a8690e1.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-19T10:45:02Z

Test build #56219 has finished for PR 10307 at commit 2109a94.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

Donderia · 2016-10-03T17:17:21Z

I am trying to apply this patch on 1.6 branch but patch failed.
{code}
Applying: Support read from multiple input paths for orc file in DataFrameReader.orc
error: patch failed: python/pyspark/sql/readwriter.py:240
error: python/pyspark/sql/readwriter.py: patch does not apply
error: patch failed: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:388
error: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala: patch does not apply
Patch failed at 0001 Support read from multiple input paths for orc file in DataFrameReader.orc
The copy of the patch that failed is found in:
/Users/vishaldonderia/Mobileum/Spark/spark/.git/rebase-apply/patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".
{code}
do i am missing something ?

holdenk · 2016-10-07T19:17:48Z

@Donderia it seems like some of the files have changed since 1.6 so this won't apply cleanly against 1.6
@zjffdu if your still working on this can you update this against the latest master? Also this seems like it part of this was fixed in bcaa799 but not the orc part.

SparkQA · 2016-10-09T04:14:11Z

Test build #66591 has finished for PR 10307 at commit 727b35a.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-09T04:24:08Z

Test build #66592 has finished for PR 10307 at commit 6ac0580.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-12T01:43:34Z

Test build #66773 has finished for PR 10307 at commit b9e6481.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zjffdu · 2016-10-17T01:19:57Z

ping @holdenk @JoshRosen @davies

SparkQA · 2017-01-17T20:23:04Z

Test build #71521 has finished for PR 10307 at commit b9e6481.

This patch fails PySpark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

holdenk · 2017-02-14T20:49:05Z

Sorry for the delay in getting to this, do you have time to update this to the latest master branch? It would be a nice small fix/improvement to get in :)

holdenk · 2017-02-17T19:11:06Z

Gentle ping @zjffdu

Donderia · 2017-02-21T15:46:26Z

Thanks for your kind reply. sorry for the delay in reply I have fixed this and it working fine. )

…

On Sat, Feb 18, 2017 at 12:41 AM, Holden Karau ***@***.***> wrote: Gentle ping @zjffdu <https://github.com/zjffdu> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10307 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFjRF7dFzSmNXrxVDcyUtgEve6Br0E6mks5rdfDrgaJpZM4G1gGy> .

-- Thanking You With Regards Vishal Donderia vishaldonderia@gmail.com +91-9711556310

holdenk · 2017-02-22T23:17:07Z

So this still doesn't merge with master, if you want to update it would be good to take a look :)

holdenk · 2017-02-24T23:06:46Z

Gentle ping @zjffdu :)

SparkQA · 2017-02-28T05:49:13Z

Test build #73567 has finished for PR 10307 at commit 401f682.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-28T06:12:30Z

Test build #73569 has started for PR 10307 at commit e425438.

SparkQA · 2017-02-28T10:22:20Z

Test build #73577 has finished for PR 10307 at commit 01501fa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2017-02-28T19:16:06Z

python/pyspark/sql/readwriter.py

@@ -388,16 +388,18 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non
        return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))

    @since(1.5)
-    def orc(self, path):
-        """Loads an ORC file, returning the result as a :class:`DataFrame`.
+    def orc(self, paths):


So if someones been calling orc with a named param of path this could cause them problems when they upgrade. I might be being overly cautious but it seems like we should avoid breaking that since we don't have to until the next major version change.

Good catch, I should not break the compatibility. BTW, I found that DataFrameReader.parquet use variable length argument which is not consistent with other file formats such as text, json and orc that use string or a list of string. I can fix it in this PR or can do it in another PR to make them consistent. What do you think ?

@since(1.4) def parquet(self, *paths): """Loads Parquet files, returning the result as a :class:`DataFrame`.

We might as well make it consistent in this PR if we can do it without breaking anything.

SparkQA · 2017-03-01T04:41:27Z

Test build #73647 has finished for PR 10307 at commit 0686453.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-07T13:44:28Z

Test build #74096 has finished for PR 10307 at commit a2d35dc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-07T13:49:05Z

Test build #74097 has finished for PR 10307 at commit fb883c8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2017-03-07T15:30:47Z

python/pyspark/sql/readwriter.py

@@ -282,6 +282,23 @@ def parquet(self, *paths):
        """
        return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))

+    @since(2.2)
+    def parquet(self, path):


Having two functions with the same name and different args doesn't behave like in Scala (so this won't work). Please use kwargs or similar and add a test for paths and path.

Thanks @holdenk , I learned a new thing of python. I reverted the changes on parquet, It would be very weird to change it as def parquet(self, *paths, path=None): and def parquet(self, **kwargs:) would break the code without using keyword argument, e.g. parquet("p_file")

holdenk · 2017-03-07T15:31:29Z

python/pyspark/sql/readwriter.py

@@ -407,15 +424,17 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non

    @since(1.5)
    def orc(self, path):
-        """Loads an ORC file, returning the result as a :class:`DataFrame`.
+        """Loads ORC files, returning the result as a :class:`DataFrame`.


Maybe add a test for loading with a list of orc files.

It is in python/pyspark/sql/tests.py

SparkQA · 2017-03-08T11:25:12Z

Test build #74198 has finished for PR 10307 at commit 6f366bf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…r orc file in DataFrameReader.orc

SparkQA · 2017-03-09T14:01:48Z

Test build #74265 has finished for PR 10307 at commit 2a5c3c6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2017-03-09T19:43:17Z

So right now we've got a mix of path & paths as the name for the arguments to the different file loading things - this is annoying to fix in Python but we should maybe make a JIRA so we follow up on the reader/writer interfaces next time we have a major release. Can you do that @zjffdu ?

Also thank you for working on this for over a year, I'm so sorry its taken so long to get to this.

holdenk · 2017-03-09T20:02:36Z

Merged to master, thank you @zjffdu

…ValueGroupedDataset ### What changes were proposed in this pull request? This PR proposes to add `as` API to RelationalGroupedDataset. It creates KeyValueGroupedDataset instance using given grouping expressions, instead of a typed function in groupByKey API. Because it can leverage existing columns, it can use existing data partition, if any, when doing operations like cogroup. ### Why are the changes needed? Currently if users want to do cogroup on DataFrames, there is no good way to do except for KeyValueGroupedDataset. 1. KeyValueGroupedDataset ignores existing data partition if any. That is a problem. 2. groupByKey calls typed function to create additional keys. You can not reuse existing columns, if you just need grouping by them. ```scala // df1 and df2 are certainly partitioned and sorted. val df1 = Seq((1, 2, 3), (2, 3, 4)).toDF("a", "b", "c") .repartition($"a").sortWithinPartitions("a") val df2 = Seq((1, 2, 4), (2, 3, 5)).toDF("a", "b", "c") .repartition($"a").sortWithinPartitions("a") ``` ```scala // This groupBy.as.cogroup won't unnecessarily repartition the data val df3 = df1.groupBy("a").as[Int] .cogroup(df2.groupBy("a").as[Int]) { case (key, data1, data2) => data1.zip(data2).map { p => p._1.getInt(2) + p._2.getInt(2) } } ``` ``` == Physical Plan == *(5) SerializeFromObject [input[0, int, false] AS value#11247] +- CoGroup org.apache.spark.sql.DataFrameSuite$$Lambda$4922/12067092816eec1b6f, a#11209: int, createexternalrow(a#11209, b#11210, c#11211, StructField(a,IntegerType,false), StructField(b,IntegerType,false), StructField(c,IntegerType,false)), createexternalrow(a#11225, b#11226, c#11227, StructField(a,IntegerType,false), StructField(b,IntegerType,false), StructField(c,IntegerType,false)), [a#11209], [a#11225], [a#11209, b#11210, c#11211], [a#11225, b#11226, c#11227], obj#11246: int :- *(2) Sort [a#11209 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(a#11209, 5), false, [id=#10218] : +- *(1) Project [_1#11202 AS a#11209, _2#11203 AS b#11210, _3#11204 AS c#11211] : +- *(1) LocalTableScan [_1#11202, _2#11203, _3#11204] +- *(4) Sort [a#11225 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(a#11225, 5), false, [id=#10223] +- *(3) Project [_1#11218 AS a#11225, _2#11219 AS b#11226, _3#11220 AS c#11227] +- *(3) LocalTableScan [_1#11218, _2#11219, _3#11220] ``` ```scala // Current approach creates additional AppendColumns and repartition data again val df4 = df1.groupByKey(r => r.getInt(0)).cogroup(df2.groupByKey(r => r.getInt(0))) { case (key, data1, data2) => data1.zip(data2).map { p => p._1.getInt(2) + p._2.getInt(2) } } ``` ``` == Physical Plan == *(7) SerializeFromObject [input[0, int, false] AS value#11257] +- CoGroup org.apache.spark.sql.DataFrameSuite$$Lambda$4933/138102700737171997, value#11252: int, createexternalrow(a#11209, b#11210, c#11211, StructField(a,IntegerType,false), StructField(b,IntegerType,false), StructField(c,IntegerType,false)), createexternalrow(a#11225, b#11226, c#11227, StructField(a,IntegerType,false), StructField(b,IntegerType,false), StructField(c,IntegerType,false)), [value#11252], [value#11254], [a#11209, b#11210, c#11211], [a#11225, b#11226, c#11227], obj#11256: int :- *(3) Sort [value#11252 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(value#11252, 5), true, [id=#10302] : +- AppendColumns org.apache.spark.sql.DataFrameSuite$$Lambda$4930/19529195347ce07f47, createexternalrow(a#11209, b#11210, c#11211, StructField(a,IntegerType,false), StructField(b,IntegerType,false), StructField(c,IntegerType,false)), [input[0, int, false] AS value#11252] : +- *(2) Sort [a#11209 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(a#11209, 5), false, [id=#10297] : +- *(1) Project [_1#11202 AS a#11209, _2#11203 AS b#11210, _3#11204 AS c#11211] : +- *(1) LocalTableScan [_1#11202, _2#11203, _3#11204] +- *(6) Sort [value#11254 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(value#11254, 5), true, [id=#10312] +- AppendColumns org.apache.spark.sql.DataFrameSuite$$Lambda$4932/15265288491f0e0c1f, createexternalrow(a#11225, b#11226, c#11227, StructField(a,IntegerType,false), StructField(b,IntegerType,false), StructField(c,IntegerType,false)), [input[0, int, false] AS value#11254] +- *(5) Sort [a#11225 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(a#11225, 5), false, [id=#10307] +- *(4) Project [_1#11218 AS a#11225, _2#11219 AS b#11226, _3#11220 AS c#11227] +- *(4) LocalTableScan [_1#11218, _2#11219, _3#11220] ``` ### Does this PR introduce any user-facing change? Yes, this adds a new `as` API to RelationalGroupedDataset. Users can use it to create KeyValueGroupedDataset and do cogroup. ### How was this patch tested? Unit tests. Closes #26509 from viirya/SPARK-29427-2. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

bomeng reviewed Dec 15, 2015
View reviewed changes

yanboliang reviewed Dec 18, 2015
View reviewed changes

zjffdu force-pushed the SPARK-12334 branch from eaf8c6c to a8690e1 Compare April 19, 2016 09:50

zjffdu changed the title ~~[SPARK-12334][SQL][PYSPARK] Support read from multiple input paths fo…~~ [SPARK-12334][SQL][PYSPARK] Support read from multiple input paths for orc file in DataFrameReader.orc Oct 9, 2016

zjffdu force-pushed the SPARK-12334 branch from 2109a94 to 727b35a Compare October 9, 2016 04:07

zjffdu force-pushed the SPARK-12334 branch from 727b35a to 6ac0580 Compare October 9, 2016 04:21

zjffdu closed this Oct 9, 2016

zjffdu reopened this Oct 9, 2016

zjffdu force-pushed the SPARK-12334 branch from b9e6481 to 401f682 Compare February 28, 2017 05:44

zjffdu force-pushed the SPARK-12334 branch from 401f682 to e425438 Compare February 28, 2017 06:08

zjffdu force-pushed the SPARK-12334 branch from e425438 to 01501fa Compare February 28, 2017 08:14

holdenk reviewed Feb 28, 2017

View reviewed changes

zjffdu force-pushed the SPARK-12334 branch from 01501fa to 0686453 Compare March 1, 2017 02:40

zjffdu force-pushed the SPARK-12334 branch from a2d35dc to fb883c8 Compare March 7, 2017 11:43

holdenk reviewed Mar 7, 2017

View reviewed changes

zjffdu added 4 commits March 9, 2017 19:52

[SPARK-12334][SQL][PYSPARK] Support read from multiple input paths fo…

b515284

…r orc file in DataFrameReader.orc

change method parameter

3df405c

address comment

ef7600d

revert the change for parquet

2a5c3c6

zjffdu force-pushed the SPARK-12334 branch from 6f366bf to 2a5c3c6 Compare March 9, 2017 11:52

asfgit closed this in cabe1df Mar 9, 2017

		@@ -322,11 +322,11 @@ class DataFrameReader private[sql](sqlContext: SQLContext) extends Logging {
		/**
		* Loads an ORC file and returns the result as a [[DataFrame]].

[SPARK-12334][SQL][PYSPARK] Support read from multiple input paths for orc file in DataFrameReader.orc #10307

[SPARK-12334][SQL][PYSPARK] Support read from multiple input paths for orc file in DataFrameReader.orc #10307

Conversation

zjffdu commented Dec 15, 2015 • edited

SparkQA commented Dec 15, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zjffdu commented Dec 17, 2015

yanboliang commented Dec 18, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

holdenk commented Apr 19, 2016

SparkQA commented Apr 19, 2016

SparkQA commented Apr 19, 2016

Donderia commented Oct 3, 2016

holdenk commented Oct 7, 2016

SparkQA commented Oct 9, 2016

SparkQA commented Oct 9, 2016

SparkQA commented Oct 12, 2016

zjffdu commented Oct 17, 2016

SparkQA commented Jan 17, 2017

holdenk commented Feb 14, 2017

holdenk commented Feb 17, 2017

Donderia commented Feb 21, 2017 via email

holdenk commented Feb 22, 2017

holdenk commented Feb 24, 2017

SparkQA commented Feb 28, 2017

SparkQA commented Feb 28, 2017

SparkQA commented Feb 28, 2017

Choose a reason for hiding this comment

zjffdu Mar 1, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 1, 2017

SparkQA commented Mar 7, 2017

SparkQA commented Mar 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 8, 2017

SparkQA commented Mar 9, 2017

holdenk commented Mar 9, 2017

holdenk commented Mar 9, 2017

zjffdu commented Dec 15, 2015 •

edited

zjffdu Mar 1, 2017 •

edited