[SPARK-15466][SQL] Make `SparkSession` as the entry point to programming with RDD too #13245

dongjoon-hyun · 2016-05-21T21:41:03Z

What changes were proposed in this pull request?

SparkSession greatly reduces the number of concepts which Spark users must know. Currently, SparkSession is defined as the entry point to programming Spark with the Dataset and DataFrame API. And, we can easily get RDD by calling Dataset.rdd or DataFrame.rdd, too.

However, many usages (including examples) are observed to extract SparkSession.sparkContext and keep it as own variable to call parallelize.

If SparkSession supports RDD seamlessly too, it would be great for usability. We can do this by simply adding parallelize API.

Example

 object SparkPi {
   def main(args: Array[String]) {
     val spark = SparkSession
       .builder
       .appName("Spark Pi")
       .getOrCreate()
-    val sc = spark.sparkContext
     val slices = if (args.length > 0) args(0).toInt else 2
     val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
-    val count = sc.parallelize(1 until n, slices).map { i =>
+    val count = spark.parallelize(1 until n, slices).map { i =>
     val count = spark.parallelize(1 until n, slices).map { i =>
       val x = random * 2 - 1
       val y = random * 2 - 1
       if (x*x + y*y < 1) 1 else 0
     }.reduce(_ + _)
     println("Pi is roughly " + 4.0 * count / n)
     spark.stop()
   }
 }

 spark = SparkSession\
   .builder\
   .appName("PythonPi")\
   .getOrCreate()

- sc = spark._sc
-
 partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
 n = 100000 * partitions

 def f(_):
   x = random() * 2 - 1
   y = random() * 2 - 1
   return 1 if x ** 2 + y ** 2 < 1 else 0

-count = sc.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
+count = spark.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
 print("Pi is roughly %f" % (4.0 * count / n))

 spark.stop()

How was this patch tested?

Pass the Jenkins test (with new python test) and also manual.

dongjoon-hyun · 2016-05-21T21:41:52Z

Hi, @rxin .
I'm wondering your opinion about this PR.

…ing with RDD too

SparkQA · 2016-05-21T23:25:24Z

Test build #59081 has finished for PR 13245 at commit 810f08a.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-22T01:35:27Z

Test build #59085 has finished for PR 13245 at commit 4f6a69e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-22T01:44:18Z

Test build #59082 has finished for PR 13245 at commit 65f9746.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-05-22T03:13:33Z

hm we are trying to avoid returning rdds in the new apis. one thing we can do is to introduce a parallelize api that returns dataset?

dongjoon-hyun · 2016-05-22T04:16:14Z

I see. Thank you!

dongjoon-hyun · 2016-05-22T05:07:54Z

Unfortunately, Dataset (or Dataframe) seems not suitable to achieve the goal on Python.

>>> spark.parallelize(range(1, 10)).toDS()
...
AttributeError: 'RDD' object has no attribute 'toDS'
>>> spark.parallelize(range(1, 10)).toDF()
...
TypeError: Can not infer schema for type: <type 'int'>

I'll think about this more until tomorrow and close this if I cannot find a neat solution.

dongjoon-hyun · 2016-05-23T03:30:14Z

Sorry, it wasn't a good idea of extending SparkSession for RDD. I'm closing this PR.

[SPARK-15466][SQL] Make SparkSession as the entry point to programm…

65f9746

…ing with RDD too

Fix python3 import error

4f6a69e

dongjoon-hyun closed this May 23, 2016

dongjoon-hyun deleted the SPARK-15466 branch July 20, 2016 07:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-15466][SQL] Make `SparkSession` as the entry point to programming with RDD too #13245

[SPARK-15466][SQL] Make `SparkSession` as the entry point to programming with RDD too #13245

Uh oh!

dongjoon-hyun commented May 21, 2016 •

edited

Loading

Uh oh!

dongjoon-hyun commented May 21, 2016

Uh oh!

SparkQA commented May 21, 2016

Uh oh!

SparkQA commented May 22, 2016

Uh oh!

SparkQA commented May 22, 2016

Uh oh!

rxin commented May 22, 2016

Uh oh!

dongjoon-hyun commented May 22, 2016

Uh oh!

dongjoon-hyun commented May 22, 2016

Uh oh!

dongjoon-hyun commented May 23, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-15466][SQL] Make SparkSession as the entry point to programming with RDD too #13245

[SPARK-15466][SQL] Make SparkSession as the entry point to programming with RDD too #13245

Uh oh!

Conversation

dongjoon-hyun commented May 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

dongjoon-hyun commented May 21, 2016

Uh oh!

SparkQA commented May 21, 2016

Uh oh!

SparkQA commented May 22, 2016

Uh oh!

SparkQA commented May 22, 2016

Uh oh!

rxin commented May 22, 2016

Uh oh!

dongjoon-hyun commented May 22, 2016

Uh oh!

dongjoon-hyun commented May 22, 2016

Uh oh!

dongjoon-hyun commented May 23, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-15466][SQL] Make `SparkSession` as the entry point to programming with RDD too #13245

[SPARK-15466][SQL] Make `SparkSession` as the entry point to programming with RDD too #13245

dongjoon-hyun commented May 21, 2016 •

edited

Loading