[SPARK-19540][SQL] Add ability to clone SparkSession wherein cloned session has an identical copy of the SessionState #16826

kunalkhamar · 2017-02-06T23:25:35Z

What changes were proposed in this pull request?

Forking a newSession() from SparkSession currently makes a new SparkSession that does not retain SessionState (i.e. temporary tables, SQL config, registered functions etc.) This change adds a method cloneSession() which creates a new SparkSession with a copy of the parent's SessionState.

Subsequent changes to base session are not propagated to cloned session, clone is independent after creation.
If the base is changed after clone has been created, say user registers new UDF, then the new UDF will not be available inside the clone. Same goes for configs and temp tables.

How was this patch tested?

Unit tests

…ered functions) when forking a new SparkSession.

… overloaded functions for Java bytecode compatibility.

kunalkhamar · 2017-02-06T23:27:30Z

Hey Ryan @zsxwing
Can you take another look and let me know if anything needs changes

rxin · 2017-02-07T09:30:19Z

What is the semantics? Do functions/settings on the base SparkSession affect the new forked?

kunalkhamar · 2017-02-08T00:20:05Z

@rxin
Changes to base session are not propagated to forked session, forked is independent after creation.
If the base is changed after forked has been created, say user registers new UDF, then the new UDF will not be available inside forked. Same goes for configs and temp tables.

zsxwing · 2017-02-08T22:28:45Z

ok to test

zsxwing

Made one pass. Overall looks good. Could you also add the semantics in the PR description?

zsxwing · 2017-02-08T22:42:56Z

sql/core/src/main/scala/org/apache/spark/sql/ExperimentalMethods.scala

+    }
+
+    val result = new ExperimentalMethods
+    result.extraStrategies = cloneSeq(extraStrategies)


You don't need to copy these two Seqs since they are not mutable Seqs.

zsxwing · 2017-02-08T22:57:34Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala

-private[sql] class SessionState(sparkSession: SparkSession) {
+private[sql] class SessionState(
+    sparkSession: SparkSession,
+    existingSessionState: Option[SessionState]) {


nit: existingSessionState -> parentSessionState to indicate we should copy its internal states.

zsxwing · 2017-02-08T22:59:45Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala

-  lazy val conf: SQLConf = new SQLConf
+  lazy val conf: SQLConf = {
+    val result = new SQLConf
+    if (existingSessionState.nonEmpty) {


nit:

existingSessionState.foreach(_.conf.getAllConfs.foreach { case (k, v) => if (v ne null) result.setConfString(k, v) })

zsxwing · 2017-02-08T23:05:59Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala


  /**
   * Internal catalog for managing functions registered by the user.
   */
-  lazy val functionRegistry: FunctionRegistry = FunctionRegistry.builtin.copy()
+  lazy val functionRegistry: FunctionRegistry = {


It's better to just add a copy method to FunctionRegistry to simplify these codes.

zsxwing · 2017-02-08T23:07:55Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala

@@ -115,16 +113,22 @@ class TestHiveContext(
 private[hive] class TestHiveSparkSession(
    @transient private val sc: SparkContext,
    @transient private val existingSharedState: Option[SharedState],
+    existingSessionState: Option[SessionState],


Looks like you don't need to change this file?

zsxwing · 2017-02-08T23:10:05Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala

@@ -151,7 +155,7 @@ private[hive] class TestHiveSparkSession(
    new TestHiveSessionState(self)

  override def newSession(): TestHiveSparkSession = {


You can change it to override def newSession(inheritSessionState: Boolean) instead

zsxwing · 2017-02-08T23:11:24Z

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

+   * This method will force the initialization of the shared state to ensure that parent
+   * and child sessions are set up with the same shared state. If the underlying catalog
+   * implementation is Hive, this will initialize the metastore, which may take some time.
+   */


nit: please add @Experimental and @InterfaceStability.Evolving

why not remove the boolean flag and just call this cloneSession?

That seems cleaner, fixed.

zsxwing · 2017-02-08T23:11:43Z

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

@@ -213,6 +218,24 @@ class SparkSession private(
    new SparkSession(sparkContext, Some(sharedState))
  }

+  /**
+   * Start a new session, sharing the underlying `SparkContext` and cached data.


nit: add :: Experimental ::

zsxwing · 2017-02-08T23:33:13Z

sql/core/src/test/scala/org/apache/spark/sql/SparkSessionBuilderSuite.scala

+    val activeSession = SparkSession
+      .builder()
+      .master("local")
+      .config("spark-configb", "b")


This is in the shared state. You should use SparkSession.conf.set instead.

zsxwing · 2017-02-08T23:39:47Z

@kunalkhamar by the way, please add the JIRA number to the title.

SparkQA · 2017-02-09T01:16:55Z

Test build #72606 has finished for PR 16826 at commit a343d8a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2017-02-09T08:55:20Z

@kunalkhamar you should create a JIRA ticket for this.

In addition, I'm not a big fan of the design to pass a base session in. It'd be simpler if there is just a clone method on sessionstate and the associated states we store, and then cloning a new sparksession is just creating a new sparksession with cloned sessionstate.

… cloning SessionState.

SparkQA · 2017-02-10T02:55:03Z

Test build #72679 has finished for PR 16826 at commit 4210079.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

…ject.clone() issues.

SparkQA · 2017-02-11T02:25:31Z

Test build #72727 has finished for PR 16826 at commit 6da6bda.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-14T22:11:50Z

Test build #72897 has finished for PR 16826 at commit 579d0b7.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

…tialize all fields directly instead. Same change for HiveSessionState.

SparkQA · 2017-02-16T01:48:46Z

Test build #72968 has finished for PR 16826 at commit 2837e73.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-16T01:58:17Z

Test build #72970 has finished for PR 16826 at commit 8c00344.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2017-02-16T16:08:20Z

What's WIP about this?

kunalkhamar · 2017-02-16T19:52:40Z

@rxin
The change from SessionState(SparkSession,parentSessionState) to SessionState(SparkSession,SQLConf,SessionCatalog,...) is mostly done. Still have 4 failing tests (OOMs) and will create some more tests.

Also, currently SparkSession has-a SessionState and SessionState has-a SparkSession. Investigating into removing the SparkSession reference from inside SessionState. This would change the constructor from SessionState(SparkSession,SQLConf,SessionCatalog,...)to SessionState(SparkContext,SQLConf,SessionCatalog,...,more params that depend on SparkSession).

There are fields inside SessionState that depend on SparkSession, e.g. Analyzer and QueryExecution. But these can be created and passed in to SessionState during creation. It seems we might be able to pull up all such references to SparkSession to initialization (i.e. inside SessionState.apply.) Same goes for HiveSessionState.

Any thoughts on this?

SparkQA · 2017-03-06T23:45:23Z

Test build #74039 has finished for PR 16826 at commit 2740c63.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-07T00:12:58Z

Test build #74044 has finished for PR 16826 at commit 0f167db.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-07T02:57:08Z

Test build #74056 has finished for PR 16826 at commit 2f0b1ad.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-03-07T05:53:04Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala

+    val original = new SessionCatalog(externalCatalog)
+    val tempTable1 = Range(1, 10, 1, 10)
+    val db1 = "copytest1"
+    original.createTempView(db1, tempTable1, overrideIfExists = false)


db1? let us give a reasonable view name?

gatorsmile · 2017-03-07T05:56:10Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala

+    // check if clone and original independent
+    val db2 = "copytest2"
+    val tempTable2 = Range(1, 20, 2, 20)
+    clone.createTempView(db2, tempTable2, overrideIfExists = false)


the same here.

gatorsmile · 2017-03-07T06:11:25Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala

+    original.createTempView(db3, tempTable3, overrideIfExists = false)
+    original.setCurrentDatabase(db3)
+    assert(clone.getCurrentDatabase == db2)
+  }


How about?

test("clone SessionCatalog - current db") { val externalCatalog = newEmptyCatalog() val db1 = "db1" val db2 = "db2" val db3 = "db3" externalCatalog.createDatabase(newDb(db1), ignoreIfExists = true) externalCatalog.createDatabase(newDb(db2), ignoreIfExists = true) externalCatalog.createDatabase(newDb(db3), ignoreIfExists = true) val original = new SessionCatalog(externalCatalog) original.setCurrentDatabase(db1) // check if current db copied over val clone = original.clone( SimpleCatalystConf(caseSensitiveAnalysis = true), new Configuration(), new SimpleFunctionRegistry, CatalystSqlParser) assert(original != clone) assert(clone.getCurrentDatabase == db1) // check if clone and original independent clone.setCurrentDatabase(db2) assert(original.getCurrentDatabase == db1) original.setCurrentDatabase(db3) assert(clone.getCurrentDatabase == db2) }

gatorsmile · 2017-03-07T06:20:39Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala

+ * @param analyzer Logical query plan analyzer for resolving unresolved attributes and relations.
+ * @param streamingQueryManager Interface to start and stop
+ *                              [[org.apache.spark.sql.streaming.StreamingQuery]]s.
+ * @param queryExecutionCreator Lambda to create a [[QueryExecution]] from a [[LogicalPlan]]


Let us document all parms?

hm most of the param documentation here are actually kind of useless (very little information conveyed).

@rxin Removing the redundant comments in SPARK-20048.

gatorsmile · 2017-03-07T06:21:21Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala


  /**
-   * SQL-specific key-value configurations.
+   *   Logical query plan optimizer.


Nit: remove the extra space

gatorsmile · 2017-03-07T07:06:05Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

@@ -278,6 +278,8 @@ private[hive] class HiveClientImpl(
    state.getConf.setClassLoader(clientLoader.classLoader)
    // Set the thread local metastore client to the client associated with this HiveClientImpl.
    Hive.set(client)
+    // Replace conf in the thread local Hive with current conf
+    Hive.get(conf)


Because IsolatedClientLoader is shared? If we do not make this change, any test case failed?

Because of reusingIsolatedClientLoader.cachedHive. Without this line, it may use an out-of-date HiveConf. The failed test is HiveSparkSubmitSuite.test("SPARK-18360: default table path of tables in default database should depend on the " + "location of default database")

gatorsmile · 2017-03-07T07:45:22Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionState.scala

+    functionRegistry: FunctionRegistry,
+    override val catalog: HiveSessionCatalog,
+    sqlParser: ParserInterface,
+    val metadataHive: HiveClient,


This is for avoiding using lazy val?

Avoid using SparkSession in this class.

gatorsmile · 2017-03-07T07:49:15Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionState.scala

+    analyzer: Analyzer,
+    streamingQueryManager: StreamingQueryManager,
+    queryExecutionCreator: LogicalPlan => QueryExecution,
+    val plannerCreator: () => SparkPlanner)


How about adding val planner to SessionState? So far, the interface of HiveSessionState looks a little bit complex to me.

Just checked how CarbonData extends HiveSessionState. Basically, this PR will break what they completed a few months ago. : )

Right now, SessionState.planner is a method. So it will return a new SparkPlanner using the latest experimentalMethods.extraStrategies every time. Changing it to a val is a breaking change.

SparkQA · 2017-03-07T10:18:45Z

Test build #74079 has finished for PR 16826 at commit c41e7bc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-07T21:30:03Z

Test build #74124 has finished for PR 16826 at commit 5eb6733.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class HiveSessionCatalogSuite extends TestHiveSingleton

kunalkhamar

Looking close to done!

kunalkhamar · 2017-03-07T21:16:27Z

sql/core/src/test/scala/org/apache/spark/sql/SessionStateSuite.scala

+      activeSession.stop()
+      activeSession = null
+    }
+    super.afterAll()
  }

  test("fork new session and inherit RuntimeConfig options") {
    val key = "spark-config-clone"
    activeSession.conf.set(key, "active")


Shoud this be inside try {} ?

kunalkhamar · 2017-03-07T21:18:32Z

sql/core/src/test/scala/org/apache/spark/sql/SessionStateSuite.scala

+    val spark = activeSession
+    // Cannot use `import activeSession.implicits._` due to the compiler limitation.
+    import spark.implicits._
+
    activeSession


Should create temp view be inside try block?

kunalkhamar · 2017-03-07T21:31:48Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala

    original.setCurrentDatabase(db3)
    assert(clone.getCurrentDatabase == db2)
  }
+
+  test("SPARK-19737: detect undefined functions without triggering relation resolution") {


Is this supposed to be part of this PR?

kunalkhamar · 2017-03-07T21:41:39Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionState.scala

+ * @param conf SQL-specific key-value configurations.
+ * @param experimentalMethods The experimental methods.
+ * @param functionRegistry Internal catalog for managing functions registered by the user.
+ * @param catalog Internal catalog for managing table and database states.


Add comment on difference from SessionCatalog: HiveSessionCatalog uses Hive client for interacting with metastore.

SparkQA · 2017-03-07T23:48:02Z

Test build #74134 has finished for PR 16826 at commit 05abcf8.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2017-03-07T23:55:19Z

retest this please

zsxwing · 2017-03-08T01:39:20Z

retest this please. The running one will be failed because master was broken.

SparkQA · 2017-03-08T01:44:53Z

Test build #74148 has finished for PR 16826 at commit 05abcf8.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-08T03:59:05Z

Test build #74165 has finished for PR 16826 at commit 05abcf8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-08T20:51:29Z

Test build #74215 has finished for PR 16826 at commit 4c23e7a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public static class LongWrapper
public static class IntWrapper
case class CostBasedJoinReorder(conf: CatalystConf) extends Rule[LogicalPlan] with PredicateHelper
case class JoinPlan(itemIds: Set[Int], plan: LogicalPlan, joinConds: Set[Expression], cost: Cost)
case class Cost(rows: BigInt, size: BigInt)
abstract class RepartitionOperation extends UnaryNode
trait WatermarkSupport extends UnaryExecNode

zsxwing · 2017-03-08T21:05:50Z

LGTM. Merging to master.

…n listeners ## What changes were proposed in this pull request? Bugfix from [SPARK-19540.](#16826) Cloning SessionState does not clone query execution listeners, so cloned session is unable to listen to events on queries. ## How was this patch tested? - Unit test Author: Kunal Khamar <kkhamar@outlook.com> Closes #17379 from kunalkhamar/clone-bugfix.

kunalkhamar added 3 commits February 2, 2017 16:31

Add capability to inherit SessionState (SQL conf, temp tables, regist…

18ce1b8

…ered functions) when forking a new SparkSession.

Add tests for forking new session with inherit config enabled. Update…

9beb78d

… overloaded functions for Java bytecode compatibility.

Fix constructor default args for bytecode compatibility.

a343d8a

zsxwing requested changes Feb 8, 2017

View reviewed changes

kunalkhamar changed the title ~~Fork SparkSession with option to inherit a copy of the SessionState.~~ [Spark-5864][SQL] Add ability to clone SparkSession wherein cloned session has a reference to SharedState and an identical copy of the SessionState Feb 9, 2017

Incorporate feedback. Fix association of incorrect SparkSession while…

4210079

… cloning SessionState.

Update spark version. Rename clone to copy, in order to avoid Java Ob…

6da6bda

…ject.clone() issues.

Make lazy vals strict.

579d0b7

kunalkhamar added 2 commits February 15, 2017 16:28

Refactor SessionState to remove passing of base SessionState, and ini…

2837e73

…tialize all fields directly instead. Same change for HiveSessionState.

Remove unused import.

8c00344

kunalkhamar added 2 commits February 16, 2017 15:38

Remove SparkSession reference from SessionState.

f423f74

Merge branch 'master' into fork-sparksession

b1371d8

Clean up tests

0f167db

fix SessionCatalogSuite

2f0b1ad

gatorsmile reviewed Mar 7, 2017

View reviewed changes

More cleanup

c41e7bc

More tests

5eb6733

kunalkhamar commented Mar 7, 2017

View reviewed changes

Update tests and a param comment.

05abcf8

Merge branch 'master' into fork-sparksession

4c23e7a

asfgit closed this in 6570cfd Mar 8, 2017

kunalkhamar mentioned this pull request Mar 21, 2017

[SPARK-20048][SQL] Cloning SessionState does not clone query execution listeners #17379

Closed

		@@ -151,7 +155,7 @@ private[hive] class TestHiveSparkSession(
		new TestHiveSessionState(self)

		override def newSession(): TestHiveSparkSession = {

[SPARK-19540][SQL] Add ability to clone SparkSession wherein cloned session has an identical copy of the SessionState #16826

[SPARK-19540][SQL] Add ability to clone SparkSession wherein cloned session has an identical copy of the SessionState #16826

Conversation

kunalkhamar commented Feb 6, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

kunalkhamar commented Feb 6, 2017 • edited Loading

rxin commented Feb 7, 2017

kunalkhamar commented Feb 8, 2017

zsxwing commented Feb 8, 2017

zsxwing left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zsxwing commented Feb 8, 2017

SparkQA commented Feb 9, 2017

rxin commented Feb 9, 2017

SparkQA commented Feb 10, 2017

SparkQA commented Feb 11, 2017

SparkQA commented Feb 14, 2017

SparkQA commented Feb 16, 2017

SparkQA commented Feb 16, 2017

rxin commented Feb 16, 2017

kunalkhamar commented Feb 16, 2017 • edited Loading

SparkQA commented Mar 6, 2017

SparkQA commented Mar 7, 2017

SparkQA commented Mar 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile Mar 7, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile Mar 7, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 7, 2017

SparkQA commented Mar 7, 2017

kunalkhamar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 7, 2017

zsxwing commented Mar 7, 2017

zsxwing commented Mar 8, 2017

SparkQA commented Mar 8, 2017

SparkQA commented Mar 8, 2017

SparkQA commented Mar 8, 2017

zsxwing commented Mar 8, 2017

kunalkhamar commented Feb 6, 2017 •

edited

Loading

kunalkhamar commented Feb 6, 2017 •

edited

Loading

kunalkhamar commented Feb 16, 2017 •

edited

Loading

gatorsmile Mar 7, 2017 •

edited

Loading

gatorsmile Mar 7, 2017 •

edited

Loading