Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19540][SQL] Add ability to clone SparkSession wherein cloned session has an identical copy of the SessionState #16826

Closed
wants to merge 39 commits into from

Conversation

kunalkhamar
Copy link
Contributor

@kunalkhamar kunalkhamar commented Feb 6, 2017

What changes were proposed in this pull request?

Forking a newSession() from SparkSession currently makes a new SparkSession that does not retain SessionState (i.e. temporary tables, SQL config, registered functions etc.) This change adds a method cloneSession() which creates a new SparkSession with a copy of the parent's SessionState.

Subsequent changes to base session are not propagated to cloned session, clone is independent after creation.
If the base is changed after clone has been created, say user registers new UDF, then the new UDF will not be available inside the clone. Same goes for configs and temp tables.

How was this patch tested?

Unit tests

@kunalkhamar
Copy link
Contributor Author

kunalkhamar commented Feb 6, 2017

Hey Ryan @zsxwing
Can you take another look and let me know if anything needs changes

@rxin
Copy link
Contributor

rxin commented Feb 7, 2017

What is the semantics? Do functions/settings on the base SparkSession affect the new forked?

@kunalkhamar
Copy link
Contributor Author

@rxin
Changes to base session are not propagated to forked session, forked is independent after creation.
If the base is changed after forked has been created, say user registers new UDF, then the new UDF will not be available inside forked. Same goes for configs and temp tables.

@zsxwing
Copy link
Member

zsxwing commented Feb 8, 2017

ok to test

Copy link
Member

@zsxwing zsxwing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made one pass. Overall looks good. Could you also add the semantics in the PR description?

}

val result = new ExperimentalMethods
result.extraStrategies = cloneSeq(extraStrategies)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to copy these two Seqs since they are not mutable Seqs.

private[sql] class SessionState(sparkSession: SparkSession) {
private[sql] class SessionState(
sparkSession: SparkSession,
existingSessionState: Option[SessionState]) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: existingSessionState -> parentSessionState to indicate we should copy its internal states.

lazy val conf: SQLConf = new SQLConf
lazy val conf: SQLConf = {
val result = new SQLConf
if (existingSessionState.nonEmpty) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

    existingSessionState.foreach(_.conf.getAllConfs.foreach {
      case (k, v) => if (v ne null) result.setConfString(k, v)
    })


/**
* Internal catalog for managing functions registered by the user.
*/
lazy val functionRegistry: FunctionRegistry = FunctionRegistry.builtin.copy()
lazy val functionRegistry: FunctionRegistry = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to just add a copy method to FunctionRegistry to simplify these codes.

@@ -115,16 +113,22 @@ class TestHiveContext(
private[hive] class TestHiveSparkSession(
@transient private val sc: SparkContext,
@transient private val existingSharedState: Option[SharedState],
existingSessionState: Option[SessionState],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you don't need to change this file?

@@ -151,7 +155,7 @@ private[hive] class TestHiveSparkSession(
new TestHiveSessionState(self)

override def newSession(): TestHiveSparkSession = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can change it to override def newSession(inheritSessionState: Boolean) instead

* This method will force the initialization of the shared state to ensure that parent
* and child sessions are set up with the same shared state. If the underlying catalog
* implementation is Hive, this will initialize the metastore, which may take some time.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please add @Experimental and @InterfaceStability.Evolving

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not remove the boolean flag and just call this cloneSession?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems cleaner, fixed.

@@ -213,6 +218,24 @@ class SparkSession private(
new SparkSession(sparkContext, Some(sharedState))
}

/**
* Start a new session, sharing the underlying `SparkContext` and cached data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add :: Experimental ::

val activeSession = SparkSession
.builder()
.master("local")
.config("spark-configb", "b")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in the shared state. You should use SparkSession.conf.set instead.

@zsxwing
Copy link
Member

zsxwing commented Feb 8, 2017

@kunalkhamar by the way, please add the JIRA number to the title.

@SparkQA
Copy link

SparkQA commented Feb 9, 2017

Test build #72606 has finished for PR 16826 at commit a343d8a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Feb 9, 2017

@kunalkhamar you should create a JIRA ticket for this.

In addition, I'm not a big fan of the design to pass a base session in. It'd be simpler if there is just a clone method on sessionstate and the associated states we store, and then cloning a new sparksession is just creating a new sparksession with cloned sessionstate.

@kunalkhamar kunalkhamar changed the title Fork SparkSession with option to inherit a copy of the SessionState. [Spark-5864][SQL] Add ability to clone SparkSession wherein cloned session has a reference to SharedState and an identical copy of the SessionState Feb 9, 2017
@kunalkhamar kunalkhamar changed the title [Spark-5864][SQL] Add ability to clone SparkSession wherein cloned session has a reference to SharedState and an identical copy of the SessionState [Spark-19540][SQL] Add ability to clone SparkSession wherein cloned session has a reference to SharedState and an identical copy of the SessionState Feb 10, 2017
@kunalkhamar kunalkhamar changed the title [Spark-19540][SQL] Add ability to clone SparkSession wherein cloned session has a reference to SharedState and an identical copy of the SessionState [WIP][Spark-19540][SQL] Add ability to clone SparkSession wherein cloned session has a reference to SharedState and an identical copy of the SessionState Feb 10, 2017
@SparkQA
Copy link

SparkQA commented Feb 10, 2017

Test build #72679 has finished for PR 16826 at commit 4210079.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kunalkhamar kunalkhamar changed the title [WIP][Spark-19540][SQL] Add ability to clone SparkSession wherein cloned session has a reference to SharedState and an identical copy of the SessionState [WIP][SPARK-19540][SQL] Add ability to clone SparkSession wherein cloned session has a reference to SharedState and an identical copy of the SessionState Feb 10, 2017
@SparkQA
Copy link

SparkQA commented Feb 11, 2017

Test build #72727 has finished for PR 16826 at commit 6da6bda.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 14, 2017

Test build #72897 has finished for PR 16826 at commit 579d0b7.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

…tialize all fields directly instead. Same change for HiveSessionState.
@kunalkhamar kunalkhamar changed the title [WIP][SPARK-19540][SQL] Add ability to clone SparkSession wherein cloned session has a reference to SharedState and an identical copy of the SessionState [WIP][SPARK-19540][SQL] Add ability to clone SparkSession wherein cloned session has an identical copy of the SessionState Feb 16, 2017
@SparkQA
Copy link

SparkQA commented Feb 16, 2017

Test build #72968 has finished for PR 16826 at commit 2837e73.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 16, 2017

Test build #72970 has finished for PR 16826 at commit 8c00344.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Feb 16, 2017

What's WIP about this?

@kunalkhamar
Copy link
Contributor Author

kunalkhamar commented Feb 16, 2017

@rxin
The change from SessionState(SparkSession,parentSessionState) to SessionState(SparkSession,SQLConf,SessionCatalog,...) is mostly done. Still have 4 failing tests (OOMs) and will create some more tests.

Also, currently SparkSession has-a SessionState and SessionState has-a SparkSession. Investigating into removing the SparkSession reference from inside SessionState. This would change the constructor from SessionState(SparkSession,SQLConf,SessionCatalog,...)to SessionState(SparkContext,SQLConf,SessionCatalog,...,more params that depend on SparkSession).

There are fields inside SessionState that depend on SparkSession, e.g. Analyzer and QueryExecution. But these can be created and passed in to SessionState during creation. It seems we might be able to pull up all such references to SparkSession to initialization (i.e. inside SessionState.apply.) Same goes for HiveSessionState.

Any thoughts on this?

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #74039 has finished for PR 16826 at commit 2740c63.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 7, 2017

Test build #74044 has finished for PR 16826 at commit 0f167db.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 7, 2017

Test build #74056 has finished for PR 16826 at commit 2f0b1ad.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val original = new SessionCatalog(externalCatalog)
val tempTable1 = Range(1, 10, 1, 10)
val db1 = "copytest1"
original.createTempView(db1, tempTable1, overrideIfExists = false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

db1? let us give a reasonable view name?

// check if clone and original independent
val db2 = "copytest2"
val tempTable2 = Range(1, 20, 2, 20)
clone.createTempView(db2, tempTable2, overrideIfExists = false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same here.

original.createTempView(db3, tempTable3, overrideIfExists = false)
original.setCurrentDatabase(db3)
assert(clone.getCurrentDatabase == db2)
}
Copy link
Member

@gatorsmile gatorsmile Mar 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about?

  test("clone SessionCatalog - current db") {
    val externalCatalog = newEmptyCatalog()
    val db1 = "db1"
    val db2 = "db2"
    val db3 = "db3"

    externalCatalog.createDatabase(newDb(db1), ignoreIfExists = true)
    externalCatalog.createDatabase(newDb(db2), ignoreIfExists = true)
    externalCatalog.createDatabase(newDb(db3), ignoreIfExists = true)

    val original = new SessionCatalog(externalCatalog)
    original.setCurrentDatabase(db1)

    // check if current db copied over
    val clone = original.clone(
      SimpleCatalystConf(caseSensitiveAnalysis = true),
      new Configuration(),
      new SimpleFunctionRegistry,
      CatalystSqlParser)
    assert(original != clone)
    assert(clone.getCurrentDatabase == db1)

    // check if clone and original independent
    clone.setCurrentDatabase(db2)
    assert(original.getCurrentDatabase == db1)
    original.setCurrentDatabase(db3)
    assert(clone.getCurrentDatabase == db2)
  }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* @param analyzer Logical query plan analyzer for resolving unresolved attributes and relations.
* @param streamingQueryManager Interface to start and stop
* [[org.apache.spark.sql.streaming.StreamingQuery]]s.
* @param queryExecutionCreator Lambda to create a [[QueryExecution]] from a [[LogicalPlan]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us document all parms?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm most of the param documentation here are actually kind of useless (very little information conveyed).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin Removing the redundant comments in SPARK-20048.


/**
* SQL-specific key-value configurations.
* Logical query plan optimizer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove the extra space

@@ -278,6 +278,8 @@ private[hive] class HiveClientImpl(
state.getConf.setClassLoader(clientLoader.classLoader)
// Set the thread local metastore client to the client associated with this HiveClientImpl.
Hive.set(client)
// Replace conf in the thread local Hive with current conf
Hive.get(conf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because IsolatedClientLoader is shared? If we do not make this change, any test case failed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of reusingIsolatedClientLoader.cachedHive. Without this line, it may use an out-of-date HiveConf. The failed test is HiveSparkSubmitSuite.test("SPARK-18360: default table path of tables in default database should depend on the " + "location of default database")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

functionRegistry: FunctionRegistry,
override val catalog: HiveSessionCatalog,
sqlParser: ParserInterface,
val metadataHive: HiveClient,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for avoiding using lazy val?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using SparkSession in this class.

analyzer: Analyzer,
streamingQueryManager: StreamingQueryManager,
queryExecutionCreator: LogicalPlan => QueryExecution,
val plannerCreator: () => SparkPlanner)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding val planner to SessionState? So far, the interface of HiveSessionState looks a little bit complex to me.

Copy link
Member

@gatorsmile gatorsmile Mar 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked how CarbonData extends HiveSessionState. Basically, this PR will break what they completed a few months ago. : )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, SessionState.planner is a method. So it will return a new SparkPlanner using the latest experimentalMethods.extraStrategies every time. Changing it to a val is a breaking change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uh, I see

@SparkQA
Copy link

SparkQA commented Mar 7, 2017

Test build #74079 has finished for PR 16826 at commit c41e7bc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 7, 2017

Test build #74124 has finished for PR 16826 at commit 5eb6733.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class HiveSessionCatalogSuite extends TestHiveSingleton

Copy link
Contributor Author

@kunalkhamar kunalkhamar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking close to done!

activeSession.stop()
activeSession = null
}
super.afterAll()
}

test("fork new session and inherit RuntimeConfig options") {
val key = "spark-config-clone"
activeSession.conf.set(key, "active")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shoud this be inside try {} ?

val spark = activeSession
// Cannot use `import activeSession.implicits._` due to the compiler limitation.
import spark.implicits._

activeSession
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should create temp view be inside try block?

original.setCurrentDatabase(db3)
assert(clone.getCurrentDatabase == db2)
}

test("SPARK-19737: detect undefined functions without triggering relation resolution") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be part of this PR?

* @param conf SQL-specific key-value configurations.
* @param experimentalMethods The experimental methods.
* @param functionRegistry Internal catalog for managing functions registered by the user.
* @param catalog Internal catalog for managing table and database states.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment on difference from SessionCatalog: HiveSessionCatalog uses Hive client for interacting with metastore.

@SparkQA
Copy link

SparkQA commented Mar 7, 2017

Test build #74134 has finished for PR 16826 at commit 05abcf8.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member

zsxwing commented Mar 7, 2017

retest this please

@zsxwing
Copy link
Member

zsxwing commented Mar 8, 2017

retest this please. The running one will be failed because master was broken.

@SparkQA
Copy link

SparkQA commented Mar 8, 2017

Test build #74148 has finished for PR 16826 at commit 05abcf8.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 8, 2017

Test build #74165 has finished for PR 16826 at commit 05abcf8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 8, 2017

Test build #74215 has finished for PR 16826 at commit 4c23e7a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public static class LongWrapper
  • public static class IntWrapper
  • case class CostBasedJoinReorder(conf: CatalystConf) extends Rule[LogicalPlan] with PredicateHelper
  • case class JoinPlan(itemIds: Set[Int], plan: LogicalPlan, joinConds: Set[Expression], cost: Cost)
  • case class Cost(rows: BigInt, size: BigInt)
  • abstract class RepartitionOperation extends UnaryNode
  • trait WatermarkSupport extends UnaryExecNode

@zsxwing
Copy link
Member

zsxwing commented Mar 8, 2017

LGTM. Merging to master.

@asfgit asfgit closed this in 6570cfd Mar 8, 2017
asfgit pushed a commit that referenced this pull request Mar 29, 2017
…n listeners

## What changes were proposed in this pull request?

Bugfix from [SPARK-19540.](#16826)
Cloning SessionState does not clone query execution listeners, so cloned session is unable to listen to events on queries.

## How was this patch tested?

- Unit test

Author: Kunal Khamar <kkhamar@outlook.com>

Closes #17379 from kunalkhamar/clone-bugfix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants