New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19900][core]Remove driver when relaunching. #18084
Conversation
ok to test |
driver.state = DriverState.RELAUNCHING | ||
waitingDrivers += driver | ||
removeDriver(driver.id, DriverState.RELAUNCHING, None) | ||
val newDriver = createDriver(driver.desc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a good reason to remove and create the driver in this case? It looks like some kind of overkill compared to the old logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, we must distinguish the original driver and the newly relaunched one, because there will be statusUpdate of the two versions to arrive at master. For example, when the network partitioned worker reconnects to master, it will send DriverStateChanged
with the driver id, and master must recognize it is the state of the original driver and not state of the newly launched driver.
The patch simply choose a new driver id to do this, which also has some Shortcomings, however. For example, In the UI, the two versions of driver are not related, and the final state is RELAUNCHING
(which seems better to be relaunched).
Another way is to add some like attemptId
to driver state, and then Let DriverStateChanged
bring the attemptId to indicate its entity. This seems more complex.
What's your opinion?
} | ||
|
||
val driverEnv2 = RpcEnv.create("driver2", "localhost", 22345, conf, new SecurityManager(conf)) | ||
val fakeDriver2 = driverEnv2.setupEndpoint("driver", new RpcEndpoint { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe these duplicate code can be combined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, please have a look.
Test build #77312 has finished for PR 18084 at commit
|
Test build #77353 has finished for PR 18084 at commit
|
Maybe some more actions should be done in Now, to help us step forward, would you like to spend some time to create a valid regression test case? That will help a lot when we are discussing further about the proper bug-fix proposal. |
Thanks for the reply. I have added some more tests to verify the state of master and worker after relaunching. I will try to think about if there are ways to reuse the old driver struct. |
Hi, I've thought more thoroughly about this. The main state involved here is Master.workers, Master.idToWorker, and WorkerInfo.drivers. Say When A reconnects, it will reregister to master, then master will remove the old WorkerInfo (whose How to recognize the
Now, how does worker A handle the After all this, I think it better to relaunch the driver with a new id to make it simple. As to the cost, |
Test build #77423 has finished for PR 18084 at commit
|
ping @jiangxb1987 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the issue is caused by we may have two running Driver instance with the same id under some condition(worker lost and later rejoin), we have to resolve the root cause by adding a workerId on the DriverStateChanged message, so we can decide whether we should remove the driver we recorded.
Hi, add a workerId may not work. For example, this scenario:
Now, what master should do? |
We should also check in Worker that we don't launch duplicate drivers, I think the logic should be added in handling |
OK, another scenario:
|
I think your point is, in case a LaunchDriver message and a KillDriver message are send out simultaneously, there is a race condition that which message arrives to worker earlier is not determined. If the KillDriver message arrives later, then we finally get a finished driver instead of a running driver. |
the fix LGTM, it would be better to add some comments to explain it clearly |
Also please rebase the latest master :) |
* 'master' of https://github.com/apache/spark: (149 commits) [SPARK-19753][CORE] Un-register all shuffle output on a host in case of slave lost or fetch failure [SPARK-20986][SQL] Reset table's statistics after PruneFileSourcePartitions rule. [SPARK-12552][CORE] Correctly count the driver resource when recovering from failure for Master [SPARK-21016][CORE] Improve code fault tolerance for converting string to number [SPARK-21051][SQL] Add hash map metrics to aggregate [SPARK-21064][CORE][TEST] Fix the default value bug in NettyBlockTransferServiceSuite [SPARK-21060][WEB-UI] Css style about paging function is error in the executor page. Css style about paging function is error in the executor page. It is different of history server ui paging function css style. [SPARK-21039][SPARK CORE] Use treeAggregate instead of aggregate in DataFrame.stat.bloomFilter [SPARK-21006][TESTS][FOLLOW-UP] Some Worker's RpcEnv is leaked in WorkerSuite [SPARK-20920][SQL] ForkJoinPool pools are leaked when writing hive tables with many partitions [TEST][SPARKR][CORE] Fix broken SparkSubmitSuite [SPARK-19910][SQL] `stack` should not reject NULL values due to type mismatch Revert "[SPARK-21046][SQL] simplify the array offset and length in ColumnVector" [SPARK-20979][SS] Add RateSource to generate values for tests and benchmark [SPARK-21050][ML] Word2vec persistence overflow bug fix [SPARK-21059][SQL] LikeSimplification can NPE on null pattern [SPARK-20345][SQL] Fix STS error handling logic on HiveSQLException [SPARK-17914][SQL] Fix parsing of timestamp strings with nanoseconds [SPARK-21046][SQL] simplify the array offset and length in ColumnVector [SPARK-21041][SQL] SparkSession.range should be consistent with SparkContext.range ...
Test build #78023 has finished for PR 18084 at commit
|
case RegisteredWorker(masterRef, _, _) => | ||
masterRef.send(WorkerLatestState(id, Nil, drivers.keys.toSeq)) | ||
case LaunchDriver(driverId, desc) => | ||
drivers(driverId) = driverId |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems drivers
can be a set instead of a map?
val master = makeMaster(conf) | ||
master.rpcEnv.setupEndpoint(Master.ENDPOINT_NAME, master) | ||
eventually(timeout(10.seconds)) { | ||
val masterState = master.self.askSync[MasterStateResponse](RequestMasterState) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we move this out of the eventually {...}
block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, this can not be moved because MasterStateResponse
is changed over time. If we move the rpc out, the masterState will never change, and the assert will fail.
See the above test SPARK-20529:...
, there is a same eventually assert.
} | ||
|
||
eventually(timeout(10.seconds)) { | ||
val masterState = master.self.askSync[MasterStateResponse](RequestMasterState) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
LGTM, pending test |
Test build #78049 has finished for PR 18084 at commit
|
thanks, merging to master! |
This is apache#17888 . Below are some spark ui snapshots. Master, after worker disconnects: <img width="1433" alt="master_disconnect" src="https://cloud.githubusercontent.com/assets/2576762/26398687/d0ee228e-40ac-11e7-986d-d3b57b87029f.png"> Master, after worker reconnects, notice the `running drivers` part: <img width="1412" alt="master_reconnects" src="https://cloud.githubusercontent.com/assets/2576762/26398697/d50735a4-40ac-11e7-80d8-6e9e1cf0b62f.png"> This patch, after worker disconnects: <img width="1412" alt="patch_disconnect" src="https://cloud.githubusercontent.com/assets/2576762/26398009/c015d3dc-40aa-11e7-8bb4-df11a1f66645.png"> This patch, after worker reconnects: ![image](https://cloud.githubusercontent.com/assets/2576762/26398037/d313769c-40aa-11e7-8613-5f157d193150.png) cc cloud-fan jiangxb1987 Author: Li Yichao <lyc@zhihu.com> Closes apache#18084 from liyichao/SPARK-19900-1.
This is #17888 .
Below are some spark ui snapshots.
Master, after worker disconnects:
Master, after worker reconnects, notice the
running drivers
part:This patch, after worker disconnects:
This patch, after worker reconnects:
cc @cloud-fan @jiangxb1987