[SPARK-14289][WIP] Support multiple eviction strategies for cached RDD partitions #12162

Earne · 2016-04-05T01:19:40Z

What changes were proposed in this pull request?

Currently, LRU is the only eviction strategy for cached RDD partitions in Spark.
This pull request will refactor and add support to multiple eviction strategies, such as FIFO, LFU(WIP), LCS(WIP).

How was this patch tested?

Manual test by set "spark.memory.entryEvictionPolicy" to LRU(default), FIFO or LCS.

AmplabJenkins · 2016-04-05T01:22:13Z

Can one of the admins verify this patch?

rxin · 2016-04-05T05:27:18Z

Thanks for the pull request. Is this actually motivated by a real use case, or just doing it because it might be good to support more than one policy?

Earne · 2016-04-06T09:03:51Z

@rxin The use case that motivate this is about below.

Java objects consume a factor of 2-5x more space than the “raw” data inside their fields.
Running graphx.LiveJournalPageRank example on a 8 nodes cluster (1 work as Master, each configured with 45GB memory for Spark running in legacy memory management mode). The dataset (about 30GB) is generated by HiBench, while running 5 iterations, time of each iteration is getting worse and worse.
By analyzing the log file, I realize that it is because memory space for cached RDD is not sufficient, and lots of partition with high recomputing cost is dropped. Recomputing these partitions brought in lots of time.
FIFO can be implemented by initialize entries with LinkedHashMap[BlockId, MemoryEntry[_]](32, 0.75f, false). And even FIFO can get much better performance than LRU.
Storage level such as MEMORY_AND_DISK may partial solve the problem, but the effect is not very good.

An eviction strategy taken the computing cost into consideration may work well (even in unified memory mode or use the MEMORY_AND_DISK level). Some cost-aware replacement policy already exists in K-V stores, such as GD-Wheel(EuroSys’15).

This PR can be separated to below sub-task.

Refactor to support more than one policy (LRU, FIFO, LFU).
Add a policy that taken the computing cost into consideration.
Taken serialize and deserialize cost into consideration.

mozinrat · 2016-11-08T09:28:29Z

@Earne is something relevant merged in spark 2.0.1, do we have FIFO eviction policy?.
If yes how can I leverage it?

michaelmior · 2017-05-05T20:51:18Z

This branch appears to be incomplete. The configuration parameter entryEvictionPolicy does not exist and there is a good chunk of the code that is never called.

HyukjinKwon · 2017-05-11T12:17:33Z

@Earne, is it still active and any opinion on the comments above? Otherwise, I will propose to close this.

mmakdessii · 2017-11-14T18:03:32Z

I'm working on my thesis to improve cache management systems. But i don't know anything about Spark! I found this program and I don't know how to even run it. If possible, can someone refer to me a video or steps in order to run this file? If i can see a sample implementation of LRU and know how it's made step by step then I'll be able to implement my own algorithm. I would be very grateful if someone can offer their help!

michaelmior · 2017-11-16T02:55:13Z

As best I can tell, the code that was pushed here is incomplete. However, Spark's default cache eviction policy is LRU. You can find the code which performs eviction here. It basically just works by storing all the data in a LinkedHashMap configured to track which elements were accessed most recently.

Earne added 8 commits April 1, 2016 10:34

refactor and add FIFO and LRU

b816874

minor bug fix

f4212e9

test

47c374c

minor

8f82519

Merge branch 'master' into refactorLCS

99754db

minor

1f9e777

minor

88229c0

Merge branch 'master' into refactorLCS

ee01774

Earne changed the title ~~[SPARK-14289][WIP] Add support to multiple eviction strategies for cached RDD partitions~~ [SPARK-14289][WIP] Support multiple eviction strategies for cached RDD partitions Apr 5, 2016

HyukjinKwon mentioned this pull request May 17, 2017

[INFRA] Close stale PRs #18017

Closed

asfgit closed this in 5d2750a May 18, 2017

Earne deleted the SPARK-14289 branch December 21, 2017 13:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-14289][WIP] Support multiple eviction strategies for cached RDD partitions #12162

[SPARK-14289][WIP] Support multiple eviction strategies for cached RDD partitions #12162

Earne commented Apr 5, 2016

AmplabJenkins commented Apr 5, 2016

rxin commented Apr 5, 2016

Earne commented Apr 6, 2016

mozinrat commented Nov 8, 2016

michaelmior commented May 5, 2017 •

edited

Loading

HyukjinKwon commented May 11, 2017

mmakdessii commented Nov 14, 2017

michaelmior commented Nov 16, 2017

[SPARK-14289][WIP] Support multiple eviction strategies for cached RDD partitions #12162

[SPARK-14289][WIP] Support multiple eviction strategies for cached RDD partitions #12162

Conversation

Earne commented Apr 5, 2016

What changes were proposed in this pull request?

How was this patch tested?

AmplabJenkins commented Apr 5, 2016

rxin commented Apr 5, 2016

Earne commented Apr 6, 2016

mozinrat commented Nov 8, 2016

michaelmior commented May 5, 2017 • edited Loading

HyukjinKwon commented May 11, 2017

mmakdessii commented Nov 14, 2017

michaelmior commented Nov 16, 2017

michaelmior commented May 5, 2017 •

edited

Loading