Adding utility threads for anti-cache eviction #175

apavlo · 2014-09-06T00:39:59Z

The following is a rough outline for how to add support for an additional thread to operate "down" in the EE while the main PartitionExecutor thread processes transactions. This is not possible in the current architecture because there is a single shared buffer that we use to pass data + error codes between the Java layer and the C++ layer (through JNI). The mapping between the Java and C++ layers is as follows:

Exception Buffer: ExecutionEngineJNI.exceptionBuffer -> `VoltDBEngine::m_reusedResultBuffer``
VoltTable Result Buffer: ExecutionEngineJNI.deserializer (this is a wrapper for the ByteBuffer) -> VoltDBEnginer::m_reusedResultBuffer

Note that the memory is allocated in Java and then we pass down the pointers to the C++ layer.

To give an example why this is a problem now with the existing AntiCacheManager implementation, I will now discuss a race condition that can occur. The Java AntiCacheManager has its own thread that it uses to unevict data at a partition. If this uneviction process encounters an error, then it will write a SerializedException into that partition's shared buffer. If the PartitionExecutor is processing a txn at the same time it may trip its own exception and want to write into that same buffer. If the other thread is trying to deserialize the exception, then the contents will get collobered. This is a race condition that we have definitely seen crop up before.

What needs to happen is that we need to make a separate buffer for data and exceptions for utility operations. This will allow us to evict and unevict data in a separate thread without worrying about overwriting the main buffers. Add this new utility buffer to ExecutionEngineJNI and update the parameters to ExecutionEngine.nativeSetBuffers() to pass down this new buffer pointer. You can see how we did the same thing with ExecutionEngineJNI.ariesLogBuffer. You will need to update VoltDBEngine::setBuffers() accordingly.
Modify VoltDBEngine::antiCacheReadBlocks() to use this new utility buffer when there is an exception. You can see the FIXME in the code that makes reference to this problem. You will also need to modify ExecutionEngineJNI.antiCacheReadBlocks() to look for errors in the new utility buffers.
The final step is to now add an additional utility thread for evicting data without needing to block transactions. This is more complicated because we need to collect blocks of cold tuples to evict but also make sure that those tuples are being used by an active txn. We may want to add a new global flag in the EE that tells us that the eviction thread is doing something and keep track of the read-write sets of any txn that executes during this brief window. When we're ready to do the eviction, we then set a lock to prevent txns from executing, remove any tuples from the block we're about to evict that are also in the active txn's ReadWriteSet, and then write out the block. I'm not sure how we just want to do this just yet because I don't want to have to check for a lock every time we normally execute txns. We can talk about this problem when you get to this point.

The text was updated successfully, but these errors were encountered:

mjgiardino · 2014-09-16T23:54:44Z

Edit: Nevermind, I think I found the problem. I'm throwing the Exception all the way up but ExecutionEngineJNI isn't checking the right buffer for the exception causing a failure.

In point 2: "You will also need to modify ExecutionEngineJNI.antiCacheReadBlocks() to look for errors in the new utility buffers."

What kind of errors should I be checking for?

After adding the additional buffer, I'm failing three tests, two of which are related to not catching an UnknownBlockAccessException correctly. I am pretty sure I have initialized and passed the new buffer correctly though it must not be passing exceptions it back up to the Java frontend.

Here is the latest commit.

Thanks.

mjgiardino · 2014-09-18T00:09:23Z

Any idea what could be causing this? The logs of commit 52 just have this single error and I'm not sure how to diagnose the source. I'm not sure why this test would have an issue with the changes I've made as it only should affect the AntiCaching.

[junit] Running org.voltdb.regressionsuites.TestPlansGroupBySuite
[junit] 
[junit] Exception: java.lang.NullPointerException thrown from the UncaughtExceptionHandler in thread "ServerThread"
[junit] Running org.voltdb.regressionsuites.TestPlansGroupBySuite
[junit]     org.voltdb.regressionsuites.TestPlansGroupBySuite:testDistributedSumAndGroup-localCluster-2-2-JNI had an error.
[junit] Tests run:   0, Failures:   0, Errors:   1, Time elapsed: 0.00 sec

zheguang · 2014-10-15T15:20:48Z

How should I go about testing item 1 and 2? In general, when would AntiCacheEvictionManager::readBlock throw an exception, besides when it's a system error such as UnknownBlockAccess?

mjgiardino · 2014-10-15T16:50:57Z

To be honest, the Exception mechanisms are not my strongest coding area, so if I were doing it by myself, I would simply query the AntiCacheEvictionManager for an Abort true/false. Exceptions are probably a better-engineered solution.

AntiCacheEvictionManager::readBlock() could throw a (to be implemented) AbortAndReissueException when the block needed is from an SSD or disk backing store. I don't have a function/method yet to relay that information but the AntiCacheEvictionManager will know for which AntiCacheDBs it will stall and for which ones transactions will be aborted and reissued. Perhaps a boolean-returning method TransactionAbort? Something like that is what I would implement.

zheguang · 2014-10-28T19:53:42Z

Hi Andy and Michael, I have implemented 1 and 2 and it has passed the tests on Jenkins. The commits are: 0bf9ff6..641c861. I have started looking at 3 and have some thoughts about how to handle conflicting reads and writes while evicting data. I will sync up with Stan first and flesh out a base design for discussion with you guys.

Thanks,
sam

mjgiardino · 2014-10-28T20:16:32Z

I think I'm at a good place to sync as well. The migration between layers works and blocks can be found in any layer. In addition all the multilevel configuration is tested. I just now added a method in AntiCacheDB to identify whether it is a stalling or aborting layer. We need to discuss how we're going to decide this, as well as what specific policy we'd like to start with for block placement.

apavlo · 2014-10-28T20:32:14Z

More than just passing existing test cases, do you add test cases for the new features?

mjgiardino · 2014-10-28T20:46:24Z

There are new tests in the EE (anticache_eviction_manager_test) to test the physical act of migration and LRU block selection, as well as a new junit test (TestAntiCacheMultiLevel) to configure multilevel, evict and merge tuples, as well as fill a level and be forced to write to the one below. They are based upon your edu.brown.hstore.TestAntiCacheManager test.

apavlo · 2014-10-28T20:49:44Z

Beautiful. Should we merge this code back into the master?

mjgiardino · 2014-10-28T20:56:55Z

Let me rerun those performance tests overnight and I'll submit a pull request tomorrow. I want to skim the code and make sure any hacky debugging printfs are gone.

apavlo · 2014-10-28T20:59:10Z

Ok. Let's try to schedule a call for this Friday. Can you send an email to the group?

mjgiardino · 2014-10-28T20:59:43Z

Will do.

zheguang · 2015-01-28T02:32:39Z

Would you guys be available to take a look at my initial patch for the second bullet point? I wrote a long commit message to convey the overall design. This however is based on my current understanding of the frontend, so please do point out what looks wrong to you.

zheguang@5537d01

Many thanks!

mjgiardino · 2015-01-29T18:53:37Z

It all makes sense to me.

Should we meet tomorrow and sync up?

apavlo · 2015-01-29T18:57:22Z

Tomorrow is NEDB day, so we're all going to be busy.

On Thursday, January 29, 2015 10:53 AM Michael Giardino wrote:

It all makes sense to me.

Should we meet tomorrow and sync up?

Reply to this email directly or view it on GitHub:
#175 (comment)

Andy Pavlo
pavlo@cs.cmu.edu

apavlo added the Anti-Cache label Sep 6, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding utility threads for anti-cache eviction #175

Adding utility threads for anti-cache eviction #175

apavlo commented Sep 6, 2014

mjgiardino commented Sep 16, 2014

mjgiardino commented Sep 18, 2014

zheguang commented Oct 15, 2014

mjgiardino commented Oct 15, 2014

zheguang commented Oct 28, 2014

mjgiardino commented Oct 28, 2014

apavlo commented Oct 28, 2014

mjgiardino commented Oct 28, 2014

apavlo commented Oct 28, 2014

mjgiardino commented Oct 28, 2014

apavlo commented Oct 28, 2014

mjgiardino commented Oct 28, 2014

zheguang commented Jan 28, 2015

mjgiardino commented Jan 29, 2015

apavlo commented Jan 29, 2015

Adding utility threads for anti-cache eviction #175

Adding utility threads for anti-cache eviction #175

Comments

apavlo commented Sep 6, 2014

mjgiardino commented Sep 16, 2014

mjgiardino commented Sep 18, 2014

zheguang commented Oct 15, 2014

mjgiardino commented Oct 15, 2014

zheguang commented Oct 28, 2014

mjgiardino commented Oct 28, 2014

apavlo commented Oct 28, 2014

mjgiardino commented Oct 28, 2014

apavlo commented Oct 28, 2014

mjgiardino commented Oct 28, 2014

apavlo commented Oct 28, 2014

mjgiardino commented Oct 28, 2014

zheguang commented Jan 28, 2015

mjgiardino commented Jan 29, 2015

apavlo commented Jan 29, 2015