Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-3249: renew ContainerCache.INSTANCE in order to test it in a fresh state #705

Closed
wants to merge 39 commits into from

Conversation

isahkemat
Copy link
Contributor

@isahkemat isahkemat commented Mar 23, 2020

What changes were proposed in this pull request?

In BlockUtils.shutdownCache(ContainerCache), set default cache instance in ContainerCache to null.
clean up ContainerCache in tear down of TestContainerPersistence and TestBlockDeletingService.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-3249

@fapifta
Copy link
Contributor

fapifta commented Mar 23, 2020

Hi, thank you for figuring the problem out and posting a JIRA for it.

I am on the other hand completely -1 on this approach of fixing it, at least if I understand the problem properly, so please let me know if I misunderstand something. (It would be nice to have a description in the PR we have a standard format for that which you can follow to help us understand better what the PR is aiming for).

So my understanding is that the TestContainerCache fails in some environment when it is running after TestBlockDeletingService. Is that correct?

For me the TestContainerCache does not fail and runs fine alone, so I do not see why it should be modified at all, as I don't see any reason to change the ContainerCache itself either. If the problem is that the TestBlockDeletingService is messing up the state before TestContainerCache is running, then TestBlockDeletingService should clean up the state properly on its teardown.
In the ContainerCache there is a method that clears the cache (shutdownCache()) can you check please if calling that in the teardown of TestBlockDeletingService solves the problem, and if so then modify the PR? Moreover, in the TestBlockDeletingService there is no direct mention of the ContainerCache, so you probably need to find the class which is using the ContainerCache, and see if you can cleanup the cache via that class, as just shutting down something that does not seem to be used in the test would seem strange also.

@isahkemat
Copy link
Contributor Author

Hi, Thank you for your comprehensive response. sorry for my mistake in creating PR, I'm new to this community.
I tried to complete the description of the PR.
I change my PR to set default instance of ContainerCache to null in the shutdown of the BlockUtils which is used in mentioned tests. ContainerCache.shutdown method is not enough because it does not change the capacity of the cache which is tested in TestContainerCache.

@fapifta
Copy link
Contributor

fapifta commented Mar 24, 2020

Oh, that gives some better understanding the problem, which is the size of the cache, thank you for enlighten me on that, and for the follow up. Also thank you for adding the PR description.

This approach is way more better, but I still have some problems but their roots are not in the fix you provided.
Let's see what the TestContainerCache actually does... it tests whether:

  1. the cache returns the same object with increased reference count with the same parameters
  2. even if we got two references we have space for a third one with cache size 2 as the first two value we got are the same and it is not cached twice
  3. after the second distinct element was get from the cache, the cache is full
  4. eviction happens when the reference count is zero and we can have a new element cached instead of the one with zero reference
  5. eviction does not happen early, so even if reference count is 0 for an element, if we get it again before evicted, we get the same object again with reference count 1
  6. ensures that ReferenceCountedDb can not decrement the reference count below zero without an error.

From here this is just my opinion, and I would like to understand what do you think about it.

I would argue whether we need this test at all... My argument is that in these cases we test the fact that LRUMap is behaving as it is documented, and this is not in the scope of our test.

Our TestContainerCache should ensure that we extended the functionality of LRUMap correctly, so we should test the removeLRU method, which is basically the test of ReferenceCountedDB.cleanup, which again is not in the scope for this class, however test that a reference we did not closed is not removable is good to have, to ensure this invariant is met even if functionality changes.
We can test the first assumption of the original test as it is ensuring that we are having a cache and we are counting references properly when one reference is obtained to the ReferenceCountedDB instance.
We can ensure that getDB and removeDB works as expected via tests, so it returns the cached instance, and removed the cached instance only when it is removable (~: referencecount=0).
Also we can test the cache shutdown whether it clears the cache for example, but we certainly don't need to test isFull and things like that.

SideNote:
the current design for me is a bit unfortunate, but this should be a subject of a follow up JIRAs which if you create I happy to collaborate in.
The problems I see:

  • by extending the LRUMap and not hiding its original functionality, getting the instance of ContainerCache allows a user to alter the cache causing weird behaviour if misused or misunderstood.
  • we have synchronization on a lock in all methods, why we do not have these methods as synchronized then?
  • we use this class only from BlockUtils, we should have the BlockUtils methods implemented in a way that it uses an internal LRUMap instead of the ContainerCache where we do not add much functionality?
  • ultimately, as we also doing instance creation in the getDB method of ContainerCache, why we are not using a LoadingCache instead with a simple size restriction, and forget reference counting on the db, and just re-create an instance if it was evicted from the cache?

@isahkemat
Copy link
Contributor Author

Your description is very helpful to me. thanks a lot. I agree with you that testing isFull for ContainerCache is somehow wired as it related to the LRUMap which is out of scope for this test.
we can simply remove this line (assertTrue(cache.isFull())) for this issue (in this PR) and open other Jira tasks to improve the structure of ContainerCache and BlockUtils.
about your side notes:

  1. I completely agree. composition would be a better idea than inheritance, we can hide the internals of ContainerCache better by composition. I will create a task regarding this in Jira to discuss more there.
  2. you're right, we can replace the ReentrantLock with synchronized methods. I will create a task for it in Jira.
  3. If ContainerCache is just used in BlockUtils, I think it would be better to make it as an inner class of BlockUtils or move it beside BlockUtils and make it package private.
  4. I don't understand it well. I will be appreciated if you elaborate.

@fapifta
Copy link
Contributor

fapifta commented Mar 24, 2020

Great, thank you for the follow up. +1 for the current state. (non-binding)

I completely agree we can fix this problem by just removing the isfull assertion, as that is testing the LRUMap functionality, and follow up in further tickets for additional improvements.

About LoadingCache:
LoadingCache is a guava utility, it provides and LRU cache, and provides facilities to create the instances that are cached, and also guava provides facilities to configure how the cache works.
What we need to consider before replacing is that why the ContainerCache is implemented in a way where you can not evict a DB instance from the cache if there is a reference for it.
Looking at all the usages of BlockUtils getDb and removeDB methods, all the cached instances are accessed for just short term, and even though creating an instance might be costly, with the default 1024 cache size if the LoadingCache evicts the least recently used item when the cache is full, we do not suffer a huge perf hit, while in the current implementation if the cache is full, then we get an exception, so I would definitely trade this off.

With LoadingCache which just have a size restriction, we can have the same cache functionality, if we get rid of reference counting which I think we do not need, as we use the instances just for short periods of time, so most of the time they are in the cache with a zero reference count, and can be evicted. If we still want to ensure that in extreme cases when the cache is pressured, and all items in the cache are referenced, we can use weakValues() in the guava cache to prevent eviction in this case, but this does not seem to be relevant based on the usage of the cached items.

@isahkemat
Copy link
Contributor Author

Thanks for your explanation. This would be great. I open the ticket in Jira here:
https://issues.apache.org/jira/browse/HDDS-3267
please take a look at it and correct me if I explain anything wrong.
BTW, I don't understand why it-client failed. is it related to my changes?

@isahkemat
Copy link
Contributor Author

isahkemat commented Mar 30, 2020

I mistakenly rebase master onto this branch, I think it would be better to create another PR #737

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet