HDDS-3249: renew ContainerCache.INSTANCE in order to test it in a fresh state #705

isahkemat · 2020-03-23T09:04:04Z

What changes were proposed in this pull request?

In BlockUtils.shutdownCache(ContainerCache), set default cache instance in ContainerCache to null.
clean up ContainerCache in tear down of TestContainerPersistence and TestBlockDeletingService.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-3249

…sh state

…he#699)

fapifta · 2020-03-23T18:10:52Z

Hi, thank you for figuring the problem out and posting a JIRA for it.

I am on the other hand completely -1 on this approach of fixing it, at least if I understand the problem properly, so please let me know if I misunderstand something. (It would be nice to have a description in the PR we have a standard format for that which you can follow to help us understand better what the PR is aiming for).

So my understanding is that the TestContainerCache fails in some environment when it is running after TestBlockDeletingService. Is that correct?

For me the TestContainerCache does not fail and runs fine alone, so I do not see why it should be modified at all, as I don't see any reason to change the ContainerCache itself either. If the problem is that the TestBlockDeletingService is messing up the state before TestContainerCache is running, then TestBlockDeletingService should clean up the state properly on its teardown.
In the ContainerCache there is a method that clears the cache (shutdownCache()) can you check please if calling that in the teardown of TestBlockDeletingService solves the problem, and if so then modify the PR? Moreover, in the TestBlockDeletingService there is no direct mention of the ContainerCache, so you probably need to find the class which is using the ContainerCache, and see if you can cleanup the cache via that class, as just shutting down something that does not seem to be used in the test would seem strange also.

…erPersistence also set defaultCache to null on BlockUtils.shutdownCache

isahkemat · 2020-03-23T20:28:14Z

Hi, Thank you for your comprehensive response. sorry for my mistake in creating PR, I'm new to this community.
I tried to complete the description of the PR.
I change my PR to set default instance of ContainerCache to null in the shutdown of the BlockUtils which is used in mentioned tests. ContainerCache.shutdown method is not enough because it does not change the capacity of the cache which is tested in TestContainerCache.

fapifta · 2020-03-24T02:06:01Z

Oh, that gives some better understanding the problem, which is the size of the cache, thank you for enlighten me on that, and for the follow up. Also thank you for adding the PR description.

This approach is way more better, but I still have some problems but their roots are not in the fix you provided.
Let's see what the TestContainerCache actually does... it tests whether:

the cache returns the same object with increased reference count with the same parameters
even if we got two references we have space for a third one with cache size 2 as the first two value we got are the same and it is not cached twice
after the second distinct element was get from the cache, the cache is full
eviction happens when the reference count is zero and we can have a new element cached instead of the one with zero reference
eviction does not happen early, so even if reference count is 0 for an element, if we get it again before evicted, we get the same object again with reference count 1
ensures that ReferenceCountedDb can not decrement the reference count below zero without an error.

From here this is just my opinion, and I would like to understand what do you think about it.

I would argue whether we need this test at all... My argument is that in these cases we test the fact that LRUMap is behaving as it is documented, and this is not in the scope of our test.

Our TestContainerCache should ensure that we extended the functionality of LRUMap correctly, so we should test the removeLRU method, which is basically the test of ReferenceCountedDB.cleanup, which again is not in the scope for this class, however test that a reference we did not closed is not removable is good to have, to ensure this invariant is met even if functionality changes.
We can test the first assumption of the original test as it is ensuring that we are having a cache and we are counting references properly when one reference is obtained to the ReferenceCountedDB instance.
We can ensure that getDB and removeDB works as expected via tests, so it returns the cached instance, and removed the cached instance only when it is removable (~: referencecount=0).
Also we can test the cache shutdown whether it clears the cache for example, but we certainly don't need to test isFull and things like that.

SideNote:
the current design for me is a bit unfortunate, but this should be a subject of a follow up JIRAs which if you create I happy to collaborate in.
The problems I see:

by extending the LRUMap and not hiding its original functionality, getting the instance of ContainerCache allows a user to alter the cache causing weird behaviour if misused or misunderstood.
we have synchronization on a lock in all methods, why we do not have these methods as synchronized then?
we use this class only from BlockUtils, we should have the BlockUtils methods implemented in a way that it uses an internal LRUMap instead of the ContainerCache where we do not add much functionality?
ultimately, as we also doing instance creation in the getDB method of ContainerCache, why we are not using a LoadingCache instead with a simple size restriction, and forget reference counting on the db, and just re-create an instance if it was evicted from the cache?

…s to chaos testing. (apache#438)

isahkemat · 2020-03-24T11:58:50Z

Your description is very helpful to me. thanks a lot. I agree with you that testing isFull for ContainerCache is somehow wired as it related to the LRUMap which is out of scope for this test.
we can simply remove this line (assertTrue(cache.isFull())) for this issue (in this PR) and open other Jira tasks to improve the structure of ContainerCache and BlockUtils.
about your side notes:

I completely agree. composition would be a better idea than inheritance, we can hide the internals of ContainerCache better by composition. I will create a task regarding this in Jira to discuss more there.
you're right, we can replace the ReentrantLock with synchronized methods. I will create a task for it in Jira.
If ContainerCache is just used in BlockUtils, I think it would be better to make it as an inner class of BlockUtils or move it beside BlockUtils and make it package private.
I don't understand it well. I will be appreciated if you elaborate.

fapifta · 2020-03-24T14:09:22Z

Great, thank you for the follow up. +1 for the current state. (non-binding)

I completely agree we can fix this problem by just removing the isfull assertion, as that is testing the LRUMap functionality, and follow up in further tickets for additional improvements.

About LoadingCache:
LoadingCache is a guava utility, it provides and LRU cache, and provides facilities to create the instances that are cached, and also guava provides facilities to configure how the cache works.
What we need to consider before replacing is that why the ContainerCache is implemented in a way where you can not evict a DB instance from the cache if there is a reference for it.
Looking at all the usages of BlockUtils getDb and removeDB methods, all the cached instances are accessed for just short term, and even though creating an instance might be costly, with the default 1024 cache size if the LoadingCache evicts the least recently used item when the cache is full, we do not suffer a huge perf hit, while in the current implementation if the cache is full, then we get an exception, so I would definitely trade this off.

With LoadingCache which just have a size restriction, we can have the same cache functionality, if we get rid of reference counting which I think we do not need, as we use the instances just for short periods of time, so most of the time they are in the cache with a zero reference count, and can be evicted. If we still want to ensure that in extreme cases when the cache is pressured, and all items in the cache are referenced, we can use weakValues() in the guava cache to prevent eviction in this case, but this does not seem to be relevant based on the usage of the cached items.

…out SCM. (apache#656)" This reverts commit 281faf3.

isahkemat · 2020-03-24T19:39:52Z

Thanks for your explanation. This would be great. I open the ticket in Jira here:
https://issues.apache.org/jira/browse/HDDS-3267
please take a look at it and correct me if I explain anything wrong.
BTW, I don't understand why it-client failed. is it related to my changes?

…OzoneChaosCluster. (apache#711)

…tainer commands to Datanode. (apache#712)

…adata. (apache#714)

…s to chaos testing. (apache#438)

…out SCM. (apache#656)" This reverts commit 281faf3.

…OzoneChaosCluster. (apache#711)

…tainer commands to Datanode. (apache#712)

…adata. (apache#714)

…e#725)

…che#722)

apache#678)

…sh state

…erPersistence also set defaultCache to null on BlockUtils.shutdownCache

isahkemat · 2020-03-30T10:18:26Z

I mistakenly rebase master onto this branch, I think it would be better to create another PR #737

isa and others added 4 commits March 23, 2020 13:30

HDDS-3249: renew ContainerCache.INSTANCE in order to test it in a fre…

f8e8084

…sh state

correct style

49599c7

HDDS-3234. Fix retry interval default in Ozone client. (apache#698)

c64d86f

HDDS-3235.Change to default of max retry count for Ozone client (apac…

f2e259b

…he#699)

fapifta mentioned this pull request Mar 23, 2020

HDDS-3248. shutdown defaultMetricsSystem before tests #704

Merged

isa added 3 commits March 23, 2020 23:58

cleanup ContainerCache after TestBlockDeletingService and TestContain…

db89958

…erPersistence also set defaultCache to null on BlockUtils.shutdownCache

remove unused import

f259a64

remove setting default instance to null in shutdown method

081b933

HDDS-2878. Refactor MiniOzoneLoadGenerator to add more load generator…

ad7cd6a

…s to chaos testing. (apache#438)

remove cache.isFull assertion from TestContainerCache

cee9c6c

Revert "HDDS-3142. Create isolated enviornment for OM to test it with…

512d607

…out SCM. (apache#656)" This reverts commit 281faf3.

mukul1987 and others added 13 commits March 25, 2020 15:43

HDDS-3250. Create a separate log file for Warnings and Errors in Mini…

8249385

…OzoneChaosCluster. (apache#711)

HDDS-3243. Recon should not have the ability to send Create/Close Con…

f6be766

…tainer commands to Datanode. (apache#712)

HDDS-3164. Add Recon endpoint to serve missing containers and its met…

4682bab

…adata. (apache#714)

HDDS-3284. ozonesecure-mr test fails due to lack of disk space (apach…

07fcb79

…e#725)

HDDS-3074. Make the configuration of container scrub consistent. (apa…

3d28568

…che#722)

HDDS-3179. Pipeline placement based on Topology does not have fallback (

7d132ce

apache#678)

HDDS-3281. Add timeouts to all robot tests (apache#723)

eece604

HDDS-3288: Update default RPC handler SCM/OM count to 100 (apache#729)

a661a75

HDDS-3273. getConf does not return all OM addresses. (apache#727)

099ab62

HDDS-3234. Fix retry interval default in Ozone client. (apache#698)

9c829fb

HDDS-3235.Change to default of max retry count for Ozone client (apac…

021bed6

…he#699)

HDDS-2878. Refactor MiniOzoneLoadGenerator to add more load generator…

da528e0

…s to chaos testing. (apache#438)

Revert "HDDS-3142. Create isolated enviornment for OM to test it with…

725ceed

…out SCM. (apache#656)" This reverts commit 281faf3.

mukul1987 and others added 16 commits March 29, 2020 20:30

HDDS-3250. Create a separate log file for Warnings and Errors in Mini…

d4e4be7

…OzoneChaosCluster. (apache#711)

HDDS-3243. Recon should not have the ability to send Create/Close Con…

28d5d6a

…tainer commands to Datanode. (apache#712)

HDDS-3164. Add Recon endpoint to serve missing containers and its met…

2f27380

…adata. (apache#714)

HDDS-3284. ozonesecure-mr test fails due to lack of disk space (apach…

3da3258

…e#725)

HDDS-3074. Make the configuration of container scrub consistent. (apa…

c19a3a5

…che#722)

HDDS-3179. Pipeline placement based on Topology does not have fallback (

18a4b05

apache#678)

HDDS-3281. Add timeouts to all robot tests (apache#723)

42d8d25

HDDS-3288: Update default RPC handler SCM/OM count to 100 (apache#729)

37281ac

HDDS-3273. getConf does not return all OM addresses. (apache#727)

562ac8b

HDDS-3249: renew ContainerCache.INSTANCE in order to test it in a fre…

c2a0d48

…sh state

correct style

aa7cfac

cleanup ContainerCache after TestBlockDeletingService and TestContain…

bbdd30d

…erPersistence also set defaultCache to null on BlockUtils.shutdownCache

remove unused import

74e5804

remove setting default instance to null in shutdown method

af28759

remove cache.isFull assertion from TestContainerCache

744d4ea

Merge remote-tracking branch 'origin/HDDS-3249' into HDDS-3249

3421e2b

isahkemat mentioned this pull request Mar 30, 2020

HDDS-3249: remove unrelated assert in ContainerCache #737

Merged

isahkemat closed this Mar 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-3249: renew ContainerCache.INSTANCE in order to test it in a fresh state #705

HDDS-3249: renew ContainerCache.INSTANCE in order to test it in a fresh state #705

isahkemat commented Mar 23, 2020 •

edited

Loading

fapifta commented Mar 23, 2020

isahkemat commented Mar 23, 2020

fapifta commented Mar 24, 2020

isahkemat commented Mar 24, 2020

fapifta commented Mar 24, 2020 •

edited

Loading

isahkemat commented Mar 24, 2020

isahkemat commented Mar 30, 2020 •

edited

Loading

HDDS-3249: renew ContainerCache.INSTANCE in order to test it in a fresh state #705

HDDS-3249: renew ContainerCache.INSTANCE in order to test it in a fresh state #705

Conversation

isahkemat commented Mar 23, 2020 • edited Loading

What changes were proposed in this pull request?

What is the link to the Apache JIRA

fapifta commented Mar 23, 2020

isahkemat commented Mar 23, 2020

fapifta commented Mar 24, 2020

isahkemat commented Mar 24, 2020

fapifta commented Mar 24, 2020 • edited Loading

isahkemat commented Mar 24, 2020

isahkemat commented Mar 30, 2020 • edited Loading

isahkemat commented Mar 23, 2020 •

edited

Loading

fapifta commented Mar 24, 2020 •

edited

Loading

isahkemat commented Mar 30, 2020 •

edited

Loading