Add check to not flush when table is being deleted.#1887
Add check to not flush when table is being deleted.#1887milleruntime merged 1 commit intoapache:mainfrom
Conversation
* The CleanUp step of deletes will wait until all tablets of a tablet are unassigned. This will stop the memory mgr from flushing if the table is being deleted, allowing it to be unassigned faster.
|
I forgot to run the ITs when I opened this the other day. This changes touches important code so running them now. |
|
I saw a timeout failure in org.apache.accumulo.test.functional.ConcurrentDeleteTableIT. I didn't see any obvious errors but didn't have a chance to look into it further. |
That's a known flaky test. It is unlikely this has anything to do with that. |
|
I was doing some testing using Uno running 2 RW MultiTable jobs and saw a bad situation with this change. The |
@milleruntime This sounds serious. Have you already created a blocker issue to track this? I wouldn't want it to get lost. |
No I was about to revert the commit. |
This reverts commit b8dac78.
|
Reverted this commit with fd001c9 due to my previous comment about the bad situation. |
|
I think this wait in That would explain why the tablet wouldn't close but I am not sure why minor compaction was started, maybe it was being flushed by another processes? The wait above should only happen if this code was called: |
|
@milleruntime do you have any tserver stack traces from when it got stuck? I have been poking around in the code looking for a possible cause, have not found anything yet. It does seems like the LargestFirstMemManager may only return a subset of the tablets. So if the subset it returns is always in the deleting state maybe that held up minor compactions from starting. However something getting stuck in at
What made you think this? |
There weren't any errors, it was just the state of the tserver that I observed. It seemed no flushes from the memory manager were happening because the 4 tablets that were chosen by 2021-02-01T15:26:09,855 [memory.LargestFirstMemoryManager] DEBUG: COMPACTING 2k;6;5 total = 178,210,928 ingestMemory = 178,210,928 2021-02-01T15:26:09,855 [memory.LargestFirstMemoryManager] DEBUG: chosenMem = 2,614,281 chosenIT = 313.02 load 3,326,974 2021-02-01T15:26:09,855 [memory.LargestFirstMemoryManager] DEBUG: COMPACTING 2k;9;8 total = 178,210,928 ingestMemory = 178,210,928 2021-02-01T15:26:09,855 [memory.LargestFirstMemoryManager] DEBUG: chosenMem = 2,544,515 chosenIT = 313.02 load 3,238,194 2021-02-01T15:26:09,855 [memory.LargestFirstMemoryManager] DEBUG: COMPACTING 2k<;9 total = 178,210,928 ingestMemory = 178,210,928 2021-02-01T15:26:09,855 [memory.LargestFirstMemoryManager] DEBUG: chosenMem = 2,534,061 chosenIT = 313.02 load 3,224,885 2021-02-01T15:26:09,855 [memory.LargestFirstMemoryManager] DEBUG: COMPACTING 2k;4;3 total = 178,210,928 ingestMemory = 178,210,928 2021-02-01T15:26:09,855 [memory.LargestFirstMemoryManager] DEBUG: chosenMem = 2,431,564 chosenIT = 313.02 load 3,094,446 2021-02-01T15:26:09,855 [tablet.Tablet] DEBUG: Table 2k is being deleted so don't flush 2k;6;5 2021-02-01T15:26:09,855 [tserver.TabletServerResourceManager] INFO : Ignoring memory manager recommendation: not minor compacting 2k;6;5 2021-02-01T15:26:09,855 [tablet.Tablet] DEBUG: Table 2k is being deleted so don't flush 2k;9;8 2021-02-01T15:26:09,855 [tserver.TabletServerResourceManager] INFO : Ignoring memory manager recommendation: not minor compacting 2k;9;8 2021-02-01T15:26:09,855 [tablet.Tablet] DEBUG: Table 2k is being deleted so don't flush 2k<;9 2021-02-01T15:26:09,855 [tserver.TabletServerResourceManager] INFO : Ignoring memory manager recommendation: not minor compacting 2k<;9 2021-02-01T15:26:09,855 [tablet.Tablet] DEBUG: Table 2k is being deleted so don't flush 2k;4;3 2021-02-01T15:26:09,855 [tserver.TabletServerResourceManager] INFO : Ignoring memory manager recommendation: not minor compacting 2k;4;3
Since the tablets should have been unloaded, I was just looking through the code to try and figure out what was preventing them from unloading. I was guessing that something else triggered a flush for the tablet calling |
|
It could just be a lack of resources when it gets to the state where the largest tablets are being deleted. The Memory mgr isn't able to clear out the biggest chunks of memory so everything slows down. I saw it happen again but eventually the tablets get unloaded and deleted. So maybe this change just needs more work. I am thinking we might be able to just remove the tablets being deleted from the memory reports, moving things along. |
|
I created an enhanced version of this change in #1899 |
* The CleanUp step of deletes will wait until all tablets of a tablet are unassigned. This will stop the memory mgr from flushing if the table is being deleted, allowing it to be unassigned faster.
)" This reverts commit b8dac78.
are unassigned. This will stop the memory mgr from flushing if the table
is being deleted, allowing it to be unassigned faster.