Skip to content

Add check to not flush when table is being deleted.#1887

Merged
milleruntime merged 1 commit intoapache:mainfrom
milleruntime:minc-delete
Feb 1, 2021
Merged

Add check to not flush when table is being deleted.#1887
milleruntime merged 1 commit intoapache:mainfrom
milleruntime:minc-delete

Conversation

@milleruntime
Copy link
Contributor

  • The CleanUp step of deletes will wait until all tablets of a tablet
    are unassigned. This will stop the memory mgr from flushing if the table
    is being deleted, allowing it to be unassigned faster.

* The CleanUp step of deletes will wait until all tablets of a tablet
are unassigned. This will stop the memory mgr from flushing if the table
is being deleted, allowing it to be unassigned faster.
@milleruntime
Copy link
Contributor Author

I forgot to run the ITs when I opened this the other day. This changes touches important code so running them now.

@milleruntime
Copy link
Contributor Author

I saw a timeout failure in org.apache.accumulo.test.functional.ConcurrentDeleteTableIT. I didn't see any obvious errors but didn't have a chance to look into it further.

@ctubbsii
Copy link
Member

I saw a timeout failure in org.apache.accumulo.test.functional.ConcurrentDeleteTableIT. I didn't see any obvious errors but didn't have a chance to look into it further.

That's a known flaky test. It is unlikely this has anything to do with that.

@milleruntime milleruntime merged commit b8dac78 into apache:main Feb 1, 2021
@milleruntime milleruntime deleted the minc-delete branch February 1, 2021 13:32
@milleruntime
Copy link
Contributor Author

I was doing some testing using Uno running 2 RW MultiTable jobs and saw a bad situation with this change. The LargestFirstMemoryManager will select the largest tablets to be flushed (default for tserver.compaction.minor.concurrent.max is 4) and if they all happen to be in the table being deleted, no tablets will get flushed and things on the tserver will get stuck. Something is preventing the tablets from being unloaded and this new check stops them from being flushed. I am not sure why they still aren't being unloaded but we may need to revert this change.

@ctubbsii
Copy link
Member

ctubbsii commented Feb 1, 2021

... no tablets will get flushed and things on the tserver will get stuck. Something is preventing the tablets from being unloaded and this new check stops them from being flushed. I am not sure why they still aren't being unloaded but we may need to revert this change.

@milleruntime This sounds serious. Have you already created a blocker issue to track this? I wouldn't want it to get lost.

@milleruntime
Copy link
Contributor Author

... no tablets will get flushed and things on the tserver will get stuck. Something is preventing the tablets from being unloaded and this new check stops them from being flushed. I am not sure why they still aren't being unloaded but we may need to revert this change.

@milleruntime This sounds serious. Have you already created a blocker issue to track this? I wouldn't want it to get lost.

No I was about to revert the commit.

milleruntime added a commit that referenced this pull request Feb 1, 2021
@milleruntime
Copy link
Contributor Author

Reverted this commit with fd001c9 due to my previous comment about the bad situation.

@milleruntime
Copy link
Contributor Author

I think this wait in Tablet.completeClose() is what might have been holding up the tablets from being closed:


That would explain why the tablet wouldn't close but I am not sure why minor compaction was started, maybe it was being flushed by another processes? The wait above should only happen if this code was called:

@keith-turner
Copy link
Contributor

keith-turner commented Feb 1, 2021

@milleruntime do you have any tserver stack traces from when it got stuck? I have been poking around in the code looking for a possible cause, have not found anything yet.

It does seems like the LargestFirstMemManager may only return a subset of the tablets. So if the subset it returns is always in the deleting state maybe that held up minor compactions from starting. However something getting stuck in at getTabletMemory().waitForMinC() should be waiting on a compaction that is runnning or queued to run. If the threads in the executor service for minor compactions were gummed up, that could cause a problem (but have not seen a cause for that).

I think this wait in Tablet.completeClose() is what might have been holding up the tablets from being closed:

What made you think this?

@milleruntime
Copy link
Contributor Author

@milleruntime do you have any tserver stack traces from when it got stuck? I have been poking around in the code looking for a possible cause, have not found anything yet.

There weren't any errors, it was just the state of the tserver that I observed. It seemed no flushes from the memory manager were happening because the 4 tablets that were chosen by LargestFirstMemoryManager.tabletsToMinorCompact() were being rejected. Here is what was showing up in the logs repeatedly:

2021-02-01T15:26:09,855 [memory.LargestFirstMemoryManager] DEBUG: COMPACTING 2k;6;5  total = 178,210,928 ingestMemory = 178,210,928
2021-02-01T15:26:09,855 [memory.LargestFirstMemoryManager] DEBUG: chosenMem = 2,614,281 chosenIT = 313.02 load 3,326,974
2021-02-01T15:26:09,855 [memory.LargestFirstMemoryManager] DEBUG: COMPACTING 2k;9;8  total = 178,210,928 ingestMemory = 178,210,928
2021-02-01T15:26:09,855 [memory.LargestFirstMemoryManager] DEBUG: chosenMem = 2,544,515 chosenIT = 313.02 load 3,238,194
2021-02-01T15:26:09,855 [memory.LargestFirstMemoryManager] DEBUG: COMPACTING 2k<;9  total = 178,210,928 ingestMemory = 178,210,928
2021-02-01T15:26:09,855 [memory.LargestFirstMemoryManager] DEBUG: chosenMem = 2,534,061 chosenIT = 313.02 load 3,224,885
2021-02-01T15:26:09,855 [memory.LargestFirstMemoryManager] DEBUG: COMPACTING 2k;4;3  total = 178,210,928 ingestMemory = 178,210,928
2021-02-01T15:26:09,855 [memory.LargestFirstMemoryManager] DEBUG: chosenMem = 2,431,564 chosenIT = 313.02 load 3,094,446
2021-02-01T15:26:09,855 [tablet.Tablet] DEBUG: Table 2k is being deleted so don't flush 2k;6;5
2021-02-01T15:26:09,855 [tserver.TabletServerResourceManager] INFO : Ignoring memory manager recommendation: not minor compacting 2k;6;5
2021-02-01T15:26:09,855 [tablet.Tablet] DEBUG: Table 2k is being deleted so don't flush 2k;9;8
2021-02-01T15:26:09,855 [tserver.TabletServerResourceManager] INFO : Ignoring memory manager recommendation: not minor compacting 2k;9;8
2021-02-01T15:26:09,855 [tablet.Tablet] DEBUG: Table 2k is being deleted so don't flush 2k<;9
2021-02-01T15:26:09,855 [tserver.TabletServerResourceManager] INFO : Ignoring memory manager recommendation: not minor compacting 2k<;9
2021-02-01T15:26:09,855 [tablet.Tablet] DEBUG: Table 2k is being deleted so don't flush 2k;4;3
2021-02-01T15:26:09,855 [tserver.TabletServerResourceManager] INFO : Ignoring memory manager recommendation: not minor compacting 2k;4;3

I think this wait in Tablet.completeClose() is what might have been holding up the tablets from being closed:

What made you think this?

Since the tablets should have been unloaded, I was just looking through the code to try and figure out what was preventing them from unloading. I was guessing that something else triggered a flush for the tablet calling Tablet.prepareForMinC() and would prevent the getTabletMemory().waitForMinC() call in the close from finishing. But now that I think about it more, if this happened, there shouldn't have been anything preventing the other flush thread from completing since the check I added is in Tablet.initiateMinorCompaction().

@milleruntime
Copy link
Contributor Author

It could just be a lack of resources when it gets to the state where the largest tablets are being deleted. The Memory mgr isn't able to clear out the biggest chunks of memory so everything slows down. I saw it happen again but eventually the tablets get unloaded and deleted. So maybe this change just needs more work. I am thinking we might be able to just remove the tablets being deleted from the memory reports, moving things along.

@milleruntime milleruntime restored the minc-delete branch February 2, 2021 16:12
@milleruntime
Copy link
Contributor Author

I created an enhanced version of this change in #1899

DomGarguilo pushed a commit to DomGarguilo/accumulo that referenced this pull request Feb 10, 2021
* The CleanUp step of deletes will wait until all tablets of a tablet
are unassigned. This will stop the memory mgr from flushing if the table
is being deleted, allowing it to be unassigned faster.
DomGarguilo pushed a commit to DomGarguilo/accumulo that referenced this pull request Feb 10, 2021
@milleruntime milleruntime deleted the minc-delete branch March 8, 2021 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants