Add check to not flush when table is being deleted. by milleruntime · Pull Request #1887 · apache/accumulo

milleruntime · 2021-01-27T17:00:29Z

The CleanUp step of deletes will wait until all tablets of a tablet
are unassigned. This will stop the memory mgr from flushing if the table
is being deleted, allowing it to be unassigned faster.

* The CleanUp step of deletes will wait until all tablets of a tablet are unassigned. This will stop the memory mgr from flushing if the table is being deleted, allowing it to be unassigned faster.

milleruntime · 2021-01-29T14:00:57Z

I forgot to run the ITs when I opened this the other day. This changes touches important code so running them now.

milleruntime · 2021-01-29T17:36:45Z

I saw a timeout failure in org.apache.accumulo.test.functional.ConcurrentDeleteTableIT. I didn't see any obvious errors but didn't have a chance to look into it further.

ctubbsii · 2021-01-29T21:42:14Z

I saw a timeout failure in org.apache.accumulo.test.functional.ConcurrentDeleteTableIT. I didn't see any obvious errors but didn't have a chance to look into it further.

That's a known flaky test. It is unlikely this has anything to do with that.

milleruntime · 2021-02-01T16:53:06Z

I was doing some testing using Uno running 2 RW MultiTable jobs and saw a bad situation with this change. The LargestFirstMemoryManager will select the largest tablets to be flushed (default for tserver.compaction.minor.concurrent.max is 4) and if they all happen to be in the table being deleted, no tablets will get flushed and things on the tserver will get stuck. Something is preventing the tablets from being unloaded and this new check stops them from being flushed. I am not sure why they still aren't being unloaded but we may need to revert this change.

ctubbsii · 2021-02-01T16:57:19Z

... no tablets will get flushed and things on the tserver will get stuck. Something is preventing the tablets from being unloaded and this new check stops them from being flushed. I am not sure why they still aren't being unloaded but we may need to revert this change.

@milleruntime This sounds serious. Have you already created a blocker issue to track this? I wouldn't want it to get lost.

milleruntime · 2021-02-01T17:01:10Z

... no tablets will get flushed and things on the tserver will get stuck. Something is preventing the tablets from being unloaded and this new check stops them from being flushed. I am not sure why they still aren't being unloaded but we may need to revert this change.

@milleruntime This sounds serious. Have you already created a blocker issue to track this? I wouldn't want it to get lost.

No I was about to revert the commit.

This reverts commit b8dac78.

milleruntime · 2021-02-01T17:06:07Z

Reverted this commit with fd001c9 due to my previous comment about the bad situation.

milleruntime · 2021-02-01T21:27:19Z

I think this wait in Tablet.completeClose() is what might have been holding up the tablets from being closed:

accumulo/server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Tablet.java

Line 1268 in fd001c9

getTabletMemory().waitForMinC();

That would explain why the tablet wouldn't close but I am not sure why minor compaction was started, maybe it was being flushed by another processes? The wait above should only happen if this code was called:

accumulo/server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/TabletMemory.java

Line 61 in 4051a8c

public CommitSession prepareForMinC() {

keith-turner · 2021-02-01T22:59:26Z

@milleruntime do you have any tserver stack traces from when it got stuck? I have been poking around in the code looking for a possible cause, have not found anything yet.

It does seems like the LargestFirstMemManager may only return a subset of the tablets. So if the subset it returns is always in the deleting state maybe that held up minor compactions from starting. However something getting stuck in at getTabletMemory().waitForMinC() should be waiting on a compaction that is runnning or queued to run. If the threads in the executor service for minor compactions were gummed up, that could cause a problem (but have not seen a cause for that).

I think this wait in Tablet.completeClose() is what might have been holding up the tablets from being closed:

What made you think this?

milleruntime · 2021-02-02T14:34:54Z

@milleruntime do you have any tserver stack traces from when it got stuck? I have been poking around in the code looking for a possible cause, have not found anything yet.

There weren't any errors, it was just the state of the tserver that I observed. It seemed no flushes from the memory manager were happening because the 4 tablets that were chosen by LargestFirstMemoryManager.tabletsToMinorCompact() were being rejected. Here is what was showing up in the logs repeatedly:

2021-02-01T15:26:09,855 [memory.LargestFirstMemoryManager] DEBUG: COMPACTING 2k;6;5  total = 178,210,928 ingestMemory = 178,210,928
2021-02-01T15:26:09,855 [memory.LargestFirstMemoryManager] DEBUG: chosenMem = 2,614,281 chosenIT = 313.02 load 3,326,974
2021-02-01T15:26:09,855 [memory.LargestFirstMemoryManager] DEBUG: COMPACTING 2k;9;8  total = 178,210,928 ingestMemory = 178,210,928
2021-02-01T15:26:09,855 [memory.LargestFirstMemoryManager] DEBUG: chosenMem = 2,544,515 chosenIT = 313.02 load 3,238,194
2021-02-01T15:26:09,855 [memory.LargestFirstMemoryManager] DEBUG: COMPACTING 2k<;9  total = 178,210,928 ingestMemory = 178,210,928
2021-02-01T15:26:09,855 [memory.LargestFirstMemoryManager] DEBUG: chosenMem = 2,534,061 chosenIT = 313.02 load 3,224,885
2021-02-01T15:26:09,855 [memory.LargestFirstMemoryManager] DEBUG: COMPACTING 2k;4;3  total = 178,210,928 ingestMemory = 178,210,928
2021-02-01T15:26:09,855 [memory.LargestFirstMemoryManager] DEBUG: chosenMem = 2,431,564 chosenIT = 313.02 load 3,094,446
2021-02-01T15:26:09,855 [tablet.Tablet] DEBUG: Table 2k is being deleted so don't flush 2k;6;5
2021-02-01T15:26:09,855 [tserver.TabletServerResourceManager] INFO : Ignoring memory manager recommendation: not minor compacting 2k;6;5
2021-02-01T15:26:09,855 [tablet.Tablet] DEBUG: Table 2k is being deleted so don't flush 2k;9;8
2021-02-01T15:26:09,855 [tserver.TabletServerResourceManager] INFO : Ignoring memory manager recommendation: not minor compacting 2k;9;8
2021-02-01T15:26:09,855 [tablet.Tablet] DEBUG: Table 2k is being deleted so don't flush 2k<;9
2021-02-01T15:26:09,855 [tserver.TabletServerResourceManager] INFO : Ignoring memory manager recommendation: not minor compacting 2k<;9
2021-02-01T15:26:09,855 [tablet.Tablet] DEBUG: Table 2k is being deleted so don't flush 2k;4;3
2021-02-01T15:26:09,855 [tserver.TabletServerResourceManager] INFO : Ignoring memory manager recommendation: not minor compacting 2k;4;3

I think this wait in Tablet.completeClose() is what might have been holding up the tablets from being closed:

What made you think this?

Since the tablets should have been unloaded, I was just looking through the code to try and figure out what was preventing them from unloading. I was guessing that something else triggered a flush for the tablet calling Tablet.prepareForMinC() and would prevent the getTabletMemory().waitForMinC() call in the close from finishing. But now that I think about it more, if this happened, there shouldn't have been anything preventing the other flush thread from completing since the check I added is in Tablet.initiateMinorCompaction().

milleruntime · 2021-02-02T16:06:11Z

It could just be a lack of resources when it gets to the state where the largest tablets are being deleted. The Memory mgr isn't able to clear out the biggest chunks of memory so everything slows down. I saw it happen again but eventually the tablets get unloaded and deleted. So maybe this change just needs more work. I am thinking we might be able to just remove the tablets being deleted from the memory reports, moving things along.

milleruntime · 2021-02-02T19:43:29Z

I created an enhanced version of this change in #1899

* The CleanUp step of deletes will wait until all tablets of a tablet are unassigned. This will stop the memory mgr from flushing if the table is being deleted, allowing it to be unassigned faster.

)" This reverts commit b8dac78.

Add check to not flush when table is being deleted.

283f1bd

* The CleanUp step of deletes will wait until all tablets of a tablet are unassigned. This will stop the memory mgr from flushing if the table is being deleted, allowing it to be unassigned faster.

milleruntime mentioned this pull request Jan 27, 2021

Stop waiting during Table Delete CleanUp #1872

Closed

ctubbsii approved these changes Jan 27, 2021

View reviewed changes

milleruntime merged commit b8dac78 into apache:main Feb 1, 2021

milleruntime deleted the minc-delete branch February 1, 2021 13:32

milleruntime added a commit that referenced this pull request Feb 1, 2021

Revert "Add check to not flush when table is being deleted. (#1887)"

fd001c9

This reverts commit b8dac78.

milleruntime restored the minc-delete branch February 2, 2021 16:12

DomGarguilo pushed a commit to DomGarguilo/accumulo that referenced this pull request Feb 10, 2021

Revert "Add check to not flush when table is being deleted. (apache#1887

0a667f6

)" This reverts commit b8dac78.

milleruntime deleted the minc-delete branch March 8, 2021 21:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add check to not flush when table is being deleted.#1887

Add check to not flush when table is being deleted.#1887
milleruntime merged 1 commit intoapache:mainfrom
milleruntime:minc-delete

milleruntime commented Jan 27, 2021

Uh oh!

milleruntime commented Jan 29, 2021

Uh oh!

milleruntime commented Jan 29, 2021

Uh oh!

ctubbsii commented Jan 29, 2021

Uh oh!

milleruntime commented Feb 1, 2021

Uh oh!

ctubbsii commented Feb 1, 2021

Uh oh!

milleruntime commented Feb 1, 2021

Uh oh!

milleruntime commented Feb 1, 2021

Uh oh!

milleruntime commented Feb 1, 2021

Uh oh!

keith-turner commented Feb 1, 2021 •

edited

Loading

Uh oh!

milleruntime commented Feb 2, 2021

Uh oh!

milleruntime commented Feb 2, 2021

Uh oh!

milleruntime commented Feb 2, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

milleruntime commented Jan 27, 2021

Uh oh!

milleruntime commented Jan 29, 2021

Uh oh!

milleruntime commented Jan 29, 2021

Uh oh!

ctubbsii commented Jan 29, 2021

Uh oh!

milleruntime commented Feb 1, 2021

Uh oh!

ctubbsii commented Feb 1, 2021

Uh oh!

milleruntime commented Feb 1, 2021

Uh oh!

milleruntime commented Feb 1, 2021

Uh oh!

milleruntime commented Feb 1, 2021

Uh oh!

keith-turner commented Feb 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

milleruntime commented Feb 2, 2021

Uh oh!

milleruntime commented Feb 2, 2021

Uh oh!

milleruntime commented Feb 2, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

keith-turner commented Feb 1, 2021 •

edited

Loading