-
Notifications
You must be signed in to change notification settings - Fork 474
Description
The description of this test failure (involving a split and unresponsive tserver) has elements in common with what @keith-turner described on #1076, which also had a MetadataMaxFilesIT test failure. However, I think this might be a distinct issue, unrelated to Thrift, especially since the problem reported there was mitigated by rolling back the change that caused it.
The MetadataMaxFilesIT test output showed it got as far as creating 3 of the 5 tables it was trying to create, each with 1000 split points. It was in the middle of creating split points for the third table when it got stuck:
2020-03-27T23:40:30,398 [functional.ConfigurableMacBase] INFO : Creating table0
2020-03-27T23:40:30,712 [functional.ConfigurableMacBase] INFO : adding splits
2020-03-27T23:40:45,004 [functional.ConfigurableMacBase] INFO : flushing
2020-03-27T23:40:45,103 [functional.ConfigurableMacBase] INFO : Creating table1
2020-03-27T23:40:45,438 [functional.ConfigurableMacBase] INFO : adding splits
2020-03-27T23:41:01,393 [functional.ConfigurableMacBase] INFO : flushing
2020-03-27T23:41:01,903 [functional.ConfigurableMacBase] INFO : Creating table2
2020-03-27T23:41:02,309 [functional.ConfigurableMacBase] INFO : adding splits
2020-03-27T23:43:10,452 [clientImpl.ThriftTransportPool] WARN : Thread "addSplits 14" stuck on IO to jenkins:34735 (120000) for at least 120049 ms
2020-03-27T23:43:10,476 [clientImpl.ThriftTransportPool] INFO : Thread "addSplits 14" no longer stuck on IO to jenkins:34735 (120000) sawError = true
This is where the test timed out.
I looked at the tserver logs and it looks like at a certain point, a metadata tablet split:
2020-03-27T23:41:10,629 [tablet.location] DEBUG: Split !0;3\\;516;2\\;875 into !0;3\\;45;2\\;875 and !0;3\\;516;3\\;45 on jenkins:34735[1000a3673370006]
Not long after, the tserver got into a state where the only thing it was doing was complaining about the new metadata tablet from that split having too many files, in response to repeated attempts from the client (presumably the client-side split code from the IT because it's the root user rather than the system user, but I haven't checked into how the split code works on the client side):
2020-03-27T23:41:10,904 [accumulo.audit] INFO : operation: permitted; user: root; client: 172.31.8.253:56900; action: authenticate;
2020-03-27T23:41:10,904 [tablet.Tablet] WARN : Tablet !0;3\\;45;2\\;875 has too many files, batch lookup can not run
2020-03-27T23:41:10,916 [accumulo.audit] INFO : operation: permitted; user: root; client: 172.31.8.253:56900; action: authenticate;
2020-03-27T23:41:10,917 [tablet.Tablet] WARN : Tablet !0;3\\;45;2\\;875 has too many files, batch lookup can not run
2020-03-27T23:41:10,921 [accumulo.audit] INFO : operation: permitted; user: root; client: 172.31.8.253:56900; action: authenticate;
2020-03-27T23:41:10,922 [tablet.Tablet] WARN : Tablet !0;3\\;45;2\\;875 has too many files, batch lookup can not run
2020-03-27T23:41:10,929 [accumulo.audit] INFO : operation: permitted; user: root; client: 172.31.8.253:56900; action: authenticate;
2020-03-27T23:41:10,930 [tablet.Tablet] WARN : Tablet !0;3\\;45;2\\;875 has too many files, batch lookup can not run
2020-03-27T23:41:10,942 [accumulo.audit] INFO : operation: permitted; user: root; client: 172.31.8.253:56900; action: authenticate;
2020-03-27T23:41:10,943 [tablet.Tablet] WARN : Tablet !0;3\\;45;2\\;875 has too many files, batch lookup can not run
The tserver never recovered from this state, and the test timed out. There were an average of a few dozen log lines between each of these messages at first, but at a certain point, all other tserver activity settled and the last 60,000 lines or so was just this (and periodic Java GC stats from GarbageCollectionLogger, which didn't show anything unusual).
The only thing in any of the other logs regarding this tablet is in the Master logs where it acknowledges the report of the split.
2020-03-27T23:41:10,783 [master.EventCoordinator] INFO : tablet 2;031;030 was loaded on jenkins:34735
2020-03-27T23:41:10,783 [master.EventCoordinator] INFO : jenkins:34735 reported split !0;3\\;45;2\\;875, !0;3\\;516;3\\;45
2020-03-27T23:41:10,783 [master.Master] INFO : Canceled migration of TKeyExtent(table:33, endRow:35 32 38, prevEndRow:35 32 34)
2020-03-27T23:41:10,783 [master.EventCoordinator] INFO : jenkins:34735 reported split 3;526;524, 3;528;526
2020-03-27T23:41:10,783 [master.EventCoordinator] INFO : tablet 2;036;035 was loaded on jenkins:34735
The Master actually did it's own spamming before the test died, complaining about 4 offline tablets, that were blocking balancing. These tablets did not seem to ever come online:
2020-03-27T23:43:13,845 [master.Master] DEBUG: [Root Table]: scan time 0.00 seconds
2020-03-27T23:43:13,845 [master.Master] DEBUG: [Root Table] sleeping for 60.00 seconds
2020-03-27T23:43:13,862 [master.Master] DEBUG: [Metadata Tablets]: scan time 0.00 seconds
2020-03-27T23:43:13,862 [master.Master] DEBUG: [Metadata Tablets] sleeping for 60.00 seconds
2020-03-27T23:43:23,508 [master.Master] DEBUG: Finished gathering information from 2 of 2 servers in 0.00 seconds
2020-03-27T23:43:23,508 [master.Master] DEBUG: not balancing because there are unhosted tablets: 4
2020-03-27T23:43:33,513 [master.Master] DEBUG: Finished gathering information from 2 of 2 servers in 0.00 seconds
2020-03-27T23:43:33,513 [master.Master] DEBUG: not balancing because there are unhosted tablets: 4
2020-03-27T23:43:43,518 [master.Master] DEBUG: Finished gathering information from 2 of 2 servers in 0.00 seconds
2020-03-27T23:43:43,518 [master.Master] DEBUG: not balancing because there are unhosted tablets: 4
2020-03-27T23:43:53,522 [master.Master] DEBUG: Finished gathering information from 2 of 2 servers in 0.00 seconds
2020-03-27T23:43:53,522 [master.Master] DEBUG: not balancing because there are unhosted tablets: 4
2020-03-27T23:44:03,527 [master.Master] DEBUG: Finished gathering information from 2 of 2 servers in 0.00 seconds
2020-03-27T23:44:03,527 [master.Master] DEBUG: not balancing because there are unhosted tablets: 4
2020-03-27T23:44:13,532 [master.Master] DEBUG: Finished gathering information from 2 of 2 servers in 0.01 seconds
2020-03-27T23:44:13,532 [master.Master] DEBUG: not balancing because there are unhosted tablets: 4
2020-03-27T23:44:13,947 [master.Master] DEBUG: [Root Table]: scan time 0.00 seconds
2020-03-27T23:44:13,947 [master.Master] DEBUG: [Root Table] sleeping for 60.00 seconds
2020-03-27T23:44:13,968 [master.Master] DEBUG: [Metadata Tablets]: scan time 0.00 seconds
2020-03-27T23:44:13,969 [master.Master] DEBUG: [Metadata Tablets] sleeping for 60.00 seconds
2020-03-27T23:44:23,536 [master.Master] DEBUG: Finished gathering information from 2 of 2 servers in 0.00 seconds
2020-03-27T23:44:23,536 [master.Master] DEBUG: not balancing because there are unhosted tablets: 4
2020-03-27T23:44:33,541 [master.Master] DEBUG: Finished gathering information from 2 of 2 servers in 0.00 seconds
2020-03-27T23:44:33,541 [master.Master] DEBUG: not balancing because there are unhosted tablets: 4
2020-03-27T23:44:43,546 [master.Master] DEBUG: Finished gathering information from 2 of 2 servers in 0.00 seconds
2020-03-27T23:44:43,546 [master.Master] DEBUG: not balancing because there are unhosted tablets: 4
2020-03-27T23:44:53,550 [master.Master] DEBUG: Finished gathering information from 2 of 2 servers in 0.00 seconds
2020-03-27T23:44:53,550 [master.Master] DEBUG: not balancing because there are unhosted tablets: 4
2020-03-27T23:45:03,555 [master.Master] DEBUG: Finished gathering information from 2 of 2 servers in 0.00 seconds
2020-03-27T23:45:03,555 [master.Master] DEBUG: not balancing because there are unhosted tablets: 4
The second tserver had no problematic activity. It showed repeated scans of the root tablet (which it was apparently hosting).
2020-03-27T23:45:12,058 [accumulo.audit] INFO : operation: permitted; user: root; client: 172.31.8.253:57824; action: scan; targetTable: accumulo.root; authorizations: ; range: {+r<<=[[!0;2;875%00; : [] 9223372036854775807 false,!0;3;76%00; : [] 9223372036854775807 false), [!0;~%00; : [] 9223372036854775807 false,!0<%00; : [] 9223372036854775807 false)]}; columns: [loc::, ~tab:~pr:]; iterators: [IterInfo(priority:10000, className:org.apache.accumulo.core.iterators.user.WholeRowIterator, iterName:WRI)]; iteratorOptions: {WRI={}};
2020-03-27T23:45:12,060 [accumulo.audit] INFO : operation: permitted; user: root; client: 172.31.8.253:57824; action: authenticate;
2020-03-27T23:45:12,061 [accumulo.audit] INFO : operation: permitted; user: root; client: 172.31.8.253:57824; action: authenticate;
2020-03-27T23:45:12,061 [accumulo.audit] INFO : operation: permitted; user: root; client: 172.31.8.253:57824; action: scan; targetTable: accumulo.root; authorizations: ; range: {+r<<=[[!0;2;875%00; : [] 9223372036854775807 false,!0;3;45%00; : [] 9223372036854775807 false)]}; columns: [loc::, ~tab:~pr:]; iterators: [IterInfo(priority:10000, className:org.apache.accumulo.core.iterators.user.WholeRowIterator, iterName:WRI)]; iteratorOptions: {WRI={}};
2020-03-27T23:45:12,062 [accumulo.audit] INFO : operation: permitted; user: root; client: 172.31.8.253:57824; action: authenticate;
2020-03-27T23:45:12,063 [accumulo.audit] INFO : operation: permitted; user: root; client: 172.31.8.253:57824; action: scan; targetTable: accumulo.root; authorizations: ; range: {+r<<=[[!0;2;875%00; : [] 9223372036854775807 false,!0;3;76%00; : [] 9223372036854775807 false), [!0;~%00; : [] 9223372036854775807 false,!0<%00; : [] 9223372036854775807 false)]}; columns: [loc::, ~tab:~pr:]; iterators: [IterInfo(priority:10000, className:org.apache.accumulo.core.iterators.user.WholeRowIterator, iterName:WRI)]; iteratorOptions: {WRI={}};
2020-03-27T23:45:12,064 [accumulo.audit] INFO : operation: permitted; user: root; client: 172.31.8.253:57824; action: authenticate;
2020-03-27T23:45:12,065 [accumulo.audit] INFO : operation: permitted; user: root; client: 172.31.8.253:57824; action: authenticate;
2020-03-27T23:45:12,065 [accumulo.audit] INFO : operation: permitted; user: root; client: 172.31.8.253:57824; action: scan; targetTable: accumulo.root; authorizations: ; range: {+r<<=[[!0;2;875%00; : [] 9223372036854775807 false,!0;3;45%00; : [] 9223372036854775807 false)]}; columns: [loc::, ~tab:~pr:]; iterators: [IterInfo(priority:10000, className:org.apache.accumulo.core.iterators.user.WholeRowIterator, iterName:WRI)]; iteratorOptions: {WRI={}};