Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-4336. ContainerInfo does not persist BCSID (sequenceId) leading to failed replica reports #1488

Merged
merged 2 commits into from
Oct 13, 2020

Conversation

sodonnel
Copy link
Contributor

What changes were proposed in this pull request?

If you create a container, and then close it, the BCSID is synced on the datanodes and then the value is updated in SCM via setting the "sequenceID" field on the containerInfo object for the container.

If you later restart just SCM, the sequenceID becomes zero, and then container reports for the replica fail with a stack trace like:

Exception in thread "EventQueue-ContainerReportForContainerReportHandler" java.lang.AssertionError
	at org.apache.hadoop.hdds.scm.container.ContainerInfo.updateSequenceId(ContainerInfo.java:176)
	at org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.updateContainerStats(AbstractContainerReportHandler.java:108)
	at org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:83)
	at org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:162)
	at org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:130)
	at org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:50)
	at org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

The assertion here is failing, as it does not allow for the sequenceID to be changed on a CLOSED container:

  public void updateSequenceId(long sequenceID) {
    assert (isOpen() || state == HddsProtos.LifeCycleState.QUASI_CLOSED);
    sequenceId = max(sequenceID, sequenceId);
  }

The issue seems to be caused by the serialisation and deserialisation of the containerInfo object to protobuf, as sequenceId never persisted or restored.

However, I am also confused about how this ever worked, as this is a pretty significant problem.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4336

How was this patch tested?

New integration test to reproduce the issue before fixing it.

@sodonnel sodonnel changed the title HDDS-4336. ContainerInfo does not persist BCSID (sequenceId) leading to failed replicas reports HDDS-4336. ContainerInfo does not persist BCSID (sequenceId) leading to failed replica reports Oct 13, 2020
@sodonnel sodonnel merged commit 7ae037e into apache:master Oct 13, 2020
errose28 added a commit to errose28/ozone that referenced this pull request Oct 14, 2020
* master: (23 commits)
  HDDS-4122. Implement OM Delete Expired Open Key Request and Response (apache#1435)
  HDDS-4336. ContainerInfo does not persist BCSID (sequenceId) leading to failed replica reports (apache#1488)
  Remove extra serialization from getBlockID (apache#1470)
  HDDS-4262. Use ClientID and CallID from Rpc Client to detect retry requests (apache#1436)
  HDDS-4285. Read is slow due to frequent calls to UGI.getCurrentUser() and getTokens() (apache#1454)
  HDDS-4312. findbugs check succeeds despite compile error (apache#1476)
  HDDS-4311. Type-safe config design doc points to OM HA (apache#1477)
  HDDS-3814. Drop a column family through debug cli tool (apache#1083)
  HDDS-3728. Bucket space: check quotaUsageInBytes when write key and allocate block. (apache#1458)
  HDDS-4316. Upgrade to angular 1.8.0 due to CVE-2020-7676 (apache#1481)
  HDDS-4325. Incompatible return codes from Ozone getconf -confKey (apache#1485). Contributed by Doroszlai, Attila.
  HDDS-4309. Fix inconsistency in recon config keys starting with recon and not ozone (apache#1478)
  HDDS-4310: Ozone getconf broke the compatibility (apache#1475)
  HDDS-4298. Use an interface in Ozone client instead of XceiverClientManager (apache#1460)
  HDDS-4280. Document notable configurations for Recon. (apache#1448)
  HDDS-4156. add hierarchical layout to Chinese doc (apache#1368)
  HDDS-4242. Copy PrefixInfo proto to new project hadoop-ozone/interface-storage (apache#1444)
  HDDS-4264. Uniform naming conventions of Ozone Shell Options. (apache#1447)
  HDDS-4271. Avoid logging chunk content in Ozone Insight (apache#1466)
  HDDS-4299. Display Ratis version with ozone version (apache#1464)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants