Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GMS cannot start due to misconfigured retention.ms in kafka topic DataHubUpgradeHistory_v1 #7882

Closed
jinlintt opened this issue Apr 21, 2023 · 2 comments
Labels
bug Bug report stale

Comments

@jinlintt
Copy link
Contributor

jinlintt commented Apr 21, 2023

v0.10.0 introduced the new DataHubUpgradeHistory_v1 topic to coordinate between the system update job and GMS. Specifically, GMS won't be able to start until it can read some message from the DataHubUpgradeHistory_v1 topic.

The intention was to configure the topic with infinite retention. However, there was a bug in the kafka command in kafka-setup.sh script where the -- was missing for the --config argument. As a result, the infinite retention period for the topic didn't take effect. Instead, it had the default 7 days retention period.

39920bb#diff-49b80548c7d96c9546170eefe1ef5340ef1d1a7e3dd67c4cd9f0655736156526

This bug has been fixed in v0.10.1 in the following commit. However, for those who already deployed v0.10.0, the retention.ms will stuck at 7 days even if you upgrade to v0.10.1 because the kafka-setup.sh script won't recreate the topic if it already exists.

b4b3a39#diff-49b80548c7d96c9546170eefe1ef5340ef1d1a7e3dd67c4cd9f0655736156526R119

What this means is if you have deployed v0.10.0, then your GMS won't be able to start if it is restarted more than seven days after the last run of the system update job because the messages have expired. This can happen if your K8S provider performs a maintenance update of your nodes, which was what happened in our case.

If you have upgraded from v0.9.x to v0.10.1 directly, then you won't be affected by this bug.

There are two temporary workarounds of this issue:

  1. Run helm upgrade again. This will run the system update job again and your GMS will survive a restart for another 7 days.
  2. Login into your kafka pod, then run the following commands to update the DataHubUpgradeHistory_v1 topic's retention to infinite. This will fix the issue for good. Note that the commands below also reads the retention.ms config of the topic before and after so you can make sure it was updated properly.
$ /opt/bitnami/kafka/bin/kafka-configs.sh --entity-type topics --entity-name DataHubUpgradeHistory_v1 --bootstrap-server localhost:9092 --describe --all | grep "retention.ms"
  retention.ms=604800000 sensitive=false synonyms={}
  delete.retention.ms=86400000 sensitive=false synonyms={DEFAULT_CONFIG:log.cleaner.delete.retention.ms=86400000}

$ /opt/bitnami/kafka/bin/kafka-configs.sh --bootstrap-server localhost:9092 --alter --entity-type topics --entity-name DataHubUpgradeHistory_v1 --add-config retention.ms=-1
Completed updating config for topic DataHubUpgradeHistory_v1.

$ /opt/bitnami/kafka/bin/kafka-configs.sh --entity-type topics --entity-name DataHubUpgradeHistory_v1 --bootstrap-server localhost:9092 --describe --all | grep "retention.ms"
  retention.ms=-1 sensitive=false synonyms={DYNAMIC_TOPIC_CONFIG:retention.ms=-1}
  delete.retention.ms=86400000 sensitive=false synonyms={DEFAULT_CONFIG:log.cleaner.delete.retention.ms=86400000}
@github-actions
Copy link

github-actions bot commented Jun 8, 2023

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

@github-actions github-actions bot added the stale label Jun 8, 2023
@github-actions
Copy link

github-actions bot commented Jul 8, 2023

This issue was closed because it has been inactive for 30 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report stale
Projects
None yet
Development

No branches or pull requests

1 participant