Skip to content

Prevent dynamic service deletion during upgrade#16151

Merged
bhardwaj-priyanshu merged 1 commit into
developfrom
fix/service-creation
May 29, 2026
Merged

Prevent dynamic service deletion during upgrade#16151
bhardwaj-priyanshu merged 1 commit into
developfrom
fix/service-creation

Conversation

@bhardwaj-priyanshu
Copy link
Copy Markdown
Contributor

@bhardwaj-priyanshu bhardwaj-priyanshu commented May 29, 2026

Context

During CDAP upgrades (such as metadata or appfabric), the Kubernetes Services dynamically created by CDAP pods were sometimes garbage-collected (deleted) by Kubernetes, leaving the active pods unreachable and causing 503 errors.

This occurred due to two separate issues:

  1. Stale Owner References: If the port/payload has not changed KubeDiscoveryService skipped updating the Service. The Service's ownerReferences remained pointing to the old, inactive ReplicaSet. When GKE cleaned up the old ReplicaSet, the Service was automatically garbage-collected.
  2. Rollout Race Condition: A terminating pod (from the old ReplicaSet) that is slowly shutting down can complete its startup sequence and register after the new pod has already registered. This overwrites the Service's ownerReferences back to the dying ReplicaSet, causing GC deletion when the old ReplicaSet is pruned.

Testing

  • Verified by 200 runs.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a check to skip service registration if the underlying Kubernetes pod is terminating, passing the pod name to KubeDiscoveryService and querying its deletion timestamp. It also updates service update logic to check for changes in owner references. The review feedback highlights several improvement opportunities: removing debugging log prefixes ("Priyanshuuuu-" and "Priyanshuuu -"), downgrading a registration skip log from error to warning, and adding null checks for currentService.getMetadata() and the retrieved V1Pod object to prevent potential NullPointerExceptions.

@bhardwaj-priyanshu bhardwaj-priyanshu force-pushed the fix/service-creation branch 5 times, most recently from 15215aa to 604e025 Compare May 29, 2026 08:32
@cdapio cdapio deleted a comment from gemini-code-assist Bot May 29, 2026
@bhardwaj-priyanshu bhardwaj-priyanshu force-pushed the fix/service-creation branch 2 times, most recently from 982992a to 0a0da88 Compare May 29, 2026 08:40
@cdapio cdapio deleted a comment from gemini-code-assist Bot May 29, 2026
@bhardwaj-priyanshu bhardwaj-priyanshu changed the title prevent dynamic service deletion during upgrade Prevent dynamic service deletion during upgrade May 29, 2026
@cdapio cdapio deleted a comment from gemini-code-assist Bot May 29, 2026
@cdapio cdapio deleted a comment from gemini-code-assist Bot May 29, 2026
@bhardwaj-priyanshu bhardwaj-priyanshu added the build Triggers github actions build label May 29, 2026
@bhardwaj-priyanshu bhardwaj-priyanshu marked this pull request as ready for review May 29, 2026 08:45
* @param loadBalancerServiceList list of services that should be exposed via LoadBalancer
* @param loadBalancerServiceAnnotations annotations to apply to LoadBalancer services
*/
public KubeDiscoveryService(String namespace, String namePrefix, @Nullable String podName,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid introducing new constructor.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

private static final String SERVICE_TYPE_CLUSTER_IP = "ClusterIP";
private static final String PAYLOAD_NAME = "cdap.service.payload";

@Nullable
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be nullable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

.map(V1ObjectMeta::getOwnerReferences)
.orElse(Collections.emptyList());

List<V1OwnerReference> safeExpectedOwners = expectedOwners == null ?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for null check.

See:

this.ownerReferences = Collections.unmodifiableList(new ArrayList<>(ownerReferences));

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

List<V1OwnerReference> safeExpectedOwners = expectedOwners == null ?
Collections.emptyList() : expectedOwners;

return currentOwners.size() != safeExpectedOwners.size()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use predefined Gauva or Collection util method for comparing collections.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have used Set

}

private boolean isPodTerminating() {
if (podName == null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove null check

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

.map(V1ObjectMeta::getDeletionTimestamp)
.isPresent();

} catch (Exception e) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return true only if e is ApiException with 404.

Let other exceptions propagate.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

private boolean hasOwnerReferencesChanged(V1Service currentService,
@Nullable List<V1OwnerReference> expectedOwners) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove @Nullable annotation

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
27.8% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

@bhardwaj-priyanshu bhardwaj-priyanshu merged commit 156cb17 into develop May 29, 2026
17 of 19 checks passed
@bhardwaj-priyanshu bhardwaj-priyanshu deleted the fix/service-creation branch May 29, 2026 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build Triggers github actions build

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants