Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix: reconcile thinruntime failed when dataset is deleted #3300

Merged
merged 13 commits into from
Jun 29, 2023

Conversation

wangshli
Copy link
Contributor

Ⅰ. Describe what this PR does

bugfix: reconcile thinruntime failed when dataset is deleted

Ⅱ. Does this pull request fix one issue?

fixes #3295

Ⅲ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.

Ⅳ. Describe how to verify it

Ⅴ. Special notes for reviews

Signed-off-by: wangshulin <wangshulin@smail.nju.edu.cn>
Signed-off-by: wangshulin <wangshulin@smail.nju.edu.cn>
@fluid-e2e-bot
Copy link

fluid-e2e-bot bot commented Jun 21, 2023

Hi @wangshli. Thanks for your PR.

I'm waiting for a fluid-cloudnative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@codecov
Copy link

codecov bot commented Jun 21, 2023

Codecov Report

Merging #3300 (1adf649) into master (3fe30e2) will increase coverage by 0.03%.
The diff coverage is 67.22%.

❗ Current head 1adf649 differs from pull request most recent head a6a2f41. Consider uploading reports for the commit a6a2f41 to get more accurate results

@@            Coverage Diff             @@
##           master    #3300      +/-   ##
==========================================
+ Coverage   65.51%   65.55%   +0.03%     
==========================================
  Files         399      399              
  Lines       23198    23196       -2     
==========================================
+ Hits        15198    15205       +7     
+ Misses       6215     6212       -3     
+ Partials     1785     1779       -6     
Impacted Files Coverage Δ
pkg/ddc/juicefs/data_migrate.go 67.96% <ø> (+1.43%) ⬆️
pkg/ddc/juicefs/transform_fuse.go 78.29% <ø> (+1.21%) ⬆️
pkg/ddc/juicefs/utils.go 78.71% <ø> (-0.72%) ⬇️
pkg/ddc/thin/referencedataset/cm.go 46.83% <45.83%> (ø)
pkg/ddc/thin/referencedataset/engine.go 50.00% <58.82%> (-2.95%) ⬇️
pkg/ddc/thin/referencedataset/runtime.go 68.18% <64.70%> (+13.07%) ⬆️
pkg/ddc/base/dataset.go 100.00% <100.00%> (ø)
pkg/ddc/thin/engine.go 86.79% <100.00%> (+1.68%) ⬆️
pkg/ddc/thin/referencedataset/sync.go 60.00% <100.00%> (+2.46%) ⬆️
pkg/ddc/thin/referencedataset/volume.go 65.09% <100.00%> (ø)

@@ -87,17 +87,17 @@ func (r *RuntimeReconciler) ReconcileInternal(ctx cruntime.ReconcileRequestConte
return utils.RequeueIfError(errors.Wrap(err, "Failed to create"))
}

// 2.Get or create the engine
engine, err := r.implement.GetOrCreateEngine(ctx)
// 2.Get the ObjectMeta of runtime
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add comments about the reason of changing the order of step 2 and 3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the order was used to judge GetOrCreateEngine failed reason which is runtime having deletionTimeStamp. In this case we should ignore the GetOrCreateEngine error and continue to reconcileruntimeDeletion, but it would cause engine is a nil pointer. And we have resolvd this problem inside GetOrCreateEngine so that it would return an engine although it could not get mounted dataset. So the order is no need to change now and i will fix it.

Signed-off-by: wangshulin <wangshulin@smail.nju.edu.cn>
@@ -79,7 +79,7 @@ func TestGetMountedDatasetNamespacedName(t *testing.T) {
},
}
for _, tt := range tests {
if got := GetMountedDatasetNamespacedName(tt.virtualDataset); len(got) != tt.want {
if got := GetMountedDatasetNamespacedName(tt.virtualDataset.Spec.Mounts); len(got) != tt.want {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about renaming the function name to GetPhysicalDatasetFromMounts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -117,10 +117,21 @@ func Precheck(client client.Client, key types.NamespacedName) (found bool, err e
func CheckReferenceDatasetRuntime(client client.Client, runtime *datav1alpha1.ThinRuntime) (bool, error) {
dataset, err := utils.GetDataset(client, runtime.Name, runtime.Namespace)
if err != nil {
return false, err
if utils.IgnoreNotFound(err) == nil && runtime.Status.Mounts != nil && len(runtime.Status.Mounts) != 0 {
Copy link
Collaborator

@cheyang cheyang Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make it work even the virtualDataset is already deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally, the virtualDataset would not be deleted because its reference runtime has not been cleaned up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it may happen when deleting virtualDataset forcely. How to handle this then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@cheyang
Copy link
Collaborator

cheyang commented Jun 25, 2023

/test fluid-e2e

Signed-off-by: wangshulin <wangshulin@smail.nju.edu.cn>
Signed-off-by: wangshulin <wangshulin@smail.nju.edu.cn>
var mounted []types.NamespacedName
if dataset != nil {
// getMountedDataset from dataset first
mounted = base.GetPhysicalDatasetFromMounts(dataset.Spec.Mounts)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest adding more logging info for debugging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

} else if runtime.Status.Mounts != nil && len(runtime.Status.Mounts) != 0 {
// then try to getMountedDataset from runtime
mounted = base.GetPhysicalDatasetFromMounts(runtime.Status.Mounts)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen if dataset is not found and the length of runtime mounts is 0? How will the user handle this situation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case will be protected by checking existence of reference datasets before removing any physical dataset. This can be done in the next PR.

@wangshli wangshli requested a review from cheyang June 25, 2023 09:01
mounted = base.GetPhysicalDatasetFromMounts(runtime.Status.Mounts)
}
// not mount other datasets
if len(mounted) == 0 {
return false, nil
}

// patch runtime with reference annotation
_, err = PatchReferenceThinRuntimeAnnotation(ctx.Client, runtime)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest moving PatchReferenceThinRuntimeAnnotation to another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Signed-off-by: wangshulin <wangshulin@smail.nju.edu.cn>
@cheyang
Copy link
Collaborator

cheyang commented Jun 26, 2023

/test fluid-e2e

} else if runtime.Status.Mounts != nil && len(runtime.Status.Mounts) != 0 {
// then try to getMountedDataset from runtime
mounted = base.GetPhysicalDatasetFromMounts(runtime.Status.Mounts)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case will be protected by checking existence of reference datasets before removing any physical dataset. This can be done in the next PR.

func CheckReferenceDatasetRuntime(ctx cruntime.ReconcileRequestContext, runtime *datav1alpha1.ThinRuntime) (bool, error) {
dataset, err := utils.GetDataset(ctx.Client, runtime.Name, runtime.Namespace)
if err != nil && utils.IgnoreNotFound(err) != nil {
// ignore dataset not found err and try to get mounted dataset from runtime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment should be added below? This is the case which does not ignore error.

if dataset != nil {
// get mountedRuntimeInfo from dataset first
mountedNameSpacedNames = base.GetPhysicalDatasetFromMounts(dataset.Spec.Mounts)
} else if runtime.Status.Mounts != nil && len(runtime.Status.Mounts) != 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to check runtime.Status.Mounts != nil here because len(nil) == 0. We can remove runtime.Status.Mounts != nil to avoid code redundancy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Signed-off-by: wangshulin <wangshulin@smail.nju.edu.cn>
}

// getMountedRuntimeInfo get mountedRuntimeInfo from dataset.
// If could not get dataset, getMountedRuntimeInfo try to get mountedRuntimeInfo from runtime status.
func (e *ReferenceDatasetEngine) getMountedRuntimeInfo() (base.RuntimeInfoInterface, error) {
if e.mountedRuntimeInfo != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comment: // If already have mountedRuntimeInfo, return it directly

Signed-off-by: wangshulin <wangshulin@smail.nju.edu.cn>
…ncedataset

Signed-off-by: wangshulin <wangshulin@smail.nju.edu.cn>
Signed-off-by: wangshulin <wangshulin@smail.nju.edu.cn>
Signed-off-by: wangshulin <wangshulin@smail.nju.edu.cn>
@cheyang
Copy link
Collaborator

cheyang commented Jun 29, 2023

/test fluid-e2e

1 similar comment
@cheyang
Copy link
Collaborator

cheyang commented Jun 29, 2023

/test fluid-e2e

@@ -260,7 +260,7 @@ func TestGetMountedDatasetSubPath(t *testing.T) {
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
if got := GetMountedDatasetSubPath(tt.args.dataset); !reflect.DeepEqual(got, tt.want) {
if got := GetPhysicalDatasetSubPath(tt.args.dataset); !reflect.DeepEqual(got, tt.want) {
t.Errorf("GetMountedDatasetSubPath() = %v, want %v", got, tt.want)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls also fix the error msg in t.Errorf and Fix the test function's name

Suggested change
t.Errorf("GetMountedDatasetSubPath() = %v, want %v", got, tt.want)
t.Errorf("GetPhysicalDatasetSubPath() = %v, want %v", got, tt.want)

physicalDataset = base.GetPhysicalDatasetFromMounts(runtime.Status.Mounts)
dataset, err := utils.GetDataset(ctx.Client, runtime.Name, runtime.Namespace)
if err != nil {
return false, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not return every error here because in cases where len(runtime.Status.Mounts) == 0 && Dataset not found, the func will return error to keep engine building failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, CheckReferenceDatasetRuntime can't judge whether this dataset is a reference dataset, so we raise the error now. And this case will be repaired by next PR.

newDataset := mountedDataset.DeepCopy()
newDataset.Status.DatasetRef = utils.RemoveString(newDataset.Status.DatasetRef, datasetRefName)
err := e.Client.Status().Update(context.TODO(), newDataset)
if physicalRuntimeInfo != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add log message to indicate physicalRuntimeInfo == nil case so that we can know corner case happened.

Signed-off-by: wangshulin <wangshulin@smail.nju.edu.cn>
@@ -61,7 +61,7 @@ func Build(id string, ctx cruntime.ReconcileRequestContext) (base.Engine, error)
return nil, fmt.Errorf("engine %s is failed due to type conversion", ctx.Name)
}

isRef, err := CheckReferenceDatasetRuntime(ctx.Client, runtime)
isRef, err := CheckReferenceDatasetRuntime(ctx, runtime)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in future we can simply check len(runtime.profileName) == 0 to indicate whether it is a VirtualRuntime or ThinRuntime instead of checking all the dataset mounts.

@cheyang
Copy link
Collaborator

cheyang commented Jun 29, 2023

/test fluid-e2e

Signed-off-by: wangshulin <wangshulin@smail.nju.edu.cn>
@sonarcloud
Copy link

sonarcloud bot commented Jun 29, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 3 Code Smells

No Coverage information No Coverage information
10.7% 10.7% Duplication

Copy link
Member

@TrafalgarZZZ TrafalgarZZZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Collaborator

@cheyang cheyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@fluid-e2e-bot
Copy link

fluid-e2e-bot bot commented Jun 29, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheyang, TrafalgarZZZ

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [TrafalgarZZZ,cheyang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fluid-e2e-bot fluid-e2e-bot bot merged commit b56249f into fluid-cloudnative:master Jun 29, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Failed on Reconcile thinruntime when dataset is deleted
3 participants