Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: The csm-isilon-controller keeps getting panic and is restarting #1104

Closed
N1K68 opened this issue Jan 19, 2024 · 3 comments
Closed

[BUG]: The csm-isilon-controller keeps getting panic and is restarting #1104

N1K68 opened this issue Jan 19, 2024 · 3 comments
Assignees
Labels
area/csi-powerscale Issue pertains to the CSI Driver for Dell EMC PowerScale type/bug Something isn't working. This is the default label associated with a bug issue.
Milestone

Comments

@N1K68
Copy link

N1K68 commented Jan 19, 2024

Bug Description

The driver container of the csm-isilon-controller gets panic: runtime error: index out of range [2] with length 2.

After some investigation it turned out that controller had some issues with a handfull PVCs for som snapshots that it wasn't able to mount and that the directories for thoses snapshots had been deleted on the isilon server, although the snapshot objects was still available mount.
We still don't know how that was possible however.

Logs

In the driver container the last log entry was:
" file="/go/src/service/controllerNodeToArrayConnectivity.go:74"
time="2024-01-19T08:40:00Z" level=debug msg="use pollingFrequency as 60 seconds"
time="2024-01-19T08:40:00Z" level=debug runid=360 msg="last connectivity was 0 sec back, tolerance is 60 sec" file="/go/src/service/controllerNodeToArrayConnectivity.go:86"
time="2024-01-19T08:40:00Z" level=info runid=360 msg="array sust070a-vl1047 is connected to node =#=#=10-32-20-36.node-exporter.openshift-monitoring.svc.cluster.local=#=#=10.32.20.36" file="/go/src/service/csi_extension_server.go:145"
time="2024-01-19T08:40:00Z" level=info runid=360 msg="ValidateVolumeHostConnectivity reply connected:true messages:"array sust070a-vl1047 is connected to node compute-16.dev01.sebshift.sebank.se=#=#=10-32-20-36.node-exporter.openshift-monitoring.svc.cluster.local=#=#=10.32.20.36"" file="/go/src/service/csi_extension_server.go:88"
time="2024-01-19T08:40:00Z" level=info msg="/podmon.v1.Podmon/ValidateVolumeHostConnectivity: REP 0360: Connected=true, IosInProgress=false, Messages=[array sust070a-vl1047 is connected to node compute-16.dev01.sebshift.sebank.se=#=#=10-32-20-36.node-exporter.openshift-monitoring.svc.cluster.local=#=#=10.32.20.36]"
time="2024-01-19T08:40:00Z" level=info msg="Get PVC labels for data-delta-zeebe-1 in namespace zeebe-delta"
time="2024-01-19T08:40:00Z" level=error msg="Error retrieving PVC info: client rate limiter Wait returned an error: context deadline exceeded"
time="2024-01-19T08:40:00Z" level=error msg="Cannot retrieve labels for PVC csi.storage.k8s.io/pvc/name in namespace: csi.storage.k8s.io/pvc/namespace, error: client rate limiter Wait returned an error: context deadline exceeded"
time="2024-01-19T08:40:00Z" level=warning msg="No Values Under Metadata"
time="2024-01-19T08:40:00Z" level=info msg="/csi.v1.Controller/CreateVolume: REQ 0361: Name=dev01-1ffeb494be, CapacityRange=required_bytes:5368709120 , VolumeCapabilities=[mount:<> access_mode:<mode:SINGLE_NODE_MULTI_WRITER > ], Parameters=map[AccessZone:az-sust070a-tst AzServiceIP:sust070a-vl1047.powerscale14.seb.net IsiPath:/ifs/az-sust070a-tst/tst-dr-off-csi-sebshift RootClientEnabled:false csi.storage.k8s.io/pv/name:dev01-1ffeb494be csi.storage.k8s.io/pvc/name:data-delta-zeebe-1 csi.storage.k8s.io/pvc/namespace:zeebe-delta], VolumeContentSource=snapshot:<snapshot_id:"5599941===sust070a-vl1047" > , XXX_NoUnkeyedLiteral={}, XXX_sizecache=0"
time="2024-01-19T08:40:00Z" level=info runid=361 msg="Request doesn't include cluster name. Use default cluster 'sust070a-vl1047'" file="/go/src/service/service.go:998"
time="2024-01-19T08:40:00Z" level=debug clusterName=sust070a-vl1047 runid=361 msg="Cluster Name: sust070a-vl1047" file="/go/src/service/controller.go:265"
time="2024-01-19T08:40:00Z" level=debug clusterName=sust070a-vl1047 runid=361 msg="isiSvc already initialized, skip probing" file="/go/src/service/service.go:376"
time="2024-01-19T08:40:00Z" level=debug clusterName=sust070a-vl1047 runid=361 msg="Replication flag unset" file="/go/src/service/controller.go:317"
time="2024-01-19T08:40:00Z" level=info clusterName=sust070a-vl1047 runid=361 msg="Limit parameters considered for quota creation SoftLimit: '' , AdvisoryLimit: '',SoftGracePrd: ''" file="/go/src/service/controller.go:352"
file="/go/src/service/controller.go:352"
panic: runtime error: index out of range [2] with length 2
/go/pkg/mod/github.com/dell/gocsi@v1.9.0/utils/utils_middleware.go:99 +0x37
github.com/dell/gocsi.(*StoragePlugin).injectContext(0xc000392e00, {0x1f3c050, 0xc00078bb00}, {0x1b941c0, 0xc0001982a0}, 0x1b11640?, 0xc000303680)
/go/pkg/mod/github.com/dell/gocsi@v1.9.0/middleware.go:231 +0xa3
github.com/dell/gocsi.(*StoragePlugin).Serve.func1.ChainUnaryServer.func2.1.1({0x1f3c050?, 0xc00078bb00?}, {0x1b941c0?, 0xc0001982a0?})
/go/pkg/mod/github.com/dell/gocsi@v1.9.0/utils/utils_middleware.go:99 +0x37
github.com/dell/csi-isilon/v2/service/interceptor.(*rewriteRequestIDInterceptor).handleServer(0xc00076d3e0?, {0x1f3c050, 0xc00078bb00}, {0x1b941c0, 0xc0001982a0}, 0x19b22c0?, 0xc0003036a0)
/go/src/service/interceptor/interceptor.go:46 +0x133
github.com/dell/gocsi.(*StoragePlugin).Serve.func1.ChainUnaryServer.func2.1.1({0x1f3c050?, 0xc00078bb00?}, {0x1b941c0?, 0xc0001982a0?})
/go/pkg/mod/github.com/dell/gocsi@v1.9.0/utils/utils_middleware.go:99 +0x37
github.com/dell/csi-isilon/v2/service/interceptor.(*interceptor).createVolume(0xc0003234a0, {0x1f3c050, 0xc00078bb00}, 0xc0001982a0, 0xc00062d968?, 0xc0003036c0)
/go/src/service/interceptor/interceptor.go:228 +0x50b
github.com/dell/csi-isilon/v2/service/interceptor.NewCustomSerialLock.func1({0x1f3c050?, 0xc00078bb00?}, {0x1b941c0?, 0xc0001982a0?}, 0xc000762900?, 0xc000303600?)
/go/src/service/interceptor/interceptor.go:114 +0x93
github.com/dell/gocsi.(*StoragePlugin).Serve.func1.ChainUnaryServer.func2.1.1({0x1f3c050?, 0xc00078bb00?}, {0x1b941c0?, 0xc0001982a0?})
/go/pkg/mod/github.com/dell/gocsi@v1.9.0/utils/utils_middleware.go:99 +0x37
github.com/dell/gocsi.(*StoragePlugin).Serve.func1.ChainUnaryServer.func2({0x1f3c050, 0xc00078bb00}, {0x1b941c0, 0xc0001982a0}, 0xc000950a38?, 0x19d3060?)
/go/pkg/mod/github.com/dell/gocsi@v1.9.0/utils/utils_middleware.go:106 +0x99
github.com/container-storage-interface/spec/lib/go/csi._Controller_CreateVolume_Handler({0x1c742e0?, 0xc000312000}, {0x1f3c050, 0xc00078bb00}, 0xc0001981c0, 0xc0004ee280)
/go/pkg/mod/github.com/container-storage-interface/spec@v1.6.0/lib/go/csi/csi.pb.go:5680 +0x135
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0005741e0, {0x1f44240, 0xc000583860}, 0xc000762900, 0xc000201dd0, 0x2d7b7c0, 0x0)
/go/pkg/mod/google.golang.org/grpc@v1.57.1/server.go:1358 +0xe15
google.golang.org/grpc.(*Server).handleStream(0xc0005741e0, {0x1f44240, 0xc000583860}, 0xc000762900, 0x0)
/go/pkg/mod/google.golang.org/grpc@v1.57.1/server.go:1735 +0x9e7
google.golang.org/grpc.(*Server).serveStreams.func1.1()
/go/pkg/mod/google.golang.org/grpc@v1.57.1/server.go:970 +0xbb
created by google.golang.org/grpc.(*Server).serveStreams.func1 in goroutine 74
/go/pkg/mod/google.golang.org/grpc@v1.57.1/server.go:981 +0x145

Screenshots

No response

Additional Environment Information

The Isilon Nodes is running Code: 9.4.0.14.

VolumeSnapshot:
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"snapshot.storage.k8s.io/v1","kind":"VolumeSnapshot","metadata":{"annotations":{},"name":"elasticsearch-master-elasticsearch-master-0-2024-01-16-20230116-before-scaling","namespace":"zeebe-delta"},"spec":{"source":{"persistentVolumeClaimName":"elasticsearch-master-elasticsearch-master-0"},"volumeSnapshotClassName":"file-share-volumesnapshotclass-dr-off"}}
creationTimestamp: "2024-01-16T13:05:40Z"
finalizers:

  • snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection
  • snapshot.storage.kubernetes.io/volumesnapshot-bound-protection
    generation: 1
    name: elasticsearch-master-elasticsearch-master-0-2024-01-16-20230116-before-scaling
    namespace: zeebe-delta
    resourceVersion: "6052036685"
    uid: c64afb97-7142-48ae-94f8-2c3a0e05e76b
    spec:
    source:
    persistentVolumeClaimName: elasticsearch-master-elasticsearch-master-0
    volumeSnapshotClassName: file-share-volumesnapshotclass-dr-off
    status:
    boundVolumeSnapshotContentName: snapcontent-c64afb97-7142-48ae-94f8-2c3a0e05e76b
    creationTime: "2024-01-16T14:04:10Z"
    readyToUse: true
    restoreSize: "20694458"

--

isi snapshot snapshots view snapshot-c64afb97-7142-48ae-94f8-2c3a0e05e76b

           ID: 5599937
         Name: snapshot-c64afb97-7142-48ae-94f8-2c3a0e05e76b
         Path: /ifs/az-sust070a-tst/tst-dr-off-csi-sebshift/dev01-1bf75cb38d
    Has Locks: No
     Schedule: -

Alias Target ID: -
Alias Target Name: -
Created: 2024-01-16T15:04:10
Expires: -
Size: 175.45M
Shadow Bytes: 0.00
% Reserve: 0.00%
% Filesystem: 0.00%
State: active

ls -l /ifs/az-sust070a-tst/tst-dr-off-csi-sebshift/dev01-1bf75cb38d
ls: /ifs/az-sust070a-tst/tst-dr-off-csi-sebshift/dev01-1bf75cb38d: No such file or directory

Steps to Reproduce

Create a VolumeSnapshot and mount it as a PVC.
Remove the directory for the snapshot on the isilon server.

Expected Behavior

I had hoped that the controller times out after a while and keeps running.

CSM Driver(s)

isilon v2.9.0

Installation Type

The Operator was Manually using the without OLM using: "bash scripts/install.sh"

Container Storage Modules Enabled

resiliency v1.8.0
observability v1.7.0

Container Orchestrator

OpenShift 4.13.27

Operating System

RHEL 9.2

@N1K68 N1K68 added needs-triage Issue requires triage. type/bug Something isn't working. This is the default label associated with a bug issue. labels Jan 19, 2024
@shanmydell shanmydell added this to the v1.10.0 milestone Jan 22, 2024
@shanmydell shanmydell removed the needs-triage Issue requires triage. label Jan 22, 2024
@shanmydell
Copy link
Collaborator

@cbartoszDell : Please have a look.

@N1K68
Copy link
Author

N1K68 commented Jan 23, 2024

It turns out that the this issue happens when the controller tries to delete a snapshot as well. And we have verified that this snaphots directory was available.

2024-01-23T10:31:15.311451576Z time="2024-01-23T10:31:15Z" level=debug clusterName=sust070a-vl1047 runid=13 msg="Cluster Name: sust070a-vl1047" file="/go/src/service/controller.go:1390"
2024-01-23T10:31:15.311474055Z time="2024-01-23T10:31:15Z" level=debug clusterName=sust070a-vl1047 runid=13 msg="isiSvc already initialized, skip probing" file="/go/src/service/service.go:376"
2024-01-23T10:31:15.311492798Z time="2024-01-23T10:31:15Z" level=debug clusterName=sust070a-vl1047 runid=13 msg="ignoreUnresolvableHosts value is 'true', for clusterName 'sust070a-vl1047'" file="/go/src/service/controller.go:1403"
2024-01-23T10:31:15.311580133Z time="2024-01-23T10:31:15Z" level=debug clusterName=sust070a-vl1047 runid=13 msg="Node ID 'compute-12.dev01.sebshift.sebank.se=#=#=10-32-20-18.defender.twistlock.svc.cluster.local=#=#=10.32.20.18' parsed into node name 'compute-12.dev01.sebshift.sebank.se', node FQDN '10-32-20-18.defender.twistlock.svc.cluster.local' and IP address '10.32.20.18'" file="/go/src/common/utils/utils.go:409"
2024-01-23T10:31:15.311598831Z time="2024-01-23T10:31:15Z" level=debug clusterName=sust070a-vl1047 runid=13 msg="RemoveExportClientByIDWithZone client Name 'compute-12.dev01.sebshift.sebank.se', client FQDN '10-32-20-18.defender.twistlock.svc.cluster.local' client IP '10.32.20.18'" file="/go/src/service/isiService.go:683"
2024-01-23T10:31:15.311623640Z time="2024-01-23T10:31:15Z" level=debug clusterName=sust070a-vl1047 runid=13 msg="RemoveExportClientByName client '[10.32.20.18 compute-12.dev01.sebshift.sebank.se 10-32-20-18.defender.twistlock.svc.cluster.local]'" file="/go/src/service/isiService.go:687"
2024-01-23T10:31:15.311711935Z [DEBUG]
2024-01-23T10:31:15.311711935Z -------------------------- GOISILON HTTP REQUEST -------------------------
2024-01-23T10:31:15.311711935Z GET /platform/2/protocols/nfs/exports/100701?zone=az-sust070a-tst HTTP/1.1
2024-01-23T10:31:15.311711935Z Host: sust070a-01-325.sebank.se:8080
2024-01-23T10:31:15.311711935Z Authorization: d827724602@corp1.ad1.seb.net:******
2024-01-23T10:31:15.311711935Z
2024-01-23T10:31:15.311711935Z
2024-01-23T10:31:15.320781521Z time="2024-01-23T10:31:15Z" level=info msg="/csi.v1.Controller/DeleteSnapshot: REQ 0014: SnapshotId=5514835===sust070a-vl1047, XXX_NoUnkeyedLiteral={}, XXX_sizecache=0"
2024-01-23T10:31:15.320833476Z time="2024-01-23T10:31:15Z" level=info runid=14 msg="DeleteSnapshot started" file="/go/src/service/controller.go:1740"
2024-01-23T10:31:15.324186072Z panic: runtime error: index out of range [2] with length 2

@N1K68
Copy link
Author

N1K68 commented Jan 23, 2024

It seems that the volumesnapshotcontent for the snapshots that the controller has problems to delete all reports a snapshotHandle in the status filed that looks like "5599939===sust070a-vl1047" (that is two fields separated by ===). While the status for a snapshotHandle for any new volumesnapshotcontent has has three fields (5605183===sust070a-vl1047===az-sust070a-tst). So could that be the issue?

@gallacher gallacher reopened this Jan 24, 2024
@gallacher gallacher added the area/csi-powerscale Issue pertains to the CSI Driver for Dell EMC PowerScale label Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/csi-powerscale Issue pertains to the CSI Driver for Dell EMC PowerScale type/bug Something isn't working. This is the default label associated with a bug issue.
Projects
None yet
Development

No branches or pull requests

4 participants