Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MS] MSUnmerged has been restarting in the t2t3 replica #10713

Closed
amaltaro opened this issue Aug 3, 2021 · 13 comments · Fixed by #10792
Closed

[MS] MSUnmerged has been restarting in the t2t3 replica #10713

amaltaro opened this issue Aug 3, 2021 · 13 comments · Fixed by #10792

Comments

@amaltaro
Copy link
Contributor

amaltaro commented Aug 3, 2021

Impact of the bug
MSUnmerged

Describe the bug
While checking the MSUnmerged logs, Todor noticed that the t2t3 replica has been having some random service restarts, note that it's not the POD getting recreated, but the service restarts inside the same POD.

Reason is still unclear, and the only thing we can extract from the service logs is [1]

NOTE: this causes that MSUnmerged replica to keep cleaning up the same RSEs over and over, since the service cache gets wiped out during the restart process.

UPDATE: the t1 POD service also gets restarted when FNAL is enabled.

How to reproduce it
Not clear yet! Note that 2 replicas work with no issues, only this t2t3 is having these problems.

Expected behavior
Investigate the reason for these service/CherryPy restarts and fix the MSUnmerged code; or whatever is making that replica unstable.

One comment that I made to Todor though, is that it would be interesting to log the amount of files retrieved from the Rucio ConMon, just so we know how many files need to be parsed in advanced.

Additional context and error message
[1]

2021-08-02 02:19:42,617:WARNING:MSUnmerged: Error while cleaning RSE: T2_BR_UERJ. Will retry in the next cycle. Err: globus_ftp_client: the server responded with an error 550 550-GlobusError: v=1 c=PATH_NOT_FOUND  550-GridFTP-Errno: 2  55
0-GridFTP-Reason: System error in stat  550-GridFTP-Error-String: No such file or directory  550 End.   
2021-08-02 02:19:42,632:DEBUG:MSUnmerged: 
RSE: T2_BR_UERJ 
DELETING: /store/unmerged/SAM/testSRM/SAM-gridftp.hepgrid.uerj.br.
PFN list with: 15415 entries: 
    gsiftp://gridftp.hepgrid.uerj.br:2811/mnt/hadoop/cms/store/unmerged/SAM/testSRM/SAM-gridftp.hepgrid.uerj.br
    gsiftp://gridftp.hepgrid.uerj.br:2811/mnt/hadoop/cms/store/unmerged/SAM/testSRM/SAM-gridftp.hepgrid.uerj.br/lcg-util/testfile-put-nospacetoken-1557824813-6d2024e5235c.txt
    gsiftp://gridftp.hepgrid.uerj.br:2811/mnt/hadoop/cms/store/unmerged/SAM/testSRM/SAM-gridftp.hepgrid.uerj.br/lcg-util/testfile-put-nospacetoken-1557865488-9e5949b3c90b.txt
    gsiftp://gridftp.hepgrid.uerj.br:2811/mnt/hadoop/cms/store/unmerged/SAM/testSRM/SAM-gridftp.hepgrid.uerj.br/lcg-util/testfile-put-nospacetoken-1557867358-4acd14b37e4a.txt
    ...

[02/Aug/2021:02:19:44]  WATCHDOG: server exited with exit code signal 11 (core dumped)... restarting
[02/Aug/2021:02:19:44]  INFO: final CherryPy configuration: {'engine.autoreload.on': False,
...

### and another restart
2021-08-02 04:05:47,963:WARNING:MSUnmerged: Error while cleaning RSE: T2_CN_Beijing. Will retry in the next cycle. Err: globus_ftp_client: the server responded with an error 550 550-GlobusError: v=1 c=PATH_NOT_FOUND  550-GridFTP-Errno: 2 
 550-GridFTP-Reason: System error in name  550-GridFTP-Error-String: No such file or directory  550 End.   
2021-08-02 04:05:47,981:DEBUG:MSUnmerged: 
RSE: T2_CN_Beijing 
DELETING: /store/unmerged/SAM/testSRM/SAM-srm.ihep.ac.cn/lcg-util.
PFN list with: 15296 entries: 
    gsiftp://ccsrm.ihep.ac.cn:2811/dpm/ihep.ac.cn/home/cms/store/unmerged/SAM/testSRM/SAM-srm.ihep.ac.cn/lcg-util/testfile-put-nospacetoken-1603812057-d43e8b99c21d.txt
    gsiftp://ccsrm.ihep.ac.cn:2811/dpm/ihep.ac.cn/home/cms/store/unmerged/SAM/testSRM/SAM-srm.ihep.ac.cn/lcg-util/testfile-put-nospacetoken-1604034358-5dbc61d69427.txt
    gsiftp://ccsrm.ihep.ac.cn:2811/dpm/ihep.ac.cn/home/cms/store/unmerged/SAM/testSRM/SAM-srm.ihep.ac.cn/lcg-util/testfile-put-nospacetoken-1605627798-1b14ee8202a1.txt
    gsiftp://ccsrm.ihep.ac.cn:2811/dpm/ihep.ac.cn/home/cms/store/unmerged/SAM/testSRM/SAM-srm.ihep.ac.cn/lcg-util/testfile-put-nospacetoken-1607445255-faf41de51a14.txt
    ...

[02/Aug/2021:04:05:49]  WATCHDOG: server exited with exit code signal 11 (core dumped)... restarting
[02/Aug/2021:04:05:50]  INFO: final CherryPy configuration: {'engine.autoreload.on': False,
@amaltaro
Copy link
Contributor Author

amaltaro commented Aug 4, 2021

As we enabled T1_US_FNAL_Disk to be taken into consideration by the MSUnmerged service, we also started seeing the t1 webserver crashing as well. This time, it seems to be consistent and it happens with the same directory:

2021-08-04 22:13:45,185:DEBUG:MSUnmerged: 
RSE: T1_US_FNAL_Disk 
DELETING: /store/unmerged/RunIIAutumn18FSPremix/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/AODSIM/GridpackScan_102X_upgrade2018_realistic_v15-v2.
PFN list with: 15409 entries: 
    https://cmsdcadisk.fnal.gov:2880/dcache/uscmsdisk/store/unmerged/RunIIAutumn18FSPremix/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/AODSIM/GridpackScan_102X_upgrade2018_realistic_v15-v2/140000/D3A5937B-BDDF-524C-BBD8-A7CAA2AD2DFF.root
    https://cmsdcadisk.fnal.gov:2880/dcache/uscmsdisk/store/unmerged/RunIIAutumn18FSPremix/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/AODSIM/GridpackScan_102X_upgrade2018_realistic_v15-v2/270000/017BC22D-8F2F-3341-8334-8D31E208359D.root
    https://cmsdcadisk.fnal.gov:2880/dcache/uscmsdisk/store/unmerged/RunIIAutumn18FSPremix/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/AODSIM/GridpackScan_102X_upgrade2018_realistic_v15-v2/270000/01F9113C-A3D4-C849-89DC-1850B01EA812.root
    https://cmsdcadisk.fnal.gov:2880/dcache/uscmsdisk/store/unmerged/RunIIAutumn18FSPremix/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/AODSIM/GridpackScan_102X_upgrade2018_realistic_v15-v2/270000/023934EC-8A34-E94E-A2B3-237C034229B6.root
    ...

[04/Aug/2021:22:13:49]  WATCHDOG: server exited with exit code signal 11 (core dumped)... restarting

2021-08-04 22:22:00,236:DEBUG:MSUnmerged: 
RSE: T1_US_FNAL_Disk 
DELETING: /store/unmerged/RunIIAutumn18FSPremix/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/AODSIM/GridpackScan_102X_upgrade2018_realistic_v15-v2.
PFN list with: 15409 entries: 
    https://cmsdcadisk.fnal.gov:2880/dcache/uscmsdisk/store/unmerged/RunIIAutumn18FSPremix/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/AODSIM/GridpackScan_102X_upgrade2018_realistic_v15-v2/140000/D3A5937B-BDDF-524C-BBD8-A7CAA2AD2DFF.root
    https://cmsdcadisk.fnal.gov:2880/dcache/uscmsdisk/store/unmerged/RunIIAutumn18FSPremix/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/AODSIM/GridpackScan_102X_upgrade2018_realistic_v15-v2/270000/017BC22D-8F2F-3341-8334-8D31E208359D.root
    https://cmsdcadisk.fnal.gov:2880/dcache/uscmsdisk/store/unmerged/RunIIAutumn18FSPremix/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/AODSIM/GridpackScan_102X_upgrade2018_realistic_v15-v2/270000/01F9113C-A3D4-C849-89DC-1850B01EA812.root
    https://cmsdcadisk.fnal.gov:2880/dcache/uscmsdisk/store/unmerged/RunIIAutumn18FSPremix/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/AODSIM/GridpackScan_102X_upgrade2018_realistic_v15-v2/270000/023934EC-8A34-E94E-A2B3-237C034229B6.root
    ...

[04/Aug/2021:22:22:04]  WATCHDOG: server exited with exit code signal 11 (core dumped)... restarting

One can also observe that all these crashes happened when we had >= 15k files to be deleted under the same directory.

I'm going to skip T1_US_FNAL_Disk RSE once again, and if possible, leave it up to Todor to investigate this issue post-vacation.

@nsmith-
Copy link

nsmith- commented Aug 4, 2021

Probably can add a debug printout of the number of pfns at:

delResult = ctx.unlink(pfnList)

likely there is some buffer in gfal2 that is overflowing and causing a segfault if the list is sufficiently large. Once running with the printout we can see what the largest non-crashing value is and maybe pick a chunk size of half that or so. For context, rucio does them one at a time:
https://github.com/rucio/rucio/blob/a45f685af1963c8f1bd60bbbfe3e207e7df7e903/lib/rucio/rse/protocols/gfal.py#L473

@todor-ivanov todor-ivanov self-assigned this Aug 16, 2021
@todor-ivanov
Copy link
Contributor

Thanks @amaltaro @nsmith- , I am taking this issue now and will start investigating on what is the thing that is breaking. I agree, this indeed seems like a buffer overflow at the gfal client.

@todor-ivanov
Copy link
Contributor

Now since we have the deletions in slices we are sure that the gfall buffer is not overflowing an yet we can see the same case happening, but now with an extremely small directories - only 37 entries last time:

  13550 2021-09-16 12:47:46,965:DEBUG:MSUnmerged: 
  13551 RSE: T1_US_FNAL_Disk 
  13552 DELETING: /store/unmerged/RunIISummer20UL16RECOAPV/TTToHadronic_mtop171p5_TuneCP5_13TeV-powheg-pythia8/AODSIM/106X_mcRun2_asymptotic_preVFP_v8-v2.
  13553 PFN list with: 37 entries: 
  13554     https://cmsdcadisk.fnal.gov:2880/dcache/uscmsdisk/store/unmerged/RunIISummer20UL16RECOAPV/TTToHadronic_mtop171p5_TuneCP5_13TeV-powheg-pythia8/AODSIM/106X_mcRun2_asymptotic_preVFP
  13555     https://cmsdcadisk.fnal.gov:2880/dcache/uscmsdisk/store/unmerged/RunIISummer20UL16RECOAPV/TTToHadronic_mtop171p5_TuneCP5_13TeV-powheg-pythia8/AODSIM/106X_mcRun2_asymptotic_preVFP
  13556     https://cmsdcadisk.fnal.gov:2880/dcache/uscmsdisk/store/unmerged/RunIISummer20UL16RECOAPV/TTToHadronic_mtop171p5_TuneCP5_13TeV-powheg-pythia8/AODSIM/106X_mcRun2_asymptotic_preVFP
  13557     https://cmsdcadisk.fnal.gov:2880/dcache/uscmsdisk/store/unmerged/RunIISummer20UL16RECOAPV/TTToHadronic_mtop171p5_TuneCP5_13TeV-powheg-pythia8/AODSIM/106X_mcRun2_asymptotic_preVFP
  13558     ...
  13559 
  13560 [16/Sep/2021:12:47:52]  WATCHDOG: server exited with exit code signal 11 (core dumped)... restarting

While fairly big ones does seem to go through without any problems, but a little bit of delay only:

   8679 2021-09-16 11:50:26,660:DEBUG:MSUnmerged: 
   8680 RSE: T1_US_FNAL_Disk 
   8681 DELETING: /store/unmerged/RunIISummer15GS/DYjetstotautau_01234jets_Pt-0ToInf_13TeV-sherpa/GEN-SIM/QCDEWNLO_correct_MCRUN2_71_V1-v1.
   8682 PFN list with: 9041 entries: 
   8683     https://cmsdcadisk.fnal.gov:2880/dcache/uscmsdisk/store/unmerged/RunIISummer15GS/DYjetstotautau_01234jets_Pt-0ToInf_13TeV-sherpa/GEN-SIM/QCDEWNLO_correct_MCRUN2_71_V1-v1/110012/2
   8684     https://cmsdcadisk.fnal.gov:2880/dcache/uscmsdisk/store/unmerged/RunIISummer15GS/DYjetstotautau_01234jets_Pt-0ToInf_13TeV-sherpa/GEN-SIM/QCDEWNLO_correct_MCRUN2_71_V1-v1/110012/7
   8685     https://cmsdcadisk.fnal.gov:2880/dcache/uscmsdisk/store/unmerged/RunIISummer15GS/DYjetstotautau_01234jets_Pt-0ToInf_13TeV-sherpa/GEN-SIM/QCDEWNLO_correct_MCRUN2_71_V1-v1/110012/E
   8686     https://cmsdcadisk.fnal.gov:2880/dcache/uscmsdisk/store/unmerged/RunIISummer15GS/DYjetstotautau_01234jets_Pt-0ToInf_13TeV-sherpa/GEN-SIM/QCDEWNLO_correct_MCRUN2_71_V1-v1/20002/28
   8687     ...
   8688 
   8689 [16/Sep/2021:11:50:27] ms-unmer-t1-5985d7b69b-nvkb8 127.0.0.1 "GET /ms-unmerged/data/status HTTP/1.1" 200 OK [data: 302 in 133 out 773 us ] [auth: OK "" "" ] [ref: "" "Go-http-client
....
   8801 [16/Sep/2021:12:08:27] ms-unmer-t1-5985d7b69b-nvkb8 127.0.0.1 "GET /ms-unmerged/data/status HTTP/1.1" 200 OK [data: 302 in 133 out 786 us ] [auth: OK "" "" ] [ref: "" "Go-http-client
   8802 2021-09-16 12:08:36,908:DEBUG:MSUnmerged: RSE: T1_US_FNAL_Disk, Dir: /store/unmerged/RunIISummer15GS/DYjetstotautau_01234jets_Pt-0ToInf_13TeV-sherpa/GEN-SIM/QCDEWNLO_correct_MCRUN2_7
   8803 2021-09-16 12:08:36,908:DEBUG:MSUnmerged: Purging dirEntry: https://cmsdcadisk.fnal.gov:2880/dcache/uscmsdisk/store/unmerged/RunIISummer15GS/DYjetstotautau_01234jets_Pt-0ToInf_13TeV- 

@nsmith-
Copy link

nsmith- commented Sep 16, 2021

If the global python logging level is debug, I know gfal prints some more detailed information about its internal operations. It might hint at what is going wrong

@todor-ivanov
Copy link
Contributor

Actually back then it was printing way too many stuff and I have separated the python log level from the gfal log level and now we have python in DEBUG and gfal in WARNING: https://gitlab.cern.ch/cmsweb-k8s/services_config/-/blob/prod/reqmgr2ms-unmerged/config-unmerged.py#L91

I will set it back to debug and see what is going to happen.

@todor-ivanov
Copy link
Contributor

Increasing the gfal log level did not give us too much [1]. It is basically suddenly receiving a signal 11 from outside and the process simply dies in the middle of the execution. I still suspect the thing which would give us a clue is somewhere outside the service logs. We had a short iteration with Imran for checking the system logs in the machine running the pods, but those were clean, at least in the part Imran provided.

[1]

2021-09-17 17:20:01,688:DEBUG:MSUnmerged: Davix: > DELETE /dcache/uscmsdisk/store/unmerged/CMSSW_11_3_0_pre5/RelValZTT_14TeV/MINIAODSIM/PU_113X_mcRun4_realistic_v6_2026D76PU200_RECOo
nly-v1/00000/fadd9323-7d5e-4e7c-a53a-1f3de6b22509.root HTTP/1.1
> User-Agent: gfal2/2.18.2 neon/0.0.29
> TE: trailers
> Host: cmsdcadisk.fnal.gov:2880
> 

2021-09-17 17:20:01,688:DEBUG:MSUnmerged: Davix: Sending request-line and headers:
2021-09-17 17:20:01,688:DEBUG:MSUnmerged: Davix: Request sent; retry is 1.
2021-09-17 17:20:01,798:DEBUG:MSUnmerged: Davix: < HTTP/1.1 404 Not Found
2021-09-17 17:20:01,799:DEBUG:MSUnmerged: Davix: < Date: Fri, 17 Sep 2021 15:20:01 GMT
2021-09-17 17:20:01,799:DEBUG:MSUnmerged: Davix: < Server: dCache/5.2.35-SNAPSHOT
2021-09-17 17:20:01,799:DEBUG:MSUnmerged: Davix: < Strict-Transport-Security: max-age=31536000
2021-09-17 17:20:01,799:DEBUG:MSUnmerged: Davix: < Access-Control-Allow-Credentials: true
2021-09-17 17:20:01,799:DEBUG:MSUnmerged: Davix: < Access-Control-Allow-Origin: *
2021-09-17 17:20:01,799:DEBUG:MSUnmerged: Davix: < Content-Type: text/html
2021-09-17 17:20:01,799:DEBUG:MSUnmerged: Davix: < Transfer-Encoding: chunked
2021-09-17 17:20:01,799:DEBUG:MSUnmerged: Davix: < 
2021-09-17 17:20:01,799:DEBUG:MSUnmerged: Davix: End of headers.
2021-09-17 17:20:01,799:DEBUG:MSUnmerged: Davix: Running post_headers hooks
2021-09-17 17:20:01,799:DEBUG:MSUnmerged: Davix:  <- negotiateRequest
2021-09-17 17:20:01,799:DEBUG:MSUnmerged: Davix: Bad server answer: Content-Length Invalid, impossible to determine answer size
2021-09-17 17:20:01,799:DEBUG:MSUnmerged: Davix: NEON Read data flow
2021-09-17 17:20:01,799:DEBUG:MSUnmerged: Davix: [chunk] < 1293

2021-09-17 17:20:01,800:DEBUG:MSUnmerged: Davix: Got chunk size: 4755
2021-09-17 17:20:01,800:DEBUG:MSUnmerged: Davix: Reading 4755 bytes of response body.

2021-09-17 17:20:01,800:DEBUG:MSUnmerged: Davix: Got 3825 bytes.
2021-09-17 17:20:01,800:DEBUG:MSUnmerged: Davix: NEONRequest::readBlock read 3825 bytes
2021-09-17 17:20:01,800:DEBUG:MSUnmerged: Davix: NEON Read data flow
2021-09-17 17:20:01,800:DEBUG:MSUnmerged: Davix: Reading 930 bytes of response body.

2021-09-17 17:20:01,800:DEBUG:MSUnmerged: Davix: Got 930 bytes.
2021-09-17 17:20:01,800:DEBUG:MSUnmerged: Davix: NEONRequest::readBlock read 930 bytes
2021-09-17 17:20:01,800:DEBUG:MSUnmerged: Davix: NEON Read data flow
2021-09-17 17:20:01,800:DEBUG:MSUnmerged: Davix: [chunk] < 0

2021-09-17 17:20:01,800:DEBUG:MSUnmerged: Davix: Got chunk size: 0
2021-09-17 17:20:01,800:DEBUG:MSUnmerged: Davix: NEONRequest::readBlock read 0 bytes
2021-09-17 17:20:01,800:DEBUG:MSUnmerged: Davix: < 
2021-09-17 17:20:01,801:DEBUG:MSUnmerged: Davix: End of headers.
2021-09-17 17:20:01,801:DEBUG:MSUnmerged: Davix: Running post_send hooks
2021-09-17 17:20:01,801:DEBUG:MSUnmerged: Davix:  <- executeRequest
2021-09-17 17:20:01,801:DEBUG:MSUnmerged: Davix: Destroy HttpRequest
2021-09-17 17:20:01,801:DEBUG:MSUnmerged: Davix: Running destroy hooks.
2021-09-17 17:20:01,801:DEBUG:MSUnmerged: Davix: Request ends.
2021-09-17 17:20:01,801:DEBUG:MSUnmerged: Davix: add old session to cache httpscmsdcadisk.fnal.gov:2880
2021-09-17 17:20:01,801:DEBUG:MSUnmerged: Davix:  davix_unlink <-
[17/Sep/2021:17:20:06]  WATCHDOG: server exited with exit code signal 11 (core dumped)... restarting

@klannon
Copy link

klannon commented Sep 21, 2021

@todor-ivanov: Signal 11 is usually indicative of a segmentation violation which occurs when a program attempts to access an invalid bit of memory. I would be surprised if this is something that is being generated from outside the MSUnmerged/gfal process. It would be consistent with an overflow of either memory or array index in a list. Can you retrieve the cord dump and view the stack trace?

@todor-ivanov
Copy link
Contributor

Hi @klannon that's absolutely correct. I did not elaborate too much but yes, by:

receiving signal 11

I meant by the kernel, exactly because of the reasons you explained very well. So that's the reason why I was expecting the core dump to be in the system logs instead of the service logs. This is the big disappointment I was talking about when I requested the logs from the machine running the pods, which were willingly provided by Imran, and were absolutely clear. The only thing I can see there are the periodic records from containerd: [1].

The core dump is again a misleading message, because it is basically a message which WMCore logs from [2], based on some bitwise operations with the object returned from the system to the Cherrypy thread (line 470 at [2]). All of them operations seem to be fair (I double checked them), but noting from this piece of code gives a clue where the core should be. I expect it to be in the system logs though.

Some more hints I find in the old bugs related to gfal2 while digging into their repository and old versions: [3] I even checked the PR supposed to solve this. But this is a different direction of investigation. It is a bug referring to the creation of a ctx instance (the gfal context manager). Which somehow overlaps with my latest observation - since we applied the fix with which we are splitting the chunks of files to be sent to gfal in sizes of 100 tops, I noticed that this bug reveals itself with random directory sizes, rather than our first assumption, that it is related to a memory overflow and is happening only for bug lists.

Things are getting messier and messier unfortunately :(

[3]
https://github.com/cern-fts/gfal2/blob/90bc0a58587b986a0d7335f244fd4cfe3079b72f/RELEASE-NOTES#L570

[2]

(exitcode, exitsigno, exitcore) = (exitrc >> 8, exitrc & 127, exitrc & 128)
retval = (exitsigno and ("signal %d" % exitsigno)) or str(exitcode)
retmsg = retval + ((exitcore and " (core dumped)") or "")
restart = (exitsigno > 0 and exitsigno not in (2, 3, 15))
cherrypy.log("WATCHDOG: server exited with exit code %s%s"
% (retmsg, (restart and "... restarting") or ""))

[1]

ep 16 08:54:57 cmsweb-k8s-services--zone-b-azdbnihnhw2p-node-0 containerd[888]: time="2021-09-16T08:54:57.397579484Z" level=info msg="ExecSync for \"689d307c14864383c95fc27c38c4218e41f36e74
Sep 16 08:54:58 cmsweb-k8s-services--zone-b-azdbnihnhw2p-node-0 containerd[888]: time="2021-09-16T08:54:58.086028766Z" level=info msg="ExecSync for \"c65bbf2fb6fb24c78e32a0d04c98d74295e6c3d6
Sep 16 08:55:07 cmsweb-k8s-services--zone-b-azdbnihnhw2p-node-0 containerd[888]: time="2021-09-16T08:55:07.397692219Z" level=info msg="ExecSync for \"689d307c14864383c95fc27c38c4218e41f36e74
Sep 16 08:55:08 cmsweb-k8s-services--zone-b-azdbnihnhw2p-node-0 containerd[888]: time="2021-09-16T08:55:08.085933904Z" level=info msg="ExecSync for \"c65bbf2fb6fb24c78e32a0d04c98d74295e6c3d6
Sep 16 08:55:17 cmsweb-k8s-services--zone-b-azdbnihnhw2p-node-0 containerd[888]: time="2021-09-16T08:55:17.397919906Z" level=info msg="ExecSync for \"689d307c14864383c95fc27c38c4218e41f36e74
Sep 16 08:55:18 cmsweb-k8s-services--zone-b-azdbnihnhw2p-node-0 containerd[888]: time="2021-09-16T08:55:18.086020586Z" level=info msg="ExecSync for \"c65bbf2fb6fb24c78e32a0d04c98d74295e6c3d6
Sep 16 08:55:27 cmsweb-k8s-services--zone-b-azdbnihnhw2p-node-0 containerd[888]: time="2021-09-16T08:55:27.398202097Z" level=info msg="ExecSync for \"689d307c14864383c95fc27c38c4218e41f36e74
Sep 16 08:55:28 cmsweb-k8s-services--zone-b-azdbnihnhw2p-node-0 containerd[888]: time="2021-09-16T08:55:28.085896248Z" level=info msg="ExecSync for \"c65bbf2fb6fb24c78e32a0d04c98d74295e6c3d6
Sep 16 08:55:37 cmsweb-k8s-services--zone-b-azdbnihnhw2p-node-0 containerd[888]: time="2021-09-16T08:55:37.398034400Z" level=info msg="ExecSync for \"689d307c14864383c95fc27c38c4218e41f36e74
Sep 16 08:55:38 cmsweb-k8s-services--zone-b-azdbnihnhw2p-node-0 containerd[888]: time="2021-09-16T08:55:38.085996297Z" level=info msg="ExecSync for \"c65bbf2fb6fb24c78e32a0d04c98d74295e6c3d6
Sep 16 08:55:47 cmsweb-k8s-services--zone-b-azdbnihnhw2p-node-0 containerd[888]: time="2021-09-16T08:55:47.398181710Z" level=info msg="ExecSync for \"689d307c14864383c95fc27c38c4218e41f36e74
Sep 16 08:55:48 cmsweb-k8s-services--zone-b-azdbnihnhw2p-node-0 containerd[888]: time="2021-09-16T08:55:48.086058455Z" level=info msg="ExecSync for \"c65bbf2fb6fb24c78e32a0d04c98d74295e6c3d6
Sep 16 08:55:57 cmsweb-k8s-services--zone-b-azdbnihnhw2p-node-0 containerd[888]: time="2021-09-16T08:55:57.397968680Z" level=info msg="ExecSync for \"689d307c14864383c95fc27c38c4218e41f36e74
Sep 16 08:55:58 cmsweb-k8s-services--zone-b-azdbnihnhw2p-node-0 containerd[888]: time="2021-09-16T08:55:58.085895259Z" level=info msg="ExecSync for \"c65bbf2fb6fb24c78e32a0d04c98d74295e6c3d6
Sep 16 08:56:07 cmsweb-k8s-services--zone-b-azdbnihnhw2p-node-0 containerd[888]: time="2021-09-16T08:56:07.397969683Z" level=info msg="ExecSync for \"689d307c14864383c95fc27c38c4218e41f36e74
Sep 16 08:56:08 cmsweb-k8s-services--zone-b-azdbnihnhw2p-node-0 containerd[888]: time="2021-09-16T08:56:08.086146065Z" level=info msg="ExecSync for \"c65bbf2fb6fb24c78e32a0d04c98d74295e6c3d6

@amaltaro
Copy link
Contributor Author

@muhammadimranfarooqi has deployed MSUnmerged with WMCore 1.5.5.pre3 (in HG2111a) to the production cluster.
We should now monitor production ms-unmerged for a couple of days and check whether the service is still crashing/restarting.

@amaltaro
Copy link
Contributor Author

amaltaro commented Nov 1, 2021

I scanned the ms-unmerged logs and it appears we still got two restarts of the t2t3 instance over the last 4 days, while the other pods had no issues at all. Here is the t2t3 log around the time when it crashed and the service got restarted by CherryPy:

2021-10-30 17:52:51,557:DEBUG:MSUnmerged: GSIFTP using certificate /etc/proxy/proxy
2021-10-30 17:52:51,557:DEBUG:MSUnmerged: GSIFTP using private key /etc/proxy/proxy
2021-10-30 17:52:51,557:DEBUG:MSUnmerged: gridftp session for: gsiftp://eoscmsftp.cern.ch:2811 found in  cache !
2021-10-30 17:52:51,557:DEBUG:MSUnmerged: Impossible to get integer parameter GRIDFTP PLUGIN:OPERATION_TIMEOUT, set to default value 300, err Key file does not have key “OPERATION_TIMEOUT” in group “GRIDFTP PLUGIN”
2021-10-30 17:52:51,557:DEBUG:MSUnmerged:    [GridFTP_Request_state::wait_callback] setup gsiftp timeout to 300 seconds
2021-10-30 17:52:51,557:DEBUG:(unknown file):  gridFTP operation done
2021-10-30 17:52:51,557:DEBUG:MSUnmerged: Impossible to get integer parameter GRIDFTP PLUGIN:OPERATION_TIMEOUT, set to default value 300, err Key file does not have key “OPERATION_TIMEOUT” in group “GRIDFTP PLUGIN”
2021-10-30 17:52:51,557:DEBUG:MSUnmerged: Impossible to get integer parameter GRIDFTP PLUGIN:OPERATION_TIMEOUT, set to default value 300, err Key file does not have key “OPERATION_TIMEOUT” in group “GRIDFTP PLUGIN”
2021-10-30 17:52:51,558:DEBUG:MSUnmerged:  -> [GridftpSimpleListReader::GridftpSimpleListReader]
2021-10-30 17:52:51,558:DEBUG:MSUnmerged:   -> [gridftp_read_stream]
2021-10-30 17:52:51,558:DEBUG:MSUnmerged:    [GridFTP_Request_state::wait_callback] setup gsiftp timeout to 300 seconds
2021-10-30 17:52:51,565:DEBUG:(unknown file):  gridFTP operation done
t244098816:p129: Fatal error: [Thread System] GLOBUSTHREAD: pthread_mutex_lock() failed

[Thread System] invalid value passed to thread interface (EINVAL)[30/Oct/2021:17:52:56]  WATCHDOG: server exited with exit code signal 6 (core dumped)... restarting

With the error above, I think our next priority within the MSUnmerged project would be to actually streamline the certificates used by the service and make sure the certificate/proxy used is properly authorized to access the storage and delete unneeded files. Reported in this issue: #10680

@amaltaro
Copy link
Contributor Author

I just had another look at the ms-unmerged logs and I did not see any CherryPy restarts for the last 5 days (1.5.6.pre1).
Once we get the right credentials in place, many of the soft errors of the service should be gone as well, leaving it in a more robust state.

@todor-ivanov from my side, I would say we can close this issue. What do you think?

@todor-ivanov
Copy link
Contributor

Thanks for taking a look @amaltaro ! Lets close it. We are for sure having that in mind while having the last check upon swapping the credentials for the service. In case we notice the errors come back we may reopen it at any time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants