Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use lcg-utils from default WN env. in job wrapper #636

Closed
ericvaandering opened this issue Oct 31, 2013 · 7 comments
Closed

use lcg-utils from default WN env. in job wrapper #636

ericvaandering opened this issue Oct 31, 2013 · 7 comments
Assignees
Labels

Comments

@ericvaandering
Copy link
Member

Original Savannah ticket 73947 reported by None on Wed Oct 13 04:05:04 2010.

Dear All...

I want to alert for the following situation we have experienced at LIP_Lisbon.

1./ A user came to us complaining that some data dissapeared from our SE (a STORM 1.4 SRM). After some investigation, we concluded that this happen because of an unsual chain of events.

2./ Probably, as you may know, when files are copied to a Storage Element, they may be categorized as "volatile" or "permanent". "Volatiles" files are deleted by the software after some period (in our case, 6 months). This was what happened to our user.

3./ Usually, the users' files are not categorized as volatile (unless they explicitly ask for it in the tools they use) but there was a sequence of events that led to this situation. Looking to the user job logs, we determined the cause.

3.1/ There are two possible methods for copying files, lcg-cp and srmcp. The lcg-cp is the recommended way to make copies. srmcp is a more "low level" command invoked by other high level tools as lcg-cp. Apparently, from the information collected through the user job logs (sent via CRAB), the job tries to list a file using lcg-ls, and since it does not find that command, it starts to use srmls and srmcp.

3.2/ The lcg-cp tools were available in the machines but we discovered that the user job (crab script) rewrote the WN pre-configured environment information, and sourced paths and scripts which could be incorrect or incomplete at the time. This meant that the lcg-cp tools were not found, and that all files copies were done via srmcp. Here is some extract of the user job log:

---*---
which lcg-ls
/opt/lcg/bin/lcg-ls

##### details of SE interaction

2010-05-17 23:56:09.421223:
Executed: unset LD_LIBRARY_PATH; export PATH=/usr/bin:/bin; source /etc/profile; source /opt/glite/etc/profile.d/grid-env.sh ; lcg-ls -b -D srmv2 -t 2400 --verbose srm://srm01.lip.pt:8443/srm/managerv2?SFN=/lustre/lip.pt/data/cms/store/user/pela/MC_31X_V3_MCDB949toEDM_Zbb_v3/LHC7TeV_Zbb_CMSSW336_PATv2/5da882caab8d8856922891cd53e29740/out_7_1.root
Done with exit code: 127
and output:
/bin/sh: lcg-ls: command not found

2010-05-17 23:56:09.566497:
Executed: java -version
Done with exit code: 0
and output:
java version "1.6.0"
OpenJDK Runtime Environment (build 1.6.0-b09)
OpenJDK 64-Bit Server VM (build 1.6.0-b09, mixed mode)

2010-05-17 23:56:12.612786:
Executed: srmls -recursion_depth=0 srm://srm01.lip.pt:8443/srm/managerv2?SFN=/lustre/lip.pt/data/cms/store/user/pela/MC_31X_V3_MCDB949toEDM_Zbb_v3/LHC7TeV_Zbb_CMSSW336_PATv2/5da882caab8d8856922891cd53e29740/
Done with exit code: 0
and output:
0 /lustre/lip.pt/data/cms/store/user/pela/MC_31X_V3_MCDB949toEDM_Zbb_v3/LHC7TeV_Zbb_CMSSW336_PATv2/5da882caab8d8856922891cd53e29740
---*---

3.3/ We verified that when the copies are made using lcg-cp (lcg_util-1.7.6-2 + GFAL-client-1.11.8-3 installed in SL5 UIs and WNs) the files are categorized as permanent in our SE. When copies are executed via srmcp (distributed with dcache-srmclient-1.9.5-16.noarch installed in SL5 UIs and WNs) the option for permanent files must be explicitly given otherwise it is assumed the default SRM concerned, which may vary. For the Storm, the default option is to mark files as volatile.

3.4/ Unfortunately the jobs sent via CRAB (version 2.7.1) invoked srmcp without specifying the permanent option, which meant that the files were categorized as volatile, and leading the system to delete them 6 months later.

4./ These events were triggered by a sequence of events, in which many of them we do not have any kind of control.

4.1/ For us it is impossible to determine whether files marked as "Volatile" is intentional or not since both markers are legitimate.

4.2/ Crab should have copied the files using the same options as the lcg-cp implementation. This would probably include the permanent option since the default depends on the SRM implementation. Unfortunately, in our case, crab rewrote the environment and prevented the job to use the lcg-cp tool.

5./ The corrective measures we took to avoid any identical situations were:

5.1. / Delete files entries categorized as "volatile" on the Storage Element DB. The system will not erase these files, whether they have been marked volatile intentionally or not.

5.2. / Increase from 6 to 600 months the default time (if no another time is set in the command) volatile files will remain in storage.

5.3. / Get back in touch with the Storm developers to find a solution to define a storage area as permanent by default.

6./ On the CMS infrastructure I think you should check what crab is doing (the jobs were using version 2.7.1)

6.1/ If the lcg-cp tools are found, there is no need to look or source other environment scripts. If they are not found, then it is ok to source other env scripts.

6.2/ If the srmcp are going to be used, use the srmcp always with the -storagetype=permanent option for jobs sent via CRAB.

I do not know if the situation was already corrected or not since the last volatile entries in the Storage Element DB goes back to August 2010. It may well happen that the situation is not corrected but that all transfers were done via lcg-cp.

I can provide you extra info if needed.

Goncalo Borges
LIP_Lisbon

@ericvaandering
Copy link
Member Author

Comment by goncalo on Tue Oct 12 11:12:43 2010

Dear All...

We already have received some answer from Storm developers (email from Ricardo Zappi on 12/20/2010), which I quote inhere. This gives strength to our sugestion that in srmcp calls, CRAB should implement the -storagetype=permanent option.

Cheers
Goncalo

-------- Original Message --------
Subject: Re: [storm-support] user files deleted in storm 1.4 after lifetime expired, and question about the volatile/permanent space
Date: Tue, 12 Oct 2010 17:34:06 +0200
From: Riccardo Zappi <riccardo.zappi@cnaf.infn.it>
Reply-To: riccardo.zappi@cnaf.infn.it
To: Mario David <david@lip.pt>
CC: storm-support@lists.infn.it, jorge@lip.pt, Gon?alo Borges <goncalo@lip.pt>

I'm sorry for the accidental loss of user's files.

In SRM v2.2 specification files are considered always volatile unless otherwise is specified.

In WLCG there was an agreement (Removal policies) to consider the file-type permanent as the default type, but the SRM v2.2 specification didn't change.

As you know, this agreement was not implemented in a uniform way. Some SRM clients (e.g. srmcp) have based the behaviour on the assumption that all SRM services would treated files as permanent (but I think that it contradicts the SRM specification). It is clear that the srmcp client wasn't certified for the use with StoRM.

It is our opinion that the right implementation of the agreement lives in the SRM client side and not in the SRM service side, because the SRM service have to be compliant with the SRM specification. So, to obey to the agreement any WLCG-SRM-client must include 'permanent' as default parameter in any (when expected) SRM requests sent to SRM service.

(...)

@ghost ghost assigned ericvaandering Oct 31, 2013
@ericvaandering
Copy link
Member Author

Comment by belforte on Wed Oct 13 04:05:04 2010

This item has been reassigned from the project CMS Computing Infrastructure Support support tracker to your tracker.

The original report is still available at support #117279

Following are the information included in the original report:

I want to alert for the following situation we have experienced at LIP_Lisbon.

1./ A user came to us complaining that some data dissapeared from our SE (a STORM 1.4 SRM). After some investigation, we concluded that this happen because of an unsual chain of events.

2./ Probably, as you may know, when files are copied to a Storage Element, they may be categorized as "volatile" or "permanent". "Volatiles" files are deleted by the software after some period (in our case, 6 months). This was what happened to our user.

3./ Usually, the users files are not categorized as volatile (unless they explicitly ask for it in the tools they use) but there was a sequence of events that led to this situation. Looking to the user job logs, we determined the cause.

3.1/ There are two possible methods for copying files, lcg-cp and srmcp. The lcg-cp is the recommended way to make copies. srmcp is a more "low level" command invoked by other high level tools as lcg-cp. Apparently, from the information collected through the user job logs (sent via CRAB), the job tries to list a file using lcg-ls, and since it does not find that command, it starts to use srmls and srmcp.

3.2/ The lcg-cp tools were available in the machines but we discovered that the user job (crab script) rewrote the WN pre-configured environment information, and sourced paths and scripts which could be incorrect or incomplete at the time. This meant that the lcg-cp tools were not found, and that all files copies were done via srmcp. Here is some extract of the user job log:

---*---
which lcg-ls
/opt/lcg/bin/lcg-ls
########### details of SE interaction
2010-05-17 23:56:09.421223:
Executed: unset LD_LIBRARY_PATH; export PATH=/usr/bin:/bin; source /etc/profile; source /opt/glite/etc/profile.d/grid-env.sh ; lcg-ls -b -D srmv2 -t 2400 --verbose srm://srm01.lip.pt:8443/srm/managerv2?SFN=/lustre/lip.pt/data/cms/store/user/pela/MC_31X_V3_MCDB949toEDM_Zbb_v3/LHC7TeV_Zbb_CMSSW336_PATv2/5da882caab8d8856922891cd53e29740/out_7_1.root
Done with exit code: 127
and output:
/bin/sh: lcg-ls: command not found

2010-05-17 23:56:09.566497:
Executed: java -version
Done with exit code: 0
and output:
java version "1.6.0"
OpenJDK Runtime Environment (build 1.6.0-b09)
OpenJDK 64-Bit Server VM (build 1.6.0-b09, mixed mode)

2010-05-17 23:56:12.612786:
Executed: srmls -recursion_depth=0 srm://srm01.lip.pt:8443/srm/managerv2?SFN=/lustre/lip.pt/data/cms/store/user/pela/MC_31X_V3_MCDB949toEDM_Zbb_v3/LHC7TeV_Zbb_CMSSW336_PATv2/5da882caab8d8856922891cd53e29740/
Done with exit code: 0
and output:
0 /lustre/lip.pt/data/cms/store/user/pela/MC_31X_V3_MCDB949toEDM_Zbb_v3/LHC7TeV_Zbb_CMSSW336_PATv2/5da882caab8d8856922891cd53e29740
---*---

3.3/ We verified that when the copies are made using lcg-cp (lcg_util-1.7.6-2 + GFAL-client-1.11.8-3 installed in SL5 UIs and WNs) the files are categorized as permanent in our SE. When copies are executed via srmcp (distributed with dcache-srmclient-1.9.5-16.noarch installed in SL5 UIs and WNs) the option for permanent files must be explicitly given otherwise it is assumed the default SRM concerned, which may vary. For the Storm, the default option is to mark files as volatile.

3.4/ Unfortunately the jobs sent via CRAB (version 2.7.1) invoked srmcp without specifying the permanent option, which meant that the files were categorized as volatile, and leading the system to delete them 6 months later.

4./ These events were triggered by a sequence of events, in which many of them we do not have any kind of control.

4.1/ For us it is impossible to determine whether files marked as "Volatile" is intentional or not since both markers are legitimate.

4.2/ Crab should have copied the files using the same options as the lcg-cp implementation. This would probably include the permanent option since the default depends on the SRM implementation. Unfortunately, in our case, crab rewrote the environment and prevented the job to use the lcg-cp tool.

5./ The corrective measures we took to avoid any identical situations were:

5.1. / Delete files entries categorized as "volatile" on the Storage Element DB. The system will not erase these files, whether they have been marked volatile intentionally or not.

5.2. / Increase from 6 to 600 months the default time (if no another time is set in the command) volatile files will remain in storage.

5.3. / Get back in touch with the Storm developers to find a solution to define a storage area as permanent by default.

6./ On the CMS infrastructure I think you should check what crab is doing (the jobs were using version 2.7.1)

6.1/ If the lcg-cp tools are found, there is no need to look or source other environment scripts. If they are not found, then it is ok to source other env scripts.

6.2/ If the srmcp are going to be used, use the srmcp always with the -storagetype=permanent option for jobs sent via CRAB.

I do not know if the situation was already corrected or not since the last volatile entries in the Storage Element DB goes back to August 2010. It may well happen that the situation is not corrected but that all transfers were done via lcg-cp.

I can provide you extra info if needed.

Goncalo Borges
LIP_Lisbon

@ericvaandering
Copy link
Member Author

Comment by belforte on Wed Oct 13 04:09:00 2010

I changed subject of this to better reflect the message/request to Crab. I am not sure this can be addressed in Crab2 scope, but at least it is something to be aware for Crab3 wrapper.

@ericvaandering
Copy link
Member Author

Comment by belforte on Wed Oct 13 04:57:00 2010

see my comments in the original ticket.
my Summary here is wrong, since Crab does this already. The thing left is whether to change srmcp call to indicate storageType. I do not think is critical, anyhow will make a different tkt to avoid confusion. This one can be closed.
StefanoB

@ericvaandering
Copy link
Member Author

Comment by belforte on Wed Oct 13 05:07:27 2010

opened https://savannah.cern.ch/bugs/index.php?73950 instead.

@ericvaandering
Copy link
Member Author

Comment by belforte on Wed Jan 5 07:16:56 2011

please close. see comment #4

@ericvaandering
Copy link
Member Author

Closed by spiga on Tue Feb 1 07:12:55 2011

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants