Skip to content

Commit

Permalink
pnfsmanager: Fix race leading to transaction failures in Chimera
Browse files Browse the repository at this point in the history
Motivation:

29 Feb 2016 16:57:34 (PnfsManager) [mCo:6882279:srm2:prepareToPut:-1232458768:-1232458767 SRM PnfsCreateUploadPath] Create upload path failed
org.springframework.jdbc.UncategorizedSQLException: PreparedStatementCallback; uncategorized SQLException for SQL [SELECT ipnfsid,isize,inlink,itype,imode,iuid,igid,iatime,ictime,imtime from path2inodes(?, ?)]; SQL state [25P02]; error code [0]; ERROR: current transaction is aborted, commands ignored until end of transaction block; nested exception is org.postgresql.util.PSQLException: ERROR: current transaction is aborted, commands ignored until end of transaction block
        at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:84) ~[spring-jdbc-4.2.4.RELEASE.jar:4.2.4.RELEASE]
        at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:81) ~[spring-jdbc-4.2.4.RELEASE.jar:4.2.4.RELEASE]
        at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:81) ~[spring-jdbc-4.2.4.RELEASE.jar:4.2.4.RELEASE]
        at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:645) ~[spring-jdbc-4.2.4.RELEASE.jar:4.2.4.RELEASE]
        at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:680) ~[spring-jdbc-4.2.4.RELEASE.jar:4.2.4.RELEASE]
        at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:712) ~[spring-jdbc-4.2.4.RELEASE.jar:4.2.4.RELEASE]
        at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:762) ~[spring-jdbc-4.2.4.RELEASE.jar:4.2.4.RELEASE]
        at org.dcache.chimera.PgSQLFsSqlDriver.path2inodes(PgSQLFsSqlDriver.java:198) ~[chimera-2.14.13.jar:2.14.13]
        at org.dcache.chimera.JdbcFs.path2inodes(JdbcFs.java:633) ~[chimera-2.14.13.jar:2.14.13]
        at org.dcache.chimera.JdbcFs.path2inodes(JdbcFs.java:626) ~[chimera-2.14.13.jar:2.14.13]
        at sun.reflect.GeneratedMethodAccessor312.invoke(Unknown Source) ~[na:na]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_72]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_72]
        at org.dcache.commons.stats.MonitoringProxy.invoke(MonitoringProxy.java:54) ~[dcache-common-2.14.13.jar:2.14.13]
        at com.sun.proxy.$Proxy33.path2inodes(Unknown Source) ~[na:na]
        at org.dcache.chimera.namespace.ChimeraNameSpaceProvider.pathToInode(ChimeraNameSpaceProvider.java:188) ~[dcache-chimera-2.14.13.jar:2.14.13]
        at org.dcache.chimera.namespace.ChimeraNameSpaceProvider.lookupDirectory(ChimeraNameSpaceProvider.java:1154) ~[dcache-chimera-2.14.13.jar:2.14.13]
        at org.dcache.chimera.namespace.ChimeraNameSpaceProvider.installDirectory(ChimeraNameSpaceProvider.java:1145) ~[dcache-chimera-2.14.13.jar:2.14.13]
        at org.dcache.chimera.namespace.ChimeraNameSpaceProvider.installDirectory(ChimeraNameSpaceProvider.java:1139) ~[dcache-chimera-2.14.13.jar:2.14.13]
        at org.dcache.chimera.namespace.ChimeraNameSpaceProvider.createUploadPath(ChimeraNameSpaceProvider.java:1175) ~[dcache-chimera-2.14.13.jar:2.14.13]
        at diskCacheV111.namespace.PnfsManagerV3.createUploadPath(PnfsManagerV3.java:1107) [dcache-core-2.14.13.jar:2.14.13]

The error is caused by two concurrent uploads trying to create the same target
directory. The code tries to recover from the failed mkdir in one of the
transactions, but at that point the transaction is already invalid due to the
failure.

Modification:

Propagate the error as a LockedCacheException and let SRM retry instead.

Result:

Fixed a race condition between two concurrent uploads to the same non-existing
target directory. Symptoms of the race condition were 'PSQLException: ERROR:
current transaction is aborted, commands ignored until end of transaction
block' failures in the pnfs manager log. Both the srm and pnfsmanager services
need to be updated to effectively resolve the race.

Target: trunk
Require-notes: yes
Require-book: no
Request: 2.15
Request: 2.14
Request: 2.13
Acked-by: Paul Millar <paul.millar@desy.de>
Patch: https://rb.dcache.org/r/9084/
(cherry picked from commit 77b450d)
  • Loading branch information
gbehrmann committed Mar 4, 2016
1 parent 979a710 commit 6699508
Show file tree
Hide file tree
Showing 2 changed files with 23 additions and 4 deletions.
Expand Up @@ -39,6 +39,7 @@
import diskCacheV111.util.FileNotFoundCacheException;
import diskCacheV111.util.FsPath;
import diskCacheV111.util.InvalidMessageCacheException;
import diskCacheV111.util.LockedCacheException;
import diskCacheV111.util.NotDirCacheException;
import diskCacheV111.util.NotFileCacheException;
import diskCacheV111.util.PermissionDeniedCacheException;
Expand Down Expand Up @@ -1073,9 +1074,9 @@ private ExtendedInode installSystemDirectory(FsPath path, int mode, List<ACE> ac
try {
inode = parentOfPath.mkdir(path.getName(), 0, 0, mode, acl, tags);
} catch (FileExistsChimeraFsException e1) {
/* Concurrent directory creation. Do another lookup.
/* Concurrent directory creation. Current transaction is invalid.
*/
inode = lookupDirectory(Subjects.ROOT, path);
throw new LockedCacheException("Concurrent access prevented this operation from completing. Please retry.");
}
}
return inode;
Expand All @@ -1091,9 +1092,9 @@ private ExtendedInode installDirectory(Subject subject, FsPath path, int uid, in
try {
inode = mkdir(subject, parentOfPath, path.getName(), uid, gid, mode);
} catch (FileExistsChimeraFsException e1) {
/* Concurrent directory creation. Do another lookup.
/* Concurrent directory creation. Current transaction is invalid.
*/
inode = lookupDirectory(subject, path);
throw new LockedCacheException("Concurrent access prevented this operation from completing. Please retry.");
}
}
return inode;
Expand Down
Expand Up @@ -1099,6 +1099,8 @@ private static Map<String,List<String>> resolve(String name, String[] attrIds)
CellStub.addCallback(_pnfsStub.send(msg),
new AbstractMessageCallback<PnfsCreateUploadPath>()
{
int failures = 0;

@Override
public void success(PnfsCreateUploadPath message)
{
Expand All @@ -1108,6 +1110,7 @@ public void success(PnfsCreateUploadPath message)
@Override
public void failure(int rc, Object error)
{
failures++;
String msg = Objects.toString(error, "");
switch (rc) {
case CacheException.PERMISSION_DENIED:
Expand All @@ -1119,6 +1122,21 @@ public void failure(int rc, Object error)
case CacheException.FILE_NOT_FOUND:
future.setException(new SRMInvalidPathException(msg));
break;
case CacheException.LOCKED:
if (failures < 3) {
/* Usually due to concurrent uploads to the same non-existing target
* directory. Retry a few times.
*/
PnfsCreateUploadPath retry =
new PnfsCreateUploadPath(subject, fullPath,
((DcacheUser) user).getRoot(),
uid, gid, NameSpaceProvider.DEFAULT,
size, al, rp, spaceToken, options);
CellStub.addCallback(_pnfsStub.send(retry), this, _executor);
} else {
future.setException(new SRMInternalErrorException(msg));
}
break;
case CacheException.TIMEOUT:
default:
future.setException(new SRMInternalErrorException(msg));
Expand Down

0 comments on commit 6699508

Please sign in to comment.