Skip to content
This repository has been archived by the owner on Oct 5, 2021. It is now read-only.

Titan DynamoDB does not release all acquired locks on commit (via gremlin) #58

Closed
jfstephe opened this issue Feb 10, 2017 · 12 comments
Closed
Assignees
Labels
Milestone

Comments

@jfstephe
Copy link

jfstephe commented Feb 10, 2017

This issue is explained in detail on StackOverflow. It seems like something isn't correct in the dynamo world (Berkeley backend looks ok).

The problem results in a lock not being released at the correct time and when subsequent transactions are run you get:
tx 0x705eafda280e already locked key-column ( 8- 0- 0- 0- 0- 0- 0-128, 80-160) when tx 0x70629e1d56bf tried to lock

When the offending transaction is run, in the logs I can see that 3 locks acquired when opening the transaction:

[33mtitan_server_1 |[0m 120479 [gremlin-server-exec-3] TRACE com.amazon.titan.diskstorage.dynamodb.AbstractDynamoDBStore - acquiring lock on ( 8- 0- 0- 0- 0- 0- 0-128, 80-160) at 123552624951495
[33mtitan_server_1 |[0m 120489 [gremlin-server-exec-3] TRACE com.amazon.titan.diskstorage.dynamodb.AbstractDynamoDBStore - acquiring lock on ( 6-137-160- 48- 46- 48- 46-177, 0) at 123552635424334
[33mtitan_server_1 |[0m 120489 [gremlin-server-exec-3] TRACE com.amazon.titan.diskstorage.dynamodb.AbstractDynamoDBStore - acquiring lock on ( 6-137-160- 48- 46- 48- 46-178, 0) at 123552635704705

...but only 2 are released when committing it:

[33mtitan_server_1 |[0m 120722 [gremlin-server-exec-3] DEBUG com.amazon.titan.diskstorage.dynamodb.DynamoDBStoreTransaction - commit id:0x705eafda280e
[33mtitan_server_1 |[0m 120722 [gremlin-server-exec-3] TRACE com.amazon.titan.diskstorage.dynamodb.AbstractDynamoDBStore - Expiring ( 6-137-160- 48- 46- 48- 46-177, 0) in tx 0x705eafda280e because of EXPLICIT
[33mtitan_server_1 |[0m 120722 [gremlin-server-exec-3] TRACE com.amazon.titan.diskstorage.dynamodb.AbstractDynamoDBStore - Expiring ( 6-137-160- 48- 46- 48- 46-178, 0) in tx 0x705eafda280e because of EXPLICIT
[33mtitan_server_1 |[0m 120722 [gremlin-server-exec-3] DEBUG org.apache.tinkerpop.gremlin.server.op.AbstractEvalOpProcessor - Preparing to iterate results from - RequestMessage{, requestId=09f27811-dcc3-4e53-a749-22828d34997f, op='eval', processor='', args={gremlin=g.V().hasLabel("databaseMetadata").has("version", "0.0.1").property("version", "0.0.2").next();g.tx().commit();, batchSize=64}} - in thread [gremlin-server-exec-3]

Please can someone take a look at this? If you could indicate that you are looking into this that would be good as I get the impression that this project isn't given that much attention (apologies if that's not the case).

@amcp amcp added the bug label Feb 12, 2017
@jfstephe
Copy link
Author

@amcp - Can you confirm that someone is looking into this? Do I need to look into this myself (somewhat difficult)? Please can you let me know so we aren't duplicating efforts :-).

@jfstephe
Copy link
Author

Ok. I've done a bit more digging.....
I added some extra logging in AbstractDynamoDBStore.java and I see 2 acquireLock calls from:

com.thinkaurelius.titan.diskstorage.BackendTransaction.acquireEdgeLock(BackendTransaction.java:221)

The first one stores the key-column in the tx.put in acquireLock(), the second one doesn't as it's already there. These both correspond o the first 'acquiring lock' log messages from the original post for this issue. This means that there's an entry in keyColumnLocalLocks for that keyColumn. All good so far. For the record the next 2 locks are for:

com.thinkaurelius.titan.diskstorage.BackendTransaction.acquireIndexLock(BackendTransaction.java:245)

but I don't think that's relevant ATM.

I also added some logging when 'releaseLock()' is called in the same class. This is where things get weird:

[33mtitan_server_1  |�[0m 90331 [gremlin-server-exec-3] DEBUG com.amazon.titan.diskstorage.dynamodb.AbstractDynamoDBStore  - RELEASING LOCK KC: (  0-  0-  0-  0-  0-  0-  0-128, 80-160)
[33mtitan_server_1  |�[0m 90332 [gremlin-server-exec-3] DEBUG com.amazon.titan.diskstorage.dynamodb.AbstractDynamoDBStore  - RELEASING LOCK KC: (  6-137-160- 48- 46- 48- 46-177,  0)
[33mtitan_server_1  |�[0m 90332 [gremlin-server-exec-3] TRACE com.amazon.titan.diskstorage.dynamodb.AbstractDynamoDBStore  - Expiring (  6-137-160- 48- 46- 48- 46-177,  0) in tx 0x160c512584a70 because of EXPLICIT
[33mtitan_server_1  |�[0m 90332 [gremlin-server-exec-3] DEBUG com.amazon.titan.diskstorage.dynamodb.AbstractDynamoDBStore  - RELEASING LOCK KC: (  6-137-160- 48- 46- 48- 46-178,  0)
[33mtitan_server_1  |�[0m 90332 [gremlin-server-exec-3] TRACE com.amazon.titan.diskstorage.dynamodb.AbstractDynamoDBStore  - Expiring (  6-137-160- 48- 46- 48- 46-178,  0) in tx 0x160c512584a70 because of EXPLICIT

Note that all 3 locks should get released. but only the last 2 do. I even added a 'keyColumnLocalLocks.cleanUp();' after the invalidate() but that didn't do anything (I'm no java guy :-) ).

For me this is all pointing to this library being the source of the issue. I just can't see why the cache wouldn't get invaidated. I know items don't get removed immediately when invalidated (although I think calling cleanUp() should poke it), but it should be removed at some point before the trouble starts.

I had assumed that google guava would be quite good, but looking at the number of serious sounding defects this could be the issue.

@amcp - can you make any suggestions?

@amcp
Copy link
Contributor

amcp commented Feb 14, 2017

Thank you for the deep dive. As soon as I come up for steam, I can take a look at this (probably on the weekend). Meanwhile, let me see if I can get someone else to take a look at this.

@amcp
Copy link
Contributor

amcp commented Feb 14, 2017

@jfstephe can you add a test case in a PR that reproduces this issue please? It will help our investigation.

@jfstephe
Copy link
Author

@amcp Thanks for getting back to me. It would be great if you could get someone to look at this. I've had trouble building locally let alone run the tests (me + intelliJ rather than the project). Much more hopefully useful information follows.

I am using docker with a Dockerfile based on this . After running up I run the commands from the SO post and I get the errors seen. Note that in the SO post a kind soul has already translated my gremlin console statements into java, so that may help with test writing. In all honesty the tests aren't particularly clear IMHO. I would expect them to be written against some titan API spec, but hey ho.

Anyway, I'm getting to the bottom of this! Using a bit of a build hack, I replaced the guava cache with ConcurrentHashMaps and I have the same issue - so it wasn't that (relief!).

Next, I decided to print out the content of the hash map when things get invalidated/queried, and record how many instances of the AbstractDynamoDBStore class are created, and which locks/releases are recorded against them. I did this using a static count and assigned a member variable 'count' in the constructor from the static count whilst incrementing the static count. The results are interesting. I don't know enough about how titan expects this plugin to work, so I can't comment too strongly, but it seems that the locks are being acquired using one instance and attempted to be released using another. Perhaps this is a titan issue after all? I'm really grasping at straws now. If I don't get this resolved soon, we won't be using titan for our graph DB and will have to work out a different solution for our needs. Anyway, I've attached the log file. If you search for 'JFS' (my initials) those are 'my' log statements. In the run that produced that log file the transaction error line was:

60 JFS tx 0x23f807deedf5 already locked key-column (184- 0- 0- 0- 0- 0- 0-128, 80-160) when tx 0x23f8b82a5baa tried to lock

Search for '60 JFS' - these seem to be the edge locks, and '61 JFS' is for index locks. You see the 2 locks created for '60 JFS' from line 23107. Originally I thought this was one call lock request in the original logs, but those don't indicate when a request is made to lock for an already locked key in the same transaction. These edge locks should (I believe) be being removed at line 23449, but only the index locks are being removed and not for the edge lock key?!?!

As you are working on Janus graph perhaps you're best positioned to know where the issue lies.

Note: You will see quite a few 'RuntimeException' in the logs. These aren't real - I was using it as a quick and dirty way to get the stack trace. I've also attached my version of AbstractDynamoDBStore for reference.

AbstractDynamoDBStore.java.txt
logOut.bingo.txt

Finally, if someone is able to look into this it would be of great help to understand when a fix may be available (an indication of 'tomorrow', 'next week', 'next month', 'next year', 'never' would be great - but hopefully closer to the left hand side of the scale ;-) ).

@amcp
Copy link
Contributor

amcp commented Feb 26, 2017

While we investigate this issue, I believe a viable workaround may be available in the form of titan built in locking. Please check out the locking branch and see if you are able to observe the desired behavior. Related to #64

@amcp amcp self-assigned this Feb 26, 2017
@amcp amcp added this to the janus-release milestone Apr 23, 2017
@amcp
Copy link
Contributor

amcp commented Apr 23, 2017

@jfstephe I rebased the locking branch on the tip of 1.0.0.

@amcp amcp changed the title Titan/DynamoDb doesn't release all acquired locks on commit (via gremlin) Titan DynamoDB does not release all acquired locks on commit (via gremlin) Apr 23, 2017
@jfstephe
Copy link
Author

Many thanks! I've been away from this for a while but hope to return in the coming couple of months(!) I'll re-run the tests I have when I get a chance.

@amcp amcp added question and removed bug labels May 2, 2017
@amcp
Copy link
Contributor

amcp commented May 2, 2017

I've merged JanusGraph locking compatibility to the JanusGraph branch. I will investigate potential issues with Guava and JanusGraph/Titan dependencies independently. You can let me know if using JanusGraph locks works for you.

@amcp
Copy link
Contributor

amcp commented May 2, 2017

I got repro and will post a PR in a few minutes

@amcp
Copy link
Contributor

amcp commented May 2, 2017

#148

@amcp
Copy link
Contributor

amcp commented May 2, 2017

Found a solution and demonstrated repro and a fix so I am closing this issue.

@amcp amcp closed this as completed May 2, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants