Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data corruption in Blazegraph database #114

Open
smalyshev opened this issue Jan 8, 2019 · 26 comments
Open

Data corruption in Blazegraph database #114

smalyshev opened this issue Jan 8, 2019 · 26 comments

Comments

@smalyshev
Copy link
Contributor

Blazegraph started producing these errors today for us, during routine update cycle:

22:57:10.414 [com.bigdata.journal.Journal.executorService1571] ERROR com.bigdata.rdf.spo.SPORelation IP: UA: - java.util.concurrent.ExecutionException: com.bigdata.btree.EvictionError: java.lang.AssertionError: Record exists for offset in cache: offset=2147483616
java.lang.AssertionError: Record exists for offset in cache: offset=2147483616
        at com.bigdata.io.writecache.WriteCache.write(WriteCache.java:977)
Wrapped by: com.bigdata.btree.EvictionError: java.lang.AssertionError: Record exists for offset in cache: offset=2147483616
        at com.bigdata.btree.DefaultEvictionListener.doEviction(DefaultEvictionListener.java:198)
Wrapped by: java.util.concurrent.ExecutionException: com.bigdata.btree.EvictionError: java.lang.AssertionError: Record exists for offset in cache: offset=2147483616
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)
22:57:12.010 [qtp1321640594-1977] ERROR c.b.r.sail.webapp.BigdataRDFServlet IP:localhost UA:Jetty/9.4.z-SNAPSHOT - cause=java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: org.openrdf.repository.RepositoryException: java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.util.concurrent.ExecutionException: com.bigdata.btree.EvictionError: java.lang.AssertionError: Record exists for offset in cache: offset=2147483616, query=SPARQL-UPDATE: updateStr=# Clear out of date site links
....

The query is a regular (rather big) SPARQL Update which Wikidata Query Service Updater sends.

The error seems to be persistent - if I restart, I get this:

00:18:48.261 [com.bigdata.rdf.sail.webapp.BigdataRDFContext.queryService25] ERROR com.bigdata.journal.Name2Addr IP: UA: - l.name: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not commit index: name=wdq.spo.OSP                                         java.lang.AssertionError: Record exists for offset in cache: offset=2147483616                                                              
        at com.bigdata.io.writecache.WriteCache.write(WriteCache.java:977)                                                                  
Wrapped by: java.lang.RuntimeException: Could not commit index: name=wdq.spo.OSP                                                                    at com.bigdata.journal.Name2Addr$CommitIndexTask.call(Name2Addr.java:578)                                                           Wrapped by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not commit index: name=wdq.spo.OSP                   
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)                                                                      00:18:48.262 [com.bigdata.rdf.sail.webapp.BigdataRDFContext.queryService25] ERROR com.bigdata.journal.Name2Addr IP: UA: - l.name: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not commit index: name=wdq.spo.POS                                         java.lang.AssertionError: Record exists for offset in cache: offset=2147483616                                                              
        at com.bigdata.io.writecache.WriteCache.write(WriteCache.java:977)                             
Wrapped by: java.lang.RuntimeException: Could not commit index: name=wdq.spo.POS                                                            
        at com.bigdata.journal.Name2Addr$CommitIndexTask.call(Name2Addr.java:578)                                                           Wrapped by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not commit index: name=wdq.spo.POS                   
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)                                                                      
00:18:49.365 [qtp1321640594-32] ERROR c.b.r.sail.webapp.BigdataRDFServlet IP:localhost UA:Jetty/9.4.z-SNAPSHOT - cause=java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: nerrors=2: lastRootBlock=rootBlock{ rootBlock=1, challisField=41668719, version=3, nextOffset=52105301987554934, localTime=1546901828460 [Monday, January 7, 2019 10:57:08 PM UTC], firstCommitTime=1510337399915 [Friday, November 10, 2017 6:09:59 PM UTC], lastCommitTime=1546901828417 [Monday, January 7, 2019 10:57:08 PM UTC], commitCounter=41668719, commitRecordAddr={off=NATIVE:-2145194496,len=422}, commitRecordIndexAddr={off
=NATIVE:-526146533,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=8454737611466036, metaStartAddr=13189238, storeType=RW, uuid=84e0
fc40-3d8f-47fb-8e01-c9899e2730f3, offsetBits=42, checksum=952184268, createTime=1510337398696 [Friday, November 10, 2017 6:09:58 PM UTC], cl
oseTime=0}, query=SPARQL-UPDATE: updateStr=# Clear out of date site links              
... skipped query ....
com.bigdata.util.concurrent.ExecutionExceptions: 2 errors : [java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not 
commit index: name=wdq.spo.OSP, java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not commit index: name=wdq.spo.POS]                                                                                                                                                  at com.bigdata.journal.Name2Addr.handleCommit(Name2Addr.java:860)                                                                   
Wrapped by: java.lang.RuntimeException: nerrors=2                                                                                                   at com.bigdata.journal.Name2Addr.handleCommit(Name2Addr.java:860)                                                                   Wrapped by: java.lang.RuntimeException: nerrors=2: lastRootBlock=rootBlock{ rootBlock=1, challisField=41668719, version=3, nextOffset=52105301987554934, localTime=1546901828460 [Monday, January 7, 2019 10:57:08 PM UTC], firstCommitTime=1510337399915 [Friday, November 10, 2017 6:0
9:59 PM UTC], lastCommitTime=1546901828417 [Monday, January 7, 2019 10:57:08 PM UTC], commitCounter=41668719, commitRecordAddr={off=NATIVE:-
2145194496,len=422}, commitRecordIndexAddr={off=NATIVE:-526146533,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=8454737611466036,
metaStartAddr=13189238, storeType=RW, uuid=84e0fc40-3d8f-47fb-8e01-c9899e2730f3, offsetBits=42, checksum=952184268, createTime=1510337398696 [Friday, November 10, 2017 6:09:58 PM UTC], closeTime=0}                                                                                   
        at com.bigdata.journal.AbstractJournal.commit(AbstractJournal.java:3134)                                                            
Wrapped by: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: nerrors=2: lastRootBlock=rootBlock{ rootBlock=1, challisField=41668719, version=3, nextOffset=52105301987554934, localTime=1546901828460 [Monday, January 7, 2019 10:57:08 PM UTC], firstCommitTime=1510337399915 [Friday, November 10, 2017 6:09:59 PM UTC], lastCommitTime=1546901828417 [Monday, January 7, 2019 10:57:08 PM UTC], commitCounter=41668719, commitRecordAddr={off=NATIVE:-2145194496,len=422}, commitRecordIndexAddr={off=NATIVE:-526146533,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=8454737611466036, metaStartAddr=13189238, storeType=RW, uuid=84e0fc40-3d8f-47fb-8e01-c9899e2730f3, offsetBits=42, 
checksum=952184268, createTime=1510337398696 [Friday, November 10, 2017 6:09:58 PM UTC], closeTime=0}                                       
        at com.bigdata.rdf.sparql.ast.eval.ASTEvalHelper.executeUpdate(ASTEvalHelper.java:1080)                                             
Wrapped by: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: nerrors=2: lastRootBlock=rootBlock{ rootBlock=1, challisField=41668719, version=3, nextOffset=52105301987554934, localTime=1546901828460 [Monday, January 7, 2019 10:57:08 PM UTC], firstCommitTime=1510337399915 [Friday, November 10, 2017 6:09:59 PM UTC], lastCommitTime=1546901828417 [Monday, January 7, 2019 10:57:08 PM UTC], commitCounter=41668719, commitRecordAddr={off=NATIVE:-2145194496,len=422}, commitRecordIndexAddr={off=NATIVE:-526146533,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=8454737611466036, metaStartAddr=13189238, storeType=RW, uuid=84e0fc40-3d8f-47fb-8e01-c9899e2730f3, offsetBits=42, checksum=952184268, createTime=1510337398696 [Friday, November 10, 2017 6:09:58 PM UTC], closeTime=0}                                                                                                                                          
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)                                                                      
Wrapped by: java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: nerrors=2: lastRootBlock=rootBlock{ rootBlock=1, challisField=41668719, version=3, nextOffset=52105301987554934, localTime=1546901828460 [Monday, January 7, 2019 10:57:08 PM UTC], firstCommitTime=1510337399915 [Friday, November 10, 2017 6:09:59 PM UTC], 
lastCommitTime=1546901828417 [Monday, January 7, 2019 10:57:08 PM UTC], commitCounter=41668719, commitRecordAddr={off=NATIVE:-2145194496,len=422}, commitRecordIndexAddr={off=NATIVE:-526146533,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=8454737611466036, metaStartAddr=13189238, storeType=RW, uuid=84e0fc40-3d8f-47fb-8e01-c9899e2730f3, offsetBits=42, checksum=952184268, createTime=1510337398696 [Friday, November 10, 2017 6:09:58 PM UTC], closeTime=0}                                                                                                 
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)         

and all write operations seem to fail.

Any ideas what's going on and why it could have happened?

@thompsonbry
Copy link
Contributor

thompsonbry commented Jan 8, 2019 via email

@thompsonbry
Copy link
Contributor

thompsonbry commented Jan 8, 2019 via email

@smalyshev
Copy link
Contributor Author

(a) check the logs after the restart. Are any other errors being
reported?

Yes, the second log is after the restart. No writes seem to be successful after that error happened.

the initial exception you reported above is for commitCounter=41668719.
Is this still the most recent commit point on the database? (You can use
DumpJournal to peek at the root blocks - and in fact I believe that there
is a servlet API for this as well.)

dumpJournal says:

There are 2 commit points.
CommitRecord{timestamp=1546901828417, commitCounter=41668719, roots=[-2259782070534405924, -9213540186699333292, -2259782126368980696, -211123412402045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

I assume this means it's still 41668719? Full copy is at https://gist.github.com/smalyshev/b04d1366fdffb0fffc27c4a95901caed

We can copy the DB from another server, but it's a time-consuming thing, so if we could easily restore the DB to working order, it'd be preferable. Since we have last update timestamp in the database, it could then re-apply updates from that point forward.

@thompsonbry
Copy link
Contributor

thompsonbry commented Jan 8, 2019 via email

@thompsonbry
Copy link
Contributor

thompsonbry commented Jan 8, 2019 via email

@thompsonbry
Copy link
Contributor

thompsonbry commented Jan 8, 2019 via email

@smalyshev
Copy link
Contributor Author

What was the last successful transaction?

Unfortunately, I don't know the exact content of last transaction. We do not log update queries, since they are huge and there's a lot of them. It was probably the update before the one that caused the error, but it's very hard now to know what exactly it contained.

@smalyshev
Copy link
Contributor Author

Got another server crashing with exactly the same symptoms, three hours later... Journal dump is here: https://gist.github.com/smalyshev/bf19825ea984371a9e69666b26f23bd8. Commit counter in error messages matches the one in root block #1, just like in the first server.

@thompsonbry
Copy link
Contributor

thompsonbry commented Jan 8, 2019 via email

@thompsonbry
Copy link
Contributor

thompsonbry commented Jan 8, 2019 via email

@smalyshev
Copy link
Contributor Author

smalyshev commented Jan 8, 2019

Do both servers see the same sequence of logical updates?

Not exactly. The update sequence is the same (it comes from the same event channel) but starting points may differ, so for example one server can have two batches of updates 1,2,3 and 4,5,6 and another can have 1,2 and 3,4,5,6, etc. The end result should be the same (almost, since we also store update timestamps per entry which may differ slightly) but the sequence may differ.

It could also be some internal limit not previously identified.

File sizes for both "bad" servers are 806G. But we have other servers with the same file size, and they are just fine.

Your other server has four commit points available. This means that it has
not reclaimed some deleted data yet. This difference is likely due to some
read transactions being open while the last updates went through, resulting
in deferred recycling.

When this recycling should be happening? It has been hours and couple of restarts (so no open transactions possible) and there's still four commit points there. I checked on other active servers and there are wide variety of numbers for commit points there - from 2 to 25. Not sure what that means.

I strongly encourage you to figure out what updates are causing this
issue.

That might be not easy, given the above - I can see at which point the failure happened - i.e. I could guess the first item in the update but I can't know the exact content - there could be more or less items in the update, depending on various random network events. Also, since the failures happened on two servers with 5 hours difference, it can't be the same exact update - the probability that same exact items has been updated in same exact sequence with this interval is vanishingly small. I'd rather suspect some kind of limit or overflow in some storage or cache or something like that, or some "magic" value that when it's randomly hit produces the failure.

@thompsonbry
Copy link
Contributor

thompsonbry commented Jan 8, 2019 via email

@smalyshev
Copy link
Contributor Author

Can you enable update logging on another server?

Yes, but only for a short time (it generates a ton of data) and probably not today (since it needs access which I don't have to enable and people that have access are offline now - but we could do it tomorrow morning if needed). What you'd like to log though? The working servers are way past the point where failing servers failed now (and they didn't fail, obviously), so their data would be useless. I could of course log update for the failing servers, once we restore them, and see if this update reproduces the problem again, but as I said, I think it's very unlikely the content of the update matters too much, since two servers failed on different updates with 5 hour delta.

@thompsonbry
Copy link
Contributor

thompsonbry commented Jan 8, 2019 via email

@smalyshev
Copy link
Contributor Author

I tried to remove categories namespace, and it doesn't seem to have changed anything for allocators.
Before: https://gist.github.com/smalyshev/19095c87b600900811e7f87d22ea250b
After: https://gist.github.com/smalyshev/4d99b5258542e0705a335b52c7805d50

I wonder why is that - does anything else need to be done?

@thompsonbry
Copy link
Contributor

thompsonbry commented Jan 8, 2019 via email

@webshark
Copy link

webshark commented Sep 1, 2022

Were you able to find a way to rollback a bad commit ? I have a fresh server that I am importing a Wikidata base in and mid way it stopped with the error:

Caused by: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: Problem with entry at -668674172594945626: lastRootBlock=rootBlock{ rootBlock=1, challisField=835, version=3, nextOff
set=62398251232785478, localTime=1661485728195 [Friday, August 26, 2022 at 3:48:48 AM Coordinated Universal Time], firstCommitTime=1658240159557 [Tuesday, July 19, 2022 at 2:15:59 PM Coordinated Universal Time], lastCommitTime=1661485714508
[Friday, August 26, 2022 at 3:48:34 AM Coordinated Universal Time], commitCounter=835, commitRecordAddr={off=NATIVE:-42932493,len=422}, commitRecordIndexAddr={off=NATIVE:-158345759,len=220}, blockSequence=27784, quorumToken=-1, metaBitsAddr=
47863639532635860, metaStartAddr=15961497, storeType=RW, uuid=f05859c7-286f-4e8e-8a91-593893119dcd, offsetBits=42, checksum=-586797382, createTime=1658240158542 [Tuesday, July 19, 2022 at 2:15:58 PM Coordinated Universal Time], closeTime=0}

I am now stuck in this position and can't continue

@thompsonbry
Copy link
Contributor

thompsonbry commented Sep 1, 2022 via email

@webshark
Copy link

webshark commented Sep 2, 2022

So I did the following (maybe it was incorrect)

I stopped the Blazegraph server.
Add this to config:
com.bigdata.journal.Options=ALTERNATE_ROOT_BLOCK
Started server.

Tried to import one file (I didn't know the syntax for update/commit)
This failed in the end after a few hours with the same error as before:

Caused by: java.lang.RuntimeException: Problem with entry at -668674172594945626: lastRootBlock=rootBlock{ rootBlock=1, challisField=835, version=3, nextOffset=62398251232785478, localTime=1661485728195 [Friday, August 26, 2022 at 3:48:48 AM Coordinated Universal Time], firstCommitTime=1658240159557 [Tuesday, July 19, 2022 at 2:15:59 PM Coordinated Universal Time], lastCommitTime=1661485714508 [Friday, August 26, 2022 at 3:48:34 AM Coordinated Universal Time], commitCounter=835, commitRecordAddr={off=NATIVE:-42932493,len=422}, commitRecordIndexAddr={off=NATIVE:-158345759,len=220}, blockSequence=27784, quorumToken=-1, metaBitsAddr=47863639532635860, metaStartAddr=15961497, storeType=RW, uuid=f05859c7-286f-4e8e-8a91-593893119dcd, offsetBits=42, checksum=-586797382, createTime=1658240158542 [Tuesday, July 19, 2022 at 2:15:58 PM Coordinated Universal Time], closeTime=0}

Perhaps I misunderstood what to do.

@thompsonbry
Copy link
Contributor

thompsonbry commented Sep 2, 2022 via email

@webshark
Copy link

webshark commented Sep 2, 2022

You can see a larger log: https://pastebin.com/JPCiBEvk

Do you have an example somewhere of a simple update/commit I could run ?

With Wikidata from what I was seeing it was running all the updates/inserts and at the end tried to commit the changes

Regarding hardware I don't think its that - its a new setup.
This is the 3rd time its happening with me in an attempt to import the Wikidata dump (its huge and takes a long time)

@thompsonbry
Copy link
Contributor

thompsonbry commented Sep 2, 2022 via email

@webshark
Copy link

webshark commented Sep 2, 2022

ok ... sad to hear :( was running smoothly for 5 weeks and then crashed.
I will create a fresh server - just to make sure no "junk" or "bad sectors" are left over

@thompsonbry
Copy link
Contributor

thompsonbry commented Sep 2, 2022 via email

@thompsonbry
Copy link
Contributor

thompsonbry commented Sep 2, 2022 via email

@webshark
Copy link

webshark commented Sep 2, 2022

Problem is the script continues running until you stop it, so it kept on going failing over and over again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants