-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data corruption in Blazegraph database #114
Comments
Adding Martyn.
Bryan
…On Mon, Jan 7, 2019 at 16:21 Stanislav Malyshev ***@***.***> wrote:
Blazegraph started producing these errors today for us, during routine
update cycle:
22:57:10.414 [com.bigdata.journal.Journal.executorService1571] ERROR com.bigdata.rdf.spo.SPORelation IP: UA: - java.util.concurrent.ExecutionException: com.bigdata.btree.EvictionError: java.lang.AssertionError: Record exists for offset in cache: offset=2147483616
java.lang.AssertionError: Record exists for offset in cache: offset=2147483616
at com.bigdata.io.writecache.WriteCache.write(WriteCache.java:977)
Wrapped by: com.bigdata.btree.EvictionError: java.lang.AssertionError: Record exists for offset in cache: offset=2147483616
at com.bigdata.btree.DefaultEvictionListener.doEviction(DefaultEvictionListener.java:198)
Wrapped by: java.util.concurrent.ExecutionException: com.bigdata.btree.EvictionError: java.lang.AssertionError: Record exists for offset in cache: offset=2147483616
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
22:57:12.010 [qtp1321640594-1977] ERROR c.b.r.sail.webapp.BigdataRDFServlet IP:localhost UA:Jetty/9.4.z-SNAPSHOT - cause=java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: org.openrdf.repository.RepositoryException: java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.util.concurrent.ExecutionException: com.bigdata.btree.EvictionError: java.lang.AssertionError: Record exists for offset in cache: offset=2147483616, query=SPARQL-UPDATE: updateStr=# Clear out of date site links
....
The query is a regular (rather big) SPARQL Update which Wikidata Query
Service Updater sends.
The error seems to be persistent - if I restart, I get this:
00:18:48.261 [com.bigdata.rdf.sail.webapp.BigdataRDFContext.queryService25] ERROR com.bigdata.journal.Name2Addr IP: UA: - l.name: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not commit index: name=wdq.spo.OSP java.lang.AssertionError: Record exists for offset in cache: offset=2147483616
at com.bigdata.io.writecache.WriteCache.write(WriteCache.java:977)
Wrapped by: java.lang.RuntimeException: Could not commit index: name=wdq.spo.OSP at com.bigdata.journal.Name2Addr$CommitIndexTask.call(Name2Addr.java:578) Wrapped by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not commit index: name=wdq.spo.OSP
at java.util.concurrent.FutureTask.report(FutureTask.java:122) 00:18:48.262 [com.bigdata.rdf.sail.webapp.BigdataRDFContext.queryService25] ERROR com.bigdata.journal.Name2Addr IP: UA: - l.name: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not commit index: name=wdq.spo.POS java.lang.AssertionError: Record exists for offset in cache: offset=2147483616
at com.bigdata.io.writecache.WriteCache.write(WriteCache.java:977)
Wrapped by: java.lang.RuntimeException: Could not commit index: name=wdq.spo.POS
at com.bigdata.journal.Name2Addr$CommitIndexTask.call(Name2Addr.java:578) Wrapped by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not commit index: name=wdq.spo.POS
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
00:18:49.365 [qtp1321640594-32] ERROR c.b.r.sail.webapp.BigdataRDFServlet IP:localhost UA:Jetty/9.4.z-SNAPSHOT - cause=java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: nerrors=2: lastRootBlock=rootBlock{ rootBlock=1, challisField=41668719, version=3, nextOffset=52105301987554934, localTime=1546901828460 [Monday, January 7, 2019 10:57:08 PM UTC], firstCommitTime=1510337399915 [Friday, November 10, 2017 6:09:59 PM UTC], lastCommitTime=1546901828417 [Monday, January 7, 2019 10:57:08 PM UTC], commitCounter=41668719, commitRecordAddr={off=NATIVE:-2145194496,len=422}, commitRecordIndexAddr={off
=NATIVE:-526146533,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=8454737611466036, metaStartAddr=13189238, storeType=RW, uuid=84e0
fc40-3d8f-47fb-8e01-c9899e2730f3, offsetBits=42, checksum=952184268, createTime=1510337398696 [Friday, November 10, 2017 6:09:58 PM UTC], cl
oseTime=0}, query=SPARQL-UPDATE: updateStr=# Clear out of date site links
... skipped query ....
com.bigdata.util.concurrent.ExecutionExceptions: 2 errors : [java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not
commit index: name=wdq.spo.OSP, java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not commit index: name=wdq.spo.POS] at com.bigdata.journal.Name2Addr.handleCommit(Name2Addr.java:860)
Wrapped by: java.lang.RuntimeException: nerrors=2 at com.bigdata.journal.Name2Addr.handleCommit(Name2Addr.java:860) Wrapped by: java.lang.RuntimeException: nerrors=2: lastRootBlock=rootBlock{ rootBlock=1, challisField=41668719, version=3, nextOffset=52105301987554934, localTime=1546901828460 [Monday, January 7, 2019 10:57:08 PM UTC], firstCommitTime=1510337399915 [Friday, November 10, 2017 6:0
9:59 PM UTC], lastCommitTime=1546901828417 [Monday, January 7, 2019 10:57:08 PM UTC], commitCounter=41668719, commitRecordAddr={off=NATIVE:-
2145194496,len=422}, commitRecordIndexAddr={off=NATIVE:-526146533,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=8454737611466036,
metaStartAddr=13189238, storeType=RW, uuid=84e0fc40-3d8f-47fb-8e01-c9899e2730f3, offsetBits=42, checksum=952184268, createTime=1510337398696 [Friday, November 10, 2017 6:09:58 PM UTC], closeTime=0}
at com.bigdata.journal.AbstractJournal.commit(AbstractJournal.java:3134)
Wrapped by: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: nerrors=2: lastRootBlock=rootBlock{ rootBlock=1, challisField=41668719, version=3, nextOffset=52105301987554934, localTime=1546901828460 [Monday, January 7, 2019 10:57:08 PM UTC], firstCommitTime=1510337399915 [Friday, November 10, 2017 6:09:59 PM UTC], lastCommitTime=1546901828417 [Monday, January 7, 2019 10:57:08 PM UTC], commitCounter=41668719, commitRecordAddr={off=NATIVE:-2145194496,len=422}, commitRecordIndexAddr={off=NATIVE:-526146533,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=8454737611466036, metaStartAddr=13189238, storeType=RW, uuid=84e0fc40-3d8f-47fb-8e01-c9899e2730f3, offsetBits=42,
checksum=952184268, createTime=1510337398696 [Friday, November 10, 2017 6:09:58 PM UTC], closeTime=0}
at com.bigdata.rdf.sparql.ast.eval.ASTEvalHelper.executeUpdate(ASTEvalHelper.java:1080)
Wrapped by: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: nerrors=2: lastRootBlock=rootBlock{ rootBlock=1, challisField=41668719, version=3, nextOffset=52105301987554934, localTime=1546901828460 [Monday, January 7, 2019 10:57:08 PM UTC], firstCommitTime=1510337399915 [Friday, November 10, 2017 6:09:59 PM UTC], lastCommitTime=1546901828417 [Monday, January 7, 2019 10:57:08 PM UTC], commitCounter=41668719, commitRecordAddr={off=NATIVE:-2145194496,len=422}, commitRecordIndexAddr={off=NATIVE:-526146533,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=8454737611466036, metaStartAddr=13189238, storeType=RW, uuid=84e0fc40-3d8f-47fb-8e01-c9899e2730f3, offsetBits=42, checksum=952184268, createTime=1510337398696 [Friday, November 10, 2017 6:09:58 PM UTC], closeTime=0}
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
Wrapped by: java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: nerrors=2: lastRootBlock=rootBlock{ rootBlock=1, challisField=41668719, version=3, nextOffset=52105301987554934, localTime=1546901828460 [Monday, January 7, 2019 10:57:08 PM UTC], firstCommitTime=1510337399915 [Friday, November 10, 2017 6:09:59 PM UTC],
lastCommitTime=1546901828417 [Monday, January 7, 2019 10:57:08 PM UTC], commitCounter=41668719, commitRecordAddr={off=NATIVE:-2145194496,len=422}, commitRecordIndexAddr={off=NATIVE:-526146533,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=8454737611466036, metaStartAddr=13189238, storeType=RW, uuid=84e0fc40-3d8f-47fb-8e01-c9899e2730f3, offsetBits=42, checksum=952184268, createTime=1510337398696 [Friday, November 10, 2017 6:09:58 PM UTC], closeTime=0}
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
and all write operations seem to fail.
Any ideas what's going on and why it could have happened?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#114>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACdv4JKgQguYGOdGmTefL1Hx6-5PCFLsks5vA-SIgaJpZM4Z0i2I>
.
|
Stas,
Here is where the exception is being thrown
https://github.com/blazegraph/database/blob/64bb8ca92eb063d45d68ec920a34b5321b80e2fe/bigdata-core/bigdata/src/java/com/bigdata/io/writecache/WriteCache.java#L978
/*
* Add metadata for the record so it can be read back from the
* cache.
*/
final RecordMetadata md = new RecordMetadata(offset, pos,
datalen, latchedAddr);
if (recordMap.put(Long.valueOf(offset), md) != null) {
/*
* Note: This exception indicates that the abort protocol
* did not reset() the current write cache before new writes
* were laid down onto the buffer.
*/
throw new AssertionError(
"Record exists for offset in cache: offset="
+ offset);
}
Obviously, this is an unexpected condition. Further, I would expect this
problem to NOT be restart durable.
I suggest:
(a) check the logs after the restart. Are any other errors being
reported?
(b) the initial exception you reported above is for commitCounter=41668719.
Is this still the most recent commit point on the database? (You can use
DumpJournal to peek at the root blocks - and in fact I believe that there
is a servlet API for this as well.)
(c) If the commit counter HAS NOT CHANGED, then there might be a "bad
commit" and you MIGHT be able to rollback to the last commit point. This is
not an option to be taken lightly, and we should probably talk before you
attempt this. It basically discards the most recent commit in favor of the
previous commit. There are com.bigdata.journal.Options which will let you
affect this rollback. If there is only one bad commit, then this could get
you back up and running.
(d) If you are still having commits roll through, but just not some
specific update then the problem might be somehow triggered by a specific
update. This would be a best case scenario since it would imply that the
database is still working correctly, but unable to apply a specific update.
In this case, you might be able to break down the update and work around
the problem.
(e) A possible scenario leading to this situation would be a breakdown in
the commit logic such that some data remained buffered and was then flushed
through by the next transaction. This sort of thing could arise from a
concurrent modification of a B+Tree. However, there are no know bugs around
this.
Bryan
On Mon, Jan 7, 2019 at 4:28 PM Bryan B. Thompson <thompsonbry@gmail.com>
wrote:
… Adding Martyn.
Bryan
On Mon, Jan 7, 2019 at 16:21 Stanislav Malyshev ***@***.***>
wrote:
> Blazegraph started producing these errors today for us, during routine
> update cycle:
>
> 22:57:10.414 [com.bigdata.journal.Journal.executorService1571] ERROR com.bigdata.rdf.spo.SPORelation IP: UA: - java.util.concurrent.ExecutionException: com.bigdata.btree.EvictionError: java.lang.AssertionError: Record exists for offset in cache: offset=2147483616
> java.lang.AssertionError: Record exists for offset in cache: offset=2147483616
> at com.bigdata.io.writecache.WriteCache.write(WriteCache.java:977)
> Wrapped by: com.bigdata.btree.EvictionError: java.lang.AssertionError: Record exists for offset in cache: offset=2147483616
> at com.bigdata.btree.DefaultEvictionListener.doEviction(DefaultEvictionListener.java:198)
> Wrapped by: java.util.concurrent.ExecutionException: com.bigdata.btree.EvictionError: java.lang.AssertionError: Record exists for offset in cache: offset=2147483616
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> 22:57:12.010 [qtp1321640594-1977] ERROR c.b.r.sail.webapp.BigdataRDFServlet IP:localhost UA:Jetty/9.4.z-SNAPSHOT - cause=java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: org.openrdf.repository.RepositoryException: java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.util.concurrent.ExecutionException: com.bigdata.btree.EvictionError: java.lang.AssertionError: Record exists for offset in cache: offset=2147483616, query=SPARQL-UPDATE: updateStr=# Clear out of date site links
> ....
>
> The query is a regular (rather big) SPARQL Update which Wikidata Query
> Service Updater sends.
>
> The error seems to be persistent - if I restart, I get this:
>
> 00:18:48.261 [com.bigdata.rdf.sail.webapp.BigdataRDFContext.queryService25] ERROR com.bigdata.journal.Name2Addr IP: UA: - l.name: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not commit index: name=wdq.spo.OSP java.lang.AssertionError: Record exists for offset in cache: offset=2147483616
> at com.bigdata.io.writecache.WriteCache.write(WriteCache.java:977)
> Wrapped by: java.lang.RuntimeException: Could not commit index: name=wdq.spo.OSP at com.bigdata.journal.Name2Addr$CommitIndexTask.call(Name2Addr.java:578) Wrapped by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not commit index: name=wdq.spo.OSP
> at java.util.concurrent.FutureTask.report(FutureTask.java:122) 00:18:48.262 [com.bigdata.rdf.sail.webapp.BigdataRDFContext.queryService25] ERROR com.bigdata.journal.Name2Addr IP: UA: - l.name: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not commit index: name=wdq.spo.POS java.lang.AssertionError: Record exists for offset in cache: offset=2147483616
> at com.bigdata.io.writecache.WriteCache.write(WriteCache.java:977)
> Wrapped by: java.lang.RuntimeException: Could not commit index: name=wdq.spo.POS
> at com.bigdata.journal.Name2Addr$CommitIndexTask.call(Name2Addr.java:578) Wrapped by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not commit index: name=wdq.spo.POS
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> 00:18:49.365 [qtp1321640594-32] ERROR c.b.r.sail.webapp.BigdataRDFServlet IP:localhost UA:Jetty/9.4.z-SNAPSHOT - cause=java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: nerrors=2: lastRootBlock=rootBlock{ rootBlock=1, challisField=41668719, version=3, nextOffset=52105301987554934, localTime=1546901828460 [Monday, January 7, 2019 10:57:08 PM UTC], firstCommitTime=1510337399915 [Friday, November 10, 2017 6:09:59 PM UTC], lastCommitTime=1546901828417 [Monday, January 7, 2019 10:57:08 PM UTC], commitCounter=41668719, commitRecordAddr={off=NATIVE:-2145194496,len=422}, commitRecordIndexAddr={off
> =NATIVE:-526146533,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=8454737611466036, metaStartAddr=13189238, storeType=RW, uuid=84e0
> fc40-3d8f-47fb-8e01-c9899e2730f3, offsetBits=42, checksum=952184268, createTime=1510337398696 [Friday, November 10, 2017 6:09:58 PM UTC], cl
> oseTime=0}, query=SPARQL-UPDATE: updateStr=# Clear out of date site links
> ... skipped query ....
> com.bigdata.util.concurrent.ExecutionExceptions: 2 errors : [java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not
> commit index: name=wdq.spo.OSP, java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not commit index: name=wdq.spo.POS] at com.bigdata.journal.Name2Addr.handleCommit(Name2Addr.java:860)
> Wrapped by: java.lang.RuntimeException: nerrors=2 at com.bigdata.journal.Name2Addr.handleCommit(Name2Addr.java:860) Wrapped by: java.lang.RuntimeException: nerrors=2: lastRootBlock=rootBlock{ rootBlock=1, challisField=41668719, version=3, nextOffset=52105301987554934, localTime=1546901828460 [Monday, January 7, 2019 10:57:08 PM UTC], firstCommitTime=1510337399915 [Friday, November 10, 2017 6:0
> 9:59 PM UTC], lastCommitTime=1546901828417 [Monday, January 7, 2019 10:57:08 PM UTC], commitCounter=41668719, commitRecordAddr={off=NATIVE:-
> 2145194496,len=422}, commitRecordIndexAddr={off=NATIVE:-526146533,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=8454737611466036,
> metaStartAddr=13189238, storeType=RW, uuid=84e0fc40-3d8f-47fb-8e01-c9899e2730f3, offsetBits=42, checksum=952184268, createTime=1510337398696 [Friday, November 10, 2017 6:09:58 PM UTC], closeTime=0}
> at com.bigdata.journal.AbstractJournal.commit(AbstractJournal.java:3134)
> Wrapped by: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: nerrors=2: lastRootBlock=rootBlock{ rootBlock=1, challisField=41668719, version=3, nextOffset=52105301987554934, localTime=1546901828460 [Monday, January 7, 2019 10:57:08 PM UTC], firstCommitTime=1510337399915 [Friday, November 10, 2017 6:09:59 PM UTC], lastCommitTime=1546901828417 [Monday, January 7, 2019 10:57:08 PM UTC], commitCounter=41668719, commitRecordAddr={off=NATIVE:-2145194496,len=422}, commitRecordIndexAddr={off=NATIVE:-526146533,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=8454737611466036, metaStartAddr=13189238, storeType=RW, uuid=84e0fc40-3d8f-47fb-8e01-c9899e2730f3, offsetBits=42,
> checksum=952184268, createTime=1510337398696 [Friday, November 10, 2017 6:09:58 PM UTC], closeTime=0}
> at com.bigdata.rdf.sparql.ast.eval.ASTEvalHelper.executeUpdate(ASTEvalHelper.java:1080)
> Wrapped by: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: nerrors=2: lastRootBlock=rootBlock{ rootBlock=1, challisField=41668719, version=3, nextOffset=52105301987554934, localTime=1546901828460 [Monday, January 7, 2019 10:57:08 PM UTC], firstCommitTime=1510337399915 [Friday, November 10, 2017 6:09:59 PM UTC], lastCommitTime=1546901828417 [Monday, January 7, 2019 10:57:08 PM UTC], commitCounter=41668719, commitRecordAddr={off=NATIVE:-2145194496,len=422}, commitRecordIndexAddr={off=NATIVE:-526146533,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=8454737611466036, metaStartAddr=13189238, storeType=RW, uuid=84e0fc40-3d8f-47fb-8e01-c9899e2730f3, offsetBits=42, checksum=952184268, createTime=1510337398696 [Friday, November 10, 2017 6:09:58 PM UTC], closeTime=0}
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> Wrapped by: java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: nerrors=2: lastRootBlock=rootBlock{ rootBlock=1, challisField=41668719, version=3, nextOffset=52105301987554934, localTime=1546901828460 [Monday, January 7, 2019 10:57:08 PM UTC], firstCommitTime=1510337399915 [Friday, November 10, 2017 6:09:59 PM UTC],
> lastCommitTime=1546901828417 [Monday, January 7, 2019 10:57:08 PM UTC], commitCounter=41668719, commitRecordAddr={off=NATIVE:-2145194496,len=422}, commitRecordIndexAddr={off=NATIVE:-526146533,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=8454737611466036, metaStartAddr=13189238, storeType=RW, uuid=84e0fc40-3d8f-47fb-8e01-c9899e2730f3, offsetBits=42, checksum=952184268, createTime=1510337398696 [Friday, November 10, 2017 6:09:58 PM UTC], closeTime=0}
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>
> and all write operations seem to fail.
>
> Any ideas what's going on and why it could have happened?
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <#114>, or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ACdv4JKgQguYGOdGmTefL1Hx6-5PCFLsks5vA-SIgaJpZM4Z0i2I>
> .
>
|
Yes, the second log is after the restart. No writes seem to be successful after that error happened.
dumpJournal says:
I assume this means it's still 41668719? Full copy is at https://gist.github.com/smalyshev/b04d1366fdffb0fffc27c4a95901caed We can copy the DB from another server, but it's a time-consuming thing, so if we could easily restore the DB to working order, it'd be preferable. Since we have last update timestamp in the database, it could then re-apply updates from that point forward. |
Let’s talk in 5m. Ok?
…On Mon, Jan 7, 2019 at 17:00 Stanislav Malyshev ***@***.***> wrote:
(a) check the logs after the restart. Are any other errors being
reported?
Yes, the second log is after the restart. No writes seem to be successful
after that error happened.
the initial exception you reported above is for commitCounter=41668719.
Is this still the most recent commit point on the database? (You can use
DumpJournal to peek at the root blocks - and in fact I believe that there
is a servlet API for this as well.)
dumpJournal says:
There are 2 commit points.
CommitRecord{timestamp=1546901828417, commitCounter=41668719, roots=[-2259782070534405924, -9213540186699333292, -2259782126368980696, -211123412402045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
I assume this means it's still 41668719? Full copy is at
https://gist.github.com/smalyshev/b04d1366fdffb0fffc27c4a95901caed
We can copy the DB from another server, but it's a time-consuming thing,
so if we could easily restore the DB to working order, it'd be preferable.
Since we have last update timestamp in the database, it could then re-apply
updates from that point forward.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#114 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACdv4A5SqIZ7zuXg7TgAs0T7-Wotq91bks5vA-2pgaJpZM4Z0i2I>
.
|
What was the last successful transaction?
…On Mon, Jan 7, 2019 at 5:00 PM Stanislav Malyshev ***@***.***> wrote:
(a) check the logs after the restart. Are any other errors being
reported?
Yes, the second log is after the restart. No writes seem to be successful
after that error happened.
the initial exception you reported above is for commitCounter=41668719.
Is this still the most recent commit point on the database? (You can use
DumpJournal to peek at the root blocks - and in fact I believe that there
is a servlet API for this as well.)
dumpJournal says:
There are 2 commit points.
CommitRecord{timestamp=1546901828417, commitCounter=41668719, roots=[-2259782070534405924, -9213540186699333292, -2259782126368980696, -211123412402045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
I assume this means it's still 41668719? Full copy is at
https://gist.github.com/smalyshev/b04d1366fdffb0fffc27c4a95901caed
We can copy the DB from another server, but it's a time-consuming thing,
so if we could easily restore the DB to working order, it'd be preferable.
Since we have last update timestamp in the database, it could then re-apply
updates from that point forward.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#114 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACdv4A5SqIZ7zuXg7TgAs0T7-Wotq91bks5vA-2pgaJpZM4Z0i2I>
.
|
See
https://github.com/blazegraph/database/blob/master/bigdata-core/bigdata/src/java/com/bigdata/journal/Options.java#L433
for the ALTERNATE_ROOT_BLOCK option
On Mon, Jan 7, 2019 at 5:20 PM Bryan B. Thompson <thompsonbry@gmail.com>
wrote:
… What was the last successful transaction?
On Mon, Jan 7, 2019 at 5:00 PM Stanislav Malyshev <
***@***.***> wrote:
> (a) check the logs after the restart. Are any other errors being
> reported?
>
> Yes, the second log is after the restart. No writes seem to be successful
> after that error happened.
>
> the initial exception you reported above is for commitCounter=41668719.
> Is this still the most recent commit point on the database? (You can use
> DumpJournal to peek at the root blocks - and in fact I believe that there
> is a servlet API for this as well.)
>
> dumpJournal says:
>
> There are 2 commit points.
> CommitRecord{timestamp=1546901828417, commitCounter=41668719, roots=[-2259782070534405924, -9213540186699333292, -2259782126368980696, -211123412402045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
>
> I assume this means it's still 41668719? Full copy is at
> https://gist.github.com/smalyshev/b04d1366fdffb0fffc27c4a95901caed
>
> We can copy the DB from another server, but it's a time-consuming thing,
> so if we could easily restore the DB to working order, it'd be preferable.
> Since we have last update timestamp in the database, it could then re-apply
> updates from that point forward.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#114 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ACdv4A5SqIZ7zuXg7TgAs0T7-Wotq91bks5vA-2pgaJpZM4Z0i2I>
> .
>
|
Unfortunately, I don't know the exact content of last transaction. We do not log update queries, since they are huge and there's a lot of them. It was probably the update before the one that caused the error, but it's very hard now to know what exactly it contained. |
Got another server crashing with exactly the same symptoms, three hours later... Journal dump is here: https://gist.github.com/smalyshev/bf19825ea984371a9e69666b26f23bd8. Commit counter in error messages matches the one in root block #1, just like in the first server. |
You can test for data corruption using dump journal. E.g., have it visit
all index pages.
The root blocks are at the head of the file along with a few magic bytes.
I am pretty sure that the file header is fixed, maybe 512 bytes. See the
FileMetadata class and the code paths involving the RWSTORE there for the
actual file header dimensions. As long as you start up in a read only
mode, you could use dad to copy those bytes somewhere and then restore them
if you needed to and see how the data looks with the alternate root block
as well (you might also be able to start up in a read only mode using the
alternate root block - I do not recall off hand whether that is disallowed)
But if you use the alternate root block and then allow any commits to
occur then you are now locked into the new commit history.
I strongly encourage you to figure out what updates are causing this
issue. You might want to put the servers into a read only mode and suspend
updates until you figure that out and we figure out the root cause.
Bryan
…On Mon, Jan 7, 2019 at 20:21 Stanislav Malyshev ***@***.***> wrote:
Got another server crashing with exactly the same symptoms, three hours
later... Journal dump is here:
https://gist.github.com/smalyshev/bf19825ea984371a9e69666b26f23bd8.
Commit counter in error messages matches the one in root block #1
<#1>, just like in the first
server.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#114 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACdv4C9c8eNepw8L0mzzznSBWLDkXGmSks5vBBzhgaJpZM4Z0i2I>
.
|
Your other server has four commit points available. This means that it has
not reclaimed some deleted data yet. This difference is likely due to some
read transactions being open while the last updates went through, resulting
in deferred recycling. Thus the binary data will be slightly different on
the two (you are handling logical replication at the application level, not
the database level, which means that the recycling patterns will be
difference and hence the specific slots in use will not be the same and the
number of reserved slots could be different as well). Since the data on
the disk are not identical, but we are hitting the same issue at the same
commit point, This implies that the problematic code path might be specific
to some update, which we would really want to identify.
Do both servers see the same sequence of logical updates?
It could also be some internal limit not previously identified.
Bryan
On Mon, Jan 7, 2019 at 20:55 Bryan B. Thompson <thompsonbry@gmail.com>
wrote:
… You can test for data corruption using dump journal. E.g., have it visit
all index pages.
The root blocks are at the head of the file along with a few magic bytes.
I am pretty sure that the file header is fixed, maybe 512 bytes. See the
FileMetadata class and the code paths involving the RWSTORE there for the
actual file header dimensions. As long as you start up in a read only
mode, you could use dad to copy those bytes somewhere and then restore them
if you needed to and see how the data looks with the alternate root block
as well (you might also be able to start up in a read only mode using the
alternate root block - I do not recall off hand whether that is disallowed)
But if you use the alternate root block and then allow any commits to
occur then you are now locked into the new commit history.
I strongly encourage you to figure out what updates are causing this
issue. You might want to put the servers into a read only mode and suspend
updates until you figure that out and we figure out the root cause.
Bryan
On Mon, Jan 7, 2019 at 20:21 Stanislav Malyshev ***@***.***>
wrote:
> Got another server crashing with exactly the same symptoms, three hours
> later... Journal dump is here:
> https://gist.github.com/smalyshev/bf19825ea984371a9e69666b26f23bd8.
> Commit counter in error messages matches the one in root block #1
> <#1>, just like in the
> first server.
>
> —
> You are receiving this because you commented.
>
>
> Reply to this email directly, view it on GitHub
> <#114 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ACdv4C9c8eNepw8L0mzzznSBWLDkXGmSks5vBBzhgaJpZM4Z0i2I>
> .
>
|
Not exactly. The update sequence is the same (it comes from the same event channel) but starting points may differ, so for example one server can have two batches of updates 1,2,3 and 4,5,6 and another can have 1,2 and 3,4,5,6, etc. The end result should be the same (almost, since we also store update timestamps per entry which may differ slightly) but the sequence may differ.
File sizes for both "bad" servers are 806G. But we have other servers with the same file size, and they are just fine.
When this recycling should be happening? It has been hours and couple of restarts (so no open transactions possible) and there's still four commit points there. I checked on other active servers and there are wide variety of numbers for commit points there - from 2 to 25. Not sure what that means.
That might be not easy, given the above - I can see at which point the failure happened - i.e. I could guess the first item in the update but I can't know the exact content - there could be more or less items in the update, depending on various random network events. Also, since the failures happened on two servers with 5 hours difference, it can't be the same exact update - the probability that same exact items has been updated in same exact sequence with this interval is vanishingly small. I'd rather suspect some kind of limit or overflow in some storage or cache or something like that, or some "magic" value that when it's randomly hit produces the failure. |
Can you enable update logging on another server?
Recycling is how the rwstore puts slots from deleted data back into play.
I rather doubt there is an issue with the recycling. I was point out the
number of commit points available mainly in terms of options they might
represent.
Recycling will not occur until a commit goes through. So it will remain at
4 commits until the server gets beyond this update issue.
…On Mon, Jan 7, 2019 at 21:30 Stanislav Malyshev ***@***.***> wrote:
Do both servers see the same sequence of logical updates?
Not exactly. The update sequence is the same (it comes from the same event
channel) but starting points may differ, so for example one server can have
two batches of updates 1,2,3 and 4,5,6 and another can have 1,2 and
3,4,5,6, etc. The end result should be the same (almost, since we also
store update timestamps per entry which may differ slightly) but the
sequence may differ.
It could also be some internal limit not previously identified.
File sizes for both "bad" servers are 806G. But we have other servers with
the same file size, and they are just fine.
Your other server has four commit points available. This means that it has
not reclaimed some deleted data yet. This difference is likely due to some
read transactions being open while the last updates went through, resulting
in deferred recycling.
When this recycling should be happening? It has been hours and couple of
restarts (so no open transactions possible) and there's still four commit
points there. I checked on other active servers and there are wide variety
of numbers for commit points there - from 2 to 25. Not sure what that means.
I strongly encourage you to figure out what updates are causing this
issue.
That might be not easy, given the above - I can see at which point the
failure happened - i.e. I could guess the first item in the update (though
since I have only second resolution and there's 5-10 items updated per
second, I'd already have uncertainty) but I can't know . Also, since the
failures happened on two servers with 5 hours difference, it can't be the
same exact update - the probability that same exact items has been updated
in same exact sequence with this interval is vanishingly small. I'd rather
suspect some kind of limit or overflow in some storage or cache or
something like that, or some "magic" value that when it's randomly hit
produces the failure.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#114 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACdv4JyAyXbHZ1ysWJilYVD6mXxybwMaks5vBCzlgaJpZM4Z0i2I>
.
|
Yes, but only for a short time (it generates a ton of data) and probably not today (since it needs access which I don't have to enable and people that have access are offline now - but we could do it tomorrow morning if needed). What you'd like to log though? The working servers are way past the point where failing servers failed now (and they didn't fail, obviously), so their data would be useless. I could of course log update for the failing servers, once we restore them, and see if this update reproduces the problem again, but as I said, I think it's very unlikely the content of the update matters too much, since two servers failed on different updates with 5 hour delta. |
Stas,
Here is a summary of our call this morning. First, the underlying root
cause is that the maximum number of allocators has been reached. This limit
is 256 * 1024 and is based on a 19-bit address as documented at the bottom
of [1]. Each allocator manages a number of allocation slots (8k slots).
For a given allocator, all those slots are the same size. Based on the dump
journal reports, the vast majority of these allocators are for 64 and 128
byte allocation slots. Those allocation slots are being used for the
Id2Term index to store the lexical form of the dictionary entry outside of
the leaf pages as separate records on the journal. I have outlined several
things which you can do to address this problem, but these will require a
reload of the data.
1. Inlining - is there anything else that can be inlined? Anything inlined
will not be in the dictionary. A smaller dictionary will improve write
performance.
2. Ensure ID2TERM is NOT using raw records. This is extremely important.
See
https://github.com/blazegraph/database/blob/BLAZEGRAPH_RELEASE_2_1_5_RC/bigdata-core/bigdata-rdf/src/java/com/bigdata/rdf/lexicon/LexiconRelation.java#L1351
. This will radically reduce the #of 64 and 128 byte allocation slots in
use since the data will be stored directly in the leaves of the ID2TERM
index instead. This means that you will be able to have significantly more
data in the backing RWStore instance since the allocators will be for
larger slots corresponding to index pages rather than small slots
corresponding to individual lexical items.
3. Consider using INLINE_TEXT_LITERALS to force the inlining of strings
into the statement indices. Inlined strings will not be entered into the
dictionary at all (which will improve write performance but will also
result in a somewhat larger footprint on the disk). Instead, the inlined
strings will appear directly in the statement indices. Note that strings
longer than 256 bytes (by default) go onto the BLOBS index rather than the
Term2ID and ID2Term indices. See the following options in
AbstractTripleStore.Options: {INLINE_TEXT_LITERALS, MAX_INLINE_TEXT_LENGTH,
and BLOBS_THRESHOLD}. In many ways, this has the same effect as (2), but it
also avoids dictionary entries entirely for smaller strings.
4. Separate the two large data sets (wikidata and categories) into
different Blazegraph instances (optional, adds flexibility)
5. Ensure 32-bit pointers in Java (slightly less than 32G heaps) - this
will improve the Java memory performance.
Thanks,
Bryan
[1] https://wiki.blazegraph.com/wiki/index.php/FixedAllocators
…On Mon, Jan 7, 2019 at 9:54 PM Stanislav Malyshev ***@***.***> wrote:
Can you enable update logging on another server?
Yes, but only for a short time (it generates a ton of data) and probably
not today (since it needs access which I don't have to enable and people
that have access are offline now - but we could do it tomorrow morning if
needed). What you'd like to log though? The working servers are way past
the point where failing servers failed now (and they didn't fail,
obviously), so their data would be useless. I could of course log update
for the failing servers, once we restore them, and see if this update
reproduces the problem again, but as I said, I think it's very unlikely the
content of the update matters too much, since two servers failed on
different updates with 5 hour delta.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#114 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACdv4JTqtm7gxiL0jZg5j1ecdrz9byOaks5vBDKKgaJpZM4Z0i2I>
.
|
I tried to remove categories namespace, and it doesn't seem to have changed anything for allocators. I wonder why is that - does anything else need to be done? |
The #of allocators will not change. What will change is that you will have
free slots in the existing allocators.
The database needs to roll through a commit before it can free the slots.
I do see the #of recycled slots going up. I suggest trying a small commit
or two (insert and then delete a single triple) to ensure that the recycler
collects all the newly freed slots.
Bryan
…On Tue, Jan 8, 2019 at 11:53 AM Stanislav Malyshev ***@***.***> wrote:
I tried to remove categories namespace, and it doesn't seem to have
changed anything for allocators.
Before: https://gist.github.com/smalyshev/19095c87b600900811e7f87d22ea250b
After: https://gist.github.com/smalyshev/4d99b5258542e0705a335b52c7805d50
I wonder why is that - does anything else need to be done?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#114 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACdv4BGoCugfmv4-UTyRbDp8Jr7KRilOks5vBPc4gaJpZM4Z0i2I>
.
|
Were you able to find a way to rollback a bad commit ? I have a fresh server that I am importing a Wikidata base in and mid way it stopped with the error: Caused by: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: Problem with entry at -668674172594945626: lastRootBlock=rootBlock{ rootBlock=1, challisField=835, version=3, nextOff I am now stuck in this position and can't continue |
Look here:
https://blazegraph.com/database/apidocs/com/bigdata/journal/Options.html#ALTERNATE_ROOT_BLOCK
There are a few relevant options. You need to roll through a single commit
after doing something like this to have both root blocks be valid again.
Eg, a small write on the database, commit, then change the options back to
what they were and restart.
You can read about the commit protocol on the wiki under the RWStore page.
https://github.com/blazegraph/database/wiki/RWStore
Bryan
…On Thu, Sep 1, 2022 at 06:45 Simon Gelfand ***@***.***> wrote:
Were you able to find a way to rollback a bad commit ? I have a fresh
server that I am importing a Wikidata base in and mid way it stopped with
the error:
Caused by: java.util.concurrent.ExecutionException:
org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException:
Problem with entry at -668674172594945626: lastRootBlock=rootBlock{
rootBlock=1, challisField=835, version=3, nextOff
set=62398251232785478, localTime=1661485728195 [Friday, August 26, 2022 at
3:48:48 AM Coordinated Universal Time], firstCommitTime=1658240159557
[Tuesday, July 19, 2022 at 2:15:59 PM Coordinated Universal Time],
lastCommitTime=1661485714508
[Friday, August 26, 2022 at 3:48:34 AM Coordinated Universal Time],
commitCounter=835, commitRecordAddr={off=NATIVE:-42932493,len=422},
commitRecordIndexAddr={off=NATIVE:-158345759,len=220}, blockSequence=27784,
quorumToken=-1, metaBitsAddr=
47863639532635860, metaStartAddr=15961497, storeType=RW,
uuid=f05859c7-286f-4e8e-8a91-593893119dcd, offsetBits=42,
checksum=-586797382, createTime=1658240158542 [Tuesday, July 19, 2022 at
2:15:58 PM Coordinated Universal Time], closeTime=0}
I am now stuck in this position and can't continue
—
Reply to this email directly, view it on GitHub
<#114 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AATW7YFRBUWE6KHXHGTLWFTV4CXQBANCNFSM4GOSFWEA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
So I did the following (maybe it was incorrect) I stopped the Blazegraph server. Tried to import one file (I didn't know the syntax for update/commit) Caused by: java.lang.RuntimeException: Problem with entry at -668674172594945626: lastRootBlock=rootBlock{ rootBlock=1, challisField=835, version=3, nextOffset=62398251232785478, localTime=1661485728195 [Friday, August 26, 2022 at 3:48:48 AM Coordinated Universal Time], firstCommitTime=1658240159557 [Tuesday, July 19, 2022 at 2:15:59 PM Coordinated Universal Time], lastCommitTime=1661485714508 [Friday, August 26, 2022 at 3:48:34 AM Coordinated Universal Time], commitCounter=835, commitRecordAddr={off=NATIVE:-42932493,len=422}, commitRecordIndexAddr={off=NATIVE:-158345759,len=220}, blockSequence=27784, quorumToken=-1, metaBitsAddr=47863639532635860, metaStartAddr=15961497, storeType=RW, uuid=f05859c7-286f-4e8e-8a91-593893119dcd, offsetBits=42, checksum=-586797382, createTime=1658240158542 [Tuesday, July 19, 2022 at 2:15:58 PM Coordinated Universal Time], closeTime=0} Perhaps I misunderstood what to do. |
1. The alternate root block should only be used for a single commit.
2. I am not sure how the wikidata import breaks down into commits.
3. There is something pretty odd going on.
4. Can you provide a full backtrack for those error messages?
Overall, I might suspect a flakey hardware system. But I would like to see
the full backtrack and see if that provides any clues.
Bryan
…On Fri, Sep 2, 2022 at 02:37 Simon Gelfand ***@***.***> wrote:
So I did the following (maybe it was incorrect)
I stopped the Blazegraph server.
Add this to config:
com.bigdata.journal.Options=ALTERNATE_ROOT_BLOCK
Started server.
Tried to import one file (I didn't know the syntax for update/commit)
This failed in the end after a few hours with the same error as before:
Caused by: java.lang.RuntimeException: Problem with entry at
-668674172594945626: lastRootBlock=rootBlock{ rootBlock=1,
challisField=835, version=3, nextOffset=62398251232785478,
localTime=1661485728195 [Friday, August 26, 2022 at 3:48:48 AM Coordinated
Universal Time], firstCommitTime=1658240159557 [Tuesday, July 19, 2022 at
2:15:59 PM Coordinated Universal Time], lastCommitTime=1661485714508
[Friday, August 26, 2022 at 3:48:34 AM Coordinated Universal Time],
commitCounter=835, commitRecordAddr={off=NATIVE:-42932493,len=422},
commitRecordIndexAddr={off=NATIVE:-158345759,len=220}, blockSequence=27784,
quorumToken=-1, metaBitsAddr=47863639532635860, metaStartAddr=15961497,
storeType=RW, uuid=f05859c7-286f-4e8e-8a91-593893119dcd, offsetBits=42,
checksum=-586797382, createTime=1658240158542 [Tuesday, July 19, 2022 at
2:15:58 PM Coordinated Universal Time], closeTime=0}
Perhaps I misunderstood what to do.
—
Reply to this email directly, view it on GitHub
<#114 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AATW7YCEANLCVC2B6SLC4H3V4HDE7ANCNFSM4GOSFWEA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
You can see a larger log: https://pastebin.com/JPCiBEvk Do you have an example somewhere of a simple update/commit I could run ? With Wikidata from what I was seeing it was running all the updates/inserts and at the end tried to commit the changes Regarding hardware I don't think its that - its a new setup. |
Ok. The root cause in that stack trace is a problem when trying to process
deferred frees (recycling). I’ve clipped that part of the trace below.
This is not a root block problem (for example, it is not a checksum issue
with the root block). A copy of old root blocks is kept for the recycler
and they are eventually pruned out along with the list of address that were
recycled in the RWStore.
Your first post was at commit counter 833 if I recall in the error message.
This one is at 835. But if these are not the root block used for commit,
but instead root blocks used by the recycler, then I really can’t say
anything about whether or not it is making progress.
I can not say offhand how things got into this state, but I also can’t see
a way forward. I would advise reinstalling the wikidata setup from
scratch, making sure that you clean out the old files and making sure that
there is not a 2nd blazegraph process running (this should be disallowed by
file locks). Then, if you hit any issue again please make sure to post the
first error.
I would also be happy to take a peek at the first error message from the
logs if you can find that. We might be able to find some indication of
what went wrong. The log you posted clearly indicates an issue within the
logged deferred frees and an inability to recycle those allocations.
com.bigdata.rwstore.RWStore.freeDeferrals(RWStore.java:5411)
... 21 more
Caused by: java.lang.IllegalStateException: Error reading from
WriteCache addr: 827585986560 length: 12, writeCacheDebug: No
WriteCache debug info
Bryan
…On Fri, Sep 2, 2022 at 06:29 Simon Gelfand ***@***.***> wrote:
You can see a larger log: https://pastebin.com/JPCiBEvk
Do you have an example somewhere of a simple update/commit I could run ?
With Wikidata from what I was seeing it was running all the
updates/inserts and at the end tried to commit the changes
Regarding hardware I don't think its that - its a new setup.
This is the 3rd time its happening with me in an attempt to import the
Wikidata dump (its huge and takes a long time)
—
Reply to this email directly, view it on GitHub
<#114 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AATW7YDE4RLL6FHXVRFK7H3V4H6LHANCNFSM4GOSFWEA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
ok ... sad to hear :( was running smoothly for 5 weeks and then crashed. |
If you do have the earliest error message in the log, I am happy to look at
that. I've also copied Martyn Cutcher. He might be able to glean
something from the message:
com.bigdata.rwstore.RWStore.freeDeferrals(RWStore.java:5411)
... 21 more
Caused by: java.lang.IllegalStateException: Error reading from
WriteCache addr: 827585986560 length: 12, writeCacheDebug: No
WriteCache debug info
Bryan
…On Fri, Sep 2, 2022 at 6:58 AM Simon Gelfand ***@***.***> wrote:
ok ... sad to hear :( was running smoothly for 5 weeks and then crashed.
I will create a fresh server - just to make sure no "junk" or "bad
sectors" are left over
—
Reply to this email directly, view it on GitHub
<#114 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AATW7YGOAVQC657L7IFFWXLV4IBZXANCNFSM4GOSFWEA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
On Fri, Sep 2, 2022 at 9:43 AM Bryan B. Thompson ***@***.***>
wrote:
… If you do have the earliest error message in the log, I am happy to look
at that. I've also copied Martyn Cutcher. He might be able to glean
something from the message:
com.bigdata.rwstore.RWStore.freeDeferrals(RWStore.java:5411)
... 21 more
Caused by: java.lang.IllegalStateException: Error reading from WriteCache addr: 827585986560 length: 12, writeCacheDebug: No WriteCache debug info
Bryan
On Fri, Sep 2, 2022 at 6:58 AM Simon Gelfand ***@***.***>
wrote:
> ok ... sad to hear :( was running smoothly for 5 weeks and then crashed.
> I will create a fresh server - just to make sure no "junk" or "bad
> sectors" are left over
>
> —
> Reply to this email directly, view it on GitHub
> <#114 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AATW7YGOAVQC657L7IFFWXLV4IBZXANCNFSM4GOSFWEA>
> .
> You are receiving this because you commented.Message ID:
> ***@***.***>
>
|
Problem is the script continues running until you stop it, so it kept on going failing over and over again |
Blazegraph started producing these errors today for us, during routine update cycle:
The query is a regular (rather big) SPARQL Update which Wikidata Query Service Updater sends.
The error seems to be persistent - if I restart, I get this:
and all write operations seem to fail.
Any ideas what's going on and why it could have happened?
The text was updated successfully, but these errors were encountered: