Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic: failed to read value pointer from vlog file #926

Closed
256dpi opened this issue Jul 11, 2019 · 14 comments
Closed

panic: failed to read value pointer from vlog file #926

256dpi opened this issue Jul 11, 2019 · 14 comments
Assignees
Labels
area/crash This issue causes a panic or some other of exception that causes a crash. kind/bug Something is broken. priority/P1 Serious issue that requires eventual attention (can wait a bit) status/confirmed The issue has been triaged by still not reproduced.

Comments

@256dpi
Copy link

256dpi commented Jul 11, 2019

I just ran into this when I restarted a service a bunch of times:

panic: failed to read value pointer from vlog file: {Fid:143 Len:277 Offset:337322920}: Unable to find log file. Please retry

Since I'm running badger with SyncWrites=false is it possible that RunValueLogGC deleted a vlog that still was referenced by a persisted table?

I'm using go1.12 and badger v1.6.0.

@jarifibrahim jarifibrahim added area/crash This issue causes a panic or some other of exception that causes a crash. kind/bug Something is broken. priority/P1 Serious issue that requires eventual attention (can wait a bit) status/confirmed The issue has been triaged by still not reproduced. labels Jul 11, 2019
@jarifibrahim jarifibrahim self-assigned this Jul 11, 2019
@jarifibrahim
Copy link
Contributor

@256dpi Can you check if your badger directory has vlog file with id 143? Also can you share your badger directory? Did the DB crash by any chance?

Since I'm running badger with SyncWrites=false is it possible that RunValueLogGC deleted a vlog that still was referenced by a persisted table?

This can't be the case. We fix the reference before removing the vlog fie.
Here's how GC works

  1. Find list of entries to move.
  2. Insert the new entries into DB (these will be pushed to LSM tree and vlog).
  3. Delete the old vlog file.

@256dpi
Copy link
Author

256dpi commented Jul 11, 2019

I wasn't able to recover any data as I purged the directory right away to get the service running again... Next time I'll do that first! Here are the logs tough:

I 2019-07-10T15:01:30 badger: Flushing memtable, mt.size=65796570 size of flushChan: 0
I 2019-07-10T15:01:30 badger: Storing value log head: {Fid:146 Len:49 Offset:128069992}
I 2019-07-10T15:26:44 badger: Storing value log head: {Fid:148 Len:49 Offset:15097884}
I 2019-07-10T15:26:44 badger: Flushing memtable, mt.size=67109213 size of flushChan: 0
I 2019-07-10T15:26:46 badger: Got compaction priority: {level:0 score:1 dropPrefix:[]}
I 2019-07-10T15:26:46 badger: Running for level: 0
I 2019-07-10T15:26:49 badger: LOG Compact. Added 2056568 keys. Skipped 172200 keys. Iteration took: 2.339905392s
I 2019-07-10T15:27:17 badger: LOG Compact. Added 2499284 keys. Skipped 0 keys. Iteration took: 28.577111521s
E 2019-07-10T15:28:18 panic: send on closed channel
--- CRASH => RESTART            
I 2019-07-10T15:28:22 badger: 11 tables out of 15 opened in 3.212s
I 2019-07-10T15:28:23 badger: All 15 tables opened in 4.151s
I 2019-07-10T15:28:23 badger: Replaying file id: 148 at offset: 15097933
I 2019-07-10T15:28:23 badger: Replay took: 51.488701ms
E 2019-07-10T15:28:23 panic: failed to read value pointer from vlog file: {Fid:143 Len:277 Offset:337322920}: Unable to find log file. Please retry

As you can see the process crashed due to a send on closed channel which is a bug in my application.

Again I was running the database with SyncWrites=false. One goroutine was running every second and calling db.RunValueLogGC(0.75) another one was also running every second and calling db.Sync().

About the environment: The process was running in GKE and writing to an attached SSD PD.

@jarifibrahim
Copy link
Contributor

I have a wild guess about what might have happened.

  1. The discard map is stored in a file say 1.vlog which is on disk.
  2. Vlog GC reclaims space by removing 1.vlog and inserting all the valid entries in 2.vlog. But these entries are not yet flushed to the disk since syncWrites=false.
  3. The DB crashes at this point.
  4. On Re-opening, the DB looks for discard stats. Badger returns the old discard map which points to a vlog file (1.vlog) which was deleted by GC.
  5. Badger crashes since this is not a valid file.

@256dpi
Copy link
Author

256dpi commented Jul 12, 2019

@jarifibrahim It happened again! This time even with SyncWrites=true!! I was able to get all the data and logs. I'm happy to share them with you privately as the db contains customer data.

@jarifibrahim
Copy link
Contributor

jarifibrahim commented Jul 12, 2019

Thanks @256dpi! Can you please send the badger directory to ibrahim@dgraph.io?
I would also need to know the options you set while opening DB.

@jarifibrahim
Copy link
Contributor

jarifibrahim commented Jul 12, 2019

This issue does not depend on the value of syncWrites. I was able to reproduce this issue using the following test (with some modification to the value.go file so that GC works)

func TestCrash(t *testing.T) {
	dir, err := ioutil.TempDir("", "badger-test")
	require.NoError(t, err, "temp dir for badger count not be created")
	fmt.Println(dir)
	ops := getTestOptions(dir)
	ops.ValueLogMaxEntries = 1
	db, err := Open(ops)
	require.NoError(t, err)

	entries := []*Entry{{
		Key:   y.KeyWithTs(lfDiscardStatsKey, 1),
		Value: make([]byte, ops.ValueThreshold+100),
	}}
	req, err := db.sendToWriteCh(entries)
	if err != nil {
		panic(err)
	}
	req.Wait()
	db.Update(func(txn *Txn) error {
		e := NewEntry([]byte("f"), []byte("1")).WithMeta(bitFinTxn)
		require.NoError(t, txn.SetEntry(e))
		return nil
	})
	db.Update(func(txn *Txn) error {
		e := NewEntry([]byte("ff"), []byte("1")).WithMeta(bitFinTxn)
		require.NoError(t, txn.SetEntry(e))
		return nil
	})

	tr := trace.New("Badger.ValueLog", "GC")
	lf := db.vlog.filesMap[0]
	require.NoError(t, db.vlog.rewrite(lf, tr))
	require.NoError(t, db.Close())

	db, err = Open(ops)
	require.NoError(t, err)
	require.NoError(t, db.Close())
}

I'm still trying to figure out why the discard key is pointing to the wrong value log file.

@jarifibrahim
Copy link
Contributor

This has been fixed via #929 . Closing it.

@256dpi
Copy link
Author

256dpi commented Jul 19, 2019

Awesome! Thanks!

@cedricfung
Copy link

I'm using v2.0.0-rc2 and has the same bug

@jarifibrahim
Copy link
Contributor

@cedricfung The fix wasn't part of v2.0.0-rc2 release. Please use the master branch for now.

You might need to backup and restore your current badger directory to use it with the latest master.
If you want to use your current badger directory with the latest master -

  1. Create a backup (you can use the command line).
  2. Update badger to latest master.
  3. Restore the backup using the latest master.

@cedricfung
Copy link

I ported the fix to the release 2.0 branch, and it works.

@0xc0392b
Copy link

I'm having the same problem with v1.0.16 after a crash. How can I recover the data? I just need Dgraph to start so that I can run a backup - which I will then restore to a :master version.

@jgoodall
Copy link

Similar to @williamsandytoes - same problem with dgraph 1.0.17. Is there any way to recover data?

@cedricfung
Copy link

In value.go, make the populateDiscardStats() do nothing.

func (vlog *valueLog) populateDiscardStats() error {
	return nil
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/crash This issue causes a panic or some other of exception that causes a crash. kind/bug Something is broken. priority/P1 Serious issue that requires eventual attention (can wait a bit) status/confirmed The issue has been triaged by still not reproduced.
Development

No branches or pull requests

5 participants