Skip to content

Use StreamWriter in bulk loader#3542

Merged
manishrjain merged 11 commits intomasterfrom
mrjn/bulk-stream-writer
Jun 10, 2019
Merged

Use StreamWriter in bulk loader#3542
manishrjain merged 11 commits intomasterfrom
mrjn/bulk-stream-writer

Conversation

@manishrjain
Copy link
Copy Markdown
Contributor

@manishrjain manishrjain commented Jun 7, 2019

This PR switched transaction based writes to Badger to StreamWriter.

This PR also refactors bulk loader code as follows:

  • Simplify shuffler and reducer code and merge them into one, i.e. reducer.
  • Remove shuffler.go file.
  • Remove metrics.go file.
  • The channel based heap merge was expensive. Switched that with a simple map entries iterator.

With these changes, the 21M dataset now takes 2 mins to load from the original 3 mins.


This change is Reviewable

@manishrjain manishrjain requested a review from a team June 7, 2019 22:51
Comment thread dgraph/cmd/bulk/loader.go Outdated
Comment thread dgraph/cmd/bulk/count_index.go
Comment thread dgraph/cmd/bulk/reduce.go
Comment thread dgraph/cmd/bulk/reduce.go Outdated
Comment thread dgraph/cmd/bulk/reduce.go
Comment thread dgraph/cmd/bulk/reduce.go Outdated
Copy link
Copy Markdown
Contributor

@martinmr martinmr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 8 files at r1, 20 of 21 files at r2.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @mangalaman93 and @manishrjain)


dgraph/cmd/bulk/reduce.go, line 47 at r2 (raw file):

func (r *reducer) run() error {
	shardDirs := shardDirs(r.opt.TmpDir)

minor: having the variable have the same name as the function makes this a bit confusing. Maybe the method can be named something like getShardDirs or the variable can be simply named dirs.


dgraph/cmd/bulk/reduce.go, line 68 at r2 (raw file):

			writer := db.NewStreamWriter()
			if err := writer.Prepare(); err != nil {
				panic(err)

why are you using panic here instead of x.Check like in some other places?


dgraph/cmd/bulk/reduce.go, line 213 at r2 (raw file):

		keyChanged := !bytes.Equal(prevKey, me.Key)
		if keyChanged && plistLen > 0 {

I am assuming the keys are returned in order so when the key changes we are done with the count for this key. Is that right?

If so, maybe adding a small comment explaining this invariant would be helpful for future readers of the code.

@manishrjain manishrjain merged commit d336e41 into master Jun 10, 2019
@manishrjain manishrjain deleted the mrjn/bulk-stream-writer branch June 10, 2019 23:36
danielmai pushed a commit that referenced this pull request Jul 15, 2019
Bug fix for bulk loader changes introduced in #3542.

Fixes #3607.

Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>
dna2github pushed a commit to dna2fork/dgraph that referenced this pull request Jul 19, 2019
This PR switched transaction based writes to Badger to StreamWriter and brings in Badger master into vendor.

This PR also refactors bulk loader code as follows:

- Simplify shuffler and reducer code and merge them into one, i.e. reducer.
- Remove shuffler.go file.
- Remove metrics.go file.
- The channel based heap merge was expensive. Switched that with a simple map entries iterator.

With these changes, the 21M dataset now takes 2 mins to load from the original 3 mins.


Changes:
* Simplified shuffler and reducer code. But, encountered an issue where key sorted order changes due to version append in Badger.
* Working code after StreamWriter integration.
* Vendor Badger in, because it contains fixes to StreamWriter.
* Fix build breakages caused by importing Badger.
dna2github pushed a commit to dna2fork/dgraph that referenced this pull request Jul 19, 2019
Bug fix for bulk loader changes introduced in dgraph-io#3542.

Fixes dgraph-io#3607.

Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants