Bulk Loader: Fix memory usage by JSON parser #3794

manishrjain · 2019-08-12T19:53:20Z

The current JSON chunker/parser was creating a slice of NQuads for each JSON map. This resulted in creating >5M NQuads in a dataset we were working with, causing bulk loader to go OOM. This PR fixes that by introducing an NQuadBuffer, which creates a slice of batchSize, and shoots it over a channel. The caller can then continuously consume from that channel. This avoids creating a slice of >5M entries and avoid very high memory usage.

Also refactored the code to bring both JSON and RDF parsers to the root level of chunker package to allow better code sharing with the new buffer.

This change is

…roportional to worker threads in shout-case.

…d others.

…hunker package as well.

pullrequest

✅ A review job has been created and sent to the PullRequest network.

Check the status or cancel PullRequest code review here.

golangcibot · 2019-08-12T19:54:59Z

chunker/json_parser.go

+	return buf
+}
+
+func (buf *NQuadBuffer) Ch() <-chan []*api.NQuad {


receiver name buf should be consistent with previous receiver name nqs for NQuadBuffer (from golint)

golangcibot · 2019-08-12T19:54:59Z

chunker/json_parser.go

+	return buf.nqCh
+}
+
+func (buf *NQuadBuffer) Push(nqs ...*api.NQuad) {


receiver name buf should be consistent with previous receiver name nqs for NQuadBuffer (from golint)

golangcibot · 2019-08-12T19:54:59Z

chunker/json_parser.go

+	}
+}
+
+func (buf *NQuadBuffer) Flush() {


receiver name buf should be consistent with previous receiver name nqs for NQuadBuffer (from golint)

golangcibot · 2019-08-12T19:57:07Z

chunker/chunk.go

@@ -117,28 +131,26 @@ func (c *rdfChunker) Chunk(r *bufio.Reader) (*bytes.Buffer, error) {
 }

 // Parse is not thread-safe. Only call it serially, because it reuses lexer object.
-func (c *rdfChunker) Parse(chunkBuf *bytes.Buffer) ([]*api.NQuad, error) {
+func (c *rdfChunker) Parse(chunkBuf *bytes.Buffer) error {


receiver name c should be consistent with previous receiver name rc for rdfChunker (from golint)

manishrjain

Reviewable status: 0 of 14 files reviewed, 4 unresolved discussions (waiting on @gitlw, @golangcibot, @mangalaman93, @manishrjain, and @martinmr)

chunker/json_parser.go, line 245 at r1 (raw file):

Previously, golangcibot (Bot from GolangCI) wrote…

receiver name buf should be consistent with previous receiver name nqs for NQuadBuffer (from golint)

Done.

chunker/json_parser.go, line 249 at r1 (raw file):

Previously, golangcibot (Bot from GolangCI) wrote…

receiver name buf should be consistent with previous receiver name nqs for NQuadBuffer (from golint)

Done.

chunker/json_parser.go, line 259 at r1 (raw file):

Previously, golangcibot (Bot from GolangCI) wrote…

receiver name buf should be consistent with previous receiver name nqs for NQuadBuffer (from golint)

Done.

martinmr

Reviewed 10 of 14 files at r1, 4 of 4 files at r3.
Reviewable status: all files reviewed, 13 unresolved discussions (waiting on @gitlw, @golangcibot, @mangalaman93, and @manishrjain)

dgraph/cmd/live/run.go, line 218 at r3 (raw file):

		chunkBuf, err := ck.Chunk(rd)
		// process parses the rdf entries from the chunk, and group them into batches (each one

comment is confusing. What is process? Should it just say "Parses the RDF entries ..."?

chunker/json_parser.go, line 180 at r3 (raw file):

}

func (nqs *NQuadBuffer) checkForDeletion(mr mapResponse, m map[string]interface{}, op int) {

why is mr no longer a pointer?

chunker/json_parser.go, line 233 at r3 (raw file):

"is set to -1"

what happens if it's set to other negative number. Maybe generalize the logic and comment to have this behavior for all numbers < 0 if it's not doing that already.

chunker/json_parser.go, line 245 at r3 (raw file):

}

func (buf *NQuadBuffer) Ch() <-chan []*api.NQuad {

add docstring to exported method.

chunker/json_parser.go, line 249 at r3 (raw file):

}

func (buf *NQuadBuffer) Push(nqs ...*api.NQuad) {

add docstring to exported method.

chunker/json_parser.go, line 259 at r3 (raw file):

}

func (buf *NQuadBuffer) Flush() {

add docstring to exported method.

chunker/json_parser.go, line 441 at r3 (raw file):

)

// ParseJSON converts the given byte slice into a slice of NQuads.

Maybe update the docstring? I guess it's still correct but now there's an intermediate step (the nquad buffer).

chunker/json/parse.go, line 349 at r3 (raw file):

			mr.nquads = append(mr.nquads, &nq)
			// Add the nquads that we got for the connecting entity.
			mr.nquads = append(mr.nquads, cr.nquads...)

If I am reading this correctly, the new logic is no longer appending cr.nquads to the output. Will that break anything?

chunker/json/parse.go, line 382 at r3 (raw file):

					mr.nquads = append(mr.nquads, &nq)
					// Add the nquads that we got for the connecting entity.
					mr.nquads = append(mr.nquads, cr.nquads...)

Same point here regarding cr.nquads

gitlw

Reviewable status: all files reviewed, 15 unresolved discussions (waiting on @golangcibot, @mangalaman93, and @manishrjain)

dgraph/cmd/bulk/run.go, line 218 at r3 (raw file):

			fmt.Printf("GC: %d. InUse: %s. Idle: %s\n", ms.NumGC, humanize.Bytes(ms.HeapInuse),
				humanize.Bytes(ms.HeapIdle-ms.HeapReleased))
			if ms.NumGC > lastNum {

Maybe add a comment to explain that in this case, GC has been run by the go runtime, so we don't need to force a GC in this iteration.

chunker/json_parser.go, line 233 at r3 (raw file):

// NewNQuadBuffer would return a new buffer. It would batch up batchSize NQuads per push to channel,
// accessible via Ch(). If batchSize is set to -1, it would only do one push to Ch() during Flush.

Maybe replace -1 to "0 or a negative value" to be more precise.

gitlw

Added some minor comments, otherwise

Reviewable status: all files reviewed, 15 unresolved discussions (waiting on @golangcibot, @mangalaman93, and @manishrjain)

gitlw

Reviewable status: all files reviewed, 15 unresolved discussions (waiting on @golangcibot, @mangalaman93, @manishrjain, and @martinmr)

chunker/json/parse.go, line 349 at r3 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

If I am reading this correctly, the new logic is no longer appending cr.nquads to the output. Will that break anything?

I think the child's nquads will be pushed to the buffer through the function call above buf.mapToNquads(...)

Compared with the current implementation, this PR would make the child's nquads show up before the this connecting nq. But it seems that won't break anything.

manishrjain

Reviewable status: 7 of 19 files reviewed, 15 unresolved discussions (waiting on @gitlw, @golangcibot, @mangalaman93, and @martinmr)

dgraph/cmd/bulk/run.go, line 218 at r3 (raw file):

Previously, gitlw (Lucas Wang) wrote…

Maybe add a comment to explain that in this case, GC has been run by the go runtime, so we don't need to force a GC in this iteration.

Done.

dgraph/cmd/live/run.go, line 218 at r3 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

comment is confusing. What is process? Should it just say "Parses the RDF entries ..."?

Done.

chunker/json_parser.go, line 180 at r3 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

why is mr no longer a pointer?

Nothing is being written to it.

chunker/json_parser.go, line 233 at r3 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

"is set to -1"

what happens if it's set to other negative number. Maybe generalize the logic and comment to have this behavior for all numbers < 0 if it's not doing that already.

Already works like that. I changed the comment to just say "negative".

chunker/json_parser.go, line 233 at r3 (raw file):

Previously, gitlw (Lucas Wang) wrote…

Maybe replace -1 to "0 or a negative value" to be more precise.

Done.

chunker/json_parser.go, line 245 at r3 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

add docstring to exported method.

Done.

chunker/json_parser.go, line 249 at r3 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

add docstring to exported method.

Done.

chunker/json_parser.go, line 259 at r3 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

add docstring to exported method.

Done.

chunker/json_parser.go, line 441 at r3 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

Maybe update the docstring? I guess it's still correct but now there's an intermediate step (the nquad buffer).

Done.

chunker/json/parse.go, line 349 at r3 (raw file):

Previously, gitlw (Lucas Wang) wrote…

I think the child's nquads will be pushed to the buffer through the function call above buf.mapToNquads(...)

Compared with the current implementation, this PR would make the child's nquads show up before the this connecting nq. But it seems that won't break anything.

Yup. The children would show up first, but NQuads work independently. Each NQuad is self-sufficient.

chunker/json/parse.go, line 382 at r3 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

Same point here regarding cr.nquads

Done. See above.

pullrequest

The approach here makes sense to me and looks sound. I have a few more minor comments along the way but otherwise this looks good from my side.

Reviewed with ❤️ by PullRequest

pullrequest · 2019-08-12T23:01:16Z

chunker/chunk.go

 	default:
 		panic("unknown input format")
 	}
 }

 // RDF files don't require any special processing at the beginning of the file.
-func (c *rdfChunker) Begin(r *bufio.Reader) error {
+func (rdfChunker) Begin(r *bufio.Reader) error {


It is often considered best practice to have consistent typing with your receivers and if one is a pointer, make them all pointers. Since NQuads() and Parse are taking a pointer receiver you may consider keeping these others (Begin, Chunk, End) as pointer as well to avoid any surprises. (https://golang.org/doc/faq#methods_on_values_or_pointers)

The FAQ seems to only talk about values vs pointers. In this case, the rdfChunker object isn't even being used. Given that, it is clearer to not even declare the object variable. Clearly indicates to the user that the object isn't being used.

Sorry yeah I meant the type changing from *rdfChunker to rdfChunker, not the removing of the variable naming. So for example just updating to:

func (*rdfChunker) Begin(r *bufio.Reader) error {

Its totally fine to ignore in this case if you prefer though since you are always using the rdfChunker as a pointer it seems so it will have both sets of methods. It is just that there is a different set of methods available on an rdChunker in value vs pointer form currently.

pullrequest · 2019-08-12T23:01:16Z

chunker/json_parser.go

+		buf.nqCh <- buf.nquads
+		buf.nquads = nil
+	}
+	close(buf.nqCh)


Could be worth adding a comment that after Flush() is called it or Push() cannot be called again since this channel is being closed.

pullrequest · 2019-08-12T23:01:16Z

dgraph/cmd/live/run.go

+		// containing opt.batchSize entries) and sends the batches to the loader.reqs channel (see
+		// above).
+		if oerr := ck.Parse(chunkBuf); oerr != nil && oerr != io.EOF {
+			return oerr


Is it worth wrapping this error to added more context?

pullrequest · 2019-08-12T23:01:18Z

chunker/json_parser.go

 			if err != nil {
 				return mr, err
 			}

 			// Add the connecting edge beteween the entities.
 			nq.ObjectId = cr.uid
 			nq.Facets = cr.fcts
-			mr.nquads = append(mr.nquads, &nq)
-			// Add the nquads that we got for the connecting entity.
-			mr.nquads = append(mr.nquads, cr.nquads...)


Are these other cr.nquads from the connecting entity no longer needed here or below on line 382?

Instead of the parent collecting the NQuads from the child, the child invocation of this func now pushes directly to the NQuadBuffer object.

…Parse no longer returns it.

manishrjain

Ah.. I see what you mean. Done.

Reviewable status: 7 of 19 files reviewed, 15 unresolved discussions (waiting on @gitlw, @golangcibot, @mangalaman93, and @martinmr)

manishrjain added 11 commits August 9, 2019 17:27

Create a new NQuads struct and move Parse under it.

ded891b

Move all nquads to the new NQuads struct.

9ac7d50

Make parse tests work

a6047b1

Code runs and all

6832162

Run GC every 5s

f4cf4be

GC every 10s

39a00b3

Set worker goroutines aka mappers to 1/4th of the number of cores.

b81b207

Only run GC if it hadn't been run. Also mention the RAM usage being p…

fa831be

…roportional to worker threads in shout-case.

Files moved to one directory to allow easier sharing of the code.

70eb295

Add batchSize option to Chunker and NQuadBuffer to simplify loader an…

b7a299b

…d others.

Bring the edgraph/server.go code back. Move multiple RDF parsing to c…

17b557b

…hunker package as well.

manishrjain requested a review from a team as a code owner August 12, 2019 19:53

pullrequest bot reviewed Aug 12, 2019

View reviewed changes

golangcibot reviewed Aug 12, 2019

View reviewed changes

Hook up NQuadBuffer to RDFChunker.NQuads

3f89c68

golangcibot reviewed Aug 12, 2019

View reviewed changes

manishrjain requested review from martinmr, gitlw and mangalaman93 August 12, 2019 20:00

manishrjain added 2 commits August 12, 2019 13:27

Address golint issues

5e12649

Fix the live loader test failure.

948bfde

manishrjain commented Aug 12, 2019

View reviewed changes

Don't think we need a for loop around chunker.Parse

3e25781

martinmr suggested changes Aug 12, 2019

View reviewed changes

gitlw approved these changes Aug 12, 2019

View reviewed changes

gitlw reviewed Aug 12, 2019

View reviewed changes

manishrjain added 3 commits August 12, 2019 14:49

Move FacetDelimiter to x package to avoid a cyclic import loop.

541c9be

Fix gql package test

297651f

Fix a test failure by handling io.EOF

8f7f8cf

Don't return io.EOF unnecessarily

1d1f529

manishrjain commented Aug 12, 2019

View reviewed changes

pullrequest bot reviewed Aug 12, 2019

View reviewed changes

manishrjain added 2 commits August 12, 2019 16:28

Address comments by PR folks. Also, no need to handle io.EOF because …

de862d1

…Parse no longer returns it.

Address PR folks comments

5c420fb

manishrjain commented Aug 13, 2019

View reviewed changes

manishrjain merged commit c594918 into master Aug 13, 2019

manishrjain deleted the mrjn/chunk-parsech branch August 13, 2019 00:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk Loader: Fix memory usage by JSON parser #3794

Bulk Loader: Fix memory usage by JSON parser #3794

manishrjain commented Aug 12, 2019 •

edited

pullrequest bot left a comment

golangcibot Aug 12, 2019

golangcibot Aug 12, 2019

golangcibot Aug 12, 2019

golangcibot Aug 12, 2019

manishrjain left a comment

martinmr left a comment

gitlw left a comment

gitlw left a comment

gitlw left a comment

manishrjain left a comment

pullrequest bot left a comment

pullrequest bot Aug 12, 2019

manishrjain Aug 12, 2019

pullrequest bot Aug 13, 2019

pullrequest bot Aug 12, 2019

manishrjain Aug 12, 2019

pullrequest bot Aug 12, 2019

manishrjain Aug 12, 2019

pullrequest bot Aug 12, 2019

manishrjain Aug 12, 2019

manishrjain left a comment

Bulk Loader: Fix memory usage by JSON parser #3794

Bulk Loader: Fix memory usage by JSON parser #3794

Conversation

manishrjain commented Aug 12, 2019 • edited

pullrequest bot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

manishrjain left a comment

Choose a reason for hiding this comment

martinmr left a comment

Choose a reason for hiding this comment

gitlw left a comment

Choose a reason for hiding this comment

gitlw left a comment

Choose a reason for hiding this comment

gitlw left a comment

Choose a reason for hiding this comment

manishrjain left a comment

Choose a reason for hiding this comment

pullrequest bot left a comment

Choose a reason for hiding this comment

pullrequest bot Aug 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pullrequest bot Aug 13, 2019

Choose a reason for hiding this comment

pullrequest bot Aug 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pullrequest bot Aug 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pullrequest bot Aug 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

manishrjain left a comment

Choose a reason for hiding this comment

manishrjain commented Aug 12, 2019 •

edited