Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moss panic with runaway disk usage #61

Open
connorgorman opened this issue Oct 31, 2018 · 4 comments
Open

Moss panic with runaway disk usage #61

connorgorman opened this issue Oct 31, 2018 · 4 comments

Comments

@connorgorman
Copy link

Hey everyone, I'm seeing the following panic which looks like it's run out of memory. However, I've attached a graph of Grafana which shows the disk usage by the Moss index (its from a file walker so it could have a race condition if Moss is generating lots of files while it is walking)

fatal error: runtime: cannot allocate memory

goroutine 103 [running]:
runtime.systemstack_switch()
	stdlib%/src/runtime/asm_amd64.s:311 fp=0xc000b75970 sp=0xc000b75968 pc=0x459c90
runtime.persistentalloc(0xd0, 0x0, 0x27ad2b0, 0x7c4eac)
	GOROOT/src/runtime/malloc.go:1142 +0x82 fp=0xc000b759b8 sp=0xc000b75970 pc=0x40c932
runtime.newBucket(0x1, 0x4, 0x425f76)
	GOROOT/src/runtime/mprof.go:173 +0x5e fp=0xc000b759f0 sp=0xc000b759b8 pc=0x42573e
runtime.stkbucket(0x1, 0x33a000, 0xc000b75a98, 0x4, 0x20, 0xc000b75a01, 0x7f08c8658138)
	GOROOT/src/runtime/mprof.go:240 +0x1aa fp=0xc000b75a50 sp=0xc000b759f0 pc=0x425a3a
runtime.mProf_Malloc(0xc01298a000, 0x33a000)
	GOROOT/src/runtime/mprof.go:344 +0xd6 fp=0xc000b75bc8 sp=0xc000b75a50 pc=0x425fd6
runtime.profilealloc(0xc0026e8000, 0xc01298a000, 0x33a000)
	GOROOT/src/runtime/malloc.go:1058 +0x4b fp=0xc000b75be8 sp=0xc000b75bc8 pc=0x40c6cb
runtime.mallocgc(0x33a000, 0x14f3080, 0x1, 0xc008fb0000)
	GOROOT/src/runtime/malloc.go:983 +0x46c fp=0xc000b75c88 sp=0xc000b75be8 pc=0x40bdac
runtime.makeslice(0x14f3080, 0x0, 0x338b32, 0xc008fb0000, 0x0, 0x17ec0)
	GOROOT/src/runtime/slice.go:70 +0x77 fp=0xc000b75cb8 sp=0xc000b75c88 pc=0x442c17
vendor/github.com/couchbase/moss.newSegment(...)
	vendor/github.com/couchbase/moss/segment.go:158
vendor/github.com/couchbase/moss.(*segmentStack).merge(0xc005eaf180, 0xc000b75e01, 0xc007dec910, 0xc002d71a90, 0xc00004bc90, 0x10, 0xc00004bcb)
	vendor/github.com/couchbase/moss/segment_stack_merge.go:73 +0x1bb fp=0xc000b75e48 sp=0xc000b75cb8 pc=0xcb199b
vendor/github.com/couchbase/moss.(*collection).mergerMain(0xc0004b00c0, 0xc005eaf180, 0xc007dec910, 0x1, 0xc005eaf180)
	vendor/github.com/couchbase/moss/collection_merger.go:248 +0x306 fp=0xc000b75ef0 sp=0xc000b75e48 pc=0xca6946
vendor/github.com/couchbase/moss.(*collection).runMerger(0xc0004b00c0)
	vendor/github.com/couchbase/moss/collection_merger.go:126 +0x2d0 fp=0xc000b75fd8 sp=0xc000b75ef0 pc=0xca5e30
runtime.goexit()
	stdlib%/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc000b75fe0 sp=0xc000b75fd8 pc=0x45bbf1
created by vendor/github.com/couchbase/moss.(*collection).Start
	vendor/github.com/couchbase/moss/collection.go:118 +0x62

screen shot 2018-10-30 at 11 07 46 pm

The disk usage grows for about 6 minutes then implodes when I assume the disk is completely filled. The green line after the blip is our service restarting and our indexes being rebuilt

@steveyen
Copy link
Member

hi @connorgorman -- sorry to hear that. From the stack dump, can't tell much, other than moss is trying to merge segments together (e.g., compact files together to hopefully make them smaller).

So, there's no "a-ha" that comes to mind, but some other thoughts -- were there perhaps multiple crashes over time, or just one?

Would you happen to see whether moss is generating tons of files, or instead just a few (or even one file) is growing really large?

Wondering about the application usage pattern? What's the app doing with moss?

One case, for example, is if there are lots of so-called "slow-reader" goroutines, then they can be holding onto reference-counts on multiple moss "segments", which will keep the files around.

An edge case of a "infinitely slow reader", for example, is if the application forgets to close a snapshot, so that also pops to mind.

@connorgorman
Copy link
Author

connorgorman commented Oct 31, 2018

Thanks for the reply @steveyen

I'm currently accessing moss + lower level store through Bleve. The app is basically writing/deleting batches of documents to Bleve periodically and then has some adhoc documents that flow in so it's a mix of both batch and individual indexing operations.

Perhaps the query in the stacktrace below is particularly slow and could bog down the indexing?

I'm currently running a cluster so I can let you know about moss generating tons of files vs a large file. It panics pretty frequently. Here are some more relevant stack traces. An interesting note is that the index isn't compacting, but when I run ./mossScope stats fragmentation it suddenly starts compacting. Is there a way to access the lower level store when creating through Bleve? The mossStoreWrapper is unexported so I can't cast Actual()

goroutine 822286 [runnable]:
vendor/github.com/couchbase/moss.persistBasicSegment.func1(0x19f0600, 0xc001a4a6b0, 0xc038ac2000, 0x5ef0, 0x5ef0, 0x6542a0000, 0xc000099440)
	vendor/github.com/couchbase/moss/segment.go:678
created by vendor/github.com/couchbase/moss.persistBasicSegment
	vendor/github.com/couchbase/moss/segment.go:678 +0x20d

goroutine 104 [runnable]:
reflect.typedmemmove(0x14f3040, 0xc000e85d30, 0xc00149fa08)
	GOROOT/src/runtime/mbarrier.go:177 +0x4e
...
vendor/github.com/couchbase/moss.(*Footer).doLoadSegments(0xc007bb5380, 0xc000320400, 0xc0018fa3c0, 0xc002983a40, 0x5, 0x6, 0x0, 0x0, 0x0, 0x0, ...)
	vendor/github.com/couchbase/moss/store_footer.go:315 +0xf57
vendor/github.com/couchbase/moss.(*Footer).loadSegments(0xc007bb5380, 0xc000320400, 0xc0018fa3c0, 0x19f0600, 0xc001b32018)
	vendor/github.com/couchbase/moss/store_footer.go:248 +0x84

goroutine 3232418 [runnable]:
vendor/github.com/blevesearch/bleve/index/store/moss.(*Reader).PrefixIterator(0xc000eb60e0, 0xc003506ae0, 0x28, 0x28, 0x0, 0x0)
	vendor/github.com/blevesearch/bleve/index/store/moss/reader.go:56 +0x11c
vendor/github.com/blevesearch/bleve/index/upsidedown.newUpsideDownCouchTermFieldReader(0xc000eb6100, 0xc003506ab0, 0x24, 0x30, 0x101010043, 0x30, 0x24, 0xc003506ab0)
	vendor/github.com/blevesearch/bleve/index/upsidedown/reader.go:95 +0x24b
vendor/github.com/blevesearch/bleve/index/upsidedown.(*IndexReader).TermFieldReader(0xc000eb6100, 0xc003506ab0, 0x24, 0x30, 0x17a8c3b, 0x14, 0xc002010101, 0x2cf70, 0x0, 0x0, ...)
	vendor/github.com/blevesearch/bleve/index/upsidedown/index_reader.go:41 +0xad
vendor/github.com/blevesearch/bleve/search/searcher.NewTermSearcher(0x19fcb40, 0xc000eb6100, 0xc001a4c9f0, 0x24, 0x17a8c3b, 0x14, 0x3ff0000000000000, 0xc001200100, 0xc00268f300, 0x0, ...)
	vendor/github.com/blevesearch/bleve/search/searcher/search_term.go:42 +0xd2
vendor/github.com/blevesearch/bleve/search/searcher.NewMultiTermSearcher(0x19fcb40, 0xc000eb6100, 0xc004244000, 0x7b18, 0x9600, 0x17a8c3b, 0x14, 0x3ff0000000000000, 0xc004010100, 0x7b18, ...)
	vendor/github.com/blevesearch/bleve/search/searcher/search_multi_term.go:35 +0x15c
vendor/github.com/blevesearch/bleve/search/searcher.NewRegexpSearcher(0x19fcb40, 0xc000eb6100, 0x19e6de0, 0xc005eefea0, 0x17a8c3b, 0x14, 0x3ff0000000000000, 0x100, 0x2, 0x2, ...)
	vendor/github.com/blevesearch/bleve/search/searcher/search_regexp.go:87 +0xf2
vendor/github.com/blevesearch/bleve/search/searcher.NewRegexpStringSearcher(0x19fcb40, 0xc000eb6100, 0xc007b2414a, 0x2, 0x17a8c3b, 0x14, 0x3ff0000000000000, 0xc000eb0100, 0x0, 0x0, ...)
	vendor/github.com/blevesearch/bleve/search/searcher/search_regexp.go:36 +0x4d6
vendor/github.com/blevesearch/bleve/search/query.(*WildcardQuery).Searcher(0xc009101440, 0x19fcb40, 0xc000eb6100, 0x19efe80, 0xc00022e080, 0x100, 0xc01c1bc180, 0xc001866190, 0x7bdcd9, 0xc00022e158)
	vendor/github.com/blevesearch/bleve/search/query/wildcard.go:87 +0xe2
vendor/github.com/blevesearch/bleve/search/query.(*ConjunctionQuery).Searcher(0xc0091014a0, 0x19fcb40, 0xc000eb6100, 0x19efe80, 0xc00022e080, 0xc001020100, 0x360e9, 0xc00120c5c8, 0xcade8c, 0x7f080aae6b24)
	vendor/github.com/blevesearch/bleve/search/query/conjunction.go:58 +0x122


goroutine 83 [runnable]:
vendor/github.com/couchbase/moss.persistBasicSegment(0x19f0000, 0xc003f84000, 0x19f0600, 0xc001a4a6b0, 0x65429f8f1, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	vendor/github.com/couchbase/moss/segment.go:690 +0x2fe
vendor/github.com/couchbase/moss.(*segment).Persist(0xc003f84000, 0x19f0600, 0xc001a4a6b0, 0xc000324400, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	vendor/github.com/couchbase/moss/segment.go:614 +0x28d
vendor/github.com/couchbase/moss.(*Store).persistSegment(0xc0004969a0, 0x19f0600, 0xc001a4a6b0, 0x19f0000, 0xc003f84000, 0xc000324400, 0x0, 0x0, 0x0, 0x0, ...)
	vendor/github.com/couchbase/moss/store.go:466 +0x16a
vendor/github.com/couchbase/moss.(*Store).persistSegments(0xc0004969a0, 0xc038ce60a0, 0xc038565ce0, 0x19f0600, 0xc001a4a6b0, 0xc0038c1a40, 0x0, 0x0)
	vendor/github.com/couchbase/moss/store.go:242 +0x28b
vendor/github.com/couchbase/moss.(*Store).persist(0xc0004969a0, 0x19ed920, 0xc038ce60a0, 0x428300, 0x1, 0x0, 0x0, 0x0, 0x0)
	vendor/github.com/couchbase/moss/store.go:140 +0x30e
vendor/github.com/couchbase/moss.(*Store).Persist(0xc0004969a0, 0x19ed920, 0xc038ce60a0, 0x942dfe4eee00, 0x1, 0x208aaa0800000000, 0x5bd9cad5, 0xc006db9ef8, 0x4a3fe6)
	vendor/github.com/couchbase/moss/store_api.go:329 +0x56

@steveyen
Copy link
Member

hi @connorgorman

Perhaps the query in the stacktrace below is particularly slow and could bog down the indexing?

Perhaps -- can't tell from the just the stacktrace alone. Like if you're doing some wildcard searches that somehow return everything, or the regexp is for a huge # of terms, and the dataset (the everything) is large, then that might be a situation. btw, that goroutine 3232418 is in the early phases of search reques processingt (going through prep work & setup & constructors of various data structures), as opposed to being in the middle of actual iteration or Next()'ing through candidate hits, so that might mean the "slow searcher theory" isn't quite right.

I'm currently running a cluster so I can let you know about moss generating tons of files vs a large file. It panics pretty frequently.

Oh, that's interesting. Then that sounds like it might not be impossible to reach an easy-to-reliably-reproduce simple situation. Do step 1, 2, 3 then bam!

Here are some more relevant stack traces. An interesting note is that the index isn't compacting, but when I run ./mossScope stats fragmentation it suddenly starts compacting.

That's really weird! And, that's a bug, in "mossScope stats fragmentation", where it's supposed to open the file in read-only mode. Created a ticket in JIRA (what the full-text team happens to use at couchbase for day to day work) to track that -- https://issues.couchbase.com/browse/MB-31816

Maybe that's the issue, if you happen to be running mossScope regularly with "stats fragmentation", perhaps leading to some kind of concurrent file access issue. Not sure, but it's a legit mossScope bug.

Is there a way to access the lower level store when creating through Bleve? The mossStoreWrapper is unexported so I can't cast Actual()

Here's the way we did that in the couchbase cbft (full text search) project, where we just had a local interface with the Actual() method on it...

https://github.com/couchbase/cbft/blob/master/pindex_bleve_moss_rollback.go#L41

@connorgorman
Copy link
Author

@steveyen Thanks again!

Would you happen to see whether moss is generating tons of files, or instead just a few (or even one file) is growing really large?

Looks like there are lots of files generated. If I understand moss correctly (especially from the OpenStore code), that only the last written file is important if there are no other searchers referencing the other files (which is why OpenStore can prune the index). If this is the case, then the ref count stuff mentioned above makes sense. I'll take a look into that

Maybe that's the issue, if you happen to be running mossScope regularly with "stats fragmentation", perhaps leading to some kind of concurrent file access issue.

We don't run moss scope frequently. It was mostly just to debug why things weren't compacting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants