Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large data collections cause garbage collection errors #21

Closed
mingard opened this issue Feb 2, 2017 · 27 comments
Closed

Large data collections cause garbage collection errors #21

mingard opened this issue Feb 2, 2017 · 27 comments
Assignees

Comments

@mingard
Copy link

mingard commented Feb 2, 2017

Using: JSONBlockStore
Number of records: 20,000
Node: 6.9 and 7.5 (--harmony-async-await and using /src/index.js)

When inserting large recordsets into a single collection I get the following error. I have tried inserting in batch, stopping and starting the app between batches, but I never get above 20k.


  544228 ms: Mark-sweep 1878.0 (2037.1) -> 1877.6 (2037.1) MB, 3625.3 / 0.0 ms [allocation failure] [GC in old space requested].
  547861 ms: Mark-sweep 1877.6 (2037.1) -> 1877.5 (2037.1) MB, 3632.8 / 0.0 ms [allocation failure] [GC in old space requested].
  551578 ms: Mark-sweep 1877.5 (2037.1) -> 1881.4 (2021.1) MB, 3716.4 / 0.0 ms [last resort gc].
  555225 ms: Mark-sweep 1881.4 (2021.1) -> 1885.2 (2021.1) MB, 3646.2 / 0.0 ms [last resort gc].


<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0x27fbaddcfb51 <JS Object>
    1: PromiseEnqueue(aka PromiseEnqueue) [native promise.js:~98] [pc=0x2c5278ddd396] (this=0x27fbadd04381 <undefined>,y=0xd6b8f552bd1 <JS Array[18]>,C=0x3f7ffe835d41 <JS Array[2]>,x=1)
    2: then [native promise.js:~215] [pc=0x2c5278d07605] (this=0x18998b0e4c41 <a Promise with map 0x30c4b83231b1>,z=0x3f7ffe835ab1 <JS Function s (SharedFunctionInfo 0x1b026b524119)>,A=0x3f7ffe835af9 <JS Functio...

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
 1: node::Abort() [node]
 2: 0x10cf7dc [node]
 3: v8::Utils::ReportApiFailure(char const*, char const*) [node]
 4: v8::internal::V8::FatalProcessOutOfMemory(char const*, bool) [node]
 5: v8::internal::Factory::CopyFixedArrayAndGrow(v8::internal::Handle<v8::internal::FixedArray>, int, v8::internal::PretenureFlag) [node]
 6: v8::internal::Isolate::EnqueueMicrotask(v8::internal::Handle<v8::internal::Object>) [node]
 7: v8::internal::Runtime_EnqueueMicrotask(int, v8::internal::Object**, v8::internal::Isolate*) [node]
 8: 0x2c52754092a7
Aborted (core dumped)
@anywhichway
Copy link
Owner

@mingard you might actually be out of RAM. What is the result of 20,000 * record size? What is the size of the files on disk? Have you tried increasing Node memory size: http://prestonparry.com/articles/IncreaseNodeJSMemorySize/

@mingard
Copy link
Author

mingard commented Feb 3, 2017

@anywhichway I increased the available memory but it seems that regardless of the available memory, past a certain point the V8 JS Array Handler can't cope.

Even if I stop the process whilst the inserts slow down (and they really slow down, hitting 100% CPU) then start again with a small handful of inserts, the process can't handle this. Looks like some sort of Array sort or Fixed array extension issue.

@anywhichway
Copy link
Owner

@mingard Ugh ... loathe to switch to Set. It is so slow compared to array. Will poke around to see if there is anything that can be done.

@anywhichway anywhichway self-assigned this Feb 4, 2017
@anywhichway
Copy link
Owner

@mingard If it is insert related, then I think we found the offending block of code. See src/index.js lines 2259 to 2289. This effectively creates an entire copy of the array of objects to be inserted with a bunch or Promises in it. The class Activity is backed by an Array. However, it may simply be related to the fact that indexes are in RAM and we don't have an LRU purge in place to deal with low RAM situations. This may take a couple of weeks to fix.

Can you provide a single sample record so that we are working with the same size objects?

@mingard
Copy link
Author

mingard commented Feb 6, 2017

@anywhichway regarding the on insert issue, yes that is when RAM usage ramps up more than I'd expect. When the process is stopped and started again, the RAM usage is around 30mb, until the first insert where it jumps up to around 50% of where it was before the app process was exited, so the purge might be necessary regardless. It seems to grab all documents before the first insert and in some cases that can occupy all available RAM instantly, slowing things to a halt.

Example record:

{
  "title": "Lorem Ipsum dolor",
  "subtitle": "Proin ut ex eget velit dignissim",
  "content": [
    "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse eget neque eu lectus dictum placerat. Vestibulum tristique eu elit et mattis. Nulla scelerisque facilisis magna, at porta diam tincidunt egestas. Proin ut ex eget velit dignissim viverra. Curabitur sit amet ultrices lorem. Nulla cursus orci libero, quis finibus ex faucibus at. Proin fringilla eu neque et scelerisque. Sed rutrum ex in magna vulputate tincidunt. Cras hendrerit id dui et imperdiet. Donec imperdiet tortor tincidunt sodales tempor. Proin faucibus mauris molestie pretium laoreet.",
    "Aenean condimentum vel neque vel egestas. Nulla faucibus lacinia est, ut condimentum risus. Cras varius, risus eu mollis semper, velit libero convallis odio, volutpat pretium est purus vitae dui. Etiam tristique purus odio, vitae porttitor odio suscipit ut. Phasellus dapibus, augue dictum vulputate auctor, purus nulla luctus turpis, id dignissim felis nunc sit amet felis. Nulla at lacus a mi eleifend accumsan. Curabitur eu dictum purus. Proin metus purus, blandit et feugiat ut, consequat eu eros. Aenean in purus lacus. Donec suscipit dignissim arcu, in tempor eros dictum eu"
  ],
  "heroImage": [
    "adipiscing"
  ],
  "publications": [
    "condimentum",
    "libero",
    "purus"
  ],
  "categories": [
    "adipiscing",
    "semper",
    "donec"
  ],
  "primarySyndicatePosition": 2,
  "published": {
    "state": "published"
  },
  "embeds": [],
  "furl": "lorem-ipsum-dolor",
  "urls": [
    "lorem/ipsum/dolor",
    "lorem/ipsum/elit",
    "lorem/consectetur/elit"
  ],
  "excerpt": "Cras varius, risus eu mollis semper, velit libero convallis odio"
}

@anywhichway
Copy link
Owner

@mingard Yeah, it loads the indexes before insert. Have you tried limiting the properties to index? See the README.md:

"If you wish to skip indexing certain properties, then add them to a class property called .skipKeys. Alternatively, use .indexKeys and only list the keys you wish to index."

In your case I would at a minimum recommend:

<YourClass>.skipKeys["subtitle","content","excerpt"]

The record you show above would actually create an index on "excerpt" that would look like this:

<YourClass>.index.excerpt["Cras varius, risus eu mollis semper, velit libero convallis odio"].string.<recordId> = true

Imagine the size of the index with 20,000 records!

BTW, we have done a hacked full text index implementation for large text fields so they can be searched for key words. This results in only one occurrence of any given word, e.g. "condimentum" would only appear once. We could probably get this into production in a few weeks if you need it ... assuming the above skipKeys get's you over the current hurdle.

@mingard
Copy link
Author

mingard commented Feb 6, 2017

@anywhichway Skipping some keys works, but means that I can't search on those values (unless I'm missing something). 25k records now uses around 1GB when I allow the title, subtitle and content to be indexed but nothing else.

The full text index concept sounds great! Is this a reverse index lookup where you'll be storing a single word and a reference to all documents containing that word? If thats the case, on document load you could analyse each field where each search value exists and create an inverse frequency by determining the distance between words. That would give some pretty great results for multi word search.

@anywhichway
Copy link
Owner

@mingard "but means that I can't search on those values" correct ... we really need to implement proper text search to support your content field.

You are correct on the approach. May even collapse into trigrams at some point. Starting work on it next week.

@mingard
Copy link
Author

mingard commented Feb 7, 2017

@anywhichway Sounds good. This may also be of interest https://github.com/NaturalNode/natural

@anywhichway
Copy link
Owner

@mingard First the bad news ... the code base has changed sufficiently since we built the original full text search that it will be quite an effort to get it in. Next the GOOD NEWS. We have an alternate approach in test right now. We have added a class property called deferKeys. It works similar to skipKeys except that predicate tests will still work, they simply aren't resolved until the cursor is processed, i.e. they are done outside the join network directly against instantiated objects. This is transparent to all existing queries. Using the example above you would use the following:

<YourClass>.deferKeys["subtitle","content","excerpt"]

We have also added a $contains in addition to the $matches predicate so you can do:

select().from({$o: Object}).where({$o1: {<some indexed field>: {$neq null}, content: {$contains: "long"}}}).exec();

Note, the where clause must include at least one indexed property.

Also a heads-up. In the next build we will be pulling the storage drivers into separate files to reduce the size of the core file. This just means you will have to do 1 additional require or script statement.

Let us know your opinion on both the above.

@mingard
Copy link
Author

mingard commented Feb 9, 2017

@anywhichway I was going to ask about splitting up the codebase. Seemed like it could benefit from becoming more modular.

In this example class, how would one set deferKeys and/or skipKeys. In my tests, I've simply been forcing the skipClass by temporarily adjusting the core file, which is obviously not the right way!

class myClass {
  constructor () {
    // Do something
  }
}

@anywhichway
Copy link
Owner

@mingard you set the values like this:

class myClass {
  constructor () {
    // Do something
  }
}
myClass.skipKeys = ["aNonQueryableProperty","anotherNonQueryableProperty"]; // not indexed at all, JOQULAR queries will return empty results
myClass.deferKeys = ["content","summary"]; // you can still reference these in JOQULAR queries

v0.3.0 branch has been pushed but not promoted to master or npm. Give it a shot. May also fix the scope issue #18.

@mingard
Copy link
Author

mingard commented Feb 11, 2017

@anywhichway Will do. I'll also add this functionality to the wrapper

@anywhichway
Copy link
Owner

@mingard cool ... also GREAT news ... figured out how to add full text indexing in a really simple way. Should only take a week to wrap it up. Got an in memory ALPHA working today, just need to persist the index and do some more testing. When done you will be able to specify <class>.fullTextKeys[...] and use a new predicate {$search: <string>}. The string will be tokenized and all documents containing the tokens will be returned (currently unsorted, perhaps later sorted by frequency). Stop words will be modifiable by changing ReasonDB.stopWords array.

@mingard
Copy link
Author

mingard commented Feb 11, 2017

@anywhichway That's incredible! Have you been able to benchmark performance impact? This is seriously awesome.

Regarding loading drivers, what do you think about requiring on evoke with something like:

// Current
ReasonDB.prototype.save = ReasonDB.prototype.insert;
ReasonDB.Store = Store;
ReasonDB.MemStore = MemStore;
ReasonDB.LocalStore = LocalStore;
ReasonDB.LocalForageStore = LocalForageStore;
// New
ReasonDB.JSONBlockStore = () => {
  return require('./drivers/JSONBlockStore')(ReasonDB)
}

This way you can require but only when called

return new ReasonDB(dbpath, key, ReasonDB.JSONBlockStore(), true, true, {saveIndexAsync: true})

@mingard
Copy link
Author

mingard commented Feb 11, 2017

@anywhichway if I console.log object.constructor.skipKeys here I get nothing, however if I log object.skipKeys I get the skipKeys values I set.

I am booting the app and immediately inserting into myClass without any prior query.

@anywhichway
Copy link
Owner

@mingard ... interesting you should propose that. It was the desired approach; however, could not get our packaging to work right. Very focused on having a src version that will load in the browser or on the server without any compile/transform steps when using Chrome. Will take another go at it though, because I do think it is the best approach.

No benchmark yet on the $search at index of query time.

@mingard
Copy link
Author

mingard commented Feb 11, 2017

@anywhichway Ah, I see. I'd be tempted to create a second repository such as reasondb-client or reasondb-browser where you include reasondb as a dependency and commit a compiled build.

@anywhichway
Copy link
Owner

@mingard Do you want an earlier release of full-text indexing or do you want to wait a few days? A version can be pushed today if you wish.

@anywhichway
Copy link
Owner

@mingard I missed this earlier: "@anywhichway if I console.log object.constructor.skipKeys here I get nothing, however if I log object.skipKeys I get the skipKeys values I set." It seems like you might have added skipKeys as an instance property not a static class property. Although your approach may work, it is not thoroughly tested that way and hence not documented. Below is the set-up from our full-text test example. The same approach is used for skipKeys:

// Define a Book class. Classes are optional. ReasonDB can store items of type Object, Array, and Date by default.
class Book {
	constructor(title,summary) {
		this.title = title;
		this.summary = summary;
	}
}
Book.fullTextKeys = ["summary"];


// Insert Objects into the Person index casting it to a Person as it is inserted.
db.insert(
	new Book("JavaScript","This book is a very long book about how to code JavaScript applications."),
	new Book("JavaScript","This book is a very short book on the history of JavaScript."),
	new Book("C#","This book is a very long book about how to code C# applications."),
	new Book("C++","This book is a very long book about how to code C++ applications.")
).into(Book).exec().then(() => {
	db.select({title: {$b: "title"}}).from({$b: Book}).where({$b: {summary: {$search: "code"}}}).exec().then((cursor) => {
		cursor.forEach((row) => {
			console.log(row); //prints 3 rows
		});
	});
	db.select({title: {$b: "title"}}).from({$b: Book}).where({$b: {title: {$contains: "JavaScript"}, summary: {$search: "code"}}}).exec().then((cursor) => {
		cursor.forEach((row) => {
			console.log(row); //prints 1 row
		});
	});
	db.select({title: {$b: "title"}}).from({$b: Book}).where({$b: {summary: {$search: "javascript"}}}).exec().then((cursor) => {
		cursor.forEach((row) => {
			console.log(row); //prints 2 rows
		});
	});
});

@mingard
Copy link
Author

mingard commented Feb 12, 2017

@anywhichway I'd love to see if, but happy to wait.

Re skipKeys issue, I was not inserting an instance of the class, so rather than:

db.insert(
	new Book("JavaScript","This book is a very long book about how to code JavaScript applications."),
	new Book("JavaScript","This book is a very short book on the history of JavaScript."),
	new Book("C#","This book is a very long book about how to code C# applications."),
	new Book("C++","This book is a very long book about how to code C++ applications.")
).into(Book).exec()

I was inserting the data raw, no class:

db.insert(
	{
          "title": "Javascript",
          "This book is a very long book about how to code JavaScript applications."
        }
).into(Book).exec()

@anywhichway
Copy link
Owner

@mingard Aha, try Object.skipKeys = ["summary"] in addition to or in place of Book.skipKeys and let me know what you find out.

Will give the full-text a few more days to settle out then.

@anywhichway
Copy link
Owner

@mingard Pushed v0.3.1 with full text indexing. Let us know how it goes and we will promote accordingly.

@mingard
Copy link
Author

mingard commented Feb 14, 2017

@anywhichway testing now! What would you advise for a standard string search? $matches?

@mingard
Copy link
Author

mingard commented Feb 14, 2017

@anywhichway much much faster! I only have one issue left around this:

If I use deferKeys on insert and don't use deferKeys on select the keys are still not resolved. Is this something that is stored against the index? Is it possible for the select request to override this?

@anywhichway
Copy link
Owner

@mingard the best way to use deferKeys is on the class not an instance. In your case you should add deferKeys or skipKeys to the class Object not instances of Object. If you add them to an instance then the keys should be skipped or deferred for just that instance but resolved for other instances during select. We have not actually tested this use case, since it seems like an odd one and could result in some strange query results that seem inconsistent with underlying data. Strongly recommend you use class level skipping and deferring.

Use $matches for RegExp matching of indexed or deferred keys. Use $contains for exact matching of indexed or deffered keys. Use $search for fullTextKeys. $search does not currently support embedded RegExp, although it might in the future.

@anywhichway
Copy link
Owner

@mingard Think we have done about as much as we can to resolve the size/gc issue that drove this ticket in the first place. We would like to close it out with the push of v0.3.1 to the main branch. Let me know if this is OK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants