-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large data collections cause garbage collection errors #21
Comments
@mingard you might actually be out of RAM. What is the result of 20,000 * record size? What is the size of the files on disk? Have you tried increasing Node memory size: http://prestonparry.com/articles/IncreaseNodeJSMemorySize/ |
@anywhichway I increased the available memory but it seems that regardless of the available memory, past a certain point the V8 JS Array Handler can't cope. Even if I stop the process whilst the inserts slow down (and they really slow down, hitting 100% CPU) then start again with a small handful of inserts, the process can't handle this. Looks like some sort of Array sort or Fixed array extension issue. |
@mingard Ugh ... loathe to switch to Set. It is so slow compared to array. Will poke around to see if there is anything that can be done. |
@mingard If it is insert related, then I think we found the offending block of code. See src/index.js lines 2259 to 2289. This effectively creates an entire copy of the array of objects to be inserted with a bunch or Promises in it. The class Activity is backed by an Array. However, it may simply be related to the fact that indexes are in RAM and we don't have an LRU purge in place to deal with low RAM situations. This may take a couple of weeks to fix. Can you provide a single sample record so that we are working with the same size objects? |
@anywhichway regarding the on insert issue, yes that is when RAM usage ramps up more than I'd expect. When the process is stopped and started again, the RAM usage is around 30mb, until the first insert where it jumps up to around 50% of where it was before the app process was exited, so the purge might be necessary regardless. It seems to grab all documents before the first insert and in some cases that can occupy all available RAM instantly, slowing things to a halt. Example record: {
"title": "Lorem Ipsum dolor",
"subtitle": "Proin ut ex eget velit dignissim",
"content": [
"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse eget neque eu lectus dictum placerat. Vestibulum tristique eu elit et mattis. Nulla scelerisque facilisis magna, at porta diam tincidunt egestas. Proin ut ex eget velit dignissim viverra. Curabitur sit amet ultrices lorem. Nulla cursus orci libero, quis finibus ex faucibus at. Proin fringilla eu neque et scelerisque. Sed rutrum ex in magna vulputate tincidunt. Cras hendrerit id dui et imperdiet. Donec imperdiet tortor tincidunt sodales tempor. Proin faucibus mauris molestie pretium laoreet.",
"Aenean condimentum vel neque vel egestas. Nulla faucibus lacinia est, ut condimentum risus. Cras varius, risus eu mollis semper, velit libero convallis odio, volutpat pretium est purus vitae dui. Etiam tristique purus odio, vitae porttitor odio suscipit ut. Phasellus dapibus, augue dictum vulputate auctor, purus nulla luctus turpis, id dignissim felis nunc sit amet felis. Nulla at lacus a mi eleifend accumsan. Curabitur eu dictum purus. Proin metus purus, blandit et feugiat ut, consequat eu eros. Aenean in purus lacus. Donec suscipit dignissim arcu, in tempor eros dictum eu"
],
"heroImage": [
"adipiscing"
],
"publications": [
"condimentum",
"libero",
"purus"
],
"categories": [
"adipiscing",
"semper",
"donec"
],
"primarySyndicatePosition": 2,
"published": {
"state": "published"
},
"embeds": [],
"furl": "lorem-ipsum-dolor",
"urls": [
"lorem/ipsum/dolor",
"lorem/ipsum/elit",
"lorem/consectetur/elit"
],
"excerpt": "Cras varius, risus eu mollis semper, velit libero convallis odio"
} |
@mingard Yeah, it loads the indexes before insert. Have you tried limiting the properties to index? See the README.md: "If you wish to skip indexing certain properties, then add them to a class property called .skipKeys. Alternatively, use .indexKeys and only list the keys you wish to index." In your case I would at a minimum recommend:
The record you show above would actually create an index on "excerpt" that would look like this:
Imagine the size of the index with 20,000 records! BTW, we have done a hacked full text index implementation for large text fields so they can be searched for key words. This results in only one occurrence of any given word, e.g. "condimentum" would only appear once. We could probably get this into production in a few weeks if you need it ... assuming the above |
@anywhichway Skipping some keys works, but means that I can't search on those values (unless I'm missing something). 25k records now uses around 1GB when I allow the title, subtitle and content to be indexed but nothing else. The full text index concept sounds great! Is this a reverse index lookup where you'll be storing a single word and a reference to all documents containing that word? If thats the case, on document load you could analyse each field where each search value exists and create an inverse frequency by determining the distance between words. That would give some pretty great results for multi word search. |
@mingard "but means that I can't search on those values" correct ... we really need to implement proper text search to support your content field. You are correct on the approach. May even collapse into trigrams at some point. Starting work on it next week. |
@anywhichway Sounds good. This may also be of interest https://github.com/NaturalNode/natural |
@mingard First the bad news ... the code base has changed sufficiently since we built the original full text search that it will be quite an effort to get it in. Next the GOOD NEWS. We have an alternate approach in test right now. We have added a class property called
We have also added a $contains in addition to the $matches predicate so you can do:
Note, the where clause must include at least one indexed property. Also a heads-up. In the next build we will be pulling the storage drivers into separate files to reduce the size of the core file. This just means you will have to do 1 additional require or script statement. Let us know your opinion on both the above. |
@anywhichway I was going to ask about splitting up the codebase. Seemed like it could benefit from becoming more modular. In this example class, how would one set class myClass {
constructor () {
// Do something
}
} |
@mingard you set the values like this:
v0.3.0 branch has been pushed but not promoted to master or npm. Give it a shot. May also fix the scope issue #18. |
@anywhichway Will do. I'll also add this functionality to the wrapper |
@mingard cool ... also GREAT news ... figured out how to add full text indexing in a really simple way. Should only take a week to wrap it up. Got an in memory ALPHA working today, just need to persist the index and do some more testing. When done you will be able to specify |
@anywhichway That's incredible! Have you been able to benchmark performance impact? This is seriously awesome. Regarding loading drivers, what do you think about requiring on evoke with something like: // Current
ReasonDB.prototype.save = ReasonDB.prototype.insert;
ReasonDB.Store = Store;
ReasonDB.MemStore = MemStore;
ReasonDB.LocalStore = LocalStore;
ReasonDB.LocalForageStore = LocalForageStore;
// New
ReasonDB.JSONBlockStore = () => {
return require('./drivers/JSONBlockStore')(ReasonDB)
} This way you can require but only when called return new ReasonDB(dbpath, key, ReasonDB.JSONBlockStore(), true, true, {saveIndexAsync: true}) |
@anywhichway if I console.log I am booting the app and immediately inserting into |
@mingard ... interesting you should propose that. It was the desired approach; however, could not get our packaging to work right. Very focused on having a src version that will load in the browser or on the server without any compile/transform steps when using Chrome. Will take another go at it though, because I do think it is the best approach. No benchmark yet on the $search at index of query time. |
@anywhichway Ah, I see. I'd be tempted to create a second repository such as |
@mingard Do you want an earlier release of full-text indexing or do you want to wait a few days? A version can be pushed today if you wish. |
@mingard I missed this earlier: "@anywhichway if I console.log object.constructor.skipKeys here I get nothing, however if I log object.skipKeys I get the skipKeys values I set." It seems like you might have added skipKeys as an instance property not a static class property. Although your approach may work, it is not thoroughly tested that way and hence not documented. Below is the set-up from our full-text test example. The same approach is used for skipKeys:
|
@anywhichway I'd love to see if, but happy to wait. Re db.insert(
new Book("JavaScript","This book is a very long book about how to code JavaScript applications."),
new Book("JavaScript","This book is a very short book on the history of JavaScript."),
new Book("C#","This book is a very long book about how to code C# applications."),
new Book("C++","This book is a very long book about how to code C++ applications.")
).into(Book).exec() I was inserting the data raw, no class: db.insert(
{
"title": "Javascript",
"This book is a very long book about how to code JavaScript applications."
}
).into(Book).exec() |
@mingard Aha, try Object.skipKeys = ["summary"] in addition to or in place of Book.skipKeys and let me know what you find out. Will give the full-text a few more days to settle out then. |
@mingard Pushed v0.3.1 with full text indexing. Let us know how it goes and we will promote accordingly. |
@anywhichway testing now! What would you advise for a standard string search? |
@anywhichway much much faster! I only have one issue left around this: If I use |
@mingard the best way to use deferKeys is on the class not an instance. In your case you should add deferKeys or skipKeys to the class Object not instances of Object. If you add them to an instance then the keys should be skipped or deferred for just that instance but resolved for other instances during select. We have not actually tested this use case, since it seems like an odd one and could result in some strange query results that seem inconsistent with underlying data. Strongly recommend you use class level skipping and deferring. Use $matches for RegExp matching of indexed or deferred keys. Use $contains for exact matching of indexed or deffered keys. Use $search for fullTextKeys. $search does not currently support embedded RegExp, although it might in the future. |
@mingard Think we have done about as much as we can to resolve the size/gc issue that drove this ticket in the first place. We would like to close it out with the push of v0.3.1 to the main branch. Let me know if this is OK. |
Using:
JSONBlockStore
Number of records: 20,000
Node: 6.9 and 7.5 (--harmony-async-await and using
/src/index.js
)When inserting large recordsets into a single collection I get the following error. I have tried inserting in batch, stopping and starting the app between batches, but I never get above 20k.
The text was updated successfully, but these errors were encountered: