Skip to content
This repository has been archived by the owner on Feb 16, 2020. It is now read-only.

[LocalDB] OutOfMemory when creating a new DB #101

Closed
gabbello opened this issue Dec 30, 2013 · 20 comments
Closed

[LocalDB] OutOfMemory when creating a new DB #101

gabbello opened this issue Dec 30, 2013 · 20 comments

Comments

@gabbello
Copy link

Note that I'm running geeko on a low memory device (raspberryPi 256 MB).

So the localDB instance that I started run for about 16 H and then crashed when creating the file for the new day with the following message:

2013-12-30 01:59:56 (DEBUG):    inserting candle 1438 (23:58:00 UTC) vol: 10.772895739999997 2013-12-29
2013-12-30 01:59:56 (DEBUG):    Leftovers: 1439
2013-12-30 02:01:13 (DEBUG):    Scheduling next fetch: in 1 minutes
2013-12-30 02:01:13 (DEBUG):    Fetched 150 new trades, from 2013-12-29  23:54:26 (UTC) to 2013-12-30 00:01:05 (UTC)
2013-12-30 02:01:13 (DEBUG):    minimum trade treshold: 2013-12-29 23:59:53
2013-12-30 02:01:13 (DEBUG):    processing 23 trades
2013-12-30 02:01:13 (DEBUG):    This batch includes trades for a new day.
2013-12-30 02:01:13 (DEBUG):    Creating a new daily database for day 2013-12-30
FATAL ERROR: JS Allocation failed - process out of memory

When I restarted the instance this morning the first lines were (I presume that this means that the system is using the file for 29th, but makes a new one starting from first candle for 30th):

2013-12-30 10:41:07 (INFO):     I'm gonna make you rich, Bud Fox.
2013-12-30 10:41:07 (INFO):     Let me show you some Exponential Moving Averages.


2013-12-30 10:41:07 (INFO):     Using normal settings to monitor the live market
2013-12-30 10:41:07 (INFO):     NOT trading with real money
2013-12-30 10:41:10 (INFO):     Starting to watch the market: BTC-e USD/BTC
2013-12-30 10:41:11 (DEBUG):    Scheduling next fetch: in 2 minutes
2013-12-30 10:41:11 (DEBUG):    Fetched 150 new trades, from 2013-12-30 08:31:38 (UTC) to 2013-12-30 08:40:49 (UTC)
2013-12-30 10:41:12 (WARN):     Found a corrupted database ( 2013-12-30 ), going to clean it up
2013-12-30 10:41:12 (DEBUG):    This should not happen, please post details here: https://github.com/askmike/gekko/issues/90
2013-12-30 10:41:12 (INFO):     No history found, starting to build one now
2013-12-30 10:41:12 (DEBUG):    Creating a new daily database for day 2013-12-30
2013-12-30 10:41:12 (DEBUG):    minimum trade treshold: 2013-12-30 00:00:00
2013-12-30 10:41:12 (DEBUG):    processing 150 trades
2013-12-30 10:41:12 (DEBUG):    inserting candle 511 (08:31:00 UTC) vol: 10.229696559999999 2013-12-30
2013-12-30 10:41:12 (DEBUG):    inserting candle 512 (08:32:00 UTC) vol: 52.31596399999999 2013-12-30

After restarting the instance see below the MEM usage
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16216 pi 20 0 82756 20m 6588 S 0,0 9,8 0:09.55 node

Note that during yesterday (after about 10 h of running) I noted that MEM usage was about 4 times higher than it is now.

In the same interval I was/and still am running a master branch geeko without any problems.

@askmike
Copy link
Owner

askmike commented Dec 30, 2013

What is your interval (this determines the amount of history Gekko will try to keep in memory)?

I am currently running multiple instances with memory logging each fetch cycle. The memory does not appear to be leaking right now (though the memory footprint is indeed quite large). I will keep this running for a couple of hours to see what happens.

Here some things I already know:

  • The localDB uses neDB* under the hood to persist data to disk.
  • neDB keeps all loaded databases in memory.
  • To make the memory footprint manageable Gekko divides all historical data in daily databases which we can load on the fly (in theory).
  • [LEAK] Currently we do not unload days out of memory, this could drastically decrease the memory footprint. Unloading databases as soon as we move to a new day is not hard, I just need to verify that garbage collection happens on time etc. Gekko already calculates meta data of any daily database, after we have this we can just unload the database unless it's about the current day (in which case we need to insert new stuff into it).
  • If there are other leaks they do not necessarily have to come from anything related to storing the history.

*Why neDB?

I need to persist historical data and these are the limitations:

  • Should not require third party database that needs to be installed.
  • Should work cross platform (eg. on Windows).
  • Should be lightweight because it only needs to do these things:
    • append new 1m candles.
    • read all candles from a day.

That leaves me with 4 options I think:

  • neDB (similar to limited mongo)
  • levelUP (key/value store that needs to be compiled, there are JS based k/v stores as well)
  • mogoHQ (external hosting, free up to 500MB, required to register and dependent upon third party).
  • Roll my own: as I am thinking about the minimal requirements it should be trivial to implement. But than again, I also thought the same thing for the databaseManager (which has to many edge cases and is already over 1k LOC).

I am currently leaning towards the last option, anybody has any other ideas?

If anybody wants to dicsuss: get on IRC: #gekkobot (freenode)!

@gabbello
Copy link
Author

I kept the standard settings:

// Exponential Moving Averages settings:
config.EMA = {
// timeframe per candle
interval: 2, // in minutes
// EMA weight (α)
// the higher the weight, the more smooth (and delayed) the line
short: 10,
long: 21,
// amount of candles to remember and base initial EMAs on
candles: 100,
// the difference between the EMAs (to act as triggers)
sellTreshold: -0.25,
buyTreshold: 0.25
 };

I will let this run until he end of the day to see if the problem replicates when the new file needs to be created.

@yin
Copy link
Contributor

yin commented Dec 30, 2013

Pure time-series storage? Though. I have no experiences, but a few guys
recommended to look into these two:

https://github.com/creationix/nstore
http://dev.nuodb.com/techblog/getting-started-nodejs-nuodb

regards
-- Matej

On Mon, Dec 30, 2013 at 12:39 PM, Mike van Rossum
notifications@github.comwrote:

What is your interval (this determines the amount of history Gekko will
try to keep in memory)?

I am currently running multiple instances with memory logging each fetch
cycle. The memory does not appear to be leaking right now (though the
memory footprint is indeed quite large). I will keep this running for a
couple of hours to see what happens.

Here some things I already know:

  • The localDB uses neDB https://github.com/louischatriot/nedb\* under
    the hood to persist data to disk.
  • neDB keeps all loaded databases in memoryhttps://github.com/louischatriot/nedb#memory-footprint
    .
  • To make the memory footprint manageable Gekko divides all historical
    data in daily databases which we can load on the fly (in theory).
  • [LEAK] Currently we do not unload days out of memory, this could
    drastically decrease the memory footprint. Unloading databases as soon as
    we move to a new day is not hard, I just need to verify that garbage
    collection happens on time etc. Gekko already calculates meta datahttps://github.com/askmike/gekko/blob/localDB/databaseManager.js#L989of any daily database, after we have this we can just unload the database
    unless it's about the current day (in which case we need to insert new
    stuff into it).
  • If there are other leaks they do not necessarily have to come from
    anything related to storing the history.

*Why neDB?

I need to persist historical data and these are the limitations:

  • Should not require third party database that needs to be installed.
  • Should work cross platform (eg. on Windows).
  • Should be lightweight because it only needs to do these things:
    • append new 1m candles.
    • read all candles from a day.

That leaves me with 4 options I think:

  • neDB (similar to limited mongo)
  • levelUP https://github.com/rvagg/node-levelup (key/value store
    that needs to be compiled, there are JS based k/v stores as well)
  • mogoHQ https://mongohq.com (external hosting, free up to 500MB,
    required to register and dependent upon third party).
  • Roll my own: as I am thinking about the minimal requirements it
    should be trivial to implement. But than again, I also thought the same
    thing for the databaseManager (which has to many edge cases and is already
    over 1k LOC).

I am currently leaning towards the last option, anybody has any other
ideas?

If anybody wants to dicsuss: get on IRC: #gekkobot (freenode)!


Reply to this email directly or view it on GitHubhttps://github.com//issues/101#issuecomment-31342356
.

@gabbello
Copy link
Author

So same error today when the new file is created:

2013-12-31 02:00:16 (DEBUG):    inserting candle 1439 (23:59:00 UTC) vol: 7.035317699999999 2013-12-30
2013-12-31 02:00:16 (DEBUG):    Creating a new daily database for day 2013-12-31
FATAL ERROR: JS Allocation failed - process out of memory

A couple of hours before this the mem usage was:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16216 pi 20 0 110m 62m 2820 S 0.0 29.6 3:09.95 node

@yin
Copy link
Contributor

yin commented Dec 31, 2013

What commit are you on?
On Dec 31, 2013 1:36 AM, "gabbello" notifications@github.com wrote:

So same error today when the new file is created:

2013-12-31 02:00:16 (DEBUG): inserting candle 1439 (23:59:00 UTC) vol: 7.035317699999999 2013-12-30
2013-12-31 02:00:16 (DEBUG): Creating a new daily database for day 2013-12-31
FATAL ERROR: JS Allocation failed - process out of memory

A couple of hours before this the mem usage was:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

16216 pi 20 0 110m 62m 2820 S 0.0 29.6 3:09.95 node


Reply to this email directly or view it on GitHubhttps://github.com//issues/101#issuecomment-31378077
.

@gabbello
Copy link
Author

I'm on this one 2d9f142. I'll get the latest version.

@askmike
Copy link
Owner

askmike commented Dec 31, 2013

@gabbello we are working on replacing the biggest memory hug in the localDB version (neDB) with our own solution.

@yin thanks for the tests! I went a little further with your tests and found an even more compact solution than storing in BSON: store in CSV and gzip (using node's zlib) it before saving it to disk. The tests are a mess now because of my playing but this is a really simple implementation hacked together:

https://gist.github.com/askmike/8191017

What do you think? (I could add some basic state to keep days 'open' by holding them in memory, if we need to append a new candle we can push to the array in memory and use the write method). I haven't tested a lot of things yet..


EDIT: @gabbello note that the localDB branch is still using neDB so you probably won't notice a difference yet..

@yin
Copy link
Contributor

yin commented Dec 31, 2013

The latest was broken when I run it , I just want to find latest usable
version. Mike is looking into alternative DB options, cpu/mem efficient
BSON being one.
On Dec 31, 2013 2:26 AM, "gabbello" notifications@github.com wrote:

I'm on this one 2d9f1422d9f142.
I'll get the latest version.


Reply to this email directly or view it on GitHubhttps://github.com//issues/101#issuecomment-31379570
.

@gabbello
Copy link
Author

ok, thanks guys, I will wait for a new version without neDB

@yin
Copy link
Contributor

yin commented Dec 31, 2013

Looks great Mike, actually. For the moment gzip's fine, native zlib should
be also faster than js implementation of lzma, my favorite compression
algorithm.

In my company, they generate GBs of CSVs a day and they keep trading, so
this might be the way to go.

Can you Let me know if you need me tomorrow on this?


https://github.com/nmrugg/LZMA-JS and others, if someone asks.
On Dec 31, 2013 2:34 AM, "Mike van Rossum" notifications@github.com wrote:

@gabbello https://github.com/gabbello we are working on replacing the
biggest memory hug in the localDB version (neDB) with our own solution.

@yin https://github.com/yin thanks for the testshttps://github.com/yin/gekko/tree/db-tests/tests/ad-hoc!
I went a little further with your tests and found an even more compact
solution than storing in BSON: store in CSV and gzip (using node's zlibhttp://nodejs.org/api/zlib.html)
it before saving it to disk. The tests are a mess but this is a really
simple implementation hacked together:

https://gist.github.com/askmike/8191017

What do you think? (I could add some basic state to keep days 'open' by
holding them in memory, if we need to append a new candle we can push to
the array in memory and use the write method). I haven't tested a lot of
things yet..


Reply to this email directly or view it on GitHubhttps://github.com//issues/101#issuecomment-31379755
.

@askmike
Copy link
Owner

askmike commented Dec 31, 2013

Ah, I am not sure if the one in node core is JS based or native (the function expects a callback which might hint that is at least being compressed in another thread), but for now that probably would not matter that much.

If you would have some time available tomorrow that would be awesome! I have the entire daily / minutely DB implementation in the codebase in my head so I think it would be best to let me worry about the state of days in memory, etc. But if you want to help out we would need some tests for the functions I wrote in the gist (so the same as you did before, just not with BSON anymore and maybe you can use my Store straight away).

@djmuk
Copy link
Contributor

djmuk commented Dec 31, 2013

Can I suggest that you DON'T compress the ondisk files within gekko - one of the advantages of the current db is that it is human readable and relatively easy to analyse in excel for example. So I would suggest sticking to a simple text based format (CSV sounds good!) - management of disk space can be done externally in a similar way to log file rotation on ux systems.

@yin
Copy link
Contributor

yin commented Dec 31, 2013

So you want a switch to turn off automatic compression?
On Dec 31, 2013 9:16 AM, "djmuk" notifications@github.com wrote:

Can I suggest that you DON'T compress the ondisk files within gekko - one
of the advantages of the current db is that it is human readable and
relatively easy to analyse in excel for example. So I would suggest
sticking to a simple text based format (CSV sounds good!) - management of
disk space can be done externally in a similar way to log file rotation on
ux systems.


Reply to this email directly or view it on GitHubhttps://github.com//issues/101#issuecomment-31387372
.

@djmuk
Copy link
Contributor

djmuk commented Dec 31, 2013

@yin - why compress anyway - the data isn't that big and requirements will vary so you will just end up with so many options, on my desktop I am not bothered and can keep any number of files. On a Pi I may want compression and to keep a minimum number of files. So handle it like log files - in fact you could probably use the log rotation engine on a ux system - and do it outside of gekko. On my machine the current DB files are 113KB and compress down to 47KB.

Thinking about the process I am not sure that any data needs to be held in memory.... (assume a CSV) - to store a new candle you are just appending a new line to a file (and the candle itself is passed around internally as a data structure) so no database / memory requirement there. The only time that the history is accessed is on startup so this is an infrequent operation and again can be handled as file operations and no need to cache the data in memory, just provide methods to retrieve historical data from the file.

@askmike
Copy link
Owner

askmike commented Dec 31, 2013

@djmuk when storing the data we have to make a tradeoff between: stuff to keep in memory / disk space usage / CPU power. I don't know what systems people are using to run Gekko so I put this question on the forum. If we would not compress the history we would save on readability of the files* and CPU power, while losing disk space.

I think backtesting and looking at historical data is of extreme importance. I have some far future ideas of Gekko connecting online to grab a trading method online (maybe even eToro style), but until than it is important to keep as much history around as possible IMO. Log rotation would kill this.

Thinking about the process I am not sure that any data needs to be held in memory.... (assume a CSV) - to store a new candle you are just appending a new line to a file (and the candle itself is passed around internally as a data structure) so no database / memory requirement there.

True and one of the things about compression is that we can't just append, so we need to keep in memory max the rest of the the day (which for a full day is 500KB in memory as an JS array, as a CSV string probably smaller). I'm not sure if this outweighs the benefit of getting more than 2 times smaller data on disk.

The only time that the history is accessed is on startup so this is an infrequent operation and again can be handled as file operations and no need to cache the data in memory, just provide methods to retrieve historical data from the file.

Right now it's infrequent (but not perse on startup, in the scenario where the full history is not available on startup it will come as soon as it's ready). So the only data that needs to be cached is: if we are watching a market (eg. inserting new stuff every minute) - keep track of the current day so we can easily append. But I do have plans for building other stuff on top of Gekko like a web GUI, in this scenario we need to access the data quite often to spawn charts, etc.

TLDR:

Arguments for compressing:

  • Smaller files on disk

Arguments against compressing:

  • More CPU intensive
  • Bigger memory footprint
  • More things that can go wrong

Why I think small files are important:

  • EMA is a simple indicator that only looks at the price of a single market. As soon as we want more advanced stuff like watching multiple markets for correlation / arbitrage you are watching multiple markets each day.
  • I want people to run Gekko for long periods of time, decreasing the file size by a factor of 2 means that you're able to store twice as much without changing anything.
  • I think it's extremely important to store historical data straight from the exchange, for backtesting purposes as well as others.
  • *I don't like looking at the raw files to see what the data looks like (especially in neDB as it's not sorted data), we should create easy tools to turn a day into a candle chart to run from the commandline/a GUI.

@yin
Copy link
Contributor

yin commented Dec 31, 2013

On Tue, Dec 31, 2013 at 11:14 AM, Mike van Rossum
notifications@github.comwrote:

@djmuk https://github.com/djmuk when storing the data we have to make a
tradeoff between: stuff to keep in memory / disk space usage / CPU power. I
don't know what systems people are using to run Gekko so I put this
question on the forumhttps://bitcointalk.org/index.php?topic=209149.msg4218321#msg4218321.
If we would not compress the history we would save on readability of the
files* and CPU power, while losing disk space.

I can't yet post to the forum. I run gekko on an Reserved AWS micro
instance (500MB, 8GB disk volume). Here, CPU time costs outscale storage.
Raspberry PI coming in soon - 265*0.5MB = 133MB for per year per market. I
think of people storing whole orderbook in the future, not just aggregated
candles.

I think backtesting and looking at historical data is of extreme
importance. I have some far future ideas of Gekko connecting online to grab
a trading method online (maybe even eToro style), but until than it is
important to keep as much history around as possible IMO. Log rotation
would kill this.

Thinking about the process I am not sure that any data needs to be held in
memory.... (assume a CSV) - to store a new candle you are just appending a
new line to a file (and the candle itself is passed around internally as a
data structure) so no database / memory requirement there.

True and one of the things about compression is that we can't just append,
so we need to keep in memory max the rest of the the day (which for a full
day is 500KB in memory as an JS array, as a CSV string probably smaller).
I'm not sure if this outweighs the benefit of getting more than 2 times
smaller data on disk.

The only time that the history is accessed is on startup so this is an
infrequent operation and again can be handled as file operations and no
need to cache the data in memory, just provide methods to retrieve
historical data from the file.

Right now it's infrequent (but not perse on startup, in the scenario where
the full history is not available on startup it will come as soon as it's
ready). So the only data that needs to be cached is: if we are watching a
market (eg. inserting new stuff every minute) - keep track of the current
day so we can easily append. But I do have plans for building other stuff
on top of Gekko like a web GUI, in this scenario we need to access the data
quite often to spawn charts, etc.

TLDR:

Arguments for compressing:

  • Smaller files on disk

Arguments against compressing:

  • More CPU intensive
  • Bigger memory footprint
  • More things that can go wrong

Why I think small files are important:

  • EMA is a simple indicator that only looks at the price of a single
    market. As soon as we want more advanced stuff like watching multiple
    markets for correlation / arbitrage you are watching multiple markets each
    day.
  • I want people to run Gekko for long periods of time, decreasing the
    file size by a factor of 2 means that you're able to store twice as much
    without changing anything.
  • I think it's extremely important to store historical data straight
    from the exchange, for backtesting purposes as well as others.
  • *I don't like looking at the raw files to see what the data looks
    like (especially in neDB as it's not sorted data), we should create easy
    tools to turn a day into a candle chart to run from the commandline/a GUI.


Reply to this email directly or view it on GitHubhttps://github.com//issues/101#issuecomment-31389916
.

@djmuk
Copy link
Contributor

djmuk commented Dec 31, 2013

@yin - are you are saying it is cheaper to store raw data than to spend the cpu power in compressing it?

My example of log rotation was to illustrate that it isn't syslogd's problem to compress or manage the data files, it is handled externally and I think gekko should do the same and leave it to an external process (even if it just a cron job/script that is included in the distribution), after all on windows I could just dump them into a compressed folder if I was worried about space. 100KB/day = 36MB/year, given the way storage costs scale I could store 100-1000x that and not worry.

I think it is MORE important that the data is stored in a transparent format such as csv, as you say historical data is important so it needs to be accessible independently of gekko,

@askmike
Copy link
Owner

askmike commented Dec 31, 2013

@djmuk I rather write a 20 LOC script that can convert Gekko's storage into a more general format than to take transparency of format into consideration: if there ever comes a new JS based datastore that's superfast and does exactly what we need, I don't want to be stuck with a 'transparant format'. Because of CPU issues we can make compressing optional, that's fine.

Also looking at the data by running something that draws a chart > opening database files.

But the 100KB per day is a single market. If you want to watch BTC-e (18 markets) that's almost 2MB per day ~ 700B per year. And I don't even dare to count how many markets cryptsy has (I think close to / over 100, that would be 10MB / day ~ 3.6GB per year).

Why do we want to store > 1 market? Right now the only method (EMA) is extremely simple, more advanced methods correlate between different markets (esp ones where FIAT/BTC markets creates trends that bubble through the rest). Also arbitrage.

I don't think we should optimize Gekko for storing this amount of data, but it shouldn't eat up your harddrive if you want to watch 1 exchange IMO.

@yin
Copy link
Contributor

yin commented Dec 31, 2013

On Tue, Dec 31, 2013 at 3:24 PM, djmuk notifications@github.com wrote:

@yin https://github.com/yin - are you are saying it is cheaper to store
raw data than to spend the cpu power in compressing it?

I've done the math again. CPU time wasted by gzip is negligible.

My example of log rotation was to illustrate that it isn't syslogd's
problem to compress or manage the data files, it is handled externally and
I think gekko should do the same and leave it to an external process (even
if it just a cron job/script that is included in the distribution), after
all on windows I could just dump them into a compressed folder if I was
worried about space. 100KB/day = 36MB/year, given the way storage costs
scale I could store 100-1000x that and not worry.

Yup, I got it. But who would decopress the history when gekko needs it. We
could think the other way around :) Have cron decompress the history and
store it in your work-folder.

I think it is MORE important that the data is stored in a transparent
format such as csv, as you say historical data is important so it needs to
be accessible independently of gekko,

Fully agree. Anyway, this is going into preferences and local setups. Let's
not waste time chatting.

@mike Let me make compression configurable, when time allows.


Reply to this email directly or view it on GitHubhttps://github.com//issues/101#issuecomment-31396271
.

@askmike
Copy link
Owner

askmike commented Jun 6, 2016

fixed.

@askmike askmike closed this as completed Jun 6, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants