Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dat share is slow with lots of files #915

Open
dcposch opened this issue Jan 14, 2018 · 14 comments
Open

dat share is slow with lots of files #915

dcposch opened this issue Jan 14, 2018 · 14 comments

Comments

@dcposch
Copy link

dcposch commented Jan 14, 2018

@feross and i are creating a demo of Simple English Wikipedia hosted on dat

dat share-ing the dump (1 big file) is fast

setup

  • dat version: 13.9.2
  • os version: ubuntu 16.04.3
  • kernel version: 4.4.0-1047-aws
  • hardware: m5.large instance, using an 800GB EBS volume, standard SSD variety

results

$ ls -lha
total 1.2G
-rw-rw-r-- 1 ubuntu ubuntu 1.2G Jan  4  2017 wikipedia_en_simple_all_2017-01.zim
[...]

we are calling dat share on one file, total 1.2GB:

$ dat share
[... runs at about ~150MB/s, finishes in < 10 seconds ...]

dat share-ing the articles (small files) runs very slowly

$ du -sh *
252K	-
1.7G	A
1.3G	I
36K	M
$ time ls -lha A I/m | wc -l
291382
real	0m3.740s

(notice that stat-ing all 300k files only takes <4 seconds, so that's not the bottleneck...)

we are calling dat share on ~300k files, totalling 3GB:

image

[... runs at about ~150KB/s, hasn't finished yet ...]

tldr; dat share throughput is 1000x less with small files

these files are about 10KB on average, and dat share is processing just a couple of them per second

@dcposch
Copy link
Author

dcposch commented Jan 14, 2018

update: it gets much slower over time

i tried dat share with just a few hundred wiki articles, and it went thru them in a few seconds, reporting ~500KB/sec throughput

dat share with 2000 wiki articles slows down to ~150KB/sec throughput, as reported above

the original dat share run, on ~300k articles, has slowed down to 8KB/sec:

image

@dcposch
Copy link
Author

dcposch commented Jan 14, 2018

@mafintosh feross says you know why?

@dcposch
Copy link
Author

dcposch commented Jan 14, 2018

@feross @mafintosh

update

even with a small number of entries per directory, it still grinds to a halt after a few hundred MB :/

image

@mafintosh
Copy link
Contributor

mafintosh commented Jan 14, 2018 via email

@dcposch
Copy link
Author

dcposch commented Jan 14, 2018

@mafintosh sure -- i just ran it on Mac using the profiler

it looks like the issue might be in append-tree

image

@dcposch
Copy link
Author

dcposch commented Jan 14, 2018

update: issues almost certainly involves append-tree

i narrowed it down a bit by adding debug() statements. turns out, for every file dat adds, it waits for process.nextTick 100s (maybe 1000s) of times:

log

  dat-node IMPORT put: /A/Alan_Bennett.html +63ms
  append-tree cache hit, waiting for next tick: 227 +1ms
  append-tree cache hit, waiting for next tick: 1 +0ms
  append-tree cache hit, waiting for next tick: 2 +0ms
  append-tree cache hit, waiting for next tick: 3 +0ms
  append-tree cache hit, waiting for next tick: 4 +0ms
  append-tree cache hit, waiting for next tick: 5 +0ms
  append-tree cache hit, waiting for next tick: 6 +0ms
  append-tree cache hit, waiting for next tick: 7 +0ms
  append-tree cache hit, waiting for next tick: 8 +0ms
  append-tree cache hit, waiting for next tick: 9 +0ms
  append-tree cache hit, waiting for next tick: 10 +0ms
  append-tree cache hit, waiting for next tick: 11 +0ms
  append-tree cache hit, waiting for next tick: 12 +0ms
  append-tree cache hit, waiting for next tick: 13 +0ms
  append-tree cache hit, waiting for next tick: 14 +0ms
  append-tree cache hit, waiting for next tick: 15 +0ms
  append-tree cache hit, waiting for next tick: 16 +0ms
  append-tree cache hit, waiting for next tick: 17 +0ms
  append-tree cache hit, waiting for next tick: 18 +0ms
  append-tree cache hit, waiting for next tick: 19 +0ms
  append-tree cache hit, waiting for next tick: 20 +0ms
  append-tree cache hit, waiting for next tick: 21 +0ms
  append-tree cache hit, waiting for next tick: 22 +0ms
  append-tree cache hit, waiting for next tick: 23 +0ms

code

slightly modified append-tree:

Tree.prototype._getAndDecode = function(seq, opts, cb) {
  if (opts && opts.cached) opts.wait = false

  var self = this
  var cached = this._cache && this._cache.get(seq)
  if (cached) {
    debug('cache hit, waiting for next tick: ' + seq)
    return nextTick(cb, null, cached, seq)
  }

  debug('cache miss: ' + seq)
  this.feed.get(seq, opts, function(err, value) {
    if (err) return cb(err)
    var node = new Node(messages.Node.decode(value), seq)
    if (self._cache) self._cache.set(seq, node)
    cb(null, node, seq)
  })
}

@dcposch
Copy link
Author

dcposch commented Jan 15, 2018

adding n files takes O(n^2) time, O(n^2) space

this looks bad

after further investigation, i found out that deep in the dat, internals, the thing that actually records a newly added file to the hypercore log is append-tree._put():

this function runs in O(n) time, appending a log entry of size O(n), if the folder contains n files

this means that adding n files to a folder, one after another, takes O(n^2) time and produces a hypercore feed with O(n) entries but taking up O(n^2) total space

...

cc @mafintosh

log

running dat share with added debug logs in append-tree to demonstrate the issue:

the folder contains ~2500 files totalling 30MB

$ rm -rf .dat && DEBUG=append-tree node ~/code/dat/bin/cli.js share --watch=false
dat v13.9.2
Created new dat in /Users/dc/sample2/.dat
dat://b1ea4e0c8b72573bac844f5d8d50c31645a29434891aa2559a4381ae153090f8
Sharing dat: (stats disabled)
Checking for file updates...
Ctrl+C to Exit
  append-tree appending node to hypercore, 58 bytes +0ms
  append-tree appending node to hypercore, 65 bytes +10ms
  append-tree appending node to hypercore, 60 bytes +2ms
  append-tree appending node to hypercore, 71 bytes +3ms
  append-tree appending node to hypercore, 71 bytes +2ms
  append-tree appending node to hypercore, 81 bytes +3ms
  append-tree appending node to hypercore, 71 bytes +6ms
  append-tree appending node to hypercore, 72 bytes +3ms
...
  append-tree appending node to hypercore, 439 bytes +6ms
  append-tree appending node to hypercore, 425 bytes +5ms
  append-tree appending node to hypercore, 419 bytes +5ms
dat v13.9.2
Created new dat in /Users/dc/sample2/.dat
dat://b1ea4e0c8b72573bac844f5d8d50c31645a29434891aa2559a4381ae153090f8
Sharing dat: (stats disabled)



Metadata created for 357 of 1110 files (2.1 MB/s)
(Calculating file count...)
ADD: A/Albanian.html (2.0 KB)


Ctrl+C to Exit
  append-tree appending node to hypercore, 433 bytes +8ms
  append-tree appending node to hypercore, 439 bytes +4ms
  append-tree appending node to hypercore, 433 bytes +6ms
  append-tree appending node to hypercore, 432 bytes +4ms
  append-tree appending node to hypercore, 428 bytes +4ms
...
  append-tree appending node to hypercore, 1352 bytes +11ms
  append-tree appending node to hypercore, 1368 bytes +10ms
  append-tree appending node to hypercore, 1355 bytes +9ms
  append-tree appending node to hypercore, 1353 bytes +9ms
  append-tree appending node to hypercore, 1360 bytes +8ms
  append-tree appending node to hypercore, 1356 bytes +11ms
  append-tree appending node to hypercore, 1360 bytes +11ms
  append-tree appending node to hypercore, 1359 bytes +9ms
  append-tree appending node to hypercore, 1362 bytes +8ms
  append-tree appending node to hypercore, 1359 bytes +9ms
...
  append-tree appending node to hypercore, 2122 bytes +16ms
  append-tree appending node to hypercore, 2118 bytes +16ms
  append-tree appending node to hypercore, 2119 bytes +16ms
  append-tree appending node to hypercore, 2124 bytes +16ms
  append-tree appending node to hypercore, 2133 bytes +16ms
  append-tree appending node to hypercore, 2125 bytes +15ms
  append-tree appending node to hypercore, 2137 bytes +17ms
  append-tree appending node to hypercore, 2123 bytes +16ms
dat v13.9.2
Created new dat in /Users/dc/sample2/.dat
dat://b1ea4e0c8b72573bac844f5d8d50c31645a29434891aa2559a4381ae153090f8
Sharing dat: (stats disabled)



Creating metadata for 2067 files (799 KB/s)
[=========================================-] 100%
ADD: A/Alzonne.html (1.9 KB)


Ctrl+C to Exit
  append-tree appending node to hypercore, 2148 bytes +21ms
  append-tree appending node to hypercore, 2140 bytes +16ms
  append-tree appending node to hypercore, 2128 bytes +15ms
  append-tree appending node to hypercore, 57 bytes +9ms
  append-tree appending node to hypercore, 56 bytes +2ms
DCs-MacBook:sample2 dc$

@gustafson
Copy link

adding n files takes O(n^2) time, O(n^2) space

this looks bad

after further investigation, i found out that deep in the dat, internals, the thing that actually records a newly added file to the hypercore log is append-tree._put():

this function runs in O(n) time, appending a log entry of size O(n), if the folder contains n files

Hi, just bumping this one after 18 months. Is something that can be fixed or is it fundamental? I'm just getting started with dat and was super pumped until I hit this. Honestly I'm still super pumped about dat but this one makes one intended use non-feasible. Here is hoping there is still hope for this issue. Thanks all!

@pfrazee
Copy link

pfrazee commented Aug 3, 2019

There's a new version of dat coming that improves performance quite a bit

@RangerMauve
Copy link
Contributor

Here's the new module that enables this: https://github.com/mafintosh/hypertrie

@gustafson

This comment has been minimized.

@okdistribute
Copy link
Collaborator

okdistribute commented Aug 13, 2019

Current timeline is to release the next dat cli with this new update by the end of the year. Until then, you can use hyperdrive-daemon which has most of the basic functionality. Cc @andrewosh

@Koeng101
Copy link

@okdistribute any update with that new update? I have similar data with lots of files I'd like to share, so have waited for this issue to get closed for a while. Thanks

@RangerMauve
Copy link
Contributor

@Koeng101 @andrewosh is ironing out the last of the bugs in hyperdrive-daemon, IIRC. It should be good to start testing if you're okay with potentially needing to rebuild your archives in the future. 😁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants