`dat share` is slow with lots of files #915

dcposch · 2018-01-14T04:10:05Z

@feross and i are creating a demo of Simple English Wikipedia hosted on dat

`dat share`-ing the dump (1 big file) is fast

setup

dat version: 13.9.2
os version: ubuntu 16.04.3
kernel version: 4.4.0-1047-aws
hardware: m5.large instance, using an 800GB EBS volume, standard SSD variety

results

$ ls -lha
total 1.2G
-rw-rw-r-- 1 ubuntu ubuntu 1.2G Jan  4  2017 wikipedia_en_simple_all_2017-01.zim
[...]

we are calling dat share on one file, total 1.2GB:

$ dat share
[... runs at about ~150MB/s, finishes in < 10 seconds ...]

`dat share`-ing the articles (small files) runs very slowly

$ du -sh *
252K	-
1.7G	A
1.3G	I
36K	M
$ time ls -lha A I/m | wc -l
291382
real	0m3.740s

(notice that stat-ing all 300k files only takes <4 seconds, so that's not the bottleneck...)

we are calling dat share on ~300k files, totalling 3GB:

[... runs at about ~150KB/s, hasn't finished yet ...]

tldr; `dat share` throughput is 1000x less with small files

these files are about 10KB on average, and dat share is processing just a couple of them per second

The text was updated successfully, but these errors were encountered:

dcposch · 2018-01-14T04:51:57Z

update: it gets much slower over time

i tried dat share with just a few hundred wiki articles, and it went thru them in a few seconds, reporting ~500KB/sec throughput

dat share with 2000 wiki articles slows down to ~150KB/sec throughput, as reported above

the original dat share run, on ~300k articles, has slowed down to 8KB/sec:

dcposch · 2018-01-14T15:32:10Z

@mafintosh feross says you know why?

dcposch · 2018-01-14T20:55:44Z

@feross @mafintosh

update

even with a small number of entries per directory, it still grinds to a halt after a few hundred MB :/

mafintosh · 2018-01-14T21:53:52Z

Anyway you could try this on a Mac as well?

…

On Sun, Jan 14, 2018, 22:55 DC ***@***.***> wrote: @feross <https://github.com/feross> @mafintosh <https://github.com/mafintosh> update even with a small number of entries per directory, it still grinds to a halt after a few hundred MB :/ [image: image] <https://user-images.githubusercontent.com/169280/34920587-e9946736-f942-11e7-8b58-e1efa6b2f010.png> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#915 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAW_VQWhnxnIHkM239-vEfdoHR5GAe0Oks5tKmnSgaJpZM4Rdd9X> .

dcposch · 2018-01-14T21:58:01Z

@mafintosh sure -- i just ran it on Mac using the profiler

it looks like the issue might be in append-tree

dcposch · 2018-01-14T22:49:39Z

update: issues almost certainly involves `append-tree`

i narrowed it down a bit by adding debug() statements. turns out, for every file dat adds, it waits for process.nextTick 100s (maybe 1000s) of times:

log

  dat-node IMPORT put: /A/Alan_Bennett.html +63ms
  append-tree cache hit, waiting for next tick: 227 +1ms
  append-tree cache hit, waiting for next tick: 1 +0ms
  append-tree cache hit, waiting for next tick: 2 +0ms
  append-tree cache hit, waiting for next tick: 3 +0ms
  append-tree cache hit, waiting for next tick: 4 +0ms
  append-tree cache hit, waiting for next tick: 5 +0ms
  append-tree cache hit, waiting for next tick: 6 +0ms
  append-tree cache hit, waiting for next tick: 7 +0ms
  append-tree cache hit, waiting for next tick: 8 +0ms
  append-tree cache hit, waiting for next tick: 9 +0ms
  append-tree cache hit, waiting for next tick: 10 +0ms
  append-tree cache hit, waiting for next tick: 11 +0ms
  append-tree cache hit, waiting for next tick: 12 +0ms
  append-tree cache hit, waiting for next tick: 13 +0ms
  append-tree cache hit, waiting for next tick: 14 +0ms
  append-tree cache hit, waiting for next tick: 15 +0ms
  append-tree cache hit, waiting for next tick: 16 +0ms
  append-tree cache hit, waiting for next tick: 17 +0ms
  append-tree cache hit, waiting for next tick: 18 +0ms
  append-tree cache hit, waiting for next tick: 19 +0ms
  append-tree cache hit, waiting for next tick: 20 +0ms
  append-tree cache hit, waiting for next tick: 21 +0ms
  append-tree cache hit, waiting for next tick: 22 +0ms
  append-tree cache hit, waiting for next tick: 23 +0ms

code

slightly modified append-tree:

Tree.prototype._getAndDecode = function(seq, opts, cb) {
  if (opts && opts.cached) opts.wait = false

  var self = this
  var cached = this._cache && this._cache.get(seq)
  if (cached) {
    debug('cache hit, waiting for next tick: ' + seq)
    return nextTick(cb, null, cached, seq)
  }

  debug('cache miss: ' + seq)
  this.feed.get(seq, opts, function(err, value) {
    if (err) return cb(err)
    var node = new Node(messages.Node.decode(value), seq)
    if (self._cache) self._cache.set(seq, node)
    cb(null, node, seq)
  })
}

dcposch · 2018-01-15T03:19:44Z

adding n files takes O(n^2) time, O(n^2) space

this looks bad

after further investigation, i found out that deep in the dat, internals, the thing that actually records a newly added file to the hypercore log is append-tree._put():

this function runs in O(n) time, appending a log entry of size O(n), if the folder contains n files

this means that adding n files to a folder, one after another, takes O(n^2) time and produces a hypercore feed with O(n) entries but taking up O(n^2) total space

...

cc @mafintosh

log

running dat share with added debug logs in append-tree to demonstrate the issue:

the folder contains ~2500 files totalling 30MB

$ rm -rf .dat && DEBUG=append-tree node ~/code/dat/bin/cli.js share --watch=false
dat v13.9.2
Created new dat in /Users/dc/sample2/.dat
dat://b1ea4e0c8b72573bac844f5d8d50c31645a29434891aa2559a4381ae153090f8
Sharing dat: (stats disabled)
Checking for file updates...
Ctrl+C to Exit
  append-tree appending node to hypercore, 58 bytes +0ms
  append-tree appending node to hypercore, 65 bytes +10ms
  append-tree appending node to hypercore, 60 bytes +2ms
  append-tree appending node to hypercore, 71 bytes +3ms
  append-tree appending node to hypercore, 71 bytes +2ms
  append-tree appending node to hypercore, 81 bytes +3ms
  append-tree appending node to hypercore, 71 bytes +6ms
  append-tree appending node to hypercore, 72 bytes +3ms
...
  append-tree appending node to hypercore, 439 bytes +6ms
  append-tree appending node to hypercore, 425 bytes +5ms
  append-tree appending node to hypercore, 419 bytes +5ms
dat v13.9.2
Created new dat in /Users/dc/sample2/.dat
dat://b1ea4e0c8b72573bac844f5d8d50c31645a29434891aa2559a4381ae153090f8
Sharing dat: (stats disabled)



Metadata created for 357 of 1110 files (2.1 MB/s)
(Calculating file count...)
ADD: A/Albanian.html (2.0 KB)


Ctrl+C to Exit
  append-tree appending node to hypercore, 433 bytes +8ms
  append-tree appending node to hypercore, 439 bytes +4ms
  append-tree appending node to hypercore, 433 bytes +6ms
  append-tree appending node to hypercore, 432 bytes +4ms
  append-tree appending node to hypercore, 428 bytes +4ms
...
  append-tree appending node to hypercore, 1352 bytes +11ms
  append-tree appending node to hypercore, 1368 bytes +10ms
  append-tree appending node to hypercore, 1355 bytes +9ms
  append-tree appending node to hypercore, 1353 bytes +9ms
  append-tree appending node to hypercore, 1360 bytes +8ms
  append-tree appending node to hypercore, 1356 bytes +11ms
  append-tree appending node to hypercore, 1360 bytes +11ms
  append-tree appending node to hypercore, 1359 bytes +9ms
  append-tree appending node to hypercore, 1362 bytes +8ms
  append-tree appending node to hypercore, 1359 bytes +9ms
...
  append-tree appending node to hypercore, 2122 bytes +16ms
  append-tree appending node to hypercore, 2118 bytes +16ms
  append-tree appending node to hypercore, 2119 bytes +16ms
  append-tree appending node to hypercore, 2124 bytes +16ms
  append-tree appending node to hypercore, 2133 bytes +16ms
  append-tree appending node to hypercore, 2125 bytes +15ms
  append-tree appending node to hypercore, 2137 bytes +17ms
  append-tree appending node to hypercore, 2123 bytes +16ms
dat v13.9.2
Created new dat in /Users/dc/sample2/.dat
dat://b1ea4e0c8b72573bac844f5d8d50c31645a29434891aa2559a4381ae153090f8
Sharing dat: (stats disabled)



Creating metadata for 2067 files (799 KB/s)
[=========================================-] 100%
ADD: A/Alzonne.html (1.9 KB)


Ctrl+C to Exit
  append-tree appending node to hypercore, 2148 bytes +21ms
  append-tree appending node to hypercore, 2140 bytes +16ms
  append-tree appending node to hypercore, 2128 bytes +15ms
  append-tree appending node to hypercore, 57 bytes +9ms
  append-tree appending node to hypercore, 56 bytes +2ms
DCs-MacBook:sample2 dc$

gustafson · 2019-08-03T02:33:21Z

adding n files takes O(n^2) time, O(n^2) space

this looks bad

after further investigation, i found out that deep in the dat, internals, the thing that actually records a newly added file to the hypercore log is append-tree._put():

this function runs in O(n) time, appending a log entry of size O(n), if the folder contains n files

Hi, just bumping this one after 18 months. Is something that can be fixed or is it fundamental? I'm just getting started with dat and was super pumped until I hit this. Honestly I'm still super pumped about dat but this one makes one intended use non-feasible. Here is hoping there is still hope for this issue. Thanks all!

pfrazee · 2019-08-03T14:56:53Z

There's a new version of dat coming that improves performance quite a bit

RangerMauve · 2019-08-05T00:20:01Z

Here's the new module that enables this: https://github.com/mafintosh/hypertrie

okdistribute · 2019-08-13T22:54:41Z

Current timeline is to release the next dat cli with this new update by the end of the year. Until then, you can use hyperdrive-daemon which has most of the basic functionality. Cc @andrewosh

Koeng101 · 2020-01-20T07:08:51Z

@okdistribute any update with that new update? I have similar data with lots of files I'd like to share, so have waited for this issue to get closed for a while. Thanks

RangerMauve · 2020-01-20T12:55:06Z

@Koeng101 @andrewosh is ironing out the last of the bugs in hyperdrive-daemon, IIRC. It should be good to start testing if you're okay with potentially needing to rebuild your archives in the future. 😁

dcposch mentioned this issue Jan 19, 2018

Add ServiceWorker support to plugin protocols beakerbrowser/beaker#46

Closed

2 tasks

mikehenrty mentioned this issue Feb 3, 2018

Investigate cheaper options than s3 for hosting the data download common-voice/common-voice#833

Closed

mitar mentioned this issue Feb 22, 2018

Creating a dataset with many small files is painfully slow #937

Open

millette mentioned this issue Feb 26, 2018

Huge metadata when having many small files #946

Open

joehand added Type: Bug and removed Type: Bug labels Mar 6, 2018

This comment has been minimized.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`dat share` is slow with lots of files #915

`dat share` is slow with lots of files #915

dcposch commented Jan 14, 2018 •

edited

Loading

dcposch commented Jan 14, 2018

dcposch commented Jan 14, 2018

dcposch commented Jan 14, 2018

mafintosh commented Jan 14, 2018 via email

dcposch commented Jan 14, 2018

dcposch commented Jan 14, 2018

dcposch commented Jan 15, 2018 •

edited

Loading

gustafson commented Aug 3, 2019

adding n files takes O(n^2) time, O(n^2) space

pfrazee commented Aug 3, 2019

RangerMauve commented Aug 5, 2019

This comment has been minimized.

okdistribute commented Aug 13, 2019 •

edited

Loading

Koeng101 commented Jan 20, 2020

RangerMauve commented Jan 20, 2020

dat share is slow with lots of files #915

dat share is slow with lots of files #915

Comments

dcposch commented Jan 14, 2018 • edited Loading

dat share-ing the dump (1 big file) is fast

setup

results

dat share-ing the articles (small files) runs very slowly

tldr; dat share throughput is 1000x less with small files

dcposch commented Jan 14, 2018

update: it gets much slower over time

dcposch commented Jan 14, 2018

dcposch commented Jan 14, 2018

update

mafintosh commented Jan 14, 2018 via email

dcposch commented Jan 14, 2018

dcposch commented Jan 14, 2018

update: issues almost certainly involves append-tree

log

code

dcposch commented Jan 15, 2018 • edited Loading

adding n files takes O(n^2) time, O(n^2) space

log

gustafson commented Aug 3, 2019

adding n files takes O(n^2) time, O(n^2) space

pfrazee commented Aug 3, 2019

RangerMauve commented Aug 5, 2019

This comment has been minimized.

okdistribute commented Aug 13, 2019 • edited Loading

Koeng101 commented Jan 20, 2020

RangerMauve commented Jan 20, 2020

`dat share` is slow with lots of files #915

`dat share` is slow with lots of files #915

dcposch commented Jan 14, 2018 •

edited

Loading

`dat share`-ing the dump (1 big file) is fast

`dat share`-ing the articles (small files) runs very slowly

tldr; `dat share` throughput is 1000x less with small files

update: issues almost certainly involves `append-tree`

dcposch commented Jan 15, 2018 •

edited

Loading

okdistribute commented Aug 13, 2019 •

edited

Loading