Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate ReGrid Performance vs Node #96

Open
bchavez opened this issue Aug 24, 2016 · 0 comments
Open

Investigate ReGrid Performance vs Node #96

bchavez opened this issue Aug 24, 2016 · 0 comments

Comments

@bchavez
Copy link
Owner

bchavez commented Aug 24, 2016

Seems like Node ReGrid can get 3x more writes than .NET; yielding faster upload wall time. See image below (credits @buskila):

pasted_image_at_2016_08_24_13_03

Test setup

Upload only:
File Size: 1 GB.
Server: RethinkDB / Linux / Ubuntu 14, 3 nodes
Client: .NET Core / Linux

Chunk Size: Default
Batch Size: Default 8 -> 32

They tried single connection and connection pooling. No difference.

Using Stream IO:

// Upload a file using an IO stream
Guid uploadId;
using( var fileStream = File.Open("C:\\video.mp4", FileMode.Open) )
using( var uploadStream = bucket.OpenUploadStream("/video.mp4") )
{
    uploadId = uploadStream.FileInfo.Id;
    fileStream.CopyTo(uploadStream);
}

Suspicion

Too much chunk calculation in stream upload code. Try to parallelize / simplify some of this, especially when given byte[].

Node's ReGrid upload code is here:
https://github.com/internalfx/regrid/blob/master/lib/upload.js

Other notes

This should come after #77 is done.

After some discussion with @interalfx (thanks a bunch), the upload code is using node streams. Node streams info via @buskila:

Using .pipe() has other benefits too, like handling backpressure automatically so that
node won't buffer chunks into memory needlessly when the remote client 
is on a really slow or high-latency connection.

https://github.com/substack/stream-handbook

Currently, @internalfx runs 10 network requests in flight at any given time. In a scenario where there is infinite network latency, node won't write to the ReGrid API until at least 1 network request completes.

Cool. I think we could maybe do the same with 10 async tasks laying down bytes over a connection pool then as they complete, then come back read more bytes as network requests complete.


Other Research Findings

RethinkDB Limitations

  • Query size (419554663) greater than maximum (134217727).
    So batch size can't be too big, Max query size is ~130MB something. So only ~130MB per batch max.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant