Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General discussion #4

Closed
filerun opened this issue Jan 16, 2014 · 49 comments
Closed

General discussion #4

filerun opened this issue Jan 16, 2014 · 49 comments
Labels
Milestone

Comments

@filerun
Copy link

filerun commented Jan 16, 2014

Hi,

Well done with the library! Although it currently doesn't seem to be much difference between this and resumablejs, I prefer flow's code structure.
Unfortunately it inherits the same big performance issue.
The "testChunks" idea is nice, but it is useless in its current form. If let's say one have uploaded 5/10 chunks in a previous session and wishes to resume the transfer, wouldn't be more practical for Flow to send only ONE single GET requests to retrieve the size of what has been already uploaded to the server and continue from there, instead of testing all chunks from 1 to 10?
There are a couple of more big problems, but let's start with this.
I am willing to collaborate.

Best regards,

Vlad

@filerun
Copy link
Author

filerun commented Jan 16, 2014

Imagine uploading a 2GB file, in 2MB sized chunks. That will mean 1000 GET requests to test for the current status of the file transfer.
Using the current implemenetation, it would mean also that it will check hundreds of time the total size of the chunks to find out if the upload was completed. It would also mean that the server-side script would have to join in 1000 pieces in a single request.
The original developer of the library really didn't test this with large files.

@AidasK
Copy link
Member

AidasK commented Jan 16, 2014

Hi,

nice to hear that you are interested in flow.js development. Currently library is far, far away from a perfect solution and this feature can make it a bit better.

Although this would be a breaking change, so i am creating develop branch and i hope to release flow.js v3 version.
If you are interested to make this feature, please make a pull request to develop branch.

How to implement this? We could use some ideas from tus project: http://tus.io/protocols/resumable-upload.html#5
I like the idea of HEAD request with offset header.

Tomorrow I hope to write some more thoughts about this and maybe we could make some milestones for next release(maybe do some refactoring i code and tests).

@filerun
Copy link
Author

filerun commented Jan 16, 2014

Great to hear from your and that the project is active.
The problem is, the project I'm working on doesn't give me much time nor does it allow less then perfect solutions. So my task is to take Flow and make it perfect :) I don't have much experience with Git or Github, so I'll just send you my version of it as soon as I am ready.
I already started the work on it. So far I have moved the "test" method from the "FlowChunk" object to the "FlowFile" because that is where it belongs. The purpose of the function is to check at the server if the upload of the file (not of a chunk) is possible and if any data was uploaded before, to resume (the file's entire upload) from there.
I have completely dropped the feature that allows simultaneous uploads. It's a nice idea that will unfortunately never see a practical implementation. I'll argument this as soon as I have some free time.

For the testing function, the link you posted describes exactly the functionality I have in mind. I will stick to GET though, as I've seen server reject HEAD requests, or treat them themselves instead of passing the request to the CGI module. GET is safe.
The request should either allow the server-side script to generate an error that can be presented to the user or if there is no problem and there is data already uploaded, to provide the data offset. I would recommend having the offset in the response body and not in a header, it will make the server-side solution more portable. I'll add a customizable function that reads the offset from the response.

@filerun
Copy link
Author

filerun commented Jan 16, 2014

Another thing that I would like is for the server-side to be able to process various chunk sizes. If a file has been split in 4 an two parts were uploaded, another upload process that would use a smaller chunk size (and would split the file in 8 parts) would not be able to restore the file properly. So, Flow should first get the facts (how much has to upload) and then establish what are the chunks. So I'm rewriting now FlowFile.bootstrap

@codeperl
Copy link

Hello everyone,
flowjs is really nice library to work with. I am also interested to contribute in this project. But not sure where to start.

@AidasK
Copy link
Member

AidasK commented Jan 17, 2014

Well, git is a must and i can't imagine library development without it. It is not as hard to learn it as you think, take a look at these links: http://try.github.io/levels/1/challenges/1, https://help.github.com/articles/set-up-git.

For the testing function, the link you posted describes exactly the functionality I have in mind. I will stick to GET though, as I've seen server reject HEAD requests, or treat them themselves instead of passing the request to the CGI module. GET is safe.

Could you give me some references about this issue?

I have completely dropped the feature that allows simultaneous uploads. It's a nice idea that will unfortunately never see a practical implementation. I'll argument this as soon as I have some free time.

I agree, simultaneous uploads gives us no profit if we have only one upload server, more about this: http://kufli.blogspot.dk/2012/06/speedier-upload-using-nodejs-and.html
I would like to hear your arguments too.

The request should either allow the server-side script to generate an error that can be presented to the user or if there is no problem and there is data already uploaded, to provide the data offset.

Maybe we need here a custom response handling function? Function would determine if request was successful or not. By default it would check if response status is 200 and chunk retries is less than max allowed chunk retries.

Another thing that I would like is for the server-side to be able to process various chunk sizes. If a file has been split in 4 an two parts were uploaded, another upload process that would use a smaller chunk size (and would split the file in 8 parts) would not be able to restore the file properly. So, Flow should first get the facts (how much has to upload) and then establish what are the chunks. So I'm rewriting now FlowFile.bootstrap

This would be great! Another issue in bootstrap function is: https://github.com/flowjs/flow.js/blob/master/src/flow.js#L872
It slices file in lots of chunks at bootstrap and it takes time, I prefer lazy chunk slicing with a buffer.
For now, if file is 2Gb of size and we have chunk size of 4Mb, library generates 512 chunks at once.
With lazy slicing, we could create one chunk for upload and prepare next chunk for later.

Furthermore, we should write flow.js protocol example, just a simple one, something like this (Core Protocol): http://tus.io/protocols/resumable-upload.html#5
Then we would see what needs to be done. Maybe this must be done even before we start changing the code. In this way we will stay clear and we will know what the result will be.

@codeperl I will write some milestones in a few days, take a look at them. If you think you can manage that, just say so.

@AidasK
Copy link
Member

AidasK commented Jan 17, 2014

We could also benefit some ideas from dropbox upload api:
https://www.dropbox.com/developers/core/docs#chunked-upload

Typical usage:

Send a PUT request to /chunked_upload with the first chunk of the file without setting upload_id, and receive an upload_id in return.
Repeatedly PUT subsequent chunks using the upload_id to identify the upload in progress and an offset representing the number of bytes transferred so far.
After each chunk has been uploaded, the server returns a new offset representing the total amount transferred.
After the last chunk, POST to /commit_chunked_upload to complete the upload.

Maybe we should implement commit step? This way, in server side, after each chunk upload, we won't need to check if file upload is finished.
Can you see any drawbacks in dropbox upload api?

@filerun
Copy link
Author

filerun commented Jan 17, 2014

Could you give me some references about this issue?

It is not an educated opinion, but just my experience with implementing upload systems over a large number of servers. I've had troubles with many IIS servers that didn't have HEAD in the list of "allowed verbs". I went with POST in the end, as I leave the full length of the URL to the user implementing the library. Stuffing the URL with variables might leave the user limited (if he has to pass an absolute path in the URL).

I agree, simultaneous uploads gives us no profit if we have only one upload server, more about this:
I would like to hear your arguments too.

Uploading one file at the same time to multiple servers would definitely give a nice speed boost, but it is currently out of the question for most people. Uploading a file by chunks in a random order creates a big problem on the server side, which outweighs the marginal speed gain. It will require the server to keep the chunks, with their numbers in a temporary place, until it has all the chunks to put together. This is almost unusable with large files, which is the main point of doing this in the first place. An 8GB file, uploaded in 20MB chunks, would mean 400 chunks. Imagine how long will the user need to wait after the last chunk, for the server-side script to put 400 chunks together into a 8GB file. Assuming that the process will be successful and the script will not be stomped down for running for too longer or using too many server resources.
If the chunks are uploaded in a proper order, you only keep one temporary file, to which you append each chunk on each request, until it is done.

Maybe we need here a custom response handling function? Function would determine if request was successful or not. By default it would check if response status is 200 and chunk retries is less than max allowed chunk retries.

The reason I don't want to rely only on HTTP status responses is the simple fact that if PHP fails, the status is 200 OK, even if the body contains an ugly error. There are actually quite many cases where the status is 200 and the request got trashed. The user should be allowed to adapt this to the particularities of his case.

It slices file in lots of chunks at bootstrap and it takes time, I prefer lazy chunk slicing with a buffer.

I have separated the calculation of the chunks from the bootstrap method for the reason that needs to be done after getting the stats from the server. However I don't think the chunks are created there, but only their number and offsets are. The actual data is read only when the current chunk gets uploaded.

Can you see any drawbacks in dropbox upload api?

There is this dilemma, regarding chunk size, that needs to be take into consideration.
For compatibility with limited servers (post_max_size, upload_max_filesize, web server's limited request body size, etc.) you want them to be as small as possible. For fault tolerance again you want the chunks to be as small as possible, so that you don't loose much time if a chunk fails. So far so good.
The thing is, the bigger the chunk the smaller the number of chunks. The smaller the number of chunks, less time gets spent waiting for the server processing a reply. In most cases the server processing takes time, connect to databases, etc. and even a second in between each chunks, for a 60 chunks file, that one minute of extra waiting time. So we can say that the bigger the chunk the shorter the upload time, which translates to faster upload speeds - as close as possible to non-chunked upload.
So, for efficiency you want the chunks as big as possible.
So you have "speed", "fault tolerance" and "portability". For the project I am improving this upload tool for (http://www.filerun.com) the priority goes like this: "portability" (it needs to work) over "fault tolerance" (it needs to work reliably) over "speed" (at a decent speed).
You can of course make it very portable and fault tolerant with a chunk size of 1MB, but that will make it painfully slow for files larger than 50MB.
However, nowdays people upload files larger than 6GB on daily basis. I have a client working with PSD files, 2GB each.
So, we should aim for the largest chunk size possible which will not affect the fault tolerance too much and keeps the server happy.
Now, I would say using a chunk size of 100MB over a decently reliable network connection would be a good case.
Imagine having to upload a 100MB chunk just to get the server to tell you something like "A file with the same name already exists". That would be frustrating. So the server-side checks need to be done before proceeding with the actual work. That is why I think the Dropbox protocol sucks :)

P.S. I spent a few hours last night on the library and I already have it working pretty much like I wanted it. I will test it a bit more, tweak its code a bit more as I don't like how much CPU is the progress calculation using, and then share it with you. I must say, it was very easy working with the code, nicely organized.

@filerun
Copy link
Author

filerun commented Jan 18, 2014

Another change that I'm making, pausing the queue should be separate from pausing individual files. I can pause one file so the file is skipped when it's its turn. While uploading the rest of the files I want to pause the whole queue, the transfer stops. But when I resume the queue, the transfer continues, while the specifically paused file remains paused.

@filerun
Copy link
Author

filerun commented Jan 18, 2014

I hope it's ok if I'm using this discussion thread to document my changes so that you can understand what I've done once I share my code.
There was a logical mess with the order of reporting progress. Instead of FlowChunk reporting progress to FlowFile and FlowFile reporting progress to Flow, it was the other way around, Flow calculating progress for the FlowFile and FlowFile calculating progress for the FlowChunks. I've fixed that and the associated performance issue.

@filerun
Copy link
Author

filerun commented Jan 18, 2014

Another thing "progressCallbacksInterval" seems like a good idea, but it fixes just half of the problem. Users don't care about how often the progress should be reported, they care when actual progress made. The user wants to know about the percentage when there is at least 1% (or even 0.X, but not 0.XXXXXXXXXX) of the file uploaded, so the progress event should be executed only when that's the case. In most cases the callback updates the user interface, and DOM manipulation is expensive, you don't want to resize elements quarters of a pixel, but at least one pixel.
I'm adding a solution to that as well ;)

@AidasK
Copy link
Member

AidasK commented Jan 18, 2014

Another change that I'm making, pausing the queue should be separate from pausing individual files. I can pause one file so the file is skipped when it's its turn. While uploading the rest of the files I want to pause the whole queue, the transfer stops. But when I resume the queue, the transfer continues, while the specifically paused file remains paused.

Well spotted, this must be done. I will fill a milestone for this.

About 200 status on php fatal error, we can avoid this: http://stackoverflow.com/questions/2331582/catch-php-fatal-error

I went with POST in the end, as I leave the full length of the URL to the user implementing the library. Stuffing the URL with variables might leave the user limited (if he has to pass an absolute path in the URL).

Well yes, library would gain more flexibility.

Another thing "progressCallbacksInterval" seems like a good idea, but it is just a poor solution to the problem. Users don't care about how often the progress should be reported, they care when actual progress made. The user wants to know about the percentage when there is at least 1% (or even 0.X, but not 0.XXXXXXXXXX) of the file uploaded, so the progress event should be executed only when that's the case. In most cases the callback updates the user interface, and DOM manipulation is expensive, you don't want to resize elements quarters of a pixel, but at least one pixel.
I'm adding a solution to that as well ;)

This parameter is also used for upload speed and time remaining calculation, and these specs depends on time, not on total upload percentage.
Another drawback of this approach is that callbacks would be rare for large files and dense for small ones. If we track 0.x% part of progress, this would mean that there will be 1000 callbacks for any file, large or small.

First thoughts how flow.js protocol should look like:

POST request is used to create new or resume old resource.
First request SHOULD NOT carry any file data, only information about it.
Content type MUST be multipart/form-data.
Request MUST include form data about: file size, file name and identifier. Identifier later will be used to resume file upload.
Request MAY include additional data.
Second and later requests MAY contain file blob and MUST set offset information.

Servers MUST acknowledge successful POST operation using a 200 Ok status.
Response MUST contain Offset header. The value MUST be an integer that is 0 or larger.
Response MAY contain additional data, and client may perform additional validation.

Example:

File is image and is named as img.png. It has size of 100bytes. Be default identifier is created by concatenating file size and file name.
Making a first request:

POST /files HTTP/1.1
Content-Type:multipart/form-data

Content-Disposition: form-data; name="fileSize"
100

Content-Disposition: form-data; name="fileName"
img.png

Content-Disposition: form-data; name="identifier"
100-img.png

Response:

HTTP/1.1 200 Ok
Offset: 0

Starting upload
Request is the same, but it includes file blob and offset. Blob is size of 30 bytes.

Request:

POST /files HTTP/1.1
Content-Type:multipart/form-data

Content-Disposition: form-data; name="fileSize"
100

Content-Disposition: form-data; name="fileName"
img.png

Content-Disposition: form-data; name="identifier"
100-img.png

Content-Disposition: form-data; name="offset"
0

Content-Disposition: form-data; name="file"; filename="blob"
Content-Type: application/octet-stream

Response:

HTTP/1.1 200 Ok
Offset: 30

However this example does not include the final step. How client should tell that this is a final chunk? Maybe request could contain parameter, which indicates if this is a final request?
O maybe we should use simple math? Offset + current blob size == fileSize?

Other question, should request include currentChunkSize? In php we could use $_FILES["blob"]["size"] instead.

Also I haven't mentioned error handling, can we leave same logic?
Retry request if response status is not in permanent errors list, otherwise stop upload and send error for the client.

@filerun
Copy link
Author

filerun commented Jan 18, 2014

If we track 0.x% part of progress, this would mean that there will be 1000 callbacks for any file, large or small.

Using a combination of both time and actual progress, this should work good for both small and large files. For both large files uploaded in large chunks and large files uploaded in small chunks.

First thoughts how flow.js protocol should look like:

Looking great.

However this example does not include the final step. How client should tell that this is a final chunk? Maybe request could contain parameter, which indicates if this is a final request?

Yes. My version now submits the two following boolean variables: flowIsFirstChunk and flowIsLastChunk
If you wait a bit, just a day or so, I will send you my full working code and you can see my changes. I've put quite a bit of work in it.

Other question, should request include currentChunkSize? In php we could use $_FILES["blob"]["size"] instead.

The server should not care about the size of the chunk. The server gets as much data as it gets there over the network from the client.

About 200 status on php fatal error, we can avoid this: http://stackoverflow.com/questions/2331582/catch-php-fatal-error

For my case, I'm not trusting the 200 status, at all. Almost no hosting service has servers configured to throw HTTP 500 on PHP errors. And the PHP errors can be very diverse, from parsing errors and warnings to silently failing. I am also not including the server's offset in a HTTP header, which might get lost through proxy and reverse-proxy servers. I'm including it in the response body.
You can of course choose to do differently. I'm just sharing my opinion and choice.

@AidasK
Copy link
Member

AidasK commented Jan 19, 2014

How about making chunk size dynamic and recalculate it after each uploaded chunk?
For example, if we have 4mb starting chunk and it gets uploaded in 4 seconds. This means that client has at least 1mb/s connection speed and we can adjust chunk size to more appropriate. How about making chunks size as big as client can transfer in 1 minute? Chunks size would be calculated by multiplying average download speed with time interval. In this example it would be 1mb/s * 60 seconds = 60 mb chunk. For this feature library would need four more parameters: maxChunkSize, minChunksSize, defaultChunkSize, chunkSizeAdjustInterval.

I am also not including the server's offset in a HTTP header, which might get lost through proxy and reverse-proxy servers. I'm including it in the response body.

Managing headers in library adds a bit of complexity, forcing server always to return json would be a simpler solution.

@filerun
Copy link
Author

filerun commented Jan 19, 2014

How about making chunk size dynamic and recalculate it after each uploaded chunk?

I love the idea. The web upload process can definitely made much "smarter" than it currently is. I was also thinking that we should keep count of the chunks that had to be retried and if it reaches a predefined threshold, automatically lower the chunks size, hoping for less failures. In the same style like Youtube lowers the video quality when it detects poor bandwidth.

@filerun
Copy link
Author

filerun commented Jan 20, 2014

I finally managed to get the overall progress tracked properly, including the removal of the temporary data that has been uploaded before the XHR was forcefully aborted. That data was added to the progress, but not subtracted if the chunk was not successful. So if you now pause a 100 MB chunk right in the middle, the file's progress and the overall progress goes back by 50 MB, which should happen, as it is reflecting the reality. However, I'm writing this note as a TODO item, as this can be improved. The user has no idea that there is a chunking going on in the background, so if he is pausing the upload and the chunk size is large, say 100MB, it would be a pity to completely discard 90M of uploaded chunk to pause in that instant, instead of waiting a moment for the chunk to complete. So we can add a timer and threshold for that.

P.S. Keeping track of progress while taking failures into account turned to be more challenging that I thought :P
However, it will be flawless. And it will allow multiple files (not chunks) to be uploaded at the same time while keeping the entire progress effortlessly, without killing the CPU.

@filerun
Copy link
Author

filerun commented Jan 20, 2014

What do you think about multiplying the "flowObj.opts.chunkRetryInterval" with "FlowChunk.retries" every time a chunk fails, with the purpose of increasing the wait time between attempts? Or at least doubling the retry interval with every failure. This should avoid having clients continuously flooding the server in case of troubles.

@filerun
Copy link
Author

filerun commented Jan 20, 2014

Ok, so here's my code: http://bit.ly/1dK1pnk
I did quite some testing, but I'm sure it can use more, especially fault-tolerance testing.
I made so many changes that it would take me quite a while to remember them and list them. They are mostly about what I have wrote in this discussion. The biggest changes are these two:

  1. chunks are uploaded one at the time, in a natural order. they are no longer tested individually, but before the upload starts, the library gets info from the server on the byte offset from where the upload needs to resume.
  2. rewrote the progress reporting, in an efficient way that doesn't load the CPU in a useless manner.

codeperl -> you said you wanted to contribute, you can do that by downloading and testing out this version

@AidasK
Copy link
Member

AidasK commented Jan 20, 2014

What do you think about multiplying the "flowObj.opts.chunkRetryInterval" with "FlowChunk.retries" every time a chunk fails, with the purpose of increasing the wait time between attempts? Or at least doubling the retry interval with every failure. This should avoid having clients continuously flooding the server in case of troubles.

I have opened issue for this two days ago #6

Thanks for sharing your work, I will review it tomorrow.

@filerun
Copy link
Author

filerun commented Jan 20, 2014

I have opened issue for this two days ago #6

Ok, great, so I see that there are more people thinking it's a good idea. Cool! It should be implemented then, as it takes just a line of code.

@codeperl
Copy link

@vvllaadd, Just got the zip. Hope I can start very soon.

@filerun
Copy link
Author

filerun commented Jan 21, 2014

A quick overview of the new vars being passed to the server:

flowGetOffset - boolean, marking the request that checks for already uploaded data. The server should return the number of uploaded bytes in the response body, or 404 if no data has been uploaded so far
flowIsFirstChunk - boolean, marking the first uploaded chunk of the file
flowIsLastChunk - boolean, marking the last uploaded chunk of the file
flowCurrentChunkSize - no change
flowTotalSize - no change
flowFilename - no change
flowRelativePath - no change

@filerun
Copy link
Author

filerun commented Jan 22, 2014

A fix for progress monitoring for small files, files that are uploaded in a single small request:

/**
     * The size of this chunk
     * @type {number}
     */
    this.size = this.endByte - this.startByte;
this.progressHandler = function(event) {
        if (event.lengthComputable) {
            $.loaded = Math.min(event.loaded, $.size);
            $.total = Math.min(event.total, $.size);
            if ($.loaded > $._lastLoaded) {
                var loadedNow = $.loaded-$._lastLoaded;
                $.fileObj.completedBytes += loadedNow;
                $.flowObj.completedBytes += loadedNow;
            }
            $._lastLoaded = event.loaded;
        }
        $.fileObj.chunkEvent('progress');
    };

(event.loaded can be larger than the file because it includes the HTTP headers)

@filerun
Copy link
Author

filerun commented Jan 22, 2014

Another improvement that I will make is to skip the "FlowFile.getOffset" call for tiny files (files that would take less than just a couple of seconds to transfer), as it only adds a delay when uploading a few hundred small text files.

@filerun
Copy link
Author

filerun commented Jan 23, 2014

You can download an updated version using the same link: http://bit.ly/1dK1pnk

Some of the changes:

  • added "Flow.opts.resumeLargerThan" to specify the minimum size for the files that Flow should attempt resuming for.
  • support for empty sized files (yeah, it's very useful in some special cases)
  • improved memory usage by cleaning up chunks and XHR objects after successful upload
  • exponential wait duration on retry (requires Flow.opts.chunkRetryInterval to be set)
  • the fix from above and many other fixes which I haven't kept track of

Here's the constructor I use with it:

new Flow({
            target: 'upload.php',
            chunkSize: 2097152, progressCallbacksInterval: 100,
            maxChunkRetries: 3, resumeLargerThan: 10485760,
            validateChunkResponse: function(status, message) {
                if (status == '200') {
                    try {
                        var rs = Ext.util.JSON.decode(message);
                    } catch (er){return 'booboo';}
                    if (rs && rs.success) {
                        return 'success';
                    }
                }
                return 'booboo';
            }, validateChunkResponseScope: this,
            query: {path: 'some/path'}
        });

This version should be quite stable. I have tested it quite a lot, the fault tolerance works fine, getting close to my goal.

Next on my todo list:

  • simultaneous file uploading:
    • testing it properly
    • improving its progress reporting, so that multiple files that upload at the same time don't refresh the progress too many times or at the same time. so the delay of reporting general progress needs to be kept inside Flow not FlowFile
    • add some smart way of automatically deciding when to upload more then one file at a time (usually many small files over good connection)
  • removing the tons of JS closures

@AidasK
Copy link
Member

AidasK commented Jan 24, 2014

Simultaneous file uploading to one backend will do no good.
To handle this feature target option should be renamed to targets.
Eg.:

targets: ['http://srv1.example.com', 'http://srv2.example.com']

And simultaneous file upload count would be equal to targets.length.
Obviously, this feature for client would be useful only then it has a greater upload speed than one server can handle. So we will have to do some math here and calculate average upload speed of one file and all files.
If average upload speed does not increase with increased simultaneous uploads count, then the library should decrease simultaneous uploads number.
If simultaneous upload number is less than targets.length, then targets should be picked in round-robin style.

removing the tons of JS closures

I am thinking of writing flow.js v3 in closure style there possible. Library will be much smaller then minified.

@filerun
Copy link
Author

filerun commented Jan 25, 2014

Simultaneous file uploading to one backend will do no good.

Not entirely true. What is no good for, is uploading two large chunks at the same time to the same server. In that case, the bandwidth gets split in two, and it will take just as long time as it would take to upload the two chunks one after the other at double the speed. There is however a case where simultaneous upload can be of really big help. That is, uploading many small files.
What most people calculating upload speeds are not considering is the server-side processing time. In many cases, the server takes just as long to process a 200MB chunk as it takes to process a 200KB one, that is because the server-side program still needs to connect to databases, check authentication, etc.
But when uploading two large files in few large chunks, the "wait time" is not important. When you have to upload 2000 small files of average size of 200KB (like a larger dynamic website for example), the accumulated wait time can be enormous.
In case of uploading a small file, the wait time can be much longer than the time the browser took to complete the request. So then why wait for the server to reply, when you can complete another request in the meantime? Let the browser wait for multiple server replies at the same time.

I promised a client that I will make a browser upload function that will be at least as efficient as an FTP program. At this time, there is no such thing online. There is no HTML upload method that is as reliable as FTP for any amount of data. It's not only about uploading large files, but also many small ones.
Would you consider uploading a Wordpress or any other similar CMS-based website, using Flow, without zipping the site first? Most probably not. What if you don't have a good option for unzipping on the remote server? Back to FTP...

@AidasK
Copy link
Member

AidasK commented Jan 25, 2014

Good point! How about making simultaneous file uploads number dynamic? We have a chunk size and size of every file.
I suggest calculating like this: we could transfer as much files as they fit in a single chunk size.
E.g.:
We have chunk size of 4mb so we could transfer one file of 2mb and two files of 1mb at the same time.
However simultaneous uploads should have a hard limit, so that client wouldn't upload 1000 files simultaneously and browser does not support this.
So the final formula would be: transfer as much files as they fit in a single chunk size, but not more than client defined hard limit.

By default hard limit should be based on a browser limits: http://www.stevesouders.com/blog/2008/03/20/roundup-on-parallel-connections/

Also, this would play nice with dynamic chunk size calculation.

@filerun
Copy link
Author

filerun commented Jan 25, 2014

I made some quick tests, comparing total upload time of uploading small files one by one or simultaneously. I have only tested Chrome v32. With 2 simultaneous uploads the upload time lowers by 50%, which is awesome. For some reason, Chrome doesn't really use more than 2 connections to the same host, so increasing that seem to make no difference.
I'll return soon with an update on my library, which includes the "maxSimultaneousTransfers" configuration along with some other fixes and optimizations.

I'm quite happy with the progress so far, uploading one thousand small files or one gigantic (20GB) file for me it's a piece of cake now :D

@AidasK
Copy link
Member

AidasK commented Jan 26, 2014

I was thinking about simultaneous uploads issue and it seems we are trying to solve this in a wrong way.
Instead of trying to upload one file in a single request and other in concurrent request, we should upload all of them in one request, but request should not exceed chunk size. This would be a batch upload.
For example: client needs to upload wordpress site, it is 16mb of size and has 1162 files. Library chunk size is 4mb.
So all that we need is 4 request to upload it instead of 1162 requests. In my opinion this should be huge improvement for client and for server.

Although this might bring some complexity, but it worth a shot.

Files transferring in xhr might be made in array:

form.append('file[' + file.uniqueIdentifier + ']', blob, file.name);
form.append('file[' + file.uniqueIdentifier + '][name]', file.name);
form.append('file[' + file.uniqueIdentifier + '][size]', file.size);
form.append('file[' + file.uniqueIdentifier + '][offset]', file.offset);

Once this is done, batch upload could be used for large files as well.

If clients has to upload 100 files, size of 5mb, in 4mb chunks, then instead of 200 requests, library could make it in 5mb * 100 / 4 = 125 requests.
This means that, if file last chunk is 1mb of size, so library should append a second file or part of it to this request.

@filerun
Copy link
Author

filerun commented Jan 26, 2014

You are so right. That would be a huge improvement.
The simultaneous uploads is still very useful and after my tests I am now convinced that it's useful to upload more then one file at the time. As soon as I get that working properly (there are still issues with the queue management and the progress reporting) I will share my code and work on the idea of bundling multiple small files in fewer requests (which can also be simultaneous). Had limited time to work on this the past days, but I'm getting there.

@AidasK
Copy link
Member

AidasK commented Jan 27, 2014

Could you make more tests with simultaneous uploads? Chrome browser should support more than two connections to a single host for sure or try firefox. If we can't see any improvement with 3 and more connections, then there is something fishy.
Maybe this is just an issue to open a connection, it takes time. So maybe we should open next connection once we finished uploading previous chunk?
Condition to open next connection would be:

xhr.upload.onprogress=function(event){
       if (event.loaded==event.total) {
// upload finished, waiting for server response
// while server is processing, we could send next chunk
       }
}

Although, this is still a simultaneous upload, but for most of the time one connection is opened.

My version now submits the two following boolean variables: flowIsFirstChunk and flowIsLastChunk

Be careful with these, because with simultaneous uploads first chunk might get uploaded after last chunk.

I would still recommend dropping simultaneous uploads feature, because with batch uploads, profit of it would be negligible.

@filerun
Copy link
Author

filerun commented Jan 27, 2014

Be careful with these, because with simultaneous uploads first chunk might get uploaded after last chunk.

The only way I am using simultaneous uploads is with separate files, not with the chunks. Chunks are always uploaded in their proper order. Letting chunks upload in a random order is a very very bad idea.

I would still recommend dropping simultaneous uploads feature, because with batch uploads, profit of it would be negligible.

I still think it's a useful feature. Imagine a queue with one large file, like 2GB, followed by 50 small files of 1MB or less. With one transfer at a time, you will have to wait quite a lot for the large one to complete, unless you pause it to allow the small ones to transfer. With two simultaneous uploads, the small files will get uploaded before the large one completes, without extending the transfer time of the large file by much.
The batch uploading will take quite some work. Once that is implemented, more tests and comparisons can be made.

@AidasK
Copy link
Member

AidasK commented Jan 27, 2014

The only way I am using simultaneous uploads is with separate files, not with the chunks. Chunks are always uploaded in their proper order. Letting chunks upload in a random order is a very very bad idea.

Oh, this might bring some complexity.

I still think it's a useful feature. Imagine a queue with one large file, like 2GB, followed by 50 small files of 1MB or less. With one transfer at a time, you will have to wait quite a lot for the large one to complete, unless you pause it to allow the small ones to transfer. With two simultaneous uploads, the small files will get uploaded before the large one completes, without extending the transfer time of the large file by much.

Well, if we have two large files in the beginning of the queue, then this does not help.

To keep library simple, it should use same logic for all files. I think batch uploading is the way to go and it solves all those issues above. Although, I haven't tried to implement this and maybe there are some unknown drawbacks.

@filerun
Copy link
Author

filerun commented Jan 27, 2014

Oh, this might bring some complexity.

Already implemented it.

Well, if we have two large files in the beginning of the queue, then this does not help.

It doesn't do any harm either. The user might pause one of the files, if he wants too. Most people upload very large files individually and only small files in a bunch.

I will be posting today what will probably be my last contribution to the public. My code is getting far from the original code and this work is not on my free time but backed by business.

@filerun
Copy link
Author

filerun commented Jan 27, 2014

So, here's my latest version: http://bit.ly/1dK1pnk

The changes:

  • added config option "Flow.opts.maxSimultaneous", defaults to 1
  • replaced "Flow.assignBrowse" with method "Flow.browseFiles()" which can be programatically called for triggering the file/folder selection. Should work fine with any HTML5 browser.
  • dropped the silly default HTTP status checking and made "Flow.opts.validateChunkResponse" mandatory. If you want reliable uploads, you need to be sure that the HTTP requests are being processed by the server-side script, not the HTTP server default behavior.
  • dropped the preprocessing code. if anybody wishes to add additional processing, it means it has advanced needs and the code customization will not be light. Most people will not need any preprocessing.
  • dropped the "FlowFile.abort" method and added the internal method "FlowFile.reset"
  • the Flow object now keeps track of how many files are currently uploading and how many files have been successfully uploaded. This is accessible via the properties "flow.uploadingFiles" and "flow.completedFiles", instead of methods that use CPU. It was needed for limiting the number of simultaneous transfers.
  • bugfixes from my previous version

If this version will become the default Flow code, I am considering contributing with pull requests for the rest of the improvements that can be made (like "dynamic chunk size" and "bundle small files in fewer requests").

Enjoy!

P.S. Shameless plug: don't forget to check out FileRun (http://www.filerun.com), the project were this code will go to.

@AidasK
Copy link
Member

AidasK commented Jan 28, 2014

This version will not become the next version of flow.js, because:

  • Library must be properly tested with unit tests. You can't develop a library if it does not have ones, something will always fail. Current library has some basic tests: https://github.com/flowjs/flow.js/tree/master/test
  • I haven't mentioned this, but one of the next flow.js version goals is to implement some rules(a protocol), which could be simply implemented in other languages. It should be easy to upload files from android and ios devices without the browser.
    With some rules, implement a server-side code should be easy too.

However, I am happy that you are developing this version, because you have came up with some nice ideas.

@filerun
Copy link
Author

filerun commented Jan 28, 2014

Yeah, the bureaucratic things, I leave that to you :) Tests are nice, but I don't have time to write them, so I can do my best to manually test most common cases and wait for the actual users to report other possible issues :P
Anyway, I hope my code and ideas will be of some use to you or anybody else, as FlowJS was of use to me.

@ralyodio
Copy link

ralyodio commented Jul 5, 2014

I uploaded a file using the Node.js example, but there seems to be some residual files leftover in ./tmp. I'm not sure which one is the actual file and what I should delete.

@marcstreeter
Copy link

Yo - any movement on this? I'm not sure if these changes were ever merged in. I'm more than willing to do the 'bureaucratic things' if that's all that's needed to get these great additions into the library. Also all of the bit.ly links are dead and I'd like to review the code that was submitted.

@AidasK
Copy link
Member

AidasK commented Jan 13, 2015

I have started refactoring this lib at develop branch https://github.com/flowjs/flow.js/tree/develop, it has the basic structure with well covered tests. Currently there is no html example/demo or any js documentation so far. It would be great to find a developer who would like to contribute.

@marcstreeter
Copy link

Quick question -- are any of the changes that vveladd implemented present in the development branch? Much of the vars and functions he mentioned as added seem to be absent.

@filerun
Copy link
Author

filerun commented Jan 13, 2015

As soon as I will get a bit of free time I will start a new public project based on my code. It has been successfully tested during one year with thousands of users and millions of uploads.
If I don't post anything here in a week or two, write a message to remind me about it ;)

@marcstreeter
Copy link

Awesome - I'll put it on my calendar.

@Patrik-Lundqvist
Copy link

Any updates on the new project @vvllaadd? :)

@filerun
Copy link
Author

filerun commented Jan 28, 2015

@Patrik-Lundqvist , @marcstreeter anybody else interested, you can watch this project: https://github.com/vvllaadd/filerun , i still need to find some free time to start it up, but it will certainly happen durring February

@filerun filerun closed this as completed Jan 28, 2015
@filerun
Copy link
Author

filerun commented Feb 18, 2015

The project at the link above has been started and initial code is available for download, with an included example.

@filerun filerun reopened this Feb 18, 2015
@filerun filerun closed this as completed Feb 18, 2015
@jdc18
Copy link

jdc18 commented Jun 23, 2015

Hi,
I have an example with java, it works, but it still has some errors, https://github.com/jdc18/ng-flow-with-java/tree/master/ng-flow-java, i update the java - resumable example, if anybody wants to test it, it works with cors too. This is my first project so any improvement on the code is welcomed.

@rohitkeshav
Copy link

How does the joining of chunks work in the library, or is something that needs to be handled on the server side

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants