New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
General discussion #4
Comments
Imagine uploading a 2GB file, in 2MB sized chunks. That will mean 1000 GET requests to test for the current status of the file transfer. |
Hi, nice to hear that you are interested in flow.js development. Currently library is far, far away from a perfect solution and this feature can make it a bit better. Although this would be a breaking change, so i am creating develop branch and i hope to release flow.js v3 version. How to implement this? We could use some ideas from Tomorrow I hope to write some more thoughts about this and maybe we could make some milestones for next release(maybe do some refactoring i code and tests). |
Great to hear from your and that the project is active. For the testing function, the link you posted describes exactly the functionality I have in mind. I will stick to GET though, as I've seen server reject HEAD requests, or treat them themselves instead of passing the request to the CGI module. GET is safe. |
Another thing that I would like is for the server-side to be able to process various chunk sizes. If a file has been split in 4 an two parts were uploaded, another upload process that would use a smaller chunk size (and would split the file in 8 parts) would not be able to restore the file properly. So, Flow should first get the facts (how much has to upload) and then establish what are the chunks. So I'm rewriting now FlowFile.bootstrap |
Hello everyone, |
Well, git is a must and i can't imagine library development without it. It is not as hard to learn it as you think, take a look at these links: http://try.github.io/levels/1/challenges/1, https://help.github.com/articles/set-up-git.
Could you give me some references about this issue?
I agree, simultaneous uploads gives us no profit if we have only one upload server, more about this: http://kufli.blogspot.dk/2012/06/speedier-upload-using-nodejs-and.html
Maybe we need here a custom response handling function? Function would determine if request was successful or not. By default it would check if response status is 200 and chunk retries is less than max allowed chunk retries.
This would be great! Another issue in bootstrap function is: https://github.com/flowjs/flow.js/blob/master/src/flow.js#L872 Furthermore, we should write flow.js protocol example, just a simple one, something like this (Core Protocol): http://tus.io/protocols/resumable-upload.html#5 @codeperl I will write some milestones in a few days, take a look at them. If you think you can manage that, just say so. |
We could also benefit some ideas from dropbox upload api:
Maybe we should implement commit step? This way, in server side, after each chunk upload, we won't need to check if file upload is finished. |
It is not an educated opinion, but just my experience with implementing upload systems over a large number of servers. I've had troubles with many IIS servers that didn't have HEAD in the list of "allowed verbs". I went with POST in the end, as I leave the full length of the URL to the user implementing the library. Stuffing the URL with variables might leave the user limited (if he has to pass an absolute path in the URL).
Uploading one file at the same time to multiple servers would definitely give a nice speed boost, but it is currently out of the question for most people. Uploading a file by chunks in a random order creates a big problem on the server side, which outweighs the marginal speed gain. It will require the server to keep the chunks, with their numbers in a temporary place, until it has all the chunks to put together. This is almost unusable with large files, which is the main point of doing this in the first place. An 8GB file, uploaded in 20MB chunks, would mean 400 chunks. Imagine how long will the user need to wait after the last chunk, for the server-side script to put 400 chunks together into a 8GB file. Assuming that the process will be successful and the script will not be stomped down for running for too longer or using too many server resources.
The reason I don't want to rely only on HTTP status responses is the simple fact that if PHP fails, the status is 200 OK, even if the body contains an ugly error. There are actually quite many cases where the status is 200 and the request got trashed. The user should be allowed to adapt this to the particularities of his case.
I have separated the calculation of the chunks from the bootstrap method for the reason that needs to be done after getting the stats from the server. However I don't think the chunks are created there, but only their number and offsets are. The actual data is read only when the current chunk gets uploaded.
There is this dilemma, regarding chunk size, that needs to be take into consideration. P.S. I spent a few hours last night on the library and I already have it working pretty much like I wanted it. I will test it a bit more, tweak its code a bit more as I don't like how much CPU is the progress calculation using, and then share it with you. I must say, it was very easy working with the code, nicely organized. |
Another change that I'm making, pausing the queue should be separate from pausing individual files. I can pause one file so the file is skipped when it's its turn. While uploading the rest of the files I want to pause the whole queue, the transfer stops. But when I resume the queue, the transfer continues, while the specifically paused file remains paused. |
I hope it's ok if I'm using this discussion thread to document my changes so that you can understand what I've done once I share my code. |
Another thing "progressCallbacksInterval" seems like a good idea, but it fixes just half of the problem. Users don't care about how often the progress should be reported, they care when actual progress made. The user wants to know about the percentage when there is at least 1% (or even 0.X, but not 0.XXXXXXXXXX) of the file uploaded, so the progress event should be executed only when that's the case. In most cases the callback updates the user interface, and DOM manipulation is expensive, you don't want to resize elements quarters of a pixel, but at least one pixel. |
Well spotted, this must be done. I will fill a milestone for this. About 200 status on php fatal error, we can avoid this: http://stackoverflow.com/questions/2331582/catch-php-fatal-error
Well yes, library would gain more flexibility.
This parameter is also used for upload speed and time remaining calculation, and these specs depends on time, not on total upload percentage. First thoughts how flow.js protocol should look like: POST request is used to create new or resume old resource. Servers MUST acknowledge successful POST operation using a 200 Ok status. Example: File is image and is named as img.png. It has size of 100bytes. Be default identifier is created by concatenating file size and file name.
Response:
Starting upload Request:
Response:
However this example does not include the final step. How client should tell that this is a final chunk? Maybe request could contain parameter, which indicates if this is a final request? Other question, should request include currentChunkSize? In php we could use $_FILES["blob"]["size"] instead. Also I haven't mentioned error handling, can we leave same logic? |
Using a combination of both time and actual progress, this should work good for both small and large files. For both large files uploaded in large chunks and large files uploaded in small chunks.
Looking great.
Yes. My version now submits the two following boolean variables: flowIsFirstChunk and flowIsLastChunk
The server should not care about the size of the chunk. The server gets as much data as it gets there over the network from the client.
For my case, I'm not trusting the 200 status, at all. Almost no hosting service has servers configured to throw HTTP 500 on PHP errors. And the PHP errors can be very diverse, from parsing errors and warnings to silently failing. I am also not including the server's offset in a HTTP header, which might get lost through proxy and reverse-proxy servers. I'm including it in the response body. |
How about making chunk size dynamic and recalculate it after each uploaded chunk?
Managing headers in library adds a bit of complexity, forcing server always to return json would be a simpler solution. |
I love the idea. The web upload process can definitely made much "smarter" than it currently is. I was also thinking that we should keep count of the chunks that had to be retried and if it reaches a predefined threshold, automatically lower the chunks size, hoping for less failures. In the same style like Youtube lowers the video quality when it detects poor bandwidth. |
I finally managed to get the overall progress tracked properly, including the removal of the temporary data that has been uploaded before the XHR was forcefully aborted. That data was added to the progress, but not subtracted if the chunk was not successful. So if you now pause a 100 MB chunk right in the middle, the file's progress and the overall progress goes back by 50 MB, which should happen, as it is reflecting the reality. However, I'm writing this note as a TODO item, as this can be improved. The user has no idea that there is a chunking going on in the background, so if he is pausing the upload and the chunk size is large, say 100MB, it would be a pity to completely discard 90M of uploaded chunk to pause in that instant, instead of waiting a moment for the chunk to complete. So we can add a timer and threshold for that. P.S. Keeping track of progress while taking failures into account turned to be more challenging that I thought :P |
What do you think about multiplying the "flowObj.opts.chunkRetryInterval" with "FlowChunk.retries" every time a chunk fails, with the purpose of increasing the wait time between attempts? Or at least doubling the retry interval with every failure. This should avoid having clients continuously flooding the server in case of troubles. |
Ok, so here's my code: http://bit.ly/1dK1pnk
codeperl -> you said you wanted to contribute, you can do that by downloading and testing out this version |
I have opened issue for this two days ago #6 Thanks for sharing your work, I will review it tomorrow. |
Ok, great, so I see that there are more people thinking it's a good idea. Cool! It should be implemented then, as it takes just a line of code. |
@vvllaadd, Just got the zip. Hope I can start very soon. |
A quick overview of the new vars being passed to the server: flowGetOffset - boolean, marking the request that checks for already uploaded data. The server should return the number of uploaded bytes in the response body, or 404 if no data has been uploaded so far |
A fix for progress monitoring for small files, files that are uploaded in a single small request: /**
* The size of this chunk
* @type {number}
*/
this.size = this.endByte - this.startByte; this.progressHandler = function(event) {
if (event.lengthComputable) {
$.loaded = Math.min(event.loaded, $.size);
$.total = Math.min(event.total, $.size);
if ($.loaded > $._lastLoaded) {
var loadedNow = $.loaded-$._lastLoaded;
$.fileObj.completedBytes += loadedNow;
$.flowObj.completedBytes += loadedNow;
}
$._lastLoaded = event.loaded;
}
$.fileObj.chunkEvent('progress');
}; (event.loaded can be larger than the file because it includes the HTTP headers) |
Another improvement that I will make is to skip the "FlowFile.getOffset" call for tiny files (files that would take less than just a couple of seconds to transfer), as it only adds a delay when uploading a few hundred small text files. |
You can download an updated version using the same link: http://bit.ly/1dK1pnk Some of the changes:
Here's the constructor I use with it: new Flow({
target: 'upload.php',
chunkSize: 2097152, progressCallbacksInterval: 100,
maxChunkRetries: 3, resumeLargerThan: 10485760,
validateChunkResponse: function(status, message) {
if (status == '200') {
try {
var rs = Ext.util.JSON.decode(message);
} catch (er){return 'booboo';}
if (rs && rs.success) {
return 'success';
}
}
return 'booboo';
}, validateChunkResponseScope: this,
query: {path: 'some/path'}
}); This version should be quite stable. I have tested it quite a lot, the fault tolerance works fine, getting close to my goal. Next on my todo list:
|
Simultaneous file uploading to one backend will do no good.
And simultaneous file upload count would be equal to
I am thinking of writing flow.js v3 in closure style there possible. Library will be much smaller then minified. |
Not entirely true. What is no good for, is uploading two large chunks at the same time to the same server. In that case, the bandwidth gets split in two, and it will take just as long time as it would take to upload the two chunks one after the other at double the speed. There is however a case where simultaneous upload can be of really big help. That is, uploading many small files. I promised a client that I will make a browser upload function that will be at least as efficient as an FTP program. At this time, there is no such thing online. There is no HTML upload method that is as reliable as FTP for any amount of data. It's not only about uploading large files, but also many small ones. |
Good point! How about making simultaneous file uploads number dynamic? We have a chunk size and size of every file. By default hard limit should be based on a browser limits: http://www.stevesouders.com/blog/2008/03/20/roundup-on-parallel-connections/ Also, this would play nice with dynamic chunk size calculation. |
I made some quick tests, comparing total upload time of uploading small files one by one or simultaneously. I have only tested Chrome v32. With 2 simultaneous uploads the upload time lowers by 50%, which is awesome. For some reason, Chrome doesn't really use more than 2 connections to the same host, so increasing that seem to make no difference. I'm quite happy with the progress so far, uploading one thousand small files or one gigantic (20GB) file for me it's a piece of cake now :D |
I was thinking about simultaneous uploads issue and it seems we are trying to solve this in a wrong way. Although this might bring some complexity, but it worth a shot. Files transferring in xhr might be made in array: form.append('file[' + file.uniqueIdentifier + ']', blob, file.name);
form.append('file[' + file.uniqueIdentifier + '][name]', file.name);
form.append('file[' + file.uniqueIdentifier + '][size]', file.size);
form.append('file[' + file.uniqueIdentifier + '][offset]', file.offset); Once this is done, batch upload could be used for large files as well. If clients has to upload 100 files, size of 5mb, in 4mb chunks, then instead of 200 requests, library could make it in 5mb * 100 / 4 = 125 requests. |
You are so right. That would be a huge improvement. |
Could you make more tests with simultaneous uploads? Chrome browser should support more than two connections to a single host for sure or try firefox. If we can't see any improvement with 3 and more connections, then there is something fishy. xhr.upload.onprogress=function(event){
if (event.loaded==event.total) {
// upload finished, waiting for server response
// while server is processing, we could send next chunk
}
} Although, this is still a simultaneous upload, but for most of the time one connection is opened.
Be careful with these, because with simultaneous uploads first chunk might get uploaded after last chunk. I would still recommend dropping simultaneous uploads feature, because with batch uploads, profit of it would be negligible. |
The only way I am using simultaneous uploads is with separate files, not with the chunks. Chunks are always uploaded in their proper order. Letting chunks upload in a random order is a very very bad idea.
I still think it's a useful feature. Imagine a queue with one large file, like 2GB, followed by 50 small files of 1MB or less. With one transfer at a time, you will have to wait quite a lot for the large one to complete, unless you pause it to allow the small ones to transfer. With two simultaneous uploads, the small files will get uploaded before the large one completes, without extending the transfer time of the large file by much. |
Oh, this might bring some complexity.
Well, if we have two large files in the beginning of the queue, then this does not help. To keep library simple, it should use same logic for all files. I think batch uploading is the way to go and it solves all those issues above. Although, I haven't tried to implement this and maybe there are some unknown drawbacks. |
Already implemented it.
It doesn't do any harm either. The user might pause one of the files, if he wants too. Most people upload very large files individually and only small files in a bunch. I will be posting today what will probably be my last contribution to the public. My code is getting far from the original code and this work is not on my free time but backed by business. |
So, here's my latest version: http://bit.ly/1dK1pnk The changes:
If this version will become the default Flow code, I am considering contributing with pull requests for the rest of the improvements that can be made (like "dynamic chunk size" and "bundle small files in fewer requests"). Enjoy! P.S. Shameless plug: don't forget to check out FileRun (http://www.filerun.com), the project were this code will go to. |
This version will not become the next version of flow.js, because:
However, I am happy that you are developing this version, because you have came up with some nice ideas. |
Yeah, the bureaucratic things, I leave that to you :) Tests are nice, but I don't have time to write them, so I can do my best to manually test most common cases and wait for the actual users to report other possible issues :P |
I uploaded a file using the Node.js example, but there seems to be some residual files leftover in |
Yo - any movement on this? I'm not sure if these changes were ever merged in. I'm more than willing to do the 'bureaucratic things' if that's all that's needed to get these great additions into the library. Also all of the bit.ly links are dead and I'd like to review the code that was submitted. |
I have started refactoring this lib at develop branch https://github.com/flowjs/flow.js/tree/develop, it has the basic structure with well covered tests. Currently there is no html example/demo or any js documentation so far. It would be great to find a developer who would like to contribute. |
Quick question -- are any of the changes that vveladd implemented present in the development branch? Much of the vars and functions he mentioned as added seem to be absent. |
As soon as I will get a bit of free time I will start a new public project based on my code. It has been successfully tested during one year with thousands of users and millions of uploads. |
Awesome - I'll put it on my calendar. |
Any updates on the new project @vvllaadd? :) |
@Patrik-Lundqvist , @marcstreeter anybody else interested, you can watch this project: https://github.com/vvllaadd/filerun , i still need to find some free time to start it up, but it will certainly happen durring February |
The project at the link above has been started and initial code is available for download, with an included example. |
Hi, |
How does the joining of chunks work in the library, or is something that needs to be handled on the server side |
Hi,
Well done with the library! Although it currently doesn't seem to be much difference between this and resumablejs, I prefer flow's code structure.
Unfortunately it inherits the same big performance issue.
The "testChunks" idea is nice, but it is useless in its current form. If let's say one have uploaded 5/10 chunks in a previous session and wishes to resume the transfer, wouldn't be more practical for Flow to send only ONE single GET requests to retrieve the size of what has been already uploaded to the server and continue from there, instead of testing all chunks from 1 to 10?
There are a couple of more big problems, but let's start with this.
I am willing to collaborate.
Best regards,
Vlad
The text was updated successfully, but these errors were encountered: