Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[upload] browser crashes while trying to upload large file (~600Mb) #89

Closed
2 tasks done
mariorodeghiero opened this issue Oct 9, 2020 · 17 comments · Fixed by #94
Closed
2 tasks done

[upload] browser crashes while trying to upload large file (~600Mb) #89

mariorodeghiero opened this issue Oct 9, 2020 · 17 comments · Fixed by #94
Assignees

Comments

@mariorodeghiero
Copy link
Contributor

mariorodeghiero commented Oct 9, 2020

Browser crashes while trying to upload a large CSV file (~600Mb) for the resource

Acceptance

  • Generate hash MD5 and SHA256 for large files

Tasks

  • slipt the file in chunks and generate hash chunk by chunk and after compile in the final hash

Analysis

There are two options:

  1. Uploading streams - Problem: stream upload is not supported by old browsers, since we should use Fetch API
    [Udpates]: This is not applicable. We should use FormData() and the file stream appened on it which doens't meet the way it suppoused to be done for upload file to the storage. Currently we pass a buffer, not a FormData object having a bufer/file/stream, this won't be handled by the storage service.

  2. Multipart - Problem: We can directly modify queries etc of upload storage request, however this depends on which storage we use, for each provider we have to make custom changes, which will basically endup having our own Uppy plugin.

@kmanaseryan
Copy link
Member

@mariorodeghiero can we close this? Seems has been solved in #90?

@rufuspollock rufuspollock added this to In Review in DataPub Overview Oct 15, 2020
@mariorodeghiero
Copy link
Contributor Author

I think yes.
FIXED. I hope the hash issue should be fixed in #90

DataPub Overview automation moved this from In Review to Done Oct 15, 2020
@mbeilin
Copy link

mbeilin commented Oct 19, 2020

@mariorodeghiero after some tests in NHS, the following error is throwing trying to upload the file ~600Mb:

Screenshot from 2020-10-19 10-28-40

@mbeilin mbeilin reopened this Oct 19, 2020
@mariorodeghiero
Copy link
Contributor Author

@mbeilin thanks for letting us know. Probably because of the file.row() and file.addSchema() not support it.

cc: @risenW

@risenW risenW self-assigned this Oct 19, 2020
@mbeilin
Copy link

mbeilin commented Oct 20, 2020

@risenW @mariorodeghiero I'm using this file, you can try this also - just download and exract the csv file from archive:
https://storage.googleapis.com/datopian-test/anon7762d88b53b9cae7c11295134f6fcb9284b81af1/EPD_202008-000000000000.csv.gz

LMK if it works for you locally.

@risenW
Copy link
Member

risenW commented Oct 20, 2020

@mbeilin Sorry about my earlier comment, I was using a smaller file instead. Currently working on the issue now.

@risenW risenW mentioned this issue Oct 20, 2020
2 tasks
@risenW
Copy link
Member

risenW commented Oct 21, 2020

@mbeilin This issue has been fixed and merged with the master branch. You can try it from your end.

cc: @mariorodeghiero

@mariorodeghiero
Copy link
Contributor Author

great, let's test it.

@risenW
Copy link
Member

risenW commented Oct 26, 2020

@mbeilin we found the cause of the problem. It's because of the buffer method. We'll work on it.

@risenW risenW reopened this Oct 26, 2020
@rufuspollock
Copy link
Contributor

rufuspollock commented Oct 27, 2020

@risenW can we also get a reasonable local (automated) test with a large file so that we increase likelihood of finding issues here rather than when integrating into upstream code? (NB: the large file might need to be stored outside of git ...)

@risenW
Copy link
Member

risenW commented Oct 27, 2020

@rufuspollock Yes, I'll set this up locally.

@kmanaseryan
Copy link
Member

How we can test this in upload large file function? In #96 it's not described

@kmanaseryan kmanaseryan reopened this Oct 30, 2020
@kmanaseryan kmanaseryan moved this from Done to In Progress in DataPub Overview Nov 2, 2020
@rufuspollock
Copy link
Contributor

@kmanaseryan i don't quite get

[Udpates]: This is not applicable. We should use FormData() and the file stream appened on it which doens't meet the way it suppoused to be done for upload file to the storage. Currently we pass a buffer, not a FormData object having a buffer/file/stream, this won't be handled by the storage service.

what does that mean? Why does this mean it won't be handled by the storage service. could you give a bit more detail.

@shevron
Copy link
Collaborator

shevron commented Nov 3, 2020

I am a bit lost here, but here is what is needed in order to support large file uploads to Git LFS based storage as is implemented by ckan-client-js:

  • We need to be able to read files (in the browser, but it would be useful to also have it Node.js) not entirely into memory but as a stream, e.g. read 1mb and process it iteratively
  • We need this iterative / stream method to be used when calculating digest (sha256, md5, etc.)
  • We need this iterative / stream method to be available as an input to Axios data parameter when doing POST or PUT requests. This should be possible from reading the Axios API (see below)
  • We need to be able to read the file between specific locations (or seek to a specific location and then read N bytes from it)
  • We need to be able to calculate a digest (sha256, md5) of a part of the file between two locations

Here is what axios accepts as upload data:

  // `data` is the data to be sent as the request body
  // Only applicable for request methods 'PUT', 'POST', 'DELETE , and 'PATCH'
  // When no `transformRequest` is set, must be of one of the following types:
  // - string, plain object, ArrayBuffer, ArrayBufferView, URLSearchParams
  // - Browser only: FormData, File, Blob
  // - Node only: Stream, Buffer

I'm really not much of a JS expect, so I can't say which one works best for us (maybe a different type depending on node vs browser) but I would be happy to jump on a call to figure this out.

@shevron
Copy link
Collaborator

shevron commented Nov 3, 2020

Ok, after reading @kmanaseryan's analysis again I understand that some of these will not be supported in older browsers (IE 11 for example, and from what I can see even modern versions of Safari may be problematic but I'm not 100% sure on this one). In this case, if we need to support these browsers, the only option I see going forward is that uploading large files will only be possible with the multipart-basic Git LFS transfer mode. This can work as long as we can still calculate the sha256 digest of the entire file in the browser.

@kmanaseryan
Copy link
Member

@shevron thanks for the insights.

  • We need to be able to read files (in the browser, but it would be useful to also have it Node.js) not entirely into memory but as a stream, e.g. read 1mb and process it iteratively

This is already implemented both for node and browser. We have stream() which does that job. Even more for browser we have Node like stream, thanks to the recent updates that @risenW made

  • We need this iterative / stream method to be used when calculating digest (sha256, md5, etc.)

So after the recent updates now hash() function uses streams.

  • We need this iterative / stream method to be available as an input to Axios data parameter when doing POST or PUT requests. This should be possible from reading the Axios API (see below)
  • We need to be able to read the file between specific locations (or seek to a specific location and then read N bytes from it)
  • We need to be able to calculate a digest (sha256, md5) of a part of the file between two locations

We can implement a new method which will do this behind the scene using streams.

Here is what axios accepts as upload data:

  // `data` is the data to be sent as the request body
  // Only applicable for request methods 'PUT', 'POST', 'DELETE , and 'PATCH'
  // When no `transformRequest` is set, must be of one of the following types:
  // - string, plain object, ArrayBuffer, ArrayBufferView, URLSearchParams
  // - Browser only: FormData, File, Blob
  // - Node only: Stream, Buffer

I'm really not much of a JS expect, so I can't say which one works best for us (maybe a different type depending on node vs browser) but I would be happy to jump on a call to figure this out.

As for Axios, in the browser I've used FormData and File as data for the payload, but it didn't work, i.e the HTTP request for uploading to blob storage didn't failed, but the verification failed. In node haven't tried.

@kmanaseryan
Copy link
Member

Closing this, you can follow datopian/ckan-client-js#32

DataPub Overview automation moved this from In Progress to Done Nov 4, 2020
@risenW risenW removed this from Done in DataPub Overview Nov 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants