Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emphasising the importance of in-place writes #260

Open
hills opened this issue Jan 5, 2021 · 30 comments
Open

Emphasising the importance of in-place writes #260

hills opened this issue Jan 5, 2021 · 30 comments

Comments

@hills
Copy link

hills commented Jan 5, 2021

I'm a bit late to the party, and for some time have been looking forward to porting an application which logs large media files to this API.

But I'm finding the spirit of the API to be undermined in practical use.

Writes go to a copy of the original file then moving in place, effectively by spec which demands atomicity. This reduces the value of the API in practice to almost nil, because the resulting behaviour almost as constrainted as using Blobs and downloading them. Performance for repeatedly appending to a file is O(n^2).

I'd like to emphasise the importance of in-place writes -- it seems that "ISSUE 6" is pivotal for this API?

Because it defines whether it falls one side of this watershed, or the other:

  • high-level API, for navigating the filer and synchronising whole files between JavaScript objects and disk, abstracting away as much as possible to fit a clear (and limited) use case mainly operating out of RAM, like text editors.

  • a conventional (like POSIX) API, embodies decades of prior art and capable of almost any practical use, up to databases, logs/streams and more; held behind a permissions model suitable for the web.

With elements of both that I can see, a lengthy period of iterative development is likely to ensue, with case-by-case discussion that often leads to APIs becoming like a series of patches. My concern is it'll be a long time before I can really make use of this (if at all).

Whereas a design around in-place writes (or expressely ommitting them) is an essential principle in understanding the future of this API, and would accelerate the design.


ISSUE 6 There has been some discussion around and desire for a "inPlace" mode for createWritable (where changes will be written to the actual underlying file as they are written to the writer, for example to support in-place modification of large files or things like databases). This is not currently implemented in Chrome. Implementing this is currently blocked on figuring out how to combine the desire to run malware checks with the desire to let websites make fast in-place modifications to existing large files. #67

@mkruisselbrink
Copy link
Contributor

I agree that not supporting in-place writes does limit what use cases this API can be useful for. On the other hand the model of writing to a temporary file which is then atomically moved in place after writing finishes is a long established method of saving files employed by many native applications to avoid data loss. As such I disagree that this means there is (almost) nil value in this API.

For use cases that really do require in-place modifications (like as you mention databases) we hope that Native IO will be a better fit. Since that API will be limited to website private storage that will of course not fit all use cases either, but I think with the two APIs the majority of use cases should be covered.

(some cases like logs/streams that are append-only are somewhere in between; that seems like something that could plausible be supported by this API in some form, but as you point out is kind of the worst case scenario for this API today).

@hills
Copy link
Author

hills commented Jan 6, 2021

That is a benefit to writing whole and atomically moving into place, but to force it invalidates the case for having write/seek calls; as they can now only achieve the same things as using the existing ArrayBuffer API.

The real value this API has the potential to introduce is the ability to incrementally get data out of the browser onto the filesystem; something that's impossible right now.

Serious multimedia use is mentioned, but that is going to need to append to files in O(1) time and update a header.

@hills
Copy link
Author

hills commented Jan 6, 2021

we hope that Native IO will be a better fit. Since that API will be limited to website private storage that will of course not fit all use cases either

I think there risks a fight over this gap. This API makes progress in solving the granting of filesystem access, but will always be pushing for I/O capability. That API will face a marass of fingerprinting and space allocation issues as it progresses, leaving it pushing for the capabilities of this API.

@ddumont
Copy link

ddumont commented Jan 17, 2021

I really really would like to see this. It's been hell without it.

@ddumont
Copy link

ddumont commented Feb 5, 2021

For use cases that really do require in-place modifications (like as you mention databases) we hope that Native IO will be a better fit.

I could not disagree more. As other users have pointed out, we are trying to build apps in the browser that interop with other apps a user might have (ya know... like... real applications). The current behavior makes file watching useless.
It also presents huge problems for large log files that need to be appended to, or non-text files that need to be updated frequently. Just because I want to write a large file (be it a database, torrent file, log file, whatever) doesn't mean I want it to only be accessible by my site.

What if Adobe wanted to write photoshop as a PWA and let you edit files. Their PSD file formats can be huge. They would not like having to churn the files on the filesystem so frequently, and if THEY wanted to avoid dataloss by using a temp file... THEY could do that. Why are you taking away the choice? Do you really have a good reason?

I don't think we need to force people to do one way or another... but let us choose to write inPlace if we need it. I'm fine with current behavior being a default... but the application writer should have the tools they need to be a responsible writer.

@ddumont
Copy link

ddumont commented Feb 5, 2021

On the other hand the model of writing to a temporary file which is then atomically moved in place after writing finishes is a long established method of saving files employed by many native applications to avoid data loss. As such I disagree that this means there is (almost) nil value in this API.

No. There is actually almost nil value in the API as it is compared to inPlace: true because before the application had a choice to use a temp file as a way to prevent data loss. Now they have no choice and face problems they cannot work around.

@ddumont
Copy link

ddumont commented Feb 5, 2021

here's a chrome bug for the issue
https://bugs.chromium.org/p/chromium/issues/detail?id=1168715

@jimmywarting
Copy link

jimmywarting commented Mar 8, 2021

I want something like inPlace too.
It's important for things like large files, databases, MultiMedia, torrents, IPFS, file watching, making small random write changes etc to replace native applications with PWA's.
I so definitely want to be able to download and stream a video at the same time, I can't do this if it's being written to a temporary file. I want something like a duplex stream. I want to be able to read what i have written to a unfinished open writable handle.

I do not wish to see yet another storage layer like Native IO that dose almost the same thing.
I hope that Native IO and this File system access can merge somehow it's enough with idb, blink's old sandbox filesystem, localStorage, cacheStorage and now this.

I think we can all agree on that something like inPlace is important to have and that it should be added & supported
the real stopper is google's security team that really wants to perform safe browsing analysis of written files, before the files are available with their normal file name/extension.
google should not have the only voice/decision in the making of this api.
This way of writing data to a temporary file is bad for some applications. and some alternative solution needs to arise for modifying existing files.

It would be better if they could scan the data that is being written to the file like a man in the middle rather than replacing the entire file. kind of:

readable
  .pipeThrough(fsAccess)
  .pipeThrough(malwareScanner)
  .pipeTo(destination)

@hills
Copy link
Author

hills commented Mar 18, 2021

I didn't intend to just rally support for having in place writes (though that is welcome)

I hoped to spell out the clear choice here. If the spec limits to writing whole files, then the entire FileSystemWritableFileStream API is effectively redundant. It's better, and already exists, to do that by using existing Blob APIs and writing that in one operation.

Of course, like the others who have contributed, I would find a good use for in-place writes to incrementally get large data out to the filesystem.

A good API would decouple policy (such as virus checking); here it has become intertwined. Policy like this differs over time, between platforms etc.. Whereas APIs are much more more difficult to change as by definition there are more users of the API than implementors of it.

In the specific case of virus checking, there could be solutions which do not involve copying the file. Such as quarantining the existing file during the in-place writes. IMO it is important to get this as far out of an API spec as possible. And then design, wholistically, an API which benefits from the last 40+ years of work in that area.

Or, simplify this API as a way to sync' Blobs to files on the user's filesystem. That's a high-level operation which gives the browser incredible scope to optimise the implementation.

@kfdf
Copy link
Contributor

kfdf commented Jun 14, 2021

Adding something like appendOnly option to createWriteable so that the temporary file doesn't replace, but is appended to the original (keepExistingData is implied to be true then) would enable a lot of use cases.

@ferdnyc
Copy link

ferdnyc commented Jul 16, 2021

On-topic

I wonder if it's possible to take a page from the filesystem playbook, and use journaled transactions to provide crash-resilient file integrity? In so-called inPlace mode (though I think it would be best to change that name, for transaction-based writables), updates go to a journal file instead of to the actual output file, with every seek() starting a new transaction. The file isn't guaranteed to be readable in its updated state until the transaction is committed by releasing the write handle, at which point the journal is replayed and modifications are applied to the file in-place.

Commits would be expensive, sure, but so is copying the entire file to a temporary just to make it writable. And journaled writes would immensely improve the performance of small updates to large files, which sounds like a common use case for in-place writes. It wouldn't be at all helpful for random-access, mixed read/write use cases, but maybe Native IO is a better fit there. Not sure how this would interact with other concerns like malware scanning, though.

Off-topic

@ddumont

What if Adobe wanted to write photoshop as a PWA and let you edit files. Their PSD file formats can be huge. They would not like having to churn the files on the filesystem so frequently, and if THEY wanted to avoid dataloss by using a temp file... THEY could do that. Why are you taking away the choice? Do you really have a good reason?

That's actually a great example, really, because the PSD format is not merely huge — it's ancient, famously convoluted1, and completely unsuited to incremental updates. Photoshop on the desktop doesn't update PSD files incrementally, in fact that would be antithetical to Photoshop's non-destructive editing paradigm. Photoshop uses a scratch file for live updates, and only serializes to PSD on save — which writes a new file, rather than updating the original.

What's more, Adobe did want to write Photoshop as a web app, they have, and to make it work they had to admit to PSD's failings as a format (on occasion it was defended2,3 somewhat Quixotically) and introduce a completely new document format, the Photoshop cloud document, just to add incremental saving to Photoshop.

Meanwhile, MS Office introduced an incremental-write version of the Word document format back in Office 2003, and they ended up having to patch it out again due to concerns about deleted information being retained after save. Many data-interchange formats geared towards the enduser sacrifice the ability to make speedy incremental updates in favor of data-integrity assurances.

My point here isn't that in-place edits aren't potentially useful to app developers — of course they are. But they're rarely what users actually want for their files (nor should they be required to possess the technical savvy to make that decision themselves), and the problems solved by incremental writes can often be addressed in other ways, without resorting to live-modifying user data. And to @jimmywarting 's point about Google driving the conversation about in-place writes: Agreed they shouldn't be the only voice in the discussion. But neither should app developers' desires to do what's most expedient for them be the overriding concern.

It doesn't feel unreasonable or inappropriate, to me, that the browser vendors' overriding concern for local-file access would be ensuring that when app developers push code into their browser, they aren't provided with APIs that expose potentially destructive operations. Even if the app's developers "want" to have that option. After all, it's not their own files we're talking about, it's the user's files.

Notes

  1. Content warning: Strong language
  2. Content warning: Corporate apologist language
  3. Also, apologies that the wayback capture breaks that blog's styling. Using select-all or a reader-mode extension is probably the easiest way to make the font readable against the dark background. That being said, there isn't all that much value to reading it anyway.

@bradisbell
Copy link

Even if the app's developers "want" to have that option. After all, it's not their own files we're talking about, it's the user's files.

In these discussions, there is a regular assertion that developers and users are somehow in opposition in what they want. I strongly challenge this. The whole point of software is to make a machine do something of value for the user.

It is also asserted that the user agent/spec knows better than the users and developers as to what the user wants to do. This of course is impossible on the whole, as the user agent is a general platform, not a purpose-built solution.

If the user wants to do something with their files, they should be able to. That requires applications to access those files, in whatever mode is required to do the task at hand.

@jimmywarting
Copy link

jimmywarting commented Jul 17, 2021

I want to touch a related issue i faced earlier...

onunload = () => writable.close()

Here the senario is that you would download something large using say something like WebTorrent and in the middle of it all the user decideds to close the tab/browser.
I thought i could just close the writable handle to save it and continue downloading the rest of the data later, but i couldn't. It did not work
if i where allowed to write directly to the file with "in place" then it would not be a issue.

@mkruisselbrink
Copy link
Contributor

Here the senario is that you would download something large using say something like WebTorrent and in the middle of it all the user decideds to close the tab/browser.
I thought i could just close the writable handle to save it and continue downloading the rest of the data later, but i couldn't. It did not work
if i where allowed to write directly to the file with "in place" then it would not be a issue.

For this particular use case, I think something like the autoClose option proposed in #236 would also work?

Having said that, we realize the importance of in-place writes. Specifically for files in the origin private file system the proposed access handle API in #310 aims to solve that use case. We have no current plans to extend that to also allow in-place writes outside of the origin private file system though.

@bradisbell
Copy link

We have no current plans to extend that to also allow in-place writes outside of the origin private file system though.

@mkruisselbrink Why? Why intentionally limit the user from using files on their own filesystem?

@jimmywarting
Copy link

☝️ i 2nd this. Why limit it to just the sandboxed origin private file system?

For this particular use case, I think something like the autoClose option proposed in #236 would also work?

Auto close would be good... But I also see it as unnecessary if we could just call createAccessHandle and get a AccessHandle from everywhere then problem would be solved (mostly) depending on if you flush after writes...
Also don't see any reason why you could not just close/flush durning the unload event, why would we need a extra option for this? is there something wrong with onunload?

It's good that we even get a duplex stream / AccessHandle at all, it's unfortunately that it's limited to the sandboxed origin...

@rektide
Copy link

rektide commented Aug 13, 2021

We have no current plans to extend that to also allow in-place writes outside of the origin private file system though.

This has now become a specification within a specification, and the one specification that,

  • has performance
  • is not heavily-intermediated by the user-agent
    • forcing more spec bloat/workarounds like autoClose

is also:

  • not useful for interacting with real files?

This seems extremely backwards. The Extensible Web Manifesto starts off by saying, "Browser vendors should provide new low-level capabilities that expose the possibilities of the underlying platform as closely as possible. They should seed the discussion of high-level APIs through JavaScript implementations of new features." That sounds like the opposite of what is happening here: there is a high level API that abstracts & restricts file-system access, and the low level tools are only being made available in a very small & limited capacity. It'd probably be better for everyone to drop the classic API altogether & only spec & ship Access Handles to start, as high-level techniques like atomic writes &c could be implemented performantly atop the low-level capabilities, were they exposed. Short of that radical rewrite, it would be horrible to only have performant write access to these concealed, non-user-facing origin-private sandboxes. Users need ways to write bit PSD files, big torrent files, big sqlite files. We can't tell them no.

@jimmywarting
Copy link

A horrible workaround would be to first write to the sandboxed origin first and then move them to HDD (either with write access or save dialog). Only so you can have duplex stream... this is a costly ops that require copying the data and manintaing state of what is newest.

@ddumont
Copy link

ddumont commented Jun 17, 2022

please bring accessHandle support to things outside of the OPFS

@geoffreylitt
Copy link

I wanted to second some of the comments above: I'm curious why in-place writes aren't planned to be supported outside OPFS?

To motivate the value, here's an example use case where it could be valuable to add support outside of OPFS.

I'm developing Riffle, a relational persistence framework based on SQLite. Riffle is available in both desktop apps and web apps. On desktop, we write to a SQLite file visible on the user's filesystem, which has a number of benefits. The user can see the files, back them up, share them with others, and even edit the files through other apps.

On web, we currently use absurd-sql and persist to IndexedDB. This approach lacks those benefits of having the SQLite files be visible to the user.

We're very excited about the ongoing work by the SQLite team to create a WASM build that persists to OPFS. But this work would be much more valuable to us if it could also persist to user-visible files.

Would appreciate any insight you can offer, @mkruisselbrink

@tomayac
Copy link
Contributor

tomayac commented Sep 26, 2022

(Reflecting what I said on Twitter: “There’s nothing that would stop you from bridging the OPFS and the regular file system as a one-off. At a convenient time and when writes are committed, you could ask the user if they want to back up the database to a regular file, or let them initiate the process when they want.” This addresses the immediate use case that motivated the feature request in #260 (comment).)

@jimmywarting
Copy link

you could ask the user if they want to back up the database to a regular file, or let them initiate the process when they want

There is always going to be certain times when both users and developers wants to write directly to a folder instead of OPFS.
OPFS and periodic sync do not appeal to everyone as it's a costly & slow operation that requires to copy all the data and also destroys the "File Descriptor" if something else is being opened to that same file as well when the file is replaced.

if you have a database and it's being used for something else like a server. then you are mostly always going to want it to be in sync with the real database while you use the browser as a simple GUI tool.

@tomayac
Copy link
Contributor

tomayac commented Sep 26, 2022

Right, this is definitely not supposed to meet any and all use cases. I was mostly just pointing out that it’s perfectly feasible to cross the boundaries between the two file systems.

@yume-chan
Copy link

I saw #236 was closed in favor of whatwg/fs#19, then whatwg/fs#19 is closed in favor of a really unrelated issue. Now it seems autoClose option also won't happen.

I'm building a Web app that allows users to record their Android phone screens to WebM files.

If the user hit stop button on my page, I need to finalize the WebM file by seeking and writing to multiple positions, so File System Access API is the only option.

But if they don't, for example they closed the tab/the browser crashed/system crashed/power failed, I need the file to exist on their file systems. The file is playable, only not having the duration field updated. Like OBS, when OBS was killed while recording, the output file is on my file system, not in some invisible OPFS!

@a-sully
Copy link
Collaborator

a-sully commented Jan 9, 2023

Ah, thank you for bringing this up! whatwg/fs#19 had two parts - some proposals around file locking and in-place writes, and the autoClose flag. I closed it since the locking discussion is now happening on another issue, but I'd forgotten that autoClose had been lumped into it, as well. I just opened whatwg/fs#86 to track autoClose

As for this issue, it's related to whatwg/fs#41, which proposes an async alternative to SyncAccessHandles, though not quite a duplicate because presumably we'd still have issue (much more performant!) writes to a swap file. I still don't expect we'll support in-place writes for files outside of the Origin Private File anytime soon (for all the reasons @mkruisselbrink mentions above).

That being said, there's a lot of room for optimization even with that constraint. The most straightforward optimization would be to implement the swap file as a copy-on-write file (as I mentioned in a recent presentation with other browsers), so that specifying keepExistingData is essentially free. Unfortunately,

  • this could only be supported on copy-on-write filesystems (including APFS, BTRFS, and ZFS, but notably excluding NTFS and ext4) and
  • this would require a significant code changes to Chromium

I can't see us (Chromium) prioritizing this in the near future, especially since it doesn't help Windows at all

@rektide
Copy link

rektide commented Jan 16, 2023

I still don't expect we'll support in-place writes for files outside of the Origin Private File anytime soon (for all the reasons @mkruisselbrink mentions above).

Best I can tell, this refers to this quote from @mkruisselbrink:

On the other hand the model of writing to a temporary file which is then atomically moved in place after writing finishes is a long established method of saving files employed by many native applications to avoid data loss. As such I disagree that this means there is (almost) nil value in this API.

So, the justification for not having moderately performant file-access to actual files (as opposed to the occluded invisible non-user-files present in OPFS) is that some use cases preferred avoiding data loss? And avoiding data-loss continues to be seen with such arch & vital importance that we won't even consider giving webapps moderately high speed access to files? What do I have wrong here? Are we ok saying that image editors, databases, git clients, & other would be web-apps must be confined to non-user-files, to being site-only tools? Why would limiting user-agency to only un-performant regimes ever be acceptable? I feel like @mkruisselbrink was trying to use a sometimes occasional/safe practice for non-performant concerns to override the ask of those who wanted a moderately performant general solution.

I must say, this feels like impressively awful conclusion for this specification to have, seemingly, ended at. We deeply mis-serve the user, & the asks of the community, by limiting ourselves to these ends, and it's unapparent in extreme why working with actual files users can see must have such incredibly poor speeds, via entirely different apis. I understand the limitations of Windows & other filesystems not offering snapshots, why keepExistingData is unlikely to be made available, but that was never the ask: having one good api that isn't bound to an extreme low common denominator is what I think most of us would hope for & expect. The concerns about safety that so constrained File System Access seem limited to only some folk, & many have chimed in clamoring for more authentic file system access: access handles give us that, but it seem cruel & pointless & self-defeating that they have been limited to such a small & narrow OFPS world. I don't think many in this thread see any comprehendible explanation for why the performant API for talking to files has been limited to such a narrow constraint. It seems uncontroversial to say: the good API for talking to files should be available for all files. This would seemingly greatly benefit the web platform in general, and the constraints imposed by OPFS seem only damaging.

@vapier
Copy link
Contributor

vapier commented Sep 3, 2023

i haven't super digested the thread here, but one thing stands out:

I hoped to spell out the clear choice here. If the spec limits to writing whole files, then the entire FileSystemWritableFileStream API is effectively redundant. It's better, and already exists, to do that by using existing Blob APIs and writing that in one operation.

this might be true for small files (like <<100MB?), but it certainly is not beyond that. Blob's are in-memory only, and doing anything 100MB+ or even GB is completely not feasible with Blob's. it will OOM the tab/window/system. that is the primary use case for me -- downloading large files from a remote server e.g. ftp/sftp client.

uploading isn't an issue because the existing input/Files API makes it easy to chunk up and not require mmapping the entire thing.

the lack of inplace appends is annoying when resuming large downloads, but it's livable.

@hills
Copy link
Author

hills commented Sep 26, 2023

I hoped to spell out the clear choice here. If the spec limits to writing whole files, then the entire FileSystemWritableFileStream API is effectively redundant. It's better, and already exists, to do that by using existing Blob APIs and writing that in one operation.

this might be true for small files (like <<100MB?), but it certainly is not beyond that. Blob's are in-memory only, and doing anything 100MB+ or even GB is completely not feasible with Blob's. it will OOM the tab/window/system.

Your assertions here aren't correct in practice, mostly blobs are very usable beyond this case.

The Blob API is opaque, which has allowed the implementations to evolve. In practice they aren't tied to in-memory.

That's the benefit of its high-level API.

Low-level flexible APIs also have their benefits, but there's no place for complex APIs that actually only allow high-level things to happen.

@jimmywarting
Copy link

jimmywarting commented Sep 26, 2023

from what i have heard / read / understood, is that if you create small blobs a little bit at a time, then implementer could offload some of that data to the disc, then combining them with new Blob(blob_chunks) would resolve in a final large big blob with multiple read offsets of where to locate this smaller chunks.

so something like this could work:

const chunks = []
for await (const uint8_chunk of stream) {
  chunks.push(new Blob([uint8_chunk]))
}
new Blob(chunks) // final blob

this is just a theory. (just don't create too many small blobs at a time)
https://docs.google.com/presentation/d/1MOm-8kacXAon1L2tF6VthesNjXgx0fp5AP17L7XDPSM/edit#slide=id.g8fe6c1657_0_5

tough i can verify that if you write small blob-chunks to IndexedDB and then later join them, then you can also create huge blobs that are GB in size.
those blob will not be read into memory - as they will just point to some location on the disk

@cyflux
Copy link

cyflux commented Feb 13, 2024

this feature is very important to my simple large file transfer service
close() comsume too long time !!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests