Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large Files #40

Closed
jWatsonDev opened this issue Feb 13, 2018 · 29 comments
Closed

Large Files #40

jWatsonDev opened this issue Feb 13, 2018 · 29 comments

Comments

@jWatsonDev
Copy link

Could large file functionality potentially be added? This utility works great for parsing unl files, but is unable to process larger files (eg 619 MB).

@adamreisnz
Copy link
Owner

Yep, would be great if we could support this! Will flag as help wanted for now.

@MichalCz
Copy link

Hmm... I may have some time to check this out - it would need rewriting some of the helpers, but with my scramjet it shouldn't be such a big deal.

You'd still be interested in that pull request @adamreisnz ?

@adamreisnz
Copy link
Owner

adamreisnz commented Sep 28, 2018

@MichalCz thanks, yes that sounds good. Could you make it in such a way that it requires an opt-in via a setting and isn't enabled by default? This is just because I'm unfamiliar with scramjet, and a lot of people rely on this package, and I wouldn't want to risk any unexpected behaviour breaking builds etc.

Happy to see what you're proposing though and help out as needed.

@MichalCz
Copy link

Got you on the builds - some of my projects rely on the package too.

I'll try to scribble some kind of a PoC first so we see how this works. :)

@MichalCz
Copy link

@adamreisnz - one question though - I think the current implementation actually replaces the file. In that case I guess using a temp file should be acceptable? There is another option but it would take writing quite a sophisticated random access streaming module...

@adamreisnz
Copy link
Owner

fs.writeFile is used under the hood to make the replacement, so that basically overwrites the file with the new (replaced) content. What approach did you have in mind?

@MichalCz
Copy link

Well I'm thinking of actually reading and writing the same file using fs.open - meaning I'd read the file and write to it using fs.write and 'fs.read`. The only culprit here is that the write index cannot go beyond read index (otherwise we'd read stuff we just wrote).

Actually I did write a simple version of that and I could clean this up and publish it as a separate module.

Now what that means is:

  • the memory footprint of such an operation is a couple 16kB buffers + the difference in length of the replaced content to what it's replaced with
  • the file will be written at the same time to when it's read

I'm quite sure it would be quicker for larger files (because the software wont be waiting for the whole file to load), but smaller files may go quicker.

@adamreisnz
Copy link
Owner

Sounds good, let's give it a try. If we are able to choose which approach you want to use using a flag we can let people try it out first before making it the default (if it works well) in an upcoming major version.

It'd be good if we can come up with a test/benchmark suite to actually check the speed for different files, so we can compare the two approaches.

@MichalCz
Copy link

MichalCz commented Oct 3, 2018

@adamreisnz I got a not-really-nice code that actually does use fs.write and fs.read with offsets - so this approach is feasible. I'll try to make a module out of it that would return read and write streams to the same file. The write stream would be limitted by read for now (nothing can be written past the place we're reading), but I'm thinking I could get around this as well...

Then the additions in replace-in-file would be fairly small.

As to the test suite - definitely. I believe that my approach has an edge case of a file that is enlarged by the replacement (where it can lead to locking up both streams at the same time.

@cliedelt
Copy link

cliedelt commented Nov 5, 2018

Any Update on this?

@MichalCz
Copy link

MichalCz commented Nov 5, 2018

Kinda - I've had just a little time to clean up my code and I opened it in github - there's a matter of review to be done and actually adding a pull request here.

Here's the repo signicode/rw-stream.

@orgertot if you have some time and some free neurons to spend on this, the main worry I have is what will happen when the write stream advances over the read stream. This is what's being tested by test.js - I'm replacing a single line with couple of them - this way you can grow the file to gigabytes in couple runs of test.js. If you'd confirm my logic and perhaps run a couple tests, I'd finish up, push the module to npm and issue a pull request.

M.

@adamreisnz adamreisnz added the 8.0 label Nov 5, 2018
@MichalCz
Copy link

MichalCz commented Nov 7, 2018

@adamreisnz - this is my proposal of the replace procedure.

Streamed replace in signicode:master.

This is a little complicated and could be written simplier with scramjet but I didn't want to add too much dependencies (scramjet adds 4 deps in total). Scramjet also requires node v8.6+. If you'd prefer a simpler code and don't mind 4 extra dependencies, I can rewrite it quickly to approx 10 lines of code.

I haven't yet run this and it's not exposed, so please take a look and tell me how do you feel about it.

Then also let me know, how would you want to expose this in CLI and in the module API.

@adamreisnz
Copy link
Owner

Thanks, I'll try to look into this next week, want to make sure I have enough time to look into it carefully!

@MichalCz
Copy link

MichalCz commented Nov 7, 2018

No problem. I'll try this some time next week, as I need similar code in scramjet...

@MichalCz
Copy link

Hi @adamreisnz - I've made some checks and it looks that the code is solid even if I save a lot more than I read (it just keeps that in memory, but that's much less than keeping the whole file there). I've tried it on files in excess of 10 GBs and it works well there.

I guess what would be a sensible next step would to run the test scenarios on this method?

@adamreisnz
Copy link
Owner

Hi @MichalCz, that sounds good, thanks!

I've had a peek at the code, a couple of questions come to mind:

  • Why did you wrap the rw call in a Promise? From looking at that library it already returns a Promise itself.
  • Might be good to abstract the transformation function away into a helper to simplify the code a bit?
  • What is the Node version requirement for the rw-stream module? The package.json file suggests >= 8.6.0. That might be an issue as replace-in-file is aimed at Node 6 and above. However, if we release a new major version, I'd be inclined to drop Node 6 support and aim for 8 and higher.

Note: In any case, the file will be accessed and it's contents may be overwritten even if there's no actual read operation done (due to node.js streams buffering mechanisms).

This might be slighty worrying, could it lead to unexpected behaviour? I guess we could mention it as a caveat, but can you clarify this further?

I'll prepare a 4.0.0 branch where you can submit a PR against

@MichalCz
Copy link

Hmm... That does seem unnecessary. :)

Yup.

  • What is the Node version requirement for the rw-stream module? The package.json file suggests >= 8.6.0. That might be an issue as replace-in-file is aimed at Node 6 and above. However, if we release a new major version, I'd be inclined to drop Node 6 support and aim for 8 and higher.

I forked my new project template, which due to extensive use of async/await is set to require 8.6.0. We can run some tests and consider dropping the version down to 6.x if all works or can be rewritten.

Note: In any case, the file will be accessed and it's contents may be overwritten even if there's no actual read operation done (due to node.js streams buffering mechanisms).

This might be slighty worrying, could it lead to unexpected behaviour? I guess we could mention it as a caveat, but can you clarify this further?

Well I guess that this is simply due to the way this transformation works - it doesn't affect the filesystem entry, but it only changes the contents.

When we read a file to let's say 64k then everything behind this 64k mark may already have some replacements. Which actually means that there must be some reads done, but if we cancel the operation in the middle we'll have half of file replaced, half not, and most worrying is the fact that if the replacement is longer than needle, some bytes may be missing, like this:

replacing: abc to abcd

abc+123
abc+123
abc+123 
abc+123 // SIGINT on reading this line!
abc+123
abc+123

we get:

abcd+123
abcd+123
abcd+
abc+123 // SIGINT on reading this line!
abc+123
abc+123

And the leftover buffer of 123 is lost, as there was no place to write it to.

So there is an actual cost that requires to be well documented.

I'll prepare a 4.0.0 branch where you can submit a PR against

Cool. Mention me there please. :)

@MichalCz
Copy link

This of course doesn't ruin the use case where we generate a file and try to do some replacements, but since the file is gigabytes in size it can't be read to memory and there may be not enough disk space available (or disk space is metered and paid for).

I could add another flag though - where we'd use a temporary file in the same location when disk space is not a problem. The same transform would work.

@adamreisnz
Copy link
Owner

Ok, I suppose we could mark this opt-in streaming replacement as being in beta and mention the risks.

@adamreisnz
Copy link
Owner

@MichalCz 4.0.0 is going to be released soon, is this still something you were wanting to create a PR for?

@MichalCz
Copy link

Yes, I had, but there's no test done and it's far from being ready. Trouble is I'm a little tied up at the moment and I can't continue my work on this.

Is there anyone on your end who could take a look at this and finish up?

@adamreisnz
Copy link
Owner

Ah, same here pretty much. Was hard enough to find the time to release 4.0.0 😢
I'll leave this open for the time being and if anyone else pops by they can possibly have a look.

@adamreisnz adamreisnz removed the 8.0 label Apr 21, 2019
@MichalCz
Copy link

Great. I hope my time frees up for 4.1

@erssebaggala
Copy link

erssebaggala commented Jun 4, 2019

@MichalCz I have tested you implementation and I have noticed that if you are writing less than you are reading, the final result will have unwanted extra text. (see fork at https://github.com/erssebaggala/replace-in-file/tree/add_replace_stream )

For example, when replacing re place with b in a re place c, you get a b cplace c.

I suspect we need to truncate the file when finish is fired on the writeable stream in your rw-stream. I will test this and see what happens.

@MichalCz
Copy link

MichalCz commented Jun 4, 2019

@erssebaggala that might be the case... If you'd add fs.truncate at the end of writing in rw-stream that should work, but I'm not sure what the len should be...

On the other hand there's a descriptor there underneath, so I guess I could simply use that to truncate.

I'll take a look this week.

@MichalCz
Copy link

MichalCz commented Jun 7, 2019

Just to update: @erssebaggala added a pull request to rw-stream. I'll do some fixes during the weekend, republish and let you know here.

@MichalCz
Copy link

Hi @erssebaggala - this fix is now merged and released in 0.3.1

@adamreisnz
Copy link
Owner

Anyone want to pick this up again and implement it as a parallel implementation, where it can be enabled with a flag?

@adamreisnz
Copy link
Owner

Closing this now as it seems interest has faded, but if anyone wants to pick this up in the future be my guest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants