Large Files #40

jWatsonDev · 2018-02-13T20:39:27Z

Could large file functionality potentially be added? This utility works great for parsing unl files, but is unable to process larger files (eg 619 MB).

adamreisnz · 2018-02-14T01:27:08Z

Yep, would be great if we could support this! Will flag as help wanted for now.

MichalCz · 2018-09-27T14:45:11Z

Hmm... I may have some time to check this out - it would need rewriting some of the helpers, but with my scramjet it shouldn't be such a big deal.

You'd still be interested in that pull request @adamreisnz ?

adamreisnz · 2018-09-28T06:03:30Z

@MichalCz thanks, yes that sounds good. Could you make it in such a way that it requires an opt-in via a setting and isn't enabled by default? This is just because I'm unfamiliar with scramjet, and a lot of people rely on this package, and I wouldn't want to risk any unexpected behaviour breaking builds etc.

Happy to see what you're proposing though and help out as needed.

MichalCz · 2018-09-28T18:05:27Z

Got you on the builds - some of my projects rely on the package too.

I'll try to scribble some kind of a PoC first so we see how this works. :)

MichalCz · 2018-09-28T18:33:32Z

@adamreisnz - one question though - I think the current implementation actually replaces the file. In that case I guess using a temp file should be acceptable? There is another option but it would take writing quite a sophisticated random access streaming module...

adamreisnz · 2018-09-29T10:09:51Z

fs.writeFile is used under the hood to make the replacement, so that basically overwrites the file with the new (replaced) content. What approach did you have in mind?

MichalCz · 2018-09-29T14:34:00Z

Well I'm thinking of actually reading and writing the same file using fs.open - meaning I'd read the file and write to it using fs.write and 'fs.read`. The only culprit here is that the write index cannot go beyond read index (otherwise we'd read stuff we just wrote).

Actually I did write a simple version of that and I could clean this up and publish it as a separate module.

Now what that means is:

the memory footprint of such an operation is a couple 16kB buffers + the difference in length of the replaced content to what it's replaced with
the file will be written at the same time to when it's read

I'm quite sure it would be quicker for larger files (because the software wont be waiting for the whole file to load), but smaller files may go quicker.

adamreisnz · 2018-09-29T20:07:00Z

Sounds good, let's give it a try. If we are able to choose which approach you want to use using a flag we can let people try it out first before making it the default (if it works well) in an upcoming major version.

It'd be good if we can come up with a test/benchmark suite to actually check the speed for different files, so we can compare the two approaches.

MichalCz · 2018-10-03T08:10:19Z

@adamreisnz I got a not-really-nice code that actually does use fs.write and fs.read with offsets - so this approach is feasible. I'll try to make a module out of it that would return read and write streams to the same file. The write stream would be limitted by read for now (nothing can be written past the place we're reading), but I'm thinking I could get around this as well...

Then the additions in replace-in-file would be fairly small.

As to the test suite - definitely. I believe that my approach has an edge case of a file that is enlarged by the replacement (where it can lead to locking up both streams at the same time.

cliedelt · 2018-11-05T14:40:09Z

Any Update on this?

MichalCz · 2018-11-05T17:43:13Z

Kinda - I've had just a little time to clean up my code and I opened it in github - there's a matter of review to be done and actually adding a pull request here.

Here's the repo signicode/rw-stream.

@orgertot if you have some time and some free neurons to spend on this, the main worry I have is what will happen when the write stream advances over the read stream. This is what's being tested by test.js - I'm replacing a single line with couple of them - this way you can grow the file to gigabytes in couple runs of test.js. If you'd confirm my logic and perhaps run a couple tests, I'd finish up, push the module to npm and issue a pull request.

M.

MichalCz · 2018-11-07T12:36:25Z

@adamreisnz - this is my proposal of the replace procedure.

Streamed replace in signicode:master.

This is a little complicated and could be written simplier with scramjet but I didn't want to add too much dependencies (scramjet adds 4 deps in total). Scramjet also requires node v8.6+. If you'd prefer a simpler code and don't mind 4 extra dependencies, I can rewrite it quickly to approx 10 lines of code.

I haven't yet run this and it's not exposed, so please take a look and tell me how do you feel about it.

Then also let me know, how would you want to expose this in CLI and in the module API.

adamreisnz · 2018-11-07T21:10:14Z

Thanks, I'll try to look into this next week, want to make sure I have enough time to look into it carefully!

MichalCz · 2018-11-07T22:20:53Z

No problem. I'll try this some time next week, as I need similar code in scramjet...

MichalCz · 2018-11-21T12:44:15Z

Hi @adamreisnz - I've made some checks and it looks that the code is solid even if I save a lot more than I read (it just keeps that in memory, but that's much less than keeping the whole file there). I've tried it on files in excess of 10 GBs and it works well there.

I guess what would be a sensible next step would to run the test scenarios on this method?

adamreisnz · 2018-11-21T19:32:46Z

Hi @MichalCz, that sounds good, thanks!

I've had a peek at the code, a couple of questions come to mind:

Why did you wrap the rw call in a Promise? From looking at that library it already returns a Promise itself.
Might be good to abstract the transformation function away into a helper to simplify the code a bit?
What is the Node version requirement for the rw-stream module? The package.json file suggests >= 8.6.0. That might be an issue as replace-in-file is aimed at Node 6 and above. However, if we release a new major version, I'd be inclined to drop Node 6 support and aim for 8 and higher.

Note: In any case, the file will be accessed and it's contents may be overwritten even if there's no actual read operation done (due to node.js streams buffering mechanisms).

This might be slighty worrying, could it lead to unexpected behaviour? I guess we could mention it as a caveat, but can you clarify this further?

I'll prepare a 4.0.0 branch where you can submit a PR against

MichalCz · 2018-11-22T10:54:29Z

Why did you wrap the rw call in a Promise? From looking at that library it already returns a Promise itself.

Hmm... That does seem unnecessary. :)

Might be good to abstract the transformation function away into a helper to simplify the code a bit?

Yup.

What is the Node version requirement for the rw-stream module? The package.json file suggests >= 8.6.0. That might be an issue as replace-in-file is aimed at Node 6 and above. However, if we release a new major version, I'd be inclined to drop Node 6 support and aim for 8 and higher.

I forked my new project template, which due to extensive use of async/await is set to require 8.6.0. We can run some tests and consider dropping the version down to 6.x if all works or can be rewritten.

Note: In any case, the file will be accessed and it's contents may be overwritten even if there's no actual read operation done (due to node.js streams buffering mechanisms).

This might be slighty worrying, could it lead to unexpected behaviour? I guess we could mention it as a caveat, but can you clarify this further?

Well I guess that this is simply due to the way this transformation works - it doesn't affect the filesystem entry, but it only changes the contents.

When we read a file to let's say 64k then everything behind this 64k mark may already have some replacements. Which actually means that there must be some reads done, but if we cancel the operation in the middle we'll have half of file replaced, half not, and most worrying is the fact that if the replacement is longer than needle, some bytes may be missing, like this:

replacing: abc to abcd

abc+123
abc+123
abc+123 
abc+123 // SIGINT on reading this line!
abc+123
abc+123

we get:

abcd+123
abcd+123
abcd+
abc+123 // SIGINT on reading this line!
abc+123
abc+123

And the leftover buffer of 123 is lost, as there was no place to write it to.

So there is an actual cost that requires to be well documented.

I'll prepare a 4.0.0 branch where you can submit a PR against

Cool. Mention me there please. :)

MichalCz · 2018-11-22T10:58:31Z

This of course doesn't ruin the use case where we generate a file and try to do some replacements, but since the file is gigabytes in size it can't be read to memory and there may be not enough disk space available (or disk space is metered and paid for).

I could add another flag though - where we'd use a temporary file in the same location when disk space is not a problem. The same transform would work.

adamreisnz · 2018-11-26T17:50:16Z

Ok, I suppose we could mark this opt-in streaming replacement as being in beta and mention the risks.

adamreisnz · 2019-04-21T00:22:04Z

@MichalCz 4.0.0 is going to be released soon, is this still something you were wanting to create a PR for?

MichalCz · 2019-04-21T10:01:05Z

Yes, I had, but there's no test done and it's far from being ready. Trouble is I'm a little tied up at the moment and I can't continue my work on this.

Is there anyone on your end who could take a look at this and finish up?

adamreisnz · 2019-04-21T20:38:34Z

Ah, same here pretty much. Was hard enough to find the time to release 4.0.0 😢
I'll leave this open for the time being and if anyone else pops by they can possibly have a look.

MichalCz · 2019-04-22T07:30:56Z

Great. I hope my time frees up for 4.1

erssebaggala · 2019-06-04T01:28:47Z

@MichalCz I have tested you implementation and I have noticed that if you are writing less than you are reading, the final result will have unwanted extra text. (see fork at https://github.com/erssebaggala/replace-in-file/tree/add_replace_stream )

For example, when replacing re place with b in a re place c, you get a b cplace c.

I suspect we need to truncate the file when finish is fired on the writeable stream in your rw-stream. I will test this and see what happens.

MichalCz · 2019-06-04T10:43:17Z

@erssebaggala that might be the case... If you'd add fs.truncate at the end of writing in rw-stream that should work, but I'm not sure what the len should be...

On the other hand there's a descriptor there underneath, so I guess I could simply use that to truncate.

I'll take a look this week.

MichalCz · 2019-06-07T09:04:02Z

Just to update: @erssebaggala added a pull request to rw-stream. I'll do some fixes during the weekend, republish and let you know here.

MichalCz · 2019-06-13T15:02:41Z

Hi @erssebaggala - this fix is now merged and released in 0.3.1

adamreisnz · 2020-04-25T04:52:12Z

Anyone want to pick this up again and implement it as a parallel implementation, where it can be enabled with a flag?

adamreisnz · 2020-06-06T23:23:32Z

Closing this now as it seems interest has faded, but if anyone wants to pick this up in the future be my guest.

adamreisnz added enhancement PR welcome labels Feb 14, 2018

adamreisnz added pr welcome and removed pr welcome labels Feb 14, 2018

adamreisnz added the 8.0 label Nov 5, 2018

adamreisnz removed the 8.0 label Apr 21, 2019

adamreisnz closed this as completed Jun 6, 2020

adamreisnz mentioned this issue Sep 18, 2020

Add support for streaming large files #128

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large Files #40

Large Files #40

jWatsonDev commented Feb 13, 2018

adamreisnz commented Feb 14, 2018

MichalCz commented Sep 27, 2018

adamreisnz commented Sep 28, 2018 •

edited

Loading

MichalCz commented Sep 28, 2018

MichalCz commented Sep 28, 2018

adamreisnz commented Sep 29, 2018

MichalCz commented Sep 29, 2018

adamreisnz commented Sep 29, 2018

MichalCz commented Oct 3, 2018

cliedelt commented Nov 5, 2018

MichalCz commented Nov 5, 2018

MichalCz commented Nov 7, 2018

adamreisnz commented Nov 7, 2018

MichalCz commented Nov 7, 2018

MichalCz commented Nov 21, 2018

adamreisnz commented Nov 21, 2018

MichalCz commented Nov 22, 2018

MichalCz commented Nov 22, 2018

adamreisnz commented Nov 26, 2018

adamreisnz commented Apr 21, 2019

MichalCz commented Apr 21, 2019

adamreisnz commented Apr 21, 2019

MichalCz commented Apr 22, 2019

erssebaggala commented Jun 4, 2019 •

edited

Loading

MichalCz commented Jun 4, 2019

MichalCz commented Jun 7, 2019

MichalCz commented Jun 13, 2019

adamreisnz commented Apr 25, 2020

adamreisnz commented Jun 6, 2020

Large Files #40

Large Files #40

Comments

jWatsonDev commented Feb 13, 2018

adamreisnz commented Feb 14, 2018

MichalCz commented Sep 27, 2018

adamreisnz commented Sep 28, 2018 • edited Loading

MichalCz commented Sep 28, 2018

MichalCz commented Sep 28, 2018

adamreisnz commented Sep 29, 2018

MichalCz commented Sep 29, 2018

adamreisnz commented Sep 29, 2018

MichalCz commented Oct 3, 2018

cliedelt commented Nov 5, 2018

MichalCz commented Nov 5, 2018

MichalCz commented Nov 7, 2018

adamreisnz commented Nov 7, 2018

MichalCz commented Nov 7, 2018

MichalCz commented Nov 21, 2018

adamreisnz commented Nov 21, 2018

MichalCz commented Nov 22, 2018

MichalCz commented Nov 22, 2018

adamreisnz commented Nov 26, 2018

adamreisnz commented Apr 21, 2019

MichalCz commented Apr 21, 2019

adamreisnz commented Apr 21, 2019

MichalCz commented Apr 22, 2019

erssebaggala commented Jun 4, 2019 • edited Loading

MichalCz commented Jun 4, 2019

MichalCz commented Jun 7, 2019

MichalCz commented Jun 13, 2019

adamreisnz commented Apr 25, 2020

adamreisnz commented Jun 6, 2020

adamreisnz commented Sep 28, 2018 •

edited

Loading

erssebaggala commented Jun 4, 2019 •

edited

Loading