Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue 7033: File.rawWrite is slow on Windows #7590

Merged
merged 9 commits into from Aug 15, 2020
Merged

Fix issue 7033: File.rawWrite is slow on Windows #7590

merged 9 commits into from Aug 15, 2020

Conversation

ghost
Copy link

@ghost ghost commented Aug 13, 2020

To fix this issue, the newly added openModeFlags (and the enum OpenMode from which it's derived) is used to store the current open mode of a file. Thus, rawWrite can check if the file was already opened in binary mode to obviate temporarily changing the mode and thus executing two flushes.
A new method, parseOpenModeString is used to parse an open mode string, and it stores the desired open mode in openModeFlags. It's called in every method that changes a file's open mode. (See the documentation of this function for more information.)

I'd like your input on this, @schveiguy, since you asked to be notified about this.

@ghost ghost requested review from CyberShadow and schveiguy as code owners August 13, 2020 06:45
@dlang-bot
Copy link
Contributor

Thanks for your pull request and interest in making D better, @canopyofstars! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please verify that your PR follows this checklist:

  • My PR is fully covered with tests (you can see the coverage diff by visiting the details link of the codecov check)
  • My PR is as minimal as possible (smaller, focused PRs are easier to review than big ones)
  • I have provided a detailed rationale explaining my changes
  • New or modified functions have Ddoc comments (with Params: and Returns:)

Please see CONTRIBUTING.md for more information.


If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment.

Bugzilla references

Auto-close Bugzilla Severity Description
7033 normal File.rawWrite is slow on Windows

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub run digger -- build "master + phobos#7590"

Copy link
Contributor

@MoonlightSentinel MoonlightSentinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed changes introduces a slight breakage for code which operates on the underlying FILE* as well as the File instances because openMode may become invalid when modifying the file mode via the FILE*.
Such code is obviously less than ideal but will be broken by this change.

std/stdio.d Outdated
p = 1 << 4
}

BitFlags!OpenMode openModeFlags;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Make it version(Windows) (it's only used on windows after all)
  • BitFlags is redundant given your enum definition
  • openModeFlags should be placed after isPopened for better packing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that it matters anymore, but isn't the whole point of BitFlags to provide a safer interface than the raw enum value? If that's the case, I don't think it's superfluous.

std/stdio.d Outdated
Orientation orientation;
Orientation orientation;

enum OpenMode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short DDOC comments are appreciated

std/stdio.d Outdated
flush(); // before changing translation mode
immutable fd = ._fileno(_p.handle);
immutable mode = ._setmode(fd, _O_BINARY);
scope(exit) ._setmode(fd, mode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now happens before the actual write because the if introduces a scope relevant for scope (exit).

@CyberShadow
Copy link
Member

Is this actually faster? Are there any benchmarks?

I think the proper way to do bulk writes is to use lockingBinaryWriter.

@schveiguy
Copy link
Member

schveiguy commented Aug 13, 2020

I think the proper way to do bulk writes is to use lockingBinaryWriter.

I think you are right -- in the general case. However, when you open a file for writing binary mode (as the bug report indicates), the fact that rawWrite is going to flush before and after every write is insane. So we still should try and solve that problem.

However, my view on the approach has changed, now that I've looked at this more closely. I originally thought you could cache the last mode (binary or not), but there are other modes besides BINARY which the FILE could be. Plus there's the valid concern that @MoonlightSentinel brings up that someone can alter the mode outside of the File operations.

Here is my recommendation for implementation now (sorry for not giving more time to it earlier):

1. Create a member bool _likelyBinary, which is set to true when we expect the FILE * is in binary mode. You do not need to parse the input string, just use auto curMode = _setmode(_O_BINARY) to read the current mode (and then set it back if necessary).
2. Inside rawWrite, if _likelyBinary is true, try setting the mode before flushing, paying attention to the previous mode. If the mode is not binary, then clear the flag, set it back to the original mode, and run the current code (flush 2x, etc). If it is binary, no need to set the mode at the end, and no need to flush either time. Conversely, if it was binary, set the flag to avoid any further flushing.

Edit, let's try this without an extra member. All that is needed is to try _setmode before flushing, and if it's not binary, set it back to the original mode and then run the current code (flush -> set to binary -> write -> flush -> set back to original mode). If the original mode was binary, avoid all flushing.

Here is the proposal, in pseudocode:

rawWrite(T[]) {
   auto oldMode = _setmode(fd, _O_BINARY);
   if(oldMode != _O_BINARY) {
      // need to flush the data that was written with the original mode
      _setmode(fd, oldMode);
      flush();
      _setMode(fd, _O_BINARY);
   }

   ... // do writing

   if(oldMode != _O_BINARY) {
      flush();
      _setMode(fd, oldMode)
   }
}

The justification for calling _setmode an extra 2 times is that the cost of doing so is likely far lower than premature flushing.

All this should be versioned with Windows. This should solve the bug, AND satisfy @MoonlightSentinel's concerns.

@CyberShadow
Copy link
Member

That sounds like it's in the right direction. Though, if _setmode is O(1), then there is no need for _likelyBinary. We can look at the implementation for DMC, at least.

I think it would be good to introduce a version indicating whether text mode and binary mode are effectively the same, just so it's clear which particular difference between Windows and POSIX is addressed by using conditional compilation here.

@schveiguy
Copy link
Member

It's likely O(1), but still an opaque call at least, and at worst may have to lock something, which might make it more expensive? I'd expect that the likelihood that the binary mode changes between calls to rawWrite is pretty low.

For sure, we can do it without storing an extra boolean in the heap-allocated struct, but the cost of storing that extra boolean is pretty small.

Copy link
Member

@schveiguy schveiguy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my suggestions in previous message.

@CyberShadow
Copy link
Member

For sure, we can do it without storing an extra boolean in the heap-allocated struct, but the cost of storing that extra boolean is pretty small.

I was thinking more about the complexity and the additional state (e.g. to a casual observer it wouldn't be clear what the interaction is between likelyBinary and orientation).

@schveiguy
Copy link
Member

The single point of the boolean is to do the minimum _setmode calls. We have to do at least one to see what the mode is. Certainly it's an optimization and isn't required to make the code faster than the current situation.

And thinking about it more, if you use _setmode and find out it wasn't binary, the cost of calling the extra _setmode is likely miniscule compared to the premature flushing that is about to happen. So sure, let's do it without the boolean at first. It makes the ctor simpler too. I'll update my proposal above.

@ghost
Copy link
Author

ghost commented Aug 13, 2020

Implemented the proposal.

By the way, this is DMC's setmode implementation; rather, this is it locking streams in a loop.

int setmode(int fd, int mode)
{
    // ...
    for (fp = &_iob[0]; fp < &_iob[_NFILE]; fp++)
    {
        __fp_lock(fp);
        // ...
    }
    // ...
}

@schveiguy
Copy link
Member

By the way, this is DMC's setmode implementation; rather, this is it locking streams in a loop.

Yikes! Note that DMC has a limit of 64 open file descriptors anyway. But yeah, that's painful. However, it's clear from that code that the binary mode is not set on the file descriptor but rather the FILE * or _iob. Perhaps locking that and checking the flag member instead of using setmode to determine the current mode is an option. There's already precedent for this inside std.stdio.

@schveiguy
Copy link
Member

Trailing whitespace on lines 1086, 1098, 1099, 1101

std/stdio.d Show resolved Hide resolved
@ghost
Copy link
Author

ghost commented Aug 13, 2020

Yikes! Note that DMC has a limit of 64 open file descriptors anyway. But yeah, that's painful. However, it's clear from that code that the binary mode is not set on the file descriptor but rather the FILE * or _iob. Perhaps locking that and checking the flag member instead of using setmode to determine the current mode is an option. There's already precedent for this inside std.stdio.

Unless I misunderstood, it seems that there is no _flag field for MS' implementation.

else version (CRuntime_Microsoft)
{
    // ...

    ///
    struct _iobuf
    {
        void* undefined;
    }

    // ...
}

std/stdio.d Show resolved Hide resolved
std/stdio.d Outdated
immutable mode = ._setmode(fd, _O_BINARY);
scope(exit) ._setmode(fd, mode);
immutable oldMode = ._setmode(fd, _O_BINARY);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trailing whitespace line 1080 here (according to autotester)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be good, assuming my editor did its job since I'm incapable of discerning needless whitespace.

@schveiguy
Copy link
Member

Unless I misunderstood, it seems that there is no _flag field for MS

Yep, you did not misunderstand. You would have to do something separate for digital mars stdio. It's not critical to do now I don't think, let's just get this in, and we can worry about squeezing more performance later. I still think the speed is going to be possibly comparable, and I also don't imagine many people are using DMC runtime over MSVC runtime.

It might be a good gut-check to run against the DMC runtime with the original bug report (but set the open mode to non-binary) and see how it does before and after.

@ghost
Copy link
Author

ghost commented Aug 13, 2020

Yep, you did not misunderstand. You would have to do something separate for digital mars stdio. It's not critical to do now I don't think, let's just get this in, and we can worry about squeezing more performance later. I still think the speed is going to be possibly comparable, and I also don't imagine many people are using DMC runtime over MSVC runtime.

Wouldn't it be something to this effect?

version (Windows)
{
	immutable fd = ._fileno(_p.handle);

	version (MICROSOFT_STDIO)
	{
		// Do changes made in this commit.
	}

	version (DIGITAL_MARS_STDIO)
	{
		immutable info = __fhnd_info[fd];

		if (info & FHND_TEXT)
		{
			flush();
			atomicOp!"&="(__fhnd_info[fd], ~FHND_TEXT);
		}
		
		scope (exit)
		{
			if (info & FHND_TEXT)
			{
				flush();
				__fhnd_info[fd] = info;
			}
		}
	}
}

@schveiguy
Copy link
Member

schveiguy commented Aug 14, 2020

Wouldn't it be something to this effect?

No, you have to duplicate the code in DMC runtime and just read the flags (though you don't have to search, you know the FILE *). The __fhnd_info seems to be a separate piece related to that old bug.

Copy link
Member

@schveiguy schveiguy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a squash-merge

@ghost
Copy link
Author

ghost commented Aug 14, 2020

No, you have to duplicate the code in DMC runtime and just read the flags (though you don't have to search, you know the FILE *). The __fhnd_info seems to be a separate piece related to that old bug.

Understood. But as you mentioned, the prevalence of DMC's runtime makes me question whether it's worth it. Regardless, thank you for posting this issue on the NG. I didn't do all that much but bungle your proposal's implementation, but it was still a learning experience.

@schveiguy
Copy link
Member

I didn't do all that much but bungle your proposal's implementation, but it was still a learning experience.

You did great! Thanks for making this fix.

@schveiguy
Copy link
Member

@CyberShadow any more comments on this?

@CyberShadow
Copy link
Member

Well, in lieu of tests, some benchmark numbers would be nice, just so we're sure we're solving the right performance problem. (I see in principle why this would obviously be faster, but also, programs like cat don't buffer at all, and that hasn't caused problems, because the OS also does buffering unless the file was opened in direct mode.) That would also be an opportunity to check the effect of minimizing _setmode calls.

@schveiguy
Copy link
Member

programs like cat don't buffer at all

Sure it does: https://github.com/coreutils/coreutils/blob/5b8161ee4dc95308c9ef89a5268aa87a7f32d4bf/src/cat.c#L513-L517

buffering is faster because you spend less system calls. Yes, the OS does buffering, but the cost of getting at the buffer is much higher than writing to memory.

In the case of cat, it can decide how much data to fetch at a time (because the whole file is available). The use case here provides data a tiny bit at a time (i.e. the bug report example). flushing after every tiny bit is always going to perform worse than flushing after a buffer worth of data 100% of the time.

in lieu of tests

I hate to set up CI tests that depend on performance, because we can't control the system being used to test.

some benchmark numbers would be nice

I agree. I can see about running this PR on a Windows box. @canopyofstars if you have some benchmarks you can run, it would be useful to post here.

@CyberShadow
Copy link
Member

CyberShadow commented Aug 14, 2020

Sure it does: https://github.com/coreutils/coreutils/blob/5b8161ee4dc95308c9ef89a5268aa87a7f32d4bf/src/cat.c#L513-L517

buffering is faster because you spend less system calls. Yes, the OS does buffering, but the cost of getting at the buffer is much higher than writing to memory.

No, cat does not buffer in the sense that C FILE does. cat continuously waits until data arrives, and as soon as it does, it immediately writes out however much it received. You can check this by doing e.g. for n in 1 2 3 4 5 ; do printf %s $n ; sleep 1 ; done | cat > a.txt.

In the case of cat, it can decide how much data to fetch at a time (because the whole file is available).

Pretty sure that's not how it works.

@schveiguy
Copy link
Member

The use case is completely different. Yes, cat reads as much as it has buffer space for, and immediately writes it out, but it does read it into a buffer. What it doesn't do is hold back data to write until it has a full buffer's worth of data, even if the read source is slow. In this sense, cat performs as well as the underlying source performs, and that's fine for that use case.

There is no way to for cat to duplicate the problem we are experiencing here, because what would need to happen is to have the data available, but only 4 bytes provided per read call. The kernel doesn't work that way.

My usual way of showing that buffering is worth it is to use dd:

$ time dd if=/dev/zero of=zeroes bs=4 count=1000000
1000000+0 records in
1000000+0 records out
4000000 bytes (4.0 MB, 3.8 MiB) copied, 1.91583 s, 2.1 MB/s

real	0m1.918s
user	0m0.681s
sys	0m1.237s
$ time dd if=/dev/zero of=zeroes bs=4000 count=1000
1000+0 records in
1000+0 records out
4000000 bytes (4.0 MB, 3.8 MiB) copied, 0.0069534 s, 575 MB/s

real	0m0.011s
user	0m0.004s
sys	0m0.007s

Both write 4MB, the first does it 4 bytes at a time, the second does it 4000 bytes at a time.

@CyberShadow
Copy link
Member

The use case is completely different. Yes, cat reads as much as it has buffer space for, and immediately writes it out, but it does read it into a buffer. What it doesn't do is hold back data to write until it has a full buffer's worth of data, even if the read source is slow. In this sense, cat performs as well as the underlying source performs, and that's fine for that use case.

Yes, that's exactly what I meant. What I meant to say in addition to that is that, when cat's output is a file, there is no additional buffering going on.

There is no way to for cat to duplicate the problem we are experiencing here, because what would need to happen is to have the data available, but only 4 bytes provided per read call. The kernel doesn't work that way.

The kernel does the buffering in the case of process pipes. I'm not sure if there's a guarantee there, though. I see there's a way to create a datagram-like pipe, but I didn't get it to work.

My usual way of showing that buffering is worth it is to use dd:

Good illustration, thanks.

@ghost
Copy link
Author

ghost commented Aug 14, 2020

I've benchmarked two programs, where one uses the current rawWrite (current.exe) and the other uses the version from this PR (new.exe). Each program is built from the D example used in the original issue. bytes_test.dat is written to an SSD (a SSD?).

ldc2 was used to compile both executables.

Program Command
current.exe ldc2 --O2 --of current.exe test.d
new.exe ldc2 --O2 --of new.exe test.d

Please note that new.exe was built in the presence of this PR's std/stdio.d.

To benchmark them, I created a brief PowerShell script. This script will print out the mean execution time (in milliseconds) over a number of repetitions (e.g. 1000).

.\benchmark.ps1 -prog .\current.exe -reps 1000
.\benchmark.ps1 -prog .\new.exe -reps 1000

The results:

Program Time (ms)
current.exe 3216.4089
new.exe 76.3247

As you can see, current.exe is, on average, ~42x slower on my machine.

@schveiguy
Copy link
Member

Thanks @canopyofstars. If possible, do you think you can run dmd instead of ldc, and make sure you use -m32 to ensure we are using the DMC runtime? I still think the improvement will be just as stark.

@CyberShadow
Copy link
Member

As you can see, current.exe is, on average, ~42x slower on my machine.

Fantastic. Thank you very much. It's good to have numbers like this on record to justify the change.

@ghost
Copy link
Author

ghost commented Aug 14, 2020

Thanks @canopyofstars. If possible, do you think you can run dmd instead of ldc, and make sure you use -m32 to ensure we are using the DMC runtime? I still think the improvement will be just as stark.

Program Command
current.exe dmd -m32 -O -of="current.exe" test.d
new.exe dmd -m32 -O -of="new.exe" test.d

.\benchmark.ps1 -prog .\current.exe -reps 100
.\benchmark.ps1 -prog .\new.exe -reps 100

I reduced the number of iterations in the interest of time. After a few tests with new.exe, I don't believe this change will have a significant impact on the results.

Program Time (ms)
current.exe 3361.1693
new.exe 119.3520

In this instance, current.exe is ~28x slower.

Fantastic. Thank you very much. It's good to have numbers like this on record to justify the change.

Indeed. I should've done so from the beginning with my first commit.

@wilzbach
Copy link
Member

Thanks @canopyofstars. If possible, do you think you can run dmd instead of ldc, and make sure you use -m32 to ensure we are using the DMC runtime? I still think the improvement will be just as stark.

FWIW (and future discussions) we no longer cater to DMD outdated optimizer or DMC's runtime here. It's a waste of time.
So the PR would have been good to go already ;-)

@schveiguy
Copy link
Member

we no longer cater to DMD outdated optimizer or DMC's runtime here.

understood. I just wanted to make sure it's not worse. I didn't see how it could be, but we do have another option in terms of handling DMC's runtime without the slow loop that currently is setmode. But the question has been answered, so we are all set.

@dlang-bot dlang-bot merged commit 90c62c6 into dlang:master Aug 15, 2020
nordlow pushed a commit to nordlow/phobos that referenced this pull request Sep 10, 2020
Fix issue 7033: File.rawWrite is slow on Windows
merged-on-behalf-of: Steven Schveighoffer <schveiguy@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants