Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure bisect-ppx-report handles corrupted coverage files gracefully #385

Closed
aantron opened this issue Oct 20, 2021 · 12 comments
Closed
Milestone

Comments

@aantron
Copy link
Owner

aantron commented Oct 20, 2021

No description provided.

@vch9
Copy link
Contributor

vch9 commented Oct 29, 2021

Regarding corrupted coverage files do you have a hint on the difference between the following errors:

Info: found coverage files in '_coverage_output/'
bisect-ppx-report: internal error, uncaught exception:
                   Bisect_common.Invalid_file("_coverage_output/277891434.coverage", "unexpected end of file while reading magic number")

and

Info: found coverage files in '_coverage_output/'
bisect-ppx-report: internal error, uncaught exception:
                   Bisect_common.Invalid_file("_coverage_output/697300040.coverage", "exception reading data: End_of_file")

@vch9
Copy link
Contributor

vch9 commented Oct 29, 2021

Well I just had another exception raised. If you're not already working on this @aantron and need help (review, pr etc) I'll be happy to help as we deeply need this feature

@aantron
Copy link
Owner Author

aantron commented Oct 29, 2021

Regarding corrupted coverage files do you have a hint on the difference between the following errors:

They are the same error (reading a truncated file, so End_of_file), but occurring during different parts of the read. The first is occurring during reading of a magic number in the file header, and the second during reading of the main file "payload." Neither should be fatal like this or reported in this way.

I haven't started working on this in particular. What I've done so far is considerably refactored the reporter to make it easier to work on, because I dreaded working on it due to some legacy holdover from features deleted in the last couple of years. I've just removed all of those. You can see the progress in src/report in master.

Would you like to work on this directly? If not, my plan is to first clarify some of the I/O functions that are most responsible for all the I/O errors, and then fix this issue. If you want to work on this immediately, though, I will pause refactoring the reporter.

@vch9
Copy link
Contributor

vch9 commented Oct 29, 2021

I haven't started working on this in particular. What I've done so far is considerably refactored the reporter to make it easier to work on, because I dreaded working on it due to some legacy holdover from features deleted in the last couple of years. I've just removed all of those. You can see the progress in src/report in master.

I saw this; this is great :)

Would you like to work on this directly? If not, my plan is to first clarify some of the I/O functions that are most responsible for all the I/O errors, and then fix this issue. If you want to work on this immediately, though, I will pause refactoring the reporter.

My priority would be to fix the bad writing of coverage files to avoid the loss of information If you're already working or have plans to do this this should be faster. However, I could test when the issue is fixed to see how well that kind of exception is handled.

@aantron
Copy link
Owner Author

aantron commented Oct 29, 2021

My priority would be to fix the bad writing of coverage files to avoid the loss of information If you're already working or have plans to do this this should be faster. However, I could test when the issue is fixed to see how well that kind of exception is handled.

I'm not working on that at all yet — I want to clean up the reporter a bit more first, before looking in detail at the writing of .coverage files. I also wanted to give time to get some more information/opinions in #384 before I do something that might turn out not to be necessary :)

aantron added a commit that referenced this issue Nov 3, 2021
@aantron
Copy link
Owner Author

aantron commented Nov 3, 2021

The errors should be somewhat clearer now, and I've added some documentation about them in d8917a3. I've simplified the code, so that it would be easier to implement graceful error handling. However, before adding it, it's better to work on making the writing as correct as possible instead, as per #385 (comment). Then, it will be clearer what the reader in the reporter has to actually adapt to. Any word on what from #384 (comment) makes sense in your testing environment?

@vch9
Copy link
Contributor

vch9 commented Nov 4, 2021

Any word on what from #384 (comment) makes sense in your testing environment?

It is most certainly related to this issue, we launch several processes and kill them if something went wrong.

@aantron
Copy link
Owner Author

aantron commented Nov 4, 2021

Any word on what from #384 (comment) makes sense in your testing environment?

It is most certainly related to this issue, we launch several processes and kill them if something went wrong.

I mean what potential solution makes sense in your environment. Can you have the processes handle the signals better, for example?

@vch9
Copy link
Contributor

vch9 commented Nov 4, 2021

Any word on what from #384 (comment) makes sense in your testing environment?

It is most certainly related to this issue, we launch several processes and kill them if something went wrong.

I mean what potential solution makes sense in your environment. Can you have the processes handle the signals better, for example?

Yes this is what I'm trying to do on my side. However, I still believe bisect-ppx-report could emit a warning instead of failing if let's say only one coverage file is corrupted (in hundreds of them idk).

aantron added a commit that referenced this issue Nov 7, 2021
@aantron
Copy link
Owner Author

aantron commented Nov 7, 2021

I've added a test that truncates a .coverage file to various lengths and shows the behavior of the reporter.

I still believe bisect-ppx-report could emit a warning instead of failing if let's say only one coverage file is corrupted (in hundreds of them idk).

We can probably add this as an option. I don't think it should be the default, because in a smaller project in which runs are more predictable:

  • corrupt coverage files should be extremely rare;
  • when they do occur, they suggest some kind of severe error that should not be ignored;
  • a missing coverage file causing only a warning (say, in CI) can silently skew final coverage statistics considerably in a smaller project.

aantron added a commit that referenced this issue Nov 7, 2021
@arvidj
Copy link
Contributor

arvidj commented Nov 15, 2021

Any word on what from #384 (comment) makes sense in your testing environment?

Broadly, we have three types of tests:

  • unit tests that do not forking nor spawn any processes
  • integration tests that spawn processes by forking
  • system tests that execute separate binaries

For the first type, I see no reason why we should have any corrupted file, nor have we seen that.

For the second type of test, of which we have very few, we've experimented with adding signal handlers, attempting to make sure that everything terminates gracefully w.r.t. bisect. However, it's been tricky to find a solution that works well, especially when running in gitlab CI for some reason. In the latest version, which seems to work OK, we have a signal handler that catches sigterms from the parent process, and then make sure that bisect's at_exit hook is finally called. It does seem like I need to mask sigterm in bisect's at_exit hook though. You can read more about this here.

Since we need to both install signal handlers and modify bisect's runtime for each test, I think I'll try to implement the --sigterm option you propose in #384 : this would give us a handier solution if similar problems pop up in the future.

Regarding the third set of tests, as long as the testing frameworks terminates the binaries gracefully, and as long as the terminated binary has a sigterm handler in place that does nothing crazy, it seems like no modifications are needed.

Our goal is to never have any corrupted traces written and to make sure each launched processes writes a trace. And to fail fast when this is not the case. We figure this will give users more confidence in the coverage reports. We're just hoping this will not make our CI less robust and that we won't have debug tests corrupting traces all the time.

@aantron
Copy link
Owner Author

aantron commented Nov 18, 2021

Thanks for that! I'm going to close this issue as likely unnecessary, since it seems it can be addressed by trying not to write corrupted coverage files. If that turns out to still not be practical, this issue can be revisited.

@aantron aantron closed this as completed Nov 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants