Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for mbio/mbinfo test data #319

Open
schwehr opened this issue Aug 15, 2019 · 4 comments
Open

Request for mbio/mbinfo test data #319

schwehr opened this issue Aug 15, 2019 · 4 comments

Comments

@schwehr
Copy link
Collaborator

schwehr commented Aug 15, 2019

The best test data will already be in a public archive. The more metadata to go with a data sample the better.

https://www3.mbari.org/data/mbsystem/html/mbio.html
http://www3.mbari.org/products/mbsystem/formatdoc/index.html
https://github.com/dwcaress/MB-System/blob/master/src/mbio/mb_format.h

This is a call for sample data files to help with testing of MB-System. By having coverage of as many formats with as many of the possible packet/message types will let MB-System grow over time without regressing it’s existing ability to read so many formats. Any file donated will assumed to be done so under the MB-System license so that the files or portions of those files can go along with MB-System. These will be a part of MB-System’s unittests and fuzzing infrastructure. Initially, I’ve just setup https://github.com/dwcaress/MB-System/blob/master/test/utilities/mbinfo_test.py with two initial formats covered: mb21/MBF_HSATLRAW and b173 MBF_MGD77TXT. I’ve used mbcopy to create test files for other formats, but those are definitely suboptimal. Eventually, I’d like to have C++ tests that test each packet type on it’s own, but that is for down the road.

Many of these formats are in public archives. I’m happy to use those and could use help from the community finding good examples.

What is most useful initially:

  • First off, anything logged from a real system is a step forward. So if we have nothing, anything is a win
  • Smaller is better. Bigger files will be hard to put into git and slow everything down. E.g. for fuzzing, the default is to limit the entire file content to 4K. I often increase that to 20K, but much bigger and things get really slow. Unittests need to finish quickly and more data is slower
  • To contrast with smaller, we want every type of packet/datagram possible. Especially things like SVP, vessel configs, etc. If a format supports backscatter, sidescan, etc, it would be great to have those.
  • Different versions of systems. E.g. GSF has had a lot of changes over the years so it would be great to have many versions on hand.

Not all of those files will be used in the unittest. For those that are used we might need to cut down files for size and time reasons before they can be good test inputs.

For later:

  • Files that cause MB-system to crash. These are great seeds for fuzzing. Eventually, MB-System should try to never crash even on the most corrupt file. But that comes after unittests
  • Files that can be used to test more complicated utilities in MB-System like gridding and preprocessing
  • Files that can be used for performance testing. These need to be big enough to be bigger than RAM caches and to minimize the influence of other systems and transients.
  • Formats that are not yet supported. It would be awesome to have work staged for anyone willing to contribute code for new formats to MB-System. Or packets/datagrams not yet supported

Known sources that people can look into:

https://www.ngdc.noaa.gov/multibeam-survey-search/

@schwehr
Copy link
Collaborator Author

schwehr commented Aug 15, 2019

I need to setup https://github.com/schwehr/mbreadsimrad/ to filter down em### files to their smallest size preserving at least one of each datagram/packet type.

@dwcaress
Copy link
Owner

Kurt,
A lot of MB-System failures are only exhibited when working with large amounts of data. I think we will ultimately need to construct a separate test repository with full size data samples - many problems result from the complexities of real datasets in which data records from different pings get mixed or some records are corrupted or some pings produce zero data, etc. Since that is true, we don't have to attempt to achieve comprehensive testing with small files used for unit testing embedded in the primary code archive. Just getting a representative small sample of most formats and checking if each i/o module works at all will achieve a first order goal.
Thanks,
Dave

@schwehr
Copy link
Collaborator Author

schwehr commented Aug 16, 2019

@dwcaress Thanks for the comments. Some clarifications of what I am thinking about for this particular issue. Running large sets through as a less frequent test is a great idea, but I would typically call those integration tests.

Wanring: rambling thoughts while bouncing along on HWY 17 in the mountains follows...

Here I'm aiming for fast and light "unit tests" that can be run for each commit. I think a large fraction of what you are talking about (but definitely not all) can be caught with these simpler small tests combined with fuzzing using pretty small corpus files. Once these tests are in place, we can setup ASAN and MSAN runners along with finishing all of the cppcheck complaints. Perhaps with a side of Coverity (free for open source code) and clang static analyzer. On top of that we can then add fuzzing + asan to really beat up the code. With fuzzers, the component "corpus" files are usually pretty small. With GDAL, I typically keep the under 4Kb, but we can try with larger files. Using a coverage check, I can generate a corpus of files that covers as many of the code paths inside mbsystem as possible. With GDAL (my own copy of gdal with < 5% of the drivers active), that is about 100K files. It might sound like a lot, but it goes pretty quickly to run them all through asan and msan built binaries to find regressions. And I can generate them pretty easily on a 30 core dev desktop over a couple of months of mostly hands off running.

This strategy has worked really well for GDAL and all of it's dependencies. That's caught >7K bugs in GDAL. About 90% of those weren't very interesting with the biggest impact typically being poor error reporting or hinder code analyzers (both in compilers and static analyzers). The results have been that using GDAL has gotten drastically better for the users I support.

I expect this strategy to take quite a while to work through on the existing code and really isn't ever done if people continue contributing to the code base. It just becomes part of the process and mostly automated. e.g. I just got a coverity email about GDAL with another 200 things that it doesn't like.

Only after that would I worry about automated runs of large files. But if someone really wants to setup a periodic runner of large batches of data they should feel free to go for it.

@schwehr
Copy link
Collaborator Author

schwehr commented Sep 5, 2019

Format 92, MBF_ELMK2UNB, appears to have issues that surfaced during working on #365 . Getting a sample would be really helpful for debugging. From mbio:

           MBIO Data Format ID:  92
           Format name:          MBF_ELMK2UNB
           Informal Description: Elac BottomChart MkII shallow
                                 water multibeam
           Attributes:           126 beam bathymetry and
                                 amplitude, binary, University
                                 of New Brunswick.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants