Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modernize Capstone testing #1984

Open
Tracked by #2015
Rot127 opened this issue Apr 3, 2023 · 13 comments · May be fixed by #2384
Open
Tracked by #2015

Modernize Capstone testing #1984

Rot127 opened this issue Apr 3, 2023 · 13 comments · May be fixed by #2384
Labels
Testing Test related issue
Milestone

Comments

@Rot127
Copy link
Collaborator

Rot127 commented Apr 3, 2023

Testing Capstones disassembly results is possible via too many ways.

  1. Test binaries print (but not test!) the values of instruction detail (e.g. which operands are written and read etc.).
  2. MC tests taken from LLVMs MC test files (in suite/MC). They only test the disassembly of bytes to their assembly strings. To test those a:
  • Python script exists
  • And the cstest binary.
  1. Issue tests (files which name is issue.cs). They look very similar to the MC tests, but are not related. They can test detail information as well. cstest processes them.

This is very confusing and could be unified.

I propose to:

  1. Get rid of the test_<arch> binaries, because they hard code every test.
  2. Get rid of the Python scripts. Because they are a duplicate of cstest.
  3. Rewrite cstest

cstest

cstest should be written from scratch. It needs modernization anyway (e.g. remove global variables) and we could settle on a single test file format.

This new format should support simple bytes <-> assembly string testing, as well as testing the content of cs_detail.
Once this is done we can also write scripts to translate LLVMs MC regression tests to our file format.

Before the v6 release we should also test every possible detail setting for correctness. See #1984 (comment)

CI

@Rot127
Copy link
Collaborator Author

Rot127 commented Apr 3, 2023

Also considering #1281 when generating tests.

@Rot127
Copy link
Collaborator Author

Rot127 commented Apr 6, 2023

The current MC test assume that a byte string can be disassembled independently from other byte strings. This is not true for instructions which depend on each other (IT/VPT blocks in ARM for example).
This needs to be considered.

@Rot127
Copy link
Collaborator Author

Rot127 commented May 5, 2023

@kabeor @aquynh We thought about choosing yaml as file format for the test files.
It's better to read then json and can be used for the bindings as well.
Also there are libraries for yaml in most distros.
Any opinions?

@kabeor
Copy link
Member

kabeor commented May 6, 2023

@Rot127 Sounds make sense. Would you like to show us an code example?

@Rot127
Copy link
Collaborator Author

Rot127 commented May 6, 2023

I'll add one once the PPC AArch64 refactor is coming to an end.

@Rot127
Copy link
Collaborator Author

Rot127 commented May 30, 2023

Testing the build of bindings and their correct function isn't tested in the CI jobs as well (#2034 (comment)).

@XVilka
Copy link
Contributor

XVilka commented May 31, 2023

See also #1954

@peace-maker
Copy link
Contributor

To verify that the bindings lead to the same output as the raw C library, currently there are tests which reimplement the test_* binaries in the binding language including the same exact detail output and just diff the output of the test_* binary with e.g. the test_*.py script.

One could write a test runner in every supported binding language which can be triggered by the cstest program and outputs in the same format as the detail printers in cstest.

Currently there are multiple places where the instructions are printed which all have to be updated when the structs change:

Only having one place to stringify an instruction would streamline the update process. When using cstool as the baseline, instruction printing could be split into a separate library with arm_insn_to_string functions which only handles rendering the instruction details to a string array. Printing and reformating those parts can be done by the consumers.

@Rot127
Copy link
Collaborator Author

Rot127 commented Jul 27, 2023

Only having one place to stringify an instruction would streamline the update process.

I very much like the idea to centralize the printing of instructions. Would you mind putting this into its own issue?

One could write a test runner in every supported binding language which can be triggered by the cstest program and outputs in the same format as the detail printers in cstest

I am not sure if I understand you here correctly. But I think testing the instruction details by comparing strings between cstool and bindings output is a bad idea.

If we have the data to test in a yaml file, we can just as well test the objects directly.
This would also allow to test for binary stuff.

@XVilka
Copy link
Contributor

XVilka commented Aug 5, 2023

Using YAML to test the data layout is also a custom in the LLVM project:

@XVilka
Copy link
Contributor

XVilka commented Aug 5, 2023

Only having one place to stringify an instruction would streamline the update process. When using cstool as the baseline, instruction printing could be split into a separate library with arm_insn_to_string functions which only handles rendering the instruction details to a string array. Printing and reformating those parts can be done by the consumers.

The best approach would be a new small library in the same repository - libcsprint or something. It could be used on its own even outside cstool this way then.

@Rot127
Copy link
Collaborator Author

Rot127 commented Aug 31, 2023

Just an idea how to test every possible execution path and generate a set of instructions which cover every possible detail combination.

Depending on architecture the number of possible unique ways to set the detail struct should be around <num operand groups> * 2 (just a guess, but multiple groups share a single execution path, while some have a very diverse one. See add_cs_detail()).

For AArch64 for example this would be roughly 162 * 2 (a very large number in comparison to other archs).

To determine the execution path for each operand group:

  1. Execute every possible encoding (assuming 4byte instructions) and write valid decoding to a file.
  • Alternatively we can als just test every encoding from suite/MC/
  1. For every valid encoding, run gcov on it and diff the coverage graph of all add_cs_detail() calls.
  • If it contains a unique add_cs_detail() path, mark this encoding as valid test.
  • If not, decode next instruction.
  1. The resulting 200-300 instructions can be checked manually for valid details.
  2. Then add tests manually for paths which couldn't be reached with this method (e.g. PPC prefixed instructions).

@XVilka
Copy link
Contributor

XVilka commented Sep 22, 2023

Once it's done some files could be removed, e.g.:

  • suite/regress/*

@Rot127 Rot127 added this to the v6 milestone Mar 19, 2024
@Rot127 Rot127 added the Testing Test related issue label Mar 19, 2024
@Rot127 Rot127 linked a pull request Jul 4, 2024 that will close this issue
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Testing Test related issue
Projects
Status: Todo
Development

Successfully merging a pull request may close this issue.

4 participants