diff --git a/docs/AutoSync.md b/suite/auto-sync/ARCHITECTURE.md similarity index 54% rename from docs/AutoSync.md rename to suite/auto-sync/ARCHITECTURE.md index 937e9c926b..fe2ce23cea 100644 --- a/docs/AutoSync.md +++ b/suite/auto-sync/ARCHITECTURE.md @@ -1,12 +1,9 @@ -# Auto-Sync - -`auto-sync` is the architecture update tool for Capstone. -Because the architecture modules of Capstone use mostly code from LLVM, -we need to update this part with every LLVM release. `auto-sync` helps -with this synchronization between LLVM and Capstone's modules by -automating most of it. + -You can find it in `suite/auto-sync`. +# Architecture of the Auto-Sync framework This document is split into four parts. @@ -15,8 +12,8 @@ This document is split into four parts. 3. Instructions how to refactor an architecture to use `auto-sync`. 4. Notes about how to add a new architecture to Capstone with `auto-sync`. -Please read the section about architecture module design in -[ARCHITECTURE.md](ARCHITECTURE.md) before proceeding. +Please read the section about capstone module design in +[ARCHITECTURE.md](https://github.com/capstone-engine/capstone/blob/next/docs/ARCHITECTURE.md) before proceeding. The architectural understanding is important for the following. ## Update procedure @@ -98,102 +95,30 @@ _Note_: For details about this checkout `suite/auto-sync/CppTranslator/README.md Because the result of the `CppTranslator` is not perfect, we still have many syntax problems left. -Those need to be fixed by hand. -In order to ease this process we run the `Differ` after the `CppTranslator`. - -The `Differ` parses each _translated_ file and the corresponding source file _currently_ used in Capstone. -It then compares specific nodes from the just translated file to the equivalent nodes in the old file. - -The user can choose if she accepts the version from the translated file or the old file. -This decision is saved for every node. -If there exists a saved decision for a node, the previous decision automatically applied again. - -Every other syntax error must be solved manually. +Those need to be fixed partially by hand. -## Update an architecture +**Differ** -To update an architecture do the following: - -Rebase `llvm-capstone` onto the new LLVM release (if not already done). -``` -# 1. Clone Capstone's LLVM -git clone https://github.com/capstone-engine/llvm-capstone -cd llvm-capstone -git checkout auto-sync - -# 2. Rebase onto the new LLVM release and resolve the conflicts. - -# 3. Build tblgen -mkdir build -cd build -cmake -G Ninja -DLLVM_TARGETS_TO_BUILD= -DCMAKE_BUILD_TYPE=Debug ../llvm -cmake --build . --target llvm-tblgen --config Debug - -# 4. Run the updater -cd ../../suite/auto-sync/ -./Updater/ASUpdater.py -a -``` - -The update script will execute the steps described above and copy the new files to their directories. - -Afterward try to build Capstone and fix any build errors left. - -If new instructions or operands were added, add test cases for those -(recession tests for instructions are located in `suite/MC/`). - -TODO: Operand and detail tests - - -## Refactor an architecture for `auto-sync` - -To refactor an architecture to use `auto-sync`, you need to add it to the configuration. - -1. Add the architecture to the supported architectures list in `ASUpdater.py`. -2. Configure the `CppTranslator` for your architecture (`suite/auto-sync/CppTranslator/arch_config.json`) - -Now, manually run the update commands within `ASUpdater.py` but *skip* the `Differ` step: - -``` -./Updater/ASUpdater.py -a -s IncGen Translate -``` - -The task after this is to: - -- Replace leftover C++ syntax with its C equivalent. -- Implement the `add_cs_detail()` handler in `Mapping` for each operand type. -- Add any missing logic to the translated files. -- Make it build and write tests. -- Run the Differ again and always select the old nodes. - -**Notes:** - -- If you find yourself fixing the same syntax error multiple times, -please consider adding a `Patch` to the `CppTranslator` for this case. - -- Please check out the implementation of ARM's `add_cs_detail()` before implementing your own. - -- Running the `Differ` after everything is done, preserves your version of syntax corrections, and the next user can auto-apply them. - -- Sometimes the LLVM code uses a single function from a larger source file. -It is not worth it to translate the whole file just for this function. -Bundle those lonely functions in `DisassemblerExtension.c`. +In order to ease this process we run the `Differ` after the `CppTranslator`. -- Some generated enums must be included in the `include/capstone/.h` header. -At the position where the enum should be inserted, add a comment like this (don't remove the `<>` brackets): +The `Differ` compares our two versions of C files we have now. +One of them are the C files currently used by the architecture module. +On the other hand we have the translated C files. Those are still faulty and need to be fixed. - ``` - // generate content begin - // generate content end - ``` +Most fixes are syntactical problems. Those were almost always resolved before, during the last update. +The `Differ` helps you to compare the files and let you select which version to accept. -The update script will insert the content of the `.inc` file at this place. +Sometimes (not very often though), the newly translated C files contain important changes. +Most often though, the old files are already correct. -## Adding a new architecture +The `Differ` parses both files into an abstract syntax tree and compares certain nodes with the same name +(mostly functions). -Adding a new architecture follows the same steps as above. With the exception that you need -to implement all the Capstone files from scratch. +The user can choose if she accepts the version from the translated file or the old file. +This decision is saved for every node. +If there exists a saved decision for two nodes, and the nodes did not change since the last time, +it applies the previous decision automatically again. -Check out an `auto-sync` supporting architectures for guidance and open an issue if you need help. +The `Differ` is far from perfect. It only helps to automatically apply "known to be good" fixes +and gives the user a better interface to solve the other problems. +But there will still be syntax errors left afterward. These must be fixed by hand. diff --git a/suite/auto-sync/README.md b/suite/auto-sync/README.md index 3f98037a34..5c519139c1 100644 --- a/suite/auto-sync/README.md +++ b/suite/auto-sync/README.md @@ -1,15 +1,19 @@ -# Architecture updater +# Architecture updater - Auto-Sync -This is Capstones updater for some architectures. -Unfortunately not all architectures are supported yet. +`auto-sync` is the architecture update tool for Capstone. +Because the architecture modules of Capstone use mostly code from LLVM, +we need to update this part with every LLVM release. `auto-sync` helps +with this synchronization between LLVM and Capstone's modules by +automating most of it. -## Install dependencies +Please refer to [intro.md](intro.md) for an introduction about this tool. + +## Install Setup Python environment and Tree-sitter @@ -20,11 +24,25 @@ sudo apt install python3-venv # Setup virtual environment in Capstone root dir python3 -m venv ./.venv source ./.venv/bin/activate +``` + +Install Auto-Sync framework + +``` cd suite/auto-sync/ pip install -e . ``` -## Update +## Architecture + +Please read [ARCHITECTURE.md](https://github.com/capstone-engine/capstone/blob/next/docs/ARCHITECTURE.md) to understand how Auto-Sync works. + +This step is essential! Please don't skip it. + +## Update an architecture + +Updating an architecture module to the newest LLVM release, is only possible if it uses Auto-Sync. +Not all arch-modules support Auto-Sync yet. Check if your architecture is supported. @@ -52,6 +70,14 @@ Run the updater ./src/autosync/ASUpdater.py -a ``` +## Update procedure + +1. Run the `ASUpdater.py` script. +2. Compare the functions in `DisassemblerExtension.*` to LLVM (search the function names in the LLVM root) +and update them if necessary. +3. Try to build Capstone and fix the build errors. + + ## Post-processing steps This update translates some LLVM C++ files to C. @@ -60,7 +86,7 @@ you will get build errors if you try to compile Capstone. The last step to finish the update is to fix those build errors by hand. -## Developer +## Additional details ### Overview updated files @@ -96,14 +122,7 @@ Those files are written by us: - `Mapping.*`: Binding code between the architecture module and the LLVM files. This is also where the detail is set. - `Module.*`: Interface to the Capstone core. -### Update procedure - -1. Run the `ASUpdater.py` script. -2. Compare the functions in `DisassemblerExtension.*` to LLVM (search the function names in the LLVM root) -and update them if necessary. -3. Try to build Capstone and fix the build errors. - -### Update details +### Relevant documentation and troubleshooting **LLVM file translation** @@ -129,9 +148,66 @@ Documentation about the `.inc` file generation is in the [llvm-capstone](https:/ **Formatting** -- If you make changes to the `CppTranslator` please format the files with `black` +- If you make changes to the `CppTranslator` please format the files with `black` and `usort` ``` - source ./.venv/bin/activate - pip3 install black - python3 -m black --line-length=120 CppTranslator/*/*.py + pip3 install black usort + python3 -m usort format src/autosync + python3 -m black src/autosync ``` + +## Refactor an architecture for Auto-Sync framework + +Not all architecture modules support Auto-Sync yet. +Here is an overview of the steps to add support for it. + +
+ +To refactor one of them to use `auto-sync`, you need to add it to the configuration. + +1. Add the architecture to the supported architectures list in `ASUpdater.py`. +2. Configure the `CppTranslator` for your architecture (`suite/auto-sync/CppTranslator/arch_config.json`) + +Now, manually run the update commands within `ASUpdater.py` but *skip* the `Differ` step: + +``` +./Updater/ASUpdater.py -a -s IncGen Translate +``` + +The task after this is to: + +- Replace leftover C++ syntax with its C equivalent. +- Implement the `add_cs_detail()` handler in `Mapping` for each operand type. +- Edit the main header file of the architecture (`include/capstone/.h`) to include the generated enums (see below) +- Add any missing logic to the translated files. +- Make it build and write tests. +- Run the Differ again and always select the old nodes. + +**Notes:** + +- Some generated enums must be included in the `include/capstone/.h` header. +At the position where the enum should be inserted, add a comment like this (don't remove the `<>` brackets): + + ``` + // generate content begin + // generate content end + ``` + +The update script will insert the content of the `.inc` file at this place. + +- If you find yourself fixing the same syntax error multiple times, +please consider adding a `Patch` to the `CppTranslator` for this case. + +- Please check out the implementation of ARM's `add_cs_detail()` before implementing your own. + +- Running the `Differ` after everything is done, preserves your version of syntax corrections, and the next user can auto-apply them. + +- Sometimes the LLVM code uses a single function from a larger source file. +It is not worth it to translate the whole file just for this function. +Bundle those lonely functions in `DisassemblerExtension.c`. + +## Adding a new architecture + +Adding a new architecture follows the same steps as above. With the exception that you need +to implement all the Capstone files from scratch. + +Check out an `auto-sync` supporting architectures for guidance and open an issue if you need help. diff --git a/suite/auto-sync/intro.md b/suite/auto-sync/intro.md new file mode 100644 index 0000000000..b486006c72 --- /dev/null +++ b/suite/auto-sync/intro.md @@ -0,0 +1,96 @@ +## Why the Auto-Sync framework? + +Capstone provides a simple API to leverage the LLVM disassemblers, without +having the big footprint of LLVM itself. + +It does this by using a stripped down copy of LLVM disassemblers (one for each architecture) +and provides a uniform API to them. + +The actual disassembly task (bytes to asm-text and decoded operands) is completely done by +the LLVM code. +Capstone takes the disassembled instructions, adds details to them (operand read/write info etc.) +and organizes them to a uniform structure (`cs_insn`, `cs_detail` etc.). +These objects are then accessible from the API. + +Capstone is in C and LLVM is in C++. So to use the disassembler modules of LLVM, +Capstone effectively translates LLVM source files from C++ to C, without changing the semantics. +One could also call it a "disassembler port". + +Capstone supports multiple architectures. So whenever LLVM +has a new release and adds more instructions, Capstone needs to update its modules as well. + +In the past, the update procedure was done by hand and with some Python scripts. +But the task was tedious and error-prone. + +To ease the complicated update procedure, Auto-Sync comes in. + +
+ +## How LLVM disassemblers work + +Because effectively use the LLVM disassembler logic, one must understand how they operate. + +Each architecture is defined in a so-called `.td` file, that is, a "Target Description" file. +Those files are a declarative description of an architecture. +They are written in a Domain-Specific Language called [TableGen](https://llvm.org/docs/TableGen/). +They contain instructions, registers, processor features, which instructions operands read and write and more information. + +These files are consumed by "TableGen Backends". They parse and process them to generate C++ code. +The generated code is for example: enums, decoding algorithms (for instructions and operands) or +lookup tables for register names or alias. + +Additionally, LLVM has handwritten files. They use the generated code to build the actual instruction classes +and handle architecture specific edge cases. + +Capstone uses both of those files. The generated ones as well as the handwritten ones. + +## Overview of updating steps + +An Auto-Sync update has multiple steps: + +**(1)** Changes in the auto-generated C++ files are handled completely automatically, +We have a LLVM fork with patched TableGen-backends, so they emit C code. + +**(2)** Changes in LLVM's handwritten sources are handled semi-automatically. +For each source file, we search C++ syntax and replace it with the equivalent C syntax. +For this task we have the CppTranslator. + +The end result is of course not perfectly valid C code. +It is merely an intermediate file, which still has some C++ syntax in it. + +Because this leftover syntax was likely already fixed in the equivalent C file currently in Capstone, +we have a last step. +The translated file is diffed with the corresponding old file in Capstone. + +The `Differ` tool parses both files into an abstract syntax tree. +From this AST it picks nodes with the same name and diffs them. +The diff is given to the user, and they can decide which one to accept. + +All choices are also recorded and automatically applied next time. + +**Example** + +> Suppose there is a file `ArchDisassembler.cpp` in LLVM. +> Capstone has the C equivalent `ArchDisassembler.c`. +> +> Now LLVM has a new release, and there were several additions in `ArchDisassembler.cpp`. +> +> Auto-Sync will pass `ArchDisassembler.cpp` to the CppTranslator, which replaces most C++ syntax. +> The result is an intermediate file `transl_ArchDisassembler.cpp`. +> +> The result is close to what we want (C code), but still contains invalid syntax. +> Most of this syntax errors were fixed before. They must be, because the C file `ArchDisassemble.c` +> is working fine. +> +> So the intermediate file `transl_ArchDisassebmler.cpp` is compared to the old `ArchDisassemble.c. +> The Differ patches both files to an AST and automatically patches all nodes it can. +> +> Effectively automate most of the boring, mechanical work involved in fixing-up `transl_ArchDisassebmler.cpp`. +> If something new came up, it asks the user for a decission. +> +> The result is saved to `ArchDisassembler.c`, which is now up-to-date with the newest LLVM release. +> +> In practice this file will still contain syntax errors. But not many, so they can easily be resolved. + +**(3)** After (1) and (2), some changes in Capstone-only files follow. +This step is manual work.