From 43d6545318fe517d0c77f7e8f2e0ed6bbea949b4 Mon Sep 17 00:00:00 2001 From: Rot127 Date: Sat, 11 May 2024 04:27:16 -0500 Subject: [PATCH 1/5] Restructure auto-sync docs to have them more contained in suite/auto-sync --- .../auto-sync/ARCHITECTURE.md | 103 ++-------------- suite/auto-sync/README.md | 110 +++++++++++++++--- 2 files changed, 98 insertions(+), 115 deletions(-) rename docs/AutoSync.md => suite/auto-sync/ARCHITECTURE.md (59%) diff --git a/docs/AutoSync.md b/suite/auto-sync/ARCHITECTURE.md similarity index 59% rename from docs/AutoSync.md rename to suite/auto-sync/ARCHITECTURE.md index 937e9c926b..1e9f21481b 100644 --- a/docs/AutoSync.md +++ b/suite/auto-sync/ARCHITECTURE.md @@ -1,12 +1,9 @@ -# Auto-Sync - -`auto-sync` is the architecture update tool for Capstone. -Because the architecture modules of Capstone use mostly code from LLVM, -we need to update this part with every LLVM release. `auto-sync` helps -with this synchronization between LLVM and Capstone's modules by -automating most of it. + -You can find it in `suite/auto-sync`. +# Architecture of Auto-Sync This document is split into four parts. @@ -106,94 +103,8 @@ It then compares specific nodes from the just translated file to the equivalent The user can choose if she accepts the version from the translated file or the old file. This decision is saved for every node. -If there exists a saved decision for a node, the previous decision automatically applied again. +If there exists a saved decision for two nodes, and the nodes did not change since the last time, +it applies the previous decision automatically again. Every other syntax error must be solved manually. -## Update an architecture - -To update an architecture do the following: - -Rebase `llvm-capstone` onto the new LLVM release (if not already done). -``` -# 1. Clone Capstone's LLVM -git clone https://github.com/capstone-engine/llvm-capstone -cd llvm-capstone -git checkout auto-sync - -# 2. Rebase onto the new LLVM release and resolve the conflicts. - -# 3. Build tblgen -mkdir build -cd build -cmake -G Ninja -DLLVM_TARGETS_TO_BUILD= -DCMAKE_BUILD_TYPE=Debug ../llvm -cmake --build . --target llvm-tblgen --config Debug - -# 4. Run the updater -cd ../../suite/auto-sync/ -./Updater/ASUpdater.py -a -``` - -The update script will execute the steps described above and copy the new files to their directories. - -Afterward try to build Capstone and fix any build errors left. - -If new instructions or operands were added, add test cases for those -(recession tests for instructions are located in `suite/MC/`). - -TODO: Operand and detail tests - - -## Refactor an architecture for `auto-sync` - -To refactor an architecture to use `auto-sync`, you need to add it to the configuration. - -1. Add the architecture to the supported architectures list in `ASUpdater.py`. -2. Configure the `CppTranslator` for your architecture (`suite/auto-sync/CppTranslator/arch_config.json`) - -Now, manually run the update commands within `ASUpdater.py` but *skip* the `Differ` step: - -``` -./Updater/ASUpdater.py -a -s IncGen Translate -``` - -The task after this is to: - -- Replace leftover C++ syntax with its C equivalent. -- Implement the `add_cs_detail()` handler in `Mapping` for each operand type. -- Add any missing logic to the translated files. -- Make it build and write tests. -- Run the Differ again and always select the old nodes. - -**Notes:** - -- If you find yourself fixing the same syntax error multiple times, -please consider adding a `Patch` to the `CppTranslator` for this case. - -- Please check out the implementation of ARM's `add_cs_detail()` before implementing your own. - -- Running the `Differ` after everything is done, preserves your version of syntax corrections, and the next user can auto-apply them. - -- Sometimes the LLVM code uses a single function from a larger source file. -It is not worth it to translate the whole file just for this function. -Bundle those lonely functions in `DisassemblerExtension.c`. - -- Some generated enums must be included in the `include/capstone/.h` header. -At the position where the enum should be inserted, add a comment like this (don't remove the `<>` brackets): - - ``` - // generate content begin - // generate content end - ``` - -The update script will insert the content of the `.inc` file at this place. - -## Adding a new architecture - -Adding a new architecture follows the same steps as above. With the exception that you need -to implement all the Capstone files from scratch. - -Check out an `auto-sync` supporting architectures for guidance and open an issue if you need help. diff --git a/suite/auto-sync/README.md b/suite/auto-sync/README.md index 3f98037a34..afe8ba1a57 100644 --- a/suite/auto-sync/README.md +++ b/suite/auto-sync/README.md @@ -1,15 +1,17 @@ -# Architecture updater +# Architecture updater - Auto-Sync -This is Capstones updater for some architectures. -Unfortunately not all architectures are supported yet. +`auto-sync` is the architecture update tool for Capstone. +Because the architecture modules of Capstone use mostly code from LLVM, +we need to update this part with every LLVM release. `auto-sync` helps +with this synchronization between LLVM and Capstone's modules by +automating most of it. -## Install dependencies +## Install Setup Python environment and Tree-sitter @@ -20,11 +22,23 @@ sudo apt install python3-venv # Setup virtual environment in Capstone root dir python3 -m venv ./.venv source ./.venv/bin/activate +``` + +Install auto-sync + +``` cd suite/auto-sync/ pip install -e . ``` -## Update +## Architecture + +Please read [ARCHITECTURE.md](ARCHITECTURE.md) to understand how Auto-Sync works. + +## Update an architecture + +Updating an architecture module to the newest LLVM release, is only possible if it uses Auto-Sync. +Not all arch-modules support Auto-Sync yet. Check if your architecture is supported. @@ -52,6 +66,14 @@ Run the updater ./src/autosync/ASUpdater.py -a ``` +## Update procedure + +1. Run the `ASUpdater.py` script. +2. Compare the functions in `DisassemblerExtension.*` to LLVM (search the function names in the LLVM root) +and update them if necessary. +3. Try to build Capstone and fix the build errors. + + ## Post-processing steps This update translates some LLVM C++ files to C. @@ -60,7 +82,7 @@ you will get build errors if you try to compile Capstone. The last step to finish the update is to fix those build errors by hand. -## Developer +## Additional details ### Overview updated files @@ -96,14 +118,7 @@ Those files are written by us: - `Mapping.*`: Binding code between the architecture module and the LLVM files. This is also where the detail is set. - `Module.*`: Interface to the Capstone core. -### Update procedure - -1. Run the `ASUpdater.py` script. -2. Compare the functions in `DisassemblerExtension.*` to LLVM (search the function names in the LLVM root) -and update them if necessary. -3. Try to build Capstone and fix the build errors. - -### Update details +### Relevant documentation and troubleshooting **LLVM file translation** @@ -129,9 +144,66 @@ Documentation about the `.inc` file generation is in the [llvm-capstone](https:/ **Formatting** -- If you make changes to the `CppTranslator` please format the files with `black` +- If you make changes to the `CppTranslator` please format the files with `black` and `usort` ``` - source ./.venv/bin/activate - pip3 install black - python3 -m black --line-length=120 CppTranslator/*/*.py + pip3 install black usort + python3.11 -m usort format src/autosync + python3.11 -m black src/autosync ``` + +## Refactor an architecture for `auto-sync` + +Not all architecture modules support Auto-Sync yet. +Here is an overview of the steps to add support for it. + +
+ +To refactor one of them to use `auto-sync`, you need to add it to the configuration. + +1. Add the architecture to the supported architectures list in `ASUpdater.py`. +2. Configure the `CppTranslator` for your architecture (`suite/auto-sync/CppTranslator/arch_config.json`) + +Now, manually run the update commands within `ASUpdater.py` but *skip* the `Differ` step: + +``` +./Updater/ASUpdater.py -a -s IncGen Translate +``` + +The task after this is to: + +- Replace leftover C++ syntax with its C equivalent. +- Implement the `add_cs_detail()` handler in `Mapping` for each operand type. +- Edit the main header file of the architecture (`include/capstone/.h`) to include the generated enums (see below) +- Add any missing logic to the translated files. +- Make it build and write tests. +- Run the Differ again and always select the old nodes. + +**Notes:** + +- Some generated enums must be included in the `include/capstone/.h` header. +At the position where the enum should be inserted, add a comment like this (don't remove the `<>` brackets): + + ``` + // generate content begin + // generate content end + ``` + +The update script will insert the content of the `.inc` file at this place. + +- If you find yourself fixing the same syntax error multiple times, +please consider adding a `Patch` to the `CppTranslator` for this case. + +- Please check out the implementation of ARM's `add_cs_detail()` before implementing your own. + +- Running the `Differ` after everything is done, preserves your version of syntax corrections, and the next user can auto-apply them. + +- Sometimes the LLVM code uses a single function from a larger source file. +It is not worth it to translate the whole file just for this function. +Bundle those lonely functions in `DisassemblerExtension.c`. + +## Adding a new architecture + +Adding a new architecture follows the same steps as above. With the exception that you need +to implement all the Capstone files from scratch. + +Check out an `auto-sync` supporting architectures for guidance and open an issue if you need help. From 7b9a08af672c67b042b61521d43e81d3b0575ad2 Mon Sep 17 00:00:00 2001 From: Rot127 Date: Sat, 11 May 2024 05:28:21 -0500 Subject: [PATCH 2/5] Enhance Differ documentation --- suite/auto-sync/ARCHITECTURE.md | 26 ++++++++++++++++++++------ 1 file changed, 20 insertions(+), 6 deletions(-) diff --git a/suite/auto-sync/ARCHITECTURE.md b/suite/auto-sync/ARCHITECTURE.md index 1e9f21481b..2735e36008 100644 --- a/suite/auto-sync/ARCHITECTURE.md +++ b/suite/auto-sync/ARCHITECTURE.md @@ -13,7 +13,7 @@ This document is split into four parts. 4. Notes about how to add a new architecture to Capstone with `auto-sync`. Please read the section about architecture module design in -[ARCHITECTURE.md](ARCHITECTURE.md) before proceeding. +[ARCHITECTURE.md](https://github.com/capstone-engine/capstone/blob/next/docs/ARCHITECTURE.md) before proceeding. The architectural understanding is important for the following. ## Update procedure @@ -95,16 +95,30 @@ _Note_: For details about this checkout `suite/auto-sync/CppTranslator/README.md Because the result of the `CppTranslator` is not perfect, we still have many syntax problems left. -Those need to be fixed by hand. +Those need to be fixed partially by hand. + +**Differ** + In order to ease this process we run the `Differ` after the `CppTranslator`. -The `Differ` parses each _translated_ file and the corresponding source file _currently_ used in Capstone. -It then compares specific nodes from the just translated file to the equivalent nodes in the old file. +The `Differ` compares our two versions of C files we have now. +One of them are the C files currently used by the architecture module. +On the other hand we have the translated C files. Those are still faulty and need to be fixed. + +Most fixes are syntactical problems. Those were almost always resolved before, during the last update. +The `Differ` helps you to compare the files and let you select which version to accept. + +Sometimes (not very often though), the newly translated C files contain important changes. +Most often though, the old files are already correct. + +The `Differ` parses both files into an abstract syntax tree and compares certain nodes with the same name +(mostly functions). The user can choose if she accepts the version from the translated file or the old file. This decision is saved for every node. If there exists a saved decision for two nodes, and the nodes did not change since the last time, it applies the previous decision automatically again. -Every other syntax error must be solved manually. - +The `Differ` is far from perfect. It only helps to automatically apply "known to be good" fixes +and gives the user a better interface to solve the other problems. +But there will still be syntax errors left afterward. These must be fixed by hand. From 03ae495b17e1ecc96f5906cad75a383a48e6e449 Mon Sep 17 00:00:00 2001 From: Rot127 Date: Sat, 11 May 2024 05:29:36 -0500 Subject: [PATCH 3/5] Fix link and emphasize importance of ARCHITECTURE.md --- suite/auto-sync/README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/suite/auto-sync/README.md b/suite/auto-sync/README.md index afe8ba1a57..6789e67146 100644 --- a/suite/auto-sync/README.md +++ b/suite/auto-sync/README.md @@ -33,7 +33,9 @@ pip install -e . ## Architecture -Please read [ARCHITECTURE.md](ARCHITECTURE.md) to understand how Auto-Sync works. +Please read [ARCHITECTURE.md](https://github.com/capstone-engine/capstone/blob/next/docs/ARCHITECTURE.md) to understand how Auto-Sync works. + +This step is essential! Please don't skip it. ## Update an architecture From d898404efa872d5eaac9aad3a52a8c7c41daf9e4 Mon Sep 17 00:00:00 2001 From: Rot127 Date: Sun, 12 May 2024 04:35:24 -0500 Subject: [PATCH 4/5] Add auto-syc intro.md document, based on @moste00 work --- suite/auto-sync/ARCHITECTURE.md | 2 +- suite/auto-sync/README.md | 2 + suite/auto-sync/intro.md | 96 +++++++++++++++++++++++++++++++++ 3 files changed, 99 insertions(+), 1 deletion(-) create mode 100644 suite/auto-sync/intro.md diff --git a/suite/auto-sync/ARCHITECTURE.md b/suite/auto-sync/ARCHITECTURE.md index 2735e36008..f1ec0f7be5 100644 --- a/suite/auto-sync/ARCHITECTURE.md +++ b/suite/auto-sync/ARCHITECTURE.md @@ -12,7 +12,7 @@ This document is split into four parts. 3. Instructions how to refactor an architecture to use `auto-sync`. 4. Notes about how to add a new architecture to Capstone with `auto-sync`. -Please read the section about architecture module design in +Please read the section about capstone module design in [ARCHITECTURE.md](https://github.com/capstone-engine/capstone/blob/next/docs/ARCHITECTURE.md) before proceeding. The architectural understanding is important for the following. diff --git a/suite/auto-sync/README.md b/suite/auto-sync/README.md index 6789e67146..b8d9db40dd 100644 --- a/suite/auto-sync/README.md +++ b/suite/auto-sync/README.md @@ -11,6 +11,8 @@ we need to update this part with every LLVM release. `auto-sync` helps with this synchronization between LLVM and Capstone's modules by automating most of it. +Please refer to [intro.md](intro.md) for an introduction about this tool. + ## Install Setup Python environment and Tree-sitter diff --git a/suite/auto-sync/intro.md b/suite/auto-sync/intro.md new file mode 100644 index 0000000000..3000290342 --- /dev/null +++ b/suite/auto-sync/intro.md @@ -0,0 +1,96 @@ +## Why AutoSync? + +Capstone provides a simple API to leverage the LLVM disassemblers, without +having the big footprint of LLVM itself. + +It does this by using a stripped down copy of LLVM disassemblers (one for each architecture) +and provides a uniform API to them. + +The actual disassembly task (bytes to asm-text and decoded operands) is completely done by +the LLVM code. +Capstone takes the disassembled instructions, adds details to them (operand read/write info etc.) +and organizes them to a uniform structure (`cs_insn`, `cs_detail` etc.). +These objects are then accessible from the API. + +Capstone is in C and LLVM is in C++. So to use the disassembler modules of LLVM, +Capstone effectively translates LLVM source files from C++ to C, without changing the semantics. +One could also call it a "disassembler port". + +Capstone supports multiple architectures. So whenever LLVM +has a new release and adds more instructions, Capstone needs to update its modules as well. + +In the past, the update procedure was done by hand and with some Python scripts. +But the task was tedious and error-prone. + +To ease the complicated update procedure, Auto-Sync comes in. + +
+ +## How LLVM disassemblers work + +Because effectively use the LLVM disassembler logic, one must understand how they operate. + +Each architecture is defined in a so-called `.td` file, that is, a "Target Description" file. +Those files are a declarative description of an architecture. +They are written in a Domain-Specific Language called [TableGen](https://llvm.org/docs/TableGen/). +They contain instructions, registers, processor features, which instructions operands read and write and more information. + +These files are consumed by "TableGen Backends". They parse and process them to generate C++ code. +The generated code is for example: enums, decoding algorithms (for instructions and operands) or +lookup tables for register names or alias. + +Additionally, LLVM has handwritten files. They use the generated code to build the actual instruction classes +and handle architecture specific edge cases. + +Capstone uses both of those files. The generated ones as well as the handwritten ones. + +## Overview of updating steps + +An Auto-Sync update has multiple steps: + +**(1)** Changes in the auto-generated C++ files are handled completely automatically, +We have a LLVM fork with patched TableGen-backends, so they emit C code. + +**(2)** Changes in LLVM's handwritten sources are handled semi-automatically. +For each source file, we search C++ syntax and replace it with the equivalent C syntax. +For this task we have the CppTranslator. + +The end result is of course not perfectly valid C code. +It is merely an intermediate file, which still has some C++ syntax in it. + +Because this leftover syntax was likely already fixed in the equivalent C file currently in Capstone, +we have a last step. +The translated file is diffed with the corresponding old file in Capstone. + +The `Differ` tool parses both files into an abstract syntax tree. +From this AST it picks nodes with the same name and diffs them. +The diff is given to the user, and they can decide which one to accept. + +All choices are also recorded and automatically applied next time. + +**Example** + +> Suppose there is a file `ArchDisassembler.cpp` in LLVM. +> Capstone has the C equivalent `ArchDisassembler.c`. +> +> Now LLVM has a new release, and there were several additions in `ArchDisassembler.cpp`. +> +> Auto-Sync will pass `ArchDisassembler.cpp` to the CppTranslator, which replaces most C++ syntax. +> The result is an intermediate file `transl_ArchDisassembler.cpp`. +> +> The result is close to what we want (C code), but still contains invalid syntax. +> Most of this syntax errors were fixed before. They must be, because the C file `ArchDisassemble.c` +> is working fine. +> +> So the intermediate file `transl_ArchDisassebmler.cpp` is compared to the old `ArchDisassemble.c. +> The Differ patches both files to an AST and automatically patches all nodes it can. +> +> Effectively automate most of the boring, mechanical work involved in fixing-up `transl_ArchDisassebmler.cpp`. +> If something new came up, it asks the user for a decission. +> +> The result is saved to `ArchDisassembler.c`, which is now up-to-date with the newest LLVM release. +> +> In practice this file will still contain syntax errors. But not many, so they can easily be resolved. + +**(3)** After (1) and (2), some changes in Capstone-only files follow. +This step is manual work. From b10cbf01c182dd328a88ddae8846e92de9689067 Mon Sep 17 00:00:00 2001 From: Rot127 Date: Sat, 1 Jun 2024 04:31:21 -0500 Subject: [PATCH 5/5] Be consistent with Auto-Sync naming and use python3 --- suite/auto-sync/ARCHITECTURE.md | 2 +- suite/auto-sync/README.md | 8 ++++---- suite/auto-sync/intro.md | 2 +- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/suite/auto-sync/ARCHITECTURE.md b/suite/auto-sync/ARCHITECTURE.md index f1ec0f7be5..fe2ce23cea 100644 --- a/suite/auto-sync/ARCHITECTURE.md +++ b/suite/auto-sync/ARCHITECTURE.md @@ -3,7 +3,7 @@ Copyright © 2022 Rot127 SPDX-License-Identifier: BSD-3 --> -# Architecture of Auto-Sync +# Architecture of the Auto-Sync framework This document is split into four parts. diff --git a/suite/auto-sync/README.md b/suite/auto-sync/README.md index b8d9db40dd..5c519139c1 100644 --- a/suite/auto-sync/README.md +++ b/suite/auto-sync/README.md @@ -26,7 +26,7 @@ python3 -m venv ./.venv source ./.venv/bin/activate ``` -Install auto-sync +Install Auto-Sync framework ``` cd suite/auto-sync/ @@ -151,11 +151,11 @@ Documentation about the `.inc` file generation is in the [llvm-capstone](https:/ - If you make changes to the `CppTranslator` please format the files with `black` and `usort` ``` pip3 install black usort - python3.11 -m usort format src/autosync - python3.11 -m black src/autosync + python3 -m usort format src/autosync + python3 -m black src/autosync ``` -## Refactor an architecture for `auto-sync` +## Refactor an architecture for Auto-Sync framework Not all architecture modules support Auto-Sync yet. Here is an overview of the steps to add support for it. diff --git a/suite/auto-sync/intro.md b/suite/auto-sync/intro.md index 3000290342..b486006c72 100644 --- a/suite/auto-sync/intro.md +++ b/suite/auto-sync/intro.md @@ -1,4 +1,4 @@ -## Why AutoSync? +## Why the Auto-Sync framework? Capstone provides a simple API to leverage the LLVM disassemblers, without having the big footprint of LLVM itself.