-
Notifications
You must be signed in to change notification settings - Fork 482
Static single assignment (SSA) #352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…sion into subexpressions of a defined expression type.
joint work with Niko Schmidt
…flow graph into SSA; joint work with Niko Schmidt
Hi! That's a really great feature 😄 Thanks again to you, @mrphrazer and @itsacoderepo for the PR! |
Hi, great! Code review and questions are welcome :) We actively use SSA for static backward slicing. In addition, it is an integral part of our SMT-based program analysis. In general, we hope that SSA is a good foundation to push forward the data flow analysis techniques in Miasm. |
|
||
return iterator | ||
|
||
def _check_itetator(self, iterator, e): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this method is used only once, you can in-line it in _dissect
.
Is iteTator
a typo?
According to Python doc, NotImplementedError
is reserved to indicate abstract methods in abstract classes (a bit misleading). You can raise, for instance, a RuntimeError
instead.
I'm not sure ir |
@@ -322,3 +322,20 @@ def reassemble_expr(self, e): | |||
if id_rhs in self.expressions: | |||
todo.add(id_rhs) | |||
return e | |||
|
|||
|
|||
class SSAPath(SSABlock): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class never seems to be used. May a map
of SSABlock.transform
is enough, and then this class useless for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may be right. Maybe the analysis
module is more likely to receive such a great PR than ir
itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will move it to analysis
. However, I am not sure if SSAPath
itself is useless. It is a simple way to provide backward slicing in a few lines for a whole path. Are you sure that I should remove that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure to correctly understand the purpose of SSAPath
. Is it designed to always work on a path? If it is the case, is a path a list of SSABlock ?
In this last case, SSAPath
should not inherits from SSABlock
, but from UserList
or list
to get methods such as append
, iteration, comparison, ... Then it can be initialized with a list of SSABlock
, and directly works on it (so transform
method will apply on its internal path).
Am I correct?
continue | ||
|
||
# remember blocks which contain phi nodes | ||
if node not in self._phinodes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also use setdefault
here (which does not modify the value if the key already exists) and merge the line with self._phinodes[node][variable] = e.src
Thanks for the great feedback! I adapted everything besides one or to issues, I will comment them later :) |
# walk in DFS over the dominator tree | ||
for block in dominator_tree.walk_depth_first_forward(head): | ||
# restore SSA variable stack of the predecessor in the dominator tree | ||
self._stack_rhs = stack.pop().copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, you always copy the top element on the stack before using it.
So, there is no need to copy elements while adding them in the stack (line 517, 537).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we did this in the beginning. It turned out that is does not work. stack.append(self._stack_rhs)
does not copy the elements, it stores a reference to self._stack_rhs
. If the elements within the stack will be modified, this will be applied to the same instance that is located several times in stack
.
It may be the case that one of the .copy()
is unnecessary, but debugging this is hard. Side effects may be hidden in details.
Well, this is a great PR (my comments are mainly cosmetics). Do you have an example in mind? Not related, but it could be nice to add an option to |
Hi, thanks again! In the following a few explanations MemoryRegarding the parallel instructions and memory, assume
In SSA form, it has to be
If we transform the left-hand side and RSP would be translated before the memory instruction, the memory instruction's SSA transformation is incorrect because of
SSA PathI have quickly written a short script to illustrate the usage of SSAPath: from miasm2.analysis.machine import Machine
from miasm2.analysis.ssa import SSAPath
from miasm2.expression.expression import ExprId
def analyse_path(ssa):
# just an example
irdst = ExprId(("IRDst", 1), 64)
output = "{}: {}\n".format(irdst, ssa.reassemble_expr(irdst))
print output
code = ""
code += "554889e5897dec8975e88b45e80145ecd165e88b55ec8b45e8"
code += "01d03d380500007514c745fc00000000c745ec000000008345"
code += "e802eb20c745fc060000008b45e80145ec8b55ec8b45fc01d0"
code += "85c07507b800000000eb05b8010000005dc3"
m = Machine("x86_64")
mdis = m.dis_engine(code.decode("hex"))
ira = m.ira(mdis.symbol_pool)
ira_ssa = m.ira(mdis.symbol_pool)
asm_blocks = mdis.dis_multibloc(0)
for block in asm_blocks:
ira.add_bloc(block)
ssa = SSAPath(ira)
start_label = ira.get_bloc(0x0).label
end_label = ira.get_bloc(0x5b).label
for index, path in enumerate(ira.graph.find_path(start_label, end_label)):
# transform path into SSA
ssa.transform(path)
# update ira_ssa
ira_ssa.blocs.update(ssa.blocks)
ira_ssa._gen_graph()
# write graph
open("/tmp/" + str(index) + ".dot", "wb").write(ira_ssa.graph.dot())
# do some analysis
analyse_path(ssa)
# reset SSA
ssa.reset()
ira_ssa.blocs = dict() Here, Current state and todoTo sum up, the following things remain on the TODO
Regarding regression tests, we do not have an idea how useful tests could look like. Perhaps it can stay open for another PR. In addition, we would prefer if you could add SSA to the examples. I will update the post when parts of the TODO have been applied. |
ssa: modified ssa variable generation ssa: integrated _convert_block in _rename_expressions
Okay, this worked better than expected. The items of the TODO have been finished besides the last three points I mentioned in the post above. |
@@ -674,3 +674,166 @@ def possible_values(expr): | |||
raise RuntimeError("Unsupported type for expr: %s" % type(expr)) | |||
|
|||
return consvals | |||
|
|||
|
|||
class ExprDissector(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi!
I am currently reviewing the code (at last!). I have a question:
If I understand correctly the ExprDissector.dissect
, it extracts sub expressions of a certain type, and stop recursion if it's found.
Why not using the Expression.visit
?
Here is an example, in which I use your regression tests:
from miasm2.expression.expression_helper import *
from miasm2.expression.expression import *
def dissect_test(expr, expr_type, result):
if isinstance(expr, expr_type):
result.add(expr)
return False
return True
def dissect_visit(expr, expr_type):
result = set()
expr.visit(lambda expr:expr,
lambda expr:dissect_test(expr, expr_type, result))
return result
# define expressions
cf = ExprId('cf', size=1)
rbp = ExprId('RBP', size=64)
rdx = ExprId('RDX', size=64)
int1 = ExprInt(0xfffffffffffffffc, 64)
int2 = ExprInt(0xffffffce, 32)
int3 = ExprInt(0x32, 32)
compose1 = ExprCompose([(int2, 0, 32), (int3, 32, 64)])
op1 = rbp + int1 + compose1 + rdx
op2 = int2 + int3
mem1 = ExprMem(op1, 32)
mem2 = ExprMem(op2, 32)
cond1 = ExprCond(mem1, int2, int3)
slice1 = ExprSlice(mem1 + mem2 + cond1, 31, 32)
aff1 = ExprAff(cf, slice1)
assert (dissect_visit(cond1, ExprOp) == {op1})
assert (dissect_visit(mem2, ExprOp) == {op2})
assert (dissect_visit(aff1, ExprSlice) == {slice1})
assert (dissect_visit(aff1, ExprCond) == {cond1})
assert (dissect_visit(aff1, ExprCompose) == {compose1})
assert (dissect_visit(aff1, ExprMem) == {mem1, mem2})
assert (dissect_visit(aff1, ExprAff) == {aff1})
assert (dissect_visit(aff1, ExprInt) == {int1, int2, int3})
assert (dissect_visit(aff1, ExprId) == {cf, rbp, rdx})
The trick is not to use the visitor callback, but the test function itself to get the results. It allow to stop recursion if the correct type is found., as you do in your dissector.
Is this correct, or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi!
Neat trick, I did not know that :) I am not definitively sure, but I see two problems:
- It is recursive and uses lambda expressions. Where possible, I try to avoid this in python since it cannot be optimised and has large overheads ; if expressions are nested (or large), time and memory usage will explode.
- The main reason: Parsing standard subexpressions is a nice feature, but not really used that much within this SSA implementation. However, you can add additional filter criteria such as in
variables
(which is often used in SSA).
At least, I think it makes sense to have a class as ExprDissector
in which you can define arbitrary filters (e.g., parsing memory expressions that depend on the stack pointer). Do you agree?
I totally agree with you for the recursive vs worklist. But in this case, (tell me if I am wrong) the parsed expressions are limited to expressions which come from the IR of instruction, which are in most case really simple. (this is different from fat expressions resulted from a symbolic emulation for example) The feature is indeed interesting |
If you operate on assembly-transformed code, you are absolutely right: the translated instructions are really simple. Therefore, it will not make much difference in this case. However, I was thinking about more generalised use cases. We can also perform SSA-based analysis of custom IRA graphs. For instance, you might create basic block summaries (e.g., via symbolic execution of the basic block) and replace the basic blocks with their summaries. In addition, you might subclass SSA and extend it with custom expressions that constraint some characteristics. In these cases, the worklist approach might be an advantage. |
Well, currently with Miasm, expression are very often handled with recursive methods. So, as the visitor paradigm seems easier and less prone to error, my vote go for it. There is no gain to handle iteratively expression that you can't use after. This is a known limitation of Miasm (undocumented 😗), expression can't exceed a depth equals to Python recursive limit. We don't need it for now, maybe in the future, but it will be associated with a pretty big patch. (Sorry for the delay, I didn't notice my opinion was expected. Your feature is actually very cool, so I expect it will be merge in a near future) |
The feature has been added in the last release. Thanks again! |
This PR introduces static single assignment (SSA). It is joint work with Niko Schmidt. It implements a
We start with
Then, we transform
The transformed expressions are located in the dict
ssa.expressions
. In addition, copies of the IRA blocks in SSA form are inssa.blocks
.We can view the transformed graph with
The classes
SSABlock
andSSAPath
allow the reassembling of expressions, which is useful for static slicing:The SSA transformation is prone to memory aliasing. In future, an SSA form as memory SSA may be considered.
The SSA form on DiGraph level is known as minimal SSA.
Up until now, regression tests for SSA are missing. Especially in the case of
SSADiGraph
, we are open for comments as useful tests could look like.