Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Static single assignment (SSA) #352

Closed
wants to merge 20 commits into from
Closed

Conversation

mrphrazer
Copy link
Contributor

@mrphrazer mrphrazer commented Apr 17, 2016

This PR introduces static single assignment (SSA). It is joint work with Niko Schmidt. It implements a

  • generic class for SSA transformations
  • class for transformations on block level
  • class for transformations on path level
  • class for transformations on control flow graph level

We start with

# initialise IRA
ira = m.ira(mdis.symbol_pool)
# initialise IRA SSA
ira_ssa = m.ira(mdis.symbol_pool)

Then, we transform

# a single block
ssa = SSABlock(ira)
ssa.transform(block_label)

# a path
path = ira.graph.find_path(start, end)[0]
ssa = SSAPath(ira)
ssa.transform(path)

# an entire CFG
head = ira.get_bloc(start_addr).label
ssa = SSADiGraph(ira)
ssa.transform(head)

The transformed expressions are located in the dict ssa.expressions. In addition, copies of the IRA blocks in SSA form are in ssa.blocks.

We can view the transformed graph with

# update IRA SSA
ira_ssa.blocs.update(ssa.blocks)

print ira_ssa.graph.dot()

The classes SSABlock and SSAPath allow the reassembling of expressions, which is useful for static slicing:

e = ExprId("IRDst.55", 64)
print ssa.reassemble_expr(e)

The SSA transformation is prone to memory aliasing. In future, an SSA form as memory SSA may be considered.

The SSA form on DiGraph level is known as minimal SSA.

Up until now, regression tests for SSA are missing. Especially in the case of SSADiGraph, we are open for comments as useful tests could look like.

@serpilliere
Copy link
Contributor

Hi!

That's a really great feature 😄
I have successfully tested it by modifying the full.py script.
We will review the code with @commial in the next days (we may have some questions)
Have you got some plans in a near future based on this feature?

Thanks again to you, @mrphrazer and @itsacoderepo for the PR!

@mrphrazer
Copy link
Contributor Author

mrphrazer commented Apr 19, 2016

Hi,

great! Code review and questions are welcome :)

We actively use SSA for static backward slicing. In addition, it is an integral part of our SMT-based program analysis. In general, we hope that SSA is a good foundation to push forward the data flow analysis techniques in Miasm.


return iterator

def _check_itetator(self, iterator, e):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this method is used only once, you can in-line it in _dissect.
Is iteTator a typo?

According to Python doc, NotImplementedError is reserved to indicate abstract methods in abstract classes (a bit misleading). You can raise, for instance, a RuntimeError instead.

@commial
Copy link
Contributor

commial commented Apr 25, 2016

I'm not sure ir ir.ssa is the best place for this. May analysis could be considered. @serpilliere , any idea?

@@ -322,3 +322,20 @@ def reassemble_expr(self, e):
if id_rhs in self.expressions:
todo.add(id_rhs)
return e


class SSAPath(SSABlock):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class never seems to be used. May a map of SSABlock.transform is enough, and then this class useless for now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may be right. Maybe the analysis module is more likely to receive such a great PR than ir itself.

Copy link
Contributor Author

@mrphrazer mrphrazer Apr 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will move it to analysis. However, I am not sure if SSAPath itself is useless. It is a simple way to provide backward slicing in a few lines for a whole path. Are you sure that I should remove that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure to correctly understand the purpose of SSAPath. Is it designed to always work on a path? If it is the case, is a path a list of SSABlock ?

In this last case, SSAPath should not inherits from SSABlock, but from UserList or list to get methods such as append, iteration, comparison, ... Then it can be initialized with a list of SSABlock, and directly works on it (so transform method will apply on its internal path).

Am I correct?

continue

# remember blocks which contain phi nodes
if node not in self._phinodes:
Copy link
Contributor

@commial commial Apr 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also use setdefault here (which does not modify the value if the key already exists) and merge the line with self._phinodes[node][variable] = e.src

@mrphrazer
Copy link
Contributor Author

Thanks for the great feedback! I adapted everything besides one or to issues, I will comment them later :)

# walk in DFS over the dominator tree
for block in dominator_tree.walk_depth_first_forward(head):
# restore SSA variable stack of the predecessor in the dominator tree
self._stack_rhs = stack.pop().copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, you always copy the top element on the stack before using it.
So, there is no need to copy elements while adding them in the stack (line 517, 537).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we did this in the beginning. It turned out that is does not work. stack.append(self._stack_rhs) does not copy the elements, it stores a reference to self._stack_rhs. If the elements within the stack will be modified, this will be applied to the same instance that is located several times in stack.

It may be the case that one of the .copy() is unnecessary, but debugging this is hard. Side effects may be hidden in details.

@commial
Copy link
Contributor

commial commented Apr 26, 2016

Well, this is a great PR (my comments are mainly cosmetics).
I'm just trying to convince myself on a last point (apart the discussion on SSAPath), regarding the handling of parallel instructions and memory expressions.
Actually, I do not see why you need to handle memory expression first.
As right elements are handled in a distinct pass, before the one on left elements, it must always work.

Do you have an example in mind?

Not related, but it could be nice to add an option to example/disasm/full.py to apply the SSA pass before generating the graph (I'll do it in another PR if you don't have the time).

@mrphrazer
Copy link
Contributor Author

mrphrazer commented Apr 26, 2016

Hi,

thanks again! In the following a few explanations

Memory

Regarding the parallel instructions and memory, assume push rbp. This will be translated into the following parallel instructions:

RSP = RSP + (- 0x8)
@64[RSP + (- 0x8)] = RBP
RBP = RSP

In SSA form, it has to be

RSP.0 = RSP + (- 0x8)
@64[RSP + (- 0x8)] = RBP
RBP.0 = RSP

If we transform the left-hand side and RSP would be translated before the memory instruction, the memory instruction's SSA transformation is incorrect because of RSP:

RSP.0 = RSP + (- 0x8)
@64[RSP.0 + (- 0x8)] = RBP
RBP.0 = RSP

SSA Path

I have quickly written a short script to illustrate the usage of SSAPath:

from miasm2.analysis.machine import Machine
from miasm2.analysis.ssa import SSAPath
from miasm2.expression.expression import ExprId


def analyse_path(ssa):
    # just an example
    irdst = ExprId(("IRDst", 1), 64)

    output = "{}: {}\n".format(irdst, ssa.reassemble_expr(irdst))

    print output


code = ""
code += "554889e5897dec8975e88b45e80145ecd165e88b55ec8b45e8"
code += "01d03d380500007514c745fc00000000c745ec000000008345"
code += "e802eb20c745fc060000008b45e80145ec8b55ec8b45fc01d0"
code += "85c07507b800000000eb05b8010000005dc3"

m = Machine("x86_64")
mdis = m.dis_engine(code.decode("hex"))

ira = m.ira(mdis.symbol_pool)
ira_ssa = m.ira(mdis.symbol_pool)

asm_blocks = mdis.dis_multibloc(0)

for block in asm_blocks:
    ira.add_bloc(block)

ssa = SSAPath(ira)

start_label = ira.get_bloc(0x0).label
end_label = ira.get_bloc(0x5b).label

for index, path in enumerate(ira.graph.find_path(start_label, end_label)):
    # transform path into SSA
    ssa.transform(path)

    # update ira_ssa
    ira_ssa.blocs.update(ssa.blocks)
    ira_ssa._gen_graph()

    # write graph
    open("/tmp/" + str(index) + ".dot", "wb").write(ira_ssa.graph.dot())

    # do some analysis
    analyse_path(ssa)

    # reset SSA
    ssa.reset()
    ira_ssa.blocs = dict()

Here, analyse_path is a placeholder for an arbitrary analysis of the transformed path. One intention of SSAPath is to provide the ability to locate paths that with certain characteristics between two nodes, for instance, in the case of data flow analysis.

Current state and todo

To sum up, the following things remain on the TODO

  • apply the today's feedback
  • modify variable renaming
  • apply changes to _convert_block
  • apply changes to _gen_empty_phi
  • regressions tests for SSA
  • discuss SSAPath
  • add SSA to example/disasm/full.py

Regarding regression tests, we do not have an idea how useful tests could look like. Perhaps it can stay open for another PR. In addition, we would prefer if you could add SSA to the examples.

I will update the post when parts of the TODO have been applied.

ssa: modified ssa variable generation

ssa: integrated _convert_block in _rename_expressions
@mrphrazer
Copy link
Contributor Author

mrphrazer commented Apr 26, 2016

Okay, this worked better than expected. The items of the TODO have been finished besides the last three points I mentioned in the post above.

@@ -674,3 +674,166 @@ def possible_values(expr):
raise RuntimeError("Unsupported type for expr: %s" % type(expr))

return consvals


class ExprDissector(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi!
I am currently reviewing the code (at last!). I have a question:
If I understand correctly the ExprDissector.dissect, it extracts sub expressions of a certain type, and stop recursion if it's found.

Why not using the Expression.visit ?
Here is an example, in which I use your regression tests:

from miasm2.expression.expression_helper import *
from miasm2.expression.expression import *

def dissect_test(expr, expr_type, result):
    if isinstance(expr, expr_type):
        result.add(expr)
        return False
    return True


def dissect_visit(expr, expr_type):
    result = set()
    expr.visit(lambda expr:expr,
               lambda expr:dissect_test(expr, expr_type, result))
    return result



# define expressions
cf = ExprId('cf', size=1)
rbp = ExprId('RBP', size=64)
rdx = ExprId('RDX', size=64)
int1 = ExprInt(0xfffffffffffffffc, 64)
int2 = ExprInt(0xffffffce, 32)
int3 = ExprInt(0x32, 32)
compose1 = ExprCompose([(int2, 0, 32), (int3, 32, 64)])
op1 = rbp + int1 + compose1 + rdx
op2 = int2 + int3
mem1 = ExprMem(op1, 32)
mem2 = ExprMem(op2, 32)
cond1 = ExprCond(mem1, int2, int3)
slice1 = ExprSlice(mem1 + mem2 + cond1, 31, 32)
aff1 = ExprAff(cf, slice1)




assert (dissect_visit(cond1, ExprOp) == {op1})
assert (dissect_visit(mem2, ExprOp) == {op2})
assert (dissect_visit(aff1, ExprSlice) == {slice1})
assert (dissect_visit(aff1, ExprCond) == {cond1})
assert (dissect_visit(aff1, ExprCompose) == {compose1})
assert (dissect_visit(aff1, ExprMem) == {mem1, mem2})
assert (dissect_visit(aff1, ExprAff) == {aff1})
assert (dissect_visit(aff1, ExprInt) == {int1, int2, int3})
assert (dissect_visit(aff1, ExprId) == {cf, rbp, rdx})

The trick is not to use the visitor callback, but the test function itself to get the results. It allow to stop recursion if the correct type is found., as you do in your dissector.
Is this correct, or am I missing something?

Copy link
Contributor Author

@mrphrazer mrphrazer Jun 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi!

Neat trick, I did not know that :) I am not definitively sure, but I see two problems:

  1. It is recursive and uses lambda expressions. Where possible, I try to avoid this in python since it cannot be optimised and has large overheads ; if expressions are nested (or large), time and memory usage will explode.
  2. The main reason: Parsing standard subexpressions is a nice feature, but not really used that much within this SSA implementation. However, you can add additional filter criteria such as in variables (which is often used in SSA).

At least, I think it makes sense to have a class as ExprDissector in which you can define arbitrary filters (e.g., parsing memory expressions that depend on the stack pointer). Do you agree?

@serpilliere
Copy link
Contributor

I totally agree with you for the recursive vs worklist. But in this case, (tell me if I am wrong) the parsed expressions are limited to expressions which come from the IR of instruction, which are in most case really simple. (this is different from fat expressions resulted from a symbolic emulation for example)

The feature is indeed interesting
@commial, any thoughts?

@mrphrazer
Copy link
Contributor Author

If you operate on assembly-transformed code, you are absolutely right: the translated instructions are really simple. Therefore, it will not make much difference in this case.

However, I was thinking about more generalised use cases. We can also perform SSA-based analysis of custom IRA graphs. For instance, you might create basic block summaries (e.g., via symbolic execution of the basic block) and replace the basic blocks with their summaries. In addition, you might subclass SSA and extend it with custom expressions that constraint some characteristics. In these cases, the worklist approach might be an advantage.

@commial
Copy link
Contributor

commial commented Jun 29, 2016

Well, currently with Miasm, expression are very often handled with recursive methods.
For instance, you can take expression simplification (and then symbolic execution), translators (and then the emulation part) and, even, expression display (through __str__, __repr__).

So, as the visitor paradigm seems easier and less prone to error, my vote go for it. There is no gain to handle iteratively expression that you can't use after.

This is a known limitation of Miasm (undocumented 😗), expression can't exceed a depth equals to Python recursive limit. We don't need it for now, maybe in the future, but it will be associated with a pretty big patch.

(Sorry for the delay, I didn't notice my opinion was expected. Your feature is actually very cool, so I expect it will be merge in a near future)

@commial
Copy link
Contributor

commial commented Nov 25, 2018

The feature has been added in the last release. Thanks again!

@commial commial closed this Nov 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants