New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Static single assignment (SSA) #352

Closed
wants to merge 20 commits into
base: master
from

Conversation

Projects
None yet
3 participants
@mrphrazer
Copy link
Contributor

mrphrazer commented Apr 17, 2016

This PR introduces static single assignment (SSA). It is joint work with Niko Schmidt. It implements a

  • generic class for SSA transformations
  • class for transformations on block level
  • class for transformations on path level
  • class for transformations on control flow graph level

We start with

# initialise IRA
ira = m.ira(mdis.symbol_pool)
# initialise IRA SSA
ira_ssa = m.ira(mdis.symbol_pool)

Then, we transform

# a single block
ssa = SSABlock(ira)
ssa.transform(block_label)

# a path
path = ira.graph.find_path(start, end)[0]
ssa = SSAPath(ira)
ssa.transform(path)

# an entire CFG
head = ira.get_bloc(start_addr).label
ssa = SSADiGraph(ira)
ssa.transform(head)

The transformed expressions are located in the dict ssa.expressions. In addition, copies of the IRA blocks in SSA form are in ssa.blocks.

We can view the transformed graph with

# update IRA SSA
ira_ssa.blocs.update(ssa.blocks)

print ira_ssa.graph.dot()

The classes SSABlock and SSAPath allow the reassembling of expressions, which is useful for static slicing:

e = ExprId("IRDst.55", 64)
print ssa.reassemble_expr(e)

The SSA transformation is prone to memory aliasing. In future, an SSA form as memory SSA may be considered.

The SSA form on DiGraph level is known as minimal SSA.

Up until now, regression tests for SSA are missing. Especially in the case of SSADiGraph, we are open for comments as useful tests could look like.

@mrphrazer mrphrazer force-pushed the mrphrazer:ssa branch from 5738469 to 3e79671 Apr 17, 2016

@serpilliere

This comment has been minimized.

Copy link
Contributor

serpilliere commented Apr 19, 2016

Hi!

That's a really great feature 😄
I have successfully tested it by modifying the full.py script.
We will review the code with @commial in the next days (we may have some questions)
Have you got some plans in a near future based on this feature?

Thanks again to you, @mrphrazer and @itsacoderepo for the PR!

@mrphrazer

This comment has been minimized.

Copy link
Contributor

mrphrazer commented Apr 19, 2016

Hi,

great! Code review and questions are welcome :)

We actively use SSA for static backward slicing. In addition, it is an integral part of our SMT-based program analysis. In general, we hope that SSA is a good foundation to push forward the data flow analysis techniques in Miasm.


return iterator

def _check_itetator(self, iterator, e):

This comment has been minimized.

@commial

commial Apr 25, 2016

Member

As this method is used only once, you can in-line it in _dissect.
Is iteTator a typo?

According to Python doc, NotImplementedError is reserved to indicate abstract methods in abstract classes (a bit misleading). You can raise, for instance, a RuntimeError instead.

if not iterator:
raise NotImplementedError("no handler for {}".format((type(e))))

def _dissect(self, e, expr_type):

This comment has been minimized.

@commial

commial Apr 25, 2016

Member

When it is possible, please use at least two or three character for variable name, it is easier to read / search / avoid a mess when the function is more complex.

This comment has been minimized.

@serpilliere

serpilliere Apr 25, 2016

Contributor

We are currently trying to limit one letter variables in new Miasm code: we recently refactored a code using many one letter variables, and we have learned not to use them the hard way 😄

"""
return self._dissect(e, m2_expr.ExprOp)

def slice(self, e):

This comment has been minimized.

@commial

commial Apr 25, 2016

Member

slice, int and id are keywords in Python. It is better to avoid using them as names.
There is a common workaround: add a _ after them.

import miasm2.expression.expression as m2_expr


class ExprDissector:

This comment has been minimized.

@commial

commial Apr 25, 2016

Member

When it is possible, please inherit from object (to avoid old style classes).

You can also move this code in expression_helper

"""
# set architecture variables
if registers and irdst:
self.arch_vars = set(registers + [irdst])

This comment has been minimized.

@commial

commial Apr 25, 2016

Member

I'm not a big fan of attributes which are sometimes defined, sometimes not.
You can always defined it, with set() when no registers are provided.

Then, instead of raising an AssertionError, you can raise an error more specific (or a silent pass).

"""Transforms into SSA"""
raise NotImplementedError("")

def get_block(self, block):

This comment has been minimized.

@commial

commial Apr 25, 2016

Member

It could be misleading to use block for an asm_label instance. In others code snippets, we preferably use label

:return: variable expression in SSA form
"""
index = stack[v]
name = v.name + "." + str(index)

This comment has been minimized.

@commial

commial Apr 25, 2016

Member

Instead of using . as a magic separator, you can use your own class for the name. Indeed, this is not required that name is a string. Then, this class can easily handle the number associated to the var, and a str displaying it as the current output.
In other architecture, we may want to represent some register using a ., and it may end with a confusion.

This comment has been minimized.

@serpilliere

serpilliere Apr 25, 2016

Contributor

Correct: the only "property" name must have is it may be immutable.
For exemple, you can write:

a = ExprId(('EAX', 18), 32)
reg_name, reg_num = a.name

That's not really a hack, as it is used for storing ExprId of asm_label for example.

This comment has been minimized.

@mrphrazer

mrphrazer Apr 25, 2016

Contributor

That is a really great way. I will adapt that, but it may take some time since I expect that this will break other functions in that class.

instructions = []
for dst in assignblk:
# dst = src
aff = m2_expr.ExprAff(dst, assignblk[dst])

This comment has been minimized.

@commial

commial Apr 25, 2016

Member

You can use assignblk.dst2ExprAff(dst) instead

# replace instructions of assignblock in IRA block
ib.irs[index] = AssignBlock(instructions)

def _copy_block(self, block):

This comment has been minimized.

@commial

commial Apr 25, 2016

Member

Idem, block is quite misleading here. You can use label instead.

As this function is more related to ir, you can move its core in the class ir.

ssa_expressions_block.append(e)

# replace blocks IR expressions with corresponding SSA transformations
self._convert_block(ib, ssa_expressions_block)

This comment has been minimized.

@commial

commial Apr 25, 2016

Member

Instead of having a flat list of instructions in ssa_expressions_block and iterating again on each destination of each AssignBlock, you can move the end of _convert_block into the above for.
It also seems easier to debug / understand for me, avoiding the fake separation between instructions.

This comment has been minimized.

@mrphrazer

mrphrazer Apr 25, 2016

Contributor

Same here, really good point. I will change it, but it may last some time.

dst_ssa = self._transform_expression_lhs(dst)

# retrieve corresponding RHS expression
src_ssa = rhs.pop(0)

This comment has been minimized.

@commial

commial Apr 25, 2016

Member

List are not very effective when used as queue. You can use deque instead, which is more effective


return ib_ssa

def _rename_expressions(self, block):

This comment has been minimized.

@commial

commial Apr 25, 2016

Member

Idem for block -> label

@commial

This comment has been minimized.

Copy link
Member

commial commented Apr 25, 2016

I'm not sure ir ir.ssa is the best place for this. May analysis could be considered. @serpilliere , any idea?

@@ -322,3 +322,20 @@ def reassemble_expr(self, e):
if id_rhs in self.expressions:
todo.add(id_rhs)
return e


class SSAPath(SSABlock):

This comment has been minimized.

@commial

commial Apr 25, 2016

Member

This class never seems to be used. May a map of SSABlock.transform is enough, and then this class useless for now

This comment has been minimized.

@serpilliere

serpilliere Apr 25, 2016

Contributor

You may be right. Maybe the analysis module is more likely to receive such a great PR than ir itself.

This comment has been minimized.

@mrphrazer

mrphrazer Apr 25, 2016

Contributor

I will move it to analysis. However, I am not sure if SSAPath itself is useless. It is a simple way to provide backward slicing in a few lines for a whole path. Are you sure that I should remove that?

This comment has been minimized.

@commial

commial Apr 26, 2016

Member

I'm not sure to correctly understand the purpose of SSAPath. Is it designed to always work on a path? If it is the case, is a path a list of SSABlock ?

In this last case, SSAPath should not inherits from SSABlock, but from UserList or list to get methods such as append, iteration, comparison, ... Then it can be initialized with a list of SSABlock, and directly works on it (so transform method will apply on its internal path).

Am I correct?

for block in self.graph.walk_depth_first_forward(head):
ib = self.get_block(block)
# blocks IR expressions
ir_expressions = (m2_expr.ExprAff(dst, assignblk[dst])

This comment has been minimized.

@commial

commial Apr 25, 2016

Member

You can use dst2ExprAff instead

# blocks IR expressions
ir_expressions = (m2_expr.ExprAff(dst, assignblk[dst])
for assignblk in ib.irs for dst in assignblk)
for e in ir_expressions:

This comment has been minimized.

@commial

commial Apr 25, 2016

Member

You only use e.dst here, that is to say assignblk destinations, so you can directly iterate on assignblk (which iterate over destinations)

# exclude architecture's instruction pointer
if e.dst in instruction_pointer:
continue
if e.dst not in self.defs:

This comment has been minimized.

@commial

commial Apr 25, 2016

Member

Instead of this code pattern, you can use setdefault:

self.defs.setdefault(e.dst, set()).add(ib.label)
continue

# remember blocks which contain phi nodes
if node not in self._phinodes:

This comment has been minimized.

@commial

commial Apr 25, 2016

Member

You can also use setdefault here (which does not modify the value if the key already exists) and merge the line with self._phinodes[node][variable] = e.src

:param v: variable
:return: ExprAff, empty phi function for v
"""
phi = m2_expr.ExprId("phi", v.size)

This comment has been minimized.

@commial

commial Apr 25, 2016

Member

As this method will be called several times with the same size, you could use a cache size -> phi

This comment has been minimized.

@serpilliere

serpilliere Apr 25, 2016

Contributor

Did I say this was a great feature?

This comment has been minimized.

@mrphrazer

mrphrazer Apr 25, 2016

Contributor

I am not exactly sure what you have in mind. Somethig as that?

def _gen_empty_phi(v):
    self.phi_dict = dict()
    if v.size not in phi_dict:
        self.phi_dict[v.size] = m2_expr.ExprId("phi", v.size)
    return  self.phi_dict[v.size] 

This comment has been minimized.

@commial

commial Apr 26, 2016

Member

It can be rewritten as a class method, leading phi_dict (or phi_cache, ...) to be a class attribute. So you can declare it under the class statement, as:

class SSADiGraph(SSA):
   ...
   # _gen_empty_phi cache
   _phi_cache = {}

   def __init__ ...

And then:

@classmethod
def _gen_empty_phi(cls, expr):
    if expr.size not in cls.phi_cache:
        cls.phi_cache[expr.size] = m2_expr.ExprId("phi", expr.size)
    return  cls.phi_cache[expr.size] 

But there is no need to spend too much time on this little optimisation.

@mrphrazer

This comment has been minimized.

Copy link
Contributor

mrphrazer commented Apr 25, 2016

Thanks for the great feedback! I adapted everything besides one or to issues, I will comment them later :)

# walk in DFS over the dominator tree
for block in dominator_tree.walk_depth_first_forward(head):
# restore SSA variable stack of the predecessor in the dominator tree
self._stack_rhs = stack.pop().copy()

This comment has been minimized.

@commial

commial Apr 26, 2016

Member

Here, you always copy the top element on the stack before using it.
So, there is no need to copy elements while adding them in the stack (line 517, 537).

This comment has been minimized.

@mrphrazer

mrphrazer Apr 26, 2016

Contributor

Well, we did this in the beginning. It turned out that is does not work. stack.append(self._stack_rhs) does not copy the elements, it stores a reference to self._stack_rhs. If the elements within the stack will be modified, this will be applied to the same instance that is located several times in stack.

It may be the case that one of the .copy() is unnecessary, but debugging this is hard. Side effects may be hidden in details.

"""
if block in self._phinodes:
# create temporary list of phi function assignments for inplace renaming
tmp = list(self._phinodes[block])

This comment has been minimized.

@commial

commial Apr 26, 2016

Member

You can avoid the if statement using tmp = list(self._phinodes.get(block, []))

# if successor is in block's dominance frontier
if successor in self._phinodes:
# walk over all variables on LHS
for dst in self._phinodes[successor]:

This comment has been minimized.

@commial

commial Apr 26, 2016

Member

Idem, you can avoid the if using a .get(, [])

for block in self._phinodes:
ib = self.get_block(block)
# list of instructions
instructions = []

This comment has been minimized.

@commial

commial Apr 26, 2016

Member

Instead of using an intermediate instructions object, build a group of ExprAff and finally build an AssignBlock from it, you can directly declare an AssignBlock, and then assign values to it like a dictionnary:

assignblk = AssignBlock()
for dst in self._phinodes[block]:
    ...
    assignblk[dst] = src
mem1 = ExprMem(op1, 32)
mem2 = ExprMem(op2, 32)
cond = ExprCond(mem1, int2, int3)
slice = ExprSlice(mem1 + mem2 + cond, 31, 32)

This comment has been minimized.

@commial

commial Apr 26, 2016

Member

You can use slice1 instead of the keyword slice

from miasm2.ir.ir import AssignBlock, irbloc
from miasm2.ir.ir import AssignBlock

from collections import deque

This comment has been minimized.

@commial

commial Apr 26, 2016

Member

(Not necessary to take in account) We try to keep an order in imports, which is:

  1. standard library imports (collections)
  2. related third party imports
  3. local application/library specific imports (miasm2)
@commial

This comment has been minimized.

Copy link
Member

commial commented Apr 26, 2016

Well, this is a great PR (my comments are mainly cosmetics).
I'm just trying to convince myself on a last point (apart the discussion on SSAPath), regarding the handling of parallel instructions and memory expressions.
Actually, I do not see why you need to handle memory expression first.
As right elements are handled in a distinct pass, before the one on left elements, it must always work.

Do you have an example in mind?

Not related, but it could be nice to add an option to example/disasm/full.py to apply the SSA pass before generating the graph (I'll do it in another PR if you don't have the time).

@mrphrazer

This comment has been minimized.

Copy link
Contributor

mrphrazer commented Apr 26, 2016

Hi,

thanks again! In the following a few explanations

Memory

Regarding the parallel instructions and memory, assume push rbp. This will be translated into the following parallel instructions:

RSP = RSP + (- 0x8)
@64[RSP + (- 0x8)] = RBP
RBP = RSP

In SSA form, it has to be

RSP.0 = RSP + (- 0x8)
@64[RSP + (- 0x8)] = RBP
RBP.0 = RSP

If we transform the left-hand side and RSP would be translated before the memory instruction, the memory instruction's SSA transformation is incorrect because of RSP:

RSP.0 = RSP + (- 0x8)
@64[RSP.0 + (- 0x8)] = RBP
RBP.0 = RSP

SSA Path

I have quickly written a short script to illustrate the usage of SSAPath:

from miasm2.analysis.machine import Machine
from miasm2.analysis.ssa import SSAPath
from miasm2.expression.expression import ExprId


def analyse_path(ssa):
    # just an example
    irdst = ExprId(("IRDst", 1), 64)

    output = "{}: {}\n".format(irdst, ssa.reassemble_expr(irdst))

    print output


code = ""
code += "554889e5897dec8975e88b45e80145ecd165e88b55ec8b45e8"
code += "01d03d380500007514c745fc00000000c745ec000000008345"
code += "e802eb20c745fc060000008b45e80145ec8b55ec8b45fc01d0"
code += "85c07507b800000000eb05b8010000005dc3"

m = Machine("x86_64")
mdis = m.dis_engine(code.decode("hex"))

ira = m.ira(mdis.symbol_pool)
ira_ssa = m.ira(mdis.symbol_pool)

asm_blocks = mdis.dis_multibloc(0)

for block in asm_blocks:
    ira.add_bloc(block)

ssa = SSAPath(ira)

start_label = ira.get_bloc(0x0).label
end_label = ira.get_bloc(0x5b).label

for index, path in enumerate(ira.graph.find_path(start_label, end_label)):
    # transform path into SSA
    ssa.transform(path)

    # update ira_ssa
    ira_ssa.blocs.update(ssa.blocks)
    ira_ssa._gen_graph()

    # write graph
    open("/tmp/" + str(index) + ".dot", "wb").write(ira_ssa.graph.dot())

    # do some analysis
    analyse_path(ssa)

    # reset SSA
    ssa.reset()
    ira_ssa.blocs = dict()

Here, analyse_path is a placeholder for an arbitrary analysis of the transformed path. One intention of SSAPath is to provide the ability to locate paths that with certain characteristics between two nodes, for instance, in the case of data flow analysis.

Current state and todo

To sum up, the following things remain on the TODO

  • apply the today's feedback
  • modify variable renaming
  • apply changes to _convert_block
  • apply changes to _gen_empty_phi
  • regressions tests for SSA
  • discuss SSAPath
  • add SSA to example/disasm/full.py

Regarding regression tests, we do not have an idea how useful tests could look like. Perhaps it can stay open for another PR. In addition, we would prefer if you could add SSA to the examples.

I will update the post when parts of the TODO have been applied.

mrphrazer added some commits Apr 26, 2016

ssa: refactored. thanks to serpilliere and commial
ssa: modified ssa variable generation

ssa: integrated _convert_block in _rename_expressions

@mrphrazer mrphrazer force-pushed the mrphrazer:ssa branch from f34415d to 3b7c94a Apr 26, 2016

@mrphrazer

This comment has been minimized.

Copy link
Contributor

mrphrazer commented Apr 26, 2016

Okay, this worked better than expected. The items of the TODO have been finished besides the last three points I mentioned in the post above.

@@ -674,3 +674,166 @@ def possible_values(expr):
raise RuntimeError("Unsupported type for expr: %s" % type(expr))

return consvals


class ExprDissector(object):

This comment has been minimized.

@serpilliere

serpilliere Jun 4, 2016

Contributor

Hi!
I am currently reviewing the code (at last!). I have a question:
If I understand correctly the ExprDissector.dissect, it extracts sub expressions of a certain type, and stop recursion if it's found.

Why not using the Expression.visit ?
Here is an example, in which I use your regression tests:

from miasm2.expression.expression_helper import *
from miasm2.expression.expression import *

def dissect_test(expr, expr_type, result):
    if isinstance(expr, expr_type):
        result.add(expr)
        return False
    return True


def dissect_visit(expr, expr_type):
    result = set()
    expr.visit(lambda expr:expr,
               lambda expr:dissect_test(expr, expr_type, result))
    return result



# define expressions
cf = ExprId('cf', size=1)
rbp = ExprId('RBP', size=64)
rdx = ExprId('RDX', size=64)
int1 = ExprInt(0xfffffffffffffffc, 64)
int2 = ExprInt(0xffffffce, 32)
int3 = ExprInt(0x32, 32)
compose1 = ExprCompose([(int2, 0, 32), (int3, 32, 64)])
op1 = rbp + int1 + compose1 + rdx
op2 = int2 + int3
mem1 = ExprMem(op1, 32)
mem2 = ExprMem(op2, 32)
cond1 = ExprCond(mem1, int2, int3)
slice1 = ExprSlice(mem1 + mem2 + cond1, 31, 32)
aff1 = ExprAff(cf, slice1)




assert (dissect_visit(cond1, ExprOp) == {op1})
assert (dissect_visit(mem2, ExprOp) == {op2})
assert (dissect_visit(aff1, ExprSlice) == {slice1})
assert (dissect_visit(aff1, ExprCond) == {cond1})
assert (dissect_visit(aff1, ExprCompose) == {compose1})
assert (dissect_visit(aff1, ExprMem) == {mem1, mem2})
assert (dissect_visit(aff1, ExprAff) == {aff1})
assert (dissect_visit(aff1, ExprInt) == {int1, int2, int3})
assert (dissect_visit(aff1, ExprId) == {cf, rbp, rdx})

The trick is not to use the visitor callback, but the test function itself to get the results. It allow to stop recursion if the correct type is found., as you do in your dissector.
Is this correct, or am I missing something?

This comment has been minimized.

@mrphrazer

mrphrazer Jun 4, 2016

Contributor

Hi!

Neat trick, I did not know that :) I am not definitively sure, but I see two problems:

  1. It is recursive and uses lambda expressions. Where possible, I try to avoid this in python since it cannot be optimised and has large overheads ; if expressions are nested (or large), time and memory usage will explode.
  2. The main reason: Parsing standard subexpressions is a nice feature, but not really used that much within this SSA implementation. However, you can add additional filter criteria such as in variables (which is often used in SSA).

At least, I think it makes sense to have a class as ExprDissector in which you can define arbitrary filters (e.g., parsing memory expressions that depend on the stack pointer). Do you agree?

@serpilliere

This comment has been minimized.

Copy link
Contributor

serpilliere commented Jun 5, 2016

I totally agree with you for the recursive vs worklist. But in this case, (tell me if I am wrong) the parsed expressions are limited to expressions which come from the IR of instruction, which are in most case really simple. (this is different from fat expressions resulted from a symbolic emulation for example)

The feature is indeed interesting
@commial, any thoughts?

@mrphrazer

This comment has been minimized.

Copy link
Contributor

mrphrazer commented Jun 5, 2016

If you operate on assembly-transformed code, you are absolutely right: the translated instructions are really simple. Therefore, it will not make much difference in this case.

However, I was thinking about more generalised use cases. We can also perform SSA-based analysis of custom IRA graphs. For instance, you might create basic block summaries (e.g., via symbolic execution of the basic block) and replace the basic blocks with their summaries. In addition, you might subclass SSA and extend it with custom expressions that constraint some characteristics. In these cases, the worklist approach might be an advantage.

@commial

This comment has been minimized.

Copy link
Member

commial commented Jun 29, 2016

Well, currently with Miasm, expression are very often handled with recursive methods.
For instance, you can take expression simplification (and then symbolic execution), translators (and then the emulation part) and, even, expression display (through __str__, __repr__).

So, as the visitor paradigm seems easier and less prone to error, my vote go for it. There is no gain to handle iteratively expression that you can't use after.

This is a known limitation of Miasm (undocumented 😗), expression can't exceed a depth equals to Python recursive limit. We don't need it for now, maybe in the future, but it will be associated with a pretty big patch.

(Sorry for the delay, I didn't notice my opinion was expected. Your feature is actually very cool, so I expect it will be merge in a near future)

@commial

This comment has been minimized.

Copy link
Member

commial commented Nov 25, 2018

The feature has been added in the last release. Thanks again!

@commial commial closed this Nov 25, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment