Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fcd does not handle tail calls #34

Open
fay59 opened this issue Dec 12, 2016 · 4 comments
Open

Fcd does not handle tail calls #34

fay59 opened this issue Dec 12, 2016 · 4 comments

Comments

@fay59
Copy link
Owner

fay59 commented Dec 12, 2016

Functions that use the tail call optimization end with a jmp instruction instead of a ret statement. Because of that, fcd will inline the target function instead of calling it and returning its result.

@mewmew
Copy link

mewmew commented Apr 18, 2017

Hej Félix,

Thanks a lot for working on fcd! I feel very happy that you've decided to do small write-ups on your decompiler development adventures; where you document your progress, ideas, problems, set-backs and major wins at http://zneak.github.io/fcd/

They have been a good source of inspiration!

If you haven't had a chance to see it yet, I'd very warmly recommend watching the talk Alessandro Di Federico presented at a LLVM developer conference in late 2016, where he introduced rev.ng which turns machine code into LLVM IR, using Qemu IR as an intermediate step in the translation.

Using Qemu is interesting, and it comes with its own set of benefits and drawbacks. What rev.ng performs well at is (sometimes non-trivial) control flow analysis, by propagating constraints on register values (similar to what you do during type analysis).

Related to this issue, rev.ng identifies tail calls and functions in a generic way, first by identifying candidate function addresses, and later pruning these candidate function addresses based on a set of constraints (single entry basic block (if I recall correctly), etc). A tail call may then be identified by a call from a function which jumps past other known functions; in other words, by a call to an address which is not part of the address region of the caller function.

I've been playing around with some of these concepts the last couple of days, in a toy experiment to lift 32-bit x86 assembly to LLVM IR.

Relevant extract from the isTailCall function at https://github.com/decomp/exp/blob/master/cmd/bin2ll/x86.go#L204

// isTailCall reports whether the given instruction is a tail call instruction.
func (d *disassembler) isTailCall(funcEntry bin.Address, inst *instruction) bool {
...
		if funcEntry <= target && target < funcEnd {
			// Target inside function.
			return false
		}

The target address of a JMP instruction is compared against known function start (entry basic block) and function end addresses (the last address of the function that is not part of succeeding data (e.g. switch jump tables) or other functions).

I hope this may give some ideas on how tail calls may be identified, and hopefully you may end up incorporating some useful parts and finding yet other ways to identify them.

Wish you all the best.

Happy reversing!

Cheers /u

@fay59
Copy link
Owner Author

fay59 commented Apr 19, 2017

Alessandro reached out to me a few months ago, after his talk, so I saw it. :)

The problem is not so much the idea as the time and interest that I have to put into this. Fcd's disassembler is very crude and not very good, and that's what's holding up the fix. It's one of the last concepts that have stayed almost identical since the beginning of the project, when I had a three-month deadline to make something that works, and I had to cut a few (many) corners to make that work.

Right now, the disassembler does a single pass that does function identification, control flow reconstruction, and lifting, all at once. Because of that, it misses out on a lot of information that could be gathered if each of these tasks were a separate step. It would be much easier to solve tail calls and jump tables with a better disassembler.

The other thing is that I'm not sure that I want to get into the disassembler business. It's an extremely difficult problem and some people already do it much better. I'm floating ideas to use disassembly input from radare, Bininja, Hopper, IDA, or any other useful source.

Once this problem is solved, identifying tail calls is straightforward, I think. You would just need to check if the destination is a known function's entry point. You could also make assumptions based on the state of the stack frame, since tail calls pop their stack just like it would for a normal return. I don't really like the address range check (because functions don't have to be contiguous, although they usually are in cooperative programs), but rev.ng's logic looks like what I'd like to do for the same task.

Thanks for your interest! I love to get feedback on the stuff that I do.

@mewmew
Copy link

mewmew commented Apr 19, 2017

The other thing is that I'm not sure that I want to get into the disassembler business. It's an extremely difficult problem and some people already do it much better. I'm floating ideas to use disassembly input from radare, Bininja, Hopper, IDA, or any other useful source.

Understandably so. This is the path that remill and MC-Semantics have taken, using primarily IDA and Binary Ninja for now.

Once this problem is solved, identifying tail calls is straightforward, I think.

I would agree, for the most part. Identifying tail calls to static addresses is straight forward. Tail calls to dynamic addresses (e.g. jmp eax) will still require a substantial amount of work to get them right; probably relying on information from symbolic execution and constraint propagation.

I don't really like the address range check (because functions don't have to be contiguous, although they usually are in cooperative programs), but rev.ng's logic looks like what I'd like to do for the same task.

Fair point. The preliminary solution to this problem in bin2ll is to track function chunks (using a hash map from basic block address to function entry point). Works quite well for now, to add support for non-continious functions as well.

Thanks for your interest! I love to get feedback on the stuff that I do.

Very happy to. It's great to see so many different approaches to a somewhat tricky problem : )

@mewmew
Copy link

mewmew commented Apr 19, 2017

For reference, bin2ll also relies on IDA for now for discovery of function and basic block addresses, and function chunks information. The dependency is not tied directly to IDA, as a tool lst2json parses the relevant information (using regular expressions on foo.lst files produced by IDA), to produce JSON input for bin2ll. In the future, support for Binary Ninja and potentially radare will be added, if the bin2ll experiment turns fruit-full.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants