Skip to content

Commit

Permalink
design: Add final version of the Control Flow Analysis section. Fixes #…
Browse files Browse the repository at this point in the history
  • Loading branch information
mewmew committed Apr 14, 2015
1 parent eda8cf4 commit 52bf596
Show file tree
Hide file tree
Showing 9 changed files with 13 additions and 34 deletions.
2 changes: 1 addition & 1 deletion sections/6_design/3_middle-end_components.tex
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
\subsection{Middle-end Components}
\label{sec:middle-end_components}

The middle-end module is responsible for lifting the low-level IR generated by the front-end to a higher level. This is achieved through a set of decompilation stages, which identify high-level control flow primitives and, as a future ambition, propagate expressions. The former decompilation stage consists of two self-contained components which separate concerns related to the control flow analysis from the underlying details of LLVM IR. The first component generates unstructured CFGs from LLVM IR, as further described in section \ref{sec:design_control_flow_graph_generation}. And the second component structures the generated CFGs by identifying high-level control flow primitives, as further described in section \ref{sec:design_control_flow_analysis}. The interaction between the front-end, the \texttt{ll2dot} and \texttt{restructure} tools of the middle-end and the back-end is illustrated in figure \ref{fig:middle-end}.
The middle-end module is responsible for lifting the low-level IR generated by the front-end to a higher level. This is achieved through a set of decompilation stages, which identify high-level control flow primitives and, as a future ambition, propagate expressions. The former decompilation stage consists of two self-contained components which separate concerns related to the control flow analysis from the underlying details of LLVM IR. The first component generates unstructured CFGs from LLVM IR, as further described in section \ref{sec:design_control_flow_graph_generation}. And the second component structures the generated CFGs by identifying high-level control flow primitives, as further described in section \ref{sec:design_control_flow_analysis}. The interaction between the front-end, the \texttt{ll2dot} (see section \ref{sec:design_control_flow_graph_generation}) and \texttt{restructure} (see section \ref{sec:design_control_flow_analysis}) tools of the middle-end and the back-end is illustrated in figure \ref{fig:middle-end}.

\begin{figure}[htbp]
\begin{center}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,8 @@ \subsubsection{Control Flow Graph Generation}
\begin{subfigure}[ht]{0.22\textwidth}
\includegraphics[width=\textwidth]{inc/cfg_gen_example.png}
\end{subfigure}
\caption{The return instructions of basic block \texttt{bar} and \texttt{baz} produces no directed edges, while the conditional branch instruction of basic block \texttt{foo} produces two directed edges, one for each target branch (i.e. \texttt{bar} and \texttt{baz}).}
\caption{The return instructions of basic block \texttt{bar} and \texttt{baz} produces no directed edges, while the conditional branch instruction of basic block \texttt{foo} produces two directed edges in the CFG, one for each target branch (i.e. \texttt{bar} and \texttt{baz}).}
\label{fig:cfg_gen_example}
\end{figure}


The \texttt{ll2dot} tool generates CFGs from LLVM IR in the DOT file format, which is a well-defined textual representation of graphs used by the Graphviz project. One benefit of expressing CFGs in this format, is that the existing Graphviz tools may be facilitated to produce image representations of the CFGs; as demonstrated in appendix \ref{app:control_flow_graph_generation_example}.
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
\subsubsection{Control Flow Analysis}
\label{sec:design_control_flow_analysis}

The control flow analysis component produces structured CFGs (in JSON format) from unstructured CFGs (in the DOT file format), by identifying high-level control flow primitives within the graph. The control flow structure is recorded by relating each node of the CFG to a specific node within the graph representation of a high-level control flow primitives, as illustrated in figure \ref{fig:representation_and_identification_of_primitive}.
The key idea behind the control flow analysis (see section \ref{sec:control_flow_analysis}), is that high-level control flow primitives may be represented using directed graphs. The problem of structuring low-level code may therefore be rephrased as the problem of identifying subgraphs (e.g. the graph representation of high-level control flow primitives) in graphs (e.g. the CFGs of low-level code) without considering node names, as illustrated in figure \ref{fig:representation_and_identification_of_primitive}. This problem is generally referred to as \textit{subgraph isomorphism search} and has been well studied \cite{subgraph_isomorphism_algorithms}. Rephrasing the problem in this manner aligns with the design principle of giving each component access to the least amount of information required to successfully accomplish its task. The control flow analysis component is only given access to control flow information (e.g. CFGs), and is oblivious of the underlying LLVM IR. This enables the component to be reused as-is when analyzing the control flow of other languages, such as REIL.

\begin{figure}[htbp]
\centering
Expand All @@ -15,31 +15,14 @@ \subsubsection{Control Flow Analysis}
\begin{subfigure}[ht]{0.18\textwidth}
\includegraphics[width=\textwidth]{poster/inc/foo.png}
\end{subfigure}
\caption{The left side contains the pseudo-code (top left) and graph representation (bottom left) of an if-statement; if \texttt{A} is true then do \texttt{B} followed by \texttt{C}, otherwise do \texttt{C}. The right side highlights (in red) an identified isomorphism of the if-statement's graph representation in the CFG of a simple function.}
\caption{The left side contains the pseudo-code (top left) and graph representation (bottom left) of an if-statement; if \texttt{A} is true then do \texttt{B} followed by \texttt{C}, otherwise do \texttt{C}. The right side highlights (in red) an identified isomorphism of the if-statement's graph representation, in the CFG of the \texttt{main} function presented in appendix \ref{app:clang_example}.}
\label{fig:representation_and_identification_of_primitive}
\end{figure}

% The \texttt{restructure} tool searches for subgraph isomorphisms of control flow primitives in a given CFG. Once located the nodes identified subgraph are merged into a single node which is labeled with the high-level control flow primitive. Successive iterations continue to simplify the CFG until only one node is left, at which point the high-level control flow primitive has been recovered. Should the \texttt{restructure} tool fail to reduce the graph into a single node, the graph is considered irreducible with regards to the supported high-level control flow primitives.
The \texttt{restructure} tool uses subgraph isomorphism search algorithms to locate isomorphisms of the graph representations of high-level control flow primitives in the CFG of a given function. The CFG is simplified by recursively replacing the identified subgraphs with single nodes until the entire CFG has been reduced into a single node; a step-by-step demonstration of which is presented in appendix \ref{app:control_flow_analysis_example}. By recoding the node names of the identified subgraph isomorphisms and the name of their corresponding high-level control flow primitives, a structured CFG may be produced in which all nodes are known to belong to a high-level control flow primitive; as demonstrated in appendix \ref{app:restructure_example}.

%TODO: Add: A data-driven design separates data from source code to facilitate the extensibility of components.
The pseudo-code and graph representations of the supported high-level control flow primitives are presented in figure \ref{fig:graph_representations} of section \ref{sec:control_flow_analysis}. Should the control flow analysis fail to reduce a CFG into a single node, the CFG is considered irreducible with regards to the supported high-level control flow primitives, in which case a structured CFG cannot be produced.

% TODO: Refer back to "each with access to the least amount of information required to successfully accomplish its task". Working on CFG, oblivious to the presence of LLVM IR or instructions for that matter.
The \texttt{restructure} tool relies entirely on subgraph isomorphism search to produce structured CFGs (in JSON format) from unstructured CFGs (in the DOT file format). The supported high-level control flow primitives are defined using DOT files, thus promoting a data-driven design which separates data regarding the primitives from the implementation of the \texttt{restructure} tool. A major benefit with this approach is that the \texttt{restructure} tool may -- without modification -- search for any high-level control flow primitive that can be expressed in the DOT file format.

% * Data-driven Design (potential and limitations)
% TODO: Mention: CFG invariants (e.g. single-entry, single-exit)

% TODO: Add: Limitations

% TODO: Add limitations related to the design choices. Which limitations are easily solvable given more time and which are fundamentally part of the design.
% - No support for n-way conditionals (e.g.switch-statements). Potential solution: Domain Specific Language; e.g. switch could be written as:
%
% digraph switch {
% A -> B [label="expr='n > 2'; label='case %d'"]
% A -> C
% B -> C
% }
% - No support for inf-loops (because of the CFG invariant). Potential solution: relax invariant to support control flow primitive descriptors which only specify an entry node? These control flow primitives would have to be marked as no-exit.

% - Unstructured CFG -> Structured CFG ([restructure](http://decomp.org/x/graphs/cmd/restructure).)

%The control flow analysis component structures low-level code by identifying high-level control flow primitives in the CFGs generated from LLVM IR (see section \ref{sec:design_control_flow_graph_generation}). As demonstrated in figure \ref{fig:representation_and_identification_of_primitive}, high-level control flow primitives may be represented using graphs. The problem of structuring low-level code may therefore be reformulated as the problem of locating subgraphs (e.g. the graph representations of control flow primitives) in graphs (e.g. the CFGs of low-level code) without considering node names. This problem is generally referred to as \textit{subgraph isomorphism search} and has been well studied \cite{subgraph_isomorphism_algorithms}.
One limitation with this approach is that it does not support graph representations of high-level control flow primitives with a variable number of nodes, as they cannot be described in the DOT file format. For this reason, the \texttt{restructure} tool does not support the recovery of n-way conditionals (e.g. \texttt{switch}-statements). Furthermore, the current design enforces a single-entry/single-exit invariant on the graph representation of high-level control flow primitives. This prevents the recovery of infinite loops, as their graph representation has no exit node. Section \ref{sec:design_validation} discusses how these issues may be mitigated in the future.
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
% ~~~ [ Code Generation ] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

\subsubsection{Code Generation}
\label{sec:design_code_generation}

% - Structured CFG -> Go ([ll2go](http://decomp.org/x/cmd/ll2go))
% + Truthfully `ll2go` does not make direct use of `restructure` but rather the graph libraries.
Expand Down
4 changes: 0 additions & 4 deletions sections/7_implementation/6_control_flow_analysis_tool.tex
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,6 @@
\subsection{Control Flow Analysis Tool}
\label{sec:impl_control_flow_analysis_tool}

The control flow analysis stage uses subgraph isomorphism search algorithms to locate isomorphisms of the graph representations of high-level control flow primitives in the CFG of a given function, as further described in section \ref{sec:design_control_flow_analysis}. The CFG is simplified by recursively replacing the identified subgraphs with single nodes until the entire CFG has been reduced into a single node. By recoding the node names of the identified subgraph isomorphisms and the name of their corresponding high-level control flow primitives, a structured CFG may be produced in which all nodes are known to belong to a high-level control flow primitive.

The pseudo-code and graph representations of the supported high-level control flow primitives are presented in figure \ref{fig:graph_representations} of section \ref{sec:control_flow_analysis}. Should the control flow analysis fail to reduce a CFG into a single node, the CFG is considered irreducible with regards to the supported high-level control flow primitives; in which case a structured CFG cannot be produced.

% TODO: Rephrase the following paragraph so it fits with the preceding paragraphs.

The \texttt{restructure} tool provides an implementation of the control flow analysis component described in section \ref{sec:design_control_flow_analysis}. It structures CFGs by utilizing the subgraph isomorphism search library (see section \ref{sec:subgraph_isomorphism_search_algorithm}) to identify high-level control flow primitives in a similar fashion as demonstrated in appendix \ref{app:control_flow_analysis_example}. One problem with the demonstrated approach is that it may fail to reduce CFGs if smaller subgraphs are replaced before larger subgraphs. This is because subgraph isomorphisms of the smaller subgraphs may exist in the larger subgraphs, as is the case with .
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
\subsection{Control Flow Graph Generation Example}
\label{app:control_flow_graph_generation_example}

The \texttt{ll2dot} tool generates CFGs (in the DOT file format) from LLVM IR assembly files, as described in section \ref{sec:control_flow_graph_generation_tool}. Using the \texttt{ll2dot} tool, a CFG was generated from the \texttt{main} function of the LLVM IR assembly in listing \ref{lst:example1_ll}. A textual representation and an image representation of the generated CFG are presented on the left and the right side of figure \ref{fig:example1_cfg} respectively. The image representation was generated using the Graphviz \texttt{dot}\footnote{Drawing Graphs with dot: \url{http://www.graphviz.org/pdf/dotguide.pdf}} tool.
The \texttt{ll2dot} tool generates CFGs (in the DOT file format) from LLVM IR assembly files, as described in section \ref{sec:design_control_flow_analysis}. Using the \texttt{ll2dot} tool, a CFG was generated from the \texttt{main} function of the LLVM IR assembly in listing \ref{lst:example1_ll}. A textual representation and an image representation of the generated CFG are presented on the left and the right side of figure \ref{fig:example1_cfg} respectively. The image representation was generated using the Graphviz \texttt{dot}\footnote{Drawing Graphs with dot: \url{http://www.graphviz.org/pdf/dotguide.pdf}} tool.

\begin{figure}[htbp]
\centering
Expand Down
2 changes: 1 addition & 1 deletion sections/appendices/h_control_flow_analysis_example.tex
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
\subsection{Control Flow Analysis Example}
\label{app:control_flow_analysis_example}

This section provides a step-by-step demonstration of how the control flow analysis is conducted by analysing the \texttt{stmt} function of the c4\footnote{C in four functions: \url{https://github.com/rswier/c4}} compiler. For a detailed description of the control flow analysis stage, please refer to section \ref{sec:impl_control_flow_analysis_tool}. The control flow analysis operates exclusively on CFGs, which are generated by a set of components prior to the control flow analysis stage. Firstly, the C source code of the c4 compiler is translated into LLVM IR by the Clang compiler of the front-end. Secondly, the LLVM IR is optionally optimized by the \texttt{opt} tool of the LLVM compiler framework. Lastly, a CFG is generated for each function of the LLVM IR using the \texttt{ll2dot} tool. For this demonstration, the CFG of the \texttt{stmt} function is the starting point of the control flow analysis stage.
This section provides a step-by-step demonstration of how the control flow analysis is conducted by analysing the \texttt{stmt} function of the c4\footnote{C in four functions: \url{https://github.com/rswier/c4}} compiler. For a detailed description of the control flow analysis stage, please refer to section \ref{sec:design_control_flow_analysis}. The control flow analysis operates exclusively on CFGs, which are generated by a set of components prior to the control flow analysis stage. Firstly, the C source code of the c4 compiler is translated into LLVM IR by the Clang compiler of the front-end. Secondly, the LLVM IR is optionally optimized by the \texttt{opt} tool of the LLVM compiler framework. Lastly, a CFG is generated for each function of the LLVM IR using the \texttt{ll2dot} tool. For this demonstration, the CFG of the \texttt{stmt} function is the starting point of the control flow analysis stage.

The first step of the control flow analysis recursively locates subgraph isomorphisms of the graph representation of pre-test loops (see figure \ref{fig:pre_test_graph_representation}) in the original CFG of the \texttt{stmt} function, and replaces these subgraphs with single nodes as illustrated in figure \ref{fig:step_1}.

Expand Down
2 changes: 1 addition & 1 deletion sections/appendices/i_restructure_example.tex
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,6 @@
\subsection{Restructure Example}
\label{app:restructure_example}

The \texttt{restructure} tool produces structured CFGs (in JSON format) from unstructured CFGs (in the DOT file format), as described in section \ref{sec:impl_control_flow_analysis_tool}. Listing \ref{lst:restructure_output} demonstrates the output of the restructure tool when analysing the CFG of the \texttt{main} function presented in figure \ref{fig:example1_cfg}.
The \texttt{restructure} tool produces structured CFGs (in JSON format) from unstructured CFGs (in the DOT file format), as described in section \ref{sec:design_control_flow_analysis}. Listing \ref{lst:restructure_output} demonstrates the output of the restructure tool when analysing the CFG of the \texttt{main} function presented in figure \ref{fig:example1_cfg}.

\lstinputlisting[language=go, style=go, caption={The structured control flow graph (in JSON format) produced by the \texttt{restructure} tool when analysing the CFG of the \texttt{main} function presented in figure \ref{fig:example1_cfg}. \label{lst:restructure_output}}]{appendices/restructure_example/example1_graphs/main.json}
2 changes: 1 addition & 1 deletion sections/appendices/j_code_generation_example.tex
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,6 @@

\subsection{Code Generation Example}

The \texttt{ll2go} tool translates LLVM IR assembly into unpolished Go source code, as described in section \ref{sec:impl_code_generation_tool}. Using the \texttt{ll2go} tool, the LLVM IR assembly of listing \ref{lst:example1_ll} was translated into the unpolished Go source code presented in listing \ref{lst:example1_go}. Please note that the \texttt{ll2go} tool produces unpolished Go source code which may not compile, as it does not follow Go conventions for program status codes and may include undeclared identifiers. Appendix \ref{app:post-processing_example} demonstrates how the post-processing stage may improve the quality of the unpolished Go source code.
The \texttt{ll2go} tool translates LLVM IR assembly into unpolished Go source code, as described in section \ref{sec:design_code_generation}. Using the \texttt{ll2go} tool, the LLVM IR assembly of listing \ref{lst:example1_ll} was translated into the unpolished Go source code presented in listing \ref{lst:example1_go}. Please note that the \texttt{ll2go} tool produces unpolished Go source code which may not compile, as it does not follow Go conventions for program status codes and may include undeclared identifiers. Appendix \ref{app:post-processing_example} demonstrates how the post-processing stage may improve the quality of the unpolished Go source code.

\lstinputlisting[language=go, style=go, caption={Unpolished Go source code, which was produced by the \texttt{ll2go} tool when translating the LLVM IR of listing \ref{lst:example1_ll} into Go. \label{lst:example1_go}}]{appendices/ll2go_example/example1.go}

0 comments on commit 52bf596

Please sign in to comment.