Data-flow analysis is a technique for gathering information about the possible set of values calculated at various points in a computer program. A program's control flow graph (CFG) is used to determine those parts of a program to which a particular value assigned to a variable might propagate.
Please read this introductory article to understand the basic terminology, concepts and common algorithms for dataflow analysis.
Microsoft.CodeAnalysis NuGet package provides public APIs to generate a ControlFlowGraph based on low-level IOperation nodes as statements/instructions within a basic block. See more details.
We have built a dataflow analysis framework based on the above CFG API in this repo. Additionally, we have implemented certain well-known dataflow analyses on top of this framework. This enables you to implement either or both of the following:
- Write dataflow based analyzers which consume the analysis result from these well-known analyses.
- Write your own custom dataflow analyses, which can optionally consume analysis results from these well-known analyses.
Let us start by listing out the most important concepts and datatypes in our framework.
-
DataflowAnalysis: Base type for all dataflow analyses on a control flow graph. It performs a worklist based approach to flow abstract data values across the basic blocks until a fix point is reached.
-
AbstractDataFlowAnalysisContext: Base type for analysis contexts for execution of DataFlowAnalysis on a control flow graph. It is the primary input to the core dataflow analysis computation routine DataFlowAnalysis.Run, and includes things such as input CFG, owning symbol, etc.
-
DataFlowAnalysisResult: Result from execution of DataFlowAnalysis on a control flow graph. It stores:
- Analysis values for all operations in the graph and
AbstractBlockAnalysisResult
for every basic block in the graph and- Merged analysis state for all the unhandled throw operations in the graph.
-
AbstractBlockAnalysisResult: Common base type for result from execution of DataFlowAnalysis on a basic block.
-
AbstractDomain: Abstract domain for DataFlowAnalysis to merge and compare values across different control flow paths. The primary abstract domains of interest are:
- AbstractValueDomain: Abstract value domain for a DataFlowAnalysis to merge and compare individual dataflow analysis values.
- AbstractAnalysisDomain: Abstract analysis domain for a DataFlowAnalysis to merge and compare entire analysis data sets or dictionary of analysis values.
Each dataflow analysis must define its own value domain and analysis domain. These domains are used by DataFlowAnalysis to perform following primary operations:
- Merge individual analysis values and analysis sets at various program points in the graph and also at start of basic blocks which have more then one incoming control flow branches.
- Compare analysis values at same program point/basic blocks across different flow analysis iterations to determine if the algorithm has reached a fix point and can be terminated.
-
DataFlowOperationVisitor: Operation visitor to flow the abstract dataflow analysis values across a given statement (IOperation) in a basic block or a given control flow branch. Operation visitor basically defines the transfer functions for analysis values.
-
AnalysisEntity: Primary entity for which analysis data is tracked by majority of dataflow analyses. The entity is based on one or more of the following:
- An ISymbol
- One or more AbstractIndex indices to index into the parent entity. For example, an index into an array or collection.
- "this" or "Me" instance.
- An allocation or an object creation.
Each entity has:
- An associated non-null "Type" and
- A non-null "InstanceLocation" indicating the abstract location at which the entity is located and
- An optional parent entity if this entity has the same "InstanceLocation" as the parent (i.e. parent is a value type allocated on stack).
-
AbstractLocation: Represents an abstract analysis location. This may be used to represent a location where an AnalysisEntity resides, i.e.
AnalysisEntity.InstanceLocation
or a location that is pointed to by a reference type variable, and tracked with PointsToAnalysis. An analysis location can be created for one of the following cases:- An allocation or an object creation operation.
- Location for the implicit 'this' or 'Me' instance being analyzed.
- Location created for certain symbols which do not have a declaration in executable code, i.e. no IOperation for declaration (such as parameter symbols, member symbols, etc.).
- Location created for flow capture entities, i.e. for InterproceduralCaptureId created for IFlowCaptureOperation or IFlowCaptureReferenceOperation.
-
PointsToAbstractValue: Abstract PointsTo value for an AnalysisEntity/IOperation tracked by PointsToAnalysis. It contains the set of possible AbstractLocations that the entity or the operation can point to and the "Kind" of the location(s).
Now that we are familiar with the basic concepts and data types for flow analysis, let us walk through an existing flow analysis implementation and a step by step guide to creating your own flow analysis implementation, say MyCustomAnalysis
. We will use ValueContentAnalysis as the sample flow analysis implementation.
-
Start by creating a new folder, say
MyCustomAnalysis
within this folder in the repo. -
Add MyCustomAnalysis.cs that defines MyCustomAnalysis: Sub-type of
DataFlowAnalysis
that provides the public entry pointsTryGetOrComputeResult
into your analysis. It takes a bunch of input parameters, packages them into your analysis specific analysis context, and invokesDataFlowAnalysis.TryGetOrComputeResultForAnalysisContext
to compute the analysis result. The implementation of this type should be almost identical for all analyses. Note that the type arguments for this type will be defined in subsequent steps. See ValueContentAnalysis.cs for reference. -
Add MyCustomAbstractValue.cs that defines MyCustomAbstractValue: This type defines the core analysis value that needs to be tracked by your analysis. For example, ValueContentAnalysis defines ValueContentAbstractValue, such that each instance of
ValueContentAbstractValue
contains a set of potential constant literal values and a ValueContainsNonLiteralState for the non-literal state of the abstract value. It also defines a bunch of static instances of common value content values, see here. -
Add MyCustomAnalysis.MyCustomAbstractValueDomain.cs that defines MyCustomAbstractValueDomain: Sub-type of
AbstractValueDomain
that defines how to compare and merge differentMyCustomAbstractValue
across different control flow paths or flow analysis iterations. For example, see ValueContentAbstractValueDomain -
Define the core data structure for the global CFG wide and/or per-block analysis data that will tracked by your analysis, i.e. MyCustomAnalysisData. For most analyses, you likely won't need to define a separate type or source file for
MyCustomAnalysisData
. A likely definition would be just a using such asusing MyCustomAnalysisData = DictionaryAnalysisData<AnalysisEntity, MyCustomAbstractValue>;
. See here for such an example forDisposeAnalysisData
. However, for complex cases, you may need to define your ownMyCustomAnalysisData
user defined type. For example, see ValueContentAnalysisData that has an aggregated analysis data, which contains core analysis data dictionary from AnalysisEntity to ValueContentAbstractValue per-basic block and additional predicated analysis data. -
Define MyCustomAnalysisDomain: Sub-type of
AbstractAnalysisDomain
that defines how to compare and merge differentMyCustomAnalysisData
data sets across different control flow paths or flow analysis iterations. For most analyses, whereMyCustomAnalysisData
is defined as a simple dictionary, this domain will be defined with a simple using such asusing MyCustomAnalysisDomain = MapAbstractDomain<AnalysisEntity, MyCustomAbstractValue>;
. See here for such an example forDisposeAnalysisDomain
. However, for complex cases, you may need to define your ownMyCustomAnalysisDomain
user defined type, such as ValueContentAnalysis.CoreAnalysisDataDomain. -
Add MyCustomAnalysis.MyCustomBlockAnalysisResult.cs that defines MyCustomBlockAnalysisResult: Sub-type of
AbstractBlockAnalysisResult
that is the immutable per-basic block result from dataflow analysis. It is immutable equivalent of the mutableMyCustomAnalysisData
that was used during flow analysis execution. For reference, see ValueContentBlockAnalysisResult, the core immutable data exposed beingImmutableDictionary<AnalysisEntity, ValueContentAbstractValue>
. -
Define MyCustomAnalysisResult: Parameterized version of
DataFlowAnalysisResult
withMyCustomBlockAnalysisResult
andMyCustomAbstractValue
as type arguments. For most cases, this will be just a simple using directive such asusing MyCustomAnalysisResult = DataFlowAnalysisResult<MyCustomBlockAnalysisResult, MyCustomAbstractValue>;
. For example, see ValueContentAnalysisResult. -
Add MyCustomAnalysisContext.cs that defines MyCustomAnalysisContext: Sub-type of
AbstractDataFlowAnalysisContext
that packages the core input parameters to dataflow analysis. Most of the code in this type is common boiler plate code that is identical for all analyses. You should be able to just clone an existing file, for example ValueContentAnalysisContext.cs, and just do a simple find and replace of "ValueContent" with "MyCustom" in the file. -
Add MyCustomAnalysis.MyCustomDataFlowOperationVisitor.cs that defines MyCustomDataFlowOperationVisitor: Sub-type of
DataFlowOperationVisitor
that contains the core operation visitor, which tracks theCurrentAnalysisData
and overrides the requiredVisitXXXOperation
to define the transfer functions for how the current analysis data changes with operations and also computes the analysis values for the overridden operation (program points in CFG). You have three potential options for implementing this operation visitor:- Derive from AnalysisEntityDataFlowOperationVisitor: If your core
MyCustomAnalysisData
is a dictionary keyed onAnalysisEntity
, then you should most likely be deriving from this type. This operation visitor is intended for all analyses which track some data pertaining to symbols, which are represented by analysis entities. For example, ValueContentDataFlowOperationVisitor derives from this visitor. - Derive from AbstractLocationDataFlowOperationVisitor: If your core
MyCustomAnalysisData
is a dictionary keyed onAbstractLocation
, then you should most likely be deriving from this type. This operation visitor is intended for all analyses which track some data pertaining to locations/allocations, which are represented by abstract locations. For example, DisposeDataFlowOperationVisitor derives from this visitor. - Derive from DataFlowOperationVisitor: If none of the above two special visitor sub-types are suitable for your analysis, you can directly sub-type the core
DataFlowOperationVisitor
, although you will likely have a more complicated implementation with many more overrides. Hopefully, this will not be required often.
- Derive from AnalysisEntityDataFlowOperationVisitor: If your core
Once you have implemented the above custom analysis pieces, your dataflow analyzers can invoke MyCustomAnalysis.TryGetOrComputeResult
API to get the analysis result. Your analyzer can then consume any of the below components of the analysis result:
- Analysis values for any given operation in the graph OR
- AbstractBlockAnalysisResult for any basic block in the graph OR
- Merged analysis state for all the unhandled throw operations in the graph.
PERFORMANCE NOTE: Dataflow analysis is computationally quite expensive, especially for large method bodies with complex control flow and symbols that can lead to large analysis data sets and multiple flow analysis iterations and data set merges to reach fix point. Analyzers should make sure that they have adequate syntactic and/or semantic checks upfront to try and limit the cases where flow analysis is invoked.
We have some common analyses that you may likely want to consume for your custom dataflow analysis implementation or use directly in dataflow analyzers:
-
PointsToAnalysis: Dataflow analysis to track locations pointed to by AnalysisEntity and IOperation instances. This is the most commonly used dataflow analysis in all our flow based analyzers/analyses. Consider the following example:
var x = new MyClass(); object y = x; var z = flag ? new MyClass() : y;
PointsToAnalysis will compute that variables
x
andy
have identical non-nullPointsToAbstractValue
, which contains a singleAbstractLocation
corresponding to the firstIObjectCreationOperation
fornew MyClass()
. Variablez
has a differentPointsToAbstractValue
, which is guaranteed to be non-null, but has two potentialAbstractLocation
, one for eachIObjectCreationOperation
in the above code. -
CopyAnalysis: Dataflow analysis to track AnalysisEntity instances that share the same value type or reference type value, determined based on CopyAbstractValueKind. Consider the following example:
var x = new MyClass(); object y = x; int c1 = 0; int c2 = 0;
CopyAnalysis will compute that variables
x
andy
have identicalCopyAbstractValue
withCopyAbstractValueKind.KnownReferenceCopy
with twoAnalysisEntity
instances, one forx
and one fory
. Similarly, it will compute thatc1
andc2
have identicalCopyAbstractValue
withCopyAbstractValueKind.KnownValueCopy
with twoAnalysisEntity
instances, one forc1
and one forc2
. CopyAnalysis is currently off by default for all analyzers as it has known performance issues and needs performance tuning. It can be enabled by end users with editorconfig option copy_analysis. -
ValueContentAnalysis: Dataflow analysis to track possible constant values that might be stored in an AnalysisEntity and IOperation instances. This is identical to constant propagation for constant values stored in non-constant symbols. Consider the following example:
int c1 = 0; int c2 = 0; int c3 = c1 + c2; int c4 = flag ? c3 : c3 + param; // assume 'param' is a parameter for this method block with unknown value content from callers.
ValueContentAnalysis will compute that variables
c1
,c2
andc3
have identicalValueContentAbstractValue
with a single literal value0
andValueContainsNonLiteralState.No
to indicate it cannot contain a non-literal value. It will compute thatc4
has a differentValueContentAbstractValue
with a single literal value0
andValueContainsNonLiteralState.Maybe
to indicate that it may contain some non-literal value(s) in some code path(s). -
TaintedDataAnalysis: Dataflow analysis to track tainted state of AnalysisEntity and IOperation instances.
-
PropertySetAnalysis: Dataflow analysis to track values assigned to one or more properties of an object to identify and flag incorrect/insecure object state. See its PropertySetAnalysisTests.cs for examples.
We also support a complete context sensitive interprocedural flow analysis for invocations of methods within the same compilation. See InterproceduralAnalysisKind for more details.
Interprocedural analysis support is baked into the core DataFlowOperationVisitor
and each custom dataflow analysis implementation gets all this support for free, without requiring to add any code specific to interprocedural analysis. Each dataflow analysis defines the default InterproceduralAnalysisKind
in its TryGetOrComputeResult
entry point, and the analyzer is free to override the interprocedural analysis kind. Interprocedural analysis almost always leads to more precise analysis results at the expense of more computation resources, i.e. it likely takes more memory and time to complete. So, an analyzer should be extremely fine tuned for performance if it defaults to enabling context sensitive interprocedural analysis by default. Note that the end user can override the interprocedural analysis kind for specific rule ID or all dataflow rules with the editorconfig option interprocedural-analysis-kind. This option takes precedence over the defaults in the TryGetOrComputeResult
entry points to analysis and also any overrides from individual analyzers invoking this API.
We also have couple of additional configuration/customization points for interprocedural analysis:
-
InterproceduralAnalysisConfiguration: Defines interprocedural analysis configuration parameters. For example,
MaxInterproceduralMethodCallChain
andMaxInterproceduralLambdaOrLocalFunctionCallChain
control the size of the maximum height of the interprocedural call tree. Each analyzer can override the defaults for these chain lengths (3 as of current implementation), and end users can override it with editorconfig options max_interprocedural_method_call_chain and max_interprocedural_lambda_or_local_function_call_chain. -
InterproceduralAnalysisPredicate: Optional predicates that can be provided by each analyzer to determine if interprocedural analysis should be invoked or not for specific callsites. For example, this predicate used by dispose analysis significantly trims down the size of interprocedural call trees and provides huge performance improvements for interprocedural analysis.