Skip to content

Feature Request: Integrate Scalpel for Call Graph and Control/Data Flow Analysis #11

@rahlk

Description

@rahlk

Is your feature request related to a problem? Please describe.

Currently, codeanalyzer-python provides basic symbol table generation and has planned call graph analysis (marked as not yet implemented for --analysis-level=2). However, it lacks crucial program flow analysis capabilities that are essential for understanding code behavior and dependencies:

  • Call Graph Construction: While planned, the current implementation doesn't provide comprehensive call graph analysis that handles Python's dynamic features (higher-order functions, nested definitions, dynamic calls)
  • Control Flow Graphs (CFG): No support for intra-procedural or inter-procedural control flow analysis
  • Data Flow Analysis: Missing data flow tracking capabilities for understanding how data moves through the program

These limitations prevent users from performing advanced static analysis tasks like vulnerability propagation analysis, refactoring impact assessment, and comprehensive dependency tracking.

Describe the solution you'd like

I would like to integrate specific components from the Scalpel Python Static Analysis Framework (https://github.com/SMAT-Lab/Scalpel) to enhance codeanalyzer-python with robust graph-based analysis:

1. Enhanced Analysis Levels

--analysis-level 2  # Call graph analysis (implement using Scalpel)
--analysis-level 3  # Call graph + Control flow graphs  
--analysis-level 4  # Call graph + CFG + Data flow analysis

2. New CLI Options

--call-graph         # Generate comprehensive call graphs
--control-flow       # Generate control flow graphs
--data-flow          # Perform data flow analysis
--inter-procedural   # Enable inter-procedural analysis

3. Scalpel Integration Focus

Target specific Scalpel capabilities:

  • Function 8: Call Graph Construction - Handles Python's dynamic features like higher-order functions and nested definitions
  • Function 2: Control-Flow Graph Construction - Generates intra-procedural CFGs that can be combined for inter-procedural analysis
  • Function 5: Constant Propagation - Provides data flow analysis capabilities

4. Enhanced Output Schema

class PyCallGraph(BaseModel):
    nodes: List[CallNode]           # Function/method nodes
    edges: List[CallEdge]           # Call relationships
    entry_points: List[str]         # Program entry points
    
class PyControlFlowGraph(BaseModel):
    function_cfgs: Dict[str, CFG]   # Per-function CFGs
    basic_blocks: List[BasicBlock]  # Code basic blocks
    
class PyDataFlow(BaseModel):
    def_use_chains: Dict[str, List] # Variable definitions and uses
    reaching_definitions: Dict      # Reaching definition analysis

Describe alternatives you've considered

1. NetworkX-based custom implementation

The project already uses NetworkX, but building CFG/call graph analysis from scratch would be time-intensive and error-prone.

2. AST-only analysis

Python's AST module provides basic structure but lacks the sophisticated analysis needed for accurate call graphs in dynamic Python code.

3. Existing call graph tools

  • pycg: Good for call graphs but limited CFG support
  • code2flow: Visualization-focused, not programmatic analysis
  • vulture: Dead code detection, not comprehensive flow analysis

Additional context

Specific Scalpel Advantages for Graph Analysis

  • Call Graph: Handles Python's complex dynamic features (decorators, metaclasses, dynamic imports)
  • CFG Construction: Provides precise basic block identification and control flow edges
  • Inter-procedural Analysis: Can combine function-level CFGs into program-wide flow graphs

Current Project Readiness

  • Already has placeholder for call graph analysis (--analysis-level=2)
  • Uses NetworkX for graph operations
  • Extensible CLI architecture with typer
  • Established pattern for multiple analysis backends

Implementation Plan

# New module: codeanalyzer/semantic_analysis/scalpel/
├── __init__.py
├── scalpel_analyzer.py      # Main integration class
├── call_graph_builder.py    # Scalpel call graph integration
├── cfg_builder.py          # Control flow graph integration
└── data_flow_analyzer.py   # Data flow analysis integration

Expected Output Enhancement

# Current (Level 1)
codeanalyzer --input project --analysis-level 1  # Symbol table only

# Enhanced (Levels 2-4 with Scalpel)
codeanalyzer --input project --analysis-level 2  # + Call graphs
codeanalyzer --input project --analysis-level 3  # + Control flow graphs  
codeanalyzer --input project --analysis-level 4  # + Data flow analysis

Example Usage Scenarios

  1. Security Analysis:

    codeanalyzer --input webapp --analysis-level 4 --data-flow
    # Trace data flow from user inputs to sensitive operations
  2. Refactoring Impact Assessment:

    codeanalyzer --input legacy_code --call-graph --inter-procedural
    # Understand function dependencies before refactoring
  3. Performance Analysis:

    codeanalyzer --input application --control-flow --analysis-level 3
    # Identify performance bottlenecks through CFG analysis

Benefits

  • Comprehensive Analysis: Complete the missing call graph functionality and add powerful control/data flow analysis
  • Python-Specific: Handles Python's dynamic nature better than generic tools
  • Research-Backed: Scalpel is published research (arXiv:2202.11840) with proven effectiveness
  • Compatible: Both projects use Python 3.12+ and have compatible licenses
  • Modular: Can integrate specific components without full framework overhead

This focused integration would complete the missing call graph functionality and add powerful control/data flow analysis capabilities, making codeanalyzer-python a comprehensive tool for program flow analysis without overwhelming complexity.


References

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions