# LLVM-IR → CFG → Grammar Extraction
This notebook contains the core logic from the project split into individual, executable cells. Each class and standalone function is isolated in its own cell to facilitate exploration and interactive experimentation.

In [None]:
import re
from typing import Dict, List, Set, Tuple, Optional
from collections import defaultdict, deque
from dataclasses import dataclass
import os
from typing import Dict, List, Tuple, Optional
from llvm_cfg_generator import llvm_ir_to_context_free_grammar, ContextFreeGrammar
from typing import Dict, List, Optional

### `BasicBlock` *(from `llvm_cfg_generator.py`)*

Immutable metadata holder that models an LLVM *basic block* inside a
function-level control-flow graph (CFG).

In the textual LLVM IR a basic block is introduced by a label such as

    ``entry:``

followed by a *contiguous* list of instructions that concludes with a
single *terminator* (`br`, `switch`, `ret`, `invoke`, …).  The absence of
explicit `goto`-style jump statements within the body guarantees that
**all** control-flow into the block is funnelled through the label and
**all** control-flow out of the block emanates from the last instruction.

The present class does not attempt any semantic inspection of the
instructions; it simply acts as a lightweight container so that the CFG
construction algorithm can record structural relationships.

Attributes
----------
name:
    Canonical label that uniquely identifies the block inside its
    enclosing function.
instructions:
    List of raw LLVM textual lines *exactly* as they appear in the input
    file.  The list is filled incrementally while the parser iterates over
    the function body.
successors / predecessors:
    Outgoing and incoming adjacency lists respectively.  They are updated
    by :pyclass:`ControlFlowGraph.add_edge` once the parser encounters a
    terminator that creates a control-flow edge.

In [None]:
class BasicBlock:
    name: str
    instructions: List[str]
    successors: List[str]
    predecessors: List[str]
    
    def __init__(self, name: str):
        self.name = name
        self.instructions = []
        self.successors = []
        self.predecessors = []

### `CFGEdge` *(from `llvm_cfg_generator.py`)*

Directed *hyper-edge* (source → target) that connects two
:pyclass:`BasicBlock` instances in a :pyclass:`ControlFlowGraph`.

The optional *condition* label provides minimal semantic context that is
later used by the grammar generator to create *choice* non-terminals.
Examples include ``"true"`` / ``"false"`` for a conditional `br`,
``"switch"`` for a `switch`-based dispatch, or ``None`` for an
unconditional jump.

In [None]:
class CFGEdge:
    source: str
    target: str
    condition: Optional[str] = None

### `ControlFlowGraph` *(from `llvm_cfg_generator.py`)*

Adjacency-list representation of the intra-procedural control-flow graph
that underpins the entire *IR → Grammar* pipeline.

Design choices
--------------
• **No edge de-duplication:** Multiple syntactically different IR
  constructs can yield identical source/target pairs – e.g. two
  back-to-back `br` instructions protected by mutually exclusive
  predicates.  Retaining duplicates helps the grammar generator produce a
  richer set of rules that account for *how* control reaches a block.

• **Partial connectivity:** The parser deliberately allows *dangling*
  blocks that have no successors (e.g. `unreachable`) or no predecessors
  (dead code after DCE). 

In [None]:
class ControlFlowGraph:
    def __init__(self):
        self.blocks: Dict[str, BasicBlock] = {}
        self.edges: List[CFGEdge] = []
        self.entry_block: Optional[str] = None
        self.exit_blocks: Set[str] = set()
    
    def add_block(self, name: str) -> BasicBlock:
        if name not in self.blocks:
            self.blocks[name] = BasicBlock(name)
        return self.blocks[name]
    
    def add_edge(self, source: str, target: str, condition: Optional[str] = None):
        edge = CFGEdge(source, target, condition)
        self.edges.append(edge)
        
        if source in self.blocks:
            self.blocks[source].successors.append(target)
        if target in self.blocks:
            self.blocks[target].predecessors.append(source)

### `GrammarRule` *(from `llvm_cfg_generator.py`)*

Single production in a context-free grammar following the conventional
Backus–Naur notation

    ``<lhs> ::= rhs₁ rhs₂ … rhsₙ``

where *rhsᵢ* may itself be of terminal or non-terminal category.  The
class is intentionally *dumb* – it stores raw strings only – because the
surrounding framework maintains global sets of terminals/non-terminals and
invariants such as *reachability* and *productivity*.

In [None]:
class GrammarRule:
    def __init__(self, lhs: str, rhs: List[str]):
        self.lhs = lhs  # Left-hand side (non-terminal)
        self.rhs = rhs  # Right-hand side (list of terminals/non-terminals)
    
    def __str__(self):
        return f"{self.lhs} -> {' '.join(self.rhs)}"
    
    def __repr__(self):
        return self.__str__()

### `ContextFreeGrammar` *(from `llvm_cfg_generator.py`)*

Aggregates the production rules that collectively form the *executable
model* of an LLVM function.

Apart from storing the rules themselves, the class keeps canonicalised
*sets* of terminals and non-terminals which are updated incrementally as
new rules are added.  This affords O(1) membership tests when the grammar
builder checks whether a symbol has been introduced before.

In [None]:
class ContextFreeGrammar:
    def __init__(self):
        self.rules: List[GrammarRule] = []
        self.terminals: Set[str] = set()
        self.non_terminals: Set[str] = set()
        self.start_symbol: str = "S"
    
    def add_rule(self, lhs: str, rhs: List[str]):
        rule = GrammarRule(lhs, rhs)
        self.rules.append(rule)
        self.non_terminals.add(lhs)
        
        for symbol in rhs:
            if self.is_terminal(symbol):
                self.terminals.add(symbol)
            else:
                self.non_terminals.add(symbol)
    
    def is_terminal(self, symbol: str) -> bool:
        """Determine if a symbol is a terminal (enhanced pattern matching)"""
        # Terminals are typically instruction names, constants, or specific tokens
        terminal_patterns = [
            r'^[A-Z][A-Z_]*$',  # All caps tokens like ADD, LOAD, STORE
            r'^[a-z]+$',        # Simple instruction names like 'add', 'load', 'store'
            r'^%\w+$',          # LLVM registers
            r'^@\w+$',          # LLVM global symbols
            r'^\d+$',           # Constants
            r'^".*"$',          # String literals
            r'^EPSILON$',       # Empty string terminal
            r'^IF$|^THEN$|^ELSE$|^WHILE$|^DO$|^ASSIGN$|^COMPARE$|^VARIABLE$|^CONSTANT$|^EXPRESSION$',  # Control flow keywords
            r'^(ENTRY|EXIT|CONTINUE|BREAK|JUMP|FALLTHROUGH)$',  # Enhanced control flow
            r'^(ALLOCA|LOAD|STORE|GEP|BITCAST|INTTOPTR|PTRTOINT)$',  # Memory operations
            r'^(PHI|SELECT|EXTRACTVALUE|INSERTVALUE)$',  # Data flow operations
            r'^(INVOKE|LANDINGPAD|RESUME|UNREACHABLE)$',  # Exception handling
            r'^(FENCE|ATOMICRMW|CMPXCHG)$',  # Atomic operations
            r'^LABEL_\d+$',     # Basic block labels
            r'^ARG_\d+$',       # Function arguments
            r'^(NULL|VOID|TRUE|FALSE)$'  # Constants
        ]
        
        for pattern in terminal_patterns:
            if re.match(pattern, symbol):
                return True
        return False
    
    def __str__(self):
        result = []
        result.append(f"Start Symbol: {self.start_symbol}")
        result.append(f"Non-terminals: {{{', '.join(sorted(self.non_terminals))}}}")
        result.append(f"Terminals: {{{', '.join(sorted(self.terminals))}}}")
        result.append("Production Rules:")
        for rule in self.rules:
            result.append(f"  {rule}")
        return '\n'.join(result)

### `LLVMIRParser` *(from `llvm_cfg_generator.py`)*

Streaming LLVM-IR parser whose only responsibility is to recover *control
structure* – it consciously ignores type information, bit-widths, and
other semantic details.

Implementation highlights
-------------------------
• **Regex-centric:** Leveraging a handful of high-precision regular
  expressions avoids the maintenance overhead of a full‐blown IR grammar.

• **Resilience over completeness:** Whenever the parser encounters an
  exotic construct it cannot handle, the error is *contained* to that
  function; the surrounding module continues to be processed so that fuzz
  campaigns are not blocked by a single unparseable corner case.

In [None]:
class LLVMIRParser:
    
    def __init__(self):
        # Enhanced patterns for modern LLVM IR
        self.function_pattern = re.compile(r'define\s+(?:dso_local\s+|internal\s+)?\w+\s+@(\w+)\s*\([^)]*\)\s*(?:#\d+\s*)?\{', re.IGNORECASE)
        self.block_pattern = re.compile(r'^(\w+):\s*(?:;.*)?$')
        self.numbered_block_pattern = re.compile(r'^(\d+):\s*(?:;.*)?$')
        self.branch_pattern = re.compile(r'br\s+i1\s+[^,]+,\s+label\s+%(\w+),\s+label\s+%(\w+)')
        self.unconditional_branch_pattern = re.compile(r'br\s+label\s+%(\w+)')
        self.return_pattern = re.compile(r'ret\s+')
        self.call_pattern = re.compile(r'call\s+.*?@(\w+)')
        self.instruction_pattern = re.compile(r'^\s*(?:%\w+\s*=\s*)?(\w+)\s+(.*)')
        self.switch_pattern = re.compile(r'switch\s+.*?\[([^\]]+)\]')
        self.indirectbr_pattern = re.compile(r'indirectbr\s+.*?\[([^\]]+)\]')
    
    def parse_llvm_ir(self, llvm_code: str) -> Dict[str, ControlFlowGraph]:
        """Parse LLVM-IR code and extract CFGs for each function with enhanced error handling"""
        functions = {}
        
        try:
            # Split into functions with better handling
            function_blocks = self._split_into_functions_enhanced(llvm_code)
            
            for func_name, func_code in function_blocks.items():
                try:
                    cfg = self._build_cfg_from_function(func_name, func_code)
                    if cfg.blocks:  # Only add non-empty CFGs
                        functions[func_name] = cfg
                except Exception as e:
                    print(f"Warning: Failed to parse function {func_name}: {e}")
                    continue
        except Exception as e:
            print(f"Error parsing LLVM IR: {e}")
        
        return functions
    
    def _split_into_functions_enhanced(self, llvm_code: str) -> Dict[str, str]:
        """Enhanced function splitting with better pattern matching"""
        functions = {}
        lines = llvm_code.split('\n')
        current_function = None
        current_code = []
        brace_count = 0
        in_function = False
        
        for i, line in enumerate(lines):
            original_line = line
            line = line.strip()
            
            # Skip empty lines and global declarations outside functions
            if not line or (line.startswith(';') and not in_function):
                continue
            
            # Check for function definition with enhanced pattern
            func_match = self.function_pattern.search(line)
            if func_match:
                # Save previous function if exists
                if current_function and current_code:
                    functions[current_function] = '\n'.join(current_code)
                
                current_function = func_match.group(1)
                current_code = [original_line]
                brace_count = line.count('{') - line.count('}')
                in_function = True
                continue
            
            if in_function and current_function:
                current_code.append(original_line)
                brace_count += line.count('{') - line.count('}')
                
                # Function ends when braces are balanced
                if brace_count == 0:
                    functions[current_function] = '\n'.join(current_code)
                    current_function = None
                    current_code = []
                    in_function = False
        
        # Handle last function if file doesn't end with closing brace
        if current_function and current_code:
            functions[current_function] = '\n'.join(current_code)
        
        return functions
    
    def _build_cfg_from_function(self, func_name: str, func_code: str) -> ControlFlowGraph:
        """Build CFG from a single function's LLVM-IR code with enhanced block detection"""
        cfg = ControlFlowGraph()
        lines = func_code.split('\n')
        
        current_block = None
        entry_found = False
        
        for line_num, line in enumerate(lines):
            original_line = line
            line = line.strip()
            
            if not line or line.startswith(';'):
                continue
            
            # Check for basic block label (both named and numbered)
            block_match = self.block_pattern.match(line) or self.numbered_block_pattern.match(line)
            if block_match:
                block_name = block_match.group(1)
                current_block = cfg.add_block(block_name)
                if not entry_found:
                    cfg.entry_block = block_name
                    entry_found = True
                continue
            
            # Handle function entry (first instruction after define)
            if not entry_found and current_block is None:
                # First instruction creates entry block
                current_block = cfg.add_block('entry')
                cfg.entry_block = 'entry'
                entry_found = True
            
            if current_block:
                current_block.instructions.append(original_line.strip())
                
                # Enhanced terminator instruction detection
                self._process_terminator_instruction(cfg, current_block, line)
        
        return cfg
    
    def _process_terminator_instruction(self, cfg: ControlFlowGraph, current_block: BasicBlock, line: str):
        """Process terminator instructions with enhanced pattern matching"""
        # Conditional branch
        branch_match = self.branch_pattern.search(line)
        if branch_match:
            true_target = branch_match.group(1)
            false_target = branch_match.group(2)
            cfg.add_edge(current_block.name, true_target, "true")
            cfg.add_edge(current_block.name, false_target, "false")
            return
        
        # Unconditional branch
        unconditional_match = self.unconditional_branch_pattern.search(line)
        if unconditional_match:
            target = unconditional_match.group(1)
            cfg.add_edge(current_block.name, target)
            return
        
        # Switch statement
        switch_match = self.switch_pattern.search(line)
        if switch_match:
            # Parse switch targets
            targets_str = switch_match.group(1)
            targets = re.findall(r'label\s+%(\w+)', targets_str)
            for target in targets:
                cfg.add_edge(current_block.name, target, "switch")
            return
        
        # Indirect branch
        indirectbr_match = self.indirectbr_pattern.search(line)
        if indirectbr_match:
            targets_str = indirectbr_match.group(1)
            targets = re.findall(r'label\s+%(\w+)', targets_str)
            for target in targets:
                cfg.add_edge(current_block.name, target, "indirect")
            return
        
        # Return instruction
        if self.return_pattern.search(line):
            cfg.exit_blocks.add(current_block.name)
            return
        
        # Unreachable instruction
        if 'unreachable' in line:
            cfg.exit_blocks.add(current_block.name)
            return

### `CFGToGrammarConverter` *(from `llvm_cfg_generator.py`)*

Synthesises a *context-free grammar* (CFG) from the raw control-flow graph
such that **every** syntactically valid derivation corresponds to at least
one concrete execution path inside the original function.

The converter applies a set of *abstractions* (e.g. mapping every LLVM
arithmetic instruction to the terminal symbol ``ADD``/``SUB``/…) to keep
the terminal alphabet tractable while preserving enough structural
richness to guide greybox fuzzers.

In [None]:
class CFGToGrammarConverter:
    def __init__(self):
        # Enhanced instruction abstraction for complex LLVM IR
        self.instruction_abstraction = {
            # Arithmetic operations
            'add': 'ADD', 'fadd': 'FADD', 'sub': 'SUB', 'fsub': 'FSUB',
            'mul': 'MUL', 'fmul': 'FMUL', 'udiv': 'UDIV', 'sdiv': 'SDIV',
            'fdiv': 'FDIV', 'urem': 'UREM', 'srem': 'SREM', 'frem': 'FREM',
            
            # Bitwise operations
            'shl': 'SHL', 'lshr': 'LSHR', 'ashr': 'ASHR', 'and': 'AND',
            'or': 'OR', 'xor': 'XOR',
            
            # Memory operations
            'alloca': 'ALLOCA', 'load': 'LOAD', 'store': 'STORE',
            'getelementptr': 'GEP', 'fence': 'FENCE',
            
            # Conversion operations
            'trunc': 'TRUNC', 'zext': 'ZEXT', 'sext': 'SEXT',
            'fptrunc': 'FPTRUNC', 'fpext': 'FPEXT', 'fptoui': 'FPTOUI',
            'fptosi': 'FPTOSI', 'uitofp': 'UITOFP', 'sitofp': 'SITOFP',
            'ptrtoint': 'PTRTOINT', 'inttoptr': 'INTTOPTR', 'bitcast': 'BITCAST',
            'addrspacecast': 'ADDRSPACECAST',
            
            # Other operations
            'icmp': 'ICMP', 'fcmp': 'FCMP', 'phi': 'PHI', 'select': 'SELECT',
            'call': 'CALL', 'va_arg': 'VA_ARG', 'landingpad': 'LANDINGPAD',
            'cleanuppad': 'CLEANUPPAD', 'catchpad': 'CATCHPAD',
            
            # Terminator instructions
            'ret': 'RETURN', 'br': 'BRANCH', 'switch': 'SWITCH',
            'indirectbr': 'INDIRECT_BR', 'invoke': 'INVOKE', 'resume': 'RESUME',
            'catchswitch': 'CATCHSWITCH', 'catchret': 'CATCHRET',
            'cleanupret': 'CLEANUPRET', 'unreachable': 'UNREACHABLE',
            
            # Aggregate operations
            'extractvalue': 'EXTRACTVALUE', 'insertvalue': 'INSERTVALUE',
            'extractelement': 'EXTRACTELEMENT', 'insertelement': 'INSERTELEMENT',
            'shufflevector': 'SHUFFLEVECTOR',
            
            # Atomic operations
            'atomicrmw': 'ATOMICRMW', 'cmpxchg': 'CMPXCHG'
        }
        
        # Enhanced patterns for complex control flow
        self.control_flow_patterns = {
            'loop': ['for', 'while', 'do_while'],
            'conditional': ['if_then', 'if_then_else', 'ternary'],
            'switch': ['multi_branch', 'jump_table'],
            'exception': ['try_catch', 'cleanup', 'landing_pad']
        }
    
    def convert_cfg_to_grammar(self, cfg: ControlFlowGraph, func_name: str) -> ContextFreeGrammar:
        """Convert a CFG to a context-free grammar optimized for effective fuzzing"""
        grammar = ContextFreeGrammar()
        grammar.start_symbol = f"FUNC_{func_name.upper()}"
        
        if not cfg.entry_block or not cfg.blocks:
            return grammar
        
        # Generate comprehensive grammar rules that capture all control flow paths
        self._generate_comprehensive_rules(grammar, cfg, func_name)
        
        return grammar
    
    def _generate_comprehensive_rules(self, grammar: ContextFreeGrammar, cfg: ControlFlowGraph, func_name: str):
        """Generate comprehensive grammar rules for effective fuzzing coverage"""
        
        # Main function entry point
        func_symbol = f"FUNC_{func_name.upper()}"
        grammar.add_rule(func_symbol, [f"BLOCK_{cfg.entry_block.upper()}"])
        
        # Generate rules for each basic block with all possible transitions
        for block_name, block in cfg.blocks.items():
            block_symbol = f"BLOCK_{block_name.upper()}"
            
            # Generate block content rules (instructions)
            self._generate_block_content_rules(grammar, block, block_symbol)
            
            # Generate transition rules for control flow paths
            self._generate_transition_rules(grammar, block, block_symbol)
            
            # Generate alternative path rules for fuzzing exploration
            self._generate_alternative_path_rules(grammar, cfg, block, block_symbol)
    
    def _generate_block_content_rules(self, grammar: ContextFreeGrammar, block: BasicBlock, block_symbol: str):
        """Generate rules for the content of a basic block"""
        if not block.instructions:
            grammar.add_rule(block_symbol, ["EPSILON"])
            return
            
        # Create instruction sequence rules
        instruction_symbols = []
        for i, instruction in enumerate(block.instructions):
            inst_symbol = self._abstract_instruction(instruction)
            if inst_symbol:
                instruction_symbols.append(inst_symbol)
        
        if instruction_symbols:
            # Single instruction rule
            if len(instruction_symbols) == 1:
                grammar.add_rule(block_symbol, instruction_symbols)
            else:
                # Multiple instruction sequence
                grammar.add_rule(block_symbol, ["INSTRUCTION_SEQ"])
                
                # Create flexible instruction sequence rules for fuzzing
                for i, inst in enumerate(instruction_symbols):
                    if i == 0:
                        grammar.add_rule("INSTRUCTION_SEQ", [inst])
                    grammar.add_rule("INSTRUCTION_SEQ", ["INSTRUCTION_SEQ", inst])
                
                # Alternative single instruction rules for fuzzing variations
                for inst in instruction_symbols:
                    grammar.add_rule("INSTRUCTION_SEQ", [inst])
        else:
            grammar.add_rule(block_symbol, ["EPSILON"])
    
    def _generate_transition_rules(self, grammar: ContextFreeGrammar, block: BasicBlock, block_symbol: str):
        """Generate control flow transition rules"""
        
        # No successors (termination)
        if not block.successors:
            grammar.add_rule(block_symbol, [block_symbol.replace("BLOCK_", "CONTENT_")])
            return
        
        # Single successor (sequential flow)
        if len(block.successors) == 1:
            successor_symbol = f"BLOCK_{block.successors[0].upper()}"
            content_symbol = block_symbol.replace("BLOCK_", "CONTENT_")
            
            # Sequential transition
            grammar.add_rule(block_symbol, [content_symbol, successor_symbol])
            grammar.add_rule(content_symbol, ["INSTRUCTION_SEQ"])
            
        # Multiple successors (conditional flow - choice points for fuzzing)
        elif len(block.successors) == 2:
            true_successor = f"BLOCK_{block.successors[0].upper()}"
            false_successor = f"BLOCK_{block.successors[1].upper()}"
            content_symbol = block_symbol.replace("BLOCK_", "CONTENT_")
            
            # Conditional transition rules - critical for fuzzing path exploration
            grammar.add_rule(block_symbol, [content_symbol, "CHOICE_POINT"])
            grammar.add_rule("CHOICE_POINT", [true_successor])
            grammar.add_rule("CHOICE_POINT", [false_successor])
            grammar.add_rule("CHOICE_POINT", [true_successor, false_successor])  # Fuzzing alternative
            grammar.add_rule(content_symbol, ["INSTRUCTION_SEQ"])
    
    def _generate_alternative_path_rules(self, grammar: ContextFreeGrammar, cfg: ControlFlowGraph, 
                                       block: BasicBlock, block_symbol: str):
        """Generate alternative path rules to enhance fuzzing coverage"""
        
        # Create loop detection and alternative path rules
        if self._is_loop_header(cfg, block.name):
            loop_symbol = f"LOOP_{block.name.upper()}"
            grammar.add_rule(loop_symbol, [block_symbol])
            grammar.add_rule(loop_symbol, [block_symbol, loop_symbol])  # Loop iteration
            grammar.add_rule(block_symbol, [loop_symbol])  # Alternative entry
        
        # Create convergence point rules for blocks with multiple predecessors
        if len(block.predecessors) > 1:
            merge_symbol = f"MERGE_{block.name.upper()}"
            grammar.add_rule(merge_symbol, [block_symbol])
            
            # Alternative paths leading to this merge point
            for pred in block.predecessors:
                pred_symbol = f"BLOCK_{pred.upper()}"
                grammar.add_rule(merge_symbol, [pred_symbol, block_symbol])
        
        # Create path variation rules for fuzzing
        self._generate_path_variation_rules(grammar, block, block_symbol)
    
    def _generate_path_variation_rules(self, grammar: ContextFreeGrammar, block: BasicBlock, block_symbol: str):
        """Generate path variation rules specifically for fuzzing optimization"""
        
        # Create optional execution rules
        optional_symbol = f"OPT_{block.name.upper()}"
        grammar.add_rule(optional_symbol, [block_symbol])
        grammar.add_rule(optional_symbol, ["EPSILON"])  # Optional execution for fuzzing
        
        # Create repetition rules for blocks that could be part of loops
        if block.successors and block.name in block.successors:
            repeat_symbol = f"REPEAT_{block.name.upper()}"
            grammar.add_rule(repeat_symbol, [block_symbol])
            grammar.add_rule(repeat_symbol, [block_symbol, repeat_symbol])
            grammar.add_rule(block_symbol, [repeat_symbol])
        
        # Create interleaving rules for complex control flows
        if len(block.successors) > 1:
            interleave_symbol = f"INTERLEAVE_{block.name.upper()}"
            for successor in block.successors:
                succ_symbol = f"BLOCK_{successor.upper()}"
                grammar.add_rule(interleave_symbol, [succ_symbol])
                # Cross-product rules for fuzzing exploration
                for other_successor in block.successors:
                    if other_successor != successor:
                        other_symbol = f"BLOCK_{other_successor.upper()}"
                        grammar.add_rule(interleave_symbol, [succ_symbol, other_symbol])
    
    def _is_loop_header(self, cfg: ControlFlowGraph, block_name: str) -> bool:
        """Check if a block is a loop header by detecting back edges"""
        if block_name not in cfg.blocks:
            return False
        
        block = cfg.blocks[block_name]
        
        # Simple heuristic: check if any successor can reach back to this block
        for successor in block.successors:
            if self._can_reach(cfg, successor, block_name, visited=set()):
                return True
        
        return False
    
    def _can_reach(self, cfg: ControlFlowGraph, from_block: str, to_block: str, visited: set) -> bool:
        """Check if from_block can reach to_block (for loop detection)"""
        if from_block == to_block:
            return True
        
        if from_block in visited or from_block not in cfg.blocks:
            return False
        
        visited.add(from_block)
        
        for successor in cfg.blocks[from_block].successors:
            if self._can_reach(cfg, successor, to_block, visited.copy()):
                return True
        
        return False
    
    def _abstract_instruction(self, instruction: str) -> Optional[str]:
        """Abstract an LLVM instruction to a grammar terminal"""
        instruction = instruction.strip()
        if not instruction:
            return None
        
        # Extract the operation from the instruction
        for llvm_op, abstract_op in self.instruction_abstraction.items():
            if llvm_op in instruction.lower():
                return abstract_op
        
        # Default abstraction for unknown instructions
        parts = instruction.split()
        if parts:
            first_word = parts[0].strip('=').strip('%')
            if first_word and not first_word.isdigit():
                return f"OP_{first_word.upper()}"
        
        return "UNKNOWN_OP"

### `llvm_ir_to_context_free_grammar` *(from `llvm_cfg_generator.py`)*

One-shot convenience wrapper that runs the *entire* pipeline:

1. Parse the textual IR into one :pyclass:`ControlFlowGraph` per function.
2. Convert each CFG into a structural :pyclass:`ContextFreeGrammar`.

The function is deliberately side-effect free – no printing, no file I/O –
to encourage re-use in batch processing and unit tests.

In [None]:
def llvm_ir_to_context_free_grammar(llvm_ir_code: str) -> Dict[str, ContextFreeGrammar]:

    parser = LLVMIRParser()
    converter = CFGToGrammarConverter()
    
    # Step 1: Parse LLVM-IR and build CFGs
    function_cfgs = parser.parse_llvm_ir(llvm_ir_code)
    
    # Step 2: Convert each CFG to a context-free grammar
    function_grammars = {}
    for func_name, cfg in function_cfgs.items():
        grammar = converter.convert_cfg_to_grammar(cfg, func_name)
        function_grammars[func_name] = grammar
    
    return function_grammars

### `GrammarAnalytics` *(from `llvm_file_processor.py`)*

Post-processing data structure that distills *quantitative* properties from
a freshly generated :pyclass:`llvm_cfg_generator.ContextFreeGrammar`.

The collected metrics fall into four broad categories:

1. **Structural** – number of rules, branching factor, derivation depth.
2. **Control-flow** – basic blocks, choice points, loop patterns.
3. **Fuzzing-oriented** – path alternatives, optional/repetition rules.
4. **Coverage / Semantics** – instruction types and high-level flow
   patterns present in the grammar.

A higher-level orchestrator (CLI, notebook, GUI) can serialise instances of
this dataclass into human-readable reports or feed them into downstream
optimisation heuristics that decide which functions are *worth* fuzzing.

In [None]:
class GrammarAnalytics:
    
    # Basic Grammar Metrics
    function_name: str
    total_rules: int
    non_terminals_count: int
    terminals_count: int
    start_symbol: str
    
    # Control Flow Metrics
    basic_blocks_count: int
    choice_points_count: int
    loop_patterns_count: int
    instruction_sequences_count: int
    
    # Fuzzing Optimization Metrics
    path_alternatives_count: int
    optional_execution_rules: int
    repetition_patterns: int
    interleaving_rules: int
    merge_points: int
    
    # Complexity Metrics
    max_rule_length: int
    avg_rule_length: float
    branching_factor: float
    depth_estimation: int
    
    # Coverage Metrics
    instruction_types_covered: List[str]
    control_flow_patterns: List[str]
    fuzzing_readiness_score: float
    
    def __str__(self) -> str:
        return f"""
Grammar Analytics for {self.function_name}:
========================================

Basic Metrics:
- Production Rules: {self.total_rules}
- Non-terminals: {self.non_terminals_count}
- Terminals: {self.terminals_count}
- Start Symbol: {self.start_symbol}

Control Flow Coverage:
- Basic Blocks: {self.basic_blocks_count}
- Choice Points: {self.choice_points_count}
- Loop Patterns: {self.loop_patterns_count}
- Instruction Sequences: {self.instruction_sequences_count}

Fuzzing Optimization:
- Path Alternatives: {self.path_alternatives_count}
- Optional Executions: {self.optional_execution_rules}
- Repetition Patterns: {self.repetition_patterns}
- Interleaving Rules: {self.interleaving_rules}
- Merge Points: {self.merge_points}

Complexity Analysis:
- Max Rule Length: {self.max_rule_length}
- Avg Rule Length: {self.avg_rule_length:.2f}
- Branching Factor: {self.branching_factor:.2f}
- Estimated Depth: {self.depth_estimation}

Coverage Details:
- Instruction Types: {len(self.instruction_types_covered)} ({', '.join(self.instruction_types_covered[:5])}{'...' if len(self.instruction_types_covered) > 5 else ''})
- Control Flow Patterns: {', '.join(self.control_flow_patterns)}
- Fuzzing Readiness Score: {self.fuzzing_readiness_score:.2f}/10.0

"""

### `FileProcessingResult` *(from `llvm_file_processor.py`)*

Container returned by :pyfunc:`LLVMFileProcessor.process_file` and the
helper wrappers such as :pyfunc:`process_llvm_file`.

It captures both **high-level status** (success flag, aggregated metrics)
and the **detailed artefacts** (per-function grammars + analytics).

Having a single object makes it easier to forward data between batch
processing helpers and report generators without having to juggle several
loosely-coupled variables.

In [None]:
class FileProcessingResult:

    
    file_path: str
    success: bool
    error_message: Optional[str]
    grammars: Dict[str, ContextFreeGrammar]
    analytics: Dict[str, GrammarAnalytics]
    
    # File-level metrics
    functions_processed: int
    total_grammar_rules: int
    total_choice_points: int
    total_loop_patterns: int
    overall_fuzzing_score: float
    
    def __str__(self) -> str:
        if not self.success:
            return f"Failed to process {self.file_path}: {self.error_message}"
        
        result = f"""
Successfully processed: {self.file_path}
=============================================

File Summary:
- Functions Processed: {self.functions_processed}
- Total Grammar Rules: {self.total_grammar_rules}
- Total Choice Points: {self.total_choice_points}
- Total Loop Patterns: {self.total_loop_patterns}
- Overall Fuzzing Score: {self.overall_fuzzing_score:.2f}/10.0

Function Details:
"""
        
        for func_name, analytics in self.analytics.items():
            result += f"\n📊 {func_name}:"
            result += f"\n   Rules: {analytics.total_rules} | Choice Points: {analytics.choice_points_count} | Score: {analytics.fuzzing_readiness_score:.1f}/10"
        
        return result

### `LLVMFileProcessor` *(from `llvm_file_processor.py`)*

Facade that orchestrates *reading*, *parsing*, *grammar generation*, and
*analytics* for **one** LLVM-IR file at a time.

The class deliberately hides all the gritty details so that command-line
interfaces and GUIs can perform complex processing with a single method
call.

In [None]:
class LLVMFileProcessor: 
    def __init__(self):
        self.supported_extensions = ['.ll', '.bc', '.txt']
    
    def process_file(self, file_path: str) -> FileProcessingResult:
        
        # Validate file path
        if not os.path.exists(file_path):
            return FileProcessingResult(
                file_path=file_path,
                success=False,
                error_message=f"File not found: {file_path}",
                grammars={},
                analytics={},
                functions_processed=0,
                total_grammar_rules=0,
                total_choice_points=0,
                total_loop_patterns=0,
                overall_fuzzing_score=0.0
            )
        
        # Check file extension
        _, ext = os.path.splitext(file_path)
        if ext not in self.supported_extensions:
            return FileProcessingResult(
                file_path=file_path,
                success=False,
                error_message=f"Unsupported file extension: {ext}. Supported: {', '.join(self.supported_extensions)}",
                grammars={},
                analytics={},
                functions_processed=0,
                total_grammar_rules=0,
                total_choice_points=0,
                total_loop_patterns=0,
                overall_fuzzing_score=0.0
            )
        
        try:
            # Read LLVM-IR content
            with open(file_path, 'r', encoding='utf-8') as f:
                llvm_ir_content = f.read()
            
            # Generate grammars
            grammars = llvm_ir_to_context_free_grammar(llvm_ir_content)
            
            if not grammars:
                return FileProcessingResult(
                    file_path=file_path,
                    success=False,
                    error_message="No functions found in LLVM-IR file",
                    grammars={},
                    analytics={},
                    functions_processed=0,
                    total_grammar_rules=0,
                    total_choice_points=0,
                    total_loop_patterns=0,
                    overall_fuzzing_score=0.0
                )
            
            # Generate analytics for each grammar
            analytics = {}
            total_rules = 0
            total_choice_points = 0
            total_loop_patterns = 0
            total_score = 0.0
            
            for func_name, grammar in grammars.items():
                func_analytics = self._analyze_grammar(func_name, grammar)
                analytics[func_name] = func_analytics
                
                total_rules += func_analytics.total_rules
                total_choice_points += func_analytics.choice_points_count
                total_loop_patterns += func_analytics.loop_patterns_count
                total_score += func_analytics.fuzzing_readiness_score
            
            overall_score = total_score / len(grammars) if grammars else 0.0
            
            return FileProcessingResult(
                file_path=file_path,
                success=True,
                error_message=None,
                grammars=grammars,
                analytics=analytics,
                functions_processed=len(grammars),
                total_grammar_rules=total_rules,
                total_choice_points=total_choice_points,
                total_loop_patterns=total_loop_patterns,
                overall_fuzzing_score=overall_score
            )
            
        except Exception as e:
            return FileProcessingResult(
                file_path=file_path,
                success=False,
                error_message=f"Error processing file: {str(e)}",
                grammars={},
                analytics={},
                functions_processed=0,
                total_grammar_rules=0,
                total_choice_points=0,
                total_loop_patterns=0,
                overall_fuzzing_score=0.0
            )
    
    def _analyze_grammar(self, func_name: str, grammar: ContextFreeGrammar) -> GrammarAnalytics:
        """Generate comprehensive analytics for a context-free grammar"""
        
        # Basic metrics
        total_rules = len(grammar.rules)
        non_terminals_count = len(grammar.non_terminals)
        terminals_count = len(grammar.terminals)
        
        # Control flow metrics
        basic_blocks = [nt for nt in grammar.non_terminals if nt.startswith('BLOCK_')]
        choice_rules = [r for r in grammar.rules if 'CHOICE_POINT' in str(r)]
        loop_rules = [r for r in grammar.rules if 'LOOP_' in str(r)]
        instruction_rules = [r for r in grammar.rules if 'INSTRUCTION_SEQ' in str(r)]
        
        # Fuzzing optimization metrics
        optional_rules = [r for r in grammar.rules if 'OPT_' in str(r)]
        repetition_rules = [r for r in grammar.rules if 'REPEAT_' in str(r)]
        interleaving_rules = [r for r in grammar.rules if 'INTERLEAVE_' in str(r)]
        merge_rules = [r for r in grammar.rules if 'MERGE_' in str(r)]
        
        # Calculate path alternatives
        path_alternatives = len(choice_rules) + len(optional_rules) + len(repetition_rules)
        
        # Complexity metrics
        rule_lengths = [len(rule.rhs) for rule in grammar.rules]
        max_rule_length = max(rule_lengths) if rule_lengths else 0
        avg_rule_length = sum(rule_lengths) / len(rule_lengths) if rule_lengths else 0
        
        # Calculate branching factor
        branching_factor = self._calculate_branching_factor(grammar)
        
        # Estimate grammar depth
        depth_estimation = self._estimate_grammar_depth(grammar)
        
        # Identify instruction types and patterns
        instruction_types = self._extract_instruction_types(grammar)
        control_flow_patterns = self._identify_control_flow_patterns(grammar)
        
        # Calculate fuzzing readiness score
        fuzzing_score = self._calculate_fuzzing_score(
            choice_rules, loop_rules, optional_rules, 
            repetition_rules, interleaving_rules, basic_blocks
        )
        
        return GrammarAnalytics(
            function_name=func_name,
            total_rules=total_rules,
            non_terminals_count=non_terminals_count,
            terminals_count=terminals_count,
            start_symbol=grammar.start_symbol,
            basic_blocks_count=len(basic_blocks),
            choice_points_count=len(choice_rules),
            loop_patterns_count=len(loop_rules),
            instruction_sequences_count=len(instruction_rules),
            path_alternatives_count=path_alternatives,
            optional_execution_rules=len(optional_rules),
            repetition_patterns=len(repetition_rules),
            interleaving_rules=len(interleaving_rules),
            merge_points=len(merge_rules),
            max_rule_length=max_rule_length,
            avg_rule_length=avg_rule_length,
            branching_factor=branching_factor,
            depth_estimation=depth_estimation,
            instruction_types_covered=instruction_types,
            control_flow_patterns=control_flow_patterns,
            fuzzing_readiness_score=fuzzing_score
        )
    
    def _calculate_branching_factor(self, grammar: ContextFreeGrammar) -> float:
        """Calculate the average branching factor of the grammar"""
        nt_alternatives = {}
        
        for rule in grammar.rules:
            if rule.lhs not in nt_alternatives:
                nt_alternatives[rule.lhs] = 0
            nt_alternatives[rule.lhs] += 1
        
        if not nt_alternatives:
            return 0.0
        
        return sum(nt_alternatives.values()) / len(nt_alternatives)
    
    def _estimate_grammar_depth(self, grammar: ContextFreeGrammar) -> int:
        """Estimate the maximum derivation depth of the grammar"""
        dependencies = {}
        
        for rule in grammar.rules:
            dependencies[rule.lhs] = [symbol for symbol in rule.rhs if symbol in grammar.non_terminals]
        
        def max_depth(symbol, visited=None):
            if visited is None:
                visited = set()
            if symbol in visited or symbol not in dependencies:
                return 0
            
            visited.add(symbol)
            max_child_depth = 0
            
            for child in dependencies[symbol]:
                max_child_depth = max(max_child_depth, max_depth(child, visited.copy()))
            
            return 1 + max_child_depth
        
        return max_depth(grammar.start_symbol)
    
    def _extract_instruction_types(self, grammar: ContextFreeGrammar) -> List[str]:
        """Extract the types of instructions covered by the grammar"""
        instruction_types = set()
        
        for terminal in grammar.terminals:
            if terminal in ['ADD', 'SUB', 'MUL', 'DIV', 'LOAD', 'STORE', 'BRANCH', 
                          'ICMP', 'FCMP', 'ALLOCA', 'PHI', 'CALL', 'RETURN']:
                instruction_types.add(terminal)
        
        return sorted(list(instruction_types))
    
    def _identify_control_flow_patterns(self, grammar: ContextFreeGrammar) -> List[str]:
        """Identify control flow patterns present in the grammar"""
        patterns = []
        
        rule_strings = [str(rule) for rule in grammar.rules]
        
        if any('CHOICE_POINT' in rule_str for rule_str in rule_strings):
            patterns.append('Conditional Branching')
        
        if any('LOOP_' in rule_str for rule_str in rule_strings):
            patterns.append('Iterative Loops')
        
        if any('INSTRUCTION_SEQ' in rule_str for rule_str in rule_strings):
            patterns.append('Sequential Execution')
        
        if any('MERGE_' in rule_str for rule_str in rule_strings):
            patterns.append('Control Flow Convergence')
        
        if any('INTERLEAVE_' in rule_str for rule_str in rule_strings):
            patterns.append('Complex Flow Interleaving')
        
        return patterns
    
    def _calculate_fuzzing_score(self, choice_rules, loop_rules, optional_rules, 
                                repetition_rules, interleaving_rules, basic_blocks) -> float:
        """
        Calculate a fuzzing readiness score (0-10) based on grammar characteristics.
        
        Higher scores indicate better suitability for fuzzing applications.
        """
        score = 0.0
        
        # Choice points are critical for fuzzing (0-3 points)
        choice_score = min(3.0, len(choice_rules) * 0.5)
        score += choice_score
        
        # Loop patterns enable iterative testing (0-2 points)
        loop_score = min(2.0, len(loop_rules) * 0.3)
        score += loop_score
        
        # Path alternatives provide fuzzing flexibility (0-2 points)
        alternatives_score = min(2.0, (len(optional_rules) + len(repetition_rules)) * 0.2)
        score += alternatives_score
        
        # Basic block coverage indicates completeness (0-2 points)
        block_score = min(2.0, len(basic_blocks) * 0.2)
        score += block_score
        
        # Complex flow patterns add fuzzing value (0-1 point)
        complexity_score = min(1.0, len(interleaving_rules) * 0.1)
        score += complexity_score
        
        return round(score, 2)

### `process_llvm_file` *(from `llvm_file_processor.py`)*

Thin wrapper around :pyclass:`LLVMFileProcessor` for one-off calls.

This helper exists mostly for backwards compatibility and quick REPL
experimentation.

In [None]:
def process_llvm_file(file_path: str) -> FileProcessingResult:
    processor = LLVMFileProcessor()
    return processor.process_file(file_path)

### `process_multiple_files` *(from `llvm_file_processor.py`)*

Apply :pyfunc:`process_llvm_file` to every path in *file_paths*.

The function collects the individual :pyclass:`FileProcessingResult`
objects in a dictionary so that callers can inspect successes and failures
side-by-side.

In [None]:
def process_multiple_files(file_paths: List[str]) -> Dict[str, FileProcessingResult]:
    processor = LLVMFileProcessor()
    results = {}
    
    for file_path in file_paths:
        results[file_path] = processor.process_file(file_path)
    
    return results

### `generate_batch_report` *(from `llvm_file_processor.py`)*

Human-readable summary for a whole *batch* of processed files.

The returned string is already formatted for terminal output and therefore
does **not** require additional pretty-printing.

In [None]:
def generate_batch_report(results: Dict[str, FileProcessingResult]) -> str:
    successful_files = [r for r in results.values() if r.success]
    failed_files = [r for r in results.values() if not r.success]
    
    if not results:
        return "No files processed."
    
    report = f"""
Batch Processing Report
======================

Summary:
- Total Files: {len(results)}
- Successful: {len(successful_files)}
- Failed: {len(failed_files)}

"""
    
    if successful_files:
        total_functions = sum(r.functions_processed for r in successful_files)
        total_rules = sum(r.total_grammar_rules for r in successful_files)
        total_choice_points = sum(r.total_choice_points for r in successful_files)
        avg_score = sum(r.overall_fuzzing_score for r in successful_files) / len(successful_files)
        
        report += f"""
Aggregate Statistics:
- Functions Processed: {total_functions}
- Grammar Rules Generated: {total_rules}
- Total Choice Points: {total_choice_points}
- Average Fuzzing Score: {avg_score:.2f}/10.0

Successful Files:
"""
        for result in successful_files:
            report += f"  ✅ {result.file_path} ({result.functions_processed} functions, score: {result.overall_fuzzing_score:.1f})\n"
    
    if failed_files:
        report += f"\nFailed Files:\n"
        for result in failed_files:
            report += f"  ❌ {result.file_path}: {result.error_message}\n"
    
    return report

### `dump_grammars_to_file` *(from `llvm_file_processor.py`)*

Serialise a set of grammars (plus optional analytics) into a text file.

The resulting *dump* is aimed at researchers who prefer to inspect the
grammar in a plain editor instead of loading the notebook / Python
objects.

In [None]:
def dump_grammars_to_file(
    grammars: Dict[str, ContextFreeGrammar],
    output_path: str,
    analytics: Dict[str, GrammarAnalytics] | None = None,
) -> bool:
    try:
        with open(output_path, 'w', encoding='utf-8') as f:
            f.write("="*80 + "\n")
            f.write("COMPREHENSIVE CONTEXT-FREE GRAMMAR DUMP\n")
            f.write("="*80 + "\n")
            f.write(f"Generated from LLVM-IR analysis\n")
            f.write(f"Total Functions: {len(grammars)}\n")
            f.write(f"Timestamp: {__import__('datetime').datetime.now()}\n")
            f.write("="*80 + "\n\n")
            
            # Summary statistics
            total_rules = sum(len(g.rules) for g in grammars.values())
            total_non_terminals = sum(len(g.non_terminals) for g in grammars.values())
            total_terminals = sum(len(g.terminals) for g in grammars.values())
            
            f.write("SUMMARY STATISTICS\n")
            f.write("-"*50 + "\n")
            f.write(f"Total Production Rules: {total_rules}\n")
            f.write(f"Total Non-terminals: {total_non_terminals}\n")
            f.write(f"Total Terminals: {total_terminals}\n")
            f.write(f"Average Rules per Function: {total_rules/len(grammars):.2f}\n\n")
            
            # If analytics available, show fuzzing readiness overview
            if analytics:
                avg_fuzzing_score = sum(a.fuzzing_readiness_score for a in analytics.values()) / len(analytics)
                total_choice_points = sum(a.choice_points_count for a in analytics.values())
                total_loops = sum(a.loop_patterns_count for a in analytics.values())
                
                f.write("FUZZING READINESS OVERVIEW\n")
                f.write("-"*50 + "\n")
                f.write(f"Average Fuzzing Score: {avg_fuzzing_score:.2f}/10.0\n")
                f.write(f"Total Choice Points: {total_choice_points}\n")
                f.write(f"Total Loop Patterns: {total_loops}\n")
                f.write(f"Functions with High Fuzzing Score (>7.0): {sum(1 for a in analytics.values() if a.fuzzing_readiness_score > 7.0)}\n\n")
            
            # Detailed grammar dump for each function
            for i, (func_name, grammar) in enumerate(grammars.items(), 1):
                f.write("="*80 + "\n")
                f.write(f"FUNCTION {i}: {func_name}\n")
                f.write("="*80 + "\n\n")
                
                # Basic grammar information
                f.write("GRAMMAR OVERVIEW\n")
                f.write("-"*40 + "\n")
                f.write(f"Start Symbol: {grammar.start_symbol}\n")
                f.write(f"Production Rules: {len(grammar.rules)}\n")
                f.write(f"Non-terminals: {len(grammar.non_terminals)}\n")
                f.write(f"Terminals: {len(grammar.terminals)}\n\n")
                
                # Analytics if available
                if analytics and func_name in analytics:
                    anal = analytics[func_name]
                    f.write("FUZZING ANALYTICS\n")
                    f.write("-"*40 + "\n")
                    f.write(f"Fuzzing Readiness Score: {anal.fuzzing_readiness_score:.2f}/10.0\n")
                    f.write(f"Basic Blocks: {anal.basic_blocks_count}\n")
                    f.write(f"Choice Points: {anal.choice_points_count}\n")
                    f.write(f"Loop Patterns: {anal.loop_patterns_count}\n")
                    f.write(f"Path Alternatives: {anal.path_alternatives_count}\n")
                    f.write(f"Branching Factor: {anal.branching_factor:.2f}\n")
                    f.write(f"Estimated Depth: {anal.depth_estimation}\n")
                    f.write(f"Instruction Types: {', '.join(anal.instruction_types_covered[:10])}\n")
                    if len(anal.instruction_types_covered) > 10:
                        f.write(f"... and {len(anal.instruction_types_covered) - 10} more\n")
                    f.write(f"Control Flow Patterns: {', '.join(anal.control_flow_patterns)}\n\n")
                
                # Non-terminals
                f.write("NON-TERMINALS\n")
                f.write("-"*40 + "\n")
                sorted_nts = sorted(grammar.non_terminals)
                for j in range(0, len(sorted_nts), 8):
                    f.write("  " + ", ".join(sorted_nts[j:j+8]) + "\n")
                f.write("\n")
                
                # Terminals
                f.write("TERMINALS\n")
                f.write("-"*40 + "\n")
                sorted_ts = sorted(grammar.terminals)
                for j in range(0, len(sorted_ts), 10):
                    f.write("  " + ", ".join(sorted_ts[j:j+10]) + "\n")
                f.write("\n")
                
                # Production rules organized by categories
                f.write("PRODUCTION RULES\n")
                f.write("-"*40 + "\n")
                
                # Group rules by left-hand side
                rules_by_lhs = {}
                for rule in grammar.rules:
                    if rule.lhs not in rules_by_lhs:
                        rules_by_lhs[rule.lhs] = []
                    rules_by_lhs[rule.lhs].append(rule.rhs)
                
                # Show rules in organized categories
                categories = {
                    'Function Entry': ['FUNC_', 'ENTRY_'],
                    'Basic Blocks': ['BLOCK_', 'BB_'],
                    'Control Flow': ['CHOICE_POINT', 'BRANCH_', 'CONDITIONAL_'],
                    'Loops': ['LOOP_', 'WHILE_', 'FOR_', 'REPEAT_'],
                    'Instructions': ['INSTRUCTION_SEQ', 'INST_', 'OP_'],
                    'Data Flow': ['PHI_', 'SELECT_', 'ASSIGN_'],
                    'Memory Operations': ['LOAD_', 'STORE_', 'ALLOCA_'],
                    'Paths & Alternatives': ['PATH_', 'ALT_', 'OPT_', 'CHOICE_'],
                    'Other': []
                }
                
                for category, patterns in categories.items():
                    category_rules = []
                    for lhs in sorted(rules_by_lhs.keys()):
                        if patterns and any(pattern in lhs for pattern in patterns):
                            category_rules.append(lhs)
                        elif not patterns:  # 'Other' category
                            if not any(any(p in lhs for p in pats) for pats in list(categories.values())[:-1]):
                                category_rules.append(lhs)
                    
                    if category_rules:
                        f.write(f"\n{category} Rules:\n")
                        for lhs in category_rules[:15]:  # Limit to first 15 rules per category
                            alternatives = rules_by_lhs[lhs]
                            f.write(f"{lhs} ->")
                            for k, rhs in enumerate(alternatives[:3]):  # Show up to 3 alternatives
                                connector = " |" if k > 0 else ""
                                rhs_str = ' '.join(rhs) if rhs else 'EPSILON'
                                if len(rhs_str) > 80:
                                    rhs_str = rhs_str[:77] + "..."
                                f.write(f"{connector} {rhs_str}\n")
                                if k == 0:
                                    f.write("     ")
                            if len(alternatives) > 3:
                                f.write(f"      | ... and {len(alternatives) - 3} more alternatives\n")
                        
                        if len(category_rules) > 15:
                            f.write(f"... and {len(category_rules) - 15} more {category.lower()} rules\n")
                
                f.write("\n" + "-"*80 + "\n\n")
            
            f.write("="*80 + "\n")
            f.write("END OF GRAMMAR DUMP\n")
            f.write("="*80 + "\n")
        
        return True
        
    except Exception as e:
        print(f"Error writing grammar dump to {output_path}: {e}")
        return False

### `dump_grammars_from_file` *(from `llvm_file_processor.py`)*

Utility that combines :pyfunc:`process_llvm_file` **and**
:pyfunc:`dump_grammars_to_file` for one stop *CLI* convenience.

In [None]:
def dump_grammars_from_file(llvm_file_path: str, output_path: str | None = None) -> bool:
    # Auto-generate output path if not provided
    if output_path is None:
        base_name = os.path.splitext(os.path.basename(llvm_file_path))[0]
        output_path = f"{base_name}_grammar_dump.txt"
    
    # Process the LLVM file
    result = process_llvm_file(llvm_file_path)
    
    if not result.success:
        print(f"❌ Failed to process {llvm_file_path}: {result.error_message}")
        return False
    
    if not result.grammars:
        print(f"❌ No grammars generated from {llvm_file_path}")
        return False
    
    # Dump grammars to file
    success = dump_grammars_to_file(result.grammars, output_path, result.analytics)
    
    if success:
        print(f"Grammar dump saved to: {output_path}")
        print(f"Processed {result.functions_processed} functions")
        print(f"Generated {result.total_grammar_rules} grammar rules")
        print(f"Overall fuzzing score: {result.overall_fuzzing_score:.1f}/10.0")
    else:
        print(f"Failed to save grammar dump to {output_path}")
    
    return success

### `extract_grammar_from_file` *(from `llvm_grammar_extractor.py`)*

High-level helper that loads *llvm_file_path* and converts it into
grammars using the core pipeline from :pymod:`llvm_cfg_generator`.

This is the *lowest* layer in the extractor module; every other utility
eventually calls it.

In [None]:
def extract_grammar_from_file(llvm_file_path: str) -> Dict[str, ContextFreeGrammar]:
    
    # Validate file exists
    if not os.path.exists(llvm_file_path):
        raise FileNotFoundError(f"LLVM-IR file not found: {llvm_file_path}")
    
    try:
        # Read file content
        with open(llvm_file_path, 'r', encoding='utf-8') as f:
            llvm_ir_content = f.read()
        
        # Generate grammars
        grammars = llvm_ir_to_context_free_grammar(llvm_ir_content)
        
        if not grammars:
            raise ValueError("No functions found in LLVM-IR file")
        
        return grammars
        
    except IOError as e:
        raise IOError(f"Error reading file {llvm_file_path}: {e}")
    except Exception as e:
        raise ValueError(f"Error processing LLVM-IR: {e}")

### `extract_single_function_grammar` *(from `llvm_grammar_extractor.py`)*

Thin convenience wrapper around :pyfunc:`extract_grammar_from_file`.

It simply returns ``grammars.get(function_name)`` so that callers do not
have to perform the dictionary look-up in two lines of code.

In [None]:
def extract_single_function_grammar(llvm_file_path: str, function_name: str) -> Optional[ContextFreeGrammar]:
    
    grammars = extract_grammar_from_file(llvm_file_path)
    return grammars.get(function_name)

### `extract_grammar_from_string` *(from `llvm_grammar_extractor.py`)*

Like :pyfunc:`extract_grammar_from_file` but works on an in-memory
string instead of reading from disk.

In [None]:
def extract_grammar_from_string(llvm_ir_code: str) -> Dict[str, ContextFreeGrammar]:
    
    if not llvm_ir_code or not llvm_ir_code.strip():
        raise ValueError("Empty LLVM-IR code provided")
    
    try:
        grammars = llvm_ir_to_context_free_grammar(llvm_ir_code)
        
        if not grammars:
            raise ValueError("No functions found in LLVM-IR code")
        
        return grammars
        
    except Exception as e:
        raise ValueError(f"Error processing LLVM-IR: {e}")