Skip to content

bkmashiro/pipespy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pipespy

A Unix pipeline debugger, profiler, and optimizer. Give it any shell pipeline string and Pipespy runs each stage while capturing intermediate data, timing, and line/byte counts between stages. It then renders a visual flow report showing where data gets filtered, which stages are bottlenecks, and how much data flows through each pipe.

The novel part: Pipespy also acts as a linter for shell pipelines. It detects common anti-patterns (useless use of cat, sort-before-grep, grep piped to wc -l, redundant sorts, awk-used-as-cut) and suggests concrete, runnable optimizations — rewritten pipeline fragments with estimated speedup. Think of it as a profiler, debugger, and static analyzer combined into one tool for the humblest unit of Unix computing: the pipe.

Pipespy works with any pipeline you can express as a string. It handles quoted arguments, nested subshells, escaped pipes, and environment variable prefixes. It runs each stage sequentially with intercepted I/O, so you get an accurate picture of what happens at every step without modifying your original commands.

Features

  • Stage-by-stage profiling: execution time, input/output line counts, byte sizes, and percentage breakdowns
  • Data flow visualization: see how many lines and bytes flow between each pair of stages, and where data gets dropped
  • Bottleneck detection: automatically identifies the slowest stage
  • Filter analysis: finds the stage that removes the most data and shows filter ratios
  • Anti-pattern detection: identifies 8 common pipeline inefficiencies
    • Useless use of cat
    • Sort before grep (wasted sort on unfiltered data)
    • Consecutive grep chains that could be combined
    • Redundant sorts
    • Echo piped to a command (use here-strings)
    • grep | wc -l (use grep -c)
    • awk used only as cut
    • Large sorts without prior filtering
  • Optimization suggestions: concrete rewrites with estimated speedups
    • Remove useless cat (use direct file read or redirection)
    • Replace grep | wc -l with grep -c
    • Move filters before sorts to reduce data volume
    • Add LC_ALL=C for faster byte-wise sorting/matching
    • Use sort --parallel for large datasets
  • Sample data inspection: peek at the actual data flowing through each stage
  • JSON output: machine-readable results for integration with other tools
  • Static analysis mode: analyze pipelines without executing them
  • Timeout control: per-stage timeouts to prevent runaway commands

Installation

Requires Python 3.9+. No external dependencies.

# Clone and install
git clone https://github.com/bkmashiro/pipespy.git
cd pipespy
pip install -e .

# Or run directly without installing
python -m pipespy "your | pipeline | here"

Quick Start

# Profile a pipeline and see where time and data go
pipespy "cat /var/log/syslog | grep error | sort | uniq -c | sort -rn | head -20"

# Get optimization suggestions for a messy pipeline
pipespy "cat access.log | sort | grep 404 | sort | grep -v static | wc -l" --no-color

# Static analysis only — no execution, just anti-pattern detection
pipespy "cat file | sort | grep pattern | wc -l" --no-run

# JSON output for scripting
pipespy "echo hello | wc -w" --json

# Inspect actual data at each stage
pipespy "printf 'banana\napple\ncherry\n' | sort | head -2" --samples

Usage

Basic profiling

pipespy "cat server.log | grep '\" 500 ' | awk '{print \$1}' | sort | uniq -c | sort -rn | head"

Output shows each stage with timing, line counts, filter ratios, and a visual time bar:

  Stage 1: cat server.log
    Time:      683us  ██░░░░░░░░░░░░░░░░░░  12%
    Out:       5,000 lines  (609.3 KB)
    --------------------------------------------------
  Stage 2 [BIGGEST FILTER]: grep '" 500 '
    Time:      1.1ms  ███░░░░░░░░░░░░░░░░░  19%
    In:        5,000 lines  (609.3 KB)
    Out:         284 lines  (34.7 KB)
    Filter:    94.3% removed
    ...

Anti-pattern detection

Pipespy automatically flags common mistakes:

pipespy "cat log | sort | grep ERROR | sort | grep FATAL | wc -l" --no-run
 Anti-patterns Detected
------------------------------------------------------------
  [i] useless-cat (stage 1)
      Useless use of cat — sort can read files directly.
      Suggestion: Replace `cat log | sort ...` with `sort ... log`

  [~] sort-before-grep (stage 2)
      sort runs before grep — the sort processes more data than necessary.
      Suggestion: Move grep before sort to filter first, then sort the smaller result set.

  [i] grep-wc (stage 6)
      grep piped to wc -l — grep has a built-in count flag.
      Suggestion: Replace `grep FATAL | wc -l` with `grep -c FATAL`

Optimization suggestions

pipespy "cat access.log | grep 404 | awk '{print \$1}' | sort | uniq -c | sort -rn | head"
 Optimization Suggestions
------------------------------------------------------------
  remove-cat: Remove useless cat — grep can read access.log directly.
    Before: cat access.log | grep 404 | awk '{print $1}' | sort | uniq -c | sort -rn | head
    After:  grep 404 access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head
    Speedup: eliminates one process + pipe

  lc-all-c: Set LC_ALL=C for grep for byte-wise comparison (much faster on ASCII data).
    Before: grep 404
    After:  LC_ALL=C grep 404
    Speedup: ~2-5x for sort/grep on ASCII

JSON output

pipespy "printf 'a\nb\nc\n' | sort | head -2" --json

Returns a structured JSON object with stages, summary, antipatterns, and optimizations arrays — useful for CI integration or building dashboards.

Data flow visualization

The Data Flow section shows exactly how data moves between stages:

 Data Flow
------------------------------------------------------------
           cat -->    5,000 lines (609.3 KB)  (-94.3%)  grep
          grep -->      284 lines (34.7 KB)             awk
           awk -->      284 lines (3.7 KB)              sort
          sort -->      284 lines (3.7 KB)   (-82.7%)   uniq
          uniq -->       49 lines (1.0 KB)              sort
          sort -->       49 lines (1.0 KB)   (-79.6%)   head

Feeding input data

# Feed a file as stdin to the first stage
pipespy "grep ERROR | awk '{print \$NF}' | sort | uniq -c" --input server.log

CLI flags

Flag Description
-s, --samples Show sample data (first/last 5 lines) at each stage
-j, --json Output results as JSON
--no-color Disable ANSI color codes
--no-run Static analysis only (no execution)
-t, --timeout Per-stage timeout in seconds (default: 60)
--keep Keep intermediate temp files (prints paths to stderr)
-i, --input Feed a file as stdin to the first stage
-V, --version Show version

Architecture

src/pipespy/
  __init__.py        Package metadata and version
  __main__.py        python -m pipespy entry point
  cli.py             Argument parsing, orchestration
  parser.py          Pipeline string -> list of PipelineStage objects
                     Handles quoting, escapes, subshells, env prefixes
  executor.py        Runs each stage sequentially with intercepted I/O
                     Captures timing, byte counts, line counts, samples
  analyzer.py        Computes aggregate stats: bottleneck, biggest filter,
                     overall reduction, time fractions, data flow edges
  antipatterns.py    8 pattern detectors that flag common pipeline mistakes
  optimizer.py       5 optimization passes that suggest concrete rewrites
  display.py         Renders visual reports (ANSI terminal) or JSON output

The execution model is straightforward: the parser splits the pipeline string by unquoted pipe characters, the executor runs each stage as a subprocess (feeding the previous stage's output file as stdin), and the analyzer/antipatterns/optimizer modules examine the results to produce insights. All intermediate data is written to temporary files that are cleaned up after analysis (unless --keep is passed).

Running Tests

pip install pytest
pytest tests/ -v

78 tests covering the parser, executor, analyzer, anti-pattern detector, optimizer, display renderer, and end-to-end CLI integration.

Running the Demo

bash demo/run_demo.sh

Generates a 5000-line sample access log, runs a realistic pipeline through Pipespy with visual output, JSON output, and static analysis modes.

License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages