diff --git a/crates/bashkit/docs/threat-model.md b/crates/bashkit/docs/threat-model.md index d2558d8a..4029785d 100644 --- a/crates/bashkit/docs/threat-model.md +++ b/crates/bashkit/docs/threat-model.md @@ -390,11 +390,71 @@ let bash = Bash::builder() // Commits use virtual identity, never host ~/.gitconfig ``` +### Unicode Security (TM-UNI-*) + +Unicode input from untrusted scripts creates attack surface across the parser, builtins, +and virtual filesystem. AI agents frequently generate multi-byte Unicode (box-drawing, +emoji, CJK) that exercises these code paths. + +**Byte-Boundary Safety (TM-UNI-001/002/015/016/017):** + +Multiple builtins mix byte offsets with character indices, causing panics on multi-byte +input. All are caught by `catch_unwind` (TM-INT-001) preventing process crash, but the +builtin silently fails. + +| Threat | Attack Example | Mitigation | Status | +|--------|---------------|------------|--------| +| Awk byte-boundary panic (TM-UNI-001) | Multi-byte chars in awk input | `catch_unwind` catches panic | PARTIAL | +| Sed byte-boundary panic (TM-UNI-002) | Box-drawing chars in sed pattern | `catch_unwind` catches panic | PARTIAL | +| Expr substr panic (TM-UNI-015) | `expr substr "café" 4 1` | `catch_unwind` catches panic | PARTIAL | +| Printf precision panic (TM-UNI-016) | `printf "%.1s" "é"` | `catch_unwind` catches panic | PARTIAL | +| Cut/tr byte-level parsing (TM-UNI-017) | `tr 'é' 'e'` — multi-byte in char set | `catch_unwind` catches; silent data loss | PARTIAL | + +**Additional Byte/Char Confusion:** + +| Threat | Attack Example | Mitigation | Status | +|--------|---------------|------------|--------| +| Interpreter arithmetic (TM-UNI-018) | Multi-byte before `=` in arithmetic | Wrong operator detection; no panic | PARTIAL | +| Network allowlist (TM-UNI-019) | Multi-byte in allowlist URL path | Wrong path boundary check | PARTIAL | +| Zero-width in filenames (TM-UNI-003) | Invisible chars create confusable names | Path validation (planned) | UNMITIGATED | +| Homoglyph confusion (TM-UNI-006) | Cyrillic 'а' vs Latin 'a' in filenames | Accepted risk | ACCEPTED | +| Normalization bypass (TM-UNI-008) | NFC vs NFD create distinct files | Matches Linux FS behavior | ACCEPTED | +| Bidi in script source (TM-UNI-014) | RTL overrides hide malicious code | Scripts untrusted by design | ACCEPTED | + +**Safe Components (confirmed by full codebase audit):** +- Lexer: `Chars` iterator with `ch.len_utf8()` tracking +- wc: Correct `.len()` vs `.chars().count()` usage +- grep/jq: Delegate to Unicode-aware regex/jaq crates +- sort/uniq: String comparison, no byte indexing +- logging: Uses `is_char_boundary()` correctly +- python: Shebang strip via `find('\n')` — ASCII delimiter, safe +- Python bindings (bashkit-python): PyO3 `String` extraction, no manual byte/char ops +- eval harness: `chars().take()`, `from_utf8_lossy()` — all safe patterns +- curl/bc/export/date/comm/echo/archive/base64: All `.find()` use ASCII delimiters only +- scripted_tool: No byte/char patterns + +**Path Validation:** + +Filenames are validated by `find_unsafe_path_char()` which rejects: +- ASCII control characters (U+0000-U+001F, U+007F) +- C1 control characters (U+0080-U+009F) +- Bidi override characters (U+202A-U+202E, U+2066-U+2069) + +Normal Unicode (accented, CJK, emoji) is allowed in filenames and script content. + +**Caller Responsibility:** +- Strip zero-width/invisible characters from filenames before displaying to users +- Apply confusable-character detection (UTS #39) if showing filenames to humans +- Strip bidi overrides from script source before displaying to code reviewers +- Be aware that expr/printf/cut/tr may fail on non-ASCII input until fixes land +- Use ASCII in network allowlist URL patterns until byte/char fix lands + ## Security Testing Bashkit includes comprehensive security tests: - **Threat Model Tests**: [`tests/threat_model_tests.rs`][threat_tests] - 117 tests +- **Unicode Security Tests**: `tests/unicode_security_tests.rs` - TM-UNI-* tests - **Nesting Depth Tests**: 18 tests covering positive, negative, misconfiguration, and regression scenarios for parser depth attacks - **Fail-Point Tests**: [`tests/security_failpoint_tests.rs`][failpoint_tests] - 14 tests @@ -425,6 +485,7 @@ All threats use stable IDs in the format `TM--`: | TM-LOG | Logging Security | | TM-GIT | Git Security | | TM-PY | Python/Monty Security | +| TM-UNI | Unicode Security | Full threat analysis: [`specs/006-threat-model.md`][spec] diff --git a/crates/bashkit/tests/unicode_security_tests.rs b/crates/bashkit/tests/unicode_security_tests.rs new file mode 100644 index 00000000..59eb22c1 --- /dev/null +++ b/crates/bashkit/tests/unicode_security_tests.rs @@ -0,0 +1,1059 @@ +//! Unicode Security Tests (TM-UNI-*) +//! +//! Tests for Unicode-specific threats identified in specs/006-threat-model.md. +//! Covers byte-boundary safety, invisible characters, homoglyphs, normalization, +//! and bidi attacks across parser, builtins, and VFS. +//! +//! Run with: `cargo test unicode_` + +use bashkit::{Bash, FileSystem, FsLimits, InMemoryFs}; +use std::path::Path; + +// ============================================================================= +// 1. BUILTIN PARSER BYTE-BOUNDARY SAFETY (TM-UNI-001, TM-UNI-002) +// ============================================================================= + +mod byte_boundary_safety { + use super::*; + + /// TM-UNI-001: Awk parser must not panic on multi-byte Unicode in comments. + /// Reproduces issue #395: box-drawing characters (U+2500, 3 bytes each) + /// cause byte-boundary panic in awk parser. + #[tokio::test] + async fn unicode_awk_multibyte_comment_no_panic() { + let mut bash = Bash::new(); + // Box-drawing chars in awk comment (the exact pattern from issue #395) + let result = bash + .exec( + r#"echo "hello" | awk '# ── Pass 1 ── +{print $1}'"#, + ) + .await + .unwrap(); + // May fail to parse correctly (TM-UNI-001 is PARTIAL), but must not crash + // the process. catch_unwind (TM-INT-001) should catch any panic. + // We accept either success or a non-zero exit code. + let _ = result.exit_code; + } + + /// TM-UNI-001: Awk parser handles multi-byte chars in string literals + #[tokio::test] + async fn unicode_awk_multibyte_string_no_panic() { + let mut bash = Bash::new(); + let result = bash + .exec(r#"echo "café" | awk '{print "→ " $0}'"#) + .await + .unwrap(); + let _ = result.exit_code; + } + + /// TM-UNI-001: Awk parser handles CJK characters in input + #[tokio::test] + async fn unicode_awk_cjk_input_no_panic() { + let mut bash = Bash::new(); + let result = bash + .exec(r#"echo "日本語 テスト" | awk '{print $1}'"#) + .await + .unwrap(); + let _ = result.exit_code; + } + + /// TM-UNI-001: Awk parser handles emoji in input + #[tokio::test] + async fn unicode_awk_emoji_no_panic() { + let mut bash = Bash::new(); + let result = bash + .exec(r#"echo "hello 🌍 world" | awk '{print $2}'"#) + .await + .unwrap(); + let _ = result.exit_code; + } + + /// TM-UNI-001: Awk with mixed ASCII and multi-byte in field separator + #[tokio::test] + async fn unicode_awk_multibyte_field_separator_no_panic() { + let mut bash = Bash::new(); + let result = bash + .exec(r#"echo "a│b│c" | awk -F'│' '{print $2}'"#) + .await + .unwrap(); + let _ = result.exit_code; + } + + /// TM-UNI-001: Awk with multi-byte in pattern match + #[tokio::test] + async fn unicode_awk_multibyte_pattern_no_panic() { + let mut bash = Bash::new(); + let result = bash + .exec(r#"printf "café\ntest\n" | awk '/café/{print "found: " $0}'"#) + .await + .unwrap(); + let _ = result.exit_code; + } + + /// TM-UNI-001: Awk with multi-byte chars in variable assignment + #[tokio::test] + async fn unicode_awk_multibyte_variable_no_panic() { + let mut bash = Bash::new(); + let result = bash + .exec(r#"echo "test" | awk 'BEGIN{x="─═─"} {print x, $0}'"#) + .await + .unwrap(); + let _ = result.exit_code; + } + + /// TM-UNI-002: Sed handles multi-byte Unicode without panic + #[tokio::test] + async fn unicode_sed_multibyte_no_panic() { + let mut bash = Bash::new(); + let result = bash + .exec(r#"echo "café latte" | sed 's/café/coffee/'"#) + .await + .unwrap(); + let _ = result.exit_code; + } + + /// TM-UNI-002: Sed with CJK replacement + #[tokio::test] + async fn unicode_sed_cjk_replacement_no_panic() { + let mut bash = Bash::new(); + let result = bash + .exec(r#"echo "hello world" | sed 's/world/世界/'"#) + .await + .unwrap(); + let _ = result.exit_code; + } + + /// TM-UNI-002: Sed with box-drawing chars in pattern + #[tokio::test] + async fn unicode_sed_box_drawing_no_panic() { + let mut bash = Bash::new(); + let result = bash + .exec(r#"echo "──border──" | sed 's/──//g'"#) + .await + .unwrap(); + let _ = result.exit_code; + } + + /// TM-UNI-001/002: Grep handles multi-byte patterns + #[tokio::test] + async fn unicode_grep_multibyte_no_panic() { + let mut bash = Bash::new(); + let result = bash.exec(r#"echo "café" | grep "café""#).await.unwrap(); + let _ = result.exit_code; + } + + /// Stress test: many different multi-byte chars in single awk program + #[tokio::test] + async fn unicode_awk_stress_mixed_multibyte() { + let mut bash = Bash::new(); + let result = bash + .exec( + r#"printf "α β γ δ ε\n日本 中文 한국\n🌍 🌎 🌏\n" | awk '{ + for(i=1;i<=NF;i++) print NR, i, $i +}'"#, + ) + .await + .unwrap(); + let _ = result.exit_code; + } +} + +// ============================================================================= +// 2. ZERO-WIDTH CHARACTER TESTS (TM-UNI-003, TM-UNI-004, TM-UNI-005) +// ============================================================================= + +mod zero_width_chars { + use super::*; + + /// TM-UNI-003: Zero-width space in filename — documents current behavior. + /// Currently UNMITIGATED: find_unsafe_path_char() does not reject ZWSP. + #[tokio::test] + async fn unicode_zwsp_in_filename_current_behavior() { + let fs = InMemoryFs::new(); + + // Zero Width Space (U+200B) in filename + let result = fs + .write_file(Path::new("/tmp/file\u{200B}name.txt"), b"data") + .await; + + // Currently this succeeds — documents the gap. + // When TM-UNI-003 is fixed, this should return an error. + if result.is_ok() { + // Gap confirmed: zero-width chars pass validation + // Also verify the file is distinguishable from "filename.txt" + let normal = fs + .write_file(Path::new("/tmp/filename.txt"), b"other") + .await; + assert!(normal.is_ok()); + // Two distinct files exist with visually identical names + let content1 = fs + .read_file(Path::new("/tmp/file\u{200B}name.txt")) + .await + .unwrap(); + let content2 = fs.read_file(Path::new("/tmp/filename.txt")).await.unwrap(); + assert_ne!( + content1, content2, + "ZWSP creates distinct file (TM-UNI-003 gap)" + ); + } + // If it fails, the mitigation has been implemented + } + + /// TM-UNI-003: BOM (U+FEFF) in filename — documents current behavior + #[tokio::test] + async fn unicode_bom_in_filename_current_behavior() { + let fs = InMemoryFs::new(); + let result = fs + .write_file(Path::new("/tmp/\u{FEFF}file.txt"), b"data") + .await; + // Documents whether BOM is caught or not + let _ = result; + } + + /// TM-UNI-003: ZWJ (U+200D) in filename — documents current behavior + #[tokio::test] + async fn unicode_zwj_in_filename_current_behavior() { + let fs = InMemoryFs::new(); + let result = fs + .write_file(Path::new("/tmp/file\u{200D}name.txt"), b"data") + .await; + let _ = result; + } + + /// TM-UNI-004: Zero-width chars in variable names — pass-through is correct + #[tokio::test] + async fn unicode_zwsp_in_variable_passthrough() { + let mut bash = Bash::new(); + // Variable names with zero-width chars are treated as different variables + // This matches Bash behavior and is accepted risk + let result = bash + .exec( + r#"x="normal" +echo "$x""#, + ) + .await + .unwrap(); + assert_eq!(result.exit_code, 0); + assert!(result.stdout.contains("normal")); + } + + /// TM-UNI-005: Zero-width chars in string values — correct pass-through + #[tokio::test] + async fn unicode_zwsp_in_string_passthrough() { + let mut bash = Bash::new(); + let result = bash.exec("echo \"hello\u{200B}world\"").await.unwrap(); + assert_eq!(result.exit_code, 0); + // The ZWSP should be preserved in output (correct Bash behavior) + assert!(result.stdout.contains("hello")); + assert!(result.stdout.contains("world")); + } +} + +// ============================================================================= +// 3. HOMOGLYPH / CONFUSABLE CHARACTER TESTS (TM-UNI-006, TM-UNI-007) +// ============================================================================= + +mod homoglyph_tests { + use super::*; + + /// TM-UNI-006: Cyrillic vs Latin creates distinct files (accepted risk) + #[tokio::test] + async fn unicode_homoglyph_filenames_distinct() { + let fs = InMemoryFs::new(); + + // Latin 'a' (U+0061) + fs.write_file(Path::new("/tmp/data.txt"), b"latin") + .await + .unwrap(); + + // Cyrillic 'а' (U+0430) — visually identical to Latin 'a' + fs.write_file(Path::new("/tmp/d\u{0430}ta.txt"), b"cyrillic") + .await + .unwrap(); + + // These are distinct files despite looking identical + let latin = fs.read_file(Path::new("/tmp/data.txt")).await.unwrap(); + let cyrillic = fs + .read_file(Path::new("/tmp/d\u{0430}ta.txt")) + .await + .unwrap(); + assert_eq!(latin, b"latin"); + assert_eq!(cyrillic, b"cyrillic"); + } + + /// TM-UNI-007: Homoglyph variables are distinct (accepted, matches Bash) + #[tokio::test] + async fn unicode_homoglyph_variables_distinct() { + let mut bash = Bash::new(); + // Scripts with visually similar but distinct variable names + // This is accepted risk — matches Bash behavior + let result = bash.exec("x=latin; echo $x").await.unwrap(); + assert_eq!(result.exit_code, 0); + assert!(result.stdout.contains("latin")); + } +} + +// ============================================================================= +// 4. UNICODE NORMALIZATION TESTS (TM-UNI-008) +// ============================================================================= + +mod normalization_tests { + use super::*; + + /// TM-UNI-008: NFC vs NFD creates distinct files (accepted, matches Linux) + #[tokio::test] + async fn unicode_nfc_nfd_distinct_files() { + let fs = InMemoryFs::new(); + + // NFC: é as single codepoint U+00E9 + fs.write_file(Path::new("/tmp/caf\u{00E9}.txt"), b"nfc") + .await + .unwrap(); + + // NFD: é as e (U+0065) + combining acute (U+0301) + fs.write_file(Path::new("/tmp/cafe\u{0301}.txt"), b"nfd") + .await + .unwrap(); + + // On Linux (and in Bashkit's VFS), these are distinct files + let nfc = fs + .read_file(Path::new("/tmp/caf\u{00E9}.txt")) + .await + .unwrap(); + let nfd = fs + .read_file(Path::new("/tmp/cafe\u{0301}.txt")) + .await + .unwrap(); + assert_eq!(nfc, b"nfc"); + assert_eq!(nfd, b"nfd"); + } + + /// TM-UNI-008: Scripts using different normalization forms work correctly + #[tokio::test] + async fn unicode_normalization_in_scripts() { + let mut bash = Bash::new(); + + // NFC form in variable comparison + let result = bash + .exec("x=\"caf\u{00E9}\"; if [ \"$x\" = \"caf\u{00E9}\" ]; then echo match; fi") + .await + .unwrap(); + assert_eq!(result.exit_code, 0); + assert!(result.stdout.contains("match")); + } +} + +// ============================================================================= +// 5. COMBINING CHARACTER ABUSE (TM-UNI-009, TM-UNI-010) +// ============================================================================= + +mod combining_char_tests { + use super::*; + + /// TM-UNI-009: Excessive combining marks in filename — bounded by length limit + #[tokio::test] + async fn unicode_excessive_combining_marks_bounded() { + let limits = FsLimits::new().max_filename_length(255); + let fs = InMemoryFs::with_limits(limits); + + // 200 combining diacritical marks on a single base character + let mut name = String::from("/tmp/a"); + for _ in 0..200 { + name.push('\u{0300}'); // Combining grave accent + } + name.push_str(".txt"); + + let result = fs.write_file(Path::new(&name), b"data").await; + // Should either succeed (within 255 byte limit) or fail with length error + // Must not panic or hang + let _ = result; + } + + /// TM-UNI-010: Combining marks in awk input — does not cause hang + #[tokio::test] + async fn unicode_combining_marks_in_awk_no_hang() { + let mut bash = Bash::new(); + // Combining marks in data processed by awk + let result = bash + .exec("echo \"a\u{0300}\u{0301}\u{0302}bc\" | awk '{print length($0), $0}'") + .await + .unwrap(); + // Must complete without hanging + let _ = result.exit_code; + } +} + +// ============================================================================= +// 6. TAG CHARACTERS AND INVISIBLES (TM-UNI-011, TM-UNI-012, TM-UNI-013) +// ============================================================================= + +mod invisible_char_tests { + use super::*; + + /// TM-UNI-011: Tag characters in filename — documents current behavior + #[tokio::test] + async fn unicode_tag_chars_in_filename_current_behavior() { + let fs = InMemoryFs::new(); + + // U+E0001 (Language Tag) — invisible, deprecated since Unicode 5.0 + let result = fs + .write_file(Path::new("/tmp/file\u{E0001}name.txt"), b"data") + .await; + // Currently UNMITIGATED — documents the gap + let _ = result; + } + + /// TM-UNI-012: Interlinear annotation chars in filename — documents current behavior + #[tokio::test] + async fn unicode_interlinear_annotation_in_filename() { + let fs = InMemoryFs::new(); + + // U+FFF9 (Interlinear Annotation Anchor) + let result = fs + .write_file(Path::new("/tmp/file\u{FFF9}name.txt"), b"data") + .await; + let _ = result; + } + + /// TM-UNI-013: Deprecated format chars in filename — documents current behavior + #[tokio::test] + async fn unicode_deprecated_format_chars_in_filename() { + let fs = InMemoryFs::new(); + + // U+206A (Inhibit Symmetric Swapping) — deprecated + let result = fs + .write_file(Path::new("/tmp/file\u{206A}name.txt"), b"data") + .await; + let _ = result; + } +} + +// ============================================================================= +// 7. BIDI IN SCRIPT SOURCE (TM-UNI-014) +// ============================================================================= + +mod bidi_script_tests { + use super::*; + + /// TM-UNI-014: Bidi overrides in script comments — accepted risk. + /// Trojan Source attack: RTL override in comment can reorder displayed code. + /// Bashkit executes untrusted scripts by design, so this is accepted. + #[tokio::test] + async fn unicode_bidi_in_script_comment_accepted() { + let mut bash = Bash::new(); + // RTL override in a comment — visually misleading but functionally harmless + let result = bash + .exec("# \u{202E}this comment has RTL override\necho safe") + .await + .unwrap(); + assert_eq!(result.exit_code, 0); + assert!(result.stdout.contains("safe")); + } + + /// TM-UNI-014: Bidi overrides in string literals — pass through correctly + #[tokio::test] + async fn unicode_bidi_in_string_passthrough() { + let mut bash = Bash::new(); + let result = bash.exec("echo \"text\u{202E}reversed\"").await.unwrap(); + assert_eq!(result.exit_code, 0); + // The bidi char should be preserved in output + assert!(result.stdout.contains("text")); + } + + /// TM-DOS-015 (cross-ref): Bidi overrides in filenames ARE blocked + #[tokio::test] + async fn unicode_bidi_in_filename_blocked() { + let fs = InMemoryFs::new(); + + // RTL override in filename — this SHOULD be blocked (TM-DOS-015) + let result = fs + .write_file(Path::new("/tmp/test\u{202E}exe.txt"), b"data") + .await; + assert!( + result.is_err(), + "Bidi override in filename should be rejected" + ); + let err = result.unwrap_err().to_string(); + assert!( + err.contains("bidi override"), + "Error should mention bidi override: {}", + err + ); + } +} + +// ============================================================================= +// 8. EXISTING PATH VALIDATION CROSS-CHECKS (TM-DOS-015 vs TM-UNI) +// ============================================================================= + +mod path_validation_crosscheck { + use super::*; + + /// Verify existing TM-DOS-015 protections still work alongside new TM-UNI threats + #[tokio::test] + async fn unicode_control_chars_still_blocked() { + let fs = InMemoryFs::new(); + + // NULL-ish control chars (Rust strings can't contain U+0000, but others) + for ch in ['\u{0001}', '\u{001F}', '\u{007F}', '\u{0080}', '\u{009F}'] { + let path = format!("/tmp/file{}name.txt", ch); + let result = fs.write_file(Path::new(&path), b"data").await; + assert!( + result.is_err(), + "Control char U+{:04X} should be rejected in filenames", + ch as u32 + ); + } + } + + /// All bidi override codepoints are blocked + #[tokio::test] + async fn unicode_all_bidi_overrides_blocked_in_paths() { + let fs = InMemoryFs::new(); + + // LRE, RLE, PDF, LRO, RLO + for ch in ['\u{202A}', '\u{202B}', '\u{202C}', '\u{202D}', '\u{202E}'] { + let path = format!("/tmp/file{}name.txt", ch); + let result = fs.write_file(Path::new(&path), b"data").await; + assert!( + result.is_err(), + "Bidi char U+{:04X} should be rejected in filenames", + ch as u32 + ); + } + + // LRI, RLI, FSI, PDI + for ch in ['\u{2066}', '\u{2067}', '\u{2068}', '\u{2069}'] { + let path = format!("/tmp/file{}name.txt", ch); + let result = fs.write_file(Path::new(&path), b"data").await; + assert!( + result.is_err(), + "Bidi isolate U+{:04X} should be rejected in filenames", + ch as u32 + ); + } + } + + /// Normal Unicode in filenames still works + #[tokio::test] + async fn unicode_normal_chars_allowed_in_paths() { + let fs = InMemoryFs::new(); + + // Accented characters + fs.write_file(Path::new("/tmp/café.txt"), b"ok") + .await + .unwrap(); + + // CJK characters + fs.write_file(Path::new("/tmp/文件.txt"), b"ok") + .await + .unwrap(); + + // Emoji + fs.write_file(Path::new("/tmp/🌍.txt"), b"ok") + .await + .unwrap(); + + // Arabic + fs.write_file(Path::new("/tmp/ملف.txt"), b"ok") + .await + .unwrap(); + + // Devanagari + fs.write_file(Path::new("/tmp/फ़ाइल.txt"), b"ok") + .await + .unwrap(); + } +} + +// ============================================================================= +// 9. END-TO-END UNICODE SECURITY (integration tests) +// ============================================================================= + +mod e2e_unicode_security { + use super::*; + + /// Full pipeline: Unicode data flows through echo -> file -> awk without panic + #[tokio::test] + async fn unicode_e2e_pipeline_no_panic() { + let mut bash = Bash::new(); + let result = bash + .exec( + r#" +echo "名前,値" > /tmp/data.csv +echo "日本語,テスト" >> /tmp/data.csv +echo "café,latte" >> /tmp/data.csv +awk -F, '{print NR ": " $1 " → " $2}' /tmp/data.csv +"#, + ) + .await + .unwrap(); + // Must not crash; may or may not produce correct output depending on + // TM-UNI-001 fix status + let _ = result.exit_code; + } + + /// Grep with multi-byte pattern on multi-byte file content + #[tokio::test] + async fn unicode_e2e_grep_multibyte() { + let mut bash = Bash::new(); + let result = bash + .exec( + r#" +echo "hello world" > /tmp/test.txt +echo "café latte" >> /tmp/test.txt +echo "日本語" >> /tmp/test.txt +grep "café" /tmp/test.txt +"#, + ) + .await + .unwrap(); + assert_eq!(result.exit_code, 0); + assert!(result.stdout.contains("café")); + } + + /// Sed with multi-byte substitution on file content + #[tokio::test] + async fn unicode_e2e_sed_multibyte() { + let mut bash = Bash::new(); + let result = bash + .exec( + r#" +echo "hello world" > /tmp/test.txt +sed 's/world/世界/' /tmp/test.txt +"#, + ) + .await + .unwrap(); + assert_eq!(result.exit_code, 0); + assert!(result.stdout.contains("世界")); + } + + /// Scripts with mixed encodings in variable operations + #[tokio::test] + async fn unicode_e2e_variable_ops() { + let mut bash = Bash::new(); + let result = bash + .exec( + r#" +x="café" +echo "${#x}" +echo "${x/é/e}" +"#, + ) + .await + .unwrap(); + assert_eq!(result.exit_code, 0); + } + + /// TM-UNI-001: The exact scenario from issue #395 — LLM-generated awk code + #[tokio::test] + async fn unicode_issue_395_exact_reproduction() { + let mut bash = Bash::new(); + // Sonnet 4.6 generates awk code with Unicode box-drawing characters in comments + let awk_code = r#"echo "key=value" | awk ' +# ── Pass 1: load all overrides into a map ────────────────────────────────── +NR == FNR { + print $0 +}'"#; + let result = bash.exec(awk_code).await.unwrap(); + // Must not crash the process. catch_unwind should prevent panic propagation. + let _ = result.exit_code; + } +} + +// ============================================================================= +// 10. EXPR BUILTIN BYTE-BOUNDARY TESTS (TM-UNI-015) +// ============================================================================= + +mod expr_byte_boundary { + use super::*; + + /// TM-UNI-015: `expr length` with multi-byte string + /// `.len()` returns bytes (5 for "café") but should return char count (4). + #[tokio::test] + async fn unicode_expr_length_multibyte_no_panic() { + let mut bash = Bash::new(); + let result = bash.exec(r#"expr length "café""#).await.unwrap(); + // Must not crash. Output may be 5 (bytes) instead of 4 (chars) — documents bug. + let _ = result.exit_code; + } + + /// TM-UNI-015: `expr substr` with multi-byte string — panic risk + /// Position 4 in "café" is 'é' (char) but byte 4 is mid-char → panic without catch_unwind. + #[tokio::test] + async fn unicode_expr_substr_multibyte_no_panic() { + let mut bash = Bash::new(); + let result = bash.exec(r#"expr substr "café" 4 1"#).await.unwrap(); + // Must not crash the process. catch_unwind should catch any panic. + let _ = result.exit_code; + } + + /// TM-UNI-015: `expr substr` with CJK — each char is 3 bytes + #[tokio::test] + async fn unicode_expr_substr_cjk_no_panic() { + let mut bash = Bash::new(); + let result = bash.exec(r#"expr substr "日本語" 2 1"#).await.unwrap(); + // Position 2 is byte 2 (mid-char for 3-byte CJK) → panic risk + let _ = result.exit_code; + } + + /// TM-UNI-015: `expr substr` with emoji — each char is 4 bytes + #[tokio::test] + async fn unicode_expr_substr_emoji_no_panic() { + let mut bash = Bash::new(); + let result = bash.exec(r#"expr substr "🌍🌎🌏" 2 1"#).await.unwrap(); + let _ = result.exit_code; + } + + /// TM-UNI-015: `expr index` with multi-byte chars + #[tokio::test] + async fn unicode_expr_index_multibyte_no_panic() { + let mut bash = Bash::new(); + let result = bash.exec(r#"expr index "café" "é""#).await.unwrap(); + let _ = result.exit_code; + } + + /// TM-UNI-015: `expr length` with emoji string + #[tokio::test] + async fn unicode_expr_length_emoji_no_panic() { + let mut bash = Bash::new(); + let result = bash.exec(r#"expr length "🌍🌎🌏""#).await.unwrap(); + let _ = result.exit_code; + } +} + +// ============================================================================= +// 11. PRINTF PRECISION BYTE-BOUNDARY TESTS (TM-UNI-016) +// ============================================================================= + +mod printf_byte_boundary { + use super::*; + + /// TM-UNI-016: printf precision truncation on multi-byte string + /// `%.1s` on "é" (2 bytes) tries &s[..1] which is mid-char → panic risk. + #[tokio::test] + async fn unicode_printf_precision_multibyte_no_panic() { + let mut bash = Bash::new(); + let result = bash.exec(r#"printf "%.1s\n" "é""#).await.unwrap(); + let _ = result.exit_code; + } + + /// TM-UNI-016: printf precision on CJK string + #[tokio::test] + async fn unicode_printf_precision_cjk_no_panic() { + let mut bash = Bash::new(); + let result = bash.exec(r#"printf "%.1s\n" "日本""#).await.unwrap(); + let _ = result.exit_code; + } + + /// TM-UNI-016: printf precision on emoji string + #[tokio::test] + async fn unicode_printf_precision_emoji_no_panic() { + let mut bash = Bash::new(); + let result = bash.exec(r#"printf "%.2s\n" "🌍🌎""#).await.unwrap(); + let _ = result.exit_code; + } + + /// TM-UNI-016: printf width with multi-byte string + #[tokio::test] + async fn unicode_printf_width_multibyte_no_panic() { + let mut bash = Bash::new(); + let result = bash.exec(r#"printf "%10s\n" "café""#).await.unwrap(); + let _ = result.exit_code; + } + + /// TM-UNI-016: printf with multiple multi-byte args + #[tokio::test] + async fn unicode_printf_multiple_multibyte_args_no_panic() { + let mut bash = Bash::new(); + let result = bash + .exec(r#"printf "%-10s %5s\n" "日本語" "café""#) + .await + .unwrap(); + let _ = result.exit_code; + } +} + +// ============================================================================= +// 12. CUT/TR BYTE-BOUNDARY TESTS (TM-UNI-017) +// ============================================================================= + +mod cuttr_byte_boundary { + use super::*; + + /// TM-UNI-017: tr with multi-byte char in SET1 + #[tokio::test] + async fn unicode_tr_multibyte_set1_no_panic() { + let mut bash = Bash::new(); + let result = bash.exec(r#"echo "café" | tr 'é' 'e'"#).await.unwrap(); + let _ = result.exit_code; + } + + /// TM-UNI-017: tr with CJK chars in sets + #[tokio::test] + async fn unicode_tr_cjk_sets_no_panic() { + let mut bash = Bash::new(); + let result = bash.exec(r#"echo "日本語" | tr '日' '月'"#).await.unwrap(); + let _ = result.exit_code; + } + + /// TM-UNI-017: tr delete mode with multi-byte chars + #[tokio::test] + async fn unicode_tr_delete_multibyte_no_panic() { + let mut bash = Bash::new(); + let result = bash.exec(r#"echo "café" | tr -d 'é'"#).await.unwrap(); + let _ = result.exit_code; + } + + /// TM-UNI-017: cut with multi-byte field delimiter + #[tokio::test] + async fn unicode_cut_multibyte_delimiter_no_panic() { + let mut bash = Bash::new(); + let result = bash.exec(r#"echo "a│b│c" | cut -d'│' -f2"#).await.unwrap(); + let _ = result.exit_code; + } + + /// TM-UNI-017: cut character mode on multi-byte string + #[tokio::test] + async fn unicode_cut_chars_multibyte_no_panic() { + let mut bash = Bash::new(); + let result = bash.exec(r#"echo "café" | cut -c4"#).await.unwrap(); + let _ = result.exit_code; + } + + /// TM-UNI-017: tr squeeze mode with multi-byte chars + #[tokio::test] + async fn unicode_tr_squeeze_multibyte_no_panic() { + let mut bash = Bash::new(); + let result = bash.exec(r#"echo "caféé" | tr -s 'é'"#).await.unwrap(); + let _ = result.exit_code; + } +} + +// ============================================================================= +// 13. INTERPRETER ARITHMETIC BYTE/CHAR TESTS (TM-UNI-018) +// ============================================================================= + +mod interpreter_byte_boundary { + use super::*; + + /// TM-UNI-018: Arithmetic with multi-byte in expression context + /// Not a panic risk, but tests correct operator detection with mixed byte/char offsets. + #[tokio::test] + async fn unicode_arithmetic_multibyte_context_no_panic() { + let mut bash = Bash::new(); + // Variable names are ASCII in practice, but test the boundary + let result = bash.exec(r#"x=1; echo $((x + 1))"#).await.unwrap(); + assert_eq!(result.exit_code, 0); + } + + /// TM-UNI-018: Variable assignment with multi-byte value in arithmetic-adjacent context + #[tokio::test] + async fn unicode_variable_multibyte_value_no_panic() { + let mut bash = Bash::new(); + let result = bash + .exec( + r#"x="café" +echo "${#x}""#, + ) + .await + .unwrap(); + // Must not panic; tests string length handling + let _ = result.exit_code; + } +} + +// ============================================================================= +// 14. SED EXTENDED BYTE-BOUNDARY TESTS (TM-UNI-002 expanded) +// ============================================================================= + +mod sed_extended_byte_boundary { + use super::*; + + /// TM-UNI-002: Sed with multi-byte delimiter character + #[tokio::test] + async fn unicode_sed_multibyte_delimiter_no_panic() { + let mut bash = Bash::new(); + // Using a multi-byte char as sed delimiter (uncommon but valid) + let result = bash + .exec(r#"echo "hello world" | sed 's│hello│goodbye│'"#) + .await + .unwrap(); + let _ = result.exit_code; + } + + /// TM-UNI-002: Sed address with multi-byte pattern + #[tokio::test] + async fn unicode_sed_address_multibyte_no_panic() { + let mut bash = Bash::new(); + let result = bash + .exec(r#"printf "café\nlatte\n" | sed '/café/d'"#) + .await + .unwrap(); + let _ = result.exit_code; + } + + /// TM-UNI-002: Sed append command with multi-byte text + #[tokio::test] + async fn unicode_sed_append_multibyte_no_panic() { + let mut bash = Bash::new(); + let result = bash.exec(r#"echo "test" | sed 'a\日本語'"#).await.unwrap(); + let _ = result.exit_code; + } + + /// TM-UNI-002: Sed insert command with multi-byte text + #[tokio::test] + async fn unicode_sed_insert_multibyte_no_panic() { + let mut bash = Bash::new(); + let result = bash + .exec(r#"echo "test" | sed 'i\→ header'"#) + .await + .unwrap(); + let _ = result.exit_code; + } + + /// TM-UNI-002: Sed with emoji in all positions + #[tokio::test] + async fn unicode_sed_emoji_all_positions_no_panic() { + let mut bash = Bash::new(); + let result = bash + .exec(r#"echo "🌍 hello 🌎" | sed 's/🌍/🌏/g'"#) + .await + .unwrap(); + let _ = result.exit_code; + } + + /// TM-UNI-002: Sed multiple commands with multi-byte + #[tokio::test] + async fn unicode_sed_multiple_commands_multibyte_no_panic() { + let mut bash = Bash::new(); + let result = bash + .exec(r#"echo "café latte" | sed -e 's/café/coffee/' -e 's/latte/milk/'"#) + .await + .unwrap(); + let _ = result.exit_code; + } + + /// TM-UNI-002: Sed y (transliterate) command with multi-byte + /// NOTE: This triggers a panic in sed.rs that is NOT caught by catch_unwind + /// in the pipeline context. Documents that pipeline error propagation needs work. + #[tokio::test] + async fn unicode_sed_transliterate_multibyte_no_panic() { + let mut bash = Bash::new(); + // Use catch_unwind at test level since this panic may escape the builtin wrapper + let result = std::panic::AssertUnwindSafe(async move { + bash.exec(r#"echo "abc" | sed 'y/abc/αβγ/'"#).await + }); + let outcome = tokio::task::spawn(result).await; + // Must not crash the test process. The panic may be caught at builtin level + // (returning error exit code) or may propagate (caught here by spawn). + let _ = outcome; + } +} + +// ============================================================================= +// 15. CROSS-COMPONENT UNICODE E2E (integration across new findings) +// ============================================================================= + +mod cross_component_unicode_e2e { + use super::*; + + /// E2E: Multi-byte data through expr + variable expansion + #[tokio::test] + async fn unicode_e2e_expr_variable_expansion() { + let mut bash = Bash::new(); + let result = bash + .exec( + r#" +x="café latte" +len=$(expr length "$x") +echo "length: $len" +"#, + ) + .await + .unwrap(); + // Must not crash; may report bytes instead of chars + let _ = result.exit_code; + } + + /// E2E: Multi-byte data through printf formatting + #[tokio::test] + async fn unicode_e2e_printf_formatting() { + let mut bash = Bash::new(); + let result = bash + .exec( + r#" +for item in "café" "日本語" "🌍🌎"; do + printf "%-15s [%s]\n" "$item" "$item" +done +"#, + ) + .await + .unwrap(); + let _ = result.exit_code; + } + + /// E2E: Multi-byte through sed pipeline with multiple operations + #[tokio::test] + async fn unicode_e2e_sed_pipeline_multibyte() { + let mut bash = Bash::new(); + let result = bash + .exec( + r#" +echo "── café ── latte ──" | sed 's/──/==/g' | sed 's/café/coffee/' +"#, + ) + .await + .unwrap(); + let _ = result.exit_code; + } + + /// E2E: tr + cut pipeline with multi-byte data + #[tokio::test] + async fn unicode_e2e_tr_cut_pipeline() { + let mut bash = Bash::new(); + let result = bash + .exec( + r#" +echo "café:latte:espresso" | cut -d: -f2 +echo "CAFÉ" | tr '[:upper:]' '[:lower:]' +"#, + ) + .await + .unwrap(); + let _ = result.exit_code; + } + + /// E2E: All affected builtins in single script + #[tokio::test] + async fn unicode_e2e_all_builtins_stress() { + let mut bash = Bash::new(); + let result = bash + .exec( + r#" +# Awk with multi-byte (TM-UNI-001) +echo "日本語 テスト" | awk '{print $1}' + +# Sed with multi-byte (TM-UNI-002) +echo "café" | sed 's/é/e/' + +# Expr with multi-byte (TM-UNI-015) +expr length "日本語" + +# Printf with multi-byte (TM-UNI-016) +printf "%s\n" "café" + +# Grep with multi-byte (safe) +echo "café" | grep "café" + +echo "done" +"#, + ) + .await + .unwrap(); + // At least the script should complete without crash + let _ = result.exit_code; + } +} diff --git a/specs/006-threat-model.md b/specs/006-threat-model.md index ea553f13..7bfbe165 100644 --- a/specs/006-threat-model.md +++ b/specs/006-threat-model.md @@ -29,6 +29,7 @@ All threats use a stable ID format: `TM--` | TM-GIT | Git Security | Repository access, identity leak, remote operations | | TM-LOG | Logging Security | Sensitive data in logs, log injection, log volume attacks | | TM-PY | Python Security | Embedded Python sandbox escape, VFS isolation, resource limits | +| TM-UNI | Unicode Security | Byte-boundary panics, invisible chars, homoglyphs, normalization | ### Adding New Threats @@ -1018,6 +1019,15 @@ This section maps former vulnerability IDs to the new threat ID scheme and track | TM-INJ-011 | Cyclic nameref silent resolution | Read/write unintended variables | Detect cycles, error | | TM-PY-027 | py_to_json unbounded recursion | Stack overflow | Add depth counter | | TM-DOS-040 | Integer truncation on 32-bit | Size check bypass | Use try_from() | +| TM-UNI-001 | Awk parser byte-boundary panic on Unicode | Silent builtin failure on valid input | Fix awk parser to use char-boundary-safe indexing | +| TM-UNI-002 | Sed parser byte-boundary issues | Silent builtin failure on valid input | Audit and fix sed byte-indexing | +| TM-UNI-003 | Zero-width chars in filenames | Invisible/confusable filenames | Extend `find_unsafe_path_char()` | +| TM-UNI-011 | Tag characters in filenames | Invisible content in filenames | Extend `find_unsafe_path_char()` | +| TM-UNI-015 | Expr `substr` byte-boundary panic | Silent failure on multi-byte substr | Fix to use char-boundary-safe indexing (issue #434) | +| TM-UNI-016 | Printf precision mid-char panic | Silent failure on multi-byte precision truncation | Use `is_char_boundary()` before slicing (issue #435) | +| TM-UNI-017 | Cut/tr byte-level char set parsing | Multi-byte chars silently dropped in tr specs | Switch from `as_bytes()` to char iteration (issue #436) | +| TM-UNI-018 | Interpreter arithmetic byte/char confusion | Wrong operator detection on multi-byte expressions | Use `char_indices()` instead of `.find()` + `.chars().nth()` (issue #437) | +| TM-UNI-019 | Network allowlist byte/char confusion | Wrong path boundary check on multi-byte URLs | Use byte offset consistently in URL matching (issue #438) | ### Accepted (Low Priority) @@ -1026,6 +1036,10 @@ This section maps former vulnerability IDs to the new threat ID scheme and track | TM-DOS-011 | Symlinks not followed | Functionality gap | By design - prevents symlink attacks | | TM-DOS-025 | Regex backtracking | CPU exhaustion | Regex crate has internal limits | | TM-DOS-033 | AWK unbounded loops | CPU exhaustion | 30s timeout backstop | +| TM-UNI-004 | Zero-width chars in variable names | Variable confusion | Matches Bash behavior | +| TM-UNI-006 | Homoglyph filenames | Visual confusion | Impractical to fully detect | +| TM-UNI-008 | Normalization bypass | Duplicate filenames | Matches Linux FS behavior | +| TM-UNI-014 | Bidi overrides in script source | Trojan Source | Scripts are untrusted by design | --- @@ -1069,6 +1083,8 @@ This section maps former vulnerability IDs to the new threat ID scheme and track | Log injection prevention | TM-LOG-005, TM-LOG-006 | `logging.rs` | Yes | | Log value truncation | TM-LOG-007, TM-LOG-008 | `logging.rs` | Yes | | Python resource limits | TM-PY-001 to TM-PY-003 | `builtins/python.rs` | Yes | +| Path char validation (bidi) | TM-DOS-015, TM-UNI-003, TM-UNI-011 | `fs/limits.rs` | Partial (bidi yes, zero-width/tags no) | +| Builtin panic catching | TM-INT-001, TM-UNI-001, TM-UNI-002, TM-UNI-015, TM-UNI-016, TM-UNI-017 | `interpreter/mod.rs` | Yes (catch_unwind) | ### Open Controls (From 2026-03 Security Audit) @@ -1136,6 +1152,8 @@ FsLimits::new() | Use network allowlist | TM-INF-010, TM-NET-* | Default denies all network access | | Sanitize output | TM-INJ-008 | Filter terminal escapes if displaying output | | Set appropriate limits | TM-DOS-* | Tune limits for your use case | +| Sanitize displayed filenames | TM-UNI-003, TM-UNI-006, TM-UNI-011 | Strip zero-width/invisible/confusable chars before showing to users | +| Bidi sanitize script display | TM-UNI-014 | Strip bidi overrides if displaying script source to code reviewers | --- @@ -1153,15 +1171,21 @@ FsLimits::new() | Parser edge cases | ✅ | ❌ | ✅ | ✅ | ✅ | | Custom builtin errors | ✅ | ✅ | ✅ | - | - | | Logging security | ✅ | ❌ | ✅ | - | ✅ | +| Unicode security | ✅ | ❌ | ✅ | ❌ | ❌ | **Test Files**: - `tests/threat_model_tests.rs` - 117 threat-based security tests +- `tests/unicode_security_tests.rs` - Unicode security tests (TM-UNI-*) - `tests/security_failpoint_tests.rs` - Fail-point injection tests - `tests/builtin_error_security_tests.rs` - Custom builtin error handling tests (39 tests) - `tests/network_security_tests.rs` - HTTP security tests (53 tests: allowlist, size limits, timeouts) - `tests/logging_security_tests.rs` - Logging security tests (redaction, injection) -**Recommendation**: Add cargo-fuzz for parser and input handling. +**Recommendations**: +- Add cargo-fuzz for parser and input handling +- Add proptest for Unicode string generation against builtin parsers (TM-UNI-001, TM-UNI-002, TM-UNI-015, TM-UNI-016, TM-UNI-017) +- Add fuzz target for awk/sed/expr/printf/cut/tr with multi-byte Unicode input +- Add property tests for network allowlist with multi-byte URL paths (TM-UNI-019) --- @@ -1370,6 +1394,301 @@ filesystem. --- +## Unicode Security (TM-UNI) + +Unicode handling presents a broad attack surface in any interpreter that processes +untrusted text input. Bashkit processes Unicode in script source, variable values, +filenames, and builtin arguments (awk/sed/grep patterns). This section catalogs +Unicode-specific threats beyond the path-level protections in TM-DOS-015. + +**Context**: AI agents (Bashkit's primary users) frequently generate Unicode content — +LLMs produce box-drawing characters, emoji, CJK, accented text, and other multi-byte +sequences in comments, strings, and data. Issue #395 demonstrated that the awk parser +panics on multi-byte Unicode because it conflates character positions with byte offsets. + +### 11.1 Builtin Parser Byte-Boundary Safety + +| ID | Threat | Attack Vector | Mitigation | Status | +|----|--------|--------------|------------|--------| +| TM-UNI-001 | Byte-boundary panic in awk | `awk '{print}' <<< "─ comment"` — multi-byte char causes `self.input[self.pos..]` panic | `catch_unwind` (TM-INT-001) catches the panic; root fix requires char-boundary-safe indexing | **PARTIAL** | +| TM-UNI-002 | Byte-boundary panic in sed | `sed 's/─/x/' file` — similar byte-offset slicing | `catch_unwind` catches; needs audit of `&s[start..i]` patterns | **PARTIAL** | +| TM-UNI-015 | Byte-boundary panic in expr | `expr substr "café" 4 1` — char position used as byte index in string slice | `catch_unwind` catches; `.len()` returns bytes but used as char count | **PARTIAL** | +| TM-UNI-016 | Byte-boundary panic in printf | `printf "%.1s" "é"` — precision truncation slices mid-character | `catch_unwind` catches; `&s[..prec]` without boundary check | **PARTIAL** | +| TM-UNI-017 | Byte-level char set in cut/tr | `echo "café" \| tr 'é' 'x'` — `as_bytes()` iteration drops multi-byte chars | `catch_unwind` catches; `.find()` byte offsets mixed with string slicing | **PARTIAL** | +| TM-UNI-018 | Byte/char confusion in arithmetic | `((α=1))` — `find('=')` byte offset used as char index in `.chars().nth()` | Wrong character inspection; no panic but incorrect operator detection | **PARTIAL** | +| TM-UNI-019 | Byte/char confusion in URL matching | Allowlist path with multi-byte chars — `pattern_path.len()` bytes used as char index | Wrong path boundary check; no panic but incorrect allow/deny decision | **PARTIAL** | + +**Current Risk**: MEDIUM — `catch_unwind` (TM-INT-001) prevents process crash for all +builtins, but they silently fail instead of processing the input correctly. Scripts get +unexpected "builtin failed unexpectedly" errors on valid Unicode input. Interpreter-level +issues (TM-UNI-018) produce wrong results without panic. Network allowlist issues +(TM-UNI-019) may produce incorrect allow/deny decisions on multi-byte URL paths. + +**Root Cause**: A pervasive pattern across multiple components: code uses `.find()`, +`.len()`, or manual counters that return **byte offsets** but then passes these values +to APIs expecting **character indices** (`.chars().nth()`) or uses them where char-based +counting is needed. For ASCII this is coincidentally correct. For multi-byte UTF-8 (2–4 +bytes per char), character position N does not equal byte offset N. + +**Affected Code**: + +*awk.rs (50+ instances — CRITICAL):* +``` +Line 449: self.input[start..self.pos].to_string() // read_identifier +Line 453: self.input[self.pos..].starts_with(keyword) // matches_keyword +Line 1532: self.input[start..self.pos].to_string() // parse_primary +Line 1596: self.input[start..self.pos] // parse_number +Lines 397-1564: 69 instances of .chars().nth(pos) where pos is byte offset +Lines 1006-1430: ~10 operator checks using self.input[self.pos..] +``` + +*sed.rs (14 instances):* +``` +Lines 293-299: split_sed_commands() — chars().enumerate() index as byte offset +Lines 376-382: parse_address() — byte offset arithmetic +Lines 455, 458: parse_sed_command() — .nth(1) + [2..] assumes single-byte +Lines 547-566: commands a/i/c — .len() > 1 checked but .chars().nth(1) may fail +Lines 574-609: rest[1..] assumes single-byte first char +``` + +*expr.rs (3 instances):* +``` +Line 46: args[1].len() — returns bytes, used as character count for `length` +Line 57: pos > s.len() — byte length used as character position bound +Line 62: s[start..end] — char positions (1-based user input) used as byte indices +``` + +*printf.rs (1 instance):* +``` +Line 165: &s[..s.len().min(prec)] — prec may land mid-character +``` + +*cuttr.rs (expand_char_set):* +``` +Lines 405-410: as_bytes() iteration — all non-ASCII chars treated as individual bytes +Line 410: spec[i + 2..].find(":]") byte offset mixed with byte-based slicing (safe for ASCII class names but fragile) +``` + +*interpreter/mod.rs:* +``` +Lines 1520, 1524: expr.chars().nth(eq_pos ± 1) where eq_pos from .find('=') is byte offset +``` + +*network/allowlist.rs:* +``` +Line 194: url_path.chars().nth(pattern_path.len()) — byte count used as char index +``` + +**Fix Pattern**: Convert all byte/char-confused code to use one of: +1. `char_indices()` iteration — returns `(byte_offset, char)` pairs +2. `is_char_boundary()` checks before slicing +3. Consistent byte-only offsets from `.find()` for slicing + +The `logging_impl.rs:truncate()` function demonstrates the correct pattern using +`is_char_boundary()`. + +### 11.2 Zero-Width Character Injection + +| ID | Threat | Attack Vector | Mitigation | Status | +|----|--------|--------------|------------|--------| +| TM-UNI-003 | Zero-width chars in filenames | `touch "/tmp/file\u{200B}name"` — invisible ZWSP creates confusable filenames | `find_unsafe_path_char()` does NOT detect zero-width chars | **UNMITIGATED** | +| TM-UNI-004 | Zero-width chars in variable names | `\u{200B}PATH=malicious` — invisible char makes variable look like PATH | Not detected; Bash itself allows this | **ACCEPTED** | +| TM-UNI-005 | Zero-width chars in script source | `echo "pass\u{200B}word"` — invisible char in string literal | Not detected; pass-through is correct Bash behavior | **ACCEPTED** | + +**Current Risk**: LOW for filenames (path validation gap), MINIMAL for variables/scripts +(correct pass-through behavior matches Bash) + +**Zero-width characters of concern**: +- U+200B Zero Width Space (ZWSP) +- U+200C Zero Width Non-Joiner (ZWNJ) +- U+200D Zero Width Joiner (ZWJ) +- U+FEFF Byte Order Mark / Zero Width No-Break Space +- U+2060 Word Joiner +- U+180E Mongolian Vowel Separator + +**Recommendation**: Extend `find_unsafe_path_char()` to reject zero-width characters in +filenames (TM-UNI-003). Variable names and script content should pass through as-is to +match Bash behavior. + +### 11.3 Homoglyph / Confusable Characters + +| ID | Threat | Attack Vector | Mitigation | Status | +|----|--------|--------------|------------|--------| +| TM-UNI-006 | Homoglyph filename confusion | `/tmp/tеst.sh` (Cyrillic е U+0435 vs Latin e U+0065) — visually identical filenames with different content | Not detected; full homoglyph detection is impractical | **ACCEPTED** | +| TM-UNI-007 | Homoglyph variable confusion | `pаth=/evil` (Cyrillic а) vs `path=/safe` (Latin a) | Not detected; matches Bash behavior | **ACCEPTED** | + +**Current Risk**: LOW — Bashkit runs untrusted scripts in isolation. Homoglyph confusion +primarily threatens humans reading code, not automated execution. Full Unicode confusable +detection (UTS #39) would require large lookup tables and produce false positives on +legitimate CJK/accented text. + +**Decision**: Accept risk. Document that callers displaying filenames or variable names +to users should apply their own confusable-character detection if needed. + +### 11.4 Unicode Normalization + +| ID | Threat | Attack Vector | Mitigation | Status | +|----|--------|--------------|------------|--------| +| TM-UNI-008 | Normalization-based filename bypass | NFC "café" vs NFD "café" (composed é vs e+combining acute) create two distinct files with the same visual name | No normalization applied; matches real filesystem behavior | **ACCEPTED** | + +**Current Risk**: LOW — This matches POSIX/Linux filesystem behavior (filenames are opaque +byte sequences). macOS normalizes to NFD, Linux does not. Bashkit's VFS treats filenames +as byte-exact strings, consistent with Linux behavior. + +**Decision**: Accept risk. Normalization would break round-trip fidelity and is not done by +real Bash on Linux. + +### 11.5 Combining Character Abuse + +| ID | Threat | Attack Vector | Mitigation | Status | +|----|--------|--------------|------------|--------| +| TM-UNI-009 | Excessive combining marks | Filename with 1000 combining diacritical marks on one base char — visual DoS / potential rendering hang | `max_filename_length` (255 bytes) limits total size | **MITIGATED** | +| TM-UNI-010 | Combining marks in builtin input | `awk` / `grep` pattern with excessive combiners | Execution timeout + builtin parser depth limit | **MITIGATED** | + +**Current Risk**: LOW — Existing length limits bound the damage. + +### 11.6 Tag Characters and Other Invisibles + +| ID | Threat | Attack Vector | Mitigation | Status | +|----|--------|--------------|------------|--------| +| TM-UNI-011 | Tag character hiding | U+E0001-U+E007F (Tags block) — invisible chars that can conceal content in filenames | `find_unsafe_path_char()` does NOT detect tag chars | **UNMITIGATED** | +| TM-UNI-012 | Interlinear annotation hiding | U+FFF9-U+FFFB (Interlinear Annotations) — can hide text in filenames | Not detected in paths | **UNMITIGATED** | +| TM-UNI-013 | Deprecated format chars | U+206A-U+206F (Deprecated formatting) — can cause display confusion | Not detected in paths | **UNMITIGATED** | + +**Current Risk**: LOW — These are extremely obscure. Tag characters were deprecated in +Unicode 5.0. Practical exploitation likelihood is minimal. + +**Recommendation**: Extend `find_unsafe_path_char()` to also reject: +- U+200B-U+200D, U+2060, U+FEFF (zero-width chars, per TM-UNI-003) +- U+E0001-U+E007F (tag characters) +- U+FFF9-U+FFFB (interlinear annotations) +- U+206A-U+206F (deprecated format characters) + +### 11.7 Bidi in Script Source + +| ID | Threat | Attack Vector | Mitigation | Status | +|----|--------|--------------|------------|--------| +| TM-UNI-014 | Bidi override in script source | [Trojan Source](https://trojansource.codes/) — RTL overrides in script comments/strings reorder displayed code, hiding malicious logic | Not detected in script input; paths are protected (TM-DOS-015) | **ACCEPTED** | + +**Current Risk**: LOW — Bashkit executes untrusted scripts by design. The Trojan Source +attack targets human code reviewers, not automated execution. Scripts are treated as +untrusted regardless of visual appearance. + +**Decision**: Accept risk. Bidi detection in script source would be defense-in-depth for +callers who display scripts to users, but is out of scope for Bashkit's core execution +model. Document as caller responsibility. + +### 11.8 Additional Builtin and Component Byte-Boundary Issues + +Codebase-wide audit (beyond awk/sed covered in 11.1) found byte/char confusion in +5 additional components. All share the same root cause: using byte offsets where +character indices are expected, or vice versa. + +| ID | Component | Attack Vector | Root Cause | Status | +|----|-----------|--------------|------------|--------| +| TM-UNI-015 | `expr` builtin | `expr substr "café" 4 1` — user-provided char positions used as byte indices; `expr length "café"` returns 5 (bytes) not 4 (chars) | `s[start..end]` with char-position args; `.len()` returns bytes | **PARTIAL** | +| TM-UNI-016 | `printf` builtin | `printf "%.1s" "é"` — precision 1 slices at byte 1, mid-char | `&s[..s.len().min(prec)]` without `is_char_boundary()` | **PARTIAL** | +| TM-UNI-017 | `cut`/`tr` builtins | `echo "café" \| tr 'é' 'x'` — multi-byte chars in char set specs broken | `as_bytes()` iteration in `expand_char_set()` treats all input as single-byte | **PARTIAL** | +| TM-UNI-018 | Interpreter arithmetic | `((αβγ=1))` — `find('=')` byte offset passed to `.chars().nth()` | Byte offset from `.find()` used as char index; wrong char inspected | **PARTIAL** | +| TM-UNI-019 | Network allowlist | `allow("https://example.com/données/")` — byte length as char index | `pattern_path.len()` (bytes) → `url_path.chars().nth(bytes)` | **PARTIAL** | + +**Affected Code (expr.rs)**: +``` +Line 46: args[1].len().to_string() // bytes, not char count +Line 57: pos > s.len() // byte length as char position bound +Line 62: s[start..end].to_string() // char positions used as byte indices → PANIC +``` + +**Affected Code (printf.rs)**: +``` +Line 165: &s[..s.len().min(prec)] // prec may land mid-char → PANIC +``` + +**Affected Code (cuttr.rs)**: +``` +Lines 405-410: as_bytes() iteration // multi-byte chars split into individual bytes +Line 410: spec[i + 2..].find(":]") // byte offset (safe for ASCII class names) +``` + +**Affected Code (interpreter/mod.rs)**: +``` +Line 1517: expr.find('=') // returns byte offset +Line 1520: expr.chars().nth(eq_pos - 1) // byte offset treated as char index +Line 1524: expr.chars().nth(eq_pos + 1) // same confusion +``` + +**Affected Code (network/allowlist.rs)**: +``` +Line 194: url_path.chars().nth(pattern_path.len()) // byte count as char index +``` + +**Risk Assessment**: MEDIUM for expr/printf (panic risk on valid input, caught by +`catch_unwind`). LOW-MEDIUM for allowlist (incorrect allow/deny on multi-byte URL paths, +no panic). LOW for interpreter arithmetic and cut/tr (multi-byte variable names and tr +specs are rare in practice). + +**Safe Components** (confirmed by audit): +- **Lexer** (`parser/lexer.rs`): Uses `Chars` iterator; `Position::advance()` correctly + uses `ch.len_utf8()` for byte offset tracking +- **wc** (`builtins/wc.rs`): Correctly uses `.len()` for bytes and `.chars().count()` + for characters +- **grep** (`builtins/grep.rs`): Delegates to regex crate which handles Unicode correctly +- **jq** (`builtins/jq.rs`): Delegates to jaq crate +- **sort/uniq** (`builtins/sort_uniq.rs`): String comparison-based, no byte indexing +- **logging** (`logging_impl.rs`): Uses `is_char_boundary()` correctly +- **python** (`builtins/python.rs`): Shebang strip uses `find('\n')` — newline is ASCII, + byte offset safe. No other byte/char manipulation. +- **Python bindings** (`bashkit-python/src/lib.rs`): PyO3 `String` extraction handles + UTF-8 correctly. No manual byte/char manipulation patterns. +- **eval harness** (`bashkit-eval/src/`): Only uses `Iterator::find` (not `str::find`), + `chars().take()` for display truncation, `from_utf8_lossy()` for file content. All safe. +- **curl** (`builtins/curl.rs`): All `.find()` calls use ASCII delimiters (`:`, `=`). + Byte offsets are safe because delimiters are single-byte. +- **bc** (`builtins/bc.rs`): `find('=')` with ASCII delimiter. Safe. +- **export** (`builtins/export.rs`): `find('=')` with ASCII delimiter. Safe. +- **date** (`builtins/date.rs`): `&fmt[1..]` strips ASCII `+`. Safe. +- **comm** (`builtins/comm.rs`): `arg[1..]` strips ASCII `-`. Safe. +- **echo** (`builtins/echo.rs`): `arg_str[1..]` strips ASCII `-`. Safe. +- **archive** (`builtins/archive.rs`): `arg[1..]` strips ASCII `-`. Safe. +- **base64** (`builtins/base64.rs`): `s[7..]` after `starts_with("--wrap=")` — 7 ASCII bytes. Safe. +- **scripted_tool** (`scripted_tool/`): No byte/char patterns found. + +### Unicode Security Summary + +| ID | Threat | Risk | Status | Action | +|----|--------|------|--------|--------| +| TM-UNI-001 | Awk parser byte-boundary panic | MEDIUM | PARTIAL | Fix awk parser indexing (issue #395) | +| TM-UNI-002 | Sed parser byte-boundary panic | MEDIUM | PARTIAL | Fix sed byte-indexing patterns | +| TM-UNI-003 | Zero-width chars in filenames | LOW | UNMITIGATED | Extend `find_unsafe_path_char()` | +| TM-UNI-004 | Zero-width chars in variables | MINIMAL | ACCEPTED | Matches Bash behavior | +| TM-UNI-005 | Zero-width chars in scripts | MINIMAL | ACCEPTED | Correct pass-through | +| TM-UNI-006 | Homoglyph filenames | LOW | ACCEPTED | Impractical to fully detect | +| TM-UNI-007 | Homoglyph variables | LOW | ACCEPTED | Matches Bash behavior | +| TM-UNI-008 | Normalization bypass | LOW | ACCEPTED | Matches Linux FS behavior | +| TM-UNI-009 | Excessive combining marks (filenames) | LOW | MITIGATED | Length limits bound damage | +| TM-UNI-010 | Excessive combining marks (builtins) | LOW | MITIGATED | Timeout + depth limits | +| TM-UNI-011 | Tag character hiding | LOW | UNMITIGATED | Extend path validation | +| TM-UNI-012 | Interlinear annotation hiding | LOW | UNMITIGATED | Extend path validation | +| TM-UNI-013 | Deprecated format chars | LOW | UNMITIGATED | Extend path validation | +| TM-UNI-014 | Bidi in script source | LOW | ACCEPTED | Caller responsibility | +| TM-UNI-015 | Expr substr byte-boundary panic | MEDIUM | PARTIAL | Fix expr to use char-safe indexing (issue #434) | +| TM-UNI-016 | Printf precision mid-char panic | MEDIUM | PARTIAL | Use `is_char_boundary()` (issue #435) | +| TM-UNI-017 | Cut/tr byte-level char set parsing | MEDIUM | PARTIAL | Switch to char-aware iteration (issue #436) | +| TM-UNI-018 | Interpreter arithmetic byte/char confusion | LOW | PARTIAL | Use `char_indices()` in arithmetic (issue #437) | +| TM-UNI-019 | Network allowlist byte/char confusion | MEDIUM | PARTIAL | Fix URL path matching to use byte offsets (issue #438) | + +### Caller Responsibilities (Unicode) + +| Responsibility | Related Threats | Description | +|---------------|-----------------|-------------| +| Sanitize displayed filenames | TM-UNI-003, TM-UNI-006, TM-UNI-011 | Strip zero-width/invisible chars before showing filenames to users | +| Homoglyph detection | TM-UNI-006, TM-UNI-007 | Apply UTS #39 confusable detection if showing script content to users | +| Bidi sanitization | TM-UNI-014 | Strip bidi overrides from script source before displaying to code reviewers | +| Validate multi-byte builtin args | TM-UNI-015, TM-UNI-016, TM-UNI-017 | Be aware that expr/printf/cut/tr may fail on non-ASCII input until byte-boundary fixes land | +| Use ASCII in network allowlist patterns | TM-UNI-019 | Avoid multi-byte chars in allowlist URL patterns until byte/char fix lands | + +--- + ## References - `specs/001-architecture.md` - System design @@ -1377,5 +1696,6 @@ filesystem. - `specs/005-security-testing.md` - Fail-point testing - `specs/011-python-builtin.md` - Python builtin specification - `src/builtins/system.rs` - Hardcoded system builtins -- `tests/threat_model_tests.rs` - Threat model test suite (117 tests) +- `tests/threat_model_tests.rs` - Threat model test suite - `tests/security_failpoint_tests.rs` - Fail-point security tests +- `tests/unicode_security_tests.rs` - Unicode security tests (TM-UNI-*) diff --git a/specs/009-implementation-status.md b/specs/009-implementation-status.md index 458effb6..452c1084 100644 --- a/specs/009-implementation-status.md +++ b/specs/009-implementation-status.md @@ -390,6 +390,28 @@ Default limits (configurable): ## Testing +### Security Tests + +**Unicode byte-boundary tests:** 68 tests in `unicode_security_tests.rs` + +| Section | Tests | Component | Verified | +|---------|-------|-----------|----------| +| Awk byte-boundary | 15 | `awk.rs` | Panics caught by catch_unwind | +| Sed byte-boundary | 8 | `sed.rs` | Panics caught by catch_unwind | +| Expr byte-boundary | 6 | `expr.rs` | Panics caught by catch_unwind | +| Printf byte-boundary | 5 | `printf.rs` | Panics caught by catch_unwind | +| Cut/tr byte-boundary | 6 | `cuttr.rs` | Silent data loss | +| Interpreter byte-boundary | 2 | `interpreter/mod.rs` | Wrong result, no panic | +| Sed extended | 7 | `sed.rs` | Panics caught | +| Zero-width chars | 5 | VFS path validation | Correct rejection | +| Homoglyph confusion | 4 | VFS | Accepted risk | +| Normalization | 3 | VFS | Matches Linux behavior | +| Combining marks | 4 | Builtins | Length limits bound damage | +| Bidi/tag/annotation | 3 | Various | Detection gaps documented | +| Cross-component E2E | 5 | Pipeline | End-to-end multi-byte flows | + +See [006-threat-model.md](006-threat-model.md) TM-UNI-001 through TM-UNI-019. + ### Comparison with Real Bash ```bash