# PyGarble: Getting Started

This notebook demonstrates how to use **pygarble** to detect garbled, gibberish, or corrupted text.

pygarble is a **zero-dependency** Python library that uses statistical and linguistic heuristics (no AI/ML) to identify text that isn't valid natural language.

## Table of Contents
1. [Basic Usage](#1-basic-usage)
2. [Available Strategies](#2-available-strategies)
3. [Probability Scores](#3-probability-scores)
4. [Batch Processing](#4-batch-processing)
5. [Ensemble Detection](#5-ensemble-detection)
6. [Strategy Deep Dive](#6-strategy-deep-dive)
7. [Custom Thresholds](#7-custom-thresholds)
8. [Real-World Examples](#8-real-world-examples)

In [1]:
import sys
sys.path.insert(0, '..')

from pygarble import GarbleDetector, Strategy, EnsembleDetector

## 1. Basic Usage

The simplest way to use pygarble is to create a `GarbleDetector` with a strategy and call `predict()`.

In [2]:
# Create a detector using the Markov chain strategy (best overall performer)
detector = GarbleDetector(Strategy.MARKOV_CHAIN)

# Test normal English text
print("Normal text:")
print(f"  'Hello world'           -> garbled: {detector.predict('Hello world')}")
print(f"  'The quick brown fox'   -> garbled: {detector.predict('The quick brown fox')}")
print(f"  'Python is awesome'     -> garbled: {detector.predict('Python is awesome')}")

print("\nGarbled text:")
print(f"  'asdfghjkl'             -> garbled: {detector.predict('asdfghjkl')}")
print(f"  'xkcd qwfp zxcv'        -> garbled: {detector.predict('xkcd qwfp zxcv')}")
print(f"  'bvnmxzqwp'             -> garbled: {detector.predict('bvnmxzqwp')}")

Normal text:
  'Hello world'           -> garbled: False
  'The quick brown fox'   -> garbled: False
  'Python is awesome'     -> garbled: False

Garbled text:
  'asdfghjkl'             -> garbled: True
  'xkcd qwfp zxcv'        -> garbled: True
  'bvnmxzqwp'             -> garbled: True


## 2. Available Strategies

pygarble provides 28 detection strategies, each analyzing different aspects of text. Here are all of them grouped by category.

In [3]:
# List all available strategies
strategies_by_category = {
    "Character-level statistics": [
        Strategy.ENTROPY_BASED,
        Strategy.LETTER_FREQUENCY,
        Strategy.CHARACTER_FREQUENCY,
        Strategy.COMPRESSION_RATIO,
    ],
    "N-gram / sequence analysis": [
        Strategy.MARKOV_CHAIN,
        Strategy.BIGRAM_PROBABILITY,
        Strategy.NGRAM_FREQUENCY,
        Strategy.RARE_TRIGRAM,
    ],
    "Phonotactic / pronunciation": [
        Strategy.PRONOUNCEABILITY,
        Strategy.CONSONANT_SEQUENCE,
        Strategy.VOWEL_RATIO,
        Strategy.VOWEL_PATTERN,
    ],
    "Word-level analysis": [
        Strategy.WORD_LOOKUP,
        Strategy.WORD_LENGTH,
        Strategy.FUNCTION_WORD_DENSITY,
        Strategy.AFFIX_DETECTION,
        Strategy.ZIPF_CONFORMITY,
        Strategy.WORD_COLLOCATION,
    ],
    "Pattern detection": [
        Strategy.KEYBOARD_PATTERN,
        Strategy.LETTER_POSITION,
        Strategy.REPETITION,
        Strategy.PATTERN_MATCHING,
    ],
    "Encoding / Unicode": [
        Strategy.MOJIBAKE,
        Strategy.UNICODE_SCRIPT,
        Strategy.HEX_STRING,
    ],
    "Other": [
        Strategy.SYMBOL_RATIO,
        Strategy.STATISTICAL_ANALYSIS,
    ],
}

for category, strats in strategies_by_category.items():
    print(f"\n{category}:")
    for s in strats:
        print(f"  Strategy.{s.name:<25s} ({s.value})")


Character-level statistics:
  Strategy.ENTROPY_BASED             (entropy_based)
  Strategy.LETTER_FREQUENCY          (letter_frequency)
  Strategy.CHARACTER_FREQUENCY       (character_frequency)
  Strategy.COMPRESSION_RATIO         (compression_ratio)

N-gram / sequence analysis:
  Strategy.MARKOV_CHAIN              (markov_chain)
  Strategy.BIGRAM_PROBABILITY        (bigram_probability)
  Strategy.NGRAM_FREQUENCY           (ngram_frequency)
  Strategy.RARE_TRIGRAM              (rare_trigram)

Phonotactic / pronunciation:
  Strategy.PRONOUNCEABILITY          (pronounceability)
  Strategy.CONSONANT_SEQUENCE        (consonant_sequence)
  Strategy.VOWEL_RATIO               (vowel_ratio)
  Strategy.VOWEL_PATTERN             (vowel_pattern)

Word-level analysis:
  Strategy.WORD_LOOKUP               (word_lookup)
  Strategy.WORD_LENGTH               (word_length)
  Strategy.FUNCTION_WORD_DENSITY     (function_word_density)
  Strategy.AFFIX_DETECTION           (affix_detection)
  Strategy.Z

## 3. Probability Scores

Instead of a binary True/False, you can get a probability score between 0.0 (definitely valid) and 1.0 (definitely garbled).

In [4]:
detector = GarbleDetector(Strategy.MARKOV_CHAIN)

texts = [
    "The quick brown fox jumps over the lazy dog",
    "Natural language processing is fascinating",
    "supercalifragilisticexpialidocious",  # unusual but pronounceable
    "qwertyuiop",                          # keyboard mashing
    "xzqkjhfbvn",                          # random characters
]

print(f"{'Text':<50s} {'Probability':>11s}  {'Garbled?':>8s}")
print("-" * 75)
for text in texts:
    proba = detector.predict_proba(text)
    label = detector.predict(text)
    display = text if len(text) <= 48 else text[:45] + '...'
    print(f"{display:<50s} {proba:>10.4f}   {'Yes' if label else 'No':>7s}")

Text                                               Probability  Garbled?
---------------------------------------------------------------------------
The quick brown fox jumps over the lazy dog            0.1236        No
Natural language processing is fascinating             0.1009        No
supercalifragilisticexpialidocious                     0.1445        No
qwertyuiop                                             0.6262       Yes
xzqkjhfbvn                                             0.9989       Yes


## 4. Batch Processing

Process multiple texts at once by passing a list. This is more efficient than calling predict() in a loop.

In [5]:
detector = GarbleDetector(Strategy.MARKOV_CHAIN)

# Batch of texts to check
texts = [
    "Hello, how are you?",
    "asdfghjkl",
    "The weather is nice today",
    "xkrf plmq bvzt",
    "Python programming",
    "qzxjkvbw",
]

# predict() and predict_proba() both accept lists
predictions = detector.predict(texts)
probabilities = detector.predict_proba(texts)

print(f"{'Text':<30s} {'Garbled?':>8s} {'Probability':>11s}")
print("-" * 55)
for text, pred, prob in zip(texts, predictions, probabilities):
    print(f"{text:<30s} {'Yes' if pred else 'No':>8s} {prob:>10.4f}")

Text                           Garbled? Probability
-------------------------------------------------------
Hello, how are you?                  No     0.1141
asdfghjkl                           Yes     0.9407
The weather is nice today            No     0.0611
xkrf plmq bvzt                      Yes     0.9659
Python programming                   No     0.1379
qzxjkvbw                            Yes     0.9981


## 5. Ensemble Detection

The `EnsembleDetector` combines multiple strategies for more robust detection. It supports several voting modes.

In [6]:
# Default ensemble: uses 5 high-precision strategies with majority voting
ensemble = EnsembleDetector()

test_texts = [
    "The cat sat on the mat",
    "asdfghjkl qwerty",
    "Python is great",
    "xzqkj bvnmw",
    "I love programming",
    "qqq zzz xxx jjj",
]

print("Default Ensemble (majority voting):")
print(f"{'Text':<30s} {'Garbled?':>8s} {'Probability':>11s}")
print("-" * 55)
for text in test_texts:
    pred = ensemble.predict(text)
    prob = ensemble.predict_proba(text)
    print(f"{text:<30s} {'Yes' if pred else 'No':>8s} {prob:>10.4f}")

Default Ensemble (majority voting):
Text                           Garbled? Probability
-------------------------------------------------------
The cat sat on the mat               No     0.0081
asdfghjkl qwerty                    Yes     0.6374
Python is great                      No     0.0392
xzqkj bvnmw                         Yes     0.9990
I love programming                   No     0.0274
qqq zzz xxx jjj                     Yes     0.8833


In [7]:
# Custom ensemble with different voting modes
strategies = [
    Strategy.MARKOV_CHAIN,
    Strategy.PRONOUNCEABILITY,
    Strategy.NGRAM_FREQUENCY,
]

text = "xkrf plmq bvzt nwsd"

# Majority voting: more than half must agree
ens_majority = EnsembleDetector(strategies=strategies, voting="majority")
print(f"Majority voting: {ens_majority.predict(text)}")

# Any voting: if ANY strategy flags it, return True (high recall)
ens_any = EnsembleDetector(strategies=strategies, voting="any")
print(f"Any voting:      {ens_any.predict(text)}")

# All voting: ALL strategies must agree (high precision)
ens_all = EnsembleDetector(strategies=strategies, voting="all")
print(f"All voting:      {ens_all.predict(text)}")

# Average voting: average probability across strategies
ens_avg = EnsembleDetector(strategies=strategies, voting="average")
print(f"Average voting:  {ens_avg.predict(text)} (proba: {ens_avg.predict_proba(text):.4f})")

Majority voting: True
Any voting:      True
All voting:      True
Average voting:  True (proba: 0.9812)


## 6. Strategy Deep Dive

Let's compare how different strategies perform on the same set of texts. Each strategy detects different types of garble.

In [8]:
# Compare strategies on different types of garbled text
test_cases = {
    # Normal text
    "Hello world":                        False,
    "The quick brown fox":                 False,
    # Keyboard mashing
    "asdfghjkl":                           True,
    "qwertyuiop":                          True,
    # Random gibberish
    "xzqkjhfbvn":                          True,
    # Mojibake (encoding corruption)
    "CafÃ© au lait":                       True,
    # Homoglyph attack (Cyrillic 'a' in 'paypal')
    "p\u0430ypal":                         True,
}

compare_strategies = [
    Strategy.MARKOV_CHAIN,
    Strategy.PRONOUNCEABILITY,
    Strategy.KEYBOARD_PATTERN,
    Strategy.MOJIBAKE,
    Strategy.UNICODE_SCRIPT,
]

# Print header
header = f"{'Text':<25s}"
for s in compare_strategies:
    header += f" {s.value[:12]:>12s}"
print(header)
print("-" * len(header))

for text, expected in test_cases.items():
    display = text if len(text) <= 23 else text[:20] + '...'
    row = f"{display:<25s}"
    for s in compare_strategies:
        det = GarbleDetector(s)
        prob = det.predict_proba(text)
        row += f" {prob:>11.3f} "
    print(row)

Text                      markov_chain pronounceabi keyboard_pat     mojibake unicode_scri
------------------------------------------------------------------------------------------
Hello world                     0.130        0.000        0.700        0.000        0.000 
The quick brown fox             0.117        0.000        0.367        0.000        0.000 
asdfghjkl                       0.941        1.000        1.000        0.000        0.000 
qwertyuiop                      0.626        1.000        1.000        0.000        0.000 
xzqkjhfbvn                      0.999        1.000        0.700        0.000        0.000 
CafÃ© au lait                   0.739        0.000        0.700        0.800        0.000 
pаypal                          0.918        0.000        0.700        0.000        0.850 


Each strategy specializes in detecting different types of garble:
- **markov_chain**: Best all-rounder, uses character bigram probabilities
- **pronounceability**: Catches unpronounceable consonant clusters
- **keyboard_pattern**: Detects QWERTY row sequences
- **mojibake**: Specifically detects encoding corruption (e.g., UTF-8 misread as Latin-1)
- **unicode_script**: Catches Cyrillic/Greek homoglyphs mixed into Latin text

## 7. Custom Thresholds

You can adjust the detection threshold and pass strategy-specific parameters.

In [9]:
# Default threshold is 0.5
detector_default = GarbleDetector(Strategy.MARKOV_CHAIN)

# Stricter threshold (fewer false positives, may miss some garble)
detector_strict = GarbleDetector(Strategy.MARKOV_CHAIN, threshold=0.8)

# More lenient threshold (catches more garble, but more false positives)
detector_lenient = GarbleDetector(Strategy.MARKOV_CHAIN, threshold=0.3)

text = "qwerty"  # borderline case
proba = detector_default.predict_proba(text)
print(f"Text: '{text}'")
print(f"Probability: {proba:.4f}")
print(f"Default (0.5):  garbled={detector_default.predict(text)}")
print(f"Strict  (0.8):  garbled={detector_strict.predict(text)}")
print(f"Lenient (0.3):  garbled={detector_lenient.predict(text)}")

Text: 'qwerty'
Probability: 0.4340
Default (0.5):  garbled=False
Strict  (0.8):  garbled=False
Lenient (0.3):  garbled=True


In [10]:
# Strategy-specific parameters are passed as keyword arguments.
# For example, the word_lookup strategy accepts an unknown_threshold:
detector = GarbleDetector(
    Strategy.WORD_LOOKUP,
    unknown_threshold=0.7,  # require 70% unknown words to flag
)

print(f"'hello world':    garbled={detector.predict('hello world')}")
print(f"'xkrf plmq bvzt': garbled={detector.predict('xkrf plmq bvzt')}")

'hello world':    garbled=False
'xkrf plmq bvzt': garbled=True


## 8. Real-World Examples

Here are practical scenarios where garble detection is useful.

In [11]:
# Scenario 1: Filtering user input / form submissions
print("=== Form Input Validation ===")

detector = GarbleDetector(Strategy.MARKOV_CHAIN)

form_submissions = [
    "John Smith",
    "I need help with my account",
    "asdfasdfasdf",
    "Please reset my password",
    "jkljkljkl qweqweqwe",
    "My order number is 12345",
]

for text in form_submissions:
    is_garbled = detector.predict(text)
    status = "REJECTED" if is_garbled else "OK"
    print(f"  [{status:>8s}] {text}")

=== Form Input Validation ===
  [      OK] John Smith
  [      OK] I need help with my account
  [REJECTED] asdfasdfasdf
  [      OK] Please reset my password
  [REJECTED] jkljkljkl qweqweqwe
  [      OK] My order number is 12345


In [12]:
# Scenario 2: Detecting encoding corruption in a data pipeline
print("=== Encoding Corruption Detection ===")

mojibake_detector = GarbleDetector(Strategy.MOJIBAKE)

records = [
    "Caf\u00c3\u00a9 au lait",         # Mojibake: UTF-8 bytes misread as Latin-1
    "Caf\u00e9 au lait",               # Correct: proper UTF-8
    "na\u00c3\u00afve",                # Mojibake
    "naive",                            # Correct
    "r\u00c3\u00a9sum\u00c3\u00a9",   # Mojibake
    "r\u00e9sum\u00e9",               # Correct
]

for text in records:
    is_corrupted = mojibake_detector.predict(text)
    status = "CORRUPTED" if is_corrupted else "OK"
    print(f"  [{status:>9s}] {text}")

=== Encoding Corruption Detection ===
  [CORRUPTED] CafÃ© au lait
  [       OK] Café au lait
  [CORRUPTED] naÃ¯ve
  [       OK] naive
  [CORRUPTED] rÃ©sumÃ©
  [CORRUPTED] résumé


In [13]:
# Scenario 3: Detecting homoglyph/spoofing attacks
print("=== Homoglyph Detection ===")

script_detector = GarbleDetector(Strategy.UNICODE_SCRIPT)

urls_and_names = [
    "paypal",                   # Normal Latin
    "p\u0430ypal",             # Cyrillic 'a' (U+0430) mixed with Latin
    "google",                   # Normal Latin
    "g\u043eogle",             # Cyrillic 'o' (U+043E) mixed with Latin
    "apple",                    # Normal Latin
    "\u0430pple",              # Cyrillic 'a' at start
]

for text in urls_and_names:
    is_spoofed = script_detector.predict(text)
    prob = script_detector.predict_proba(text)
    status = "SPOOFED" if is_spoofed else "OK"
    # Show the actual Unicode codepoints
    codepoints = ' '.join(f'U+{ord(c):04X}' for c in text)
    print(f"  [{status:>7s}] '{text}' (prob: {prob:.2f})")
    print(f"           codepoints: {codepoints}")

=== Homoglyph Detection ===
  [     OK] 'paypal' (prob: 0.00)
           codepoints: U+0070 U+0061 U+0079 U+0070 U+0061 U+006C
  [SPOOFED] 'pаypal' (prob: 0.85)
           codepoints: U+0070 U+0430 U+0079 U+0070 U+0061 U+006C
  [     OK] 'google' (prob: 0.00)
           codepoints: U+0067 U+006F U+006F U+0067 U+006C U+0065
  [SPOOFED] 'gоogle' (prob: 0.85)
           codepoints: U+0067 U+043E U+006F U+0067 U+006C U+0065
  [     OK] 'apple' (prob: 0.00)
           codepoints: U+0061 U+0070 U+0070 U+006C U+0065
  [SPOOFED] 'аpple' (prob: 0.85)
           codepoints: U+0430 U+0070 U+0070 U+006C U+0065


In [14]:
# Scenario 4: Quality check on scraped/OCR'd text
print("=== Text Quality Assessment ===")

# Use ensemble for robust detection
ensemble = EnsembleDetector()

scraped_texts = [
    "The annual report shows strong growth in Q4.",
    "xhfk23 sld#@ fgn!! kd0x nvm@ plw$$",
    "Scientists discover new species in the Amazon rainforest.",
    "asd asd asd asd asd asd",
    "Revenue increased by 15% compared to last year.",
]

for text in scraped_texts:
    is_garbled = ensemble.predict(text)
    prob = ensemble.predict_proba(text)
    quality = "LOW" if is_garbled else "HIGH"
    display = text if len(text) <= 55 else text[:52] + '...'
    print(f"  [Quality: {quality:>4s}] (prob: {prob:.2f}) {display}")

=== Text Quality Assessment ===
  [Quality: HIGH] (prob: 0.05) The annual report shows strong growth in Q4.
  [Quality:  LOW] (prob: 0.67) xhfk23 sld#@ fgn!! kd0x nvm@ plw$$
  [Quality: HIGH] (prob: 0.02) Scientists discover new species in the Amazon rainfo...
  [Quality: HIGH] (prob: 0.25) asd asd asd asd asd asd
  [Quality: HIGH] (prob: 0.02) Revenue increased by 15% compared to last year.


---

## Summary

| Class | Purpose | Key Methods |
|---|---|---|
| `GarbleDetector` | Single-strategy detection | `predict()`, `predict_proba()` |
| `EnsembleDetector` | Multi-strategy ensemble | `predict()`, `predict_proba()` |
| `Strategy` | Enum of all 28 strategies | `Strategy.MARKOV_CHAIN`, etc. |

**Tips:**
- Start with `Strategy.MARKOV_CHAIN` for general-purpose detection (best F1 score)
- Use `EnsembleDetector()` with default settings for highest precision
- Use specialized strategies for specific use cases (e.g., `MOJIBAKE` for encoding issues, `UNICODE_SCRIPT` for spoofing)
- Both `predict()` and `predict_proba()` accept single strings or lists of strings