# üß† The Einstein Guide to OpenSearch Analyzers: Practical Edition

Welcome to the definitive, interactive guide to Text Analysis in OpenSearch. We aren't just reading theory; we are going to run it against your local OpenSearch instance.

**Topics Covered:**
1.  **Text Analysis**: The big picture.
2.  **Analyzers**: The pre-packaged deals.
3.  **Tokenizers**: The chopping block.
4.  **Token Filters**: The polishers.
5.  **Character Filters**: The sanitizers.
6.  **Normalizers**: For keywords.
7.  **Stemming**: Finding the root.
8.  **Token Graphs**: Handling multi-word synonyms.

---

### üõ†Ô∏è Setup
First, let's set up a helper function to talk to your local OpenSearch (`localhost:19200`).
We'll use this to send requests to the `_analyze` API.

In [1]:
import requests
import json

# Configuration
OPENSEARCH_URL = "http://localhost:19200"
AUTH = ('admin', 'OpenSearch@2024') # Default demo credentials
VERIFY_SSL = False

def analyze(text, analyzer=None, tokenizer=None, filters=None, char_filters=None, explain=False):
    """
    Helper function to call the _analyze API.
    """
    url = f"{OPENSEARCH_URL}/_analyze"
    
    payload = {
        "text": text,
        "explain": explain
    }
    
    if analyzer:
        payload["analyzer"] = analyzer
    if tokenizer:
        payload["tokenizer"] = tokenizer
    if filters:
        payload["filter"] = filters
    if char_filters:
        payload["char_filter"] = char_filters
        
    try:
        response = requests.post(url, json=payload, auth=AUTH, verify=VERIFY_SSL)
        response.raise_for_status()
        
        # Pretty print the tokens
        result = response.json()
        if explain:
            print(json.dumps(result, indent=2))
        else:
            print(f"Input: '{text}'")
            print("-" * 40)
            if "tokens" in result:
                for token in result["tokens"]:
                    print(f"[{token['position']}] Token: {token['token']:<15} | Type: {token['type']}")
            else:
                print("No tokens produced.")
            print("-" * 40)
            
    except Exception as e:
        print(f"Error: {e}")
        if 'response' in locals():
            print(response.text)

print("‚úÖ Setup Complete. Helper function `analyze()` is ready.")

‚úÖ Setup Complete. Helper function `analyze()` is ready.


## 1. Text Analysis & Analyzers

**Text Analysis** is the process of breaking down text into terms.
An **Analyzer** is the package that does this. It's composed of:
1.  **Char Filters** (0 or more)
2.  **Tokenizer** (Exactly 1)
3.  **Token Filters** (0 or more)

Let's look at the `standard` analyzer, which is the default. It splits on word boundaries and removes punctuation.

In [3]:
# Test the 'standard' analyzer
text = "The 2 QUICK Brown-Foxes jumped over,e=22012,LOG: the lazy dog's back!"
analyze(text, analyzer="standard")

Input: 'The 2 QUICK Brown-Foxes jumped over,e=22012,LOG: the lazy dog's back!'
----------------------------------------
[0] Token: the             | Type: <ALPHANUM>
[1] Token: 2               | Type: <NUM>
[2] Token: quick           | Type: <ALPHANUM>
[3] Token: brown           | Type: <ALPHANUM>
[4] Token: foxes           | Type: <ALPHANUM>
[5] Token: jumped          | Type: <ALPHANUM>
[6] Token: over            | Type: <ALPHANUM>
[7] Token: e               | Type: <ALPHANUM>
[8] Token: 22012           | Type: <NUM>
[9] Token: log             | Type: <ALPHANUM>
[10] Token: the             | Type: <ALPHANUM>
[11] Token: lazy            | Type: <ALPHANUM>
[12] Token: dog's           | Type: <ALPHANUM>
[13] Token: back            | Type: <ALPHANUM>
----------------------------------------


## 2. Tokenizers (The Chopping Block)

The **Tokenizer** receives a stream of characters and breaks it into tokens. You must have exactly one.

*   `standard`: Splits on word boundaries, removes punctuation.
*   `whitespace`: Splits whenever it sees a space.
*   `keyword`: Doesn't split at all. The whole text becomes one token.
*   `pattern`: Splits based on a Regex.

Let's compare `standard` vs `whitespace` vs `keyword`.

In [4]:
text = "email@example.com,is my email."

print("--- Whitespace Tokenizer ---")
analyze(text, tokenizer="whitespace")

print("\n--- Standard Tokenizer ---")
analyze(text, tokenizer="standard")

print("\n--- Keyword Tokenizer ---")
analyze(text, tokenizer="keyword")

--- Whitespace Tokenizer ---
Input: 'email@example.com,is my email.'
----------------------------------------
[0] Token: email@example.com,is | Type: word
[1] Token: my              | Type: word
[2] Token: email.          | Type: word
----------------------------------------

--- Standard Tokenizer ---
Input: 'email@example.com,is my email.'
----------------------------------------
[0] Token: email           | Type: <ALPHANUM>
[1] Token: example.com     | Type: <ALPHANUM>
[2] Token: is              | Type: <ALPHANUM>
[3] Token: my              | Type: <ALPHANUM>
[4] Token: email           | Type: <ALPHANUM>
----------------------------------------

--- Keyword Tokenizer ---
Input: 'email@example.com,is my email.'
----------------------------------------
[0] Token: email@example.com,is my email. | Type: word
----------------------------------------


## 3. Token Filters (The Polishers)

Once tokens are created, **Token Filters** can modify, add, or remove them.

*   `lowercase`: Converts to lowercase.
*   `stop`: Removes common words (the, and, is).
*   `unique`: Removes duplicates.
*   `synonym`: Adds synonyms.

Let's build a custom chain: `whitespace` tokenizer -> `lowercase` -> `stop` -> `unique`.

In [10]:
text = "The THE the quick Quick QUICK"

print("--- Raw Whitespace ---")
analyze(text, tokenizer="standard")

print("\n--- With Lowercase + Stop + Unique ---")
analyze(text, tokenizer="standard", filters=["lowercase", "stop", "unique"])

--- Raw Whitespace ---
Input: 'The THE the quick Quick QUICK'
----------------------------------------
[0] Token: The             | Type: <ALPHANUM>
[1] Token: THE             | Type: <ALPHANUM>
[2] Token: the             | Type: <ALPHANUM>
[3] Token: quick           | Type: <ALPHANUM>
[4] Token: Quick           | Type: <ALPHANUM>
[5] Token: QUICK           | Type: <ALPHANUM>
----------------------------------------

--- With Lowercase + Stop + Unique ---
Input: 'The THE the quick Quick QUICK'
----------------------------------------
[3] Token: quick           | Type: <ALPHANUM>
----------------------------------------


## 4. Character Filters (The Sanitizers)

These run **before** the tokenizer. They work on the raw string.

*   `html_strip`: Removes HTML tags.
*   `mapping`: Replaces characters (e.g., `:)` -> `_happy_`).

Let's strip some HTML.

In [11]:
text = "<p>I <b>love</b> OpenSearch! &copy; 2024</p>"

print("--- Without Char Filter ---")
analyze(text, tokenizer="standard")

print("\n--- With HTML Strip ---")
analyze(text, tokenizer="standard", char_filters=["html_strip"])

--- Without Char Filter ---
Input: '<p>I <b>love</b> OpenSearch! &copy; 2024</p>'
----------------------------------------
[0] Token: p               | Type: <ALPHANUM>
[1] Token: I               | Type: <ALPHANUM>
[2] Token: b               | Type: <ALPHANUM>
[3] Token: love            | Type: <ALPHANUM>
[4] Token: b               | Type: <ALPHANUM>
[5] Token: OpenSearch      | Type: <ALPHANUM>
[6] Token: copy            | Type: <ALPHANUM>
[7] Token: 2024            | Type: <NUM>
[8] Token: p               | Type: <ALPHANUM>
----------------------------------------

--- With HTML Strip ---
Input: '<p>I <b>love</b> OpenSearch! &copy; 2024</p>'
----------------------------------------
[0] Token: I               | Type: <ALPHANUM>
[1] Token: love            | Type: <ALPHANUM>
[2] Token: OpenSearch      | Type: <ALPHANUM>
[3] Token: ¬©               | Type: <EMOJI>
[4] Token: 2024            | Type: <NUM>
----------------------------------------


## 5. Normalizers (For Keywords)

Normalizers are for `keyword` fields where you want exact matching but with some leniency (like case insensitivity). They **do not** tokenize.

Note: You can't test a "normalizer" object directly in `_analyze` easily without an index, but you can simulate it by using `keyword` tokenizer + filters.

Simulating a normalizer that lowercases:

In [14]:
text = "ID-123-XYZ"

# A normalizer effectively does this:
analyze(text, tokenizer="keyword", filters=["lowercase"])

Input: 'ID-123-XYZ'
----------------------------------------
[0] Token: id-123-xyz      | Type: word
----------------------------------------


## 6. Stemming (Finding the Root)

Stemming reduces words to their base form.
*   `running` -> `run`
*   `cats` -> `cat`

Common algorithms: `snowball`, `porter`.

In [15]:
text = "The foxes are running and jumping quickly"

# Using the 'snowball' filter
analyze(text, tokenizer="standard", filters=["lowercase", "snowball"])

Input: 'The foxes are running and jumping quickly'
----------------------------------------
[0] Token: the             | Type: <ALPHANUM>
[1] Token: fox             | Type: <ALPHANUM>
[2] Token: are             | Type: <ALPHANUM>
[3] Token: run             | Type: <ALPHANUM>
[4] Token: and             | Type: <ALPHANUM>
[5] Token: jump            | Type: <ALPHANUM>
[6] Token: quick           | Type: <ALPHANUM>
----------------------------------------


## 7. Token Graphs (Multi-word Synonyms)

When you have synonyms that span multiple words (e.g., "ny" -> "new york"), simple token replacement gets messy.
The `synonym_graph` filter handles this correctly by creating a graph of tokens where "ny" and "new york" occupy the same position length.

*Note: This is complex to visualize in simple JSON, but look at the `positionLength` if available.*

In [16]:
# We need to define a custom analyzer in the request to use synonym_graph with custom synonyms
text = "I live in NY"

payload = {
  "tokenizer": "standard",
  "filter": [
    "lowercase",
    {
      "type": "synonym_graph",
      "synonyms": ["ny, new york"]
    }
  ],
  "text": text
}

# We'll use requests directly here since our helper assumes simple strings for filters
response = requests.post(f"{OPENSEARCH_URL}/_analyze", json=payload, auth=AUTH, verify=VERIFY_SSL)
print(json.dumps(response.json(), indent=2))

{
  "tokens": [
    {
      "token": "i",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "live",
      "start_offset": 2,
      "end_offset": 6,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "in",
      "start_offset": 7,
      "end_offset": 9,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "new",
      "start_offset": 10,
      "end_offset": 12,
      "type": "SYNONYM",
      "position": 3
    },
    {
      "token": "ny",
      "start_offset": 10,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 3,
      "positionLength": 2
    },
    {
      "token": "york",
      "start_offset": 10,
      "end_offset": 12,
      "type": "SYNONYM",
      "position": 4
    }
  ]
}


## 8. Real World: Analyzing `patronidata`

Let's look under the hood of your live index `patronidata`.
We will:
1.  Fetch the **Mapping** to see which analyzers are assigned to which fields.
2.  Fetch the **Settings** to see if any custom analyzers are defined.
3.  Test the analyzer on a specific field.

In [17]:
# 1. Get Mapping & Settings
index_name = "patronidata"

try:
    # Get Mapping
    mapping_url = f"{OPENSEARCH_URL}/{index_name}/_mapping"
    mapping_resp = requests.get(mapping_url, auth=AUTH, verify=VERIFY_SSL)
    mapping_resp.raise_for_status()
    
    # Get Settings
    settings_url = f"{OPENSEARCH_URL}/{index_name}/_settings"
    settings_resp = requests.get(settings_url, auth=AUTH, verify=VERIFY_SSL)
    settings_resp.raise_for_status()

    print(f"--- MAPPING for {index_name} ---")
    # Just printing the properties to keep it readable
    props = mapping_resp.json()[index_name]["mappings"].get("properties", {})
    print(json.dumps(props, indent=2))
    
    print(f"\n--- SETTINGS for {index_name} ---")
    analysis_settings = settings_resp.json()[index_name]["settings"]["index"].get("analysis", "No custom analysis settings found.")
    print(json.dumps(analysis_settings, indent=2))

except Exception as e:
    print(f"Error: {e}")

--- MAPPING for patronidata ---
{
  "@timestamp": {
    "type": "date"
  },
  "_raw": {
    "type": "text",
    "fields": {
      "keyword": {
        "type": "keyword",
        "ignore_above": 256
      }
    }
  },
  "cribl_breaker": {
    "type": "text",
    "fields": {
      "keyword": {
        "type": "keyword",
        "ignore_above": 256
      }
    }
  },
  "host": {
    "properties": {
      "name": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  },
  "message": {
    "type": "text",
    "fields": {
      "keyword": {
        "type": "keyword",
        "ignore_above": 256
      }
    }
  },
  "source": {
    "type": "text",
    "fields": {
      "keyword": {
        "type": "keyword",
        "ignore_above": 256
      }
    }
  }
}

--- SETTINGS for patronidata ---
"No custom analysis settings found."


In [18]:
# 2. Test Analyzer on a Field
# We'll try to find a text field from the mapping above. 
# If you see a field like 'message' or 'log' in the output above, replace 'message' below.

field_to_test = "message" # Default guess, change this based on the mapping output!
sample_text = "Error: Connection failed to 192.168.1.1"

print(f"--- Analyzing text using the analyzer configured for field: '{field_to_test}' ---")

url = f"{OPENSEARCH_URL}/{index_name}/_analyze"
payload = {
  "field": field_to_test,
  "text": sample_text
}

try:
    response = requests.post(url, json=payload, auth=AUTH, verify=VERIFY_SSL)
    # If the field doesn't exist, this might fail or fallback to default
    if response.status_code == 400:
        print(f"Field '{field_to_test}' might not exist. Check the mapping above.")
        print(response.text)
    else:
        result = response.json()
        for token in result.get("tokens", []):
             print(f"[{token['position']}] Token: {token['token']:<15} | Type: {token['type']}")

except Exception as e:
    print(f"Error: {e}")

--- Analyzing text using the analyzer configured for field: 'message' ---
[0] Token: error           | Type: <ALPHANUM>
[1] Token: connection      | Type: <ALPHANUM>
[2] Token: failed          | Type: <ALPHANUM>
[3] Token: to              | Type: <ALPHANUM>
[4] Token: 192.168.1.1     | Type: <NUM>


In [22]:
# 3. Deep Dive Audit: What's inside `patronidata`?
# Let's programmatically check for all the concepts we learned:
# Analyzers, Tokenizers, Filters, Char Filters, Normalizers.

print(f"üïµÔ∏è‚Äç‚ôÄÔ∏è AUDITING INDEX: {index_name}\n")

try:
    # Re-fetch for this cell's context
    settings = requests.get(f"{OPENSEARCH_URL}/{index_name}/_settings", auth=AUTH, verify=VERIFY_SSL).json()
    mapping = requests.get(f"{OPENSEARCH_URL}/{index_name}/_mapping", auth=AUTH, verify=VERIFY_SSL).json()
    
    index_settings = settings[index_name]["settings"]["index"]
    analysis_config = index_settings.get("analysis", {})
    
    # 1. Check for Custom Definitions
    print("--- 1. CUSTOM DEFINITIONS (Settings) ---")
    components = ["analyzer", "tokenizer", "filter", "char_filter", "normalizer"]
    found_custom = False
    for component in components:
        definitions = analysis_config.get(component, {})
        if definitions:
            found_custom = True
            print(f"‚úÖ Custom {component.title()}s found: {list(definitions.keys())}")
            print(json.dumps(definitions, indent=2))
        else:
            print(f"‚ùå No custom {component}s defined.")
            
    if not found_custom:
        print("\nüëâ This index relies on BUILT-IN defaults (Standard Analyzer, etc).")

    # 2. Check Field Assignments
    print("\n--- 2. FIELD ASSIGNMENTS (Mapping) ---")
    props = mapping[index_name]["mappings"].get("properties", {})
    
    text_fields = []
    keyword_fields = []
    
    for field, config in props.items():
        if "type" in config:
            if config["type"] == "text":
                analyzer = config.get("analyzer", "standard (default)")
                print(f"üìù Field '{field}' is TEXT using analyzer: '{analyzer}'")
                text_fields.append(field)
            elif config["type"] == "keyword":
                normalizer = config.get("normalizer", "None")
                print(f"üîë Field '{field}' is KEYWORD using normalizer: '{normalizer}'")
                keyword_fields.append(field)
                
    # 3. Live Test on Actual Fields
    print("\n--- 3. LIVE TEST ---")
    
    # Test a Text Field (Stemming, Tokenization check)
    if text_fields:
        target_field = text_fields[0]
        test_text = "The Quick Foxes are Running!"
        print(f"\nTesting Analyzer on field '{target_field}' with text: '{test_text}'")
        
        # We use the index-specific analyze endpoint
        resp = requests.post(
            f"{OPENSEARCH_URL}/{index_name}/_analyze", 
            json={"field": target_field, "text": test_text}, 
            auth=AUTH, verify=VERIFY_SSL
        ).json()
        
        raw_tokens = resp.get("tokens", [])
        print(raw_tokens)
        tokens = [t["token"] for t in raw_tokens]
        print(f"Resulting Tokens: {tokens}")
        
        # --- AUTOMATED CHECKS ---
        print("\nüîç Analysis Report:")
        
        # Lowercase Check
        is_lowercased = all(t.islower() for t in tokens)
        if is_lowercased:
            print("‚úÖ Lowercasing: YES")
        else:
            print("‚ùå Lowercasing: NO (Uppercase preserved)")

        # Stop Word Check (Expect 'the' and 'are' to be removed if stop filter is on)
        has_stops = "the" in tokens or "are" in tokens
        if not has_stops:
            print("‚úÖ Stop Words: YES (Removed 'the', 'are')")
        else:
            print("‚ùå Stop Words: NO (Preserved 'the', 'are')")

        # Stemming Check (Expect 'foxes'->'fox' or 'running'->'run')
        is_stemmed = ("fox" in tokens and "foxes" not in tokens) or ("run" in tokens and "running" not in tokens)
        if is_stemmed:
            print("‚úÖ Stemming: YES ('foxes' -> 'fox' / 'running' -> 'run')")
        else:
            print("‚ùå Stemming: NO ('foxes' and 'running' preserved)")

        # Token Graph Check
        # Graphs are present if positionLength > 1 or if multiple tokens share the same position
        positions = [t.get("position") for t in raw_tokens]
        position_lengths = [t.get("positionLength", 1) for t in raw_tokens]
        
        has_overlapping_positions = len(positions) != len(set(positions))
        has_multi_position_tokens = any(pl > 1 for pl in position_lengths)

        if has_overlapping_positions or has_multi_position_tokens:
             print("‚úÖ Token Graph: YES (Detected multi-position tokens or synonyms)")
             if has_overlapping_positions:
                 print("   - Found multiple tokens at the same position.")
             if has_multi_position_tokens:
                 print("   - Found tokens spanning multiple positions.")
        else:
            print("‚ùå Token Graph: NO (Linear token stream)")
    
    # Test a Keyword Field (Normalizer check)
    if keyword_fields:
        target_field = keyword_fields[0]
        test_text = "MixedCase-Value"
        print(f"\nTesting Normalizer on field '{target_field}' with text: '{test_text}'")
        
        # For keywords, _analyze shows the single token produced
        resp = requests.post(
            f"{OPENSEARCH_URL}/{index_name}/_analyze", 
            json={"field": target_field, "text": test_text}, 
            auth=AUTH, verify=VERIFY_SSL
        ).json()
        
        tokens = [t["token"] for t in resp.get("tokens", [])]
        print(f"Resulting Token: {tokens}")
        
        if tokens and tokens[0] == test_text:
            print("Observation: Exact match (No normalization).")
        else:
            print("Observation: Normalized (e.g., lowercased).")

except Exception as e:
    print(f"Error during audit: {e}")

üïµÔ∏è‚Äç‚ôÄÔ∏è AUDITING INDEX: patronidata

--- 1. CUSTOM DEFINITIONS (Settings) ---
‚ùå No custom analyzers defined.
‚ùå No custom tokenizers defined.
‚ùå No custom filters defined.
‚ùå No custom char_filters defined.
‚ùå No custom normalizers defined.

üëâ This index relies on BUILT-IN defaults (Standard Analyzer, etc).

--- 2. FIELD ASSIGNMENTS (Mapping) ---
üìù Field '_raw' is TEXT using analyzer: 'standard (default)'
üìù Field 'cribl_breaker' is TEXT using analyzer: 'standard (default)'
üìù Field 'message' is TEXT using analyzer: 'standard (default)'
üìù Field 'source' is TEXT using analyzer: 'standard (default)'

--- 3. LIVE TEST ---

Testing Analyzer on field '_raw' with text: 'The Quick Foxes are Running!'
[{'token': 'the', 'start_offset': 0, 'end_offset': 3, 'type': '<ALPHANUM>', 'position': 0}, {'token': 'quick', 'start_offset': 4, 'end_offset': 9, 'type': '<ALPHANUM>', 'position': 1}, {'token': 'foxes', 'start_offset': 10, 'end_offset': 15, 'type': '<ALPHANUM>', 'posi