# This is a sample Jupyter Notebook

Below is an example of a code cell. 
Put your cursor into the cell and press Shift+Enter to execute it and select the next one, or click 'Run Cell' button.

Press Double Shift to search everywhere for classes, files, tool windows, actions, and settings.

To learn more about Jupyter Notebooks in PyCharm, see [help](https://www.jetbrains.com/help/pycharm/ipython-notebook-support.html).
For an overview of PyCharm, go to Help -> Learn IDE features or refer to [our documentation](https://www.jetbrains.com/help/pycharm/getting-started.html).

# Thai Grapheme Classification Algorithm

This algorithm classifies Thai graphemes into **three main classes**, with subclasses, plus a special case for **อ**.

1. **ฐาน (tan)** - *foundation class*
   Consonant letters that act as the **base** for dependent marks (vowels, tone marks, etc.).

2. **สระ (sara)** - *vowel class*
   Vowel graphemes (both independent and dependent), which attach to a foundation consonant.

3. **ยุกต์ (yuk)** - *dependent class*
   Tone marks and other diacritics that cannot exist without a foundation consonant.

4. **ข้อยกเว้น (kho yok waen)** - *exception class*
   The consonant **อ** is treated separately, since it functions both as a **foundation** (carrier consonant) and as part of certain **vowel symbols**.



In [16]:
import json

with open("res/foundation/foundation.json", "r", encoding="utf-8") as f:
    data_foundation = json.load(f)

with open("res/sara/sara_combos.json", "r", encoding="utf-8") as f:
    data_sara = json.load(f)

foundation = data_foundation["foundation"]
vowel = data_sara
dependent = []
exception = ["อ"]

print(len(foundation))
print(len(vowel))

43
79


In [17]:
import json
from typing import List, Dict, Optional, Tuple

# Load the data
with open("res/foundation/foundation.json", "r", encoding="utf-8") as f:
    data_foundation = json.load(f)

with open("res/sara/sara_combos.json", "r", encoding="utf-8") as f:
    sara_combos = json.load(f)

# Initialize character classifications
foundation = data_foundation["foundation"]
vowel_patterns = sara_combos
exception_chars = ["อ", "ว"]  # Both can act as foundation or vowel component

# Leading vowels that appear before consonants but read after
leading_vowels = ["เ", "แ", "โ", "ใ", "ไ"]

# Dependent marks (tone marks and other diacritics)  
dependent_marks = ["่", "้", "๊", "๋", "์", "็", "ํ"]
vowel_marks = ["ั", "ิ", "ี", "ึ", "ื", "ุ", "ู"]

def classifyThaiGraphemes(thai: str) -> List[Dict]:
    """
    Classify Thai graphemes with their class and reading order.
    
    Returns a list of dictionaries containing:
    - grapheme: The pattern template (e.g. "xา" for vowel patterns)
    - instance: The actual matched text (e.g. "ยา")
    - class: "foundation", "vowel", "dependent", or "exception"  
    - read_order: Integer indicating reading sequence
    - pattern: For vowels, the pattern that was matched
    - role: "initial" (x), "vowel", or "final" (f)
    - start_pos: Starting position in the original string
    - positions: All positions occupied by this unit
    """
    if not thai:
        return []
    
    result = []
    used_positions = set()  # Track which positions have been consumed
    
    # Try to match vowel patterns at each position
    for start_pos in range(len(thai)):
        if start_pos in used_positions:
            continue
            
        # Try to match a vowel pattern starting here
        match_result = find_best_vowel_match(thai, start_pos, used_positions)
        
        if match_result:
            matched_pattern, pattern_template, positions_used, x_positions, f_positions = match_result
            
            # Calculate vowel-only positions (excluding x and f)
            vowel_only_positions = [p for p in positions_used if p not in x_positions and p not in f_positions]
            
            # Add the vowel unit - grapheme is the pattern, instance is the actual text
            vowel_unit = {
                'grapheme': pattern_template,  # The pattern (e.g. "xา")
                'instance': matched_pattern,   # The actual matched text (e.g. "ยา")
                'class': 'vowel',
                'pattern': pattern_template,
                'start_pos': min(vowel_only_positions) if vowel_only_positions else min(positions_used),
                'positions': positions_used,
                'role': 'vowel',
                'read_order': 1  # Vowel is always read second
            }
            result.append(vowel_unit)
            
            # Add initial consonant(s) (x)
            if x_positions:
                x_grapheme = ''.join(thai[p] for p in x_positions)
                result.append({
                    'grapheme': x_grapheme,
                    'instance': x_grapheme,  # For non-vowels, grapheme and instance are the same
                    'class': 'foundation' if len(x_positions) == 1 else 'cluster',
                    'start_pos': min(x_positions),
                    'positions': x_positions,
                    'role': 'initial',
                    'read_order': 0  # Initial is always read first
                })
            
            # Add final consonant (f)
            if f_positions:
                f_grapheme = ''.join(thai[p] for p in f_positions)
                result.append({
                    'grapheme': f_grapheme,
                    'instance': f_grapheme,  # For non-vowels, grapheme and instance are the same
                    'class': 'foundation',
                    'start_pos': min(f_positions),
                    'positions': f_positions,
                    'role': 'final',
                    'read_order': 2  # Final is always read third
                })
            
            # Mark all positions as used
            used_positions.update(positions_used)
    
    # Add any remaining unmatched characters
    for i in range(len(thai)):
        if i not in used_positions:
            char = thai[i]
            result.append({
                'grapheme': char,
                'instance': char,
                'class': classify_single_char(char),
                'start_pos': i,
                'positions': [i],
                'role': 'standalone',
                'read_order': 3 + i  # Standalone chars come after matched patterns
            })
            used_positions.add(i)
    
    return sorted(result, key=lambda x: (x['read_order'], x['start_pos']))

def find_best_vowel_match(text: str, start_pos: int, used_positions: set) -> Optional[Tuple]:
    """
    Find the best matching vowel pattern at the given position.
    Returns: (matched_string, pattern_template, positions_used, x_positions, f_positions)
    """
    best_match = None
    best_length = 0
    
    # Try each vowel pattern
    for pattern in vowel_patterns:
        match_result = try_match_vowel_pattern(text, start_pos, pattern, used_positions)
        
        if match_result:
            matched, positions_used, x_positions, f_positions = match_result
            if len(positions_used) > best_length:
                best_match = (matched, pattern, positions_used, x_positions, f_positions)
                best_length = len(positions_used)
    
    return best_match

def try_match_vowel_pattern(text: str, start_pos: int, pattern: str, used_positions: set) -> Optional[Tuple]:
    """
    Try to match a specific vowel pattern.
    x = initial consonant(s) - can be multiple consecutive foundations (cluster)
    f = final consonant - single foundation that comes after the vowel
    Other characters must match exactly.
    
    Returns: (matched_string, positions_used, x_positions, f_positions)
    """
    matched_chars = []
    positions_used = []
    x_positions = []
    f_positions = []
    text_pos = start_pos
    pattern_pos = 0
    
    # For patterns starting with x, we need to be at a foundation position
    if pattern[0] == 'x' and start_pos < len(text):
        if text[start_pos] not in foundation and text[start_pos] not in exception_chars:
            return None
    
    while pattern_pos < len(pattern) and text_pos < len(text):
        pattern_char = pattern[pattern_pos]
        
        if pattern_char == 'x':
            # x can be multiple consecutive foundations (cluster)
            if text[text_pos] in foundation or text[text_pos] in exception_chars:
                matched_chars.append(text[text_pos])
                x_positions.append(text_pos)
                positions_used.append(text_pos)
                text_pos += 1
                
                # Check for cluster - look ahead to see if we should consume more foundations
                # This is a simplified version - you may need more sophisticated cluster detection
                while (text_pos < len(text) and 
                       (text[text_pos] in foundation or text[text_pos] in exception_chars) and
                       # Make sure we're not consuming what should be 'f'
                       (pattern_pos + 1 >= len(pattern) or pattern[pattern_pos + 1] != 'f') and
                       # Make sure next pattern char isn't a specific vowel mark
                       (pattern_pos + 1 >= len(pattern) or pattern[pattern_pos + 1] in ['x', 'f'])):
                    matched_chars.append(text[text_pos])
                    x_positions.append(text_pos)
                    positions_used.append(text_pos)
                    text_pos += 1
                    # Only advance pattern if the pattern has another 'x'
                    if pattern_pos + 1 < len(pattern) and pattern[pattern_pos + 1] == 'x':
                        pattern_pos += 1
                    else:
                        break
            else:
                return None
                
        elif pattern_char == 'f':
            # f is a single final consonant
            if text[text_pos] in foundation or text[text_pos] in exception_chars:
                matched_chars.append(text[text_pos])
                f_positions.append(text_pos)
                positions_used.append(text_pos)
                text_pos += 1
            else:
                return None
                
        else:
            # Must match exactly (vowel marks, tone marks, etc.)
            if text[text_pos] == pattern_char:
                matched_chars.append(text[text_pos])
                positions_used.append(text_pos)
                text_pos += 1
            else:
                return None
        
        pattern_pos += 1
    
    # Check if we matched the full pattern
    if pattern_pos != len(pattern):
        return None
    
    # Check if any non-foundation positions are already used
    for pos in positions_used:
        if pos in used_positions and pos not in x_positions and pos not in f_positions:
            return None
    
    # Validate: if pattern has 'f', we must have found a final consonant
    if 'f' in pattern and not f_positions:
        return None
    
    return (''.join(matched_chars), positions_used, x_positions, f_positions)

def classify_single_char(char: str) -> str:
    """Classify a single Thai character."""
    if char in foundation:
        return 'foundation'
    elif char in exception_chars:
        return 'exception'
    elif char in dependent_marks:
        return 'dependent'
    elif char in vowel_marks or char in leading_vowels or char in ["า", "ะ", "ำ", "ฤ", "ฦ", "ๅ"]:
        return 'vowel'
    else:
        return 'unknown'

# Test cases
thai1 = "ยา"  # simple: x=ย, vowel=า
thai2 = "เด็ก"  # pattern เx็f: x=ด, vowel=เ็, f=ก
thai3 = "คน"  # x=ค, f=น (hidden vowel)
thai4 = "เลว"  # pattern เxf: x=ล, vowel=เ, f=ว

thaihard1 = "เกรียน"  # pattern เxียf: x=กร (cluster), vowel=เีย, f=น
thaihard2 = "เอา"  # pattern เxา: x=อ (silent), vowel=เา
thaihard3 = "อย่า"  # x=อ (silent), vowel=่า
thaihard4 = "เอือม"  # pattern เxือf: x=อ, vowel=เือ, f=ม
thaihard5 = "ไกล"  # pattern ไxf: x=ก, vowel=ไ, f=ล

# Test the function
print("Testing Thai grapheme classification:\n")
print("Format: [read_order] 'grapheme' (class) role [instance]\n")
print("-" * 50)

test_cases = [
    (thai1, "simple: foundation + vowel"),
    (thai2, "เx็f pattern"),
    (thai3, "foundation + final"),
    (thai4, "เxf pattern")
]

for thai_text, description in test_cases:
    print(f"\n'{thai_text}' - {description}:")
    result = classifyThaiGraphemes(thai_text)
    for item in result:
        instance_str = f" [{item['instance']}]" if item['grapheme'] != item['instance'] else ""
        pos_str = f" pos={item['positions']}" if len(item['positions']) > 1 else f" pos={item['start_pos']}"
        print(f"  [{item['read_order']}] '{item['grapheme']}' ({item['class']}) {item['role']}{instance_str}{pos_str}")

Testing Thai grapheme classification:

Format: [read_order] 'grapheme' (class) role [instance]

--------------------------------------------------

'ยา' - simple: foundation + vowel:
  [0] 'ย' (foundation) initial pos=0
  [1] 'xา' (vowel) vowel [ยา] pos=[0, 1]

'เด็ก' - เx็f pattern:
  [0] 'ด' (foundation) initial pos=1
  [1] 'เx็f' (vowel) vowel [เด็ก] pos=[0, 1, 2, 3]
  [2] 'ก' (foundation) final pos=3

'คน' - foundation + final:
  [3] 'ค' (foundation) standalone pos=0
  [4] 'น' (foundation) standalone pos=1

'เลว' - เxf pattern:
  [0] 'ลว' (cluster) initial pos=[1, 2]
  [1] 'เx' (vowel) vowel [เลว] pos=[0, 1, 2]


In [18]:
# Test with all provided examples
print("="*60)
print("Testing all examples including hard cases:")
print("="*60)

all_test_cases = [
    (thai1, "simple: foundation + vowel"),
    (thai2, "vowel appears first"), 
    (thai3, "foundation + final"),
    (thai4, "leading vowel + foundation + ว"),
    (thaihard1, "cluster with multi-part vowel"),
    (thaihard2, "อ as silent foundation"),
    (thaihard3, "อ stress test with atomic vowel"),
    (thaihard4, "complex อ: foundation then vowel part"),
    (thaihard5, "ไ read after กล cluster")
]

for thai_text, description in all_test_cases:
    print(f"\n'{thai_text}' - {description}:")
    result = classifyThaiGraphemes(thai_text)
    
    # Show results sorted by reading order
    print("  By reading order:")
    for item in sorted(result, key=lambda x: x['read_order']):
        instance_str = f" [{item['instance']}]" if item['grapheme'] != item['instance'] else ""
        print(f"    [{item['read_order']}] '{item['grapheme']}' ({item['class']}) role={item['role']}{instance_str}")
    
    # Show results in original position order
    print("  By position:")
    for item in sorted(result, key=lambda x: x['start_pos']):
        instance_str = f" [{item['instance']}]" if item['grapheme'] != item['instance'] else ""
        positions_str = f" (spans positions {item['positions']})" if len(item['positions']) > 1 else ""
        print(f"    pos {item['start_pos']}: '{item['grapheme']}' ({item['class']}) role={item['role']} -> read at {item['read_order']}{positions_str}{instance_str}")

Testing all examples including hard cases:

'ยา' - simple: foundation + vowel:
  By reading order:
    [0] 'ย' (foundation) role=initial
    [1] 'xา' (vowel) role=vowel [ยา]
  By position:
    pos 0: 'ย' (foundation) role=initial -> read at 0
    pos 1: 'xา' (vowel) role=vowel -> read at 1 (spans positions [0, 1]) [ยา]

'เด็ก' - vowel appears first:
  By reading order:
    [0] 'ด' (foundation) role=initial
    [1] 'เx็f' (vowel) role=vowel [เด็ก]
    [2] 'ก' (foundation) role=final
  By position:
    pos 0: 'เx็f' (vowel) role=vowel -> read at 1 (spans positions [0, 1, 2, 3]) [เด็ก]
    pos 1: 'ด' (foundation) role=initial -> read at 0
    pos 3: 'ก' (foundation) role=final -> read at 2

'คน' - foundation + final:
  By reading order:
    [3] 'ค' (foundation) role=standalone
    [4] 'น' (foundation) role=standalone
  By position:
    pos 0: 'ค' (foundation) role=standalone -> read at 3
    pos 1: 'น' (foundation) role=standalone -> read at 4

'เลว' - leading vowel + foundation + ว:
  By