Zomi NLP v0.4.0 - Complete Native Pipeline
π Zomi NLP v0.4.0 - Complete Native Pipeline
This release introduces a complete rule-based native Zomi NLP pipeline with no external dependencies!
β¨ What's New
Core Native Components
| Component | Description | Features |
|---|---|---|
| ZomiTokenizer | Pure Python tokenizer | Clitic splitting, reduplication, compounds, punctuation |
| ZomiPOSTagger | Rule-based POS tagger | 600+ lexicon entries, context-aware rules |
| ZomiLemmatizer | Morphological lemmatization | Clitic removal, affix stripping, irregular forms |
| ZomiDependencyParser | Modular dependency parser | Zomi grammar rules, ergative markers |
| ZomiNER | Named Entity Recognition | PERSON, LOCATION, GPE, DATE, NUMERIC |
| ZomiMorphologicalAnalyzer | Morpheme segmentation | Prefix/suffix detection, feature extraction |
Native Pipeline Architecture
User Input β Tokenizer β Tagger β Lemmatizer β Parser β NER β ZomiDoc
β β β β β
βββββββββββββ΄ββββββββββββ΄ββββββββββ΄ββββββββ
All Native! No External Dependencies
CLI Improvements
- New
zomi-nlp --doctorcommand for diagnostics - Better error messages with actionable fixes
- Installation status reports
π¦ Installation
# Minimal install (native only, no dependencies)
pip install zomi-nlp
# With optional backends (spaCy/Stanza for fallback)
pip install 'zomi-nlp[full]'π Quick Start
from zomi_nlp import load
# Load native pipeline (auto-selects best backend)
nlp = load()
# Process Zomi text
text = "Tuni ka pai hi."
doc = nlp(text)
for token in doc:
print(f"{token.text:<12} {token.pos_:<8} {token.lemma_:<12} {token.ent_type_ or '':<8}")Output:
Tuni DATE tuni DATE
ka PRON ka N/A
pai VERB pai N/A
hi PART hi N/A
. PUNCT . N/A
π Performance
| Metric | Value |
|---|---|
| Speed | ~10,000 tokens/second |
| Memory | ~50MB |
| Dependencies | None (optional spaCy/Stanza) |
| Test coverage | 64% (95+ tests) |
π§ Commands
# Check installation status
zomi-nlp --check
# Diagnose issues
zomi-nlp --doctor
# Process text from CLI
zomi-nlp "Tuni ka pai hi."π Documentation
π Full Changelog
Added
ZomiTokenizer - Complete tokenization module
ZomiPOSTagger - Native POS tagging with 600+ lexicon
ZomiLemmatizer - Rule-based lemmatization
ZomiDependencyParser - Modular dependency parsing
ZomiNER - Rule-based named entity recognition
ZomiMorphologicalAnalyzer - Morpheme analysis
lexicons/ module with centralized word data
--doctor CLI command for diagnostics
95+ comprehensive tests
Changed
Native backend now prioritized over spaCy/Stanza
Reorganized native/ directory structure
Improved feature parsing with LRU caching
Better error messages for missing dependencies
Fixed
NER over-matching (no more "Pasian sian" issues)
Duplicate tokenization in pipeline
Morphological analyzer feature merging
CLI argument parsing for --doctor
π― Roadmap to v1.0
-
v0.2.0 - spaCy/Stanza backends
-
v0.3.0 - ZomiRuleBasedParser
-
v0.4.0 - Complete native pipeline | This version
-
v0.5.0 - Word embeddings
-
v0.6.0 - ML-based components
-
v1.0.0 - Production ready
π Contributors
Zomi NLP Community
Zomi language speakers and linguists