Release Zomi NLP v0.4.0 - Complete Native Pipeline · ZomiCommunity/zomi-nlp

🎉 Zomi NLP v0.4.0 - Complete Native Pipeline

This release introduces a complete rule-based native Zomi NLP pipeline with no external dependencies!

✨ What's New

Core Native Components

Component	Description	Features
ZomiTokenizer	Pure Python tokenizer	Clitic splitting, reduplication, compounds, punctuation
ZomiPOSTagger	Rule-based POS tagger	600+ lexicon entries, context-aware rules
ZomiLemmatizer	Morphological lemmatization	Clitic removal, affix stripping, irregular forms
ZomiDependencyParser	Modular dependency parser	Zomi grammar rules, ergative markers
ZomiNER	Named Entity Recognition	PERSON, LOCATION, GPE, DATE, NUMERIC
ZomiMorphologicalAnalyzer	Morpheme segmentation	Prefix/suffix detection, feature extraction

Native Pipeline Architecture

User Input → Tokenizer → Tagger → Lemmatizer → Parser → NER → ZomiDoc
↑ ↑ ↑ ↑ ↑
└───────────┴───────────┴─────────┴───────┘
All Native! No External Dependencies

CLI Improvements

New zomi-nlp --doctor command for diagnostics
Better error messages with actionable fixes
Installation status reports

📦 Installation

# Minimal install (native only, no dependencies)
pip install zomi-nlp

# With optional backends (spaCy/Stanza for fallback)
pip install 'zomi-nlp[full]'

🚀 Quick Start

from zomi_nlp import load

# Load native pipeline (auto-selects best backend)
nlp = load()

# Process Zomi text
text = "Tuni ka pai hi."
doc = nlp(text)

for token in doc:
    print(f"{token.text:<12} {token.pos_:<8} {token.lemma_:<12} {token.ent_type_ or '':<8}")

Output:

Tuni         DATE     tuni         DATE
ka           PRON     ka           N/A
pai          VERB     pai          N/A
hi           PART     hi           N/A
.            PUNCT    .            N/A

📊 Performance

Metric	Value
Speed	~10,000 tokens/second
Memory	~50MB
Dependencies	None (optional spaCy/Stanza)
Test coverage	64% (95+ tests)

🔧 Commands

# Check installation status
zomi-nlp --check

# Diagnose issues
zomi-nlp --doctor

# Process text from CLI
zomi-nlp "Tuni ka pai hi."

📚 Documentation

🔄 Full Changelog

Added

ZomiTokenizer - Complete tokenization module

ZomiPOSTagger - Native POS tagging with 600+ lexicon

ZomiLemmatizer - Rule-based lemmatization

ZomiDependencyParser - Modular dependency parsing

ZomiNER - Rule-based named entity recognition

ZomiMorphologicalAnalyzer - Morpheme analysis

lexicons/ module with centralized word data

--doctor CLI command for diagnostics

95+ comprehensive tests

Changed

Native backend now prioritized over spaCy/Stanza

Reorganized native/ directory structure

Improved feature parsing with LRU caching

Better error messages for missing dependencies

Fixed

NER over-matching (no more "Pasian sian" issues)

Duplicate tokenization in pipeline

Morphological analyzer feature merging

CLI argument parsing for --doctor

🎯 Roadmap to v1.0

v0.2.0 - spaCy/Stanza backends
v0.3.0 - ZomiRuleBasedParser
v0.4.0 - Complete native pipeline | This version
v0.5.0 - Word embeddings
v0.6.0 - ML-based components
v1.0.0 - Production ready

🙏 Contributors

Zomi NLP Community

Zomi language speakers and linguists

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Zomi NLP v0.4.0 - Complete Native Pipeline

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

🎉 Zomi NLP v0.4.0 - Complete Native Pipeline

✨ What's New

Core Native Components

Native Pipeline Architecture

CLI Improvements

📦 Installation

🚀 Quick Start

Output:

📊 Performance

🔧 Commands

📚 Documentation

🔄 Full Changelog

Added

Changed

Fixed

🎯 Roadmap to v1.0

🙏 Contributors

Uh oh!