Skip to content

Zomi NLP v0.4.0 - Complete Native Pipeline

Choose a tag to compare

@ZomiNLP ZomiNLP released this 27 Apr 14:38

πŸŽ‰ Zomi NLP v0.4.0 - Complete Native Pipeline

This release introduces a complete rule-based native Zomi NLP pipeline with no external dependencies!


✨ What's New

Core Native Components

Component Description Features
ZomiTokenizer Pure Python tokenizer Clitic splitting, reduplication, compounds, punctuation
ZomiPOSTagger Rule-based POS tagger 600+ lexicon entries, context-aware rules
ZomiLemmatizer Morphological lemmatization Clitic removal, affix stripping, irregular forms
ZomiDependencyParser Modular dependency parser Zomi grammar rules, ergative markers
ZomiNER Named Entity Recognition PERSON, LOCATION, GPE, DATE, NUMERIC
ZomiMorphologicalAnalyzer Morpheme segmentation Prefix/suffix detection, feature extraction

Native Pipeline Architecture

User Input β†’ Tokenizer β†’ Tagger β†’ Lemmatizer β†’ Parser β†’ NER β†’ ZomiDoc
↑ ↑ ↑ ↑ ↑
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜
All Native! No External Dependencies

CLI Improvements

  • New zomi-nlp --doctor command for diagnostics
  • Better error messages with actionable fixes
  • Installation status reports

πŸ“¦ Installation

# Minimal install (native only, no dependencies)
pip install zomi-nlp

# With optional backends (spaCy/Stanza for fallback)
pip install 'zomi-nlp[full]'

πŸš€ Quick Start

from zomi_nlp import load

# Load native pipeline (auto-selects best backend)
nlp = load()

# Process Zomi text
text = "Tuni ka pai hi."
doc = nlp(text)

for token in doc:
    print(f"{token.text:<12} {token.pos_:<8} {token.lemma_:<12} {token.ent_type_ or '':<8}")

Output:

Tuni         DATE     tuni         DATE
ka           PRON     ka           N/A
pai          VERB     pai          N/A
hi           PART     hi           N/A
.            PUNCT    .            N/A

πŸ“Š Performance

Metric Value
Speed ~10,000 tokens/second
Memory ~50MB
Dependencies None (optional spaCy/Stanza)
Test coverage 64% (95+ tests)

πŸ”§ Commands

# Check installation status
zomi-nlp --check

# Diagnose issues
zomi-nlp --doctor

# Process text from CLI
zomi-nlp "Tuni ka pai hi."

πŸ“š Documentation

πŸ”„ Full Changelog

Added

ZomiTokenizer - Complete tokenization module

ZomiPOSTagger - Native POS tagging with 600+ lexicon

ZomiLemmatizer - Rule-based lemmatization

ZomiDependencyParser - Modular dependency parsing

ZomiNER - Rule-based named entity recognition

ZomiMorphologicalAnalyzer - Morpheme analysis

lexicons/ module with centralized word data

--doctor CLI command for diagnostics

95+ comprehensive tests

Changed

Native backend now prioritized over spaCy/Stanza

Reorganized native/ directory structure

Improved feature parsing with LRU caching

Better error messages for missing dependencies

Fixed

NER over-matching (no more "Pasian sian" issues)

Duplicate tokenization in pipeline

Morphological analyzer feature merging

CLI argument parsing for --doctor

🎯 Roadmap to v1.0

  • v0.2.0 - spaCy/Stanza backends

  • v0.3.0 - ZomiRuleBasedParser

  • v0.4.0 - Complete native pipeline | This version

  • v0.5.0 - Word embeddings

  • v0.6.0 - ML-based components

  • v1.0.0 - Production ready

πŸ™ Contributors

Zomi NLP Community

Zomi language speakers and linguists