A unified language metadata toolkit for NLP: identifiers, scripts, speakers, geographic data, and traversable relationships across ~27,256 languoids.
Name: Qwanqwa is a phonetic spelling of 'ቋንቋ', which means language in Amharic; qq is short to type.
Note
Explore in your browser: https://wesselpoelman.nl/qq/
Demo video: https://youtu.be/D9MmGCmJeNg
Paper (pre-print): https://arxiv.org/abs/2603.00620
- Identifiers: BCP-47, ISO 639-1, ISO 639-3, ISO 639-2B, ISO 639-2T, ISO 639-5, Glottocode, Wikidata ID, Wikipedia ID, NLLB-style codes
- Geographic information: Countries, subdivisions, regions, which can be traversed, including from languoids and back
- Speaker information: Population counts, UNESCO endangerment status
- Writing systems: ISO 15924 script codes with canonical/historical metadata
- Multilingual names: Language names in 500+ languages
- Relationships: Traversable graph of language families, scripts, and geographic regions
- Phylogenetic data: Language family trees from Glottolog
In qq, language-like entities are referred to as Languoids: this includes dialects, macro-languages, and language families, not just individual languages. Not all languoids have coverage for all features.
uv add qwanqwa
# or
pip install qwanqwafrom qq import Database, IdType
# Load the pre-compiled database
db = Database.load()
# Get a language by BCP-47 code (default)
dutch = db.get("nl")
print(dutch.name) # "Dutch"
print(dutch.iso_639_3) # "nld"
print(dutch.speaker_count) # 24085200
# Also works with ISO 639-3, Glottocode, etc.
dutch2 = db.get("nld", id_type=IdType.ISO_639_3)
dutch3 = db.get("dutc1256", id_type=IdType.GLOTTOCODE)
dutch4 = db.guess("dut") # guessing works too
# This will all resolve to the same languoid
assert dutch == dutch2 == dutch3 == dutch4
# Search by name
results = db.search("Chinese")
for lang in results:
print(f"{lang.name} ({lang.glottocode})")
# Search also accepts identifiers and ranks them higher
db.search("nl")[0].name # "Dutch"
db.search("English")[0].bcp_47 # "en"Important: qq makes a strict distinction between None (don't know) and False (it is not the case). When checking boolean attributes, prefer explicit checks over truthiness: use if script.is_canonical is None: rather than if not script.is_canonical:.
Languoids, scripts, and geographic regions are all part of the same graph, which can be traversed:
dutch = db.get("nl")
# Language family navigation (Glottolog tree)
dutch.parent # Global Dutch
dutch.parent.parent # Modern Dutch
dutch.family_tree # [Global Dutch, Modern Dutch, ..., West Germanic, Germanic, Indo-European]
dutch.siblings # [Afrikaansic, Javindo, Petjo]
dutch.children # [North Hollandish, Central Northern Dutch, ...]
dutch.descendants() # All descendants (recursive)
# Writing systems
dutch.scripts # [Script(Latin, code=Latn)]
dutch.script_codes # ["Latn"]
dutch.canonical_scripts # scripts marked canonical in LinguaMeta
# Geographic regions
dutch.regions # [Aruba, Belgium, ..., Netherlands, Suriname, ...]
dutch.country_codes # ["AW", "BE", "BQ", "CW", "NL", "SR", "SX"]
# Reverse traversal to script
latin = dutch.scripts[0]
latin.languoids # All languages using Latin script
# Cross-domain queries
dutch.languoids_with_same_script # other languages sharing any script
dutch.languoids_in_same_region # other languages in the same regionsfrom qq import IdType
# Automatic detection
lang = db.guess("nld") # tries all identifier types
# Explicit conversion
db.convert("nl", IdType.BCP_47, IdType.ISO_639_3) # "nld"
db.convert("nld", IdType.ISO_639_3, IdType.GLOTTOCODE) # "dutc1256"
# Conversion where you don't know or care what the source is, just the target.
# Useful for normalizing multiple standards to one
db.convert("nl", IdType.ISO_639_3) # "nld"
db.convert("dutc1256", IdType.ISO_639_3) # "nld"
db.convert("mol", IdType.ISO_639_3) # "ron" (deprecated alias normalized silently)
# NLLB-style codes
dutch.nllb_codes() # ["nld_Latn"]
dutch.nllb_codes(use_bcp_47=True) # ["nl_Latn"]get() and guess() warn when you use a deprecated code that still resolves to a replacement.
convert() does not; it silently normalizes deprecated aliases to the requested target identifier.
# Name of Dutch in French
dutch.name_in("fr") # "néerlandais"
dutch.name_in(french) # also accepts a Languoid object
# Native name
dutch.endonym # "Nederlands"# Look up a language
qq get nl
qq get nld --type ISO_639_3
# Search by name or identifier
qq search Dutch
qq search nl
# Database statistics and validation
qq validate
# Rebuild the database from sources
qq rebuild
# Check source status
qq status
# Update sources (only needed if you want to rebuild the database,
# not necessary in normal use)
qq updateSee the examples/ directory for runnable scripts covering:
01_basic_usage.py: Loading and accessing attributes02_identifiers.py: Working with identifier types and retired codes03_conversion.py: Converting between identifiers04_traversal.py: Language family navigation05_search.py: Searching and filtering06_names.py: Multilingual name data07_geographic.py: Geographic regions and countries08_relations.py: Relationship graph traversal09_advanced_queries.py: Complex queries and statistics10_linking_datasets.py: Joining datasets that use different identifier systems11_normalizing_datasets.py: Normalizing mixed identifier codes to a single standard
The case-studies/ directory contains runnable analyses that use qq:
huggingface-audit/: Scans all multilingual datasets on the HuggingFace Hub and classifies everylanguage:tag as valid, deprecated, a misused country code, or unknown. qq resolves 99.2% of the 8,189 codes; the rest are deprecated, misused country codes, or HuggingFace-specific tags.linking-datasets/: Links four lexical datasets (Concepticon, WordNet, Etymon, Phonotacticon) that each use a different identifier standard. qq resolves these four to a shared canonical ID: 102 languages are covered by all four.latex-tables/: Generates a LaTeX table of language metadata (identifiers, scripts, speaker counts, families) for an imaginary 30-language NLP benchmark.identifier-coverage/: Visualizes which combinations of identifier standards (Glottocode, ISO 639-3, ISO 639-1, Wikidata) cover which languoids as an UpSet plot.
This project builds on the work of many people. See docs/sources.md for the full list. All sources are available under Creative Commons BY or BY-SA licenses.
To rebuild the database from sources, install with the build extras:
uv add qwanqwa[build]
# or
pip install qwanqwa[build]To install for local development:
git clone https://github.com/WPoelman/qwanqwa
cd qwanqwa
uv sync --group devThe data sources qq incorporates have different licenses, see here.
We follow this example and license the software as Apache 2.0 and the data as CC BY-SA 4.0.
This means for instance that any data issues we encounter will be openly reported to the upstream sources (in accordance with ShareAlike principles of CC BY-SA), but that the software will ship with a compiled dataset (in accordance with the redistribution CC BY and CC BY-SA allow).
Ideally we'd use CC BY-SA for everything, but this is highly discouraged for software, even by Creative Commons themselves.