This FAQ answers questions regarding the editorial principles and design decisions that went into the creation of PHOIBLE. We appreciate and welcome feedback regarding these FAQs via our issue tracker or by contacting the editors directly.
You can get releases of PHOIBLE from Zenodo or from the release page. While the CLDF data in these releases can easily be read and analyzed by tools like R - or even just spreadsheet programs, in the remainder of this FAQ we will make use of the fact that each CLDF dataset can be converted to a SQLite database and queried using SQL.
To prepare an SQLite database with the PHOIBLE data, you will need
- the pycldf Python package
- an unzipped download of a PHOIBLE release.
Then you can run the cldf createdb
command, which was installed with pycldf
:
cldf createdb PATH/TO/DOWNLOAD/cldf/StructureDataset-metadata.json phoible.sqlite
Now you can query the database using the sqlite3
program,
either interactively, typing SQL queries at the sqlite>
prompt,
$ sqlite3 phoible.sqlite
SQLite version 3.31.1 2020-01-27 19:55:54
Enter ".help" for usage hints.
sqlite> select count(*) from 'inventories.csv';
3020
or by having sqlite3
read queries stored as SQL text files:
$ echo "select count(*) from 'inventories.csv';" > query.sql
$ sqlite3 phoible.sqlite < query.sql
3020
You may also use the GUI program DB Browser for SQLite to follow along.
For the most part, every phonological inventory in PHOIBLE is based on one-and-only-one language description (usually a research article, book chapter, dissertation, or descriptive grammar). The technical term for this in comparative linguistics is “doculect” (from “documented lect”), in which lect means a specific form of a language or dialect, i.e. an instance of documentation of an instance of linguistic behavior at a particular time and place (Cysouw & Good, 2013). A brief explanation and some history of why linguists use the term “doculect,” which has gained broad acceptance in light of the issues of language identification and the use of “language codes,” is given in this blog post by Michael Cysouw.
Contributors to PHOIBLE start with a doculect, extract the contrastive phonemes and allophones, and (if necessary) adapt the authors’ choice of symbols to align with PHOIBLE’s symbol guidelines. If the authors have not provided ISO 639-3 and Glottolog codes, these are determined before adding the inventory to PHOIBLE. Each inventory is then given a unique numeric ID. Doculects are tracked in PHOIBLE using BibTeX keys.
An exception to the “one doculect per inventory” rule arises for inventories that were originally part of a curated phonological database such as UPSID (Maddieson, 1984; Maddieson & Precoda, 1990) or SPA (Crothers, Lorentz, Sherman, & Vihman, 1979). In those collections, inventories were often based on multiple descriptions of linguistic behavior, written by different linguists; those descriptions were believed to be describing the same language, and disagreements between the descriptions were adjudicated by the experts who compiled the collection.
We can quickly see how many of PHOIBLE’s inventories are based on multiple doculects by looking at the mapping table between PHOIBLE inventories and sources:
select c as `Number of doculects consulted`,
count(iid) as `Number of inventories`
from (
-- Count number of consulted sources in the table relating inventories and sources.
select `inventories.csv_cldf_id` as iid,
count(`SourceTable_id`) as c
from `inventories.csv_SourceTable`
group by iid
order by c asc)
group by c;
Number of doculects consulted | Number of inventories |
---|---|
1 | 2434 |
2 | 442 |
3 | 93 |
4 | 29 |
5 | 10 |
6 | 7 |
7 | 2 |
9 | 1 |
11 | 1 |
Running the query in DB Browser's "Execute SQL" pane, will look like this
Clearly, the majority of inventories in PHOIBLE represent a phonological description from a single doculect. But it seems strange that a single phonological inventory in PHOIBLE could be based on 11 different doculects. Let’s examine it:
select
l.cldf_glottocode as `Glottocode`,
l.cldf_iso639p3code as `ISO6393`,
l.cldf_name as `LanguageName`,
s.cldf_id as `Source`
from (select `inventories.csv_cldf_id` as iid, count(`SourceTable_id`) as c
from `inventories.csv_SourceTable`
group by `inventories.csv_cldf_id`) as q,
ValueTable as v,
LanguageTable as l,
`inventories.csv` as i,
`inventory_sources.csv` as s
where c = 11
and iid = v.Inventory_ID
and v.cldf_languageReference = l.cldf_id
and i.cldf_id = iid
and i.Inventory_source_ID = s.cldf_id limit 1;
Glottocode | ISO6393 | LanguageName | Source |
---|---|---|---|
haus1257 | hau | Hausa | UPSID |
As we can see, this inventory represents a description of Hausa and was added to PHOIBLE from the UPSID database (Maddieson, 1984; Maddieson & Precoda, 1990). To understand why this UPSID entry consulted 11 different sources, consider first that Hausa is typologically interesting (e.g., it has both ejective and implosive phonation mechanisms) and has tens of millions of speakers, making it relatively well-studied (the Glottolog reference catalog has more than 1400 references related to Hausa).
Second, note that Maddieson’s work on UPSID involved “typologizing” phonological inventories from different doculects, so that they were comparable across all entries in his database (cf. Hyman, 2008). Maddieson’s work was groundbreaking at the time because he was the first typologist to generate a stratified language sample aimed at being genealogically balanced, i.e. for each language family he chose one representative language. This allowed Maddieson to make statements about the cross-linguistic distribution of contrastive speech sounds with some level of statistical confidence. In fact, much about what we know about the distribution of the sounds of the world’s languages is due to Maddieson’s original language sample and his meticulous curation of the data.
FIXME: complete!
select i.cldf_id, l.cldf_glottocode, l.cldf_iso639p3code, l.cldf_name
from `inventories.csv` as i,
ValueTable as v,
LanguageTable as l
where v.Inventory_ID = i.cldf_id
and v.cldf_languageReference = l.cldf_id
and l.cldf_iso639p3code = 'mis'
group by i.cldf_id, l.cldf_glottocode, l.cldf_iso639p3code, l.cldf_name;
cldf_id | cldf_glottocode | cldf_iso639P3code | cldf_name |
---|---|---|---|
2143 | pisa1245 | mis | Pisamira |
2352 | lizu1234 | mis | Lizu |
2388 | east2773 | mis | Eastern Newari |
2420 | zhon1235 | mis | Zhongu |
2434 | vach1239 | mis | Vach Khanty |
2450 | fore1274 | mis | Forest Nenets |
2691 | mink1237 | mis | Minkin |
2714 | guwa1244 | mis | Guwar |
2729 | mis | Djindewal | |
2748 | mith1236 | mis | Mithaka |
2773 | cola1237 | mis | Colac |
2778 | yari1243 | mis | Yari-Yari |
2782 | west2443 | mis | Western Victoria |
2783 | djad1246 | mis | Djadjawurrungic |
2792 | kera1256 | mis | Keramin |
2793 | lowe1402 | mis | Lower Riverland |
2794 | ngin1247 | mis | Ngindadj |
2818 | gudj1237 | mis | Gudjal |
2882 | kawa1290 | mis | Kawarrang-Ogh Undjan |
2883 | kawa1290 | mis | Kawarrang-Ogh Undjan |
2907 | wala1263 | mis | Walangama |
2911 | tyan1235 | mis | Tyanngayt-Mamngayt-Ntrwangayt-Ntrangit |
2913 | luth1234 | mis | Luthigh |
2914 | mbiy1238 | mis | Mbiywom |
2916 | ngko1236 | mis | Ngkoth |
2920 | yadh1237 | mis | Yadhaykenu |
2946 | bula1255 | mis | Bularnu |
2956 | yulp1239 | mis | Yulparija |
2988 | west2443 | mis | Western Victoria |
2999 | sout2770 | mis | Southern Inland Yuin |
select i.cldf_id, count(v.cldf_id), group_concat(v.cldf_value, ' ')
from `inventories.csv` as i,
ValueTable as v,
LanguageTable as l
where v.Inventory_ID = i.cldf_id
and v.cldf_languageReference = l.cldf_id
and l.cldf_iso639p3code = 'kbd'
group by i.cldf_id
order by cast(i.cldf_id as int);
cldf_id | c | phonemes |
---|---|---|
4 | 56 | b d̻ d̻z̻ f fʼ j kʷʰ kʷʼ k̟ʲʰ k̟ʲʼ m n̻ pʰ pʼ qχ qχʷ qχʷʼ qχʼ r s̻ t̻s̻ t̻s̻ʼ t̻ʰ t̻ʼ v w xʷ x̟ʲ z̻ ç̟ ħ ɡʷ ɡ̟ʲ ɣ̟ʲ ɦ ɬʲ ɬʲʼ ɮʲ ʁ ʁʷ ʃ ʃʼ ʒ ʔ ʔʷ ʕ ʝ̟ χ χʷ a̟ː e̞ː iː o̞ː uː ɜ ɨ |
391 | 56 | b d̪ d̪z̪ f fʼ j kʲʰ kʲʼ kʷʰ kʷʼ m n̪ pʰ pʼ qʷʼ qʼ qχ qχʷ r s̪ t̪s̪ t̪s̪ʼ t̪ʰ t̪ʼ v w xʲ xʷ z̪ ħ ɡʲ ɡʷ ɣʲ ɦ ɬ̪ʲ ɬ̪ʲʼ ɮ̪ʲ ʁ ʁʷ ʃ ʃʼ ʃ͇ ʒ ʒ͇ ʔ ʔʷ ʕ χ χʷ a̟ː e̞ː iː o̞ː uː ɜ ɨ |
2310 | 55 | b d dz d̠ʒ f fʼ j kʷ kʷʼ m n p pʼ q qʷ qʷʼ qʼ r s t ts tsʼ tʼ t̠ʃ t̠ʃʼ v w x xʷ z ħ ɡʷ ɣ ɬ ɬʼ ɮ ʁ ʁʷ ʃ ʆ ʆʼ ʒ ʓ ʔ ʔʷ χ χʷ ä e̞ː iː o̞ o̞ː uː ɑː ə |
2401 | 63 | b d dz dʑ f fʼ j kʷ kʷʼ kʼ l m n p pʼ q qʷ qʷʼ qʼ r s t ts tsʷʼ tsʼ tɕ tɕʼ tʷʼ tʼ t̠ʃ t̠ʃʼ v w x xʷ z zʷ ħ ɕ ɡʷ ɣ ɬ ɬʼ ʁ ʁʷ ʃ ʆ ʆʼ ʑ ʒ ʓ ʔ ʔʷ χ χʷ ä e̞ː iː o̞ o̞ː uː ɑː ə |
2610 | 51 | b d dz f fʼ h j kʲʼ kʷʰ kʷʼ l m n pʰ pʼ qʰ qʷʰ qʷʼ qʼ s ts tsʼ tʰ tʼ w xʷ z ç ħ ɡʲ ɡʷ ɬ ɬʼ ɾ ʁ ʁʷ ʃ ʃʼ ʒ ʔ ʔʷ ʝ χ χʷ äː eː iː oː uː ɐ ə |
query.sql:
select cinvs, count(iso)
from (select l.cldf_iso639p3code as iso, count(distinct i.cldf_id) as cinvs
from `inventories.csv` as i,
ValueTable as v,
LanguageTable as l
where v.Inventory_ID = i.cldf_id
and v.cldf_languageReference = l.cldf_id
group by l.cldf_iso639p3code)
group by cinvs
order by cinvs desc;
https://github.com/mkaz/termgraph
$ sqlite3 -csv -separator "," phoible.sqlite < query.sql | termgraph
30: ▏ 1.00
12: ▏ 1.00
10: ▏ 1.00
9 : ▏ 1.00
8 : ▏ 3.00
7 : ▏ 2.00
6 : ▏ 7.00
5 : ▏ 19.00
4 : ▇ 36.00
3 : ▇▇▇▇ 126.00
2 : ▇▇▇▇▇▇▇▇▇▇▇ 360.00
1 : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1542.00
select cinvs, count(gc)
from (select l.cldf_glottocode as gc, count(distinct i.cldf_id) as cinvs
from `inventories.csv` as i,
ValueTable as v,
LanguageTable as l
where v.Inventory_ID = i.cldf_id
and v.cldf_languageReference = l.cldf_id
group by l.cldf_glottocode)
group by cinvs
order by cinvs desc;
$ sqlite3 -csv -separator "," phoible.sqlite < query2.sql | termgraph
11: ▏ 1.00
10: ▏ 1.00
9 : ▏ 1.00
8 : ▏ 2.00
7 : ▏ 2.00
6 : ▏ 4.00
5 : ▏ 18.00
4 : ▇ 36.00
3 : ▇▇▇ 119.00
2 : ▇▇▇▇▇▇▇▇▇▇ 352.00
1 : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1640.00
select gc as `Glottocode`, names as `Names`, c as `Number of inventories`
from (
select gc,
group_concat(iname, ', ') as names,
count (iid) as c
from (
select distinct
l.cldf_glottocode as gc,
i.cldf_name as iname,
i.cldf_id as iid
from
`inventories.csv` as i,
ValueTable as v,
LanguageTable as l
where
v.Inventory_ID = i.cldf_id and
v.cldf_languageReference = l.cldf_id
) group by gc order by c desc)
where c > 5;
Glottocode | Names | Number of inventories |
---|---|---|
osse1243 | Ossetian, Iron Ossetic (Iron Ossetic (Tual-S)), Iron Ossetic (Iron Ossetic (Upper Alagir)), Iron Ossetic (Iron Ossetic (Tual-Ch)), Iron Ossetic (Iron Ossetic (Urs-Tual)), Iron Ossetic (Iron Ossetic (Roka)), Iron Ossetic (Iron Ossetic (Standard)), Iron Ossetic (Iron Ossetic (Alagir)), Iron Ossetic (Iron Ossetic (Ksani)), Iron Ossetic (Iron Ossetic (Tual-Sh)), Iron Ossetic (Iron Ossetic (Kurtat)) | 11 |
biri1256 | Barna, Biri, Garingbal, Miyan, Wiri, Yambina, Yangga, Yilba, Yuwi, Wangan | 10 |
stan1293 | English, English (American) (Western and Mid-Western US; Southern California), American English (Southeastern Michigan), English (Australian), English (British) (Liverpool), English (New Zealand) (Pākehā), English (British) (Tyneside English (spoken in Newcastle)), English (English (RP)), English (English (Liverpool)) | 9 |
kham1282 | Rgyalthang Tibetan, Brag-g.yab Tibetan, Nangchenpa Tibetan, Soghpo Tibetan, Kami Tibetan, Sangdam Tibetan, Dongwang Tibetan, Kham Tibetan | 8 |
dutc1256 | Dutch (Belgian Standard), Dutch (Belgian Standard Dutch), Dutch (the dialect of Hasselt (belongs to the West-Limburgian dialect group)), Dutch (the dialect of Maastricht (south-eastern dialect group; Central Limburgian)), Dutch (the Dutch dialect of Weert (West-Limburgian), rural variety (not Stadsweerts)), Dutch (the Belgian Limburg dialect of Hamont (belongs to the West Limburg dialects, subclassification: Dommellands)), Dutch (the Flemish-Brabant dialect of Orsmaal-gussenhoven), Dutch | 8 |
east2328 | Cheremis, MARI, Cheremis, Meadow Mari (Meadow Mari (Morki-Sernur)), Eastern Mari, Meadow Mari (Meadow Mari (Joškar-Ola)), Meadow Mari (Meadow Mari (Volga)) | 7 |
basq1248 | Basque, BASQUE, Basque (Goizueta Basque), Basque (Basque (Baztan)), Basque (Basque (Arbizu)), Basque (Basque (Gernika)), Basque (Basque (Ondarroa)) | 7 |
lith1251 | Lithuanian, LITHUANIAN, Lithuanian (Lithuanian (West Aukštaitian)), Lithuanian (Lithuanian (North Samogitian)), Lithuanian (Lithuanian (West Samogitian)), Lithuanian (Lithuanian (South Samogitian)) | 6 |
lazz1240 | Laz, Laz (Laz (Borçka)), Laz (Laz (Pazar)), Laz (Laz (Ardeşen)), Laz (Laz (Hopa)), Laz (Laz (Arhavi)) | 6 |
khan1273 | Ostyak, KHANTY, Eastern Khanty, Northern Khanty (Northern Khanty (Sob Obdorsk)), Northern Khanty (Northern Khanty (Polui Obdorsk)), Northern Khanty (Northern Khanty (Kazym)) | 6 |
gwan1268 | Gwandara (Karshi) (Karshi), Gwandara (Cancara) (Cancara), Gwandara (Toni) (Toni), Gwandara (Gitata) (Gitata), Gwandara (Koro) (Koro), Gwandara (Nimbia) (Nimbia) | 6 |
select src.cldf_id as `Source ID`, src.cldf_name as `Source name`, count(i.cldf_id) as `Number of inventories`
from `inventory_sources.csv` as src,
`inventories.csv` as i
where i.Inventory_source_ID = src.cldf_id
group by src.cldf_id, src.cldf_name
order by `Number of inventories` desc;
Source ID | Source name | Number of inventories |
---|---|---|
GM | Christopher Green and Steven Moran | 460 |
UPSID | UCLA Phonological Segment Inventory Database | 451 |
ER | Erich Round | 392 |
EA | Dmitry Nikolaev | 390 |
PH | Steven Moran and Daniel McCloy and Richard Wright. | 389 |
SAPHON | South American Phonological Inventory Database | 355 |
AA | Christian Chanard and Rhonda L. Hartell (AA) | 203 |
SPA | Stanford Phonology Archive | 197 |
RA | Ramaswami, N. | 100 |
UZ | Steven Moran | 83 |
select gc, c, names
from (select gc, count (srcid) as c, group_concat(srcid, ', ') as names from (select distinct l.cldf_glottocode as gc, i.Inventory_source_ID as srcid from `inventories.csv` as i, ValueTable as v, LanguageTable as l where v.Inventory_ID = i.cldf_id and v.cldf_languageReference = l.cldf_id) as q, `inventory_sources.csv` as src where src.cldf_id = q.srcid group by gc order by count (srcid) desc)
where c > 3;
gc | c | names |
---|---|---|
khar1287 | 5 | SPA, UPSID, PH, RA, EA |
hind1269 | 5 | SPA, UPSID, RA, UZ, EA |
beng1280 | 5 | SPA, UPSID, RA, UZ, EA |
yuec1235 | 4 | SPA, UPSID, UZ, EA |
west2369 | 4 | SPA, UPSID, UZ, EA |
viet1252 | 4 | SPA, UPSID, UZ, EA |
telu1262 | 4 | SPA, UPSID, RA, EA |
taga1270 | 4 | SPA, UPSID, GM, EA |
tach1250 | 4 | SPA, UPSID, GM, UZ |
stan1295 | 4 | SPA, UPSID, UZ, EA |
stan1290 | 4 | SPA, UPSID, UZ, EA |
stan1288 | 4 | SPA, UPSID, UZ, EA |
nucl1417 | 4 | SPA, UPSID, AA, UZ |
nucl1310 | 4 | SPA, UPSID, UZ, EA |
nucl1302 | 4 | SPA, UPSID, UZ, EA |
nucl1301 | 4 | SPA, UPSID, UZ, EA |
nepa1254 | 4 | UPSID, RA, UZ, EA |
mund1320 | 4 | SPA, UPSID, RA, EA |
mode1248 | 4 | SPA, UPSID, UZ, EA |
mand1415 | 4 | SPA, UPSID, PH, EA |
kore1280 | 4 | SPA, UPSID, UZ, EA |
khan1273 | 4 | SPA, UPSID, PH, EA |
iris1253 | 4 | SPA, UPSID, UZ, EA |
hung1274 | 4 | SPA, UPSID, UZ, EA |
haus1257 | 4 | SPA, UPSID, AA, UZ |
hakk1236 | 4 | SPA, UPSID, UZ, EA |
east2328 | 4 | SPA, UPSID, PH, EA |
bulg1262 | 4 | SPA, UPSID, UZ, EA |
basq1248 | 4 | SPA, UPSID, UZ, EA |
akan1250 | 4 | SPA, UPSID, AA, GM |
select ma, count(iso)
from (select distinct l.cldf_macroarea as ma, l.cldf_iso639p3code as iso
from languagetable as l,
valuetable as v
where v.cldf_languageReference = l.cldf_id)
group by ma
order by ma;
ma | count(iso) |
---|---|
1 | |
Africa | 709 |
Australia | 292 |
Eurasia | 439 |
North America | 144 |
Papunesia | 183 |
South America | 334 |
select ma, count(iid)
from (select distinct l.cldf_macroarea as ma, i.cldf_id as iid
from languagetable as l,
valuetable as v,
`inventories.csv` as i
where v.cldf_languageReference = l.cldf_id
and v.Inventory_ID = i.cldf_id)
group by ma
order by ma;
ma | count(iid) |
---|---|
1 | |
Africa | 887 |
Australia | 436 |
Eurasia | 816 |
North America | 184 |
Papunesia | 217 |
South America | 479 |
$ sqlite3 -csv -separator "," phoible.sqlite "select cldf_glottocode, cldf_latitude, cldf_longitude from LanguageTable" | glottolog --repos PATH/TO/glottolog/glottolog --repos-version v4.3 htmlmap --open