# Phoneme Analysis

Comparing phonemes outputted by phonemization to phonemes stored in [phoible](https://phoible.org/). This is used to improve the phonemization by adding folding dictionaries to increase the coherence of the resulting phoneme set for each language.

In [1]:
import pandas as pd

from tqdm import tqdm

phoible = pd.read_csv('../../../data/phoible.csv')
phonemes = phoible.Phoneme.unique()

  phoible = pd.read_csv('../../../data/phoible.csv')


In [2]:
def find_unseen_phonemes(counts, phonemes):
    """ Finds phonemes in counts unknown in the phonemes list, and phonemes in the phonemes list not seen in counts """
    unseen = set() 
    unknown = {} 
    for phoneme in phonemes:
        if phoneme not in counts:
            unseen.add(phoneme)
    for phoneme, count in counts.items():
        if phoneme not in phonemes and phoneme != 'WORD_BOUNDARY':
            unknown[phoneme] = count
    unknown = {k: v for k, v in sorted(unknown.items(), key=lambda item: item[1], reverse=True)}
    return unseen, unknown

def find_phoneme(csv_file, phoneme, count=1):
    """ Find lines where phoneme appears and also print gloss """
    dataset = pd.read_csv(csv_file)
    phonemized = list(dataset['phonemized_utterance'])
    for i, line in enumerate(phonemized):
        if phoneme in line:
            print(f'Line {i}: {dataset["gloss"][i]} - {line}')
            count -= 1
        if count == 0:
            break

def find_words_with_phoneme(csv_file, phoneme, single_phoneme=False):
    words_counts = {}
    dataset = pd.read_csv(csv_file)
    phonemized = list(dataset['phonemized_utterance'])
    glosses = list(dataset['gloss'])
    for i, line in tqdm(enumerate(phonemized)):
        if phoneme in line:
            words = glosses[i].split()
            phonemic_words = line.split(' WORD_BOUNDARY ')
            for j, word in enumerate(phonemic_words):
                if (single_phoneme and phoneme in word.split()) or (not single_phoneme and phoneme in word):
                    if j >= len(words):
                        continue
                    entry = words[j] + '/' + word.replace(' WORD_BOUNDARY', '')
                    if entry not in words_counts:
                        words_counts[entry] = 0
                    words_counts[entry] += 1
    # Sort by counts
    words_counts = {k: v for k, v in sorted(words_counts.items(), key=lambda item: item[1], reverse=True)}
    return words_counts

def print_unseen_phonemes(unseen, unknown):
    print(f'Unknown phonemes: {unknown if unknown else "None"}')
    print(f'Unseen phonemes: {unseen if unseen else "None"}')

def compare_phoneme_sets(csv_file, iso_code, language_names, inventory_id=None):
    """ Compares phonemes in the csv file with the phonemes in phoible.
    
    Four levels of comparison are made:
    1. Phonemes in the dataset not seen in any parts of phoible
    2. Phonemes in the dataset not seen in any phoible language with a specific language code (e.g. 'eng' for English)
    3. Phonemes in the dataset not seen in the specific inventories of phoible that match the language names (e.g. ['American English', 'English (American)'] for English)
    4. (Optional) Phonemes in the dataset not seen in a specific inventory of phoible (e.g. '2175' for English (American))
    """
    if not csv_file.endswith('.csv'):
        dataset = pd.read_csv('../CHILDES-dataset/'+csv_file+'/train.csv')
    else:
        dataset = pd.read_csv(csv_file)

    token_counts = {}
    for line in dataset['phonemized_utterance']:
        # Our tool combines tone markers with the preceeding vowel, we remove tone markers in our comparison so that we don't get many "unknown phonemes" consisting of a known vowel + tone marker.
        line = line.replace('˧˥', '').replace('˧˩̰', '').replace('˩˧', '').replace('˨', '').replace('˥', '').replace('˧', '').replace('˧˥', '').replace('˧˩̰', '').replace('˩˧','').replace('˩','').replace('˦','')
        tokens = line.strip().split()
        for token in tokens:
            if token not in token_counts:
                token_counts[token] = 0
            token_counts[token] += 1

    print(f'Total phonemes in the dataset: {sum(token_counts.values())}')
    print(f'Unique phonemes: {list(token_counts.keys())}\n')

    _, unknown = find_unseen_phonemes(token_counts, phonemes)
    print('-'*100)
    print('ALL PHOIBLE:')
    print('Comparing the phonemes seen in the data with all phoible phonemes:')
    print(f'Unknown phonemes: {unknown if unknown else "None"}')


    language_phonemes = phoible[phoible.ISO6393 == iso_code].Phoneme.unique()
    language_unseen, language_unknown = find_unseen_phonemes(token_counts, language_phonemes)
    print('-'*100)
    print('LANGUAGE CODE:')
    print(f'\nComparing the phonemes seen in the data with just the phonemes in phoible with the {iso_code} language code:')
    print_unseen_phonemes(language_unseen, language_unknown)
    
    possible_language_names = phoible[phoible.ISO6393 == iso_code].LanguageName.unique()
    print(f'\nPhoible inventories names with the {iso_code} language code: {str(list(possible_language_names))}')

    possible_dialects = phoible[phoible.ISO6393 == iso_code].SpecificDialect.unique()
    print(f'Phoible dialects names with the {iso_code} language code: {str(list(possible_dialects))}')

    print('-'*100)
    print('INVENTORIES:')

    all_inventory_phonemes = set()
    inventory_ids = set()
    for inventory in language_names:
        dialect_phonemes = phoible[phoible.LanguageName == inventory].Phoneme.unique()
        ids = phoible[phoible.LanguageName == inventory].InventoryID.unique()
        all_inventory_phonemes.update(dialect_phonemes)
        inventory_ids.update(ids)

    dialect_unseen, dialect_unknown = find_unseen_phonemes(token_counts, all_inventory_phonemes)
    print(f'\nNow comparing the phonemes seen in the data with the phonemes in phoible with the specific language names {language_names}:')
    print(f'Inventory IDs matching the inventory names: {str(inventory_ids)}')
    print_unseen_phonemes(dialect_unseen, dialect_unknown)

    if inventory_id is not None:
        inventory_phonemes = phoible[phoible.InventoryID == inventory_id].Phoneme.unique()
        print(inventory_phonemes)
        inventory_unseen, inventory_unknown = find_unseen_phonemes(token_counts, inventory_phonemes)
        print('-'*100)
        print(f'INVENTORY ID:')
        print(f'\nNow comparing the phonemes seen in the data with the phonemes in phoible with the specific inventory {inventory_id}:')
        print_unseen_phonemes(inventory_unseen, inventory_unknown)


# English (US) Analysis

The English section of our corpus is taken from Eng-NA in CHILDES, so we compare to all English languages and more specifically the two American English sections of phoible. For English, we use phonemizer.

The output of espeak, adjusted by our folding dictionary, brings the phoneme inventory to be pretty close to the American English phoneme inventories (specifically invenetory 2175) in phoible. The unseen phonemes mostly overlap with the unknown phonemes, with slight altered diacritics.

For English, we do not try to match some of the remaining unseen phonemes with the unknown phonemes, as our folding dictionary is based on the phonemizing processed used by [BabySLM](https://github.com/MarvinLvn/BabySLM) and we need to stick to the same transcription process if we want our datasets to match theirs for training and evaluation. 

In [3]:
compare_phoneme_sets('Eng-NA', 'eng', ['English (American)', 'American English'], 2175)

Total phonemes in the dataset: 29029125
Unique phonemes: ['s', 'iː', 'WORD_BOUNDARY', 'ð', 'ɛ', 'ɹ', 'z', 'ʌ', 'f', 'eɪ', 'w', 'ɪ', 'ɡ', 'l', 'æ', 'ɑ', 'h', 'ə', 'ʊ', 'k', 'p', 'uː', 'b', 'i', 't', 'aɪ', 'θ', 'ŋ', 'j', 'ɔ', 'm', 'ɔɪ', 'n', 'd', 'oʊ', 'aʊ', 'v', 'ɜː', 't̠ʃ', 'd̠ʒ', 'ʃ', 'iə', 'ʒ', 'ɑ̃', 'r', 'x', 'nʲ']

----------------------------------------------------------------------------------------------------
ALL PHOIBLE:
Comparing the phonemes seen in the data with all phoible phonemes:
Unknown phonemes: None
----------------------------------------------------------------------------------------------------
LANGUAGE CODE:

Comparing the phonemes seen in the data with just the phonemes in phoible with the eng language code:
Unknown phonemes: {'ɑ̃': 13, 'nʲ': 8}
Unseen phonemes: {'pʰ', 'iɛ', 'ɪə', 'aː', 'ɚ', 'ɑe', 'øː', 'oː', 'æo', 'kx', 'ʊə', 'ʔ', 'ts', 'ɒː', 'tʰ', 'ʉə', 'ɵː', 'e', 'ʉː', 'æe', 'əʉ', 'ɛː', 'oe', 'ɛʉ', 'ʍ', 'ei', 'əː', 'ɐ', 'ɚː', 'əʊ', 'ɑː', 'a', 'iɪ', 'eː', 'o

# English (UK) Analysis

The English section of our corpus is taken from Eng-UK in CHILDES, so we compare to all English languages and more specifically the British English sections of phoible. We use the `en-gb` espeak accent which uses RP, so we compare to inventory 2252 (English RP). Note that much of the data comes from areas of the UK that do not use RP. 

With a few adjustments in our folding dictionary, we get a pretty good match to inventory 2252. There are just two extra vowels that the phonemizer outputs (`i`, and `ɐ`) as well as syllabic `n̩`. The rest of the unknown phonemes are sufficiently rare.

In [4]:
compare_phoneme_sets('Eng-UK', 'eng', ['English (British)', 'English'], 2252)

Total phonemes in the dataset: 20714353
Unique phonemes: ['eɪ', 't̠ʃ', 'WORD_BOUNDARY', 'w', 'ɒ', 'tʰ', 'd', 'ʌ', 'z', 'ð', 'æ', 'm', 'iː', 'n', 'e', 'kʰ', 's', 'ɪ', 'ɡ', 'ʊ', 'ɑː', 'ɔː', 'l', 'ə', 'ɹ', 'i', 'əʊ', 'uː', 'j', 'h', 'ɪə', 'ɔɪ', 'v', 'aɪ', 'f', 'ɜː', 'b', 'pʰ', 'd̠ʒ', 'ɐ', 'eə', 'ʃ', 'θ', 'ŋ', 'aʊ', 'ʊə', 'n̩', 'ʒ', 'r', 'ɑ̃', 'aː', 'ɔ', 'x']

----------------------------------------------------------------------------------------------------
ALL PHOIBLE:
Comparing the phonemes seen in the data with all phoible phonemes:
Unknown phonemes: None
----------------------------------------------------------------------------------------------------
LANGUAGE CODE:

Comparing the phonemes seen in the data with just the phonemes in phoible with the eng language code:
Unknown phonemes: {'n̩': 3004, 'ɑ̃': 28}
Unseen phonemes: {'iɛ', 'ɚ', 'ɑe', 'øː', 'oː', 'æo', 'kx', 'ʔ', 'ts', 'ɒː', 'ʉə', 'ɛ', 'ɵː', 'ʉː', 'æe', 'əʉ', 'ɛː', 'oe', 'p', 'ɛʉ', 'ʍ', 'ei', 'əː', 'ɚː', 'a', 'iɪ', 'eː', 'o̞

# German Analysis

We compare all German inventories in phoible to our German dataset. For German, we use the epitran backend.

With our folding dictionary, we are close to inventory 2398. 

The phoneme `ʒ` is unseen and the unknown phonemes `x` and `ɐ` are missing from inventory 2398 but exist in other german inventories. The rest are sufficiently rare.


In [20]:
compare_phoneme_sets('../processed/German/train.csv', 'deu', ['German','GERMAN'], 2398)

Total phonemes in the dataset: 18573308
Unique phonemes: ['h', 'a', 'l', 'oː', 'WORD_BOUNDARY', 'j', 'aː', 'm', 'd̺', 's', 't̺ʰ', 'iː', 'n', 'z', 'ɛ', 'ts', 'ɪ', 'ɛː', 'ʀ', 'ʊ', 'ɡ', 'ŋ', 'v', 'uː', 'ç', 'b', 'x', 'ɔ', 'ʃ', 'eː', 'œ', 'kʰ', 'f', 'pʰ', 'øː', 'yː', 'ʏ', 't̠ʃ', 'pf', 'ə', 'ɐ', 'd̠ʒ', 'r', 'k']

----------------------------------------------------------------------------------------------------
ALL PHOIBLE:
Comparing the phonemes seen in the data with all phoible phonemes:
Unknown phonemes: None
----------------------------------------------------------------------------------------------------
LANGUAGE CODE:

Comparing the phonemes seen in the data with just the phonemes in phoible with the deu language code:
Unknown phonemes: {'r': 1}
Unseen phonemes: {'aɪ', 'e', 'ʔ', 'd', 't', 'ai', 'ʏː', 'aʊ', 'y', 'ʒ', 'au', 'p', 'ɔi', 'ɔɪ', 'ʁ', 'tʰ'}

Phoible inventories names with the deu language code: ['German', 'GERMAN']
Phoible dialects names with the deu language code: [nan, '

In [19]:
find_words_with_phoneme('../CHILDES-dataset/German/train.csv', 'r')

850297it [00:00, 6351000.10it/s]


{'grˆsser/ɡ rˆ z ɛː ʀ': 1}

# Japanese Analysis

We compare all Japanese inventories in phoible to our Japanese dataset. For Japanese, we use phonemizer.

The three phoible Japanese inventories do not seem to accurately reflect the Wikipedia page on Japanese phonology, but we get close to inventory 2196. There are few unseen phonemes `ɴ` and `u`. Since phonemizer uses a segments backend, we do not output `ɴ` in the correct contexts, but it is a rather infrequent phoneme. `ɯː` might be being output instead of `u`. 

There are many unseen phonemes, many of which are valid phonemes according to the Wikipedia entry and appear in some of the other inventories. 



In [6]:
compare_phoneme_sets('Japanese', 'jpn', ['Japanese', 'JAPANESE'], 2196)

Total phonemes in the dataset: 8714043
Unique phonemes: ['b', 'aː', 'WORD_BOUNDARY', 's', 'ɯ', 'm', 'i', 'h', 'a', 'ɾ', 'e', 't̠ʃ', 'n', 'ɯː', 'j', 'o', 'ʃ', 'd', 'ɡ', 'k', 'oː', 'd̠ʒ', 't', 'z', 'eː', 'w', 'p', 'pʲ', 'ɲ', 'ts', 'ɸ', 'ɾʲ', 'kʲ', 'ç', 'ɡʲ', 'bʲ', 'mʲ']

----------------------------------------------------------------------------------------------------
ALL PHOIBLE:
Comparing the phonemes seen in the data with all phoible phonemes:
Unknown phonemes: None
----------------------------------------------------------------------------------------------------
LANGUAGE CODE:

Comparing the phonemes seen in the data with just the phonemes in phoible with the jpn language code:
Unknown phonemes: {'oː': 134200, 'eː': 20028, 'ɸ': 19093, 'kʲ': 8621, 'ɲ': 4225, 'ɾʲ': 2226, 'ɡʲ': 1705, 'ç': 1022, 'pʲ': 1013, 'bʲ': 790, 'mʲ': 121}
Unseen phonemes: {'iː', 'ʔ', 'kː', 'ɛ', '˥', 'n̪', 'ɛː', 'ɴ', 't̪', 'ɔ', 'çː', 't̠ʃː', 'd̠', 'u', 'd̪', 'ɯ̃', 's̪', 'ʃː', '˧', 'ɔː', 'tsː', 'pː', 'tː', 'z̪',

# Indonesian Analysis

We compare all Indonesian inventories in phoible to our Indonesian dataset. For Indonesian, we use the epitran backend. With minimal folding, we are close to inventory 1690.

The only unknown phoneme is `j`, which appears in the other Indonesian inventory. There are a few unseen phonemes, `e`, `ʔ`, `ɛ`, `v` and `y`. Three of these are not listed in the other inventory.



In [7]:
compare_phoneme_sets('Indonesian', 'ind', ['Indonesian'], 1690)

Total phonemes in the dataset: 7807672
Unique phonemes: ['n', 'i', 'h', 'WORD_BOUNDARY', 'l', 'o', 't', 'm', 'a', 'w', 's', 'd̠ʒ', 'ŋ', 'd', 'p', 'ə', 'ɡ', 'b', 'u', 'r', 'k', 'ɲ', 'j', 't̠ʃ', 'f', 'z', 'ʃ', 'x']

----------------------------------------------------------------------------------------------------
ALL PHOIBLE:
Comparing the phonemes seen in the data with all phoible phonemes:
Unknown phonemes: None
----------------------------------------------------------------------------------------------------
LANGUAGE CODE:

Comparing the phonemes seen in the data with just the phonemes in phoible with the ind language code:
Unknown phonemes: None
Unseen phonemes: {'oi̯', 'ɛ', 'e', 'y', 'au̯', 'ai̯', 'ʔ', 't̪', 'v'}

Phoible inventories names with the ind language code: ['Indonesian']
Phoible dialects names with the ind language code: ['Central Java', 'Standard']
----------------------------------------------------------------------------------------------------
INVENTORIES:

Now c

# French Analysis

We compare all French inventories in phoible to our French dataset. For French, we use phonemizer.

We find a few unknown phonemes, all very infrequent and most are valid allophones according to the French inventory:
* `yː` is an allophone of `y`
* `aː` is an allophone of `a̟`
* `iː` is an allophone of `i`

The remaining two phonemes `t̠ʃ` and `d̠ʒ` seem to come from loan words such as `sandwhich` and `jazz` so seem acceptable to keep. 

Comparing to the specific inventory ID `2266` there is only one unseen phoneme, `ɥ` and the unknown phonemes are either those mentioned above, rare, or contained in one of the other French inventories (such as `œ̃`).

In [8]:
compare_phoneme_sets('French', 'fra', ['French', 'FRENCH'], 2269)

Total phonemes in the dataset: 7332327
Unique phonemes: ['t', 'y', 'WORD_BOUNDARY', 'v', 'j', 'ɛ̃', 'w', 'a', 'ʁ', 'd', 'e', 'ʒ', 'm', 'ɔ̃', 'p', 'ɛ', 'f', 'ɔ', 'ɑ̃', 's', 'z', 'l', 'ə', 'b', 'k', 'u', 'o', 'ʃ', 'ɡ', 'i', 'n', 'œ̃', 'ø', 'œ', 'oː', 'yː', 'ɲ', 'aː', 't̠ʃ', 'd̠ʒ', 'iː', 'ŋ']

----------------------------------------------------------------------------------------------------
ALL PHOIBLE:
Comparing the phonemes seen in the data with all phoible phonemes:
Unknown phonemes: None
----------------------------------------------------------------------------------------------------
LANGUAGE CODE:

Comparing the phonemes seen in the data with just the phonemes in phoible with the fra language code:
Unknown phonemes: {'yː': 1393, 'aː': 1335, 't̠ʃ': 29, 'd̠ʒ': 9, 'iː': 3}
Unseen phonemes: {'r', 'ɒ̃', 'ɡ̟', 'ø̞', 'ɥ', 'a̟', 'n̪', 'ɛː', 'k̟', 't̪', 'l̪', 'æ̃', 'ɑ', 'ɒ', 'd̪', 's̪', 'õ', 'ɦ', 'ʀ', 'z̪'}

Phoible inventories names with the fra language code: ['French', 'FRENCH']
Phoi

# Spanish Analysis

We compare all Spanish inventories in phoible to our Spanish dataset. For Spanish, we use epitran.

With our folding dictionary, we get close to inventory 164. The unseen phonemes `θ` and `ʎ` do not always appear in Spanish phonology, depending on accent. `ʝ` is unknown but typically replaces `ʎ` in certain accents. `d` and `g` are unseen for this inventory but are listed in others. `ʃ` is unknown but mostly comes from english loan words. The remaining unknown phonemes are sufficiently rare. 


In [9]:
compare_phoneme_sets('Spanish', 'spa', ['Spanish', 'SPANISH'], 164)

Total phonemes in the dataset: 5633193
Unique phonemes: ['k', 'o̞', 'm', 'WORD_BOUNDARY', 'e̞', 's', 'i', 'ɾ', 'n', 'a', 'u', 'p', 'd', 'l', 't', 'β', 'ɡ', 'w', 'ʝ', 'f', 'x', 'j', 'ɲ', 'r', 't̠ʃ', 'ʃ', 'tl', 'ü', 'ts', 'ë̞', 'ï', 'ɘ', 'ɛ', 'ä']

----------------------------------------------------------------------------------------------------
ALL PHOIBLE:
Comparing the phonemes seen in the data with all phoible phonemes:
Unknown phonemes: None
----------------------------------------------------------------------------------------------------
LANGUAGE CODE:

Comparing the phonemes seen in the data with just the phonemes in phoible with the spa language code:
Unknown phonemes: {'d': 138829, 'ʝ': 25234, 'tl': 56, 'ts': 45, 'ɘ': 15, 'ë̞': 12, 'ï': 6, 'ü': 1}
Unseen phonemes: {'u̯æi', 'i̯ɔ', 's̺̠', 'd̻', 'ou̯', 'æ', 'i̯a', 'ɣ', 'au', 'ua', 'ð͉', 'iu', 'oi̯', 'i̯o', 'θ', 'ʎ', 'e', 'ai', 'ᴅ', 'n̪', 'i̯e', 'i̯ɛi', 'ɛi', 'ai̯', 'χ', 'ei', 't̪', 'l̪', 'ei̯', 'ɔ', 'ia', 'o', 'ɟʝ', 'au̯

# Mandarin Analysis

We compare all Mandarin inventories in phoible to our Mandarin dataset. For Mandarin, we use pinyin_to_ipa.

There are four Mandarin inventories in Phoible, "Mandarin Chinese", "Standard Chinese; Mandarin", "MANDARIN" and "Standard Chinese". There is considerable disagreement between these.

Our output seems to mostly align with the inventory code 2457 (Beijing Standard Chinese) with the help of a few folding rules to mach diacritics used by the inventory (e.g. `au` instead of `au̯`).

Note that our tool (and pinyin_to_ipa) combines tone markers with the preceeding vowel, which is not how phoible lists tone markers. We remove tone markers in our comparison so that we don't get many "unknown phonemes" consisting of a known vowel + tone marker.

There are a few key unknown phonemes:
* `o`, `ɛ`, `ʊ`, `e`, `ɔ`: these are vowels not listed in the inventory but are typically used in transcriptions as allophones of typical Mandarin phonemes. pinyin_to_ipa seems to favour a slihgly more phonetic realisation of vowels.
* `ɻ̩` and `ɻ`: listed in other sources as a valid consonant (and syllabic consonant) in Mandarin but not present in this inventory
* `ɥ`: a glide used along with `j` and `w` in certain analyses of Mandarin, this inventory seems to prefer to write these as diphthongs (see below). 
* `ɚ`: the rhotic coda present in the Beijing dialect for a few words but not listed in this inventory. 

The unseen tokens consist of:
* `ɹ̺`: It is unclear why this is not produced
* `ua`, `iau`, `iu`, `uei`, `ye`, `uo`, `uai`, `ie`, `uə`, `ia`, `iou`: dipthongs resulting from the alternative analysis of glides where `j`, `ɥ` and `w` are not analysed as independent phonemes but instead as allophones of the high vowels `i`, `y` and `u`. Although this intepretation may be more common, we stick to the output of pinyin_to_ipa rather than try to correct this. 

In [10]:
compare_phoneme_sets('../processed/Mandarin/train.csv', 'cmn', ['Standard Chinese; Mandarin', 'Mandarin Chinese'], 2457)

Total phonemes in the dataset: 5721192
Unique phonemes: ['au', 'WORD_BOUNDARY', 'n', 'a', 'ʃ̺', 'ɻ̩', 'ə', 'm', 'ɤ', 'p', 'j', 'e', 'kʰ', 'k', 'w', 'o', 't̠ʃ̺ʰ', 'ŋ', 't', 'ʊ', 'ɕ', 'i', 'l', 'x', 'u', 'ei', 'pʰ', 'ai', 'ou', 'tɕ', 'ts', 't̠ʃ̺', 's', 'ɹ̪̩', 'tɕʰ', 'ɛ', 'f', 'y', 'tʰ', 'ɻ', 'ɥ', 'ɔ', 'tsʰ', 'ɚ']

----------------------------------------------------------------------------------------------------
ALL PHOIBLE:
Comparing the phonemes seen in the data with all phoible phonemes:
Unknown phonemes: None
----------------------------------------------------------------------------------------------------
LANGUAGE CODE:

Comparing the phonemes seen in the data with just the phonemes in phoible with the cmn language code:
Unknown phonemes: {'ɻ̩': 95797, 'o': 89893, 'ɛ': 52880, 'ʊ': 34560, 'e': 29436, 'ɻ': 18882, 'ɔ': 10547, 'ɚ': 3167}
Unseen phonemes: {'˦˨', 'u˞', 'cçʰ', 'ʂ', '˧˨˧', 'a̟˞', 'ə̃˞', 'cç', '˧˥', 'uei', 't̪ʰ|tʰ', 'ʈʂʰ', 'ɹ', 'ua', '˥˦', 'iu', 'uo', '˥', 'χ', '˦', '˧˨˥'

# Dutch Analysis

We compare all Dutch inventories in phoible to our Dutch dataset. For Dutch, we use phonemizer.

With our folding map, we get close to inventory 2405, although there are a few major differences.

The inventory lists `χ` as unseen, but instead we produce `ɣ` and `x` which are listed on Wikipedia. The inventory also lists `ɔ̃`, `ɛ̃`, `œː` and `ã` as missing. There are also many unknown phonemes. Many are rare, the others are mostly listed on Wikipedia or in other inventories.

In [12]:
compare_phoneme_sets('Dutch', 'nld', ['Dutch'], 2405)

Total phonemes in the dataset: 4513875
Unique phonemes: ['j', 'aː', 'WORD_BOUNDARY', 'ɦ', 'oː', 'ɾ', 'd', 'i', 'ɛ', 'p', 'ɪ', 'k', 'ɑ', 'l', 'eː', 'n', 's', 'v', 'ə', 'ɛi', 'ʋ', 'z', 't', 'm', 'ɣ', 'ʏ', 'ɔ', 'x', 'u', 'f', 'ŋ', 'øː', 'b', 'ɔː', 'ʌu', 'œy', 'tʲ', 'y', 'w', 'ʃ', 'ɛː', 't̠ʃ', 'ɲ', 'ʒ', 'ɡ', 'd̠ʒ', 'iː', 'a', 'ð', 'uː', 'e', 'yː']

----------------------------------------------------------------------------------------------------
ALL PHOIBLE:
Comparing the phonemes seen in the data with all phoible phonemes:
Unknown phonemes: None
----------------------------------------------------------------------------------------------------
LANGUAGE CODE:

Comparing the phonemes seen in the data with just the phonemes in phoible with the nld language code:
Unknown phonemes: {'t̠ʃ': 172, 'ð': 9, 'e': 2}
Unseen phonemes: {'ɑuː', 'ɪə', 'r', 'ɔːə', 'iːə', 'pʲ', 'œʏ', 'uɪ', 'æ', 'æ̃ː', 'c', 'øy', 'ʔ', 'œ', 'ɒː', 'eɪ', 'øɪ', 'aʊ', 'œ̃ː', 'øə', 'β̞', 'ɛ̃', 'ʎ', 'ou', 'œː', 'æː', 'uːə', 'χ'

# Serbian Analysis

We compare the Serbian inventory in phoible to our Serbian dataset. For Serbian, we use epitran. 

With our folding dictionary, we get close to inventory 2499 (the only inventory for Serbian).

The unseen phonemes are mostly long vowels, which epitran fails to produce (**this might be an issue**).

The unknown phoneme `j` seems to be missing from phoible and is a common phoneme in Serbian, according to wikipedia. The remaining unknown phonemes are rare and come from english loan words that have been badly transcribed.


In [13]:
compare_phoneme_sets('Serbian', 'srp', ['Serbian'], 2499)

Total phonemes in the dataset: 3724972
Unique phonemes: ['d̪̻', 'ä', 'WORD_BOUNDARY', 'i', 'j', 'm', 's̪̻', 'l', 'n', 'e̞', 'ʋ', 'r', 'o̞', 't̪̻', 'k', 't̪̻s̪̻', 'p', 'u', 'ʃ̺', 'x', 'b', 'ʒ̺', 'ɡ', 't̻ʃ̻', 'f', 'z̪̻', 'ɲ', 'ʎ', 'd̻ʒ̻', 'y', 'w', '́']

----------------------------------------------------------------------------------------------------
ALL PHOIBLE:
Comparing the phonemes seen in the data with all phoible phonemes:
Unknown phonemes: {'́': 1}
----------------------------------------------------------------------------------------------------
LANGUAGE CODE:

Comparing the phonemes seen in the data with just the phonemes in phoible with the srp language code:
Unknown phonemes: {'j': 136459, 'y': 39, 'w': 38, '́': 1}
Unseen phonemes: {'iː', 'uː', 'äː', 'd̺̠ʒ̺ʷ', 'o̞ː', 't̺ʃ̺ʷ', 'e̞ː'}

Phoible inventories names with the srp language code: ['Serbian']
Phoible dialects names with the srp language code: ['Serbian (Standard Ekavian)']
------------------------------------------

# Estonian Analysis

We compare all Estonian inventories in phoible to our Estonian dataset. For Estonian, we use phonemizer.

To create our folding dictionary, we compare our output with phoible inventory 2181 and Wikipedia. Many changes were required in the folding dictionary, due to repeated letters for long vowels and consonants being produced by phonemizer.

In the end, we have quite a bit of disagreement between our output and phoible inventory 2181 since the inventory lists semi-long consontants (which phonemizer does not output) and our tool outputs many consonants, long vowels and diphthongs that are not listed in the phoible inventory (but seem to be valid according to Wikipedia).

Generally, we get a pretty good match with the Wikipedia entry for Estonian phonology, except for the fact that we do not output many diphthongs, instead many vowels are listed independently (e.g. `e o` instead of `eo`). We could add rules to join these vowels but it is not clear if these should always be diphthongs, so we leave them. 

However, looking at individual examples, the output is not always very good (e.g. initial `b` being output as `p` or `i` randomly being added after some other vowels). The espeak-ng voice might not be very good for this language.


In [14]:
compare_phoneme_sets('Estonian', 'ekk', ['Estonian'], 2181)

Total phonemes in the dataset: 2508499
Unique phonemes: ['n', 'o', 'WORD_BOUNDARY', 'm', 'i', 's', 'a', 'æ', 'r', 'p', 'l', 'e', 'd', 'iː', 't', 'ʃ', 'kː', 'v', 'æi', 'pː', 'u', 'k', 'h', 'mː', 'eː', 'sː', 'ɑ', 'yː', 'uː', 'ɛ', 'j', 'ɡ', 'b', 'aː', 'sʲ', 'ɪ', 'ɔ', 'ɤː', 'ʊ', 'lː', 'ø', 'ɤ', 'øː', 'tː', 'ŋ', 'y', 'tʲ', 'oː', 'rː', 'ɲ', 'nː', 'w', 'tʲː', 'æː', 'øɪ̯', 'f', 'dʲ', 'sʲː', 't̠ʃ', 'ʃː', 'ʒ', 'fː', 'dː', 'z', 'yi']

----------------------------------------------------------------------------------------------------
ALL PHOIBLE:
Comparing the phonemes seen in the data with all phoible phonemes:
Unknown phonemes: None
----------------------------------------------------------------------------------------------------
LANGUAGE CODE:

Comparing the phonemes seen in the data with just the phonemes in phoible with the ekk language code:
Unknown phonemes: {'a': 209437, 'd': 67228, 'iː': 52732, 'ɡ': 32129, 'aː': 30848, 'ɛ': 28868, 'eː': 27859, 'b': 22808, 'ɪ': 14614, 'ʊ': 12601, 'oː': 

# Cantonese Analysis

We compare all Cantonese inventories in phoible to our Cantonese dataset. For Cantonese, we use pingyam.

There are four Cantonese inventories in Phoible, "Chinese", "Cantonese", "TAISHAN" and "Yue Chinese". 

Our output closely aligns with inventory 2166. The only unknown phoneme is syllabic m (`m̩`) and the only unseen phonemes are `kʷʰ`, `r`, `kʷ` and `β`.

In [34]:
compare_phoneme_sets('../processed/Cantonese/train.csv', 'yue', ['Cantonese', 'TAISHAN', 'Yue Chinese', 'Chinese'], 2166)

Total phonemes in the dataset: 1873310
Unique phonemes: ['a̞', 'WORD_BOUNDARY', 't', 'ɐ', 'k', 'l', 'j', 'ʊ̟', 'ɛ', 'n', 'ei', 'w', 'ɐi', 'm̩', 'm', 'ou', 'i', 'ts', 'ɔ̽', 'tʰ', 'f', 'aːĭ', 'p', 'h', 'ɵy', 'u', 'ŋ', 's', 'ɔːĭ', 'ɐu', 'ɪ̞', 'iːŭ', 'ɵ', 'tsʰ', 'kʰ', 'aːŭ', 'pʰ', 'y', 'œ̞', 'uːĭ']

----------------------------------------------------------------------------------------------------
ALL PHOIBLE:
Comparing the phonemes seen in the data with all phoible phonemes:
Unknown phonemes: None
----------------------------------------------------------------------------------------------------
LANGUAGE CODE:

Comparing the phonemes seen in the data with just the phonemes in phoible with the yue language code:
Unknown phonemes: None
Unseen phonemes: {'ɔ', 'iu', 'ai', 't̪|t', 't̪ʰ|tʰ', 'kʷʰ', '˩', '˧˩', 't̠ʃʰ', '˧˩̰', 'ɛu', '˧˥', '˩˧', 't̠ʃ', 'l̪|l', 'ʃ', 'œ', '˥˦', 'au', 'ŋ̩', 'β', 'ʊ', 'l̪̥|l̥', 'n̪|n', 'r', '˦', '˧', 'l̥', 'ui', '˨', 'æ', '˥', 'a', 'ɔi', 'kʷ', 'ɪ'}

Phoible inve

# Polish analysis

We compare the two Polish inventories in phoible to our Polish dataset. For Polish, we use phonemizer. 

With some mapping, we get close to inventory 1046. 

The phoneme `d̪z̪` does not seem to be produced by phonemizer, but the other two unseen phonemes `c` and `ɟ` are not listed on Wikipedia so it seems ok to miss these.

The phonemes are tool produces that are not in the inventory ("unknown phonemes" below) are the consonants `t, d, ɡʲ, kʲ, xʲ` which are valid according to Wikipedia, as well as the vowel, `ẽ` which is listed in the other phoible set.


# 

In [16]:
compare_phoneme_sets('Polish', 'pol', ['Polish'], 1046)

Total phonemes in the dataset: 1734629
Unique phonemes: ['x', 'o', 'dʑ', 'WORD_BOUNDARY', 'd̪', 'b', 'a', 'p', 'tɕ', 'i', 'z̪', 'r', 'u', 'n̪', 'ɲ', 'e', 'f', 'k', 'w', 'j', 'ʑ', 'v', 'l̪', 'ɕ', 'm', 't̪', 'z̻', 'ɨ', 's̻', 't̻s̻', 't̪s̪', 'ɡ', 's̪', 'ŋ', 'kʲ', 't', 'ɡʲ', 'ẽ', 'ɣ', 'd̻z̻', 'd', 'xʲ']

----------------------------------------------------------------------------------------------------
ALL PHOIBLE:
Comparing the phonemes seen in the data with all phoible phonemes:
Unknown phonemes: None
----------------------------------------------------------------------------------------------------
LANGUAGE CODE:

Comparing the phonemes seen in the data with just the phonemes in phoible with the pol language code:
Unknown phonemes: {'kʲ': 3912, 'ɡʲ': 970, 'ẽ': 470, 'ɣ': 469, 'xʲ': 120, 't': 69, 'd': 37}
Unseen phonemes: {'ʈ̻ʂ̻', 'ä', 'ũ̯', 'ʂ̻', 'ɔ', 'ɘ̟', 'ɛ', 'd̪z̪', 'ĩ̯', 'c', 'ʐ̻', 'ɖ̻ʐ̻', 'ɟ'}

Phoible inventories names with the pol language code: ['Polish']
Phoible dialects

# Swedish analysis

We compare the two Swedish inventories in phoible to our Swedish dataset. For Swedish, we use phonemizer.

With our folding map, we get close to inventory 1150.

There are no unseen phonemes from the inventory and only a few unknown phonemes (`e, ʂ, ʃ, ɵː, z`) which are either rare or are listed as valid phonemes for Swedish on Wikipedia.

In [17]:
compare_phoneme_sets('Swedish', 'swe', ['Swedish'], 1150)

Total phonemes in the dataset: 1449668
Unique phonemes: ['ɔ', 'ʝ', 'WORD_BOUNDARY', 'k', 'l', 'ɛ', 'm', 'd̪', 'e', 'ʉ̟', 'f', 'ɪ', 'ŋ', 'ɹ', 'a', 'n̪', 'iː', 'ɑː', 't̪', 'ɛː', 's̪', 'v', 'oː', 'uː', 'eː', 'ʊ', 'p', 'b', 'h', 'øː', 'yː', 'ɡ', 'ɵ', 'ʃ', 'ʂ', 'œ', 'ʏ', 'ɕ', 'ɧ', 'ɵː', 'z']

----------------------------------------------------------------------------------------------------
ALL PHOIBLE:
Comparing the phonemes seen in the data with all phoible phonemes:
Unknown phonemes: None
----------------------------------------------------------------------------------------------------
LANGUAGE CODE:

Comparing the phonemes seen in the data with just the phonemes in phoible with the swe language code:
Unknown phonemes: {'e': 19563, 'ʂ': 1073, 'ʃ': 263, 'ɵː': 40, 'z': 10}
Unseen phonemes: {'pʰ', 't̪ʰ', 'kʰ', 'ʉ̟ː', 'l̪'}

Phoible inventories names with the swe language code: ['Swedish']
Phoible dialects names with the swe language code: ['Stockholm', 'Swedish (Central)']
--------------

# Portuguese (portugal) analysis

We compare the Portuguese inventories in phoible to our Portuguese dataset. For Portuguese, we use phonemizer.

We need a substantial folding map, but by correcting the output we get very close to inventory 1150. However, this does require some items in the folding dictionary that do not simply change the symbol used for a particular phoneme, some other decisions also needed to be made.

With these decisions, all that remains is one rare unknown phonemes used for "pizza".

In [18]:
compare_phoneme_sets('PortuguesePt', 'por', ['Portuguese', 'Portuguese (European)'], 2206)

Total phonemes in the dataset: 1304905
Unique phonemes: ['f', 'ɯ', 'ʃ', 'ɐ', 'WORD_BOUNDARY', 'a', 's', 'ĩ', 'n̪', 'ɐ̃u̜', 'd̪', 'p', 'oi', 'z', 'ɛ', 't̪', 'ɐ̃', 'm', 'b', 'ẽ', 'ʒ', 'l̪ˠ', 'i', 'ɡ', 'o', 'ɔ', 'ɾ', 'e', 'ai', 'eu̜', 'u', 'ɛu̜', 'ũ', 'ʁ', 'õ', 'ʎ', 'k', 'ɲ', 'v', 'ɐi', 'au̜', 'iu̜', 'ũi', 'ɐ̃i', 'õi', 'ui', 'ɛi', 'ɔi', 'ts']

----------------------------------------------------------------------------------------------------
ALL PHOIBLE:
Comparing the phonemes seen in the data with all phoible phonemes:
Unknown phonemes: None
----------------------------------------------------------------------------------------------------
LANGUAGE CODE:

Comparing the phonemes seen in the data with just the phonemes in phoible with the por language code:
Unknown phonemes: {'ts': 19}
Unseen phonemes: {'l', 'ẽɪ̯̃', 's̻', 'r', 'j', 'ŭ', 'oɪ̯', 'ɣ', 'ɛʊ̯', 'oʊ̯', 'ʊ̯ɐ̃', 'n', 'u̯', 'x', 'ʊ̯aɪ̯', 'ɐ̃ʊ̯̃', 'ĕ', 'ə', 'uʊ̯', 'ũɪ̯̃', 'õɪ̯̃', 'ɛɪ̯', 'ɔʊ̯', 'ä', 'eʊ̯', 'ĩ̯', 'eɪ̯', 

# Korean analysis

We compare the Korean inventories in phoible to our Korean dataset. For Korean, we use phonemizer.

Our system does not seem to output long vowels or diphthongs (this could be a problem) but inventory 423 does the same, so we use our folding map to approach this inventory.

Unknown phonemes:
* `b`, `d`, `dʑ` and `ɡ` are the voiced `p` `t` `tɕ` and `k` which are valid but not listed in the inventory for some reason
* `ɾ` is an allophone of `l` so should be valid

We have several unseen phonemes. Five are consonants with glottal stops or aspiration, which don't seem to be produced by our tool. The remaining unseen phonemes are vowels. It is unclear why these are not produced by our tool. 



In [19]:
compare_phoneme_sets('Korean', 'kor', ['Korean', 'KOREAN'], 423)

Total phonemes in the dataset: 1085146
Unique phonemes: ['w', 'a', 'WORD_BOUNDARY', 'j', 'o', 'ɾ', 'i', 'h', 'n̪', 'ɯ', 'd', 'e', 'ŋ', 'p', 'dʑ', 'u', 'm', 'ɡ', 'ɤ̞', 'l', 't̠ʃ', 'æ', 'b', 's̪', 'k', 't̪', 'pʰ', 'kʰ', 'ɯi', 't̠ʃʰ']

----------------------------------------------------------------------------------------------------
ALL PHOIBLE:
Comparing the phonemes seen in the data with all phoible phonemes:
Unknown phonemes: None
----------------------------------------------------------------------------------------------------
LANGUAGE CODE:

Comparing the phonemes seen in the data with just the phonemes in phoible with the kor language code:
Unknown phonemes: {'ɡ': 73759, 'd': 24529, 'dʑ': 14060, 'b': 5308}
Unseen phonemes: {'iɛ', 'iː', 'aː', 't͉', 'tɕ', 'oː', 'p͉', 'uɛ', 's̪ˀ', 'ʔ', 's͉', 'n', 'ɤː', 'lː', 'tʰ', 'uː', 'tɕʰ', 'k͉', 'iu', 'ua', 'ɛ', 'd̥ʑ̥', 'ɛː', 'ɘː', 'æː', 'uʌ', 'əː', 'ɐ', 'tˀ', 'ɡ̊', 'ɪ', 'ɛ̝', 'ø', 't̪ʰ', 'iʌ', 'kˀ', 'ia', 't͉ɕ͉', 'sʰ', 'eː', 'sˀ', 'ɯː', 'ie', 

# Italian analysis

We compare the Italian inventories in phoible to our Italian dataset. For Italian, we use phonemizer.

The inventory with the closest match is 1145 (which Wikipedia cites). 

We have no unseen phonemes but many unknown phonemes. These are:
* Long vowels: according to Wikipedia the distinction between long and short vowels is rare, but our tool seems to produce many long vowels. We could remove the length marker, but for now we keep it.
* Long consonants: according to Wikipedia, Italian consonants can geminate, so we keep these
* `ɾ` is an allophone for `r` so we keep it
* `ŋ` is an allophone for `n` so we keep it
* `ʒ` and `h` are very rare so we allow these




In [20]:
compare_phoneme_sets('Italian', 'ita', ['Italian'], 1145)

Total phonemes in the dataset: 1046215
Unique phonemes: ['i', 'l', 'WORD_BOUNDARY', 't', 'ɐ', 'o', 'k', 'ɔ', 'z', 'f', 'v', 'n', 'e', 'd', 'j', 't̠ʃ', 'b', 'ɛ', 'm', 'w', 's', 'ɛː', 'p', 'r', 'u', 'kː', 'ʎ', 'd̠ʒ', 'tː', 'pː', 'ts', 'ɐː', 'oː', 'iː', 't̠ʃː', 'ɾ', 'ɡ', 'sː', 'eː', 'dz', 'bː', 'd̠ʒː', 'ɲ', 'tsː', 'ʃ', 'dzː', 'ŋ', 'dː', 'a', 'uː', 'ɔː', 'ɡː', 'h', 'ʒ']

----------------------------------------------------------------------------------------------------
ALL PHOIBLE:
Comparing the phonemes seen in the data with all phoible phonemes:
Unknown phonemes: None
----------------------------------------------------------------------------------------------------
LANGUAGE CODE:

Comparing the phonemes seen in the data with just the phonemes in phoible with the ita language code:
Unknown phonemes: {'tː': 8710, 'sː': 2547, 'iː': 1962, 'ɛː': 1577, 'ɐː': 1344, 'oː': 617, 'tsː': 454, 'ŋ': 414, 'eː': 372, 'dː': 223, 'dzː': 88, 'ɔː': 15, 'uː': 14, 'h': 7}
Unseen phonemes: {'iɛ', 'ui̯', 'uɛ

# Catalan analysis

We compare the Catalan inventories in phoible to our Catalan dataset. For Catalan, we use phonemizer.

With our folding dictionary we get close to inventory 2555. We have two unseen phonemes, `dz̺` and `i̯`. It is unclear why these are not produced. I'm also not sure what the distinction between `i` and `i̯` is. It is also worth noting that none of the Catalan inventories include diphthongs, nor does our tool produce diphthongs, even though Wikipedia says that there are phonemic diphthongs in Catalan.

The three unknown phonemes, `ð`, `β`, `ɣ` are the lenited versions of `d`, `b` and `g` which are produced in Catalan, so we keep these.


In [21]:
compare_phoneme_sets('Catalan', 'cat', ['Catalan'], 2555)

Total phonemes in the dataset: 908431
Unique phonemes: ['t̪', 'ɛ', 'WORD_BOUNDARY', 'ð', 'o', 'n̺', 'a', 'ɡ', 'ɾ̺', 's̺', 'j', 'ə', 'β', 'i', 'ŋ', 'p', 'u', 'ʒ', 'k', 'z̺', 'w', 'm', 'ɣ', 'ʎ̟', 'ɫ̺', 'ʃ', 'f', 'r̺', 'd̪', 'd̠ʒ', 'u̯', 'e', 'ɲ̟', 'ts̺', 'b', 't̠ʃ', 'ɔ']

----------------------------------------------------------------------------------------------------
ALL PHOIBLE:
Comparing the phonemes seen in the data with all phoible phonemes:
Unknown phonemes: None
----------------------------------------------------------------------------------------------------
LANGUAGE CODE:

Comparing the phonemes seen in the data with just the phonemes in phoible with the cat language code:
Unknown phonemes: {'β': 20590, 'ð': 12608, 'ɣ': 6873}
Unseen phonemes: {'l', 'ui̯', 'ɾ', 'r', 'ou̯', 'ə̹', 'i̯a', 'u̯ə', 'c', 'a̟', 'n', 'dz̺', 'ʎ', 'v', 'ɲ', 'ɛ̞', 'ei̯', 'ɔ̞', 'u̯a', 'ɟ', 'z', 'i̯', 'i̯ɛ', 's', 'ɛu̯', 'l̺'}

Phoible inventories names with the cat language code: ['Catalan']
Phoible diale

# Croatian analysis

We compare the Croatian inventory in phoible to our Croatian dataset. For Croatian, we use phonemizer.

With our folding dictionary, we get close to inventory 1139 (the only inventory for Serbian).

The unseen phonemes are mostly long vowels, which epitran fails to produce (**this might be an issue**). We also don't seem to produce `tɕ` and `dʑ`.

The unknown phonemes are sufficiently rare.

In [22]:
compare_phoneme_sets('Croatian', 'hrv', ['Croatian'], 1139)

Total phonemes in the dataset: 834971
Unique phonemes: ['m', 'i', 'ʃ', 'k', 'o', 'WORD_BOUNDARY', 'n', 'e', 't̪', 'd̪', 'r', 'a', 'p', 'u', 's', 'ʋ', 'j', 't̠ʃ', 'l', 'ɡ', 'x', 'ʒ', 'b', 't̪s', 'z', 'd̠ʒ', 'f', 'ɲ', 'ʎ', 'y', 'q', 'w']

----------------------------------------------------------------------------------------------------
ALL PHOIBLE:
Comparing the phonemes seen in the data with all phoible phonemes:
Unknown phonemes: None
----------------------------------------------------------------------------------------------------
LANGUAGE CODE:

Comparing the phonemes seen in the data with just the phonemes in phoible with the hrv language code:
Unknown phonemes: {'y': 39, 'q': 25, 'w': 21}
Unseen phonemes: {'iː', 'uː', 'aː', 'tɕ', 'eː', 'oː', 'ie', 'dʑ'}

Phoible inventories names with the hrv language code: ['Croatian']
Phoible dialects names with the hrv language code: [nan]
----------------------------------------------------------------------------------------------------
IN

# Welsh analysis

We compare the Welsh inventory in phoible to our Welsh dataset. For Welsh, we use phonemizer.

With our folding dictionary, we get close to inventory 1139 (the only inventory for Welsh).

Unseen phonemes
* `m̥`, `r̥`, `ŋ̥` and `n̥` - voiceless consonants which are not included in some interpretations of Welsh phonology.
* `tʰ`, `pʰ` and `kʰ` - aspirated stops, not always included in Welsh phonology.
* `t̠ʃ` - a rare phoneme mostly found in loan words.
* `əu` - a diphthong that should be produced, unclear why it is not here.

Unknown phonemes:
* `d`, `b`, `g` - voiceless stops. These should definitely be listed, not clear why the inventory does not include them. 
* `ø` and `ʌ` - vowels not listed in the inventory, it's not obvious what they should be mapped to.  They are both produced when there is a `y` in the orthography and should be different vowels in different contexts, so can't be easily mapped. In most cases, they should be `iː` or `ɪ`. **This may be an issue**

In [23]:
compare_phoneme_sets('Welsh', 'cym', ['Welsh'], 2406)

Total phonemes in the dataset: 860863
Unique phonemes: ['d', 'eː', 'χ', 'r', 'ai', 'WORD_BOUNDARY', 'ɛ', 't', 'ɔ', 'w', 'a', 'n', 'j', 'au', 'ə', 'ɔi', 'ð', 'oː', 'iː', 'ɪ', 'b', 's', 'ø', 'ɡ', 'ʊi', 'h', 'ʊ', 'm', 'əi', 'θ', 'l', 'v', 'ʌ', 'k', 'ɬ', 'p', 'f', 'ɪu', 'uː', 'ʃ', 'ɑː', 'ɛu', 'ŋ', 'd̠ʒ', 'z']

----------------------------------------------------------------------------------------------------
ALL PHOIBLE:
Comparing the phonemes seen in the data with all phoible phonemes:
Unknown phonemes: None
----------------------------------------------------------------------------------------------------
LANGUAGE CODE:

Comparing the phonemes seen in the data with just the phonemes in phoible with the cym language code:
Unknown phonemes: {'d': 42792, 'b': 19346, 'ø': 17282, 'ɡ': 13860, 'ʌ': 11316}
Unseen phonemes: {'pʰ', 'tʰ', 'ŋ̥', 't̠ʃ', 'n̥', 'm̥', 'kʰ', 'r̥', 'əu'}

Phoible inventories names with the cym language code: ['Welsh']
Phoible dialects names with the cym language code: [

# Icelandic analysis

We compare the Icelandic inventories in phoible to our Icelandic dataset. For Icelandic, we use phonemizer.

With our folding dictionary, we get close to inventory 2568.

Unseen phonemes:
* `aː`, `ɛː`, `ɔː`, `uː`, `iː`, `ɪː`, `ʏ` and `œː` - long vowels that should maybe be shortened, but mostly do exist in the other inventory.
* `ə` - not in the other inventory, unclear if it should be here
* `g` - not in either inventory but listed on wikipedia. Could be mapped to `kʰ`?
* `z` and `ʃ` which are sufficiently rare

Generally, it does not seem to be a great mapping.

In [24]:
compare_phoneme_sets('Icelandic', 'isl', ['Icelandic'], 2568)

Total phonemes in the dataset: 761780
Unique phonemes: ['aː', 'r̥', 'ɪ', 'WORD_BOUNDARY', 'ɛ', 't̪ʰ', 's̺', 'j', 'ä', 'pʰ', 'iː', 'k', 'ʋ', 'ɛː', 'r', 'ei̯', 'θ̻', 'i', 'l', 'n̪', 'uː', 'ð̺̞', 'ɡ', 'ɔ', 'n̪̥', 't̪', 'äu̯', 'ŋ̥', 'h', 'ʏ', 'm', 'f', 'ɔː', 'x', 'cʰ', 'ou̯', 'p', 'ŋ', 'äi̯', 'ɰ', 'ʏː', 'u', 'ɪː', 'œ', 'ç', 'ə', 'øɪ̯', 'ɬ', 'c', 'ɲ', 'm̥', 'œː', 'ɔi̯', 'ɲ̥', 'z', 'ʏi̯', 'ʃ']

----------------------------------------------------------------------------------------------------
ALL PHOIBLE:
Comparing the phonemes seen in the data with all phoible phonemes:
Unknown phonemes: None
----------------------------------------------------------------------------------------------------
LANGUAGE CODE:

Comparing the phonemes seen in the data with just the phonemes in phoible with the isl language code:
Unknown phonemes: {'ɡ': 9860, 'ɪː': 4944, 'ə': 3449, 'ʏː': 826, 'z': 6, 'ʃ': 2}
Unseen phonemes: {'n', 'ø', 'tʰ', 't', 's', 'n̥', 'ŋˑ', 'a', 'θ', 'e', 'kʰ', 'j̥', 'eː', 'øː', 'l̥', '

# Danish analysis

We compare the Danish inventories in phoible to our Danish dataset. For Danish, we use phonemizer.

The output of phonemizer differs substantially from phoible and Wikipedia. With our folding dictionary we get close to inventory 2265, but there are some issues.

Unseen phonemes:
* `r` - phonemizer was producing `ɐ̯`, which is a correct phonetic realisation of `/r/`, but not listed in phoible, so we map it to the phoneme `/r/`. 
* `d`, `g`, `b`, `ŋ`, `ɕ`, `ð`, `w` - consonants not listed the other inventory (unclear why) but listed on Wikipedia. 
* `eˤ`, `ɑˤ`, `iˤ`, `uˤ`, `oˤ`, `aˤ`  - the Danish stød, which phonemizer seems to produce 
* `ɒ`, `ɑ`, `ɒː`, `uː`, `œː`, `ɜ`, `ɔː`, `oː` - vowels not listed in this inventory, but are listed in the other inventory or on Wikipedia
* Remaining phonemes are rare or from loan words

Unknown phonemes:
* `kʰ`, `pʰ` and `tsʰ` - aspirated consonants which the tool doesn't seem to produce
* `ø` - a vowel listed in this inventory that doesn't seem to be produced

Generally, it does not seem to be a great mapping.

In [25]:
compare_phoneme_sets('Danish', 'dan', ['Danish'], 2265)

Total phonemes in the dataset: 625565
Unique phonemes: ['d', 'e', 'WORD_BOUNDARY', 'ɛ', 'r', 'eˤ', 'n', 'm', 'uˤ', 's', 't', 'ɑˤː', 'k', 'j', 'f', 'ɑ', 'ɒ', 'ə', 'ʋ', 'a', 'l', 'iˤ', 'h', 'b', 'ʁ', 'ɔˤ', 'p', 'œ', 'i', 'ɡ', 'ɔ', 'u', 'ɕ', 'w', 'ð', 'oˤ', 'o', 'y', 'ŋ', 'aˤ', 'ɒː', 'œː', 't̠ʃ', 'aː', 'd̠ʒ', 'uː', 'ɔː', 'ɜ', 'oː', 'θ', 'eː', 'ʔ', 'yː', 'iː', 'ɛː']

----------------------------------------------------------------------------------------------------
ALL PHOIBLE:
Comparing the phonemes seen in the data with all phoible phonemes:
Unknown phonemes: None
----------------------------------------------------------------------------------------------------
LANGUAGE CODE:

Comparing the phonemes seen in the data with just the phonemes in phoible with the dan language code:
Unknown phonemes: {'d': 31926, 'eˤ': 19248, 'ɔˤ': 15181, 'ð': 8860, 'ɑˤː': 6425, 'iˤ': 6214, 'ɡ': 5285, 'uˤ': 4971, 'b': 4365, 'oˤ': 3231, 'aˤ': 460, 'ɒː': 325, 'uː': 96, 'œː': 67, 'ɜ': 47, 'ɔː': 37, 'd̠ʒ': 34, 

# Norwegian analysis

We compare the Norwegian inventories in phoible to our Norwegian dataset. For Norwegian, we use phonemizer.

With our folding dictionary we get close to inventory 499.

Unseen phonemes:
* `ɑ` - not listed in the inventory, but listed in other inventories and on Wikipedia.
* `kː`, `pː`, `ɡː`, `bː`, `tː`, `dː` - long consonants produced by phonemizer. Not listed in phoible but is a feature of Norwegian (although perhaps not for `k` and `g`) 
* `ʂ`  - present in the other inventories and a valid consonant for Norwegian
* Remaining phonemes are rare or from loan words

Unknown phonemes:
* `æi` - diphthong that doesn't seem to be produced by our tool
* `ɳ`, `ɭ`, `ʈʰ` and `ɖ` - retroflex allophones of consonants we already have

In [9]:
compare_phoneme_sets('Norwegian', 'nob', ['Norwegian', 'NORWEGIAN'], 499)

Total phonemes in the dataset: 531777
Unique phonemes: ['t̪ʰ', 'ɑ', 'kː', 'WORD_BOUNDARY', 'ʋ', 'a', 'ɾ', 'ʃ', 'o̞ː', 'ɡ', 'uː', 'd̪', 'h', 'ʉː', 's', 'e̞', 'tː', 'eː', 'n̪', 'pː', 'ə', 'l', 'ɪ', 'b', 'iː', 'æ', 'j', 'kʰ', 'ʉ', 'ɒ̝', 'm', 'f', 'yː', 'ai', 'pʰ', 'øy', 'ŋ', 'ø̞ː', 'dː', 'œ', 'bː', 'ç', 'æː', 'ɑː', 'ʏ', 'æʉ', 'ʊ', 'ɡː', 'ɔy', 'ʂ', 'w', 'z']

----------------------------------------------------------------------------------------------------
ALL PHOIBLE:
Comparing the phonemes seen in the data with all phoible phonemes:
Unknown phonemes: None
----------------------------------------------------------------------------------------------------
LANGUAGE CODE:

Comparing the phonemes seen in the data with just the phonemes in phoible with the nob language code:
Unknown phonemes: {'ɑ': 27044, 'tː': 3493, 'kː': 3364, 'pː': 1733, 'ɡː': 538, 'dː': 472, 'ʂ': 266, 'bː': 89, 'w': 13, 'z': 2}
Unseen phonemes: {'˥˧˥', 'ɖ', 'ɳ', 'ɭ̩', 'æi', 'l̪', 'aː', 'ɭ', 'ʔ', '˩˨', 'm̥', 'ɹ̩', 'ʈʰ', 

In [8]:
find_words_with_phoneme('../processed/Norwegian/train.csv', 'ʉɪ')

25547it [00:00, 2896544.87it/s]


{'juice/j ʉɪ kʰ a': 5,
 'eplejuice/e̞ pʰ l eː j ʉɪ kʰ a': 2,
 'appelsinjuice/ɑ pː ə l s ɪ n̪ j ʉɪ kʰ a': 2,
 'juicekartong/j ʉɪ kʰ eː kʰ ɑ ɾ t̪ʰ ɒ̝ ŋ': 1,
 'juicen/j ʉɪ kʰ ə n̪': 1}

# Basque analysis

We compare the Basque inventories in phoible to our Basque dataset. For Basque, we use phonemizer.

The best inventory to match with the output of phonemizer seems to be 2161, since it is the only Basque inventory that contains diphthongs (which our tool produces and which exist in Basque). With our folding dictionary, we get a close match.

Unseen phonemes:
* `β`, `ð` and `ɣ` - the unvoiced `b`, `d` and `ɡ` which should be listed in the inventory
* `ɟ` - a valid consonant listed on Wikipedia and in other inventories
* `θ`  - rare, produced in Spanish loan words

Unknown phonemes:
* `˧˩` and `˧˥` - Pitch/tone is used to mark stress in some dialects of Basque, but our tool does not produce these


In [27]:
compare_phoneme_sets('Basque', 'eus', ['Basque', 'Zuberoan Basque', 'BASQUE'], 2161)

Total phonemes in the dataset: 516010
Unique phonemes: ['oi̯', 'WORD_BOUNDARY', 'a', 'ɾ', 'k', 't̠ʃ', 'i', 's̺', 'l', 'p', 'o', 'r', 'ai̯', 'n', 'm', 'ð', 'e', 't̪̻s̪̻', 'β', 's̪̻', 'ʎ', 'b', 'au̯', 't̪', 'ɣ', 'ɡ', 'c', 'u', 'ei̯', 'd̪', 't̺s̺', 'j', 'ɲ', 'f', 'ʃ', 'ɟ', 'eu̯', 'θ', 'x']

----------------------------------------------------------------------------------------------------
ALL PHOIBLE:
Comparing the phonemes seen in the data with all phoible phonemes:
Unknown phonemes: None
----------------------------------------------------------------------------------------------------
LANGUAGE CODE:

Comparing the phonemes seen in the data with just the phonemes in phoible with the eus language code:
Unknown phonemes: {'β': 11559, 'ð': 11008, 'ɣ': 7534, 'θ': 70}
Unseen phonemes: {'pʰ', 'ĩ', 'iː', 'ʂ', 'ts̻', 's̻', 'ỹ', 'd', '˧˥', 'h̃', 'd̠ʒ', 'tʰ', 'uː', 'ɾ̺', 'ʒ', 't̻s̻', 'χ', 'ɪ', 't̪|t', 't̪s̪|ts', 'ũ', 'äː', 'ä', 'dz̻', 'n̪|n', 'o̞ː', '˧˩', 'õ̞', 'z̺', 'ɾ̪|ɾ', 'l̪|l', 'i̯',

# Hungarian analysis

We compare the Hungarian inventories in phoible to our Hungarian dataset. For Hungarian, we use epitran.

With our folding dictionary, we get a good match with inventory 2191.

Unknown phonemes:
* `ŋ` - caused by nasal place assimilation, not always included in inventories but is valid

Unseen phonemes:
* `d̻z̻ː`, `ʒː`, `d̠ʒː` - long consonants that do not seem to be produced by our tool 

In [28]:
compare_phoneme_sets('Hungarian', 'hun', ['HUNGARIAN', 'Hungarian'], 2191)

Total phonemes in the dataset: 538043
Unique phonemes: ['m', 'ɛ', 'n̪', 'j', 'y', 'k', 'WORD_BOUNDARY', 'ɑ', 'r̪', 'aː', 'd̪', 'i', 'o', 'h', 'z̻', 'v', 'l̪', 'eː', 'oː', 'ʃ', 's̻', 't̪', 't̠ʃ', 'ɟʝ', 's̻ː', 'p', 'b', 'u', 'z̻ː', 'ɡ', 't̪ː', 'l̪ː', 'f', 'ɟʝː', 'uː', 'øː', 'ø', 'n̪ː', 'ɲ', 'iː', 't̻s̻', 'mː', 'r̪ː', 'ʃː', 'kː', 'ŋ', 't̠ʃː', 'jː', 'cç', 'ɡː', 'ɲː', 'bː', 'd̪ː', 'pː', 'ʒ', 'vː', 't̻s̻ː', 'cçː', 'fː', 'hː', 'd̻z̻', 'yː', 'd̠ʒ']

----------------------------------------------------------------------------------------------------
ALL PHOIBLE:
Comparing the phonemes seen in the data with all phoible phonemes:
Unknown phonemes: None
----------------------------------------------------------------------------------------------------
LANGUAGE CODE:

Comparing the phonemes seen in the data with just the phonemes in phoible with the hun language code:
Unknown phonemes: {'ŋ': 241}
Unseen phonemes: {'l', 'dzː', 'ʒː', 'ɾ', 'r', 'æ', 'd', 'c', 'ts', 'rː', 'lː', 'n', 'dː', 'zː', 'd̻z̻ː

# Romanian analysis

We compare the Romanian inventories in phoible to our Romanian dataset. For Romanian, we use phonemizer.

With our folding dictionary, we get a close match with inventory 2443 (the other inventories do not include palatalised consonants, which phonemizer produces).

The main difference between the output of our tool and inventory 2443 is diphthongs. In Romanian, one view is that there are only two phonemic diphthongs, /oa/ and /ea/, however the traditional view is that there are many diphthongs, which phonemizer seems to produce. We could split the diphthongs into vowel glides, but we leave them for now. 

Unknown phonemes:
* `j`, `w` - valid approximant consonants that are listed in other inventories and Wikipedia but not this inventory
* `aɪ`, `uɪ`, `aʊ`, `eɪ`, `iɪ`, `əɪ`, `eʊ`, `iʊ`, `oʊ`, `əʊ`, `eo`, `oɪ`, `yɪ`, `äʊ` - valid Romanian diphthongs that our tool produces but aren't listed in the inventory. Strangely, only the falling diphthongs are produced (not the rising diphthongs).

Unseen phonemes:
* `i̯`, `u̯` - vowels listed in the inventory that are not produced by our tool. It is unclear these exist in the inventory since we already have `i` and `u`. 
* `ç` - not listed in other inventories and not produced by our tool.

In [29]:
compare_phoneme_sets('Romanian', 'ron', ['Rumanian', 'ROMANIAN', 'Romanian'], 2443)

Total phonemes in the dataset: 336366
Unique phonemes: ['ä', 'ʃ', 'WORD_BOUNDARY', 'm', 'e̞', 'h', 'd̪', 's̪', 'f', 't̠ʃ', 'l', 'ə', 'p', 'i', 'k', 'ɨ', 'n̪', 'uɪ', 't̪', 't̠ʃʲ', 'w', 'u', 't̪s̪', 'aɪ', 'ɾ̪', 'o̞', 'j', 'b', 'v', 'eɪ', 'o̯ä', 'tsʲ', 'ɡ', 'z̪', 'zʲ', 'e̯ä', 'iɪ', 'ʃʲ', 'aʊ', 'ʒ', 'əɪ', 'd̠ʒ', 'eʊ', 'nʲ', 'tʲ', 'iʊ', 'mʲ', 'ɾʲ', 'bʲ', 'kʲ', 'eo', 'd̠ʒʲ', 'dʲ', 'pʲ', 'əʊ', 'oʊ', 'sʲ', 'lʲ', 'fʲ', 'vʲ', 'ʒʲ', 'yɪ', 'oɪ', 'ɡʲ']

----------------------------------------------------------------------------------------------------
ALL PHOIBLE:
Comparing the phonemes seen in the data with all phoible phonemes:
Unknown phonemes: {'yɪ': 4}
----------------------------------------------------------------------------------------------------
LANGUAGE CODE:

Comparing the phonemes seen in the data with just the phonemes in phoible with the ron language code:
Unknown phonemes: {'aɪ': 3590, 'uɪ': 1309, 'aʊ': 814, 'eɪ': 724, 'iɪ': 713, 'əɪ': 377, 'eʊ': 373, 'iʊ': 267, 'oʊ': 174, 'əʊ'

In [35]:
find_words_with_phoneme('../CHILDES-dataset/Romanian/train.csv', 'yɪ')

21550it [00:00, 3563182.53it/s]


{'întâi/ɨ n̪ t̪ yɪ': 4}

# Brazilian Portuguese analysis

We compare the Brazilian Portuguese inventories in phoible to our Brazilian Portuguese dataset. For Brazilian Portuguese, we use phonemizer.

With our folding dictionary, we get a close match with inventory 2207.

Unknown phonemes:
* `ɐ`, `õ`, `ɐ̃`, `ĩ`, `ũ` - vowels (mostly nasal) that exist in Brazilian portuguese but are not listed in the inventory (are listed in others)
* `w` - consonant that exists in Brazilian portuguese but isn't listed in this inventory (is listed in others)
* `d̠ʒ` - palatalised /d/ that isn't listed in the inventory but is valid in Brazilian portuguese

Unseen phonemes:
* `õɪ̯̃`, `ɔʊ̯`, `uʊ̯`, `ɐ̃ɪ̯̃`, `ũɪ̯̃`, `ʊ̯aɪ̯`, `ʊ̯ɐ̃`, - diphthongs that don't seem to be produced by our tool
* `ʎ` - consonant that doesn't seem to be produced by our tool

In [30]:
compare_phoneme_sets('../processed/PortugueseBr/train.csv', 'por', ['Portuguese', 'Portuguese (Brazilian)'], 2207)

Total phonemes in the dataset: 80855
Unique phonemes: ['k', 'i', 'WORD_BOUNDARY', 'm', 'aɪ̯', 's̪', 't̪', 'ẽɪ̯̃', 'a', 'n̪', 'ɛ', 'ɐ', 'z', 'ɡ', 'r', 'v', 'u', 'ɾ', 'd̪', 'oɪ̯', 'ɲ', 'e', 'ɐ̃ʊ̯̃', 'f', 'õ', 'p', 'ʒ', 'd̠ʒ', 'eʊ̯', 'aʊ̯', 'o', 'ũ', 'ĩ', 'ɐ̃', 'oʊ̯', 'l', 'b', 'ɣ', 'ɔ', 'ʃ', 'w', 'eɪ̯', 'iʊ̯', 'ɛʊ̯', 'ɔɪ̯', 'uɪ̯', 'ɛɪ̯']

----------------------------------------------------------------------------------------------------
ALL PHOIBLE:
Comparing the phonemes seen in the data with all phoible phonemes:
Unknown phonemes: None
----------------------------------------------------------------------------------------------------
LANGUAGE CODE:

Comparing the phonemes seen in the data with just the phonemes in phoible with the por language code:
Unknown phonemes: {'d̠ʒ': 630}
Unseen phonemes: {'õi', 'iu̜', 's̻', 'eu̜', 'j', 'ŭ', 'ʊ̯ɐ̃', 'au̜', 'n', 'u̯', 'x', 'ʊ̯aɪ̯', 'ũi', 'ĕ', 'ʎ', 'ai', 'ɛi', 'ɯ', 'ə', 'ɐi', 'uʊ̯', 'ũɪ̯̃', 'ɛu̜', 'õɪ̯̃', 'ä', 'ɔʊ̯', 'ɐ̃i', 'ĩ̯', 'i

# Irish analysis

We compare the Irish inventories in phoible to our Irish dataset. For Irish, we use phonemizer.

With a substantial folding dictionary, we get close to inventory 2521. 

Unknown phonemes:
* `w`, `j` - voice approximate phonemes that are listed in other inventories for Irish but not this one. 
* The rest are sufficiently rare.

Unseen phonemes:
* `ŋ̟`,  `ʝ`, `ɟ` - consonants that our tool doesn't seem to produce
* `ɔʊ`, `uːə`, `ia` - dipthongs that our tool doesn't seem to produce (most of the others are)


# 

In [7]:
compare_phoneme_sets('Irish', 'gle', ['Irish', 'Irish Gaelic', 'IRISH'], 2521)

Total phonemes in the dataset: 167003
Unique phonemes: ['a', 'ɡ', 'ə', 'sˠ', 'WORD_BOUNDARY', 'n̪ˠ', 'w', 'ɪ', 'l̪ˠ', 'oː', 'ɛ̝', 'ɑː', 'i̞', 'ʒ', 'bˠ', 'mˠ', 'ʃ', 'cʰ', 'x', 't̪ˠʰ', 'ʊ', 'iː', 'ɾ̪ˠ', 'eː', 'h', 'ɾ̪ʲ', 'fˠ', 'ɔ̝', 'd̪ˠ', 'kʰ', 'l̪ʲ', 'd̪ʲ', 'vˠ', 'n̪ʲ', 'pˠʰ', 'd̠ʒ', 'uː', 'ŋ', 'ɣ', 'j', 'ɐɪ', 't̪ʲʰ', 'iːə', 'ç', 'pʲʰ', 'uːe', 'mʲ', 'bʲ', 'ɐ', 'fʲ', 'tʲ', 'xʲ', 'z', 'e', 'vʲ', 'χ']

----------------------------------------------------------------------------------------------------
ALL PHOIBLE:
Comparing the phonemes seen in the data with all phoible phonemes:
Unknown phonemes: None
----------------------------------------------------------------------------------------------------
LANGUAGE CODE:

Comparing the phonemes seen in the data with just the phonemes in phoible with the gle language code:
Unknown phonemes: {'z': 22, 'ʒ': 17, 'ɐ': 13, 'tʲ': 3, 'xʲ': 1, 'χ': 1}
Unseen phonemes: {'ʎ', 'd', 'ẽː', 'õ̞', 'ɾ̥ʲ', 'ɸʷˠ', 'ua', 'lˠ', 'əi', 'əu', 'ɔː', 'n', 'e̞', 'ʝ', 

# Turkish analysis

We compare the Turkish inventories in phoible to our Turkish dataset. For Turkish, we use phonemizer.

With our folding dictionary, we get close to inventory 2217. 

Unknown phonemes are all sufficiently rare.

Unseen phonemes are just `ɣ` which is not always included in inventories.

In [32]:
compare_phoneme_sets('Turkish', 'tur', ['Turkish', 'TURKISH'], 2217)

Total phonemes in the dataset: 90808
Unique phonemes: ['s̪', 'e', 'n̪', 'WORD_BOUNDARY', 'o', 'j', 'u', 'a', 'ɾ', 'm', 'ɯ', 'k', 'i', 'lʲ', 'iː', 'v', 'd̪', 'd̠ʒ', 'y', 't̪', 'b', 'z̪', 'ʃ', 'ɟ', 'eː', 'p', 'ɡ', 'l̪ˠ', 'h', 't̠ʃ', 'f', 'œ', 'aː', 'c', 'uː', 'ʒ', 'w']

----------------------------------------------------------------------------------------------------
ALL PHOIBLE:
Comparing the phonemes seen in the data with all phoible phonemes:
Unknown phonemes: None
----------------------------------------------------------------------------------------------------
LANGUAGE CODE:

Comparing the phonemes seen in the data with just the phonemes in phoible with the tur language code:
Unknown phonemes: {'w': 2}
Unseen phonemes: {'l', 'pʰ', 'd̻', 'r', 's̻', 'ø̞', 'æ', 'd', 'ʔ', 'cʰ', 'ɣ', 'n', 'tʰ', 'l̻', 'ɛ', 'ʎ', 't̻ʰ', 'ɛː', 'œː', 'ɪ', 'ɯ̞', 'ʏ', 'äː', 'ä', 'o̞ː', 'ʊ', 'ɫ̻', 'ɯː', 'yː', 't̠ʃʰ', 'ɾ̻', 'n̻', 'z', 'l̪|l', 's', 'kʰ', 'e̞ː', 'o̞', 'e̞', 'z̻', 'ʋ'}

Phoible inventories nam

# Quechua analysis

We compare the Quechua inventories in phoible to our Quechua dataset. For Quechua, we use phonemizer.

There are many Quechuan languages and many phoible inventories. With our folding dictionary, we get close to inventory 104, but there is still a lot of disagreement. 

Unknown phonemes:
* `d`, `g`, `f` - consonants that aren't listed in this inventory but are in others
* `aː` - a vowel not in this inventory but listed in others
* Remaining are sufficiently rare

There are several phonemes in inventory 104 that are not seen in our data, which could be an issue.

In [33]:
compare_phoneme_sets('Quechua', 'quh', ['Quechua', 'QUECHUA', 'Huallaga (Huanuco) Quechua', 'Ancash Quechua', 'Cajamarca Quechua', 'Cuzco-Collao Quechua', 'Ferreñafe Quechua', 'Imbabura Quichua', 'Jauja-Huanca Quechua', 'North Junín Quechua', 'Ayacucho Quechua', 'Bolivian Quechua', 'Huallaga Huánuco Quechua', 'Huaylas-Conchucos Quechua', 'San Martin Quechua', 'Yaru Quechua', 'Santiago del Estero Quechua', 'Salasca Quechua', 'Tena Quechua' ], 104)

Total phonemes in the dataset: 68488
Unique phonemes: ['t', 's', 'aː', 'h', 'ʊ', 'ʎ', 'WORD_BOUNDARY', 'j', 'a', 'm', 'k', 'ɪ', 'n', 'r', 'p', 'tʼ', 'ʔ', 'q', 't̠ʃ', 'f', 'ɛ', 'β', 'w', 'ɡ', 'kʼ', 'l', 'pʼ', 't̠ʃʼ', 'qʼ', 'd', 'ɪː', 'ʊː', 'ɔ']

----------------------------------------------------------------------------------------------------
ALL PHOIBLE:
Comparing the phonemes seen in the data with all phoible phonemes:
Unknown phonemes: None
----------------------------------------------------------------------------------------------------
LANGUAGE CODE:

Comparing the phonemes seen in the data with just the phonemes in phoible with the quh language code:
Unknown phonemes: {'d': 266, 'ɡ': 137, 'aː': 128, 'f': 42, 'ɪː': 31, 'ʊː': 1}
Unseen phonemes: {'pʰ', 'ɾ', 'ʃ', 'ɣ', 't̪ʰ|tʰ', 'tʰ', 't̪|t', 'ɲ', 'n̪|n', 't̠ʃʰ', 'ɾ̪|ɾ', 'l̪|l', 'ð', 'kʰ', 'ɸ', 'r̪|r', 't̪ʼ|tʼ', 'qʰ', 's̪|s'}

Phoible inventories names with the quh language code: ['Quechua', 'QUECHUA']
Phoible dialects names with 

# Farsi analysis

We compare the Farsi inventory in phoible to our Farsi dataset. For Farsi, we use phonemizer.

There are many Quechuan languages and many phoible inventories. With our folding dictionary, we get a very good match with inventory 516. There are no unknown phonemes, just the unseen phonemes `χ`, `ʒ` and `ʔ`.

In [34]:
compare_phoneme_sets('Farsi', 'pes', ['FARSI' ], 516)

Total phonemes in the dataset: 32177
Unique phonemes: ['kʰ', 's', 'o', 'b', 'WORD_BOUNDARY', 'a̟', 'h', 'n̪', 't̠ʃ', 'i', 'j', 'd̪', 'e', 'ʃ', 'u', 'ɡ', 'r', 'f', 't̪ʰ', 'm', 'd̠ʒ', 'l', 'ɢ', 'v', 'z', 'pʰ', 'w', 'ɑ']

----------------------------------------------------------------------------------------------------
ALL PHOIBLE:
Comparing the phonemes seen in the data with all phoible phonemes:
Unknown phonemes: None
----------------------------------------------------------------------------------------------------
LANGUAGE CODE:

Comparing the phonemes seen in the data with just the phonemes in phoible with the pes language code:
Unknown phonemes: None
Unseen phonemes: {'iː', 'æ', 'd', 'ʔ', 'ɒː', 'ɣ', 'n', 'tʰ', 'x', 'uː', 'ʒ', 'χ', 'l̪', 'r̪', 'ɒ', 't̠ʃʰ', 'ɢʁ', 'o̞', 'e̞'}

Phoible inventories names with the pes language code: ['Persian', 'FARSI', 'Farsi']
Phoible dialects names with the pes language code: [nan]
--------------------------------------------------------------------