# Phoneme Analysis

Comparing phonemes outputted by phonemization to phonemes stored in [phoible](https://phoible.org/). This is used to improve the phonemization by adding replacement dictionaries to increase the coherence of the resulting phoneme set for each language.

In [2]:
import pandas as pd

from datasets import load_dataset
from tqdm import tqdm

phoible = pd.read_csv('../data/phoible.csv')
phonemes = phoible.Phoneme.unique()

  from .autonotebook import tqdm as notebook_tqdm
  phoible = pd.read_csv('../data/phoible.csv')


In [3]:
def find_unseen_phonemes(counts, phonemes):
    """ Finds phonemes in counts unknown in the phonemes list, and phonemes in the phonemes list not seen in counts """
    unseen = set() 
    unknown = {} 
    for phoneme in phonemes:
        if phoneme not in counts:
            unseen.add(phoneme)
    for phoneme, count in counts.items():
        if phoneme not in phonemes and phoneme != 'WORD_BOUNDARY':
            unknown[phoneme] = count
    return unseen, unknown

def find_phoneme(dataset_section, phoneme, count=1):
    """ Find lines where phoneme appears and also print gloss """
    dataset = load_dataset('../childes/CHILDES-dataset', dataset_section, split='train')
    phonemized = list(dataset['phonemized_utterance'])
    for i, line in enumerate(phonemized):
        if phoneme in line:
            print(f'Line {i}: {dataset["gloss"][i]} - {line}')
            count -= 1
        if count == 0:
            break

def find_words_with_phoneme(dataset_section, phoneme):
    words_counts = {}
    dataset = load_dataset('../childes/CHILDES-dataset', dataset_section, split='train')
    phonemized = list(dataset['phonemized_utterance'])
    glosses = list(dataset['gloss'])
    for i, line in tqdm(enumerate(phonemized)):
        if phoneme in line:
            words = glosses[i].split()
            phonemic_words = line.split(' WORD_BOUNDARY ')
            for j, word in enumerate(phonemic_words):
                if phoneme in word.split():
                    if j >= len(words):
                        continue
                    entry = words[j] + '/' + word.replace(' WORD_BOUNDARY', '')
                    if entry not in words_counts:
                        words_counts[entry] = 0
                    words_counts[entry] += 1
    # Sort by counts
    words_counts = {k: v for k, v in sorted(words_counts.items(), key=lambda item: item[1], reverse=True)}
    return words_counts

def compare_phoneme_sets(dataset_section, iso_code, inventories):
    """ Compares phonemes in a specific section of the dataset (a specific language) with the phonemes in phoible.
    
    Three levels of comparison are made:
    1. Phonemes in the dataset not seen in any parts of phoible
    2. Phonemes in the dataset not seen in any phoible language with a specific language code (e.g. 'eng' for English)
    3. Phonemes in the dataset not seen in the specific inventories of phoible (e.g. ['American English', 'English (American)'] for English)
    """
    dataset = load_dataset('../childes/CHILDES-dataset', dataset_section, split='train')

    token_counts = {}
    for line in dataset['phonemized_utterance']:
        tokens = line.strip().split()
        for token in tokens:
            if token not in token_counts:
                token_counts[token] = 0
            token_counts[token] += 1

    print(f'Total phonemes in the dataset: {sum(token_counts.values())}')
    print(f'Unique phonemes: {list(token_counts.keys())}\n')

    _, unknown = find_unseen_phonemes(token_counts, phonemes)
    print('Comparing the phonemes seen in the data with all phoible phonemes:')
    print(f'Unseen phonemes:')
    print(unknown)


    language_phonemes = phoible[phoible.ISO6393 == iso_code].Phoneme.unique()
    language_unseen, language_unknown = find_unseen_phonemes(token_counts, language_phonemes)
    print(f'\nComparing the phonemes seen in the data with just the phonemes in phoible with the {iso_code} language code:')
    print(f'Unknown phonemes:')
    print(language_unknown)
    print(f'Unseen phonemes:')
    print(language_unseen)
    
    possible_language_names = phoible[phoible.ISO6393 == iso_code].LanguageName.unique()
    print(f'\nPhoible inventories names with the {iso_code} language code:')
    print(possible_language_names)

    possible_dialects = phoible[phoible.ISO6393 == iso_code].SpecificDialect.unique()
    print(f'\nPhoible dialects names with the {iso_code} language code:')
    print(possible_dialects)

    all_inventory_phonemes = set()
    for inventory in inventories:
        dialect_phonemes = phoible[phoible.LanguageName == inventory].Phoneme.unique()
        all_inventory_phonemes.update(dialect_phonemes)

    if all_inventory_phonemes == set(language_phonemes):
        print(f'\nAll phonemes in the {iso_code} language code match with the specific inventories, no need for further comparison\n')
        return

    dialect_unseen, dialect_unknown = find_unseen_phonemes(token_counts, all_inventory_phonemes)
    print(f'\nNow comparing the phonemes seen in the data with the phonemes in phoible with the specific inventories {inventories}:')
    print(f'Unknown phonemes:')
    print(dialect_unknown)
    print(f'Unseen phonemes:')
    print(dialect_unseen)

# English Analysis

The English section of our corpus is taken from Eng-NA in CHILDES, so we compare to all English languages and more specifically the two American English sections of phoible. 

When comparing to all English languages, we see only frequent two unknown phonemes, `ɾ` and `n̩`. The first is an [alternative](https://english.stackexchange.com/questions/549727/whats-the-difference-between-t%CC%AC-and-%C9%BE-in-american-english) to `t̬` and is marked as an allophone of several consonants in the English inventories of phoible. The second is a syllabic `n`, seen in words like `button`. This occurs in American English although is not listed in the two American English phoible inventories. 

When comparing to the two American English phoible inventories there are many more unknown phonemes. Most are due to elongated vowels or are allophones of phonemes that do appear in these inventories (e.g. `k` instead of `kʰ`). Almost all appear in the main English inventory of phoible. 

In [4]:
compare_phoneme_sets('English2', 'eng', ['English (American)', 'American English'])

Generating train split: 1636954 examples [00:08, 182346.23 examples/s]
Generating valid split: 10000 examples [00:00, 160064.11 examples/s]


Total phonemes in the dataset: 29051643
Unique phonemes: ['s', 'iː', 'WORD_BOUNDARY', 'ð', 'ɛ', 'ɹ', 'z', 'ʌ', 'f', 'eɪ', 'w', 'ɪ', 'ɡ', 'l', 'æ', 'ɑ', 'h', 'ə', 'ʊ', 'k', 'p', 'uː', 'b', 'i', 't', 'aɪ', 'θ', 'ŋ', 'j', 'ɔ', 'm', 'ɔɪ', 'n', 'd', 'oʊ', 'aʊ', 'v', 'ɜː', 't̠ʃ', 'd̠ʒ', 'ʃ', 'iə', 'ʒ', 'ɑ̃', 'r', 'x', 'nʲ']

Comparing the phonemes seen in the data with all phoible phonemes:
Unseen phonemes:
{}

Comparing the phonemes seen in the data with just the phonemes in phoible with the eng language code:
Unknown phonemes:
{'ɑ̃': 13, 'nʲ': 8}
Unseen phonemes:
{'aː', 'ʉː', 'ɻ', 'kx', 'ʍ', 'ts', 'e', 'eə', 'e̞', 'æɔ', 'oe', 'pʰ', 'eɪ̯', 'ɪə', 'æo', 'øː', 'ɒː', 'ɛː', 'iɛ', 'u', 'ɐː', 'iɪ', 'o̞ː', 'ɒɯ', 'ɑː', 'æe', 'ʉə', 'ɚ', 'kʰ', 'ɔː', 'eː', 'oɪ', 'ɐʉ', 'ei', 'ɛʉ', 'ɑe', 'ɚː', 'əʉ', 'əʊ', 'əː', 'ɒ', 'ɐ', 'ɵː', 'ʔ', 'oː', 'ʊə', 'a', 'tʰ', 'æɪ', 'ɘ'}

Phoible inventories names with the eng language code:
['English' 'English (American)' 'American English' 'English (Australian)'
 'English (B

# French Analysis

We compare all French inventories in phoible to our French dataset.

We find a few unknown phonemes, all very infrequent and most are valid allophones according to the French inventory:
* `yː` is an allophone of `y`
* `aː` is an allophone of `a̟`
* `iː` is an allophone of `i`

The remaining two phonemes `t̠ʃ` and `d̠ʒ` seem to come from loan words such as `sandwhich` and `jazz` so seem acceptable to keep. 

In [11]:
compare_phoneme_sets('French', 'fra', ['French', 'FRENCH'])

Generating train split: 422133 examples [00:02, 174377.05 examples/s]
Generating valid split: 10000 examples [00:00, 170033.61 examples/s]


Total phonemes in the dataset: 7332327
Unique phonemes: ['t', 'y', 'WORD_BOUNDARY', 'v', 'j', 'ɛ̃', 'w', 'a', 'ʁ', 'd', 'e', 'ʒ', 'm', 'ɔ̃', 'p', 'ɛ', 'f', 'ɔ', 'ɑ̃', 's', 'z', 'l', 'ə', 'b', 'k', 'u', 'o', 'ʃ', 'ɡ', 'i', 'n', 'œ̃', 'ø', 'œ', 'oː', 'yː', 'ɲ', 'aː', 't̠ʃ', 'd̠ʒ', 'iː', 'ŋ']

Comparing the phonemes seen in the data with all phoible phonemes:
Unseen phonemes:
{}

Comparing the phonemes seen in the data with just the phonemes in phoible with the fra language code:
Unknown phonemes:
{'yː': 1393, 'aː': 1335, 't̠ʃ': 29, 'd̠ʒ': 9, 'iː': 3}
Unseen phonemes:
{'z̪', 'n̪', 'l̪', 'ɥ', 'ø̞', 'ɦ', 'æ̃', 'ʀ', 'd̪', 's̪', 'õ', 'ɑ', 'ɒ̃', 'r', 'ɡ̟', 'ɒ', 'k̟', 't̪', 'ɛː', 'a̟'}

Phoible inventories names with the fra language code:
['French' 'FRENCH']

Phoible dialects names with the fra language code:
[nan 'French (Parisian speaker)']

All phonemes in the fra language code match with the specific inventories, no need for further comparison



# German Analysis

We compare all German inventories in phoible to our German dataset.

We find a few unknown phonemes:
* `r` is an allophone of `ʀ`. Seems to exist in Austrian/Swiss standard german rather than german standard german. 
* `ɛɪ` is not listed in German phoible but is a non-native vowel according to [wikipedia](https://en.wikipedia.org/wiki/Help:IPA/Standard_German). Seems to mostly be produced in loan words such as "okay". 

The remaining phonemes are sufficiently rare and mostly produced by loan words:
* `ɔ̃` is produced in the loan word "pardon"
* `ɑ̃` is produced by loan words "chance" and "restaurant"
* `w` is produced by loan words "Twix" and "twinners". It is also produced by rare words "twart" and "tweikt" and should possibly be `v`.
* `ɔː` is produced by the name "George"
* `i` is produced by the loan words "Region"
* `œ̃` is produced by the loan words "Terrain" and "Mannequin"

Decisions we made in the phonemization code:
* `ɑː` and `ɑ` were being produced very frequently and do not occur in german phoneme inventories. We replaced these instances with `a` and `aː`.
* `ɜ` was being produced very frequently and does not occur in german phoneme inventories. It was always being produced for syllables ending in "er", such as "aber". It seems more standard to use `ɐ` so we made this replacement.
* `ɾ` was being produced very frequently and did not occur in German phoible. Seems to be produced in words such as "wer" and "wir", resulting in `/veːɾ/` and `/viːɾ`. `/veːɐ/` and `/viːɐ/` seem more standard, so we replace `ɾ` with `ɐ`. 

In [12]:
compare_phoneme_sets('German', 'deu', ['German','GERMAN'])

Generating train split: 840888 examples [00:04, 189018.34 examples/s]
Generating valid split: 10000 examples [00:00, 179655.28 examples/s]


Total phonemes in the dataset: 17749318
Unique phonemes: ['h', 'a', 'l', 'oː', 'WORD_BOUNDARY', 'j', 'aː', 'ə', 'm', 'd', 's', 't', 'iː', 'n', 'z', 'ɛ', 'ts', 'ɪ', 'eː', 'ʀ', 'aʊ', 'ɡ', 'ŋ', 'ʊ', 'v', 'aɪ', 'uː', 'k', 'ç', 'b', 'ɐ', 'ʃ', 'ɔ', 'x', 'œ', 'f', 'p', 'ɛɪ', 'ʏ', 'yː', 'y', 'ɛː', 'pf', 'øː', 't̠ʃ', 'd̠ʒ', 'ʒ', 'ɔ̃', 'ã', 'w', 'ɔː', 'i', 'œ̃']

Comparing the phonemes seen in the data with all phoible phonemes:
Unseen phonemes:
{}

Comparing the phonemes seen in the data with just the phonemes in phoible with the deu language code:
Unknown phonemes:
{'ɛɪ': 3940, 'ɔ̃': 19, 'ã': 126, 'w': 13, 'ɔː': 4, 'i': 22, 'œ̃': 5}
Unseen phonemes:
{'ʏː', 'ai', 't̺ʰ', 'au', 'ɔɪ', 'd̺', 'kʰ', 'ʔ', 'tʰ', 'ʁ', 'e', 'pʰ', 'ɔi'}

Phoible inventories names with the deu language code:
['German' 'GERMAN']

Phoible dialects names with the deu language code:
[nan 'German (Standard)']

All phonemes in the deu language code match with the specific inventories, no need for further comparison



# Indonesian Analysis

We compare all Indonesian inventories in phoible to our Indonesian dataset.

We find a few unknown phonemes:
* `ɔ` is listed as an allophone of `o` and apparently is valid in some analyses.
* `χ` is produced by the names words "Chiki" and "Michael". It should possibly be `t̠ʃ` in the first case and `k` in the second.

The remaining phonemes are sufficiently rare:
* `ɹ` is produced only by the word "per".

Decisions we made in the phonemization code:
* `aɪ` is produced frequently but is not listed in the indonesian phoneme inventories. Instead, it seems more standard to use `ai̯` so we make this replacement.
* `aʊ` is produced frequently but is not listed in the indonesian phoneme inventories. Instead, it seems more standard to use `au̯` so we make this replacement.
* `ç` was being produced occassionally (801 instances) in words such as "mesjid" or "syut". We replace these with `ʃ`.


In [14]:
compare_phoneme_sets('Indonesian', 'ind', ['Indonesian'])

Generating train split: 524469 examples [00:02, 259134.07 examples/s]
Generating valid split: 10000 examples [00:00, 262342.40 examples/s]


Total phonemes in the dataset: 7806445
Unique phonemes: ['n', 'i', 'h', 'WORD_BOUNDARY', 'l', 'o', 't', 'm', 'a', 'w', 's', 'd̠ʒ', 'ŋ', 'd', 'p', 'ɛ', 'ɡ', 'b', 'u', 'r', 'au̯', 'z', 'k', 'ɲ', 'j', 't̠ʃ', 'e', 'ə', 'ɔ', 'ai̯', 'f', 'v', 'ʔ', 'ʃ', 'χ', 'x', 'ɹ']

Comparing the phonemes seen in the data with all phoible phonemes:
Unseen phonemes:
{}

Comparing the phonemes seen in the data with just the phonemes in phoible with the ind language code:
Unknown phonemes:
{'ɔ': 8887, 'χ': 3273, 'ɹ': 4}
Unseen phonemes:
{'y', 'oi̯', 't̪'}

Phoible inventories names with the ind language code:
['Indonesian']

Phoible dialects names with the ind language code:
['Central Java' 'Standard']

All phonemes in the ind language code match with the specific inventories, no need for further comparison

