# International Phonetic Alphabet - Consonants Visualization
### Anthony Kosinski

This notebook takes consonant phonemes data scraped from Wikipedia and visualizes the data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

## Loading in the IPA Data

In [2]:
ipa_data = pd.read_csv('ipa_collection.csv')

In [3]:
lang_count = len(ipa_data['Language'].unique())
print(f"There are {lang_count} languages in the dataset.")

There are 126 languages in the dataset.


Looking for outliers

In [4]:
for lang in ipa_data['Language'].unique():
    cons_count = len(ipa_data[ipa_data['Language'] == lang])
    if cons_count > 100:
        print(lang)

Astur-Leonese
Nguni


Do these languages really have more than 100 different phonemes from consonants? Let's double check.

In [5]:
nguni_data = ipa_data[ipa_data['Language'] == 'Nguni']
nguni_ipa_names = []
for i in range(len(nguni_data)):
    print(f"Name: {nguni_data['Name'].iloc[i]} | Symbol: {nguni_data['Symbol'].iloc[i]}")
    nguni_ipa_names.append(nguni_data['Name'].iloc[i])

Name: Voiced bilabial plosive | Symbol: b
Name: Voiced bilabial implosive | Symbol: ɓ
Name: Voiced alveolar plosive | Symbol: d
Name: Voiced postalveolar affricate | Symbol: dʒ
Name: Voiced alveolar affricate | Symbol: dz
Name: Voiceless labiodental fricative | Symbol: f
Name: Voiced velar plosive | Symbol: ɡ
Name: Voiced velar implosive | Symbol: ɠ
Name: Voiceless glottal fricative | Symbol: h
Name: Voiced glottal fricative | Symbol: ɦ
Name: Voiced palatal approximant | Symbol: j
Name: Voiceless velar plosive | Symbol: k
Name: Voiced alveolar lateral approximant | Symbol: l
Name: Voiceless alveolar lateral fricative | Symbol: ɬ
Name: Voiced alveolar lateral fricative | Symbol: ɮ
Name: Voiced bilabial nasal | Symbol: m
Name: Voiced alveolar nasal | Symbol: n
Name: Voiced velar nasal | Symbol: ŋ
Name: Voiced palatal nasal | Symbol: ɲ
Name: Voiceless bilabial plosive | Symbol: p
Name: Voiced alveolar trill | Symbol: r
Name: Voiceless alveolar fricative | Symbol: s
Name: Voiceless postalv

In [6]:
nguni_ipa_names = np.array(nguni_ipa_names)

print(len(nguni_ipa_names)) # Nguni Phonemes in ipa_data
print(len(np.unique(nguni_ipa_names))) # Unique Phonemes in Nguni

102
34


It appears that Nguni has duplicate phonemes in the data, but how many languages have this issue?

In [7]:
dup_count = 0
for lang in ipa_data['Language'].unique():
    lang_ph = ipa_data[ipa_data['Language'] == lang]['Name']
    if len(lang_ph) != len(np.unique(lang_ph)):
        dup_count += 1

print(f"There are {dup_count} languages with duplicate phoneme data in ipa_data")

There are 35 languages with duplicate phoneme data in ipa_data


As suspected, there appears to be repeats in the data, in which case the data should be run through once more and cleaned of any repeating phonemes in the same language.

## Removing duplicates

In [9]:
# Setting up new IPA DataFrame with Language, Name, and Symbol columns
new_ipa_data = pd.DataFrame([])
new_ipa_data.insert(0, "Language", [])
new_ipa_data.insert(1, "Name", [])
new_ipa_data.insert(2, "Symbol", [])

# Creating Phoneme Name : Symbol dictionary to make inserting phonemes to new dataframe easier
phoneme_dict = {}
ipa_dd = ipa_data.drop_duplicates(subset=['Name'])
for i in range(len(ipa_dd['Name'])):
    phoneme_dict[ipa_dd['Name'].iloc[i]] = ipa_dd['Symbol'].iloc[i]

# Appending non-duplicated phonemes for each language to new IPA dataframe
dex = 0
for lang in ipa_data['Language'].unique():
    for ph in np.unique(ipa_data[ipa_data['Language'] == lang]['Name']):
        new_ipa_data.loc[dex] = [lang, ph, phoneme_dict[ph]]
        dex += 1
new_ipa_data

Unnamed: 0,Language,Name,Symbol
0,Standard_German,Glottal stop,ʔ
1,Standard_German,Voiced alveolar approximant,ɹ
2,Standard_German,Voiced alveolar fricative,z
3,Standard_German,Voiced alveolar lateral approximant,l
4,Standard_German,Voiced alveolar nasal,n
...,...,...,...
3093,Zhuang,Voiceless bilabial plosive,p
3094,Zhuang,Voiceless dental fricative,θ
3095,Zhuang,Voiceless glottal fricative,h
3096,Zhuang,Voiceless labiodental fricative,f


Let's check our new dataframe to make sure it doesn't have the same duplication issues as before.

In [10]:
dup_count = 0
for lang in new_ipa_data['Language'].unique():
    lang_ph = new_ipa_data[new_ipa_data['Language'] == lang]['Name']
    if len(lang_ph) != len(np.unique(lang_ph)):
        dup_count += 1

print(f"There are {dup_count} languages with duplicate phoneme data in new_ipa_data")

There are 0 languages with duplicate phoneme data in new_ipa_data


Success! We've removed the duplicate phoneme data from the IPA data.

//todo: start plotting

In [12]:
lang_count_data = pd.DataFrame([])
#ipa_data.sort_values('Language')
lang_count_data.insert(0, "Language", [])
lang_count_data.insert(1, "Phonemes", [])

dex = 0
for lang in new_ipa_data['Language'].unique():
    dex += 1
    cons_count = len(new_ipa_data[new_ipa_data['Language'] == lang])
    lang_count_data.loc[dex] = [lang, cons_count]
        
lang_count_data = lang_count_data.sort_values('Phonemes', ascending=False)

lang_count_data

Unnamed: 0,Language,Phonemes
12,Astur-Leonese,41
105,Sorbian,37
2,Adyghe,34
108,Nguni,34
22,Insular_Catalan,34
...,...,...
81,Mongolian,14
24,Cantonese,13
46,Hawaiian,11
77,M%C4%81ori,10
