## Wikipedia IPA

In this notebook, we find the IPA representation for English words by looking up their [Wiktionary](https://www.wiktionary.org/) page. A previous approach used Wikipedia pages and search the content for IPA. Another previous version used the WikiMedia API directly. Thankfully, someone has already done that work for us and packaged it up in [`WiktionaryParser`](https://github.com/Suyash458/WiktionaryParser). This is much quicker than using Wikipedia, and simpler than calling the API ourselves. Most Wiktionary pages have an IPA representation. Some words have many Wikipedia pages associted with them (e.g. [Python](https://en.wikipedia.org/wiki/Python)), and some pages have many IPA representations for their title (e.g. [Edinburgh](https://en.wikipedia.org/wiki/Edinburgh)). In these cases, I collect all IPA representations. For titles with no IPA, nothing is collected. We store everything in a `pandas` dataframe, and eventually save it to disk.

In [2]:
import re
import pandas as pd
import numpy as np
from wiktionaryparser import WiktionaryParser

What words shall we search for? Here, I've decided to search for all words in a wordlist. Most *nix systems have a wordlist at '/usr/share/dict/words'. If you're using something else, just replace the path.

In [17]:
path_to_wordlist = '/usr/share/dict/words'
with open(path_to_wordlist) as f:
    raw = f.read()
wordlist = raw.split()

It looks like the WiktionaryParser can only retrieve one word at a time. This isn't ideal, for us or for the WikiMedia servers. I may write my own library for parsing Wiktionary data soon. In the mean time, let's just search for a few words that we're interested in by looping over the a list. Make sure they're capitalised as they appear in Wiktionary, otherwise we won't get any data on them.

In [35]:
titles = ['France', 'Napoleon', 'phonetics', 'linguistics', 'Australia']

In [52]:
parser = WiktionaryParser()
words = []

In [56]:
for title in titles:
    data = parser.fetch(title)[0] # If there are multiple pages, just keep the first one.
    words.append(data)

In [63]:
words[0]['pronunciations']['text']

['(UK) IPA: /fɹɑːns/, /fɹæns/',
 '(US) IPA: /fɹæns/',
 'Rhymes: -ɑːns, -æns',
 'Rhymes: -æns']

Next we take our list of words, make them a column in a dataframe, and use the function from above to get their IPA representations and store them in another column. The return value of that function is a list of varying size, depending on how many IPA representations it found.

In [5]:
raw_df = pd.DataFrame(words, columns=['title'])
raw_df['pron'] = raw_df['title'].map(get_ipa)
df = raw_df.copy() # make a copy, because it's expensive to create

The cell below is some pandas magic for unpacking those variable sized lists into separate columns for each element.

In [6]:
unpacked_prons = df['pron'].apply(pd.Series)
df = pd.concat([df['title'], unpacked_prons], axis=1)

Now we make a [tidy](https://www.jstatsoft.org/article/view/v059i10) dataframe, where each row is an observation. The resulting dataframe has 2 columns and each title may appear multiple times in the first column.

In [7]:
df = pd.melt(df, id_vars=['title'], value_name='ipa')
df.drop('variable', axis=1, inplace=True)
df.head()

Unnamed: 0,title,ipa
0,France,[fʁɑ̃s]
1,Napoleon,/nəˈpoʊliən ˈboʊnəpɑːrt/
2,Phonetics,/fəˈnɛtɪks/
3,Linguistics,
4,Australia,/əˈstreɪliə/


Here we clean up a little. We get rid of any empty strings.

In [8]:
df.replace('', np.nan, inplace=True)
df.dropna(inplace=True)
df

Unnamed: 0,title,ipa
0,France,[fʁɑ̃s]
1,Napoleon,/nəˈpoʊliən ˈboʊnəpɑːrt/
2,Phonetics,/fəˈnɛtɪks/
4,Australia,/əˈstreɪliə/
5,France,[ʁepyblik fʁɑ̃sɛz]
6,Napoleon,[napɔleɔ̃ bɔnapaʁt]
9,Australia,/ɒ-/
11,Napoleon,[napoleˈoːne di bwɔnaˈparte]
14,Australia,/-ljə/
19,Australia,"[əˈstɹæɪljə, -liə]"


Some Wikipedia pages use `/.../`, some use `[...]`. Let's get rid of both.

In [9]:
def remove_brackets(ipa):
    """Remove enclosing brackets from IPA representation.
    
    Parameters
    ----------
    ipa : str
        IPA representation
    Returns
    -------
    str
        IPA representations without brackets.
        
    >>> clean_ipa('[fʁɑ̃s]')
    'fʁɑ̃s'
    """
    pat = re.compile(r'/|\[|\]')
    return re.sub(pat, '', ipa)

In [10]:
df['cleaned'] = df['ipa'].map(remove_brackets)
df

Unnamed: 0,title,ipa,cleaned
0,France,[fʁɑ̃s],fʁɑ̃s
1,Napoleon,/nəˈpoʊliən ˈboʊnəpɑːrt/,nəˈpoʊliən ˈboʊnəpɑːrt
2,Phonetics,/fəˈnɛtɪks/,fəˈnɛtɪks
4,Australia,/əˈstreɪliə/,əˈstreɪliə
5,France,[ʁepyblik fʁɑ̃sɛz],ʁepyblik fʁɑ̃sɛz
6,Napoleon,[napɔleɔ̃ bɔnapaʁt],napɔleɔ̃ bɔnapaʁt
9,Australia,/ɒ-/,ɒ-
11,Napoleon,[napoleˈoːne di bwɔnaˈparte],napoleˈoːne di bwɔnaˈparte
14,Australia,/-ljə/,-ljə
19,Australia,"[əˈstɹæɪljə, -liə]","əˈstɹæɪljə, -liə"


Looking at the 'cleaned' dataframe, some of the IPA representations aren't what we're looking for. In particular, we've got alternative affixes masquerading as complete pronunciations. This comes from the way we identified IPA in the HTML. We don't want those, so let's filter them out. One way to do that is to find all values with a '-' in them, as that's not any IPA symbol and is used for affixes, and filter them out. We'll define a function to do that.

In [11]:
def is_affix(ipa):
    """Return True if `ipa` is likely an affix.
    
    Parameters
    ----------
    ipa : str
        IPA representation
    Returns
    -------
    bool
        
    >>> is_affix('fʁɑ̃s')
    False
    >>> is_affix('-ljə')
    True
    """
    pat = re.compile(r'-')
    return bool(pat.search(ipa))

In [12]:
df['affix'] = df['cleaned'].map(is_affix)
df

Unnamed: 0,title,ipa,cleaned,affix
0,France,[fʁɑ̃s],fʁɑ̃s,False
1,Napoleon,/nəˈpoʊliən ˈboʊnəpɑːrt/,nəˈpoʊliən ˈboʊnəpɑːrt,False
2,Phonetics,/fəˈnɛtɪks/,fəˈnɛtɪks,False
4,Australia,/əˈstreɪliə/,əˈstreɪliə,False
5,France,[ʁepyblik fʁɑ̃sɛz],ʁepyblik fʁɑ̃sɛz,False
6,Napoleon,[napɔleɔ̃ bɔnapaʁt],napɔleɔ̃ bɔnapaʁt,False
9,Australia,/ɒ-/,ɒ-,True
11,Napoleon,[napoleˈoːne di bwɔnaˈparte],napoleˈoːne di bwɔnaˈparte,False
14,Australia,/-ljə/,-ljə,True
19,Australia,"[əˈstɹæɪljə, -liə]","əˈstɹæɪljə, -liə",True


Great. Now to make sure we don't have to do that all again, let's save it as a csv in the working directory. In particular, we want the pronunciations that are not affixes. The `~` is how we subset a dataframe by a false condition in pandas.

In [13]:
df[~df['affix']].to_csv('wikipedia_ipa.csv', columns=['title', 'cleaned'])