## Wikipedia IPA

In this notebook, we find the IPA representation for English words by looking up their Wikipedia page. The bulk of the work is done through the [`wikipedia`](http://wikipedia.readthedocs.io/en/latest/) package, which is a great interface to the content of Wikipedia. Most Wikipedia pages have an IPA representation of their title somewhere within parentheses in the first sentence. Even better, it is explicitly marked in the HTML as such. Getting the HTML using `wikipedia` is slow, but I chose this over simple regular expressions because it's accurate. Some words have many Wikipedia pages associted with them (e.g. [Python](https://en.wikipedia.org/wiki/Python)), and some pages have many IPA representations for their title (e.g. [Edinburgh](https://en.wikipedia.org/wiki/Edinburgh)). In these cases, I collect all IPA representations. For titles with no IPA, nothing is collected. We store everything in a `pandas` dataframe, and eventually save it to disk.

In [1]:
import re
import wikipedia as wk
wk.set_lang('en')
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

The first thing we want is a function that will take in a word, get the Wikipedia page for that word, and look through the HTML for the IPA. We embed all the logic in a `try` clause, so that if anything goes wrong, it just returns some  null value.

In [2]:
def get_ipa(title):
    """Return the IPA representation of `title` from Wikipedia.
    
    Some pages have no IPA, in which case its IPA representation 
    is the empty string. Some pages have more than one IPA representation.
    In this case, all representations are returned. If there is more 
    than one Wikipedia page related to `title`, the IPA representations 
    for all pages are returned.
    
    IPA representations are marked in the HTML of the Wikipedia pages 
    as such. However, getting the HTML is a slow process.
    
    NB: To get the IPA from other languages' Wikipedias, we only need 
    to change the way IPA is identified in the HTML (e.g. French uses `API`.)
    
    Parameters
    ----------
    title : str
        Title of wikipedia page
    Returns
    -------
    list
        IPA representations found.
        
    >>> get_ipa('France')
    ['[fʁɑ̃s]', '[ʁepyblik fʁɑ̃sɛz]']
    
    """
    try:
        wiki_page = wk.page(title)
        soup = BeautifulSoup(wiki_page.html(), 'html5lib')
        results = soup.find_all('span', class_=re.compile(r'IPA'))
        if results:
            return [r.text for r in results]
        return ['']
    except wk.DisambiguationError as e:
        titles = e.options
        ipas = [get_ipa(t) for t in titles]
        return sum(ipas, [])
    except wk.WikipediaException:
        return ['']

Now we need some words to search for on Wikipedia. We could randomly choose Wikipedia titles using `wk.random()`. Or we could start with one and follow all links on that page. Here, I've decided to search for all words in a wordlist. Most *nix systems have a wordlist at '/usr/share/dict/words'. If you're using something else, just replace the path.

In [3]:
path_to_wordlist = '/usr/share/dict/words'
with open(path_to_wordlist) as f:
    raw = f.read()
words = raw.split()

Using all these words is great, but because of how slow it is to get the HTML using `wikipedia`, I'm not going to do that here. Instead, as proof of concept, let's just use a few select words.

In [4]:
words = ['France', 'Napoleon', 'Phonetics', 'Linguistics', 'Australia']

Next we take our list of words, make them a column in a dataframe, and use the function from above to get their IPA representations and store them in another column. The return value of that function is a list of varying size, depending on how many IPA representations it found.

In [5]:
raw_df = pd.DataFrame(words, columns=['title'])
raw_df['pron'] = raw_df['title'].map(get_ipa)
df = raw_df.copy() # make a copy, because it's expensive to create

The cell below is some pandas magic for unpacking those variable sized lists into separate columns for each element.

In [6]:
unpacked_prons = df['pron'].apply(pd.Series)
df = pd.concat([df['title'], unpacked_prons], axis=1)

Now we make a [tidy](https://www.jstatsoft.org/article/view/v059i10) dataframe, where each row is an observation. The resulting dataframe has 2 columns and each title may appear multiple times in the first column.

In [7]:
df = pd.melt(df, id_vars=['title'], value_name='ipa')
df.drop('variable', axis=1, inplace=True)
df.head()

Unnamed: 0,title,ipa
0,France,[fʁɑ̃s]
1,Napoleon,/nəˈpoʊliən ˈboʊnəpɑːrt/
2,Phonetics,/fəˈnɛtɪks/
3,Linguistics,
4,Australia,/əˈstreɪliə/


Here we clean up a little. We get rid of any empty strings.

In [8]:
df.replace('', np.nan, inplace=True)
df.dropna(inplace=True)
df

Unnamed: 0,title,ipa
0,France,[fʁɑ̃s]
1,Napoleon,/nəˈpoʊliən ˈboʊnəpɑːrt/
2,Phonetics,/fəˈnɛtɪks/
4,Australia,/əˈstreɪliə/
5,France,[ʁepyblik fʁɑ̃sɛz]
6,Napoleon,[napɔleɔ̃ bɔnapaʁt]
9,Australia,/ɒ-/
11,Napoleon,[napoleˈoːne di bwɔnaˈparte]
14,Australia,/-ljə/
19,Australia,"[əˈstɹæɪljə, -liə]"


Some Wikipedia pages use `/.../`, some use `[...]`. Let's get rid of both.

In [9]:
def remove_brackets(ipa):
    """Remove enclosing brackets from IPA representation.
    
    Parameters
    ----------
    ipa : str
        IPA representation
    Returns
    -------
    str
        IPA representations without brackets.
        
    >>> clean_ipa('[fʁɑ̃s]')
    'fʁɑ̃s'
    """
    pat = re.compile(r'/|\[|\]')
    return re.sub(pat, '', ipa)

In [10]:
df['cleaned'] = df['ipa'].map(remove_brackets)
df

Unnamed: 0,title,ipa,cleaned
0,France,[fʁɑ̃s],fʁɑ̃s
1,Napoleon,/nəˈpoʊliən ˈboʊnəpɑːrt/,nəˈpoʊliən ˈboʊnəpɑːrt
2,Phonetics,/fəˈnɛtɪks/,fəˈnɛtɪks
4,Australia,/əˈstreɪliə/,əˈstreɪliə
5,France,[ʁepyblik fʁɑ̃sɛz],ʁepyblik fʁɑ̃sɛz
6,Napoleon,[napɔleɔ̃ bɔnapaʁt],napɔleɔ̃ bɔnapaʁt
9,Australia,/ɒ-/,ɒ-
11,Napoleon,[napoleˈoːne di bwɔnaˈparte],napoleˈoːne di bwɔnaˈparte
14,Australia,/-ljə/,-ljə
19,Australia,"[əˈstɹæɪljə, -liə]","əˈstɹæɪljə, -liə"


Looking at the 'cleaned' dataframe, some of the IPA representations aren't what we're looking for. In particular, we've got alternative affixes masquerading as complete pronunciations. This comes from the way we identified IPA in the HTML. We don't want those, so let's filter them out. One way to do that is to find all values with a '-' in them, as that's not any IPA symbol and is used for affixes, and filter them out. We'll define a function to do that.

In [11]:
def is_affix(ipa):
    """Return True if `ipa` is likely an affix.
    
    Parameters
    ----------
    ipa : str
        IPA representation
    Returns
    -------
    bool
        
    >>> is_affix('fʁɑ̃s')
    False
    >>> is_affix('-ljə')
    True
    """
    pat = re.compile(r'-')
    return bool(pat.search(ipa))

In [12]:
df['affix'] = df['cleaned'].map(is_affix)
df

Unnamed: 0,title,ipa,cleaned,affix
0,France,[fʁɑ̃s],fʁɑ̃s,False
1,Napoleon,/nəˈpoʊliən ˈboʊnəpɑːrt/,nəˈpoʊliən ˈboʊnəpɑːrt,False
2,Phonetics,/fəˈnɛtɪks/,fəˈnɛtɪks,False
4,Australia,/əˈstreɪliə/,əˈstreɪliə,False
5,France,[ʁepyblik fʁɑ̃sɛz],ʁepyblik fʁɑ̃sɛz,False
6,Napoleon,[napɔleɔ̃ bɔnapaʁt],napɔleɔ̃ bɔnapaʁt,False
9,Australia,/ɒ-/,ɒ-,True
11,Napoleon,[napoleˈoːne di bwɔnaˈparte],napoleˈoːne di bwɔnaˈparte,False
14,Australia,/-ljə/,-ljə,True
19,Australia,"[əˈstɹæɪljə, -liə]","əˈstɹæɪljə, -liə",True


Great. Now to make sure we don't have to do that all again, let's save it as a csv in the working directory. In particular, we want the pronunciations that are not affixes. The `~` is how we subset a dataframe by a false condition in pandas.

In [13]:
df[~df['affix']].to_csv('wikipedia_ipa.csv', columns=['title', 'cleaned'])