# Phonological

Learning phonological representations isn't as far from learning semantic representations as often thought. The data for both come in superficially linear but underlyingly hierarchically structured sequences. In the case of phonological representations, the feature representations are more agreed upon. For this reason, I want to learn and evaluate phonological representations using similar methods.

[PHOIBLE](http://phoible.org/) is a great resource. It includes feature representations for over 2000 segments. More info on the features is [here](https://github.com/phoible/dev/tree/master/raw-data/FEATURES).

In [7]:
!wget -q -O raw_phonological_features.tsv https://raw.githubusercontent.com/phoible/dev/master/raw-data/FEATURES/phoible-segments-features.tsv

In [10]:
import pandas as pd
import numpy as np

In [21]:
raw = pd.read_csv('raw_phonological_features.tsv', sep='\t', index_col=0).T
raw.head()

segment,m,k,i,a,j,p,u,w,n,s,...,r̪̰,o̞iˤ,r̪̥,ɡ‼x,a̠ː,ʈɽ,ḭː,ḭˑ,cçʲ,j̤
tone,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
stress,-,-,-,-,-,-,-,-,-,-,...,-,-,-,-,-,-,-,-,-,-
syllabic,-,-,+,+,-,-,+,-,-,-,...,-,+,-,-,+,-,+,+,-,-
short,-,-,-,-,-,-,-,-,-,-,...,-,-,-,-,-,-,-,+,-,-
long,-,-,-,-,-,-,-,-,-,-,...,-,-,-,-,+,-,+,+,-,-


I need to change the values in the dataframe from strings to ints. Some values are combinations of pluses and minuses. I'm not entirely sure what that means, but for now I'm going to treat them all as a 0.

In [22]:
np.unique(raw.values)

array(['+', '+,-', '+,-,+', '+,-,+,-', '+,-,-', '-', '-,+', '-,+,+',
       '-,+,-', '0'], dtype=object)

In [23]:
mapping = {'+':1, '-':0, '0':0}
for value in ['+,-', '+,-,+', '+,-,+,-', '+,-,-', '-,+', '-,+,+','-,+,-']:
    mapping[value] = 0
raw.replace(mapping, inplace=True)

In [24]:
raw.to_csv('phonological_features.csv')