# Generating a Spanish Phonetic Dictionary with Prosody

The Spanish phonetic dictionary (https://github.com/Kyubyong/pron_dictionaries) will be used to determine postiions of primary stress in the Spanish words used for elicitation. Each vowel will be marked accordingly as belonging to a stressed or unstressed syllable. Using regex, the dictionary will be modified so that vowels in stressed syllables are marked in uppercase. This dictionary will then be used to modify dictionaries generated by the Montreal Forced Aligner's g2phone dictionary generator, so that the resulting phone-aligned TextGrids will use the same uppercase/lowercase convention.

In [112]:
import pandas as pd
import csv
import re
import numpy as np

In [86]:
sp_phon = pd.read_csv("dicts/sp_phon.csv")
sp_phon.sample()

Unnamed: 0,headword,pronunciation
14133,carcelarias,k a ɾ s e ˈ l a ɾ j a s


In [97]:
sub = sp_phon.pronunciation.str.replace(r'ˈ\s([^aeiou]*\s*)([aeiou])', lambda x: (x[1] + x[2].upper()))

In [98]:
sp_phon["Pron"] = sub
sp_phon.head()

Unnamed: 0,headword,pronunciation,Pron
0,-a,a,a
1,-aba,ˈ a β a,A β a
2,-able,ˈ a β l e,A β l e
3,-aca,ˈ a k a,A k a
4,-acas,ˈ a k o s,A k o s


In [100]:
sp_phon.to_csv("dicts/sp_phon_stress.csv", index = False)
len(sp_phon)

51636

## Modify CBAS word list MFA-generated dictionary

In [137]:
female = pd.read_csv("dicts/cbas_female_dictionary.txt", sep = "\t", header = None)
female.columns = ["Word", "Pronunciation"]
female.sample(5)

Unnamed: 0,Word,Pronunciation
157,vagoneta,b a G o n e t a
79,tus,t u s
102,balcón,b a l k o+ ng
51,fosil,f o s i l
120,albino,a l b i n o


In [110]:
cbas_female_stress = female.merge(sp_phon, left_on = "Word", right_on = "headword", how = "left")
cbas_female_stress.sample(5)

Unnamed: 0,Word,Pronunciation,headword,pronunciation,Pron
178,anchas,a n tS a s,,,
38,hotel,o t e l,hotel,o ˈ t e l,o t E l
117,balada,b a l a D a,,,
116,prosa,p rf o s a,prosa,ˈ p ɾ o s a,p ɾ O s a
14,vez,b e s,vez,b e θ,b e θ


In [111]:
cbas_female_stress

Unnamed: 0,Word,Pronunciation,headword,pronunciation,Pron
0,carbohidratos,k a rf b o i D rf a t o s,,,
1,presente,p rf e s e n t e,presente,p r e ˈ s e n t e,p r e s E n t e
2,cabaret,k a b a rf e t,cabaret,k a ˈ β a ɾ e t,k a β A ɾ e t
3,revés,r e b e+ s,,,
4,unigenitos,u n i x e n i t o s,,,
...,...,...,...,...,...
185,cintura,T i n t u rf a,,,
186,vacío,b a s i+ o,,,
187,vampiro,b a m p i rf o,vampiro,b a m ˈ p i ɾ o,b a m p I ɾ o
188,terminarás,t e rf m i n a rf a+ s,,,


In [126]:
available = cbas_female_stress.dropna()
available = available[["Word", "Pron"]]
available.sample(5)

Unnamed: 0,Word,Pron
152,vaciar,b a s j A ɾ
7,hervir,e ɾ β I ɾ
5,vector,b e k t O ɾ
162,piel,p j E l
48,causal,k a w s A l


## New Approach

Using https://github.com/nur-ag/syltippy package to generate syllabified (stress-indicated) outputs.

In [129]:
from syltippy import syllabize

In [141]:
def stress(word):
    syllables, stress = syllabize(word)
    return ''.join(s if stress != i else s.upper() for (i, s) in enumerate(syllables))

stress("comes")

'COmes'

In [146]:
female["pron"] = female["Word"].apply(lambda x : stress(x.replace(" ", "")))
female.sample(5)

Unnamed: 0,Word,Pronunciation,pron
15,pulverizar,p u l b e rf i T a rf,pulveriZAR
14,valuación,b a l w a T j o+ n,valuaCIÓN
104,abandonar,a b a n d o n a rf,abandoNAR
47,básico,b a+ s i k o,BÁsico
153,basura,b a s u rf a,baSUra


In [None]:
sub = female.Pronunciation.str.replace(r'rf', 'ɾ')
sub.str.replace(r')