# Preparing contour txt files for SS-ANOVA in R

With a Praat script (Mietta Lennes, 2003), the first three formants were extracted at 2ms intervals within the DIMEx100 and CBAS corpora. The resulting .txt files need to be cleaned and turned into .csv files. To clean, the formant data needs to be merged with the textgrids and vowels need to be isolated. The formant measurements can then be normalized, and then they will be ready to upload to R and fitted with SS-ANOVA.

In [6]:
import pandas as pd
import os
from audiolabel import read_label

### Import formant data

First import DIME data and add new columns for gender and corpus.

In [7]:
# import dime female
dime_fem_con = pd.read_csv("data/dime_female_contour.txt", sep = "\t")
dime_fem_con = dime_fem_con.rename(columns = {"Filename": "Participant"})
dime_fem_con["Gender"] = "Female"
dime_fem_con["Corpus"] = "DIMEx100"

# import dime male
dime_male_con = pd.read_csv("data/dime_male_contour.txt", sep = "\t")
dime_male_con = dime_male_con.rename(columns = {"Filename": "Participant"})
dime_male_con["Gender"] = "Male"
dime_male_con["Corpus"] = "DIMEx100"

# concatenate male and female data
dime = pd.concat([dime_male_con, dime_fem_con], ignore_index = True)
dime.head()

Unnamed: 0,Participant,phone,Time,Interval,F1,F2,F3,Gender,Corpus
0,s00101,e,0.069,0.002,2094.613890102079,2623.550627119573,3466.0182725549303,Male,DIMEx100
1,s00101,e,0.071,0.004,2094.042033082855,2624.0054225415506,3465.6262512556896,Male,DIMEx100
2,s00101,e,0.073,0.006,2093.470176063631,2624.4602179635285,3465.2342299564493,Male,DIMEx100
3,s00101,e,0.075,0.008,2092.8983190444064,2624.915013385506,3464.842208657209,Male,DIMEx100
4,s00101,e,0.077,0.01,2092.326462025182,2625.3698088074834,3464.450187357969,Male,DIMEx100


Now do the same with the CBAS corpus.

In [17]:
# import cbas female
cbas_fem = pd.read_csv("data/cbas_female_contour.txt", sep = "\t")
cbas_fem = cbas_fem.rename(columns = {"Filename": "Participant"})
cbas_fem["Gender"] = "Female"
cbas_fem["Corpus"] = "CBAS"

# import cbas male
cbas_male = pd.read_csv("data/cbas_male_contour.txt", sep = "\t")
cbas_male = cbas_male.rename(columns = {"Filename": "Participant"})
cbas_male["Gender"] = "Male"
cbas_male["Corpus"] = "CBAS"

# combine cbas female and male
cbas = pd.concat([cbas_male, cbas_fem], ignore_index = True)
cbas.head()

Unnamed: 0,Participant,Segment label,F1.50,F2.50,F3.50,F1.25,F2.25,F3.25,F1.75,F2.75,F3.75,Gender,Corpus
0,p112,sil,1349.549295,2591.478429,3249.782912,1640.24642,2418.644376,3493.133471,703.933103,2157.879849,2892.455533,Male,CBAS
1,p112,b,538.241047,1256.248187,2799.236009,884.928365,2273.697186,3419.016247,203.380184,972.14597,2675.148493,Male,CBAS
2,p112,a,695.790976,1139.953282,2667.258621,689.324612,840.183174,2738.685252,672.665361,1108.588353,2626.355793,Male,CBAS
3,p112,x,751.966861,1524.518652,2973.220585,702.809393,1206.953717,2902.09406,765.207925,1004.810302,2809.626524,Male,CBAS
4,p112,o,423.620505,737.502557,2374.974921,438.92995,803.823366,2369.001274,442.314537,785.356913,2490.71866,Male,CBAS


Now concatenate the cbas and dime dataframes.

In [18]:
# combine cbas and dime
formants = pd.concat([cbas, dime], ignore_index = True)
formants.head()

Unnamed: 0,Participant,Segment label,F1.50,F2.50,F3.50,F1.25,F2.25,F3.25,F1.75,F2.75,F3.75,Gender,Corpus,phone,Time,Interval,F1,F2,F3,t1_ph
0,p112,sil,1349.549295,2591.478429,3249.782912,1640.24642,2418.644376,3493.133471,703.933103,2157.879849,2892.455533,Male,CBAS,,,,,,,
1,p112,b,538.241047,1256.248187,2799.236009,884.928365,2273.697186,3419.016247,203.380184,972.14597,2675.148493,Male,CBAS,,,,,,,
2,p112,a,695.790976,1139.953282,2667.258621,689.324612,840.183174,2738.685252,672.665361,1108.588353,2626.355793,Male,CBAS,,,,,,,
3,p112,x,751.966861,1524.518652,2973.220585,702.809393,1206.953717,2902.09406,765.207925,1004.810302,2809.626524,Male,CBAS,,,,,,,
4,p112,o,423.620505,737.502557,2374.974921,438.92995,803.823366,2369.001274,442.314537,785.356913,2490.71866,Male,CBAS,,,,,,,


In order to later merge with the associated TextGrids, create a new column called `t1_ph` that contains the timestamp of the interval start.

In [20]:
dime["t1_ph"] = dime["Time"]-dime["Interval"]
dime.head()

Unnamed: 0,Participant,Segment label,F1.50,F2.50,F3.50,F1.25,F2.25,F3.25,F1.75,F2.75,F3.75,Gender,Corpus,phone,Time,Interval,F1,F2,F3,t1_ph
0,p112,sil,1349.549295,2591.478429,3249.782912,1640.24642,2418.644376,3493.133471,703.933103,2157.879849,2892.455533,Male,CBAS,,,,,,,
1,p112,b,538.241047,1256.248187,2799.236009,884.928365,2273.697186,3419.016247,203.380184,972.14597,2675.148493,Male,CBAS,,,,,,,
2,p112,a,695.790976,1139.953282,2667.258621,689.324612,840.183174,2738.685252,672.665361,1108.588353,2626.355793,Male,CBAS,,,,,,,
3,p112,x,751.966861,1524.518652,2973.220585,702.809393,1206.953717,2902.09406,765.207925,1004.810302,2809.626524,Male,CBAS,,,,,,,
4,p112,o,423.620505,737.502557,2374.974921,438.92995,803.823366,2369.001274,442.314537,785.356913,2490.71866,Male,CBAS,,,,,,,


### Import TextGrids

In [10]:
cbasdf = pd.DataFrame({
    'relpath': 'textgrids/cbas',
    'fname': ['p112.TextGrid',
              'p119.TextGrid',
              'p113.TextGrid',
              'p115.TextGrid',
              'p120.TextGrid',
              'p124.TextGrid'],
    'subject': ['p112', 'p119', 'p113', 'p115', 'p120', 'p124']
})

dimedf = pd.DataFrame({
    'relpath': 'textgrids/dime',
    'fname' : os.listdir("textgrids/dime")})
dimedf['subject'] = dimedf['fname'].apply(lambda x: x[:6])

tgdf = pd.concat([cbasdf, dimedf], ignore_index = True)

In [11]:
# inputs 

def tg2df(row):
    '''Load 'phone' and 'word' tiers from a textgrid and merge them.
    
    Parameters
    ----------
    
    row: named tuple
    A namedtuple as provided by `itertuples` that can be used to load a Praat
    textgrid from a path identified by row.relpath and row.fname. The textgrid is
    expected to have 'phone' and 'word' tiers.

    Returns
    -------
    
    mergedf: the merged dataframe.
    '''
    [wddf, phdf] = read_label(
        os.path.join(row.relpath, row.fname).replace("\\","/"),
        ftype='praat',
        tiers=['word', 'phone']
    )
    # Throw an error if tiers are not strictly hierarchical.
    # words contain phones
    assert(wddf.t1.isin(phdf.t1).all())
    assert(wddf.t2.isin(phdf.t2).all())
    
    # Add phone duration and speaker
    phdf['dur_ph'] = phdf.t2 - phdf.t1
    phdf['Participant'] = row.subject

    # Merge phone and word tiers.
    phwddf = pd.merge_asof(
        phdf.rename({'t1': 't1_ph', 't2': 't2_ph'}, axis='columns'),
        wddf.drop('fname', axis='columns') \
            .rename({'t1': 't1_wd', 't2': 't2_wd'}, axis='columns'),
        left_on='t1_ph',
        right_on='t1_wd'
    )

    # Add word-init and -final columns
    phwddf['is_wdinit_ph'] = phwddf.t1_ph == phwddf.t1_wd
    phwddf['is_wdfin_ph'] = phwddf.t2_ph == phwddf.t2_wd

    # Merge context tier and return the result.
    return phwddf

In [12]:
dflist = [tg2df(row) for row in tgdf.itertuples()]

In [13]:
alldf = pd.concat(dflist, ignore_index=True)

alldf.sample(10)

Unnamed: 0,t1_ph,t2_ph,phone,fname,dur_ph,Participant,t1_wd,t2_wd,word,is_wdinit_ph,is_wdfin_ph
509,143.148,143.248,m,textgrids/cbas/p112.TextGrid,0.1,p112,142.548,143.378,vandalismo,False,False
13038,1.064,1.112,e,textgrids/dime/s05132.TextGrid,0.048,s05132,1.02,1.112,de,False,True
2965,241.2,241.44,ng,textgrids/cbas/p113.TextGrid,0.24,p113,240.33,241.44,bien,False,True
19158,0.598,0.671,n,textgrids/dime/s05538.TextGrid,0.073,s05538,0.531,0.739,una,False,False
9602,3.158,3.161088,,textgrids/dime/s00219.TextGrid,0.003088,s00219,3.158,3.161088,,True,True
6638,2.856,2.955,m,textgrids/dime/s00103.TextGrid,0.099,s00103,2.784,3.544,importante,False,False
10893,0.181,0.221,e,textgrids/dime/s00245.TextGrid,0.04,s00245,0.181,0.712,evolucio_7n,True,False
11168,1.715,1.74,i,textgrids/dime/s00250.TextGrid,0.025,s00250,1.21,1.862,construccio_7n,False,False
5916,198.451,200.028,sp,textgrids/cbas/p124.TextGrid,1.577,p124,198.451,200.028,,True,True
11729,2.233,2.303,n,textgrids/dime/s05110.TextGrid,0.07,s05110,1.497,2.303,administracio_7n,False,True


Create cols `prev_ph` and `next_ph` containing previous and following phones.

In [14]:
alldf['prev_ph'] = alldf.phone.shift(1).fillna('')
alldf['next_ph'] = alldf.phone.shift(-1).fillna('')
alldf = alldf[alldf["phone"]!=""]
alldf = alldf.reset_index(drop = True)

In [15]:
alldf

Unnamed: 0,t1_ph,t2_ph,phone,fname,dur_ph,Participant,t1_wd,t2_wd,word,is_wdinit_ph,is_wdfin_ph,prev_ph,next_ph
0,0.000,0.710,sil,textgrids/cbas/p112.TextGrid,0.710,p112,0.000,0.710,,True,True,,b
1,0.710,0.820,b,textgrids/cbas/p112.TextGrid,0.110,p112,0.710,1.140,bajo,True,False,sil,a
2,0.820,0.920,a,textgrids/cbas/p112.TextGrid,0.100,p112,0.710,1.140,bajo,False,False,b,x
3,0.920,1.000,x,textgrids/cbas/p112.TextGrid,0.080,p112,0.710,1.140,bajo,False,False,a,o
4,1.000,1.140,o,textgrids/cbas/p112.TextGrid,0.140,p112,0.710,1.140,bajo,False,True,x,sp
...,...,...,...,...,...,...,...,...,...,...,...,...,...
22107,5.368,5.432,k,textgrids/dime/s05650.TextGrid,0.064,s05650,5.185,5.622,sector,False,False,e,t
22108,5.432,5.525,t,textgrids/dime/s05650.TextGrid,0.093,s05650,5.185,5.622,sector,False,False,k,o
22109,5.525,5.584,o,textgrids/dime/s05650.TextGrid,0.059,s05650,5.185,5.622,sector,False,False,t,r(
22110,5.584,5.622,r(,textgrids/dime/s05650.TextGrid,0.038,s05650,5.185,5.622,sector,False,True,o,.sil


### Merge formant and TextGrid data

In [23]:
# later change dime to formants
data = dime.merge(alldf, how='left', left_on=['Participant','t1_ph'], right_on = ['Participant','t1_ph'])
data = data.drop(["phone_y"], axis = 1)
data = data.rename(columns = {"phone_x": "phone"})
data.head()

Unnamed: 0,Participant,phone,Time,Interval,F1,F2,F3,Gender,Corpus,t1_ph,t2_ph,fname,dur_ph,t1_wd,t2_wd,word,is_wdinit_ph,is_wdfin_ph,prev_ph,next_ph
0,s00101,e,0.069,0.002,2094.613890102079,2623.550627119573,3466.0182725549303,Male,DIMEx100,0.067,0.15,textgrids/dime/s00101.TextGrid,0.083,0.067,0.215,en,True,False,,n
1,s00101,e,0.071,0.004,2094.042033082855,2624.0054225415506,3465.6262512556896,Male,DIMEx100,0.067,0.15,textgrids/dime/s00101.TextGrid,0.083,0.067,0.215,en,True,False,,n
2,s00101,e,0.073,0.006,2093.470176063631,2624.4602179635285,3465.2342299564493,Male,DIMEx100,0.067,0.15,textgrids/dime/s00101.TextGrid,0.083,0.067,0.215,en,True,False,,n
3,s00101,e,0.075,0.008,2092.8983190444064,2624.915013385506,3464.842208657209,Male,DIMEx100,0.067,0.15,textgrids/dime/s00101.TextGrid,0.083,0.067,0.215,en,True,False,,n
4,s00101,e,0.077,0.01,2092.326462025182,2625.3698088074834,3464.450187357969,Male,DIMEx100,0.067,0.15,textgrids/dime/s00101.TextGrid,0.083,0.067,0.215,en,True,False,,n


Now to fix errors/inconsistencies:

In [27]:
# run this cell when I don't have cbas data yet
data = data.dropna()
data = data.reset_index(drop=True)

In [28]:
import re

# fix phones from txt file, remove + following some vowels
data["phone"] = data["phone"].apply(lambda x: re.sub("([aeiou])\+", "\1", x))

# replace `r(` with `rf` for consistency
data['word'] = data['word'].apply(lambda x: re.sub("r\(", "rf", x))

# fix notation in dimex corpus, where V_7 yields accented V
data['word'] = data['word'].apply(lambda x: re.sub("a_7", "á", x))
data['word'] = data['word'].apply(lambda x: re.sub("i_7", "í", x))
data['word'] = data['word'].apply(lambda x: re.sub("o_7", "ó", x))
data['word'] = data['word'].apply(lambda x: re.sub("u_7", "ú", x))
data['word'] = data['word'].apply(lambda x: re.sub("e_7", "é", x))

# fix tildas
data['word'] = data['word'].apply(lambda x: re.sub("n~", "ñ", x))

# remove phones `sp` and `.sil`
data = data[(data['phone'] != ".sil") & (data['phone'] != "sp")]

data = data.reset_index(drop = True)

Now that we have made use of the Participant naming system in DIMEx to combine the tg and formant data, we can rename the Participant column, dropping the indication of the task number.

In [29]:
# fix naming of participant col
data["Participant"] = data["Participant"].apply(lambda x: x[:4])

Now to isolate the vowels.

In [106]:
# remove rows not containing vowels
vowelsdf = data[(data['phone']=="a") | 
                 (data['phone']=="e") | 
                 (data['phone']=="i") | 
                 (data['phone']=="u") | 
                 (data['phone']=="o")]
vowelsdf = vowelsdf.reset_index(drop = True)
vowelsdf = vowelsdf.rename(columns = {"phone": "Vowel"})
len(vowelsdf)

114773

### Speech rate

First we will take the number of vowels a speaker produces to be equal to the number of syllables they utter. Then we will take the unique values from the `t1_wd` and `t2_wd` columns and subtract t2 from t1 to obtain an array of the duration of each word uttered. Then we will sum the durations of all words and divide the number of syllables by this value.

In [107]:
def speech_rate(df):
    import numpy as np

    Participant = []
    speech_rate = []

    for i in df.Participant.unique():
        data = df[df["Participant"]==i]
        syllables = len(data.Vowel)
        end_times = data["t2_wd"]
        start_times = data["t1_wd"]
        durations = np.subtract(end_times, start_times)
        duration = sum(durations)
        rate = syllables/duration
    
        Participant.append(i)
        speech_rate.append(rate)

    rates = {k:v for k,v in zip(Participant, speech_rate)}
    rates_df = pd.DataFrame.from_dict(rates, orient = "index", columns = ['Speech Rate'])
    rates_df = rates_df.rename_axis('Participant').reset_index()
    
    df = pd.merge(left = df, right = rates_df, on = 'Participant', how = 'outer')
    return df

In [108]:
vowelsdf = speech_rate(vowelsdf)

### Stress

The syltippy package (https://github.com/nur-ag/syltippy) will be used to generate syllabified (stress-indicated) outputs for each word found in the transcriptions. Then, the corresponding vowels in the TextGrid-formant dataframes will be marked as either stressed or unstressed.

In [109]:
# function takes into dictionary.txt file with cols `word` and `ipa`
# input formants df with cols `Participant`, `word`, `t1_wd`, and `t1_ph`

def get_stress(vowels):
    # import required packages
    import csv
    import numpy as np
    from syltippy import syllabize
    
    # def fxn to create stress column in dictionary
    def stress(word):
        syllables, stress = syllabize(word)
        return ','.join(s if stress != i else s.upper() for (i, s) in enumerate(syllables))
    
    # add column to dictionary
    vowels["stress_syll"] = vowels["word"].apply(lambda x : stress(str(x)))
    
    # create separate column to hold only the vowels in each word
    vowels["syll_vowels"] = vowels["stress_syll"].apply(lambda x: re.sub(r'[^,aeiouAEIOUáéíóúÁÉÍÓÚ]', '', x))
    
    # define function to return index of 'vowels' column with stress
    def is_stress(word):
        # convert to list
        word = word.split(",")
        stress_vowel = 0
        for syllable in word:
            if syllable.isupper():
                stress_vowel = word.index(syllable)
        return stress_vowel
    
    # create new column which gives vowel number in given word that has stress
    vowels["stress_vowel"] = vowels["syll_vowels"].apply(lambda x: is_stress(x))
    
    # determine index of vowel in df
    vowels["vowel_ind"] = vowels.groupby(["fname", "t1_wd"])["t1_ph"].apply(lambda x:x.astype('category').cat.codes).astype(int)

    #vowels = vowels.reset_index(drop = True)
    
    # add column to formants to indicate stress
    vowels["stress"] = np.where(vowels['stress_vowel'] == vowels['vowel_ind'], "stressed", "unstressed")
    
    # drop unnecesary columns
    vowels = vowels.drop(["syll_vowels", "stress_syll"], axis = 1)
       
    return vowels

In [110]:
vowelsdf = get_stress(vowelsdf)

In [111]:
vowelsdf.sample(50)

Unnamed: 0,Participant,Vowel,Time,Interval,F1,F2,F3,Gender,Corpus,t1_ph,...,t2_wd,word,is_wdinit_ph,is_wdfin_ph,prev_ph,next_ph,Speech Rate,stress_vowel,vowel_ind,stress
78357,s055,u,5.034,0.018,401.9905318563255,1276.4251934028937,2834.8470770108,Female,DIMEx100,5.016,...,5.141,curriculum,False,False,l,m,1.893287,3,3,stressed
34646,s051,a,2.89,0.068,535.0795242177925,1863.8628669339164,2416.691567061131,Female,DIMEx100,2.822,...,3.179,salamanca,False,False,m,n,2.114681,2,2,stressed
45109,s051,e,0.426,0.048,368.93239561022847,2248.1457817369824,3164.2137874078544,Female,DIMEx100,0.378,...,0.652,elementos,False,False,m,n,2.114681,2,2,stressed
14116,s001,a,0.154,0.094,570.6473545159527,1568.7787055395484,2450.3417166173817,Male,DIMEx100,0.06,...,0.537,adriano,True,False,,d,2.167846,1,0,unstressed
89502,s055,e,1.399,0.076,441.71344797586585,2028.9561367920976,3098.2580011859527,Female,DIMEx100,1.323,...,1.49,en,True,False,r(,n,1.893287,0,0,stressed
6321,s001,a,0.644,0.024,422.0363441101842,1578.2845350090354,2371.064662461464,Male,DIMEx100,0.62,...,0.954,organización,False,False,s,s,2.167846,4,2,unstressed
108012,s056,e,2.219,0.076,571.6697728197018,2464.222552121423,3349.6053520025544,Female,DIMEx100,2.143,...,2.285,adecuadamente,False,True,t,.sil,1.923002,4,3,unstressed
912,s001,e,1.929,0.018,396.005735015644,1686.958087757669,2595.91793717752,Male,DIMEx100,1.911,...,1.961,de,False,True,d,m,2.167846,0,0,stressed
90896,s055,o,0.551,0.074,534.2424633075324,1662.0408008990628,3076.7893289401927,Female,DIMEx100,0.477,...,0.624,pioneros,False,False,r(,s,1.893287,1,1,stressed
31679,s002,e,1.338,0.02,456.90434289023136,1832.825697020567,2912.1275134003126,Male,DIMEx100,1.318,...,1.961,educación,True,False,l,d,2.270729,3,0,unstressed


### Normalization of vowel formants

Because both male and female speakers are represented in this data set, the formant frequencies need to be normalized to minimized vocal tract length differences.

Following Johnson (2018), I will use the line-fitting Delta F Normalization method, which makes use of the entire vowel space. To do so, the average vowel space will be calculated for each participant, and then each F1 and F2 measurement will be divided by this value.

First we will calculate the average formant measurements over each vowel production, to get an estimate of the 'midpoint'.

In [124]:
vowelsdf["F1"] = vowelsdf["F1"].astype(float)
vowelsdf["F2"] = vowelsdf["F2"].astype(float)
vowelsdf["F3 "] = vowelsdf["F3 "].astype(float)
vowelsdf = vowelsdf.rename(columns={"F3 ": "F3"})

In [138]:
import numpy as np

def delta_f(vowels): # df as argument
    
    Participant = []
    ll = []
    
    for i in vowels.Participant.unique():
        data = vowels[vowels['Participant']==i]
        
        delta = np.mean([np.true_divide(data["F1"], 0.5), 
                        np.true_divide(data["F2"], 1.5), 
                        np.true_divide(data["F3"], 2.5)
                       ])
        
        Participant.append(i)
        ll.append(delta)
    
    deltas = {k:v for k,v in zip(Participant, ll)}
    delta_df = pd.DataFrame.from_dict(deltas, orient = "index", columns = ['Delta F'])
    delta_df = delta_df.rename_axis('Participant').reset_index()
        
    return(delta_df)

In [139]:
def normalization(vowels):
    delta_df = delta_f(vowels)
    
    deltas = delta_df.set_index("Participant")
    deltas = deltas.reset_index()
    
    vowels_normalized = pd.merge(left = vowels,
                                 right = deltas,
                                 on = 'Participant',
                                 how = 'outer')
    vowels_normalized['F1_norm'] = vowels_normalized['F1']/vowels_normalized['Delta F']
    vowels_normalized['F2_norm'] = vowels_normalized['F2']/vowels_normalized['Delta F']
    
    return(vowels_normalized)

In [140]:
vowels_norm = normalization(vowelsdf)
print(len(vowels_norm))
vowels_norm.sample(10)

114773


Unnamed: 0,Participant,Vowel,Time,Interval,F1,F2,F3,Gender,Corpus,t1_ph,...,prev_ph,next_ph,Speech Rate,stress_vowel,vowel_ind,stress,F1_mid,Delta F,F1_norm,F2_norm
32343,s051,a,0.28,0.02,823.513739,1845.030799,1971.334897,Female,DIMEx100,0.26,...,r(,f,2.114681,0,1,unstressed,857.44264,1148.993852,0.716726,1.60578
113440,s056,e,2.15,0.044,509.241245,1945.738349,2943.770344,Female,DIMEx100,2.106,...,d,b,1.923002,0,0,stressed,514.409162,1155.803643,0.440595,1.683451
22421,s002,e,0.114,0.036,446.710276,2044.109721,2868.952192,Male,DIMEx100,0.078,...,,Z,2.270729,0,0,stressed,418.106052,1048.729308,0.425954,1.94913
106256,s056,e,0.831,0.02,582.809665,1889.6591,3031.555921,Female,DIMEx100,0.811,...,d,l,1.923002,0,0,stressed,554.846485,1155.803643,0.504246,1.634931
74657,s055,u,2.915,0.032,403.890779,1603.536598,2788.974967,Female,DIMEx100,2.883,...,r(,s,1.893287,0,1,unstressed,341.875626,1157.70087,0.348873,1.385104
100722,s056,a,3.36,0.01,414.576179,2095.726196,3118.557632,Female,DIMEx100,3.35,...,k,s,1.923002,4,3,unstressed,476.000243,1155.803643,0.358691,1.81322
64597,s053,e,0.275,0.036,488.161468,1941.556191,2986.323601,Female,DIMEx100,0.239,...,m,r(,2.113555,0,1,unstressed,456.534262,1152.303798,0.42364,1.684934
44411,s051,e,1.167,0.026,498.589669,2083.421502,2828.720917,Female,DIMEx100,1.141,...,b,r(,2.114681,4,2,unstressed,477.855254,1148.993852,0.433936,1.813257
30108,s002,a,4.772,0.048,981.191064,1556.047274,2832.002903,Male,DIMEx100,4.724,...,m,.sil,2.270729,0,1,unstressed,894.684223,1048.729308,0.9356,1.483745
113160,s056,a,2.657,0.008,471.192532,688.771064,2142.094124,Female,DIMEx100,2.649,...,r(,l,1.923002,2,2,stressed,682.326833,1155.803643,0.407675,0.595924


For SS-ANOVA, duration of each vowel needs to be scaled from 0 to 1. To do this, create a new column `RTime` that is the result of `Interval` divided by `dur_ph`.

In [142]:
vowels_norm["RTime"] = vowels_norm["Interval"]/vowels_norm["dur_ph"]
vowels_norm.head()

Unnamed: 0,Participant,Vowel,Time,Interval,F1,F2,F3,Gender,Corpus,t1_ph,...,next_ph,Speech Rate,stress_vowel,vowel_ind,stress,F1_mid,Delta F,F1_norm,F2_norm,RTime
0,s001,e,0.069,0.002,2094.61389,2623.550627,3466.018273,Male,DIMEx100,0.067,...,n,2.167846,0,0,stressed,834.207032,977.31502,2.143233,2.684447,0.024096
1,s001,e,0.071,0.004,2094.042033,2624.005423,3465.626251,Male,DIMEx100,0.067,...,n,2.167846,0,0,stressed,834.207032,977.31502,2.142648,2.684913,0.048193
2,s001,e,0.073,0.006,2093.470176,2624.460218,3465.23423,Male,DIMEx100,0.067,...,n,2.167846,0,0,stressed,834.207032,977.31502,2.142063,2.685378,0.072289
3,s001,e,0.075,0.008,2092.898319,2624.915013,3464.842209,Male,DIMEx100,0.067,...,n,2.167846,0,0,stressed,834.207032,977.31502,2.141478,2.685843,0.096386
4,s001,e,0.077,0.01,2092.326462,2625.369809,3464.450187,Male,DIMEx100,0.067,...,n,2.167846,0,0,stressed,834.207032,977.31502,2.140893,2.686309,0.120482


Create a grouping factor that will uniquely identify each vowel produced by each speaker.

In [145]:
vowels_norm["unique"] = vowels_norm["fname"] + vowels_norm["t1_ph"].astype(str)

In [146]:
vowels_norm.to_csv("data/contour_norm.csv", index = False)