# Python Data Processing Basics for Acoustic Analysis

## Single speaker test
We’ll test out the process of retrieving and merging the data we want to combine for a single speaker, step-by-step before combining the steps into larger code chunks, then we’ll put the process into practice by looping over all the speakers for whom we have data stored in our “data” folder.

#### Step 0: Load in libraries

In [1]:
import pandas as pd
from pathlib import Path
from phonlab.utils import dir2df # must use pip to install
from audiolabel import read_label
import parselmouth as ps
from parselmouth.praat import call as pcall

#### Step 1: Save path to data directory, identify speaker directories

In [2]:
# get the path to larger folder containing your data
datadir = Path('./data').absolute()

# create df with by-speaker subfolders containing wav and TextGrid data for one speaker
# fnpat specifies unique wav files so that spkrdf contains each speaker name only once
spkrdf = dir2df(datadir, fnpat='\.wav$')
spkrdf

Unnamed: 0,relpath,fname
0,S01,S01_interview.wav
1,S02,S02_interview.wav
2,S03,S03_interview.wav


### Processing TextGrid annotations

#### Step 2: Extract phones and words tiers from TextGrid

- For the first speaker-specific folder `spkrdf.head(1)`, store phones tier in `phdf`
- Then store words tier in `wrdf`
- The `with_suffix()` method identifies the relevant TG file
- The `tiers` argument uses the names you gave your Praat tiers to identify the correct ones

In [3]:
for row in spkrdf.head(1).itertuples():
    spkrfile = Path(datadir, row.relpath, row.fname).with_suffix('.TextGrid')
    [phdf, wrdf] = read_label([spkrfile], ftype='praat', tiers=['phones', 'words'])

# check that each df is as expected
phdf

Unnamed: 0,t1,t2,phones,fname
0,0.00,0.640,,/Users/ambergalvano/Desktop/Py-Data-Acoustics/...
1,0.64,0.700,AY1,/Users/ambergalvano/Desktop/Py-Data-Acoustics/...
2,0.70,0.800,S,/Users/ambergalvano/Desktop/Py-Data-Acoustics/...
3,0.80,0.840,T,/Users/ambergalvano/Desktop/Py-Data-Acoustics/...
4,0.84,0.870,IH1,/Users/ambergalvano/Desktop/Py-Data-Acoustics/...
...,...,...,...,...
10130,1332.84,1332.870,K,/Users/ambergalvano/Desktop/Py-Data-Acoustics/...
10131,1332.87,1332.900,AO1,/Users/ambergalvano/Desktop/Py-Data-Acoustics/...
10132,1332.90,1332.970,R,/Users/ambergalvano/Desktop/Py-Data-Acoustics/...
10133,1332.97,1333.000,S,/Users/ambergalvano/Desktop/Py-Data-Acoustics/...


In [4]:
wrdf

Unnamed: 0,t1,t2,words,fname
0,0.00,0.640,,/Users/ambergalvano/Desktop/Py-Data-Acoustics/...
1,0.64,0.700,i,/Users/ambergalvano/Desktop/Py-Data-Acoustics/...
2,0.70,0.930,still,/Users/ambergalvano/Desktop/Py-Data-Acoustics/...
3,0.93,1.140,haven't,/Users/ambergalvano/Desktop/Py-Data-Acoustics/...
4,1.14,1.360,met,/Users/ambergalvano/Desktop/Py-Data-Acoustics/...
...,...,...,...,...
3777,1332.16,1332.410,course,/Users/ambergalvano/Desktop/Py-Data-Acoustics/...
3778,1332.41,1332.710,of,/Users/ambergalvano/Desktop/Py-Data-Acoustics/...
3779,1332.71,1332.840,,/Users/ambergalvano/Desktop/Py-Data-Acoustics/...
3780,1332.84,1333.000,course,/Users/ambergalvano/Desktop/Py-Data-Acoustics/...


#### Step 3: Subsetting the phones dataframe
- Note that `copy()` ensures the subset of phdf is treated as unique object; avoids warning message

In [5]:
# remove empty segments
phdf = phdf[phdf['phones']!=''].copy()

# add phone duration tier
phdf['phone_dur'] = phdf['t2']-phdf['t1']

# add col for previous phone
phdf['prev']=phdf['phones'].shift() 

# add col for following phone
phdf['nxt']=phdf['phones'].shift(-1)

# keep only relevant phones, remove short tokens
phdf = phdf[phdf['phones'].isin(['S', 'SH']) & (phdf['phone_dur'] >= 0.05)]

# check updated df - should be no empty phone segments or segments <0.05s
phdf

Unnamed: 0,t1,t2,phones,fname,phone_dur,prev,nxt
2,0.70,0.80,S,/Users/ambergalvano/Desktop/Py-Data-Acoustics/...,0.10,AY1,T
48,13.76,13.99,S,/Users/ambergalvano/Desktop/Py-Data-Acoustics/...,0.23,Z,T
81,23.73,24.04,S,/Users/ambergalvano/Desktop/Py-Data-Acoustics/...,0.31,EH1,N
120,29.90,30.04,S,/Users/ambergalvano/Desktop/Py-Data-Acoustics/...,0.14,AE1,T
130,30.65,30.72,S,/Users/ambergalvano/Desktop/Py-Data-Acoustics/...,0.07,AE1,T
...,...,...,...,...,...,...,...
10011,1304.60,1304.69,SH,/Users/ambergalvano/Desktop/Py-Data-Acoustics/...,0.09,IH1,P
10042,1306.78,1307.01,S,/Users/ambergalvano/Desktop/Py-Data-Acoustics/...,0.23,T,N
10061,1309.64,1309.79,S,/Users/ambergalvano/Desktop/Py-Data-Acoustics/...,0.15,OW1,AE1
10068,1312.10,1312.16,S,/Users/ambergalvano/Desktop/Py-Data-Acoustics/...,0.06,EH1,DH


#### Step 4: Merge phones and words dfs
- Note we specify which columns from each df to retain to avoid duplicates or unwanted columns

In [6]:
# merge matching on closest start times between phone and word annotations
tg = pd.merge_asof(
        phdf[['t1', 't2', 'phones', 'phone_dur', 'prev', 'nxt']],               
        wrdf[['t1', 'words']], 
        on='t1', 
        suffixes=['_ph', '_wd'] # in case there are duplicates
    )

# check merged df is same length and has only specified columns
tg

Unnamed: 0,t1,t2,phones,phone_dur,prev,nxt,words
0,0.70,0.80,S,0.10,AY1,T,still
1,13.76,13.99,S,0.23,Z,T,student
2,23.73,24.04,S,0.31,EH1,N,yes
3,29.90,30.04,S,0.14,AE1,T,past
4,30.65,30.72,S,0.07,AE1,T,past
...,...,...,...,...,...,...,...
328,1304.60,1304.69,SH,0.09,IH1,P,wish
329,1306.78,1307.01,S,0.23,T,N,that's
330,1309.64,1309.79,S,0.15,OW1,AE1,gross
331,1312.10,1312.16,S,0.06,EH1,DH,yes


#### Step 5: Add in metadata

In [7]:
# add speaker column at the front of the df
tg.insert(0, 'speaker', row.relpath)

# add column for name of current recording as second column
tg.insert(1, 'recording', row.fname)

tg.head()

Unnamed: 0,speaker,recording,t1,t2,phones,phone_dur,prev,nxt,words
0,S01,S01_interview.wav,0.7,0.8,S,0.1,AY1,T,still
1,S01,S01_interview.wav,13.76,13.99,S,0.23,Z,T,student
2,S01,S01_interview.wav,23.73,24.04,S,0.31,EH1,N,yes
3,S01,S01_interview.wav,29.9,30.04,S,0.14,AE1,T,past
4,S01,S01_interview.wav,30.65,30.72,S,0.07,AE1,T,past


#### Putting it all together: Loading in your TextGrid data
Test compilation of all annotations and basic metadata for a single speaker

In [8]:
for row in spkrdf.head(1).itertuples():
    spkrfile = Path(datadir, row.relpath, row.fname)
    [phdf, wrdf] = read_label(spkrfile.with_suffix('.TextGrid'), ftype='praat', 
        tiers=['phones', 'words'])
    
    phdf = phdf[phdf['phones']!=''].copy()
    phdf['phone_dur'] = phdf['t2']-phdf['t1'] 
    phdf['prev']=phdf['phones'].shift()
    phdf['nxt']=phdf['phones'].shift(-1)
    phdf = phdf[phdf['phones'].isin(['S', 'SH']) & (phdf['phone_dur'] >= 0.05)] 
    
    tg = pd.merge_asof(
        phdf[['t1', 't2', 'phones', 'phone_dur', 'prev', 'nxt']],               
        wrdf[['t1', 'words']], 
        on='t1', 
        suffixes=['_ph', '_wd']
    )
    
    tg.insert(0, 'speaker', row.relpath)
    tg.insert(1, 'recording', row.fname) 
    
    print('Done with TG')

tg.head()

Done with TG


Unnamed: 0,speaker,recording,t1,t2,phones,phone_dur,prev,nxt,words
0,S01,S01_interview.wav,0.7,0.8,S,0.1,AY1,T,still
1,S01,S01_interview.wav,13.76,13.99,S,0.23,Z,T,student
2,S01,S01_interview.wav,23.73,24.04,S,0.31,EH1,N,yes
3,S01,S01_interview.wav,29.9,30.04,S,0.14,AE1,T,past
4,S01,S01_interview.wav,30.65,30.72,S,0.07,AE1,T,past


### Processing audio files

#### Step 6: Use wav path to create sound object

In [9]:
# get path to wav files
wav = datadir / row.relpath / row.fname
    
# use path name to create sound object 
sound = ps.Sound(str(wav))

#### Step 7: Filter pitch out
Skip this step if measuring voicing or other low-frequency variables

In [10]:
# Set low frequency threshold
voicing_hz_filter = 300

# create voicing-filtered sound object (this step could take some time)
sound = pcall(sound, 'Filter (stop Hann band)...', 0, voicing_hz_filter, 100)

#### Step 8: Use lambda function to extract spectral moments
Note that you will need to have generated the `tg` dataframe first

`.get_...()` method can be modified to include any Praat spectrum call, depending on measurement of interest

In [11]:
# convert sound object to spectrum
# use lambda function to get spectral moments between t1 and t2 for each token
tg['COG'] = tg.apply(lambda x: sound.extract_part(x.t1+0.01, 
                    x.t2-0.01).to_spectrum().get_center_of_gravity(), axis=1)

tg['SD'] = tg.apply(lambda x: sound.extract_part(x.t1+0.01, 
                    x.t2-0.01).to_spectrum().get_standard_deviation(), axis=1)

tg['skew'] = tg.apply(lambda x: sound.extract_part(x.t1+0.01, 
                    x.t2-0.01).to_spectrum().get_skewness(), axis=1)

tg['kurtosis'] = tg.apply(lambda x: sound.extract_part(x.t1+0.01, 
                    x.t2-0.01).to_spectrum().get_kurtosis(), axis=1)

tg.head()

Unnamed: 0,speaker,recording,t1,t2,phones,phone_dur,prev,nxt,words,COG,SD,skew,kurtosis
0,S01,S01_interview.wav,0.7,0.8,S,0.1,AY1,T,still,6843.643604,2399.00898,-1.40155,2.771497
1,S01,S01_interview.wav,13.76,13.99,S,0.23,Z,T,student,6441.362865,2262.10359,-0.709024,2.879648
2,S01,S01_interview.wav,23.73,24.04,S,0.31,EH1,N,yes,6824.1875,2231.494502,-1.464468,3.785619
3,S01,S01_interview.wav,29.9,30.04,S,0.14,AE1,T,past,5552.486811,3521.406978,-0.418952,-0.738118
4,S01,S01_interview.wav,30.65,30.72,S,0.07,AE1,T,past,4701.829887,3671.126603,-0.022515,-1.391621


#### Putting it all together: Loading in your audio data
Read in audio files and create spectrum object for a single speaker

In [12]:
voicing_hz_filter = 300

for row in spkrdf.head(1).itertuples(): 
    # get path to wav files
    wav = datadir / row.relpath / row.fname
    
    # use path name to create sound object 
    sound = ps.Sound(str(wav))

    # filter voicing out
    sound = pcall(sound, 'Filter (stop Hann band)...', 0, voicing_hz_filter, 100)    
    
    # convert sound object to spectrum
    # use lambda function to get spectral moments between t1 and t2 for each token
    tg['COG'] = tg.apply(lambda x: sound.extract_part(x.t1+0.01, 
        x.t2-0.01).to_spectrum().get_center_of_gravity(), axis=1)

    tg['SD'] = tg.apply(lambda x: sound.extract_part(x.t1+0.01, 
        x.t2-0.01).to_spectrum().get_standard_deviation(), axis=1)

    tg['skew'] = tg.apply(lambda x: sound.extract_part(x.t1+0.01, 
        x.t2-0.01).to_spectrum().get_skewness(), axis=1)

    tg['kurtosis'] = tg.apply(lambda x: sound.extract_part(x.t1+0.01, 
        x.t2-0.01).to_spectrum().get_kurtosis(), axis=1)

    print('Done with wav')

# check that the first 5 rows of compiled df look good
tg.head()

Done with wav


Unnamed: 0,speaker,recording,t1,t2,phones,phone_dur,prev,nxt,words,COG,SD,skew,kurtosis
0,S01,S01_interview.wav,0.7,0.8,S,0.1,AY1,T,still,6843.643604,2399.00898,-1.40155,2.771497
1,S01,S01_interview.wav,13.76,13.99,S,0.23,Z,T,student,6441.362865,2262.10359,-0.709024,2.879648
2,S01,S01_interview.wav,23.73,24.04,S,0.31,EH1,N,yes,6824.1875,2231.494502,-1.464468,3.785619
3,S01,S01_interview.wav,29.9,30.04,S,0.14,AE1,T,past,5552.486811,3521.406978,-0.418952,-0.738118
4,S01,S01_interview.wav,30.65,30.72,S,0.07,AE1,T,past,4701.829887,3671.126603,-0.022515,-1.391621


### Merging TG and wav data in single loop
This could take a few moments.

In [13]:
datadir = Path('./data').absolute()
spkrdf = dir2df(datadir, fnpat='\.wav$')
tglist = []
voicing_hz_filter = 300

for row in spkrdf.head(1).itertuples():
    spkrfile = Path(datadir, row.relpath, row.fname)
    [phdf, wrdf] = read_label(spkrfile.with_suffix('.TextGrid'), ftype='praat', 
        tiers=['phones', 'words'])
    
    phdf = phdf[phdf['phones']!=''].copy()
    phdf['phone_dur'] = phdf['t2']-phdf['t1'] 
    phdf['prev']=phdf['phones'].shift()
    phdf['nxt']=phdf['phones'].shift(-1)
    phdf = phdf[phdf['phones'].isin(['S', 'SH']) & (phdf['phone_dur'] >= 0.05)] 
    
    tg = pd.merge_asof(
        phdf[['t1', 't2', 'phones', 'phone_dur', 'prev', 'nxt']],               
        wrdf[['t1', 'words']], 
        on='t1', 
        suffixes=['_ph', '_wd']
    )
    
    tg.insert(0, 'speaker', row.relpath)
    tg.insert(1, 'recording', row.fname) 
    
    print('Done with TG')

    # wav portion
    wav = datadir / row.relpath / row.fname
    
    sound = ps.Sound(str(wav))

    sound = pcall(sound, 'Filter (stop Hann band)...', 0, voicing_hz_filter, 100)    
    
    tg['COG'] = tg.apply(lambda x: sound.extract_part(x.t1+0.01, 
        x.t2-0.01).to_spectrum().get_center_of_gravity(), axis=1)

    tg['SD'] = tg.apply(lambda x: sound.extract_part(x.t1+0.01, 
        x.t2-0.01).to_spectrum().get_standard_deviation(), axis=1)

    tg['skew'] = tg.apply(lambda x: sound.extract_part(x.t1+0.01, 
        x.t2-0.01).to_spectrum().get_skewness(), axis=1)

    tg['kurtosis'] = tg.apply(lambda x: sound.extract_part(x.t1+0.01, 
        x.t2-0.01).to_spectrum().get_kurtosis(), axis=1)

    print('Done with wav')

print('Done')
tg.head()

Done with TG
Done with wav
Done


Unnamed: 0,speaker,recording,t1,t2,phones,phone_dur,prev,nxt,words,COG,SD,skew,kurtosis
0,S01,S01_interview.wav,0.7,0.8,S,0.1,AY1,T,still,6843.643604,2399.00898,-1.40155,2.771497
1,S01,S01_interview.wav,13.76,13.99,S,0.23,Z,T,student,6441.362865,2262.10359,-0.709024,2.879648
2,S01,S01_interview.wav,23.73,24.04,S,0.31,EH1,N,yes,6824.1875,2231.494502,-1.464468,3.785619
3,S01,S01_interview.wav,29.9,30.04,S,0.14,AE1,T,past,5552.486811,3521.406978,-0.418952,-0.738118
4,S01,S01_interview.wav,30.65,30.72,S,0.07,AE1,T,past,4701.829887,3671.126603,-0.022515,-1.391621


## Looping through speakers
In most use cases, whether we are interested in phonetic variation, social variation, or both, we will want to compare our measurements for sounds of interest across multiple speakers. So long as your data for each speaker is stored in its own folder within `data`, we only need to add a few lines to make this happen. Namely:

- initialize a `tglist` to store our data for each speaker
- remove the `head()` method from our call to `spkrdf` to loop through the entirety of `spkrdf`
- add `print()` statements to let us know what speaker we're on and what step we're on for that speaker
- use `concat()` as a last step, to append each df in `tglist` into a single `fulldf`

This code could take some time to execute.

In [14]:
spkrdf = dir2df(datadir, fnpat='\.wav$')
tglist = []
voicing_hz_filter = 300

for row in spkrdf.itertuples():
    print(row.relpath)

    # TextGrid portion
    spkrfile = Path(datadir, row.relpath, row.fname)
    [phdf, wrdf] = read_label(spkrfile.with_suffix('.TextGrid'), ftype='praat', 
        tiers=['phones', 'words'])
    
    phdf = phdf[phdf['phones']!=''].copy()
    phdf['phone_dur'] = phdf['t2']-phdf['t1'] 
    phdf['prev']=phdf['phones'].shift()
    phdf['nxt']=phdf['phones'].shift(-1)
    phdf = phdf[phdf['phones'].isin(['S', 'SH']) & (phdf['phone_dur'] >= 0.05)] 
    
    tg = pd.merge_asof(
        phdf[['t1', 't2', 'phones', 'phone_dur', 'prev', 'nxt']],               
        wrdf[['t1', 'words']], 
        on='t1', 
        suffixes=['_ph', '_wd']
    )
    
    tg.insert(0, 'speaker', row.relpath)
    tg.insert(1, 'recording', row.fname)
    
    print('Done with TG')

    # wav portion
    wav = datadir / row.relpath / row.fname
    sound = ps.Sound(str(wav))

    sound = pcall(sound, 'Filter (stop Hann band)...', 0, voicing_hz_filter, 100)    
    
    tg['COG'] = tg.apply(lambda x: sound.extract_part(x.t1+0.01, 
        x.t2-0.01).to_spectrum().get_center_of_gravity(), axis=1)

    tg['SD'] = tg.apply(lambda x: sound.extract_part(x.t1+0.01, 
        x.t2-0.01).to_spectrum().get_standard_deviation(), axis=1)

    tg['skew'] = tg.apply(lambda x: sound.extract_part(x.t1+0.01, 
        x.t2-0.01).to_spectrum().get_skewness(), axis=1)

    tg['kurtosis'] = tg.apply(lambda x: sound.extract_part(x.t1+0.01, 
        x.t2-0.01).to_spectrum().get_kurtosis(), axis=1)

    print('Done with wav')

    tglist.append(tg.reset_index(drop=True))
    
fulldf = pd.concat(tglist, axis='rows', ignore_index=True)
    
print('Done')
fulldf

S01
Done with TG
Done with wav
S02
Done with TG
Done with wav
S03
Done with TG
Done with wav
Done


Unnamed: 0,speaker,recording,t1,t2,phones,phone_dur,prev,nxt,words,COG,SD,skew,kurtosis
0,S01,S01_interview.wav,0.7000,0.8000,S,0.10,AY1,T,still,6843.643604,2399.008980,-1.401550,2.771497
1,S01,S01_interview.wav,13.7600,13.9900,S,0.23,Z,T,student,6441.362865,2262.103590,-0.709024,2.879648
2,S01,S01_interview.wav,23.7300,24.0400,S,0.31,EH1,N,yes,6824.187500,2231.494502,-1.464468,3.785619
3,S01,S01_interview.wav,29.9000,30.0400,S,0.14,AE1,T,past,5552.486811,3521.406978,-0.418952,-0.738118
4,S01,S01_interview.wav,30.6500,30.7200,S,0.07,AE1,T,past,4701.829887,3671.126603,-0.022515,-1.391621
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1095,S03,S03_interview.wav,1543.6415,1543.7715,S,0.13,AE1,OW1,so,3080.268516,4016.028948,1.080176,-0.477866
1096,S03,S03_interview.wav,1544.6515,1544.7715,S,0.12,L,OW0,also,4168.001556,4014.565520,0.375728,-1.368350
1097,S03,S03_interview.wav,1545.8515,1545.9615,S,0.11,IY0,P,specific,8253.935980,1950.407950,0.254103,2.525659
1098,S03,S03_interview.wav,1546.1015,1546.2515,S,0.15,AH0,IH1,specific,8932.145824,2474.852918,-1.538584,3.868505


In [15]:
fulldf[fulldf['speaker'] == 'S02']

Unnamed: 0,speaker,recording,t1,t2,phones,phone_dur,prev,nxt,words,COG,SD,skew,kurtosis
333,S02,S02_interview.wav,9.21,9.31,S,0.10,AH0,AH0,semester,6301.049111,1757.248954,-0.743910,5.110758
334,S02,S02_interview.wav,9.40,9.51,S,0.11,EH1,T,semester,6292.591555,2018.731624,-0.936699,2.676309
335,S02,S02_interview.wav,10.21,10.31,S,0.10,V,T,starting,6814.782650,1642.943453,-0.620201,4.675261
336,S02,S02_interview.wav,21.06,21.16,S,0.10,N,P,span,6818.461623,1643.244091,-0.163057,6.053256
337,S02,S02_interview.wav,25.12,25.20,S,0.08,N,L,romance,7355.760990,1735.904500,-0.597158,4.588744
...,...,...,...,...,...,...,...,...,...,...,...,...,...
872,S02,S02_interview.wav,1717.15,1717.23,SH,0.08,P,AH0,misconception,3791.419917,2353.482655,1.153082,1.765094
873,S02,S02_interview.wav,1719.00,1719.08,S,0.08,K,P,experienced,7005.230209,2220.404713,0.261930,2.176154
874,S02,S02_interview.wav,1719.41,1719.49,S,0.08,N,T,experienced,6457.466212,1884.376495,-0.135447,4.527180
875,S02,S02_interview.wav,1719.54,1719.77,S,0.23,T,OW1,so,6750.270219,1584.384413,-0.855103,2.566215


## Adding in additional metadata
Adding coarser-grained phonetic categorizations

#### Step 12: Add columns for previous and following vowel height

Check the unique values for previous and following sounds; use to populate the dictionary below.

In [16]:
fulldf['prev'].unique()

array(['AY1', 'Z', 'EH1', 'AE1', 'D', 'N', 'V', 'K', 'AH1', 'OW1', 'T',
       'S', 'IH2', 'AH0', 'NG', 'IY0', 'EY1', 'ER1', 'L', 'F', 'M', 'IH0',
       'IH1', 'IY1', 'ER0', 'R', 'UW1', 'P', 'AA1', 'EH0', 'OW0', 'AW1',
       'G', 'spn', 'OW2', 'AE2', 'JH', 'UH1', 'AY2', 'UW0', 'TH', 'EH2'],
      dtype=object)

In [17]:
fulldf['nxt'].unique()

array(['T', 'N', 'IY1', 'AH0', 'OW2', 'L', 'Y', 'AY1', 'M', 'OW1', 'EH1',
       'UW1', 'W', 'R', 'IH2', 'OW0', 'HH', 'spn', 'IH0', 'AH1', 'P', 'S',
       'B', 'UH1', 'AE1', 'IH1', 'EY1', 'ER0', 'EH0', 'V', 'DH', 'K',
       'IY0', 'EH2', 'JH', 'F', 'AA1', 'AW1', 'AE2', 'D', 'ER1', 'CH',
       'AO1', 'G', 'UW0'], dtype=object)

<div style="margin-top: 15px;"></div>

Save `prev` and `nxt` without stress markings to `prev_short` and `nxt_short`.

In [18]:
fulldf.insert(8, 'prev_short', fulldf['prev'].str[:2])
fulldf.insert(9, 'nxt_short', fulldf['nxt'].str[:2])

<div style="margin-top: 15px;"></div>

**Example 1:** Single attribute of previous vowels only (height)
- Write a dictionary with vowel heights as *keys* with each unique vowel as a *values* in the appropriate entry
- `apply()` the dictionary to `prev_short` to generate `prev_height`

In [19]:
# Establish your desired grouping of phones by height (this groups diphthongs by the SECOND vowel quality)
# Note this dataset uses ARPAbet rather than IPA
vowel_height_prev = {
    'high': ['IY', 'UW', 'IH', 'UH', 'IX', 'UX', 'EY', 'OW', 'EY', 'OY', 'AY', 'AW'], 
    'mid': ['EH', 'AO', 'AX', 'AH', 'ER', 'AXR'], 
    'low': ['AE', 'AA'] 
}

# write function that applies vowel_height_prev to a given column
def get_prev_height(vowel): 
    for height, vowels in vowel_height_prev.items():
        if vowel in vowels: # check if each value in a column is contained in vowel_height_dict
            return height # if yes, return its height
    return 'consonant' # if not, return 'consonant'

<div style="margin-top: 15px;"></div>

Let's try this on a copy of `fulldf` so that we can run the more complicated code in Example 2 without conflicts:

In [20]:
fulldf_temp = fulldf.copy()

fulldf_temp.insert(10, 'prev_height', fulldf_temp['prev_short'].apply(get_prev_height))
fulldf_temp

Unnamed: 0,speaker,recording,t1,t2,phones,phone_dur,prev,nxt,prev_short,nxt_short,prev_height,words,COG,SD,skew,kurtosis
0,S01,S01_interview.wav,0.7000,0.8000,S,0.10,AY1,T,AY,T,high,still,6843.643604,2399.008980,-1.401550,2.771497
1,S01,S01_interview.wav,13.7600,13.9900,S,0.23,Z,T,Z,T,consonant,student,6441.362865,2262.103590,-0.709024,2.879648
2,S01,S01_interview.wav,23.7300,24.0400,S,0.31,EH1,N,EH,N,mid,yes,6824.187500,2231.494502,-1.464468,3.785619
3,S01,S01_interview.wav,29.9000,30.0400,S,0.14,AE1,T,AE,T,low,past,5552.486811,3521.406978,-0.418952,-0.738118
4,S01,S01_interview.wav,30.6500,30.7200,S,0.07,AE1,T,AE,T,low,past,4701.829887,3671.126603,-0.022515,-1.391621
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1095,S03,S03_interview.wav,1543.6415,1543.7715,S,0.13,AE1,OW1,AE,OW,low,so,3080.268516,4016.028948,1.080176,-0.477866
1096,S03,S03_interview.wav,1544.6515,1544.7715,S,0.12,L,OW0,L,OW,consonant,also,4168.001556,4014.565520,0.375728,-1.368350
1097,S03,S03_interview.wav,1545.8515,1545.9615,S,0.11,IY0,P,IY,P,high,specific,8253.935980,1950.407950,0.254103,2.525659
1098,S03,S03_interview.wav,1546.1015,1546.2515,S,0.15,AH0,IH1,AH,IH,mid,specific,8932.145824,2474.852918,-1.538584,3.868505


<div style="margin-top: 15px;"></div>

**Example 2**: Multiple attributes, multiple columns (height and backness)

Here, we use nested dictionaries to specify height and backness groupings separately for previous and following vowels.
- If your desired analysis is simpler, you could just specify one set of height and backness groupings to apply to vowels and diphthongs in any position.

In [21]:
# save combined dictionary as vowel_attributes
vowel_attributes = {
    # height dictionaries
    'height': {
        # height groupings for previous vowels
        'prev_heights': {
            'high': ['IY', 'UW', 'IH', 'UH', 'IX', 'UX', 'EY', 'OW', 'EY', 'OY', 'AY', 'AW'],
            'mid': ['EH', 'AO', 'AX', 'AH', 'ER', 'AXR'],
            'low': ['AE', 'AA']
        },
        # height groupins for next vowels
        'nxt_heights': {
            'high': ['IY', 'IH', 'UH', 'IX', 'UX', 'EY'],
            'mid': ['EH', 'AO', 'AX', 'AH', 'ER', 'AXR', 'UW', 'EY', 'OW', 'OY'],
            'low': ['AE', 'AA', 'AY', 'AW']
        }
    },
    # backness dictionaries
    'backness': {
        # backness groupings for previous vowels
        'prev_backs': {
            'front': ['AE', 'AY', 'EH', 'ER', 'EY', 'IH', 'IY', 'OY'], 
            'central': ['AX', 'AXR', 'IX', 'UX'], 
            'back': ['AA', 'AH', 'AO', 'AW', 'OW', 'UH', 'UW']
        },
        # backness groupings for next vowels
        'nxt_backs': {
            'front': ['AE', 'EH', 'ER', 'EY', 'IH', 'IY'], 
            'central': ['AW', 'AY', 'AX', 'AXR', 'IX', 'UX'], 
            'back': ['AA', 'AH', 'AO', 'OW', 'UH', 'UW', 'OY']
        }
    }
}

In [22]:
# function that asks for a vowel, what attribute we want, and which grouping we want
def get_vowel_attribute(vowel, attribute_type, grouping):
    attribute_dict = vowel_attributes[attribute_type][grouping] # pick out the relevant subdictionary
    for attribute, vowels in attribute_dict.items():
        if vowel in vowels: # if our vowel is in the subdictionary
            return attribute # return its grouping for the attribute we want
    return 'consonant' # if not in the subdict, return "consonant"

# use lambda to loop through each vowel in the given column and apply the function
# use insert to specify the indices at which to insert the new columns
fulldf.insert(10, 'prev_height', fulldf['prev_short'].apply(lambda vowel: 
                                        get_vowel_attribute(vowel, 'height', 'prev_heights')))
fulldf.insert(11, 'nxt_height', fulldf['nxt_short'].apply(lambda vowel: 
                                        get_vowel_attribute(vowel, 'height', 'nxt_heights')))
fulldf.insert(12, 'prev_back', fulldf['prev_short'].apply(lambda vowel: 
                                        get_vowel_attribute(vowel, 'backness', 'prev_backs')))
fulldf.insert(13, 'nxt_back', fulldf['nxt_short'].apply(lambda vowel: 
                                        get_vowel_attribute(vowel, 'backness', 'nxt_backs')))

fulldf

Unnamed: 0,speaker,recording,t1,t2,phones,phone_dur,prev,nxt,prev_short,nxt_short,prev_height,nxt_height,prev_back,nxt_back,words,COG,SD,skew,kurtosis
0,S01,S01_interview.wav,0.7000,0.8000,S,0.10,AY1,T,AY,T,high,consonant,front,consonant,still,6843.643604,2399.008980,-1.401550,2.771497
1,S01,S01_interview.wav,13.7600,13.9900,S,0.23,Z,T,Z,T,consonant,consonant,consonant,consonant,student,6441.362865,2262.103590,-0.709024,2.879648
2,S01,S01_interview.wav,23.7300,24.0400,S,0.31,EH1,N,EH,N,mid,consonant,front,consonant,yes,6824.187500,2231.494502,-1.464468,3.785619
3,S01,S01_interview.wav,29.9000,30.0400,S,0.14,AE1,T,AE,T,low,consonant,front,consonant,past,5552.486811,3521.406978,-0.418952,-0.738118
4,S01,S01_interview.wav,30.6500,30.7200,S,0.07,AE1,T,AE,T,low,consonant,front,consonant,past,4701.829887,3671.126603,-0.022515,-1.391621
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1095,S03,S03_interview.wav,1543.6415,1543.7715,S,0.13,AE1,OW1,AE,OW,low,mid,front,back,so,3080.268516,4016.028948,1.080176,-0.477866
1096,S03,S03_interview.wav,1544.6515,1544.7715,S,0.12,L,OW0,L,OW,consonant,mid,consonant,back,also,4168.001556,4014.565520,0.375728,-1.368350
1097,S03,S03_interview.wav,1545.8515,1545.9615,S,0.11,IY0,P,IY,P,high,consonant,front,consonant,specific,8253.935980,1950.407950,0.254103,2.525659
1098,S03,S03_interview.wav,1546.1015,1546.2515,S,0.15,AH0,IH1,AH,IH,mid,high,back,front,specific,8932.145824,2474.852918,-1.538584,3.868505


Use this as a template for any phonetic grouping variables (e.g., backness, roundness, voice quality, etc.).

#### Step 13: Add speaker demographic information
Next, let’s add in our speaker demographic information.

| speaker | age | gender         | sexuality | race_ethnicity          | SES               | place_origin          | L1s             |
|:------- |:--- |:-------------- |:--------- |:----------------------- |:------------------|:--------------------- |:----------------|
| S01     | 22  | cisgender male | gay       | White, Latino           | middle class       | Seattle               | Spanish, English |
| S02     | 22  | male           | bisexual  | Caucasian               | low-middle class   | Michigan              | English         |
| S03     | 25  | cis woman      | straight  | Black, African-American | low-to-middle class| Houston & Lubbock, TX | English         |

Load in the toy demographic data and make sure the `speaker` column is at index 0.

In [23]:
# load in your speaker demographic data, make ‘speaker’ the leftmost column
spkr_demog = pd.read_csv('./spkr_demog.csv')
spkr_col = spkr_demog.pop('speaker')  
spkr_demog.insert(0, 'speaker', spkr_col)
spkr_demog

Unnamed: 0,speaker,age,gender,sexuality,race_ethnicity,SES,place_origin,L1s
0,S01,22,cisgender male,gay,"White, Latino",middle class,Seattle,"Spanish, English"
1,S02,22,male,bisexual,Caucasian,low-middle class,Michigan,English
2,S03,25,cis woman,straight,"Black, African-American",low-to-middle class,"Houston & Lubbock, TX",English


Now move `speaker` in fulldf to the rightmost column.

In [24]:
# in the df containing your acoustic measures, move the ‘speaker’ to be the rightmost column
spk = fulldf.pop('speaker')
fulldf['speaker'] = spk
fulldf

Unnamed: 0,recording,t1,t2,phones,phone_dur,prev,nxt,prev_short,nxt_short,prev_height,nxt_height,prev_back,nxt_back,words,COG,SD,skew,kurtosis,speaker
0,S01_interview.wav,0.7000,0.8000,S,0.10,AY1,T,AY,T,high,consonant,front,consonant,still,6843.643604,2399.008980,-1.401550,2.771497,S01
1,S01_interview.wav,13.7600,13.9900,S,0.23,Z,T,Z,T,consonant,consonant,consonant,consonant,student,6441.362865,2262.103590,-0.709024,2.879648,S01
2,S01_interview.wav,23.7300,24.0400,S,0.31,EH1,N,EH,N,mid,consonant,front,consonant,yes,6824.187500,2231.494502,-1.464468,3.785619,S01
3,S01_interview.wav,29.9000,30.0400,S,0.14,AE1,T,AE,T,low,consonant,front,consonant,past,5552.486811,3521.406978,-0.418952,-0.738118,S01
4,S01_interview.wav,30.6500,30.7200,S,0.07,AE1,T,AE,T,low,consonant,front,consonant,past,4701.829887,3671.126603,-0.022515,-1.391621,S01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1095,S03_interview.wav,1543.6415,1543.7715,S,0.13,AE1,OW1,AE,OW,low,mid,front,back,so,3080.268516,4016.028948,1.080176,-0.477866,S03
1096,S03_interview.wav,1544.6515,1544.7715,S,0.12,L,OW0,L,OW,consonant,mid,consonant,back,also,4168.001556,4014.565520,0.375728,-1.368350,S03
1097,S03_interview.wav,1545.8515,1545.9615,S,0.11,IY0,P,IY,P,high,consonant,front,consonant,specific,8253.935980,1950.407950,0.254103,2.525659,S03
1098,S03_interview.wav,1546.1015,1546.2515,S,0.15,AH0,IH1,AH,IH,mid,high,back,front,specific,8932.145824,2474.852918,-1.538584,3.868505,S03


Complete a left-merge to add the appropriate demographic information in `spkr_demog` to each row of `full_df`; save the result as `merged_df`, and move `speaker` back to index 0.

In [25]:
# Complete left merge on speaker column
merged_df = fulldf.merge(spkr_demog, on='speaker', how='left')

# Move speaker column back to index 0
spk1 = merged_df.pop('speaker')  
merged_df.insert(0, 'speaker', spk1)

Check that everything worked as intended (no duplicate columns, no lost columns, etc.)

In [26]:
# Check number of rows matches fulldf
merged_df.shape

(1100, 26)

In [27]:
# Check for column duplicates
merged_df.columns

Index(['speaker', 'recording', 't1', 't2', 'phones', 'phone_dur', 'prev',
       'nxt', 'prev_short', 'nxt_short', 'prev_height', 'nxt_height',
       'prev_back', 'nxt_back', 'words', 'COG', 'SD', 'skew', 'kurtosis',
       'age', 'gender', 'sexuality', 'race_ethnicity', 'SES', 'place_origin',
       'L1s'],
      dtype='object')

#### Step 14: Save your completed dataset

Save your completely updated dataset! Append today's date to keep track of your versions.

In [28]:
# Save w/ original filename plus tag for metadata and date for reference
merged_df.to_csv('./spectral_moments_w-demog_11-02-24.csv')

Congratulations! Now our data is in the appropriate format to start exploring visualization and statistical analysis. 🎉

In [29]:
merged_df

Unnamed: 0,speaker,recording,t1,t2,phones,phone_dur,prev,nxt,prev_short,nxt_short,...,SD,skew,kurtosis,age,gender,sexuality,race_ethnicity,SES,place_origin,L1s
0,S01,S01_interview.wav,0.7000,0.8000,S,0.10,AY1,T,AY,T,...,2399.008980,-1.401550,2.771497,22,cisgender male,gay,"White, Latino",middle class,Seattle,"Spanish, English"
1,S01,S01_interview.wav,13.7600,13.9900,S,0.23,Z,T,Z,T,...,2262.103590,-0.709024,2.879648,22,cisgender male,gay,"White, Latino",middle class,Seattle,"Spanish, English"
2,S01,S01_interview.wav,23.7300,24.0400,S,0.31,EH1,N,EH,N,...,2231.494502,-1.464468,3.785619,22,cisgender male,gay,"White, Latino",middle class,Seattle,"Spanish, English"
3,S01,S01_interview.wav,29.9000,30.0400,S,0.14,AE1,T,AE,T,...,3521.406978,-0.418952,-0.738118,22,cisgender male,gay,"White, Latino",middle class,Seattle,"Spanish, English"
4,S01,S01_interview.wav,30.6500,30.7200,S,0.07,AE1,T,AE,T,...,3671.126603,-0.022515,-1.391621,22,cisgender male,gay,"White, Latino",middle class,Seattle,"Spanish, English"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1095,S03,S03_interview.wav,1543.6415,1543.7715,S,0.13,AE1,OW1,AE,OW,...,4016.028948,1.080176,-0.477866,25,cis woman,straight,"Black, African-American",low-to-middle class,"Houston & Lubbock, TX",English
1096,S03,S03_interview.wav,1544.6515,1544.7715,S,0.12,L,OW0,L,OW,...,4014.565520,0.375728,-1.368350,25,cis woman,straight,"Black, African-American",low-to-middle class,"Houston & Lubbock, TX",English
1097,S03,S03_interview.wav,1545.8515,1545.9615,S,0.11,IY0,P,IY,P,...,1950.407950,0.254103,2.525659,25,cis woman,straight,"Black, African-American",low-to-middle class,"Houston & Lubbock, TX",English
1098,S03,S03_interview.wav,1546.1015,1546.2515,S,0.15,AH0,IH1,AH,IH,...,2474.852918,-1.538584,3.868505,25,cis woman,straight,"Black, African-American",low-to-middle class,"Houston & Lubbock, TX",English
