# Select only the necessary audio feats

At this point, I have extracted all the audio features (MFCC+pitch). The file feats.scp serves as an index to these features, with the keys being utterance names, in SWBD-1 format.

I also have a list of utterances in utterances.txt. The first element on each line is the utterance number in SWBD-NXT format; the second is the utterance number in SWBD-NXT format; the third is the list of tokens; the last is the list of labels.

The goal here is to pull out all the keys from utterances.txt and use them to pull the actual features out, as indexed by feats.scp. Then make this into a python dictionary, where keys are utterance names (in SWBD-1 format) and values are numpy arrays that correspond to the MFCC+pitch feats for that utterance. Stores as a serialized file.

This also generates a dictionary of labels, where the keys are utterance names, as above, and the values are the labels (in this case, as single binary value, where 1 indicates that the last token in the utterance is stressed and 0 indicates that it is not). Stores as a serialized file.


**Step 1:** Import libraries and set filename variables

In [1]:
import kaldi_io
import pandas as pd
import pickle
import torch

feats_file = '/home/elizabeth/repos/kaldi/egs/swbd/s5c/data/train/feats_pitch.scp'
#feats_file = '/afs/inf.ed.ac.uk/group/project/prosody/mfcc_pitch/feats.scp'
data_file = 'data/utterances.txt'
feat_dict_file = 'data/utterances_feats.pkl'
label_dict_file = 'data/utterances_labels.pkl'

**Step 2:** Load the utterances (tokens and labels in text format) and extract both labels and keys (the SWBD-1-formatted utterance names):

In [2]:
df = pd.read_csv(data_file,sep='\t',header=None)
labels = df.iloc[:,-1].tolist()
keepkeys = df.iloc[:,1].tolist()
df

Unnamed: 0,0,1,2,3
0,sw2018A-ms98-a-0001,sw02018-A_000000-000376,hello this is lois,1 1 0 1
1,sw2018A-ms98-a-0003,sw02018-A_000512-001158,and um i called you know from that the the ti ...,1 0 1 1 0 1 0 0 1 0 1 1 1 1
2,sw2018A-ms98-a-0005,sw02018-A_001394-001810,yeah this is about changes in women in the,0 1 0 1 1 0 1 0 0
3,sw2018A-ms98-a-0006,sw02018-A_001810-002363,uh there's really a lot isn't there i mean the...,0 0 1 0 1 1 1 1 0 0 1 1
4,sw2018A-ms98-a-0008,sw02018-A_002467-003200,oh i guess the work force would be the main wo...,0 0 1 0 1 0 0 1 0 1 1 0 1 0
...,...,...,...,...
7745,sw4890B-ms98-a-0055,sw04890-B_026274-026695,oh oh that's interesting,0 0 0 1
7746,sw4890B-ms98-a-0057,sw04890-B_027482-027716,uh-huh,0
7747,sw4890B-ms98-a-0059,sw04890-B_028002-028154,oh,0
7748,sw4890B-ms98-a-0061,sw04890-B_028397-028688,oh i see,0 0 1


**Step 2.5:** Check to make sure these extracted lists make good sense:

In [3]:
rownum = 67

row = df.iloc[rownum]
print('ROW:')
print(row)
print('LABEL:')
print(labels[rownum])
print('KEY:')
print(keepkeys[rownum])

ROW:
0                                  sw2018B-ms98-a-0018
1                              sw02018-B_004661-005898
2    i know when my mother was a you know going int...
3    0 1 0 0 1 0 0 1 0 1 0 0 1 1 0 1 0 1 1 0 0 0 1 ...
Name: 67, dtype: object
LABEL:
0 1 0 0 1 0 0 1 0 1 0 0 1 1 0 1 0 1 1 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 1 0 1 0 0 1
KEY:
sw02018-B_004661-005898


**Step 3:** Make labels into a dictionary where keys = utterance name, values = last token's label

In [12]:
label_dict = {}
zero_lens = []
one_lens = []
for i,key in enumerate(keepkeys):
    label_dict[key] = torch.tensor(int(labels[i].split()[-1]))
    if label_dict[key].item()==0:
        zero_lens.append(len(labels[i]))
    else:
        one_lens.append(len(labels[i]))
    

(Quick, a data analysis aside: how often are unaccented final tokens just single tokens??

In [22]:
all_labels = [tens.tolist() for tens in list(label_dict.values())]
print('Percent of 1 labels:',sum(all_labels)/len(all_labels))


import numpy as np
from scipy import stats
zero_lens = np.array(zero_lens)
one_lens = np.array(one_lens)

print(stats.mode(zero_lens))
print(zero_lens.shape)

print(stats.mode(one_lens))
print(one_lens.shape)

Percent of 1 labels: 0.4984516129032258
ModeResult(mode=array([1]), count=array([1302]))
(3887,)
ModeResult(mode=array([1]), count=array([745]))
(3863,)


Hmmmm pretty often. Could be that these are backchannel-type tokens ('uh-huh', 'yeah', etc.) Check on this.

**Step 4:** Go retrieve features and put into a dictionary

In [6]:
feat_dict = {}
print("filtering keys ...")
for key,mat in kaldi_io.read_mat_scp(feats_file):
    if key in keepkeys:
        feat_dict[key] = torch.tensor(mat)



filtering keys ...


**Step 5:** If you couldn't find a feature for an utterance, then drop the utterance from the label_dict as well:

In [7]:
missing_feats = list(set(label_dict.keys())-set(feat_dict.keys()))

for utt in missing_feats:
    del label_dict[utt]
    
assert(len(label_dict)==len(feat_dict))

**Step 6:** Pickle the dictionaries

In [8]:
with open(label_dict_file,'wb') as f:
    pickle.dump(label_dict,f)

with open(feat_dict_file,'wb') as f:
    pickle.dump(feat_dict,f)