# Pickle python dictionary of audio feats

Based on filter_feats.ipynb

At this point, I have extracted all the audio features (MFCC+pitch). The file feats.scp serves as an index to these features, with the keys being: 

`<filename identifier from BURNC>-starttime-endtime`

I also have a list of utterances in text2labels. The first element on each line is the utterance ID; the second is the list of tokens; the last is the list of labels.

The goal here is to pull out all the keys from utterances.txt and use them to pull the actual features out, as indexed by feats.scp. Then make this into a python dictionary, where keys are utterance IDs and values are numpy arrays that correspond to the MFCC+pitch feats for that utterance. Store this dictionary as a serialized file.

This also generates a dictionary of labels, where the keys are utterance IDs, as above, and the values are the labels (in this case, as single binary value, where 1 indicates that the last token in the utterance is stressed and 0 indicates that it is not). Stores as a serialized file.


**Step 1:** Import libraries and set filename variables

In [1]:
import kaldi_io
import pandas as pd
import pickle
import torch

#feats_file = '/home/elizabeth/repos/kaldi/egs/burnc/kaldi_features/data/train/feats.scp'
#feats_file = '/home/elizabeth/repos/kaldi/egs/burnc/kaldi_features/data/train_breath_tok/feats.scp'
feats_file = '/home/elizabeth/repos/kaldi/egs/burnc/kaldi_features/data/train_breath_sent/feats.scp'
#feats_file = '/afs/inf.ed.ac.uk/group/project/prosody/mfcc_pitch/feats.scp'
#data_file = 'data/burnc/text2labels'
data_file = 'data/burnc/text2labels_breath_sent'


# Output files
feat_dict_file = 'data/burnc_mfcc_dict_breath_sent.pkl'
last_label_dict_file = 'data/burnc_last_label_dict_breath_sent.pkl'
label_dict_file = 'data/burnc_label_dict_breath_sent.pkl'

**Step 2:** Load the utterances (tokens and labels in text format) and extract both labels and keys (the SWBD-1-formatted utterance names):

In [2]:
df = pd.read_csv(data_file,sep='\t',header=None)
labels = df[2].tolist()
texts = df[1].tolist()
utt_ids = df[0].tolist()
texts

['wanted chief justice of the massachusetts supreme court',
 'in april the sjc',
 'current leader edward hennessy reaches a mandatory retirement age of seventy',
 'and a successor is expected to be named in march',
 'it may be the most important appointment governor michael dukakis makes during the remainder of his administration',
 'and one of the toughest',
 'as wbur margo melnicove reports',
 'hennessy will be a hard act to follow',
 'in nineteen seventy six democratic governor michael dukakis fulfilled a campaign promise to de-politicize judicial appointment',
 'he named republican edward hennessy to head the state supreme judicial court',
 'for hennessy it was another step along a distinguished career that began as a trial lawyer',
 'and led to an appointment as associate supreme court justice in nineteen seventy one',
 'that year thomas maffy now president of the massachusetts bar association was hennessy law clerk',
 'the author of more than eight hundred state supreme court opi

**Step 2.5:** Check to make sure these extracted lists make good sense:

In [3]:
rownum = 67

row = df.iloc[rownum]
print('ROW:')
print(row)
print('LABEL:')
print(labels[rownum])
print('KEY:')
print(utt_ids[rownum])
assert(len(labels[rownum].split())==len(texts[rownum].split()))

ROW:
0                           f1arrlp1-0007.030-0015.300
1    the legislation came complete with a message f...
2    0 1 0 1 0 0 1 0 0 1 0 1 0 0 1 0 1 1 0 1 0 1 1 1 1
Name: 67, dtype: object
LABEL:
0 1 0 1 0 0 1 0 0 1 0 1 0 0 1 0 1 1 0 1 0 1 1 1 1
KEY:
f1arrlp1-0007.030-0015.300


**Step 3:** Make labels into a dictionary where keys = utterance name, values = last token's label

In [4]:
last_label_dict = {}
label_dict = {}
text_dict = {}

for i,utt_id in enumerate(utt_ids):
    last_label_dict[utt_id] = torch.tensor(int(labels[i].split()[-1]))
    label_dict[utt_id] = torch.tensor([int(lbl) for lbl in labels[i].split()])
    print(label_dict[utt_id],last_label_dict[utt_id])

tensor([1, 1, 1, 0, 0, 1, 1, 1]) tensor(1)
tensor([0, 1, 0, 1]) tensor(1)
tensor([1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1]) tensor(1)
tensor([0, 0, 1, 0, 1, 0, 0, 0, 0, 1]) tensor(1)
tensor([0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1]) tensor(1)
tensor([1, 0, 0, 0, 1]) tensor(1)
tensor([0, 1, 0, 1, 1]) tensor(1)
tensor([1, 0, 0, 0, 1, 0, 1, 0]) tensor(0)
tensor([0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1]) tensor(1)
tensor([0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1]) tensor(1)
tensor([0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0]) tensor(0)
tensor([0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1]) tensor(1)
tensor([1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0]) tensor(0)
tensor([0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1]) tensor(1)
tensor([1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1]) tensor(1)
tensor([1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0]) tensor(0)
tensor([1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1]) tensor(1)
tensor([0, 1, 0, 1, 1, 0, 0, 1, 1, 0]) tensor(0)
tensor([1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0,

tensor([0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1]) tensor(1)
tensor([0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1]) tensor(1)
tensor([1, 0, 1, 1, 0, 0, 1, 0, 1, 0]) tensor(0)
tensor([0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1]) tensor(1)
tensor([0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1]) tensor(1)
tensor([0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0]) tensor(0)
tensor([1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1]) tensor(1)
tensor([0, 1, 0, 1, 1]) tensor(1)
tensor([1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1]) tensor(1)
tensor([0, 0, 1, 0, 1, 0, 1, 0, 1, 1]) tensor(1)
tensor([0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1]) tensor(1)
tensor([0, 1, 1, 1, 1]) tensor(1)
tensor([0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1]) tensor(1)
tensor([0, 1, 0, 0, 1, 1, 0, 0, 1, 0]) tensor(0)
tensor([1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1]) tensor(1)
tensor([0, 1, 0, 1, 1, 0, 1]) tensor(1)
tensor([1, 1, 0, 1]) tensor(1)
tensor([0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1]) tensor(1)
tensor([1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0]) tens

tensor([1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1]) tensor(1)
tensor([0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0]) tensor(0)
tensor([1, 0, 1, 0, 1, 1, 0, 1, 0, 0]) tensor(0)
tensor([1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1]) tensor(1)
tensor([0, 1, 1, 1, 0, 1, 0, 1, 0]) tensor(0)
tensor([1, 1, 1, 0, 1, 0]) tensor(0)
tensor([0, 1, 1, 1, 1]) tensor(1)
tensor([1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1]) tensor(1)
tensor([0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1,
        1]) tensor(1)
tensor([1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1]) tensor(1)
tensor([1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0,
        0, 0, 1, 0, 1]) tensor(1)
tensor([0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1]) tensor(1)
tensor([0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1]) tensor(1)
tensor([1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1]) tensor(1)
tensor([0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1]) tensor(1)


tensor([1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1]) tensor(1)
tensor([1, 0, 1, 1, 0, 0]) tensor(0)
tensor([0, 1, 1, 0, 1, 1]) tensor(1)
tensor([0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1]) tensor(1)
tensor([1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1]) tensor(1)
tensor([0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1]) tensor(1)
tensor([1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1]) tensor(1)
tensor([1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1]) tensor(1)
tensor([0, 1, 0, 1, 1, 1, 1, 0, 0]) tensor(0)
tensor([0, 0, 1, 0, 1, 1]) tensor(1)
tensor([0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1,
        0]) tensor(0)
tensor([0, 1, 1, 1, 0, 1, 1, 1, 0]) tensor(0)
tensor([1, 0, 1, 1]) tensor(1)
tensor([1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1]) tensor(1)
tensor([1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1]) tensor(1)
tensor([0, 0, 1, 0, 1, 0, 1]) tensor(1)
tensor([1, 0, 0, 1]) tensor(1)
tensor([1, 0, 0, 1, 0, 1]) tensor(1)
tensor([1, 1, 0, 1, 

In [5]:
lbls = [item.item() for item in list(last_label_dict.values())]
sum(lbls)/len(lbls)

0.820250284414107

**Step 4:** Go retrieve speech features and put into a dictionary

In [6]:
feat_dict = {}
for utt_id,mat in kaldi_io.read_mat_scp(feats_file):
    if utt_id in utt_ids:
        feat_dict[utt_id] = torch.tensor(mat)



**Step 5:** If you couldn't find a feature for an utterance, then drop the utterance from the label_dict as well:

In [7]:
print(len(last_label_dict))
print(len(feat_dict))

2637
2636


In [8]:
missing_feats = list(set(last_label_dict.keys())-set(feat_dict.keys()))

for utt in missing_feats:
    del last_label_dict[utt]
    
assert(len(last_label_dict)==len(feat_dict))

**Step 6:** Pickle the dictionaries

In [9]:
with open(last_label_dict_file,'wb') as f:
    pickle.dump(last_label_dict,f)
    
with open(label_dict_file,'wb') as f:
    pickle.dump(label_dict,f)

with open(feat_dict_file,'wb') as f:
    pickle.dump(feat_dict,f)