# Extracting activations


Download and unzip directory with model data from: https://drive.google.com/open?id=14jWnakZn5JvtJpqalBJIkmwCPbbHd4gn
This model has been trained on [The Places Audio Caption Corpus] (https://groups.csail.mit.edu/sls/downloads/placesaudio/index.cgi).


We will extract activations from two example audio files included in the repository.


In [1]:
import vg.activations
modelpath = "models/places-stack-s2-t.-s2i2-s2t.-t2s.-t2i.--f/model.12.pkl"
paths = ["data/example_audio/667626_18933d713e_0.wav", "data/example_audio/3637013_c675de7705_0.wav"]
data = vg.activations.from_audio(modelpath, paths)


The output is a list of numpy arrays, one array per audio file. 
The arrays are shaped as (TIMESTEP, LAYER, FEATURE).

In [2]:
print(len(data))
print(data[0].shape)
print(data[1].shape)

2
(131, 4, 1024)
(215, 4, 1024)


# Forced alignment

You will need to install the [Gentle toolkit](https://github.com/lowerquality/gentle) and the associated Python library gentle.

Let's align an audio file with the transcript of the speech. The output will be a Python dictionary with the alignment information. The phonetic notation uses the Arpabet system, with some added suffixes. For example the second phoneme of the word *girl* is noted as `er_I`, indicating the phoneme written as /ɝ/ or
/ɹ/ in IPA. The suffix `_I` indicates that this phoneme occurs inside a word (as opposed to beginning or end).

See https://raw.githubusercontent.com/gchrupala/encoding-of-phonology/master/src/phonemes.txt for a mapping between Arpabet and IPA.

In [3]:
import vg.align
path = "data/example_audio/667626_18933d713e_0.wav"
text = "A girl is stretched out in shallow water."
alignment = vg.align.align(path, text)
alignment

{'transcript': 'A girl is stretched out in shallow water.',
 'words': [{'alignedWord': 'a',
   'case': 'success',
   'end': 0.39,
   'endOffset': 1,
   'phones': [{'duration': 0.07, 'phone': 'ah_S'}],
   'start': 0.32,
   'startOffset': 0,
   'word': 'A'},
  {'alignedWord': 'girl',
   'case': 'success',
   'end': 0.66,
   'endOffset': 6,
   'phones': [{'duration': 0.1, 'phone': 'g_B'},
    {'duration': 0.1, 'phone': 'er_I'},
    {'duration': 0.07, 'phone': 'l_E'}],
   'start': 0.39,
   'startOffset': 2,
   'word': 'girl'},
  {'alignedWord': 'is',
   'case': 'success',
   'end': 0.78,
   'endOffset': 9,
   'phones': [{'duration': 0.06, 'phone': 'ih_B'},
    {'duration': 0.06, 'phone': 'z_E'}],
   'start': 0.66,
   'startOffset': 7,
   'word': 'is'},
  {'alignedWord': 'stretched',
   'case': 'success',
   'end': 1.2,
   'endOffset': 19,
   'phones': [{'duration': 0.03, 'phone': 's_B'},
    {'duration': 0.08, 'phone': 't_I'},
    {'duration': 0.05, 'phone': 'r_I'},
    {'duration': 0.09, 

# Phoneme activations

Given neural activations for an utterance and an alignment, we can extract the phoneme labels with their corresponding mean-pooled features using the function `vg.align.phoneme_activations`.
The output will be two numpy arrays:


In [4]:
labels, features = vg.align.phoneme_activations(data[0], alignment)
print(labels.shape)
print(features.shape)

(24,)
(24, 4, 1024)


The function `vg.align.from_audio` will extract the same information directly from audio with the associated transcripts.

In [5]:
modelpath = "models/places-stack-s2-t.-s2i2-s2t.-t2s.-t2i.--f/model.12.pkl"
paths = ["data/example_audio/667626_18933d713e_0.wav", "data/example_audio/3637013_c675de7705_0.wav"]
texts = ["A girl is stretched out in shallow water.", "A couple stands close at the water's edge."]
labels, features = vg.align.from_audio(modelpath, paths, texts)
print(labels.shape)
print(features.shape)
         

(51,)
(51, 4, 1024)
