In [1]:
import pickle

# open dataset file
dataset_path = "./dataset/gsj_ac_dataset.pickle"
with open(dataset_path, "rb") as f:
    dataset = pickle.load(f)

Below you'll find some light dataset documentation.

```
{
    id: {
        "expected": # Target/working word, paired with "label". 
                    # Contains word form and phoneme translation
        "label":    # Ground truth pronunciation label (True/False)
        "metadata": # Important metadata such as 
                    # activityId, storyId, phraseIndex, and word_index
        "phrase":   # Phrase text of which the expected word is a part of.
                    # Contains word and phoneme forms.
        "asr":      # Contains what I think is the most important ASR data 
                    # for a potential automated mispronunciation system.
    }
}
```

In [5]:
import random
sample = dataset[random.randint(0,len(dataset))]
print(sample.keys())

dict_keys(['expected', 'label', 'metadata', 'phrase', 'asr'])


First, "expected" and "label":

```
{
    id: {
        "expected": {
            "word":     # Target/working word as given in the labels.csv file
            "phoneme":  # Target/working word translated to AMIRABET phonemes
        }
        "label": 0/1    # Whether the reader pronounced the expected word correctly
    }
}
```

In [7]:
print(sample["expected"])
print(sample["label"])

{'word': 'take', 'phoneme': 'tak'}
1


Important metadata for diagnosing problems and coordinating between data sources.

```
{
    id: {
        "metadata": {
            "activityId":   # Id corresponding to the reading session
            "storyId":      # Id corresponding to overall story being read
            "phraseIndex":  # Index of current phrase in story being read
            "word_index":   # Index of current word in phrase
        }
    }
}
```


In [8]:
print(sample["metadata"])

{'activityId': 'FC1D174030EB11EC89641635D148', 'storyId': '4B5718806EB011EABBC087B56D5C6D4A', 'phraseIndex': 5, 'word_index': 3}


Phrase data, in word and phoneme form. Each has been tokenized.

```
{
    id: {
        "phrase": {
            "word":     # Current phrase in word form
            "phoneme":  # Current phrase in phoneme form
        }
    }
}
```


In [9]:
print(sample["phrase"])

{'word': ['val', 'helps', 'casey', 'take', 'care', 'of', 'the', 'lamb'], 'phoneme': ['væl', 'hɛlps', 'kasi', 'tak', 'kɛɹ', 'ʌv', 'θʌ', 'læm']}


ASR data attributes which I think are of highest value, prepared in the most flexible way possible. This includes tokenized word and phoneme form as well as confidence data depending on ASR source.

```
{
    id: {
        "asr": {
            "amazon_data":                  # word, word_confidence, and phoneme
            "kaldi_data":                   # word, word_confidence, and phoneme
            "kaldiNa_data":                 # word, word_confidence, and phoneme
            "wav2vec_transcript_words":     # word and phoneme
            "wav2vec_transcript_phonemes":  # phoneme
        }
    }
}
```

In [13]:
print(sample["asr"].keys())
print()
print("amazon: ", sample["asr"]["amazon_data"])
print()
print("kaldi: ", sample["asr"]["kaldi_data"])
print()
print("kaldiNa: ", sample["asr"]["kaldiNa_data"])
print()
print("wave2vec_word: ", sample["asr"]["wav2vec_transcript_words"])
print()
print("wave2vec_phoneme: ", sample["asr"]["wav2vec_transcript_phonemes"])
print()

dict_keys(['amazon_data', 'kaldi_data', 'kaldiNa_data', 'wav2vec_transcript_words', 'wav2vec_transcript_phonemes'])

amazon:  {'word': ['fell', 'helps', 'casey', 'take', 'car', 'of', 'the', 'lap', 'first', 'casey'], 'word_confidence': (0.3942, 1, 0.8517, 1, 0.8232, 0.2946, 1, 0.3627, 1, 0.7284), 'phoneme': ['fɛl', 'hɛlps', 'kasi', 'tak', 'kɑɹ', 'ʌv', 'θʌ', 'læp', 'fɝst', 'kasi']}

kaldi:  {'word': ['val', 'helps', 'casey', 'take', 'car', 'of', 'the', 'lamb'], 'word_confidence': [0.6368449330329895, 1.0, 0.781516432762146, 0.9803750514984131, 0.46812132000923157, 1.0, 0.9921054840087891, 0.7626157999038696], 'phoneme': ['væl', 'hɛlps', 'kasi', 'tak', 'kɑɹ', 'ʌv', 'θʌ', 'læm']}

kaldiNa:  {'word': ['val', 'helps', 'casey', 'take', 'care', 'of', 'the', '<UNK>'], 'word_confidence': [0.6485823392868042, 1.0, 1.0, 1.0, 0.9791126847267151, 1.0, 0.968928337097168, 0.6780429482460022], 'phoneme': ['væl', 'hɛlps', 'kasi', 'tak', 'kɛɹ', 'ʌv', 'θʌ', False]}

wave2vec_word:  {'word': ['bell', 'help