# Lexicon - Orchestrator


## Overview

For this project, I will build a simple custom ochestrator that processes data objects from the "Lexicon" class.
    - These objects are custom datasets that are modeled after the Ted Talk speakers. 
    - Each Lexicon has a corpus and some helper methods aimed at training and prediction
    - Lexicon class will also have a preprocessing and caching function.
    - Each object will have two methods of prediction, n-gram language model and a recurrent neural network model
    - Each object has a custom reporting function that reports the results of training
    - Each object will be able to learn from any text data provided, and return a transcript with confidence values from input posed in speech utterances. 
        - I will use Google's cloud-based services to preprocess the input audio data and transcribe into an initial guess. Then I will train a model to improve on Google cloud speech API's response.


In [1]:
## Use to reload modules
from importlib import reload
%reload_ext autoreload
%autoreload 2

In [2]:
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS']=os.path.join(os.getcwd(),'Lexicon-e94eff39fad7.json')

In [3]:
from urllib.request import urlretrieve
from os.path import isfile, isdir
from tqdm import tqdm
import tarfile

librispeech_dataset_folder_path = 'LibriSpeech'
tar_gz_path = 'dev-clean.tar.gz'

books_path = 'original-books.tar.gz'

class DLProgress(tqdm):
    last_block = 0

    def hook(self, block_num=1, block_size=1, total_size=None):
        self.total = total_size
        self.update((block_num - self.last_block) * block_size)
        self.last_block = block_num

if not isfile(books_path):
    with DLProgress(unit='B', unit_scale=True, miniters=1, desc='Librispeech Book Texts') as pbar:
        urlretrieve(
            'http://www.openslr.org/resources/12/original-books.tar.gz',
            books_path,
            pbar.hook)

if not isdir(librispeech_dataset_folder_path+'/books'):
    with tarfile.open(books_path) as tar:
        tar.extractall()
        tar.close()
        
        
        
if not isfile(tar_gz_path):
    with DLProgress(unit='B', unit_scale=True, miniters=1, desc='Librispeech dev-clean.tar.gz') as pbar:
        urlretrieve(
            'http://www.openslr.org/resources/12/dev-clean.tar.gz',
            tar_gz_path,
            pbar.hook)

if not isdir(librispeech_dataset_folder_path):
    with tarfile.open(tar_gz_path) as tar:
        tar.extractall()
        tar.close()
        
        
        

In [4]:
import io

# Imports the Google Cloud client library
from google.cloud import speech
from google.cloud.speech import enums
from google.cloud.speech import types

# Instantiates a client
client = speech.SpeechClient()

# The name of the dev-test audio file to transcribe
dev_file_name_0 = os.path.join(
    os.getcwd(),
    'LibriSpeech',
    'dev-clean',
    '84',
    '121123',
    '84-121123-0000.flac')
gt0 = 'GO DO YOU HEAR'

dev_file_name_1 = os.path.join(
    os.getcwd(),
    'LibriSpeech',
    'dev-clean',
    '84',
    '121123',
    '84-121123-0001.flac')
gt1 = 'BUT IN LESS THAN FIVE MINUTES THE STAIRCASE GROANED BENEATH AN EXTRAORDINARY WEIGHT'

# The name of the test audio file to transcribe
dev_file_name_2 = os.path.join(
    os.getcwd(),
    'LibriSpeech',
    'dev-clean',
    '84',
    '121123',
    '84-121123-0002.flac')
gt2 = 'AT THIS MOMENT THE WHOLE SOUL OF THE OLD MAN SEEMED CENTRED IN HIS EYES WHICH BECAME BLOODSHOT THE VEINS OF THE THROAT SWELLED HIS CHEEKS AND TEMPLES BECAME PURPLE AS THOUGH HE WAS STRUCK WITH EPILEPSY NOTHING WAS WANTING TO COMPLETE THIS BUT THE UTTERANCE OF A CRY'

dev_file_name_3 = os.path.join(
    os.getcwd(),
    'LibriSpeech',
    'dev-clean',
    '84',
    '121123',
    '84-121123-0003.flac')
gt3 = 'AND THE CRY ISSUED FROM HIS PORES IF WE MAY THUS SPEAK A CRY FRIGHTFUL IN ITS SILENCE'

dev_file_name_4 = os.path.join(
    os.getcwd(),
    'LibriSpeech',
    'dev-clean',
    '84',
    '121123',
    '84-121123-0004.flac')
gt4 = "D'AVRIGNY RUSHED TOWARDS THE OLD MAN AND MADE HIM INHALE A POWERFUL RESTORATIVE"


test_file_name_1 = os.path.join(
    os.getcwd(),
    'RNN-Tutorial-master',
    'data',
    'raw',
    'librivox',
    'LibriSpeech',
    'test-clean-wav',
    '4507-16021-0019.wav')


audio_files = {dev_file_name_0:gt0, dev_file_name_1:gt1, dev_file_name_2:gt2, dev_file_name_3:gt3, dev_file_name_4:gt4}


In [5]:
# Prepare a plain text corpus from which we train a languague model
import glob
import os
import utils
import nltk

# Gather all text files from directory
LIBRISPEECH_DIRECTORY = os.path.join(os.getcwd(),'LibriSpeech/')
TEDLIUM_DIRECTORY = os.path.join(os.getcwd(),'TEDLIUM_release1/')

# TRAINING_DIRECTORY = os.path.abspath(os.path.join(os.sep,'Volumes',"My\ Passport\ for\ Mac",'lexicon','LibriSpeech'))
dev_librispeech_path = "{}{}{}{}".format(LIBRISPEECH_DIRECTORY, 'dev-clean/', '**/', '*.txt*')
train_librispeech_path = "{}{}{}{}{}".format(LIBRISPEECH_DIRECTORY, 'books/', 'utf-8/', '**/', '*.txt*')
TED_path = "{}{}{}{}".format(TEDLIUM_DIRECTORY,'train/','**/', '*.stm')

text_paths = sorted(glob.glob(train_librispeech_path, recursive=True))
segmented_text_paths = sorted(glob.glob(dev_librispeech_path, recursive=True))
stm_paths = sorted(glob.glob(TED_path, recursive=True))

print('Found:',len(text_paths),"text files in the directories {0}\n{1} segmented text files in the {2} directory and \n{3} stm files in directory: {4}:".format(train_librispeech_path, 
        len(segmented_text_paths), dev_librispeech_path, len(stm_paths),TED_path ))

Found: 41 text files in the directories /src/lexicon/LibriSpeech/books/utf-8/**/*.txt*
97 segmented text files in the /src/lexicon/LibriSpeech/dev-clean/**/*.txt* directory and 
774 stm files in directory: /src/lexicon/TEDLIUM_release1/train/**/*.stm:


### Build Text Corpuses for Training

In [6]:
import tensorflow as tf
import re
import codecs
import string
from lexicon import Lexicon
from speech import Speech
      
librispeech_corpus = u""
stm_segments = []
lexicons = {} # {speaker_id: lexicon_object}
speeches = {} # {speech_id: speech_object}
segmented_librispeeches = {}

for book_filename in text_paths[:10]: # 1 Book
    with codecs.open(book_filename, "r", "utf-8") as book_file:
        lines = book_file.read()
        librispeech_corpus += lines
for stm_filename in stm_paths: # Process STM files (Tedlium)
        stm_segments.append(utils.parse_stm_file(stm_filename))
        

# Train on 3 speakers
for segments in stm_segments[15:17]: 
    for segment in segments:
        segment_key = "{0}_{1}_{2}".format(segment.speaker_id.strip(), str(segment.start_time).replace('.','_'),
                                          str(segment.stop_time).replace('.','_'))
        if segment.speaker_id not in speeches.keys():
            source_file = os.path.join(os.getcwd(), 'TEDLIUM_release1',
                                       'train','sph', '{}.sph'.format(segment.filename))
            speech = Speech(speaker_id=segment.speaker_id,
                                           speech_id = segment_key,
                                           source_file=source_file,
                                           ground_truth = ' '.join(segment.transcript.split()[:-1]),
                                           start = segment.start_time,
                                           stop = segment.stop_time,
                                           audio_type = 'LINEAR16')
        else:
            speech = speeches[segment.speaker_id.strip()]
            print('Already found speech in list at location: ', speech)
        
        speeches[segment_key] = speech

        if segment.speaker_id not in lexicons.keys():
            lexicon = Lexicon(base_corpus=librispeech_corpus, name=segment.speaker_id)
            lexicons[segment.speaker_id.strip()] = lexicon
        else:
            lexicon = lexicons[segment.speaker_id.strip()]
        
        # Add Speech to Lexicon
        if speech not in lexicon.speeches:
            lexicon.add_speech(speech)


### Load GCS Transcripts using GCS Wrapper

In [7]:
import numpy as np
view_sentence_range = (0, 10)

for speaker_id, lexicon in lexicons.items():
    print('Dataset Stats')
    print('Roughly the number of unique words: {}'.format(lexicon.vocab_size))
    
    word_counts = [len(sentence.split()) for sentence in lexicon.corpus_sentences]
    print('Number of sentences: {}'.format(len(lexicon.corpus_sentences)))
    print('Average number of words in a sentence: {}'.format(np.average(word_counts)))

    print()
    print('Transcript sentences {} to {}:'.format(*view_sentence_range))
    print('\n'.join(lexicon.training_set[0][view_sentence_range[0]:view_sentence_range[1]]))
    print()
    print('Ground Truth sentences {} to {}:'.format(*view_sentence_range))
    print('\n'.join(lexicon.training_set[1][view_sentence_range[0]:view_sentence_range[1]]))
    print()

Dataset Stats
Roughly the number of unique words: 57965
Number of sentences: 27540
Average number of words in a sentence: 24.005228758169935

Transcript sentences 0 to 10:
 Merrick
 Would it not be better for me to send these
papers by a messenger to your house?"

"No; I'll take them myself
 No one will rob me
" And then the door
swung open and, chuckling in his usual whimsical fashion, Uncle John
came out, wearing his salt-and-pepper suit and stuffing; a bundle of
papers into his inside pocket


The Major stared at him haughtily, but made no attempt to openly
recognize the man
 Uncle John gave a start, laughed, and then walked
away briskly, throwing a hasty "good-bye" to the obsequious banker,
who followed him out, bowing low


The Major returned to his office with a grave face, and sat for the
best part of three hours in a brown study
 Then he took his hat and
went home


Patsy asked anxiously if anything had happened, when she saw his face;
but the Major shook his head


Uncle John 

### Preprocess Dataset - Tokenize Corpus

In [8]:
from nltk.collocations import BigramCollocationFinder
from nltk.corpus import stopwords
import re
import codecs
import string

# reading the file in unicode format using codecs library    
stoplist = set(stopwords.words('english'))
# Strip punctuation
translate_table = dict((ord(char), None) for char in string.punctuation) 
        
corpus_raw = u""
for book_filename in text_paths:
    with codecs.open(book_filename, "r", "utf-8") as book_file:
        lines = book_file.read()
        corpus_raw += lines.translate(translate_table) # remove punctuations 

               
# Tokenize
tokenized_words = nltk.tokenize.word_tokenize(corpus_raw)

## Clean the tokens ##
# Remove stop words
tokenized_words = [word for word in tokenized_words if word not in stoplist]

# Remove single-character tokens (mostly punctuation)
tokenized_words = [word for word in tokenized_words if len(word) > 1]

# Remove numbers
tokenized_words = [word for word in tokenized_words if not word.isnumeric()]

# Lowercase all words (default_stopwords are lowercase too)
tokenized_words = [word.lower() for word in tokenized_words]

### Preprocess Dataset - Extract N-Gram Model

In [9]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from nltk.collocations import *
from nltk.probability import FreqDist
import nltk

# extracting the bi-grams and sorting them according to their frequencies
finder = BigramCollocationFinder.from_words(tokenized_words)
# finder.apply_freq_filter(3)

bigram_model = nltk.bigrams(tokenized_words)
bigram_model = sorted(bigram_model, key=lambda item: item[1], reverse=True)  
# print(bigram_model)
print('')
print('')
print('')
np.save("lang_model.npy",bigram_model)






In [10]:
fdist = nltk.FreqDist(bigram_model)

# Output top 50 words
print("Word|Freq:")
for word, frequency in fdist.most_common(50):
    print(u'{}|{}'.format(word, frequency))

Word|Freq:
('project', 'gutenbergtm')|1095
('project', 'gutenberg')|1014
('greater', 'part')|532
('captain', 'nemo')|452
('united', 'states')|407
('great', 'britain')|385
('uncle', 'john')|364
('gold', 'silver')|337
('let', 'us')|331
('of', 'course')|328
('new', 'york')|310
('old', 'man')|306
('gutenbergtm', 'electronic')|306
('mr', 'bounderby')|294
('public', 'domain')|293
('every', 'one')|291
('young', 'man')|284
('mrs', 'sparsit')|282
('one', 'day')|281
('one', 'another')|280
('archive', 'foundation')|279
('gutenberg', 'literary')|279
('literary', 'archive')|279
('dont', 'know')|275
('electronic', 'works')|272
('per', 'cent')|263
('could', 'see')|262
('ned', 'land')|254
('good', 'deal')|247
('two', 'three')|240
('set', 'forth')|225
('years', 'ago')|220
('old', 'woman')|219
('you', 'may')|218
('it', 'would')|207
('the', 'first')|206
('next', 'day')|201
('long', 'time')|200
('said', 'mrs')|199
('said', 'mr')|198
('of', 'the')|198
('first', 'time')|196
('every', 'day')|193
('one', 'thi

In [11]:
cfreq_2gram = nltk.ConditionalFreqDist(bigram_model)
# print('Conditional Frequency Conditions:\n', cfreq_2gram)
print()

# First access the FreqDist associated with "one", then the keys in that FreqDist
print("Listing the words that can follow after 'greater':\n", cfreq_2gram["greater"].keys())
print()

# Determine Most common in conditional frequency
print("Listing 20 most frequent words to come after 'greater':\n", cfreq_2gram["greater"].most_common(20))


Listing the words that can follow after 'greater':
 dict_keys(['valuable', 'smaller', 'gehenna', 'desolation', 'dexterity', 'opportunities', 'woe', 'returns', 'perithous', 'tenant', 'honor', 'indeed', 'agony', 'glorious', 'latter', 'account', 'divinity', 'sum', 'indignation', 'sin', 'advantage', 'transgression', 'flourish', 'necessary', 'teacher', 'activity', 'fortune', 'whole', 'rank', 'beauty', 'abundance', 'cost', 'disorders', 'sums', 'told', 'opening', 'action', 'parsimony', 'claim', 'worlds', 'beginning', 'convenience', 'labourers', 'change', 'supply', 'require', 'equal', 'found', 'weal', 'confidence', 'expected', 'knave', 'scarcity', 'quantity', 'gift', 'thoughts', 'trade', 'insult', 'deviation', 'stock', 'second', 'prince', 'extent', 'great', 'semblance', 'america', 'zeal', 'solidarity', 'clerk', 'want', 'among', 'rum', 'sun', 'riches', 'lesser', 'wealth', 'proportion', 'importation', 'slaves', 'liberty', 'grew', 'sorrow', 'intelligence', 'it', 'moment', 'need', 'inferiority', 

In [12]:
# For each word in the evaluation list:
# Select word and determine its frequency distribution
# Grab probability of second word in the list
# Continue this process until the sentence is scored

# Add small epsilon value to avoid division by zero
epsilon = 0.0000001

# Loads the audio into memory
for audio, ground_truth in audio_files.items():
    with io.open(audio, 'rb') as audio_file:
        content = audio_file.read()
        audio = types.RecognitionAudio(content=content)

    config = types.RecognitionConfig(
        encoding=enums.RecognitionConfig.AudioEncoding.FLAC,
        sample_rate_hertz=16000,
        language_code='en-US',
        max_alternatives=10,
        profanity_filter=False,
        enable_word_time_offsets=True)

    # Detects speech and words in the audio file
    operation = client.long_running_recognize(config, audio)

    print('Waiting for operation to complete...')
    result = operation.result(timeout=90)

    alternatives = result.results[0].alternatives


    #print("API Results: ", alternatives)
    print()
    print()

    rerank_results = {}
    for alternative in alternatives:
        sent = alternative.transcript

        words = nltk.tokenize.word_tokenize(sent)
        probs = np.ones_like(words, dtype=np.float32)*epsilon
        # print(words,'\n',probs)
        for word in words:
            if words.index(word) < len(words)-1: 
                freq = cfreq_2gram[word].freq(words[words.index(word)+1])
                probs[words.index(word)] = freq
            # print(probs)

        lexicon_score = np.sum(probs)
        # print(word_score)

        # Re-rank alternatives using a weighted average of the two scores
        api_weight = 0.90
        confidence_score = alternative.confidence*api_weight + lexicon_score*(1-api_weight)
        rerank_results[alternative.transcript] = confidence_score

    print("RE-RANKED Results: \n", rerank_results)
    print()
    print()

    import operator
    index, value = max(enumerate(list(rerank_results.values())), key=operator.itemgetter(1))
    # Select Corresponding Transcript:
    script=''
    for trnscript, confidence in rerank_results.items():
        if confidence == value:
            script = trnscript

    # Evaluate the differences between the Original and the Reranked transcript:
    print("ORIGINAL Transcript: \n'{0}' \nwith a confidence_score of: {1}".format(alternative.transcript, alternative.confidence))
    
    
    print()
    print()
    print("RE-RANKED Transcript: \n'{0}' \nwith a confidence_score of: {1}".format(script, value))
    
    print()
    print()
    print("GROUND TRUTH TRANSCRIPT: \n{0}".format(ground_truth))
    print()
    ranked_differences = list(set(nltk.tokenize.word_tokenize(alternative.transcript.lower())) -
                              set(nltk.tokenize.word_tokenize(script.lower())))
    if len(ranked_differences) == 0:  
        print("No reranking was performed. The transcripts match!")
    else:
        print("The original transcript was RE-RANKED. The transcripts do not match!")
        print("Differences between original and re-ranked: ", ranked_differences)
    print()
    print()
    
    # Evaluate Differences between the Original and Ground Truth:
    gt_orig_diff = list(set(nltk.tokenize.word_tokenize(alternative.transcript.lower())) -
                              set(nltk.tokenize.word_tokenize(ground_truth.lower())))
    if len(gt_orig_diff) == 0:  
        print("The ORIGINAL transcript matches ground truth!")
    else:
        print("The original transcript DOES NOT MATCH ground truth.")
        print("Differences between original and ground truth: ", gt_orig_diff)
    print()
    print()
    
    
    gt_rr_diff = list(set(nltk.tokenize.word_tokenize(script.lower())) -
                              set(nltk.tokenize.word_tokenize(ground_truth.lower())))
    if len(gt_rr_diff) == 0:  
        print("The RE-RANKED transcript matches ground truth!")
    else:
        print("The RE_RANKED transcript DOES NOT MATCH ground truth.")
        print("Differences between Reranked and ground truth: ", gt_rr_diff)
    print()
    print()
    
    print()
    print()
    
    
    # Compute the Levenshtein Distance (a.k.a. Edit Distance)
#     import nltk.metrics.distance as lev_dist
    
    # Google API Edit Distance
    goog_edit_distance = nltk.edit_distance(alternative.transcript.lower(), ground_truth.lower())
    
    # Re-Ranked Edit Distance
    rr_edit_distance = nltk.edit_distance(script.lower(), ground_truth.lower())

    
    print("ORIGINAL Edit Distance: \n{0}".format(goog_edit_distance))
    print("RE-RANKED Edit Distance: \n{0}".format(rr_edit_distance))
    print()
    print()
    

Waiting for operation to complete...


RE-RANKED Results: 
 {'go go do you hear': 0.85315002696588638, 'go do you here': 0.86520871338434524, 'I go do you hear': 0.81552877281792457, 'go do here': 0.85310941742492696, 'do you here': 0.75866525587625799, 'go do you hear': 0.77847528909333052, 'goat do you hear': 0.85976193998940287, 'do you hear': 0.75895958994515234, 'goat do you here': 0.85946760592050853, 'I go do you here': 0.81523443874903023}


ORIGINAL Transcript: 
'goat do you here' 
with a confidence_score of: 0.9545454978942871


RE-RANKED Transcript: 
'go do you here' 
with a confidence_score of: 0.8652087133843452


GROUND TRUTH TRANSCRIPT: 
GO DO YOU HEAR

The original transcript was RE-RANKED. The transcripts do not match!
Differences between original and re-ranked:  ['goat']


The original transcript DOES NOT MATCH ground truth.
Differences between original and ground truth:  ['here', 'goat']


The RE_RANKED transcript DOES NOT MATCH ground truth.
Differences between Rera

Waiting for operation to complete...


RE-RANKED Results: 
 {'and the cry issued from his pores if we made us speak a cry frightful and its silence': 0.8229485416784883, 'and the cry issued from his pores if we made the speak a cry frightful and its silence': 0.81755691375583417, 'and the cry issued from his pores if we may the speak a cry frightful and its silence': 0.80107996519655, 'and the cry issued from his pores if we made the speak a cry frightful and it silence': 0.77812602724879987, 'and the cry issued from his pores if we made the speak a cry frightful in it silence': 0.77820050343871117, 'and the cry issued from his pores if we made us speak a cry frightful in its silence': 0.8230056628584862, "and the cry issued from his pores if we made the speak a cry frightful and it's silence": 0.82358051147311928, 'and the cry issued from his pores if we may the speak a cry frightful in it silence': 0.76172350123524668, 'and the cry issued from his pores if we made the speak a cry fri

### Evaluate N-Gram Model on Dataset

In [13]:
# Gather all samples, load into dictionary
# Prepare a plain text corpus from which we train a languague model
import glob
import operator

# Gather all text files from directory
WORKING_DIRECTORY = os.path.join(os.getcwd(),'LibriSpeech/')

dev_path = "{}{}{}{}".format(WORKING_DIRECTORY, 'dev-clean/', '**/', '*.txt')
train_path = "{}{}{}{}{}".format(WORKING_DIRECTORY, 'books/', 'utf-8/', '**/', '*.txt*')

text_paths = sorted(glob.glob(dev_path, recursive=True))
print('Found',len(text_paths),'text files in the directory:', dev_path)

transcripts = {}
for document in text_paths:
    with codecs.open(document, 'r', 'utf-8') as filep:
        for i,line in enumerate(filep):
            transcripts[line.split()[0]] = ' '.join(line.split()[1:])

## Evaluate all samples found ##
cloud_speech_api_accuracy = []
custom_lang_model_accuracy = []
epsilon = 0.000000001
api_weight = 0.85
steps = 0
# Pull In Audio File
for filename, gt_transcript in transcripts.items():
    steps += 1
    dirs = filename.split('-')
    
    audio_filepath = dev_file_name_0 = os.path.join(
    os.getcwd(),
    'LibriSpeech',
    'dev-clean',
    dirs[0],
    dirs[1],
    "{0}.flac".format(filename))
    
    

    # Load the audio into memory
    with io.open(audio_filepath, 'rb') as audio_file:
        content = audio_file.read()
        audio = types.RecognitionAudio(content=content)

    config = types.RecognitionConfig(
        encoding=enums.RecognitionConfig.AudioEncoding.FLAC,
        sample_rate_hertz=16000,
        language_code='en-US',
        max_alternatives=10,
        profanity_filter=False,
        enable_word_time_offsets=True)

    # Detects speech and words in the audio file
    operation = client.long_running_recognize(config, audio)
    result = operation.result(timeout=90)
    alternatives = result.results[0].alternatives


    # Evaluate API Results for Re-Ranking:
    rerank_results = {}
    for alternative in alternatives:
        sent = alternative.transcript
        
        # Strip punctuation
        translate_table = dict((ord(char), None) for char in string.punctuation)        
        sent = sent.translate(translate_table) # remove punctuations

        words = nltk.tokenize.word_tokenize(sent)
        probs = np.ones_like(words, dtype=np.float32)*epsilon

        for word in words:
            if words.index(word) < len(words)-1: 
                freq = cfreq_2gram[word].freq(words[words.index(word)+1])
                probs[words.index(word)] = freq

        lexicon_score = np.sum(probs)

        # Re-rank alternatives using a weighted average of the two scores
        confidence_score = alternative.confidence*api_weight + lexicon_score*(1-api_weight)
        rerank_results[alternative.transcript] = confidence_score


    
    index, value = max(enumerate(list(rerank_results.values())), key=operator.itemgetter(1))
    # Select Corresponding Transcript:
    script=''
    for trnscript, confidence in rerank_results.items():
        if confidence == value:
            script = trnscript
                
    # Compute the Accuracy, based on the Levenshtein Distance (a.k.a. Edit Distance)
    gcs_ed = nltk.edit_distance(alternative.transcript.lower(), gt_transcript.lower())
    gcs_upper_bound = max(len(alternative.transcript),len(gt_transcript))
    gcs_accuracy = (1.0 - gcs_ed/gcs_upper_bound)
    
    clm_ed = nltk.edit_distance(script.lower(), gt_transcript.lower())
    clm_upper_bound = max(len(script),len(gt_transcript))
    clm_accuracy = (1.0 - clm_ed/clm_upper_bound)
    
    cloud_speech_api_accuracy.append(gcs_accuracy)
    custom_lang_model_accuracy.append(clm_accuracy)

    if steps % 100 == 0:
        print("{0} Transcripts Processed.".format(steps))
        print('Average API Accuracy:', np.mean(cloud_speech_api_accuracy))
        print('Average Custom Model Accuracy:', np.mean(custom_lang_model_accuracy))
        print()


Found 97 text files in the directory: /src/lexicon/LibriSpeech/dev-clean/**/*.txt


KeyboardInterrupt: 

In [None]:
# Use other TED speeches for building test set
test_speeches = {}
for segments in stm_segments:
    for segment in segments:
        segment_key = "{0}_{1}_{2}".format(segment.speaker_id.strip(), str(segment.start_time).replace('.','_'),
                                          str(segment.stop_time).replace('.','_'))

        speech = None
        # If not already exist
        if segment.speaker_id not in test_speeches.keys():
            # Connect to Cloud API to get Candidate Transcripts
            source_file = os.path.join(os.getcwd(), 'TEDLIUM_release1', 'train','sph', '{}.sph'.format(segment.filename))
            speech = Speech(speaker_id=segment.speaker_id,
                                           speech_id = segment_key,
                                           source_file=source_file,
                                           ground_truth = ' '.join(segment.transcript.split()[:-1]),
                                           start = segment.start_time,
                                           stop = segment.stop_time,
                                           audio_type = 'LINEAR16')
        else:
            speech = test_speeches[segment.speaker_id.strip()]
            print('Already found speech in list at location: ', speech)
        
        
        
        test_speeches[segment_key] = speech

### Get Cloud Speech API Results

In [14]:
def get_audio_size(audio_filepath):
    statinfo = os.stat(audio_filepath)
    return statinfo.st_size

In [None]:
from gcs_api_wrapper import GCSWrapper

speaker_id, lexicon = list(lexicons.items())[0]
gcs = GCSWrapper()
cache_directory = os.path.join(os.getcwd(), 'datacache', 'speech_objects')
for speech_id, speech in test_speeches.items():
    # Not already saved in prepocess cache
    cache_file = os.path.join(cache_directory,'{}_preprocess.p'.format(speech.speech_id))
    if not speech.candidate_transcripts: 
        size = get_audio_size(speech.audio_file)
        
        #TODO: Split large audio file into new files, build new speech objects
        if size < 10485760:
            try:
                result = gcs.transcribe_speech(speech.audio_file)
            except:
                result = None
            if result:
                speech.populate_gcs_results(result)
                speech.preprocess_and_save()
                print('Adding speech with candidate_transcripts to lexicon')
                lexicon.add_speech(speech)

### Train LSTM Net and Evaluate

In [19]:
speaker_id, lexicon = list(lexicons.items())[0]
lexicon.optimize(early_stop=True)
#lexicon.evaluate_testset()

Epoch   0 Batch  500/2400 - Train Accuracy: 0.5570, Validation Accuracy: 0.6339, Loss: 3.0404
Epoch   0 Batch 1000/2400 - Train Accuracy: 0.3839, Validation Accuracy: 0.6339, Loss: 4.5766
Epoch   0 Batch 1500/2400 - Train Accuracy: 0.6307, Validation Accuracy: 0.6339, Loss: 2.7718
Epoch   0 Batch 2000/2400 - Train Accuracy: 0.3582, Validation Accuracy: 0.6339, Loss: 4.4786
Epoch   1 Batch  500/2400 - Train Accuracy: 0.6232, Validation Accuracy: 0.6339, Loss: 2.5335
Epoch   1 Batch 1000/2400 - Train Accuracy: 0.4174, Validation Accuracy: 0.6339, Loss: 3.7214
Epoch   1 Batch 1500/2400 - Train Accuracy: 0.6705, Validation Accuracy: 0.6339, Loss: 2.1842
Epoch   1 Batch 2000/2400 - Train Accuracy: 0.4183, Validation Accuracy: 0.6339, Loss: 3.9006
Epoch   2 Batch  500/2400 - Train Accuracy: 0.6415, Validation Accuracy: 0.6339, Loss: 2.2166
Epoch   2 Batch 1000/2400 - Train Accuracy: 0.4576, Validation Accuracy: 0.6339, Loss: 3.2150
Epoch   2 Batch 1500/2400 - Train Accuracy: 0.6903, Validati

Epoch  22 Batch  500/2400 - Train Accuracy: 0.8585, Validation Accuracy: 0.6339, Loss: 0.5283
Epoch  22 Batch 1000/2400 - Train Accuracy: 0.7723, Validation Accuracy: 0.6339, Loss: 0.5823
Epoch  22 Batch 1500/2400 - Train Accuracy: 0.9489, Validation Accuracy: 0.6339, Loss: 0.1528
Epoch  22 Batch 2000/2400 - Train Accuracy: 0.7788, Validation Accuracy: 0.6384, Loss: 0.6134
Epoch  23 Batch  500/2400 - Train Accuracy: 0.8493, Validation Accuracy: 0.6339, Loss: 0.4685
Epoch  23 Batch 1000/2400 - Train Accuracy: 0.7969, Validation Accuracy: 0.6339, Loss: 0.5515
Epoch  23 Batch 1500/2400 - Train Accuracy: 0.9716, Validation Accuracy: 0.6339, Loss: 0.1543
Epoch  23 Batch 2000/2400 - Train Accuracy: 0.7620, Validation Accuracy: 0.6362, Loss: 0.5848
Epoch  24 Batch  500/2400 - Train Accuracy: 0.8676, Validation Accuracy: 0.6339, Loss: 0.4613
Epoch  24 Batch 1000/2400 - Train Accuracy: 0.8013, Validation Accuracy: 0.6339, Loss: 0.5133
Epoch  24 Batch 1500/2400 - Train Accuracy: 0.9631, Validati

Epoch  44 Batch  500/2400 - Train Accuracy: 0.9375, Validation Accuracy: 0.6362, Loss: 0.1641
Epoch  44 Batch 1000/2400 - Train Accuracy: 0.9442, Validation Accuracy: 0.6339, Loss: 0.1380
Epoch  44 Batch 1500/2400 - Train Accuracy: 0.9943, Validation Accuracy: 0.6339, Loss: 0.0358
Epoch  44 Batch 2000/2400 - Train Accuracy: 0.8654, Validation Accuracy: 0.6362, Loss: 0.2172
Epoch  45 Batch  500/2400 - Train Accuracy: 0.9191, Validation Accuracy: 0.6384, Loss: 0.1719
Epoch  45 Batch 1000/2400 - Train Accuracy: 0.9509, Validation Accuracy: 0.6339, Loss: 0.1356
Epoch  45 Batch 1500/2400 - Train Accuracy: 0.9972, Validation Accuracy: 0.6362, Loss: 0.0339
Epoch  45 Batch 2000/2400 - Train Accuracy: 0.8942, Validation Accuracy: 0.6339, Loss: 0.1802
Epoch  46 Batch  500/2400 - Train Accuracy: 0.9283, Validation Accuracy: 0.6339, Loss: 0.1879
Epoch  46 Batch 1000/2400 - Train Accuracy: 0.9129, Validation Accuracy: 0.6339, Loss: 0.1220
Epoch  46 Batch 1500/2400 - Train Accuracy: 0.9716, Validati

Epoch  66 Batch  500/2400 - Train Accuracy: 0.9393, Validation Accuracy: 0.6339, Loss: 0.0905
Epoch  66 Batch 1000/2400 - Train Accuracy: 0.9866, Validation Accuracy: 0.6339, Loss: 0.0568
Epoch  66 Batch 1500/2400 - Train Accuracy: 0.9744, Validation Accuracy: 0.6362, Loss: 0.0192
Epoch  66 Batch 2000/2400 - Train Accuracy: 0.9760, Validation Accuracy: 0.6339, Loss: 0.0857
Epoch  67 Batch  500/2400 - Train Accuracy: 0.9540, Validation Accuracy: 0.6339, Loss: 0.0758
Epoch  67 Batch 1000/2400 - Train Accuracy: 0.9777, Validation Accuracy: 0.6339, Loss: 0.0533
Epoch  67 Batch 1500/2400 - Train Accuracy: 0.9858, Validation Accuracy: 0.6362, Loss: 0.0218
Epoch  67 Batch 2000/2400 - Train Accuracy: 0.9832, Validation Accuracy: 0.6339, Loss: 0.0672
Epoch  68 Batch  500/2400 - Train Accuracy: 0.9467, Validation Accuracy: 0.6339, Loss: 0.0947
Epoch  68 Batch 1000/2400 - Train Accuracy: 0.9777, Validation Accuracy: 0.6339, Loss: 0.0587
Epoch  68 Batch 1500/2400 - Train Accuracy: 1.0000, Validati

Epoch  88 Batch  500/2400 - Train Accuracy: 0.9614, Validation Accuracy: 0.6339, Loss: 0.0491
Epoch  88 Batch 1000/2400 - Train Accuracy: 0.9777, Validation Accuracy: 0.6339, Loss: 0.0569
Epoch  88 Batch 1500/2400 - Train Accuracy: 0.9943, Validation Accuracy: 0.6339, Loss: 0.0110
Epoch  88 Batch 2000/2400 - Train Accuracy: 0.9880, Validation Accuracy: 0.6339, Loss: 0.0480
Epoch  89 Batch  500/2400 - Train Accuracy: 0.9651, Validation Accuracy: 0.6339, Loss: 0.0488
Epoch  89 Batch 1000/2400 - Train Accuracy: 0.9911, Validation Accuracy: 0.6339, Loss: 0.0445
Epoch  89 Batch 1500/2400 - Train Accuracy: 1.0000, Validation Accuracy: 0.6339, Loss: 0.0130
Epoch  89 Batch 2000/2400 - Train Accuracy: 0.9856, Validation Accuracy: 0.6339, Loss: 0.0293


### Evaluate LSTM Net Only

In [None]:
speaker_id, lexicon = list(lexicons.items())[0]
print("List of Speeches:", len(lexicon.speeches))
lexicon.evaluate_testset()

In [None]:
import helper 
# Save parameters for checkpoint
speaker_id, lexicon = list(lexicons.items())[0]
helper.save_params(lexicon.cache_dir)

In [None]:
import tensorflow as tf
import numpy as np
import helper
speaker_id, lexicon = list(lexicons.items())[0]
_, (source_vocab_to_int, target_vocab_to_int), (source_int_to_vocab, target_int_to_vocab) = helper.load_preprocess()
load_path = helper.load_params(lexicon.cache_dir)