# Lexicon - Custom Language Model


## Overview

For this project, I will build a simple custom language model that is able to learn from any text data provided, and return a transcript with confidence values from input posed in speech utterances. I will use Google's cloud-based services to preprocess the input audio data and transcribe into an initial guess. Then I will train a model to improve on Google cloud speech API's response.



## Getting Started

In order to use Google's cloud-based services, you first need to create an account on the [Google Cloud Platform](https://cloud.google.com//).

Then, for each service you want to use, you have to enable use of that service.

In [1]:
!pip install --upgrade google-cloud-speech

Requirement already up-to-date: google-cloud-speech in /Users/deanmwebb/anaconda/lib/python2.7/site-packages
Requirement already up-to-date: google-gax<0.16dev,>=0.15.14 in /Users/deanmwebb/anaconda/lib/python2.7/site-packages (from google-cloud-speech)
Requirement already up-to-date: google-cloud-core<0.28dev,>=0.27.0 in /Users/deanmwebb/anaconda/lib/python2.7/site-packages (from google-cloud-speech)
Requirement already up-to-date: googleapis-common-protos[grpc]<2.0dev,>=1.5.2 in /Users/deanmwebb/anaconda/lib/python2.7/site-packages (from google-cloud-speech)
Requirement already up-to-date: ply==3.8 in /Users/deanmwebb/anaconda/lib/python2.7/site-packages (from google-gax<0.16dev,>=0.15.14->google-cloud-speech)
Requirement already up-to-date: dill<0.3dev,>=0.2.5 in /Users/deanmwebb/anaconda/lib/python2.7/site-packages (from google-gax<0.16dev,>=0.15.14->google-cloud-speech)
Requirement already up-to-date: future<0.17dev,>=0.16.0 in /Users/deanmwebb/anaconda/lib/python2.7/site-packages

### Install the Google Cloud SDK: https://cloud.google.com/sdk/docs/

In [2]:
!CLOUDSDK_CORE_DISABLE_PROMPTS=1 ./google-cloud-sdk/install.sh

Welcome to the Google Cloud SDK!

To help improve the quality of this product, we collect anonymized usage data
and anonymized stacktraces when crashes are encountered; additional information
is available at <https://cloud.google.com/sdk/usage-statistics>. You may choose
to opt out of this collection now (by choosing 'N' at the below prompt), or at
any time in the future by running the following command:

    gcloud config set disable_usage_reporting true


Your current Cloud SDK version is: 170.0.1
The latest available version is: 171.0.0

┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                   Components                                                   │
├──────────────────┬──────────────────────────────────────────────────────┬──────────────────────────┬───────────┤
│      Status      │                         Name                         │            ID            │    Si

## Authenticate with Google Cloud API:

In [3]:
!source google-cloud-sdk/completion.bash.inc && \
source google-cloud-sdk/path.bash.inc && \
gcloud auth activate-service-account lexicon-bot@exemplary-oath-179301.iam.gserviceaccount.com --key-file=Lexicon-e94eff39fad7.json

Activated service account credentials for: [lexicon-bot@exemplary-oath-179301.iam.gserviceaccount.com]


In [4]:
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS']='/Users/deanmwebb/Google Drive/Development/consulting/lexicon/Lexicon-e94eff39fad7.json'

### Test out Cloud Spech API

In [619]:
import io

# Imports the Google Cloud client library
from google.cloud import speech
from google.cloud.speech import enums
from google.cloud.speech import types

# Instantiates a client
client = speech.SpeechClient()

# The name of the dev-test audio file to transcribe
dev_file_name_0 = os.path.join(
    os.getcwd(),
    'LibriSpeech',
    'dev-clean',
    '84',
    '121123',
    '84-121123-0000.flac')
gt0 = 'GO DO YOU HEAR'

dev_file_name_1 = os.path.join(
    os.getcwd(),
    'LibriSpeech',
    'dev-clean',
    '84',
    '121123',
    '84-121123-0001.flac')
gt1 = 'BUT IN LESS THAN FIVE MINUTES THE STAIRCASE GROANED BENEATH AN EXTRAORDINARY WEIGHT'

# The name of the test audio file to transcribe
dev_file_name_2 = os.path.join(
    os.getcwd(),
    'LibriSpeech',
    'dev-clean',
    '84',
    '121123',
    '84-121123-0002.flac')
gt2 = 'AT THIS MOMENT THE WHOLE SOUL OF THE OLD MAN SEEMED CENTRED IN HIS EYES WHICH BECAME BLOODSHOT THE VEINS OF THE THROAT SWELLED HIS CHEEKS AND TEMPLES BECAME PURPLE AS THOUGH HE WAS STRUCK WITH EPILEPSY NOTHING WAS WANTING TO COMPLETE THIS BUT THE UTTERANCE OF A CRY'

dev_file_name_3 = os.path.join(
    os.getcwd(),
    'LibriSpeech',
    'dev-clean',
    '84',
    '121123',
    '84-121123-0003.flac')
gt3 = 'AND THE CRY ISSUED FROM HIS PORES IF WE MAY THUS SPEAK A CRY FRIGHTFUL IN ITS SILENCE'

dev_file_name_4 = os.path.join(
    os.getcwd(),
    'LibriSpeech',
    'dev-clean',
    '84',
    '121123',
    '84-121123-0004.flac')
gt4 = "D'AVRIGNY RUSHED TOWARDS THE OLD MAN AND MADE HIM INHALE A POWERFUL RESTORATIVE"


test_file_name_1 = os.path.join(
    os.getcwd(),
    'RNN-Tutorial-master',
    'data',
    'raw',
    'librivox',
    'LibriSpeech',
    'test-clean-wav',
    '4507-16021-0019.wav')


audio_files = {dev_file_name_0:gt0, dev_file_name_1:gt1, dev_file_name_2:gt2, dev_file_name_3:gt3, dev_file_name_4:gt4}


# Loads the audio into memory
with io.open(dev_file_name_2, 'rb') as audio_file:
    content = audio_file.read()
    audio = types.RecognitionAudio(content=content)

config = types.RecognitionConfig(
    encoding=enums.RecognitionConfig.AudioEncoding.FLAC,
    sample_rate_hertz=16000,
    language_code='en-US',
    max_alternatives=10,
    profanity_filter=False,
    enable_word_time_offsets=True)

# Detects speech and words in the audio file
operation = client.long_running_recognize(config, audio)

print('Waiting for operation to complete...')
result = operation.result(timeout=90)

alternatives = result.results[0].alternatives
for alternative in alternatives:
    print('Transcript: {}'.format(alternative.transcript))
    print('Confidence Score: {}'.format(alternative.confidence))

    for word_info in alternative.words:
        word = word_info.word
        start_time = word_info.start_time
        end_time = word_info.end_time
        start = start_time.seconds + start_time.nanos * 1e-9
        end = end_time.seconds + end_time.nanos * 1e-9
        delta = end - start
        
        print('Word: {}, start_time (s): {}, end_time (s): {}, total_time (s): {}'.format(
            word,
            start,
            end,
            delta))
        
        #TODO: Do we need to figure out how to assign words to alternatives?
            # If same amounts, assign words to index of parsed word

Waiting for operation to complete...
Transcript: at this moment to the whole soul of the Old Man scene centered in his eyes which became bloodshot the veins of the throat swelled his cheeks and temples became purple as though he was struck with epilepsy nothing was wanting to complete this but the utterance of a cry
Confidence Score: 0.9484339356422424
Word: at, start_time (s): 0.1, end_time (s): 0.5, total_time (s): 0.4
Word: this, start_time (s): 0.5, end_time (s): 0.6000000000000001, total_time (s): 0.10000000000000009
Word: moment, start_time (s): 0.6000000000000001, end_time (s): 0.8, total_time (s): 0.19999999999999996
Word: to, start_time (s): 0.8, end_time (s): 1.1, total_time (s): 0.30000000000000004
Word: the, start_time (s): 1.1, end_time (s): 1.3, total_time (s): 0.19999999999999996
Word: whole, start_time (s): 1.3, end_time (s): 1.5, total_time (s): 0.19999999999999996
Word: soul, start_time (s): 1.5, end_time (s): 2.0, total_time (s): 0.5
Word: of, start_time (s): 2.0, en

### Download the Dataset

In [6]:
from urllib.request import urlretrieve
from os.path import isfile, isdir
from tqdm import tqdm
import tarfile

librispeech_dataset_folder_path = 'LibriSpeech'
tar_gz_path = 'dev-clean.tar.gz'

class DLProgress(tqdm):
    last_block = 0

    def hook(self, block_num=1, block_size=1, total_size=None):
        self.total = total_size
        self.update((block_num - self.last_block) * block_size)
        self.last_block = block_num

if not isfile(tar_gz_path):
    with DLProgress(unit='B', unit_scale=True, miniters=1, desc='Librispeech dev-clean.tar.gz') as pbar:
        urlretrieve(
            'http://www.openslr.org/resources/12/dev-clean.tar.gz',
            tar_gz_path,
            pbar.hook)

if not isdir(librispeech_dataset_folder_path):
    with tarfile.open(tar_gz_path) as tar:
        tar.extractall()
        tar.close()

### Preprocess Dataset - Download Dependencies

#### NLTK Dependencies

In [7]:
import nltk #NLP Toolkit
nltk.download()
nltk.download('punkt')
nltk.download('stopwords')

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

#### Gensim Dependencies

In [11]:
!pip install --upgrade gensim
import gensim

Requirement already up-to-date: gensim in /Users/deanmwebb/anaconda/envs/sdc_dev/lib/python3.5/site-packages
Requirement already up-to-date: six>=1.5.0 in /Users/deanmwebb/anaconda/envs/sdc_dev/lib/python3.5/site-packages (from gensim)
Requirement already up-to-date: smart-open>=1.2.1 in /Users/deanmwebb/anaconda/envs/sdc_dev/lib/python3.5/site-packages (from gensim)
Requirement already up-to-date: numpy>=1.11.3 in /Users/deanmwebb/anaconda/envs/sdc_dev/lib/python3.5/site-packages (from gensim)
Requirement already up-to-date: scipy>=0.18.1 in /Users/deanmwebb/anaconda/envs/sdc_dev/lib/python3.5/site-packages (from gensim)
Requirement already up-to-date: requests in /Users/deanmwebb/anaconda/envs/sdc_dev/lib/python3.5/site-packages (from smart-open>=1.2.1->gensim)
Requirement already up-to-date: bz2file in /Users/deanmwebb/anaconda/envs/sdc_dev/lib/python3.5/site-packages (from smart-open>=1.2.1->gensim)
Requirement already up-to-date: boto>=2.32 in /Users/deanmwebb/anaconda/envs/sdc_de

In [516]:
# Prepare a plain text corpus from which we train a languague model
import glob

# Gather all text files from directory
WORKING_DIRECTORY = os.path.join(os.getcwd(),'LibriSpeech/')

# TRAINING_DIRECTORY = os.path.abspath(os.path.join(os.sep,'Volumes',"My\ Passport\ for\ Mac",'lexicon','LibriSpeech'))
dev_path = "{}{}{}{}".format(WORKING_DIRECTORY, 'dev-clean/', '**/', '*.txt')
train_path = "{}{}{}{}{}".format(WORKING_DIRECTORY, 'books/', 'utf-8/', '**/', '*.txt*')

text_paths = sorted(glob.glob(train_path, recursive=True))
print('Found',len(text_paths),'text files in the directory:', train_path)

Found 41 text files in the directory: /Users/deanmwebb/Google Drive/Development/consulting/lexicon/LibriSpeech/books/utf-8/**/*.txt*


### Preprocess Dataset - Tokenize Corpus

In [397]:
def sentence_to_wordlist(raw):
    clean = re.sub("[^a-zA-Z]"," ", raw)
    words = clean.split()
    return words

In [552]:
from nltk.collocations import BigramCollocationFinder, TrigramCollocationFinder
from nltk.corpus import stopwords
import re
import codecs
import string

# reading the file in unicode format using codecs library    
stoplist = set(stopwords.words('english'))

corpus_raw = u""
for book_filename in text_paths:
    with codecs.open(book_filename, "r", "utf-8") as book_file:
        corpus_raw += book_file.read()

# Tokenize
tokenized_words = nltk.tokenize.word_tokenize(corpus_raw)

## Clean the tokens ##
# Remove stop words
tokenized_words = [word for word in tokenized_words if word not in stoplist]

# Remove single-character tokens (mostly punctuation)
tokenized_words = [word for word in tokenized_words if len(word) > 1]

# Remove numbers
tokenized_words = [word for word in tokenized_words if not word.isnumeric()]

# Lowercase all words (default_stopwords are lowercase too)
tokenized_words = [word.lower() for word in tokenized_words]


## DON'T USE THIS ##
# Stemming words seems to make matters worse, disabled
# stemmer = nltk.stem.snowball.SnowballStemmer('english')
# tokenized_words = [stemmer.stem(word) for word in tokenized_words]

### Preprocess Dataset - Extract N-Gram Model

In [609]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from nltk.collocations import *
from nltk.probability import FreqDist
import nltk

# extracting the bi-grams and sorting them according to their frequencies
finder = BigramCollocationFinder.from_words(tokenized_words)
# finder.apply_freq_filter(3)

bigram_model = nltk.bigrams(tokenized_words)
bigram_model = sorted(bigram_model, key=lambda item: item[1], reverse=True)  
# print(bigram_model)
print('')
print('')
print('')
np.save("lang_model.npy",bigram_model)






#### Frequency Distribution

In [610]:
fdist = nltk.FreqDist(bigram_model)

# Output top 50 words
print("Word|Freq:")
for word, frequency in fdist.most_common(50):
    print(u'{}|{}'.format(word, frequency))

Word|Freq:
("''", '``')|7524
("''", 'said')|2491
('said', '``')|1331
('project', 'gutenberg-tm')|1095
('project', 'gutenberg')|1018
("''", 'the')|945
('``', 'the')|777
('--', "''")|673
('``', 'but')|630
("''", 'and')|619
('``', 'you')|617
('``', 'it')|586
('ca', "n't")|557
('greater', 'part')|532
("''", 'quoth')|515
('``', 'yes')|503
('``', 'what')|497
('captain', 'nemo')|483
('could', "n't")|462
('it', "'s")|453
("''", 'asked')|448
('``', 'no')|444
('``', 'well')|425
("''", "''")|412
('``', 'oh')|409
('united', 'states')|408
('uncle', 'john')|388
('``', 'and')|387
('great', 'britain')|384
('wo', "n't")|384
('ai', "n't")|371
("n't", 'know')|369
("''", 'replied')|368
('would', "n't")|362
('says', "''")|361
('``', 'we')|340
('gold', 'silver')|336
('replied', '``')|333
('let', 'us')|331
('``', 'why')|326
('old', 'man')|318
("''", 'cried')|314
('``', 'that')|311
("''", 'he')|310
('gutenberg-tm', 'electronic')|306
('new', 'york')|304
('public', 'domain')|293
('every', 'one')|292
('quoth', '

#### Conditional Frequency Distribution

In [614]:
cfreq_2gram = nltk.ConditionalFreqDist(bigram_model)
# print('Conditional Frequency Conditions:\n', cfreq_2gram)
print()

# First access the FreqDist associated with "one", then the keys in that FreqDist
print("Listing the words that can follow after 'said':\n", cfreq_2gram["extraodinary"].keys())
print()

# Determine Most common in conditional frequency
print("Listing 20 most frequent words to come after 'said':\n", cfreq_2gram["extraodinary"].most_common(20))


Listing the words that can follow after 'said':
 dict_keys([])

Listing 20 most frequent words to come after 'said':
 []


### DEMO - Evaluate Sentences Using Language Model

In [629]:
# For each word in the evaluation list:
# Select word and determine its frequency distribution
# Grab probability of second word in the list
# Continue this process until the sentence is scored

# Add small epsilon value to avoid division by zero
epsilon = 0.0000001

# Loads the audio into memory
for audio, ground_truth in audio_files.items():
    with io.open(audio, 'rb') as audio_file:
        content = audio_file.read()
        audio = types.RecognitionAudio(content=content)

    config = types.RecognitionConfig(
        encoding=enums.RecognitionConfig.AudioEncoding.FLAC,
        sample_rate_hertz=16000,
        language_code='en-US',
        max_alternatives=10,
        profanity_filter=False,
        enable_word_time_offsets=True)

    # Detects speech and words in the audio file
    operation = client.long_running_recognize(config, audio)

    print('Waiting for operation to complete...')
    result = operation.result(timeout=90)

    alternatives = result.results[0].alternatives


    print("API Results: ", alternatives)
    print()
    print()

    rerank_results = {}
    for alternative in alternatives:
        sent = alternative.transcript

        words = nltk.tokenize.word_tokenize(sent)
        probs = np.ones_like(words, dtype=np.float32)*epsilon
        # print(words,'\n',probs)
        for word in words:
            if words.index(word) < len(words)-1: 
                freq = cfreq_2gram[word].freq(words[words.index(word)+1])
                probs[words.index(word)] = freq
            # print(probs)

        lexicon_score = np.sum(probs)
        # print(word_score)

        # Re-rank alternatives using a weighted average of the two scores
        api_weight = 0.95
        confidence_score = alternative.confidence*api_weight + lexicon_score*(1-api_weight)
        rerank_results[alternative.transcript] = confidence_score

    print("RE-RANKED Results: \n", rerank_results)
    print()
    print()

    import operator
    index, value = max(enumerate(list(rerank_results.values())), key=operator.itemgetter(1))
    # Select Corresponding Transcript:
    script=''
    for trnscript, confidence in rerank_results.items():
        if confidence == value:
            script = trnscript

    # Evaluate the differences between the Original and the Reranked transcript:
    print("ORIGINAL Transcript: \n'{0}' \nwith a confidence_score of: {1}".format(alternative.transcript, alternative.confidence))
    print("RE-RANKED Transcript: \n'{0}' \nwith a confidence_score of: {1}".format(script, value))
    print("GROUND TRUTH TRANSCRIPT: \n{0}".format(ground_truth))
    print()
    ranked_differences = list(set(nltk.tokenize.word_tokenize(alternative.transcript.lower())) -
                              set(nltk.tokenize.word_tokenize(script.lower())))
    if len(ranked_differences) == 0:  
        print("No reranking was performed. The transcripts match!")
    else:
        print("The original transcript was RE-RANKED. The transcripts do not match!")
        print("Differences between original and re-ranked: ", ranked_differences)
    print()
    print()
    
    # Evaluate Differences between the Original and Ground Truth:
    gt_orig_diff = list(set(nltk.tokenize.word_tokenize(alternative.transcript.lower())) -
                              set(nltk.tokenize.word_tokenize(ground_truth.lower())))
    if len(gt_orig_diff) == 0:  
        print("The ORIGINAL transcript matches ground truth!")
    else:
        print("The original transcript DOES NOT MATCH ground truth.")
        print("Differences between original and ground truth: ", gt_orig_diff)
    print()
    print()
    
    
    gt_rr_diff = list(set(nltk.tokenize.word_tokenize(script.lower())) -
                              set(nltk.tokenize.word_tokenize(ground_truth.lower())))
    if len(gt_rr_diff) == 0:  
        print("The RE-RANKED transcript matches ground truth!")
    else:
        print("The RE_RANKED transcript DOES NOT MATCH ground truth.")
        print("Differences between Reranked and ground truth: ", gt_rr_diff)
    print()
    print()
    
    print()
    print()
    print()
    print()

Waiting for operation to complete...
API Results:  [transcript: "at this moment to the whole soul of the Old Man scene centered in his eyes which became bloodshot the veins of the throat swelled his cheeks and temples became purple as though he was struck with epilepsy nothing was wanting to complete this but the utterance of a cry"
confidence: 0.948433518409729
words {
  start_time {
    nanos: 100000000
  }
  end_time {
    nanos: 500000000
  }
  word: "at"
}
words {
  start_time {
    nanos: 500000000
  }
  end_time {
    nanos: 600000000
  }
  word: "this"
}
words {
  start_time {
    nanos: 600000000
  }
  end_time {
    nanos: 800000000
  }
  word: "moment"
}
words {
  start_time {
    nanos: 800000000
  }
  end_time {
    seconds: 1
    nanos: 100000000
  }
  word: "to"
}
words {
  start_time {
    seconds: 1
    nanos: 100000000
  }
  end_time {
    seconds: 1
    nanos: 300000000
  }
  word: "the"
}
words {
  start_time {
    seconds: 1
    nanos: 300000000
  }
  end_time {
  