# Lexicon - Custom Language Model


## Overview

For this project, I will build a simple custom language model that is able to learn from any text data provided, and return a transcript with confidence values from input posed in speech utterances. I will use Google's cloud-based services to preprocess the input audio data and transcribe into an initial guess. Then I will train a model to improve on Google cloud speech API's response.



## Getting Started

In order to use Google's cloud-based services, you first need to create an account on the [Google Cloud Platform](https://cloud.google.com//).

Then, for each service you want to use, you have to enable use of that service.

In [1]:
!pip install --upgrade google-cloud-speech

Requirement already up-to-date: google-cloud-speech in /Users/deanmwebb/anaconda/lib/python2.7/site-packages
Requirement already up-to-date: google-gax<0.16dev,>=0.15.14 in /Users/deanmwebb/anaconda/lib/python2.7/site-packages (from google-cloud-speech)
Requirement already up-to-date: google-cloud-core<0.28dev,>=0.27.0 in /Users/deanmwebb/anaconda/lib/python2.7/site-packages (from google-cloud-speech)
Requirement already up-to-date: googleapis-common-protos[grpc]<2.0dev,>=1.5.2 in /Users/deanmwebb/anaconda/lib/python2.7/site-packages (from google-cloud-speech)
Requirement already up-to-date: ply==3.8 in /Users/deanmwebb/anaconda/lib/python2.7/site-packages (from google-gax<0.16dev,>=0.15.14->google-cloud-speech)
Requirement already up-to-date: dill<0.3dev,>=0.2.5 in /Users/deanmwebb/anaconda/lib/python2.7/site-packages (from google-gax<0.16dev,>=0.15.14->google-cloud-speech)
Requirement already up-to-date: future<0.17dev,>=0.16.0 in /Users/deanmwebb/anaconda/lib/python2.7/site-packages

### Install the Google Cloud SDK: https://cloud.google.com/sdk/docs/

In [2]:
!CLOUDSDK_CORE_DISABLE_PROMPTS=1 ./google-cloud-sdk/install.sh

Welcome to the Google Cloud SDK!

To help improve the quality of this product, we collect anonymized usage data
and anonymized stacktraces when crashes are encountered; additional information
is available at <https://cloud.google.com/sdk/usage-statistics>. You may choose
to opt out of this collection now (by choosing 'N' at the below prompt), or at
any time in the future by running the following command:

    gcloud config set disable_usage_reporting true


Your current Cloud SDK version is: 170.0.1
The latest available version is: 171.0.0

┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                   Components                                                   │
├──────────────────┬──────────────────────────────────────────────────────┬──────────────────────────┬───────────┤
│      Status      │                         Name                         │            ID            │    Si

## Authenticate with Google Cloud API:

In [3]:
!source google-cloud-sdk/completion.bash.inc && \
source google-cloud-sdk/path.bash.inc && \
gcloud auth activate-service-account lexicon-bot@exemplary-oath-179301.iam.gserviceaccount.com --key-file=Lexicon-e94eff39fad7.json

Activated service account credentials for: [lexicon-bot@exemplary-oath-179301.iam.gserviceaccount.com]


In [4]:
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS']='/Users/deanmwebb/Google Drive/Development/consulting/lexicon/Lexicon-e94eff39fad7.json'

### Test out Cloud Spech API

In [5]:
import io

# Imports the Google Cloud client library
from google.cloud import speech
from google.cloud.speech import enums
from google.cloud.speech import types

# Instantiates a client
client = speech.SpeechClient()

# The name of the dev-test audio file to transcribe
dev_file_name_1 = os.path.join(
    os.getcwd(),
    'RNN-Tutorial-master',
    'data',
    'raw',
    'librivox',
    'LibriSpeech',
    'dev-clean-wav',
    '777-126732-0068.wav')

# The name of the test audio file to transcribe
dev_file_name_2 = os.path.join(
    os.getcwd(),
    'RNN-Tutorial-master',
    'data',
    'raw',
    'librivox',
    'LibriSpeech',
    'dev-clean-wav',
    '3752-4944-0041.wav')

test_file_name_1 = os.path.join(
    os.getcwd(),
    'RNN-Tutorial-master',
    'data',
    'raw',
    'librivox',
    'LibriSpeech',
    'test-clean-wav',
    '4507-16021-0019.wav')

test_file_name_2 = os.path.join(
    os.getcwd(),
    'RNN-Tutorial-master',
    'data',
    'raw',
    'librivox',
    'LibriSpeech',
    'test-clean-wav',
    '7176-92135-0009.wav')

# Loads the audio into memory
with io.open(dev_file_name_1, 'rb') as audio_file:
    content = audio_file.read()
    audio = types.RecognitionAudio(content=content)

config = types.RecognitionConfig(
    encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=16000,
    language_code='en-US',
    max_alternatives=10,
    profanity_filter=False,
    enable_word_time_offsets=True)

# Detects speech and words in the audio file
operation = client.long_running_recognize(config, audio)

print('Waiting for operation to complete...')
result = operation.result(timeout=90)

alternatives = result.results[0].alternatives
for alternative in alternatives:
    print('Transcript: {}'.format(alternative.transcript))
    print('Confidence Score: {}'.format(alternative.confidence))

    for word_info in alternative.words:
        word = word_info.word
        start_time = word_info.start_time
        end_time = word_info.end_time
        start = start_time.seconds + start_time.nanos * 1e-9
        end = end_time.seconds + end_time.nanos * 1e-9
        delta = end - start
        
        print('Word: {}, start_time (s): {}, end_time (s): {}, total_time (s): {}'.format(
            word,
            start,
            end,
            delta))
        
        #TODO: Do we need to figure out how to assign words to alternatives?
            # If same amounts, assign words to index of parsed word

Waiting for operation to complete...
Transcript: the boy hears too much of what is talked about here
Confidence Score: 0.9152035117149353
Word: the, start_time (s): 0.1, end_time (s): 0.5, total_time (s): 0.4
Word: boy, start_time (s): 0.5, end_time (s): 0.7000000000000001, total_time (s): 0.20000000000000007
Word: hears, start_time (s): 0.7000000000000001, end_time (s): 1.1, total_time (s): 0.4
Word: too, start_time (s): 1.1, end_time (s): 1.2, total_time (s): 0.09999999999999987
Word: much, start_time (s): 1.2, end_time (s): 1.4, total_time (s): 0.19999999999999996
Word: of, start_time (s): 1.4, end_time (s): 1.5, total_time (s): 0.10000000000000009
Word: what, start_time (s): 1.5, end_time (s): 1.6, total_time (s): 0.10000000000000009
Word: is, start_time (s): 1.6, end_time (s): 1.7000000000000002, total_time (s): 0.10000000000000009
Word: talked, start_time (s): 1.7000000000000002, end_time (s): 2.0, total_time (s): 0.2999999999999998
Word: about, start_time (s): 2.0, end_time (s):

### Download the Dataset

In [6]:
from urllib.request import urlretrieve
from os.path import isfile, isdir
from tqdm import tqdm
import tarfile

librispeech_dataset_folder_path = 'LibriSpeech'
tar_gz_path = 'dev-clean.tar.gz'

class DLProgress(tqdm):
    last_block = 0

    def hook(self, block_num=1, block_size=1, total_size=None):
        self.total = total_size
        self.update((block_num - self.last_block) * block_size)
        self.last_block = block_num

if not isfile(tar_gz_path):
    with DLProgress(unit='B', unit_scale=True, miniters=1, desc='Librispeech dev-clean.tar.gz') as pbar:
        urlretrieve(
            'http://www.openslr.org/resources/12/dev-clean.tar.gz',
            tar_gz_path,
            pbar.hook)

if not isdir(librispeech_dataset_folder_path):
    with tarfile.open(tar_gz_path) as tar:
        tar.extractall()
        tar.close()

### Preprocess Dataset - Download Dependencies

#### NLTK Dependencies

In [7]:
import nltk #NLP Toolkit
nltk.download()
nltk.download('punkt')
nltk.download('stopwords')

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

#### Gensim Dependencies

In [11]:
!pip install --upgrade gensim
import gensim

Requirement already up-to-date: gensim in /Users/deanmwebb/anaconda/envs/sdc_dev/lib/python3.5/site-packages
Requirement already up-to-date: six>=1.5.0 in /Users/deanmwebb/anaconda/envs/sdc_dev/lib/python3.5/site-packages (from gensim)
Requirement already up-to-date: smart-open>=1.2.1 in /Users/deanmwebb/anaconda/envs/sdc_dev/lib/python3.5/site-packages (from gensim)
Requirement already up-to-date: numpy>=1.11.3 in /Users/deanmwebb/anaconda/envs/sdc_dev/lib/python3.5/site-packages (from gensim)
Requirement already up-to-date: scipy>=0.18.1 in /Users/deanmwebb/anaconda/envs/sdc_dev/lib/python3.5/site-packages (from gensim)
Requirement already up-to-date: requests in /Users/deanmwebb/anaconda/envs/sdc_dev/lib/python3.5/site-packages (from smart-open>=1.2.1->gensim)
Requirement already up-to-date: bz2file in /Users/deanmwebb/anaconda/envs/sdc_dev/lib/python3.5/site-packages (from smart-open>=1.2.1->gensim)
Requirement already up-to-date: boto>=2.32 in /Users/deanmwebb/anaconda/envs/sdc_de

In [516]:
# Prepare a plain text corpus from which we train a languague model
import glob

# Gather all text files from directory
WORKING_DIRECTORY = os.path.join(os.getcwd(),'LibriSpeech/')

# TRAINING_DIRECTORY = os.path.abspath(os.path.join(os.sep,'Volumes',"My\ Passport\ for\ Mac",'lexicon','LibriSpeech'))
dev_path = "{}{}{}{}".format(WORKING_DIRECTORY, 'dev-clean/', '**/', '*.txt')
train_path = "{}{}{}{}{}".format(WORKING_DIRECTORY, 'books/', 'utf-8/', '**/', '*.txt*')

text_paths = sorted(glob.glob(train_path, recursive=True))
print('Found',len(text_paths),'text files in the directory:', train_path)

Found 41 text files in the directory: /Users/deanmwebb/Google Drive/Development/consulting/lexicon/LibriSpeech/books/utf-8/**/*.txt*


### Preprocess Dataset - Tokenize Corpus

In [397]:
def sentence_to_wordlist(raw):
    clean = re.sub("[^a-zA-Z]"," ", raw)
    words = clean.split()
    return words

In [552]:
from nltk.collocations import BigramCollocationFinder, TrigramCollocationFinder
from nltk.corpus import stopwords
import re
import codecs
import string

# reading the file in unicode format using codecs library    
stoplist = set(stopwords.words('english'))

corpus_raw = u""
for book_filename in text_paths:
    with codecs.open(book_filename, "r", "utf-8") as book_file:
        corpus_raw += book_file.read()

# Tokenize
tokenized_words = nltk.tokenize.word_tokenize(corpus_raw)

## Clean the tokens ##
# Remove stop words
tokenized_words = [word for word in tokenized_words if word not in stoplist]

# Remove single-character tokens (mostly punctuation)
tokenized_words = [word for word in tokenized_words if len(word) > 1]

# Remove numbers
tokenized_words = [word for word in tokenized_words if not word.isnumeric()]

# Lowercase all words (default_stopwords are lowercase too)
tokenized_words = [word.lower() for word in tokenized_words]


## DON'T USE THIS ##
# Stemming words seems to make matters worse, disabled
# stemmer = nltk.stem.snowball.SnowballStemmer('english')
# tokenized_words = [stemmer.stem(word) for word in tokenized_words]

### Preprocess Dataset - Extract N-Gram Model

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from nltk.collocations import *
from nltk.probability import FreqDist
import nltk

# extracting the bi-grams and sorting them according to their frequencies
finder = BigramCollocationFinder.from_words(tokenized_words)
finder.apply_freq_filter(3)

bigram_model = nltk.bigrams(tokenized_words)
bigram_model = sorted(bigram_model, key=lambda item: item[1], reverse=True)  
# print(bigram_model)
print('')
print('')
print('')
np.save("lang_model.npy",bigram_model)

#### Frequency Distribution

In [None]:
fdist = nltk.FreqDist(bigram_model)

# Output top 50 words
print("Word|Freq:")
for word, frequency in fdist.most_common(50):
    print(u'{}|{}'.format(word, frequency))

#### Conditional Frequency Distribution

In [None]:
cfreq_2gram = nltk.ConditionalFreqDist(bigram_model)
# print('Conditional Frequency Conditions:\n', cfreq_2gram)
print()

# First access the FreqDist associated with "one", then the keys in that FreqDist
print("Listing the words that can follow after 'said':\n", cfreq_2gram["come"].keys())
print()

# Determine Most common in conditional frequency
print("Listing 20 most frequent words to come after 'said':\n", cfreq_2gram["come"].most_common(20))

### Alternate Approach: Use Gensim to Create Sentence Embeddings

In [269]:
# Word Encoding
import codecs
# Regex
import glob
import re
# Natural Language Toolkit
import nltk
# Word2Vec
import gensim.models.word2vec as w2v
# Dimensionality Reduction
import sklearn.manifold
# Visualization
import seaborn as sns
# Stop List
from nltk.corpus import stopwords
import string

# reading the file in unicode format using codecs library    
stoplist = set(stopwords.words('english'))
transcripts = [[[word for word in line.lower().split()[1:] if word not in stoplist]
         for i,line in enumerate(codecs.open(document,"r","utf-8"))]
         for document in text_paths]

# Reshape list to be a transcript for each row
transcript_list = []
for doc in transcripts:
    for line in doc:
        if len(line) > 0:
            transcript_list.append(line)

# Convert wordlist to raw corpus
corpus_raw = u""
for line in transcript_list:
    corpus_raw += u" ".join(line)

# Tokenize
raw_sentences = nltk.word_tokenize(corpus_raw)
# print(raw_sentences)
print(len(raw_sentences))

24379


#### Alternate approaches

In [235]:
from nltk.collocations import BigramCollocationFinder, TrigramCollocationFinder
import re
import codecs
import numpy as np
import string

words_all = []
translate_table = dict((ord(char), None) for char in string.punctuation)
# reading the file in unicode format using codecs library    


stoplist = set('for a of the and to in'.split())
arr = []
texts = [[[word for word in line.lower().split()[1:] if word not in stoplist]
         #for i,line in enumerate(codecs.open(document,"r","utf-8"))]
         for i,line in enumerate(codecs.open(document,"r","utf-8"))]
         for document in text_paths]

# Reshape list to be a transcript for each row
for doc in texts:
    for line in doc:
        if len(line) > 0:
            arr.append(line)
# texts = arr
print(len(arr))
print(arr[-1])


# Convert wordlist to raw corpus
corpus_raw = u""
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
for line in arr:
    corpus_raw += u" ".join(line)+u". "
    # merged_line = u" ".join(line)
    # print('Merged Line:', merged_line)
    # print()
    # print()
    # raw_sentences.append(nltk.tokenize.word_tokenize(merged_line))
#     print("Corpus is now {0} characters long".format(len(corpus_raw)))
#     print()
# print(corpus_raw)

# Tokenize
# tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
raw_sentences = nltk.tokenize.word_tokenize(corpus_raw)
# print(len(corpus_raw))
print(len(raw_sentences))
print(raw_sentences[-1])





# extracting the bi-grams and sorting them according to their frequencies
finder = TrigramCollocationFinder.from_words(raw_sentences)
finder.apply_freq_filter(5)
trigram_model = finder.ngram_fd
trigram_model = sorted(finder.ngram_fd, key=lambda item: item[1],reverse=True)  

print(bigram_model)
print('')
print('')
print('')
np.save("lang_model.npy",bigram_model)

2703
['thou', 'like', 'arcturus', 'steadfast', 'skies', 'with', 'tardy', 'sense', 'guidest', 'thy', 'kingdom', 'fair', 'bearing', 'alone', 'load', 'liberty']
46582
.
[('of', 'your'), ('on', 'your'), ('the', 'young'), ('and', 'you'), ('that', 'you'), ('for', 'you'), ('.', 'you'), ('tell', 'you'), ('if', 'you'), ('with', 'you'), ('perhaps', 'you'), ('then', 'you'), ('do', 'you'), ('as', 'you'), ('to', 'you'), ('but', 'you'), ('see', 'you'), ("n't", 'you'), ('of', 'you'), ('are', 'you'), ('and', 'yet'), ('.', 'yet'), ('.', 'yes'), ('a', 'year'), ('you', 'would'), ('she', 'would'), ('i', 'would'), ('it', 'would'), ('he', 'would'), ('the', 'world'), ('the', 'words'), ('a', 'word'), ('miss', 'woodley'), ('.', 'with'), ('satisfied', 'with'), ('up', 'with'), ('and', 'with'), ('covered', 'with'), ('it', 'with'), ('him', 'with'), ('the', 'window'), ('the', 'wind'), ('it', 'will'), ('we', 'will'), ('you', 'will'), ('i', 'will'), ('his', 'wife'), ('the', 'whole'), ('one', 'who'), ('.', 'who'), ('p

In [482]:
from nltk.collocations import BigramCollocationFinder, TrigramCollocationFinder
from nltk.corpus import stopwords
import re
import codecs
import string

# reading the file in unicode format using codecs library    
stoplist = set(stopwords.words('english'))
print(stoplist)
# transcripts = [[[word for word in line.lower().split() if word not in stoplist]
#          for i,line in enumerate(codecs.open(document,"r","utf-8"))]
#          for document in text_paths]

# # Reshape list to be a transcript for each row
# transcript_list = []
# for document in transcripts:
#     for transcript in document:
#         if len(transcript) > 0:
#             transcript_list.append(transcript)
# print("Transcript List Length: {0:,}".format(len(transcript_list)))

# Convert wordlist to raw corpus
# corpus_raw = u""
# for line in transcript_list:
#     new_script = u" ".join(line)
#     corpus_raw += new_script
# raw_words = nltk.word_tokenize(corpus_raw)


# Tokenize sentences    
corpus_raw = u""
for book_filename in text_paths[:2]:
    # print("Reading '{0}'...".format(book_filename))
    with codecs.open(book_filename, "r", "utf-8") as book_file:
        corpus_raw += book_file.read()
#     print("Corpus is now {0} characters long".format(len(corpus_raw)))
#     print()

# Tokenize
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
raw_sentences = tokenizer.tokenize(corpus_raw)
# print(raw_sentences[:10])


#sentence where each word is tokenized
sentences = []
for raw_sentence in raw_sentences:
    if len(raw_sentence) > 0:
        sentences.append(sentence_to_wordlist(raw_sentence))
print("There are {0:,} sentences total".format(len(sentences)))

## Clean the tokens ##
# Remove stop words
sentences = [[word for word in sentence if word not in stoplist]
            for sentence in sentences]

# Remove single-character tokens (mostly punctuation)
sentences = [[word for word in sentence if len(word) > 1]
            for sentence in sentences]
# Remove numbers
sentences = [[word for word in sentence if not word.isnumeric()]
            for sentence in sentences]
# Lowercase all words (default_stopwords are lowercase too)
sentences = [[word.lower() for word in sentence]
            for sentence in sentences]
# # Stemming words seems to make matters worse, disabled
stemmer = nltk.stem.snowball.SnowballStemmer('english')
sentences = [[stemmer.stem(word) for word in sentence]
            for sentence in sentences]

# print("Raw Words Length: {0:,}".format(len(raw_words)))

{'until', 'this', 'their', 'herself', 'didn', 'ain', 'can', 'yours', 'having', 'i', 've', 'or', 'ours', 'than', 'm', 'me', 'how', 'once', 'himself', 'from', 'own', 'while', 'isn', 'most', 'under', 'for', 'after', 'during', 'he', 'so', 'themselves', 'only', 'further', 'being', 'off', 'has', 'but', 'and', 's', 'shan', 'to', 'against', 'below', 'an', 'down', 'hers', 'again', 'won', 'because', 'the', 'out', 'on', 'his', 'of', 'above', 'my', 'myself', 'll', 'are', 'same', 'we', 'aren', 'then', 'that', 'if', 'very', 'did', 'at', 'through', 'no', 'mustn', 'd', 'whom', 'our', 'over', 'doing', 'now', 'its', 'haven', 'them', 'will', 'couldn', 'wasn', 't', 'who', 'into', 'a', 'there', 'don', 'where', 'was', 'few', 'here', 'as', 'when', 'o', 'hasn', 'in', 'not', 'does', 'you', 'her', 'other', 'be', 're', 'do', 'by', 'such', 'which', 'him', 'theirs', 'with', 'up', 'ourselves', 'all', 'about', 'some', 'needn', 'any', 'between', 'they', 'been', 'wouldn', 'shouldn', 'she', 'y', 'these', 'weren', 'had'