In [47]:
from final_report import tokenize_doc, mask_tokens, train_mle, train_laplace, train_lidstone, train_stupid_backoff, test_model

# Abstract

Word prediction recently has become a very integral part of everyday lives as mobile devices become more accessible to people everyday. In this paper we will tackle the topic of building such a piece of software to predict words so we can deepen our understanding of the technology and see how it can be beneficial to the everyday person as well as to the field of augmentative and alternative communications (AAC). We used two models to test our predictor, an N-Grams based model and a LSTM based model. These our integral to our approach since we need models that can "remember" words that previously came before the word currently being predicted. We use this information to then calculate probabilities for the next possible word until the end of a sentence.

# Introduction

# Related Work

# Data

## Tokenization & Normalization

Tokenization and normalization is very important as we need to ensure that our NLP model is not skewed by unclean data. Our first step in tokenization is to separate our training Amazon reviews into a list of sentence tokens using NLTK's sentence tokenizer. Finally, separate this list of sentence tokens into a list of lists of word tokens using NLTK's TweetTokenizer. As for normalization, we chose to lowercase all the word tokens as to get more meaningful results not altered by capitalization.

In [63]:
sentence_tokens = []
with open('../data/sample_train.csv') as csv_file:
    print('Tokenizing and normalizing training data...', end=' ')
    sentence_tokens = tokenize_doc(csv_file)
    print('Done')

Tokenizing and normalizing training data... Done


In [64]:
import statistics

print('Total Training Sentence Tokens:', len(sentence_tokens))
print('Average Number of Training Word Tokens per Training Sentence Token:', statistics.mean(map(lambda e: len(e), sentence_tokens)))
print('Example Training Sentence Token:', sentence_tokens[0])

Total Training Sentence Tokens: 165696
Average Number of Training Word Tokens per Training Sentence Token: 18.35504779837775
Example Training Sentence Token: ['this', 'sound', 'track', 'was', 'beautiful', '!']


# Method

## N-Gram Models

To begin, we first chose to model our problem uaing several N-Gram models. This is a rather simple approach to word prediction since we can store an arbitrary number of N-Grams and predict words based on the previous N - 1 words. To be more specific, we limited our N-Gram models to trigrams, bigrams, and unigrams which means we can predict words based on 2 or less previous words. The predicted word is the N-Gram that has the highest probability with that word as the last item and the previous items being the previous words, if any. The creation and preparation of such N-Grams for our models can be handled by NLTK's Everygram Preprocessor. The models come from NLTK's LanguageModels which we decided to use because of the ease of use in terms of input and output for our tokens.

### Maximum Likelihood Estimator (MLE)

The MLE model serves as the basis for our N-Gram modeling. It utilizes the algorithm described above without any smoothing or extra preparation. The model's only concern is the raw likelihoods for the word predictions.

In [65]:
print('Training MLE model...', end=' ')
mle = train_mle(sentence_tokens)
print('Done')

Training MLE model... Done


### Laplace

The Laplace model utilizes the MLE model while also implemeting add-1 smoothing. This leads to more accurate word probabilities in general since we can assign non-zero probabilities to unseen words. We are using this model as a direct comparison to the MLE model.

In [66]:
print('Training Laplace model...', end=' ')
laplace = train_laplace(sentence_tokens)
print('Done')

Training Laplace model... Done


### Lidstone

The Lidstone model is the same as the Laplace model but instead of add-1 smoothing, we can specify the amount of smoothing. We chose to create three different Lidstone models, initialized with add-0.25 smoothing, add-0.5 smoothing, and add-0.75 smoothing, respectively. We are using this model to demonstrate how smoothing affects the word predictions.

In [52]:
print('Training Lidstone models...')
print('Training Lidstone (gamma=0.25) model...', end=' ')
lidstone_25 = train_lidstone(sentence_tokens, 0.25)
print('Done')
print('Training Lidstone (gamma=0.5) model...', end=' ')
lidstone_50 = train_lidstone(sentence_tokens, 0.5)
print('Done')
print('Training Lidstone (gamma=0.75) model...', end=' ')
lidstone_75 = train_lidstone(sentence_tokens, 0.75)
print('Done')

Training Lidstone models...


### Stupid Backoff

The Stupid Backoff model utilizes the MLE model while also providing the ability to scale lower order probabilities. The downside of this is that it is not a true probability distribution. We chose to create three different Stupid Backoff models, initialized with 0.25, 0.5, and 0.75, respectively. We are using this model to determine at what degree lower order probabilities affect the word predictions

In [53]:
print('Training Stupid Backoff models...')
print('Training Stupid Backoff (alpha=0.25) model...', end=' ')
stupid_backoff_25 = train_stupid_backoff(sentence_tokens, 0.25)
print('Done')
print('Training Stupid Backoff (alpha=0.25) model...', end=' ')
stupid_backoff_50 = train_stupid_backoff(sentence_tokens, 0.5)
print('Done')
print('Training Stupid Backoff (alpha=0.25) model...', end=' ')
stupid_backoff_75 = train_stupid_backoff(sentence_tokens, 0.75)
print("Done")

Training Stupid Backoff models...


# Results

In [54]:
sentence_tokens = []
with open('../data/sample_test.csv') as csv_file:
    print('Tokenizing and normalizing testing data...', end=' ')
    sentence_tokens = tokenize_doc(csv_file)
    print('Done')

print('Masking word tokens in each sentence token...', end=' ')
masked_sentence_tokens, masked_words = mask_tokens(sentence_tokens)
print('Done')

Tokenizing and normalizing testing data...
Masking word tokens in each sentence token...


## N-Gram Models

### Maximum Likelihood Estimator (MLE)

In [60]:
print('Testing MLE model...', end=' ')
mle_exact_accuracy = test_model(mle, masked_sentence_tokens, masked_words)
print('Done')
print('MLE Exact Prediction Accuracy:', mle_exact_accuracy, '%')

Testing MLE model...
MLE Exact Prediction Accuracy: 8.74894336432798 %


### Laplace

In [61]:
print('Testing Laplace model...', end=' ')
laplace_exact_accuracy = test_model(laplace, masked_sentence_tokens, masked_words)
print('Done')
print('Laplace Exact Prediction Accuracy:', laplace_exact_accuracy, '%')

Testing Laplace model...
Laplace Exact Prediction Accuracy: 7.6923076923076925 %


### Lidstone

In [62]:
print('Testing Lidstone models...')
print('Testing Lidstone (gamma=0.25) model...', end=' ')
lidstone_25_exact_accuracy = test_model(lidstone_25, masked_sentence_tokens, masked_words)
print('Done')
print('Testing Lidstone (gamma=0.50) model...', end=' ')
lidstone_50_exact_accuracy = test_model(lidstone_50, masked_sentence_tokens, masked_words)
print('Done')
print('Testing Lidstone (gamma=0.75) model...', end=' ')
lidstone_75_exact_accuracy = test_model(lidstone_75, masked_sentence_tokens, masked_words)
print('Done')
print('Lidstone (gamma=0.25) Exact Prediction Accuracy:', lidstone_25_exact_accuracy, '%')
print('Lidstone (gamma=0.50) Exact Prediction Accuracy:', lidstone_50_exact_accuracy, '%')
print('Lidstone (gamma=0.75) Exact Prediction Accuracy:', lidstone_75_exact_accuracy, '%')

Testing Lidstone models...
Testing Lidstone (gamma=0.25) model...
Testing Lidstone (gamma=0.50) model...
Testing Lidstone (gamma=0.75) model...
Lidstone (gamma=0.25) Exact Prediction Accuracy: 8.368554522400675 %
Lidstone (gamma=0.50) Exact Prediction Accuracy: 7.6923076923076925 %
Lidstone (gamma=0.75) Exact Prediction Accuracy: 8.241758241758241 %


### Stupid Backoff

In [59]:
print('Testing Stupid Backoff models...')
print('Testing Studid Backoff (alpha=0.25) model...', end=' ')
stupid_backoff_25_exact_accuracy = test_model(stupid_backoff_25, masked_sentence_tokens, masked_words)
print('Done')
print('Testing Stupid Backoff (alpha=0.50) model...', end=' ')
stupid_backoff_50_exact_accuracy = test_model(stupid_backoff_50, masked_sentence_tokens, masked_words)
print('Done')
print('Testing Stupid Backoff (alpha=0.75) model...', end=' ')
stupid_backoff_75_exact_accuracy = test_model(stupid_backoff_75, masked_sentence_tokens, masked_words)
print('Done')
print('Stupid Backoff (alpha=0.25) Exact Prediction Accuracy:', stupid_backoff_25_exact_accuracy, '%')
print('Stupid Backoff (alpha=0.50) Exact Prediction Accuracy:', stupid_backoff_50_exact_accuracy, '%')
print('Stupid Backoff (alpha=0.75) Exact Prediction Accuracy:', stupid_backoff_75_exact_accuracy, '%')

Testing Stupid Backoff models...
Testing Studid Backoff (alpha=0.25) model...
Testing Stupid Backoff (alpha=0.50) model...
Testing Stupid Backoff (alpha=0.75) model...
Stupid Backoff (alpha=0.25) Exact Prediction Accuracy: 9.171597633136095 %
Stupid Backoff (alpha=0.50) Exact Prediction Accuracy: 7.734573119188504 %
Stupid Backoff (alpha=0.75) Exact Prediction Accuracy: 9.129332206255283 %


# Discussion & Future Work