# Abstract

Word prediction recently has become a very integral part of everyday lives as mobile devices become more accessible to people everyday. In this paper we will tackle the topic of building such a piece of software to predict words so we can deepen our understanding of the technology and see how it can be beneficial to the everyday person as well as to the field of augmentative and alternative communications (AAC). We used two models to test our predictor, an N-Grams based model and a LSTM based model. These our integral to our approach since we need models that can "remember" words that previously came before the word currently being predicted. We use this information to then calculate probabilities for the next possible word until the end of a sentence.

# Introduction

# Related Work

# Data

## Tokenization & Normalization

Tokenization and normalization is very important as we need to ensure that our NLP model is not skewed by unclean data. Our first step in tokenization is to separate our training Amazon reviews into a list of sentence tokens using NLTK's sentence tokenizer. Finally, separate this list of sentence tokens into a list of lists of word tokens using NLTK's TweetTokenizer. As for normalization, we chose to lowercase all the word tokens as to get more meaningful results not altered by capitalization.

In [121]:
import nltk
import csv

sentence_tokens = []

with open('../data/sample.csv') as csv_file:
    csv_reader = csv.reader(csv_file)
    print("Tokenizing and normalizing...")
    for row in csv_reader:
        sentence_tokens += [nltk.word_tokenize(sentence.lower()) for sentence in nltk.sent_tokenize(row[2])]

Tokenizing and normalizing...


In [109]:
import statistics

print("Total Training Sentence Tokens:", len(sentence_tokens))
print("Average Number of Training Word Tokens per Training Sentence Token:", statistics.mean(map(lambda e: len(e), sentence_tokens)))
print("Example Training Sentence Token:", sentence_tokens[0])

Total Training Sentence Tokens: 470931
Average Number of Training Word Tokens per Training Sentence Token: 18.113396654711625
Example Training Sentence Token: ['my', 'lovely', 'pat', 'has', 'one', 'of', 'the', 'great', 'voices', 'of', 'her', 'generation', '.']


# Method

## N-Gram Models

To begin, we first chose to model our problem uaing several N-Gram models. This is a rather simple approach to word prediction since we can store an arbitrary number of N-Grams and predict words based on the previous N - 1 words. To be more specific, we limited our N-Gram models to trigrams, bigrams, and unigrams which means we can predict words based on 2 or less previous words. The predicted word is the N-Gram that has the highest probability with that word as the last item and the previous items being the previous words, if any. The creation and preparation of such N-Grams for our models can be handled by NLTK's Everygram Preprocessor. The models come from NLTK's LanguageModels which we decided to use because of the ease of use in terms of input and output for our tokens.

### Maximum Likelihood Estimator (MLE)

The MLE model serves as the basis for our N-Gram modeling. It utilizes the algorithm described above without any smoothing or extra preparation. The model's only concern is the raw likelihoods for the word predictions.

In [139]:
from nltk.lm import MLE

mle = MLE(3)

print("Training MLE model...")
train, vocab = nltk.lm.preprocessing.padded_everygram_pipeline(3, sentence_tokens)
mle.fit(train, vocab)

Training MLE model...


### Laplace

The Laplace model utilizes the MLE model while also implemeting add-1 smoothing. This leads to more accurate word probabilities in general since we can assign non-zero probabilities to unseen words. We are using this model as a direct comparison to the MLE model.

In [157]:
from nltk.lm import Laplace

laplace = Laplace(3)

print("Training Laplace model...")
train, vocab = nltk.lm.preprocessing.padded_everygram_pipeline(3, sentence_tokens)
laplace.fit(train, vocab)

Training Laplace model...


### Lidstone

The Lidstone model is the same as the Laplace model but instead of add-1 smoothing, we can specify the amount of smoothing. We chose to create three different Lidstone models, initialized with add-0.25 smoothing, add-0.5 smoothing, and add-0.75 smoothing, respectively. We are using this model to demonstrate how smoothing affects the word predictions.

In [158]:
from nltk.lm import Lidstone

lidstone_25 = Lidstone(0.25, 3)
lidstone_50 = Lidstone(0.5, 3)
lidstone_75 = Lidstone(0.75, 3)

print("Training Lidstone models...")
train, vocab = nltk.lm.preprocessing.padded_everygram_pipeline(3, sentence_tokens)
lidstone_25.fit(train, vocab)
train, vocab = nltk.lm.preprocessing.padded_everygram_pipeline(3, sentence_tokens)
lidstone_50.fit(train, vocab)
train, vocab = nltk.lm.preprocessing.padded_everygram_pipeline(3, sentence_tokens)
lidstone_75.fit(train, vocab)

Training Lidstone models...


### Stupid Backoff

The Stupid Backoff model utilizes the MLE model while also providing the ability to scale lower order probabilities. The downside of this is that it is not a true probability distribution. We chose to create three different Stupid Backoff models, initialized with 0.25, 0.5, and 0.75, respectively. We are using this model to determine at what degree lower order probabilities affect the word predictions

In [160]:
from nltk.lm import StupidBackoff

stupid_backoff_25 = StupidBackoff(0.25, 3)
stupid_backoff_50 = StupidBackoff(0.5, 3)
stupid_backoff_75 = StupidBackoff(0.75, 3)

print("Training Stupid Backoff models...")
train, vocab = nltk.lm.preprocessing.padded_everygram_pipeline(3, sentence_tokens)
stupid_backoff_25.fit(train, vocab)
train, vocab = nltk.lm.preprocessing.padded_everygram_pipeline(3, sentence_tokens)
stupid_backoff_50.fit(train, vocab)
train, vocab = nltk.lm.preprocessing.padded_everygram_pipeline(3, sentence_tokens)
stupid_backoff_75.fit(train, vocab)

Training Stupid Backoff models...


## LSTM Model

# Results

# Discussion and Future Work