# Abstract

Word prediction recently has become a very integral part of everyday lives as mobile devices become more accessible to people everyday. In this paper we will tackle the topic of building such a piece of software to predict words so we can deepen our understanding of the technology and see how it can be beneficial to the everyday person as well as to the field of augmentative and alternative communications (AAC). We used two models to test our predictor, an N-Grams based model and a LSTM based model. These our integral to our approach since we need models that can "remember" words that previously came before the word currently being predicted. We use this information to then calculate probabilities for the next possible word until the end of a sentence.

# Introduction

# Related Work

# Data

## Tokenization & Normalization

Tokenization and normalization is very important as we need to ensure that our NLP model is not skewed by unclean data. Our first step in tokenization is to separate our training Amazon reviews into a list of sentence tokens using NLTK's sentence tokenizer. Finally, separate this list of sentence tokens into a list of lists of word tokens using NLTK's TweetTokenizer. As for normalization, we chose to lowercase all the word tokens as to get more meaningful results not altered by capitalization.

In [104]:
import nltk
import csv

sentence_tokens = []
tokenizer = nltk.tokenize.TweetTokenizer()

with open('../data/sample.csv') as csv_file:
    csv_reader = csv.reader(csv_file)
    print("Tokenizing and normalizing...")
    for row in csv_reader:
        sentence_tokens += [tokenizer.tokenize(sentence.lower()) for sentence in nltk.sent_tokenize(row[2])]

Tokenizing and normalizing...


In [109]:
import statistics

print("Total Training Sentence Tokens:", len(sentence_tokens))
print("Average Number of Training Word Tokens per Training Sentence Token:", statistics.mean(map(lambda e: len(e), sentence_tokens)))
print("Example Training Sentence Token:", sentence_tokens[0])

Total Training Sentence Tokens: 470931
Average Number of Training Word Tokens per Training Sentence Token: 18.113396654711625
Example Training Sentence Token: ['my', 'lovely', 'pat', 'has', 'one', 'of', 'the', 'great', 'voices', 'of', 'her', 'generation', '.']


# Method

## N-Grams Model

To begin, we first chose to model our problem using N-Grams. This is a rather simple approach to word prediction since we can store an arbitrary number of N-Grams and predict words based on the previous N - 1 words. To be more specific, we limited our N-Grams model to trigrams, bigrams, and unigrams which means we can predict words based on 2 or less previous words. The predicted word is the N-Gram that has the highest probability with that word as the last item and the previous items being the previous words, if any. This can all be handled by NLTK's Everygram Preprocessor (to create N-Grams) and NLTK's Most Liklihood Estimator (to calculate the probabilities to make a prediction). 

In [110]:
import nltk

lm = nltk.lm.MLE(3)
train, vocab = nltk.lm.preprocessing.padded_everygram_pipeline(3, sentence_tokens)
print("Training the model...")
lm.fit(train, vocab)

Training the model...


## LSTM Model

# Results

# Discussion and Future Work