author: Fenton Taylor date: 23 March, 2017 autosize: true
This application was created for the final assignment of the Johns Hopkins University Data Science Specialization Capstone Project, which was co-sponsored by SwiftKey.
The task was to create a model based on texts taken from online blogs, news articles, and Twitter posts (source) and use that model in an algorithm to predict the next word in a phrase.
The ultimate goal was to implement the model and prediction algorithm in a Shiny application that takes user input text and generates a prediction for the next word.
An N-gram (contiguous set of n items from a given sequence of text) model was created from the text sources. The tm package was used to create a corpus of the texts for necessary pre-processing. Pre-processing included various methods of cleaning and preparing the text, such as converting to ASCII and lower case, profanity and web-text filtering, and end of sentence tagging. For example:
[1] Raw Text: John gave her WHAT carat diamond?!?! #crazy
[1] Clean Text: john gave her what carat diamond <EOS>
After the corpus was pre-processed. N-grams of lengths 1 to 6 were created using the quanteda package and the frequency of each N-gram were counted. Finally, simple Good-Turing smoothing was performed to adjust for unseen words and N-grams. Example:
words freq r_smooth pr
1 one of the 10529 10527.5 0.00084
2 a lot of 8975 8973.5 0.00071
3 thanks for the 7242 7240.5 0.00058
The prediction algorithm does the following:
- Takes user text input
- Pre-processes the text to match the format of the cleaned corpus text
- Searches the appropriate highest-order N-gram list for the user's text
- If no match is found, perform Stupid Backoff until a match is found
- Return up to top 5 words that complete the N-gram
- If no matches are found, return the top 5 most common words
Model Performance:
model top_acc top3_acc top5_acc avg_time
1 Speed 0.170 0.250 0.293 0.103
2 Accuracy 0.186 0.258 0.298 0.652
The Shiny Application provides a simple user interface to interact with the prediction algorithm and the data. Notable features include:
- Interactive command line to input text for prediction
- Parameter selections for the algorithm: model choice, number of results
- Interactive plot of most frequent N-grams
- Interactive examples of pre/post-processed text