Skip to content

Latest commit

 

History

History
83 lines (64 loc) · 3.6 KB

capstonePresentation.md

File metadata and controls

83 lines (64 loc) · 3.6 KB

Capstone Project: Word Predicting Application

author: Fenton Taylor date: 23 March, 2017 autosize: true

img1 img2 img3

The Task

This application was created for the final assignment of the Johns Hopkins University Data Science Specialization Capstone Project, which was co-sponsored by SwiftKey.

The task was to create a model based on texts taken from online blogs, news articles, and Twitter posts (source) and use that model in an algorithm to predict the next word in a phrase.

The ultimate goal was to implement the model and prediction algorithm in a Shiny application that takes user input text and generates a prediction for the next word.

The Model

An N-gram (contiguous set of n items from a given sequence of text) model was created from the text sources. The tm package was used to create a corpus of the texts for necessary pre-processing. Pre-processing included various methods of cleaning and preparing the text, such as converting to ASCII and lower case, profanity and web-text filtering, and end of sentence tagging. For example:

[1] Raw Text: John gave her WHAT carat diamond?!?! #crazy
[1] Clean Text:  john gave her what carat diamond <EOS> 

After the corpus was pre-processed. N-grams of lengths 1 to 6 were created using the quanteda package and the frequency of each N-gram were counted. Finally, simple Good-Turing smoothing was performed to adjust for unseen words and N-grams. Example:

           words  freq r_smooth      pr
1     one of the 10529  10527.5 0.00084
2       a lot of  8975   8973.5 0.00071
3 thanks for the  7242   7240.5 0.00058

The Prediction Function

The prediction algorithm does the following:

  • Takes user text input
  • Pre-processes the text to match the format of the cleaned corpus text
  • Searches the appropriate highest-order N-gram list for the user's text
  • If no match is found, perform Stupid Backoff until a match is found
  • Return up to top 5 words that complete the N-gram
  • If no matches are found, return the top 5 most common words

Model Performance:

     model top_acc top3_acc top5_acc avg_time
1    Speed   0.170    0.250    0.293    0.103
2 Accuracy   0.186    0.258    0.298    0.652

The App

The Shiny Application provides a simple user interface to interact with the prediction algorithm and the data. Notable features include:

  • Interactive command line to input text for prediction
  • Parameter selections for the algorithm: model choice, number of results
  • Interactive plot of most frequent N-grams
  • Interactive examples of pre/post-processed text
![app3](app3.png) ![app2](app2.png) ![app1](app1.png)