Skip to content

Latest commit

 

History

History
109 lines (77 loc) · 7.62 KB

README.md

File metadata and controls

109 lines (77 loc) · 7.62 KB

Next Word Prediction - Shiny Web Application

Next word prediction (NWP) is the task of suggesting the most probable word a user will type next. This is a part of natural language processing (NLP) which can be applied in many areas where text is inserted using a keyboard. An example of NWP application is the SwiftKey smart keyboard which aims at improving the typing experience on small mobile phone. Beyond improving keystroke efficiency, literature points out other benefits of word prediction including the improvement of the quality and quantity of written work, enhancement in the development of written literacy skills, along with spelling assistance to people with various levels of spelling disorder [1,2].

NWP is based on the analysis of a corpus (large text files) resulting in probability distributions over the different sequences of words occurring in the corpus. The resulting language model is then used for predicting the most likely next word.

This repo contains code to build such a language model with R, save it into a local SQLite database and query the database from a shiny web application to perform predictions. Preprocessing of the corpus requires MeTa, a C++ data science toolkit developed at the UIUC and used for text mining research.

The application can be accessed here.

Files

  • model_builder.R : script to build the model
  • helpers.R : helper functions
  • predict.R : functions to perform prediction based on the language model
  • shinyAPP/ : R code for the shiny server and user interface

Building the language model

A brief description of the main steps to obtain the language model from the initial corpus. The latter can consist of any set of text files. In the present case I used data from twitter, blog articles and news articles (about 1GB in total).

Extracting ngrams (word sequences) from the corpus is computationally intensive, especially for large corpora. For this reason the first step was done using an optimized C++ toolkit called MeTa.

step 1. extracting ngrams

ngrams of order 1 to 4 were extracted from each source file using MeTa. This can be done using the provided profile.cpp tool and customising the filter chain. I used lowercase and alpha filters, the latter removing all non-alphanumeric characters. For this I modified the freq() function as follows:

using namespace meta::analyzers;

std::unique_ptr<token_stream> stream
    = make_unique<tokenizers::icu_tokenizer>();
stream = make_unique<filters::lowercase_filter>(std::move(stream));
stream = make_unique<filters::alpha_filter>(std::move(stream));
ngram_word_analyzer ana{n, std::move(stream)};

Note that, by default, profile.cpp handles ngrams of order up to 3, but you can use higher orders by adding the following lines in main() (or eventually directly pass an argument to freq()):

if (all || args.find("--freq-4gram") != args.end())
    freq(file, config, 4);
if (all || args.find("--freq-5gram") != args.end())
    freq(file, config, 5);

Note that MeTa by default outputs the ngrams and their count using underscore as a delimiter. I found it a bit unconvenient for further parsing of the files, especially when the ngrams potentially contain underscores. I switched the delimiter to white space in MeTa source code (and this is what the R code expect when reading MeTa's output). This is just a simple change in the source code: locate the ngram_word_analyzer.cpp file and change the following line:

combined = tokens[i - j] + "_" + combined;

to:

combined = tokens[i - j] + " " + combined;

Then re-compile MeTa.

step 2. building the language model

Once ngram have been generated by MeTa and output as text files, we can read them into R and start building the model. In order to deal with relatively large ngram tables, the data.table package is used and all processing steps are based on join/filter operations (looping through tables rows is by far too slow here). The whole process takes typically less than a minute and the main steps are as follows:

  1. read the ngram files from MeTa (4 tables for each source file)
  2. merge the tables, resulting in 4 tables for ngrams of order 1 to 4
  3. filter all ngrams with non alphabetic characters or appearing less than 5 times. Although an alpha filter is used in MeTa, I had some issues with unicodes which seems specific to the R+windows combination. This is the reason why I included an additional filtering step. Besides, filtering ngrams with count < 5 is to be able to handle tables in memory (i.e. there are a large amount of unique ngram in each table, the latter not contributing to prediction power)
  4. compute the conditional probabilities. For example for the trigram table, this corresponds to the probability of the third word to appear given the first two words. I used interpolated Kneyser-Ney smoothing as a language model. This requires computing several intermediate results including continuation counts and then chaining up the results since it is a recursive model. (a nice way to compute it is to use a recursive function, but it forces you to loop through the tables which is very inefficient. So the actual code uses join operations)

From the above steps a interpolated KN model is built for each ngram order. The lowest order model (unigrams) simply corresponds to maximum likelihood estimation since no recursive call to lower order model is possible. For higher order models (2-4), the estimated probability take into account information from lower order ngrams. When doing prediction, we first look into the model of order 4, trying to predict the next words based on the first three words. If no result is found (i.e. the sequence of three words was unseen in the training corpus), we simply back-off to a lower order KN model.

For more details about how to compute the Kneyser-Ney probabilities, I strongly recommend the Thesis from Martin Körner and this article.

Step 3. Storing the model in a local database

Once KN models of order 1 to 4 are computed, they are stored in a local SQLite database. I used the SQLite R package for this. When predicting, one can simply connect to the database and retrieve specific word sequences using SQL queries.

Testing

predict.R includes a predict() function which can be used to test the algorithm. It takes as inputs:

  • database: the SQLite database containing the language model
  • raw_input: the input sentence from which we want to predict the next word. This sentence is preprocessed in the same way than the initial corpus used to build the model.
  • method: either "qML" for maximum likelihood estimation or "KN" for kneyser-Ney smoother probability based prediction
  • npred: number of potential next word to predict
  • max_order: order of the highest order model, by default 4

The shiny App

The shiny web application requieres the language model stored in a SQLite database in the data/ sub-folder. Database name is assumed to be 'data/NWP.SQLite'. User interface is very basic and the server uses the predict() function (same one as in predict.R, located in shinyAPP/helpers.R).

References and notes

  • [1] J. Klund and M. Novak, "If Word Prediction Can Help, Which Program Do You Choose?", available at http://trace.wisc.edu
  • [2] A. Newell, J. Arnott, L. Booth, W. Beattie, B. Brophy and I. Ricketts. "Effect of the "pal" word prediction system on the quality and quantity of text generation.", Augmentative and Alternative Communication, 8(4), 304�311, 1992