SwiftPredict

This is a ngram model to predict the next word in English text based on some history words. To understand fully how it works, read the report in main.html. To see a working POC web app, go to https://dmitrytoda.shinyapps.io/SwiftPredict/

All the code comes as separate helper functions, one per file, that are described below.

Building the model

These functions are needed to build the model from data set:

createSample() randomly chooses some lines from an input text file to create a smaller size sample for exploratory analysis
file2sentences() reads text from file(s) and returns a quanteda::corpus object split into one-sentece documents

str2tokens() splits each (one-sentece) text into tokens (words) with the following preprocessing:

  - remove puncuation
  - remove special symbols
  - remove numbers
  - remove URLs
  - remove tokens that contain no letters
  - remove tokens that contain non-English characters

nFreq() builds a ngram frequency table for a given quanteda::tokens object (that is, calculates how many times every given ngram is observed)
removeOOD() receives a ngram frequency table, replaces out-of-dictionary words with a special "" token and recalculates frequencies by collapsing equivalent ngrams
keep3() keeps only top-3 predictions for each possible history and replaces integer ngram frequencies with factor prediction ranks (1, 2, 3)

Using the model

Once you have a prepared model, only two functions are needed to use it:

combined_predict() makes predictions based on some history, a ngram model and a dictonary
my_cond() is a helper function to create data.table-compatible conditions for fast ngram binary search

Web app

Shiny_app folder contains all the code for a POC Shiny web app that allows you to input any text and get a prediction along with ngrams that contributed to it.

server.R and UI.R contain server and UI code respectively
model20_with_dict contains the 20,000 word dictionary and 1- to 6-gram model itself
other files are just copied from the root folder for deployment to Shiny servers

Legacy files

_shrink model.R and collapse_ngrams.R were used to collapse the initial big 50k word dictionary model to the current 20k one

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
Shiny_app		Shiny_app
main_files/figure-html		main_files/figure-html
test		test
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SwiftPredict.Rproj		SwiftPredict.Rproj
_shrink model.R		_shrink model.R
benchmark 6-grams 50k-dict.txt		benchmark 6-grams 50k-dict.txt
benchmark_comparison.csv		benchmark_comparison.csv
collapse_ngrams.R		collapse_ngrams.R
combined_predict.R		combined_predict.R
createSample.R		createSample.R
file2sentences.R		file2sentences.R
file2tokens.R		file2tokens.R
how it works.Rmd		how it works.Rmd
keep3.R		keep3.R
main.Rmd		main.Rmd
main.html		main.html
my_cond.R		my_cond.R
nFreq.R		nFreq.R
removeOOD.R		removeOOD.R
shrink_ngrams.R		shrink_ngrams.R
str2tokens.R		str2tokens.R
~$buy the book.xlsx		~$buy the book.xlsx

License

dmitrytoda/SwiftPredict

Folders and files

Latest commit

History

Repository files navigation

SwiftPredict

Building the model

Using the model

Web app

Legacy files

About

Resources

License

Stars

Watchers

Forks

Languages