# Predictive Text App - DSS Capstone Project

This report consists of an exploratory data analysis and a brief summary of the steps taken to develop the Predictive Text App for the DSS Capstone Project. The data comes from Hans Christensen's HC Corpora, that consists of various corpus collected from publicly available sources by a web crawler.

The original datasets can be found on the following site: http://www.corpora.heliohost.org/

## Data Sources
The data provided consists of three files: en_US.news.txt, en_US.blogs.txt and en_US.twitter.txt. In accordance with the name of the file, each row in the files contains either a blog post, a news article or a tweet.

The files were read using the R library readr. Although, in the first attempt the twitter document was impossible to read due to an embedded null. Using the *tr* command line tool this issue was fixed, allowing to read the 3 files successfully.

Some basic characteristics of the files are shown on the table below:

Files|	en_US.news|	en_US.blogs|	en_US.twitter
------|-----------|------------|---------------
Size (MB)|	261.8|	260.6|	316
Lines|	1010242|	899288|	2360148
Words|	34275000|	37242000|	29876000

## Processing and cleaning of the data
The next step was to create a sample of the data, given my PC’s processing limitations. The chosen sample was composed of 3,5% of the three datasets, and was obtained using the R’s function rbinom and setting a seed for reproducibility.

Now, using this sample of the data 3 steps were taken to clean the data:

1. All the text was transformed to lowercase and sentences were split according to this regular expression *[\.|!|?][:space:]*. The latter step was taken because some sentences would lose grammatical significance and create non-meaningful n-grams, since in the next steps the punctuations marks were going to be removed.

2. Punctuation marks and other non-recognized characters were removed using this regular expression *[^[:alnum:]|[:space:]|']+*. The ’ was not removed to avoid changing the meaning of phrases like I’m, There’s and others.

3. Finally, making use of the [George Carlin’s List of Seven Dirty Words](https://en.wikipedia.org/wiki/Seven_dirty_words) (that could be a questionable choice), profanities were replaced with an "UNK". That item would eventually be filtered out but removing it in this step would end upcreating some senseless n-grams.



## Data features
The next step was to perform different summarizing operations on the data. The first operation was to obtain n-grams, which are a contiguous sequence of n items from a given sequence of text or speech. Using R’s package quanteda, n-grams for n = 1 to 5, were obtained.

Beforehand, it was expected that few words would comprise the majority of the the language used. This was confirmed when the analysis was performed. More than 50% of the vocabulary used in the data is made of only 150 words as it can be seen signaled in the next graph.

On the other hand, it can be seen that after the first few words, the coverage starts to grow much more slowly. For instance, the article the is the most used word in the dataset, comprising 4.75% of all the vocabulary, but the word why that occupies the 150th position covers less than 0.08% of the vocabulary.

The next graph shows the number of appearances of the most common unigrams. From these unigrams, it is possible to conclude that the most common groups of words in the dataset are articles, prepositions and personal pronouns.

This same treatment was performed on higher order n-grams, but due to the brief nature of this report those results were not added to the document.

## Prediction algorithm

To develop a functional predictive text application I opted for the the Stupid Backoff Algorithmn since it seemed to offer a fair trade-off between complexity and accuracy. This algorithm tries to use the highest order known n-gram, matching the end of the inputed phrase and predicts the next word. If it is unable to do so, it backs off to lower order n-grams until a match is found. When there is not match found, the algorithm outputs the most used 1-gram (in this case, "the"). Given my current time constraints, the current implementation of the algorithmn is sub-optimal and will have to be heavily modified to achieve major gains in speed.

One interesting fact is that when the prediction model parameters goes from 4-grams to 5-grams, there does not seem to be a major difference in accuracy. Due to this fact, I decided to put a cap on the n-grams on 4 words. Anyhow, I am leaning more towards efficency in the trade-off between accuracy and efficency.

## Application

Finally, the application can be found in the following link: [Predictive Text App](https://dylanjcastillo.shinyapps.io/predictive_text_app/)