## Week 2: Capstone Project: Task 2

**Tasks to accomplish**

1. Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
1. Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

**Questions to consider**

1. **Some words are more frequent than others - what are the distributions of word frequencies?**
1. **What are the frequencies of 2-grams and 3-grams in the dataset?**
1. **How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?**
1. How do you evaluate how many of the words come from foreign languages?
1. Can you think of a way to increase the coverage -- identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

### 0. Environmet Settings

In [None]:
library(data.tree)
library(DiagrammeR)
library(dplyr)
library(ggplot2)
library(igraph)
library(influenceR)
library(plyr)
library(SnowballC)
library(stringi)
library(stringr)
library(tidyr)
library(tidytext)
library(tokenizers)
library(tm)

### 1. Preparation

In [None]:
# file path for english data
blogs_path <- "~/Soft/Rtest/JHU_capstone_project_data/final/en_US/en_US.blogs.txt"
news_path <- "~/Soft/Rtest/JHU_capstone_project_data/final/en_US/en_US.news.txt"
twitter_path <- "~/Soft/Rtest/JHU_capstone_project_data/final/en_US/en_US.twitter.txt"

# load data
blogs <- readLines(blogs_path, encoding = "UTF-8", skipNul = TRUE)
news <- readLines(news_path, encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines(twitter_path, encoding = "UTF-8", skipNul = TRUE)

# number of characters per line
blogs_nchar <- nchar(blogs)
news_nchar <- nchar(news)
twitter_nchar <- nchar(twitter)

In [None]:
# boxplot(
#     blogs_nchar
#     , news_nchar
#     , twitter_nchar
#     , log = "y"
#     , names = c("blogs", "news", "twitter")
#     , ylab = "log(Number of Characters)"
#     , xlab = "Type"
#     , title = "Distribution of Characters/Line"
# )

In [None]:
# Read the data files into data frames
blogs <- data.frame(text = blogs)
news <- data.frame(text = news)
twitter <- data.frame(text = twitter)

In [None]:
# Sampling
set.seed(1565)
sample_pct <- 0.1

blogs_sample <- blogs %>% sample_n(., nrow(blogs)*sample_pct)
news_sample <- news %>% sample_n(., nrow(news)*sample_pct)
twitter_sample <- twitter %>% sample_n(., nrow(twitter)*sample_pct)

In [None]:
# Create aggregate sample
agg_sample <- bind_rows(
    mutate(blogs_sample, source = 'blogs')
    , mutate(news_sample, source = 'news')
    , mutate(twitter_sample, source = 'twitter')
)
# change agg_sample$source type: chr --> factor
agg_sample$source <- factor(agg_sample$source)

In [None]:
head(agg_sample)

In [None]:
# Create filters: non-alphanumeric's, url's, repeated letters
replace_reg <- "[^[:alpha:][:space:]]*"
replace_url <- "http[^[:space:]]*"
replace_aaa <- "\\b(?=\\w*(\\w)\\1)\\w+\\b"  

# Clean sample
clean_sample <- agg_sample %>%
    mutate(text = str_replace_all(text, replace_reg, "")) %>%
    mutate(text = str_replace_all(text, replace_url, "")) %>%
    mutate(text = str_replace_all(text, replace_aaa, "")) #%>%
#    mutate(text = iconv(text, "ASCII//TRANSLIT"))

In [None]:
rm(blogs
   , blogs_nchar
   , news
   , news_nchar
   , twitter
   , twitter_nchar
   , replace_reg
   , replace_url
   , replace_aaa
  )

In [None]:
data("stop_words")
swear_words <- read.csv("/home/yanyuan/Soft/swear-words/en")
swear_words <- unnest_tokens(swear_words, word, X2g1c)

In [None]:
tidy_data <- clean_sample %>%
    unnest_tokens(word, text) %>%
     anti_join(swear_words) %>%
     anti_join(stop_words)

In [None]:
head(tidy_data)

### 2. Word Frequencies

Some words are more frequent than others - what are the distributions of word frequencies? 

In [None]:
data_count <- tidy_data %>% summarise(keys = n_distinct(word))
data_count

In [None]:
word_freq <- count(tidy_data, vars = "word") %>%
mutate(proportion = freq / sum(freq)) %>%
arrange(desc(proportion))

In [None]:
head(word_freq, 10)

So that the top 10 words with highest frequencies are : `im`, `time`, `dont`, `day`, `people`, `life`, `rt`, `home`, `night`, `lol`. We can show that in a plot

In [None]:
word_freq %>%
    top_n(20, proportion) %>%
    mutate(word = reorder(word, proportion)) %>%
    ggplot(aes(word, proportion)) + geom_col(col = "red", fill = "darkgrey") + labs(title = "Word Frequences in en_US Data") + xlab("Proportions") + coord_flip()

Furthermore, we plot word distribution by sources

In [None]:
head(tidy_data)

In [None]:
word_freq_source <- count(tidy_data, vars = c("word", "source")) %>%
    group_by(source) %>%
    mutate(proportion = freq / sum(freq)) %>%
    spread(source, proportion) %>%
    gather(source, proportion, `blogs`:`twitter`) %>%
    arrange(desc(proportion), desc(freq)) %>%
    filter(proportion > 0.0005) %>%
    mutate(word = reorder(word, proportion))

In [None]:
g <- NULL
g <- ggplot(data = word_freq_source, aes(word, proportion))
g <- g + geom_col(col = "blue", fill = "darkgrey")
g <- g + coord_flip()
g <- g + facet_grid(~source, scales = "free")
g

### 3. Word Coverage

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%? 

In [None]:
cover_50 <- count(tidy_data, vars = "word") %>%
mutate(proportion = freq / sum(freq)) %>%
arrange(desc(proportion)) %>%
mutate(coverage = cumsum(proportion)) %>%
filter(coverage <= 0.5)

nrow(cover_50)

In [None]:
cover_90 <- count(tidy_data, vars = "word") %>%
mutate(proportion = freq / sum(freq)) %>%
arrange(desc(proportion)) %>%
mutate(coverage = cumsum(proportion)) %>%
filter(coverage <= 0.9)

nrow(cover_90)

So that the number of words a dictionary needs to cover frequent words are:
- 1305 words, to cover 50% of all
- 17470 words, to cover 90% of all

### 4. Frequencies of Bigrams

In [None]:
# create bigrams by source
bigram_data <- as.data.frame(clean_sample) %>% 
    unnest_tokens(output = bigram, input = text, token = 'ngrams', n = 2) #%>%
#    mutate(proportion = n / sum(n)) #%>%

In [None]:
bigram_cover_90 <- bigram_data %>%
    count(c("bigram","source"))  %>%
    mutate(proportion = freq / sum(freq)) %>%
    arrange(desc(proportion)) %>%
    mutate(coverage = cumsum(proportion)) %>%
    filter(coverage <= 0.9)

In [None]:
head(bigram_cover_90)

So that the number of bigrams to get 90% coverage of all bigrams are

In [None]:
nrow(bigram_cover_90)

In [None]:
# plot top 20 frequences
bigram_cover_90 %>%
    top_n(20, proportion) %>%
    mutate(bigram = reorder(bigram, proportion)) %>%
    ggplot(aes(bigram, proportion)) + geom_col() + xlab("Proportion") + ggtitle("Bigram: Top 20") + coord_flip()

### 5. Frequencies of Trigrams

In [None]:
# create bigrams by source
trigram_data <- as.data.frame(clean_sample) %>% 
    unnest_tokens(output = trigram, input = text, token = 'ngrams', n = 3) #%>%
#    mutate(proportion = n / sum(n)) #%>%


In [None]:
trigram_cover_90 <- trigram_data %>%
    count(c("trigram","source"))  %>%
    mutate(proportion = freq / sum(freq)) %>%
    arrange(desc(proportion)) %>%
    mutate(coverage = cumsum(proportion)) %>%
    filter(coverage <= 0.9) 

So that the number of bigrams to get 90% coverage of all bigrams are

In [None]:
nrow(trigram_cover_90)

In [None]:
head(trigram_cover_90, 20)

In [None]:
# plot top 20 frequences
trigram_cover_90 %>%
    top_n(20, proportion) %>%
    mutate(trigram = reorder(trigram, proportion)) %>%
    ggplot(aes(trigram, proportion)) + geom_col() + xlab("Proportion") + ggtitle("Trigram: Top 20") + coord_flip()

## Week 2: Capstone Project: Task 3

The goal here is to build your first simple model for the relationship between words. This is the first step in building a predictive text mining application. You will explore simple models and discover more complicated modeling techniques.

**Tasks to accomplish**
1. Build basic n-gram model - using the exploratory analysis you performed, build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words.
1. Build a model to handle unseen n-grams - in some cases people will want to type a combination of words that does not appear in the corpora. Build a model to handle cases where a particular n-gram isn't observed.

**Questions to consider**
1. How can you efficiently store an n-gram model (think Markov Chains)?
1. How can you use the knowledge about word frequencies to make your model smaller and more efficient?
1. How many parameters do you need (i.e. how big is n in your n-gram model)?
1. Can you think of simple ways to "smooth" the probabilities (think about giving all n-grams a non-zero probability even if they aren't observed in the data) ?
1. How do you evaluate whether your model is any good?
1. How can you use backoff models to estimate the probability of unobserved n-grams?