# Introduction to Text Analytics

### Data Science 350
### Stephen Elston

## Introduction 

This notebook contains a tutorial introduction to basic text analytics with R.  

![](img/HAL.jpg)
<center>**AI has a ways to go!**

### Text Data Are Everywhere

Raw text data is an unstructured and ubiquitious type of data. Most of the world’s data is unstructured. Volumes of unstructured data, including text, are growing much faster than structured data. There are many industry estmates for the fraction of all data which is unstructured. A few from the last 8 years include:   
- 2009 HP Survey: 70%
- Gartner: 80%
- Teradata: 85%
- But, **Beware of industry estimates!!**

How much text data are we talking about here? In a few years time, Twitter has more text data recorded than all that has been written in print in the history of mankind. (http://www.internetlivestats.com/twitter-statistics/)

### Applications of Text Analytics

Given the ubiquity and volume of text data, it is not surprising that numerious powerful applications which exploit text analytics are appearing. A few of these applications are listed below.

- Intelligent applications
  - Assistants
  - Chat bots
- Classification
  - Sentiment analysis
  - SPAM detection
- Speech recognition
- Search
- Information retrieval

### Analysis of Text Data

In this tutorial we investigate three areas of text analytics. The following three sections cover these topics.

- Preparing text for analysis.
- Classification of text and sentiment analysis.
- Topic Models for document classification and retrival. 

#### Preparing  text data

By its very nature, text data comes unstructured and poorly organized for analysis. Typically multiple steps are required to process text into a form suitable for analysis. You can think of this process as transforming the unstructured data into a structured set of features. 

Steps covered in this tutorial include the following:

- Organize text documents into a corpus
- Normalize the text to remove unneeded content
  - Tokenize text
  - Clean text
- Create term document matrix

#### Text analysis methods

There are a great many approaches which have been tried for text analytics and natural language processing (NLP). We only mention a few below. 

- The **bag of words model** is a simple widely used and suprsingly effective model for analysis of text data. The BOW model uses only on the frequency of the words in the document and order of the words is not consisdered. Dispite this seamingly rediculous assumption, the model works well in many cases. 
  - The BOW model assumes **exchangeability** of words. 
  - The end product of applying the BOW model is a term-document or document-term matrix. The tdm, or dtm is a structured representation of word frequency by document. 
  - The tdm or dtm can be used for classification if lables are available or clustering for unspervised learning. 
- Another powerful model is the **word to vec** and **doc to vec** model. Word to vec, uses a neural network model to determine similarity between words. These models are beyond the scope of this tutorial. You can find a good introduction to this model in the [article by Rong](http://www-personal.umich.edu/~ronxin/pdf/w2vexp.pdf)
- Another widely used model is of **Part of Speech (PoS) Tagging**. PoS tagging attempts to label or anotate words in a corpus (e.g. a collection of documents) as, say nouns, verbs, pronouns, etc. PoS tagging is beyond the  scope of this tutorial. The PoS tagger creates a tree of the relationship of words in say a sentance. One useful specialization of PoS tagging is named entity recognition, which attempts to find proper nouns. 

  
####  Document Classification, Topic Models, and Information Retervial

A common objective of text analysis is to classify and group documents. These methods have applicaiton in information retrival and search. Understandably, there are a great many such methods which have been developed over the years. We will only discuss a few examples in this tutorial.    

- **Classification** is a widely used supervised learning method for docuent analysis. For example, documents can be classified as SPAM  or not SPAM or as positive or negative sentiment. 
- **Latent Sematic Analysis (LSA)** and **Doc-to-Vec** analysis are unsupervised learning methods used to determine which documents are closely related. These methods use similarity measures to rank doccuments as being related. These powerful methods are beyond the scope of this tutorial. 
- **Topic models** are an method wherein the  
  - Allocate the probability a document contains a topic
  - Latent Dirichlet Allocation (LDA)
- A variety of **distance metrics** have been developed to determine the distance between words, sentances and documents. These methods are related to coding theory widely used in telecommunications engineering.
- **Clustering methods** are unsupervised learning models which seek to group similar documents into clusters. A variety of distance metrics can be used to define the structure of text clusters. 
  - K-means 
  - Hieratical


***
**Note** To run this notebook you will need to install the following R packages:

- tm
- slam
- topicmodels
- SnowballC
- RTextTools
- ggplot2
***

## Text Preparation

Unstructured text must be processed into a uniform set of features suitable for further analysis. In this section we will step through some of the commonly used methods for convering unstructured text into a form we can use for analysis. There are three steps we need to transform text into a a set of features we can analyze.  

- **Tokenize** the document.
- **Normalize** the text. 
- Compute the **term-document matrix** or **document-term matrix**.

***
**Note:** You can find additional information on the R `tm` package in the [vignette by Feinerer](https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf).
***

### Tokenize Text

As a first step in preparing text for analysis of a document is to **tokenize** the text. In general terms, tokenization is the process dividing raw text into works, symbols and other elements, known as **tokens**. A set of tokens from one or more documents is known as a **corpus**.

As a first step in creating a corpus is reading the data set. This particular data set is comprised of 160,000 tweets. The sentement of these tweets has been human labled as positive or negative {0,4}. The code in the cell below reads the tweet text and sentiment. The sentiment is marked as {0,1} for positive and negative. Run this code to load the data. 

In [None]:
## Read the tweet data set
tweets = read.csv('Binary Classification_ Twitter sentiment analysis.csv', 
                   header = TRUE, stringsAsFactors = FALSE)
colnames(tweets) <- c("sentiment", "tweets") # Set the column names
tweets[, 'sentiment'] = ifelse(tweets$sentiment == 4, 1, 0)  # set sentiment to {0,1}
head(tweets) # Have a look at the data frame

Now that we have the data set read, we need to tokenize the tweets. The code in the cell below does the following:

- The `VectorSource` function adds attributes to each document in the corpus. In this case each tweet is considered a document in a vector of documents.  
- The each tweet text is tokenized and organized into a document within the corpus by the `Corpus` function.  

Execute this code and examine the results. 

In [None]:
## Create a tm text corpus from the tweets
library(tm)  ## tm package for text mining
temp = VectorSource(tweets['tweets'])
str(temp)
tweet.corpus <- Corpus(temp)
# What is the class of the corpus
cat('')
class(tweet.corpus)

### Text Normalization

With the corpus constructed we can perform text normalization on these documents. There are functions in tm which perform all of these steps, but for the purpose of illustration we will go step-by-step. 

Text normalization involves removing extrainous symbols and words, ensuring that text is uniformly coded, and converting words to their roots. The following list outlines some commonly used text normalization steps inlcuding some examples. 

- **Strip extra white space:** {I <3 statistics $\ \ $, it’s my \u1072  $\ \ $    fAvoRitE!! 11!!!}
 $\longrightarrow$ {I <3 statistics, it’s my \u1072 fAvoRitE!! 11!!!}
- Remove **Unicode text**: {I <3 statistics, it’s my \u1072 fAvoRitE!! 11!!! $\longrightarrow$ I <3 statistics, it’s my fAvoRitE!! 11!!!} 
- Convert to **lower case:** {I <3 statistics, it’s my fAvoRitE!! 11!!!$\longrightarrow$ i <3 statistics, it’s my favorite!! 11!!!}
- **Remove punctuation:** {i <3 statistics, it’s my favorite!! 11!!! $\longrightarrow$ i 3 statistics its my favorite 11}
- **Remove numbers:** {i 3 statistics its my favorite 11 $\longrightarrow$ i statistics its my favorite}

The code in the cell below does the following:

- Uses `tm_map` to iterate over all of the documents in the corpus.
- The `content_transformer` function is used to transform each document. The argument to `content_transformer` specifies the type of transformation to be performed. 

Exectue the code to perform some basic text normalization steps. 

In [None]:
## Normalize tweets text
tweet.corpus <- tm_map(tweet.corpus, content_transformer(removeNumbers))
tweet.corpus <- tm_map(tweet.corpus, content_transformer(removePunctuation))
tweet.corpus <- tm_map(tweet.corpus, content_transformer(stripWhitespace))
tweet.corpus <- tm_map(tweet.corpus, content_transformer(tolower))

### Term Document Matrix

Now that we have a corpus with some basic normalization applied, we can create a **term document matrix** The tdm is a representation of **Bag of Words** model. The tdm has the following properties:

- Frequencies of a given term  are in the rows. The term frequencies for each document are in the columns
- The tdm is a sparse matrix, as most documents do not include many of the terms. Sparse matrix coding must be used for efficiency. 
- Document term matrix is transpose
- Using the distribution of a document’s TF (TF-IDF) values a number of analyses can be performed, including:
  - Characterize writing styles
  - Comparing authors
  - Determining original authors
  - Finding plagiarism


Let's look at an example of a tdm. The figure below shows a corpus of text documents on the left. This corpus is transformed into the term document matrix shown on the right. Notice that the matrix is sparse as any given document may not contain a term. Additionally, some terms may appear in the document multiple times. 

![](img/tdm.png)

The code in the cell below computes a tdm using the R `slam` sparse matrix package. Terms with very low frequency across all documents are removed from the matrix. These sparse terms are generally not informative. 

Run this code and examine the summary of the results. 

In [None]:
## ----- Convert the corpus to a term document matrix
to.tdm = function(corpus, sparse = 0.998){
  require(tm)
  ## Compute a term-document matrix and then 
  require(slam) # Sparse matrix package
  tdm <- TermDocumentMatrix(corpus, control = list(stopwords = FALSE))
  tdm <- removeSparseTerms(tdm, sparse)
  tdm
}
tdm = to.tdm(tweet.corpus) # Create a term document matrix
str(tdm) # Look at sparse tdm
findFreqTerms(tdm, 2000) # Words that occur at least 2000 times


### ComputingTerm Frequency

Now that we have computed a tdm, how can we understand it? Recall that the simple **Bag of Words model** is just based on **Term Frequency (TF)**. In this case, the weighting of a document for a given term is just the frequency of that term in the document. 

In other cases we will used the **Inverse Document Frequency (IDF)** weighting. IDF weighting accounts for cases where only a few documents contain certain terms. The formula for the IDF weighting can be written as:

$$IDF = log(\frac{Number\ Documents}{Number\ Documents\ with\ Word})$$

The IDF can exhibit a problem however. When there are a few documents with very frequent terms, the weighting is skewed toward those documents.  To solve this problem, we reweight IDF by the overall frequency of the word to create a TFIDF matrix. The formula for computing TFIDF is: 

$$TF - IDF = frequency(word) \cdot log(\frac{Number\ Documents}{Number\ Documents\ with\ Word})\ $$

The code in the cell below computes both simple TF and the cumulative of the term frequencies, strating from the most fequent terms to the least. Execute this code and examine the results. 

In [None]:
## Compute the word fequency from the tdm
to.wf = function(tdm){
  ## compute the word frequencies.
  require(slam)
  freq <- row_sums(tdm, na.rm = T)   
  ## Sort the word frequency and build a dataframe
  ## including the cumulative frequecy of the words.
  freq <- sort(freq, decreasing = TRUE)
  word.freq <- data.frame(word = factor(names(freq), levels = names(freq)), 
                          frequency = freq)
  word.freq['Cumulative'] <- cumsum(word.freq['frequency'])/sum(word.freq$frequency)
  word.freq
}
wf = to.wf(tdm)
head(wf, n = 10)


You can see that certain common words are quite frequent. 

To further investigate term frequency, execute the code in the cell below to create an ordered bar plot of term frequncy and cumulative term frequency and examine the results. 

In [None]:
## Make a bar chart of the word frequency
word.bar = function(wf, num = 50){
  require(ggplot2)
  ggplot(wf[1:num,], aes(word, frequency)) +
    geom_bar(stat = 'identity') +
    ggtitle('Frequency of common words') +
    ylab('Frequency') +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
}
word.bar(wf)


In [None]:
## Make cumulative distribution plots of the most frequent words
word.cdf = function(wf, num = 50){
  require(ggplot2)
  ggplot(wf[1:num,], aes(word, Cumulative)) +
    geom_bar(stat = 'identity') +
    ggtitle('Cumulative fraction of common words') +
    ylab('Cumulative frequency') +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
}
word.cdf(wf)


You can see that the frequency of terms drops off rather quickly. This is a common situation with many documents. 

### Removing Stop Words

From the foregoing analysis of word frequency in the tweets we can see that the most frequent words do not have any particular semantic meaning. We say these words are **stop words**. To compute a meaningful TF it is important to remove the stop words. An example of removing stop words is shown below:

i statistics its my favorite $\longrightarrow$ statistics favorite

The choice of stop words is often dependent on the applicaton at hand. Some words may generally be stop words, but may be critical in some applicaitons. For the tweets, the range of stop words must be extended to include the non-standard spelling often used in tweets. In practice, you may need to try several choices of stop word lists to find the best list for a given problem.

The code in the cell below, reads a custom list of stop words, ensures they are unique and then prints the first 100. Execute the code and examine the results.

In [None]:
## Load stop words from a file and ensure they are 
stopWords = read.csv('stopwords.csv', header = TRUE, stringsAsFactors = FALSE)
stopWords = unique(stopWords) # Ensure the list is unique
cat(nrow(stopWords))
stopWords[1:100,] # Look at the first 100 stop words


These stop words contain many words which are unlikely to help us determine the sentiment of the tweets. 

Execute the code in the cell below to rempove the stop words from the corpus of tweets.

In [None]:
## Remove the stop words from the corpus
tweet.corpus <- tm_map(tweet.corpus, removeWords, stopWords[, 'words'])

Next, execute the code in the cell below to examine the frequency of words in the tweets with the stop words removed. 

In [None]:
## View the results
tdm = to.tdm(tweet.corpus) # Create a term document matrix
findFreqTerms(tdm, 2000) # Words that occur at least 2000 times
wf = to.wf(tdm)  # Compute word fequency
head(wf, n = 10)  # Look at the most common words
word.bar(wf) # Plot word frequency
word.cdf(wf) # Plot cdf

Now, the most frequent words appear to have symantic meaning and will help us determine tweet sentiment. 

### Steming words

Word can appear in different forms. For example, the differnt tenses of a verb are the same word, but use different spelling. Dispite the different spelling, the word has the same meaning and symantics. 

To ensure that the same word is treated the same in an analysis, dispite spelling differences, inflected (or sometimes derived) words are **stemmed** to their roots. 
  - Stemming was pioneered by Julie Beth Lovens (1968). 
  - Porter (1980, 2000) is common algorithm for stemming English words.
  - Example 1: {relies, relied, rely = reli}
  - Example 2: {statistics favorite $\longrightarrow$ statisti favori}

A related process is to **substitute synonyms** is sometimes necessary, but can be tricky. We will not attempt this process in our example. 

The code in the cell below uses the Porter Stemmer from the SnowballC package to stem the words in our corpus. Exectue this code. 

In [None]:
## Use the porter stemmer in Snowball package
##
require(SnowballC) ## For Porter stemming words
tweet.corpus <- tm_map(tweet.corpus, stemDocument)

Run the code in the cell below to examine the differences in the word frequency of the tweets following stemming.

In [None]:
## View the results
tdm = to.tdm(tweet.corpus, sparse = 0.99) # Create a term document matrix
findFreqTerms(tdm, 2000) # Words that occur at least 2000 times
wf = to.wf(tdm)  # Compute word fequency
head(wf, n = 10)  # Look at the most common words
word.bar(wf) # Plot word frequency
word.cdf(wf) # Plot cdf

In [None]:
## View the results
tdm = to.tdm(tweet.corpus, sparse = 0.99) # Create a term document matrix
findFreqTerms(tdm, 2000) # Words that occur at least 2000 times
wf = to.wf(tdm)  # Compute word fequency
head(wf, n = 10)  # Look at the most common words
word.bar(wf) # Plot word frequency
word.cdf(wf) # Plot cdf

### Word Clouds

Word clouds are a completely useless display of information that people love to see.

![](img/Wordcloud.png)

Beware of any presentation using word clounds!

## Classification and Sentiment Analysis

Now that we have a prepared TDM of the 160,000 tweets, let's build and evaluate models to classify the sentiment of these tweets. An outline of our process is as follows:

- Use TDM or TFIDF weighted TDM as features for training the model.
- Use marked cases for training and evaluation of model.
- Slect a method for sparse matrix requires regularization from the following:
  - Feature selection, is impractial since there are over one million features.
  - SVD/PCA could be used to reduce dimensionality of the problem.
  - In this case we will use the ridge and lasso methods offered in the  elasticnet model.

***
**Note:** For an in depth discussion of supervised classification of documents using the R `RTextTools` package, see the [2013 article by Jurka et. al.](https://journal.r-project.org/archive/2013-1/collingwood-jurka-boydstun-etal.pdf).
***



The code in the cell below uses the `create_matrix` function from the RTextTools package to create a TDM. The `create_container` function is used to package the model matrix and labels for 120,000 training cases. The remaining  40,000 cases will be used to evaluate the model.

Execute the code below. This may take a while.

In [15]:
## Compute a tdm
require(RTextTools)
model.matrix = create_matrix(tweets$tweets, language="english",                               
                           removeNumbers=TRUE,
                           stemWords=TRUE, 
                           removeSparseTerms=.998, 
                           removeStopwords = TRUE, 
                           stripWhitespace = TRUE,
                           toLower = TRUE)                            

## Create the a container for the tdm and label
tweet.cont = create_container(model.matrix, 
                              tweets$sentiment, 
                              trainSize = 1:120000, 
                              virgin=TRUE)

Loading required package: RTextTools
"package 'RTextTools' was built under R version 3.3.3"Loading required package: SparseM
"package 'SparseM' was built under R version 3.3.2"
Attaching package: 'SparseM'

The following object is masked from 'package:base':

    backsolve


Attaching package: 'RTextTools'

The following objects are masked from 'package:SnowballC':

    getStemLanguages, wordStem



Execute the  code in the cell below to train a glmnet model on the 120,000 training cases in the container.

In [16]:
## Compute a logistic regresson model for sentiment classification
tweet.glmnet <- train_model(tweet.cont, "GLMNET")

The code in the cell below scores the 40,000 tweets not used to train the  model and computes model performance metrics. 

In [17]:
## Test classification
tweet.class = classify_model(tweet.cont, tweet.glmnet)
tweet.metrics = create_analytics(tweet.cont, tweet.class)

Next, run the code in the cell below to display the classification of the tweets. 

In [15]:
## Examine some raw metrics
tweet.metrics@label_summary
cbind(head(tweet.metrics@document_summary, n = 10), head(tweets$sentiment, n = 10))

ERROR: Error in eval(expr, envir, enclos): object 'tweet.metrics' not found


The coded in the cell below commputes the precisio, recall and Fscore of the model for positive and negative tweets. Execute the code and examine the results. 

In [20]:
create_precisionRecallSummary(tweet.cont, tweet.class)

Unnamed: 0,GLMNET_PRECISION,GLMNET_RECALL,GLMNET_FSCORE
0,0.77,0.57,0.66
1,0.66,0.83,0.74


These figures are not particularly good. Perhaps we can do model using a TFIDF TDM. Exectue the code in the cell below to do just this.

In [19]:
#----------------------------------------------
## Compute TFIDF weighted tdm
## Compute a tdm
tdm.tools2 = create_matrix(tweets$tweets, 
                           language="english",                               
                           removeNumbers=TRUE,
                           stemWords=TRUE, 
                           removeSparseTerms=.998, 
                           removeStopwords = TRUE, 
                           stripWhitespace = TRUE,
                           toLower = TRUE,
                           weighting = tm::weightTfIdf)

"empty document(s): why?????............... mh..what should i do?? 40 more to go where did you go? Here I am! Up in 3d BH with LB up at 6:30am is up and about so, here we go Up in 3D Here again E3 is here! I'm not in it Its over Up and at 'em. Ok, I'm out again. IT'S NOT ME..IT'S YOU Do i have to go? i would have  x They have... Here I am Go for it. all to myself So am I But why? with him 7th for $1600 DD is down. I'm off! k im out! Go UP!! F#@! off 17 again this is me: Why'd you have to go? Not again... It's a 2:2 is up!! I'm out doing some hw before doing some more hw not me? . . . . . and it's on! Up in 3D @yours12099 ???????????????????????????????????????????????????????????? what to do... what did i do What's up? is up and about! I â™¥ you And off I go ! or not... no..its not him ...and we did And there it is IM NOT... WHAT DO I DO???????????????????????????? not again ....she's out to be with you Here i am Its over ... that's all"

Execute the code in the cell below to create a container with 120,000 training cases from the TFIDF TDM. As before, this may take some  time.

In [20]:
## Create the a container for the TfIdf weighted tdm and label
tweet.cont = create_container(tdm.tools2,tweets$sentiment, trainSize = 1:120000, virgin=TRUE)

Execute the code in the cell below to compute the new model. 

In [21]:
## Compute a logistic regresson model for sentiment classification
tweet.glmnet.TfIdf <- train_model(tweet.cont,"GLMNET")

Execute the code in the cell below to classify the test cases and compute the model performance metrics.

In [22]:
## Test classification
tweet.class.TfIdf = classify_model(tweet.cont, tweet.glmnet.TfIdf)
tweet.metrics.TfIdf = create_analytics(tweet.cont, tweet.class.TfIdf)

Execute code in the cell below to see some of the raw labels and scores for the test cases.

In [23]:
## Examine some raw metrics
tweet.metrics.TfIdf@label_summary
results = head(tweet.metrics.TfIdf@document_summary, n = 20)
results

Unnamed: 0,NUM_CONSENSUS_CODED,NUM_PROBABILITY_CODED
0,45523,45523
1,74477,74477


GLMNET_LABEL,GLMNET_PROB,CONSENSUS_CODE,CONSENSUS_AGREE,PROBABILITY_CODE
1,0.6957116,1,1,1
1,0.5108559,1,1,1
0,0.6060297,0,1,0
1,0.5108559,1,1,1
0,0.5318078,0,1,0
0,0.5141094,0,1,0
1,0.5108559,1,1,1
0,0.8484554,0,1,0
1,0.5668313,1,1,1
1,0.5108559,1,1,1


Finally, execute the code in the cell below to display the performance metrics for both models. Examine and compare the reuslts.

In [24]:
m## Look at the confusion matrix and compare to the unweighte tdf model
create_precisionmRecallSummary(tweet.cont, tweet.class.TfIdf)
create_precisionRecallSummary(tweet.cont, tweet.class)

Unnamed: 0,GLMNET_PRECISION,GLMNET_RECALL,GLMNET_FSCORE
0,0.76,0.58,0.66
1,0.66,0.82,0.73


Unnamed: 0,GLMNET_PRECISION,GLMNET_RECALL,GLMNET_FSCORE
0,0.77,0.57,0.66
1,0.66,0.83,0.74


In this case, the use of different TDM weighting has not had much effect. 

## Topic Models

It is often useful to allocate documents to one or more topics. This process can be useful in, say, information retrival and search. Models to perform this allocation to topics are known as **topic models**. 

A power topic model is know as **Latent Dirichlet Allocation** or **LDA**. LDA is an unsupervised Bayesian learning model.  We can summarize the LDA model as follows:

- The LDA model uses a fixed number of (sub) topics, k.
- The model computes the posterior probability of a document containing a topic.
- The model uses know word frequencies for documents in corpus, e.g. the tdm.
- All other variables are estimated or **latent**, including the topics of each document. 

How does Latent Dirichlet alloction work? It's a Baysian model, so we need to define a likeihood and choose a prior. 

Our posterior distribution is categorical, since we have many topics. The Dirichlet distribution is the conjugate of the multinomial and categorical distribution

All we actually know: $w_{ij}$ is the frequency of a specific word $j$ in document $i$.

What we want to know (latent): $\theta_i$ is the topic distribution of document $i$.

We also need to estimate (latent):
- $\phi_k$ is the word distribution for topic $k$
- $z_{ij}$ is the topic of the jth word in document $i$

The Bayesian model and its priors:

- Multinomial model
$$z_{ij} \sim Multinomial(\theta_i)\\
w_{ij} \sim Multinomial(\phi_k)$$

- With Dirichlet priors with parameters a and b:
$$\theta_i \sim Dirichelet(\alpha)\\
\phi_k \sim Dirichelet(\beta)$$

Since we don't know the allocation of topics in advance we generally use uniform priors across topics.

The Likelihood is taken from the TD matrix.

***
**Note:** An in depth introduction to fitting topic models with the R `topicmodels` package can be found in the [vignette by Grun and Hornik](https://cran.r-project.org/web/packages/topicmodels/vignettes/topicmodels.pdf).
***

Let's try an example. In this example we apply LDA to a corpus of 20 business news articles from Reuters news wire concering the oil industry. We will apply an LDA model with 5 topics ($k = 5$). 

As a first step we will load the corpus of these docments. Execute the code in the cell below to load the corpus and examine the contents of the first document. 

In [1]:
## Load the data set as a vector corpus of 20 documents
library(tm)
data(crude)
writeLines(as.character(crude[[1]]))

"package 'tm' was built under R version 3.3.3"Loading required package: NLP
"package 'NLP' was built under R version 3.3.2"

Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
    The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the copany said.
    "The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
    Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
 Reuter


Now that the corpus is loaded, let's create a term document matrix. The code in the cell below does just this, including text normalization. Execute this code and examine the results. 

In [2]:
## Compute the term document matrix
crude.tdm = TermDocumentMatrix(crude, control = list(removePunctuation = TRUE,
                                                     tolower = TRUE,
                                                     removePunctuation = TRUE,
                                                     removeNumbers = TRUE,
                                                     stopwords = TRUE,
                                                     stemming = TRUE))
## Have a look at the tdm 
inspect(crude.tdm[202:210, 1:10])

<<TermDocumentMatrix (terms: 9, documents: 10)>>
Non-/sparse entries: 17/73
Sparsity           : 81%
Maximal term length: 8
Weighting          : term frequency (tf)
Sample             :
          Docs
Terms      127 144 191 194 211 236 237 242 246 248
  dlr        0   0   0   1   0   0   0   0   0   0
  dlrs       2   0   1   2   2   2   1   0   0   4
  doha       0   0   0   0   0   0   0   0   0   0
  dollar     0   0   0   0   0   2   0   1   0   0
  domest     0   0   0   0   0   0   1   0   0   0
  drawback   0   0   0   0   0   0   0   0   0   0
  drop       0   0   0   0   0   1   0   0   0   0
  due        0   0   0   0   0   0   1   1   1   0
  earli      0   0   0   0   0   0   1   0   1   0


You can see the term frequencies for the selcted words and documents in the display above. As expected, most of the entries in the tdm are zeros. A few words occur more than once. 

To further examine the occurance of the words in these documents, let's find words which occur 10 or more times. The `findFreqTerms` function does just that. Execute the code in the cell below to see the list of most common terms. 

In [3]:
## Which terms occur 10 times or more?
crudeTDMHighFreq <- findFreqTerms(crude.tdm, 10, Inf)
crudeTDMHighFreq

To furter investigate the distribution of these most frequent terms, we can create a tdm for them for the first 5 documents in the corpus. Execute the code in the cell below and examine the results. 

In [4]:
# Do these terms show up in the first 5 documents?
inspect(crude.tdm[crudeTDMHighFreq, 1:5]) 

<<TermDocumentMatrix (terms: 35, documents: 5)>>
Non-/sparse entries: 57/118
Sparsity           : 67%
Maximal term length: 8
Weighting          : term frequency (tf)
Sample             :
         Docs
Terms     127 144 191 194 211
  crude     2   0   2   3   0
  dlrs      2   0   1   2   2
  market    2   5   0   0   0
  meet      0   6   0   0   0
  mln       0   4   0   0   2
  oil       5  12   2   1   1
  opec      0  15   0   0   0
  price     5   6   2   2   0
  product   1   6   0   0   0
  said      3  11   1   1   3


Even for these most frequent terms the tdm is sparse. 

The LDA model actually uses a DTM (as opposed to the TDM). Execute the code in the cell below to compute the DTM. 

In [5]:
## Compute the DTM
crude.dtm = DocumentTermMatrix(crude, control = list(removePunctuation = TRUE,
                                                     stopwords = TRUE))
crude.dtm  ## Check the drm

<<DocumentTermMatrix (documents: 20, terms: 1000)>>
Non-/sparse entries: 1738/18262
Sparsity           : 91%
Maximal term length: 16
Weighting          : term frequency (tf)

Notice that there are about 1,700 non zero entries. 91% of the entries in the DTM are zero. 

Now, we are ready to apply the LDA model to the DTM. The code in the cell below does the following:

- Define parameters for the Gibbs sampler used to compute the posterior distribution.
- Set the number of topics for the model.
- Compute the posterior distribution of the topics. 

Exectue this code. 

In [6]:
## Apply a topic model to the news articles
##load topic models library
library(topicmodels)

#Set parameters for Gibbs sampling
burnin <- 4000
iter <- 2000
thin <- 500
seed <-list(2003,5,63,100001,765)
nstart <- 5
best <- TRUE

#Number of topics
k <- 5

ldaOut = LDA(crude.dtm, k, method= "Gibbs", 
             control = list(nstart = nstart, 
                            seed = seed, 
                            best = best, 
                            burnin = burnin, 
                            iter = iter, 
                            thin=thin))

"package 'topicmodels' was built under R version 3.3.3"

Let's examine thhe properties of  this model. Execute the code in the cell below to see the most likely topic for each article.

In [14]:
## Examine the topics
ldaOut.topics <- as.matrix(topics(ldaOut))
ldaOut.topics

0,1
127,2
144,1
191,2
194,2
211,4
236,1
237,4
242,3
246,5
248,1


Next, we can examine the most frequent terms for each of our 5 topics. Exectue the code in the cell below to create a chart of the most frequent terms in each of the 5 topics found by our model.

In [8]:
## And the terms
ldaOut.terms <- as.matrix(terms(ldaOut,6))
head(ldaOut.terms)

Topic 1,Topic 2,Topic 3,Topic 4,Topic 5
said,oil,will,oil,barrels
opec,dlrs,one,pct,sheikh
prices,last,new,industry,billion
oil,crude,prices,mln,petroleum
bpd,reuter,futures,group,gulf
mln,barrel,production,report,riyals


In [9]:
#probabilities associated with each topic assignment
topicProbabilities <- as.data.frame(ldaOut@gamma)
head(topicProbabilities)

V1,V2,V3,V4,V5
0.1926606,0.412844,0.1284404,0.13761468,0.1284404
0.3501577,0.1324921,0.3217666,0.08201893,0.1135647
0.1460674,0.4382022,0.1460674,0.11235955,0.1573034
0.1145833,0.46875,0.15625,0.11458333,0.1458333
0.1619048,0.2,0.1047619,0.38095238,0.152381
0.4920635,0.1650794,0.1333333,0.04761905,0.1619048


In [10]:
#Find relative importance of top topic
topic1ToTopic2 <- lapply(1:nrow(crude.dtm),function(x)
  sort(topicProbabilities[x,])[k]/sort(topicProbabilities[x,])[k-1])
unlist(topic1ToTopic2)

In [11]:
#Find relative importance of second most important topics
topic2ToTopic3 <- lapply(1:nrow(crude.dtm),function(x)
  sort(topicProbabilities[x,])[k-1]/sort(topicProbabilities[x,])[k-2])
unlist(topic2ToTopic3)

## Measuring Text Distance

Measuring the distance between words in a document is not as stright forward as it might seem. The choice of distance metric can have a significant effect on analytical results. This is particularly the case for unsupervised learning methods like cluster models. 

Let's look at a few of the commonly used distance metrics. 

**Hamming Distance**
- Line up strings, count number of positions that are the different.
- Assumes strings are of the same length.

$$𝐻𝑎𝑚𝑚𝑖𝑛𝑔(101101, 100011)=3\\
𝐻𝑎𝑚𝑚𝑖𝑛𝑔(𝑏𝑒𝑒𝑟,𝑏𝑒𝑎𝑟)=1$$

**Levenshtein distance** measures the edit distance between two strings (insertion, deletion, substitution only):

$$𝐿𝑒𝑣(𝑏𝑒𝑒𝑟,𝑏𝑒𝑎𝑟)=1\\
𝐿𝑒𝑣(𝑏𝑎𝑛𝑎𝑛𝑎,𝑏𝑎𝑛)=3$$

**Jaccard index** measures the size of intersection of characters divided by size of union of characters.

$$J(A, B) = 1 - \frac{|A \cap B|}{|A \cup B|}\\
J(beer, beer) = 1 - \frac{3}{4}\\
J(bannana, ban) = 1 - \frac{3}{3} \leftarrow\ This\ is\ a\ problem$$

**Weighted Jaccard Index** For each letter, calculate the minimum times it appears, $m_i$, and the maximum number of times it appears, $M_i$.

$$J'(A, B) = 1 - \frac{\sum m_i}{\sum M_i}\\
J'(beer, bear) = 1 - \frac{m_a + m_e + m_b + m_r}{M_a + M_e + M_b + M_r}\\
J'(beer, bear) = 1 - \frac{0 + 1 + 1 + 1}{1 + 1 + 2 + 1}\\
J'(bannana, ban) = 1 - \frac{1 + 1 + 1}{3 + 1 + 2}$$

#### Copyright 2017, Stephen F Elston. All rights reserved. 