First, we convert all the comments into a document corpus, a data type used in text mining.  We then perform pre-processing, such as removing puncutation, numbers, whitespace and stopwords.    

Next, we conver the documents to a document term matrix, where each row is a commit comment and each column is a word.  

In [9]:
require(jsonlite)
require(dplyr)
require(tm)

# import commits from JSON
comments <- fromJSON("http://pyro.primeprocessor.com:8888/files/comments.json")
#print(comments$comment[1:2])
# create corpus
comments <- VCorpus(VectorSource(comments$comment))
#comments <- VCorpus(VectorSource(c("aa bb c", "b c","b c d")))
# remove puncutation, stopwords, numbers, whitespace
mod <- tm_map(comments, tolower)
mod <- tm_map(mod, removePunctuation)
mod <- tm_map(mod, removeNumbers)
mod <- tm_map(mod, removeWords, stopwords("english"))
mod <- tm_map(mod, stripWhitespace)
print(mod)
mod <- tm_map(mod, PlainTextDocument)
dtm <- DocumentTermMatrix(mod)

<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 7655


We examine the size of the document-term matrix (dtm).  It has 7655 comments and 6202 words

In [10]:
dim(dtm)
inspect(dtm[1,1])

<<DocumentTermMatrix (documents: 1, terms: 1)>>
Non-/sparse entries: 0/1
Sparsity           : 100%
Maximal term length: 17
Weighting          : term frequency (tf)

              Terms
Docs           aadfabebeacaaeade
  character(0)                 0


The term-frequency-inverse-document-frequency matrix assigns weights to words in a document according to their frequency of appearance.  To account for the fact that some words appear more frequently in a given corpus (a collection of documents), each term frequency is multiplied by the inverse document frequency.  This tampers the weights.  

In [11]:
dtm_tfxidf <- weightTfIdf(dtm, normalize=F)

There are too many words to analyze.  We exclude words without meaning by the following technique.  We take the column sums of the weighted matrix.  Then, we take the terms whose cumulative weight is over 70.  This leaves us with 1296 words. 

In [22]:
term_wt <- apply(dtm_tfxidf,2,sum)
length(term_wt[which(term_wt>70)])

In [24]:
dtm_tfxidf <- dtm_tfxidf[,term_wt>70]

K Means clustering works in the following way.  First, we specify k number of clusters.  We randomly assign cluster centers in a n-dimensional space, where n equals the number of terms.  We take the mean of the points (documents) closest to the given clusters and then use that location as the new cluster center location.  We repeat this iteratively until the mean converges (no more changes) or the cumulative squared difference between all the points in a cluster and the center doesn't decrease by a given threshold.  

In [28]:
# Determine optimal number of clusters
#tss <- c()
#for (i in 1:50) {
#  fit <- kmeans(dtm_tfxidf, i)
#  tss[i] <- sum(fit$tot.withinss)
#}

# no elbow in chart, pick k = 10
#plot(tss, main="Optimal K's", ylab="Total SS Within", xlab="# of ks")

options(warn=-1)

fit <- kmeans(dtm_tfxidf, 10)
clusters <- fit$cluster

clust.df <- as.matrix(dtm_tfxidf)
clust.df<- as.data.frame(clust.df)
all <- cbind(fit$cluster,clust.df)

names(all)[1] <- "cluster"

options(warn=0)

Each cluster has a given number of documents assigned to it.  To obtain the 5 most popular terms in a cluster, we once again obtain the cumulative sum of the weights for all the docs in a given cluster.  We then output the top 5 words for each cluster.  

In [26]:
# Find top 5 words by weight for each cluster
freq_terms <- list()
for (i in 1:10) {
  selection <- all %>% filter(cluster==i)
  vector <- colSums(selection[,-1]) %>% sort(., decreasing=T)
  freq_terms[[i]] <- vector[1:5]
}

In [27]:
print (freq_terms)

[[1]]
   merge  version      api      usb  support 
2601.990 1354.445 1260.020 1176.678 1147.433 

[[2]]
     pool   message   lagging      flag     share 
1549.7309  133.7105  121.4225  118.5146  108.7957 

[[3]]
  timeout       win       usb   seconds   message 
702.57681 329.13314 110.01465  66.98480  61.28397 

[[4]]
generation       pool      based    sprintf       make 
 337.94412   23.13031   20.98589   19.80437   18.61704 

[[5]]
    driver     icarus        usb       send   firmware 
1336.86549   72.08843   66.96544   51.80738   51.22057 

[[6]]
      end  scanhash      data       now   without 
279.82453  47.41312  40.55564  37.54998  31.58612 

[[7]]
     will   cgminer    within   seconds     cause 
430.67414  55.69147  30.17854  29.77102  29.19678 

[[8]]
         file configuration     configure     reference       cgminer 
    514.33444     126.43499      39.07362      37.77102      37.12765 

[[9]]
      make       sure   possible        new    compile 
1247.34145  313.