In [30]:
# Loading the packages that will be used
list.of.packages <- c("tm", "dbscan", "proxy", "colorspace")

# (downloading and) requiring packages
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) 
  install.packages(new.packages)
for (p in list.of.packages) 
  require(p, character.only = TRUE)

In [31]:
rm(list = ls()) # Cleaning environment
options(header = FALSE, stringsAsFactors = FALSE, fileEncoding = "latin1")

### Opening notes on the problem

We are going to cluster a dataset consisting of health news tweets. These short sentences belong to one of the 16 sources of news considered in the dataset. We are then facing a multi-label classifying problem, with `num_classes = 16`.

In [32]:
truth.K <- 16

### Data Acquisition

We are about to download directly the data from the UCI Machine Learning repository. Thanks to native functions, we are able to download the zip file, extract it and fill a dataframe with all the text files read iteratively.

In [33]:
# Creating the empty dataset with the formatted columns
dataframe <- data.frame(ID=character(),
                      datetime=character(),
                      content=character(),
                      label=factor())
target.directory <- '/tmp/clustering-r'

In [15]:
#
# Download file
#
source.url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/00438/Health-News-Tweets.zip'
temporary.file <- tempfile()
download.file(source.url, temporary.file)
unzip(temporary.file, exdir = target.directory)

In [34]:
#
# Reading the files
#
target.directory <- paste(target.directory, 'Health-Tweets', sep = '/')
files <- list.files(path = target.directory, pattern='.txt$')

In [35]:
files

In [36]:
# Filling the dataframe by reading the text content
for (f in files) {
  news.filename = paste(target.directory , f, sep ='/')
  news.label <- substr(f, 0, nchar(f) - 4) # Removing the 4 last characters => '.txt'
  news.data <- read.csv(news.filename,
                        encoding = 'UTF-8',
                        header = FALSE,
                        quote = "",
                        sep = '|',
                        col.names = c('ID', 'datetime', 'content'))
  
  # Trick to ignore last part of tweets which content contains the split character "|"
  # No satisfying solution has been found to split (as in Python) and merging extra-columns with the last one
  news.data <- news.data[news.data$content != "", ]
  news.data['label'] = news.label # We add the label of the tweet 
  
  # Only considering a little portion of data ...
  # ... because handling sparse matrix for generic usage is a pain
  news.data <- head(news.data, floor(nrow(news.data) * 0.05))
  dataframe <- rbind(dataframe, news.data)
}
# Deleting the temporary directory
unlink(target.directory, recursive =  TRUE)

In [37]:
sentences <- sub("http://([[:alnum:]|[:punct:]])+", '', dataframe$content)

In [38]:
corpus = tm::Corpus(tm::VectorSource(sentences))

# Cleaning up

# Handling UTF-8 encoding problem from the dataset
corpus.cleaned <- tm::tm_map(corpus, function(x) iconv(x, to='UTF-8', sub='byte')) 
corpus.cleaned <- tm::tm_map(corpus.cleaned, tm::removeWords, tm::stopwords('english')) # Removing stop-words
corpus.cleaned <- tm::tm_map(corpus, tm::stemDocument, language = "english") # Stemming the words 
corpus.cleaned <- tm::tm_map(corpus.cleaned, tm::stripWhitespace) # Trimming excessive whitespaces

"transformation drops documents"

In [39]:
# Building the feature matrices
tdm <- tm::DocumentTermMatrix(corpus.cleaned)
tdm.tfidf <- tm::weightTfIdf(tdm)

# We remove A LOT of features. R is natively very weak with high dimensional matrix
tdm.tfidf <- tm::removeSparseTerms(tdm.tfidf, 0.999)

# There is the memory-problem part
# - Native matrix isn't "sparse-compliant" in the memory
# - Sparse implementations aren't necessary compatible with clustering algorithms
tfidf.matrix <- as.matrix(tdm.tfidf)
# Cosine distance matrix (useful for specific clustering algorithms)
dist.matrix = proxy::dist(tfidf.matrix, method = "cosine")

In [40]:
tdm

<<DocumentTermMatrix (documents: 3159, terms: 7951)>>
Non-/sparse entries: 30486/25086723
Sparsity           : 100%
Maximal term length: 62
Weighting          : term frequency (tf)

**Partitioning clustering**. As a partitioning clustering, we will use the famous K-means algorithm. As we know the dataset, we can define properly the number of awaited clusters

In [41]:
clustering.kmeans <- kmeans(tfidf.matrix, truth.K)

**Hierarchical clustering**. R comes with an easy interface to run hierarchical clustering. All we have to define is the clustering criterion and the pointwise distance matrix. We will be using the Ward’s method as the clustering criterion.

In [42]:
clustering.hierarchical <- hclust(dist.matrix, method = "ward.D2")

**Density-based clustering**. To try the density-based clustering, we will run the HDBScan algorithm. We can run it easily from an external package, dbscan. Regarding the hyper-parameters of the algorithm, a more or less arbitrary value has been fixed.

In [43]:
clustering.dbscan <- dbscan::hdbscan(dist.matrix, minPts = 10)