### Cluster classic3 dataset using Non-negative matrix factorization

In this example we cluster documents from the Classic3 collection of three categories: CISI, CRAN and MEDLINE using Non-negative matrix factorization in Julia. See [NMF.jl](https://github.com/JuliaStats/NMF.jl)

#### About the data
The Classic collection is a benchmark dataset used in text mining that can be obtained from ftp://ftp.cs.cornell.edu/pub/smart/. This dataset consists of 4 different document collections: CACM, CISI, CRAN, and MED. 

The composition of the collection is as follows:

CACM: 3204 documents
CISI: 1460 documents
CRAN: 1398 documents
MED: 1033 documents

This dataset is usually referred to as Classic3 dataset (CISI, CRAN and MED only), and sometimes referred to as Classic4 dataset.

As a further step, the whole dataset has been preprocessed and the document-term matrix can be downloaded [here](http://www.dataminingresearch.com/index.php/2010/09/classic3-classic4-datasets/)in various forms. 

The following files are given:

* docbyterm.mat: Term frequencies only (in Cluto’s MAT file format)
* docbyterm.tfidf.mat: Weighted with TFIDF scheme (in Cluto’s MAT file format)
* docbyterm.tfidf.norm.mat: Weighted with TFIDF scheme and normalized to 1 (in Cluto’s MAT file format)
* docbyterm.txt: Term frequencies only (in Coordinate file format)
* docbyterm.tfidf.txt: Weighted with TFIDF scheme (in Coordinate file format)
* docbyterm.tfidf.norm.txt: Weighted with TFIDF scheme and normalized to 1 (in Coordinate file format)
* documents.txt: List of the document names as they appear in the data matrix
* terms.txt: List of terms that appear in the data matrix
* terms_detailed.txt: A detailed list of terms (ie. term id, term, # of documents the term appears)

In [11]:
using NMF
using PlotlyJS

In [2]:
function nnmf_multmse_svd(X, k)
    #init
    W, H = NMF.nndsvd(X, k)
     # optimize
    alginst = NMF.MultUpdate{Float64}(obj=:mse, maxiter=100, verbose=false)
    res = NMF.solve!(alginst, X, W, H)
    res
end

function nnmf_multdiv_svd(X, k)
    #init
    W, H = NMF.nndsvd(X, k)
     # optimize
    alginst = NMF.MultUpdate{Float64}(obj=:div, maxiter=100, verbose=false)
    res = NMF.solve!(alginst, X, W, H)
    res
end

function nnmf_alspgrad_svdar(X, k)
    #init
    W, H = NMF.nndsvd(X, k, variant =:ar)
     # optimize
    alginst = NMF.ALSPGrad{Float64}(maxiter=100, verbose=false)
    res = NMF.solve!(alginst, X, W, H)
    res
end

function nnmf_alspgrad_rand(X, k)
    #init
    W, H = NMF.randinit(X, k)
     # optimize
    alginst = NMF.ALSPGrad{Float64}(maxiter=100, verbose=false)
    res = NMF.solve!(alginst, X, W, H)
    res
end

function plot_nnmf_results(H, class_labels, plot_title)
    cisi_docs = H[:, class_label .== 1]
    trace_cisi = scatter3d(; x=cisi_docs[1,:], y=cisi_docs[2,:], z=cisi_docs[3,:], mode="markers",
                                marker=attr(color="#1f77b4", size=1, symbol="dot"))

    cran_docs = H[:, class_label .== 2]
    trace_cran = scatter3d(; x=cran_docs[1,:], y=cran_docs[2,:], z=cran_docs[3,:], mode="markers",
                                marker=attr(color="#9467bd", size=1, symbol="dot"))


    medline_docs = H[:, class_label .== 3]
    trace_medline = scatter3d(; x=medline_docs[1,:], y=medline_docs[2,:], z=medline_docs[3,:], mode="markers",
                                marker=attr(color="#bcbd22", size=1, symbol="dot"))

    layout = Layout(;title=plot_title)
    p = plot([trace_cisi, trace_cran, trace_medline], layout)
    p
end

plot_nnmf_results (generic function with 1 method)

In [3]:
# Read document-term matrix weighted with TFIDF scheme and normalized to 1 (in Coordinate file format)
docbyterm = readdlm("./classic_data/docbyterm.tfidf.norm.txt")

# create the sparse document matrix
doc_mat_sparse = sparse(Array{Int64}(docbyterm[2:end, 1]), Array{Int64}(docbyterm[2:end, 2]), docbyterm[2:end, 3])

#load the classes
class_label = Array{Int8}(readdlm("./classic_data/documents.txt")[:,2]);

In [4]:
# NNMF - set up
k = 3
X = full(doc_mat_sparse)'

#used only classic3 documents so we can visualize in 3d
X = X[:, 3205:end]
class_label = class_label[3205:end];

In [5]:
# NNMF - run 
res_multmse_svd = nnmf_multmse_svd(X,k)
res_multdiv_svd = nnmf_multdiv_svd(X,k)
res_alspgrad_svdar = nnmf_alspgrad_svdar(X,k)
res_alspgrad_rand = nnmf_alspgrad_rand(X,k)



NMF.Result{Float64}([1.56871 1.46235 0.278813; 7.74764 2.70909 0.691331; … ; 0.026561 0.0377876 0.0; 0.0816809 0.0418716 0.0],[0.00084571 0.00215405 … 0.000332873 0.000806681; 0.0 0.0 … 0.000318416 0.000380903; 0.0 0.000148392 … 0.000127985 5.18779e-5],31,true,3680.2763406122403)

In [6]:
#subplots didn't work - so examine each plot individually
p1 = plot_nnmf_results(res_multmse_svd.H, class_label, "Multiplicative Updating - MSE")
p1

In [7]:
p2 = plot_nnmf_results(res_multdiv_svd.H, class_label, "Multiplicative Updating - KL Div.")
p2

In [8]:
p3 = plot_nnmf_results(res_alspgrad_svdar.H, class_label, "Alternate Least Square Using Projected Gradient Descent")
p3

In [9]:
# if you wanted to find max cluster membership - do something along the line
# max_ind = findmax(H,1)
# max_col = ind2sub(size(H), vec(max_ind[2]))[1]