## LDA with collapsed Gibbs sampling

In [1]:
include("src/lda.jl")
ENV["LINES"] = 40
ENV["COLUMNS"] = 120

dictfile = "data/R3_all_Dictionary.txt"
documentfile = "data/R3-trn-all_run.txt" 
gtfile = "data/R3-Label.txt"
documentfile_test = "data/R3-tst-all_run.txt"
gtfile_test = "data/R3-GT.txt"

gt = readdlm(gtfile, Int64)
document_matrix, dictionary = ldac2docterm(dictfile, documentfile);
document_matrix_test, _ = ldac2docterm(dictfile, documentfile_test);
gt_test = readdlm(gtfile_test, Int64)

# Remove all stopwords
include("src/stopwords.jl")
stopwords = findin(dictionary, stopwords)
document_matrix[:, stopwords] = 0
document_matrix_test[:, stopwords] = 0

0

#### Run with T=3 topics

In [2]:
n_iter = 100
verbose = true
seed = 1234

alpha = 0.3
beta = 0.3
T = 3
phi, theta = lda(document_matrix, alpha, beta, T,
                    n_iter, verbose, seed);

iteration: 10, elapsed time: 9.985158920288086s
iteration: 20, elapsed time: 20.835213899612427s
iteration: 30, elapsed time: 31.613263845443726s
iteration: 40, elapsed time: 41.867663860321045s
iteration: 50, elapsed time: 52.646684885025024s
iteration: 60, elapsed time: 63.13225293159485s
iteration: 70, elapsed time: 74.37708497047424s
iteration: 80, elapsed time: 84.82683992385864s
iteration: 90, elapsed time: 94.62029981613159s
iteration: 100, elapsed time: 104.20041680335999s


### Top 30 words in the 3 topics

In [3]:
# top words in the topics
nwords = 30
topwords = Matrix{Any}(nwords, T)
for itopic in 1:T
    idx = sortperm(phi[:, itopic], rev=true)[1:nwords]
    topwords[:, itopic] = dictionary[idx]
end
topwords

30×3 Array{Any,2}:
 "oil"         "trade"          "said"      
 "reuter"      "reuter"         "reuter"    
 "said"        "said"           "bank"      
 "dlrs"        "states"         "market"    
 "crude"       "united"         "billion"   
 "prices"      "japan"          "trade"     
 "mln"         "told"           "exchange"  
 "day"         "countries"      "mln"       
 "barrels"     "agreement"      "pct"       
 "petroleum"   "japanese"       "dollar"    
 "energy"      "year"           "year"      
 "company"     "tariffs"        "currency"  
 "pct"         "reagan"         "money"     
 "price"       "foreign"        "treasury"  
 "production"  "markets"        "today"     
 "opec"        "washington"     "rate"      
 "barrel"      "president"      "rates"     
 "year"        "world"          "deficit"   
 "minister"    "talks"          "economic"  
 "bpd"         "house"          "central"   
 "today"       "international"  "foreign"   
 "corp"        "officials"      "dlr

#### Topics distributions for document `2`

Seems to be correct (**crude** topic)

*diamond shamrock dia cuts **crude** prices diamond shamrock corp said that effective today it had cut its contract prices for crude oil by **dlrs** a **barrel** the reduction brings its posted price for west texas intermediate to **dlrs** a **barrel** the copany said the price reduction today was made in the light of falling **oil** product prices and a weak **crude** **oil** market a company spokeswoman said diamond is the latest in a line of u s **oil** companies that have cut its contract or posted prices over the last two days citing weak **oil** markets reuter*

In [4]:
theta[2, :]

3-element Array{Float64,1}:
 0.984169  
 0.00791557
 0.00791557

### with T=10 topics

doc `2` still correctly has most weight on the topic that seems related to **oil**

In [5]:
T = 10
phi, theta = lda(document_matrix, alpha, beta, T,
                    n_iter, verbose, seed);

iteration: 10, elapsed time: 23.54747986793518s
iteration: 20, elapsed time: 48.257526874542236s
iteration: 30, elapsed time: 74.9956419467926s
iteration: 40, elapsed time: 99.66833901405334s
iteration: 50, elapsed time: 126.61610984802246s
iteration: 60, elapsed time: 154.38964891433716s
iteration: 70, elapsed time: 181.1843400001526s
iteration: 80, elapsed time: 206.21196794509888s
iteration: 90, elapsed time: 236.28232502937317s
iteration: 100, elapsed time: 268.20136404037476s


In [6]:
topwords = Matrix{Any}(nwords, T)
for itopic in 1:T
    idx = sortperm(phi[:, itopic], rev=true)[1:nwords]
    topwords[:, itopic] = dictionary[idx]
end
topwords

30×10 Array{Any,2}:
 "japanese"        "prices"      "said"        "house"           …  "foreign"        "oil"          "exchange"  
 "trade"           "oil"         "reuter"      "reuter"             "trade"          "said"         "dollar"    
 "reuter"          "reuter"      "mln"         "president"          "told"           "reuter"       "reuter"    
 "japan"           "opec"        "billion"     "trade"              "countries"      "energy"       "rate"      
 "washington"      "said"        "bank"        "reagan"             "exports"        "pct"          "says"      
 "said"            "output"      "year"        "said"            …  "economic"       "new"          "rates"     
 "agreement"       "minister"    "market"      "administration"     "economy"        "gas"          "paris"     
 "officials"       "production"  "today"       "i"                  "dlrs"           "petroleum"    "said"      
 "pact"            "market"      "money"       "committee"          "export"

In [7]:
theta[2, :]

10-element Array{Float64,1}:
 0.0075
 0.0075
 0.0075
 0.0075
 0.0075
 0.9325
 0.0075
 0.0075
 0.0075
 0.0075

#### Most likely topic

In [12]:
topwords[:, 6]

30-element Array{Any,1}:
 "reuter"   
 "crude"    
 "oil"      
 "said"     
 "dlrs"     
 "company"  
 "day"      
 "today"    
 "barrels"  
 "mln"      
 "prices"   
 "petroleum"
 "march"    
 "bpd"      
 "west"     
 "state"    
 "corp"     
 "barrel"   
 "texas"    
 "effective"
 "cts"      
 "bbl"      
 "contract" 
 "months"   
 "ecuador"  
 "venezuela"
 "energy"   
 "week"     
 "pipeline" 
 "price"    

## using LDA for feature extraction, for classificaton (using kNN)

In [17]:
phi_test, theta_test = lda(document_matrix_test, alpha, beta, T,
                            n_iter, verbose, seed);

iteration: 10, elapsed time: 8.27159595489502s
iteration: 20, elapsed time: 18.085965871810913s
iteration: 30, elapsed time: 28.07868480682373s
iteration: 40, elapsed time: 37.76651096343994s
iteration: 50, elapsed time: 47.31500482559204s
iteration: 60, elapsed time: 57.26273584365845s
iteration: 70, elapsed time: 66.61199688911438s
iteration: 80, elapsed time: 76.3161928653717s
iteration: 90, elapsed time: 85.58695483207703s
iteration: 100, elapsed time: 96.14471077919006s


gives a measly ~ 20 % correct classification rate

In [18]:
using Distributions
using Distances

function knn(Xtrain::AbstractArray, labels::AbstractVector,
            Xquery::AbstractArray, k::Int)
    # Simple brute force kNN
    n_train = size(Xtrain, 1)
    n_queries = size(Xquery, 1)
    classification = zeros(Int, n_queries)

    for q in 1:n_queries
        dists = colwise(Euclidean(), Xtrain', Xquery[q, :])
        topclasses = labels[sortperm(dists)[1:k]]
        classification[q] = mode(topclasses)
    end
    
    classification
end

classification = knn(theta, vec(gt), theta_test, 5);

sum(classification .== gt_test) / length(gt_test)

0.18021201413427562

LDA is not an identifed model, the the topics may have been switched around in the two training passes (commonly known as label switching), so comparing the learned features directly doesn't make sense (though a distance metric is perfectly fine)

This varies with different runs of the Gibbs sampler, suggesting sampling from the true posterior distribution is difficult...

Looking at misclassified documents

In [19]:
find(classification .== gt_test)[1:3], find(classification .!= gt_test)[1:3]

# etc

([1,19,30],[2,3,4])

# Deriving the update rules

use Bayes rule and the conjugacy properties of the Dirichlet/Multinomial