## LDA with collapsed Gibbs sampling

In [1]:
include("src/lda.jl")
ENV["LINES"] = 40
ENV["COLUMNS"] = 120

dictfile = "data/R3_all_Dictionary.txt"
documentfile = "data/R3-trn-all_run.txt" 
gtfile = "data/R3-Label.txt"
documentfile_test = "data/R3-tst-all_run.txt"
gtfile_test = "data/R3-GT.txt"

gt = readdlm(gtfile, Int64)
document_matrix, dictionary = ldac2docterm(dictfile, documentfile);
document_matrix_test, _ = ldac2docterm(dictfile, documentfile_test);
gt_test = readdlm(gtfile_test, Int64)

# Remove all stopwords
include("src/stopwords.jl")
stopwords = findin(dictionary, stopwords)
document_matrix[:, stopwords] = 0
document_matrix_test[:, stopwords] = 0

0

#### Run with T=3 topics

In [5]:
n_iter = 100
verbose = true
seed = 1234

alpha = 0.3
beta = 0.3
T = 3
phi, theta = lda(document_matrix, alpha, beta, T,
                    n_iter, verbose, seed);

iteration: 10, elapsed time: 12.101094961166382s
iteration: 20, elapsed time: 25.0362651348114s
iteration: 30, elapsed time: 37.8934850692749s
iteration: 40, elapsed time: 50.713765144348145s
iteration: 50, elapsed time: 62.77731800079346s
iteration: 60, elapsed time: 77.1285560131073s
iteration: 70, elapsed time: 90.05349802970886s
iteration: 80, elapsed time: 103.11226201057434s
iteration: 90, elapsed time: 117.32974410057068s
iteration: 100, elapsed time: 132.2751760482788s


### Top 30 words in the 3 topics

In [7]:
# top words in the topics
nwords = 30
topwords = Matrix{Any}(nwords, T)
for itopic in 1:T
    idx = sortperm(phi[:, itopic], rev=true)[1:nwords]
    topwords[:, itopic] = dictionary[idx]
end
topwords

30×3 Array{Any,2}:
 "oil"         "reuter"    "trade"        
 "said"        "said"      "reuter"       
 "reuter"      "market"    "said"         
 "dlrs"        "bank"      "united"       
 "prices"      "exchange"  "states"       
 "crude"       "mln"       "japan"        
 "mln"         "billion"   "told"         
 "year"        "dollar"    "agreement"    
 "pct"         "today"     "countries"    
 "barrels"     "pct"       "japanese"     
 "petroleum"   "treasury"  "year"         
 "day"         "currency"  "foreign"      
 "energy"      "rate"      "tariffs"      
 "company"     "money"     "international"
 "production"  "trade"     "reagan"       
 "opec"        "major"     "washington"   
 "barrel"      "foreign"   "house"        
 "price"       "rates"     "world"        
 "corp"        "official"  "cut"          
 "bpd"         "says"      "goods"        
 "dlr"         "central"   "minister"     
 "today"       "year"      "president"    
 "country"     "economic"  "industr

#### Topics distributions for document `2`

Seems to be correct (**crude** topic)

*diamond shamrock dia cuts **crude** prices diamond shamrock corp said that effective today it had cut its contract prices for crude oil by **dlrs** a **barrel** the reduction brings its posted price for west texas intermediate to **dlrs** a **barrel** the copany said the price reduction today was made in the light of falling **oil** product prices and a weak **crude** **oil** market a company spokeswoman said diamond is the latest in a line of u s **oil** companies that have cut its contract or posted prices over the last two days citing weak **oil** markets reuter*

In [10]:
theta[2, :]

3-element Array{Float64,1}:
 0.984169  
 0.00791557
 0.00791557

### with T=10 topics

doc `2` still correctly has most weight on the topic that seems related to **oil**

In [11]:
T = 10
phi, theta = lda(document_matrix, alpha, beta, T,
                    n_iter, verbose, seed);

iteration: 10, elapsed time: 31.26441192626953s
iteration: 20, elapsed time: 66.99765205383301s
iteration: 30, elapsed time: 99.31763696670532s
iteration: 40, elapsed time: 133.38119292259216s
iteration: 50, elapsed time: 172.10296392440796s
iteration: 60, elapsed time: 213.97391295433044s
iteration: 70, elapsed time: 250.41037797927856s
iteration: 80, elapsed time: 287.52402997016907s
iteration: 90, elapsed time: 325.1246690750122s
iteration: 100, elapsed time: 361.62379002571106s


In [25]:
topwords = Matrix{Any}(nwords, T)
for itopic in 1:T
    idx = sortperm(phi[:, itopic], rev=true)[1:nwords]
    topwords[:, itopic] = dictionary[idx]
end
topwords

30×10 Array{Any,2}:
 "exchange"    "year"        "trade"           "oil"          …  "opec"        "agreement"       "said"      
 "currency"    "said"        "house"           "pct"             "oil"         "trade"           "reuter"    
 "dollar"      "reuter"      "said"            "said"            "minister"    "general"         "market"    
 "new"         "billion"     "i"               "prices"          "day"         "reuter"          "bank"      
 "rate"        "mln"         "reuter"          "reuter"          "said"        "talks"           "today"     
 "bank"        "dlrs"        "committee"       "tax"          …  "barrels"     "said"            "mln"       
 "major"       "trade"       "administration"  "industry"        "prices"      "meeting"         "money"     
 "financial"   "exports"     "reagan"          "domestic"        "reuter"      "countries"       "treasury"  
 "banks"       "pct"         "congress"        "energy"          "market"      "european"        "st

In [26]:
theta[2, :]

10-element Array{Float64,1}:
 0.0075
 0.1825
 0.0575
 0.0075
 0.7075
 0.0075
 0.0075
 0.0075
 0.0075
 0.0075

#### Most likely topic

In [27]:
topwords[:, 5]

30-element Array{Any,1}:
 "oil"      
 "reuter"   
 "said"     
 "crude"    
 "dlrs"     
 "company"  
 "corp"     
 "petroleum"
 "barrel"   
 "price"    
 "prices"   
 "today"    
 "west"     
 "effective"
 "texas"    
 "cts"      
 "bbl"      
 "canada"   
 "barrels"  
 "day"      
 "unit"     
 "raises"   
 "refinery" 
 "pct"      
 "contract" 
 "posted"   
 "march"    
 "pay"      
 "raised"   
 "canadian" 

## using LDA for feature extraction, for classificaton (using kNN)

In [35]:
phi_train, theta_train = lda(document_matrix_test, alpha, beta, T,
                            n_iter, verbose, seed);

iteration: 10, elapsed time: 9.880680084228516s
iteration: 20, elapsed time: 21.244205951690674s
iteration: 30, elapsed time: 31.501085996627808s
iteration: 40, elapsed time: 42.36569690704346s
iteration: 50, elapsed time: 54.194780111312866s
iteration: 60, elapsed time: 65.42524409294128s
iteration: 70, elapsed time: 78.56971597671509s
iteration: 80, elapsed time: 95.10010290145874s
iteration: 90, elapsed time: 108.33162498474121s
iteration: 100, elapsed time: 123.0820939540863s


does not give much more than a 50 % correct classification rate

In [134]:
#### using Distributions
using Distances

function kNN(Xtrain::AbstractArray, labels::AbstractVector,
            Xquery::AbstractArray, k::Int)
    # Simple brute force kNN
    n_train = size(Xtrain, 1)
    n_queries = size(Xquery, 1)
    classification = zeros(Int, n_queries)

    for q in 1:n_queries
        dists = colwise(Euclidean(), Xtrain', Xquery[q, :])
        topclasses = labels[sortperm(dists)[1:k]]
        classification[q] = mode(topclasses)
    end
    
    classification
end

classification = kNN(theta, vec(gt), theta_train, 5);

sum(classification .== gt_test) / length(gt_test)

0.5512367491166078

LDA is not an identifed model, the the topics may have been switched around in the two training passes (commonly known as label switching) and the distance metric is not comparing apples with apples...

Looking at misclassified documents

In [135]:
find(classification .== gt_test)[1:3], find(classification .!= gt_test)[1:3]

# etc

([1,2,3],[15,18,20])

# Deriving the update rules

Just use Bayes rule and the conjugacy properties of the Dirichlet/Multinomial