# Compare SentimentAnalysis models

### 1. Dependencies & Pre-processing
To derive files test_X, test_Y, I split the labels and review for each line in the original Amazon Reviews dataset. I then replace "\__label\__" with null, and only keep the numerical sentiment label. I write the label and review data into 2 separate files.

In [13]:
using TextAnalysis

testset = Array{String, 1}(undef, 400000)
file1 = open("test_X.txt")

global cnt = 1
for i in readlines(file1)
    if length(i) > 500
        try
            testset[cnt] = i[1:500]
        catch e
            # char at index 500 is of iso-8859-1 not Unicode!
            testset[cnt] = i[1:499]
        end
    else
        testset[cnt] = i
    end
    global cnt += 1
end
close(file1)

#simultaneous iter gives bug for testset

testlabels = Array{Int8, 1}(undef, 400000)
file2 = open("test_Y.txt")
global cnt = 1
for j in readlines(file2)
    testlabels[cnt] = parse(Int8, j)
    global cnt += 1
end
close(file2)

ArgumentError: ArgumentError: Package TextAnalysis not found in current path:
- Run `import Pkg; Pkg.add("TextAnalysis")` to install the TextAnalysis package.


### 2. Load pretrained TextAnalysis model & test

I assess the model based on accuracy, precision, recall, and F1 score.
After conducting pre-processing of truncating at 500 chars, stripping stopwords, stripping non-utf8 (iso-8859-1) chars, here are my results:
- Accuracy = $\frac{100*total correct}{total}=49.64325$
<br>
<br>
- Precision = $\frac{true positive}{true positive + false positive}=0.497469903015904$
<br>
<br>
- Recall = $\frac{true positive}{true positive + false negative}=0.701445$
<br>
<br>
- F1 Score = $\frac{2*precision*recall}{precision+recall}=0.5821059947510917$

In [20]:
model = SentimentAnalyzer()
global total_correct, total_valid = 0, 0
global cnt = 1
global boundserror = []
global tp, fp, fn = 0,0,0 #positive sentiment = positive

for i in 1:400000
    input = Document(testset[i])
    prepare!(input, strip_stopwords)
    prepare!(input, strip_corrupt_utf8)
    prepare!(input, stem_words)
    try
        pred = Int8.(round(model(input))) + 1
        if pred == 2
            (testlabels[i] == 2) ? (global total_correct += 1; global tp += 1) : (global fp += 1) 
        else
            (testlabels[i] == 1) ? (global total_correct += 1) : (global fn += 1) 
        end
        global total_valid += 1
    catch e
        push!(boundserror, i)
    end
    if i % 1000 == 0 
        println("[", cnt, "]: ", total_correct/total_valid*100)
        global cnt += 1
    end
end

In [17]:
println("The accuracy of TextAnalysis' pretrained model is ", total_correct/total_valid*100)
precision =  tp/(tp+fp)
recall = tp/(tp+fn)
println("Precision: ", precision)
println("Recall: ", recall)
println("F1 Score: ", 2*precision*recall/(precision+recall))
println(length(boundserror))

The accuracy of TextAnalysis' pretrained model is 49.64325
Precision: 0.497469903015904
Recall: 0.701445
F1 Score: 0.5821059947510917
0
