# Compare SentimentAnalysis models: Comparison with Sklearn

### 1. Loading Dependencies

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
import numpy as np

### 2. Pre-processing
From the previously generated train_Y and test_Y files, I replace positive sentiment labels "2" with "pos", negative sentiment labels "1" with "neg", in order to follow sklearn convention.

In [2]:
data_f = open("train_X.txt", "r", encoding="iso-8859-1")
labels_f = open("train_YY.txt", "r", encoding="iso-8859-1")
test_fd = open("test_X.txt", "r", encoding="iso-8859-1")
test_fl = open("test_YY.txt", "r", encoding="iso-8859-1")

def f(x):
    j = []
    for i in x:
        if len(i) > 102: #min length of both trainset & testset, sklearn requires same feature dims
            j.append(i[:102])
        else:
            j.append(i)
    return j

train_data = np.array(f(data_f))
train_labels = np.array(f(labels_f))
test_data = np.array(f(test_fd))
test_labels = np.array(f(test_fl))

### 3. Model definition & Training
As this is a binary classification task involving 2 labels: positive/negative, I use the logsitic regression model. Due to Out of Memory issues, I conduct training on a 12000-sample larger subset of the trainset, with a maximum of 200 uterations. 

In [3]:
log_model = LogisticRegression(max_iter=200)


vectorizer = CountVectorizer(
            analyzer = 'word',
            lowercase = False)

batch = 4000
iterr = 100

cnt = 0
total = len(test_labels)
correct = 0

features1 = vectorizer.fit_transform(train_data[:batch*3])
features1 = features1.toarray()
log_model = log_model.fit(X=features1, y=train_labels[:batch*3])

### 4. Testing
To avoid OOM issues, I conduct inference of batches of 4000 across in entire testset over 100 epochs.

In [4]:
for i in range(100):
    features2 = vectorizer.transform(test_data[cnt:cnt+batch])
    features2 = features2.toarray()
    cnt += batch
    y_pred = log_model.predict(features2)
    counter = 0
    for j in y_pred:
        if j == test_labels[counter]:
            correct += 1
    print("[" + str(i+1) + "] Correct: " + str(correct) + "/" + str(4000*(i+1)))
print("The final accuracy is: ", correct/total)


[1] Correct: 2030/4000
[2] Correct: 4113/8000
[3] Correct: 6163/12000
[4] Correct: 8201/16000
[5] Correct: 10259/20000
[6] Correct: 12352/24000
[7] Correct: 14408/28000
[8] Correct: 16367/32000
[9] Correct: 18342/36000
[10] Correct: 20364/40000
[11] Correct: 22315/44000
[12] Correct: 24342/48000
[13] Correct: 26331/52000
[14] Correct: 28338/56000
[15] Correct: 30261/60000
[16] Correct: 32312/64000
[17] Correct: 34360/68000
[18] Correct: 36371/72000
[19] Correct: 38448/76000
[20] Correct: 40474/80000
[21] Correct: 42503/84000
[22] Correct: 44411/88000
[23] Correct: 46458/92000
[24] Correct: 48425/96000
[25] Correct: 50372/100000
[26] Correct: 52358/104000
[27] Correct: 54434/108000
[28] Correct: 56397/112000
[29] Correct: 58342/116000
[30] Correct: 60363/120000
[31] Correct: 62467/124000
[32] Correct: 64422/128000
[33] Correct: 66352/132000
[34] Correct: 68323/136000
[35] Correct: 70294/140000
[36] Correct: 72328/144000
[37] Correct: 74362/148000
[38] Correct: 76329/152000
[39] Correct:

### 5. Conclusion

Despite experimenting with different hyperparameters and settings, namely the number of epochs (100 and 200) and the trainset size (4000 vs. 8000 vs. 12000), accuracy of sklearn's logistic regression model plateaus at 49.4%. At 100 epochs, the accuracy on a training set of 4000 samples is 0.4647175; the accuracy on a trainset of 8000 is 0.493485; the accuracy on a trainset is 0.4944. Surprisingly, this accuracy decreases to 0.49438 when the epoch num is increased from 100 to 200. To allow for better accuracy, more complex architecture needs to be used, and training needs to be conducted with my samples.

### 5. References
1. [Making Sentiment Analysis Easy With Scikit-Learn] (https://www.twilio.com/blog/2017/12/sentiment-analysis-scikit-learn.html)
2. [Sentiment Analysis — A how-to guide with movie reviews] (https://towardsdatascience.com/sentiment-analysis-a-how-to-guide-with-movie-reviews-9ae335e6bcb2)
3. [Movie Reviews Sentiment Analysis with Scikit-Learn] (https://www.pitt.edu/~naraehan/presentation/Movie%20Reviews%20sentiment%20analysis%20with%20Scikit-Learn.html)