Importing google drive file to load dataset.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Importing all library will be used.

In [None]:
import pandas as pd
import numpy as np
from collections import namedtuple
from nltk.tokenize import word_tokenize 
import re
from nltk.stem import WordNetLemmatizer
from gensim.models import Doc2Vec
from sklearn.metrics import accuracy_score,confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

First of all, we need to load dataset into array/list. Code below are syntax to load csv file using pandas library. Each file (train and test) are assigned to different array.

In [None]:
train = pd.read_csv("drive/My Drive/Clader Follow-Up Task/train_dataset.csv")
test = pd.read_csv("drive/My Drive/Clader Follow-Up Task/test_dataset.csv")
review_train = train.verified_reviews # making "verified_reviews" column in train data as single array
review_test = test.verified_reviews # making "verified_reviews" column in test data as single array

Here is the example of loaded train data.

In [None]:
train

Unnamed: 0,rating,date,variation,verified_reviews,feedback
0,5,31-Jul-18,Charcoal Fabric,Loved it!,1
1,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1
2,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1
3,5,31-Jul-18,Heather Gray Fabric,I received the echo as a gift. I needed anothe...,1
4,3,31-Jul-18,Sandstone Fabric,"Without having a cellphone, I cannot use many ...",1
...,...,...,...,...,...
2516,5,30-Jul-18,Black Dot,"Perfect for kids, adults and everyone in betwe...",1
2517,5,30-Jul-18,Black Dot,"Listening to music, searching locations, check...",1
2518,5,30-Jul-18,Black Dot,"I do love these things, i have them running my...",1
2519,5,30-Jul-18,White Dot,Only complaint I have is that the sound qualit...,1


To make a good model in Natural Language Processing system especially sentiment classification, we need to do data preprocessing first. Preprocessing steps in function below are :
<ul> 
  <li> Removing unwanted special characters. (Such as: ,.{} etc) </li>
  <li> Lowercasing word </li>
  <li> Tokenizing word (Example : ["Saya adalah mahasiswa"], the result of tokenizing process is ["Saya", "adalah", "mahasiswa"] </li>
  <li> Lemmatizing word </li>
</ul>

Lemmatizing word means to find a root/base word from corresponding word. For example, there are set of words = ["loves", "loving", "loved"]. The result after lemmatizing that set of words is:
<ul> 
  <li> loves -> love </li>
  <li> loving -> love </li>
  <li> loved -> love </li>
</ul>

So, the purpose of lemmatizing word is <b> to reduce the variance of text data</b>.

In [None]:
def preprocessing(data):
  wnl = WordNetLemmatizer()
  result = []
  for i in range(0,len(data)):
    words = data[i]
    words = re.sub('[~`!@#$%^&*():;"{}_/?><\|.,`0-9]', '', words)
    words = str.lower(words)
    words = word_tokenize(words)
    words = ([wnl.lemmatize(words[i],"v") for i in range(0,len(words))])
    result.append(words)
  return result

In [None]:
review_train = preprocessing(review_train)
review_test = preprocessing(review_test)

Below is an example of preprocessing data on the first record of train data. The first line is text input, and the second line is the preprocess result.

In [None]:
print(train.verified_reviews[0])
print(review_train[0])

Loved it!
['love', 'it']


Because we will be using a Doc2Vec( ) approach, the input data must be in document type. Therefore, code below is a process to convert array into document-type object. Inside of document, there are 2 components, first is "words" that contain preprocessed text, and the second is "tags" that contain a label of every records.

In [None]:
dTrain = []
analyzedDocument1 = namedtuple('AnalyzedDocument1', 'words tags')
for i in range(len(train)):
    words = review_train[i]
    tags = [train.feedback[i]]
    dTrain.append(analyzedDocument1(words, tags))
dTrain = pd.Series(dTrain)

dTest = []
analyzedDocument2 = namedtuple('AnalyzedDocument2', 'words tags')
for i in range(len(test)):
    words = review_test[i]
    tags = [test.feedback[i]]
    dTest.append(analyzedDocument2(words, tags))
dTest = pd.Series(dTest)

After we conduct a perfect preprocessing step, the next thing we have to do is building a vocabulary using Doc2Vec( ) approach. Basically, Doc2Vec( ) doing a training process on train data and produce a vocabulary that can be used for predicting test data. 

In [None]:
max_epochs = 20 # number of iterations/learning
vec_size = 5 # vector dimension
alpha = 0.025 # learning rate
model = Doc2Vec(vector_size=vec_size,
                alpha=alpha, 
                min_alpha=0.0025,
                min_count=5,
                dm =1)
  
model.build_vocab(dTrain) # training to produce/achieve vocabulary.

In [None]:
for epoch in range(max_epochs):
    print('Iteration : {0}'.format(epoch+1))
    model.train(dTrain,
                total_examples=model.corpus_count,
                epochs=model.epochs)

    model.alpha -= 0.0002 # each iteration will be reduced of 0.0002 learning rate.

    model.min_alpha = model.alpha

Iteration : 1
Iteration : 2
Iteration : 3
Iteration : 4
Iteration : 5
Iteration : 6
Iteration : 7
Iteration : 8
Iteration : 9
Iteration : 10
Iteration : 11
Iteration : 12
Iteration : 13
Iteration : 14
Iteration : 15
Iteration : 16
Iteration : 17
Iteration : 18
Iteration : 19
Iteration : 20


The function below (vec_for_learning) is a function to generate the vector of each record.

In [None]:
def vec_for_learning(model, tagged_docs):
    sents = tagged_docs.values
    targets, regressors = zip(*[(doc.tags[0], model.infer_vector(doc.words, steps=5)) for doc in sents])
    return regressors,targets

In [None]:
X_train,y_train = vec_for_learning(model, dTrain)
X_test,y_test = vec_for_learning(model, dTest)

As a comparison, I'm using Random Forest and K-Nearest Neighbors (KNN) as the classifiers. It shows that Random Forest has a best training accuracy (98,73%), and KNN has a best testing accuracy (94.28%).

In [None]:
RF_clf = RandomForestClassifier(n_estimators=10, random_state = 42)
RF_clf.fit(X_train, y_train)
pred1 = RF_clf.predict(X_train)
print('Training accuracy (Random Forest) : %.2f%%' % (accuracy_score(pred1, y_train)*100))
pred2 = RF_clf.predict(X_test)
print('Testing accuracy (Random Forest) : %.2f%%' % (accuracy_score(pred2, y_test)*100))
print("Confusion Matrix of Random Forest:")
print(confusion_matrix(y_test,pred2))
print()
KNN_clf = KNeighborsClassifier(n_neighbors=9)
KNN_clf.fit(X_train, y_train)
pred3 = KNN_clf.predict(X_train)
print('Training accuracy (K-NN) : %.2f%%' % (accuracy_score(pred3, y_train)*100))
pred4 = KNN_clf.predict(X_test)
print('Testing accuracy (K-NN) : %.2f%%' % (accuracy_score(pred4, y_test)*100))
print("Confusion Matrix of K-NN:")
print(confusion_matrix(y_test,pred4))

Training accuracy (Random Forest) : 98.73%
Testing accuracy (Random Forest) : 93.00%
Confusion Matrix of Random Forest:
[[ 14  29]
 [ 15 571]]

Training accuracy (K-NN) : 93.97%
Testing accuracy (K-NN) : 94.28%
Confusion Matrix of K-NN:
[[  9  34]
 [  2 584]]


To get the predictions more accurate, we shall consider the "rating" column in the dataset. In every record that we had predict, each label (0 and 1) has probability value (using .predict_proba() ) of the prediction. We need a <b>threshold</b> to make the "rating" column considered for a better prediction. How it works is as follows. <br>
Threshold 1 (for label 1):

1. 0.01
2. 0.1
3. 0.6
4. 0.9
5. 1.0

Threshold 2 (for label 0):

1. 1.0
2. 0.9
3. 0.6
4. 0.1
5. 0.01

For example, on a record we have: <br>
probability of 0 : 0.612 <br>
probability of 1 : 0.388 <br>
Prediction Result : 0 <br>
Actual Label : 1 <br>
Rating : 4 <br> <br>
The new prediction will be: <br>
P(0) = 0.612 x 0.1 = <b> 0.0612 </b> <br>
P(1) = 0.388 x 0.9 = <b> 0.3492 </b>

Because of P(1) > P(0), so the new prediction is 1 and it equals with the actual label.

In [None]:
proba1 = RF_clf.predict_proba(X_test)
proba2 = KNN_clf.predict_proba(X_test)

In [None]:
def predict_using_threshold(proba):
  thresh1 = [0.01,0.1,0.6,0.9,1.0]
  thresh2 = np.flip(thresh1)
  predictions = []
  for i in range(0,len(proba)):
    rate = test.rating[i]
    new_prob1 = thresh1[rate-1] * proba[i][1]
    new_prob2 = thresh2[rate-1] * proba[i][0]
    if new_prob1>new_prob2:
      res=1
    else:
      res=0
    predictions.append(res)
  return predictions

Here are the final accuracy of the sentiment prediction/classification. It shows a better confusion matrix and accuracy.

In [None]:
pred5 = predict_using_threshold(proba1)
pred6 = predict_using_threshold(proba1)

print('Final Test accuracy (Random Forest): %.2f%%' % (accuracy_score(pred5, y_test)*100))
print("Final Confusion Matrix of Random Forest:")
print(confusion_matrix(y_test,pred5))
print()
print('Final Test accuracy (K-NN) : %.2f%%' % (accuracy_score(pred6, y_test)*100))
print("Final Confusion Matrix of K-NN:")
print(confusion_matrix(y_test,pred6))

Final Test accuracy (Random Forest): 97.30%
Final Confusion Matrix of Random Forest:
[[ 32  11]
 [  6 580]]

Final Test accuracy (K-NN) : 97.30%
Final Confusion Matrix of K-NN:
[[ 32  11]
 [  6 580]]
