# Sentiment Analysis - Feature Selection

In [4]:
!pip install -U nltk

Requirement already up-to-date: nltk in /home/gemilang/.local/lib/python3.7/site-packages (3.3)


In [5]:
import nltk
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1337)

In [8]:
from sklearn import preprocessing
my_df_training = pd.read_csv('clean_tweet_training.csv',index_col=0)
tweet_text_training = my_df_training['text']
target_training = my_df_training['target']
X_training = tweet_text_training[pd.notnull(tweet_text_training)]
Y_training = target_training[pd.notnull(tweet_text_training)]


  mask |= (ar1 == a)


Karena data training terlalu besar, kita akan menggunakan sebagian kecil dari data (1%) agar proses training lebih cepat.

In [9]:
from sklearn.model_selection import train_test_split
X_train_part1, X_train_part2, Y_train_part1, Y_test_part2 = train_test_split(X_training, Y_training, train_size=0.01, random_state=42)



In [10]:
my_df_test = pd.read_csv('clean_tweet_test_fixed.csv',index_col=0)
my_df_test = my_df_test[my_df_test.target != 2]
tweet_text_test = my_df_test['text']
target_test = my_df_test['target']
X_test = tweet_text_test[pd.notnull(tweet_text_test)]
Y_test = target_test[pd.notnull(tweet_text_test)]

## Feature Extraction <br>
Seperti pada notebook sebelumnya, kita menggunakan bag-of-words sebagai feature. Total ada 17489 terms (unique words) yang terindex dalam vocabulary

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X_training_vector = vectorizer.fit_transform(X_train_part1)
print("Number of features:  %d" % len(vectorizer.vocabulary_))
X_test_vector = vectorizer.transform(X_test)


Number of features:  17631


Untuk melihat bagaimana performance/akurasi sistem dengan menggunakan BagOfWords sebagai feature, kita akan mengaplikasikan salah satu classifier model yaitu SVM (Support Vector Machine). <br>
Dengan menggunakan 17489 features, SVM berhasil meng-klasifikasikan tweets dengan akurasi **72.70%**

In [12]:
from sklearn import svm
from sklearn.metrics import accuracy_score
clf = svm.SVC(kernel='linear')
print("Training Classifier...")
clf.fit(X_training_vector, Y_train_part1)
print("Predicting...")
prediction = clf.predict(X_test_vector)
accuracy = accuracy_score(Y_test, prediction)
print ('Accuracy:', accuracy)

Training Classifier...
Predicting...
Accuracy: 0.7409470752089137


## (1) Frequency-based Feature Selection <br>
Selanjutnya, kita akan mengaplikasikan feature selection method yang paling sederhana, yaitu frequency-based feature selection. <br>
Kita akan menggunakan hanya 700 features yang frekuensi kemunculannya paling sering. <br>
Dari list 10 kata yang paling sering muncul adalah *good, day, get, like, go, love, going, work, today, got* <br>
Dengan 700 features, akurasi dari SVM dalam mengklasifikasikan tweets naik hingga **75.21%**


In [13]:
vectorizer = CountVectorizer(max_features=700)
X_training_vector = vectorizer.fit_transform(X_train_part1)
print("Number of features:  %d" % len(vectorizer.vocabulary_))
print("------------------")

sum_words = X_training_vector.sum(axis=0) 
words_freq = [(word, sum_words[0, idx]) for word, idx in vectorizer.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
most_freq = words_freq[:10]
for word, freq in most_freq:
  print(word, freq)

X_test_vector = vectorizer.transform(X_test)
print("------------------")
print("Training Classifier...")
clf.fit(X_training_vector, Y_train_part1)
print("Predicting...")
prediction = clf.predict(X_test_vector)
accuracy = accuracy_score(Y_test, prediction)
print ('Accuracy:', accuracy)


Number of features:  700
------------------
good 851
day 851
get 817
like 754
go 748
today 663
going 650
love 628
work 614
back 613
------------------
Training Classifier...
Predicting...
Accuracy: 0.7298050139275766


## (2) Select K-Best Features Selection dengan Chi2 <br>
Selanjutnya, kita akan menggunakan chi2 sebagai parameter untuk menentukan K-Best features. <br>
Sama seperti sebelumnya, hanya akan digunakan 700 features. Dengan menggunakan Chi2, 10 terms yang paling signifikan adalah *sad, thanks, miss, love, good, thank, sick, work, bad, hate* .<br>
Akurasi meningkat hingga **78.27%**

In [14]:
from sklearn.feature_selection import SelectKBest, chi2

vectorizer = CountVectorizer()
X_training_vector = vectorizer.fit_transform(X_train_part1)
print("Number of features:  %d" % len(vectorizer.vocabulary_))
print ("---------------------")
X_test_vector = vectorizer.transform(X_test)

#print features their chi2 score
feature_scores = chi2(X_training_vector, Y_train_part1)[0]
for score, fname in sorted(zip(feature_scores, vectorizer.get_feature_names()), reverse=True)[:10]:
    print(fname, score)

#selectKBest feature using Chi2 and see whether it could improve the accuracy
ch2 = SelectKBest(chi2, k=700)
X_train_best = ch2.fit_transform(X_training_vector, Y_train_part1)
X_test_best = ch2.transform(X_test_vector)
print ("---------------------")
print("Training Classifier...")
clf.fit(X_train_best, Y_train_part1)
print("Predicting...")
prediction = clf.predict(X_test_best)
accuracy = accuracy_score(Y_test, prediction)
print ('Accuracy:', accuracy)

Number of features:  17631
---------------------
sad 238.54023574959746
thanks 209.71485062050235
miss 168.3154279421448
love 159.80138255363354
sick 133.3229217791157
good 131.14237035467895
bad 112.47065955957137
wish 108.39876788533053
sorry 105.39613173269423
thank 102.13420300182054
---------------------
Training Classifier...
Predicting...
Accuracy: 0.7437325905292479


## (3) Select K-Best Features Selection dengan Mutual Information <br>
Selanjutnya, kita akan menggunakan Mutual Information (MI) sebagai parameter untuk menentukan K-Best features. <br>
Sama seperti sebelumnya, hanya akan digunakan 700 features. Dengan menggunakan MI, 10 terms yang paling signifikan adalah *sad, thanks, miss, thank, good, love, sick, work, hate, bad .<br>
Akurasi meningkat hingga **77.44%**

In [15]:
from sklearn.feature_selection import mutual_info_classif
vectorizer = CountVectorizer()
X_training_vector = vectorizer.fit_transform(X_train_part1)
print("Number of features:  %d" % len(vectorizer.vocabulary_))
print ("---------------------")
X_test_vector = vectorizer.transform(X_test)

#print features their MI score
feature_scores = mutual_info_classif(X_training_vector, Y_train_part1)
for score, fname in sorted(zip(feature_scores, vectorizer.get_feature_names()), reverse=True)[:10]:
    print(fname, score)

#selectKBest feature using Mutual Information and see whether it could improve the accuracy
mic = SelectKBest(mutual_info_classif, k=700)
X_train_best = mic.fit_transform(X_training_vector, Y_train_part1)
X_test_best = mic.transform(X_test_vector)
print ("---------------------")
print("Training Classifier...")
clf.fit(X_train_best, Y_train_part1)
print("Predicting...")
prediction = clf.predict(X_test_best)
accuracy = accuracy_score(Y_test, prediction)
print ('Accuracy:', accuracy)

Number of features:  17631
---------------------
sad 0.009294996835169803
thanks 0.007328825481966443
miss 0.005785720183408666
love 0.005091748163164076
sick 0.004709206870385366
good 0.00408905897754629
bad 0.0038255447193213162
wish 0.003747292695550559
thank 0.003662163464954462
sorry 0.003574547204913292
---------------------
Training Classifier...
Predicting...
Accuracy: 0.7465181058495822
