# Sentiment Analysis - Feature Selection

In [1]:
!pip install -U nltk

Requirement already up-to-date: nltk in /home/gemilang/.local/lib/python3.7/site-packages (3.3)


In [3]:
import nltk
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1337)


## Upload File to Google Colab

In [12]:
uploaded = files.upload()

Saving clean_tweet_training.csv to clean_tweet_training.csv


In [27]:
uploaded = files.upload()

Saving clean_tweet_test_fixed.csv to clean_tweet_test_fixed.csv


In [0]:
from sklearn import preprocessing
my_df_training = pd.read_csv('clean_tweet_training.csv',index_col=0)
tweet_text_training = my_df_training['text']
target_training = my_df_training['target']
X_training = tweet_text_training[pd.notnull(tweet_text_training)]
Y_training = target_training[pd.notnull(tweet_text_training)]


Karena data training terlalu besar, kita akan menggunakan sebagian kecil dari data (1%) agar proses training lebih cepat.

In [0]:
from sklearn.model_selection import train_test_split
X_train_part1, X_train_part2, Y_train_part1, Y_test_part2 = train_test_split(X_training, Y_training, train_size=0.01, random_state=42)

In [0]:
my_df_test = pd.read_csv('clean_tweet_test_fixed.csv',index_col=0)
my_df_test = my_df_test[my_df_test.target != 2]
tweet_text_test = my_df_test['text']
target_test = my_df_test['target']
X_test = tweet_text_test[pd.notnull(tweet_text_test)]
Y_test = target_test[pd.notnull(tweet_text_test)]

## Feature Extraction <br>
Seperti pada notebook sebelumnya, kita menggunakan bag-of-words sebagai feature. Total ada 17489 terms (unique words) yang terindex dalam vocabulary

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X_training_vector = vectorizer.fit_transform(X_train_part1)
print("Number of features:  %d" % len(vectorizer.vocabulary_))
X_test_vector = vectorizer.transform(X_test)


Number of features:  17489


Untuk melihat bagaimana performance/akurasi sistem dengan menggunakan BagOfWords sebagai feature, kita akan mengaplikasikan salah satu classifier model yaitu SVM (Support Vector Machine). <br>
Dengan menggunakan 17489 features, SVM berhasil meng-klasifikasikan tweets dengan akurasi **72.70%**

In [9]:
from sklearn import svm
from sklearn.metrics import accuracy_score
clf = svm.SVC(kernel='linear')
print("Training Classifier...")
clf.fit(X_training_vector, Y_train_part1)
print("Predicting...")
prediction = clf.predict(X_test_vector)
accuracy = accuracy_score(Y_test, prediction)
print ('Accuracy:', accuracy)

Training Classifier...
Predicting...
Accuracy: 0.7270194986072424


## (1) Frequency-based Feature Selection <br>
Selanjutnya, kita akan mengaplikasikan feature selection method yang paling sederhana, yaitu frequency-based feature selection. <br>
Kita akan menggunakan hanya 700 features yang frekuensi kemunculannya paling sering. <br>
Dari list 10 kata yang paling sering muncul adalah *good, day, get, like, go, love, going, work, today, got* <br>
Dengan 700 features, akurasi dari SVM dalam mengklasifikasikan tweets naik hingga **75.21%**


In [15]:
vectorizer = CountVectorizer(max_features=700)
X_training_vector = vectorizer.fit_transform(X_train_part1)
print("Number of features:  %d" % len(vectorizer.vocabulary_))
print("------------------")

sum_words = X_training_vector.sum(axis=0) 
words_freq = [(word, sum_words[0, idx]) for word, idx in vectorizer.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
most_freq = words_freq[:10]
for word, freq in most_freq:
  print(word, freq)

X_test_vector = vectorizer.transform(X_test)
print("------------------")
print("Training Classifier...")
clf.fit(X_training_vector, Y_train_part1)
print("Predicting...")
prediction = clf.predict(X_test_vector)
accuracy = accuracy_score(Y_test, prediction)
print ('Accuracy:', accuracy)


Number of features:  700
------------------
good 948
day 880
get 811
like 791
go 737
love 672
going 653
work 648
today 644
got 637
------------------
Training Classifier...
Predicting...
Accuracy: 0.7520891364902507


## (2) Select K-Best Features Selection dengan Chi2 <br>
Selanjutnya, kita akan menggunakan chi2 sebagai parameter untuk menentukan K-Best features. <br>
Sama seperti sebelumnya, hanya akan digunakan 700 features. Dengan menggunakan Chi2, 10 terms yang paling signifikan adalah *sad, thanks, miss, love, good, thank, sick, work, bad, hate* .<br>
Akurasi meningkat hingga **78.27%**

In [18]:
from sklearn.feature_selection import SelectKBest, chi2

vectorizer = CountVectorizer()
X_training_vector = vectorizer.fit_transform(X_train_part1)
print("Number of features:  %d" % len(vectorizer.vocabulary_))
print ("---------------------")
X_test_vector = vectorizer.transform(X_test)

#print features their chi2 score
feature_scores = chi2(X_training_vector, Y_train_part1)[0]
for score, fname in sorted(zip(feature_scores, vectorizer.get_feature_names()), reverse=True)[:10]:
    print(fname, score)

#selectKBest feature using Chi2 and see whether it could improve the accuracy
ch2 = SelectKBest(chi2, k=700)
X_train_best = ch2.fit_transform(X_training_vector, Y_train_part1)
X_test_best = ch2.transform(X_test_vector)
print ("---------------------")
print("Training Classifier...")
clf.fit(X_train_best, Y_train_part1)
print("Predicting...")
prediction = clf.predict(X_test_best)
accuracy = accuracy_score(Y_test, prediction)
print ('Accuracy:', accuracy)

Number of features:  17489
---------------------
sad 218.08164455521614
thanks 212.3399637323019
miss 170.0616747325742
love 156.98849239358736
good 153.22887880726293
thank 143.04750613913407
sick 117.2589380599822
work 111.85689224903534
bad 108.65106545572228
hate 93.88677283331981
---------------------
Training Classifier...
Predicting...
Accuracy: 0.7827298050139275


## (3) Select K-Best Features Selection dengan Mutual Information <br>
Selanjutnya, kita akan menggunakan Mutual Information (MI) sebagai parameter untuk menentukan K-Best features. <br>
Sama seperti sebelumnya, hanya akan digunakan 700 features. Dengan menggunakan MI, 10 terms yang paling signifikan adalah *sad, thanks, miss, thank, good, love, sick, work, hate, bad .<br>
Akurasi meningkat hingga **77.44%**

In [20]:
from sklearn.feature_selection import mutual_info_classif
vectorizer = CountVectorizer()
X_training_vector = vectorizer.fit_transform(X_train_part1)
print("Number of features:  %d" % len(vectorizer.vocabulary_))
print ("---------------------")
X_test_vector = vectorizer.transform(X_test)

#print features their MI score
feature_scores = mutual_info_classif(X_training_vector, Y_train_part1)
for score, fname in sorted(zip(feature_scores, vectorizer.get_feature_names()), reverse=True)[:10]:
    print(fname, score)

#selectKBest feature using Mutual Information and see whether it could improve the accuracy
mic = SelectKBest(mutual_info_classif, k=700)
X_train_best = mic.fit_transform(X_training_vector, Y_train_part1)
X_test_best = mic.transform(X_test_vector)
print ("---------------------")
print("Training Classifier...")
clf.fit(X_train_best, Y_train_part1)
print("Predicting...")
prediction = clf.predict(X_test_best)
accuracy = accuracy_score(Y_test, prediction)
print ('Accuracy:', accuracy)

Number of features:  17489
---------------------
sad 0.008364170510188972
thanks 0.0073625548431539255
miss 0.006100445655051625
thank 0.005032626264982238
good 0.0049708010971418505
love 0.004921903333730723
sick 0.004302198033515895
work 0.0035984547239264
hate 0.003575654363850231
bad 0.0035398842870227893
---------------------
Training Classifier...
Predicting...
Accuracy: 0.7743732590529248
