Sentiment analysis is the interpretation and classification of emotions (positive, negative and neutral) within text data using text analysis techniques. Before to do it, we have to make sure to install some required packages ```pip install pandas nltk sklearn Sastrawi```

Before creating the model, we have to supervise the tweets sample, and provide the sentiment manually.

In [1]:
import pandas as pd

review_data = pd.read_csv('training_all_random.csv', sep=';', names=['Tweet', 'sentimen'])
review_data

Unnamed: 0,Tweet,sentimen
0,"rt @napqilla: no 1, 3 ambisinya menguasai raky...",1
1,rt @pandji: nah gue pikir sentimen petahana ok...,1
2,rt @pandji: urutan pertama best moment #debat2...,1
3,rt @pandji: ini artikel yg menjelaskan ternyat...,1
4,rt @mrtampi: agus makin santai.\nahok makin sa...,0
...,...,...
1501,rt @erixputra: rakyat adalah bos kami. kami ad...,1
1502,ahok \u2013 djarot waspadai politik uang dalam...,1
1503,@jokowi harusnya sdh tahu dari awal ini semaki...,1
1504,"#coblosnomor2 soal transportasi, ahok sebut an...",0


Next step is tokenize each tweet. Tokenize is a process dividing a sentence into several tokens (can be words or others depends on context).

In [2]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'[A-Za-z]+')

review_data['Tweet'] = review_data.Tweet.map(lambda x: tokenizer.tokenize(x))
review_data.Tweet = review_data.Tweet.str.join(sep=' ')
review_data

Unnamed: 0,Tweet,sentimen
0,rt napqilla no ambisinya menguasai rakyat no a...,1
1,rt pandji nah gue pikir sentimen petahana oke ...,1
2,rt pandji urutan pertama best moment debat pil...,1
3,rt pandji ini artikel yg menjelaskan ternyata ...,1
4,rt mrtampi agus makin santai nahok makin santu...,0
...,...,...
1501,rt erixputra rakyat adalah bos kami kami adala...,1
1502,ahok u djarot waspadai politik uang dalam pilk...,1
1503,jokowi harusnya sdh tahu dari awal ini semakin...,1
1504,coblosnomor soal transportasi ahok sebut anies...,0


Next step is remove stop words. Stop words are natural language words which have very little meaning, such as "and", "the", "a", "an", and similar words. In indonesian language, some common stop words includes “yang”, “untuk”, “pada”, “ke”, etc.

In [3]:
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory

stopword = StopWordRemoverFactory().create_stop_word_remover()

review_data['Tweet'] = review_data.Tweet.map(lambda x: stopword.remove(x)) 
review_data

Unnamed: 0,Tweet,sentimen
0,rt napqilla no ambisinya menguasai rakyat no a...,1
1,rt pandji nah gue pikir sentimen petahana oke ...,1
2,rt pandji urutan pertama best moment debat pil...,1
3,rt pandji artikel yg menjelaskan ternyata deba...,1
4,rt mrtampi agus makin santai nahok makin santu...,0
...,...,...
1501,rt erixputra rakyat bos kami pelayan rakyat u ...,1
1502,ahok u djarot waspadai politik uang pilkada ht...,1
1503,jokowi harusnya sdh tahu awal semakin masif hi...,1
1504,coblosnomor soal transportasi ahok sebut anies...,0


Next step is stemming. Stemming is the process of producing morphological variants of a root/base word. In english language, there are several popular algorithms, including Potter’s, Lovins, Dawson, Krovetz, etc. While in indonesia, there are library called Sastrawi.

In [4]:
# # Its very slow comparing English Stemmer. So it will be disabled

# from multiprocessing.dummy import Pool as ThreadPool
# from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

# def stem(x):
#     stemmer = StemmerFactory().create_stemmer()
#     return stemmer.stem(x)

# pool = ThreadPool(4)
# review_data['Tweet'] = pool.map(stem, review_data.Tweet)
# pool.close()
# pool.join()

Next step is creating bag of words with CountVectorizer. With CountVectorizer we are converting raw text to a numerical vector representation of words and n-grams. 

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()

review_data_tf = cv.fit_transform(review_data.Tweet)
cv.get_feature_names()

['aa',
 'aaakysdoke',
 'aagym',
 'aamandakasih',
 'aamiin',
 'aangku',
 'abah',
 'abaikan',
 'abal',
 'abdee',
 'abdusshomad',
 'abiiiis',
 'about',
 'abs',
 'absolutely',
 'abstrak',
 'academia',
 'acara',
 'acded',
 'aceh',
 'acehcenterid',
 'acg',
 'acta',
 'acung',
 'ad',
 'ada',
 'adakah',
 'adam',
 'addiems',
 'adhdhaifah',
 'adib',
 'adibhidayat',
 'adil',
 'adj',
 'adl',
 'adlh',
 'adsaimugbf',
 'adu',
 'aduh',
 'ae',
 'aef',
 'afb',
 'afhbfellzu',
 'afif',
 'afiit',
 'afnr',
 'afqzzgsz',
 'afu',
 'aga',
 'agama',
 'agamanya',
 'agamis',
 'agatha',
 'agitasi',
 'agkgoerfpz',
 'agree',
 'agrveu',
 'agungwidrajat',
 'agus',
 'agusfansclub',
 'agussylvi',
 'agussylvidki',
 'agusyudhoyono',
 'ah',
 'ahaaayeee',
 'ahhhh',
 'ahli',
 'ahok',
 'ahokbersinetron',
 'ahokdipecat',
 'ahokdjarot',
 'ahokdjarothebat',
 'ahokdjarotpilihan',
 'ahoker',
 'ahokers',
 'ahoklah',
 'ahokoke',
 'ahuv',
 'ahy',
 'ahydki',
 'ahymata',
 'ahymatana',
 'ahymatanajwa',
 'ai',
 'aib',
 'aid',
 'ainunnajib'

In [6]:
review_data_tf.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

Before we create a model, we have to split our data into two disjoint sets, train set and test set. Train set would be trained into classifier resulting a model. And then we can evaluate our model with test set.

In [7]:
from sklearn.model_selection import train_test_split

trainX, testX, trainY, testY = train_test_split(review_data_tf, review_data.sentimen)

In [8]:
trainX.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 1]])

In [9]:
testX.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [10]:
trainY.tolist()

[1,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 0,


In [11]:
testY.tolist()

[1,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 1,


The next step is creating a model. The model can be achieve by training a classifier using the training data. One of the most suitable classifiers for text analysis is MultinomialNB. MultinomialNB is a variant of Naive Bayes classifier.

In [12]:
from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB()
mnb.fit(trainX, trainY)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

The accuracy of model can be achieved by feeding test set into trained model. Compare the result with the actual class of test set, then visualized it using confusion matrix. A confusion matrix is a technique for summarizing the performance of a classification algorithm.

In [13]:
y_pred = mnb.predict(testX)

In [14]:
from sklearn.metrics import confusion_matrix, precision_recall_fscore_support, f1_score

confusion_matrix(y_true=testY, y_pred=y_pred, labels=review_data.sentimen.unique())

array([[149,  43],
       [ 41, 144]])

In [15]:
precision, recall, _, _ = precision_recall_fscore_support(y_true=testY, y_pred=y_pred, labels=review_data.sentimen.unique())

In [16]:
precision

array([0.78421053, 0.77005348])

In [17]:
recall

array([0.77604167, 0.77837838])

In [18]:
f1_score(y_true=testY, y_pred=y_pred, labels=review_data.sentimen.unique())

0.7801047120418849

Predict sentiment of tweets crawled in "Crawling Twitter Using Tweepy"

In [43]:
f = open('sensus_penduduk_tweets.txt')
tweets = f.readlines()

# Tokenize
tweets = list(map(lambda x: tokenizer.tokenize(x), tweets))
tweets = list(map(lambda x: ' '.join(x), tweets))

# Remove stop words
tweets = list(map(lambda x: stopword.remove(x),tweets))

# Predict sentiment
tweets_tf = cv.transform(tweets)
tweets_sentiments = mnb.predict(tweets_tf)

# Show as table
sensus_penduduk_sentiment = pd.DataFrame({'Tweet': tweets, 'Sentiment': tweets_sentiments})
sensus_penduduk_sentiment

Unnamed: 0,Tweet,Sentiment
0,Ayo input data sebelum tanggal Maret Etam suks...,0
1,OdedMD kangyanamulyana KHamidipradja NurShomad...,1
2,ASN DPMPTSP NAKER Melakukan Proses pengisian s...,1
3,Selamat malam Sehubungan sedang dilaksanakanny...,1
4,persen penduduk Kota Bogor lakukan sensus mand...,1
...,...,...
690,Sosialisasi sensus penduduk online acara sambu...,1
691,BABINSA KORAMIL BDK HADIRI RAPAT KOORDINASI KE...,1
692,LOGIN https t co cnQuhbjWSm Cara Isi Sensus Pe...,0
693,Rapat Koordinasi Sensus Penduduk Kegiatan dira...,0


Predict sentiment of tweets crawled in "Crawling Web Using Selenium Webdriver"

In [44]:
f = open('sensus_penduduk_news.txt')
news = f.readlines()

# Tokenize
news = list(map(lambda x: tokenizer.tokenize(x), news))
news = list(map(lambda x: ' '.join(x), news))

# Remove stop words
news = list(map(lambda x: stopword.remove(x),news))

# Predict sentiment
news_tf = cv.transform(news)
news_sentiments = mnb.predict(news_tf)

# Show as table
sensus_penduduk_sentiment = pd.DataFrame({'Tweet': news, 'Sentiment': news_sentiments})
sensus_penduduk_sentiment

Unnamed: 0,Tweet,Sentiment
0,Jumat Maret Cari Network Login Tribun Home Nas...,1
1,Jumat Maret Cari Network Login Tribun Home Nas...,1
2,UNTUK INDONESIA BERITA OPINI CERITA PROFIL BOL...,1
3,Bahasa Indonesia English Profil Ragam Layanan ...,0
4,MENU CARI JUM AT MARET Beranda Makro Nasional ...,1
...,...,...
100,Skip to content Pencarian MARET HOME NASIONAL ...,1
101,Login Buku Tambahkan KoleksikuTulis resensi Ha...,1
102,JUMAT MARET WIB HOME NEWS LINGKUNGAN POLITIK H...,1
103,LINE Today TOP Trending Showbiz News Life Regi...,1
