# ANALISIS SENTIMEN TERHADAP PELAYANAN UMUM DI JAKARTA DENGAN DATA TWEET MENGGUNAKAN ALGORITMA NAIVE BAYES

Pada penelitian akan dilakukan sebuah analisis sentimen pada data twitter menggunakan algoritma Naive Bayes. Data tweet yang diambil berkaitan dengan topik pelayanan umum oleh pemerintah DKI Jakarta. Data diambil dari periode bulan Maret 2016 - Desember 2016.

Secara umum analisis dilakukan dengan cara:
- Data Preprocessing
- Feature Selection
- Building Model Classification
- Predict Data Test

## Data

In [2]:
#Pembuatan kelas untuk memanggil data
import pandas as pd
import os

os.chdir('C:\\Users\Aldi\Documents\skripsi\html')

class data(object):
    def __init__(self, filename):
        self.filename = filename

    def create_df(self):
        tabel = pd.read_csv(self.filename,delimiter=';', names = ['Tweet','Sentimen'])
        return tabel

Data yang diambil merupakan data tweet yang ditweet atau menerima tweet (mention) pada akun-akun antara lain DKIJakarta, Jakartagoid, basuki_btp, info_DKI, tmcpoldametro, BiroHukumDKIJakarta, dkpsJakarta, PT_TransJakarta, SatpolPP_Prov, KebersihanDKI, DinsosDKI1. Untuk mengetahui data tersebut merupakan suatu opini yang positif atau negatif yang diketahui berdasarkan emoticon Happy Face :) dan Sad Face :( yang terdapat pada tweet. Untuk data yang bersentimen positif di simpan dalam file JakartaPos.csv dan untuk data yang bersentimen negatif disimpan dalam file JakartaNeg.csv

In [3]:
a = data('JakartaPos.csv')
b = data('JakartaNegBaru.csv')
Positif = a.create_df()
print(len(Positif))
Negatif = b.create_df()
print(len(Negatif))

700
700


Data positif dan data negatif digabungkan menjadi satu file yang akan digunakan sebagai data training

In [4]:
train_data = [Positif,Negatif]
TrainingTable = pd.concat(train_data,ignore_index = True)
TrainingData = TrainingTable[['Tweet','Sentimen']]
print(TrainingData)
ListData = [tuple(x) for x in TrainingData.values]
# print(ListData)

                                                  Tweet  Sentimen
0     blak-blakan juga ni orang  tanpa perlu basa-ba...         1
1     inikan sebagian besar sdh dijalankan pak @basu...         1
2     Tapi yg menarik ada kata " Menggusur " jadi in...         1
3     kau gugat donk @ecosocrights ! gi cepetan :) @...         1
4     #CherrybellewithAhok Kalo Kolaborasi di suatu ...         1
5     Wkwkwkwk kece koh..KECEbong bingits :p @basuki...         1
6        @PT_TransJakarta terima kasih infonya..... :-)         1
7     @Beritasatu cara @jokowi  memindah kan warga d...         1
8     @PT_TransJakarta @GunRomli waktu libur kejepit...         1
9     @MerryMP @bambangelf @basuki_btp serang ahok m...         1
10    Pak Ahok Jadi Super Strong deh kalo sama ChiBi...         1
11    Maaf  @basuki_btp menjadi gubenur hanya untuk ...         1
12    Selama jkt menjadi lebih baik  kenapa enggak? ...         1
13    Kapan nih @basuki_btp bikin KJKLM buat orang-o...         1
14    Dan 

In [5]:
#Proses Pengambilan tweet
tweets = []
for a,b in ListData:
    tweets.append((a,b))
print(tweets)

[('blak-blakan juga ni orang  tanpa perlu basa-basi bicara soal pak Ahok @basuki_btp :)) https://twitter.com/triwul82/status/781538144789291008??', 1), ('inikan sebagian besar sdh dijalankan pak @basuki_btp :)) moga kedepan tetap konsisten :) https://twitter.com/angelmaximilan/status/781375738599530497??', 1), ('Tapi yg menarik ada kata " Menggusur " jadi inget puisi " Tukang gusur tukang gusur tukang gusur " ... akuin aja @basuki_btp lebih baik :D https://twitter.com/Pradhana_Adi/status/781461528495087617??', 1), ('kau gugat donk @ecosocrights ! gi cepetan :) @politik_twit @jokowi @basuki_btp', 1), ('#CherrybellewithAhok Kalo Kolaborasi di suatu acara Bakalan jadi topik dunia tuh :) @basuki_btp @Cherrybelleindo\\n#GoCherrybelleAHOK', 1), ('Wkwkwkwk kece koh..KECEbong bingits :p @basuki_bt @basuki_btp', 1), ('@PT_TransJakarta terima kasih infonya..... :-)', 1), ('@Beritasatu cara @jokowi  memindah kan warga dari kumuh kotor miskin rusak ke RuSun itu sdh sama @basuki_btp kelessssss :)',

In [6]:
#simpan data pengambilan
import csv

with open("datatweet.csv", "w", newline='') as f:
    writer = csv.writer(f, delimiter = ';')
    writer.writerows(tweets)

### Data Preprocessing

Data Preprocessing adalah proses untuk mempersiapkan data sebelum dilakukan pemodelan pada data.
Preprocessing yang dilakukan antara lain:
    - Menghapus simbol atau karakter yang tidak berguna
    - Stemming
    - Tokenisasi
    - Filtering kata
    - Hapus Stopword

In [7]:
#Pembuatan kelas Preprocessing
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
import re,string

class preprocessing(object):
    def __init__(self,tweet):
        self.tweet = tweet

    def cleansing(self):
        emoticon_str = """
    (?:
        [:=;] # Eyes
        [oO\-]? # Nose (optional)
        [D\)\]\(\]/\\OpP] # Mouth
    )"""
        twit= self.tweet
    #Ubah menjadi lowercase
       # twit = twit.lower()
    #Hapus URL
        twit = re.sub('((www\.[^\s]+)|(https?://[^\s]+)|((pic\.[^\s]+)))','',twit)
    #Hapus @username
        twit = re.sub('@[^\s]+','',twit)
    #Hapus Hashtag
        twit = re.sub(r'#([^\s]+)', r'\1', twit)
    #Hapus angka
        twit = re.sub(r'(?:(?:\d+,?)+(?:\.?\d+)?)','',twit)
    #Hapus emoticon
        twit = re.sub(r'^'+emoticon_str+'$','',twit)
    #Hapus /n
        twit = re.sub(r"(?<=[a-z])\r?\\n"," ", twit)
    #Hapus duplicate char
        twit = re.sub(r'([a-z])\1\1+', r'\1', twit)
    #Hapus tanda baca
        twit = re.sub(r'[^\w\s]',' ',twit)
    #Hapus spasi
        twit = re.sub('[\s]+', ' ', twit)
    #trim
        twit = twit.strip('\'"')
        return twit

    #Fungsi untuk menghapus kata yang kurang dari 3 karakter
    def filtering(self):
        text = [word for word in self.tweet if len(word) >= 3]
        return text

    #Fungsi untuk memisahkan kalimat menjadi kata-kata atau token
    def tokenize(self):
        tweet = self.tweet
        return tweet.split()

    #Fungsi untuk menghapus kata yang termasuk stopword
    def hapus_stopword(self):
        StopWords = "StopWords_Eng-Ind.txt"
        sw=open(StopWords,encoding='utf-8', mode='r');stop=sw.readlines();sw.close()
        stop=[kata.strip() for kata in stop];stop=set(stop)
        kata = [item for item in self.tweet if item not in stop]
        return kata
    
    #Fungsi untuk mengubah kata menjadi kata dasar
    def stemming(self):
        factory = StemmerFactory()
        stemmer = factory.create_stemmer()
        kata_dasar = stemmer.stem(self.tweet)
        return kata_dasar
    

In [8]:
#Implementasi untuk menghapus simbol dan karaktkter pada data training
tweet_removehastag = []
for (a,b) in tweets:
    tweets_kotor = preprocessing(a)
    tweet = tweets_kotor.cleansing()
    tweet_removehastag.append((tweet,b))
print(tweet_removehastag)

[('blak blakan juga ni orang tanpa perlu basa basi bicara soal pak Ahok ', 1), ('inikan sebagian besar sdh dijalankan pak moga kedepan tetap konsisten ', 1), ('Tapi yg menarik ada kata Menggusur jadi inget puisi Tukang gusur tukang gusur tukang gusur akuin aja lebih baik D ', 1), ('kau gugat donk gi cepetan ', 1), ('CherrybellewithAhok Kalo Kolaborasi di suatu acara Bakalan jadi topik dunia tuh ', 1), ('Wkwkwkwk kece koh KECEbong bingits p ', 1), (' terima kasih infonya ', 1), (' cara memindah kan warga dari kumuh kotor miskin rusak ke RuSun itu sdh sama keles ', 1), (' waktu libur kejepit kemaren udah coba BW D', 1), (' serang ahok mslh mulutnya aja Kl mslh pajak or korupsi or pembuktian terbalik jgn Ntar malu lho situ', 1), ('Pak Ahok Jadi Super Strong deh kalo sama ChiBi ChiBi hihi GOCherrybelleAHOK ', 1), ('Maaf menjadi gubenur hanya untuk bekerja bekerja bekerja bukan jalan lari spt mas p ', 1), ('Selama jkt menjadi lebih baik kenapa enggak Toh perkembangan jkt sjk pemerintahan ja

In [9]:
#Implementasi untuk mengubah kata menjadi kata dasar pada data training
tweets_stem = []
for (a,b) in tweet_removehastag:
    tweets_kotor = preprocessing(a)
    tweet = tweets_kotor.stemming()
    tweets_stem.append((tweet,b))
print(tweets_stem)

[('blak blakan juga ni orang tanpa perlu basa basi bicara soal pak Ahok ', 1), ('ini bagi besar sdh jalan pak moga depan tetap konsisten ', 1), ('Tapi yg tarik ada kata Menggusur jadi inget puisi Tukang gusur tukang gusur tukang gusur akuin aja lebih baik D ', 1), ('kau gugat donk gi cepetan ', 1), ('CherrybellewithAhok Kalo Kolaborasi di suatu acara Bakalan jadi topik dunia tuh ', 1), ('Wkwkwkwk kece koh KECEbong bingits p ', 1), (' terima kasih info ', 1), (' cara pindah kan warga dari kumuh kotor miskin rusak ke RuSun itu sdh sama les ', 1), (' waktu libur jepit kemaren udah coba BW D', 1), (' serang ahok mslh mulut aja Kl mslh pajak or korupsi or bukti balik jgn Ntar malu lho situ', 1), ('Pak Ahok Jadi Super Strong deh kalo sama ChiBi ChiBi hihi GOCherrybelleAHOK ', 1), ('Maaf jadi gubenur hanya untuk kerja kerja kerja bukan jalan lari spt mas p ', 1), ('Selama jkt jadi lebih baik kenapa enggak Toh kembang jkt sjk perintah jadi lebih baik kok ', 1), ('Kapan nih bikin KJKLM buat ora

In [10]:
#Implementasi untuk memecah kalimat menjadi kata atau token pada data training
tweet_token = []
for (a,b) in tweets_stem:
    tweets_kotor = preprocessing(a)
    tweet = tweets_kotor.tokenize()
    tweet_token.append((tweet,b))
print(tweet_token)

[(['blak', 'blakan', 'juga', 'ni', 'orang', 'tanpa', 'perlu', 'basa', 'basi', 'bicara', 'soal', 'pak', 'Ahok'], 1), (['ini', 'bagi', 'besar', 'sdh', 'jalan', 'pak', 'moga', 'depan', 'tetap', 'konsisten'], 1), (['Tapi', 'yg', 'tarik', 'ada', 'kata', 'Menggusur', 'jadi', 'inget', 'puisi', 'Tukang', 'gusur', 'tukang', 'gusur', 'tukang', 'gusur', 'akuin', 'aja', 'lebih', 'baik', 'D'], 1), (['kau', 'gugat', 'donk', 'gi', 'cepetan'], 1), (['CherrybellewithAhok', 'Kalo', 'Kolaborasi', 'di', 'suatu', 'acara', 'Bakalan', 'jadi', 'topik', 'dunia', 'tuh'], 1), (['Wkwkwkwk', 'kece', 'koh', 'KECEbong', 'bingits', 'p'], 1), (['terima', 'kasih', 'info'], 1), (['cara', 'pindah', 'kan', 'warga', 'dari', 'kumuh', 'kotor', 'miskin', 'rusak', 'ke', 'RuSun', 'itu', 'sdh', 'sama', 'les'], 1), (['waktu', 'libur', 'jepit', 'kemaren', 'udah', 'coba', 'BW', 'D'], 1), (['serang', 'ahok', 'mslh', 'mulut', 'aja', 'Kl', 'mslh', 'pajak', 'or', 'korupsi', 'or', 'bukti', 'balik', 'jgn', 'Ntar', 'malu', 'lho', 'situ'],

In [11]:
#Implementasi untuk menghapus kata yang memiliki karakter kurang dari 3 karakter pada data training
tweet_normalisasi = []
for (a,b) in tweet_token:
    tweets_kotor = preprocessing(a)
    tweet = tweets_kotor.filtering()
    tweet_normalisasi.append((tweet,b))
print(tweet_normalisasi)

[(['blak', 'blakan', 'juga', 'orang', 'tanpa', 'perlu', 'basa', 'basi', 'bicara', 'soal', 'pak', 'Ahok'], 1), (['ini', 'bagi', 'besar', 'sdh', 'jalan', 'pak', 'moga', 'depan', 'tetap', 'konsisten'], 1), (['Tapi', 'tarik', 'ada', 'kata', 'Menggusur', 'jadi', 'inget', 'puisi', 'Tukang', 'gusur', 'tukang', 'gusur', 'tukang', 'gusur', 'akuin', 'aja', 'lebih', 'baik'], 1), (['kau', 'gugat', 'donk', 'cepetan'], 1), (['CherrybellewithAhok', 'Kalo', 'Kolaborasi', 'suatu', 'acara', 'Bakalan', 'jadi', 'topik', 'dunia', 'tuh'], 1), (['Wkwkwkwk', 'kece', 'koh', 'KECEbong', 'bingits'], 1), (['terima', 'kasih', 'info'], 1), (['cara', 'pindah', 'kan', 'warga', 'dari', 'kumuh', 'kotor', 'miskin', 'rusak', 'RuSun', 'itu', 'sdh', 'sama', 'les'], 1), (['waktu', 'libur', 'jepit', 'kemaren', 'udah', 'coba'], 1), (['serang', 'ahok', 'mslh', 'mulut', 'aja', 'mslh', 'pajak', 'korupsi', 'bukti', 'balik', 'jgn', 'Ntar', 'malu', 'lho', 'situ'], 1), (['Pak', 'Ahok', 'Jadi', 'Super', 'Strong', 'deh', 'kalo', 'sama

In [12]:
#Implementasi untuk menghapus kata yang termasuk stopword pada data training
tweet_bersih = []
for (a,b) in tweet_normalisasi:
    tweets_kotor = preprocessing(a)
    tweet = tweets_kotor.hapus_stopword()
    tweet_bersih.append((tweet,b))
print(tweet_bersih)

[(['blak', 'blakan', 'orang', 'basa', 'basi', 'bicara', 'Ahok'], 1), (['sdh', 'jalan', 'moga', 'konsisten'], 1), (['Tapi', 'tarik', 'Menggusur', 'inget', 'puisi', 'Tukang', 'gusur', 'tukang', 'gusur', 'tukang', 'gusur', 'akuin', 'aja'], 1), (['kau', 'gugat', 'donk', 'cepetan'], 1), (['CherrybellewithAhok', 'Kalo', 'Kolaborasi', 'acara', 'Bakalan', 'topik', 'dunia', 'tuh'], 1), (['Wkwkwkwk', 'kece', 'koh', 'KECEbong', 'bingits'], 1), (['terima', 'kasih', 'info'], 1), (['pindah', 'warga', 'kumuh', 'kotor', 'miskin', 'rusak', 'RuSun', 'sdh', 'les'], 1), (['libur', 'jepit', 'kemaren', 'udah', 'coba'], 1), (['serang', 'ahok', 'mslh', 'mulut', 'aja', 'mslh', 'pajak', 'korupsi', 'bukti', 'jgn', 'Ntar', 'malu', 'lho', 'situ'], 1), (['Pak', 'Ahok', 'Jadi', 'Super', 'Strong', 'deh', 'kalo', 'ChiBi', 'ChiBi', 'hihi', 'GOCherrybelleAHOK'], 1), (['Maaf', 'gubenur', 'kerja', 'kerja', 'kerja', 'jalan', 'lari', 'spt', 'mas'], 1), (['Selama', 'jkt', 'Toh', 'kembang', 'jkt', 'sjk', 'perintah'], 1), (['K

## Feature Selection
feature selection adalah proses untuk menseleksi kata dengan perhitungan tertentu untuk menentukan apakah kata tersebut layak untuk dijadikan menjadi sebuah fitur yang nantinya digunakan pada proses klasifikasi. Pada analisis ini digunakan metode **Categorical Probability Proportion Difference**

### Categorical Probability Proportion Difference
Categorical Probability Proportion Difference (CPPD) adalah sebuah metode yang digunakan untuk menseleksi kata untuk menjadi sebuah fitur dalam proses klasifikasi. Categorical Probability Proportion Difference menggabungkan 2 metode yaitu metode **Categorical Proportion Difference** dan metode **Probabililty Proportion Difference**

Langkah-Langkah proses CPPD:
- Menghitung jumlah setiap kata yang muncul pada dokumen
- Setiap kata dihitung kemunculannya pada masing-masing kelas.
- Menghitung seluruh frekuensi kata yang muncul, frekuensi kata yang muncul pada kelas positif, dan frekuensi kata yang muncul pada kelas negatif
- menghitung nilai CPD dan PPD pada setiap kata
- Simpan kata sebagai fitur apabila hasil nilai CPD dan PPD diatas nilai yang sudah ditentukan sebelumnya

In [13]:
#Menghitung jumlah setiap kata yang muncul pada dokumen
from collections import Counter
counter = Counter()
[counter.update(tweet) for tweet,sentiment in tweet_bersih]


[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,

In [14]:
#fungsi melihat diagram
import matplotlib.pyplot as plt
import numpy as np

def diagram_vocab(vocab):
    kata = list(zip(*vocab))[0]
    score = list(zip(*vocab))[1]
    x_pos = np.arange(len(kata))
    plt.bar(x_pos, score,align='center')
    plt.xticks(x_pos, kata) 
    plt.ylabel('Popularity Score')
    return plt.show()

In [15]:
vocab_all = counter.most_common()
# diagram = diagram_vocab(vocab_all)
# print (vocab_all)

#Menyimpan seluruh vocab
vocab = [word[:][0] for word in vocab_all]
print(vocab)

['Pak', 'gak', 'nya', 'Ahok', 'aja', 'macet', 'nih', 'jam', 'jalan', 'bus', 'arah', 'min', 'udah', 'Jakarta', 'banget', 'kalo', 'tolong', 'nunggu', 'halte', 'dukung', 'kasih', 'DKI', 'busway', 'ahok', 'kerja', 'bgt', 'utk', 'sih', 'pilih', 'JKTKerja', 'selamat', 'jalur', 'bikin', 'tugas', 'orang', 'moga', 'deh', 'tdk', 'parah', 'liat', 'klo', 'polisi', 'tol', 'tuh', 'pagi', 'blm', 'info', 'sampe', 'bantu', 'terima', 'Selamat', 'banjir', 'coba', 'jakarta', 'gin', 'anak', 'biar', 'koridor', 'yah', 'jgn', 'mohon', 'nyata', 'masuk', 'rute', 'gubernur', 'udh', 'kendara', 'BeraniBerubah', 'makasih', 'dgn', 'sdh', 'gerak', 'motor', 'menit', 'jkt', 'gimana', 'armada', 'buka', 'rakyat', 'maju', 'tumpang', 'calon', 'skrg', 'lampu', 'salah', 'kaya', 'ngga', 'lihat', 'bagus', 'kayak', 'mobil', 'yaa', 'kota', 'kantor', 'KTP', 'bersih', 'warga', 'merdekaataumacet', 'nyaman', 'tau', 'Indonesia', 'kena', 'lupa', 'krn', 'salam', 'koh', 'merah', 'Gubernur', 'gitu', 'kampanye', 'Jadi', 'Terima', 'kali', 

In [16]:
#Menghitung jumlah setiap kata yang muncul pada dokumen kelas positif
counter_pos = Counter()
[counter_pos.update(tweet) for tweet,sentiment in tweet_bersih if sentiment == 1]
vocab_pos = counter_pos.most_common()
# diagramPos = diagram_vocab(vocab_pos)
print(vocab_pos)

#fungsi menghitung jumlah kata kelas positif
def HitungTermPos(word):
    for a,b in vocab_pos:
        if a == word:
            return b

[('Pak', 81), ('Ahok', 60), ('aja', 36), ('Jakarta', 33), ('kasih', 33), ('nya', 32), ('JKTKerja', 30), ('dukung', 30), ('selamat', 29), ('gak', 29), ('DKI', 25), ('ahok', 24), ('min', 24), ('kalo', 22), ('nih', 21), ('moga', 21), ('pilih', 20), ('jalan', 20), ('kerja', 20), ('terima', 19), ('Selamat', 19), ('BeraniBerubah', 17), ('banget', 17), ('utk', 17), ('orang', 17), ('udah', 15), ('gubernur', 14), ('deh', 14), ('bikin', 13), ('coba', 13), ('jakarta', 13), ('makasih', 13), ('merdekaataumacet', 13), ('tuh', 13), ('salam', 12), ('bersih', 12), ('info', 12), ('dgn', 12), ('calon', 12), ('biar', 12), ('jgn', 12), ('tugas', 12), ('semangat', 11), ('bagus', 11), ('layan', 11), ('Terima', 11), ('anak', 11), ('Semoga', 11), ('tdk', 10), ('you', 10), ('bus', 10), ('lupa', 10), ('nyaman', 10), ('sih', 10), ('liat', 9), ('banjir', 9), ('lihat', 9), ('kayak', 9), ('kampanye', 9), ('Gubernur', 9), ('pagi', 9), ('maju', 9), ('Indonesia', 9), ('bang', 9), ('situ', 9), ('jkt', 9), ('keren', 8), 

In [17]:
#Menghitung jumlah setiap kata yang muncul pada dokumen kelas negatif
counter_neg = Counter()
[counter_neg.update(tweet) for tweet,sentiment in tweet_bersih if sentiment == 0]

vocab_neg = counter_neg.most_common()
# diagramNeg = diagram_vocab(vocab_neg)
print(vocab_neg)

#fungsi menghitung jumlah kata kelas negatif
def HitungTermNeg(kata):
    for a,b in vocab_neg:
        if a == kata:
            return b

[('macet', 59), ('jam', 57), ('Pak', 56), ('arah', 54), ('gak', 49), ('bus', 45), ('nih', 42), ('nya', 42), ('jalan', 39), ('tolong', 35), ('nunggu', 33), ('busway', 32), ('aja', 31), ('udah', 30), ('halte', 28), ('banget', 27), ('bgt', 26), ('parah', 25), ('min', 24), ('tol', 23), ('jalur', 23), ('polisi', 22), ('sih', 22), ('kalo', 20), ('koridor', 18), ('blm', 18), ('sampe', 17), ('gerak', 16), ('klo', 16), ('tugas', 16), ('tdk', 15), ('liat', 15), ('rute', 15), ('bantu', 15), ('bikin', 15), ('utk', 15), ('menit', 14), ('gin', 14), ('motor', 13), ('mobil', 13), ('udh', 13), ('pagi', 13), ('Jakarta', 12), ('gimana', 12), ('tumpang', 12), ('kantor', 12), ('yah', 12), ('lampu', 12), ('masuk', 12), ('armada', 12), ('kerja', 12), ('kendara', 12), ('banjir', 11), ('buka', 11), ('pintu', 11), ('padat', 11), ('tutup', 11), ('sore', 11), ('mohon', 11), ('deh', 11), ('bulus', 11), ('ngga', 10), ('pilih', 10), ('sdh', 10), ('merah', 10), ('pulang', 10), ('grogol', 10), ('tuh', 10), ('DKI', 10)

In [18]:
#Hasil jumlah setiap kata yang muncul pada dokumen
TotalUniqueTerm = len(vocab)
print("Jumlah total term pada seluruh dokumen: %d" % TotalUniqueTerm)
print()

#Hasil jumlah setiap kata yang muncul pada dokumen positif
TotalTermPos = len(vocab_pos)
print("Jumlah total term pada seluruh dokumen pada kelas positif: %d" % TotalTermPos)
print()

#Hasil jumlah setiap kata yang muncul pada dokumen negatif
TotalTermNeg = len(vocab_neg)
print("Jumlah total term pada seluruh dokumen pada kelas negatif: %d" % TotalTermNeg)

Jumlah total term pada seluruh dokumen: 4177

Jumlah total term pada seluruh dokumen pada kelas positif: 2390

Jumlah total term pada seluruh dokumen pada kelas negatif: 2552


In [19]:
import math

#Fungsi seleksi fitur
def FiturSeleksi(vocab):
    fitur = []
    rata2cpd = 0
    rata2ppd = 0
    for a in vocab:
#         print(a)
        Ntp = HitungTermPos(a)
        if Ntp == None:
            Ntp = 0
#         print(Ntp)
        Ntn = HitungTermNeg(a)
        if Ntn == None:
            Ntn = 0
#         print(Ntn)
        cpd = (math.fabs(Ntp-Ntn))/(Ntp+Ntn)
        ppd = math.fabs((Ntp/(TotalTermPos+TotalUniqueTerm))-(Ntn/(TotalTermNeg+TotalUniqueTerm)))
        rata2cpd = rata2cpd + cpd
        rata2ppd = rata2ppd + ppd
        if cpd >= 0.5 and ppd>=0.0002:
            fitur.append(a)
#     print(a)
#     print(Ntp)
#     print(Ntn)
#         print(cpd)
#         print(ppd)
    return fitur,rata2cpd, rata2ppd

In [20]:
# Menyimpan fitur
# finalfitur = FiturSeleksi(vocab)
# print(finalfitur)

finalfitur,ratacpd, ratappd = FiturSeleksi(vocab)
meancpd = ratacpd/(len(vocab))
print(meancpd)
meanppd = ratappd/(len(vocab))
print(meanppd)
# print(finalfitur)
# print(len(finalfitur))

0.8660671927046611
0.0002427005712350993


In [21]:
print(len(finalfitur))

822


In [22]:
fiturseleksi = open('hasilfitur.txt', 'w')
for item in finalfitur:
  fiturseleksi.write("%s\n" % item)

### Data to Vector
Setelah dilakukan proses seleksi fitur, telah didapatkan sekumpulan fitur yang nantinya akan masuk pada proses klasifikasi. Namun, sekumpulan fitur tersebut harus diubah dalam bentuk vector agar model machine learning yang digunakan untuk klasifikasi dapat mengerti.

langkah-langkah yang dapat dilakukan untuk mengubah data menjadi vector:
- menghitung kemunculan kata pada setiap dokumen (tf)
- memberi bobot pada kata (idf)
- normalisasi vector

In [23]:
tweetfinal = []
for a,b in tweet_bersih:
    tweetfinal.append(" ".join(a))
print(tweetfinal)

['blak blakan orang basa basi bicara Ahok', 'sdh jalan moga konsisten', 'Tapi tarik Menggusur inget puisi Tukang gusur tukang gusur tukang gusur akuin aja', 'kau gugat donk cepetan', 'CherrybellewithAhok Kalo Kolaborasi acara Bakalan topik dunia tuh', 'Wkwkwkwk kece koh KECEbong bingits', 'terima kasih info', 'pindah warga kumuh kotor miskin rusak RuSun sdh les', 'libur jepit kemaren udah coba', 'serang ahok mslh mulut aja mslh pajak korupsi bukti jgn Ntar malu lho situ', 'Pak Ahok Jadi Super Strong deh kalo ChiBi ChiBi hihi GOCherrybelleAHOK', 'Maaf gubenur kerja kerja kerja jalan lari spt mas', 'Selama jkt Toh kembang jkt sjk perintah', 'Kapan nih bikin KJKLM orang orang susah modal usaha KJKLM Kartu Jakarta Kagak Lagi Miskin', 'Dan Cherrybelle Full Formation with Pak hehehe Thankyou Pak Ahok CherrybellewithAhok', 'Kenang kenang Cherrybelle Pak nih Semoga manfaat yaa Pak CherrybellewithAhok', 'golput Dua pasang standar banding', 'terima kasih layan Pak', 'Yang bikin klip nya Mas', 'a

In [24]:
#simpan Hasil Preprocessing

import  csv

with open("hasilpreprocessing.csv","w") as f:
    wr = csv.writer(f,delimiter="\n")
    wr.writerow(tweetfinal)
# import csv

# with open("hasilpreprocessing.csv", "w",newline = '') as f:
#     writer = csv.writer(f, delimiter = ';')
#     writer.writerow(tweetfinal)

In [25]:
#mengubah data menjadi vector menggunakan CountVectorizer()
from sklearn.feature_extraction.text import CountVectorizer
vector_transformer = CountVectorizer(analyzer='word',  min_df = 0.1, max_df = 0.9, ngram_range=(1,1), vocabulary=finalfitur)
data_transformer = vector_transformer.fit_transform(tweetfinal)
print(data_transformer)

  (0, 11)	1
  (0, 170)	1
  (1, 16)	1
  (1, 338)	1
  (2, 66)	3
  (3, 75)	1
  (3, 383)	1
  (6, 9)	1
  (6, 22)	1
  (7, 109)	1
  (7, 653)	1
  (7, 706)	1
  (7, 752)	1
  (8, 202)	1
  (9, 11)	1
  (9, 54)	1
  (9, 290)	1
  (9, 477)	1
  (9, 480)	1
  (10, 11)	1
  (11, 408)	1
  (12, 201)	1
  (12, 274)	1
  (13, 68)	1
  (13, 148)	1
  :	:
  (1392, 339)	1
  (1393, 199)	1
  (1393, 266)	1
  (1393, 411)	1
  (1393, 628)	1
  (1394, 573)	1
  (1395, 3)	2
  (1395, 4)	1
  (1396, 1)	1
  (1396, 30)	1
  (1396, 174)	1
  (1396, 180)	1
  (1396, 194)	1
  (1397, 5)	1
  (1397, 424)	1
  (1397, 576)	1
  (1398, 53)	1
  (1398, 115)	1
  (1398, 649)	2
  (1399, 1)	1
  (1399, 11)	1
  (1399, 21)	1
  (1399, 64)	1
  (1399, 531)	1
  (1399, 642)	1


In [26]:
vector_transformer.get_feature_names()

['Ahok',
 'macet',
 'jam',
 'bus',
 'arah',
 'tolong',
 'nunggu',
 'halte',
 'dukung',
 'kasih',
 'busway',
 'ahok',
 'bgt',
 'JKTKerja',
 'selamat',
 'jalur',
 'moga',
 'parah',
 'polisi',
 'tol',
 'blm',
 'sampe',
 'terima',
 'Selamat',
 'koridor',
 'rute',
 'gubernur',
 'udh',
 'BeraniBerubah',
 'makasih',
 'gerak',
 'motor',
 'menit',
 'gimana',
 'armada',
 'tumpang',
 'calon',
 'lampu',
 'bagus',
 'mobil',
 'kantor',
 'bersih',
 'merdekaataumacet',
 'nyaman',
 'lupa',
 'salam',
 'merah',
 'Gubernur',
 'kampanye',
 'Terima',
 'sore',
 'bulus',
 'layan',
 'pulang',
 'situ',
 'Semoga',
 'bang',
 'pintu',
 'bayar',
 'grogol',
 'semangat',
 'bekas',
 'henti',
 'padat',
 'tutup',
 'you',
 'gusur',
 'loh',
 'susah',
 'msh',
 'Wah',
 'berita',
 'menang',
 'minggu',
 'pakai',
 'donk',
 'keren',
 'pribadi',
 'masyarakat',
 'knp',
 'Udah',
 'lebak',
 'BSD',
 'Ada',
 'untung',
 'mata',
 'slip',
 'parkir',
 'Gak',
 'depok',
 'kasi',
 'Dan',
 'indah',
 'stuck',
 'antri',
 'sayang',
 'senen',
 '

In [27]:
data_transformer.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0]], dtype=int64)

In [28]:
countvect = pd.DataFrame(data_transformer.toarray(), columns=vector_transformer.get_feature_names())
print(countvect)

      Ahok  macet  jam  bus  arah  tolong  nunggu  halte  dukung  kasih  ...   \
0        0      0    0    0     0       0       0      0       0      0  ...    
1        0      0    0    0     0       0       0      0       0      0  ...    
2        0      0    0    0     0       0       0      0       0      0  ...    
3        0      0    0    0     0       0       0      0       0      0  ...    
4        0      0    0    0     0       0       0      0       0      0  ...    
5        0      0    0    0     0       0       0      0       0      0  ...    
6        0      0    0    0     0       0       0      0       0      1  ...    
7        0      0    0    0     0       0       0      0       0      0  ...    
8        0      0    0    0     0       0       0      0       0      0  ...    
9        0      0    0    0     0       0       0      0       0      0  ...    
10       0      0    0    0     0       0       0      0       0      0  ...    
11       0      0    0    0 

In [29]:
countvect.to_csv('countvect.csv', sep=';')

In [30]:
from sklearn.feature_extraction.text import CountVectorizer
#binary weihgting
def binary_weighting(fitur):
    vector = CountVectorizer(analyzer='word', binary=True, ngram_range=(1, 1), vocabulary=finalfitur)
    hasil = vector.fit_transform(tweetfinal)
    return hasil

In [31]:
vector_binary = binary_weighting(finalfitur)
print(vector_binary)

  (0, 11)	1
  (0, 170)	1
  (1, 16)	1
  (1, 338)	1
  (2, 66)	1
  (3, 75)	1
  (3, 383)	1
  (6, 9)	1
  (6, 22)	1
  (7, 109)	1
  (7, 653)	1
  (7, 706)	1
  (7, 752)	1
  (8, 202)	1
  (9, 11)	1
  (9, 54)	1
  (9, 290)	1
  (9, 477)	1
  (9, 480)	1
  (10, 11)	1
  (11, 408)	1
  (12, 201)	1
  (12, 274)	1
  (13, 68)	1
  (13, 148)	1
  :	:
  (1392, 339)	1
  (1393, 199)	1
  (1393, 266)	1
  (1393, 411)	1
  (1393, 628)	1
  (1394, 573)	1
  (1395, 3)	1
  (1395, 4)	1
  (1396, 1)	1
  (1396, 30)	1
  (1396, 174)	1
  (1396, 180)	1
  (1396, 194)	1
  (1397, 5)	1
  (1397, 424)	1
  (1397, 576)	1
  (1398, 53)	1
  (1398, 115)	1
  (1398, 649)	1
  (1399, 1)	1
  (1399, 11)	1
  (1399, 21)	1
  (1399, 64)	1
  (1399, 531)	1
  (1399, 642)	1


In [32]:
vector_binary.dtype
dense_binary = vector_binary.toarray()
DF_vector_binary = pd.DataFrame(dense_binary)
# print(DF_vector_binary)

X_bin = DF_vector_binary
# X_bin

In [33]:
#Simpan Bobot Biner
# X_bin.to_csv('binaryweighting.csv', sep=';')

#### Tf-Idf
Tf-Idf merupakan metode pembobotan term dengan menggabungkan antara term frequency(tf) dan inverve document frequency (idf). Nilai yang dihasilkan menunjukkan seberapa penting term dalam sebuah dokumen.


In [34]:
from sklearn.feature_extraction.text import TfidfVectorizer

#Fungsi untuk membuat vector tf-idf
def tfidf(fitur):
    tfidf = TfidfVectorizer(analyzer='word', ngram_range=(1,1), use_idf = True, vocabulary=fitur)
    hasil = tfidf.fit_transform(tweetfinal)
    return hasil

In [35]:
hasil_tfidf = tfidf(finalfitur)
print(hasil_tfidf)

  (0, 170)	0.872711725412
  (0, 11)	0.488235849081
  (1, 338)	0.818676299676
  (1, 16)	0.574255271067
  (2, 66)	1.0
  (3, 383)	0.768904396898
  (3, 75)	0.63936376847
  (6, 22)	0.71965246131
  (6, 9)	0.694334454662
  (7, 752)	0.501293356912
  (7, 706)	0.522319822908
  (7, 653)	0.522319822908
  (7, 109)	0.450631751555
  (8, 202)	1.0
  (9, 480)	0.521981913519
  (9, 477)	0.500969050376
  (9, 290)	0.484670207473
  (9, 54)	0.414877776355
  (9, 11)	0.263697010219
  (10, 11)	1.0
  (11, 408)	1.0
  (12, 274)	0.716887023621
  (12, 201)	0.697189354024
  (13, 148)	0.737975105365
  (13, 68)	0.674827936486
  :	:
  (1392, 19)	0.341415176328
  (1393, 628)	0.527143123177
  (1393, 411)	0.505922490782
  (1393, 266)	0.489462489526
  (1393, 199)	0.476013689253
  (1394, 573)	1.0
  (1395, 4)	0.448879408034
  (1395, 3)	0.893592343881
  (1396, 194)	0.497372963623
  (1396, 180)	0.497372963623
  (1396, 174)	0.485491944374
  (1396, 30)	0.417103881794
  (1396, 1)	0.309131353722
  (1397, 576)	0.659480905258
  (1397,

In [36]:
hasil_tfidf.shape

(1400, 822)

In [37]:
#Cetak vector Tf-Idf
hasil_tfidf.dtype
dense = hasil_tfidf.toarray()
DF_tfidf = pd.DataFrame(dense)
print(DF_tfidf)

      0         1         2         3         4         5    6    7    8    \
0     0.0  0.000000  0.000000  0.000000  0.000000  0.000000  0.0  0.0  0.0   
1     0.0  0.000000  0.000000  0.000000  0.000000  0.000000  0.0  0.0  0.0   
2     0.0  0.000000  0.000000  0.000000  0.000000  0.000000  0.0  0.0  0.0   
3     0.0  0.000000  0.000000  0.000000  0.000000  0.000000  0.0  0.0  0.0   
4     0.0  0.000000  0.000000  0.000000  0.000000  0.000000  0.0  0.0  0.0   
5     0.0  0.000000  0.000000  0.000000  0.000000  0.000000  0.0  0.0  0.0   
6     0.0  0.000000  0.000000  0.000000  0.000000  0.000000  0.0  0.0  0.0   
7     0.0  0.000000  0.000000  0.000000  0.000000  0.000000  0.0  0.0  0.0   
8     0.0  0.000000  0.000000  0.000000  0.000000  0.000000  0.0  0.0  0.0   
9     0.0  0.000000  0.000000  0.000000  0.000000  0.000000  0.0  0.0  0.0   
10    0.0  0.000000  0.000000  0.000000  0.000000  0.000000  0.0  0.0  0.0   
11    0.0  0.000000  0.000000  0.000000  0.000000  0.000000  0.0

In [38]:

# tfidfvect = pd.DataFrame(hasil.toarray(), columns=vector_transformer.get_feature_names())

Mempersiapkan data

In [39]:
#Persiapan data untuk training model
import numpy as np
sentimen = []
for a,b in tweet_bersih:
    sentimen.append(b)

In [40]:
print(sentimen)

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [41]:
X = DF_tfidf
y = np.asarray(sentimen)

In [42]:
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
import datetime

# menyiapkan dataset
X = tweetfinal
y = np.asarray(sentimen)

# mengatur classifier
clf = Pipeline([
        ('tfidf',TfidfVectorizer(analyzer='word', ngram_range=(1,1), use_idf = True, vocabulary=finalfitur)),
        ('clf', MultinomialNB(alpha = 0.1))
    ])

params = {
#     'tfidf__max_df': (0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0,9, 1.0),
#     'tfidf__min_df': (0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0,9, 1.0),
    #'vect__max_features': (None, 5000, 10000, 50000),
    # 'vect__ngram_range': ((1, 1), (1, 2)),
    #'tfidf__use_idf': (True, False),
    #'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': (0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9, 1.0),
    # 'clf__penalty': ('l2', 'elasticnet'),
    #'clf__n_iter': (10, 50, 80),
}

grid = GridSearchCV(
    clf,
    params,
    cv=10,
)

clf = grid.fit(X,y)

print ("\nBest estimator:")
print()
print (clf.best_estimator_)

# print ("\nBest score:")
# print()
# print(clf.best_score_)

print ("\nGrid score MNB:")
print()
for params, mean_score, scores in clf.grid_scores_:
    print ("%0.3f (+/-%0.03f) for %r" % (mean_score, scores.std() / 2, params))
print()


Best estimator:

Pipeline(steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ..., 'Cuti', 'nTerima', 'boss'])), ('clf', MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True))])

Grid score MNB:

0.804 (+/-0.008) for {'clf__alpha': 0.1}
0.802 (+/-0.010) for {'clf__alpha': 0.2}
0.801 (+/-0.010) for {'clf__alpha': 0.3}
0.798 (+/-0.010) for {'clf__alpha': 0.4}
0.798 (+/-0.011) for {'clf__alpha': 0.5}
0.795 (+/-0.010) for {'clf__alpha': 0.6}
0.794 (+/-0.011) for {'clf__alpha': 0.7}
0.791 (+/-0.010) for {'clf__alpha': 0.8}
0.790 (+/-0.011) for {'clf__alpha': 0.9}
0.791 (+/-0.011) for {'clf__alpha': 1.0}



In [43]:
from sklearn.grid_search import GridSearchCV

import datetime

# menyiapkan dataset
X = tweetfinal
y = np.asarray(sentimen)

# mengatur classifier
clf = Pipeline([
        ('vector',CountVectorizer(analyzer='word', binary=True, ngram_range=(1, 1), vocabulary=finalfitur)),
        ('clf', MultinomialNB(alpha = 0.1))
    ])

params = {
#     'vector__max_df': (0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0,9, 1.0),
#     'vector__min_df': (0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0,9, 1.0),
    #'vect__max_features': (None, 5000, 10000, 50000),
    # 'vect__ngram_range': ((1, 1), (1, 2)),
    #'tfidf__use_idf': (True, False),
    #'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': (0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9, 1.0),
    # 'clf__penalty': ('l2', 'elasticnet'),
    #'clf__n_iter': (10, 50, 80),
}

grid = GridSearchCV(
    clf,
    params,
    cv=10,
)

clf = grid.fit(X,y)

print ("\nBest estmator:")
print()
print (clf.best_estimator_)

# print ("\nBest score:")
# print()
# print(clf.best_score_)


print ("\nGrid score BMNB:")
print()
for params, mean_score, scores in clf.grid_scores_:
    print ("%0.3f (+/-%0.03f) for %r" % (mean_score, scores.std() / 2, params))
print()


Best estmator:

Pipeline(steps=[('vector', CountVectorizer(analyzer='word', binary=True, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        stri..., 'Cuti', 'nTerima', 'boss'])), ('clf', MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True))])

Grid score BMNB:

0.804 (+/-0.010) for {'clf__alpha': 0.1}
0.801 (+/-0.012) for {'clf__alpha': 0.2}
0.799 (+/-0.012) for {'clf__alpha': 0.3}
0.800 (+/-0.012) for {'clf__alpha': 0.4}
0.799 (+/-0.013) for {'clf__alpha': 0.5}
0.796 (+/-0.013) for {'clf__alpha': 0.6}
0.793 (+/-0.013) for {'clf__alpha': 0.7}
0.791 (+/-0.013) for {'clf__alpha': 0.8}
0.791 (+/-0.013) for {'clf__alpha': 0.9}
0.793 (+/-0.013) for {'clf__alpha': 1.0}



In [44]:
from sklearn.grid_search import GridSearchCV
from sklearn.naive_bayes import BernoulliNB

import datetime

# menyiapkan dataset
X = tweetfinal
y = np.asarray(sentimen)

# mengatur classifier
clf = Pipeline([
        ('vector',CountVectorizer(analyzer='word', binary=True, ngram_range=(1, 1), vocabulary=finalfitur)),
        ('clf', BernoulliNB(alpha = 0.1))
    ])

params = {
#     'vector__max_df': (0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0,9, 1.0),
#     'vector__min_df': (0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0,9, 1.0),
    #'vect__max_features': (None, 5000, 10000, 50000),
    # 'vect__ngram_range': ((1, 1), (1, 2)),
    #'tfidf__use_idf': (True, False),
    #'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': (0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9, 1.0),
    # 'clf__penalty': ('l2', 'elasticnet'),
    #'clf__n_iter': (10, 50, 80),
}

grid = GridSearchCV(
    clf,
    params,
    cv=10,
)

clf = grid.fit(X,y)

print ("\nBest estmator:")
print()
print (clf.best_estimator_)

# print ("\nBest score:")
# print()
# print(clf.best_score_)

print()
print ("\nGrid score BNB:")
print()
for params, mean_score, scores in clf.grid_scores_:
    print ("%0.3f (+/-%0.03f) for %r" % (mean_score, scores.std() / 2, params))
print()


Best estmator:

Pipeline(steps=[('vector', CountVectorizer(analyzer='word', binary=True, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        stri...Terima', 'boss'])), ('clf', BernoulliNB(alpha=0.1, binarize=0.0, class_prior=None, fit_prior=True))])


Grid score BNB:

0.827 (+/-0.026) for {'clf__alpha': 0.1}
0.822 (+/-0.028) for {'clf__alpha': 0.2}
0.822 (+/-0.029) for {'clf__alpha': 0.3}
0.820 (+/-0.029) for {'clf__alpha': 0.4}
0.819 (+/-0.030) for {'clf__alpha': 0.5}
0.814 (+/-0.031) for {'clf__alpha': 0.6}
0.801 (+/-0.028) for {'clf__alpha': 0.7}
0.800 (+/-0.028) for {'clf__alpha': 0.8}
0.800 (+/-0.028) for {'clf__alpha': 0.9}
0.800 (+/-0.029) for {'clf__alpha': 1.0}



In [45]:
# from sklearn.naive_bayes import MultinomialNB
# #Uji execution time untuk model Multinomial Naive bayes
# %time clf = MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True).fit(X, y)

In [46]:
# from sklearn.naive_bayes import MultinomialNB
# #Uji execution time untuk model Binary Multinomial Naive bayes
# %time clf = MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True).fit(X_bin, y)

In [47]:
# from sklearn.naive_bayes import BernoulliNB
# #Uji execution time untuk model Bernoulli Naive bayes
# %time clfBernoulli = BernoulliNB(alpha=0.1, binarize=0.0, class_prior=None, fit_prior=True).fit(X, y)

In [48]:
from sklearn.pipeline import Pipeline
import pickle

def model_multinomialnb(x,y):
    clf = MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)
    tfidf = TfidfVectorizer(analyzer='word', ngram_range=(1,1), min_df = 0.1, max_df = 0.1, use_idf = True, vocabulary=finalfitur)
    model_multinomial = Pipeline([('tfidf', tfidf),('naive_bayes', clf)])
    model_multinomial.fit(x,y)
    s = pickle.dumps(model_multinomial)
    return s

In [49]:
def model_multinomialnb_bin(x,y):
    clf = MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)
    vector = CountVectorizer(analyzer='word', binary=True, ngram_range=(1, 1), min_df = 0.1, max_df = 0.1, vocabulary=finalfitur)
    model_multinomial_bin = Pipeline([('binary_weighting', vector),('naive_bayes', clf)])
    model_multinomial_bin.fit(x,y)
    s = pickle.dumps(model_multinomial_bin)
    return s

In [50]:
def model_bernoullinb(x,y):
    clf = BernoulliNB(alpha=0.1, binarize=0.0, class_prior=None, fit_prior=True)
    vector = CountVectorizer(analyzer='word', binary=True, ngram_range=(1, 1), min_df = 0.1, max_df = 0.1, vocabulary=finalfitur)
    model_bernoulli = Pipeline([('binary_weighting', vector),('naive_bayes', clf)])
    model_bernoulli.fit(x,y)
    s = pickle.dumps(model_bernoulli)
    return s

In [51]:
def tfidf(fitur):
    tfidf = TfidfVectorizer(analyzer='word', ngram_range=(1,1), min_df = 0.1, max_df = 0.1, use_idf = True, vocabulary=fitur)
    hasil = tfidf.fit_transform(tweetfinal)
    return hasil, tfidf

def binary_weighting(fitur):
    vector = CountVectorizer(analyzer='word', binary=True, ngram_range=(1, 1), min_df = 0.1, max_df = 0.1, vocabulary=fitur)
    hasil = vector.fit_transform(tweetfinal)
    return hasil, vector

vector_binary, binary = binary_weighting(finalfitur)
vector_tfidf, tfidf = tfidf(finalfitur)

dense = vector_binary.toarray()
DF_binary = pd.DataFrame(dense)

dense = vector_tfidf.toarray()
DF_tfidf = pd.DataFrame(dense)


y = np.asarray(sentimen)

In [52]:
binary_vector = pd.DataFrame(vector_binary.toarray(), columns=binary.get_feature_names())
tfidf_vector = pd.DataFrame(vector_tfidf.toarray(), columns=tfidf.get_feature_names())

In [53]:
#Simpan bobot TFIDF
tfidf_vector.to_csv('tfidf_weighting.csv', sep=';')
binary_vector.to_csv('binaryweighting.csv', sep=';')

## K-Fold Cross Validation

In [54]:
from sklearn.cross_validation import KFold
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score

def evaluasi_multinomialnb(X,y):
    kf = KFold(y.shape[0], n_folds=10, shuffle=True, random_state=9999)
    model_train_index = []
    model_test_index = []
    model = 0
    accuracy = []
    fscore = []
    pscore = []
    rscore = []
    
    for k, (index_train,index_test) in enumerate(kf):
        X_train = X.ix[index_train,:]
        X_test = X.ix[index_test,:]
        y_train = y[index_train]
        y_test = y[index_test]
        clf = MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True).fit(X_train, y_train)
        score = clf.score(X_test, y_test)
        f1score = f1_score(y_test, clf.predict(X_test))
        precision = precision_score(y_test, clf.predict(X_test))
        recall = recall_score(y_test, clf.predict(X_test))
        accuracy.append(score)
        fscore.append(f1score)
        pscore.append(precision)
        rscore.append(recall)
        print('Model %d has accuracy %f with | f1score: %f | precision: %f | recall: %f '%(k,score,f1score,precision,recall))
        model_train_index.append(index_train)
        model_test_index.append(index_test) 
        model+=1
    return accuracy, fscore, pscore, rscore
# evaluasi_multinomialnb(X,y)
# # hasil_multinomialnb = evaluasi_multinomialnb(X,y)
%time accuracy_multinomial, f1score_multinomial, pscore_multinomial, rscore_multinomial = evaluasi_multinomialnb(DF_tfidf,y)

# # print(accurac)

total = 0
for c in accuracy_multinomial:
    total = total + c
print("rata-rata akurasi algoritma multinomial naive bayes: %f" %(total/10))

total_f1score = 0
for c in f1score_multinomial:
    total_f1score = total_f1score + c
print("rata-rata f1score algoritma multinomial naive bayes: %f" %(total_f1score/10))

total_pscore = 0
for c in pscore_multinomial:
    total_pscore = total_pscore + c
print("rata-rata presisi algoritma multinomial naive bayes: %f" %(total_pscore/10))

total_rscore = 0
for c in rscore_multinomial:
    total_rscore = total_rscore + c
print("rata-rata recall algoritma multinomial naive bayes: %f" %(total_rscore/10))

Model 0 has accuracy 0.835714 with | f1score: 0.839161 | precision: 0.895522 | recall: 0.789474 
Model 1 has accuracy 0.835714 with | f1score: 0.843537 | precision: 0.925373 | recall: 0.775000 
Model 2 has accuracy 0.757143 with | f1score: 0.784810 | precision: 0.696629 | recall: 0.898551 
Model 3 has accuracy 0.842857 with | f1score: 0.845070 | precision: 0.789474 | recall: 0.909091 
Model 4 has accuracy 0.778571 with | f1score: 0.766917 | precision: 0.864407 | recall: 0.689189 
Model 5 has accuracy 0.750000 with | f1score: 0.761905 | precision: 0.700000 | recall: 0.835821 
Model 6 has accuracy 0.771429 with | f1score: 0.774648 | precision: 0.696203 | recall: 0.873016 
Model 7 has accuracy 0.842857 with | f1score: 0.838235 | precision: 0.904762 | recall: 0.780822 
Model 8 has accuracy 0.857143 with | f1score: 0.852941 | precision: 0.935484 | recall: 0.783784 
Model 9 has accuracy 0.842857 with | f1score: 0.825397 | precision: 0.764706 | recall: 0.896552 
Wall time: 614 ms
rata-rata ak

In [55]:
%time accuracy_multinomial_bin,f1score_multinomial_bin, pscore_multinomial_bin, rscore_multinomial_bin = evaluasi_multinomialnb(DF_binary,y)

total_multinomialnb_bin = 0
for h in accuracy_multinomial_bin:
    total_multinomialnb_bin = total_multinomialnb_bin + h
print("rata-rata akurasi algoritma binary multinomial naive bayes: %f" %(total_multinomialnb_bin/10))

total_multinomialnb_bin_f1score = 0
for h in f1score_multinomial_bin:
    total_multinomialnb_bin_f1score = total_multinomialnb_bin_f1score + h
print("rata-rata f1score algoritma binary multinomial naive bayes: %f" %(total_multinomialnb_bin_f1score/10))

total_multinomialnb_bin_pscore = 0
for h in pscore_multinomial_bin:
    total_multinomialnb_bin_pscore = total_multinomialnb_bin_pscore + h
print("rata-rata presisi algoritma binary multinomial naive bayes: %f" %(total_multinomialnb_bin_pscore/10))

total_multinomialnb_bin_rscore = 0
for h in rscore_multinomial_bin:
    total_multinomialnb_bin_rscore = total_multinomialnb_bin_rscore + h
print("rata-rata recall algoritma binary multinomial naive bayes: %f" %(total_multinomialnb_bin_rscore/10))

Model 0 has accuracy 0.842857 with | f1score: 0.845070 | precision: 0.909091 | recall: 0.789474 
Model 1 has accuracy 0.835714 with | f1score: 0.843537 | precision: 0.925373 | recall: 0.775000 
Model 2 has accuracy 0.757143 with | f1score: 0.784810 | precision: 0.696629 | recall: 0.898551 
Model 3 has accuracy 0.850000 with | f1score: 0.851064 | precision: 0.800000 | recall: 0.909091 
Model 4 has accuracy 0.792857 with | f1score: 0.781955 | precision: 0.881356 | recall: 0.702703 
Model 5 has accuracy 0.750000 with | f1score: 0.761905 | precision: 0.700000 | recall: 0.835821 
Model 6 has accuracy 0.771429 with | f1score: 0.774648 | precision: 0.696203 | recall: 0.873016 
Model 7 has accuracy 0.842857 with | f1score: 0.838235 | precision: 0.904762 | recall: 0.780822 
Model 8 has accuracy 0.864286 with | f1score: 0.861314 | precision: 0.936508 | recall: 0.797297 
Model 9 has accuracy 0.835714 with | f1score: 0.816000 | precision: 0.761194 | recall: 0.879310 
Wall time: 169 ms
rata-rata ak

In [56]:
from sklearn.naive_bayes import BernoulliNB

def evaluasi_bernoullinb(X,y):
    kf = KFold(y.shape[0], n_folds=10, shuffle=True, random_state=9999)
    model_train_index = []
    model_test_index = []
    model = 0
    accuracy = []
    fscore = []
    pscore = []
    rscore = []
    
    for k, (index_train,index_test) in enumerate(kf):
        X_train = X.ix[index_train,:]
        X_test = X.ix[index_test,:]
        y_train = y[index_train]
        y_test = y[index_test]
        clf = BernoulliNB(alpha=0.1, binarize=None, class_prior=None, fit_prior=True).fit(X_train, y_train)
        score = clf.score(X_test, y_test)
        f1score = f1_score(y_test, clf.predict(X_test))
        precision = precision_score(y_test, clf.predict(X_test))
        recall = recall_score(y_test, clf.predict(X_test))
        accuracy.append(score)
        fscore.append(f1score)
        pscore.append(precision)
        rscore.append(recall)
        print('Model %d has accuracy %f with | f1score: %f | precision: %f | recall: %f '%(k,score,f1score,precision,recall))
        model_train_index.append(index_train)
        model_test_index.append(index_test) 
        model+=1
    return accuracy, fscore, pscore, rscore
%time accuracy_bernoulli,f1score_bernoulli, pscore_bernoulli, rscore_bernoulli = evaluasi_bernoullinb(DF_binary,y)

total_bernoulli = 0
for m in accuracy_bernoulli:
    total_bernoulli = total_bernoulli + m
# print(total_bernoulli)
print("rata-rata akurasi algoritma bernoulli naive bayes: %f" % (total_bernoulli/10))

total_f1score_bnb = 0
for m in f1score_bernoulli:
    total_f1score_bnb = total_f1score_bnb + m
# print(total_bernoulli)
print("rata-rata f1score algoritma bernoulli naive bayes: %f" % (total_f1score_bnb/10))

total_bernoulli_pscore = 0
for m in pscore_bernoulli:
    total_bernoulli_pscore = total_bernoulli_pscore + m
# print(total_bernoulli)
print("rata-rata presisi algoritma bernoulli naive bayes: %f" % (total_bernoulli_pscore/10))

total_bernoulli_rscore = 0
for m in rscore_bernoulli:
    total_bernoulli_rscore = total_bernoulli_rscore + m
# print(total_bernoulli)
print("rata-rata recall algoritma bernoulli naive bayes: %f" % (total_bernoulli_rscore/10))

Model 0 has accuracy 0.850000 with | f1score: 0.869565 | precision: 0.823529 | recall: 0.921053 
Model 1 has accuracy 0.914286 with | f1score: 0.928571 | precision: 0.886364 | recall: 0.975000 
Model 2 has accuracy 0.750000 with | f1score: 0.779874 | precision: 0.688889 | recall: 0.898551 
Model 3 has accuracy 0.842857 with | f1score: 0.842857 | precision: 0.797297 | recall: 0.893939 
Model 4 has accuracy 0.878571 with | f1score: 0.893082 | precision: 0.835294 | recall: 0.959459 
Model 5 has accuracy 0.742857 with | f1score: 0.756757 | precision: 0.691358 | recall: 0.835821 
Model 6 has accuracy 0.778571 with | f1score: 0.783217 | precision: 0.700000 | recall: 0.888889 
Model 7 has accuracy 0.892857 with | f1score: 0.901961 | precision: 0.862500 | recall: 0.945205 
Model 8 has accuracy 0.842857 with | f1score: 0.855263 | precision: 0.833333 | recall: 0.878378 
Model 9 has accuracy 0.835714 with | f1score: 0.818898 | precision: 0.753623 | recall: 0.896552 
Wall time: 158 ms
rata-rata ak

In [57]:
# fig1 = plt.figure()
# fig1.suptitle('Grafik Akurasi', fontsize=20)
# plt.plot(accuracy_multinomial, 'r')
# plt.plot(accuracy_multinomial_bin, 'g')
# plt.plot(accuracy_bernoulli, 'b')  
# plt.show()

In [58]:
# fig2 = plt.figure()
# fig2.suptitle('Grafik f-score', fontsize=20)
# plt.plot(f1score_multinomial, 'r')
# plt.plot(f1score_multinomial_bin, 'g')
# plt.plot(f1score_bernoulli, 'b')  
# plt.show()

In [59]:
# fig3 = plt.figure()
# fig3.suptitle('Grafik presisi', fontsize=20)
# plt.plot(pscore_multinomial, 'r')
# plt.plot(pscore_multinomial_bin, 'g')
# plt.plot(pscore_bernoulli, 'b')  
# plt.show()

In [None]:
# fig4 = plt.figure()
# fig4.suptitle('Grafik recall', fontsize=20)
# plt.plot(rscore_multinomial, 'r')
# plt.plot(rscore_multinomial_bin, 'g')
# plt.plot(rscore_bernoulli, 'b')  
# plt.show()

## TESTING

In [None]:
# Testing
import csv
from sklearn.externals import joblib
import pickle

file = input("Masukkan nama file csv: ")
print("Importing Data Test", flush = True)
TestData = data(file)
Testing = TestData.create_df()
print("done")

print("Importing Tweet", flush = True)
tweet_test = [tuple(x) for x in Testing.values]
tweet_data = []
for a,b in tweet_test:
    tweet_data.append((a))
print("done")

print("Preprocessing", flush = True)
tweet_a= []
for a in tweet_data:
    tweets_kotor = preprocessing(a)
    tweet = tweets_kotor.cleansing()
    tweet_a.append(tweet)
tweet_b = []
for a in tweet_a:
    tweets_kotor = preprocessing(a)
    tweet = tweets_kotor.stemming()
    tweet_b.append(tweet)
tweet_c = []
for a in tweet_b:
    tweets_kotor = preprocessing(a)
    tweet = tweets_kotor.tokenize()
    tweet_c.append(tweet)
tweet_d = []
for a in tweet_c:
    tweets_kotor = preprocessing(a)
    tweet = tweets_kotor.filtering()
    tweet_d.append(tweet)

tweet_f = []
for a in tweet_d:
    tweets_kotor = preprocessing(a)
    tweet = tweets_kotor.hapus_stopword()
    tweet_f.append(tweet)
    
tweet_testing = []
for a in tweet_f:
    tweet_testing.append(" ".join(a))
print("done")

print("Predict", flush = True)
model_multinomial = pickle.loads(model_multinomialnb(tweetfinal,y))
model_multinomial_bin = pickle.loads(model_multinomialnb_bin(tweetfinal,y))
model_bernoulli = pickle.loads(model_bernoullinb(tweetfinal,y))

predict_multinomial = model_multinomial.predict(tweet_testing)
predict_multinomial_bin = model_multinomial_bin.predict(tweet_testing)
predict_bernoulli = model_bernoulli.predict(tweet_testing)

print("done")


In [None]:
print(tweet_data)

In [None]:
import timeit

time = []

start_multinomial_timer = timeit.default_timer()
predict_multinomial = model_multinomial.predict(tweet_testing)
stop_multinomial_timer = timeit.default_timer()
timer_multinomial = stop_multinomial_timer - start_multinomial_timer
time.append(timer_multinomial)

start_multinomialbin_timer = timeit.default_timer()
predict_multinomial_bin = model_multinomial_bin.predict(tweet_testing)
stop_multinomialbin_timer = timeit.default_timer()
timer_multinomial_bin = stop_multinomialbin_timer - start_multinomialbin_timer
time.append(timer_multinomial_bin)

start_bernoulli_timer = timeit.default_timer()
predict_bernoulli = model_bernoulli.predict(tweet_testing)
stop_bernoulli_timer = timeit.default_timer()
timer_bernoulli = stop_bernoulli_timer - start_bernoulli_timer
time.append(timer_bernoulli)

print(time)
# x_pos = np.arange(len(time))
# plt.bar(time,x_pos)
# plt.show()

In [None]:
def pie_chart(df, y):
#     count_pos = 0
#     count_neg = 0
    # array_search = df.values
    # print(array_search)
#     for r in range(0,len(array_search)):
#         if array_search[r] == 1:
#             count_pos = count_pos + 1
#         else:
#             count_neg = count_neg + 1
#     sizes = [count_neg, count_pos]
    labels = 'Negatif', 'Positif'
    y, ax = plt.subplots()
    ax.pie(df, explode=None, labels=labels, autopct='%1.1f%%', shadow=False, startangle=90)
    ax.axis('equal')
    plt.title('chart')
    return plt

In [None]:
# fig = pl.figure()
# ax = pl.subplot(111)
# ax.bar(dates, values, width=100)
# ax.xaxis_date()

In [None]:
#Persiapan data testing
Testing = Testing.drop('Sentimen', 1)

In [None]:
df_result_multinomial = pd.DataFrame(predict_multinomial)
df_result_multinomial_bin = pd.DataFrame(predict_multinomial_bin)
df_result_bernoulli = pd.DataFrame(predict_bernoulli)
final_multinomial = pd.concat([Testing, df_result_multinomial], axis=1)
final_multinomial_bin = pd.concat([Testing, df_result_multinomial_bin], axis=1)
final_bernoulli = pd.concat([Testing, df_result_bernoulli], axis=1)
final_multinomial.columns = ['Tweet', 'Sentimen']
final_multinomial_bin.columns = ['Tweet', 'Sentimen']
final_bernoulli.columns = ['Tweet', 'Sentimen']

In [None]:
print(final_multinomial)
df_sort_multinomial = final_multinomial['Sentimen'].value_counts()
print(df_sort_multinomial)
# df_sort_multinomial.values


In [None]:
# print(final_multinomial_bin)
df_sort_multinomial_bin = final_multinomial_bin['Sentimen'].value_counts()
print(df_sort_multinomial_bin)
# df_sort_multinomial_bin.values


In [None]:
print(final_bernoulli)
df_sort_bernoulli = final_bernoulli['Sentimen'].value_counts().sort_values()
print(df_sort_bernoulli)

In [None]:
pie_chart(df_sort_multinomial.values, y)
pie_chart(df_sort_multinomial_bin.values, y)
pie_chart(df_sort_bernoulli.values, y)

plt.show()

In [None]:
search_kata = input("Search: ")
search_df_multinomial = final_multinomial[final_multinomial.Tweet.str.contains(search_kata)]
search_df_multinomial_bin = final_multinomial_bin[final_multinomial_bin.Tweet.str.contains(search_kata)]
search_df_bernoulli = final_bernoulli[final_bernoulli.Tweet.str.contains(search_kata)]
print(search_df_multinomial)
print(search_df_multinomial_bin)
print(search_df_bernoulli)

count_pos_multinomial = 0
count_neg_multinomial = 0
array_search_multinomial = search_df_multinomial['Sentimen'].values
# print(array_search)
for r in range(0,len(array_search_multinomial)):
    if array_search_multinomial[r] == 1:
        count_pos_multinomial = count_pos_multinomial + 1
    else:
        count_neg_multinomial = count_neg_multinomial + 1
sizes_multinomial = [count_neg_multinomial, count_pos_multinomial]
# print(sizes)

count_pos_multinomial_bin = 0
count_neg_multinomial_bin = 0
array_search_multinomial_bin = search_df_multinomial_bin['Sentimen'].values
# print(array_search)
for r in range(0,len(array_search_multinomial_bin)):
    if array_search_multinomial_bin[r] == 1:
        count_pos_multinomial_bin = count_pos_multinomial_bin + 1
    else:
        count_neg_multinomial_bin = count_neg_multinomial_bin + 1
sizes_multinomial_bin = [count_neg_multinomial_bin, count_pos_multinomial_bin]
# print(sizes)

count_pos_bernoulli = 0
count_neg_bernoulli = 0
array_search_bernoulli = search_df_bernoulli['Sentimen'].values
# print(array_search)
for r in range(0,len(array_search_bernoulli)):
    if array_search_bernoulli[r] == 1:
        count_pos_bernoulli = count_pos_bernoulli + 1
    else:
        count_neg_bernoulli = count_neg_bernoulli + 1
sizes_bernoulli = [count_neg_bernoulli, count_pos_bernoulli]


fig = plt.figure()

explode = (0, 0)
labels = 'Negatif', 'Positif'

# fig, ax = plt.subplots(1,2,1)
ax1 = fig.add_axes([.1, .5, .4, .4], aspect=1)
ax1.pie(sizes_multinomial, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=False, startangle=90, radius = .3)
ax1.axis('equal')
plt.title('MultinomialNB')

# fig, ax = plt.subplots(1,2,2)
ax2 = fig.add_axes([.1, .05, .4, .4], aspect=1)
ax2.pie(sizes_multinomial_bin, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=False, startangle=90, radius = .3)
ax2.axis('equal')
plt.title('Binary MultinomialNB')

# # fig, ax = plt.subplots(1,2,3)
ax3 = fig.add_axes([.5, .5, .4, .4], aspect=1)
ax3.pie(sizes_bernoulli, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=False, startangle=90, radius = .3)
ax3.axis('equal')
plt.title('BernoulliNB')

plt.show()