<a href="https://colab.research.google.com/github/lukmandev/NBC-Twitter/blob/master/Tweet%20Classification%20for%20Classify%20Online%20Prostitut.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Pada notebook ini akan menerapkan model klasifikasi teks pada data Twitter menggunakan metode Naive Bayes Classifier untuk melakukan klasifikasi tweet dalam kategori prostitusi (True) ataupun bukan prostitusi (False).


Dataset yang dimasukkan disini meliputi data training dan data testing. Ada 40000 data training yang dibagi menjadi dua bagian, yaitu data 20000 True (prostitusi) dan 20000 data False (bukan prostitusi), dan 10000 data sebagai data testing.

Pengujian dilakukan untuk menentukan akurasi klasifikasi metode NBC dan menggunakan Confusion Matrix

Langkah pertama, persiapkan module python yang akan digunakan

In [0]:
import pandas as pd
import string
import numpy as np
import nltk
import re
import random
from nltk.tokenize import word_tokenize
from nltk import FreqDist,classify, NaiveBayesClassifier
from sklearn.metrics import confusion_matrix,classification_report
from collections import Counter

Import data training dan data testing

In [0]:
data_train_true = pd.read_excel('../content/twitter-prostitute.xlsx')
data_train_false = pd.read_excel('../content/twitter-not-prostitute.xlsx')
data_test = pd.read_csv('../content/random-tweet-06052020.csv')

In [28]:
print(data_train_true['tweet'].head())

0    AvaiL BO yaa beb😙\nWA 0831 9315 9762\n#AvailJo...
1    Include exclude Ready beb \nWa 0831 9315 9762\...
2    AvaiL Jogja Minat DM aja😍\nFasht Respon.\n#Ava...
3    MAEN SANTAI GA BURU" \nFULL SERVICE NO ANAL US...
4    New bie...Ready ya..2 slot aja 085647266101\n#...
Name: tweet, dtype: object


In [29]:
print(data_train_false['tweet'].head())

0    Lanjutannya, siapa yang naruh bawang disini _�...
1    1 Ramadan : Hari Juma'at.\n8 Ramadan : Hari Ju...
2    Selamat berbuka puasa kepada semua�_��� pic.tw...
3    Aku mau nepatin janji aku kalo taekook selca a...
4    Bagi Bagi Saldo\nada saldo 100k untuk 4 pemena...
Name: tweet, dtype: object


In [30]:
print(data_test['tweet'].head())

0                                           ko km gt:(
1               Waktu indonesia bagian overthinking :f
2             efek mau dapet huh kmrn w juga gitu ndoy
3                                                Ampun
4    Btw aku dapet ads ultah baekkie pic.twitter.co...
Name: tweet, dtype: object


In [7]:
print(f"Jumlah data training (True)\t:\t{len(data_train_true)}")
print(f"Jumlah data training (False)\t:\t{len(data_train_false)}")
print(f"Jumlah data test\t\t:\t{len(data_test)}")

Jumlah data training (True)	:	20000
Jumlah data training (False)	:	20000
Jumlah data test		:	10000


In [0]:
prostitute_tweets = data_train_true['tweet']
not_prostitute_tweets = data_train_false['tweet']
just_tweets = data_test['tweet']

Ambil data dari kolom tweet dari setiap file

In [0]:
def cleaning(text):
	text = re.sub(r'<[^>]+>', '', text) #delete html tags
	text = re.sub(r'\S*twitter.com\S*', '', text)   #delete twitter image
	text = re.sub(r'https?://[A-Za-z0-9./]+','',text) #delete url
	text = re.sub(r'@[A-Za-z0-9]+','',text) #delete user mention
	text = re.sub(r'#[A-Za-z0-9]+','',text) #delete twitter hashtag
	text = re.sub(r'(?:(?:\d+,?)+(?:\.?\d+)?)','', text) #delete number
	text = re.sub(r"[^a-zA-Z]", " ", text) #only accept alphabet char
	text = re.sub(r"(\w)(\1{2,})", r'\1', text) #delete repeated char
	text = re.sub(r"\b[a-zA-Z]\b", "", text) #remove single character
	text = text.lower() #change to lowercase
	return text

Lakukan pembersihan teks dari noise seperti, menghapus html tags, menghapus url gambar dari twitter, menghapus url, menghapus user mention, menghapus hashtag, dan mengkonversi semua huruf menjadi lowercase

In [0]:
def tokenize(text):
	#disini diisi dengan stop words
	ignore_words = ['by', 'yang', 'ya', 'saya', 'dia', 'ia', 'ke', 'pun', 'rt']
	words = text.split();
	words = [w for w in words if w not in ignore_words]
	return words

Melakukan tokenisasi, yaitu memecah setiap kata dalam kalimat menjadi per kata dan dimasukkan ke dalam list

In [0]:
positive_tweet_tokens = []
for i in prostitute_tweets:
	positive_tweet_tokens.append(tokenize(cleaning(i)))

In [0]:
negative_tweet_tokens = []
for i in not_prostitute_tweets:
	negative_tweet_tokens.append(tokenize(cleaning(i)))

Buat daftar kata dari seluruh data training yang sudah melalui tahap pembersihan dan sudah dikonversi menjadi bentuk token

In [0]:
def get_all_words(cleaned_token_list):
	for tokens in cleaned_token_list:
		for token in tokens:
			yield token

def get_tweets_for_model(cleaned_tokens_list):
	for tweet_tokens in cleaned_tokens_list:
		yield dict([token, True] for token in tweet_tokens)

In [0]:
all_pos_words = get_all_words(positive_tweet_tokens)
freq_dist_pos = FreqDist(all_pos_words)

Buat daftar kata beserta frekuensi kemunculan kata pada daftar seluruh kata **(Bag of words)**

In [27]:
print(f"Kata yang sering muncul: {freq_dist_pos.most_common(100)}")

Kata yang sering muncul: [('wa', 11387), ('open', 7716), ('dm', 6011), ('bo', 5391), ('ready', 5356), ('jogja', 4876), ('rr', 3333), ('avail', 3211), ('say', 3121), ('slot', 2941), ('yuk', 2931), ('cod', 2567), ('beb', 2524), ('dp', 2364), ('no', 2264), ('include', 2124), ('vcs', 2123), ('ini', 1876), ('info', 1874), ('minat', 1853), ('hotel', 1631), ('exclude', 1590), ('di', 1573), ('sayang', 1533), ('st', 1490), ('isi', 1468), ('area', 1459), ('inc', 1345), ('lt', 1344), ('malam', 1279), ('exc', 1278), ('yg', 1223), ('aja', 1210), ('langsung', 1200), ('masih', 1105), ('jam', 1064), ('expo', 1001), ('chat', 967), ('promo', 943), ('via', 906), ('bisa', 894), ('real', 878), ('kak', 859), ('serius', 835), ('khusus', 804), ('buat', 789), ('wajib', 783), ('cancel', 747), ('hari', 742), ('yaa', 736), ('mau', 707), ('bio', 678), ('or', 633), ('ga', 621), ('aku', 606), ('cocok', 594), ('main', 590), ('crot', 575), ('incld', 530), ('harga', 519), ('merapat', 515), ('lagi', 503), ('fast', 494),

In [0]:
positive_tokens_for_model = get_tweets_for_model(positive_tweet_tokens)
negative_tokens_for_model = get_tweets_for_model(negative_tweet_tokens)

In [0]:
positive_dataset = [(tweet_dict, "True")
						for tweet_dict in positive_tokens_for_model]
negative_dataset = [(tweet_dict, "False")
						for tweet_dict in negative_tokens_for_model]

Berikan label pada setiap kata yang termasuk dalam kategori True dan False

In [0]:
dataset = positive_dataset + negative_dataset

Gabungkan semua daftar kata dalam sebuah list

In [0]:
random.shuffle(dataset)

In [0]:
train_data = dataset[:40000]
test_data = dataset[9600:]

Buat alokasi data training : data testing

In [0]:
classifier = NaiveBayesClassifier.train(train_data)

In [22]:
print("Akurasi Klasifikasi Naive Bayes\t:\t"+"{:.2f}".format(classify.accuracy(classifier, test_data) * 100)+" %")

Akurasi Klasifikasi Naive Bayes	:	98.61 %


In [0]:
test_result = []
classifier_result = []

for i in range(len(test_data)):
	test_result.append(classifier.classify(test_data[i][0]))
	classifier_result.append(test_data[i][1])

c_matrix = nltk.ConfusionMatrix(classifier_result, test_result)

Lakukan pengujian menggunakan confusion matrix

In [24]:
print(f"Confusion Matrix :\n{c_matrix}", )

Confusion Matrix :
      |     F       |
      |     a     T |
      |     l     r |
      |     s     u |
      |     e     e |
------+-------------+
False |<15122>   31 |
 True |   393<14854>|
------+-------------+
(row = reference; col = test)



In [25]:
labels = {'True', 'False'}

TP, FN, FP = Counter(), Counter(), Counter()
for i in labels:
	for j in labels:
		if i == j:
			TP[i] += int(c_matrix[i,j])
		else:
			FN[i] += int(c_matrix[i,j])
			FP[j] += int(c_matrix[i,j])
print("label   | precision             | recall                | f_measure         ")
print("--------+-----------------------+-----------------------+-------------------")
for label in sorted(labels):
	precision, recall = 0, 0
	if TP[label] == 0:
		f_measure = 0
	else:
		precision = float(TP[label]) / (TP[label]+FP[label])
		recall = float(TP[label]) / (TP[label]+FN[label])
		f_measure = float(2) * (precision * recall) / (precision + recall)
	print(f"{label}\t| {precision}\t| {recall}\t| {f_measure}")

label   | precision             | recall                | f_measure         
--------+-----------------------+-----------------------+-------------------
False	| 0.9746696745085401	| 0.9979542004883521	| 0.9861745141515587
True	| 0.9979173664763185	| 0.9742244375942808	| 0.9859285809106597


In [26]:
print(classifier.show_most_informative_features(20))

Most Informative Features
                      bo = True             True : False  =    910.3 : 1.0
                    baik = True            False : True   =    450.3 : 1.0
                 twitter = True            False : True   =    413.0 : 1.0
                 content = True            False : True   =    367.0 : 1.0
                     exc = True             True : False  =    365.0 : 1.0
                    saat = True            False : True   =    350.3 : 1.0
                 slotnya = True             True : False  =    315.0 : 1.0
                    like = True            False : True   =    309.7 : 1.0
                   salah = True            False : True   =    308.3 : 1.0
                      rr = True             True : False  =    266.3 : 1.0
                 menjadi = True            False : True   =    265.0 : 1.0
                  mereka = True            False : True   =    250.6 : 1.0
                 include = True             True : False  =    249.8 : 1.0