Pada notebook ini akan menerapkan model klasifikasi teks pada data Twitter menggunakan metode Naive Bayes Classifier untuk melakukan klasifikasi tweet dalam kategori prostitusi (True) ataupun bukan prostitusi (False).


Dataset yang dimasukkan disini meliputi data training dan data testing. Ada 40000 data training yang dibagi menjadi dua bagian, yaitu data 20000 True (prostitusi) dan 20000 data False (bukan prostitusi), dan 10000 data sebagai data testing.

Pengujian dilakukan untuk menentukan akurasi klasifikasi metode NBC dan menggunakan Confusion Matrix

Langkah pertama, persiapkan module python yang akan digunakan

In [0]:
import pandas as pd
import string
import numpy as np
import nltk
import re
import random
from nltk.tokenize import word_tokenize
from nltk import FreqDist,classify, NaiveBayesClassifier
from sklearn.metrics import confusion_matrix,classification_report
from collections import Counter

Import data training dan data testing

In [0]:
data_train_true = pd.read_excel('../content/twitter-prostitute.xlsx')
data_train_false = pd.read_excel('../content/twitter-not-prostitute.xlsx')
data_test = pd.read_excel('../content/labeled-data-testing.xlsx')

In [3]:
print(data_train_true['tweet'].head())

0    AvaiL BO yaa beb😙\nWA 0831 9315 9762\n#AvailJo...
1    Include exclude Ready beb \nWa 0831 9315 9762\...
2    AvaiL Jogja Minat DM aja😍\nFasht Respon.\n#Ava...
3    MAEN SANTAI GA BURU" \nFULL SERVICE NO ANAL US...
4    New bie...Ready ya..2 slot aja 085647266101\n#...
Name: tweet, dtype: object


In [4]:
print(data_train_false['tweet'].head())

0    Lanjutannya, siapa yang naruh bawang disini _�...
1    1 Ramadan : Hari Juma'at.\n8 Ramadan : Hari Ju...
2    Selamat berbuka puasa kepada semua�_��� pic.tw...
3    Aku mau nepatin janji aku kalo taekook selca a...
4    Bagi Bagi Saldo\nada saldo 100k untuk 4 pemena...
Name: tweet, dtype: object


In [5]:
print(data_test['tweet'].head())

0    Design Kaos Jathilan Jogja by heljog | S, M, L...
1    #Jogja |  Dinkes Gunungkidul Telusuri Warga Re...
2    Jual Herbal Diabetes di Jogja, WA: 08967274191...
3    Promo Honda Anugerah Jogja\nBantu like dan sub...
4    AVAIL SEKARANG BEB  RR DM✅  ,WJB CPAS,NOANAL♥🔑...
Name: tweet, dtype: object


In [6]:
print(f"Jumlah data training (True)\t:\t{len(data_train_true)}")
print(f"Jumlah data training (False)\t:\t{len(data_train_false)}")
print(f"Jumlah data test\t\t:\t{len(data_test)}")

Jumlah data training (True)	:	20000
Jumlah data training (False)	:	20000
Jumlah data test		:	5327


In [0]:
prostitute_tweets = data_train_true['tweet']
not_prostitute_tweets = data_train_false['tweet']

Ambil data dari kolom tweet dari setiap file

In [0]:
def cleaning(text):
	text = re.sub(r'<[^>]+>', '', text) #delete html tags
	text = re.sub(r'\S*twitter.com\S*', '', text)   #delete twitter image
	text = re.sub(r'https?://[A-Za-z0-9./]+','',text) #delete url
	text = re.sub(r'@[A-Za-z0-9]+','',text) #delete user mention
	text = re.sub(r'#[A-Za-z0-9]+','',text) #delete twitter hashtag
	text = re.sub(r'(?:(?:\d+,?)+(?:\.?\d+)?)','', text) #delete number
	text = re.sub(r"[^a-zA-Z]", " ", text) #only accept alphabet char
	text = re.sub(r"(\w)(\1{2,})", r'\1', text) #delete repeated char
	text = re.sub(r"\b[a-zA-Z]\b", "", text) #remove single character
	text = text.lower() #change to lowercase
	return text

Lakukan pembersihan teks dari noise seperti, menghapus html tags, menghapus url gambar dari twitter, menghapus url, menghapus user mention, menghapus hashtag, dan mengkonversi semua huruf menjadi lowercase

In [0]:
stopwords_file = open("stopwords-id.txt", 'r')
stopwords = [x.strip() for x in stopwords_file.readlines()]
stopwords.extend(['by', 'rt', 'via'])

In [0]:
def tokenize(text):
	words = text.split();
	words = [w for w in words if w not in stopwords]
	return words

Melakukan tokenisasi, yaitu memecah setiap kata dalam kalimat menjadi per kata dan dimasukkan ke dalam list

In [0]:
positive_tweet_tokens = []
for i in prostitute_tweets:
	positive_tweet_tokens.append(tokenize(cleaning(i)))

In [0]:
negative_tweet_tokens = []
for i in not_prostitute_tweets:
	negative_tweet_tokens.append(tokenize(cleaning(i)))

Buat daftar kata dari seluruh data training yang sudah melalui tahap pembersihan dan sudah dikonversi menjadi bentuk token

In [0]:
def get_all_words(cleaned_token_list):
	for tokens in cleaned_token_list:
		for token in tokens:
			yield token

def get_tweets_for_model(cleaned_tokens_list):
	for tweet_tokens in cleaned_tokens_list:
		yield dict([token, True] for token in tweet_tokens)

In [0]:
all_pos_words = get_all_words(positive_tweet_tokens)
all_neg_words = get_all_words(negative_tweet_tokens)
freq_dist_pos = FreqDist(all_pos_words)
freq_dist_neg = FreqDist(all_neg_words)

Buat daftar kata beserta frekuensi kemunculan kata pada daftar seluruh kata **(Bag of words)**

In [15]:
print(f"Kata yang sering muncul (True): {freq_dist_pos.most_common(10)}")

Kata yang sering muncul (True): [('wa', 11387), ('open', 7716), ('dm', 6011), ('ya', 5767), ('bo', 5391), ('ready', 5356), ('jogja', 4876), ('rr', 3333), ('avail', 3211), ('say', 3121)]


In [16]:
print(f"Kata yang sering muncul (False): {freq_dist_neg.most_common(10)}")

Kata yang sering muncul (False): [('yg', 3618), ('utm', 2935), ('orang', 2731), ('ya', 2060), ('dm', 1700), ('nak', 1659), ('ni', 1594), ('aja', 1302), ('twitter', 1068), ('ga', 1017)]


In [0]:
positive_tokens_for_model = get_tweets_for_model(positive_tweet_tokens)
negative_tokens_for_model = get_tweets_for_model(negative_tweet_tokens)

In [0]:
positive_dataset = [(tweet_dict, "True")
						for tweet_dict in positive_tokens_for_model]
negative_dataset = [(tweet_dict, "False")
						for tweet_dict in negative_tokens_for_model]

Berikan label pada setiap kata yang termasuk dalam kategori True dan False

In [0]:
dataset = positive_dataset + negative_dataset

In [20]:
print(len(dataset))

40000


Gabungkan semua daftar kata dalam sebuah list

In [0]:
random.shuffle(dataset)

In [0]:
train_data = dataset[:40000]

Buat alokasi data training : data testing

In [0]:
classifier = NaiveBayesClassifier.train(train_data)

Kemudian buat sample data test untuk dilakukan pengujian. Data testing yang di masukkan sudah dilakukan label secara manual true dan false nya agar dapat dilakukan validasi secara manual di pengujian menggunakan Confussion Matrix.

In [0]:
actual_status = []
for i in data_test['status']:
  if i == 1:
    actual_status.append('True')
  else:
    actual_status.append('False')

test_tweet = data_test['tweet']

result_clasify = []
tokenize_tweet_test = []
for i in test_tweet:
  test_tweet_tokens = tokenize(cleaning(i))
  tokenize_tweet_test.append(test_tweet_tokens)
  result_clasify.append(classifier.classify(dict([token, True] for token in test_tweet_tokens)))

tweet_tokens_model = get_tweets_for_model(tokenize_tweet_test)

datatest_result = []
for (tweet_dict, i) in zip(tweet_tokens_model, result_clasify):
  datatest_result += [(tweet_dict, i)]

In [29]:
print("Akurasi Klasifikasi Naive Bayes\t:\t"+"{:.2f}".format(classify.accuracy(classifier, datatest_result) * 100)+" %")

Akurasi Klasifikasi Naive Bayes	:	100.00 %


In [0]:
test_result = []
classifier_result = []
labeled_data = []

for i in range(len(datatest_result)):
	test_result.append(classifier.classify(datatest_result[i][0]))
	classifier_result.append(datatest_result[i][1])

c_matrix = nltk.ConfusionMatrix(classifier_result, actual_status)

Lakukan pengujian menggunakan confusion matrix

In [34]:
print(f"Confusion Matrix :\n{c_matrix}", )

Confusion Matrix :
      |    F      |
      |    a    T |
      |    l    r |
      |    s    u |
      |    e    e |
------+-----------+
False |<4042>  24 |
 True |  290 <971>|
------+-----------+
(row = reference; col = test)



In [35]:
labels = {'True', 'False'}

TP, FN, FP = Counter(), Counter(), Counter()
for i in labels:
	for j in labels:
		if i == j:
			TP[i] += int(c_matrix[i,j])
		else:
			FN[i] += int(c_matrix[i,j])
			FP[j] += int(c_matrix[i,j])
print("label   | precision             | recall                | f_measure         ")
print("--------+-----------------------+-----------------------+-------------------")
for label in sorted(labels):
	precision, recall = 0, 0
	if TP[label] == 0:
		f_measure = 0
	else:
		precision = float(TP[label]) / (TP[label]+FP[label])
		recall = float(TP[label]) / (TP[label]+FN[label])
		f_measure = float(2) * (precision * recall) / (precision + recall)
	print(f"{label}\t| {precision}\t| {recall}\t| {f_measure}")

label   | precision             | recall                | f_measure         
--------+-----------------------+-----------------------+-------------------
False	| 0.933056325023084	| 0.9940973930152484	| 0.9626101452726838
True	| 0.9758793969849247	| 0.7700237906423474	| 0.8608156028368794


In [36]:
print(classifier.show_most_informative_features(20))

Most Informative Features
                      bo = True             True : False  =    910.3 : 1.0
                 twitter = True            False : True   =    413.0 : 1.0
                 content = True            False : True   =    367.0 : 1.0
                     exc = True             True : False  =    365.0 : 1.0
                 slotnya = True             True : False  =    315.0 : 1.0
                    like = True            False : True   =    309.7 : 1.0
                   salah = True            False : True   =    308.3 : 1.0
                      rr = True             True : False  =    266.3 : 1.0
                 include = True             True : False  =    249.8 : 1.0
                      tu = True            False : True   =    248.6 : 1.0
               indonesia = True            False : True   =    245.0 : 1.0
                     inc = True             True : False  =    244.6 : 1.0
                      lt = True             True : False  =    244.5 : 1.0

Disini, sudah dapat digunakan untuk memprediksi data baru untuk melakukan klasifikasi tweet berdasarkan input an.

In [0]:
custom_tweet_first = "jarang jarang hari ini bisa bertemu dengan orang tercinta"
custom_tweet_second = "hari ini ready jogja cod tanpa dp wajib pengaman"

In [0]:
cleaned_custom_tokens_first = tokenize(cleaning(custom_tweet_first))
cleaned_custom_tokens_second = tokenize(cleaning(custom_tweet_second))

In [0]:
result_tweet_first = classifier.classify(dict([token, True] for token in cleaned_custom_tokens_first))
result_tweet_second = classifier.classify(dict([token, True] for token in cleaned_custom_tokens_second))

In [46]:
print(f"First Tweet: {result_tweet_first}")
print(f"Second Tweet: {result_tweet_second}")

First Tweet: False
Second Tweet: True
