Pada notebook ini akan menerapkan model klasifikasi teks pada data Twitter menggunakan metode Naive Bayes Classifier untuk melakukan klasifikasi tweet dalam kategori prostitusi (True) ataupun bukan prostitusi (False).


Dataset yang dimasukkan disini meliputi data training dan data testing. Ada 40000 data training yang dibagi menjadi dua bagian, yaitu data 20000 True (prostitusi) dan 20000 data False (bukan prostitusi), dan 10000 data sebagai data testing.

Pengujian dilakukan untuk menentukan akurasi klasifikasi metode NBC dan menggunakan Confusion Matrix

Langkah pertama, persiapkan module python yang akan digunakan

In [1]:
import pandas as pd
import string
import numpy as np
import re
import random
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
from sklearn.naive_bayes import MultinomialNB

Import data training dan data testing

In [2]:
data_train_true = pd.read_excel('twitter-prostitute.xlsx')
data_train_false = pd.read_excel('twitter-not-prostitute.xlsx')

In [3]:
# Gabungkan semua dataset
dataset = pd.concat([data_train_true, data_train_false], ignore_index = True)

In [4]:
print(dataset.head(10))

                    id                 date       username  \
0  1255052854739398658  2020-04-28 00:00:00   rina11091996   
1  1255052799798202373  2020-04-28 00:00:00   rina11091996   
2  1255052613646573569  2020-04-28 00:00:00  viollasyantik   
3  1255052558667661312  2020-04-28 00:00:00    lyannyhijab   
4  1255052557061287938  2020-04-28 00:00:00      dheajogja   
5  1255052446486761481  2020-04-28 00:00:00   ayudewijogja   
6  1255052156417146880  2020-04-28 00:00:00   alexabojogja   
7  1255051841395503104  2020-04-28 00:00:00     alexayolan   
8  1255051002836738048  2020-04-28 00:00:00     nanabellen   
9  1255049688308428800  2020-04-28 00:00:00  realacoount18   

                                               tweet  status  
0  AvaiL BO yaa beb😙\nWA 0831 9315 9762\n#AvailJo...       1  
1  Include exclude Ready beb \nWa 0831 9315 9762\...       1  
2  AvaiL Jogja Minat DM aja😍\nFasht Respon.\n#Ava...       1  
3  MAEN SANTAI GA BURU" \nFULL SERVICE NO ANAL US...       1  
4 

In [5]:
print(dataset['tweet'][:5])

0    AvaiL BO yaa beb😙\nWA 0831 9315 9762\n#AvailJo...
1    Include exclude Ready beb \nWa 0831 9315 9762\...
2    AvaiL Jogja Minat DM aja😍\nFasht Respon.\n#Ava...
3    MAEN SANTAI GA BURU" \nFULL SERVICE NO ANAL US...
4    New bie...Ready ya..2 slot aja 085647266101\n#...
Name: tweet, dtype: object


In [6]:
print(len(dataset.loc[dataset['status']==1]))
print(len(dataset.loc[dataset['status']==0]))

20000
20000


Setelah data berhasil dibuka, saatnya melakukan preprocessing pada teks. Yang akan dilakukan preprocessing yaitu data pada kolom tweet dimana data ini berisi tweet yang diambil dari twitter.

Ada beberapa process preprocessing yaitu cleaning text atau membersihkan text dari noise dan tokenizing yaitu memecah semua text menjadi per kata.

In [7]:
stopwords_file = open("stopwords-id.txt", 'r')
stopwords = [x.strip() for x in stopwords_file.readlines()]
stopwords.extend(['by', 'rt', 'via'])

def cleaning(text):
	text = re.sub(r'<[^>]+>', '', text) #delete html tags
	text = re.sub(r'\S*twitter.com\S*', '', text)   #delete twitter image
	text = re.sub(r'https?://[A-Za-z0-9./]+','',text) #delete url
	text = re.sub(r'@[A-Za-z0-9]+','',text) #delete user mention
	text = re.sub(r'#[A-Za-z0-9]+','',text) #delete twitter hashtag
	text = re.sub(r'(?:(?:\d+,?)+(?:\.?\d+)?)','', text) #delete number
	text = re.sub(r"[^a-zA-Z]", " ", text) #only accept alphabet char
	text = re.sub(r"(\w)(\1{2,})", r'\1', text) #delete repeated char
	text = re.sub(r"\b[a-zA-Z]\b", "", text) #remove single character
	text = text.lower() #change to lowercase
	return text

def tokenize(text):
	#disini diisi dengan stop words
	words = text.split();
	words = [w for w in words if w not in stopwords]
	return words

Cleaning text pada data tweet

In [8]:
dataset['tweet'] = dataset.tweet.map(lambda x: cleaning(x))

In [9]:
print(dataset['tweet'][:5])

0             avail bo yaa beb  wa                    
1    include exclude ready beb  wa                 ...
2    avail jogja minat dm aja  fasht respon        ...
3    maen santai ga buru   full service no anal use...
4      new bie   ready ya   slot aja                  
Name: tweet, dtype: object


In [10]:
dataset['tweet'] = dataset.tweet.apply(lambda x: tokenize(x))

In [11]:
print(dataset['tweet'][:5])

0                            [avail, bo, yaa, beb, wa]
1                   [include, exclude, ready, beb, wa]
2        [avail, jogja, minat, dm, aja, fasht, respon]
3    [maen, santai, ga, buru, full, service, no, an...
4                     [new, bie, ready, ya, slot, aja]
Name: tweet, dtype: object


Menggabungkan semua tweet menggunakan spasi untuk dilakukan tahap selanjutnya yaitu pembuatan vektor

In [12]:
dataset['tweet'] = dataset.tweet.apply(lambda x: ' '.join(x))

Memasukkan label ke array

In [13]:
dataset['label'] = dataset.status.map(lambda x: x)

In [14]:
count_vect = CountVectorizer()
counts = count_vect.fit_transform(dataset['tweet'])

In [15]:
print(counts[:5])

  (0, 1715)	1
  (0, 4248)	1
  (0, 37085)	1
  (0, 2486)	1
  (0, 36375)	1
  (1, 2486)	1
  (1, 36375)	1
  (1, 13185)	1
  (1, 9869)	1
  (1, 28573)	1
  (2, 1715)	1
  (2, 14328)	1
  (2, 21932)	1
  (2, 8839)	1
  (2, 466)	1
  (2, 10073)	1
  (2, 28935)	1
  (3, 18837)	1
  (3, 29840)	1
  (3, 10628)	1
  (3, 4838)	1
  (3, 10577)	1
  (3, 31068)	1
  (3, 23900)	1
  (3, 975)	1
  (3, 36000)	1
  (3, 5072)	1
  (4, 28573)	1
  (4, 466)	1
  (4, 23215)	1
  (4, 3936)	1
  (4, 37084)	1
  (4, 31856)	1


Disini mentransformasikan semua feature kata menggunakan TF IDF

In [16]:
transformer = TfidfTransformer().fit(counts)
counts = transformer.transform(counts)

In [17]:
print(counts[:5])

  (0, 37085)	0.5911705953129078
  (0, 36375)	0.2822194474686629
  (0, 4248)	0.3843499205847922
  (0, 2486)	0.47674485769290337
  (0, 1715)	0.44255958911507276
  (1, 36375)	0.2866682929106074
  (1, 28573)	0.38428108764986996
  (1, 13185)	0.49885978763514066
  (1, 9869)	0.5355187636154758
  (1, 2486)	0.4842601590165411
  (2, 28935)	0.3746459280373387
  (2, 21932)	0.2760387849741051
  (2, 14328)	0.21527389492118496
  (2, 10073)	0.7561577668564705
  (2, 8839)	0.18528478090139275
  (2, 1715)	0.2457130613228591
  (2, 466)	0.2657446914706499
  (3, 36000)	0.38496965117902543
  (3, 31068)	0.322521532751217
  (3, 29840)	0.2868751168311405
  (3, 23900)	0.17759626056082173
  (3, 18837)	0.3707445553998642
  (3, 10628)	0.2008833216432625
  (3, 10577)	0.2868751168311405
  (3, 5072)	0.34646183022969057
  (3, 4838)	0.3628687482176093
  (3, 975)	0.34946812299802377
  (4, 37084)	0.2176563413269418
  (4, 31856)	0.28790259324420076
  (4, 28573)	0.23764296132175766
  (4, 23215)	0.44937909480755067
  (4, 393

Kemudian setelah semua kata memiliki bobot tersendiri sesuai dengan transformasi vektor TF IDF, disini pembagian data training dan data testing dengan alokasi 80% train:20% test

Gabungkan semua daftar kata dalam sebuah list

In [18]:
feature_train, feature_test, target_train, target_test = train_test_split(counts, dataset['label'], 
    train_size=0.8, test_size=0.2, random_state=123)

In [19]:
model = MultinomialNB()
model.fit(feature_train, target_train)
predicted = model.predict(feature_test)

Untuk menghitung akurasi model yang sudah dibuat, dapat menggunakan accuracy_score dari scikitlearn

In [20]:
accuracy = accuracy_score(target_test, predicted)

In [21]:
print(accuracy)

0.985125


Untuk pengujian, menggunakan confusion matrix sebagai berikut ini:

In [22]:
c_matrix = confusion_matrix(target_test, predicted)

In [23]:
print(c_matrix)

[[4009   71]
 [  48 3872]]


Untuk melihat laporan klasifikasi, menggunakan classification_report

In [24]:
c_report = classification_report(target_test, predicted)
print(c_report)

              precision    recall  f1-score   support

           0       0.99      0.98      0.99      4080
           1       0.98      0.99      0.98      3920

    accuracy                           0.99      8000
   macro avg       0.99      0.99      0.99      8000
weighted avg       0.99      0.99      0.99      8000



In [25]:
def pred_text(score):
  if score == 0:
    return 'False (Not Prostitute)'
  else:
    return 'True (Prostitute)'

In [26]:
input_text = 'Semoga covid-19 ini segera berakhir dan semua kegiatan dapat kembali seperti sedia kala'
input_text = tokenize(cleaning(input_text))
new_counts = count_vect.transform(input_text)
pred = model.predict(new_counts)
print(pred_text(pred[0]))

False (Not Prostitute)


In [27]:
input_text = 'area #Jogja cod no dp free cancel include room'
input_text = tokenize(cleaning(input_text))
new_counts = count_vect.transform(input_text)
pred = model.predict(new_counts)
print(pred_text(pred[0]))

True (Prostitute)


In [28]:
print(model.score(feature_train, target_train))
print(model.score(feature_test, target_test))

0.99084375
0.985125


In [29]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, feature_train, target_train, cv=10)
print(scores)

[0.9871875 0.9834375 0.9871875 0.9875    0.981875  0.9884375 0.9884375
 0.9846875 0.9859375 0.9865625]


In [30]:
scores.mean()

0.986125