# Text Mining and Hate Speech Classification
Kali ini kita akan melakukan klasifikasi *hate speech* untuk menghasilkan model yang dapat menentukan apakah sebuah *tweet* merupakan *hate speech* atau bukan.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv(r'https://raw.githubusercontent.com/ialfina/id-hatespeech-detection/master/IDHSD_RIO_unbalanced_713_2017.txt', sep='\t', header=None, names=['label', 'tweet'], skiprows=1, engine='python')
df

Unnamed: 0,label,tweet
0,Non_HS,RT @spardaxyz: Fadli Zon Minta Mendagri Segera...
1,Non_HS,RT @baguscondromowo: Mereka terus melukai aksi...
2,Non_HS,Sylvi: bagaimana gurbernur melakukan kekerasan...
3,Non_HS,"Ahmad Dhani Tak Puas Debat Pilkada, Masalah Ja..."
4,Non_HS,RT @lisdaulay28: Waspada KTP palsu.....kawal P...
...,...,...
708,HS,Muka Si BABi Ahok Tuh Yg Mirip SERBET Lantai.....
709,HS,"Betul bang hancurkan merka bang, musnahkan chi..."
710,HS,"Sapa Yg bilang Ahok anti korupsi!?, klo grombo..."
711,HS,"Gw juga ngimpi SENTILIN BIJI BABI AHOK, pcetar..."


## Data Preprocessing
Berbeda dengan kasus pada proyek sebelumnya, kali ini kolom *tweet* berisi teks yang harus diolah sedemikian rupa sehingga dapat dimengerti oleh komputer. Pertama, mari kita impor *library* yang akan digunakan untuk memproses teks dalam Bahasa Indonesia.

In [3]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\nitro\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Tokenizing
Merupakan metode untuk memecah sebuah kalimat menjadi kata-kata. Hal ini perlu dilakukan karena untuk mesin, lebih mudah memproses kata-kata dibandingkan kalimat. Selain itu, kita juga akan mengubah semua huruf kapital menjadi huruf kecil serta menghilangkan simbol yang tidak diperlukan.

In [4]:
# before
print(df['tweet'][0])
nltk.word_tokenize(df["tweet"][0])

RT @spardaxyz: Fadli Zon Minta Mendagri Segera Menonaktifkan Ahok Jadi Gubernur DKI https:\/\/t.co\/KH5vIRwPdO


['RT',
 '@',
 'spardaxyz',
 ':',
 'Fadli',
 'Zon',
 'Minta',
 'Mendagri',
 'Segera',
 'Menonaktifkan',
 'Ahok',
 'Jadi',
 'Gubernur',
 'DKI',
 'https',
 ':',
 '\\/\\/t.co\\/KH5vIRwPdO']

In [5]:
# memecah kalimat
df['tokenized'] = df.apply(lambda row: nltk.word_tokenize(row['tweet']), axis=1)
df

Unnamed: 0,label,tweet,tokenized
0,Non_HS,RT @spardaxyz: Fadli Zon Minta Mendagri Segera...,"[RT, @, spardaxyz, :, Fadli, Zon, Minta, Menda..."
1,Non_HS,RT @baguscondromowo: Mereka terus melukai aksi...,"[RT, @, baguscondromowo, :, Mereka, terus, mel..."
2,Non_HS,Sylvi: bagaimana gurbernur melakukan kekerasan...,"[Sylvi, :, bagaimana, gurbernur, melakukan, ke..."
3,Non_HS,"Ahmad Dhani Tak Puas Debat Pilkada, Masalah Ja...","[Ahmad, Dhani, Tak, Puas, Debat, Pilkada, ,, M..."
4,Non_HS,RT @lisdaulay28: Waspada KTP palsu.....kawal P...,"[RT, @, lisdaulay28, :, Waspada, KTP, palsu, ...."
...,...,...,...
708,HS,Muka Si BABi Ahok Tuh Yg Mirip SERBET Lantai.....,"[Muka, Si, BABi, Ahok, Tuh, Yg, Mirip, SERBET,..."
709,HS,"Betul bang hancurkan merka bang, musnahkan chi...","[Betul, bang, hancurkan, merka, bang, ,, musna..."
710,HS,"Sapa Yg bilang Ahok anti korupsi!?, klo grombo...","[Sapa, Yg, bilang, Ahok, anti, korupsi, !, ?, ..."
711,HS,"Gw juga ngimpi SENTILIN BIJI BABI AHOK, pcetar...","[Gw, juga, ngimpi, SENTILIN, BIJI, BABI, AHOK,..."


In [6]:
# menghapus simbol
df['tokenized_nosymbol'] = df['tokenized']
for i in range(0, len(df)):
    words = df['tokenized'][i]
    words = [word.lower() for word in words if word.isalpha()]
    df['tokenized_nosymbol'][i] = words
df

Unnamed: 0,label,tweet,tokenized,tokenized_nosymbol
0,Non_HS,RT @spardaxyz: Fadli Zon Minta Mendagri Segera...,"[RT, @, spardaxyz, :, Fadli, Zon, Minta, Menda...","[rt, spardaxyz, fadli, zon, minta, mendagri, s..."
1,Non_HS,RT @baguscondromowo: Mereka terus melukai aksi...,"[RT, @, baguscondromowo, :, Mereka, terus, mel...","[rt, baguscondromowo, mereka, terus, melukai, ..."
2,Non_HS,Sylvi: bagaimana gurbernur melakukan kekerasan...,"[Sylvi, :, bagaimana, gurbernur, melakukan, ke...","[sylvi, bagaimana, gurbernur, melakukan, keker..."
3,Non_HS,"Ahmad Dhani Tak Puas Debat Pilkada, Masalah Ja...","[Ahmad, Dhani, Tak, Puas, Debat, Pilkada, ,, M...","[ahmad, dhani, tak, puas, debat, pilkada, masa..."
4,Non_HS,RT @lisdaulay28: Waspada KTP palsu.....kawal P...,"[RT, @, lisdaulay28, :, Waspada, KTP, palsu, ....","[rt, waspada, ktp, palsu, pilkada, https]"
...,...,...,...,...
708,HS,Muka Si BABi Ahok Tuh Yg Mirip SERBET Lantai.....,"[Muka, Si, BABi, Ahok, Tuh, Yg, Mirip, SERBET,...","[muka, si, babi, ahok, tuh, yg, mirip, serbet,..."
709,HS,"Betul bang hancurkan merka bang, musnahkan chi...","[Betul, bang, hancurkan, merka, bang, ,, musna...","[betul, bang, hancurkan, merka, bang, musnahka..."
710,HS,"Sapa Yg bilang Ahok anti korupsi!?, klo grombo...","[Sapa, Yg, bilang, Ahok, anti, korupsi, !, ?, ...","[sapa, yg, bilang, ahok, anti, korupsi, klo, g..."
711,HS,"Gw juga ngimpi SENTILIN BIJI BABI AHOK, pcetar...","[Gw, juga, ngimpi, SENTILIN, BIJI, BABI, AHOK,...","[gw, juga, ngimpi, sentilin, biji, babi, ahok,..."


In [7]:
# after
df['tokenized_nosymbol'][0]

['rt',
 'spardaxyz',
 'fadli',
 'zon',
 'minta',
 'mendagri',
 'segera',
 'menonaktifkan',
 'ahok',
 'jadi',
 'gubernur',
 'dki',
 'https']

### Stemming
Dilakukan untuk menghilangkan imbuhan atau awalan dari tiap katanya.

In [8]:
# before
df['tokenized_nosymbol'][1]

['rt',
 'baguscondromowo',
 'mereka',
 'terus',
 'melukai',
 'aksi',
 'dalam',
 'rangka',
 'memenjarakan',
 'ahok',
 'atau',
 'ahok',
 'gagal',
 'dalam',
 'pilkada']

In [9]:
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
factory = StemmerFactory()
stemmer = factory.create_stemmer()

df['stemmed'] = df['tweet']

for i in range(0, len(df)):
    stemmed = []
    for j in range(0, len(df['tokenized_nosymbol'][i])):
        stemmed.append(stemmer.stem(df['tokenized_nosymbol'][i][j]))
    df['stemmed'][i] = stemmed
    
df

Unnamed: 0,label,tweet,tokenized,tokenized_nosymbol,stemmed
0,Non_HS,RT @spardaxyz: Fadli Zon Minta Mendagri Segera...,"[RT, @, spardaxyz, :, Fadli, Zon, Minta, Menda...","[rt, spardaxyz, fadli, zon, minta, mendagri, s...","[rt, spardaxyz, fadli, zon, minta, mendagri, s..."
1,Non_HS,RT @baguscondromowo: Mereka terus melukai aksi...,"[RT, @, baguscondromowo, :, Mereka, terus, mel...","[rt, baguscondromowo, mereka, terus, melukai, ...","[rt, baguscondromowo, mereka, terus, luka, aks..."
2,Non_HS,Sylvi: bagaimana gurbernur melakukan kekerasan...,"[Sylvi, :, bagaimana, gurbernur, melakukan, ke...","[sylvi, bagaimana, gurbernur, melakukan, keker...","[sylvi, bagaimana, gurbernur, laku, keras, per..."
3,Non_HS,"Ahmad Dhani Tak Puas Debat Pilkada, Masalah Ja...","[Ahmad, Dhani, Tak, Puas, Debat, Pilkada, ,, M...","[ahmad, dhani, tak, puas, debat, pilkada, masa...","[ahmad, dhani, tak, puas, debat, pilkada, masa..."
4,Non_HS,RT @lisdaulay28: Waspada KTP palsu.....kawal P...,"[RT, @, lisdaulay28, :, Waspada, KTP, palsu, ....","[rt, waspada, ktp, palsu, pilkada, https]","[rt, waspada, ktp, palsu, pilkada, https]"
...,...,...,...,...,...
708,HS,Muka Si BABi Ahok Tuh Yg Mirip SERBET Lantai.....,"[Muka, Si, BABi, Ahok, Tuh, Yg, Mirip, SERBET,...","[muka, si, babi, ahok, tuh, yg, mirip, serbet,...","[muka, si, babi, ahok, tuh, yg, mirip, serbet,..."
709,HS,"Betul bang hancurkan merka bang, musnahkan chi...","[Betul, bang, hancurkan, merka, bang, ,, musna...","[betul, bang, hancurkan, merka, bang, musnahka...","[betul, bang, hancur, merka, bang, musnah, chi..."
710,HS,"Sapa Yg bilang Ahok anti korupsi!?, klo grombo...","[Sapa, Yg, bilang, Ahok, anti, korupsi, !, ?, ...","[sapa, yg, bilang, ahok, anti, korupsi, klo, g...","[sapa, yg, bilang, ahok, anti, korupsi, klo, g..."
711,HS,"Gw juga ngimpi SENTILIN BIJI BABI AHOK, pcetar...","[Gw, juga, ngimpi, SENTILIN, BIJI, BABI, AHOK,...","[gw, juga, ngimpi, sentilin, biji, babi, ahok,...","[gw, juga, ngimpi, sentilin, biji, babi, ahok,..."


In [10]:
# after
df['stemmed'][1]

['rt',
 'baguscondromowo',
 'mereka',
 'terus',
 'luka',
 'aksi',
 'dalam',
 'rangka',
 'penjara',
 'ahok',
 'atau',
 'ahok',
 'gagal',
 'dalam',
 'pilkada']

### One Hot Encoding
Mengubah list pada kolom *stemmed* menjadi biner. Hanya akan dilakukan pada kolom *stemming*, agar bisa dilakukan bersamaan dengan menentukan fitur dan target model.

In [11]:
feature = df['stemmed'].str.join('|').str.get_dummies()
feature

Unnamed: 0,a,aagym,aahhh,aamiin,abai,abang,abdi,abis,abu,acara,...,yme,yng,yoi,yos,you,youtube,yudijannis,yuk,zikir,zon
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
708,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
709,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
710,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
711,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [12]:
target = df['label']
target

0      Non_HS
1      Non_HS
2      Non_HS
3      Non_HS
4      Non_HS
        ...  
708        HS
709        HS
710        HS
711        HS
712        HS
Name: label, Length: 713, dtype: object

## Model Building
Akan digunakan beberapa model untuk mencari model yang terbaik.

### Train Test Split
Untuk melatih model, dataframe akan dipecah menjadi 70% data untuk *training* dan 30% data untuk *testing*

In [13]:
from sklearn.model_selection import train_test_split, cross_val_score
X_train, X_test, y_train, y_test  = train_test_split(feature , target, shuffle = True, test_size=0.3, random_state=1)

# Show the Training and Testing Data
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(499, 2330)
(214, 2330)
(499,)
(214,)


### Decision Tree

In [14]:
# Import Module
from sklearn import tree, metrics

# Modeling Decision Tree
dtc = tree.DecisionTreeClassifier(min_impurity_decrease=0.01)
dtc.fit(X_train, y_train)

# Predict to Test Data 
y_pred_dtc = dtc.predict(X_test)
y_pred_dtc

array(['HS', 'Non_HS', 'Non_HS', 'Non_HS', 'Non_HS', 'Non_HS', 'Non_HS',
       'Non_HS', 'Non_HS', 'Non_HS', 'Non_HS', 'Non_HS', 'Non_HS', 'HS',
       'HS', 'Non_HS', 'HS', 'Non_HS', 'HS', 'Non_HS', 'HS', 'Non_HS',
       'Non_HS', 'HS', 'Non_HS', 'Non_HS', 'HS', 'Non_HS', 'HS', 'HS',
       'Non_HS', 'Non_HS', 'Non_HS', 'HS', 'Non_HS', 'Non_HS', 'Non_HS',
       'HS', 'Non_HS', 'HS', 'Non_HS', 'HS', 'Non_HS', 'Non_HS', 'Non_HS',
       'Non_HS', 'Non_HS', 'Non_HS', 'Non_HS', 'Non_HS', 'HS', 'HS',
       'Non_HS', 'HS', 'HS', 'HS', 'Non_HS', 'Non_HS', 'Non_HS', 'Non_HS',
       'Non_HS', 'HS', 'HS', 'Non_HS', 'Non_HS', 'HS', 'Non_HS', 'Non_HS',
       'Non_HS', 'HS', 'HS', 'Non_HS', 'Non_HS', 'Non_HS', 'Non_HS', 'HS',
       'HS', 'Non_HS', 'Non_HS', 'Non_HS', 'Non_HS', 'HS', 'Non_HS', 'HS',
       'Non_HS', 'Non_HS', 'Non_HS', 'Non_HS', 'Non_HS', 'HS', 'Non_HS',
       'Non_HS', 'HS', 'HS', 'HS', 'Non_HS', 'Non_HS', 'Non_HS', 'Non_HS',
       'HS', 'Non_HS', 'Non_HS', 'HS', 'Non_HS'

In [17]:
# Show the Accuracy, Precision, Recall
acc_dtc = metrics.accuracy_score(y_test, y_pred_dtc)
prec_dtc = metrics.precision_score(y_test, y_pred_dtc, pos_label = 'HS')
rec_dtc = metrics.recall_score(y_test, y_pred_dtc, pos_label = 'HS')
f1_dtc = metrics.f1_score(y_test, y_pred_dtc, pos_label = 'HS')
kappa_dtc = metrics.cohen_kappa_score(y_test, y_pred_dtc)

print("Accuracy:", acc_dtc)
print("Precision:", prec_dtc)
print("Recall:", rec_dtc)
print("F1 Score:", f1_dtc)
print("Cohens Kappa Score:", kappa_dtc)

Accuracy: 0.8177570093457944
Precision: 0.8688524590163934
Recall: 0.6309523809523809
F1 Score: 0.7310344827586208
Cohens Kappa Score: 0.5984024636704841


### Naive Bayes

In [18]:
# Import Module
from sklearn.naive_bayes import GaussianNB 

# Modeling Naive Bayes Classifier
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predict to Test Data
y_pred_gnb= gnb.predict(X_test)

In [19]:
# Show the Accuracy, Precision, Recall
acc_gnb = metrics.accuracy_score(y_test, y_pred_gnb)
prec_gnb = metrics.precision_score(y_test, y_pred_gnb, pos_label = 'HS')
rec_gnb = metrics.recall_score(y_test, y_pred_gnb, pos_label = 'HS')
f1_gnb = metrics.f1_score(y_test, y_pred_gnb, pos_label = 'HS')
kappa_gnb = metrics.cohen_kappa_score(y_test, y_pred_gnb)

print("Accuracy:", acc_gnb)
print("Precision:", prec_gnb)
print("Recall:", rec_gnb)
print("F1 Score:", f1_gnb)
print("Cohens Kappa Score:", kappa_gnb)

Accuracy: 0.7523364485981309
Precision: 0.6534653465346535
Recall: 0.7857142857142857
F1 Score: 0.7135135135135134
Cohens Kappa Score: 0.4986296525506144


### Random Forest

In [23]:
# Import Module
from sklearn.ensemble import RandomForestClassifier

# Modeling Decision Tree
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)

# Predict to Test Data 
y_pred_clf = clf.predict(X_test)
y_pred_clf

array(['HS', 'Non_HS', 'Non_HS', 'Non_HS', 'Non_HS', 'Non_HS', 'Non_HS',
       'Non_HS', 'Non_HS', 'Non_HS', 'Non_HS', 'Non_HS', 'Non_HS', 'HS',
       'HS', 'Non_HS', 'HS', 'Non_HS', 'HS', 'Non_HS', 'Non_HS', 'Non_HS',
       'Non_HS', 'HS', 'Non_HS', 'Non_HS', 'HS', 'Non_HS', 'Non_HS', 'HS',
       'Non_HS', 'Non_HS', 'Non_HS', 'HS', 'Non_HS', 'Non_HS', 'Non_HS',
       'HS', 'Non_HS', 'HS', 'Non_HS', 'HS', 'Non_HS', 'Non_HS', 'Non_HS',
       'Non_HS', 'Non_HS', 'Non_HS', 'Non_HS', 'Non_HS', 'HS', 'HS',
       'Non_HS', 'HS', 'HS', 'HS', 'Non_HS', 'Non_HS', 'Non_HS', 'Non_HS',
       'Non_HS', 'HS', 'HS', 'Non_HS', 'Non_HS', 'HS', 'Non_HS', 'Non_HS',
       'Non_HS', 'HS', 'HS', 'Non_HS', 'Non_HS', 'Non_HS', 'Non_HS', 'HS',
       'HS', 'Non_HS', 'Non_HS', 'Non_HS', 'Non_HS', 'HS', 'Non_HS', 'HS',
       'Non_HS', 'Non_HS', 'Non_HS', 'Non_HS', 'Non_HS', 'HS', 'Non_HS',
       'Non_HS', 'HS', 'HS', 'HS', 'Non_HS', 'Non_HS', 'Non_HS', 'Non_HS',
       'HS', 'Non_HS', 'Non_HS', 'HS', 

In [24]:
# Show the Accuracy, Precision, Recall
acc_clf = metrics.accuracy_score(y_test, y_pred_clf)
prec_clf = metrics.precision_score(y_test, y_pred_clf, pos_label = 'HS')
rec_clf = metrics.recall_score(y_test, y_pred_clf, pos_label = 'HS')
f1_clf = metrics.f1_score(y_test, y_pred_clf, pos_label = 'HS')
kappa_clf = metrics.cohen_kappa_score(y_test, y_pred_clf)

print("Accuracy:", acc_clf)
print("Precision:", prec_clf)
print("Recall:", rec_clf)
print("F1 Score:", f1_clf)
print("Cohens Kappa Score:", kappa_clf)

Accuracy: 0.8317757009345794
Precision: 0.9
Recall: 0.6428571428571429
F1 Score: 0.75
Cohens Kappa Score: 0.6284722222222222


## Kesimpulan
Berdasarkan ketiga model yang dibuat, diketahui bahwa akurasi dan presisi tertinggi ada pada **Random Forest** dengan berturut turut **0.83** dan **0.9**.