<a href="https://colab.research.google.com/github/aurisaprastika/hate-speech-classification/blob/main/Text_Mining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Dataset diperoleh dari https://github.com/ialfina/id-hatespeech-detection


**Install nltk**

In [80]:
pip install nltk



**Install Sastrawi untuk melakukan pemrosesan teks dalam Bahasa Indonesia**

In [81]:
pip install Sastrawi



**Install modul yang diperlukan**

In [82]:
import pandas as pd
import numpy as np
import nltk
import re
import os
import string
import nltk.corpus
import matplotlib.pyplot as plt
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
factory = StemmerFactory()
stemmer = factory.create_stemmer()
from nltk.tokenize import word_tokenize 
from nltk.probability import FreqDist
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression

In [83]:
df = pd.read_csv('IDHSD_RIO_unbalanced_713_2017.txt', sep='\t', header=None, names=['Label', 'Tweet'], skiprows=1, engine='python')
df

Unnamed: 0,Label,Tweet
0,Non_HS,RT @spardaxyz: Fadli Zon Minta Mendagri Segera...
1,Non_HS,RT @baguscondromowo: Mereka terus melukai aksi...
2,Non_HS,Sylvi: bagaimana gurbernur melakukan kekerasan...
3,Non_HS,"Ahmad Dhani Tak Puas Debat Pilkada, Masalah Ja..."
4,Non_HS,RT @lisdaulay28: Waspada KTP palsu.....kawal P...
...,...,...
708,HS,Muka Si BABi Ahok Tuh Yg Mirip SERBET Lantai.....
709,HS,"Betul bang hancurkan merka bang, musnahkan chi..."
710,HS,"Sapa Yg bilang Ahok anti korupsi!?, klo grombo..."
711,HS,"Gw juga ngimpi SENTILIN BIJI BABI AHOK, pcetar..."


## **Tahap Preprocessing**

Case Folding

In [84]:
df['Tweet'] = df['Tweet'].str.lower()
print('Case Folding Result: \n')
print(df['Tweet'])
print('\n\n\n')

Case Folding Result: 

0      rt @spardaxyz: fadli zon minta mendagri segera...
1      rt @baguscondromowo: mereka terus melukai aksi...
2      sylvi: bagaimana gurbernur melakukan kekerasan...
3      ahmad dhani tak puas debat pilkada, masalah ja...
4      rt @lisdaulay28: waspada ktp palsu.....kawal p...
                             ...                        
708    muka si babi ahok tuh yg mirip serbet lantai.....
709    betul bang hancurkan merka bang, musnahkan chi...
710    sapa yg bilang ahok anti korupsi!?, klo grombo...
711    gw juga ngimpi sentilin biji babi ahok, pcetar...
712    mudah2an gw ketemu sama si babi iwan bopeng di...
Name: Tweet, Length: 713, dtype: object






## **Tokenizing**

In [85]:
def remove_tweet_special(text):
    text = text.replace('\\t'," ").replace('\\n'," ").replace('\\u'," ").replace('\\',"")
    text = text.encode('ascii', 'replace').decode('ascii')
    text = ' '.join(re.sub("([@#][A-Za-z0-9]+)|(\w+:\/\/\S+)"," ", text).split())
    return text.replace("http://", " ").replace("https://", " ")

df['Tweet'] = df['Tweet'].apply(remove_tweet_special)

#remove number
def remove_number(text):
    return  re.sub(r"\d+", "", text)

df['Tweet'] = df['Tweet'].apply(remove_number)

#remove punctuation
def remove_punctuation(text):
    return text.translate(str.maketrans("","",string.punctuation))

df['Tweet'] = df['Tweet'].apply(remove_punctuation)

#remove whitespace leading & trailing
def remove_whitespace_LT(text):
    return text.strip()

df['Tweet'] = df['Tweet'].apply(remove_whitespace_LT)

#remove multiple whitespace into single whitespace
def remove_whitespace_multiple(text):
    return re.sub('\s+',' ',text)

df['Tweet'] = df['Tweet'].apply(remove_whitespace_multiple)

# remove single char
def remove_singl_char(text):
    return re.sub(r"\b[a-zA-Z]\b", "", text)

df['Tweet'] = df['Tweet'].apply(remove_singl_char)

In [86]:
print('Tokenizing Result: \n') 
print(df['Tweet'])
print('\n\n\n')

Tokenizing Result: 

0      rt fadli zon minta mendagri segera menonaktifk...
1      rt mereka terus melukai aksi dalam rangka meme...
2      sylvi bagaimana gurbernur melakukan kekerasan ...
3      ahmad dhani tak puas debat pilkada masalah jal...
4                      rt waspada ktp palsukawal pilkada
                             ...                        
708     muka si babi ahok tuh yg mirip serbet lantai btp
709    betul bang hancurkan merka bang musnahkan chin...
710    sapa yg bilang ahok anti korupsi klo grombolan...
711    gw juga ngimpi sentilin biji babi ahok pcetar ...
712    mudahan gw ketemu sama si babi iwan bopeng di ...
Name: Tweet, Length: 713, dtype: object






## **Stemming**

In [87]:
def stemming(sentence):
  return stemmer.stem(sentence)

df['Tweet'] = df['Tweet'].apply(stemming).apply(stemming)

df

Unnamed: 0,Label,Tweet
0,Non_HS,rt fadli zon minta mendagri segera nonaktif ah...
1,Non_HS,rt mereka terus luka aksi dalam rangka penjara...
2,Non_HS,sylvi bagaimana gurbernur laku keras perempuan...
3,Non_HS,ahmad dhani tak puas debat pilkada masalah jal...
4,Non_HS,rt waspada ktp palsukawal pilkada
...,...,...
708,HS,muka si babi ahok tuh yg mirip serbet lantai btp
709,HS,betul bang hancur merka bang musnah china babi...
710,HS,sapa yg bilang ahok anti korupsi klo grombolan...
711,HS,gw juga ngimpi sentilin biji babi ahok pcetar ...


## **Encode**

In [88]:
le1 = preprocessing.LabelEncoder()
df['Label'] =le1.fit_transform(df['Label'])

df

Unnamed: 0,Label,Tweet
0,1,rt fadli zon minta mendagri segera nonaktif ah...
1,1,rt mereka terus luka aksi dalam rangka penjara...
2,1,sylvi bagaimana gurbernur laku keras perempuan...
3,1,ahmad dhani tak puas debat pilkada masalah jal...
4,1,rt waspada ktp palsukawal pilkada
...,...,...
708,0,muka si babi ahok tuh yg mirip serbet lantai btp
709,0,betul bang hancur merka bang musnah china babi...
710,0,sapa yg bilang ahok anti korupsi klo grombolan...
711,0,gw juga ngimpi sentilin biji babi ahok pcetar ...


## **Feature extraction**

In [89]:
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2),max_df=0.75, min_df=5, max_features=10000)

# TF-IDF feature matrix
tfidf = tfidf_vectorizer.fit_transform(df['Tweet'] )
tfidf

<713x456 sparse matrix of type '<class 'numpy.float64'>'
	with 6516 stored elements in Compressed Sparse Row format>

## **Modelling**

Naive Bayes

In [90]:
#Naive Bayes Modelling
X = tfidf
y = df['Label'].astype(int)
X_train_tfidf, X_test_tfidf, y_train, y_test = train_test_split(X.toarray(), y, random_state=42, test_size=0.30)
nbb=GaussianNB()
nbb.fit(X_train_tfidf,y_train)
y_preds = nbb.predict(X_test_tfidf)
acc2b=accuracy_score(y_test,y_preds)
report = classification_report( y_test, y_preds )
print(report)
print("Naive Bayes, Accuracy Score:",acc2b)

              precision    recall  f1-score   support

           0       0.55      0.86      0.67        66
           1       0.92      0.69      0.79       148

    accuracy                           0.74       214
   macro avg       0.74      0.78      0.73       214
weighted avg       0.81      0.74      0.75       214

Naive Bayes, Accuracy Score: 0.7429906542056075


Logistic Regression

In [91]:
X_train_tfidf, X_test_tfidf, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)
model = LogisticRegression().fit(X_train_tfidf,y_train)
y_preds = model.predict(X_test_tfidf)
report = classification_report( y_test, y_preds )
print(report)
acc=accuracy_score(y_test,y_preds)
print("Logistic Regression, Accuracy Score:" , acc)

              precision    recall  f1-score   support

           0       0.87      0.73      0.80        45
           1       0.89      0.95      0.92        98

    accuracy                           0.88       143
   macro avg       0.88      0.84      0.86       143
weighted avg       0.88      0.88      0.88       143

Logistic Regression, Accuracy Score: 0.8811188811188811
