# Data Science Trainee Hands On

## Pendahuluan

### Latar Belakang

Sejak beberapa tahun ke belakang, media sosial menjadi salah satu media penyebaran informasi terbesar yang digunakan di masyarakat. Media sosial memberikan ruang untuk setiap individu dapat melakukan produksi, distribusi, dan konsumsi informasi dengan sangat mudah tanpa mengenal ruang dan waktu. Kemudahan dalam menyampaikan informasi dan berekspresi membuat volume data media sosial menjadi sangat besar. Data ini menyimpan berbagai informasi yang berharga. Salah satu penggunaan media sosial dalam di dalam masyarakat sebagai media diskusi atau sekedar penyampaian opini tentang berbagai isu yang berkembang di masyarakat. 

Kemudahan yang dihadirkan media sosial tidak hanya memberikan dampak yang bersifat positif. Salah satu hal yang paling mudah terlihat adalah mudahnya seseorang untuk melakukan ujaran kebencian (hate speech) terhadap sesama pengguna di media sosial. Hal ini tentunya menjadi masalah yang cukup serius karena memungkinkan terciptanya lingkungan yang buruk bagi masyarakat ketika bermedia sosial. Bukan tidak mungkin banyaknya permasalahan mental yang dialami orang-orang tidak lepas dari kondisi ini. Kondisi ini juga tentunya sangat tidak sesuai dengan budaya Indonesia yang terkenal akan keramahannya antar sesama. Terpisahnya jarak dan tempat yang dihubungkan oleh media sosial rasanya membuat hilangnya budaya masyarakat Indonesia yang selalu sopan dan santun kepada siapa pun.

Efek negatif yang ditimbulkan oleh teknologi seharusnya dapat diatasi oleh pendekatan teknologi pula. Salah satu solusi untuk mengurangi keberadaan ujaran kebencian dan bullying di media sosial adalah dengan melakukan klasifikasi atas pesan yang diterima untuk meniadakan pesan yang diduga mengandung maksud tertentu. Oleh karena itu, permasalahan yang dianalisis dan dicari solusinya pada implementasi ini adalah klasifikasi ujaran kebencian.

Sebelum melakukan pekerjaan utama, dilakukan analisis data teks mengenai topik ini terlebih dahulu. Analisis data dimaksudkan untuk mendapatkan wawasan/insight mengenai data yang dapat membantu dalam menyelesaikan masalah. Analisis dilakukan dengan mengolah data dan juga melakukan visualisasi terhadapnya.

### Deskripsi Permasalahan

Data yang digunakan memiliki 6 label dengan 5 label hate speech (**religion, race, physical, gender, other**) dan 1 label **not hate speech**. Pada implementasinya saya bermaksud untuk menganalisis terlebih dahulu dan akan menentukan apakah akan menggunakan 6 label atau diturunkan menjadi 2 label. 

Data yang digunakan merupakan kumpulan tweet yang memiliki label abusive dan hate speech. Untuk hate speech sendiri memiliki informasi lebih dalam terkait topik atau ranah hate speech-nya dan juga kekuatan dari hate speech-nya. Pada folder dataset juga disediakan kamus typo atau slangword sebagai upaya normalisasi teks tweet yang banyak mengandung kata typo, singkatan, dan gaul. Terdapat juga list kata abusive yang dapat digunakan untuk ekstraksi fitur.


## Data Overview

### Data Preparation

#### Import Pustaka

In [1]:
import re
import nltk
import string
import codecs
import pickle
import warnings
import numpy as np
import pandas as pd
import fasttext.util
import seaborn as sns
import tensorflow as tf
import matplotlib.pyplot as plt

from sklearn.metrics import classification_report, confusion_matrix

from tensorflow import keras
from tensorflow.keras import layers, models, initializers, regularizers, constraints, optimizers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Dense, Conv1D, Embedding, Dropout, GlobalMaxPool1D, SpatialDropout1D, BatchNormalization, Bidirectional, LSTM, GlobalMaxPooling1D, MaxPooling1D, Flatten
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing import text, sequence

from tqdm import tqdm
from gensim.models.word2vec import Word2Vec

from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
from nltk.stem.snowball import PorterStemmer

from spacy.lang.id.stop_words import STOP_WORDS as STOPWORDS


In [2]:
sns.set_style("white")
pd.options.mode.chained_assignment = None
pd.set_option('display.max_colwidth', None)
warnings.simplefilter(action="ignore", category=FutureWarning)

#### Import Data

In [3]:
df_raw = pd.read_csv("../input/jsc-handson/data.csv", encoding = "ISO-8859-1")
df_raw.head(3)

Unnamed: 0,Tweet,HS,Abusive,HS_Individual,HS_Group,HS_Religion,HS_Race,HS_Physical,HS_Gender,HS_Other,HS_Weak,HS_Moderate,HS_Strong
0,- disaat semua cowok berusaha melacak perhatian gue. loe lantas remehkan perhatian yg gue kasih khusus ke elo. basic elo cowok bego ! ! !',1,1,1,0,0,0,0,0,1,1,0,0
1,RT USER: USER siapa yang telat ngasih tau elu?edan sarap gue bergaul dengan cigax jifla calis sama siapa noh licew juga',0,1,0,0,0,0,0,0,0,0,0,0
2,"41. Kadang aku berfikir, kenapa aku tetap percaya pada Tuhan padahal aku selalu jatuh berkali-kali. Kadang aku merasa Tuhan itu ninggalkan aku sendirian. Ketika orangtuaku berencana berpisah, ketika kakakku lebih memilih jadi Kristen. Ketika aku anak ter",0,0,0,0,0,0,0,0,0,0,0,0


Untuk mempermudah analisis, label yang berbentuk one hot encoding ditranslasikan menjadi bentuk label text. 

In [4]:
def categories_translate(row):
    if (row.HS_Religion):
        return "religion hatespeech"
    elif (row.HS_Race):
        return "race hatespeech"
    elif (row.HS_Physical):
        return "physical hatespeech"
    elif (row.HS_Gender):
        return "gender hatespeech"
    elif (row.HS_Other):
        return "other hatespeech"
    else:
        return "not hatespeech"
    
def target_translate(row):
    if (row.HS_Individual):
        return "individual hatespeech"
    elif (row.HS_Group):
        return "group hatespeech"
    else:
        return "not hatespeech"
    
def severity_translate(row):
    if (row.HS_Weak):
        return "weak hatespeech"
    elif (row.HS_Moderate):
        return "moderate hatespeech"
    elif (row.HS_Strong):
        return "strong hatespeech"
    else:
        return "not hatespeech"

In [5]:
df_raw["hate_speech"] = df_raw.apply(lambda row: "hatespeech" if row.HS else "not hatespeech", axis=1)
df_raw["abusive"] = df_raw.apply(lambda row: "abusive" if row.HS else "not abusive", axis=1)
df_raw["hate_speech_categories"] = df_raw.apply(lambda row: categories_translate(row), axis=1)
df_raw["hate_speech_target"] = df_raw.apply(lambda row: target_translate(row), axis=1)
df_raw["hate_speech_severity"] = df_raw.apply(lambda row: severity_translate(row), axis=1)

df_raw = df_raw.rename(columns={"Tweet": "text"})
df_raw = df_raw.loc[:, ['text', 'hate_speech', "abusive", "hate_speech_categories", "hate_speech_target", "hate_speech_severity"]]

Teramati bahwa data masih sangat kotor karena diambil dari media sosial sehingga masih memiliki mention, hashtag, hingga url. Diperlukan preprocessing untuk menangani ini.

#### Missing Values

In [6]:
df_raw.isnull().sum()

text                      0
hate_speech               0
abusive                   0
hate_speech_categories    0
hate_speech_target        0
hate_speech_severity      0
dtype: int64

Tidak terdapat baris data yang memiliki nilai null atau tidak terdefinisi sehingga dapat diproses keseluruhannya.

#### Duplicate Values

In [7]:
df_raw[df_raw.duplicated(keep=False)]

Unnamed: 0,text,hate_speech,abusive,hate_speech_categories,hate_speech_target,hate_speech_severity
33,"RT USER: Kelakuan homok jaman now, ngentot aja sambil live di Blued #gayvid #gvid #lokalhangat URL",not hatespeech,not abusive,not hatespeech,not hatespeech,not hatespeech
40,GAPERNAH MENDALAMI AL-QURAN YA BANG??? PANTESAN MULUTNYA KAYA ORANG KAFIR BEJAT HINA farhatabbaslaw',hatespeech,abusive,religion hatespeech,individual hatespeech,weak hatespeech
43,"#GubernurZamanNow #GusIpulPuti2 #GanjarYasin1 #DjarotSihar2 #HasanAnton2 #KosterAce1 #NurdinSudirman3 #KarolinGidot2 RT USER: ""Keanekaragaman budaya, suku, agama, itulah sesungguhnya kekayaan kita,â?? tutur Djarot di hadapan peserta Rakercabsuâ?¦",not hatespeech,not abusive,not hatespeech,not hatespeech,not hatespeech
165,"Giat Bhabinkamtibmas Brigpol Mbs Sihombing pd hari ini, Rabu 11 April 2018 melaks sambang desa dan tatap muka sekaligus monitoring Kamtibmas menuju Pilkada 2018 di Desa Lobusiregar I, tepatnya dirumah Rudi Panjaitan, Helman Siahaan dan Hotman Siahaan",not hatespeech,not abusive,not hatespeech,not hatespeech,not hatespeech
182,USER USER USER USER USER USER USER USER USER USER USER USER USER USER USER USER USER USER USER USER USER,not hatespeech,not abusive,not hatespeech,not hatespeech,not hatespeech
...,...,...,...,...,...,...
12793,I added a video to a USER playlist,not hatespeech,not abusive,not hatespeech,not hatespeech,not hatespeech
12952,USER USER USER USER USER USER USER USER USER USER USER USER USER USER USER USER USER USER USER USER USER,not hatespeech,not abusive,not hatespeech,not hatespeech,not hatespeech
12953,"SETUJU SAYA,KLAU AHOK DAPAT GELAR SANTRI KEHORMATAN.... TAPI SANTRI KEHORMATAN BABI.... URL",hatespeech,abusive,other hatespeech,individual hatespeech,weak hatespeech
12985,USER ahmad dhani yang terhormat paling babi paling anjing aing kontol paling lonte tungu saat nya karier mu akan habis!!!!,hatespeech,abusive,gender hatespeech,individual hatespeech,weak hatespeech


Data duplikat dihapus karena jumlahnya tidak terlalu signifikan

In [8]:
df_raw = df_raw[df_raw["text"] != ""]
df_raw = df_raw.drop_duplicates().reset_index()

### Preprocessing
Method preprocessing digunakan untuk mentransformasikan teks pesan dari media sosial menjadi lebih bersih dan tidak lagi mengandung entitas yang kurang diperlukan. Beberapa hal yang dilakukan adalah :
- Lowercasing
- Hapus URL, mention, Hashtag
- Hapus STOPWORDS atau kata yang sangat sering muncuk dan tidak perlu
- Hapus selain huruf sehingga vektor nanti hanya terdiri dari kata

In [9]:
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from wordcloud import WordCloud
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

STOPWORDS.update(["saya", "user", "url", "yg", "lo", "ya", "rt", "aja", "nya", "ga", "gak", "orang"])

slangdict = pickle.load(open("../input/kamus-slang-word-bahasa-indonesia/kamus_alay.pkl", 'rb'))
slangwords = frozenset(slangdict)

def show_wordcloud(data):
    words = ''
     
    for sentence in data:
        tokens = str(sentence).split()
        for i in range(len(tokens)):
            tokens[i] = tokens[i].lower()
        words += " ".join(tokens) + " "
     
    wordcloud = WordCloud(width = 800, height = 800, background_color ='white', min_font_size = 12, stopwords=STOPWORDS).generate(words)
     
    plt.figure(figsize = (8, 8), facecolor = None)
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.tight_layout(pad = 0)
     
    plt.show()
    
def show_top_ngram(df_column):
    
    vectorizer = TfidfVectorizer(ngram_range=(2,2))

    ngrams = vectorizer.fit_transform(df_column)
    count_values = ngrams.toarray().sum(axis=0)
    vocab = vectorizer.vocabulary_
    df_ngram = pd.DataFrame(sorted([(count_values[i],k) for k,i in vocab.items()], reverse=True)).rename(columns={0: 'frequency', 1:'bigram/trigram'})

    return df_ngram

def delete_url(text):
    links = re.findall(re.compile('((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)', re.DOTALL), text)
    for link in links:
        text = text.replace(link[0], ' ')    
    return text

def delete_mention_tag(text):
               
    # filter kata yang mengandung penanda mention dan hashtag
    words = []
    for word in text.split():
        word = word.strip()
        if word:
            if word[0] not in ['@','#']:
                words.append(word)
    return ' '.join(words)

def slangword_converter(text):
    list_words = text.split()
    for i in range(len(list_words)):
        if list_words[i] in slangwords:
            list_words[i] = slangdict[list_words[i]]
    text = " ".join(list_words)
    return text

def preprocessing(text):
    text = text.lower()                    # convert ke lowercase
    text = delete_url(text)                # hapus URL/link
    text = delete_mention_tag(text)        # hapus entitas mention dan hashtags
    text = text.strip()
    text = slangword_converter(text)
    text = " ".join([word for word in text.split() if not word in STOPWORDS]) 
    text = re.sub(r" \d+ ", " ", text)     # hapus digit
    text = text.translate(str.maketrans("", "", string.punctuation))
    text = re.sub(r"[^a-z ]", "", text)
    text = re.sub(r"  ", " ", text)
    return text

def word2vec_embedding():
    
    dictionary = {}
    file = open("../input/word2vec-bahasa-indonesia/embedding_word2vec.txt")
    for line in tqdm(file, desc="Load Vector Model "):
        values = line.split(' ')
        word, coefs = values[0], np.asarray(values[1:], dtype='float32')
        dictionary[word] = coefs

    file.close()
    return dictionary

def embedding_matrix(tokenizer, embeddings_index, size):
    embedding_matrix = np.zeros((len(tokenizer.word_index)+1, size))

    for word, i in tokenizer.word_index.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
    return embedding_matrix

In [10]:
df_raw['text'] = df_raw.apply(lambda row: preprocessing(row.text), axis=1)

## Eksperimen Model

### Splitting Data

In [11]:
from sklearn.model_selection import train_test_split

df_raw['hate_speech'] = df_raw.apply(lambda row: 1 if row.hate_speech == "hatespeech" else 0, axis=1)

X = df_raw[["text"]]
y = df_raw['hate_speech']

X_train_raw, X_test_raw, y_train_raw, y_test_raw = train_test_split(X, y, test_size=0.1, random_state=11)

## Deep Learning with Word Embeddings

In [12]:
X_train = X_train_raw['text'].values
X_test  = X_test_raw['text'].values

y_train = y_train_raw.values
y_test  = y_test_raw.values

In [13]:
tokenizer = Tokenizer(num_words=30000)
tokenizer.fit_on_texts(list(X_train))

X_train = tokenizer.texts_to_sequences(X_train)
X_test  = tokenizer.texts_to_sequences(X_test)

X_train = sequence.pad_sequences(X_train, maxlen=50)
X_test  = sequence.pad_sequences(X_test, maxlen=50)

print(X_train.shape)
print(X_test.shape)

(11736, 50)
(1304, 50)


In [14]:
word2vec_index = word2vec_embedding()
word2vec_matrix = embedding_matrix(tokenizer, word2vec_index, 50)

Load Vector Model : 466063it [00:06, 73041.42it/s]


### Model RNN with Word2Vec Embedding
Model diadaptasi dari pustaka [ini](https://medium.com/@nehabhangale/toxic-comment-classification-models-comparison-and-selection-6c02add9d39f)

In [15]:
model = Sequential()

model.add(Embedding(input_dim=word2vec_matrix.shape[0], input_length=50, output_dim=word2vec_matrix.shape[1], weights=[word2vec_matrix], trainable=False))
model.add(SpatialDropout1D(0.5))
model.add(Bidirectional(LSTM(60, return_sequences=True)))
model.add(Conv1D(filters=128, kernel_size=5, padding='same', activation='relu'))
model.add(MaxPooling1D(3))
model.add(GlobalMaxPool1D())
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(Dense(100, activation='relu'))
model.add(Dense(2, activation='softmax'))

2022-08-09 05:48:12.639594: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-09 05:48:12.641835: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-09 05:48:12.642965: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-09 05:48:12.644282: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compil

In [16]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 50, 50)            1209200   
_________________________________________________________________
spatial_dropout1d (SpatialDr (None, 50, 50)            0         
_________________________________________________________________
bidirectional (Bidirectional (None, 50, 120)           53280     
_________________________________________________________________
conv1d (Conv1D)              (None, 50, 128)           76928     
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 16, 128)           0         
_________________________________________________________________
global_max_pooling1d (Global (None, 128)               0         
_________________________________________________________________
batch_normalization (BatchNo (None, 128)               5

In [17]:
model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(), metrics=['accuracy'])
model_hist = model.fit(X_train, y_train, batch_size=32, epochs=50, validation_split=0.1)

Epoch 1/50


2022-08-09 05:48:21.664352: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2022-08-09 05:48:25.445960: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8005


Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [18]:
prediction = model.predict(X_test)
re = []
for predict in prediction:
    label = np.argmax(predict)
    re.append(label)
    
re = np.array(re)

In [19]:
from sklearn.metrics import classification_report
print(classification_report(y_test, re))

              precision    recall  f1-score   support

           0       0.77      0.85      0.81       751
           1       0.76      0.65      0.70       553

    accuracy                           0.77      1304
   macro avg       0.77      0.75      0.76      1304
weighted avg       0.77      0.77      0.76      1304



### Model CNN with Word2Vec Embedding
Model diadaptasi dari pustaka [ini](https://medium.com/@nehabhangale/toxic-comment-classification-models-comparison-and-selection-6c02add9d39f)

In [20]:
model = Sequential()

model.add(Embedding(input_dim=word2vec_matrix.shape[0], input_length=50, output_dim=word2vec_matrix.shape[1], weights=[word2vec_matrix], trainable=False))
model.add(SpatialDropout1D(0.2))
model.add(Conv1D(filters=128, kernel_size=5, padding='same', activation='relu'))
model.add(MaxPooling1D(3))
model.add(GlobalMaxPool1D())
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(Dense(100, activation='relu'))
model.add(Dense(2, activation='softmax'))

In [21]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 50, 50)            1209200   
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 50, 50)            0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 50, 128)           32128     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 16, 128)           0         
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 128)               0         
_________________________________________________________________
batch_normalization_1 (Batch (None, 128)               512       
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)              

In [22]:
model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(), metrics=['accuracy'])
model_hist = model.fit(X_train, y_train, batch_size=32, epochs=50, validation_split=0.1)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [23]:
prediction = model.predict(X_test)
re = []
for predict in prediction:
    label = np.argmax(predict)
    re.append(label)
    
re = np.array(re)

In [24]:
from sklearn.metrics import classification_report
print(classification_report(y_test, re))

              precision    recall  f1-score   support

           0       0.77      0.84      0.80       751
           1       0.75      0.65      0.70       553

    accuracy                           0.76      1304
   macro avg       0.76      0.74      0.75      1304
weighted avg       0.76      0.76      0.76      1304



## Deep Learning BERT

In [25]:
! pip install -q transformer

[31mERROR: Could not find a version that satisfies the requirement transformer (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for transformer[0m[31m
[0m

In [26]:
from transformers import BertTokenizer, TFBertModel

from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint

In [27]:
def bert_encode(texts, tokenizer, max_len=100):
    all_tokens = []
    all_masks = []
    all_segments = []
    
    for text in texts:
        text = tokenizer.tokenize(text)
            
        text = text[:max_len-2]
        input_sequence = ["[CLS]"] + text + ["[SEP]"]
        pad_len = max_len - len(input_sequence)
        
        tokens = tokenizer.convert_tokens_to_ids(input_sequence) + [0] * pad_len
        pad_masks = [1] * len(input_sequence) + [0] * pad_len
        segment_ids = [0] * max_len
        
        all_tokens.append(tokens)
        all_masks.append(pad_masks)
        all_segments.append(segment_ids)
    
    return np.array(all_tokens), np.array(all_masks), np.array(all_segments)

In [28]:
tokenizer = BertTokenizer.from_pretrained("indobenchmark/indobert-base-p2")
bert = TFBertModel.from_pretrained("indobenchmark/indobert-base-p2")

Downloading:   0%|          | 0.00/224k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.50k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625M [00:00<?, ?B/s]

Some layers from the model checkpoint at indobenchmark/indobert-base-p2 were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at indobenchmark/indobert-base-p2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [29]:
def build_model(bert_layer, max_len=100):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    segment_ids = Input(shape=(max_len,), dtype=tf.int32, name="segment_ids")

    sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
    clf_output = sequence_output["pooler_output"]
    
    out = Dense(1, activation='sigmoid')(clf_output)

    model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
    model.compile(tf.optimizers.Adam(learning_rate=5.95e-6), loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

def train_model(train_input, train_labels):
      
    model_BERT = build_model(bert, max_len=100)
    model_BERT.summary()
    
    checkpoint = ModelCheckpoint('model_BERT.h5', monitor='val_loss', save_best_only=True)

    model_BERT.fit(
        train_input, train_labels,
        validation_split = 0.15,
        epochs = 6,
        callbacks=[checkpoint],
        batch_size = 16
    )
    
    return model_BERT

In [30]:
X_train = bert_encode(X_train_raw['text'], tokenizer)
X_test = bert_encode(X_test_raw['text'], tokenizer)

In [31]:
model_BERT = train_model(X_train, y_train_raw)

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_word_ids (InputLayer)     [(None, 100)]        0                                            
__________________________________________________________________________________________________
input_mask (InputLayer)         [(None, 100)]        0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 100)]        0                                            
__________________________________________________________________________________________________
tf_bert_model (TFBertModel)     TFBaseModelOutputWit 124441344   input_word_ids[0][0]             
                                                                 input_mask[0][0]             

In [32]:
prediction = model_BERT.predict(X_test)
re = []
for predict in prediction:
    label = 1 if predict[0] >= 0.5 else 0
    re.append(label)
    
re = np.array(re)

In [33]:
from sklearn.metrics import classification_report
print(classification_report(y_test_raw, re))

              precision    recall  f1-score   support

           0       0.89      0.80      0.84       751
           1       0.76      0.87      0.81       553

    accuracy                           0.83      1304
   macro avg       0.83      0.83      0.83      1304
weighted avg       0.84      0.83      0.83      1304



In [34]:
model_BERT.save_weights('model_weights.h5')