# Analisis Sentimen Review Game

Notebook ini berisi analisis sentimen menggunakan komentar review game. Data tersebut diambil dari Google Play Store sebagaimana yang dijelaskan pada `scrape.py`. Data diambil pada 31 Juli 2024. 

## Requirements

Untuk menjalankan notebook ini, anda perlu menginstall dependensi berikut:
1. Tensorflow
2. Keras
3. Python
4. Numpy
5. Nltk
6. Matplotlib
7. Sklearn
8. Pandas

In [1]:
import json
import re

import keras
import tensorflow as tf
import numpy as np
import matplotlib as mt
import matplotlib.pyplot as plt
import sklearn
import nltk
import pandas as pd
import gensim

from gensim.models import Word2Vec, FastText
from datetime import datetime
from keras import losses
from keras import optimizers
from sklearn.metrics import accuracy_score, ConfusionMatrixDisplay, confusion_matrix

2024-07-31 22:11:31.719043: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-07-31 22:11:31.766531: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
print(tf.__version__)
print(np.version.full_version)
print(mt.__version__)
print(sklearn.__version__)
print(nltk.__version__)
print(pd.__version__)
print(gensim.__version__)

2.16.1
1.26.4
3.9.0
1.5.0
3.8.1
2.2.2
4.3.2


Berikut ini merupakan konstanta yang digunakan pada notebook ini

In [3]:
LANGUAGE = "indonesian"
WORKER_NUMBER = 16
WORD_EPOCH = 20
INPUT_SIZE = 300

LEARNING_EPOCH = 25
PATIENCE = 5

## Pemrosesan Data
Pada tahap ini, akan dilakukan proses pengolahan data. Pada tahap ini, data akan diimport. Stopword yang ada pada kalimat akan dihapus juga pada tahap ini.

In [4]:
corpus = []
regex = r'[^a-zA-Z0-9\- \n"\']+'

### Dataset Loading

Pada bagian ini, akan ditunjukan proses loading data.

In [5]:
df = pd.read_csv("data/reviews.csv")

In [6]:
df.head()

Unnamed: 0,content,score
0,"Please dong yg game mininya, yang judul topeng...",5
1,Seru sih cuman sayang banyak jawaban yg ndk co...,3
2,mantap Applikasinya bisa buat asah otak,5
3,seruuuu,5
4,sangat menarik untuk di mainkan,4


In [7]:
df["score"].value_counts()

score
5    60119
1    13758
4     9248
3     5334
2     3140
Name: count, dtype: int64

In [8]:
data = df.to_dict(orient="records")

### Text Cleansing

Berikut ini merupakan proses pembersihan text.

In [9]:
nltk.download('words')
nltk.download('stopwords')

[nltk_data] Downloading package words to /home/miawheker/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/miawheker/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [10]:
stopwords = set(nltk.corpus.stopwords.words(LANGUAGE))

In [11]:
cleaned_data = []

for d in data:
    review = d["content"]
    result = []

    for sentence in nltk.sent_tokenize(review):
        sentence = sentence.lower()
        words = sentence.split()

        # Stopword removal
        sentence = " ".join([word for word in words if word not in stopwords])

        # Remove special chars
        sentence = re.sub(regex, '', sentence)
        sentence = sentence.replace("-", " ")
        sentence = sentence.replace("\n", " ")
        sentence = sentence.replace("\"", "")
        sentence = re.sub(r'\bhttp[a-z0-9]+\b', '', sentence)
        sentence = re.sub(r'\b.+@.+\b', '', sentence)
        sentence = re.sub(r'\b(img|src)[a-z0-9]*\b', '', sentence)
        sentence = re.sub(r'\s{2,}', ' ', sentence)

        # Remove whitespaces
        sentence = sentence.strip()

        if len(sentence) > 0:
            result.append(sentence)
    

    if len(result) > 0:
        cleaned_data.append({
            "review": result,
            "score": d["score"]
        })

In [12]:
corpus = []
for d in cleaned_data:
    for sentence in d["review"]:
        corpus.append(sentence)

with open("data/corpus.txt", "r") as f:
    for line in f:
        if len(line) < 10:
            continue

        for sentence in nltk.sent_tokenize(line):
            sentence = sentence.lower()
            words = sentence.split()

            # Stopword removal
            sentence = " ".join([word for word in words if word not in stopwords])

            # Remove special chars
            sentence = re.sub(regex, '', sentence)
            sentence = sentence.replace("-", " ")
            sentence = sentence.replace("\n", " ")
            sentence = sentence.replace("\"", "")
            sentence = re.sub(r'\bhttp[a-z0-9]+\b', '', sentence)
            sentence = re.sub(r'\b.+@.+\b', '', sentence)
            sentence = re.sub(r'\b(img|src)[a-z0-9]*\b', '', sentence)
            sentence = re.sub(r'\s{2,}', ' ', sentence)

            # Remove whitespaces
            sentence = sentence.strip()

            if len(sentence) < 10:
                continue

            corpus.append(sentence)

In [13]:
corpus[:10]

['please yg game mininya judul topeng makan lanjutin udah beres sampe level 500 sumpah candu banget game utamanya please',
 'seru sih cuman sayang yg ndk cocok pertanyaannya petunjuknya tolong d perbaiki ya membingungkan',
 'mantap applikasinya asah otak',
 'seruuuu',
 'menarik mainkan',
 'permainan bagus berfikir',
 'bagus mengasah otak',
 'bagus mengasah otak',
 'buset game nya keren banget suka banget main otak langsung pintar terimakasih yg gsme mengasah otak banget sih',
 'yh bosen']

### Number of Vocab

In this section, we try to calculate number of vocab that exist in our datasets.

In [14]:
vocab = set()

for sentence in corpus:
    for word in sentence.split():
        vocab.add(word)

In [15]:
number_of_vocab = len(vocab)
number_of_vocab

20874

## Feature Extraction

Pada tahap ini, akan dilakukan proses ekstraksi fitur. 

### Word2Vec

Pada bagian ini, kita akan mencoba untuk membuat Word2Vec model

In [16]:
word2vec_model = Word2Vec([sentence.split() for sentence in corpus], workers=WORKER_NUMBER, vector_size=INPUT_SIZE, epochs=WORD_EPOCH)

In [17]:
# Save model checkpoint
word2vec_model.save("models/corpus_word2vec.model")

In [18]:
print("Jumlah kata:", len(word2vec_model.wv.index_to_key))

Jumlah kata: 16719


In [19]:
word2vec_model.wv.most_similar("gak")

[('ngga', 0.4801954925060272),
 ('ngak', 0.45049017667770386),
 ('kagak', 0.42653602361679077),
 ('mesti', 0.4175211489200592),
 ('nggak', 0.41186514496803284),
 ('habis', 0.4066455662250519),
 ('tandanya', 0.3955079913139343),
 ('masak', 0.38996732234954834),
 ('nya', 0.38812580704689026),
 ('tau', 0.38671812415122986)]

In [20]:
word2vec_model.wv.most_similar("suka")

[('bagus', 0.4776606857776642),
 ('seru', 0.43819722533226013),
 ('seruu', 0.4221615493297577),
 ('banget', 0.41466301679611206),
 ('nya', 0.4096386730670929),
 ('seneng', 0.39435485005378723),
 ('ilang', 0.39239683747291565),
 ('kadang', 0.39008206129074097),
 ('lucu', 0.3742161691188812),
 ('mantapool', 0.35801175236701965)]

Berikut ini merupakan proses untuk mengubah menjadi dalam bentuk tensorflow

In [21]:
text_layer = keras.layers.TextVectorization(
    max_tokens=(len(word2vec_model.wv.index_to_key) + 2), 
    output_mode="int",
    split="whitespace",
    trainable=False,
    standardize="lower_and_strip_punctuation",
    vocabulary=word2vec_model.wv.index_to_key
  )

vectors = []
for word in text_layer.get_vocabulary():
    try:
        vectors.append(word2vec_model.wv.get_vector(word))
    except:
        vectors.append(np.random.rand(INPUT_SIZE))

word2vec_layer = keras.Sequential([
  keras.layers.Input(shape=(None,), dtype=tf.string),
  text_layer,
  keras.layers.Embedding(
    input_dim=(len(word2vec_model.wv.index_to_key) + 2),
    output_dim=INPUT_SIZE,
    weights=np.array(vectors),
    trainable=False
  )
])

# Cleanup
vectors = []

2024-07-31 22:11:44.783423: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-07-31 22:11:44.826790: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-07-31 22:11:44.850059: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-

In [22]:
word2vec_layer.summary()

### Fasttext

Pada bagian ini, kita akan mencoba membuat model Fasttext

In [23]:
fasttext_model = FastText([sentence.split() for sentence in corpus], workers=WORKER_NUMBER, vector_size=INPUT_SIZE, epochs=WORD_EPOCH)

In [24]:
# Checkpoint
fasttext_model.save("models/corpus_fasttext.model")

In [25]:
print("Jumlah kata di model ini:", len(fasttext_model.wv.index_to_key))

Jumlah kata di model ini: 16719


In [26]:
fasttext_model.wv.most_similar("gak")

[('2xgak', 0.8523187637329102),
 ('ngak', 0.8152997493743896),
 ('ggak', 0.8001991510391235),
 ('gaktau', 0.7658352255821228),
 ('tegak', 0.7477834224700928),
 ('gaalak', 0.7431991100311279),
 ('kagak', 0.7363858819007874),
 ('nggak', 0.7331864237785339),
 ('sihgak', 0.7268201112747192),
 ('jugak', 0.7241016030311584)]

In [27]:
fasttext_model.wv.most_similar("suka")

[("suka'", 0.9256801009178162),
 ('sukaq', 0.9135703444480896),
 ('3suka', 0.9065844416618347),
 ('sukasukasuka', 0.858688473701477),
 ('sukakk', 0.8341118693351746),
 ('sukak', 0.8290210962295532),
 ('sukah', 0.8095313906669617),
 ('kusuka', 0.8083089590072632),
 ('sukahrs', 0.8072800636291504),
 ('sukaaseru', 0.7896515130996704)]

Berikut ini merupakan proses konversi menjadi layer keras.

In [28]:
text_layer = keras.layers.TextVectorization(
    max_tokens=(len(fasttext_model.wv.index_to_key) + 2), 
    output_mode="int",
    split="whitespace",
    trainable=False,
    standardize="lower_and_strip_punctuation",
    vocabulary=fasttext_model.wv.index_to_key
  )

vectors = []
for word in text_layer.get_vocabulary():
    try:
        vectors.append(fasttext_model.wv.get_vector(word))
    except:
        vectors.append(np.random.rand(INPUT_SIZE))

fasttext_layer = keras.Sequential([
  keras.layers.Input(shape=(None,), dtype=tf.string),
  text_layer,
  keras.layers.Embedding(
    input_dim=(len(fasttext_model.wv.index_to_key) + 2),
    output_dim=INPUT_SIZE,
    weights=np.array(vectors),
    trainable=False
  )
])

# Cleanup
vectors = []

In [29]:
fasttext_layer.summary()

### TF-IDF

Pada bagian ini, kita akan mencoba membuat TF-IDF layer.

In [30]:
layer_tfidf = keras.layers.TextVectorization(
    max_tokens=number_of_vocab+1,
    output_mode="tf_idf",
    split="whitespace",
    sparse=False,
    pad_to_max_tokens=True,
    ngrams=1
)

with tf.device("CPU"):
    layer_tfidf.adapt(corpus)


## Labeling

Pada tahap ini, akan dilakukan proses melabeli dataset.

In [31]:
np.random.shuffle(cleaned_data)

In [32]:
rating_cnt = [0, 0, 0, 0, 0]

for d in cleaned_data:
    rating_cnt[d["score"]-1] += 1

In [33]:
rating_cnt

[13548, 3100, 5239, 9048, 58012]

Berikut ini merupakan pelabelan menggunakan skema dibawah. Pada skema ini, terdapat aturan sebagai berikut:
* Rating < 3 dianggap memiliki sentimen negatif
* Rating == 3 dianggap memiliki sentimen netral
* Rating > 3 dianggap memiliki sentimen positif

In [34]:
labelled_data = []

weight = np.array([rating_cnt[0] + rating_cnt[1], rating_cnt[2], rating_cnt[3] + rating_cnt[4]])
max_weight = np.max(weight)
weight = max_weight / weight

weight

array([ 4.02811148, 12.8001527 ,  1.        ])

In [35]:
for d in cleaned_data:
  if d["score"] < 3:
    labelled_data.append([" ".join(d["review"]), [0,0,1], weight[0]])
  elif d["score"] == 3:
    labelled_data.append([" ".join(d["review"]), [0,1,0], weight[1]])
  else:
    labelled_data.append([" ".join(d["review"]), [1,0,0], weight[2]])

## Data Splitting

Pada tahap ini, akan dilakukan proses data splitting. Data akan dibagi menjadi 3 bagian, yaitu training data (80%), validation data (10%), dan test data (10%).

In [36]:
dataset_cnt = len(labelled_data)
dataset_cnt

88947

In [37]:
train_dataset = labelled_data[:int(dataset_cnt * 0.8)]
test_dataset = labelled_data[int(dataset_cnt * 0.8):int(dataset_cnt * 0.9)]
validation_dataset = labelled_data[int(dataset_cnt * 0.9):]

Mari kita amati distribusi masing-masing dataset

In [38]:
cnt = [0, 0, 0]

for d in train_dataset:
    cnt[np.argmax(d[1])] += 1

cnt

[53640, 4238, 13279]

In [39]:
cnt = [0, 0, 0]

for d in test_dataset:
    cnt[np.argmax(d[1])] += 1

cnt

[6710, 509, 1676]

In [40]:
cnt = [0, 0, 0]

for d in validation_dataset:
    cnt[np.argmax(d[1])] += 1

cnt

[6710, 492, 1693]

## Model Training

Pada tahap ini, akan dilakukan training model. Terdapat beberapa variasi yang digunakan yaitu:
1. Text Vectorization: Kita akan menggunakan `word2vec_model`, `fasttext_model`, dan `tf_idf`.
2. Klasifikasi akan menggunakan 3 buah model, model berbasis GRU, Bidirectional GRU, dan Dense only khusus TF-IDF.

In [41]:
class DataGenerator:
    def __init__(self, dataset, repeat=1):
        self.dataset = dataset
        self.data_length = len(dataset)
        self.repeat = repeat

    def generate(self):
        for _ in range(self.repeat):
          for x, y, w in self.dataset:
                X = tf.convert_to_tensor([x], dtype=tf.string)
                
                Y = np.array([y], dtype=np.float32)
                W = np.array([w], dtype=np.float32)
                
                yield X, Y, W

### Word2vec dan Fasttext Model

Berikut ini adalah proses pembuatan model word2vec.

In [43]:
for wordlayer in ["word2vec", "fasttext"]:
  for i in ["gru", "bi-gru"]:
    train_gen = DataGenerator(train_dataset, repeat=LEARNING_EPOCH + 1)
    validation_gen = DataGenerator(validation_dataset, repeat=LEARNING_EPOCH + 1)

    train_ds = tf.data.Dataset.from_generator(train_gen.generate, output_signature=(
        tf.TensorSpec(shape=(None,), dtype=tf.string),
        tf.TensorSpec(shape=(None, 3), dtype=tf.float32),
        tf.TensorSpec(shape=(None,), dtype=tf.float64),
    ))
    validation_ds = tf.data.Dataset.from_generator(validation_gen.generate, output_signature=(
        tf.TensorSpec(shape=(None,), dtype=tf.string),
        tf.TensorSpec(shape=(None, 3), dtype=tf.float32),
        tf.TensorSpec(shape=(None,), dtype=tf.float64),
    ))

    callback = [
      keras.callbacks.EarlyStopping(
        monitor="val_loss",
        patience=5,
        restore_best_weights=True,
      ),
      keras.callbacks.ModelCheckpoint(
        filepath=f"models/checkpoint/word2vec_model_checkpoint.keras",
        save_best_only=True,
      ),
      keras.callbacks.TensorBoard(
        log_dir=f"logs/{wordlayer}_{i}",
      ),
    ]

    model = keras.models.Sequential([
        keras.layers.Input(shape=(1,), dtype=tf.string),
        word2vec_layer if wordlayer == "word2vec" else fasttext_layer,
        keras.layers.GRU(128, dropout=0.2) if i == "gru" else keras.layers.Bidirectional(keras.layers.GRU(128, dropout=0.2)),
        keras.layers.Dense(64, activation="relu"),
        keras.layers.Dropout(0.2),
        keras.layers.Dense(64, activation="relu"),
        keras.layers.Dropout(0.2),
        keras.layers.Dense(3, activation="softmax"),
      ],
      name=f"{wordlayer}_{i}",
    )

    model.summary()

    model.compile(
      loss=losses.CategoricalCrossentropy(),
      optimizer=optimizers.Adam(),
      metrics=[keras.metrics.CategoricalAccuracy()],
    )
    model.fit(
      train_ds,
      validation_data=validation_ds,
      epochs=LEARNING_EPOCH,
      callbacks=callback,
      steps_per_epoch=len(train_dataset),
      validation_steps=len(validation_dataset),
    )
    model.save(f"models/{wordlayer}_{i}.keras")

Epoch 1/25


[1m71157/71157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m532s[0m 7ms/step - categorical_accuracy: 0.5985 - loss: 2.1285 - val_categorical_accuracy: 0.6461 - val_loss: 1.9387
Epoch 2/25
[1m71157/71157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m524s[0m 7ms/step - categorical_accuracy: 0.6289 - loss: 1.9209 - val_categorical_accuracy: 0.6996 - val_loss: 1.7363
Epoch 3/25
[1m71157/71157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m524s[0m 7ms/step - categorical_accuracy: 0.6253 - loss: 1.8719 - val_categorical_accuracy: 0.6787 - val_loss: 1.7679
Epoch 4/25
[1m71157/71157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m518s[0m 7ms/step - categorical_accuracy: 0.6288 - loss: 1.8701 - val_categorical_accuracy: 0.6709 - val_loss: 1.7572
Epoch 5/25
[1m71157/71157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m520s[0m 7ms/step - categorical_accuracy: 0.6416 - loss: 1.8717 - val_categorical_accuracy: 0.7196 - val_loss: 1.8752
Epoch 6/25
[1m71157/71157[0m [32m━━━━━━━━━━━━━━━

Epoch 1/25
[1m71157/71157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m779s[0m 11ms/step - categorical_accuracy: 0.6290 - loss: 2.0837 - val_categorical_accuracy: 0.6895 - val_loss: 1.7060
Epoch 2/25
[1m71157/71157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m798s[0m 11ms/step - categorical_accuracy: 0.6534 - loss: 1.8480 - val_categorical_accuracy: 0.6704 - val_loss: 1.5971
Epoch 3/25
[1m71157/71157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m842s[0m 12ms/step - categorical_accuracy: 0.6776 - loss: 1.7069 - val_categorical_accuracy: 0.6998 - val_loss: 1.5283
Epoch 4/25
[1m71157/71157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m796s[0m 11ms/step - categorical_accuracy: 0.6789 - loss: 1.7098 - val_categorical_accuracy: 0.6931 - val_loss: 1.5026
Epoch 5/25
[1m71157/71157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m793s[0m 11ms/step - categorical_accuracy: 0.6816 - loss: 1.6807 - val_categorical_accuracy: 0.7381 - val_loss: 1.5315
Epoch 6/25
[1m71157/71157[0m [32

Epoch 1/25
[1m71157/71157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m499s[0m 7ms/step - categorical_accuracy: 0.6055 - loss: 2.1080 - val_categorical_accuracy: 0.6372 - val_loss: 1.8621
Epoch 2/25
[1m71157/71157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m492s[0m 7ms/step - categorical_accuracy: 0.6183 - loss: 1.9627 - val_categorical_accuracy: 0.6616 - val_loss: 1.8511
Epoch 3/25
[1m71157/71157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m493s[0m 7ms/step - categorical_accuracy: 0.6118 - loss: 1.9587 - val_categorical_accuracy: 0.6657 - val_loss: 1.7493
Epoch 4/25
[1m71157/71157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m497s[0m 7ms/step - categorical_accuracy: 0.6167 - loss: 1.9360 - val_categorical_accuracy: 0.6016 - val_loss: 1.8131
Epoch 5/25
[1m71157/71157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m492s[0m 7ms/step - categorical_accuracy: 0.6097 - loss: 1.9295 - val_categorical_accuracy: 0.6207 - val_loss: 1.8183
Epoch 6/25
[1m71157/71157[0m [32m━━━━

Epoch 1/25
[1m71157/71157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m708s[0m 10ms/step - categorical_accuracy: 0.6224 - loss: 2.1149 - val_categorical_accuracy: 0.7016 - val_loss: 1.8306
Epoch 2/25
[1m71157/71157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m718s[0m 10ms/step - categorical_accuracy: 0.6272 - loss: 1.9250 - val_categorical_accuracy: 0.7029 - val_loss: 1.7623
Epoch 3/25
[1m71157/71157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m739s[0m 10ms/step - categorical_accuracy: 0.6335 - loss: 1.8884 - val_categorical_accuracy: 0.6653 - val_loss: 1.6629
Epoch 4/25
[1m71157/71157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m743s[0m 10ms/step - categorical_accuracy: 0.6473 - loss: 1.8231 - val_categorical_accuracy: 0.7022 - val_loss: 1.6401
Epoch 5/25
[1m71157/71157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m746s[0m 10ms/step - categorical_accuracy: 0.6452 - loss: 1.8145 - val_categorical_accuracy: 0.6751 - val_loss: 1.5971
Epoch 6/25
[1m71157/71157[0m [32

### TFIDF Model

Berikut ini adalah proses pembuatan model dengan TF-IDF.

In [44]:
train_gen = DataGenerator(train_dataset, repeat=LEARNING_EPOCH + 1)
validation_gen = DataGenerator(validation_dataset, repeat=LEARNING_EPOCH + 1)

train_ds = tf.data.Dataset.from_generator(train_gen.generate, output_signature=(
    tf.TensorSpec(shape=(None,), dtype=tf.string),
    tf.TensorSpec(shape=(None, 3), dtype=tf.float32),
    tf.TensorSpec(shape=(None,), dtype=tf.float64),
))
validation_ds = tf.data.Dataset.from_generator(validation_gen.generate, output_signature=(
    tf.TensorSpec(shape=(None,), dtype=tf.string),
    tf.TensorSpec(shape=(None, 3), dtype=tf.float32),
    tf.TensorSpec(shape=(None,), dtype=tf.float64),
))

callback = [
  keras.callbacks.EarlyStopping(
    monitor="val_loss",
    patience=5,
    restore_best_weights=True,
  ),
  keras.callbacks.ModelCheckpoint(
    filepath=f"models/checkpoint/word2vec_model_checkpoint.keras",
    save_best_only=True,
  ),
  keras.callbacks.TensorBoard(
    log_dir=f"logs/tfidf",
  ),
]

model = keras.models.Sequential([
    keras.layers.Input(shape=(1,), dtype=tf.string),
    layer_tfidf,
    keras.layers.Dense(64, activation="relu"),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(64, activation="relu"),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(3, activation="softmax"),
  ],
  name=f"tfidf",
)

model.summary()

model.compile(
  loss=losses.CategoricalCrossentropy(),
  optimizer=optimizers.Adam(),
  metrics=[keras.metrics.CategoricalAccuracy()],
)
model.fit(
  train_ds,
  validation_data=validation_ds,
  epochs=LEARNING_EPOCH,
  callbacks=callback,
  steps_per_epoch=len(train_dataset),
  validation_steps=len(validation_dataset),
)
model.save(f"models/tfidf.keras")

Epoch 1/25
[1m71157/71157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m226s[0m 3ms/step - categorical_accuracy: 0.7396 - loss: 1.8846 - val_categorical_accuracy: 0.8990 - val_loss: 1.1272
Epoch 2/25
[1m71157/71157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m225s[0m 3ms/step - categorical_accuracy: 0.8909 - loss: 1.1036 - val_categorical_accuracy: 0.9295 - val_loss: 1.0109
Epoch 3/25
[1m71157/71157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m226s[0m 3ms/step - categorical_accuracy: 0.9204 - loss: 0.9153 - val_categorical_accuracy: 0.9487 - val_loss: 0.9671
Epoch 4/25
[1m71157/71157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m229s[0m 3ms/step - categorical_accuracy: 0.9377 - loss: 0.8262 - val_categorical_accuracy: 0.9448 - val_loss: 0.8286
Epoch 5/25
[1m71157/71157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m226s[0m 3ms/step - categorical_accuracy: 0.9382 - loss: 0.7987 - val_categorical_accuracy: 0.9575 - val_loss: 0.9937
Epoch 6/25
[1m71157/71157[0m [32m━━━━

In [47]:
# Cleanup
train_ds = None
validation_ds = None
train_gen = None
validation_gen = None
callback = None
model = None
train_dataset = None
validation_dataset = None
fasttext_model = None
word2vec_model = None
labelled_data = None