# Analisis Sentimen Review Game

Notebook ini berisi analisis sentimen menggunakan komentar review game. Data tersebut diambil dari Google Play Store sebagaimana yang dijelaskan pada `scrape.py`. Data diambil pada 31 Juli 2024. 

## Requirements

Untuk menjalankan notebook ini, anda perlu menginstall dependensi berikut:
1. Tensorflow
2. Keras
3. Python
4. Numpy
5. Nltk
6. Pandas

In [1]:
import re

import keras
import tensorflow as tf
import numpy as np
import nltk
import pandas as pd
import gensim

from gensim.models import Word2Vec, FastText
from keras import losses
from keras import optimizers


2024-08-02 10:38:04.886656: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-08-02 10:38:05.024769: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-02 10:38:05.103618: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-02 10:38:05.115989: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-02 10:38:05.211383: I tensorflow/core/platform/cpu_feature_guar

In [2]:
print(tf.__version__)
print(np.version.full_version)
print(nltk.__version__)
print(pd.__version__)
print(gensim.__version__)

2.17.0
1.26.4
3.8.1
2.2.2
4.3.3


Berikut ini merupakan konstanta yang digunakan pada notebook ini

In [3]:
LANGUAGE = "indonesian"
WORKER_NUMBER = 16
WORD_EPOCH = 20
INPUT_SIZE = 300

LEARNING_EPOCH = 25
PATIENCE = 5

## Pemrosesan Data
Pada tahap ini, akan dilakukan proses pengolahan data. Pada tahap ini, data akan diimport. Stopword yang ada pada kalimat akan dihapus juga pada tahap ini.

In [4]:
corpus = []
regex = r'[^a-zA-Z0-9\- \n"\']+'

### Dataset Loading

Pada bagian ini, akan ditunjukan proses loading data.

In [5]:
df = pd.read_csv("data/reviews.csv")

In [6]:
df.head()

Unnamed: 0,content,score
0,"Please dong yg game mininya, yang judul topeng...",5
1,Seru sih cuman sayang banyak jawaban yg ndk co...,3
2,mantap Applikasinya bisa buat asah otak,5
3,seruuuu,5
4,sangat menarik untuk di mainkan,4


In [7]:
df["score"].value_counts()

score
5    60159
1    13760
4     9250
3     5335
2     3140
Name: count, dtype: int64

In [8]:
data = df.to_dict(orient="records")

### Text Cleansing

Berikut ini merupakan proses pembersihan text.

In [9]:
nltk.download('words')
nltk.download('stopwords')

[nltk_data] Downloading package words to /home/miawheker/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/miawheker/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [10]:
stopwords = set(nltk.corpus.stopwords.words(LANGUAGE))

In [11]:
cleaned_data = []

for d in data:
    review = d["content"]
    result = []

    for sentence in nltk.sent_tokenize(review):
        sentence = sentence.lower()
        words = sentence.split()

        # Stopword removal
        sentence = " ".join([word for word in words if word not in stopwords])

        # Remove special chars
        sentence = re.sub(regex, '', sentence)
        sentence = sentence.replace("-", " ")
        sentence = sentence.replace("\n", " ")
        sentence = sentence.replace("\"", "")
        sentence = re.sub(r'\bhttp[a-z0-9]+\b', '', sentence)
        sentence = re.sub(r'\b.+@.+\b', '', sentence)
        sentence = re.sub(r'\b(img|src)[a-z0-9]*\b', '', sentence)
        sentence = re.sub(r'\s{2,}', ' ', sentence)

        # Remove whitespaces
        sentence = sentence.strip()

        if len(sentence) > 0:
            result.append(sentence)
    

    if len(result) > 0:
        cleaned_data.append({
            "review": result,
            "score": d["score"]
        })

In [12]:
corpus = []
for d in cleaned_data:
    for sentence in d["review"]:
        corpus.append(sentence)

with open("data/corpus.txt", "r") as f:
    for line in f:
        if len(line) < 10:
            continue

        for sentence in nltk.sent_tokenize(line):
            sentence = sentence.lower()
            words = sentence.split()

            # Stopword removal
            sentence = " ".join([word for word in words if word not in stopwords])

            # Remove special chars
            sentence = re.sub(regex, '', sentence)
            sentence = sentence.replace("-", " ")
            sentence = sentence.replace("\n", " ")
            sentence = sentence.replace("\"", "")
            sentence = re.sub(r'\bhttp[a-z0-9]+\b', '', sentence)
            sentence = re.sub(r'\b.+@.+\b', '', sentence)
            sentence = re.sub(r'\b(img|src)[a-z0-9]*\b', '', sentence)
            sentence = re.sub(r'\s{2,}', ' ', sentence)

            # Remove whitespaces
            sentence = sentence.strip()

            if len(sentence) < 10:
                continue

            corpus.append(sentence)

In [13]:
corpus[:10]

['please yg game mininya judul topeng makan lanjutin udah beres sampe level 500 sumpah candu banget game utamanya please',
 'seru sih cuman sayang yg ndk cocok pertanyaannya petunjuknya tolong d perbaiki ya membingungkan',
 'mantap applikasinya asah otak',
 'seruuuu',
 'menarik mainkan',
 'permainan bagus berfikir',
 'bagus mengasah otak',
 'bagus mengasah otak',
 'buset game nya keren banget suka banget main otak langsung pintar terimakasih yg gsme mengasah otak banget sih',
 'yh bosen']

### Number of Vocab

In this section, we try to calculate number of vocab that exist in our datasets.

In [14]:
vocab = set()

for sentence in corpus:
    for word in sentence.split():
        vocab.add(word)

In [15]:
number_of_vocab = len(vocab)
number_of_vocab

20875

## Feature Extraction

Pada tahap ini, akan dilakukan proses ekstraksi fitur. 

### TF-IDF

Pada bagian ini, kita akan mencoba membuat TF-IDF layer.

In [16]:
layer_tfidf = keras.layers.TextVectorization(
    max_tokens=number_of_vocab+1,
    output_mode="tf_idf",
    split="whitespace",
    sparse=False,
    pad_to_max_tokens=True,
    ngrams=1
)

with tf.device("CPU"):
    layer_tfidf.adapt(corpus)


I0000 00:00:1722569891.561687   24612 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1722569891.712966   24612 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1722569891.719841   24612 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1722569891.726826   24612 cuda_executor.cc:1015] successful NUMA node read from SysFS ha

## Labeling

Pada tahap ini, akan dilakukan proses melabeli dataset.

In [17]:
np.random.shuffle(cleaned_data)

In [18]:
rating_cnt = [0, 0, 0, 0, 0]

for d in cleaned_data:
    rating_cnt[d["score"]-1] += 1

In [19]:
rating_cnt

[13550, 3100, 5240, 9050, 58052]

Berikut ini merupakan pelabelan menggunakan skema dibawah. Pada skema ini, terdapat aturan sebagai berikut:
* Rating < 3 dianggap memiliki sentimen negatif
* Rating == 3 dianggap memiliki sentimen netral
* Rating > 3 dianggap memiliki sentimen positif

In [20]:
labelled_data = []

weight = np.array([rating_cnt[0] + rating_cnt[1], rating_cnt[2], rating_cnt[3] + rating_cnt[4]])
max_weight = np.max(weight)
weight = max_weight / weight

weight

array([ 4.03015015, 12.80572519,  1.        ])

In [21]:
for d in cleaned_data:
  if d["score"] < 3:
    labelled_data.append([" ".join(d["review"]), [0,0,1], weight[0]])
  elif d["score"] == 3:
    labelled_data.append([" ".join(d["review"]), [0,1,0], weight[1]])
  else:
    labelled_data.append([" ".join(d["review"]), [1,0,0], weight[2]])

## Data Splitting

Pada tahap ini, akan dilakukan proses data splitting. Data akan dibagi menjadi 3 bagian, yaitu training data (80%), validation data (10%), dan test data (10%).

In [22]:
dataset_cnt = len(labelled_data)
dataset_cnt

88992

In [23]:
train_dataset = labelled_data[:int(dataset_cnt * 0.8)]
test_dataset = labelled_data[int(dataset_cnt * 0.8):int(dataset_cnt * 0.9)]
validation_dataset = labelled_data[int(dataset_cnt * 0.9):]

Mari kita amati distribusi masing-masing dataset

In [24]:
cnt = [0, 0, 0]

for d in train_dataset:
    cnt[np.argmax(d[1])] += 1

cnt

[53742, 4191, 13260]

In [25]:
cnt = [0, 0, 0]

for d in test_dataset:
    cnt[np.argmax(d[1])] += 1

cnt

[6714, 511, 1674]

In [26]:
cnt = [0, 0, 0]

for d in validation_dataset:
    cnt[np.argmax(d[1])] += 1

cnt

[6646, 538, 1716]

## Model Training

Pada tahap ini, akan dilakukan training model.

In [27]:
class DataGenerator:
    def __init__(self, dataset, repeat=1):
        self.dataset = dataset
        self.data_length = len(dataset)
        self.repeat = repeat

    def generate(self):
        for _ in range(self.repeat):
          for x, y, w in self.dataset:
                X = tf.convert_to_tensor([x], dtype=tf.string)
                
                Y = np.array([y], dtype=np.float32)
                W = np.array([w], dtype=np.float32)
                
                yield X, Y, W

### TFIDF Model

Berikut ini adalah proses pembuatan model dengan TF-IDF.

In [28]:
train_gen = DataGenerator(train_dataset, repeat=LEARNING_EPOCH + 1)
validation_gen = DataGenerator(validation_dataset, repeat=LEARNING_EPOCH + 1)

train_ds = tf.data.Dataset.from_generator(train_gen.generate, output_signature=(
    tf.TensorSpec(shape=(None,), dtype=tf.string),
    tf.TensorSpec(shape=(None, 3), dtype=tf.float32),
    tf.TensorSpec(shape=(None,), dtype=tf.float64),
))
validation_ds = tf.data.Dataset.from_generator(validation_gen.generate, output_signature=(
    tf.TensorSpec(shape=(None,), dtype=tf.string),
    tf.TensorSpec(shape=(None, 3), dtype=tf.float32),
    tf.TensorSpec(shape=(None,), dtype=tf.float64),
))

callback = [
  keras.callbacks.EarlyStopping(
    monitor="val_loss",
    patience=5,
    restore_best_weights=True,
  ),
  keras.callbacks.ModelCheckpoint(
    filepath=f"models/checkpoint/tfidf_checkpoint.keras",
    save_best_only=True,
  ),
  keras.callbacks.TensorBoard(
    log_dir=f"logs/tfidf",
  ),
]

model = keras.models.Sequential([
    keras.layers.Input(shape=(1,), dtype=tf.string),
    layer_tfidf,
    keras.layers.Dense(64, activation="relu"),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(64, activation="relu"),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(3, activation="softmax"),
  ],
  name=f"tfidf",
)

model.summary()

model.compile(
  loss=losses.CategoricalCrossentropy(),
  optimizer=optimizers.Adam(),
  metrics=[keras.metrics.CategoricalAccuracy()],
)
model.fit(
  train_ds,
  validation_data=validation_ds,
  epochs=LEARNING_EPOCH,
  callbacks=callback,
  steps_per_epoch=len(train_dataset),
  validation_steps=len(validation_dataset),
)
model.save(f"models/tfidf.keras")

tfidf_model = model

Epoch 1/25
[1m71193/71193[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m253s[0m 4ms/step - categorical_accuracy: 0.7311 - loss: 1.8819 - val_categorical_accuracy: 0.8767 - val_loss: 1.1962
Epoch 2/25
[1m71193/71193[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m253s[0m 4ms/step - categorical_accuracy: 0.8889 - loss: 1.1365 - val_categorical_accuracy: 0.9261 - val_loss: 0.9440
Epoch 3/25
[1m71193/71193[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m252s[0m 4ms/step - categorical_accuracy: 0.9211 - loss: 0.9549 - val_categorical_accuracy: 0.9274 - val_loss: 0.8295
Epoch 4/25
[1m71193/71193[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m242s[0m 3ms/step - categorical_accuracy: 0.9272 - loss: 0.8660 - val_categorical_accuracy: 0.9385 - val_loss: 0.9303
Epoch 5/25
[1m71193/71193[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m251s[0m 4ms/step - categorical_accuracy: 0.9318 - loss: 0.8198 - val_categorical_accuracy: 0.9498 - val_loss: 0.7662
Epoch 6/25
[1m71193/71193[0m [32m━━━━

Dari hasil tersebut, terlihat bahwa model tfidf yang paling terbaik.

## Evaluasi

Pada bagian ini, kita akan mencoba melakukan evaluasi dari model yang telah dibuat.

In [30]:
model = tfidf_model

test_ds = DataGenerator(test_dataset, repeat=1)
test_gen = tf.data.Dataset.from_generator(test_ds.generate, output_signature=(
    tf.TensorSpec(shape=(None,), dtype=tf.string),
    tf.TensorSpec(shape=(None, 3), dtype=tf.float32),
    tf.TensorSpec(shape=(None,), dtype=tf.float64),
))

res = model.evaluate(test_gen, steps=len(test_dataset))
print(f"Model accuracy: {res[1]}")

[1m8899/8899[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 1ms/step - categorical_accuracy: 0.9681 - loss: 0.5659
Model accuracy: 0.9658388495445251


Dari hasil tersebut menunjukan bahwa model TF-IDF memiliki akurasi 95% pada dataset testing

## Demo

Pada bagian ini, akan dilakukan demonstrasi terhadap model TF-IDF

In [31]:
def predict_sentiment(model, sentence):
    result = model.predict(tf.convert_to_tensor([sentence], dtype=tf.string))
    argmax = np.argmax(result)

    return ["positif", "netral", "negatif"][argmax]

In [32]:
predict_sentiment(model, "Ih bagus banget")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 109ms/step


'positif'

In [33]:
predict_sentiment(model, "Game jelek banget")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 23ms/step


'negatif'

In [35]:
predict_sentiment(model, "main game dulu")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 87ms/step


'netral'