<a href="https://colab.research.google.com/drive/1UmAF__cZZzJMFTjWAB2wIBHU0vOi_2hh?usp=sharing" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Sentiment Analysis for Amazon Polarity Dataset

In [25]:
from transformers import BertTokenizer, TFBertForSequenceClassification
import tensorflow as tf
!pip install deep_translator
from deep_translator import GoogleTranslator



In [26]:
!pip install --q datasets
from datasets import load_dataset
import pandas as pd

In [27]:
train_dataset, test_dataset = load_dataset("fancyzhx/amazon_polarity", split=['train[:2000]', 'test[:500]'])

train_df = pd.DataFrame(train_dataset['content'], columns=['content'])
train_df['label'] = train_dataset['label']

test_df = pd.DataFrame(test_dataset['content'], columns=['content'])
test_df['label'] = test_dataset['label']

In [28]:
def translate_to_indo(text):
    translator = GoogleTranslator(source='en', target='id')
    translated_text = translator.translate(text)
    return translated_text

# train_df['content'] = train_df['content'].apply(translate_to_indo)
# test_df['content'] = test_df['content'].apply(translate_to_indo)

In [29]:
train_df = pd.read_csv("train_review_product.csv")
test_df = pd.read_csv("test_review_product.csv")

In [30]:
train_df

Unnamed: 0,content,label
0,Produk ini sangat bagus dan memuaskan!,1
1,"Tidak sesuai dengan harapan, kualitas buruk.",0
2,Paket datang cepat dan barang sesuai deskripsi.,1
3,"Sangat kecewa, produk tidak berfungsi sama sek...",0
4,"Harga terjangkau, kualitas oke.",1
5,"Barang cacat, sangat mengecewakan.",0
6,"Produk berkualitas tinggi, sangat direkomendas...",1
7,"Layanan pengiriman sangat lambat, tidak puas.",0
8,"Barang sesuai dengan foto, sangat puas.",1
9,"Produk murah tapi kualitas buruk, tidak direko...",0


In [31]:
model_name = 'cahya/bert-base-indonesian-522M'
tokenizer = BertTokenizer.from_pretrained(model_name)

MAX_LENGTH = 100

train_tokenized = tokenizer(
    text=train_df['content'].tolist(),
    add_special_tokens=True,
    max_length=MAX_LENGTH,
    truncation=True,
    padding=True,
    return_tensors='tf',
    return_token_type_ids=False,
    return_attention_mask=True
)

test_tokenized = tokenizer(
    text=test_df['content'].tolist(),
    add_special_tokens=True,
    max_length=MAX_LENGTH,
    truncation=True,
    padding=True,
    return_tensors='tf',
    return_token_type_ids=False,
    return_attention_mask=True
)

train_input_ids = tf.cast(train_tokenized['input_ids'], tf.int32)
train_attention_mask = tf.cast(train_tokenized['attention_mask'], tf.int32)

test_input_ids = tf.cast(test_tokenized['input_ids'], tf.int32)
test_attention_mask = tf.cast(test_tokenized['attention_mask'], tf.int32)

train_labels = tf.convert_to_tensor(train_df['label'])
test_labels = tf.convert_to_tensor(test_df['label'])

In [32]:
model = TFBertForSequenceClassification.from_pretrained(model_name, num_labels=2)

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at cahya/bert-base-indonesian-522M and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [33]:
optimizer = tf.keras.optimizers.AdamW(learning_rate=1e-4, weight_decay=0.01)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metrics = ['accuracy']

In [34]:
model.compile(optimizer=optimizer, loss=loss, metrics=metrics)

In [35]:
history = model.fit(
    [train_input_ids, train_attention_mask],
    train_labels,
    validation_data=([test_input_ids, test_attention_mask], test_labels),
    epochs=7,
    batch_size=32
)

Epoch 1/7
Epoch 2/7
Epoch 3/7
Epoch 4/7
Epoch 5/7
Epoch 6/7
Epoch 7/7


In [36]:
model.save_pretrained("transformers-bert", from_pt=True)

In [None]:
# from zipfile import ZipFile
# import shutil

# # Buat arsip ZIP dari folder model
# shutil.make_archive("transformers-bert", 'zip', "transformers-bert")

# # Tampilkan tautan unduhan
# from google.colab import files
# files.download("transformers-bert.zip")


In [37]:
model = TFBertForSequenceClassification.from_pretrained("transformers-bert")

Some layers from the model checkpoint at transformers-bert were not used when initializing TFBertForSequenceClassification: ['dropout_113']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at transformers-bert.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [38]:
def predict_sentiment(texts):
    # translated_text = translate_to_indo(text)
    tokenized_text = tokenizer(
        text=texts,
        add_special_tokens=True,
        max_length=MAX_LENGTH,
        truncation=True,
        padding='max_length',
        return_tensors='tf'
    )
    input_ids = tokenized_text['input_ids']
    attention_mask = tokenized_text['attention_mask']
    predictions = model.predict([input_ids, attention_mask], use_multiprocessing=True, workers=2)
    logits = predictions.logits
    result = {'positive': 0, 'negative': 0}
    for pred in predictions.logits:
      if pred[0] > pred[1]:
        result['negative'] += 1
      else:
        result['positive'] += 1
    return result

In [39]:
text = ["Produk ini sangat bagus dan saya sangat menyukainya!", "mantap", "jelek banget", "oke lah", "kualitas sangat bagus"]
sentiment = predict_sentiment(text)
print(f"Sentiment: {sentiment}")

Sentiment: {'positive': 4, 'negative': 1}


In [40]:
text = "Jelek banget"
sentiment = predict_sentiment(text)
print(f"Sentiment: {sentiment}")

Sentiment: {'positive': 0, 'negative': 1}


In [41]:
text = "Produk ini sangat mengecewakan. Kualitasnya buruk dan tidak sesuai dengan yang diiklankan. Saya tidak merekomendasikan produk ini kepada siapa pun."
sentiment = predict_sentiment(text)
print(f"Sentiment: {sentiment}")

Sentiment: {'positive': 0, 'negative': 1}


In [42]:
text = "Saya tidak puas dengan kualitas produk ini. Sepertinya saya akan mencari pilihan lain."
sentiment = predict_sentiment(text)
print(f"Sentiment: {sentiment}")

Sentiment: {'positive': 0, 'negative': 1}


In [43]:
text = "Saya merasa produk ini tidak memenuhi ekspektasi saya."
sentiment = predict_sentiment(text)
print(f"Sentiment: {sentiment}")

Sentiment: {'positive': 0, 'negative': 1}


In [44]:
text = "Produk ini benar-benar luar biasa! Saya sangat terkesan dengan kualitasnya dan sangat merekomendasikannya kepada teman-teman saya."
sentiment = predict_sentiment(text)
print(f"Sentiment: {sentiment}")

Sentiment: {'positive': 1, 'negative': 0}
