# Overview

This dataset contains a collection of tweets from the Indonesian community, expressing their opinions on the government's implementation of PPKM (Enforcement of Community Activity Restrictions). The dataset consists of approximately 20,000 tweets gathered within the time range from April 1, 2020, to April 1, 2022.

The selected time range for data collection is based on when Indonesia started implementing PPKM extensively and when the government revoked the policy. Within this dataset, diverse opinions, comments, and reactions from the public regarding the PPKM policy during that period can be found.

This dataset provides an opportunity to analyze the sentiment and public views regarding the PPKM policy, as well as observe changes in opinions over time. It offers valuable insights into understanding the perceptions and reactions of the community towards government policies related to PPKM.

Label: 0 (Positive), 1 (Neutral), 2 (Negative)

In [82]:
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Embedding, Dense, SpatialDropout1D
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Bana\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Bana\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [16]:
data = pd.read_csv("INA_TweetsPPKM_Labeled_Pure.csv", delimiter="\t")[:1000]

In [17]:
data.head()

Unnamed: 0,Date,User,Tweet,sentiment
0,2022-03-31 14:32:04+00:00,pikobar_jabar,Ketahui informasi pembagian #PPKM di wilayah J...,1
1,2022-03-31 09:26:00+00:00,inewsdotid,Tempat Ibadah di Wilayah PPKM Level 1 Boleh Be...,1
2,2022-03-31 05:02:34+00:00,vdvc_talk,"Juru bicara Satgas Covid-19, Wiku Adisasmito m...",1
3,2022-03-30 14:23:10+00:00,pikobar_jabar,Ketahui informasi pembagian #PPKM di wilayah J...,1
4,2022-03-30 11:28:57+00:00,tvOneNews,Kementerian Agama menerbitkan Surat Edaran Nom...,1


In [18]:
data.sentiment.value_counts()

sentiment
1    906
2     70
0     24
Name: count, dtype: int64

## Preprocessing

In [19]:
def clean_text(text):
    # Lowercase text
    text = text.lower()
    
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    
    # Remove punctuation and special characters
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = re.sub(r'[^\x00-\x7F]+', ' ', text)  # Remove non-ASCII characters
    
    # Tokenize text
    tokens = word_tokenize(text)
    
    # Remove stopwords
    stop_words = set(stopwords.words('indonesian'))
    filtered_tokens = [word for word in tokens if word not in stop_words]
    
    # Join tokens back into text
    cleaned_text = ' '.join(filtered_tokens)
    
    return cleaned_text


In [20]:
clean_text = data.Tweet.apply(clean_text)

In [21]:
clean_text

0      ketahui informasi pembagian ppkm wilayah jabar...
1      ibadah wilayah ppkm level berkapasitas persen ...
2      juru bicara satgas covid wiku adisasmito bukbe...
3      ketahui informasi pembagian ppkm wilayah jabar...
4      kementerian agama menerbitkan surat edaran nom...
                             ...                        
995    omicron bertambah ppkm jawabali diperpanjang f...
996    omicron meningkat ppkm jawabali diperpanjang y...
997    infoekon hai sahabatekon pemerintah update ter...
998    ketahui informasi pembagian ppkm wilayah jabar...
999    gabungan pengusaha nasional angkutan sungai da...
Name: Tweet, Length: 1000, dtype: object

In [31]:
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')



In [38]:
encoded_texts = encoder.encode(clean_text)

In [39]:
encoded_texts

array([[ 2.2088645e-02, -1.6602866e-01,  2.4198163e-03, ...,
         2.4661376e-01, -2.6212451e-01, -2.2927236e-02],
       [ 4.0184265e-01, -6.1241776e-02, -3.4493497e-01, ...,
         1.3564037e-01, -3.4816757e-01,  5.0411489e-02],
       [-5.7184729e-03,  2.1681347e-01,  5.1759981e-04, ...,
         7.1460614e-03, -4.7337633e-02,  1.3810094e-01],
       ...,
       [-1.7834090e-01, -2.1855620e-01,  8.3498836e-02, ...,
         3.3723388e-02, -7.4701734e-02, -7.7687828e-03],
       [ 3.9470021e-02, -1.3872997e-01, -1.9961135e-02, ...,
         2.3657681e-01, -2.5121626e-01,  1.8325572e-04],
       [ 4.4836566e-02,  6.5323249e-02,  3.1551573e-02, ...,
        -1.3469467e-01,  2.4939582e-01,  1.3760692e-01]], dtype=float32)

In [40]:
encoded_texts.shape

(1000, 384)

# Modeling using Tree

In [81]:
from lightgbm import LGBMClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score, make_scorer

f1 = make_scorer(f1_score, average='macro')

model = LGBMClassifier(random_state=42)
scores = cross_val_score(model, encoded_texts, data.sentiment, scoring=f1)
print(scores)
print(f'Test F1 Score (Macro-Averaged): {np.mean(scores) * 100:.2f}%')

[0.65688889 0.4294135  0.4294135  0.47780324 0.38967691]
Test F1 Score (Macro-Averaged): 47.66%


In [46]:
model.fit(encoded_texts, data.sentiment)

In [52]:
text =  "saya sangat emosi dengan kinerja pemerintah"
encoded_text = [encoder.encode(text)]
model.predict(encoded_text)

array([1], dtype=int64)

# Modeling using LSTM

In [59]:
X = clean_text
y = data['sentiment'].values

# Encode labels
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [60]:
# Tokenize words
max_words = 1000
tokenizer = Tokenizer(num_words=max_words, oov_token='<OOV>')
tokenizer.fit_on_texts(X_train)

# Convert text to sequences
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

# Pad sequences
max_len = 100
X_train_pad = pad_sequences(X_train_seq, maxlen=max_len, padding='post')
X_test_pad = pad_sequences(X_test_seq, maxlen=max_len, padding='post')


In [71]:
# Define the model
embedding_dim = 100

model = Sequential()
model.add(Embedding(input_dim=max_words, output_dim=embedding_dim, input_length=max_len))
model.add(SpatialDropout1D(0.2))  # Dropout layer to reduce overfitting
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(3, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Summary of the model
model.summary()


In [5]:
# Define F1 score macro-averaged function
def f1_macro(y_true, y_pred):
    return f1_score(y_true, y_pred, average='macro')

In [79]:
epochs = 10
batch_size = 32

history = model.fit(X_train_pad, y_train, epochs=epochs, batch_size=batch_size, 
                    validation_data=(X_test_pad, y_test))

Epoch 1/10
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 35ms/step - accuracy: 0.8911 - loss: 0.3978 - val_accuracy: 0.9100 - val_loss: 0.3849
Epoch 2/10
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 32ms/step - accuracy: 0.9043 - loss: 0.3593 - val_accuracy: 0.9100 - val_loss: 0.3871
Epoch 3/10
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 31ms/step - accuracy: 0.9013 - loss: 0.3652 - val_accuracy: 0.9100 - val_loss: 0.3823
Epoch 4/10
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 30ms/step - accuracy: 0.8988 - loss: 0.3727 - val_accuracy: 0.9100 - val_loss: 0.3829
Epoch 5/10
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 30ms/step - accuracy: 0.8795 - loss: 0.4365 - val_accuracy: 0.9100 - val_loss: 0.3797
Epoch 6/10
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 30ms/step - accuracy: 0.9014 - loss: 0.3787 - val_accuracy: 0.9100 - val_loss: 0.3884
Epoch 7/10
[1m25/25[0m [32m━━━━

In [80]:
# Evaluate the model using F1 score macro-averaged
y_pred = np.argmax(model.predict(X_test_pad), axis=1)
f1_macro_score = f1_score(y_test, y_pred, average='macro')

print(f'Test F1 Score (Macro-Averaged): {f1_macro_score * 100:.2f}%')

[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step
Test F1 Score (Macro-Averaged): 31.76%


## Using LLM

In [4]:
# Chat with an intelligent assistant in your terminal
from openai import OpenAI

# Point to the local server
client = OpenAI(base_url="http://localhost:1234/v1", api_key="not-needed")

history = [
    {"role": "system", "content": "Kamu adalah model machine learning, kamu bisa mengklasifikasikan text menjadi 3 tipe yaitu, positif, netral, dan negatif."},
    {"role": "user", "content": "Halo, perkenalkan dirimu."},
]

while True:
    completion = client.chat.completions.create(
        model="local-model", # this field is currently unused
        messages=history,
        temperature=0.7,
        stream=True,
    )

    new_message = {"role": "assistant", "content": ""}
    
    for chunk in completion:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)
            new_message["content"] += chunk.choices[0].delta.content

    history.append(new_message)

    print()
    history.append({"role": "user", "content": input("> ")})

Hai! Aku adalah model machine learning, dan saya dapat membantu Anda dengan tugas pemclassfication text.

**Bagaimana Anda ingin saya membantu Anda?**
Cocinakanlah, aku akan membantu! Apakah Anda memiliki masalah apa yang ingin diclassifikasi?
Text "Saya sangat senang hari ini" diclassyfikasi sebagai positif.
Text "Saya sangat sedih hari ini" diclassyfikasi sebagai negatif.
Text "Saya membaca buku dengan fokus" diclassyfikasi sebagai positif.


KeyboardInterrupt: Interrupted by user