<a href="https://colab.research.google.com/github/akmaldika/Task-Classification-NLP/blob/main/13521070_task_classification_nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tugas Pemrograman Text Classification menggunakan ML

**Deskripsi Tugas**

>Buatlah model Sentiment Analysis menggunakan traditional Machine Learning algorithm dengan feature berupa bag of words. Gunakan data pada repository: https://github.com/IndoNLP/indonlu/tree/master/dataset/smsa_doc-sentiment-prosa
>
>Lakukan eksperimen untuk variasi feature extraction yang telah diajarkan di kelas yaitu bag of words dengan traditional ML algorithm (minimal 3 algoritme). Tetapkan preprocessing apa yang digunakan.
>
>Deliverables:
>Laporan berupa penjelasan kode program (serta link ke repository kode program), skenario eksperimen, hasil eksperimen dan error analysis. Gunakan metrik Akurasi, Precision, Recall dan F1-score untuk setiap eksperimen yang dilakukan

## Preprocessing

### import file

In [2]:
import pandas as pd
base_path = ''

train_data_path = base_path + 'train_preprocess.tsv'
valid_data_path = base_path + 'valid_preprocess.tsv'
test_data_path = base_path + 'test_preprocess.tsv'

train_data = pd.read_csv(train_data_path, sep='\t', header=None)
valid_data = pd.read_csv(valid_data_path, sep='\t', header=None)
test_data = pd.read_csv(test_data_path, sep='\t', header=None)

train_data.columns = ['text', 'label']
valid_data.columns = ['text', 'label']
test_data.columns = ['text', 'label']

X_train = train_data['text']
y_train = train_data['label']  # Label sentimen: positif atau negatif

X_valid = valid_data['text']
y_valid = valid_data['label'] # Label sentimen: positif atau negatif

X_test = test_data['text']
y_test = test_data['label']  # Label sentimen: positif atau negatif

print(f"Train data: {X_train.shape}, Train labels: {y_train.shape}")
print(f"Valid data: {X_valid.shape}, Valid labels: {y_valid.shape}")
print(f"Test data: {X_test.shape}, Test labels: {y_test.shape}")

def load_vocab(vocab_file):
  with open(vocab_file, 'r') as f:
    vocab = f.read().splitlines()
    return vocab

vocab_file = base_path + 'vocab.txt'
vocab = load_vocab(vocab_file)
print(f"Vocab size: {len(vocab)}")

# check how many label 'negative' and 'positive' in train data
print(f"Train data: {y_train.value_counts()}")


Train data: (11000,), Train labels: (11000,)
Valid data: (1260,), Valid labels: (1260,)
Test data: (500,), Test labels: (500,)
Vocab size: 19521
Train data: label
positive    6416
negative    3436
neutral     1148
Name: count, dtype: int64


### NLTK Download and import

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# from nltk.stem import WordNetLemmatizer
import string
import re

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
stop_words = set(stopwords.words('indonesian'))

def preprocess_text(text):
    # Lowercasing
    text = text.lower()
    # Menghapus angka dan karakter khusus
    text = re.sub(r'\d+', '', text)
    text = re.sub(r'[^\w\s]', '', text)
    # Tokenisasi
    tokens = word_tokenize(text)
    # Stopwords dan punctuation | ...and word not in string.punctuation
    tokens = [word for word in tokens if word in vocab and word not in stop_words]

    return ' '.join(tokens)

X_train_clean = X_train.apply(preprocess_text)
X_valid_clean = X_valid.apply(preprocess_text)
X_test_clean = X_test.apply(preprocess_text)

# Menampilkan beberapa hasil preprocessing
print("Original text:")
for i in range (5): print(X_train[i])
print("\nPreprocessed text:")
for i in range (5): print(X_train_clean[i])

# from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

Original text:
warung ini dimiliki oleh pengusaha pabrik tahu yang sudah puluhan tahun terkenal membuat tahu putih di bandung . tahu berkualitas , dipadu keahlian memasak , dipadu kretivitas , jadilah warung yang menyajikan menu utama berbahan tahu , ditambah menu umum lain seperti ayam . semuanya selera indonesia . harga cukup terjangkau . jangan lewatkan tahu bletoka nya , tidak kalah dengan yang asli dari tegal !
mohon ulama lurus dan k212 mmbri hujjah partai apa yang harus diwlh agar suara islam tidak pecah-pecah
lokasi strategis di jalan sumatera bandung . tempat nya nyaman terutama sofa di lantai 2 . paella nya enak , sangat pas dimakan dengan minum bir dingin . appetiser nya juga enak-enak .
betapa bahagia nya diri ini saat unboxing paket dan barang nya bagus ! menetapkan beli lagi !
duh . jadi mahasiswa jangan sombong dong . kasih kartu kuning segala . belajar dulu yang baik , tidak usahlah ikut-ikut politik . nanti sudah selesai kuliah nya mau ikut politik juga tidak telat . d

## Future Extraction

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# create the transform | CountVectorizer Bag of Words
vectorizer = CountVectorizer(
  # ngram_range=(1, 2),                 # Unigrams and bigrams
)

# Fit dan transform X_train yang sudah di-preprocess
X_train_bow = vectorizer.fit_transform(X_train_clean)

# Transform X_valid dan X_test tanpa fit ulang
X_valid_bow = vectorizer.transform(X_valid_clean)
X_test_bow = vectorizer.transform(X_test_clean)

# output
print(X_train_bow.shape)
print(X_valid_bow.shape)
print(X_test_bow.shape)
print(type(X_train_bow))
print(vectorizer.vocabulary_)

(11000, 16080)
(1260, 16080)
(500, 16080)
<class 'scipy.sparse._csr.csr_matrix'>
{'warung': 15740, 'dimiliki': 3571, 'pengusaha': 11233, 'pabrik': 10527, 'puluhan': 12057, 'terkenal': 14763, 'putih': 12096, 'bandung': 864, 'berkualitas': 1443, 'dipadu': 3625, 'keahlian': 6465, 'memasak': 8559, 'kretivitas': 7364, 'menyajikan': 9303, 'menu': 9272, 'utama': 15540, 'berbahan': 1219, 'ditambah': 3926, 'ayam': 707, 'selera': 13156, 'indonesia': 5710, 'harga': 5292, 'terjangkau': 14745, 'lewatkan': 7752, 'bletoka': 1842, 'nya': 10255, 'kalah': 6275, 'asli': 602, 'tegal': 14478, 'mohon': 9689, 'ulama': 15417, 'lurus': 7993, 'mmbri': 9653, 'hujjah': 5540, 'partai': 10742, 'diwlh': 4049, 'suara': 14007, 'islam': 5862, 'lokasi': 7868, 'strategis': 13967, 'jalan': 5950, 'sumatera': 14095, 'nyaman': 10264, 'sofa': 13755, 'lantai': 7571, 'paella': 10558, 'enak': 4352, 'pas': 10751, 'dimakan': 3544, 'minum': 9609, 'bir': 1798, 'dingin': 3599, 'appetiser': 494, 'betapa': 1696, 'bahagia': 766, 'unboxi

## Model Builder (plain)

In [None]:
# Model Naive Bayes
nb_model = MultinomialNB()
nb_model.fit(X_train_bow, y_train)

# Prediksi dan evaluasi
y_pred_nb = nb_model.predict(X_valid_bow)
print(y_pred_nb)


# Model SVC
svc_model = SVC(kernel='linear')
svc_model.fit(X_train_bow, y_train)

# Prediksi dan evaluasi
y_pred_svc = svc_model.predict(X_valid_bow)
print(y_pred_svc)


# Model Logistic Regression
lr_model = LogisticRegression(max_iter=200)
lr_model.fit(X_train_bow, y_train)

# Prediksi dan evaluasi
y_pred_lr = lr_model.predict(X_valid_bow)
print(y_pred_lr)


['negative' 'positive' 'positive' ... 'negative' 'negative' 'positive']
['negative' 'positive' 'positive' ... 'negative' 'negative' 'positive']
['negative' 'positive' 'positive' ... 'negative' 'positive' 'positive']


## Evaluate Model

In [None]:
# Fungsi evaluasi
def evaluate_model(y_true, y_pred):
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred, average='weighted')
    recall = recall_score(y_true, y_pred, average='weighted')
    f1 = f1_score(y_true, y_pred, average='weighted')
    return accuracy, precision, recall, f1

# Evaluasi untuk Naive Bayes
accuracy_nb, precision_nb, recall_nb, f1_nb = evaluate_model(y_valid, y_pred_nb)

# Evaluasi untuk SVC
accuracy_svc, precision_svc, recall_svc, f1_svc = evaluate_model(y_valid, y_pred_svc)

# Evaluasi untuk Logistic Regression
accuracy_lr, precision_lr, recall_lr, f1_lr = evaluate_model(y_valid, y_pred_lr)


data = [
    ["Model", "Accuracy", "Precision", "Recall", "F1-Score"],
    ["Naive Bayes", accuracy_nb, precision_nb, recall_nb, f1_nb],
    ["Logistic Regression", accuracy_lr, precision_lr, recall_lr, f1_lr],
    ["SVC", accuracy_svc, precision_svc, recall_svc, f1_svc],
]

for row in data:
    for element in row:
        print(f"{element:<25}", end="")
    print()


Model                    Accuracy                 Precision                Recall                   F1-Score                 
Naive Bayes              0.8484126984126984       0.8507450299053643       0.8484126984126984       0.8462859089025592       
Logistic Regression      0.8404761904761905       0.8430390387031469       0.8404761904761905       0.8399551018038592       
SVC                      0.8238095238095238       0.8258759797848143       0.8238095238095238       0.8233815322454867       


In [None]:
# Buat DataFrame untuk validasi hasil
df_valid = pd.DataFrame({
    'text': valid_data['text'],
    'true_label': y_valid,
    'predicted_label_nb': y_pred_nb,
    'predicted_label_lr': y_pred_lr,
    'predicted_label_svc': y_pred_svc
})

# Tampilkan contoh kesalahan prediksi
errors_nb = df_valid[df_valid['true_label'] != df_valid['predicted_label_nb']]
errors_lr = df_valid[df_valid['true_label'] != df_valid['predicted_label_lr']]
errors_svc = df_valid[df_valid['true_label'] != df_valid['predicted_label_svc']]

print(errors_nb['predicted_label_nb'].count())
print(errors_lr['predicted_label_lr'].count())
print(errors_svc['predicted_label_svc'].count())

191
201
222


## Test

In [None]:
# Naive Bayes
y_pred_test_nb = nb_model.predict(X_test_bow)

# SVC
y_pred_test_svc = svc_model.predict(X_test_bow)

# Logistic Regression
y_pred_test_lr = lr_model.predict(X_test_bow)


In [None]:
# Evaluasi Naive Bayes pada data 'test'
accuracy_test_nb, precision_test_nb, recall_test_nb, f1_test_nb = evaluate_model(y_test, y_pred_test_nb)

# Evaluasi SVC pada data 'test'
accuracy_test_svc, precision_test_svc, recall_test_svc, f1_test_svc = evaluate_model(y_test, y_pred_test_svc)

# Evaluasi Logistic Regression pada data 'test'
accuracy_test_lr, precision_test_lr, recall_test_lr, f1_test_lr = evaluate_model(y_test, y_pred_test_lr)

# Tampilkan hasil evaluasi pada data 'test'
data_test = [
    ["Model", "Accuracy", "Precision", "Recall", "F1-Score"],
    ["Naive Bayes", accuracy_test_nb, precision_test_nb, recall_test_nb, f1_test_nb],
    ["Logistic Regression", accuracy_test_lr, precision_test_lr, recall_test_lr, f1_test_lr],
    ["SVC", accuracy_test_svc, precision_test_svc, recall_test_svc, f1_test_svc],
]

print("\nHasil Evaluasi pada Test Set:")
for row in data_test:
    for element in row:
        print(f"{element:<25}", end="")
    print()



Hasil Evaluasi pada Test Set:
Model                    Accuracy                 Precision                Recall                   F1-Score                 
Naive Bayes              0.654                    0.6659749373433583       0.654                    0.6315623931623932       
Logistic Regression      0.74                     0.7392539259888231       0.74                     0.7312522656795416       
SVC                      0.692                    0.6923912284695348       0.692                    0.6883335106573456       


## Tunned

### Hyperparameter Tunning with grid Search

In [None]:
from sklearn.model_selection import GridSearchCV

# Naive Bayes (no hyperparameters to tune)

# Logistic Regression
lr_params = {
    'max_iter': [100, 200, 300],
    'C': [0.01, 0.1, 1],
    'solver': ['liblinear', 'lbfgs']
}

# SVC
svc_params = {
    'C': [0.01, 0.1, 1],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto']
}

# Create GridSearchCV for each model
lr_grid = GridSearchCV(LogisticRegression(), lr_params, cv=5, scoring='accuracy', verbose=1)
svc_grid = GridSearchCV(SVC(), svc_params, cv=5, scoring='accuracy', verbose=1)

# Fit the models on training data
lr_grid.fit(X_train_bow, y_train)
svc_grid.fit(X_train_bow, y_train)

# Get the best hyperparameters and the corresponding model
best_lr = lr_grid.best_estimator_
best_svc = svc_grid.best_estimator_

print(f"Best parameters for Logistic Regression: {lr_grid.best_params_}")
print(f"Best parameters for SVC: {svc_grid.best_params_}")


Fitting 5 folds for each of 18 candidates, totalling 90 fits
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best parameters for Logistic Regression: {'C': 1, 'max_iter': 100, 'solver': 'liblinear'}
Best parameters for SVC: {'C': 0.1, 'gamma': 'scale', 'kernel': 'linear'}


### Predict

In [None]:


# Evaluate the tuned Logistic Regression model
y_pred_valid_lr = best_lr.predict(X_valid_bow)
accuracy_lr, precision_lr, recall_lr, f1_lr = evaluate_model(y_valid, y_pred_valid_lr)

# Evaluate the tuned SVC model
y_pred_valid_svc = best_svc.predict(X_valid_bow)
accuracy_svc, precision_svc, recall_svc, f1_svc = evaluate_model(y_valid, y_pred_valid_svc)


data = [
    ["Model", "Accuracy", "Precision", "Recall", "F1-Score"],
    ["Naive Bayes", accuracy_nb, precision_nb, recall_nb, f1_nb],
    ["Logistic Regression", accuracy_lr, precision_lr, recall_lr, f1_lr],
    ["SVC", accuracy_svc, precision_svc, recall_svc, f1_svc],
]

for row in data:
    for element in row:
        print(f"{element:<25}", end="")
    print()


Model                    Accuracy                 Precision                Recall                   F1-Score                 
Naive Bayes              0.8484126984126984       0.8507450299053643       0.8484126984126984       0.8462859089025592       
Logistic Regression      0.8428571428571429       0.8441571417466729       0.8428571428571429       0.8420087846191296       
SVC                      0.8492063492063492       0.8573541176899385       0.8492063492063492       0.8486648813048496       


### Test Tunned

In [None]:
# Predict on the test set with the tuned Logistic Regression model
y_pred_test_lr = best_lr.predict(X_test_bow)

# Predict on the test set with the tuned SVC model
y_pred_test_svc = best_svc.predict(X_test_bow)

# Evaluate the test set performance
accuracy_test_lr, precision_test_lr, recall_test_lr, f1_test_lr = evaluate_model(y_test, y_pred_test_lr)
accuracy_test_svc, precision_test_svc, recall_test_svc, f1_test_svc = evaluate_model(y_test, y_pred_test_svc)

# Tampilkan hasil evaluasi pada data 'test'
data_test = [
    ["Model", "Accuracy", "Precision", "Recall", "F1-Score"],
    ["Naive Bayes", accuracy_test_nb, precision_test_nb, recall_test_nb, f1_test_nb],
    ["Logistic Regression", accuracy_test_lr, precision_test_lr, recall_test_lr, f1_test_lr],
    ["SVC", accuracy_test_svc, precision_test_svc, recall_test_svc, f1_test_svc],
]

print("\nHasil Evaluasi pada Test Set:")
for row in data_test:
    for element in row:
        print(f"{element:<25}", end="")
    print()




Hasil Evaluasi pada Test Set:
Model                    Accuracy                 Precision                Recall                   F1-Score                 
Naive Bayes              0.654                    0.6659749373433583       0.654                    0.6315623931623932       
Logistic Regression      0.742                    0.739890188434048        0.742                    0.7304324783363204       
SVC                      0.708                    0.7314014136778231       0.708                    0.6948004401100275       
