In [28]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.preprocessing import MinMaxScaler, StandardScaler
%matplotlib inline

In [29]:
from google.colab import files
uploaded = files.upload()

Saving train_data.csv to train_data (1).csv


## 1. Data Cleansing

In [30]:
df = pd.read_csv('train_data.csv')

df.info()

df.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60115 entries, 0 to 60114
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Unnamed: 0    60115 non-null  int64 
 1   Sentence_id   60115 non-null  object
 2   New_Sentence  59002 non-null  object
 3   Type          60115 non-null  object
dtypes: int64(1), object(3)
memory usage: 1.8+ MB


Unnamed: 0.1,Unnamed: 0,Sentence_id,New_Sentence,Type
0,0,GERRES15609,Author and/or Review architecture/design and o...,Responsibility
1,1,PHERES15784,Should be able to develop custom dynamic shape...,Responsibility
2,2,GERREQ10457,Experience in working crosslly with a larger ...,Requirement
3,3,GERSKL27235,"Previous business experience, including but no...",Skill
4,4,HONSSK18415,Delivering fast and right the first time.,SoftSkill
5,5,INDRES9494,Track department expenses and capital budget .,Responsibility
6,6,HONRES51852,Meet performance metrics.,Responsibility
7,7,UAEREQ34683,YOU MUST HAVEBachelors degree in Engineering.,Requirement
8,8,UAEREQ4612,After COVID-19: Ability to travel to manufactu...,Requirement
9,9,INDEDU42721,Six Sigma qualification to at least Green Bel...,Education


### A. Handle missing values

In [31]:
#Drop baris yang memiliki missing values pada kolom 'New_Sentence'
df = df.dropna(subset=['New_Sentence'])

In [32]:
df.info()

df.head()

<class 'pandas.core.frame.DataFrame'>
Index: 59002 entries, 0 to 60114
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Unnamed: 0    59002 non-null  int64 
 1   Sentence_id   59002 non-null  object
 2   New_Sentence  59002 non-null  object
 3   Type          59002 non-null  object
dtypes: int64(1), object(3)
memory usage: 2.3+ MB


Unnamed: 0.1,Unnamed: 0,Sentence_id,New_Sentence,Type
0,0,GERRES15609,Author and/or Review architecture/design and o...,Responsibility
1,1,PHERES15784,Should be able to develop custom dynamic shape...,Responsibility
2,2,GERREQ10457,Experience in working crosslly with a larger ...,Requirement
3,3,GERSKL27235,"Previous business experience, including but no...",Skill
4,4,HONSSK18415,Delivering fast and right the first time.,SoftSkill


Tipe data, nama kolom, dan isi dari dataset df sepertinya sudah sesuai kecuali kolom 'Unnamed: 0' karena memiliki nilai yang sebenarnya sama saja dengan nilai index. Maka dari itu seharusnya harus dihilangkan agar mengurangi dimensi data dan akhirnya variabel menjadi lebih relevan.

In [33]:
df = df.drop(columns=['Unnamed: 0'])
df.head()

Unnamed: 0,Sentence_id,New_Sentence,Type
0,GERRES15609,Author and/or Review architecture/design and o...,Responsibility
1,PHERES15784,Should be able to develop custom dynamic shape...,Responsibility
2,GERREQ10457,Experience in working crosslly with a larger ...,Requirement
3,GERSKL27235,"Previous business experience, including but no...",Skill
4,HONSSK18415,Delivering fast and right the first time.,SoftSkill


Dengan menghapus kolom 'Unnamed: 0', model menjadi lebih sederhana dan lebih mudah untuk dianalisis. Dengan memilih feature yang relevan, kita dapat meningkatkan kinerja model karena model hanya belajar dari informasi yang penting. Kemudian dengan lebih sedikitnya feature juga diharapkan proses pelatihan model dan inferensi menjadi lebih cepat.

### B. Handle duplicated data

In [34]:
# Melihat duplikat di seluruh DataFrame
duplicates = df[df.duplicated(keep=False)]

print(duplicates)

duplicate_count = df.duplicated().sum()
print(f"Jumlah total baris duplikat: {duplicate_count}")

       Sentence_id                                       New_Sentence  \
4495    UAEEXP9379  his is a dummy block of text And this is repre...   
7733   UAESSK14803  his is a dummy block of text And this is repre...   
25566    GERSSK226  his is a dummy block of text And this is repre...   
33606  UAESSK14803  his is a dummy block of text And this is repre...   
41130   UAEEXP9379  his is a dummy block of text And this is repre...   
46749    GERSSK226  his is a dummy block of text And this is repre...   

             Type  
4495   Experience  
7733    SoftSkill  
25566   SoftSkill  
33606   SoftSkill  
41130  Experience  
46749   SoftSkill  
Jumlah total baris duplikat: 3


In [35]:
df_cleaned = df.drop_duplicates(keep='first')  # Mempertahankan yang pertama

df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 58999 entries, 0 to 60114
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Sentence_id   58999 non-null  object
 1   New_Sentence  58999 non-null  object
 2   Type          58999 non-null  object
dtypes: object(3)
memory usage: 1.8+ MB


In [36]:
duplicates = df_cleaned[df_cleaned.duplicated(keep=False)]

print(duplicates)

Empty DataFrame
Columns: [Sentence_id, New_Sentence, Type]
Index: []


### C. Handle outliers

Salah satu cara untuk menangani outliers dalam data non-numerik adalah dengan memeriksa frekuensi kemunculan setiap kategori. Kategori yang muncul sangat jarang (misalnya, hanya muncul satu kali atau beberapa kali dalam data besar) bisa dianggap sebagai outlier.

In [37]:
category_counts = df_cleaned['Type'].value_counts()

print(category_counts)

Type
Responsibility    15257
Requirement       13878
SoftSkill          9414
Experience         9077
Skill              6833
Education          4540
Name: count, dtype: int64


Dapat dilihat bahwa tidak ada data yang frekuensi kemuncululannya sangat jarang. Maka tidak perlu melakukan Handle Outliers.

### D. Feature transformation

In [38]:
from textblob import TextBlob

df_cleaned['Sentiment'] = df_cleaned['New_Sentence'].apply(lambda x: TextBlob(x).sentiment.polarity)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['Sentiment'] = df_cleaned['New_Sentence'].apply(lambda x: TextBlob(x).sentiment.polarity)


In [39]:
df_cleaned

Unnamed: 0,Sentence_id,New_Sentence,Type,Sentiment
0,GERRES15609,Author and/or Review architecture/design and o...,Responsibility,0.011667
1,PHERES15784,Should be able to develop custom dynamic shape...,Responsibility,0.250000
2,GERREQ10457,Experience in working crosslly with a larger ...,Requirement,0.053333
3,GERSKL27235,"Previous business experience, including but no...",Skill,-0.065476
4,HONSSK18415,Delivering fast and right the first time.,SoftSkill,0.245238
...,...,...,...,...
60110,UAERES18030,"In this position, you will utilize your progr...",Responsibility,0.000000
60111,GERRES3026,"In addition, this individual will be responsib...",Responsibility,0.150000
60112,INDSSK5492,Good problem solving skills.,SoftSkill,0.700000
60113,PHESSK15092,Good Excel knowledge .,SoftSkill,0.700000


*   Sentiment Analysis (Analisis Sentimen)

Mengubah kalimat menjadi skor sentimen (positif, negatif, netral) untuk mengetahui sikap yang terkandung dalam kalimat tersebut.

Melakukan Sentiment Analysis pada kolom New_Sentence (yang berisi deskripsi terkait pekerjaan) memungkinkan perusahaan untuk:

1.   Memahami reaksi calon karyawan terhadap deskripsi pekerjaan.
2.   Mengoptimalkan deskripsi pekerjaan agar lebih menarik dan inklusif.
3.   Menyesuaikan pesan agar sesuai dengan target audiens dan meningkatkan kualitas perekrutan.
4.   Mengurangi bias dan ketidaksesuaian dalam komunikasi pekerjaan, serta meningkatkan kepuasan dan retensi karyawan.

Analisis ini memberikan data berbasis sentimen yang berguna untuk memperbaiki kualitas deskripsi pekerjaan dan strategi rekrutmen secara keseluruhan.

### E. Feature Encoding

Kolom Type berisi kategori yang dapat diubah menjadi fitur numerik dengan menggunakan teknik Label Encoding atau One-Hot Encoding.

In [40]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
df_cleaned['Type_Encoded'] = encoder.fit_transform(df_cleaned['Type'])

df_cleaned.head(20)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['Type_Encoded'] = encoder.fit_transform(df_cleaned['Type'])


Unnamed: 0,Sentence_id,New_Sentence,Type,Sentiment,Type_Encoded
0,GERRES15609,Author and/or Review architecture/design and o...,Responsibility,0.011667,3
1,PHERES15784,Should be able to develop custom dynamic shape...,Responsibility,0.25,3
2,GERREQ10457,Experience in working crosslly with a larger ...,Requirement,0.053333,2
3,GERSKL27235,"Previous business experience, including but no...",Skill,-0.065476,4
4,HONSSK18415,Delivering fast and right the first time.,SoftSkill,0.245238,5
5,INDRES9494,Track department expenses and capital budget .,Responsibility,0.0,3
6,HONRES51852,Meet performance metrics.,Responsibility,0.0,3
7,UAEREQ34683,YOU MUST HAVEBachelors degree in Engineering.,Requirement,0.0,2
8,UAEREQ4612,After COVID-19: Ability to travel to manufactu...,Requirement,0.0,2
9,INDEDU42721,Six Sigma qualification to at least Green Bel...,Education,-0.25,0


#### Stopwords
Stopwords adalah kata-kata umum (seperti "the", "and", "is") yang sering dihapus dalam pengolahan teks karena kata-kata tersebut tidak memberikan banyak informasi untuk analisis sentimen atau klasifikasi.

In [41]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')  # Mengunduh daftar stopwords dari NLTK

# Daftar stopwords dari bahasa Inggris
stop_words = set(stopwords.words('english'))

# Fungsi untuk menghapus stopwords
def remove_stopwords(sentence):
    words = sentence.split()  # Memisahkan kalimat menjadi kata-kata
    filtered_words = [word for word in words if word.lower() not in stop_words]  # Menghapus stopwords
    return ' '.join(filtered_words)  # Menggabungkan kata yang tersisa menjadi kalimat

# Contoh penerapan pada kolom 'New_Sentence'
df_cleaned['New_Sentence_No_Stopwords'] = df_cleaned['New_Sentence'].apply(remove_stopwords)

df_cleaned

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['New_Sentence_No_Stopwords'] = df_cleaned['New_Sentence'].apply(remove_stopwords)


Unnamed: 0,Sentence_id,New_Sentence,Type,Sentiment,Type_Encoded,New_Sentence_No_Stopwords
0,GERRES15609,Author and/or Review architecture/design and o...,Responsibility,0.011667,3,Author and/or Review architecture/design techn...
1,PHERES15784,Should be able to develop custom dynamic shape...,Responsibility,0.250000,3,"able develop custom dynamic shape, object scri..."
2,GERREQ10457,Experience in working crosslly with a larger ...,Requirement,0.053333,2,Experience working crosslly larger Engineering...
3,GERSKL27235,"Previous business experience, including but no...",Skill,-0.065476,4,"Previous business experience, including limite..."
4,HONSSK18415,Delivering fast and right the first time.,SoftSkill,0.245238,5,Delivering fast right first time.
...,...,...,...,...,...,...
60110,UAERES18030,"In this position, you will utilize your progr...",Responsibility,0.000000,3,"position, utilize program management skills dr..."
60111,GERRES3026,"In addition, this individual will be responsib...",Responsibility,0.150000,3,"addition, individual responsible managing 1 sm..."
60112,INDSSK5492,Good problem solving skills.,SoftSkill,0.700000,5,Good problem solving skills.
60113,PHESSK15092,Good Excel knowledge .,SoftSkill,0.700000,5,Good Excel knowledge .


#### Sentence_Length
Menambahkan feature 'Panjang kalimat' atau  'Sentence_Length' yang mungkin dapat digunakan sebagai fitur numerik.

In [42]:
df_cleaned['Sentence_Length'] = df_cleaned['New_Sentence_No_Stopwords'].apply(lambda x: len(x.split()))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['Sentence_Length'] = df_cleaned['New_Sentence_No_Stopwords'].apply(lambda x: len(x.split()))


#### Sentiment_No_Stopwords
Menambahkan hasil Sentiment dari feature 'New_Sentence_No_Stopwords'

In [43]:
from textblob import TextBlob

df_cleaned['Sentiment_No_Stopwords'] = df_cleaned['New_Sentence_No_Stopwords'].apply(lambda x: TextBlob(x).sentiment.polarity)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['Sentiment_No_Stopwords'] = df_cleaned['New_Sentence_No_Stopwords'].apply(lambda x: TextBlob(x).sentiment.polarity)


In [44]:
df_cleaned

Unnamed: 0,Sentence_id,New_Sentence,Type,Sentiment,Type_Encoded,New_Sentence_No_Stopwords,Sentence_Length,Sentiment_No_Stopwords
0,GERRES15609,Author and/or Review architecture/design and o...,Responsibility,0.011667,3,Author and/or Review architecture/design techn...,17,0.080000
1,PHERES15784,Should be able to develop custom dynamic shape...,Responsibility,0.250000,3,"able develop custom dynamic shape, object scri...",10,0.250000
2,GERREQ10457,Experience in working crosslly with a larger ...,Requirement,0.053333,2,Experience working crosslly larger Engineering...,10,0.053333
3,GERSKL27235,"Previous business experience, including but no...",Skill,-0.065476,4,"Previous business experience, including limite...",14,-0.119048
4,HONSSK18415,Delivering fast and right the first time.,SoftSkill,0.245238,5,Delivering fast right first time.,5,0.245238
...,...,...,...,...,...,...,...,...
60110,UAERES18030,"In this position, you will utilize your progr...",Responsibility,0.000000,3,"position, utilize program management skills dr...",13,0.000000
60111,GERRES3026,"In addition, this individual will be responsib...",Responsibility,0.150000,3,"addition, individual responsible managing 1 sm...",34,0.080000
60112,INDSSK5492,Good problem solving skills.,SoftSkill,0.700000,5,Good problem solving skills.,4,0.700000
60113,PHESSK15092,Good Excel knowledge .,SoftSkill,0.700000,5,Good Excel knowledge .,4,0.700000


#### Klasifikasi Sentimen
Untuk klasifikasi sentimen menggunakan kolom 'Sentiment' dan 'Sentiment_No_Stopwords' dengan algoritma Naive Bayes, kami perlu mempersiapkan data dan ekstraksi fitur. Di sini, kami akan membuat dua model klasifikasi Naive Bayes terpisah, satu menggunakan 'Sentiment' sebagai target variabel dan yang satu lagi menggunakan 'Sentiment_No_Stopwords'. Kami akan melakukan preprocessing teks, ekstraksi fitur, dan melatih model untuk masing-masing kolom sentimen.

##### Naive Bayes

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

df_cleaned['Sentiment_Class'] = df_cleaned['Sentiment'].apply(lambda x: 1 if x > 0 else 0)
df_cleaned['Sentiment_No_Stopwords_Class'] = df_cleaned['Sentiment_No_Stopwords'].apply(lambda x: 1 if x > 0 else 0)

# Membagi data menjadi dua set, satu untuk Sentiment dan satu untuk Sentiment_No_Stopwords
X_sentiment = df_cleaned['New_Sentence']
y_sentiment = df_cleaned['Sentiment_Class']

# Membagi dataset menjadi training dan testing set untuk Sentiment
X_train_sentiment, X_test_sentiment, y_train_sentiment, y_test_sentiment = train_test_split(X_sentiment, y_sentiment, test_size=0.2, random_state=42)

# Ekstraksi fitur menggunakan TF-IDF
vectorizer = TfidfVectorizer()
X_train_sentiment_tfidf = vectorizer.fit_transform(X_train_sentiment)
X_test_sentiment_tfidf = vectorizer.transform(X_test_sentiment)

# Melatih model Naive Bayes untuk Sentiment
nb_model_sentiment = MultinomialNB()
nb_model_sentiment.fit(X_train_sentiment_tfidf, y_train_sentiment)

# Prediksi untuk Sentiment
y_pred_sentiment = nb_model_sentiment.predict(X_test_sentiment_tfidf)

# Evaluasi model Sentiment
print("Akurasi Model Sentiment:", accuracy_score(y_test_sentiment, y_pred_sentiment))
print("Classification Report untuk Sentiment:\n", classification_report(y_test_sentiment, y_pred_sentiment))

# Sekarang, kita akan menggunakan Sentiment_No_Stopwords sebagai target
X_no_stopwords = df_cleaned['New_Sentence']
y_no_stopwords = df_cleaned['Sentiment_No_Stopwords_Class']

# Membagi dataset menjadi training dan testing set untuk Sentiment_No_Stopwords
X_train_no_stopwords, X_test_no_stopwords, y_train_no_stopwords, y_test_no_stopwords = train_test_split(X_no_stopwords, y_no_stopwords, test_size=0.2, random_state=42)

# Ekstraksi fitur menggunakan TF-IDF
X_train_no_stopwords_tfidf = vectorizer.fit_transform(X_train_no_stopwords)
X_test_no_stopwords_tfidf = vectorizer.transform(X_test_no_stopwords)

# Melatih model Naive Bayes untuk Sentiment_No_Stopwords
nb_model_no_stopwords = MultinomialNB()
nb_model_no_stopwords.fit(X_train_no_stopwords_tfidf, y_train_no_stopwords)

# Prediksi untuk Sentiment_No_Stopwords
y_pred_no_stopwords = nb_model_no_stopwords.predict(X_test_no_stopwords_tfidf)

# Evaluasi model Sentiment_No_Stopwords
print("Akurasi Model Sentiment_No_Stopwords:", accuracy_score(y_test_no_stopwords, y_pred_no_stopwords))
print("Classification Report untuk Sentiment_No_Stopwords:\n", classification_report(y_test_no_stopwords, y_pred_no_stopwords))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['Sentiment_Class'] = df_cleaned['Sentiment'].apply(lambda x: 1 if x > 0 else 0)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['Sentiment_No_Stopwords_Class'] = df_cleaned['Sentiment_No_Stopwords'].apply(lambda x: 1 if x > 0 else 0)


Akurasi Model Sentiment: 0.8626271186440678
Classification Report untuk Sentiment:
               precision    recall  f1-score   support

           0       0.87      0.93      0.90      7958
           1       0.84      0.71      0.77      3842

    accuracy                           0.86     11800
   macro avg       0.86      0.82      0.84     11800
weighted avg       0.86      0.86      0.86     11800

Akurasi Model Sentiment_No_Stopwords: 0.8625423728813559
Classification Report untuk Sentiment_No_Stopwords:
               precision    recall  f1-score   support

           0       0.87      0.94      0.90      8032
           1       0.84      0.70      0.76      3768

    accuracy                           0.86     11800
   macro avg       0.86      0.82      0.83     11800
weighted avg       0.86      0.86      0.86     11800



Kesimpulan:

1.   Performa Model:
Kedua model, baik yang menggunakan teks asli (dengan stopwords) maupun yang menggunakan teks tanpa stopwords, memberikan hasil yang sangat mirip. Akurasi keduanya di sekitar 86% dengan perbedaan yang sangat kecil.
2.   Kelemahan di Kelas Positif (1):
Meski akurasi secara keseluruhan baik, model cenderung kurang sensitif terhadap kelas positif (sentimen positif), terlihat dari recall yang lebih rendah untuk kelas 1. Artinya, meskipun model dengan baik dapat mengidentifikasi banyak contoh negatif, ia cenderung melewatkan beberapa contoh positif.
3.   Dampak Stopwords:
Penghilangan stopwords (seperti dalam model Sentiment_No_Stopwords) tidak memberikan perbedaan besar dalam akurasi atau kinerja model secara keseluruhan. Namun, dapat bermanfaat dalam beberapa kasus, terutama jika model dipengaruhi oleh kata-kata yang sering muncul tanpa memberi banyak informasi.

### F. Handle class imbalance

In [None]:
!pip install imbalanced-learn



In [None]:
# Menampilkan distribusi kelas pada kolom 'Type'
distribution = df_cleaned['Type_Encoded'].value_counts()
print(distribution)

Type_Encoded
3    15257
2    13878
5     9414
1     9077
4     6833
0     4540
Name: count, dtype: int64


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from imblearn.over_sampling import SMOTE

# Target variabel yang telah di-encode, misalnya "Type_Encoded" atau "Sentiment_Class"
X = df_cleaned['New_Sentence']  # Fitur (teks)
y = df_cleaned['Type_Encoded']  # Target (encoded labels)

# Membagi dataset menjadi training dan testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Ekstraksi fitur menggunakan TF-IDF
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Menggunakan SMOTE untuk mengatasi class imbalance
smote = SMOTE(random_state=42)
X_train_res, y_train_res = smote.fit_resample(X_train_tfidf, y_train)

# Tampilkan jumlah data setelah SMOTE
print("\nJumlah Data Setelah SMOTE:")
print(y_train_res.value_counts())

# Melatih model Naive Bayes menggunakan data yang telah disesuaikan dengan SMOTE
nb_model = MultinomialNB()
nb_model.fit(X_train_res, y_train_res)

# Prediksi untuk data testing
y_pred = nb_model.predict(X_test_tfidf)

# Evaluasi model
print("Akurasi Model Setelah SMOTE:", accuracy_score(y_test, y_pred))
print("Classification Report Setelah SMOTE:\n", classification_report(y_test, y_pred))


Jumlah Data Setelah SMOTE:
Type_Encoded
0    12113
1    12113
2    12113
3    12113
5    12113
4    12113
Name: count, dtype: int64
Akurasi Model Setelah SMOTE: 0.681864406779661
Classification Report Setelah SMOTE:
               precision    recall  f1-score   support

           0       0.65      0.85      0.74       909
           1       0.74      0.83      0.78      1781
           2       0.67      0.32      0.43      2723
           3       0.81      0.86      0.83      3144
           4       0.46      0.60      0.52      1370
           5       0.64      0.74      0.69      1873

    accuracy                           0.68     11800
   macro avg       0.66      0.70      0.67     11800
weighted avg       0.69      0.68      0.67     11800



Perbandingan Sebelum dan Setelah SMOTE


*   Sebelum SMOTE, model mungkin lebih terfokus pada kelas mayoritas (kelas yang lebih banyak) dan lebih akurat pada kelas tersebut, namun mengabaikan kelas minoritas.
*   Setelah SMOTE, meskipun distribusi kelas menjadi lebih seimbang, model mungkin terpengaruh oleh data sintetis dan kesulitan mempelajari pola yang tepat, sehingga menurunkan kinerja keseluruhan (akurasi menurun).

Kesimpulan


*   SMOTE memberikan perbaikan dalam hal keseimbangan kelas, tetapi dapat menyebabkan penurunan performa jika tidak diikuti dengan penyesuaian model yang tepat.
*   Performa model menjadi lebih baik untuk beberapa kelas seperti kelas 3 dan 1, namun menurun untuk kelas 2 dan 4, terutama dalam hal recall yang lebih rendah.
*   Secara keseluruhan, meskipun akurasi sedikit menurun (68.19%), SMOTE bisa menjadi langkah yang berguna untuk menangani class imbalance, tetapi model harus dievaluasi lebih lanjut dan mungkin perlu diperbaiki atau disesuaikan dengan teknik lain seperti hyperparameter tuning atau menggunakan model yang lebih kompleks.


### G. Inovasi LLM untuk permasalahan sentiment
Untuk menghadapi masalah sentiment analysis, khususnya pada data dengan class imbalance dan penggunaan SMOTE yang memberikan hasil kurang optimal, ada beberapa inovasi dan pendekatan baru menggunakan Large Language Models (LLM) seperti GPT, BERT, atau T5 yang bisa diterapkan untuk meningkatkan kinerja dan akurasi.

#### Transfer Learning dengan Pre-trained LLM (BERT, RoBERTa, T5)

In [None]:
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader
import torch

# Tokenizer dan model BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=6)  # Jumlah kelas

# Preprocessing dataset
class SentimentDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, item):
        text = str(self.texts[item])
        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt'
        )
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(self.labels[item], dtype=torch.long)
        }

# Split data
X_train, X_test, y_train, y_test = train_test_split(df_cleaned['New_Sentence'], df_cleaned['Type_Encoded'], test_size=0.2, random_state=42)

# DataLoader for training and validation
train_dataset = SentimentDataset(X_train.values, y_train.values, tokenizer, max_len=128)
test_dataset = SentimentDataset(X_test.values, y_test.values, tokenizer, max_len=128)

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16)

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)

# Define Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

# Fine-tune the model
trainer.train()

# Evaluation
trainer.evaluate()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss


KeyboardInterrupt: 

Hasil yang dikeluarkan akan memakan waktu yang cukup lama, estimasi sekitar 55 jam :(, namun:

Keuntungan:
BERT dapat menangani long-range dependencies dalam kalimat, yang sangat berguna dalam analisis sentimen yang melibatkan konteks panjang dan multi-nuansa.
Dapat menangani class imbalance lebih baik karena model telah dilatih pada banyak data dan memahami konteks yang lebih kompleks.

## 2. Feature Engineering

### A. Feature Selection

Feature selection sudah dilakukan di bagian "Handle missing values", dimana kita menghilangkan kolom 'Unnamed: 0' agar kita bisa melihat beberapa data baris yang duplicate dan menghapusnya.

### B. Feature Extraction

Sudah ditambahkan pada bagian Feature Encoding & Feature Transformation.

### C. Tuliskan minimal 4 feature tambahan

feature tambahan yang mungkin dibuat diantaranya mungkin adalah

1.   Bag of Words (BoW)
Bag of Words adalah metode ekstraksi fitur yang mengubah teks menjadi representasi numerik berdasarkan frekuensi kata dalam teks tersebut. Setiap kata dalam kalimat akan menjadi fitur dalam ruang fitur, dan nilainya adalah frekuensi kemunculan kata dalam kalimat tersebut.
2.   Word Embeddings (Word2Vec, GloVe, FastText)
Word embeddings mengubah kata-kata menjadi representasi vektor berdimensi tinggi yang membawa informasi tentang makna kata tersebut dalam konteks.
3.   Part-of-Speech (POS) Tagging
Melakukan tagging POS pada kalimat dapat memberikan informasi tentang jenis kata dalam kalimat (seperti kata benda, kata kerja, kata sifat, dll.), yang dapat membantu dalam menganalisis konteks dan struktur kalimat.
4.   N-grams
Menggunakan n-grams (misalnya, bigram, trigram) bisa memberikan konteks tambahan tentang hubungan antar kata dalam kalimat.

