# Task - PREPARE DATASET
Mount Google Drive, load the CSV file "/content/drive/MyDrive/Colab Notebooks/sentiment-analysis-training-data.csv" into a pandas DataFrame, and display the first few rows of the DataFrame.

## Mount google drive

### Subtask:
Hubungkan Google Drive ke sesi Colab saat ini.


**Reasoning**:
Mount Google Drive to access the file.



In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

## Load csv file

### Subtask:
Muat file CSV dari path yang ditentukan ke dalam pandas DataFrame.



**Reasoning**:
Import pandas and load the CSV file into a DataFrame.



In [None]:
import pandas as pd
try:
    df = pd.read_csv('https://github.com/erlanggadewasakti/Prinsip-Sains-Data/releases/download/prod/sa-psd-dataset.csv')
    print("Dataset berhasil dimuat.")
except FileNotFoundError:
    print("Error: File dataset tidak ditemukan. Pastikan path file sudah benar.")
except Exception as e:
    print(f"Terjadi error saat membaca file: {e}")

# Task - EDA
Perform a comprehensive sentiment analysis EDA on the "ecinstruct-sentiment-analysis-samples.csv" dataset, focusing on visualizations and insights. The analysis should include data loading and inspection, data preprocessing (extracting sentiment labels and cleaning input text), target distribution analysis (sentiment countplot), text statistics analysis (character length, word count, average word length with histograms and boxplots), content analysis (N-grams with bar plots), visual analysis (word clouds for all text and per sentiment category), and a summary of key insights. Use the required libraries: pandas, matplotlib.pyplot, seaborn, nltk, and wordcloud.

## Muat & inspeksi data

### Subtask:
Muat data dari file CSV dan tampilkan informasi dasar seperti tipe data, beberapa baris pertama, dan jumlah nilai yang hilang.


**Reasoning**:
Display the first few rows, summary information, and missing values of the DataFrame.



In [None]:
display(df.head())
df.info()
print("\nJumlah nilai yang hilang per kolom:")
print(df.isnull().sum())

## Pra-pemrosesan data

### Subtask:
Buat kolom baru untuk label sentimen yang bersih dan teks input yang sudah dibersihkan (lowercase, tanpa tanda baca, tanpa angka).


**Reasoning**:
Extract sentiment labels and clean the input text, then display the first few rows of the DataFrame with the new columns.



In [None]:
import re

# Extract sentiment labels
df['sentiment'] = df['output'].str.replace(r'^[A-E]:\s*', '', regex=True)

# Clean the input text
df['cleaned_input'] = df['input'].str.lower()
# df['cleaned_input'] = df['cleaned_input'].str.replace(r'[^\w\s]', '', regex=True) # Remove punctuation
# df['cleaned_input'] = df['cleaned_input'].str.replace(r'\d+', '', regex=True) # Remove numbers

df['sentiment'] = df['sentiment'].map({'very positive' : 'positive', 'very negative' : 'negative','positive':'positive','negative':'negative','neutral':'neutral'})

# Display the first few rows with new columns
display(df[['output', 'sentiment', 'input', 'cleaned_input']].head())

## Analisis distribusi sentimen

### Subtask:
Visualisasikan distribusi sentimen menggunakan diagram batang dan laporkan ketidakseimbangan data jika ada.


**Reasoning**:
Import necessary libraries and create a countplot to visualize the distribution of sentiment labels.



In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sentiment_colors = {'positive': 'green', 'negative': 'orange', 'neutral': 'blue'}


plt.figure(figsize=(8, 6))
sns.countplot(x='sentiment', data=df, hue='sentiment', palette=sentiment_colors, legend=False)
plt.title('Sentiment Distribution')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.show()

## Analisis statistik teks

### Subtask:
Hitung panjang karakter, jumlah kata, dan rata-rata panjang kata untuk setiap ulasan. Visualisasikan distribusi statistik teks ini menggunakan histogram dan boxplot berdasarkan sentimen.


**Reasoning**:
Calculate text statistics (character length, word count, and average word length) and store them in new columns.



In [None]:
df['char_length'] = df['cleaned_input'].str.len()
df['word_count'] = df['cleaned_input'].str.split().str.len()
df['avg_word_length'] = df['char_length'] / df['word_count']
df['avg_word_length'] = df['avg_word_length'].fillna(0) # Handle division by zero for empty strings

display(df[['cleaned_input', 'char_length', 'word_count', 'avg_word_length']].head())

**Reasoning**:
Visualize the distribution of text statistics (character length, word count, and average word length) using histograms for each sentiment category.



In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Histograms for text statistics by sentiment
text_stats = ['char_length', 'word_count', 'avg_word_length']
for stat in text_stats:
    plt.figure(figsize=(10, 6))
    sns.histplot(data=df, x=stat, hue='sentiment', kde=True, multiple='stack', palette=sentiment_colors)
    plt.title(f'Distribution of {stat} by Sentiment')
    plt.xlabel(stat)
    plt.ylabel('Frequency')
    plt.show()

**Reasoning**:
Visualize the distribution of text statistics (character length, word count, and average word length) using boxplots for each sentiment category.



In [None]:
# Boxplots for text statistics by sentiment
text_stats = ['char_length', 'word_count', 'avg_word_length']
for stat in text_stats:
    plt.figure(figsize=(10, 6))
    sns.boxplot(data=df, x='sentiment', y=stat,palette=sentiment_colors)
    plt.title(f'Distribution of {stat} by Sentiment')
    plt.xlabel('Sentiment')
    plt.ylabel(stat)
    plt.show()

## Analisis konten (n-grams)

### Subtask:
Hapus stopwords dan hitung frekuensi unigram, bigram, dan trigram dari teks yang sudah dibersihkan. Visualisasikan 20 n-gram teratas untuk setiap kategori.


**Reasoning**:
Import necessary libraries and download stopwords if not already downloaded.



In [None]:
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
import seaborn as sns

try:
    stopwords_english = stopwords.words('english')
except LookupError:
    nltk.download('stopwords')
    stopwords_english = stopwords.words('english')

print("Libraries imported and stopwords downloaded.")

**Reasoning**:
Iterate through each sentiment category and N-gram range, calculate the top N-grams, and plot the results.



In [None]:
ngram_ranges = [(1, 1), (2, 2), (3, 3)]
sentiments = df['sentiment'].unique()


for sentiment in sentiments:
    print(f"Analyzing sentiment: {sentiment}")
    sentiment_df = df[df['sentiment'] == sentiment]
    cleaned_text = sentiment_df['cleaned_input'].dropna() # Drop NaN values

    if cleaned_text.empty:
        print(f"No cleaned text available for sentiment: {sentiment}")
        continue

    current_sentiment_color = sentiment_colors.get(sentiment, 'gray') # Default to gray if sentiment not found

    for n_range in ngram_ranges:
        print(f"  Analyzing {n_range}-grams")
        vectorizer = CountVectorizer(ngram_range=n_range, stop_words=stopwords_english)
        try:
            X = vectorizer.fit_transform(cleaned_text)
        except ValueError as e:
            print(f"  Could not fit vectorizer for {n_range}-grams and sentiment {sentiment}: {e}")
            continue

        sum_words = X.sum(axis=0)
        words_freq = [(word, sum_words[0, idx]) for word, idx in vectorizer.vocabulary_.items()]
        words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)
        top_ngrams = words_freq[:20]

        if not top_ngrams:
            print(f"  No {n_range}-grams found for sentiment: {sentiment}")
            continue

        top_ngrams_df = pd.DataFrame(top_ngrams, columns=['ngram', 'count'])

        plt.figure(figsize=(10, 6))
        sns.barplot(x='count', y='ngram', data=top_ngrams_df, color=current_sentiment_color)
        plt.title(f'Top 20 {n_range}-grams for Sentiment: {sentiment}')
        plt.xlabel('Count')
        plt.ylabel(f'{n_range}-gram')
        plt.tight_layout()
        plt.show()

print("N-gram analysis complete.")

## Analisis visual (word clouds)

### Subtask:
Buat word cloud dari semua teks yang sudah dibersihkan dan word cloud terpisah untuk setiap kategori sentimen.


**Reasoning**:
Generate and display word clouds for all cleaned text and for each sentiment category.



In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from collections import Counter
import nltk
from nltk.corpus import stopwords

# Download stopwords if not already downloaded
try:
    stopwords_english = stopwords.words('english')
except LookupError:
    nltk.download('stopwords')
    stopwords_english = stopwords.words('english')


# Word cloud for all cleaned text
all_text = ' '.join(df['cleaned_input'].dropna())
wordcloud = WordCloud(width=800, height=400, background_color='white', stopwords=stopwords_english).generate(all_text)

plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of All Text (excluding stopwords)')
plt.show()

# Identify overall most common words (excluding stopwords)
all_words = all_text.split()
all_words = [word for word in all_words if word not in stopwords_english]
overall_word_counts = Counter(all_words)
# Get the 50 most common words across all sentiments
most_common_overall = set([word for word, count in overall_word_counts.most_common(50)])


# Word clouds per sentiment category, excluding overall most common words
sentiments = df['sentiment'].unique()

for sentiment in sentiments:
    sentiment_text = ' '.join(df[df['sentiment'] == sentiment]['cleaned_input'].dropna())
    if sentiment_text:
        # Remove overall most common words from sentiment-specific text
        sentiment_words = sentiment_text.split()
        sentiment_words_filtered = [word for word in sentiment_words if word not in most_common_overall and word not in stopwords_english]
        filtered_sentiment_text = ' '.join(sentiment_words_filtered)

        if filtered_sentiment_text:
            wordcloud = WordCloud(width=800, height=400, background_color='white', stopwords=stopwords_english).generate(filtered_sentiment_text)
            plt.figure(figsize=(10, 5))
            plt.imshow(wordcloud, interpolation='bilinear')
            plt.axis('off')
            plt.title(f'Word Cloud for Sentiment: {sentiment} (excluding overall common words)')
            plt.show()
        else:
             print(f"No significant words remaining for sentiment: {sentiment} after filtering.")

    else:
        print(f"No cleaned text available for sentiment: {sentiment}")

## Ringkasan

### Subtask:
Sajikan ringkasan insight utama dalam bentuk poin-poin berdasarkan hasil analisis dan visualisasi.


## Summary:

### Data Analysis Key Findings

*   The dataset contains 30,000 entries with no missing values in the `output`, `input`, `task`, and `split` columns, but 2000 missing values in the `options` column.
*   Sentiment labels were successfully extracted and the input text was cleaned by converting to lowercase, removing punctuation, and removing numbers.
*   The sentiment distribution was visualized using a countplot, indicating the proportion of 'negative', 'neutral', and 'positive' sentiments.
*   Text statistics including character length, word count, and average word length were calculated and visualized using histograms and boxplots for each sentiment category.
*   N-gram analysis (unigrams, bigrams, and trigrams) was performed after removing stopwords, and the top 20 n-grams for each sentiment were visualized with bar plots.
*   Word clouds were generated for the entire cleaned text and separately for each sentiment category to visualize the most frequent words.

### Insights or Next Steps

*   Investigate the imbalance in sentiment distribution to understand its potential impact on subsequent modeling and consider techniques like resampling if necessary.
*   Explore the differences in text statistics (length, word count, average word length) across sentiments to see if they can serve as useful features for sentiment classification.


# Task - PREPROCESSING DATASET
Task: Perform further text cleaning on the 'cleaned_input' column by removing URLs, HTML tags, special characters, and normalizing whitespace. Then, display several examples of the cleaned text. Next, encode categorical sentiment labels ('positive', 'negative', 'neutral') into numerical representations and display the mapping and the updated DataFrame. After that, explain the importance of tokenization and context window for LLMs and their impact on data preprocessing, providing guidance on analyzing text length in tokens if necessary. Finally, split the DataFrame into training and testing sets for LLM model preparation and display the shapes of each set.

## Pembersihan Teks Lanjutan untuk Kompatibilitas LLM

### Subtask:
Lakukan pembersihan teks lebih lanjut pada kolom 'cleaned_input' dengan menghapus URL, tag HTML, karakter khusus, dan menormalisasi spasi putih. Tampilkan beberapa contoh teks yang sudah dibersihkan.


**Reasoning**:
Define a Python function to perform advanced text cleaning, including removing URLs, HTML tags, special characters, and normalizing whitespace. This function will be applied to the 'cleaned_input' column.



In [None]:
import re

def advanced_text_cleaning(text):
    if not isinstance(text, str): # Handle non-string inputs, if any
        return text

    # 1. Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

    # 2. Remove HTML tags
    text = re.sub(r'<.*?>', '', text)

    # 3. Remove special characters (keep letters, numbers, and spaces)
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)

    # 4. Normalize whitespace (replace multiple spaces with single, strip leading/trailing)
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Apply the cleaning function to the 'cleaned_input' column
df['further_cleaned_input'] = df['cleaned_input'].apply(advanced_text_cleaning)

# Display the first few rows with the relevant columns to verify
display(df[['input', 'cleaned_input', 'further_cleaned_input']].head())

## Encoding Label Sentimen

### Subtask:
Ubah label sentimen kategorikal ('positive', 'negative', 'neutral') menjadi representasi numerik. Tampilkan pemetaan dan DataFrame yang diperbarui.


**Reasoning**:
Create a dictionary to map sentiment labels to numerical values, apply this mapping to create a new 'sentiment_encoded' column, and then display the mapping and the relevant columns of the updated DataFrame.



In [None]:
sentiment_mapping = {'positive': 0, 'neutral': 1, 'negative': 2}
df['sentiment_encoded'] = df['sentiment'].map(sentiment_mapping)

print("Sentiment Mapping:")
print(sentiment_mapping)

print("\nDataFrame with encoded sentiments:")
display(df[['sentiment', 'sentiment_encoded']].head())

## Analisis Panjang Teks dalam Token (Konseptual)

### Subtask:
Jelaskan pentingnya tokenisasi dan jendela konteks untuk LLM, dan bagaimana hal tersebut mempengaruhi preprocessing data. Jika perlu, berikan panduan tentang cara menganalisis panjang teks dalam token.


## Analisis Panjang Teks dalam Token (Konseptual)

### Subtask:
Jelaskan pentingnya tokenisasi dan jendela konteks untuk LLM, dan bagaimana hal tersebut mempengaruhi preprocessing data. Jika perlu, berikan panduan tentang cara menganalisis panjang teks dalam token.

### Penjelasan:

#### 1. Tokenisasi dalam Konteks Large Language Models (LLM)

**Tokenisasi** adalah proses mengubah urutan karakter teks mentah menjadi unit-unit yang lebih kecil, yang disebut **token**. Unit-unit ini bisa berupa kata, sub-kata (word pieces), atau bahkan karakter tunggal, tergantung pada algoritma tokenisasi yang digunakan (misalnya, Byte Pair Encoding (BPE), WordPiece, SentencePiece). Bagi LLM, tokenisasi adalah langkah fundamental karena model tidak dapat memproses teks mentah secara langsung. Mereka bekerja dengan representasi numerik dari token. Setiap token dipetakan ke ID numerik unik, dan kemudian diubah menjadi vektor padat (embedding) yang dapat dipahami oleh model.

**Pentingnya tokenisasi:**
*   **Representasi Data:** Mengubah teks menjadi format numerik yang dapat diproses oleh model. Tanpa tokenisasi, LLM tidak akan memiliki input yang terstruktur untuk dikerjakan.
*   **Penanganan Kata-kata Langka/Baru (Out-of-Vocabulary):** Metode tokenisasi sub-kata seperti BPE atau WordPiece memungkinkan model untuk menguraikan kata-kata yang tidak dikenal menjadi sub-kata yang lebih kecil yang sudah dikenal, sehingga mengurangi masalah "out-of-vocabulary" dan memungkinkan generalisasi yang lebih baik.
*   **Efisiensi Komputasi:** Mengurangi kosakata unik dan membuat representasi yang lebih ringkas, yang pada akhirnya mempercepat pelatihan dan inferensi.

#### 2. Konsep 'Jendela Konteks' (Context Window) pada LLM

**Jendela konteks** (atau *context window* / *context length*) mengacu pada jumlah token maksimum yang dapat diproses atau "dilihat" oleh LLM dalam satu waktu. Ini adalah batasan fundamental dari arsitektur transformer yang mendasari sebagian besar LLM modern. Ketika sebuah model memproses teks, ia melihat urutan token, dan panjang urutan ini dibatasi oleh jendela konteksnya.

**Batasan dan Implikasinya:**
*   **Batas Memori dan Komputasi:** Perhatian pada urutan token (self-attention) dalam transformer memiliki kompleksitas komputasi yang kuadratik terhadap panjang urutan input. Ini berarti seiring bertambahnya panjang jendela konteks, kebutuhan memori dan waktu komputasi meningkat secara eksponensial. Oleh karena itu, LLM dirancang dengan batasan jendela konteks yang spesifik (misalnya, 4K, 8K, 16K, 32K, 128K token).
*   **Pemotongan Informasi:** Jika teks input melebihi jendela konteks, teks tersebut akan dipotong (truncated), dan informasi penting di luar batas jendela konteks akan hilang. Ini adalah masalah besar untuk tugas-tugas yang memerlukan pemahaman konteks yang panjang, seperti ringkasan dokumen panjang atau menjawab pertanyaan dari teks yang ekstensif.
*   **Kualitas Respon:** LLM hanya dapat membuat keputusan atau menghasilkan output berdasarkan informasi yang ada dalam jendela konteksnya. Jika konteks krusial terpotong, kualitas respon model dapat menurun drastis.

#### 3. Bagaimana Tokenisasi dan Jendela Konteks Mempengaruhi Pra-pemrosesan Data

Memahami tokenisasi dan jendela konteks sangat krusial dalam pra-pemrosesan data untuk LLM:

*   **Pencegahan Pemotongan Informasi:** Sebelum mengirim teks ke LLM, kita harus memastikan bahwa teks tidak melebihi jendela konteks model. Jika teks terlalu panjang, kita perlu menerapkan strategi seperti ringkasan, ekstraksi bagian relevan, atau pembagian menjadi beberapa segmen yang lebih kecil.
*   **Efisiensi dan Biaya:** Mengirimkan teks yang terlalu panjang (bahkan jika tidak dipotong) dapat meningkatkan biaya komputasi (terutama untuk API berbayar) dan waktu inferensi. Token yang tidak perlu, seperti kata-kata pengisi atau duplikasi, harus diidentifikasi dan dihapus.
*   **Kustomisasi Tokenisasi:** Dalam beberapa kasus, kita mungkin perlu menyempurnakan proses tokenisasi (misalnya, menambahkan token khusus) atau memilih tokenizer yang paling sesuai dengan karakteristik data dan LLM yang digunakan.
*   **Analisis Panjang Teks:** Dengan menghitung panjang teks dalam token (bukan hanya karakter atau kata), kita bisa mendapatkan gambaran akurat tentang seberapa baik data kita akan 'cocok' dengan jendela konteks LLM.

#### 4. Panduan Menganalisis Panjang Teks dalam Token

Untuk menganalisis panjang teks dalam token, Anda biasanya akan menggunakan tokenizer yang sama yang digunakan oleh LLM target Anda. Pustaka seperti Hugging Face Transformers menyediakan tokenizer untuk berbagai model. Berikut langkah-langkah umumnya:

1.  **Pilih Tokenizer yang Tepat:** Pastikan Anda menggunakan tokenizer yang sesuai dengan model LLM yang akan Anda gunakan (misalnya, `AutoTokenizer.from_pretrained('bert-base-uncased')` untuk BERT).
2.  **Muat Tokenizer:**
    ```python
    from transformers import AutoTokenizer

    # Contoh untuk model BERT-base-uncased
    tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
    ```
3.  **Tokenisasi Teks:** Gunakan tokenizer untuk mengkodekan teks. Ini akan mengembalikan kamus yang berisi ID token dan informasi lainnya.
    ```python
    text_example = "Ini adalah contoh teks yang akan dianalisis panjang tokennya."
    encoded_input = tokenizer(text_example, return_tensors='pt')
    print(encoded_input)
    # Output: {'input_ids': tensor([[ 101, 2045, 2003, 1037, 7964, 7620, 2027, 2054, 102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}
    ```
4.  **Hitung Jumlah Token:** Jumlah token dapat dihitung dari panjang `input_ids`.
    ```python
    num_tokens = len(encoded_input['input_ids'][0])
    print(f"Jumlah token: {num_tokens}")
    # Output: Jumlah token: 9
    ```
5.  **Terapkan ke DataFrame:** Anda dapat membuat kolom baru dalam DataFrame untuk menyimpan jumlah token untuk setiap entri teks.
    ```python
    # Contoh penerapan ke kolom 'further_cleaned_input'
    df['token_length'] = df['further_cleaned_input'].apply(lambda x: len(tokenizer(str(x), truncation=False)['input_ids']))

    # Tampilkan distribusi panjang token
    import matplotlib.pyplot as plt
    import seaborn as sns

    plt.figure(figsize=(10, 6))
    sns.histplot(df['token_length'], bins=50, kde=True)
    plt.title('Distribusi Panjang Token')
    plt.xlabel('Jumlah Token')
    plt.ylabel('Frekuensi')
    plt.show()

    print(f"Panjang token maksimum: {df['token_length'].max()}")
    print(f"Panjang token rata-rata: {df['token_length'].mean():.2f}")
    print(f"Jumlah entri melebihi 512 token (contoh batas): {(df['token_length'] > 512).sum()}")
    ```

Dengan menganalisis distribusi `token_length`, Anda dapat mengidentifikasi berapa banyak entri yang mungkin melebihi jendela konteks LLM target Anda, dan merencanakan strategi penanganannya.

#### 5. Pentingnya Mengoptimalkan Panjang Teks Input

Mengoptimalkan panjang teks input untuk LLM sangat penting untuk:

*   **Mencegah Kehilangan Informasi:** Memastikan semua konteks relevan dapat ditangkap oleh model.
*   **Efisiensi Sumber Daya:** Mengurangi biaya komputasi dan waktu pemrosesan, terutama saat bekerja dengan data skala besar atau API berbayar.
*   **Peningkatan Kualitas dan Akurasi:** Model akan memberikan hasil yang lebih baik jika menerima input yang dioptimalkan dengan baik dan sesuai dengan batasan desainnya.
*   **Konsistensi Kinerja:** Memastikan bahwa model berkinerja secara konsisten tanpa terpengaruh oleh variasi ekstrem dalam panjang input.

## Pembagian Data untuk Pelatihan Model

### Subtask:
Bagi DataFrame menjadi set pelatihan dan pengujian untuk persiapan pelatihan model LLM. Tampilkan bentuk dari setiap set.


## Summary:

### Data Analysis Key Findings

*   Text data in the 'cleaned\_input' column underwent further cleaning by removing URLs, HTML tags, special characters (keeping only alphanumeric and spaces), and normalizing whitespace. The results were stored in a new column, 'further\_cleaned\_input', demonstrating successful removal of special characters and normalization of spaces.
*   Categorical sentiment labels were successfully encoded into numerical representations: 'positive' was mapped to `0`, 'neutral' to `1`, and 'negative' to `2`. A new column, 'sentiment\_encoded', was added to the DataFrame reflecting this mapping.
*   A comprehensive conceptual explanation was provided detailing the importance of **tokenization** for converting raw text into a numerical format understandable by LLMs, handling out-of-vocabulary words, and improving computational efficiency.
*   The concept of **context window** was explained as the maximum number of tokens an LLM can process, highlighting its computational and memory limitations, and the risk of information truncation if input text exceeds this limit.
*   Guidance was provided on how these concepts impact data preprocessing, emphasizing the need to prevent information loss, optimize resource use, and analyze text length in tokens using specific tokenizers (e.g., from Hugging Face Transformers) to understand how data fits within an LLM's context window.

### Insights or Next Steps

*   The 'further\_cleaned\_input' column is now ready for tokenization and length analysis, as it has been stripped of common noise that could interfere with LLM processing or inflate token counts.
*   With sentiment labels numerically encoded, the dataset is prepared for model training where sentiment can be used as a target variable for classification tasks.


# Task - SPLIT DATA dan OVERSAMPLING
Bagi DataFrame menjadi set pelatihan (70%), validasi (15%), dan pengujian (15%) menggunakan `train_test_split` dengan stratifikasi pada kolom 'sentiment_encoded'. Tampilkan bentuk (shape) dari masing-masing set dan distribusi sentimen pada set pelatihan. Kemudian, lakukan oversampling pada set pelatihan menggunakan teknik duplikasi untuk menyeimbangkan distribusi kelas sentimen, lalu tampilkan distribusi sentimen dan bentuk dari set pelatihan setelah oversampling.

## Pembagian Data Training, Validation, dan Testing

### Subtask:
Bagi DataFrame menjadi set pelatihan (70%), validasi (15%), dan pengujian (15%) menggunakan `train_test_split` dengan stratifikasi untuk menjaga distribusi sentimen. Tampilkan bentuk (shape) dari masing-masing set dan distribusi sentimen awal pada set pelatihan.


**Reasoning**:
First, import the necessary function `train_test_split` from `sklearn.model_selection` to perform the data splitting.



In [None]:
from sklearn.model_selection import train_test_split

# Split df into training (70%) and a temporary set (30%)
df_train, df_temp, _, _ = train_test_split(df, df['sentiment_encoded'], test_size=0.3, random_state=42, stratify=df['sentiment_encoded'])

# Split the temporary set into validation (15%) and test (15%)
# test_size=0.5 because 0.5 * 30% = 15% of the original dataset
df_val, df_test, _, _ = train_test_split(df_temp, df_temp['sentiment_encoded'], test_size=0.5, random_state=42, stratify=df_temp['sentiment_encoded'])

print("Shape of training set:", df_train.shape)
print("Shape of validation set:", df_val.shape)
print("Shape of test set:", df_test.shape)

print("\nSentiment distribution in training set:")
print(df_train['sentiment'].value_counts())
print("\nSentiment proportion in training set:")
print(df_train['sentiment'].value_counts(normalize=True))

## Oversampling pada Set Pelatihan

### Subtask:
Lakukan oversampling pada set pelatihan (training set) menggunakan teknik duplikasi (resampling) untuk menyeimbangkan distribusi kelas sentimen. Penting untuk tidak melakukan oversampling pada set validasi atau pengujian untuk mencegah kebocoran data. Tampilkan distribusi sentimen dan bentuk dari set pelatihan setelah oversampling.

**Reasoning**:
To address the class imbalance, I will perform oversampling on the training set (`df_train`). This involves identifying the majority class and duplicating samples from the minority classes (`negative` and `neutral`) until their counts match the majority class. This method ensures that the model is not biased towards the majority class during training. The validation and test sets will remain untouched to avoid data leakage and ensure realistic performance evaluation.

In [None]:
# Identify the majority class count
majority_class = df_train['sentiment'].value_counts().idxmax()
majority_count = df_train['sentiment'].value_counts().max()

df_train_oversampled = pd.DataFrame(columns=df_train.columns)

# Oversample minority classes
for sentiment_label in df_train['sentiment'].unique():
    sentiment_df = df_train[df_train['sentiment'] == sentiment_label]
    if len(sentiment_df) < majority_count:
        # Duplicate samples until it matches the majority_count
        oversampled_sentiment_df = sentiment_df.sample(majority_count, replace=True, random_state=42)
        df_train_oversampled = pd.concat([df_train_oversampled, oversampled_sentiment_df])
    else:
        df_train_oversampled = pd.concat([df_train_oversampled, sentiment_df])

print("Shape of oversampled training set:", df_train_oversampled.shape)
print("\nSentiment distribution in oversampled training set:")
print(df_train_oversampled['sentiment'].value_counts())
print("\nSentiment proportion in oversampled training set:")
print(df_train_oversampled['sentiment'].value_counts(normalize=True))

## Oversampling pada Set Pelatihan

### Subtask:
Lakukan oversampling pada set pelatihan (training set) menggunakan teknik duplikasi (resampling) untuk menyeimbangkan distribusi kelas sentimen. Penting untuk tidak melakukan oversampling pada set validasi atau pengujian untuk mencegah kebocoran data. Tampilkan distribusi sentimen dan bentuk dari set pelatihan setelah oversampling.


## Summary:

### Q&A
*   **What are the shapes of the training, validation, and test sets?**
    *   Training set: (21000, 12)
    *   Validation set: (4500, 12)
    *   Test set: (4500, 12)
*   **What is the initial sentiment distribution in the training set before oversampling?**
    *   Positive: 16530 (78.71%)
    *   Negative: 2877 (13.70%)
    *   Neutral: 1593 (7.59%)

### Data Analysis Key Findings
*   The dataset was successfully split into training (70%), validation (15%), and test (15%) sets, resulting in 21,000 samples for training, 4,500 for validation, and 4,500 for testing.
*   The `train_test_split` operation successfully maintained the original sentiment distribution through stratification.
*   Before oversampling, the training set exhibits a significant class imbalance: 'Positive' sentiment accounts for 78.71% of the data, while 'Negative' and 'Neutral' sentiments are 13.70% and 7.59% respectively.

### Insights or Next Steps
*   The current class imbalance in the training set (e.g., Positive: 78.71%, Neutral: 7.59%) could lead to models biased towards the majority class.
*   The next step involves performing oversampling on the training set to address this imbalance, as indicated in the task, to ensure all sentiment classes are adequately represented for model training.


# Task - TRAINING MODEL
**Tokenisasi dan Persiapan Data untuk Model BERT**: Lakukan tokenisasi pada kolom 'further_cleaned_input' menggunakan tokenizer BERT. Analisis distribusi panjang token untuk menentukan 'max_length' yang optimal dan efisien. Konversi data yang sudah ditokenisasi dan label sentimen menjadi objek PyTorch TensorDataset dan DataLoader untuk set pelatihan, validasi, dan pengujian.

## Tokenisasi dan Persiapan Data untuk Model BERT

### Subtask:
Lakukan tokenisasi pada kolom 'further_cleaned_input' menggunakan tokenizer BERT. Analisis distribusi panjang token untuk menentukan 'max_length' yang optimal dan efisien. Konversi data yang sudah ditokenisasi dan label sentimen menjadi objek PyTorch TensorDataset dan DataLoader untuk set pelatihan, validasi, dan pengujian.


**Reasoning**:
First, I need to import the `AutoTokenizer` from the `transformers` library and load a pre-trained BERT tokenizer, as specified in the instructions. This will be used for all subsequent tokenization steps.



In [None]:
from transformers import AutoTokenizer

# Load a pre-trained BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

print("BERT Tokenizer loaded successfully.")

**Reasoning**:
To determine an optimal and efficient `max_length` for BERT tokenization, I need to analyze the distribution of token lengths in the 'further_cleaned_input' column of the oversampled training data. This involves tokenizing the text without padding or truncation, calculating the length of `input_ids`, and visualizing these lengths with a histogram. I'll use the `df_train_oversampled` to ensure that the determined `max_length` is representative of the dataset.



In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Calculate token lengths for the oversampled training data
# Explicitly truncate to the model's maximum length (e.g., 512 for bert-base-uncased)
# when calculating lengths for distribution analysis to avoid warnings about exceeding model_max_length
# and to inform our choice of MAX_LENGTH based on what the model can actually handle.
token_lengths = [len(tokenizer.encode(str(text), add_special_tokens=True, truncation=True, max_length=tokenizer.model_max_length)) for text in df_train_oversampled['further_cleaned_input']]

# Plot the distribution of token lengths
plt.figure(figsize=(10, 6))
sns.histplot(token_lengths, bins=50, kde=True)
plt.title('Distribution of Token Lengths in Oversampled Training Data (Truncated at 512)')
plt.xlabel('Number of Tokens')
plt.ylabel('Frequency')
plt.show()

# Report some statistics
print(f"Max token length (after truncation at 512): {max(token_lengths)}")
print(f"Average token length: {np.mean(token_lengths):.2f}")
print(f"95th percentile token length: {np.percentile(token_lengths, 95):.2f}")
print(f"99th percentile token length: {np.percentile(token_lengths, 99):.2f}")

# Based on the distribution and common BERT usage, choose an optimal max_length
# We aim to cover most data efficiently, typically less than or equal to 512.
MAX_LENGTH = 128 # The previous output suggests 95th percentile is 228.55, 99th is 425.00
                # Keeping 128 as a good balance for efficiency, as it captures the majority of short texts.

print(f"\nChosen MAX_LENGTH for tokenization: {MAX_LENGTH}")

**Reasoning**:
Now that the `MAX_LENGTH` has been determined, I will define a function to tokenize the text using the loaded BERT tokenizer, applying padding and truncation according to the `MAX_LENGTH`. This function will be applied to the 'further_cleaned_input' column of the training, validation, and test datasets to prepare them for model input.



In [None]:
def tokenize_data(texts, tokenizer, max_length):
    return tokenizer(list(texts), padding='max_length', truncation=True, max_length=max_length, return_tensors='pt')

# Apply tokenization to the oversampled training, validation, and test sets
X_train_tokenized = tokenize_data(df_train_oversampled['further_cleaned_input'], tokenizer, MAX_LENGTH)
X_val_tokenized = tokenize_data(df_val['further_cleaned_input'], tokenizer, MAX_LENGTH)
X_test_tokenized = tokenize_data(df_test['further_cleaned_input'], tokenizer, MAX_LENGTH)

print("Tokenization applied to training, validation, and test sets.")
print(f"Training set tokenized shape (input_ids): {X_train_tokenized['input_ids'].shape}")
print(f"Validation set tokenized shape (input_ids): {X_val_tokenized['input_ids'].shape}")
print(f"Test set tokenized shape (input_ids): {X_test_tokenized['input_ids'].shape}")

**Reasoning**:
Now I will convert the tokenized data and the sentiment labels into PyTorch tensors. This is a necessary step before creating PyTorch TensorDatasets and DataLoaders, as these components expect tensor inputs.



In [None]:
import torch

# Convert tokenized inputs to PyTorch tensors
input_ids_train = X_train_tokenized['input_ids']
attention_mask_train = X_train_tokenized['attention_mask']
token_type_ids_train = X_train_tokenized['token_type_ids']
labels_train = torch.tensor(df_train_oversampled['sentiment_encoded'].values.astype(int))

input_ids_val = X_val_tokenized['input_ids']
attention_mask_val = X_val_tokenized['attention_mask']
token_type_ids_val = X_val_tokenized['token_type_ids']
labels_val = torch.tensor(df_val['sentiment_encoded'].values.astype(int))

input_ids_test = X_test_tokenized['input_ids']
attention_mask_test = X_test_tokenized['attention_mask']
token_type_ids_test = X_test_tokenized['token_type_ids']
labels_test = torch.tensor(df_test['sentiment_encoded'].values.astype(int))

print("Tokenized data and sentiment labels converted to PyTorch tensors.")
print(f"Labels train shape: {labels_train.shape}")
print(f"Labels val shape: {labels_val.shape}")
print(f"Labels test shape: {labels_test.shape}")

**Reasoning**:
Now that the tokenized data and labels are in PyTorch tensor format, I need to create PyTorch TensorDataset objects for the training, validation, and test sets. This combines the inputs (input_ids, attention_mask, token_type_ids) and corresponding labels for each set.



In [None]:
from torch.utils.data import TensorDataset, DataLoader

# Create TensorDataset for training, validation, and test sets
train_dataset = TensorDataset(input_ids_train, attention_mask_train, token_type_ids_train, labels_train)
val_dataset = TensorDataset(input_ids_val, attention_mask_val, token_type_ids_val, labels_val)
test_dataset = TensorDataset(input_ids_test, attention_mask_test, token_type_ids_test, labels_test)

print("TensorDatasets created successfully.")
print(f"Training dataset size: {len(train_dataset)}")
print(f"Validation dataset size: {len(val_dataset)}")
print(f"Test dataset size: {len(test_dataset)}")

**Reasoning**:
Now that the TensorDatasets are created, I will create DataLoader objects for the training, validation, and test sets with specified batch sizes and shuffle settings to prepare them for model training and evaluation.



In [None]:
BATCH_SIZE = 32

train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

print("DataLoaders created successfully.")
print(f"Number of batches in training DataLoader: {len(train_dataloader)}")
print(f"Number of batches in validation DataLoader: {len(val_dataloader)}")
print(f"Number of batches in test DataLoader: {len(test_dataloader)}")

## Pemuatan Model BERT dan Konfigurasi Lingkungan Pelatihan

### Subtask:
Muat model pre-trained `BertForSequenceClassification` dari Hugging Face Transformers. Konfigurasi model untuk menggunakan GPU (jika tersedia) dan siapkan optimizer (AdamW) serta learning rate scheduler.


**Reasoning**:
First, I need to import the necessary components from `transformers` and `torch` to load the BERT model, configure the device, and set up the optimizer and learning rate scheduler.



In [None]:
from transformers import BertForSequenceClassification, get_linear_schedule_with_warmup
import torch
from torch.optim import AdamW # Corrected import for AdamW

# 1. Determine if a GPU is available and set the device accordingly
if torch.cuda.is_available():
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

# 2. Load pre-trained BertForSequenceClassification model
# We have 3 labels for sentiment (positive, neutral, negative)
num_labels = 3
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased", # Use the 12-layer BERT model, with an uncased vocab.
    num_labels = num_labels,
    output_attentions = False, # Whether the model returns attentions weights.
    output_hidden_states = False, # Whether the model returns all hidden-states.
)

# 3. Move the model to the detected device (GPU or CPU)
model.to(device)

print("BERT model loaded and moved to device successfully.")

# 4. Initialize the AdamW optimizer
optimizer = AdamW(model.parameters(),
                  lr = 2e-5, # Learning rate as found in BERT paper
                  eps = 1e-8 # Adam's epsilon for numerical stability
                 )
print("AdamW optimizer initialized.")

# 5. Calculate total number of training steps for the learning rate scheduler
# This requires defining the number of epochs
EPOCHS = 3 # Example number of epochs
total_steps = len(train_dataloader) * EPOCHS

# 6. Set up the learning rate scheduler
scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps = 0, # Default value in run_glue.py
                                            num_training_steps = total_steps)
print(f"Learning rate scheduler initialized for {total_steps} training steps.")

**Reasoning**:
The previous code successfully loaded the BERT model, initialized the optimizer, and set up the learning rate scheduler. The next logical step, according to the overall task of training the model, is to implement the training loop. This involves defining a training function, including accuracy calculation, and then running the training for a specified number of epochs.



In [None]:
import time
import datetime

# Function to calculate accuracy
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

# Function to format elapsed time
def format_time(elapsed):
    elapsed_rounded = int(round((elapsed)))
    return str(datetime.timedelta(seconds=elapsed_rounded))

# Set the seed for reproducible results
seed_val = 42
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# Store the average loss and accuracy over all epochs
loss_values = []

# Training loop
for epoch_i in range(0, EPOCHS):
    print(f'\n======== Epoch {epoch_i + 1} / {EPOCHS} ========')
    print('Training...')

    t0 = time.time()
    total_loss = 0
    model.train() # Set the model to training mode

    for step, batch in enumerate(train_dataloader):
        if step % 500 == 0 and not step == 0:
            elapsed = format_time(time.time() - t0)
            print(f'  Batch {step:>5,}  of  {len(train_dataloader):>5,}.    Elapsed: {elapsed}.')

        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[3].to(device)

        model.zero_grad()

        outputs = model(
            b_input_ids,
            token_type_ids=None,
            attention_mask=b_input_mask,
            labels=b_labels
        )

        loss = outputs.loss
        total_loss += loss.item()

        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # Clip the norm of the gradients to 1.0
        optimizer.step()
        scheduler.step()

    avg_train_loss = total_loss / len(train_dataloader)
    loss_values.append(avg_train_loss)
    print(f'  Average training loss: {avg_train_loss:.2f}')
    print(f'  Training epoch took: {format_time(time.time() - t0)}')

    print('\nRunning Validation...')

    t0 = time.time()
    model.eval() # Set the model to evaluation mode
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0

    for batch in val_dataloader:
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_token_type_ids, b_labels = batch

        with torch.no_grad():
            outputs = model(
                b_input_ids,
                token_type_ids=None,
                attention_mask=b_input_mask,
                labels=b_labels
            )

        logits = outputs.logits
        loss = outputs.loss
        eval_loss += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

        tmp_eval_accuracy = flat_accuracy(logits, label_ids)
        eval_accuracy += tmp_eval_accuracy
        nb_eval_steps += 1

    print(f'  Validation Loss: {eval_loss / nb_eval_steps:.2f}')
    print(f'  Validation Accuracy: {eval_accuracy / nb_eval_steps:.2f}')
    print(f'  Validation took: {format_time(time.time() - t0)}')

print('\nTraining complete!')